Human-Computer Interaction - The Design of User-Friendly, Transparent Interfaces; Research and Evaluation [1, Complete ed.] 0128152652

124 79 86MB

English Pages [878]

Table of contents :
Article Contents
p. 121
p. 122
p. 123
p. 124
p. 125
p. 126
p. 127
p. 128
p. 129
p. 130
p. 131
p. 132
p. 133
p. 134
p. 135
p. 136
Article Contents
p.1
p.2
p.3
p.4
p.5
p.6
p.7
p.8
p.9
p.10
p.[11]
p.[12]
p.13
p.14
p.15
p.16
p.[17]
p.[18]
p.19
p.20
p.21
p.22
p.23
Abstract
1 Introduction
2 Related Work
2.1 ADS Systems in the Public Sector
2.2 Broadening Participation in AI Design
3 Study Context
4 Study Design
4.1 AI Lifecycle Comicboarding
4.2 Study Protocol
4.3 Recruitment
4.4 Data Analysis
4.5 Positionality Statement
5 Results
5.1 Desires for Feedback Opportunities
5.2 Feedback on HAA’s Design
5.3 Feedback on HAA’s Deployment
6 Discussion
7 Conclusion
Acknowledgments
References
A Comicboards
B Higher-Level Themes
Abstract
1 Introduction
2 Background and Related work
2.1 Language Modeling and Generation
2.2 GPT-3
2.3 Computational User Models
3 Data
3.1 Human Data
3.2 GPT-3 Data
4 Experiment 1: Distinguishing between GPT-3 and real data
4.1 Participants and Stimuli
4.2 Procedure
4.3 Data Analysis
4.4 Results
5 Experiment 2: What Kinds of Errors Does GPT-3 Make?
5.1 Methods
5.2 Results
6 Experiment 3: Differences Between Real and GPT-3 Content
6.1 Methods
6.2 Results
7 Discussion
7.1 Use Cases for Synthetic Data
7.2 Misuse potential
7.3 GPT-3 and Emotions
7.4 Future Directions
7.5 Limitations
8 Conclusion
Acknowledgments
References
Abstract
1 Introduction
2 Related Work
2.1 HCI Education and Studio Learning Environments
2.2 Components of Design Process
2.3 Digital Collaboration Tools in Our Classroom
2.4 Action Research and Educational Deployments
3 Methods
3.1 Course Context
3.2 Initial Design Principles
3.3 Kaleidoscope System
3.4 Evaluation Methods
4 Findings
4.1 Documentation Supports Reflection, Conflicts with Creation
4.2 Centralizing Discussion
4.3 Privacy and Visibility in a Shared Space
4.4 History Display Creates Sense of Achievement but also Overwhelms
4.5 Initial Perceptions and Incentives
5 Discussion
5.1 Documentation Enables Explicit and Opportunistic Reflection on Process
5.2 Challenges to Integrating Documentation with the Design Process
5.3 Incentives and Motivations for Documentation
6 Limitations
7 Future Work
8 Conclusion
Acknowledgments
References
Abstract
1 Introduction
2 Related Work
2.1 Deceptive Design Patterns in HCI
2.2 Safety Technologies in HCI
3 Case Study Design
3.1 Contextual Background
3.2 Data Collection
3.3 Data Analysis
4 Results
4.1 Manufacturing Anxiety
4.2 Offering a Solution to Users' Heightened Safety Needs
5 Discussion
5.1 The Power of Deceptive Infrastructure
5.2 Dismantling the Deceptive Infrastructure of Safety
5.3 Limitations
6 Conclusion
Acknowledgments
References
Abstract
1 Introduction
2 Related Work
2.1 Research Ethics
2.2 Openness
2.3 Transparency
3 Criteria for research ethics, openness, and transparency
3.1 Development process
3.2 Insights
4 Method
4.1 Samples
4.2 Coding procedure
4.3 Data Analysis
5 Results
5.1 Changes in Research Ethics
5.2 Changes in Openness Practices
5.3 Changes in Transparency Practices
5.4 Lack of Change in Reporting Practices
6 A proof-of-concept screening Tool
6.1 Design considerations
6.2 Implementation
6.3 Evaluation
7 Discussion
7.1 Self-Report Surveys vs. Actual Practices
7.2 Raising Awareness
7.3 How to Make Further Progress
7.4 Extra Care is Needed With Students
7.5 Limitations
8 Conclusion
Acknowledgments
References
Abstract
1 Introduction
2 Background
2.1 Amsterdam
2.2 Camera car use in Amsterdam
3 Related work
3.1 Contestable AI by design
3.2 Public & urban AI
3.3 Vehicular urban sensing
3.4 Speculative design
4 Method
4.1 Design process
4.2 Civil servant interviews
4.3 Analysis
5 Results
5.1 Concept video description
5.2 Civil servant responses to concept video
6 Discussion
6.1 Summary of results
6.2 Results' relation to existing literature
6.3 Transferability: Results' relation to city and citizens
6.4 Limitations
6.5 Future work
7 Conclusion
Acknowledgments
References
ABSTRACT
Author Keywords
ACM Classification Keywords
INTRODUCTION
CONCLUSIONS
FUTURE WORK
REFERENCES
Introduction
User-Centred Design (UCD)
Integrating UCD and Agile Development
Similarities and Differences Between UCD and Agile Development
Fieldwork
Method
The Project Teams
Results
User Involvement
Collaboration and Culture
Prototyping
Project Lifecycle
Discussion
Five Principles for Integrating UCD and Agile Development
Conclusion
References
Abstract
Introduction
Related Work
Data and Methods
Results
Discussion, Conclusion, and Future Work
References
Abstract
1 Introduction
2 Related Work
Smart Homes
Domesticity and Public Policy
Aging in Place
3 Context and Method
Participatory Design Workshops
4 Workshop Analysis and Outcomes
Tracking and Monitoring
The Boundaries of Personal and Public Privacy
Shifting Baselines
5 Discussion
Accountabilities of Tracking
Self-determination in Data and Use
From Endpoint to Infrastructure
6 Conclusion
Acknowledgments
References
Abstract
1 Introduction
2 Contributions
3 Related Work
4 Designing the Serpentine Prototype
Triboelectric Nanogenerators
5 Sensor Structure and Fabrication
Sensor Operation
Sensor Design Parameters
6 Recognizing Human Interaction
Designing the Interactions
Data Processing Pipeline
7 Applications
8 Evaluation
Participants and Setup
Study
9 Results
Quantitative Analysis
Qualitative Analysis
10 Discussion
Does it matter that Serpentine is self-powered?
Interaction Design Parameters
Stiffness versus electrical output
Limitation of Sensor and Study
Future work
11 Conclusion
12 Acknowledgments
References

Recommend Papers

Human-Computer Interaction - The Design of User-Friendly, Transparent Interfaces; Research and Evaluation [1, Instructor's ed.] 0128152652

113 46 86MB Read more

Human-Computer Interaction - The Design of User-Friendly, Transparent Interfaces; Research and Evaluation [1]

130 16 67MB Read more

Human-Computer Interaction - The Design of User-Friendly, Transparent Interfaces; Research and Evaluation [2]

153 118 13MB Read more

Human Computer Interaction Research in Web Design and Evaluation 1599042460, 9781599042466, 9781599042480

Human Computer Interaction Research in Web Design and Evaluation presents research from academics and industry experts,

455 8 6MB Read more

Designing Interfaces: Patterns for Effective Interaction Design [1 ed.] 9780596008031, 0596008031

Designing a good interface isn't easy. Users demand software that is well-behaved, good-looking, and easy to use. Y

406 96 9MB Read more

Transparent Plastics: Design and Technology 9783764382872, 9783764374709

Practical design and construction guide Recent years have seen the construction of buildings made of plastic, structur

113 59 41MB Read more

Designing Interfaces: Patterns for Effective Interaction Design [3 ed.] 9781492051961, 1492051969

Designing good application interfaces isn't easy now that companies need to create compelling, seamless user experi

391 56 178MB Read more

Design of Multimodal Mobile Interfaces 9781501502736, 9781501510847

The “smart mobile” has become an essential and inseparable part of our lives. This powerful tool enables us to perform m

98 85 5MB Read more

Design of Multimodal Mobile Interfaces 9781501502736, 9781501510847

The “smart mobile” has become an essential and inseparable part of our lives. This powerful tool enables us to perform m

131 50 12MB Read more

Brain-Computer Interfaces: Revolutionizing Human-Computer Interaction 9783642020902, 9783642020919, 3642020909

This book provides an accessible introduction to the neurophysiological and signal-processing background required for BC

329 67 12MB Read more

Human-Computer Interaction - The Design of User-Friendly, Transparent Interfaces; Research and Evaluation [1, Complete ed.]
0128152652

Author / Uploaded
Various

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Human-Computer Interaction An Empirical Research Perspective

I. Scott MacKenzie

".45&3%".r#0450/r)&*%&-#&3(r-0/%0/ /&8:03,r09'03%r1"3*4r4"/%*&(0 4"/'3"/$*4$0r4*/("103&r4:%/&:r50,:0

.PSHBO,BVGNBOOJTBOJNQSJOUPG&MTFWJFS

CHAPTER

Historical Context

1

Human-computer interaction. In the beginning, there were humans. In the 1940s came computers. Then in the 1980s came interaction. Wait! What happened between 1940 and 1980? Were humans not interacting with computers then? Well, yes, but not just any human. Computers in those days were too precious, too complicated, to allow the average human to mess with them. Computers were carefully guarded. They lived a secluded life in large air-conditioned rooms with raised floors and locked doors in corporate or university research labs or government facilities. The rooms often had glass walls to show off the unique status of the behemoths within. If you were of that breed of human who was permitted access, you were probably an engineer or a scientist—specifically, a computer scientist. And you knew what to do. Whether it was connecting relays with patch cords on an ENIAC (1940s), changing a magnetic memory drum on a UNIVAC (1950s), adjusting the JCL stack on a System/360 (1960s), or greping and awking around the unix command set on a PDP-11 (1970s), you were on home turf. Unix commands like grep, for global regular expression print, were obvious enough. Why consult the manual? You probably wrote it! As for unix’s vi editor, if some poor soul was stupid enough to start typing text while in command mode, well, he got what he deserved.1 Who gave him a login account, anyway? And what’s all this talk about make the state of the system visible to the user? What user? Sounds a bit like … well … socialism! Interaction was not on the minds of the engineers and scientists who designed, built, configured, and programmed the early computers. But by the 1980s interaction was an issue. The new computers were not only powerful, they were useable—by anyone! With usability added, computers moved from their earlier secure confines onto people’s desks in workplaces and, more important, into people’s homes. One reason human–computer interaction (HCI) is so exciting is that the field’s emergence and progress are aligned with, and in good measure responsible for, this dramatic shift in computing practices. 1

One of the classic UI foibles—told and re-told by HCI educators around the world—is the vi editor’s lack of feedback when switching between modes. Many a user made the mistake of providing input while in command mode or entering a command while in input mode. Human-Computer Interaction. © 2013 Elsevier Inc. All rights reserved.

1

2

CHAPTER 1 Historical Context

This book is about research in human-computer interaction. As in all fields, research in HCI is the force underlying advances that migrate into products and processes that people use, whether for work or pleasure. While HCI itself is broad and includes a substantial applied component—most notably in design—the focus in this book is narrow. The focus is on research—the what, the why, and the how— with a few stories to tell along the way. Many people associate research in HCI with developing a new or improved interaction or interface and testing it in a user study. The term “user study” sometimes refers to an informal evaluation of a user interface. But this book takes a more formal approach, where a user study is “an experiment with human participants.” HCI experiment are discussed throughout the book. The word empirical is added to this book’s title to give weight to the value of experimental research. The research espoused here is empirical because it is based on observation and experience and is carried out and reported on in a manner that allows results to be verified or refuted through the efforts of other researchers. In this way, each item of HCI research joins a large body of work that, taken as a whole, defines the field and sets the context for applying HCI knowledge in real products or processes.

1.1 Introduction Although HCI emerged in the 1980s, it owes a lot to older disciplines. The most central of these is the field of human factors, or ergonomics. Indeed, the name of the preeminent annual conference in HCI—the Association for Computing Machinery Conference on Human Factors in Computing Systems (ACM SIGCHI)—uses that term. SIGCHI is the special interest group on computer-human interaction sponsored by the ACM.2 Human factors is both a science and a field of engineering. It is concerned with human capabilities, limitations, and performance, and with the design of systems that are efficient, safe, comfortable, and even enjoyable for the humans who use them. It is also an art in the sense of respecting and promoting creative ways for practitioners to apply their skills in designing systems. One need only change systems in that statement to computer systems to make the leap from human factors to HCI. HCI, then, is human factors, but narrowly focused on human interaction with computing technology of some sort. That said, HCI itself does not feel “narrowly focused.” On the contrary, HCI is tremendously broad in scope. It draws upon interests and expertise in disciplines such as psychology (particularly cognitive psychology and experimental psychology), sociology, anthropology, cognitive science, computer science, and linguistics. 2

The Association of Computing Machinery (ACM), founded in 1947, is the world’s leading educational and scientific computing society, with over 95,000 members. The ACM is organized into over 150 special interest groups, or “SIGs.” Among the services offered is the ACM Digital Library, a repository of online publications which includes 45+ ACM journals, 85+ ACM conference proceedings, and numerous other publications from affiliated organizations. See www.acm.org.

1.2 Vannevar Bush’s “as we may think” (1945)

FIGURE 1.1 Timeline of notable events in the history of human–computer interaction HCI.

Figure 1.1 presents a timeline of a few notable events leading to the birth and emergence of HCI as a field of study, beginning in the 1940s.

1.2 Vannevar Bush’s “as we may think” (1945) Vannevar Bush’s prophetic essay “As We May Think,” published in the Atlantic Monthly in July, 1945 (Bush, 1945), is required reading in many HCI courses even today. The article has garnered 4,000+ citations in scholarly publications.3 Attesting to the importance of Bush’s vision to HCI is the 1996 reprint of the entire essay in the ACM’s interactions magazine, complete with annotations, sketches, and biographical notes. Bush (see Figure 1.2) was the U.S. government’s Director of the Office of Scientific Research and a scientific advisor to President Franklin D. Roosevelt. During World War II, he was charged with leading some 6,000 American scientists in the application of science to warfare. But Bush was keenly aware of the possibilities that lay ahead in peacetime in applying science to more lofty and humane 3

Google Scholar search using author: “v bush.”

3

4

CHAPTER 1 Historical Context

FIGURE 1.2 Vannevar Bush at work (circa 1940–1944).

pursuits. His essay concerned the dissemination, storage, and access to scholarly knowledge. Bush wrote: the summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships (p. 37).4

Aside from the reference to antiquated square-rigged ships, what Bush says we can fully relate to today, especially his mention of the expanding human experience in relation to HCI. For most people, nothing short of Olympian talent is needed to keep abreast of the latest advances in the information age. Bush’s consequent maze is today’s information overload or lost in hyperspace. Bush’s momentarily important item sounds a bit like a blog posting or a tweet. Although blogs and tweets didn’t exist in 1945, Bush clearly anticipated them. Bush proposed navigating the knowledge maze with a device he called memex. Among the features of memex is associative indexing, whereby points of interest can be connected and joined so that selecting one item immediately and automatically selects another: “When the user is building a trail, he names it, inserts the name in his code book, and taps it out on his keyboard” (Bush, 1945, p. 44). This sounds like a description of hyperlinks and bookmarks. Although today it is easy to equate memex with hypertext and the World Wide Web, Bush’s inspiration for this idea came from the contemporary telephone exchange, which he described as a “spider web of metal, sealed in a thin glass container” (viz. vacuum tubes) (p. 38). The maze of connections in a telephone exchange gave rise to Bush’s more general theme of a spider web of connections for the information in one’s mind, linking one’s experiences. It is not surprising that some of Bush’s ideas, for instance, dry photography, today seem naïve. Yet the ideas are naïve only when juxtaposed with Bush’s 4

For convenience, page references are to the March 1996 reprint in the ACM’s interactions.

1.3 Ivan Sutherland’s Sketchpad (1962)

FIGURE 1.3 (a) Demo of Ivan Sutherland’s Sketchpad. (b) A light pen dragging (“rubber banding”) lines, subject to constraints.

brilliant foretelling of a world we are still struggling with and are still fine-tuning and perfecting.

1.3 Ivan Sutherland’s Sketchpad (1962) Ivan Sutherland developed Sketchpad in the early 1960s as part of his PhD research in electrical engineering at the Massachusetts Institute of Technology (M.I.T.). Sketchpad was a graphics system that supported the manipulation of geometric shapes and lines (objects) on a display using a light pen. To appreciate the inferior usability in the computers available to Sutherland at the time of his studies, consider these introductory comments in a paper he published in 1963: Heretofore, most interaction between man and computers has been slowed by the need to reduce all communication to written statements that can be typed. In the past we have been writing letters to, rather than conferring with, our computers (Sutherland, 1963, p. 329).

With Sketchpad, commands were not typed. Users did not “write letters to” the computer. Instead, objects were drawn, resized, grabbed and moved, extended, deleted—directly, using the light pen (see Figure 1.3). Object manipulations worked with constraints to maintain the geometric relationships and properties of objects. The use of a pointing device for input makes Sketchpad the first direct manipulation interface—a sign of things to come. The term “direct manipulation” was coined many years later by Ben Shneiderman at the University of Maryland to provide a psychological context for a suite of related features that naturally came together in this new genre of human–computer interface (Shneiderman, 1983). These features included visibility of objects, incremental action, rapid feedback, reversibility, exploration, syntactic correctness of all actions, and replacing language with action. While Sutherland’s Sketchpad was one of the earliest examples

5

6

CHAPTER 1 Historical Context

of a direct manipulation system, others soon followed, most notably the Dynabook concept system by Alan Kay of the Xerox Palo Alto Research Center (PARC) (Kay and Goldberg, 1977). I will say more about Xerox PARC throughout this chapter. Sutherland’s work was presented at the Institute of Electrical and Electronics Engineers (IEEE) conference in Detroit in 1963 and subsequently published in its proceedings (Sutherland, 1963). The article is available in the ACM Digital Library (http://portal.acm.org). Demo videos of Sketchpad are available on YouTube (www. youtube.com). Not surprisingly, a user study of Sketchpad was not conducted, since Sutherland was a student of electrical engineering. Had his work taken place in the field of industrial engineering (where human factors is studied), user testing would have been more likely.

1.4 Invention of the mouse (1963) If there is one device that symbolizes the emergence of HCI, it is the computer mouse. Invented by Douglas Engelbart in 1963, the mouse was destined to fundamentally change the way humans interact with computers.5 Instead of typing commands, a user could manipulate a mouse to control an on-screen tracking symbol, or cursor. With the cursor positioned over a graphic image representing the command, the command is issued with a select operation—pressing and releasing a button on the mouse. Engelbart was among a group of researchers at the Stanford Research Institute (SRI) in Menlo Park, California. An early hypertext system called NLS, for oNLine System, was the project for which an improved pointing device was needed. Specifically, the light pen needed to be replaced. The light pen was an established technology, but it was awkward. The user held the pen in the air in front of the display. After a few minutes of interaction, fatigue would set in. A more natural and comfortable device might be something on the desktop, something in close proximity to the keyboard. The keyboard is where the user’s hands are normally situated, so a device beside the keyboard made the most sense. Engelbart’s invention met this requirement. The first prototype mouse is seen in Figure 1.4a. The device included two potentiometers positioned at right angles to each other. Large metal wheels were attached to the shafts of the potentiometers and protruded slightly from the base of the housing. The wheels rotated as the device was moved across a surface. Side-to-side motion rotated one wheel; to-and-fro motion rotated the other. With diagonal movement, both wheels rotated, in accordance with the amount of movement in each direction. The amount of rotation of each wheel altered the voltage at the wiper terminal of the potentiometer. The voltages were passed on to the host system for processing. The x and y positions of an on-screen object or cursor were indirectly 5

Engelbart’s patent for the mouse was filed on June 21, 1967 and issued on November 17, 1970 (Engelbart, 1970). U.S. patent laws allow one year between public disclosure and filing; thus, it can be assumed that prior to June 21, 1966, Engelbart’s invention was not disclosed to the public.

1.4 Invention of the mouse (1963)

FIGURE 1.4 (a) The first mouse. (b) Inventor Douglas Engelbart holding his invention in his left hand and an early three-button variation in his right hand.

controlled by the two voltage signals. In Figure 1.4a, a selection button can be seen under the user’s index finger. In Figure 1.4b, Engelbart is shown with his invention in his left hand and a three-button version of a mouse, which was developed much later, in his right. Initial testing of the mouse focused on selecting and manipulating text, rather than drawing and manipulating graphic objects. Engelbart was second author of the first published evaluation of the mouse. This was, arguably, HCI’s first user study, so a few words are in order here. Engelbart, along with English and Berman conducted a controlled experiment comparing several input devices capable of both selection and x-y position control of an on-screen cursor (English, Engelbart, and Berman, 1967). Besides the mouse, the comparison included a light pen, a joystick, a knee-controlled lever, and a Grafacon. The joystick (Figure 1.5a) had a moving stick and was operated in two control modes. In absolute or position-control mode, the cursor’s position on the display had an absolute correspondence to the position of the stick. In rate-control mode, the cursor’s velocity was determined by the amount of stick deflection, while the direction of the cursor’s motion was determined by the direction of the stick. An embedded switch was included for selection and was activated by pressing down on the stick. The light pen (Figure 1.5b) was operated much like the pen used by Sutherland (see Figure 1.3). The device was picked up and moved to the display surface with the pen pointing at the desired object. A projected circle of orange light indicated the target to the lens system. Selection involved pressing a switch on the barrel of the pen. The knee-controlled lever (Figure 1.5c) was connected to two potentiometers. Side-to-side knee motion controlled side-to-side (x-axis) cursor movement; upand-down knee motion controlled up-and-down (y-axis) cursor movement. Up-anddown knee motion was achieved by a “rocking motion on the ball of the foot” (p. 7). The device did not include an integrated method for selection. Instead, a key on the system’s keyboard was used.

7

8

CHAPTER 1 Historical Context

FIGURE 1.5 Additional devices used in the first comparative evaluation of a mouse: (a) Joystick. (b) Lightpen. (c) Knee-controlled lever. (d) Grafacon. (Source: a, b, d, adapted from English et al., 1967; c, 1967 IEEE. Reprinted with permission)

The Grafacon (Figure 1.5d) was a commercial device used for tracing curves. As noted, the device consisted “of an extensible arm connected to a linear potentiometer, with the housing for the linear potentiometer pivoted on an angular potentiometer” (1967, 6). Originally, there was a pen at the end of the arm; however, this was replaced with a knob-and-switch assembly (see Figure 1.5). The user gripped the knob and moved it about to control the on-screen cursor. Pressing the knob caused a selection. The knee-controlled lever and Grafacon are interesting alternatives to the mouse. They illustrate and suggest the processes involved in empirical research. It is not likely that Engelbart simply woke up one morning and invented the mouse. While it may be true that novel ideas sometimes arise through “eureka” moments, typically there is more to the process of invention. Refining ideas—deciding what works and what doesn’t—is an iterative process that involves a good deal of trial and error. No doubt, Engelbart and colleagues knew from the outset that they needed a device that would involve some form of human action as input and would produce two channels (x-y) of analog positioning data as output. A select operation was also needed to produce a command or generate closure at the end of a positioning operation. Of course, we know this today as a point-select, or point-andclick, operation. Operating the device away from the display meant some form of on-screen tracker (a cursor) was needed to establish correspondence between the device space and the display space. While this seems obvious today, it was a newly emerging form of human-to-computer interaction in the 1960s.

1.4 Invention of the mouse (1963)

In the comparative evaluation, English et al. (1967) measured users’ access time (the time to move the hand from the keyboard to the device) and motion time (the time from the onset of cursor movement to the final selection). The evaluation included 13 participants (eight experienced in working with the devices and three inexperienced). For each trial, a character target (with surrounding distracter targets) appeared on the display. The trial began with the participant pressing and releasing the spacebar on the system’s keyboard, whereupon a cursor appeared on the display. The participant moved his or her hand to the input device and then manipulated the device to move the cursor to the target. With the cursor over the target, a selection was made using the method associated with the device. Examples of the test results from the inexperienced participants are shown in Figure 1.6. Each bar represents the mean for ten sequences. Every sequence consists of eight target-patterns. Results are shown for the mean task completion time (Figure 1.6a) and error rate (Figure 1.6b), where the error rate is the ratio of missed target selections to all selections. While it might appear that the knee-controlled lever is the best device in terms of time, each bar in Figure 1.6a includes both the access time and the motion time. The access time for the knee-controlled lever is, of course, zero. The authors noted that considering motion time only, the knee-controlled lever “no longer shows up so favorably” (p. 12). At 2.43 seconds per trial, the light pen had a slight advantage over the mouse at 2.62 seconds per trial; however, this must be viewed with consideration for the inevitable discomfort in continued use of a light pen, which is operated in the air at the surface of the display. Besides, the mouse was the clear winner in terms of accuracy. The mouse error rate was less than half that of any other device condition in the evaluation (see Figure 1.6b). The mouse evaluation by English et al. (1967) marks an important milestone in empirical research in HCI. The methodology was empirical and the write-up included most of what is expected today in a user study destined for presentation at a conference and publication in a conference proceedings. For example, the writeup contained a detailed description of the participants, the apparatus, and the procedure. The study could be reproduced if other researchers wished to verify or refute the findings. Of course, reproducing the evaluation today would be difficult, as the devices are no longer available. The evaluation included an independent variable, input method, with six levels: mouse, light pen, joystick (position-control), joystick (rate-control), and knee-controlled lever. There were two dependent variables, task completion time and error rate. The order of administering the device conditions was different for each participant, a practice known today as counterbalancing. While testing for statistically significant differences using an analysis of variance (ANOVA) was not done, it is important to remember that the authors did not have at their disposal the many tools taken for granted today, such as spreadsheets and statistics applications. The next published comparative evaluation involving a mouse was by Card, English, and Burr (1978), about 10 years later. Card et al.’s work was carried out at Xerox PARC and was part of a larger effort that eventually produced the first windows-based graphical user interface, or GUI (see next section). The mouse

9

10

CHAPTER 1 Historical Context

FIGURE 1.6 Results of first comparative evaluation of a computer mouse: (a) Task completion time in seconds. (b) Error rate as the ratio of missed selections to all selections. (Adapted from English et al., 1967)

underwent considerable refining and reengineering at PARC. Most notably, the potentiometer wheels were replaced with a rolling ball assembly, developed by Rider (1974). The advantage of the refined mouse over competing devices was reconfirmed by Card et al. (1978) and has been demonstrated in countless comparative evaluations since and throughout the history of HCI. It was becoming clear that Engelbart’s invention was changing the face of human-computer interaction. Years later, Engelbart would receive the ACM Turing Award (1997) and the ACM SIGCHI Lifetime Achievement Award (1998; 1st recipient). It is interesting that Engelbart’s seminal invention dates to the early 1960s, yet commercialization of the mouse did not occur until 1981, when the Xerox Star was launched.

1.5 Xerox star (1981)

1.5 Xerox star (1981) There was a buzz around the floor of the National Computer Conference (NCC) in May 1981. In those days, the NCC was the yearly conference for computing. It was both a gathering of researchers (sponsored by the American Federation of Information Processing Societies, or AFIPS) and a trade show. The trade show was huge.6 All the players were there. There were big players, like IBM, and little players, like Qupro Data Systems of Kitchener, Ontario, Canada. I was there, “working the booth” for Qupro. Our main product was a small desktop computer system based on a single-board computer known as the Pascal MicroEngine. The buzz at the NCC wasn’t about Qupro. It wasn’t about IBM, either. The buzz was about Xerox. “Have you been to the Xerox booth?” I would hear. “You gotta check it out. It’s really cool.” And indeed it was. The Xerox booth had a substantial crowd gathered around it throughout the duration of the conference. There were scripted demonstrations every hour or so, and the crowd was clearly excited by what they were seeing. The demos were of the Star, or the Xerox 8100 Star Information System, as it was formally named. The excitement was well deserved, as the 1981 launch of the Xerox Star at the NCC marks a watershed moment in the history of computing. The Star was the first commercially released computer system with a GUI. It had windows, icons, menus, and a pointing device (WIMP). It supported direct manipulation and what-you-see-is-what-you-get (WYSIWYG) interaction. The Star had what was needed to bring computing to the people. The story of the Star began around 1970, when Xerox established its research center, PARC, in Palo Alto, California. The following year, Xerox signed an agreement with SRI licensing Xerox to use Engelbart’s invention, the mouse (Johnson et al., 1989, p. 22). Over the next 10 years, development proceeded along a number of fronts. The most relevant development for this discussion is that of the Alto, the Star’s predecessor, which began in 1973. The Alto also included a GUI and mouse. It was used widely at Xerox and at a few external test sites. However, the Alto was never released commercially—a missed opportunity on a grand scale, according to some (D. K. Smith and Alexander, 1988). Figure 1.7 shows the Star workstation, which is unremarkable by today’s standards. The graphical nature of the information on the system’s display can be seen in the image. This was novel at the time. The display was bit-mapped, meaning images were formed by mapping bits in memory to pixels on the display. Most systems at the time used character-mapped displays, meaning the screen image was composed of sequences of characters, each limited to a fixed pattern (e.g., 7 × 10 pixels) retrieved from read-only memory. Character-mapped displays required considerably less memory, but limited the richness of the display image. The mouse—a two-button variety—is featured by the system’s keyboard. 6

Attendance figures for 1981 are unavailable, but the NCC was truly huge. In 1983, NCC attendance exceeded 100,000 (Abrahams, 1987).

11

12

CHAPTER 1 Historical Context

FIGURE 1.7 Xerox Star workstation.

As the designers noted, the Star was intended as an office automation system (Johnson et al., 1989). Business professionals would have Star workstations on their desks and would use them to create, modify, and manage documents, graphics tables, presentations, etc. The workstations were connected via high-speed Ethernet cables and shared centralized resources, such as printers and file servers. A key tenet in the Star philosophy was that workers wanted to get their work done, not fiddle with computers. Obviously, the computers had to be easy to use, or invisible, so to speak. One novel feature of the Star was use of the desktop metaphor. Metaphors are important in HCI. When a metaphor is present, the user has a jump-start on knowing what to do. The user exploits existing knowledge from another domain. The desktop metaphor brings concepts from the office desktop to the system’s display. On the display the user finds pictorial representations (icons) for things like documents, folders, trays, and accessories such as a calculator, printer, or notepad. A few examples of the Star’s icons are seen in Figure 1.8. By using existing knowledge of a desktop, the user has an immediate sense of what to do and how things work. The Star designers, and others since, pushed the limits of the metaphor to the point where it is now more like an office metaphor than a desktop metaphor. There are windows, printers, and a trashcan on the display, but of course these artifacts are not found on an office desktop. However, the metaphor seemed to work, as we hear even today that the GUI is an example of the desktop metaphor. I will say more about metaphors again in Chapter 3. In making the system usable (invisible), the Star developers created interactions that deal with files, not programs. So users “open a document,” rather than “invoke an editor.” This means that files are associated with applications, but these details are hidden from the user. Opening a spreadsheet document launches the spreadsheet application, while opening a text document opens a text editor.

1.5 Xerox star (1981)

FIGURE 1.8 Examples of icons appearing on the Xerox Star desktop. (Adapted from Smith, Irby, Kimball, and Harslem, 1982)

With a GUI and point-select interaction, the Star interface was the archetype of direct manipulation. The enabling work on graphical interaction (e.g., Sutherland) and pointing devices (e.g., Engelbart) was complete. By comparison, previous command-line interfaces had a single channel of input. For every action, a command was needed to invoke it. The user had to learn and remember the syntax of the system’s commands and type them in to get things done. Direct manipulation systems, like the Star, have numerous input channels, and each channel has a direct correspondence to a task. Furthermore, interaction with the channel is tailored to the properties of the task. A continuous property, such as display brightness or sound volume, has a continuous control, such as a slider. A discrete property, such as font size or family, has a discrete control, such as a multi-position switch or a menu item. Each control also has a dedicated location on the display and is engaged using a direct point-select operation. Johnson et al. (1989, 14) compares direct manipulation to driving a car. A gas pedal controls the speed, a lever controls the wiper blades, a knob controls the radio volume. Each control is a dedicated channel, each has a dedicated location, and each is operated according to the property it controls. When operating a car, the driver can adjust the radio volume and then turn on the windshield wipers. Or the driver can first turn on the windshield wipers and then adjust the radio volume. The car is capable of responding to the driver’s inputs in any order, according to the driver’s wishes. In computing, direct manipulation brings the same flexibility. This is no small feat. Command-line interfaces, by comparison, are simple. They follow a software paradigm known as sequential programming. Every action occurs in a sequence under the system’s control. When the system needs a specific input, the user is prompted to enter it. Direct manipulation interfaces require a different approach because they must accept the user’s actions according to the user’s wishes. While manipulating hello in a text editor, for example, the user might change the font to Courier (hello) and then change the style to bold (hello). Or the user might first set the style to bold (hello) and then

13

14

CHAPTER 1 Historical Context

change the font to Courier (hello). The result is the same, but the order of actions differs. The point here is that the user is in control, not the system. To support this, direct manipulation systems are designed using a software paradigm known as event-driven programming, which is substantially more complicated than sequential programming. Although event-driven programming was not new (it was, and still is, used in process-control to respond to sensor events), designing systems that responded asynchronously to user events was new in the early 1970s when work began on the Star. Of course, from the user’s perspective, this detail is irrelevant (remember the invisible computer). We mention it here only to give credit to the Herculean effort that was invested in designing the Star and bringing it to market. Designing the Star was not simply a matter of building an interface using windows, icons, menus, and a pointing device (WIMP), it was about designing a system on which these components could exist and work. A team at PARC led by Alan Kay developed such a system beginning around 1970. The central ingredients were a new object-oriented programming language known as Smalltalk and a software architecture known as Model-View-Controller. This was a complex programming environment that evolved in parallel with the design of the Star. It is not surprising, then, that the development of the Star spanned about 10 years, since the designers were not only inventing a new style of human-computer interaction, they were inventing the architecture on which this new style was built. In the end, the Star was not a commercial success. While many have speculated on why (e.g., D. K. Smith and Alexander, 1988), probably the most significant reason is that the Star was not a personal computer. In the article by the Star interface designers Johnson et al. (1989), there are numerous references to the Star as a personal computer. But it seems they had a different view of “personal.” They viewed the Star as a beefed-up version of a terminal connected to a central server, “a collection of personal computers” (p. 12). In another article, designers Smith and Irby call the Star “a personal computer designed for office professionals” (1998, 17). “Personal”? Maybe, but without a doubt the Star was, first and foremost, a networked workstation connected to a server and intended for an office environment. And it was expensive: $16,000 for the workstation alone. That’s a distant world from personal computing as we know it today. It was also a distant world from personal computing as it existed in the late 1970s and early 1980s. Yes, even then personal computing was flourishing. The Apple II, introduced in 1977 by Apple Computer, was hugely successful. It was the platform on which VisiCalc, the first spreadsheet application, was developed. VisiCalc eventually sold over 700,000 copies and became known as the first “killer app.” Notably, the Star did not have a spreadsheet application, nor could it run any spreadsheet or other application available in the market place. The Star architecture was “closed”—it could only run applications developed by Xerox. Other popular personal computer systems available around the same time were the PET, VIC-20, and Commodore 64, all by Commodore Business Machines, and the TRS-80 by Tandy Corp. These systems were truly personal. Most of them were located in people’s homes. But the user interface was terrible. These systems worked with a traditional command-line interface. The operating system—if you

1.6 Birth of HCI (1983)

could call it that—usually consisted of a BASIC-language interpreter and a console prompt. LOAD, SAVE, RUN, EDIT, and a few other commands were about the extent of it. Although these systems were indeed personal, a typical user was a hobbyist, computer enthusiast, or anyone with enough technical skill to connect components together and negotiate the inevitable software and hardware hiccups. But users loved them, and they were cheap. However, they were tricky to use. So while the direct manipulation user interface of the Star may have been intuitive and had the potential to be used by people with no technical skill (or interest in having it!), the system just didn’t reach the right audience.

1.6 Birth of HCI (1983) Nineteen eighty-three is a good year to peg as the birth of HCI. There are at least three key events as markers: the first ACM SIGCHI conference, the publication of Card, Moran, and Newell’s The Psychology of Human-Computer Interaction (1983), and the arrival of the Apple Macintosh, pre-announced with flyers in December 1983. The Mac launch was in January 1984, but I’ll include it here anyway.

1.6.1 First ACM SIGCHI conference (1983) Human-computer interaction’s roots reach as early as 1969, when ACM’s Special Interest Group on Social and Behavioral Computing (SIGSOC) was formed. (Borman, 1996). Initially, SIGSOC focused on computers in the social sciences. However, emphasis soon shifted to the needs and behavioral characteristics of the users, with talk about the user interface or the human factors of computing. Beginning in 1978, SIGSOC lobbied the ACM for a name change. This happened at the 1982 Conference on Human Factors in Computing Systems in Gaithersburg, Maryland, where the formation of the ACM Special Interest Group on ComputerHuman Interaction (SIGCHI) was first publicly announced. Today, the ACM provides the following articulate statement of SIGCHI and its mission: The ACM Special Interest Group on Computer-Human Interaction is the world’s largest association of professionals who work in the research and practice of computer-human interaction. This interdisciplinary group is composed of computer scientists, software engineers, psychologists, interaction designers, graphic designers, sociologists, and anthropologists, just to name some of the domains whose special expertise come to bear in this area. They are brought together by a shared understanding that designing useful and usable technology is an interdisciplinary process, and believe that when done properly it has the power to transform persons’ lives.7

The interdisciplinary nature of the field is clearly evident in the list of disciplines that contribute to, and have a stake in, HCI. 7

Retrieved from http://www.acm.org/sigs#026 on September 10, 2012.

15

16

CHAPTER 1 Historical Context

FIGURE 1.9 Number of papers submitted and accepted by year for the ACM SIGCHI Conference on Human Factors in Computing Systems (“CHI”). Statistics from the ACM Digital Library.

In the following year, 1983, the first SIGCHI conference was held in Boston. Fifty-nine technical papers were presented. The conference adopted a slightly modified name to reflect its new stature: ACM SIGCHI Conference on Human Factors in Computing Systems. “CHI,” as it is known (pronounced with a hard “k” sound), has been held yearly ever since and in recent years has had an attendance of about 2,500 people. The CHI conference brings together both researchers and practitioners. The researchers are there for the technical program (presentation of papers), while the practitioners are there to learn about the latest themes of research in academia and industry. Actually, both groups are also there to network (meet and socialize) with like-minded HCI enthusiasts from around the world. Simply put, CHI is the event in HCI, and the yearly pilgrimage to attend is often the most important entry in the calendar for those who consider HCI their field. The technical program is competitive. Research papers are peer reviewed, and acceptance requires rising above a relatively high bar for quality. Statistics compiled from 1982 to 2011 indicate a total of 12,671 paper submissions with 3,018 acceptances, for an overall acceptance rate of 24 percent. Figure 1.9 shows the breakdown by year, as provided on the ACM Digital Library website.8 The technical program is growing rapidly. For example, the number of accepted contributions in 2011 (410) exceeded the number of submissions in 2005 (372). Once accepted, researchers present their work at the conference, usually in a 15–20 minute talk augmented with visual slides and perhaps a video demonstration of the research. Acceptance also means the final submitted paper is published in the conference proceedings and archived in the ACM Digital Library. Some tips on writing and publishing a research paper are presented in Chapter 8. 8

Data retrieved from http://portal.acm.org. (Click on “Proceedings,” scroll down to any CHI conference proceedings, click on it, then click on the “Publication” tab.)

1.6 Birth of HCI (1983)

CHI papers have high visibility, meaning they reach a large community of researchers and practitioners in the field. One indication of the quality of the work is impact, the number of citations credited to a paper. Since the standards for acceptance are high, one might expect CHI papers to have high impact on the field of HCI. And indeed this is the case (MacKenzie, 2009a). I will say more about research impact in Chapter 4. Although the annual CHI conference is SIGCHI’s flagship event, other conferences are sponsored or co-sponsored by SIGCHI. These include the annual ACM Symposium on User Interface Software and Technology (UIST), specialized conferences such as the ACM Symposium on Eye Tracking Research and Applications (ETRA) and the ACM Conference on Computers and Accessibility (ASSETS), and regional conferences such as the Nordic Conference on Computer-Human Interaction (NordiCHI).

1.6.2 The psychology of human-computer interaction (1983) If two HCI researchers speaking of “Card, Moran, and Newell” are overheard, there is a good chance they are talking about The Psychology of Human-Computer Interaction—the book published in 1983 and co-authored by Stuart Card, Tom Moran, and Allen Newell. (See Figure 1.10.) The book emerged from work done at Xerox PARC. Card and Moran arrived at PARC in 1974 and soon after joined PARC’s Applied Information-Processing Psychology Project (AIP). Newell, a professor of computer science and cognitive psychology at Carnegie Mellon University in Pittsburgh, Pennsylvania, was a consultant to the project. The AIP mission was “to create an applied psychology of human-computer interaction by conducting requisite basic research within a context of application” (Card et al., 1983, p. ix). The book contains 13 chapters organized roughly as follows: scientific foundation (100 pages), text editing examples (150 pages), modeling (80 pages), and extensions and generalizations (100 pages). So what is an “applied psychology of human-computer interaction”? Applied psychology is built upon basic research in psychology. The first 100 or so pages in the book provide a comprehensive overview of core knowledge in basic psychology as it pertains to the human sensory, cognitive, and motor systems. In the 1980s, many computer science students (and professionals) were challenged with building simple and intuitive interfaces for computer systems, particularly in view of emerging interaction styles based on a GUI. For many students, Card, Moran, and Newell’s book was their first formalized exposure to human perceptual input (e.g., the time to visually perceive a stimulus), cognition (e.g., the time to decide on the appropriate reaction), and motor output (e.g., the time to react and move the hand or cursor to a target). Of course, research in human sensory, cognitive, and motor behavior was well developed at the time. What Card, Moran, and Newell did was connect low-level human processes with the seemingly innocuous interactions humans have with computers (e.g., typing or using a mouse). The framework for this was the model human processor (MHP). (See Figure 1.11.) The MHP had an eye and an ear (for sensory input to a perceptual processor), a

17

18

CHAPTER 1 Historical Context

FIGURE 1.10 Card, Moran, and Newell’s The Psychology of Human-Computer Interaction. (Published by Erlbaum in 1983)

brain (with a cognitive processor, short-term memory, and long-term memory), and an arm, hand, and finger (for motor responses). The application selected to frame the analyses in the book was text editing. This might seem odd today, but it is important to remember that 1983 predates the World Wide Web and most of today’s computing environments such as mobile computing, touch-based input, virtual reality, texting, tweeting, and so on. Text editing seemed like the right framework in which to develop an applied psychology of humancomputer interaction.9 Fortunately, all the issues pertinent to text editing are applicable across a broad spectrum of human-computer interaction. An interesting synergy between psychology and computer science—and it is well represented in the book—is the notion that human behavior can be understood, even modeled, as an information processing activity. In the 1940s and 1950s the work of Shannon (1949), Huffman (1952), and others, on the transmission of information through electronic channels, was quickly picked up by psychologists like Miller (1956), Fitts (1954), and Welford (1968) as a way to characterize human perceptual, cognitive, and motor behavior. Card, Moran, and Newell

9

At a panel session at CHI 2008, Moran noted that the choice was between text editing and programming.

1.6 Birth of HCI (1983)

FIGURE 1.11 The model human processor (MHP) (Card et al., 1983, p. 26).

adapted information processing models of human behavior to interactive systems. The two most prominent examples in the book are Hick’s law for choice reaction time (Hick, 1952) and Fitts’ law for rapid aimed movement (Fitts, 1954). I will say more about these in Chapter 7, Modeling Interaction. Newell later reflected on the objectives of The Psychology of Human-Computer Interaction: We had in mind the need for a theory for designers of interfaces. The design of the interface is the leverage point in human-computer interaction. The classical emphasis of human factors and man-machine psychology on experimental

19

20

CHAPTER 1 Historical Context

analysis requires that the system or a suitable mock-up be available for experimentation, but by the time such a concrete system exists, most of the important degrees of freedom in the interface have been bound. What is needed are tools for thought for the designer—so at design time the properties and constraints of the user can be brought to bear in making the important choices. Our objective was to develop an engineering-style theory of the user that permitted approximate, back-of-the-envelope calculations of how the user would interact with the computer when operating at a terminal. (Newell, 1990, pp. 29–30)

There are some interesting points here. For one, Newell astutely identifies a dilemma in the field: experimentation cannot be done until it is too late. As he put it, the system is built and the degrees of freedom are bound. This is an overstatement, perhaps, but it is true that novel interactions in new products always seem to be followed by a flurry of research papers identifying weaknesses and suggesting and evaluating improvements. There is more to the story, however. Consider the Apple iPhone’s two-finger gestures, the Nintendo Wii’s acceleration sensing flicks, the Microsoft IntelliMouse’s scrolling wheel, or the Palm Pilot’s text-input gestures (aka Graffiti). These “innovations” were not fresh ideas born out of engineering or design brilliance. These breakthroughs, and many more, have context, and that context is the milieu of basic research in human-computer interaction and related fields.10 For the examples just cited, the research preceded commercialization. Research by its very nature requires dissemination through publication. It is not surprising, then, that conferences like CHI and books like The Psychology of Human-Computer Interaction are fertile ground for discovering and spawning new and exciting interaction techniques. Newell also notes that an objective in the book was to generate “tools for thought.” This is a casual reference to models—models of interaction. The models may be quantitative and predictive or qualitative and descriptive. Either way, they are tools, the carver’s knife, the cobbler’s needle. Whether generating quantitative predictions across alternative design choices or delimiting a problem space to reveal new relationships, a model’s purpose is to tease out strengths and weaknesses in a hypothetical design and to elicit opportunities to improve the design. The book includes exemplars, such as the keystroke-level model (KLM) and the goals, operators, methods, and selection rules model (GOMS). Both of these models were presented in earlier work (Card, Moran, and Newell, 1980), but were presented again in the book, with additional discussion and analysis. The book’s main contribution on modeling, however, was to convincingly demonstrate why and how models are important and to teach us how to build them. For this, HCI’s debt to 10

Of the four examples cited, research papers anticipating each are found in the HCI literature. On multi-touch finger gestures, there is Rekimoto’s “pick-and-drop” (1997), Dietz and Leigh’s DiamondTouch (Dietz and Leigh, 2001), or, much earlier, Herot and Weinzapfel’s two-finger rotation gesture on a touchscreen (Herot and Weinzapfel, 1978). On acceleration sensing, there is Harrison et al.’s “tilt me!” (1998). On the wheel mouse, there is Venolia’s “roller mouse” (Venolia, 1993). On single-stroke handwriting, there is Goldberg and Richardson’s “Unistrokes” (1993).

1.6 Birth of HCI (1983)

Card, Moran, and Newell is considerable. I will discuss descriptive and predictive models further in Chapter 7, Modeling Interaction. Newell suggests using approximate “back of the envelope” calculations as a convenient way to describe or predict user interaction. In The Psychology of Human-Computer Interaction, these appear, among other ways, through a series of 19 interaction examples in Chapter 2 (pp. 23–97). The examples are presented as questions about a user interaction. The solutions use rough calculations but are based on data and concepts gleaned from basic research in experimental psychology. Example 10 is typical: A user is presented with two symbols, one at a time. If the second symbol is identical to the first, he is to push the key labeled YES. Otherwise he is to push NO. What is the time between signal and response for the YES case? (Card et al., 1983, p. 66)

Before giving the solution, let us consider a modern context for the example. Suppose a user is texting a friend and is entering the word hello on a mobile phone using predictive text entry (T9). Since the mobile phone keypad is ambiguous for text entry, the correct word does not always appear. After entering 4(GHI), 3(DEF), 5(JKL), 5(JKL), 6(MNO), a word appears on the display. This is the signal in the example (see above). There are two possible responses. If the word is hello, it matches the word in the user’s mind and the user presses 0(Space) to accept the word and append a space. This is the yes response in the example. If the display shows some other word, a collision has occurred, meaning there are multiple candidates for the key sequence. The user presses *(Next) to display the next word in the ambiguous set. This is the no response in the example. As elaborated by Card, Moran, and Newell, the interaction just described is a type of simple decision known as physical matching. The reader is walked through the solution using the model human processor to illustrate each step, from stimulus to cognitive processing to motor response. The solution is approximate. There is a nominal prediction accompanied by a fastman prediction and a slowman prediction. Here’s the solution: Reaction time

tp

2

tc

tM

100[30 ∼ 200]

2

(70[25 ∼ 170])

70[30 ∼ 100] (1)

310[130 ∼ 640] ms (Card et al., 1983, p. 69). There are four low-level processing cycles: a perceptual processor cycle (tP), two cognitive processor cycles (tC), and a motor processor cycle (tM). For each, the nominal value is bracketed by an expected minimum and maximum. The values in Equation 1 are obtained from basic research in experimental psychology, as cited in the book. The fastman–slowman range is large and demonstrates the difficulty in accurately predicting human behavior. The book has many other examples like this. There are also modern contexts for the examples, just waiting to be found and applied. It might not be apparent that predicting the time for a task that takes only onethird of a second is relevant to the bigger picture of designing interactive systems.

21

22

CHAPTER 1 Historical Context

But don’t be fooled. If a complex task can be deconstructed into primitive actions, there is a good chance the time to do the task can be predicted by dividing the task into a series of motor actions interlaced with perceptual and cognitive processing cycles. This idea is presented in Card, Moran, and Newell’s book as a keystrokelevel model (KLM), which I will address again in Chapter 7. The Psychology of Human-Computer Interaction is still available (see http:// www.amazon.com) and is regularly and highly cited in research papers (5,000+ citations according to Google Scholar). At the ACM SIGCHI conference in Florence, Italy in 2008, there was a panel session celebrating the book’s 25th anniversary. Both Card and Moran spoke on the book’s history and on the challenges they faced in bringing a psychological science to the design of interactive computing systems. Others spoke on how the book affected and influenced their own research in human-computer interaction.

1.6.3 Launch of the Apple Macintosh (1984) January 22, 1984 was a big day in sports. It was the day of Super Bowl XVIII, the championship game of the National Football League in the United States. It was also a big day in advertising. With a television audience of millions, companies were jockeying (and paying!) to deliver brief jolts of hype to viewers who were hungry for entertainment and primed to purchase the latest must-have products. One ad—played during the third quarter—was a 60-second stint for the Apple Macintosh (the Mac) personal computer. The ad, which is viewable on YouTube, used Orwell’s Nineteen Eighty-Four as a theme, portraying the Mac as a computer that would shatter the conventional image of the home computer.11 The ad climaxed with a female athlete running toward, and tossing a sledgehammer through, the face of Big Brother. The disintegration of Big Brother signaled the triumph of the human spirit over the tyranny and oppression of the corporation. Directed by Ridley Scott,12 the ad was a hit and was even named the 1980s Commercial of the Decade by Advertising Age magazine.13 It never aired again. The ad worked. Soon afterward, computer enthusiasts scooped up the Mac. It was sleek and sported the latest input device, a computer mouse. (See Figure 1.12.) The operating system and applications software heralded the new age of the GUI with direct manipulation and point-select interaction. The Mac was not only cool, the interface was simple and intuitive. Anyone could use it. Part of the simplicity was its one-button mouse. With one button, there was no confusion on which button to press. There are plenty of sources chronicling the history of Apple and the events leading to the release of the Mac (Levy, 1995; Linzmayer, 2004; Moritz, 1984). Unfortunately, along with the larger-than-life stature of Apple and its flamboyant 11

Search using “1984 Apple Macintosh commercial.” Known for his striking visual style, Scott directed many off-beat feature-length films such as Alien (1979), Blade Runner (1982), Thelma and Louise (1991), and Gladiator (2000). 13 http://en.wikipedia.org/wiki/1984 (advertisement). 12

1.7 Growth of HCI and graphical user interfaces (GUIs)

FIGURE 1.12 The Apple Macintosh.

leaders comes plenty of folklore to untangle. A few notable events are listed in Figure 1.13. Names of the key players are deliberately omitted.

1.7 Growth of HCI and graphical user interfaces (GUIs) With the formation of ACM SIGCHI in 1983 and the release and success of the Apple Macintosh in 1984, human-computer interaction was off and running. GUIs entered the mainstream and, consequently, a much broader community of users and researchers were exposed to this new genre of interaction. Microsoft was a latecomer in GUIs. Early versions of Microsoft Windows appeared in 1985, but it was not until the release of Windows 3.0 (1990) and in particular Windows 3.1 (1992) that Microsoft Windows was considered a serious alternative to the Macintosh operating system. Microsoft increased its market share with improved versions of Windows, most notably Windows 95 (1995), Windows 98 (1998), Windows XP (2001), and Windows 7 (2009). Today, Microsoft operating systems for desktop computers have a market share of about 84 percent, compared to 15 percent for Apple.14 With advancing interest in human-computer interaction, all major universities introduced courses in HCI or user interface (UI) design, with graduate students often choosing a topic in HCI for their thesis research. Many such programs of study were in computer science departments; however, HCI also emerged as a legitimate and popular focus in other areas such as psychology, cognitive science, industrial engineering, information systems, and sociology. And it wasn’t just universities that recognized the importance of the emerging field. Companies soon

14

www.statowl.com.

23

24

CHAPTER 1 Historical Context

FIGURE 1.13 Some notable events leading to the release of the Apple Macintosh.15

realized that designing good user interfaces was good business. But it wasn’t easy. Stories of bad UIs are legion in HCI (e.g., Cooper, 1999; Johnson, 2007; Norman, 1988). So there was work to be done. Practitioners—that is, specialists applying HCI principles in industry—are important members of the HCI community, and they form a significant contingent at many HCI conferences today.

1.8 Growth of HCI research Research interest in human-computer interaction, at least initially, was in the quality, effectiveness, and efficiency of the interface. How quickly and accurately can people do common tasks using a GUI versus a text-based command-line interface? Or, given two or more variations in a GUI implementation, which one is quicker or more accurate? These or similar questions formed the basis of much empirical research in the early days of HCI. The same is still true today. A classic example of a research topic in HCI is the design of menus. With a GUI, the user issues a command to the computer by selecting the command from a menu rather than typing it on the keyboard. Menus require recognition; typing 15

www.theapplemuseum.com, http://en.wikipedia.org/wiki/History_of_Apple, and www.guidebookgallery.org/articles/lisainterview, with various other sources to confirm dates and events.

1.8 Growth of HCI research

FIGURE 1.14 Breadth versus depth in menu design: (a) 8×8 choices in a broad hierarchy. (b) 2×2×2×2×2×2 choices in a deep hierarchy.

requires recall. It is known that recognition is preferred over recall in user interfaces (Bailey, 1996, p. 144; Hodgson and Ruth, 1985; Howes and Payne, 1990), at least for novices, but a new problem then surfaces. If there are numerous commands in a menu, how should they be organized? One approach is to organize menu commands in a hierarchy that includes depth and breadth. The question arises: what is the best structure for the hierarchy? Consider the case of 64 commands organized in a menu. The menu could be organized with depth = 8 and breadth = 2, or with depth = 2 and breadth = 6. Both structures provide access to 64 menu items. The breadth-emphasis case gives 82 = 64 choices (Figure 1.14a). The depth-emphasis case gives 26 = 64 choices (Figure 1.14b). Which organization is better? Is another organization better still (e.g., 43 = 64)? Given these questions, it is not surprising that menu design issues were actively pursued as research topics in the early days of HCI (e.g., Card, 1982; Kiger, 1984; Landauer and Nachbar, 1985; D. P. Miller, 1981; Snowberry, Parkinson, and Sisson, 1983; Tullis, 1985). Depth versus breadth is not the only research issue in menu design; there are many others. Should items be ordered alphabetically or by function (Card, 1982; Mehlenbacher, Duffy, and Palmer, 1989)? Does the presence of a title on a submenu improve menu access (J. Gray, 1986)? Is access improved if an icon is added to the label (Hemenway, 1982)? Do people in different age groups respond differently to broad versus deep menu hierarchies (Zaphiris, Kurniawan, and Ellis, 2003)? Is there a depth versus breadth advantage for menus on mobile devices (Geven, Sefelin, and Tschelig, 2006)? Does auditory feedback improve menu access (Zhao, Dragicevic, Chignell, Balakrishnan, and Baudisch, 2007)? Can the tilt of a mobile phone be used for menu navigation (Rekimoto, 1996)? Can menu lists be pie shaped, rather than linear (Callahan, Hopkins, Weiser, and Shneiderman, 1988)? Can pie menus be used for text entry (D. Venolia and Neiberg, 1994)? The answers to these research questions can be found in the papers cited. They are examples of the kinds of research questions that create opportunities for empirical research in HCI. There are countless such topics of research in HCI. While we’ve seen many in this chapter, we will find many more in the chapters to come.

25

26

CHAPTER 1 Historical Context

1.9 Other readings Two other papers considered important in the history of HCI are: ●

●

“Personal Dynamic Media” by A. Kay and A. Goldberg (1977). This article describes Dynabook. Although never built, Dynabook provided the conceptual basis for laptop computers, tablet PCs, and e-books. “The Computer for the 21st Century” by M. Weiser (1991). This is the essay that presaged ubiquitous computing. Weiser begins, “The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it” (p. 94).

Other sources taking a historical view of human-computer interaction include: Baecker, Grudin, Buxton, and Greenberg, 1995; Erickson and McDonald, 2007; Grudin, 2012; Myers, 1998.

1.10 Resources The following online resources are useful for conducting research in human-computer interaction: ● ● ●

Google Scholar: http://scholar.google.ca ACM Digital Library: http://portal.acm.org HCI Bibliography: http://hcibib.org

This website is available as a resource accompanying this book: ●

www.yorku.ca/mack/HCIbook Many downloads are available to accompany the examples presented herein.

STUDENT EXERCISES 1-1. The characteristics of direct manipulation include visibility of objects, incremental action, rapid feedback, reversibility, exploration, syntactic correctness of all actions, and replacing language with action. For each characteristic consider and discuss an example task performed with modern GUIs. Contrast the task with the same task as performed in a command-line environment such as unix, linux, or DOS.

THE DESIGN OF EVERYDAY THINGS

CHAPTER ONE

THE PSYCHOPATHOLOGY OF EVERYDAY THINGS

If I were placed in the cockpit of a modern jet airliner, my inability to perform well would neither surprise nor bother me. But why should I have trouble with doors and light switches, water faucets and stoves? “Doors?” I can hear the reader saying. “You have trouble opening doors?” Yes. I push doors that are meant to be pulled, pull doors that should be pushed, and walk into doors that neither pull nor push, but slide. Moreover, I see others having the same troubles—unnecessary troubles. My problems with doors have become so well known that confusing doors are often called “Norman doors.” Imagine becoming famous for doors that don’t work right. I’m pretty sure that’s not what my parents planned for me. (Put “Norman doors” into your favorite search engine—be sure to include the quote marks: it makes for fascinating reading.) How can such a simple thing as a door be so confusing? A door would seem to be about as simple a device as possible. There is not much you can do to a door: you can open it or shut it. Suppose you are in an office building, walking down a corridor. You come to a door. How does it open? Should you push or pull, on the left or the right? Maybe the door slides. If so, in which direction? I have seen doors that slide to the left, to the right, and even up into the ceiling. 1

Coffeepot for Masochists. The French artist Jacques Carelman in his series of books Catalogue d’objets introuvables (Catalog of unfindable objects) provides delightful examples of everyday things that are deliberately unworkable, outrageous, or otherwise ill-formed. One of my favorite items is what he calls “coffeepot for masochists.” The photograph shows a copy given to me by collegues at the University of California, San Diego. It is one of my treasured art objects. F IGU R E 1 .1 .

(Photograph by Aymin Shamma for the author.)

The design of the door should indicate how to work it without any need for signs, certainly without any need for trial and error. A friend told me of the time he got trapped in the doorway of a post office in a European city. The entrance was an imposing row of six glass swinging doors, followed immediately by a second, identical row. That’s a standard design: it helps reduce the airflow and thus maintain the indoor temperature of the building. There was no visible hardware: obviously the doors could swing in either direction: all a person had to do was push the side of the door and enter. My friend pushed on one of the outer doors. It swung inward, and he entered the building. Then, before he could get to the next row of doors, he was distracted and turned around for an instant. He didn’t realize it at the time, but he had moved slightly to the right. So when he came to the next door and pushed it, nothing happened. “Hmm,” he thought, “must be locked.” So he pushed the side of the adjacent door. Nothing. Puzzled, my friend decided to go outside again. He turned around and pushed against the side of a door. Nothing. He pushed the adjacent door. Nothing. The door he had just entered no longer worked. He turned around once more and tried the inside doors again. Nothing. Concern, then mild panic. He was trapped! Just then, a group of people on the other side of the entranceway (to my friend’s right) passed easily through both sets of doors. My friend hurried over to follow their path. 2

The Design of Everyday Things

How could such a thing happen? A swinging door has two sides. One contains the supporting pillar and the hinge, the other is unsupported. To open the door, you must push or pull on the unsupported edge. If you push on the hinge side, nothing happens. In my friend’s case, he was in a building where the designer aimed for beauty, not utility. No distracting lines, no visible pillars, no visible hinges. So how can the ordinary user know which side to push on? While distracted, my friend had moved toward the (invisible) supporting pillar, so he was pushing the doors on the hinged side. No wonder nothing happened. Attractive doors. Stylish. Probably won a design prize. Two of the most important characteristics of good design are discoverability and understanding. Discoverability: Is it possible to even figure out what actions are possible and where and how to perform them? Understanding: What does it all mean? How is the product supposed to be used? What do all the different controls and settings mean? The doors in the story illustrate what happens when discoverability fails. Whether the device is a door or a stove, a mobile phone or a nuclear power plant, the relevant components must be visible, and they must communicate the correct message: What actions are possible? Where and how should they be done? With doors that push, the designer must provide signals that naturally indicate where to push. These need not destroy the aesthetics. Put a vertical plate on the side to be pushed. Or make the supporting pillars visible. The vertical plate and supporting pillars are natural signals, naturally interpreted, making it easy to know just what to do: no labels needed. With complex devices, discoverability and understanding require the aid of manuals or personal instruction. We accept this if the device is indeed complex, but it should be unnecessary for simple things. Many products defy understanding simply because they have too many functions and controls. I don’t think that simple home appliances—stoves, washing machines, audio and television sets—should look like Hollywood’s idea of a spaceship control room. They already do, much to our consternation. Faced one: The Psychopathology of Everyday Things

3

with a bewildering array of controls and displays, we simply memorize one or two fixed settings to approximate what is desired. In England I visited a home with a fancy new Italian washerdryer combination, with super-duper multisymbol controls, all to do everything anyone could imagine doing with the washing and drying of clothes. The husband (an engineering psychologist) said he refused to go near it. The wife (a physician) said she had simply memorized one setting and tried to ignore the rest. I asked to see the manual: it was just as confusing as the device. The whole purpose of the design is lost.

The Complexity of Modern Devices All artificial things are designed. Whether it is the layout of furniture in a room, the paths through a garden or forest, or the intricacies of an electronic device, some person or group of people had to decide upon the layout, operation, and mechanisms. Not all designed things involve physical structures. Services, lectures, rules and procedures, and the organizational structures of businesses and governments do not have physical mechanisms, but their rules of operation have to be designed, sometimes informally, sometimes precisely recorded and specified. But even though people have designed things since prehistoric times, the field of design is relatively new, divided into many areas of specialty. Because everything is designed, the number of areas is enormous, ranging from clothes and furniture to complex control rooms and bridges. This book covers everyday things, focusing on the interplay between technology and people to ensure that the products actually fulfill human needs while being understandable and usable. In the best of cases, the products should also be delightful and enjoyable, which means that not only must the requirements of engineering, manufacturing, and ergonomics be satisfied, but attention must be paid to the entire experience, which means the aesthetics of form and the quality of interaction. The major areas of design relevant to this book are industrial design, interaction design, and experience design. None of the fields is well defined, but the focus of the efforts does vary, with industrial 4

The Design of Everyday Things

designers emphasizing form and material, interactive designers emphasizing understandability and usability, and experience designers emphasizing the emotional impact. Thus: Industrial design: The professional service of creating and developing concepts and specifications that optimize the function, value, and appearance of products and systems for the mutual benefit of both user and manufacturer (from the Industrial Design Society of America’s website). Interaction design: The focus is upon how people interact with technology. The goal is to enhance people’s understanding of what can be done, what is happening, and what has just occurred. Interaction design draws upon principles of psychology, design, art, and emotion to ensure a positive, enjoyable experience. Experience design: The practice of designing products, processes, services, events, and environments with a focus placed on the quality and enjoyment of the total experience.

Design is concerned with how things work, how they are controlled, and the nature of the interaction between people and technology. When done well, the results are brilliant, pleasurable products. When done badly, the products are unusable, leading to great frustration and irritation. Or they might be usable, but force us to behave the way the product wishes rather than as we wish. Machines, after all, are conceived, designed, and constructed by people. By human standards, machines are pretty limited. They do not maintain the same kind of rich history of experiences that people have in common with one another, experiences that enable us to interact with others because of this shared understanding. Instead, machines usually follow rather simple, rigid rules of behavior. If we get the rules wrong even slightly, the machine does what it is told, no matter how insensible and illogical. People are imaginative and creative, filled with common sense; that is, a lot of valuable knowledge built up over years of experience. But instead of capitalizing on these strengths, machines require us to be precise and accurate, things we are not very good at. Machines have no one: The Psychopathology of Everyday Things

5

leeway or common sense. Moreover, many of the rules followed by a machine are known only by the machine and its designers. When people fail to follow these bizarre, secret rules, and the machine does the wrong thing, its operators are blamed for not understanding the machine, for not following its rigid specifications. With everyday objects, the result is frustration. With complex devices and commercial and industrial processes, the resulting difficulties can lead to accidents, injuries, and even deaths. It is time to reverse the situation: to cast the blame upon the machines and their design. It is the machine and its design that are at fault. It is the duty of machines and those who design them to understand people. It is not our duty to understand the arbitrary, meaningless dictates of machines. The reasons for the deficiencies in human-machine interaction are numerous. Some come from the limitations of today’s technology. Some come from self-imposed restrictions by the designers, often to hold down cost. But most of the problems come from a complete lack of understanding of the design principles necessary for effective human-machine interaction. Why this deficiency? Because much of the design is done by engineers who are experts in technology but limited in their understanding of people. “We are people ourselves,” they think, “so we understand people.” But in fact, we humans are amazingly complex. Those who have not studied human behavior often think it is pretty simple. Engineers, moreover, make the mistake of thinking that logical explanation is sufficient: “If only people would read the instructions,” they say, “everything would be all right.” Engineers are trained to think logically. As a result, they come to believe that all people must think this way, and they design their machines accordingly. When people have trouble, the engineers are upset, but often for the wrong reason. “What are these people doing?” they will wonder. “Why are they doing that?” The problem with the designs of most engineers is that they are too logical. We have to accept human behavior the way it is, not the way we would wish it to be.

6

The Design of Everyday Things

I used to be an engineer, focused upon technical requirements, quite ignorant of people. Even after I switched into psychology and cognitive science, I still maintained my engineering emphasis upon logic and mechanism. It took a long time for me to realize that my understanding of human behavior was relevant to my interest in the design of technology. As I watched people struggle with technology, it became clear that the difficulties were caused by the technology, not the people. I was called upon to help analyze the American nuclear power plant accident at Three Mile Island (the island name comes from the fact that it is located on a river, three miles south of Middletown in the state of Pennsylvania). In this incident, a rather simple mechanical failure was misdiagnosed. This led to several days of difficulties and confusion, total destruction of the reactor, and a very close call to a severe radiation release, all of which brought the American nuclear power industry to a complete halt. The operators were blamed for these failures: “human error” was the immediate analysis. But the committee I was on discovered that the plant’s control rooms were so poorly designed that error was inevitable: design was at fault, not the operators. The moral was simple: we were designing things for people, so we needed to understand both technology and people. But that’s a difficult step for many engineers: machines are so logical, so orderly. If we didn’t have people, everything would work so much better. Yup, that’s how I used to think. My work with that committee changed my view of design. Today, I realize that design presents a fascinating interplay of technology and psychology, that the designers must understand both. Engineers still tend to believe in logic. They often explain to me in great, logical detail, why their designs are good, powerful, and wonderful. “Why are people having problems?” they wonder. “You are being too logical,” I say. “You are designing for people the way you would like them to be, not for the way they really are.” When the engineers object, I ask whether they have ever made an error, perhaps turning on or off the wrong light, or the wrong

one: The Psychopathology of Everyday Things

7

stove burner. “Oh yes,” they say, “but those were errors.” That’s the point: even experts make errors. So we must design our machines on the assumption that people will make errors. (Chapter 5 provides a detailed analysis of human error.)

Human-Centered Design People are frustrated with everyday things. From the ever-increasing complexity of the automobile dashboard, to the increasing automation in the home with its internal networks, complex music, video, and game systems for entertainment and communication, and the increasing automation in the kitchen, everyday life sometimes seems like a never-ending fight against confusion, continued errors, frustration, and a continual cycle of updating and maintaining our belongings. In the multiple decades that have elapsed since the first edition of this book was published, design has gotten better. There are now many books and courses on the topic. But even though much has improved, the rapid rate of technology change outpaces the advances in design. New technologies, new applications, and new methods of interaction are continually arising and evolving. New industries spring up. Each new development seems to repeat the mistakes of the earlier ones; each new field requires time before it, too, adopts the principles of good design. And each new invention of technology or interaction technique requires experimentation and study before the principles of good design can be fully integrated into practice. So, yes, things are getting better, but as a result, the challenges are ever present. The solution is human-centered design (HCD), an approach that puts human needs, capabilities, and behavior first, then designs to accommodate those needs, capabilities, and ways of behaving. Good design starts with an understanding of psychology and technology. Good design requires good communication, especially from machine to person, indicating what actions are possible, what is happening, and what is about to happen. Communication is especially important when things go wrong. It is relatively easy to design things that work smoothly and harmoniously as 8

The Design of Everyday Things

TABLE 1.1.

The Role of HCD and Design Specializations

Experience design Industrial design

These are areas of focus

Interaction design Human-centered design

The process that ensures that the designs match the needs and capabilities of the people for whom they are intended

long as things go right. But as soon as there is a problem or a misunderstanding, the problems arise. This is where good design is essential. Designers need to focus their attention on the cases where things go wrong, not just on when things work as planned. Actually, this is where the most satisfaction can arise: when something goes wrong but the machine highlights the problems, then the person understands the issue, takes the proper actions, and the problem is solved. When this happens smoothly, the collaboration of person and device feels wonderful. Human-centered design is a design philosophy. It means starting with a good understanding of people and the needs that the design is intended to meet. This understanding comes about primarily through observation, for people themselves are often unaware of their true needs, even unaware of the difficulties they are encountering. Getting the specification of the thing to be defined is one of the most difficult parts of the design, so much so that the HCD principle is to avoid specifying the problem as long as possible but instead to iterate upon repeated approximations. This is done through rapid tests of ideas, and after each test modifying the approach and the problem definition. The results can be products that truly meet the needs of people. Doing HCD within the rigid time, budget, and other constraints of industry can be a challenge: Chapter 6 examines these issues. Where does HCD fit into the earlier discussion of the several different forms of design, especially the areas called industrial, interaction, and experience design? These are all compatible. HCD is a philosophy and a set of procedures, whereas the others are areas of focus (see Table 1.1). The philosophy and procedures of HCD add one: The Psychopathology of Everyday Things

9

deep consideration and study of human needs to the design process, whatever the product or service, whatever the major focus.

Fundamental Principles of Interaction Great designers produce pleasurable experiences. Experience: note the word. Engineers tend not to like it; it is too subjective. But when I ask them about their favorite automobile or test equipment, they will smile delightedly as they discuss the fit and finish, the sensation of power during acceleration, their ease of control while shifting or steering, or the wonderful feel of the knobs and switches on the instrument. Those are experiences. Experience is critical, for it determines how fondly people remember their interactions. Was the overall experience positive, or was it frustrating and confusing? When our home technology behaves in an uninterpretable fashion we can become confused, frustrated, and even angry—all strong negative emotions. When there is understanding it can lead to a feeling of control, of mastery, and of satisfaction or even pride—all strong positive emotions. Cognition and emotion are tightly intertwined, which means that the designers must design with both in mind. When we interact with a product, we need to figure out how to work it. This means discovering what it does, how it works, and what operations are possible: discoverability. Discoverability results from appropriate application of five fundamental psychological concepts covered in the next few chapters: affordances, signifiers, constraints, mappings, and feedback. But there is a sixth principle, perhaps most important of all: the conceptual model of the system. It is the conceptual model that provides true understanding. So I now turn to these fundamental principles, starting with affordances, signifiers, mappings, and feedback, then moving to conceptual models. Constraints are covered in Chapters 3 and 4. AFFORDANCES

We live in a world filled with objects, many natural, the rest artificial. Every day we encounter thousands of objects, many of them new to us. Many of the new objects are similar to ones we already 10

The Design of Everyday Things

know, but many are unique, yet we manage quite well. How do we do this? Why is it that when we encounter many unusual natural objects, we know how to interact with them? Why is this true with many of the artificial, human-made objects we encounter? The answer lies with a few basic principles. Some of the most important of these principles come from a consideration of affordances. The term affordance refers to the relationship between a physical object and a person (or for that matter, any interacting agent, whether animal or human, or even machines and robots). An affordance is a relationship between the properties of an object and the capabilities of the agent that determine just how the object could possibly be used. A chair affords (“is for”) support and, therefore, affords sitting. Most chairs can also be carried by a single person (they afford lifting), but some can only be lifted by a strong person or by a team of people. If young or relatively weak people cannot lift a chair, then for these people, the chair does not have that affordance, it does not afford lifting. The presence of an affordance is jointly determined by the qualities of the object and the abilities of the agent that is interacting. This relational definition of affordance gives considerable difficulty to many people. We are used to thinking that properties are associated with objects. But affordance is not a property. An affordance is a relationship. Whether an affordance exists depends upon the properties of both the object and the agent. Glass affords transparency. At the same time, its physical structure blocks the passage of most physical objects. As a result, glass affords seeing through and support, but not the passage of air or most physical objects (atomic particles can pass through glass). The blockage of passage can be considered an anti-affordance—the prevention of interaction. To be effective, affordances and antiaffordances have to be discoverable—perceivable. This poses a difficulty with glass. The reason we like glass is its relative invisibility, but this aspect, so useful in the normal window, also hides its anti-affordance property of blocking passage. As a result, birds often try to fly through windows. And every year, numerous people injure themselves when they walk (or run) through closed glass one: The Psychopathology of Everyday Things

11

doors or large picture windows. If an affordance or anti-affordance cannot be perceived, some means of signaling its presence is required: I call this property a signifier (discussed in the next section). The notion of affordance and the insights it provides originated with J. J. Gibson, an eminent psychologist who provided many advances to our understanding of human perception. I had interacted with him over many years, sometimes in formal conferences and seminars, but most fruitfully over many bottles of beer, late at night, just talking. We disagreed about almost everything. I was an engineer who became a cognitive psychologist, trying to understand how the mind works. He started off as a Gestalt psychologist, but then developed an approach that is today named after him: Gibsonian psychology, an ecological approach to perception. He argued that the world contained the clues and that people simply picked them up through “direct perception.” I argued that nothing could be direct: the brain had to process the information arriving at the sense organs to put together a coherent interpretation. “Nonsense,” he loudly proclaimed; “it requires no interpretation: it is directly perceived.” And then he would put his hand to his ears, and with a triumphant flourish, turn off his hearing aids: my counterarguments would fall upon deaf ears—literally. When I pondered my question—how do people know how to act when confronted with a novel situation—I realized that a large part of the answer lay in Gibson’s work. He pointed out that all the senses work together, that we pick up information about the world by the combined result of all of them. “Information pickup” was one of his favorite phrases, and Gibson believed that the combined information picked up by all of our sensory apparatus—sight, sound, smell, touch, balance, kinesthetic, acceleration, body position— determines our perceptions without the need for internal processing or cognition. Although he and I disagreed about the role played by the brain’s internal processing, his brilliance was in focusing attention on the rich amount of information present in the world. Moreover, the physical objects conveyed important information about how people could interact with them, a property he named “affordance.” 12

The Design of Everyday Things

Affordances exist even if they are not visible. For designers, their visibility is critical: visible affordances provide strong clues to the operations of things. A flat plate mounted on a door affords pushing. Knobs afford turning, pushing, and pulling. Slots are for inserting things into. Balls are for throwing or bouncing. Perceived affordances help people figure out what actions are possible without the need for labels or instructions. I call the signaling component of affordances signifiers. SIGNIFIERS

Are affordances important to designers? The first edition of this book introduced the term affordances to the world of design. The design community loved the concept and affordances soon propagated into the instruction and writing about design. I soon found mention of the term everywhere. Alas, the term became used in ways that had nothing to do with the original. Many people find affordances difficult to understand because they are relationships, not properties. Designers deal with fixed properties, so there is a temptation to say that the property is an affordance. But that is not the only problem with the concept of affordances. Designers have practical problems. They need to know how to design things to make them understandable. They soon discovered that when working with the graphical designs for electronic displays, they needed a way to designate which parts could be touched, slid upward, downward, or sideways, or tapped upon. The actions could be done with a mouse, stylus, or fingers. Some systems responded to body motions, gestures, and spoken words, with no touching of any physical device. How could designers describe what they were doing? There was no word that fit, so they took the closest existing word—affordance. Soon designers were saying such things as, “I put an affordance there,” to describe why they displayed a circle on a screen to indicate where the person should touch, whether by mouse or by finger. “No,” I said, “that is not an affordance. That is a way of communicating where the touch should be. You are communicating where to do the touching: the one: The Psychopathology of Everyday Things

13

affordance of touching exists on the entire screen: you are trying to signify where the touch should take place. That’s not the same thing as saying what action is possible.” Not only did my explanation fail to satisfy the design community, but I myself was unhappy. Eventually I gave up: designers needed a word to describe what they were doing, so they chose affordance. What alternative did they have? I decided to provide a better answer: signifiers. Affordances determine what actions are possible. Signifiers communicate where the action should take place. We need both. People need some way of understanding the product or service they wish to use, some sign of what it is for, what is happening, and what the alternative actions are. People search for clues, for any sign that might help them cope and understand. It is the sign that is important, anything that might signify meaningful information. Designers need to provide these clues. What people need, and what designers must provide, are signifiers. Good design requires, among other things, good communication of the purpose, structure, and operation of the device to the people who use it. That is the role of the signifier. The term signifier has had a long and illustrious career in the exotic field of semiotics, the study of signs and symbols. But just as I appropriated affordance to use in design in a manner somewhat different than its inventor had intended, I use signifier in a somewhat different way than it is used in semiotics. For me, the term signifier refers to any mark or sound, any perceivable indicator that communicates appropriate behavior to a person. Signifiers can be deliberate and intentional, such as the sign push on a door, but they may also be accidental and unintentional, such as our use of the visible trail made by previous people walking through a field or over a snow-covered terrain to determine the best path. Or how we might use the presence or absence of people waiting at a train station to determine whether we have missed the train. (I explain these ideas in more detail in my book Living with Complexity.)

14

The Design of Everyday Things

A.

B.

C.

Problem Doors: Signifiers Are Needed. Door hardware can signal whether to push or pull without signs, but the hardware of the two doors in the upper photo, A, are identical even though one should be pushed, the other pulled. The flat, ribbed horizontal bar has the obvious perceived affordance of pushing, but as the signs indicate, the door on the left is to be pulled, the one on the right is to be pushed. In the bottom pair of photos, B and C, there are no visible signifiers or affordances. How does one know which side to push? Trial and error. When external signifiers—signs— have to be added to something as simple as a door, it indicates bad design.

F IGU RE 1. 2 .

(Photographs by the author.)

The signifier is an important communication device to the recipient, whether or not communication was intended. It doesn’t matter whether the useful signal was deliberately placed or whether it is incidental: there is no necessary distinction. Why should it matter whether a flag was placed as a deliberate clue to wind direction (as is done at airports or on the masts of sailboats) or was there as an

one: The Psychopathology of Everyday Things

15

advertisement or symbol of pride in one’s country (as is done on public buildings). Once I interpret a flag’s motion to indicate wind direction, it does not matter why it was placed there. Consider a bookmark, a deliberately placed signifier of one’s place in reading a book. But the physical nature of books also makes a bookmark an accidental signifier, for its placement also indicates how much of the book remains. Most readers have learned to use this accidental signifier to aid in their enjoyment of the reading. With few pages left, we know the end is near. And if the reading is torturous, as in a school assignment, one can always console oneself by knowing there are “only a few more pages to get through.” Electronic book readers do not have the physical structure of paper books, so unless the software designer deliberately provides a clue, they do not convey any signal about the amount of text remaining. A.

B.

C.

Sliding Doors: Seldom Done Well. Sliding doors are seldom signified properly. The top two photographs show the sliding door to the toilet on an Amtrak train in the United States. The handle clearly signifies “pull,” but in fact, it needs to be rotated and the door slid to the right. The owner of the store in Shanghai, China, Photo C, solved the problem with a sign. “don’t push!” it says, in both English and Chinese. Amtrak’s toilet door could have used a similar kind of sign. (Photographs by the author.) FIGURE 1.3.

16

The Design of Everyday Things

Whatever their nature, planned or accidental, signifiers provide valuable clues as to the nature of the world and of social activities. For us to function in this social, technological world, we need to develop internal models of what things mean, of how they operate. We seek all the clues we can find to help in this enterprise, and in this way, we are detectives, searching for whatever guidance we might find. If we are fortunate, thoughtful designers provide the clues for us. Otherwise, we must use our own creativity and imagination. A.

B.

C.

D.

The Sink That Would Not Drain: Where Signifiers Fail. I washed my hands in my hotel sink in London, but then, as shown in Photo A, was left with the question of how to empty the sink of the dirty water. I searched all over for a control: none. I tried prying open the sink stopper with a spoon (Photo B): failure. I finally left my hotel room and went to the front desk to ask for instructions. (Yes, I actually did.) “Push down on the stopper,” I was told. Yes, it worked (Photos C and D). But how was anyone to ever discover this? And why should I have to put my clean hands back into the dirty water to empty the sink? The problem here is not just the lack of signifier, it is the faulty decision to produce a stopper that requires people to dirty their clean hands to use it. (Photographs by the author.)

FIGURE 1.4.

one: The Psychopathology of Everyday Things

17

Affordances, perceived affordances, and signifiers have much in common, so let me pause to ensure that the distinctions are clear. Affordances represent the possibilities in the world for how an agent (a person, animal, or machine) can interact with something. Some affordances are perceivable, others are invisible. Signifiers are signals. Some signifiers are signs, labels, and drawings placed in the world, such as the signs labeled “push,” “pull,” or “exit” on doors, or arrows and diagrams indicating what is to be acted upon or in which direction to gesture, or other instructions. Some signifiers are simply the perceived affordances, such as the handle of a door or the physical structure of a switch. Note that some perceived affordances may not be real: they may look like doors or places to push, or an impediment to entry, when in fact they are not. These are misleading signifiers, oftentimes accidental but sometimes purposeful, as when trying to keep people from doing actions for which they are not qualified, or in games, where one of the challenges is to figure out what is real and what is not. Accidental Affordances Can Become Strong Signifiers. This wall, at the Industrial Design department of KAIST, in Korea, provides an antiaffordance, preventing people from falling down the stair shaft. Its top is flat, an accidental by-product of the design. But flat surfaces afford support, and as soon as one person discovers it can be used to dispose of empty drink containers, the discarded container becomes a signifier, telling others that it is permissible to discard their items there. (Photographs by the author.)

F IGU RE 1. 5.

B.

18

The Design of Everyday Things

A.

C.

My favorite example of a misleading signifier is a row of vertical pipes across a service road that I once saw in a public park. The pipes obviously blocked cars and trucks from driving on that road: they were good examples of anti-affordances. But to my great surprise, I saw a park vehicle simply go through the pipes. Huh? I walked over and examined them: the pipes were made of rubber, so vehicles could simply drive right over them. A very clever signifier, signaling a blocked road (via an apparent anti-affordance) to the average person, but permitting passage for those who knew. To summarize: • Affordances are the possible interactions between people and the environment. Some affordances are perceivable, others are not. • Perceived affordances often act as signifiers, but they can be ambiguous. • Signifiers signal things, in particular what actions are possible and how they should be done. Signifiers must be perceivable, else they fail to function.

In design, signifiers are more important than affordances, for they communicate how to use the design. A signifier can be words, a graphical illustration, or just a device whose perceived affordances are unambiguous. Creative designers incorporate the signifying part of the design into a cohesive experience. For the most part, designers can focus upon signifiers. Because affordances and signifiers are fundamentally important principles of good design, they show up frequently in the pages of this book. Whenever you see hand-lettered signs pasted on doors, switches, or products, trying to explain how to work them, what to do and what not to do, you are also looking at poor design. A F F O R DA N C E S A N D S I G N I F I E R S : A C O N V E R S AT I O N

A designer approaches his mentor. He is working on a system that recommends restaurants to people, based upon their preferences and those of their friends. But in his tests, he discovered that people never used all of the features. “Why not?” he asks his mentor. (With apologies to Socrates.) one: The Psychopathology of Everyday Things

19

DESIGNER

MENTOR

I’m frustrated; people aren’t using our application properly.

Can you tell me about it?

The screen shows the restaurant that we recommend. It matches their preferences, and their friends like it as well. If they want to see other recommendations, all they have to do is swipe left or right. To learn more about a place, just swipe up for a menu or down to see if any friends are there now. People seem to find the other recommendations, but not the menus or their friends? I don’t understand.

Why do you think this might be?

I don’t know. Should I add some affordances? Suppose I put an arrow on each edge and add a label saying what they do.

That is very nice. But why do you call these affordances? They could already do the actions. Weren’t the affordances already there?

Yes, you have a point. But the affordances weren’t visible. I made them visible.

Very true. You added a signal of what to do.

Yes, isn’t that what I said?

Not quite—you called them affordances even though they afford nothing new: they signify what to do and where to do it. So call them by their right name: “signifiers.”

Oh, I see. But then why do designers care about affordances? Perhaps we should focus our attention on signifiers.

You speak wisely. Communication is a key to good design. And a key to communication is the signifier.

Oh. Now I understand my confusion. Yes, a signifier is what signifies. It is a sign. Now it seems perfectly obvious.

Profound ideas are always obvious once they are understood.

MAPPING

Mapping is a technical term, borrowed from mathematics, meaning the relationship between the elements of two sets of things. Suppose there are many lights in the ceiling of a classroom or auditorium and a row of light switches on the wall at the front of the 20

The Design of Everyday Things

Signifiers on a Touch Screen. The arrows and icons are signifiers: they provide signals about the permissible operations for this restaurant guide. Swiping left or right brings up new restaurant recommendations. Swiping up reveals the menu for the restaurant being displayed; swiping down, friends who recommend the restaurant. FIGURE 1.6.

room. The mapping of switches to lights specifies which switch controls which light. Mapping is an important concept in the design and layout of controls and displays. When the mapping uses spatial correspondence between the layout of the controls and the devices being controlled, it is easy to determine how to use them. In steering a car, we rotate the steering wheel clockwise to cause the car to turn right: the top of the wheel moves in the same direction as the car. Note that other choices could have been made. In early cars, steering was controlled by a variety of devices, including tillers, handlebars, and reins. Today, some vehicles use joysticks, much as in a computer game. In cars that used tillers, steering was done much as one steers a boat: move the tiller to the left to turn to the right. Tractors, construction equipment such as bulldozers and cranes, and military tanks that have tracks instead of wheels use separate controls for the speed and direction of each track: to turn right, the left track is increased in speed, while the right track is slowed or even reversed. This is also how a wheelchair is steered. All of these mappings for the control of vehicles work because each has a compelling conceptual model of how the operation of the control affects the vehicle. Thus, if we speed up the left wheel of a wheelchair while stopping the right wheel, it is easy to imagine the chair’s pivoting on the right wheel, circling to the right. In one: The Psychopathology of Everyday Things

21

a small boat, we can understand the tiller by realizing that pushing the tiller to the left causes the ship’s rudder to move to the right and the resulting force of the water on the rudder slows down the right side of the boat, so that the boat rotates to the right. It doesn’t matter whether these conceptual models are accurate: what matters is that they provide a clear way of remembering and understanding the mappings. The relationship between a control and its results is easiest to learn wherever there is an understandable mapping between the controls, the actions, and the intended result. Natural mapping, by which I mean taking advantage of spatial analogies, leads to immediate understanding. For example, to move an object up, move the control up. To make it easy to determine which control works which light in a large room or auditorium, arrange the controls in the same pattern as the lights. Some natural mappings are cultural or biological, as in the universal standard that moving the hand up signifies more, moving it down signifies less, which is why it is appropriate to use vertical position to represent intensity or amount. Other natural mappings follow from the principles of perception and allow for the natural grouping or patterning of controls and feedback. Groupings and proximity are important principles from Gestalt psychology that can be used to map controls to function: related controls should be grouped together. Controls should be close to the item being controlled. Note that there are many mappings that feel “natural” but in fact are specific to a particular culture: what is natural for one culture is not necessarily natural for another. In Chapter 3, I discuss how

F IGU RE 1.7. Good Mapping: Automobile Seat Adjustment Control. This is an excellent example of natural mapping. The control is in the shape of the seat itself: the mapping is straightforward. To move the front edge of the seat higher, lift up on the front part of the button. To make the seat back recline, move the button back. The same principle could be applied to much more common objects. This particular control is from Mercedes-Benz, but this form of mapping is now used by many automobile companies. (Photograph by the author.)

22

The Design of Everyday Things

different cultures view time, which has important implications for some kinds of mappings. A device is easy to use when the set of possible actions is visible, when the controls and displays exploit natural mappings. The principles are simple but rarely incorporated into design. Good design takes care, planning, thought, and an understanding of how people behave. FEEDBACK

Ever watch people at an elevator repeatedly push the Up button, or repeatedly push the pedestrian button at a street crossing? Ever drive to a traffic intersection and wait an inordinate amount of time for the signals to change, wondering all the time whether the detection circuits noticed your vehicle (a common problem with bicycles)? What is missing in all these cases is feedback: some way of letting you know that the system is working on your request. Feedback—communicating the results of an action—is a wellknown concept from the science of control and information theory. Imagine trying to hit a target with a ball when you cannot see the target. Even as simple a task as picking up a glass with the hand requires feedback to aim the hand properly, to grasp the glass, and to lift it. A misplaced hand will spill the contents, too hard a grip will break the glass, and too weak a grip will allow it to fall. The human nervous system is equipped with numerous feedback mechanisms, including visual, auditory, and touch sensors, as well as vestibular and proprioceptive systems that monitor body position and muscle and limb movements. Given the importance of feedback, it is amazing how many products ignore it. Feedback must be immediate: even a delay of a tenth of a second can be disconcerting. If the delay is too long, people often give up, going off to do other activities. This is annoying to the people, but it can also be wasteful of resources when the system spends considerable time and effort to satisfy the request, only to find that the intended recipient is no longer there. Feedback must also be informative. Many companies try to save money by using inexpensive lights or sound generators for feedback. These simple light flashes one: The Psychopathology of Everyday Things

23

or beeps are usually more annoying than useful. They tell us that something has happened, but convey very little information about what has happened, and then nothing about what we should do about it. When the signal is auditory, in many cases we cannot even be certain which device has created the sound. If the signal is a light, we may miss it unless our eyes are on the correct spot at the correct time. Poor feedback can be worse than no feedback at all, because it is distracting, uninformative, and in many cases irritating and anxiety-provoking. Too much feedback can be even more annoying than too little. My dishwasher likes to beep at three a.m. to tell me that the wash is done, defeating my goal of having it work in the middle of the night so as not to disturb anyone (and to use less expensive electricity). But worst of all is inappropriate, uninterpretable feedback. The irritation caused by a “backseat driver” is well enough known that it is the staple of numerous jokes. Backseat drivers are often correct, but their remarks and comments can be so numerous and continuous that instead of helping, they become an irritating distraction. Machines that give too much feedback are like backseat drivers. Not only is it distracting to be subjected to continual flashing lights, text announcements, spoken voices, or beeps and boops, but it can be dangerous. Too many announcements cause people to ignore all of them, or wherever possible, disable all of them, which means that critical and important ones are apt to be missed. Feedback is essential, but not when it gets in the way of other things, including a calm and relaxing environment. Poor design of feedback can be the result of decisions aimed at reducing costs, even if they make life more difficult for people. Rather than use multiple signal lights, informative displays, or rich, musical sounds with varying patterns, the focus upon cost reduction forces the design to use a single light or sound to convey multiple types of information. If the choice is to use a light, then one flash might mean one thing; two rapid flashes, something else. A long flash might signal yet another state; and a long flash followed by a brief one, yet another. If the choice is to use a sound, quite often the least expensive sound device is selected, one that 24

The Design of Everyday Things

can only produce a high-frequency beep. Just as with the lights, the only way to signal different states of the machine is by beeping different patterns. What do all these different patterns mean? How can we possibly learn and remember them? It doesn’t help that every different machine uses a different pattern of lights or beeps, sometimes with the same patterns meaning contradictory things for different machines. All the beeps sound alike, so it often isn’t even possible to know which machine is talking to us. Feedback has to be planned. All actions need to be confirmed, but in a manner that is unobtrusive. Feedback must also be prioritized, so that unimportant information is presented in an unobtrusive fashion, but important signals are presented in a way that does capture attention. When there are major emergencies, then even important signals have to be prioritized. When every device is signaling a major emergency, nothing is gained by the resulting cacophony. The continual beeps and alarms of equipment can be dangerous. In many emergencies, workers have to spend valuable time turning off all the alarms because the sounds interfere with the concentration required to solve the problem. Hospital operating rooms, emergency wards. Nuclear power control plants. Airplane cockpits. All can become confusing, irritating, and lifeendangering places because of excessive feedback, excessive alarms, and incompatible message coding. Feedback is essential, but it has to be done correctly. Appropriately. CONCEPTUAL MODELS

A conceptual model is an explanation, usually highly simplified, of how something works. It doesn’t have to be complete or even accurate as long as it is useful. The files, folders, and icons you see displayed on a computer screen help people create the conceptual model of documents and folders inside the computer, or of apps or applications residing on the screen, waiting to be summoned. In fact, there are no folders inside the computer—those are effective conceptualizations designed to make them easier to use. Sometimes these depictions can add to the confusion, however. When reading e-mail or visiting a website, the material appears to be on one: The Psychopathology of Everyday Things

25

the device, for that is where it is displayed and manipulated. But in fact, in many cases the actual material is “in the cloud,” located on some distant machine. The conceptual model is of one, coherent image, whereas it may actually consist of parts, each located on different machines that could be almost anywhere in the world. This simplified model is helpful for normal usage, but if the network connection to the cloud services is interrupted, the result can be confusing. Information is still on their screen, but users can no longer save it or retrieve new things: their conceptual model offers no explanation. Simplified models are valuable only as long as the assumptions that support them hold true. There are often multiple conceptual models of a product or device. People’s conceptual models for the way that regenerative braking in a hybrid or electrically powered automobile works are quite different for average drivers than for technically sophisticated drivers, different again for whoever must service the system, and yet different again for those who designed the system. Conceptual models found in technical manuals and books for technical use can be detailed and complex. The ones we are concerned with here are simpler: they reside in the minds of the people who are using the product, so they are also “mental models.” Mental models, as the name implies, are the conceptual models in people’s minds that represent their understanding of how things work. Different people may hold different mental models of the same item. Indeed, a single person might have multiple models of the same item, each dealing with a different aspect of its operation: the models can even be in conflict. Conceptual models are often inferred from the device itself. Some models are passed on from person to person. Some come from manuals. Usually the device itself offers very little assistance, so the model is constructed by experience. Quite often these models are erroneous, and therefore lead to difficulties in using the device. The major clues to how things work come from their perceived structure—in particular from signifiers, affordances, constraints, and mappings. Hand tools for the shop, gardening, and the house tend to make their critical parts sufficiently visible that concep26

The Design of Everyday Things

F IGU R E 1. 8 . Junghans Mega 1000 Digital Radio Controlled Watch. There is no good conceptual model for understanding the operation of my watch. It has five buttons with no hints as to what each one does. And yes, the buttons do different things in their different modes. But it is a very nice-looking watch, and always has the exact time because it checks official radio time stations. (The top row of the display is the date: Wednesday, February 20, the eighth week of the year.) (Photograph by the author.)

tual models of their operation and function are readily derived. Consider a pair of scissors: you can see that the number of possible actions is limited. The holes are clearly there to put something into, and the only logical things that will fit are fingers. The holes are both affordances—they allow the fingers to be inserted—and signifiers—they indicate where the fingers are to go. The sizes of the holes provide constraints to limit the possible fingers: a big hole suggests several fingers; a small hole, only one. The mapping between holes and fingers—the set of possible operations—is signified and constrained by the holes. Moreover, the operation is not sensitive to finger placement: if you use the wrong fingers (or the wrong hand), the scissors still work, although not as comfortably. You can figure out the scissors because their operating parts are visible and the implications clear. The conceptual model is obvious, and there is effective use of signifiers, affordances, and constraints. What happens when the device does not suggest a good conceptual model? Consider my digital watch with five buttons: two along the top, two along the bottom, and one on the left side (Figure 1.8). What is each button for? How would you set the time? There is no way to tell—no evident relationship between the operating controls and the functions, no constraints, no apparent mappings. Moreover, the buttons have multiple ways of being used. Two of the buttons do different things when pushed quickly or when kept depressed for several seconds. Some operations require simultaneous depression of several of the buttons. The only way to tell how to work the watch is to read the manual, over and over again. With the scissors, moving the handle makes the blades move. The watch provides no one: The Psychopathology of Everyday Things

27

visible relationship between the buttons and the possible actions, no discernible relationship between the actions and the end results. I really like the watch: too bad I can’t remember all the functions. Conceptual models are valuable in providing understanding, in predicting how things will behave, and in figuring out what to do when things do not go as planned. A good conceptual model allows us to predict the effects of our actions. Without a good model, we operate by rote, blindly; we do operations as we were told to do them; we can’t fully appreciate why, what effects to expect, or what to do if things go wrong. As long as things work properly, we can manage. When things go wrong, however, or when we come upon a novel situation, then we need a deeper understanding, a good model. For everyday things, conceptual models need not be very complex. After all, scissors, pens, and light switches are pretty simple devices. There is no need to understand the underlying physics or chemistry of each device we own, just the relationship between the controls and the outcomes. When the model presented to us is inadequate or wrong (or, worse, nonexistent), we can have difficulties. Let me tell you about my refrigerator. I used to own an ordinary, two-compartment refrigerator—nothing very fancy about it. The problem was that I couldn’t set the temperature properly. There were only two things to do: adjust the temperature of the freezer compartment and adjust the tempera-

Refrigerator Controls. Two compartments— fresh food and freezer—and two controls (in the fresh food unit). Your task: Suppose the freezer is too cold, the fresh food section just right. How would you adjust the controls so as to make the freezer warmer and keep the fresh food the same?

F IGU RE 1.9.

(Photograph by the author.)

28

The Design of Everyday Things

ture of the fresh food compartment. And there were two controls, one labeled “freezer,” the other “refrigerator.” What’s the problem? Oh, perhaps I’d better warn you. The two controls are not independent. The freezer control also affects the fresh food temperature, and the fresh food control also affects the freezer. Moreover, the manual warns that one should “always allow twenty-four (24) hours for the temperature to stabilize whether setting the controls for the first time or making an adjustment.” It was extremely difficult to regulate the temperature of my old refrigerator. Why? Because the controls suggest a false conceptual model. Two compartments, two controls, which implies that each control is responsible for the temperature of the compartment that carries its name: this conceptual model is shown in Figure 1.10A. It is wrong. In fact, there is only one thermostat and only one cooling mechanism. One control adjusts the thermostat setting, the other the relative proportion of cold air sent to each of the two compartments of the refrigerator. This is why the two controls interact: this conceptual model is shown in Figure 1.10B. In addition, there must be a temperature sensor, but there is no way of knowing where it is located. With the conceptual model suggested by the controls, A.

B.

FIGURE 1.10. Two Conceptual Models for a Refrigerator. The conceptual model A is provided by the system image of the refrigerator as gleaned from the controls. Each control determines the temperature of the named part of the refrigerator. This means that each compartment has its own temperature sensor and cooling unit. This is wrong. The correct conceptual model is shown in B. There is no way of knowing where the temperature sensor is located so it is shown outside the refrigerator. The freezer control determines the freezer temperature (so is this where the sensor is located?). The refrigerator control determines how much of the cold air goes to the freezer and how much to the refrigerator.

one: The Psychopathology of Everyday Things

29

adjusting the temperatures is almost impossible and always frustrating. Given the correct model, life would be much easier. Why did the manufacturer suggest the wrong conceptual model? We will never know. In the twenty-five years since the publication of the first edition of this book, I have had many letters from people thanking me for explaining their confusing refrigerator, but never any communication from the manufacturer (General Electric). Perhaps the designers thought the correct model was too complex, that the model they were giving was easier to understand. But with the wrong conceptual model, it was impossible to set the controls. And even though I am convinced I knew the correct model, I still couldn’t accurately adjust the temperatures because the refrigerator design made it impossible to discover which control was for the temperature sensor, which for the relative proportion of cold air, and in which compartment the sensor was located. The lack of immediate feedback for the actions did not help: it took twenty-four hours to see whether the new setting was appropriate. I shouldn’t have to keep a laboratory notebook and do controlled experiments just to set the temperature of my refrigerator. I am happy to say that I no longer own that refrigerator. Instead I have one that has two separate controls, one in the fresh food compartment, one in the freezer compartment. Each control is nicely calibrated in degrees and labeled with the name of the compartment it controls. The two compartments are independent: setting the temperature in one has no effect on the temperature in the other. This solution, although ideal, does cost more. But far less expensive solutions are possible. With today’s inexpensive sensors and motors, it should be possible to have a single cooling unit with a motor-controlled valve controlling the relative proportion of cold air diverted to each compartment. A simple, inexpensive computer chip could regulate the cooling unit and valve position so that the temperatures in the two compartments match their targets. A bit more work for the engineering design team? Yes, but the results would be worth it. Alas, General Electric is still selling refrigerators with the very same controls and mechanisms that cause so much

30

The Design of Everyday Things

confusion. The photograph in Figure 1.9 is from a contemporary refrigerator, photographed in a store while preparing this book.

The System Image People create mental models of themselves, others, the environment, and the things with which they interact. These are conceptual models formed through experience, training, and instruction. These models serve as guides to help achieve our goals and in understanding the world. How do we form an appropriate conceptual model for the devices we interact with? We cannot talk to the designer, so we rely upon whatever information is available to us: what the device looks like, what we know from using similar things in the past, what was told to us in the sales literature, by salespeople and advertisements, by articles we may have read, by the product website and instruction manuals. I call the combined information available to us the system image. When the system image is incoherent or inappropriate, as in the case of the refrigerator, then the user cannot easily use the device. If it is incomplete or contradictory, there will be trouble. As illustrated in Figure 1.11, the designer of the product and the person using the product form somewhat disconnected vertices of a triangle. The designer’s conceptual model is the designer’s conception of the product, occupying one vertex of the triangle. The product itself is no longer with the designer, so it is isolated as a second vertex, perhaps sitting on the user’s kitchen counter. The system image is what can be perceived from the physical structure that has been built (including documentation, instructions, signifiers, and any information available from websites and help lines). The user’s conceptual model comes from the system image, through interaction with the product, reading, searching for online information, and from whatever manuals are provided. The designer expects the user’s model to be identical to the design model, but because designers cannot communicate directly with users, the entire burden of communication is on the system image.

one: The Psychopathology of Everyday Things

31

The Designer’s Model, the User’s Model, and the System Image. The designer’s conceptual model is the designer’s conception of the look, feel, and operation of a product. The system image is what can be derived from the physical structure that has been built (including documentation). The user’s mental model is developed through interaction with the product and the system image. Designers expect the user’s model to be identical to their own, but because they cannot communicate directly with the user, the burden of communication is with the system image. F IGU RE 1.11.

Figure 1.11 indicates why communication is such an important aspect of good design. No matter how brilliant the product, if people cannot use it, it will receive poor reviews. It is up to the designer to provide the appropriate information to make the product understandable and usable. Most important is the provision of a good conceptual model that guides the user when thing go wrong. With a good conceptual model, people can figure out what has happened and correct the things that went wrong. Without a good model, they struggle, often making matters worse. Good conceptual models are the key to understandable, enjoyable products: good communication is the key to good conceptual models.

The Paradox of Technology Technology offers the potential to make life easier and more enjoyable; each new technology provides increased benefits. At the same time, added complexities increase our difficulty and frustration with technology. The design problem posed by technological advances is enormous. Consider the wristwatch. A few decades ago, watches were simple. All you had to do was set the time and keep the watch wound. The standard control was the stem: a knob at the side of the watch. Turning the knob would wind the spring that provided power to the watch movement. Pulling out the knob and turning it rotated the hands. The operations were easy to learn and easy to do. There was a reasonable relationship between the 32

The Design of Everyday Things

turning of the knob and the resulting turning of the hands. The design even took into account human error. In its normal position, turning the stem wound the mainspring of the clock. The stem had to be pulled before it would engage the gears for setting the time. Accidental turns of the stem did no harm. Watches in olden times were expensive instruments, manufactured by hand. They were sold in jewelry stores. Over time, with the introduction of digital technology, the cost of watches decreased rapidly, while their accuracy and reliability increased. Watches became tools, available in a wide variety of styles and shapes and with an ever-increasing number of functions. Watches were sold everywhere, from local shops to sporting goods stores to electronic stores. Moreover, accurate clocks were incorporated in many appliances, from phones to musical keyboards: many people no longer felt the need to wear a watch. Watches became inexpensive enough that the average person could own multiple watches. They became fashion accessories, where one changed the watch with each change in activity and each change of clothes. In the modern digital watch, instead of winding the spring, we change the battery, or in the case of a solar-powered watch, ensure that it gets its weekly dose of light. The technology has allowed more functions: the watch can give the day of the week, the month, and the year; it can act as a stopwatch (which itself has several functions), a countdown timer, and an alarm clock (or two); it has the ability to show the time for different time zones; it can act as a counter and even as a calculator. My watch, shown in Figure 1.8, has many functions. It even has a radio receiver to allow it to set its time with official time stations around the world. Even so, it is far less complex than many that are available. Some watches have built-in compasses and barometers, accelerometers, and temperature gauges. Some have GPS and Internet receivers so they can display the weather and news, e-mail messages, and the latest from social networks. Some have built-in cameras. Some work with buttons, knobs, motion, or speech. Some detect gestures. The watch is no longer just an instrument for telling time: it has become a platform for enhancing multiple activities and lifestyles. one: The Psychopathology of Everyday Things

33

The added functions cause problems: How can all these functions fit into a small, wearable size? There are no easy answers. Many people have solved the problem by not using a watch. They use their phone instead. A cell phone performs all the functions much better than the tiny watch, while also displaying the time. Now imagine a future where instead of the phone replacing the watch, the two will merge, perhaps worn on the wrist, perhaps on the head like glasses, complete with display screen. The phone, watch, and components of a computer will all form one unit. We will have flexible displays that show only a tiny amount of information in their normal state, but that can unroll to considerable size. Projectors will be so small and light that they can be built into watches or phones (or perhaps rings and other jewelry), projecting their images onto any convenient surface. Or perhaps our devices won’t have displays, but will quietly whisper the results into our ears, or simply use whatever display happens to be available: the display in the seatback of cars or airplanes, hotel room televisions, whatever is nearby. The devices will be able to do many useful things, but I fear they will also frustrate: so many things to control, so little space for controls or signifiers. The obvious solution is to use exotic gestures or spoken commands, but how will we learn, and then remember, them? As I discuss later, the best solution is for there to be agreed upon standards, so we need learn the controls only once. But as I also discuss, agreeing upon these is a complex process, with many competing forces hindering rapid resolution. We will see. The same technology that simplifies life by providing more functions in each device also complicates life by making the device harder to learn, harder to use. This is the paradox of technology and the challenge for the designer.

The Design Challenge Design requires the cooperative efforts of multiple disciplines. The number of different disciplines required to produce a successful product is staggering. Great design requires great designers, but that isn’t enough: it also requires great management, because the 34

The Design of Everyday Things

hardest part of producing a product is coordinating all the many, separate disciplines, each with different goals and priorities. Each discipline has a different perspective of the relative importance of the many factors that make up a product. One discipline argues that it must be usable and understandable, another that it must be attractive, yet another that it has to be affordable. Moreover, the device has to be reliable, be able to be manufactured and serviced. It must be distinguishable from competing products and superior in critical dimensions such as price, reliability, appearance, and the functions it provides. Finally, people have to actually purchase it. It doesn’t matter how good a product is if, in the end, nobody uses it. Quite often each discipline believes its distinct contribution to be most important: “Price,” argues the marketing representative, “price plus these features.” “Reliable,” insist the engineers. “We have to be able to manufacture it in our existing plants,” say the manufacturing representatives. “We keep getting service calls,” say the support people; “we need to solve those problems in the design.” “You can’t put all that together and still have a reasonable product,” says the design team. Who is right? Everyone is right. The successful product has to satisfy all these requirements. The hard part is to convince people to understand the viewpoints of the others, to abandon their disciplinary viewpoint and to think of the design from the viewpoints of the person who buys the product and those who use it, often different people. The viewpoint of the business is also important, because it does not matter how wonderful the product is if not enough people buy it. If a product does not sell, the company must often stop producing it, even if it is a great product. Few companies can sustain the huge cost of keeping an unprofitable product alive long enough for its sales to reach profitability—with new products, this period is usually measured in years, and sometimes, as with the adoption of high-definition television, decades. Designing well is not easy. The manufacturer wants something that can be produced economically. The store wants something that will be attractive to its customers. The purchaser has several one: The Psychopathology of Everyday Things

35

demands. In the store, the purchaser focuses on price and appearance, and perhaps on prestige value. At home, the same person will pay more attention to functionality and usability. The repair service cares about maintainability: how easy is the device to take apart, diagnose, and service? The needs of those concerned are different and often conflict. Nonetheless, if the design team has representatives from all the constituencies present at the same time, it is often possible to reach satisfactory solutions for all the needs. It is when the disciplines operate independently of one another that major clashes and deficiencies occur. The challenge is to use the principles of human-centered design to produce positive results, products that enhance lives and add to our pleasure and enjoyment. The goal is to produce a great product, one that is successful, and that customers love. It can be done.

36

The Design of Everyday Things

Chapter 5

The CHI of Teaching Online: Blurring the Lines Between User Interfaces and Learner Interfaces David Joyner

Abstract The growing prevalence of online education has led to an increase in user interface design for educational contexts, and especially an increase in user interfaces that serve a central role in the learning process. While much of this is straightforward user interface design, there are places where the line between interface design and learning design blur in significant ways. In this analysis, we perform a case study on a graduate-level human-computer interaction class delivered as part of an accredited online program. To evaluate the class, we borrow design principles from the HCI literature and examine how the class’s design implements usability principles like equity, flexibility, and consistency. Through this, we illustrate the unique intersection of interface design and learning design, with an emphasis on decisions that are not clearly in one design area or the other. Finally, we provide a brief evaluation of the class to endorse the class’s value for such an analysis.

5.1 Introduction The rising role of technology in education has led to a blurring of the lines between user interface design and learning design. The requirements of teachers, students, administrators, and parents dictates elements of the design of user interfaces used in educational contexts, but the design of those interfaces in turn fundamentally alters the learning process. At times, specific design decisions or elements of instruction cannot solely be attributed to learning design or user interface design. This trend has existed for decades, from classic interfaces for correspondence learning to more modern learning management systems, but it has taken on a new significance with the advent of entirely online learning environments. While in some ways these learning environments are a natural evolution of these prior interfaces, the fundamental change that has occurred is the placement of the user interface as D. Joyner (B) College of Computing, Georgia Institute of Technology, 801 Atlantic Drive NW, 30084 Atlanta, GA, Georgia e-mail: [email protected] © Springer Nature Switzerland AG 2018 E. Kapros and M. Koutsombogera (eds.), Designing for the User Experience in Learning Systems, Human–Computer Interaction Series, https://doi.org/10.1007/978-3-319-94794-5_5 [email protected]

81

82

D. Joyner

the core of the class experience. Rather than complementing traditional classroom experiences with learning management systems or in-classroom technologies, these online learning environments are the classroom. As a result, for perhaps the first time, the classroom itself is a user interface. This can be taken very literally, as with synchronous virtual classroom environments (Koppelman and Vranken 2008; Martin et al. 2012; McBrien et al. 2009), or it can be taken more figuratively, where user interfaces can serve the same functional roles as traditional classrooms while eschewing the typical requirements of synchronicity and telepresence (Hiltz and Wellman 1997; Joyner et al. 2016; Swan et al. 2000). These latter classrooms are particularly notable because the interface changes the interaction more fundamentally; whereas synchronous virtual classrooms may aim to recreate in-person interactions as completely as possible, asynchronous learning environments must use these computational interfaces to create the same effects through different mechanisms. Significant work has been devoted to investigating how these interfaces may replicate components of traditional learning environments, such as peer-to-peer learning (Boud et al. 2014), peer assessment (Kulkarni et al. 2015), social presence (Tu and McIsaac 2002), laboratory activities (O’Malley et al. 2015), and academic integrity (Li et al. 2015; Northcutt et al. 2016). This trend toward interfaces as classrooms brings new emphasis to the intersection between learning design and user interface design. The two are highly compatible: principles like rapid feedback are comparably valued in user interface design (Nielsen 1995) and learning design (Chandler 2003; Kulkarni et al. 2015). However, it is also important to understand the nature of desirable difficulties (Bjork 2013; McDaniel and Butler 2011) within the material, as an interface designer may inadvertently undermine the learning experience in pursuit of higher user satisfaction (Fishwick 2004; Norman 2013). For this reason, we must carefully prescribe principles and guidelines for designing learning interfaces that emphasize when the roles of student and user are compatible. Thus, due to both the advent of fully online learning environments and the underlying similarities between user interface design and learning design, there is tremendous opportunity to examine the user experience in learning systems from the perspectives of both interface design and learning design. However, the different objectives of the two design paradigms—one to support immediate interaction, the other to support long-term learning gains—mean that the application of one paradigm’s heuristics and guidelines to the other must be performed carefully. Toward this end, some work has already been performed evaluating user interface design specifically within the realm of digital learning environments (Cho et al. 2009; Jones and Farquhar 1997; Najjar 1998), but relatively little work has been done on specifically the user interface design of fully online learning environments. In this analysis we perform a case study on a graduate-level class offered as part of an online Master of Science in Computer Science program at a major public university. Both the program and the class are delivered asynchronously and online, with no requirement for synchronous activities or in-person attendance. While considerable attention could be paid to evaluating the specific user interfaces that deliver the program, this case study instead focuses on higher-level design decisions. Specifically,

[email protected]

5 The CHI of Teaching Online …

83

we are interested in transferring principles of human-computer interaction into the realm of learning design, especially insofar as their application is facilitated by the online nature of the program. To do this, we first provide some necessary background on the nature and structure of the program and this class, and then move through four prominent principles from the human-computer interaction literature: flexibility, equity, consistency, and distributed cognition. For each topic, we examine how it transfers into this online learning environment as a principle of both interface design and learning design. We also look at a smaller number of additional principles with narrower applications in this course, and then evaluate the course based on student surveys.

5.2 Background While this case study focuses specifically on a single class, that class exists in the context of a broader online Master of Science program at a major public university in the United States. Several of the principles we observe in this class are actually derived from the broader principles of the program, especially as it relates to equity. Thus, we begin by giving a brief background on the program, and then focus more specifically on the course under evaluation in this case study.

5.2.1 Program Background The course under evaluation in this case study is part of an online Master of Science in Computer Science program launched by a major public university in the United States in 2014. The program merges recent MOOC-based initiatives with more classical principles and approaches to distance learning. The goal is to create an online program whose learning outcomes and student experience are equivalent or comparable to the in-person experience; as such, the program carries equal accreditation to the traditional on-campus degree. In drawing inspiration from MOOC initiatives over the past several years, however, the program emphasizes low cost and high flexibility. On the cost side, the cost of attendance is $170 per credit hour plus $200 in fees per semester of attendance. Thirty credit hours are required to graduate, and thus, the total degree costs between $6100 and $7100, a small fraction of comparable programs or the university’s own oncampus program. These costs are digestible because each class draws dramatically higher enrolment than their on-campus counterparts: as of Spring 2018, the program enrolls over 6,500 total students taking an average of 1.4 classes per semester, with individual courses enrolling as many as 600 students. On the flexibility side, the program emphasizes that it requires no synchronous or collocated activities: students are never required to attend a virtual lecture at a specific time or visit campus, a testing center, or a remote lab for a course activity.

[email protected]

84

D. Joyner

Proctored and timed exams are typically open for three to four days at a time, while lecture material is pre-produced and assignments are published well in advance of the due date. The program thus captures an audience for whom a Master of Science in Computer Science is otherwise inaccessible, either due to high costs, geographic immobility, or scheduling constraints. Evaluations have shown that as a result, the program draws a dramatically different demographic of student from the university’s on-campus program: online students tend to be older, are more likely to be employed, have more significant prior education and professional experience, and are more likely to be from the United States (Goel and Joyner 2016; Joyner 2017). The program is forecast to increase the annual output of MSCS graduates in the United States by 8% (Goodman et al. 2016).

5.2.2 Course Background This case study focuses on one specific course in this broader program. Fitting this analysis’s contribution, the course is on human-computer interaction, and covers HCI principles, the design life cycle, and modern applications such as virtual reality and wearable computing. At time of writing, the course has been offered four complete times, including three 17-week full semesters and one 12-week summer semester. Each week, students watch a series of custom-produced lecture videos, complete a written assignment, and participate in peer review and forum discussions. Participation is mandated by the course’s grading policy, but students have multiple pathways to earning participation credit to fit their personalities and routines. Students also complete two projects—one individual, one group—and take two timed, proctored, open-book, open-note multiple choice exams. Proctoring is supplied by a digital proctoring solution, allowing students to take the exam on their own computer. Aside from the exams, all work is manually graded by human teaching assistants. One teaching assistant is hired for approximately every 40 enrollees in the course, and teaching assistants are solely responsible for grading assignments: course administration, announcements, Q&A, office hours, etc. are all handled by the course instructor. The course generally enrolls 200–250 students per semester, supported by 5–6 teaching assistants. Its completion rate is 92%, ranking slightly higher than the program’s overall average of approximately 85%. To date, 708 students have completed the course across four semesters, with 205 more on track to complete the course this semester. To explore the crossover between interface design principles and learning design, we take four common design principles or theories from the HCI literature—flexibility, equity, consistency, and distributed cognition—and examine their applications to the design of this online course. In some ways, these principles are applied by analogy: flexibility, for example, traditionally refers to flexible interactions with a specific interface, but in our case, refers to flexible interactions with course material. In others, the application is more literal: equity, for example, refers in part to

[email protected]

5 The CHI of Teaching Online …

85

accommodating individuals with disabilities, which is more directly supported by the course and program structure.

5.3 Flexibility For flexibility, we apply the Principle of Flexibility from Story, Mueller and Mace’s Principles of Universal Design, which they define as, “The design accommodates a wide range of individual preferences and abilities” (Story et al. 1998). We also inject the heuristic of Flexibility and Efficiency of Use from Jakob Nielsen’s ten heuristics, where he writes, “Allow users to tailor frequent actions” (Nielsen 1995). The flexibility of the course generally flows from the inherent properties of the online program, although the course design takes care to preserve and accentuate this flexibility. Most importantly, these applications of the principle of flexibility support the subsequent applications of the principle of equity.

5.3.1 Geographic Flexibility Geographic flexibility refers to the online program’s ability to accept students regardless of their geographic location. At a trivial level, this relates to the program’s ability to accept students who do not live within range of campus. As it pertains to flexibility as a usability guideline, however, this flexibility relates more to accommodating individual preferences for where they complete their work. This relates in part to individual circumstantial constraints, such as the need for working professionals to be able to take course material with them during work trips. It has more significant implications, however, especially as flexibility ties into equity: for example, individuals with disabilities that deter them from leaving the house may participate in a program that offers true geographic flexibility. In a computer science program, several of the abilities required for in-person attendance (e.g. walking, driving to campus, relocating to campus) are largely unrelated to the material itself, and thus this geographic flexibility resolves individual characteristics that pose threats to a student’s participation in the field that are unrelated to the content. It is worth noting that geographic flexibility is inherent in distance learning as a whole; this class’s instantiation of geographic flexibility is not unique except insofar as an identically-accredited distance learning program at a major public institution is still somewhat novel.

[email protected]

86

D. Joyner

Table 5.1 Enrollment and number of instructor and student forum contributions by semester Statistic Fall 2016 Spring 2017 Summer 2017 Fall 2017 Enrollment Student contributions

83 3,477

231 9,147

183 7,970

211 9,381

Instructor contributions

785

1,768

1,143

1,265

5.3.2 Temporal Flexibility Temporal flexibility refers to flexibility of the student’s time, allowing them to work on the class not only wherever they want, but whenever they want. Temporal flexibility offers a greater difference between this program and traditional distance learning as the presence of live interaction has typically differentiated distance learning from correspondence learning. Given the program’s goals of equality with the on-campus program, however, simplifying delivery to correspondence education would be insufficient; requiring live interaction, however, would challenge temporal flexibility. The class achieves balances these competing needs by maximizing the usage of asynchronous communication tools in course delivery. Most course forums garner over ten thousand posts per semester, with approximately 80% coming from students and 20% coming from the instructor. Table 5.1 shows the class’s enrollment and contribution statistics by semester. In addition to forum participation, the class also leverages asynchronous tools for peer review and instructor feedback, as well as an asynchronous video-based method for disseminating pre-recorded custom-produced lecture videos. This temporal flexibility refers strictly to those activities that are typically synchronous in traditional course delivery. Other activities, such as completing homework, are usually somewhat asynchronous. As a result, the design of this course accommodates individual students with a wide range of preferences or constraints on when they work on course material. We will discuss the impacts of this more in the section below on equity.

5.3.3 Preference Flexibility The geographic and temporal flexibility described above give way to an abundance of flexible accommodations for individual students’ preferences and abilities. For example, as a product of being able to watch and re-watch lectures at any pace and in any setting, students may choose to watch lectures while actively working on the assignment they target; to attempt an assignment prior to watching the lecture videos in order to pre-load questions to consider while watching; or to only watch the videos as needed knowing that lecture material cannot be permanently missed the way a single in-person class may be missed.

[email protected]

5 The CHI of Teaching Online …

87

For this course, flexibility is extended through the course’s participation policy as well. It is common for online courses to attempt to capture in-person participation by requiring forum participation, but most research instead focuses on incentivizing or encouraging it more authentically [e.g. Kizilcec et al. (2014)]. There are multiple reasons to focus on more organic discussion stimulation, not least among them that requiring such participation does not address recognized gender issues in forum communication (Freeman and Bamford 2004). To accommodate a greater range of student preferences, this course instead offers multiple routes to earning participation credit: students may contribute to the forums, complete peer reviews of their classmates’ work, give instructors feedback on the course, or participate in their classmates’ need finding or evaluation studies as part of their coursework. These different activities fit with different student preferences and behaviors; for instance, it is easier to set aside a block of time for peer reviews, whereas it is easier to participate in a course forum in several short moments of time.

5.4 Equity In defining equity as a design principle, we borrow in particular the Principle of Equitable Use from Story, Mueller, and Mace, which they define as “The design is useful and marketable to people with diverse abilities” (Story et al. 1998). In particularly, we note the sub-guidelines, “Provide the same means of use for all users: identical whenever possible, equivalent when not” and “Avoid segregating or stigmatizing any users” (Story et al. 1998). Our application of equity begins with the natural consequences of the flexibility described above; flexibility focuses on what students within the program can do, but equity focuses on what students can participate due to that flexibility. We then examine equity as well as facilitated by the program’s admissions structure and pseudo-anonymity in course delivery.

5.4.1 Equity Through Flexibility In many ways, the greatest advantage of the geographic and temporal flexibility referenced above is not in the experience of students in the program, but rather in what students may enter the program in the first place. A traditional graduate program draws from a very narrow population: individuals (a) who either live near the university or have the financial or lifestyle flexibility to relocate, and (b) have the scheduling flexibility to attend classes during the day or pre-selected evenings. Financial flexibility plays into this as well: a traditional graduate program is only available to those who have or can secure (through loans or employer reimbursement) the funds to pay high tuition rates.

[email protected]

88

D. Joyner

Because this program is available to students regardless of location or specific scheduling availability, it is equally available to students who otherwise would lack the ability to participate in such a program. The requirements are distilled down to only those that are inherently required for the content: a significant time commitment (albeit flexible to the student’s own schedule) and sufficient prior background. The cost supports this equity as well: while still expensive, the program does not demand access to an exorbitant amount of funds. As noted previously, these factors directly correspond to the unique demographics the program draws (Goel and Joyner 2016; Joyner 2017). It is worth noting that this audience is not one for which we might stress equity: students entering the program must have a bachelor’s in computer science or a similar field with a strong GPA (or equivalent work experience); these criteria generally mean the students are advantaged in the first place. Thus, one takeaway of this program’s application of the principle of equity comes instead in how similar models may be extended to otherwise-disadvantaged populations. However, another application comes in expanding the view of the program’s audience from geographically dispersed mid-career working professionals and considering also individuals with chronic illnesses, caretakers for others with illnesses, expecting parents, and others for whom obstacles to participation exist.

5.4.2 Equity Through Admissions One component discussed above is the program’s size: at 6,500 students, it is believed to be the largest program of its kind in the world (Goodman et al. 2016; Joyner 2018). While this is often discussed as part of counterbalancing the low tuition rate, it has a profound effect on equity as well. While the program’s on-campus analogue sets a minimum bar for acceptance, it draws far more qualified applicants than it has capacity to handle. As a result, the top few percent are admitted, leaving out many students who meet the minimum requirements but are not competitive with the mostdecorated applicants. As the online program lacks a set capacity, however, any student who meets the minimum requirements is admitted. This expands access to students who otherwise would be uncompetitive, typically due to a more meager prior background. These students meet the minimum requirements and stand a strong chance of succeeding, but they would not be in the top percentile of applicants typically accepted to a limitedcapacity program. Thus, the limitless capacity supports the principle of equity by accepting students with the potential to succeed who may not otherwise have the opportunity.

[email protected]

5 The CHI of Teaching Online …

89

5.4.3 Equity Through Anonymity A classic internet aphorism states, “On the internet, no one knows you’re a dog.” In some ways, the principle applies to this program: although students are identified by name and work is tied to their real identity (unlike MOOCs, where a username may supplant a true name), students have considerable control over what portions of their identity they reveal to classmates and instructors. To classmates, students have the option to reveal essentially no personal information: they may select the name that is shown in discussion posts and peer review, which typically are the only communications inherently surfaced to classmates. Even to instructors, students reveal little about their personal selves. While a systematic study of this dynamic is still in the works, we have anecdotally observed several applications. At a broad level, it is known that there are issues with perceived identity mismatches between gender or race and computer science (Whitley 1997), and that merely being reminded of such stereotypes can lessen performance and engagement (Good et al. 2003). Signifiers of these stereotypes are inherently present in traditional classrooms, but online lack any inherent need to be disclosed. It is worth considering whether hiding these signifiers is a missed opportunity in the long run, but it nonetheless presents a path around stereotype threats worth considering. Other applications of this anonymity are even more delicate, demanding caution in conducting more rigorous studies, but they nonetheless reveal enormous potential for equity through the relative anonymity of the online delivery mechanism. Students have on multiple occasions confided in trusted instructors or teaching assistants the presence of mitigating issues that alter their in-person interactions, including physical disabilities or deformities, obesity, speech impediments, transgenderism, and behavioural disorders. The online environment removes these as a first impression among classmates and with instructors, creating an equity of experience among those populations least likely to find it in person.

5.5 Consistency As a design principle, consistency appears across multiple sets of guidelines and heuristics. We apply the definitions from three different such sets. First, Norman states (Norman 2013), Consistency in design is virtuous. It means that lessons learned with one system transfer readily to others … If a new way of doing things is only slightly better than the old, it is better to be consistent.

Nielsen prescribes a similar heuristic, stating, “Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions” (Nielsen 1995). Constantine and Lockwood echo these sentiments as well with their Reuse Principle, stating (Constantine and Lockwood 1999),

[email protected]

90

D. Joyner The design should reuse internal and external components and behaviors, maintaining consistency with purpose rather than merely arbitrary consistency, thus reducing the need for users to rethink and remember.

With regard to this case study, we consider especially consistency within the class: just as consistency is used to set expectations among users of the outcomes of different interactions, so also consistency is used to set expectations among students of certain responsibilities or deliverables. Efforts are underway as well to extend consistency across courses, especially as they relate to administrative elements of course delivery.

5.5.1 Assignment Cadence Early on, we observed that in delivering an asynchronous class, a forcing function for students’ regular engagement was lost. On campus, that engagement came from lectures: even if assessments were only due every month, students were still incentivized to remain engaged by the fleeting lectures which could not be recovered once lost. In this online design, all lecture material is persistently available: what, then, is there to motivate students to remain engaged long before assessments are due? Our approach to this is to manually recreate that cadence of a weekly engagement through weekly deliverables. The class requires student submissions every week of the semester, each of which directly corresponds to the recommended lecture material for the week. Flexibility (and its effect on equity) are preserved in that lectures and assignment descriptions are all provided at the start of the semester, so students who need to work around other constraints may do so by working ahead; regular deadlines, however, force the majority of students to remain engaged with the course on a weekly basis. Just as through the principle of consistency in interface design a user can interact with a new interface and understand the outcomes of their actions, so also a student can enter a new week of the course and understand the expectations without re-reading the calendar.

5.5.2 Announcement Cadence Just as in-person lectures serve as a forcing function for continued student engagement, we also observed that they serve as a hub for course communication. A natural expectation arises (even if only in the minds of instructors) that weekly lectures will set expectations for the week or recap the week. The loss of this dynamic risks a class becoming a single amorphous semester rather than a regimented curriculum, especially with students’ tendencies to do work at non-traditional times [e.g. weekends (Joyner 2017)].

[email protected]

5 The CHI of Teaching Online …

91

To combat this, the course leverages consistent weekly announcements, sent to students every Monday morning and Friday evening. Monday announcements remind students what they are expected to watch, read, and do for the week, while Friday announcements typically recap significant occurrences or reemphasize key points from the week’s material. These announcements aim to further emphasize that classroom cadence, replicating the effect of a teacher walking in on Monday morning and beginning lecture. As an application of consistency, this replicates common interaction designs such as weekly reports or digests of activity, acting as consistent reminders that the course is ongoing. The announcement cadence plays a more significant role as well with regard to the course’s emphasis on distributed cognition, explained further in the next section. Either way, these weekly announcements are the single most well-praised element of the course’s delivery, and have been incorporated into recommendations issued to all classes in the program.

5.5.3 Administrative Decisions As a more literal application of the principle of consistency, the course makes several administrative decisions to create consistent expectations among students regarding more trivial elements of the course experience. The course’s smallest unit of time is one week: there are no in-week deadlines (excepting a small incentive for early peer review discussed later). Sunday night at 11:59 PM UTC-12 (anywhere on earth) time marks the end of each week; all of the week’s work is due at this time, and one minute later marks the start of the next week. Anywhere on Earth time is chosen to simplify planning for students: if it is before midnight their local time, the assignment is not due. We encourage students to submit by their own midnight for simplicity, although our experience is that students maximize the time available, and submissions role in late in the evening on Sunday nights. Few course components are time-gated (exams, periodic course surveys), but those that are open at 12:00 AM UTC-12 on Mondays, closing at the typical deadline as well. Thus, students do not devote cognitive resources each day to considering what is required; only on Sundays are students required to ensure they have accomplished the week’s deliverables. As a principle of consistency, this process similarly aims to diminish students’ reliance on repeated manual checks and increase the time allotted to focus on the course material and assessments. Interestingly, we have attempted to leverage the principle of consistency in other ways, such as scheduling the aforementioned announcements to go out at the exact start of the week. Feedback we have received from students, however, indicates this is actually somewhat counterproductive as it diminishes the personal feel of these announcements: students feel more connected to the class knowing the instructor was physically present to send the announcement, even if it is delayed. This suggests this principle is best applied to items around which students plan, such as deadlines and release dates, rather than every element of the course delivery. It may also be

[email protected]

92

D. Joyner

the case that students are patient with late announcements because expectations of consistency and fairness are set in these other ways.

5.6 Distributed Cognition Where the previous four design principles were stated with some clarity in a wellknown prescriptive set of guidelines, distributed cognition is a more general theory through which we may examine human-computer interfaces (Hollan et al. 2000). Key to this idea is the notion that human cognitive tasks like reasoning, remembering, and acting could be offloaded onto a computer interface to lighten the cognitive load on the user. As applied to education, this suggests using an interface to lessen the attention paid by students to course administration to support greater attention to course content.

5.6.1 Offloading Through Announcements As referenced above, in addition to creating consistent expectations, a major function of regular announcements is to offload the attention students may otherwise spend thinking about course procedures, assignment deadlines, and so on onto the interface, allowing them instead to focus on the course material. This role of these announcements comes from an early observation from students: whereas traditional in-person courses operate on a “push” structure, online courses emphasize a “pull” structure. These terms, derived from principles of HCI as well, mean that students in a traditional class can usually rely on the instructor to push information to them, such as by standing in front of a lecture hall and giving announcements. Online classes usually operate by making all information available to the students, but that relies on students pulling the right information at the right time. Weekly announcements approximate that in-person dynamic by pushing immediately pertinent information to students. Students thus do not need to trust that they have pulled all critical information at the right time; absent this trust, students devote significant cognitive resources to attending to the class’s administration, which diminishes the resources that may be devoted to learning the actual course material. As noted above, this is a small feature, but it is one of the most well-praised features in the program; student reviews on a public student-run review site praise this repeatedly, and other pieces of negative feedback could be similarly addressed by offloading these roles onto the interface.

[email protected]

5 The CHI of Teaching Online …

93

5.6.2 Offloading Through Documentation A second application of distributed cognition to the course design leverages the student community more heavily. As referenced previously, the online environment makes heavy use of the course forum, but it takes on a unique role in the online course: it is the classroom, but it is a classroom where any student can contribute at any time (Joyner et al. 2016). Student answers to classmates’ questions are not often emphasized in traditional lectures where students inherently pose questions to the professor, but the online board affords student-to-student discussion more fully. This provides an answer to another implicit question in course design: what information should be incorporated into the course’s fundamental documentation, and what should be pushed to students through announcements and discussions? This course errs heavily on the side of the documentation specifically because it leverages this student community: the community as a whole can come to a common understanding of the course’s administration and policies because the entire documentation is available to everyone. Any single student likely will not read all the documentation, but enough students will read each part that if a student has a question that is covered in the documentation, some classmate will have the answer. Thus, knowledge of the course is distributed among the student body rather than solely relying on the communication of the instructor.

5.6.3 Offloading Through Assessment Design Finally, the course deliberately designs assessments to encourage students to leverage distributed cognition. While this is natural in essays and projects where course access is persistent during work, the course tests are also designed to be open to any noncollaborative information seeking. These open-book, open-note, open-video, openforum tests are created with the knowledge that students will have access to course resources, and thus should focus less on the knowledge they are able to draw to mind immediately and more on their ability to solve questions quickly with the available resources. Students are informed of this paradigm in advance of the exams, and encouraged to organize their test-taking environment accordingly. Ready access to course material, their notes, the readings, and even the course’s discussions are encouraged. These tests emphasize that it is the system comprised of the student, their resources, and their environment that is being assessed on the test rather than just their cognition. Distributed cognition is thus simultaneously a lesson in the course, a principle for students to apply to the course, and a theory for us to apply in evaluating the course.

[email protected]

94

D. Joyner

5.7 Additional Principles Additional principles are at play in the course as well, although we generally note that many of these principles apply equally well to traditional courses using modern-day learning management systems. Nonetheless, they are worth including as they further broaden the view of how interface design principles may be applied to learning design.

5.7.1 Structure With regards to structure as a principle of design, we leverage the principle defined by Constantine and Lockwood (1999). In many ways, our applications of structure are not inherently restricted to online environments; however, we observe that specific details of the online environment more clearly afford visible structure. We observe, for example, that organizing lecture material into pre-produced videos allows the presentation of it in a way that brings out the underlying structure of the content rather than forcing it into a prescribed lecture schedule. This, then, allows students to construct their consumption of course material around the actual structure of the content. This similarly connects to the structure of a course calendar offered to students: without requirements that a pre-set amount of time be spent in certain weeks in lecture, the structure of the course can be more deliberately designed not only for the content, but also for the assessments. Other classes in the program, for example, implicitly require students to “attend” ten hours of lecture in the early weeks of the class, then shift to a strict project-based mode in the later weeks. Such a structure would not be possible in a traditional system of prescribed lecture times.

5.7.2 Perceptibility On perceptibility, Nielsen writes, “The system should always keep users informed about what is going on, through appropriate feedback within reasonable time” (Nielsen 1995). An education application of this heuristic has emerged as a somewhat natural consequence of the advent of learning management students: students retain persistent access to the gradebook for immediate perceptibility of their current status in the class. Although Nielsen focuses on this as a pushing relationship where relevant information is pushed to the user, this availability instead facilitates a pulling behavior allowing the student to pull information when pertinent to them. We have seen this principle emphasized more heavily in other courses, especially those reliant more on automated evaluations. An online CS1 course offered by the same university provides automated evaluators for every course problem, all of which

[email protected]

5 The CHI of Teaching Online …

95

feed an immediately-available gradebook (Joyner 2018). This even more dramatically the perceptibility of what is going on with a student’s grade, and while this is compatible with traditional classes, it takes on a new emphasis when the entire experience is in an online environment based on immediately-perceptible feedback.

5.7.3 Tolerance Regarding tolerance, the Principles of Universal Design state that a good design “minimizes hazards and the adverse consequences of accidental or unintended actions” (Story et al. 1998). In education, the level of tolerance for content-specific answers is often dictated by the field rather than by the learning design. However, interface design and learning design can merge to create a tolerance for mistakes more related to administration and policies instead of content errors. In this course’s learning management systems, it is possible to separate an assignment deadline (shown to the students) and an assignment close date (hidden from the students); this course uses these features to set a two-hour grace window after the official deadline where submissions are still accepted. This creates a tolerance for minor errors, such as incorrectly converting the UTC-12 time zone to one’s local time zone or underestimating the time it will take to move through the submission screens to upload an assignment. This course also builds tolerance for late work into its process for rapidly evaluating assignments. After an assignment’s close date, a gradebook is exported with individual students assigned to individual grades. In the event that a student submits work even later than the grace period allowed by the learning management system, the course staff may quickly attach the submission to the row; if the grader has not yet completed their tasks, then accepting the late submission costs the grading team no time compared to if it had been submitted on time. While others address this with a strict grading policy, the size of the class means that a non-trivial number of assignments will have earnest reasons for late submission, and so the course builds tolerance into the grading workflow design.

5.7.4 Feedback Regarding the common need for feedback, Norman writes (Norman 2013), Feedback must be immediate. … Feedback must also be informative. … Poor feedback can be worse than no feedback at all, because it is distracting, uninformative, and in many cases irritating and anxiety-provoking

Among all usability principles, the principle of feedback is likely the most easily transferrable between interface and learning design. Feedback holds the same

[email protected]

96

D. Joyner

meaning in both domains, providing actionable information on the outcome and correctness of an action. As it relates to online course design, we see in this course two interesting applications where the course facilitates more rapid feedback. First, the scale of the course dictates heavy organization; the grading workflow described above follows almost an assembly line approach, where assignments are automatically distributed to graders, rubrics are formalized, and results are processed in batch. Research on the program shows that a significant amount of attention in the learning design process goes into exactly these grading workflows (Joyner 2018), and the result is a more rapid return rate than seen on campus due to the benefits of scale. A second component comes from the course’s method for implementing peer review. Students complete peer reviews as part of their participation grade, but as rapid feedback is more desirable, students are explicitly incentivized to complete peer reviews early. This is the only place in the course where a mid-week semi-deadline exists: students receive 50% more credit (1.5 points) for a peer review submitted within three days of its assignment’s deadline, and 50% less credit (0.5 points) for a review submitted more than a week after the deadline. With each assignment reviewed by 3–4 classmates, this raises the likelihood that feedback will arrive rapidly; in the most recent semester, 58% of all peer reviews were submitted within 3 days, and 69% within one week.

5.8 Course Evaluation Course evaluation has been the topic of considerable discussion in the learning sciences literature. Attempts have been made to create explicit evaluators of course or teaching quality (Biggs and Collis 2014; Ramsden 1991), but these often require standardized tests or high-effort qualitative analyses. In place of these, student reviews are often used as a low-cost approximation of course quality. While some early research found these types of surveys are decently correlated of learning outcomes (Cohen 1981), more recent research casts doubt on this correlation (Greenwald 1997; Uttl et al. 2017), suggesting student reviews are too biased especially by gender differences to be useful for comparisons (Andersen and Miller 1997; Centra and Gaubatz 2000). In this analysis, we nonetheless use student reviews to add to the overall picture of the class in this case study. We acknowledge the weaknesses of student reviews as comparative tools, but note that (a) we are not using these student reviews to compare against another class, but rather merely to attest that the class is generally wellreceived by students, and (b) while most research on the validity of student reviews has been performed at the K-12 or undergraduate level, these reviews are submitted by graduate students who are also mid-career professionals, and thus we hypothesize are more valid assessors of course quality. Anecdotally, several professors in the program agree to the observation that online students appear to have far higher standards than their traditional counterparts.

[email protected]

5 The CHI of Teaching Online …

97

These student surveys come from two sources: first, the institute issues a Course Instructor Opinion Survey open to every student in every class. Student identifies are strictly hidden in these surveys, and the results are known to inform institute-level evaluations of teaching. Second, the course itself issues an end-of-course survey asking questions more specific to its own unique details.

5.8.1 Institutional Surveys At time of writing, the course from this case study has been offered four times: Fall 2016, Spring 2016, Summer 2016, and Fall 2017. At the end of each of these semesters, students were offered the opportunity to complete the institute’s Course Instructor Opinion Survey for the course. The questions on this survey are dictated by the institute, and although no explicit incentive exists for students to participate, students are nonetheless highly encouraged to do so by the school and instructor. All questions on this survey offer 5-point Likert-scale responses. Table 5.2 provides the interpolated medians to each of these prompts. Based on these results, we make two notable observations. First, the ratings of course effectiveness and quantity learned have not changed semester to semester. This is notable because the course has undergone significant revisions semester to semester, suggesting that either these revisions do not affect the student experience (or the effect is too small for detection), or that students are unable to evaluate the effect of these changes absent a target for comparison. In particular, Fall 2017 added a significant reading component to the course requiring an additional 1–2 h per week of reading. With this change, 61.7% of the Fall 2017 class estimated they put 9 or more hours per week into the course, which is statistically significantly different from the percent reporting 9 or more hours in Spring 2017 (43.6%, X 2 9.193, p = 0.0024) or Fall 2016 (51.5%, X 2 5.322, p = 0.0211).1 Despite this, student assessments of the amount of material learned did not change. Secondly, these reviews suggest that the design decisions described herein are at least somewhat effective in supporting the student experience as students specifically comment positively on criteria that typically are considered lacking in online courses. Most notably, whereas online instructors are often considered detached or uninvolved (De Gagne and Walters 2009), students in this class specifically reflected positively on the instructor’s enthusiasm (4.96/5.00), respect (4.96/5.00), availability (4.90/5.00), and ability to simulate interest (4.89/5.00). We hypothesize this is due in part to the singular ownership over course announcements, documentation, and scheduling attributed to the instructor, in line with existing research on the effectiveness of immediacy behaviors (Arbaugh 2001).

1 Summer

2017 is excluded from this comparison as the semester is shorter and more work is deliberately expected per week.

[email protected]

98

D. Joyner

Table 5.2 Interpolated medians of student responses to eight prompts on the institute-run end-ofcourse opinion surveys Prompt Fall 2016 Spring 2017 Summer 2017 Fall 2017 Response rate (%)

83

69

70

61

How much would you say you learned in this course?a

4.53

4.45

4.41

4.45

Considering everything, this was an effective courseb The instructor clearly communicated what it would take to succeed in this coursea Instructors respect and concern for studentsc Instructors level of enthusiasm about teaching the coursed Instructors ability to simulate my interest in the subject mattere Instructors availability for consultationf Considering everything, the instructor was an effective teachera

4.82

4.74

4.85

4.80

4.89

4.89

4.93

4.90

4.95

4.96

4.96

4.94

4.95

4.97

4.97

4.95

4.90

4.86

4.89

4.89

4.88

4.89

4.93

4.87

4.92

4.95

4.94

4.93

a From

5—Exceptional amount to 1—Almost nothing 5—Strongly agree to 1—Strongly disagree c From 5—Exceptional to 1—Very poor d From 5—Extremely enthusiastic to 1—Detached e From 5—Made me eager to 1—Ruined interest f From 5—Highly accessible to 1—Hard to find b From

5.8.2 Course Surveys While the institute-wide course surveys give some useful information, they are a bit constrained by the need to apply universally to all courses. To supplement these, the course offers its own end-of-semester survey asking questions more specifically targeted to the design and structure of the course itself. Table 5.3 provides these results. As with the institute-level survey, the course-level survey provides some interesting insights. First, the numbers across most categories do not change semester to semester. This is notable not only because of changes made to the course as time goes on, but also because of semester-specific factors. Fall 2016, for example, was the first semester of the course, and students popularly consider the first semester a “trial run”; anecdotally, many students specifically avoid first-semester classes knowing the second run will be smoother, while other students deliberately take new classes because

[email protected]

5 The CHI of Teaching Online …

99

Table 5.3 Interpolated medians of student responses to eleven prompts on the course-run end-ofsemester opinion survey Prompt Fall 2016 Spring 2017 Summer 2017 Fall 2017 Response rate (%)

63

65

83

52

“The lectures were informative and easy to understand”a “The exercises during the lectures kept me engaged”a

6.71

6.60

6.56

6.66

5.95

5.74

5.80

5.85

“The video lessons were valuable in helping me learn”a “[The forum] improved my experience in this class”a

6.78

6.56

6.59

6.67

5.90

5.42

5.56

5.45

“The [peer review] system 5.29 improved my experience in this class”a Jump around in the lessons 1.82 instead of watching in orderb

5.51

5.46

4.92

1.84

1.91

2.13

Fall behind the recommended schedule in the syllabusb

2.17

2.17

2.12

2.26

Watch ahead of the recommended scheduleb

2.55

2.05

2.11

2.15

Re-watch an entire lessonb 3.02 Re-watch only a portion of a 3.72 lesson after having previously finished a lessonb

2.73 3.74

2.71 3.49

2.93 3.41

Watch videos through an appb

1.41

1.36

1.72

1.40

Download course videos for offline viewingb

1.38

1.31

1.37

1.34

a Agree b How

or disagree, from 7—Strongly agree to 1—Strongly disagree often, from 5—Always to 1—Never

they enjoy being early adopters. This may be visible in the data: students reported slightly more re-watching and watch-ahead behaviors during the first semester. It is unclear why peer review ratings are lower during Fall 2017. Second and more significant to this analysis, however, is that we see a significant incidence of behaviors corresponding to the claims regarding equity from earlier in this analysis. Nearly all students report some re-watching behaviors with an interpolated median corresponding to 3 (“Occasionally”) for rewatching lectures in their entirety and closer to 4 (“Frequently”) for rewatching only specific parts. While data does not exist regarding why students engage in these behaviors, they are closely aligned with potential supports for sensory or attentional deficits. Similarly, while

[email protected]

100

D. Joyner

behaviors related to watching ahead, falling behind, or taking lectures “on the go” are rarer, a non-trivial portion of the class still reports leveraging these capabilities. These correspond to the applications of flexibility discussed previously, allowing students to integrate their course performance flexibly into their routine and schedule. Anecdotally, students report these behaviors most commonly in working around vacation or work schedules or integrating course participation into train commutes or travel plans.

5.9 Conclusion In this case study, we have taken common principles from well-renowned literature on human-computer interact (Constantine and Lockwood 1999; Nielsen 1995; Norman 2013; Story et al. 1998) and applied it to the design of an entirely-online for-credit graduate-level course in human-computer interaction. We find that whether by analogy or by direct application, many of these principles are strongly related to both the goals and design of online education. Just as interface design aims to accommodate flexibility with regard to user preferences, so also a major objective of online education is to accommodate audiences for whom traditional education is too inflexible to fit into their lifestyle. Just as interface design strives to accommodate all audiences regardless of experience and personal factors, so also online education aims to give access to anyone who may succeed at the course material. Just interface design aims to shrink feedback cycles and emphasize attention to the underlying task, so also learning design in online education aims to offload non-content tasks onto the interface or leverage consistent expectations to minimize time spent thinking about course administration. Most notably, there are places where the lines between learning design and interface design blur: instructors take certain actions in the interface to implement the learning design, such as setting consistent deadlines to minimizes cognitive load or pushing announcements to students to offload progress-tracking onto the interface.

References Andersen K, Miller ED (1997) Gender and student evaluations of teaching. PS: Polit Sci Polit 30(2):216–220 Arbaugh JB (2001) How instructor immediacy behaviors affect student satisfaction and learning in web-based courses. Bus Commun Q 64(4):42–54 Biggs JB, Collis KF (2014) Evaluating the quality of learning: the SOLO taxonomy (Structure of the observed learning outcome). Academic Press. Bjork RA (2013) Desirable difficulties perspective on learning. Encycl. Mind 4:243–245 Boud D, Cohen R, Sampson J (eds) (2014) Peer learning in higher education: learning from and with each other. Routledge

[email protected]

5 The CHI of Teaching Online …

101

Centra JA, Gaubatz NB (2000) Is there gender bias in student evaluations of teaching? J Higher Educ 71(1):17–33 Chandler J (2003) The efficacy of various kinds of error feedback for improvement in the accuracy and fluency of L2 student writing. J Second Lang Writ 12(3):267–296 Cho V, Cheng TE, Lai WJ (2009) The role of perceived user-interface design in continued usage intention of self-paced e-learning tools. Comput Educ 53(2):216–227 Cohen PA (1981) Student ratings of instruction and student achievement: a meta-analysis of multisection validity studies. Rev Educ Res 51(3):281–309 Constantine L, Lockwood L (1999) Software for use: a practical guide to the models and methods of usage-centered design. Pearson Education. De Gagne J, Walters K (2009) Online teaching experience: a qualitative metasynthesis (QMS). J Online Learn Teach 5(4):577 Fishwick M (2004) Emotional design: why we love (or hate) everyday things. J Am Cult 27(2):234 Freeman M, Bamford A (2004) Student choice of anonymity for learner identity in online learning discussion forums. Int J ELearn 3(3):45 Goel A, Joyner DA (2016) An experiment in teaching cognitive systems online. In: Haynes D (ed) International journal for scholarship of technology-enhanced learning, vol 1, no 1 Good C, Aronson J, Inzlicht M (2003) Improving adolescents’ standardized test performance: an intervention to reduce the effects of stereotype threat. J Appl Dev Psychol 24(6) Goodman J, Melkers J, Pallais A (2016) Can online delivery increase access to education? (No. w22754). National Bureau of Economic Research Greenwald AG (1997) Validity concerns and usefulness of student ratings of instruction. Am Psychol 52(11):1182 Hiltz SR, Wellman B (1997) Asynchronous learning networks as a virtual classroom. Commun ACM 40(9):44–49 Hollan J, Hutchins E, Kirsh D (2000) Distributed cognition: toward a new foundation for humancomputer interaction research. ACM Trans Comput-Hum Interact (TOCHI) 7(2):174–196 Jones MG, Farquhar JD (1997) User interface design for web-based instruction. Khan 62:239–244 Joyner DA, Goel AK, Isbell C (April 2016) The unexpected pedagogical benefits of making higher education accessible. In: Proceedings of the third ACM conference on Learning@ Scale. ACM pp 117–120 Joyner DA (April 2017) Scaling expert feedback: two case studies. In: Proceedings of the fourth (2017) ACM conference on Learning @ Scale. ACM Joyner DA (June 2018) Squeezing the limeade: policies and workflows for scalable online degrees. In: Proceedings of the fifth (2018) ACM conference on Learning @ Scale. ACM Joyner DA (June 2018) Towards CS1 at scale: building and testing a MOOC-for-credit candidate. In: Proceedings of the fifth (2018) ACM conference on Learning @ Scale. ACM Kulkarni C, Wei KP, Le H, Chia D, Papadopoulos K, Cheng J, Koller D, Klemmer SR (2015) Peer and self assessment in massive online classes. In: Design thinking research. Springer, Cham, pp 131–168 Kizilcec RF, Schneider E, Cohen GL, McFarland DA (2014) Encouraging forum participation in online courses with collectivist, individualist and neutral motivational framings. In: EMOOCS 2014, Proceedings of the European MOOC stakeholder summit, pp 80–87 Koppelman H, Vranken H (June 2008) Experiences with a synchronous virtual classroom in distance education. In: ACM SIGCSE bulletin, vol 40, no 3. ACM, pp 194–198 Kulkarni CE, Bernstein MS, Klemmer SR (March 2015) PeerStudio: rapid peer feedback emphasizes revision and improves performance. In:Proceedings of the second (2015) ACM conference on Learning @ Scale. ACM, pp 75–84 Li X, Chang KM, Yuan Y, Hauptmann A (February 2015) Massive open online proctor: protecting the credibility of MOOCs certificates. In: Proceedings of the 18th ACM conference on ComputerSupported Cooperative Work & Social Computing. ACM, pp 1129–1137 Martin F, Parker MA, Deale DF (2012) Examining interactivity in synchronous virtual classrooms. Int Rev Res Open Distrib Learn 13(3):228–261

[email protected]

102

D. Joyner

McBrien JL, Cheng R, Jones P (2009) Virtual spaces: employing a synchronous online classroom to facilitate student engagement in online learning. Int Rev Res Open Distrib Learn 10(3) McDaniel MA, Butler AC (2011) A contextual framework for understanding when difficulties are desirable. In: Successful remembering and successful forgetting: A festschrift in honor of Robert A. Bjork Najjar LJ (1998) Principles of educational multimedia user interface design. Hum Factors 40(2):311–323 Nielsen J (1995) 10 usability heuristics for user interface design, vol 1, no 1. Nielsen Norman Group Norman D (2013) The design of everyday things: revised and expanded edition. Basic Books (AZ) Northcutt CG, Ho AD, Chuang IL (2016) Detecting and preventing “multiple-account” cheating in massive open online courses. Comput Educ 100:71–80 O’Malley PJ, Agger JR, Anderson MW (2015) Teaching a chemistry MOOC with a virtual laboratory: lessons learned from an introductory physical chemistry course. J Chem Educ 92(10):1661–1666 Ramsden P (1991) A performance indicator of teaching quality in higher education: the course experience questionnaire. Stud High Educ 16(2):129–150 Story MF, Mueller JL, Mace RL (1998) The universal design file: designing for people of all ages and abilities Swan K, Shea P, Fredericksen E, Pickett A, Pelz W, Maher G (2000) Building knowledge building communities: consistency, contact and communication in the virtual classroom. J Educ Comput Res 23(4):359–383 Tu CH, McIsaac M (2002) The relationship of social presence and interaction in online classes. Am J Distance Educ 16(3):131–150 Uttl B, White C, Gonzalez D (2017) Meta-analysis of faculty’s teaching effectiveness: student evaluation of teaching ratings and student learning are not related. Stud Educ Eval 54:22–42 Whitley BE Jr (1997) Gender differences in computer-related attitudes and behavior: a metaanalysis. Comput Hum Behav 13(1)

[email protected]

CHAPTER ~~~

~

~~~~~

~

3

~~

Cognitive Engineering DONALD A. NORMAN

PROLOGUE cognitive Engineering, a term invented to reflect the enterprise I find myself engaged in: neither Cognitive Psychology, nor Cognitive Science, nor Human Factors. It is a type of applied Cognitive Science, trying to apply what is known from science to the design and construction of machines. It is a surprising business. On the one hand, there actually is quite a lot known in Cognitive Science that can be applied. But on the other hand, our lack of knowledge is appalling. On the one hand, computers are ridiculously difficult to use. On the other hand, many devices are difficult to use-the problem is not restricted to computers, there are fundamental difficulties in understanding and using most complex devices. So the goal of Cognitive Engineering is to come to understand the issues, to show how to make better choices when they exist, and to show what the tradeoffs are when, as is the usual case, an improvement in one domain leads to deficits in another. In this chapter I address some of the problems of applications that have been of primary concern to me over the past few years and that have guided the selection of contributors and themes of this book. The chapter is not intended to be a coherent discourse on Cognitive Engineering. Instead, I discuss a few issues that seem central to the

32

DONALD A. NORMAN

way that people interact with machines. The goal is to determine what are the critical phenomena: The details can come later. Overall, I have two major goals: 1. To understand the fundamental principles behind human action and performance that are relevant for the development of engineering principles of design.

2. To devise systems that are pleasant to use-the goal is neither efficiency nor ease nor power, although these are all to be desired, but rather systems that are pleasant, even fun: to produce what Laurel calls “pleasurable engagement” (Chapter 4).

AN ANALYSIS OF TASK COMPLEXITY Start with an elementary example: how a person performs a simple task, Suppose there are two variables to be controlled. How should we build a device to control these variables? The control question seems trivial: If there are two variables to be controlled, why not simply have two controls, one for each? What is the problem? It turns out that there is more to be considered than is obvious at first thought. Even the task of controlling a single variable by means of a single control mechanism raises a score of interesting issues. One has only to watch a novice sailor attempt to steer a small boat to a compass course to appreciate how difficult it can be to use a single control mechanism (the tiller) to affect a single outcome (boat direction). The mapping from tiller motion to boat direction is the opposite of what novice sailors sometimes expect. And the mapping of compass movement to boat movement is similarly confusing. If the sailor attempts to control the boat by examining the compass, determining in which direction to move the boat, and only then moving the tiller, the task can be extremely difficult.

kperienced sailors will point out that this formulation puts the problem in its clumsiest, most dijjjcult form: With the right formulation, or the right conceptual model, the task is not complex. That comment makes two points. First, the description I gave is a reasonable one for many novice sailors: The task is quite dijjjcult for them. The point is not that there are simpler ways of viewing the task, but that even a task that has but a single mechanism to control a single variable can be dgfJicult to understand, to learn, and to do. Second, the comment reveals the power of the proper conceptual model of the

3. COGNITIVE ENGINEERING

33

situation: The correct conceptual model can transform confising, dljjkult tasks into simple, straightforward ones. This is an important point that forms the theme of a later section.

Psychological Variables Differ From Physical Variables There is a discrepancy between the person’s psychologically expressed goals and the physical controls and variables of the task. The person starts with goals and intentions. These are psychological variables. They exist in the mind of the person and they relate directly to the needs and concerns of the person. However, the task is to be performed on aphysical system, with physical mechanisms to be manipulated, resulting in changes to the physical variables and system state. Thus, the person must interpret the physical variables into terms relevant to the psychological goals and must translate the psychological intentions into physical actions upon the mechanisms. This means that there must be a stage of interpretation that relates physical and psychological variables, as well as functions that relate the manipulation of the physical variables to the resulting change in physical state. In many situations the variables that can easily be controlled are not those that the person cares about. Consider the example of bathtub water control. The person wants to control rate of total water flow and temperature. But water arrives through two pipes: hot and cold. The easiest system to build has two faucets and two spouts. As a result, the physical mechanisms control rate of hot water and rate of cold water. Thus, the variables of interest to the user interact with the two physical variables: Rate of total flow is the sum of the two physical variables; temperature is a function of their difference (or ratio). The problems come from several sources: 1. Mapping problems. Which control is hot, which is cold? Which way should each control be turned to increase or decrease the flow? (Despite the appearance of universal standards for these mappings, there are sufficient variations in the standards, idiosyncratic layouts, and violations of expectations, that each new faucet poses potential problems.)

2. Ease of control. To make the water hotter while maintaining total rate constant requires simultaneous manipulation of both faucets.

3, Evaluation. With two spouts, it is sometimes difficult to determine if the correct outcome has been reached.

34

DONALD A. NORMAN

Faucet technology evolved to solve the problem. First, mixing spouts were devised that aided the evaluation problem. Then, "single control" faucets were devised that varied the psychological factors directly: One dimension of movement of the control affects rate of flow, another orthogonal dimension affects temperature. These controls are clearly superior to use. They still do have a mapping problem-knowing what kind of movement to which part of the mechanism controls which variable-and because the mechanism is no longer as visible as in the two-faucet case, they are not quite so easy to understand for the first-time user. Still, faucet design can be used as a positive example of how technology has responded to provide control over the variables of psychological interest rather than over the physical variables that are easier and more obvious. It is surprisingly easy to find other examples of the two-variabletwo-control task. The water faucets is one example. The loudness and balance controls on some audio sets is another. The temperature controls of some refrigerator-freezer units is another. Let me examine this latter example, for it illustrates a few more issues that need to be considered, including the invisibility of the control mechanisms and a long time delay between adjustment of the control and the resulting change of temperature.

1 1 i7 NORMAL SETTINGS

COLDER FRESH FOOD

C

AND

6-1

1 SETBOTHCONTROLS

COLDEST FRESH FOOD

B

AND

8-9

2 ALLOW 24 HOURS TO STABILIZE

COLDER FREEZER

D

AND

7.8

WARMER FRESH FOOD

C

AND

4-1

OFF (FRESH FD 6 FRZ)

f1-

1

FREEZER

T

I

FRESH FOOD

1

There are two variables of concern to the user: the temperature of the freezer compartment and the temperature of the regular "fresh

3. COGNITIVE ENGINEERING

35

food" compartment. At first, this seems just like the water control example, but there is a difference. Consider the refrigerator that I own. It has two compartments, a freezer and a fresh foods one, and two controls, both located in the fresh foods section. One control is labeled "freezer," the other "fresh food," and there is an associated instruction plate (see the illustration). But what does each control do? What is the mapping between their settings and my goal? The labels seem clear enough, but if you read the "instructions" confusion can rapidly set in. Experience suggests that the action is not as labeled: The two controls interact with one another. The problems introduced by this example seem to exist at almost every level:

1. Matching the psychological variables of interest to the physical variables being controlled. Although the labels on the control mechanisms indicate some relationship to the desired psychological variables, in fact, they do not control those variables directly. 2. The mapping relationships. There is clearly strong interaction between the two controls, making simple mapping between control and function or control and outcome difficult.

3. Feedback. Very slow, so that by the time one is able to determine the result of an action, so much time has passed that the action is no longer remembered, making "correction" of the action difficult.

4. Conceptual model. None. The instructions seem deliberately opaque and nondescriptive of the actual operations.

I suspect that this problem results from the way this refrigerator's cooling mechanism is constructed. The two variables of psychological interest cannot be controlled directly. Instead, there is only one cooling mechanism and one thermostat, which therefore, must be located in either the "fiesh food' section or in the freezer, but not both. A good description of this mechanism, stating which control affected which function would probably make matters workable. If one mechanism were clearly shown to control the thermostat and the other to control the relative proportion of cold air directed toward the freezer and fresh foods section, the task would be

36

DONALD A. NORMAN

much easier. The user would be able to get a clear conceptual model of the operation. Without a conceptual model, with a 24-hour delay between setting the controls and determining the results, it is almost impossible to determine how to operate the controls. Two variables: two controls. Who could believe that it would be so dijJcult?

Even Simple Tasks Involve a Large Number of Aspects The conclusion to draw from these examples is that even with two variables, the number of aspects that must be considered is surprisingly large. Thus, suppose the person has two psychological goals, G1 and G 2 . These give rise to two intentions, Zl and Z2, to satisfy the goals. The system has some physical state, S , realized through the values of its variables: For convenience, let there be two variables of interest, Vl and V 2 . And let there be two mechanisms that control the system, M I and M 2 . So we have the psychological goals and intentions (G and I ) and the physical state, mechanisms, and variables ( S , M , and V ) . First, the person must examine the current system state, S , and evaluate it with respect to the goals, G . This requires translating the physical state of the system into a form consistent with the psychological goal. Thus, in the case of steering a boat, the goal is to reach some target, but the physical state is the numerical compass heading. In writing a paper, the goal may be a particular appearance of the manuscript, but the physical state may be the presence of formatting commands in the midst of the text. The difference between desired goal and current state gives rise to an intention, again stated in psychological terms. This must get translated into an action sequence, the specification of what physical acts will be performed upon the mechanisms of the system. To go from intention to action specification requires consideration of the mapping between physical mechanisms and system state, and between system state and the resulting psychological interpretation. There may not be a simple mapping between the mechanisms and the resulting physical variables, nor between the physical variables and the resulting psychological states. Thus, each physical variable might be affected by an interaction of the control mechanisms: Vl = f (M1,M2) and V 2= g (MI, M 2 ) . In turn, the system state, S is a function of all its variables: S = h ( V l , V 2 ) . And finally, the mapping between system state and psychological interpretation is complex. All in all, the two variable-two mechanism situation can involve a surprising number of aspects. The list of aspects is shown and defined in Table 3.1.

3. COGNITIVE ENGINEERING

37

TABLE 3.1 ASPECTS OF A TASK Aspect

Description

Goals and intentions.

A goal is the state the person wishes to achieve; an intention is the decision to act so as to achieve the goal.

Specification of the action sequence.

The psychological process of determining the psychological representation of the actions that are to be executed by the user on the mechanisms of the system.

Mapping from psychological goals and intentions to action sequence.

In order to specify the action sequence, the user must translate the psychological goals and intentions into the desired system state, then determine what settings of the control mechanisms will yield that state, and then determine what physical manipulations of the mechanisms are required. The result is the internal, mental specification of the actions that are to be executed.

Physical state of the system.

The physical state of the system, determined by the values of all its physical variables.

Control mechanisms.

The physical devices that control the physical variables.

Mapping between the physical mechanisms and system state.

The relationship between the settings of the mechanisms of the system and the system state.

Interpretation of system state.

The relationship between the physical state of the system and the psychological goals of the user can only be determined by first translating the physical state into psychological states (perception), then interpreting the perceived system state in terms of the psychological variables of interest.

Evaluating the outcome.

Evaluation of the system state requires comparing the interpretation of the perceived system state with the desired goals. This often leads to a new set of goals and intentions.

TOWARD A THEORY OF ACTION It seems clear that we need to develop theoretical tools to understand what the user is doing. We need to know more about how people actually do things, which means a theory of action. There isn’t any realistic hope of getting the theory of action, at least for a long time, but

38

DONALD A. NORMAN

certainly we should be able to develop approximate theories.’ And that is what follows: an approximate theory for action which distinguishes among different stages of activities, not necessarily always used nor applied in that order, but different kinds of activities that appear to capture the critical aspects of doing things. The stages have proved to be useful in analyzing systems and in guiding design. The essential components of the theory have already been introduced in Table 3.1. In the theory of action to be considered here, a person interacts with a system, in this case a computer. Recall that the person’s goals are expressed in terms relevant to the person-in psychological terms-and the system’s mechanisms and states are expressed in terms relative to it-in physical terms. The discrepancy between psychological and physical variables creates the major issues that must be addressed in the design, analysis, and use of systems. I represent the discrepancies as two gulfs that must be bridged: the Gulf of Execution and the Gulf of Evuluurion, both shown in Figure 3.1. 2

The Gulfs of Execution and Evaluation The user of the system starts off with goals expressed in psychological terms. The system, however, presents its current state in physical terms. Goals and system state differ significantly in form and content, creating the Gulfs that need to be bridged if the system can be used (Figure 3.1). The Gulfs can be bridged by starting in either direction. The designer can bridge the Gulfs by starting at the system side and moving closer to the person by constructing the input and output characteristics of the interface so as to make better matches to the 1 There is little prior work in psychology that can act as a guide. Some of the principles come from the study of servomechanisms and cybernetics. The first study known to me in psychology-and in many ways still the most important analysis-is the book Plans and the Sfructure of Behuvior by Miller, Galanter, and Pribram (1960) early in the history of information processing psychology. Powers (1973) applied concepts from control theory to cognitive concerns. In the work most relevant to the study of Human-Computer Interaction, Card, Moran, and Newell (19831, analyzed the cycle of activities from Goal through Selection: the GOMS model ( Goal, Operator, Merhods, Selection). Their work is closely related to the approach given here. This is an issue that has concerned me for some time, so some of my own work is relevant: the analysis of errors, of typing, and of the attentional control of actions (Norman, 1981a, 1984b, 1986; Norman & Shallice, 1985; Rumelhart & Norman, 1982).

2 The emphasis on the the discrepancy between the user and the system, and the suggestion that we should conceive of the discrepancy as a Gulf that must be bridged by the user and the system designer, came from Jim Hollan and Ed Hutchins during one of the many revisions of the Direct Manipulation chapter (Chapter 5 ) .

3. COGNITIVE ENGINEERING

39

FIGURE 3.1. The Gulfs of Execution and Evaluation. Each Gulf is unidirectional: The Gulf of Execution goes from Goals to Physical System; the Gulf of Evaluation goes from Physical System to Goals.

psychological needs of the user. The user can bridge the Gulfs by creating plans, action sequences, and interpretations that move the normal description of the goals and intentions closer to the description required by the physical system (Figure 3.2).

Bridging the Gulf of Execution. The gap from goals to physical system is bridged in four segments: intention formation, specifying the action sequence, executing the action, and, finally, making contact with the input mechanisms of the interface. The intention is the first step, and it starts to bridge the gulf, in part because the interaction language demanded by the physical system comes to color the thoughts of the person, a point expanded upon in Chapter 5 by Hutchins, Hollan, and Norman. Specifying the action sequence is a nontrivial exercise in planning (see Riley & O’Malley, 1985). It is what Moran calls matching the internal specification to the external (Moran, 1983). In the terms of the aspects listed in Table 3.1, specifying the action requires translating the psychological goals of the intention into the changes to be made to the physical variables actually under control of the system. This, in turn, requires following the mapping between the psychological intentions and the physical actions permitted on the mechanisms of the system, as well as the mapping between the physical mechanisms and the resulting physical state variables, and between the physical state of the system and the psychological goals and intentions. After an appropriate action sequence is determined, the actions must be executed. Execution is the first physical action in this sequence: Forming the goals and intentions and specifying the action sequence were all mental events, Execution of an action means to do something, whether it is just to say something or to perform a complex motor

40

DONALD A. NORMAN

EXECUTION BRIDGE

GOALS PHYSICAL SYSTEM

EVALUATION BRIDGE

FIGURE 3.2. Bridging the Gulfs of Execution and Evaluation. The Gulf of Execution is bridged from the psychology side by the user’s formation of intentions relevant to the system and the determination of an action sequence. It is bridged from the system side when the designer of the system builds the input characteristics of the interface. The Gulf of Evaluation is bridged from the psychology side by the user’s perception of the system state and the interpretation placed on that perception, which is then evaluated by comparing it with the original goals and intentions. It is bridged from the system side when the designer builds the output characteristics of the interface.

sequence. Just what physical actions are required is determined by the choice of input devices on the system, and this can make a major difference in the usability of the system. Because some physical actions are more difficult than others, the choice of input devices can affect the selection of actions, which in turn affects how well the system matches with intentions. On the whole, theorists in this business tend to ignore the input devices, but in fact, the choice of input device can often make an important impact on the usability of a system. (See Chapter 15 by Buxton for a discussion of this frequently overlooked point.)

Bridging the Gulf of Evaluation. Evaluation requires comparing the interpretation of system state with the original goals and intentions. One problem is to determine what the system state is, a task that can be assisted by appropriate output displays by the system itself. The outcomes are likely to be expressed in terms of physical variables that bear complex relationships to the psychological variables of concern to the user and in which the intentions were formulated. The gap from system to user is bridged in four segments: starting with the output

3. COGNITIVE ENGINEERING

41

displays of the interface, moving to the perceptual processing of those displays, to its interpretation, and finally, to the evaluation-the comparison of the interpretation of system state with the original goals and intention. But in doing all this, there is one more problem, one just beginning to be understood, and one not assisted by the usual forms of displays: the problem of level. There may be many levels of outcomes that must be matched with different levels of intentions (see Norman, 1981a; Rasmussen in press; Rasmussen & Lind, 1981). And, finally, if the change in system state does not occur immediately following the execution of the action sequence, the resulting delay can severely impede the process of evaluation, for the user may no longer remember the details of the intentions or the action sequence.

Stages of User Activities A convenient summary of the analysis of tasks is is that the process of performing and evaluating an action can be approximated by seven stages of user activity’ (Figure 3.3): 0 0 0 0

0 0

Establishing the Goal Forming the Intention Specifying the Action Sequence Executing the Action Perceiving the System State Interpreting the State Evaluating the System State with respect to the Goals and Intentions

3 The last two times I spoke of an approximate theory of action (Norman, 1984a. 1985) I spoke of four stages. Now I speak of seven. An explanation seems to be in order. The answer really is simple. The full theory of action is not yet in existence, but whatever its form, it involves a continuum of stages on both the action/execution side and the perception/evaluation side. The notion of stages is a simplification of the underlying theory: I do not believe that there really are clean, separable stages. However, for practical application, approximating the activity into stages seems reasonable and useful. Just what division of stages should be made, however, seems less clear. In my original formulations, I suggested four stages: intention, action sequence, execution, and evaluation. In this chapter I separated goals and intentions and expanded the analysis of evaluation by adding perception and interpretation, thus making the stages of evaluation correspond better with the stages of execution: Perception is the evaluatory equivalent of execution, interpretation the equivalent of the action sequence, and evaluation the equivalent of forming the intention. The present formulation seems a richer, more satisfactory analysis.

42

DONALD A . N O R M A N

PHYSICAL ACTIVITY

J

FIGURE 3.3. Seven stages of user activities involved in the performance of a task. The primary, central stage is the establishment of the goal. Then, to carry out an action requires three stages: forming the intention, specifying the action sequence, and executing the action. To assess the effect of the action also requires three stages, each in some sense complementary to the three stages of carrying out the action: perceiving the system state, interpreting the state, and evaluating the interpreted state with respect to the original goals and intentions.

Real activity does not progress as a simple sequence of stages. Stages appear out of order, some may be skipped, some repeated. Even the analysis of relatively simple tasks demonstrates the complexities. Moreover, in some situations, the person is reactive-event or data driven-responding to events, as opposed to starting with goals and intentions. Consider the task of monitoring a complex, ongoing operation. The person’s task is to respond to observations about the state of the system. Thus, when an indicator starts to move a bit out of range, or when something goes wrong and an alarm is triggered, the operator

3. COGNITIVE ENGINEERING

43

must diagnose the situation and respond appropriately. The diagnosis leads to the formation of goals and intentions: Evaluation includes not only checking on whether the intended actions were executed properly and intentions satisfied, but whether the original diagnosis was appropriate. Thus, although the stage analysis is relevant, it must be used in ways appropriate to the situation. Consider the example of someone who has written a letter on a computer word-processing system. The overall goal is to convey a message to the intended recipient. Along the way, the person prints a draft of the letter. Suppose the person decides that the draft, shown in Figure 3.4A, doesn't look right: The person, therefore, establishes the intention "Improve the appearance of the letter." Call this first intention intention I . Note that this intention gives little hint of how the task is to be accomplished. As a result, some problem solving is required, perhaps ending with intention2: "Change the indented paragraphs to block paragraphs." To do this requires intention3: "Change the occurrences of .pp in the source code for the letter to .sp." This in turn requires the person to generate an action sequence appropriate for the text editor, and then, finally, to execute the actions on the computer keyboard. Now, to evaluate the results of the operation requires still further operations, including generation of a foulth intention, inten[ion4: "Format the file" (in order to see whether intention2 and intention 1 were satisfied). The entire sequence of stages is shown in Figure 3.4B. The final product, the reformatted letter, is shown in Figure 3.4C. Even intentions that appear to be quite simple ( e.g., intention,: "Approve the appearance of the lettef) lead to numerous subintentions. The intermediary stages may require generating some new subintentions.

Practical Implications The existence of the two gulfs points out a critical requirement for the design of the interface: to bridge the gap between goals and system. Moreover, as we have seen, there are only two ways to do this: move the system closer to the user; move the user closer to the system. Moving from the system to the user means providing an interface that matches the user's needs, in a form that can be readily interpreted and manipulated. This confronts the designer with a large number of issues. Not only do users differ in their knowledge, skills, and needs, but for even a single user the requirements for one stage of activity can conflict with the requirements for another. Thus, menus can be thought of as information to assist in the stages of intention formation and action specification, but they frequently make execution more

44

DONALD A. NORMAN

r .

,,C$D

I'ITlniTE

-

FOR COGhlTIYL S'IC*CL

~-, ,-,= .~,"~~,. ~,,,.

".-,,

LIIIb 1.

-II.-

~ b -

1"*",",.1".

lhl ,lD.

*.I

.I.*

11".

,d/D.L

,VI, +" ,.".,," o h

I

*I *b 1.."1. .I. -11.1 ,~..i..,I.,,1.111..,/1*",...,... D,o 1-1 V # 1-11.810 ~ I,. w . o . , e

I.", .1,,

.,h*..

S I . li.

..,,,"a

o , ~,I*

,~"D

I

..

101ll"ll.,lO.

.*,,"~ &"",," .,,, ,~... ,.,,

.'b..'* **

*I.

.m1

a.i,.

l . D ,me> *I. .Itj., I". "a. ,111. *'*.e o".,r~I",..'.,"...~.'".*'~."'.C.'..'.

"..i~"D*,,""~",6*~.~1101*i.1"*..11..1.CII".~..~~1*-..'.

.f. V88.D

l,hO 8". .lI", I.

..,..,... ..,., ...,+~",,,.,. ,/.. *.'..r).. ,". .., ,,"~ ,,./ ,,,l.., .,.~.~,"l*l .,., ,.rr",. ,S1"

~

" 8 ~ d . ,. I.

I-'."'."d..,,,..i.I.i*..~.~.II.~~.'

~ b . .)'1

"..",..."Dl*'-.,)..,,..*D...ll'".a',o..

db

")r,,O

,.r.,,r,,o..i.o.,,r*rlr.Ii~rr

I

#"I

,l.i."'llDi."

~ b -

.I ".P,IO..*...,"i.II....II"",..I

r,,o..,"r*a

,"r).D,,D

*

U C ~ " , U I T , T U F~ 0 I C O G N I l l " L I C l L Y C L ..,IWL n or m an @nn gr ou p. com

Most items in the world have been designed

Human-centered design has become such a dominant theme in design that it is now accepted by

without the benefit of user studies and the methods

interface and application designers automatically,

of human-centered design. Yet they do quite well.

without thought, let alone criticism. That’s a danger-

Moreover, these include some of the most success-

ous state—when things are treated as accepted

ful objects of our modern, technological worlds.

wisdom. The purpose of this essay is to provoke

Consider two representative examples: T h e A u t o m ob i l e.

thought, discussion, and reconsideration of some of

People all over the world learn

the fundamental principles of human-centered

to drive quite successfully with roughly the same

design. These principles, I suggest, can be helpful,

configuration of controls. There were no systematic

misleading, or wrong. At times, they might even be

studies of users. Rather, early automobiles tried a

harmful. Activity-centered design might be superior.

variety of configurations, initially copying the seating and steering arrangements of horse-drawn car-

Know Your User

riages, going through tillers and rods, and then vari-

If there is any principle that is sacred to those in

ous hand and foot controls until the current scheme

the field of user-interface design and human-com-

evolved. Ev er yd a y Ob jec ts.

puter interaction, it is “know your user.” After all,

Just look around: kitchen

how can one design something for people without a

utensils, garden tools, woodworking tools, typewrit-

deep, detailed knowledge of those people? The

ers, cameras, and sporting equipment vary some-

plethora of bad designs in the world would seem to

what from culture to culture, but on the whole, they

be excellent demonstrations of the perils of ignor-

are more similar than not. People all over the world

ing the people for whom the design is intended.

manage to learn them—and manage quite well. Act iv it y- Center ed Des ign.

Human-centered design was developed to overcome

Why do these devices

the poor design of software products. By emphasiz-

work so well? The basic reason is that they were all

ing the needs and abilities of those who were to

developed with a deep understanding of the activi-

use the software, usability and understandability of

ties that were to be performed: Call this activity-

products has indeed been improved. But despite

centered design. Many were not even designed in

these improvements, software complexity is still with

the common sense of the term; rather, they evolved

us. Even companies that pride themselves on fol-

with time. Each new generation of builders slowly

lowing human-centered principles still have com-

improved the product upon the previous generation,

plex, confusing products.

based on feedback from their own experiences as

If it is so critical to understand the particular users

well as from their customers. Slow, evolutionary folk

of a product, then what happens when a product is

design. But even for those devices created by for-

designed to be used by almost anyone in the world?

mal design teams, populated with people whose job

There are many designs that do work well for every-

title was “designer,” these designers used their own

one. This is paradoxical, and it is this very paradox

understanding of the activities to be performed to

that led me to reexamine common dogma.

determine how the device would be operated. The

i n t e r a c t i o n s

/

j u l y

+

a u g u s t

2 0 0 5

fresh users were supposed to understand the task and to

when we are rested. University classes are taught in

understand the designers’ intentions.

one-hour periods, three times a week, in ten- to 15week sessions, not because this is good for educa-

Activities Are Not the Same as Tasks

tion, but because it makes for easier scheduling.

Do note the emphasis on the word “activity” as

The extreme reliance on time is an accidental out-

opposed to “task.” There is a subtle difference. I

growth of the rise of the factory and the resulting

use the terms in a hierarchical fashion. At the high-

technological society. Wr itin g Syst em s.

est levels are activities, which are composed of

Consider printing, handwriting,

tasks, which themselves are composed of actions,

and typing. All are artificial and unnatural. It takes

and actions are made up of operations. The hierar-

people weeks, months, or even years to learn and

chical structure comes from my own brand of

become skilled. One successful stylus-based text

“activity theory,” heavily motivated by early Russian

input device for the Roman alphabet is graffiti—yet

and Scandinavian research. To me, an activity is a

another unnatural way of writing. Mu si ca l I n str u me n ts .

coordinated, integrated set of tasks. For example,

Musical instruments are

mobile phones that combine appointment books,

complex and difficult to manipulate and can cause

diaries and calendars, note-taking facilities, text

severe medical problems. Musical notation is modal,

messaging, and cameras can do a good job of sup-

so the same representation on a treble clef has a

porting communication activities. This one single

different interpretation on the bass clef. The usabili-

device integrates several tasks: looking up numbers,

ty profession has long known of the problems with

dialing, talking, note taking, checking one’s diary or

modes, yet multiple staves have been with us for

calendar, and exchanging photographs, text mes-

approximately 1,000 years. It takes considerable

sages, and emails. One activity, many tasks.

instruction and practice to become skilled at reading and playing. The medical problems faced by musi-

What Adapts? Technology or People?

cians are so severe that there are books, physicians,

The historical record contains numerous examples

Web pages and discussion groups devoted to them.

of successful devices that required people to adapt

For example, repetitive stress injuries among violin-

to and learn the devices. People were expected to

ists and pianists are common. Neither the instru-

acquire a good understanding of the activities to be

ments nor the notation would pass any human-cen-

performed and of the operation of the technology.

tered design review.

None of this “tools adapt to the people” non-

Human-Centered versus ActivityCentered: What’s the Difference?

sense—people adapt to the tools. Think about that last point. A fundamental corollary to the principle of human-centered design has

What is going on? Why are such non-human-cen-

always been that technology should adapt to peo-

tered designs so successful? I believe there are two

ple, not people to the technology. Is this really true?

reasons, one the activity-centered nature, and two

Consider the history of the following successful

the communication of intention from the builders

technologies.

and designers. Successful devices are those that fit

The Cl ock (and Wat ch).

An arbitrary division of

gracefully into the requirements of the underlying

the year and day into months, weeks, days, hours,

activity, supporting them in a manner understand-

minutes, and seconds, all according to physical prin-

able by people. Understand the activity, and the

ciples that differ from psychological or biological

device is understandable. Builders and designers

ones, now rules our lives. We eat when our watches

often have good reasons for the way they construct-

tell us it is meal time, not when we are hungry. We

ed the system. If these reasons can be explained,

awake according to the harsh call of the alarm, not

then the task of learning the system is both eased

i n t e r a c t i o n s

/

j u l y

+

a u g u s t

2 0 0 5

: / 15

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

and made plausible. Yes, it takes years to learn to

the vagaries of the media. If you want to do oil

play the violin, but people accept this because the

painting, then you need to understand oil, and

instrument itself communicates rather nicely the

brushes, and painting surfaces—even how and

relationship between strings and the resulting

when to clean your brush. Is this the tool wagging

sounds. Both the activity and the design are under-

the dog? Yes, and that is how it always is, how it

standable, even if the body must be contorted to

always shall be. The truly excellent artists have a

hold, finger, and bow the instrument.

deep and thorough understanding of their tools and

Activity-centered design (ACD) is actually very much like human-centered design (HCD). Many of

sense. So too with sports, with cooking, with music,

the best attributes of HCD carry over. But there are

and with all other major activities that use tools.

several differences, first and foremost that of attitude. Attitude? Yes, the mindset of the designer. The activities, after all, are human activities, so they reflect the possible range of actions, of condi-

To the human-centered design community, the tool should be invisible; it should not get in the way. With activity-centered design, the tool is the way.

tions under which people are able to function, and

Why Might HCD Be Harmful?

the constraints of real people. A deep understanding

Why might a human-centered design approach ever

of people is still a part of ACD. But ACD is more: It

be harmful? After all, it has evolved as a direct

also requires a deep understanding of the technolo-

result of the many problems people have with exist-

gy, of the tools, and of the reasons for the activities.

ing designs, problems that lead to frustration, grief,

Tools Define the Activity: People Really Do Adapt to Technology

lost time and effort, and, in safety-critical applications, errors, accidents, and death. Moreover, HCD has demonstrated clear benefits: improved usability,

HCD asserts as a basic tenet that technology

fewer errors during usage, and faster learning times.

adapts to the person. In ACD, we admit that much

What, then, are the concerns?

of human behavior can be thought of as an adapta-

One concern is that the focus upon individual

tion to the powers and limitations of technology.

people (or groups) might improve things for them

Everything, from the hours we sleep to the way we

at the cost of making it worse for others. The more

dress, eat, interact with one another, travel, learn,

something is tailored for the particular likes, dislikes,

communicate, play, and relax. Not just the way we

skills, and needs of a particular target population,

do these things, but with whom, when, and the way

the less likely it will be appropriate for others.

we are supposed to act, variously called mores, customs, and conventions.

The individual is a moving target. Design for the individual of today, and the design will be wrong

People do adapt to technology. It changes social

tomorrow. Indeed, the more successful the product,

and family structure. It changes our lives. Activity-

the more that it will no longer be appropriate. This

centered design not only understands this, but

is because as individuals gain proficiency in usage,

might very well exploit it.

they need different interfaces than were required

Learn the activity, and the tools are understood.

when they were beginners. In addition, the success-

That’s the mantra of the human-centered design

ful product often leads to unanticipated new uses

community. But this is actually a misleading state-

that are very apt not to be well supported by the

ment, because for many activities, the tools define

original design.

the activity. Maybe the reality is just the converse: Learn the tools, and the activity is understood.

i n t e r a c t i o n s

But there are more-serious concerns: First, the focus upon humans detracts from support for the

Consider art, where much time is spent learning

: / 16

technologies. It isn’t enough to have an artistic

/

activities themselves; second, too much attention to

j u l y

+

a u g u s t

2 0 0 5

fresh the needs of the users can lead to a lack of cohe-

designs. Several major software companies, proud

sion and added complexity in the design. Consider

of their human-centered philosophy, suffer from this

the dynamic nature of applications, where any task

problem. Their software gets more complex and less

requires a sequence of operations, and activities can

understandable with each revision. Activity-centered

comprise multiple, overlapping tasks. Here is where

philosophy tends to guard against this error

the difference in focus becomes evident, and where

because the focus is upon the activity, not the

the weakness of the focus on the users shows up.

human. As a result, there is a cohesive, well-articulated design model. If a user suggestion fails to fit

Static Screens versus Dynamic Sequences

within this design model, it should be discarded. Alas, all too many companies, proud of listening to their users, would put it in.

We find that work in the kitchen does not consist of

Here, what is needed is a strong, authoritative

independent, separate acts, but of a series of interrelated processes. (Christine Frederick, The Labor-

designer who can examine the suggestions and

Saving Kitchen.1919.)

evaluate them in terms of the requirements of the activity. When necessary, it is essential to be able to

The methods of HCD seem centered around static understanding of each set of controls, each screen

ignore the requests. This is the goal to cohesion

on an electronic display. But as a result, the sequen-

and understandability. Paradoxically, the best way to

tial operations of activities are often ill-supported.

satisfy users is sometimes to ignore them. Note that this philosophy applies in the service

The importance of support for sequences has been known ever since the time-and-motion studies of the

domain as well. Thus, Southwest Airlines has been

early 1900s, as the quotation from Frederick, above,

successful despite the fact that it ignores the two

illustrates. Simply delete the phrase “in the kitchen”

most popular complaints of its passengers: provide

and her words are still a powerful prescription for

reserved seating and inter-airline baggage transfer.

design. She was writing in 1919: What has hap-

Southwest decided that its major strategic advan-

pened in the past 100 years to make us forget this?

tage was inexpensive, reliable transportation, and

Note that the importance of support for sequences

this required a speedy turn-around time at each

is still deeply understood within industrial engineer-

destination. Passengers complain, but they still pre-

ing and human factors and ergonomics communities.

fer the airline. Sometimes what is needed is a design dictator

Somehow, it seems less prevalent within the human-

who says, “Ignore what users say: I know what’s

computer interaction community.

best for them.” The case of Apple Computer is illus-

Many of the systems that have passed through HCD design phases and usability reviews are

trative. Apple’s products have long been admired for

superb at the level of the static, individual display,

ease of use. Nonetheless, Apple replaced its well

but fail to support the sequential requirements of

known, well-respected human interface design team

the underlying tasks and activities. The HCD meth-

with a single, authoritative (dictatorial) leader. Did

ods tend to miss this aspect of behavior: Activity-

usability suffer? On the contrary: Its new products

centered methods focus upon it.

are considered prototypes of great design. The “listen to your users” produces incoherent

Too Much Listening to Users

designs. The “ignore your users” can produce horror

One basic philosophy of HCD is to listen to users,

stories, unless the person in charge has a clear vision

to take their complaints and critiques seriously. Yes,

for the product, what I have called the “conceptual

listening to customers is always wise, but acceding

model.” The person in charge must follow that vision

to their requests can lead to overly complex

and not be afraid to ignore findings. Yes, listen to

i n t e r a c t i o n s

/

j u l y

+

a u g u s t

2 0 0 5

: / 17

fresh customers, but don’t always do what they say.

of those bad designs are profitable products. Hmm.

Now consider the method employed by the

What does that suggest? Would they be even more

human-centered design community. The emphasis is

profitable had human-centered design principles

often upon the person, not the activity. Look at

been followed? Perhaps. But perhaps they might

those detailed scenarios and personas: Honestly,

not have existed at all. Think about that.

now, did they really inform your design? Did know-

Yes, we all know of disastrous attempts to intro-

ing that the persona is that of a 37-year-old, single

duce computer systems into organizations where

mother, studying for the MBA at night, really help

the failure was a direct result of a lack of under-

lay out the control panel or determine the screen

standing of the people and system. Or was it a

layout and, more importantly, to design the appro-

result of not understanding the activities? Maybe

priate action sequence? Did user modeling, formal

what is needed is more activity-centered design;

or informal, help determine just what technology

Maybe failures come from a shallow understanding

should be employed?

of the needs of the activities that are to be sup-

Show me an instance of a major technology that

ported. Note too that in safety-critical applications,

was developed according to principles of human-

a deep knowledge of the activity is fundamental.

centered design, or rapid prototype and test, or

Safety is usually a complex system issue, and with-

user modeling, or the technology adapting to the

out deep understanding of all that is involved, the

user. Note the word “major.” I have no doubt that

design is apt to be faulty.

many projects were improved, perhaps even dra-

Still, I think it’s time to rethink some of our funda-

matically, by the use of these techniques. But name

mental suppositions. The focus upon the human

one fundamental, major enhancement to our tech-

may be misguided. A focus on the activities rather

nologies that came about this way.

than the people might bring benefits. Moreover,

Human-centered design does guarantee good

substituting activity-centered for human-centered

products. It can lead to clear improvements of bad

design does not mean discarding all that we have

ones. Moreover, good human-centered design will

learned. Activities involve people, and so any sys-

avoid failures. It will ensure that products do work,

tem that supports the activities must of necessity

that people can use them. But is good design the

support the people who perform them. We can

goal? Many of us wish for great design. Great

build upon our prior knowledge and experience,

design, I contend, comes from breaking the rules,

both from within the field of HCD, but also from

by ignoring the generally accepted practices, by

industrial engineering and ergonomics.

pushing forward with a clear concept of the end

All fields have fundamental presuppositions.

result, no matter what. This ego-centric, vision-

Sometimes it is worthwhile to reexamine them, to

directed design results in both great successes and

consider the pros and cons and see whether they

great failures. If you want great rather than good,

might be modified or even replaced. Is this the

this is what you must do.

case for those of us interested in human-centered

There is a lot more to say on this topic. My pre-

design? We will never know unless we do the

cepts here are themselves dangerous. We dare not

exercise. ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

let the entire world of designers follow their instincts and ignore conventional wisdom: Most lack

ABOUT THE AUTHOR

the deep understanding of the activity coupled with

many hats, including co-founder of the

a clear conceptual model. Moreover, there certainly

Nielsen Norman Group, professor at Northwestern University, and author; his lat-

are sufficient examples of poor design out in the

est book is Emotional Design. He lives at www.jnd.org.

world to argue against my position. But note, many

i n t e r a c t i o n s

Don Norman wears

/

j u l y

+

a u g u s t

2 0 0 5

: / 19

Human-Computer Interaction An Empirical Research Perspective

I. Scott MacKenzie

".45&3%".r#0450/r)&*%&-#&3(r-0/%0/ /&8:03,r09'03%r1"3*4r4"/%*&(0 4"/'3"/$*4$0r4*/("103&r4:%/&:r50,:0

.PSHBO,BVGNBOOJTBOJNQSJOUPG&MTFWJFS

88

CHAPTER 3 Interaction Elements

FIGURE 3.18 Button arrangements for an elevator control panel. (a) Correct. (b) Incorrect.

geographic regions have experienced and learned it differently. What is accepted in one region may differ from what is accepted in another. If there is a physical contradiction, then the situation is different. Consider elevators (in buildings, not scrollbars). Early elevators didn’t have buttons to specify floors; they only had up and down buttons. Consider the two arrangements for the button controls shown in Figure 3.18. Clearly the arrangement in (a) is superior. When the up control is pressed, the display (the elevator) moves up. The stimulus (control) and response (display) are compatible beyond doubt. In (b) the position of the controls is reversed. Clearly, there is an incompatibility between the stimulus and the response. This situation is different from the scroll pane example given in Figure 3.6 because there is no physical analogy to help the user (can you think of one?). If all elevator control panels had the arrangement in Figure 3.18b, would a population stereotype emerge, as with the light switch example? Well, sort of. People would learn the relationship, because they must. But they would make more errors than if the relationship was based on a correct physical mapping. This particular point has been the subject of considerable experimental testing, dating back to the 1950s (Fitts and Seeger, 1953). See also Newell (1990, 276–278) and Kantowitz and Sorkin (1983, 323–331). The gist of this work is that people take longer and commit more errors if there is a physical misalignment between displays and controls, or between controls and the responses they effect. This work is important to HCI at the very least to highlight the challenges in designing human-computer systems. The physical analogies that human factors engineers seek out and exploit in designing better systems are few and far between in human-computer interfaces. Sure, there are physical relationships like “mouse right, cursor right,” but considering the diversity of people’s interactions with computers, the tasks with physical analogies are the exception. For example, what is the physical analogy for “file save”? Human-computer interfaces require a different way of thinking. Users need help—a lot of help. The use of metaphor is often helpful.

3.4 Mental models and metaphor There is more to learning or adapting than simply experiencing. One of the most common ways to learn and adapt is through physical analogy (Norman, 1988, p. 23)

3.4 Mental models and metaphor

or metaphor (Carroll and Thomas, 1982). Once we latch on to a physical understanding of an interaction based on experience, it all makes sense. We’ve experienced it, we know it, it seems natural. With a scroll pane, moving the slider up moves the view up. If the relationship were reversed, moving the slider up would move the content up. We could easily develop a physical sense of slider up → view up or slider up → content up. The up-up in each expression demonstrates the importance of finding a spatially congruent physical understanding. These two analogies require opposite control-display relationships, but either is fine and we could work with one just as easily as with the other, provided implementations were consistent across applications and platforms. Physical analogies and metaphors are examples of the more general concept of mental models, also known as conceptual models. Mental models are common in HCI. The idea is simple enough: “What is the user’s mental model of . . . ?” An association with human experience is required. HCI’s first mental model was perhaps that of the office or desktop. The desktop metaphor helped users understand the graphical user interface. Today it is hard to imagine the pre-GUI era, but in the late 1970s and early 1980s, the GUI was strange. It required a new way of thinking. Designers exploited the metaphor of the office or desktop to give users a jump-start on the interface (Johnson et al., 1989). And it worked. Rather than learning something new and unfamiliar, users could act out with concepts already understood: documents, folders, filing cabinets, trashcans, the top of the desk, pointing, selecting, dragging, dropping, and so on. This is the essence of mental models. Implementation models are to be avoided. These are systems that impose on the user a set of interactions that follow the inner workings of an application. Cooper and Reimann give the example of a software-based fax product where the user is paced through a series of agonizing details and dialogs (Cooper and Riemann, 2003, p. 25). Interaction follows an implementation model, rather than the user’s mental model of how to send a fax. The user is prompted for information when it is convenient for the program to receive it, not when it makes sense to the user. Users often have pre-existing experiences with artifacts like faxes, calendars, media players, and so on. It is desirable to exploit these at every opportunity in designing a software-based product. Let’s examine a few other examples in human-computer interfaces. Toolbars in GUIs are fertile ground for mental models. To keep the buttons small and of a consistent size, they are adorned with an icon rather than a label. An icon is a pictorial representation. In HCI, icons trigger a mental image in the user’s mind, a clue to a real-world experience that is similar to the action associated with the button or tool. Icons in drawing and painting applications provide good examples. Figure 3.19a shows the Tool Palette in Corel’s Paint Shop Pro, a painting and image manipulation application.12 The palette contains 21 buttons, each displaying an icon. Each button is associated with a function and its icon is carefully chosen to elicit the association in the user’s mind. Some are clear, like the magnifying glass or the paintbrush. Some are less clear. Have a look. Can you tell what action is 12

www.jasc.com.

89

90

CHAPTER 3 Interaction Elements

FIGURE 3.19 Icons create associations. (a) Array of toolbar buttons from Corel’s Paint Shop Pro. (b) Tooltip help for “Picture Tube” icon.

associated with each button? Probably not. But users of this application likely know the meaning of most of these buttons. Preparing this example gave me pause to consider my own experience with this toolbar. I use this application frequently, yet some of the buttons are entirely strange to me. In 1991 Apple introduced a method to help users like me. Hover the mouse pointer over a GUI button and a field pops up providing a terse elaboration on the button’s purpose. Apple called the popups balloons, although today they are more commonly known as tooltips or screen tips. Figure 3.19b gives an example for a button in Paint Shop Pro. Apparently, the button’s purpose is related to a picture tube. I’m still in the dark, but I take solace in knowing that I am just a typical user: “Each user learns the smallest set of features that he needs to get his work done, and he abandons the rest.” (Cooper, 1999, p. 33) Another example of mental models are a compass and a clock face as metaphors for direction. Most users have an ingrained understanding of a compass and a clock. The inherent labels can serve as mental models for direction. Once there is an understanding that a metaphor is present, the user has a mental model and uses it efficiently and accurately for direction: north, for straight ahead or up, west for left, and so on. As an HCI example, Lindeman et al. (2005) used the mental model of a compass to help virtual reality users navigate a building. Users wore a vibro-tactile belt with eight actuators positioned according to compass directions. They were able to navigate the virtual building using a mental model of the compass. There is also a long history in HCI of using a compass metaphor for stylus gestures with pie menus (Callahan et al., 1988) and marking menus (G. P. Kurtenbach, Sellen, and Buxton, 1993; Li, Hinckley, Guan, and Landay, 2005). With twelve divisions, a clock provides finer granularity than a compass (“obstacle ahead at 2 o’clock!”). Examples in HCI include numeric entry (Goldstein, Chincholle, and Backström, 2000; Isokoski and Käki, 2002; McQueen, MacKenzie, and Zhang, 1995) and locating people and objects in an environment (Sáenz and Sánchez, 2009; A. Sellen, Eardley, Iazdl, and Harper, 2006). Using a clock metaphor for numeric entry with a stylus is shown in Figure 3.20. Instead of scripting numbers using Roman characters, the numbers are entered using straightline strokes. The direction of the stroke is the number’s position on a clock face. In a longitudinal study, McQueen et al. (1995) found that numeric entry was about

3.4 Mental models and metaphor

FIGURE 3.20 Mental model example: (a) Clock face. (b) Numeric entry with a stylus.

24 percent faster using straight-line strokes compared to handwritten digits. The 12 o’clock position was used for 0. The 10 o’clock and 11 o’clock positions were reserved for system commands. Sáenz and Sánchez describe a system to assist the blind (Sáenz and Sánchez, 2009) using the clock metaphor. Users carried a mobile locating device that provided spoken audio information about the location of nearby objects (see Figure 3.21a). For the metaphor to work, the user is assumed to be facing the 12 o’clock position. The system allowed users to navigate a building eyes-free (Figure 3.21b). Users could request position and orientation information from the locator. Auditory responses were provided using the clock metaphor and a text-to-speech module (e.g., “door at 3 o’clock”). A similar interface is Rümelin et al.’s NaviRadar (Rümelin, Rukzio, and Hardy, 2012), which uses tactile feedback rather that auditory feedback. Although not specifically using the clock metaphor, NaviRadar leverages users’ spatial sense of their surroundings to aid navigation. Users receive combinations of long and short vibratory pulses to indicate direction (Figure 3.21c). Although the patterns must be learned, the system is simple and avoids auditory feedback, which may be impractical in some situations. The systems described by Sáenz and Sánchez (2009) and Rümelin et al. (2012) have similar aims yet were presented and evaluated in different ways. Sáenz and Sánchez emphasized and described the system architecture in detail. Although this is of interest to some in the HCI community, from the user’s perspective the system architecture is irrelevant. A user test was reported, but the evaluation was not experimental. There were no independent or dependent variables. Users performed tasks with the system and then responded to questionnaire items, expressing their level of agreement to assertions such as “The software was motivating,” or “I like the sounds in the software.” While qualitative assessments are an essential component of any evaluation, the navigation and locating aides described in this work are well suited to experimental testing. Alternative implementations, even minor modifications to the interface, are potential independent variables. Speed (e.g., the time to complete tasks) and accuracy (e.g., the number of wrong turns, retries, direction changes, wall collisions) are potential dependent variables.

91

92

CHAPTER 3 Interaction Elements

FIGURE 3.21 Spatial metaphor: (a) Auditory feedback provides information for locating objects, such as “object at 4 o’clock.” (b) Navigation task. (c) NaviRadar. (Source: b, adapted from Sáenz and Sánchez, 2009; c, adapted from Rumelin et al., 2012)

Rümelin et al. (2012) took an empirical approach to system tests. Their research included both the technical details of NaviRadar and an evaluation in a formal experiment with independent variables, dependent variables, and so on. The main independent variable included different intensities, durations, and rhythms in the tactile pulses. Since their approach was empirical, valuable analyses were possible. They reported, for example, the deviation of indicated and reported directions and how this varied according to direction and the type of tactile information given. Their approach enables other researchers to study the strengths and weaknesses in NaviRadar in empirical terms and consider methods of improvement.

Human-Computer Interaction An Empirical Research Perspective

I. Scott MacKenzie

".45&3%".r#0450/r)&*%&-#&3(r-0/%0/ /&8:03,r09'03%r1"3*4r4"/%*&(0 4"/'3"/$*4$0r4*/("103&r4:%/&:r50,:0

.PSHBO,BVGNBOOJTBOJNQSJOUPG&MTFWJFS

3.8 Interaction errors

3.8 Interaction errors Most of analyses in this chapter are directed at the physical properties of the human-machine interface, such as degrees of freedom in 2D or 3D or spatial and temporal relationships between input controllers and output displays. Human performance, although elaborated in Chapter 2, has not entered into discussions here, except through secondary observations that certain interactions are better, worse, awkward, or unintuitive. At the end of the day, however, human performance is what counts. Physical properties, although instructive and essential, are secondary. Put another way, human performance is like food, while physical properties are like plates and bowls. It is good and nutritious food that we strive for. Empirical research in HCI is largely about finding the physical properties and combinations that improve and enhance human performance. We conclude this chapter on interaction elements with comments on that nagging aspect of human performance that frustrates users: interaction errors. Although the time to complete a task can enhance or hinder by degree, errors only hinder. Absence of errors is, for the most part, invisible. As it turns out, errors—interaction errors—are germane to the HCI experience. The big errors are the easy ones—they get fixed. It is the small errors that are interesting. As the field of HCI matures, a common view that emerges is that the difficult problems (in desktop computing) are solved, and now researchers should focus on new frontiers: mobility, surface computing, ubiquitous computing, online social networking, gaming, and so on. This view is partially correct. Yes, the emerging themes are exciting and fertile ground for HCI research, and many frustrating UI problems from the old days are gone. But desktop computing is still fraught with problems, lots of them. Let’s examine a few of these. Although the examples below are from desktop computing, there are counterparts in mobile computing. See also student exercise 3-8 at the end of this chapter. The four examples developed in the following discussion were chosen for a specific reason. There is a progression between them. In severity, they range from serious problems causing loss of information to innocuous problems that most users rarely think about and may not even notice. In frequency, they range from rarely, if ever, occurring any more, to occurring perhaps multiple times every minute while users engage in computing activities. The big, bad problems are welltraveled in the literature, with many excellent sources providing deep analyses on what went wrong and why (e.g., Casey, 1998, 2006; Cooper, 1999; Johnson, 2007;

FIGURE 3.42 HCI has come a long way: (a) Today’s UIs consistently use the same, predictable dialog to alert the user to a potential loss of information. (b) Legacy dialog rarely (if ever) seen today.

111

112

CHAPTER 3 Interaction Elements

B. H. Kantowitz and Sorkin, 1983; Norman, 1988). While the big problems get lots of attention, and generally get fixed, the little ones tend to linger. We’ll see the effect shortly. Let’s begin with one of the big problems. Most users have, at some point, lost information while working on their computers. Instead of saving new work, it was mistakenly discarded, overwritten, or lost in some way. Is there any user who has not experienced this? Of course, nearly everyone has a story of losing data in some silly way. Perhaps there was a distraction. Perhaps they just didn’t know what happened. It doesn’t matter. It happened. An example is shown in Figure 3.42. A dialog box pops up and the user responds a little too quickly. Press enter with the “Save changes?” dialog box (Figure 3.42a) and all is well, but the same response with the “Discard changes?” dialog box spells disaster (Figure 3.42b). The information is lost. This scenario, told by Cooper (1999, 14), is a clear and serious UI design flaw. The alert reader will quickly retort, “Yes, but if the ‘Discard changes?’ dialog box defaults to ‘No,’ the information is safe.” But that misses the point. The point is that a user expectation is broken. Broken expectations sooner or later cause errors. Today, systems and applications consistently use the “Save changes?” dialog box in Figure 3.42a. With time and experience, user expectations emerge and congeal. The “Save changes?” dialog box is expected, so we act without hesitating and all is well. But new users have no experiences, no expectations. They will develop them sure enough, but there will be some scars along the way. Fortunately, serious flaws like the “Discard changes?” dialog box are rare in desktop applications today. The following is another error to consider. If prompted to enter a password, and caps_lock mode is in effect, logging on will fail and the password must be reentered. The user may not know that caps_lock is on. Perhaps a key-stroking error occurred. The password is reentered, slowly and correctly, with the caps_lock mode still in effect. Oops! Commit the same error a third time and further log-on attempts may be blocked. This is not as serious as losing information by pressing enter in response to a renegade dialog box, but still, this is an interaction error. Or is it a design flaw? It is completely unnecessary, it is a nuisance, it slows our interaction, and it is easy to correct. Today, many systems have corrected this problem (Figure 3.43a), while others have not (Figure 3.43b). The caps_lock error is not so bad. But it’s bad enough that it occasionally receives enough attention to be the beneficiary of the few extra lines of code necessary to pop up a caps_lock alert.

FIGURE 3.43 Entering a password: (a) Many systems alert the user if CAPS_LOCK is on. (b) Others do not.

3.8 Interaction errors

Let’s examine another small problem. In editing a document, suppose the user wishes move some text to another location in the document. The task is easy. With the pointer positioned at the beginning of the text, the user presses and holds the primary mouse button and begins dragging. But the text spans several lines and extends past the viewable region. As the dragging extent approaches the edge of the viewable region, the user is venturing into a difficult situation. The interaction is about to change dramatically. (See Figure 3.44.) Within the viewable region, the interaction is position-control—the displacement of the mouse pointer controls the position of the dragging extent. As soon as the mouse pointer moves outside the viewable region, scrolling begins and the interaction becomes velocity-control—the displacement of the mouse pointer now controls the velocity of the dragging extent. User beware! Once in velocity-control mode, it is anyone’s guess what will happen. This is a design flaw. A quick check of several applications while working on this example revealed dramatically different responses to the transition from position control to velocity control. In one case, scrolling was so fast that the dragging region extended to the end of the document in less time than the user could react (≈200 ms). In another case, the velocity of scrolling was controllable but frustratingly slow. Can you think of a way to improve this interaction? A two-handed approach, perhaps. Any technique that gets the job done and allows the user to develop an expectation of the interaction is an improvement. Perhaps there is some empirical research waiting in this area. Whether the velocity-control is too sensitive or too sluggish really doesn’t matter. What matters is that the user experience is broken or awkward. Any pretense to the interaction being facile, seamless, or transparent is gone. The user will recover, and no information will be lost, but the interaction has degraded to error recovery. This is a design error or, at the very least, a design-induced error. Let’s move on to a very minor error. When an application or a dialog box is active, one of the UI components has focus and receives an event from the keyboard if a key is pressed. For buttons,

FIGURE 3.44 On the brink of hyper-speed scrolling. As the mouse pointer is dragged toward the edge of the viewable region, the user is precipitously close to losing control over the speed of dragging.

113

114

CHAPTER 3 Interaction Elements

focus is usually indicated with a dashed border (see “Yes” button in Figure 3.42). For input fields, it is usually indicated with a flashing insertion bar (“|”). “Focus advancement” refers to the progression of focus from one UI component to the next. There is wide-spread inconsistency in current applications in the way UI widgets acquire and lose focus and in the way focus advances from one component to the next. The user is in trouble most of the time. Here is a quick example. When a login dialog box pops up, can you immediately begin to enter your username and password? Sometimes yes, sometimes no. In the latter case, the entry field does not have focus. The user must click in the field with the mouse pointer or press tab to advance the focus point to the input field. Figure 3.43 provides examples. Both are real interfaces. The username field in (a) appears with focus; the same field in (b) appears without focus. The point is simply that users don’t know. This is a small problem (or is it an interaction error?), but it is entirely common. Focus uncertainty is everywhere in today’s user interfaces. Here is another, more specific example: Many online activities, such as reserving an airline ticket or booking a vacation, require a user to enter data into a form. The input fields often require very specific information, such as a two-digit month, a seven-digit account number, and so on. When the information is entered, does focus advance automatically or is a user action required? Usually, we just don’t know. So we remain “on guard.” Figure 3.45 gives a real example from a typical login dialog box. The user is first requested to enter an account number. Account numbers are nine digits long, in three threedigit segments. After seeing the dialog box, the user looks at the keyboard and begins entering: 9, 8, 0, and then what? Chances are the user is looking at the keyboard while entering the numeric account number. Even though the user can enter the entire nine digits at once, interaction is halted after the first three-digit group because the user doesn’t know if the focus will automatically advance to the next field. There are no expectations here, because this example of GUI interaction has not evolved and stabilized to a consistent pattern. Data entry fields have not reached the evolutionary status of, for example, dialog boxes for saving versus discarding changes (Figure 3.42a). The user either acts, with an approximately 50 percent likelihood of committing an error, or pauses to attend to the display (Has the focus advanced to the next field?). Strictly speaking, there is no gulf of evaluation here. Although not shown in the figure, the insertion point is present. After entering 980, the insertion point is either after the 0 in the first field, if focus did not advance, or at the beginning of the next field, if focus advanced. So the system does indeed “provide a physical

FIGURE 3.45 Inconsistent focus advancement keeps the user on guard. “What do I do next?”

3.8 Interaction errors

representation that can be perceived and that is directly interpretable in terms of the intentions and expectations of the person” (Norman, 1988, p. 51). That’s not good enough. The user’s attention is on the keyboard while the physical presentation is on the system’s display. The disconnect is small, but nevertheless, a shift in the user’s attention is required. The absence of expectations keeps the user on guard. The user is often never quite sure what to do or what to expect. The result is a slight increase in the attention demanded during interaction, which produces a slight decrease in transparency. Instead of engaging in the task, attention is diverted to the needs of the computer. The user is like a wood carver who sharpens tools rather than creates works of art. Where the consequences of errors are small, such as an extra button click or a gaze shift, errors tend to linger. For the most part, these errors aren’t on anyone’s radar. The programmers who build the applications have bigger problems to focus on, like working on their checklist of new features to add to version 2.0 of the application before an impending deadline.22 The little errors persist. Often, programmers’ discretion rules the day (Cooper, 1999, p. 47). An interaction scenario that makes sense to the programmer is likely to percolate through to the final product, particularly if it is just a simple thing like focus advancement. Do programmers ever discuss the nuances of focus advancement in building a GUI? Perhaps. But was the discussion framed in terms of the impact on the attention or gaze shifts imposed on the user? Not likely. Each time a user shifts his or her attention (e.g., from the keyboard to the display and back), the cost is two gaze shifts. Each gaze shift, or saccade, takes from 70 to 700 ms (Card et al., 1983, p. 28).23 These little bits of interaction add up. They are the fine-grained details—the microstructures and microstrategies used by, or imposed on, the user. “Microstrategies focus on what designers would regard as the mundane aspects of interface design; the ways in which subtle features of interactive technology influence the ways in which users perform tasks” (W. D. Gray and Boehm-Davis, 2000, p. 322). Designers might view these fine-grained details as a mundane sidebar to the bigger goal, but the reality is different. Details are everything. User experiences exist as collections of microstrategies. Whether booking a vacation online or just hanging out with friends on a social networking site, big actions are collections of little actions. To the extent possible, user actions form the experience, our experience. It is unfortunate that they often exist simply to serve the needs of the computer or application.

22

The reader who detects a modicum of sarcasm here is referred to Cooper (1999, 47–48 and elsewhere) for a full frontal assault on the insidious nature of feature bloat in software applications. The reference to version 2.0 of a nameless application is in deference to Johnson’s second edition of his successful book where the same tone appears in the title: GUI Bloopers 2.0. For a more sober and academic look at software bloat, feature creep, and the like, see McGrenere and Moore (2000). 23 An eye movement involves both a saccade and fixation. A saccade—the actual movement of the eye—is fast, about 30 ms. Fixations takes longer as they involve perceiving the new stimulus and cognitive processing of the stimulus.

115

116

CHAPTER 3 Interaction Elements

Another reason little errors tend to linger is that they are often deemed user errors, not design, programming, or system errors. These errors, like most, are more correctly called design-induced errors (Casey, 2006, p. 12). They occur “when designers of products, systems, or services fail to account for the characteristics and capabilities of people and the vagaries of human behavior” (Casey, 1998, p. 11). We should all do a little better. Figure 3.46 illustrates a tradeoff between the cost of errors and the frequency of errors. There is no solid ground here, so it’s just a sketch. The four errors described above are shown. The claim is that high-cost errors occur with low frequency. They receive a lot of attention and they get dealt with. As systems mature and the big errors get fixed, designers shift their efforts to fixing less costly errors, like the caps_lock design-induced error, or consistently implementing velocity-controlled scrolling. Over time, more and more systems include reasonable and appropriate implementations of these interactions. Divergence in the implementations diminishes and, taken as a whole, there is an industry-wide coalescing toward the same consistent implementation (e.g., a popup alert for caps_lock). The ground is set for user expectation to take hold. Of the errors noted in Figure 3.46, discard changes is ancient history (in computing terms), caps_lock is still a problem but is improving, scrolling frenzy is much more controlled in new applications, and focus uncertainty is, well, a mess. The cost is minor, but the error happens frequently. In many ways, the little errors are the most interesting, because they slip past designers and programmers. A little self-observation and reflection goes a long way here. Observe little errors that you encounter. What were you trying to do? Did it work the first time, just as expected? Small interactions are revealing. What were your hands and eyes doing? Were your interactions quick and natural, or were there unnecessary or awkward steps? Could a slight reworking of the interaction help? Could an attention shift be averted with the judicious use of auditory or tactile feedback? Is there a “ready for input” auditory signal that could sound when an input field receives focus? Could this reduce the need for an attention shift? Would this improve user performance? Would it improve the user experience? Would users like it, or would it be annoying? The little possibilities add up. Think of them as opportunities for empirical research in HCI.

FIGURE 3.46 Trade-off between the cost of errors and the frequency of errors.

THE DESIGN OF EVERYDAY THINGS

CHAPTER FIVE

HUMAN ERROR? NO, BAD DESIGN

Most industrial accidents are caused by human error: estimates range between 75 and 95 percent. How is it that so many people are so incompetent? Answer: They aren’t. It’s a design problem. If the number of accidents blamed upon human error were 1 to 5 percent, I might believe that people were at fault. But when the percentage is so high, then clearly other factors must be involved. When something happens this frequently, there must be another underlying factor. When a bridge collapses, we analyze the incident to find the causes of the collapse and reformulate the design rules to ensure that form of accident will never happen again. When we discover that electronic equipment is malfunctioning because it is responding to unavoidable electrical noise, we redesign the circuits to be more tolerant of the noise. But when an accident is thought to be caused by people, we blame them and then continue to do things just as we have always done. Physical limitations are well understood by designers; mental limitations are greatly misunderstood. We should treat all failures in the same way: find the fundamental causes and redesign the system so that these can no longer lead to problems. We design 162

equipment that requires people to be fully alert and attentive for hours, or to remember archaic, confusing procedures even if they are only used infrequently, sometimes only once in a lifetime. We put people in boring environments with nothing to do for hours on end, until suddenly they must respond quickly and accurately. Or we subject them to complex, high-workload environments, where they are continually interrupted while having to do multiple tasks simultaneously. Then we wonder why there is failure. Even worse is that when I talk to the designers and administrators of these systems, they admit that they too have nodded off while supposedly working. Some even admit to falling asleep for an instant while driving. They admit to turning the wrong stove burners on or off in their homes, and to other small but significant errors. Yet when their workers do this, they blame them for “human error.” And when employees or customers have similar issues, they are blamed for not following the directions properly, or for not being fully alert and attentive.

Understanding Why There Is Error Error occurs for many reasons. The most common is in the nature of the tasks and procedures that require people to behave in unnatural ways—staying alert for hours at a time, providing precise, accurate control specifications, all the while multitasking, doing several things at once, and subjected to multiple interfering activities. Interruptions are a common reason for error, not helped by designs and procedures that assume full, dedicated attention yet that do not make it easy to resume operations after an interruption. And finally, perhaps the worst culprit of all, is the attitude of people toward errors. When an error causes a financial loss or, worse, leads to an injury or death, a special committee is convened to investigate the cause and, almost without fail, guilty people are found. The next step is to blame and punish them with a monetary fine, or by firing or jailing them. Sometimes a lesser punishment is proclaimed: make the guilty parties go through more training. Blame and punish; blame and train. The investigations and resulting punishments feel five: Human Error? No, Bad Design 163

good: “We caught the culprit.” But it doesn’t cure the problem: the same error will occur over and over again. Instead, when an error happens, we should determine why, then redesign the product or the procedures being followed so that it will never occur again or, if it does, so that it will have minimal impact. ROOT CAUSE ANALYSIS

Root cause analysis is the name of the game: investigate the accident until the single, underlying cause is found. What this ought to mean is that when people have indeed made erroneous decisions or actions, we should determine what caused them to err. This is what root cause analysis ought to be about. Alas, all too often it stops once a person is found to have acted inappropriately. Trying to find the cause of an accident sounds good but it is flawed for two reasons. First, most accidents do not have a single cause: there are usually multiple things that went wrong, multiple events that, had any one of them not occurred, would have prevented the accident. This is what James Reason, the noted British authority on human error, has called the “Swiss cheese model of accidents” (shown in Figure 5.3 of this chapter on page 208, and discussed in more detail there). Second, why does the root cause analysis stop as soon as a human error is found? If a machine stops working, we don’t stop the analysis when we discover a broken part. Instead, we ask: “Why did the part break? Was it an inferior part? Were the required specifications too low? Did something apply too high a load on the part?” We keep asking questions until we are satisfied that we understand the reasons for the failure: then we set out to remedy them. We should do the same thing when we find human error: We should discover what led to the error. When root cause analysis discovers a human error in the chain, its work has just begun: now we apply the analysis to understand why the error occurred, and what can be done to prevent it. One of the most sophisticated airplanes in the world is the US Air Force’s F-22. However, it has been involved in a number of accidents, and pilots have complained that they suffered oxygen 164

The Design of Everyday Things

deprivation (hypoxia). In 2010, a crash destroyed an F-22 and killed the pilot. The Air Force investigation board studied the incident and two years later, in 2012, released a report that blamed the accident on pilot error: “failure to recognize and initiate a timely dive recovery due to channelized attention, breakdown of visual scan and unrecognized spatial distortion.” In 2013, the Inspector General’s office of the US Department of Defense reviewed the Air Force’s findings, disagreeing with the assessment. In my opinion, this time a proper root cause analysis was done. The Inspector General asked “why sudden incapacitation or unconsciousness was not considered a contributory factor.” The Air Force, to nobody’s surprise, disagreed with the criticism. They argued that they had done a thorough review and that their conclusion “was supported by clear and convincing evidence.” Their only fault was that the report “could have been more clearly written.” It is only slightly unfair to parody the two reports this way: Air Force: It was pilot error—the pilot failed to take corrective action. Inspector General: That’s because the pilot was probably unconscious. Air Force: So you agree, the pilot failed to correct the problem.

THE FIVE WHYS

Root cause analysis is intended to determine the underlying cause of an incident, not the proximate cause. The Japanese have long followed a procedure for getting at root causes that they call the “Five Whys,” originally developed by Sakichi Toyoda and used by the Toyota Motor Company as part of the Toyota Production System for improving quality. Today it is widely deployed. Basically, it means that when searching for the reason, even after you have found one, do not stop: ask why that was the case. And then ask why again. Keep asking until you have uncovered the true underlying causes. Does it take exactly five? No, but calling the procedure “Five Whys” emphasizes the need to keep going even after a reason has been found. Consider how this might be applied to the analysis of the F-22 crash: five: Human Error? No, Bad Design 165

Five Whys Question

Answer

Q1: Why did the plane crash?

Because it was in an uncontrolled dive.

Q2: Why didn’t the pilot recover from the dive?

Because the pilot failed to initiate a timely recovery.

Q3: Why was that?

Because he might have been unconscious (or oxygen deprived).

Q4: Why was that?

We don’t know. We need to find out.

Etc.

The Five Whys of this example are only a partial analysis. For example, we need to know why the plane was in a dive (the report explains this, but it is too technical to go into here; suffice it to say that it, too, suggests that the dive was related to a possible oxygen deprivation). The Five Whys do not guarantee success. The question why is ambiguous and can lead to different answers by different investigators. There is still a tendency to stop too soon, perhaps when the limit of the investigator’s understanding has been reached. It also tends to emphasize the need to find a single cause for an incident, whereas most complex events have multiple, complex causal factors. Nonetheless, it is a powerful technique. The tendency to stop seeking reasons as soon as a human error has been found is widespread. I once reviewed a number of accidents in which highly trained workers at an electric utility company had been electrocuted when they contacted or came too close to the high-voltage lines they were servicing. All the investigating committees found the workers to be at fault, something even the workers (those who had survived) did not dispute. But when the committees were investigating the complex causes of the incidents, why did they stop once they found a human error? Why didn’t they keep going to find out why the error had occurred, what circumstances had led to it, and then, why those circumstances had happened? The committees never went far enough to find the deeper, root causes of the accidents. Nor did they consider redesigning the systems and procedures to make the incidents 166

The Design of Everyday Things

either impossible or far less likely. When people err, change the system so that type of error will be reduced or eliminated. When complete elimination is not possible, redesign to reduce the impact. It wasn’t difficult for me to suggest simple changes to procedures that would have prevented most of the incidents at the utility company. It had never occurred to the committee to think of this. The problem is that to have followed my recommendations would have meant changing the culture from an attitude among the field workers that “We are supermen: we can solve any problem, repair the most complex outage. We do not make errors.” It is not possible to eliminate human error if it is thought of as a personal failure rather than as a sign of poor design of procedures or equipment. My report to the company executives was received politely. I was even thanked. Several years later I contacted a friend at the company and asked what changes they had made. “No changes,” he said. “And we are still injuring people.” One big problem is that the natural tendency to blame someone for an error is even shared by those who made the error, who often agree that it was their fault. People do tend to blame themselves when they do something that, after the fact, seems inexcusable. “I knew better,” is a common comment by those who have erred. But when someone says, “It was my fault, I knew better,” this is not a valid analysis of the problem. That doesn’t help prevent its recurrence. When many people all have the same problem, shouldn’t another cause be found? If the system lets you make the error, it is badly designed. And if the system induces you to make the error, then it is really badly designed. When I turn on the wrong stove burner, it is not due to my lack of knowledge: it is due to poor mapping between controls and burners. Teaching me the relationship will not stop the error from recurring: redesigning the stove will. We can’t fix problems unless people admit they exist. When we blame people, it is then difficult to convince organizations to restructure the design to eliminate these problems. After all, if a person is at fault, replace the person. But seldom is this the case: usually the system, the procedures, and social pressures have led five: Human Error? No, Bad Design 167

to the problems, and the problems won’t be fixed without addressing all of these factors. Why do people err? Because the designs focus upon the requirements of the system and the machines, and not upon the requirements of people. Most machines require precise commands and guidance, forcing people to enter numerical information perfectly. But people aren’t very good at great precision. We frequently make errors when asked to type or write sequences of numbers or letters. This is well known: so why are machines still being designed that require such great precision, where pressing the wrong key can lead to horrendous results? People are creative, constructive, exploratory beings. We are particularly good at novelty, at creating new ways of doing things, and at seeing new opportunities. Dull, repetitive, precise requirements fight against these traits. We are alert to changes in the environment, noticing new things, and then thinking about them and their implications. These are virtues, but they get turned into negative features when we are forced to serve machines. Then we are punished for lapses in attention, for deviating from the tightly prescribed routines. A major cause of error is time stress. Time is often critical, especially in such places as manufacturing or chemical processing plants and hospitals. But even everyday tasks can have time pressures. Add environmental factors, such as poor weather or heavy traffic, and the time stresses increase. In commercial establishments, there is strong pressure not to slow the processes, because doing so would inconvenience many, lead to significant loss of money, and, in a hospital, possibly decrease the quality of patient care. There is a lot of pressure to push ahead with the work even when an outside observer would say it was dangerous to do so. In many industries, if the operators actually obeyed all the procedures, the work would never get done. So we push the boundaries: we stay up far longer than is natural. We try to do too many tasks at the same time. We drive faster than is safe. Most of the time we manage okay. We might even be rewarded and praised for our he-

168

The Design of Everyday Things

roic efforts. But when things go wrong and we fail, then this same behavior is blamed and punished.

Deliberate Violations Errors are not the only type of human failures. Sometimes people knowingly take risks. When the outcome is positive, they are often rewarded. When the result is negative, they might be punished. But how do we classify these deliberate violations of known, proper behavior? In the error literature, they tend to be ignored. In the accident literature, they are an important component. Deliberate deviations play an important role in many accidents. They are defined as cases where people intentionally violate procedures and regulations. Why do they happen? Well, almost every one of us has probably deliberately violated laws, rules, or even our own best judgment at times. Ever go faster than the speed limit? Drive too fast in the snow or rain? Agree to do some hazardous act, even while privately thinking it foolhardy to do so? In many industries, the rules are written more with a goal toward legal compliance than with an understanding of the work requirements. As a result, if workers followed the rules, they couldn’t get their jobs done. Do you sometimes prop open locked doors? Drive with too little sleep? Work with co-workers even though you are ill (and might therefore be infectious)? Routine violations occur when noncompliance is so frequent that it is ignored. Situational violations occur when there are special circumstances (example: going through a red light “because no other cars were visible and I was late”). In some cases, the only way to complete a job might be to violate a rule or procedure. A major cause of violations is inappropriate rules or procedures that not only invite violation but encourage it. Without the violations, the work could not be done. Worse, when employees feel it necessary to violate the rules in order to get the job done and, as a result, succeed, they will probably be congratulated and rewarded. This, of course, unwittingly rewards noncompliance. Cultures that encourage and commend violations set poor role models.

five: Human Error? No, Bad Design 169

Although violations are a form of error, these are organizational and societal errors, important but outside the scope of the design of everyday things. The human error examined here is unintentional: deliberate violations, by definition, are intentional deviations that are known to be risky, with the potential of doing harm.

Two Types of Errors: Slips and Mistakes Many years ago, the British psychologist James Reason and I developed a general classification of human error. We divided human error into two major categories: slips and mistakes (Figure 5.1). This classification has proved to be of value for both theory and practice. It is widely used in the study of error in such diverse areas as industrial and aviation accidents, and medical errors. The discussion gets a little technical, so I have kept technicalities to a minimum. This topic is of extreme importance to design, so stick with it. DEFINITIONS: ERRORS, SLIPS, AND MISTAKES

Human error is defined as any deviance from “appropriate” behavior. The word appropriate is in quotes because in many circumstances, the appropriate behavior is not known or is only deter-

Classification of Errors. Errors have two major forms. Slips occur when the goal is correct, but the required actions are not done properly: the execution is flawed. Mistakes occur when the goal or plan is wrong. Slips and mistakes can be further divided based upon their underlying causes. Memory lapses can lead to either slips or mistakes, depending upon whether the memory failure was at the highest level of cognition (mistakes) or at lower (subconscious) levels (slips). Although deliberate violations of procedures are clearly inappropriate behaviors that often lead to accidents, these are not considered as errors (see discussion in text).

F IGU R E 5 .1 .

170

The Design of Everyday Things

mined after the fact. But still, error is defined as deviance from the generally accepted correct or appropriate behavior. Error is the general term for all wrong actions. There are two major classes of error: slips and mistakes, as shown in Figure 5.1; slips are further divided into two major classes and mistakes into three. These categories of errors all have different implications for design. I now turn to a more detailed look at these classes of errors and their design implications. SLIPS

A slip occurs when a person intends to do one action and ends up doing something else. With a slip, the action performed is not the same as the action that was intended. There are two major classes of slips: action-based and memory-lapse. In action-based slips, the wrong action is performed. In lapses, memory fails, so the intended action is not done or its results not evaluated. Action-based slips and memory lapses can be further classified according to their causes. Example of an action-based slip. I poured some milk into my coffee and then put the coffee cup into the refrigerator. This is the correct action applied to the wrong object. Example of a memory-lapse slip. I forget to turn off the gas burner on my stove after cooking dinner. M I S TA K E S

A mistake occurs when the wrong goal is established or the wrong plan is formed. From that point on, even if the actions are executed properly they are part of the error, because the actions themselves are inappropriate—they are part of the wrong plan. With a mistake, the action that is performed matches the plan: it is the plan that is wrong. Mistakes have three major classes: rule-based, knowledge-based, and memory-lapse. In a rule-based mistake, the person has appropriately diagnosed the situation, but then decided upon an erroneous course of action: the wrong rule is being followed. In a knowledge-based mistake, the problem is misdiagnosed because five: Human Error? No, Bad Design 171

of erroneous or incomplete knowledge. Memory-lapse mistakes take place when there is forgetting at the stages of goals, plans, or evaluation. Two of the mistakes leading to the “Gimli Glider” Boeing 767 emergency landing were: Example of knowledge-based mistake. Weight of fuel was computed in pounds instead of kilograms. Example of memory-lapse mistake. A mechanic failed to complete troubleshooting because of distraction. ERROR AND THE SEVEN STAGES OF ACTION

Errors can be understood through reference to the seven stages of the action cycle of Chapter 2 (Figure 5.2). Mistakes are errors in setting the goal or plan, and in comparing results with expectations—the higher levels of cognition. Slips happen in the execution of a plan, or in the perception or interpretation of the outcome—the lower stages. Memory lapses can happen at any of the eight transitions between stages, shown by the X’s in Figure 5.2B. A memory lapse at one of these transitions stops the action cycle from proceeding, and so the desired action is not completed. A.

B.

Where Slips and Mistakes Originate in the Action Cycle. Figure A shows that action slips come from the bottom four stages of the action cycle and mistakes from the top three stages. Memory lapses impact the transitions between stages (shown by the X’s in Figure B). Memory lapses at the higher levels lead to mistakes, and lapses at the lower levels lead to slips.

FIGURE 5. 2 .

172

The Design of Everyday Things

Slips are the result of subconscious actions getting waylaid en route. Mistakes result from conscious deliberations. The same processes that make us creative and insightful by allowing us to see relationships between apparently unrelated things, that let us leap to correct conclusions on the basis of partial or even faulty evidence, also lead to mistakes. Our ability to generalize from small amounts of information helps tremendously in new situations; but sometimes we generalize too rapidly, classifying a new situation as similar to an old one when, in fact, there are significant discrepancies. This leads to mistakes that can be difficult to discover, let alone eliminate.

The Classification of Slips A colleague reported that he went to his car to drive to work. As he drove away, he realized that he had forgotten his briefcase, so he turned around and went back. He stopped the car, turned off the engine, and unbuckled his wristwatch. Yes, his wristwatch, instead of his seatbelt.

The story illustrates both a memory-lapse slip and an action slip. The forgetting of the briefcase is a memory-lapse slip. The unbuckling of the wristwatch is an action slip, in this case a combination of description-similarity and capture error (described later in this chapter). Most everyday errors are slips. Intending to do one action, you find yourself doing another. When a person says something clearly and distinctly to you, you “hear” something quite different. The study of slips is the study of the psychology of everyday errors— what Freud called “the psychopathology of everyday life.” Freud believed that slips have hidden, dark meanings, but most are accounted for by rather simple mental mechanisms. An interesting property of slips is that, paradoxically, they tend to occur more frequently to skilled people than to novices. Why? Because slips often result from a lack of attention to the task. Skilled people—experts—tend to perform tasks automatically, under subconscious control. Novices have to pay considerable conscious attention, resulting in a relatively low occurrence of slips. five: Human Error? No, Bad Design 173

Some slips result from the similarities of actions. Or an event in the world may automatically trigger an action. Sometimes our thoughts and actions may remind us of unintended actions, which we then perform. There are numerous different kinds of action slips, categorized by the underlying mechanisms that give rise to them. The three most relevant to design are: • capture slips • description-similarity slips • mode errors CAPTURE SLIPS

I was using a copying machine, and I was counting the pages. I found myself counting, “1, 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King.” I had been playing cards recently.

The capture slip is defined as the situation where, instead of the desired activity, a more frequently or recently performed one gets done instead: it captures the activity. Capture errors require that part of the action sequences involved in the two activities be identical, with one sequence being far more familiar than the other. After doing the identical part, the more frequent or more recent activity continues, and the intended one does not get done. Seldom, if ever, does the unfamiliar sequence capture the familiar one. All that is needed is a lapse of attention to the desired action at the critical junction when the identical portions of the sequences diverge into the two different activities. Capture errors are, therefore, partial memory-lapse errors. Interestingly, capture errors are more prevalent in experienced skilled people than in beginners, in part because the experienced person has automated the required actions and may not be paying conscious attention when the intended action deviates from the more frequent one. Designers need to avoid procedures that have identical opening steps but then diverge. The more experienced the workers, the more likely they are to fall prey to capture. Whenever possible, sequences should be designed to differ from the very start. 174

The Design of Everyday Things

DESCRIPTION-SIMILARITY SLIPS

A former student reported that one day he came home from jogging, took off his sweaty shirt, and rolled it up in a ball, intending to throw it in the laundry basket. Instead he threw it in the toilet. (It wasn’t poor aim: the laundry basket and toilet were in different rooms.)

In the slip known as a description-similarity slip, the error is to act upon an item similar to the target. This happens when the description of the target is sufficiently vague. Much as we saw in Chapter 3, Figure 3.1, where people had difficulty distinguishing among different images of money because their internal descriptions did not have sufficient discriminating information, the same thing can happen to us, especially when we are tired, stressed, or overloaded. In the example that opened this section, both the laundry basket and the toilet bowl are containers, and if the description of the target was sufficiently ambiguous, such as “a large enough container,” the slip could be triggered. Remember the discussion in Chapter 3 that most objects don’t need precise descriptions, simply enough precision to distinguish the desired target from alternatives. This means that a description that usually suffices may fail when the situation changes so that multiple similar items now match the description. Descriptionsimilarity errors result in performing the correct action on the wrong object. Obviously, the more the wrong and right objects have in common, the more likely the errors are to occur. Similarly, the more objects present at the same time, the more likely the error. Designers need to ensure that controls and displays for different purposes are significantly different from one another. A lineup of identical-looking switches or displays is very apt to lead to description-similarity error. In the design of airplane cockpits, many controls are shape coded so that they both look and feel different from one another: the throttle levers are different from the flap levers (which might look and feel like a wing flap), which are different from the landing gear control (which might look and feel like a wheel). five: Human Error? No, Bad Design 175

MEMORY-LAPSE SLIPS

Errors caused by memory failures are common. Consider these examples: • Making copies of a document, walking off with the copy, but leaving the original inside the machine. • Forgetting a child. This error has numerous examples, such as leaving a child behind at a rest stop during a car trip, or in the dressing room of a department store, or a new mother forgetting her one-month-old and having to go to the police for help in finding the baby. • Losing a pen because it was taken out to write something, then put down while doing some other task. The pen is forgotten in the activities of putting away a checkbook, picking up goods, talking to a salesperson or friends, and so on. Or the reverse: borrowing a pen, using it, and then putting it away in your pocket or purse, even though it is someone else’s (this is also a capture error). • Using a bank or credit card to withdraw money from an automatic teller machine, then walking off without the card, is such a frequent error that many machines now have a forcing function: the card must be removed before the money will be delivered. Of course, it is then possible to walk off without the money, but this is less likely than forgetting the card because money is the goal of using the machine.

Memory lapses are common causes of error. They can lead to several kinds of errors: failing to do all of the steps of a procedure; repeating steps; forgetting the outcome of an action; or forgetting the goal or plan, thereby causing the action to be stopped. The immediate cause of most memory-lapse failures is interruptions, events that intervene between the time an action is decided upon and the time it is completed. Quite often the interference comes from the machines we are using: the many steps required between the start and finish of the operations can overload the capacity of short-term or working memory. There are several ways to combat memory-lapse errors. One is to minimize the number of steps; another, to provide vivid reminders of steps that need to be completed. A superior method is to use the 176

The Design of Everyday Things

forcing function of Chapter 4. For example, automated teller machines often require removal of the bank card before delivering the requested money: this prevents forgetting the bank card, capitalizing on the fact that people seldom forget the goal of the activity, in this case the money. With pens, the solution is simply to prevent their removal, perhaps by chaining public pens to the counter. Not all memory-lapse errors lend themselves to simple solutions. In many cases the interruptions come from outside the system, where the designer has no control. MODE-ERROR SLIPS

A mode error occurs when a device has different states in which the same controls have different meanings: we call these states modes. Mode errors are inevitable in anything that has more possible actions than it has controls or displays; that is, the controls mean different things in the different modes. This is unavoidable as we add more and more functions to our devices. Ever turn off the wrong device in your home entertainment system? This happens when one control is used for multiple purposes. In the home, this is simply frustrating. In industry, the confusion that results when operators believe the system to be in one mode, when in reality it is in another, has resulted in serious accidents and loss of life. It is tempting to save money and space by having a single control serve multiple purposes. Suppose there are ten different functions on a device. Instead of using ten separate knobs or switches— which would take considerable space, add extra cost, and appear intimidatingly complex, why not use just two controls, one to select the function, the other to set the function to the desired condition? Although the resulting design appears quite simple and easy to use, this apparent simplicity masks the underlying complexity of use. The operator must always be completely aware of the mode, of what function is active. Alas, the prevalence of mode errors shows this assumption to be false. Yes, if I select a mode and then immediately adjust the parameters, I am not apt to be confused about the state. But what if I select the mode and then get interrupted five: Human Error? No, Bad Design 177

by other events? Or if the mode is maintained for considerable periods? Or, as in the case of the Airbus accident discussed below, the two modes being selected are very similar in control and function, but have different operating characteristics, which means that the resulting mode error is difficult to discover? Sometimes the use of modes is justifiable, such as the need to put many controls and displays in a small, restricted space, but whatever the reason, modes are a common cause of confusion and error. Alarm clocks often use the same controls and display for setting the time of day and the time the alarm should go off, and many of us have thereby set one when we meant the other. Similarly, when time is displayed on a twelve-hour scale, it is easy to set the alarm to go off at seven a.m. only later to discover that the alarm had been set for seven p.m. The use of “a.m.” and “p.m.” to distinguish times before and after noon is a common source of confusion and error, hence the common use of 24-hour time specification throughout most of the world (the major exceptions being North America, Australia, India, and the Philippines). Watches with multiple functions have similar problems, in this case required because of the small amount of space available for controls and displays. Modes exist in most computer programs, in our cell phones, and in the automatic controls of commercial aircraft. A number of serious accidents in commercial aviation can be attributed to mode errors, especially in aircraft that use automatic systems (which have a large number of complex modes). As automobiles become more complex, with the dashboard controls for driving, heating and air-conditioning, entertainment, and navigation, modes are increasingly common. An accident with an Airbus airplane illustrates the problem. The flight control equipment (often referred to as the automatic pilot) had two modes, one for controlling vertical speed, the other for controlling the flight path’s angle of descent. In one case, when the pilots were attempting to land, the pilots thought that they were controlling the angle of descent, whereas they had accidentally

178

The Design of Everyday Things

selected the mode that controlled speed of descent. The number (–3.3) that was entered into the system to represent an appropriate angle (–3.3º) was too steep a rate of descent when interpreted as vertical speed (–3,300 feet/minute: –3.3º would only be –800 feet/ minute). This mode confusion contributed to the resulting fatal accident. After a detailed study of the accident, Airbus changed the display on the instrument so that vertical speed would always be displayed with a four-digit number and angle with two digits, thus reducing the chance of confusion. Mode error is really design error. Mode errors are especially likely where the equipment does not make the mode visible, so the user is expected to remember what mode has been established, sometimes hours earlier, during which time many intervening events might have occurred. Designers must try to avoid modes, but if they are necessary, the equipment must make it obvious which mode is invoked. Once again, designers must always compensate for interfering activities.

The Classification of Mistakes Mistakes result from the choice of inappropriate goals and plans or from faulty comparison of the outcome with the goals during evaluation. In mistakes, a person makes a poor decision, misclassifies a situation, or fails to take all the relevant factors into account. Many mistakes arise from the vagaries of human thought, often because people tend to rely upon remembered experiences rather than on more systematic analysis. We make decisions based upon what is in our memory. But as discussed in Chapter 3, retrieval from longterm memory is actually a reconstruction rather than an accurate record. As a result, it is subject to numerous biases. Among other things, our memories tend to be biased toward overgeneralization of the commonplace and overemphasis of the discrepant. The Danish engineer Jens Rasmussen distinguished among three modes of behavior: skill-based, rule-based, and knowledge-based. This three-level classification scheme provides a practical tool that has found wide acceptance in applied areas, such as the design of

five: Human Error? No, Bad Design 179

many industrial systems. Skill-based behavior occurs when workers are extremely expert at their jobs, so they can do the everyday, routine tasks with little or no thought or conscious attention. The most common form of errors in skill-based behavior is slips. Rule-based behavior occurs when the normal routine is no longer applicable but the new situation is one that is known, so there is already a well-prescribed course of action: a rule. Rules simply might be learned behaviors from previous experiences, but includes formal procedures prescribed in courses and manuals, usually in the form of “if-then” statements, such as, “If the engine will not start, then do [the appropriate action].” Errors with rule-based behavior can be either a mistake or a slip. If the wrong rule is selected, this would be a mistake. If the error occurs during the execution of the rule, it is most likely a slip. Knowledge-based procedures occur when unfamiliar events occur, where neither existing skills nor rules apply. In this case, there must be considerable reasoning and problem-solving. Plans might be developed, tested, and then used or modified. Here, conceptual models are essential in guiding development of the plan and interpretation of the situation. In both rule-based and knowledge-based situations, the most serious mistakes occur when the situation is misdiagnosed. As a result, an inappropriate rule is executed, or in the case of knowledge-based problems, the effort is addressed to solving the wrong problem. In addition, with misdiagnosis of the problem comes misinterpretation of the environment, as well as faulty comparisons of the current state with expectations. These kinds of mistakes can be very difficult to detect and correct. RULE-BASED MISTAKES

When new procedures have to be invoked or when simple problems arise, we can characterize the actions of skilled people as rulebased. Some rules come from experience; others are formal procedures in manuals or rulebooks, or even less formal guides, such as cookbooks for food preparation. In either case, all we must do is identify the situation, select the proper rule, and then follow it. 180

The Design of Everyday Things

When driving, behavior follows well-learned rules. Is the light red? If so, stop the car. Wish to turn left? Signal the intention to turn and move as far left as legally permitted: slow the vehicle and wait for a safe break in traffic, all the while following the traffic rules and relevant signs and lights. Rule-based mistakes occur in multiple ways: • The situation is mistakenly interpreted, thereby invoking the wrong goal or plan, leading to following an inappropriate rule. • The correct rule is invoked, but the rule itself is faulty, either because it was formulated improperly or because conditions are different than assumed by the rule or through incomplete knowledge used to determine the rule. All of these lead to knowledge-based mistakes. • The correct rule is invoked, but the outcome is incorrectly evaluated. This error in evaluation, usually rule- or knowledge-based itself, can lead to further problems as the action cycle continues. Example 1: In 2013, at the Kiss nightclub in Santa Maria, Brazil, pyrotechnics used by the band ignited a fire that killed over 230 people. The tragedy illustrates several mistakes. The band made a knowledge-based mistake when they used outdoor flares, which ignited the ceiling’s acoustic tiles. The band thought the flares were safe. Many people rushed into the rest rooms, mistakenly thinking they were exits: they died. Early reports suggested that the guards, unaware of the fire, at first mistakenly blocked people from leaving the building. Why? Because nightclub attendees would sometimes leave without paying for their drinks. The mistake was in devising a rule that did not take account of emergencies. A root cause analysis would reveal that the goal was to prevent inappropriate exit but still allow the doors to be used in an emergency. One solution is doors that trigger alarms when used, deterring people trying to sneak out, but allowing exit when needed. Example 2: Turning the thermostat of an oven to its maximum temperature to get it to the proper cooking temperature faster is a mistake based upon a false conceptual model of the way the oven works. If the person wanders off and forgets to come back and check the oven five: Human Error? No, Bad Design 181

temperature after a reasonable period (a memory-lapse slip), the improper high setting of the oven temperature can lead to an accident, possibly a fire. Example 3: A driver, unaccustomed to anti-lock brakes, encounters an unexpected object in the road on a wet, rainy day. The driver applies full force to the brakes but the car skids, triggering the anti-lock brakes to rapidly turn the brakes on and off, as they are designed to do. The driver, feeling the vibrations, believes that it indicates malfunction and therefore lifts his foot off the brake pedal. In fact, the vibration is a signal that anti-lock brakes are working properly. The driver’s misevaluation leads to the wrong behavior.

Rule-based mistakes are difficult to avoid and then difficult to detect. Once the situation has been classified, the selection of the appropriate rule is often straightforward. But what if the classification of the situation is wrong? This is difficult to discover because there is usually considerable evidence to support the erroneous classification of the situation and the choice of rule. In complex situations, the problem is too much information: information that both supports the decision and also contradicts it. In the face of time pressures to make a decision, it is difficult to know which evidence to consider, which to reject. People usually decide by taking the current situation and matching it with something that happened earlier. Although human memory is quite good at matching examples from the past with the present situation, this doesn’t mean that the matching is accurate or appropriate. The matching is biased by recency, regularity, and uniqueness. Recent events are remembered far better than less recent ones. Frequent events are remembered through their regularities, and unique events are remembered because of their uniqueness. But suppose the current event is different from all that has been experienced before: people are still apt to find some match in memory to use as a guide. The same powers that make us so good at dealing with the common and the unique lead to severe error with novel events. What is a designer to do? Provide as much guidance as possible to ensure that the current state of things is displayed in a coherent 182

The Design of Everyday Things

and easily interpreted format—ideally graphical. This is a difficult problem. All major decision makers worry about the complexity of real-world events, where the problem is often too much information, much of it contradictory. Often, decisions must be made quickly. Sometimes it isn’t even clear that there is an incident or that a decision is actually being made. Think of it like this. In your home, there are probably a number of broken or misbehaving items. There might be some burnt-out lights, or (in my home) a reading light that works fine for a little while, then goes out: we have to walk over and wiggle the fluorescent bulb. There might be a leaky faucet or other minor faults that you know about but are postponing action to remedy. Now consider a major process-control manufacturing plant (an oil refinery, a chemical plant, or a nuclear power plant). These have thousands, perhaps tens of thousands, of valves and gauges, displays and controls, and so on. Even the best of plants always has some faulty parts. The maintenance crews always have a list of items to take care of. With all the alarms that trigger when a problem arises, even though it might be minor, and all the everyday failures, how does one know which might be a significant indicator of a major problem? Every single one usually has a simple, rational explanation, so not making it an urgent item is a sensible decision. In fact, the maintenance crew simply adds it to a list. Most of the time, this is the correct decision. The one time in a thousand (or even, one time in a million) that the decision is wrong makes it the one they will be blamed for: how could they have missed such obvious signals? Hindsight is always superior to foresight. When the accident investigation committee reviews the event that contributed to the problem, they know what actually happened, so it is easy for them to pick out which information was relevant, which was not. This is retrospective decision making. But when the incident was taking place, the people were probably overwhelmed with far too much irrelevant information and probably not a lot of relevant information. How were they to know which to attend to and which to ignore? Most of the time, experienced operators get things right. The one time they fail, the retrospective analysis is apt to condemn five: Human Error? No, Bad Design 183

them for missing the obvious. Well, during the event, nothing may be obvious. I return to this topic later in the chapter. You will face this while driving, while handling your finances, and while just going through your daily life. Most of the unusual incidents you read about are not relevant to you, so you can safely ignore them. Which things should be paid attention to, which should be ignored? Industry faces this problem all the time, as do governments. The intelligence communities are swamped with data. How do they decide which cases are serious? The public hears about their mistakes, but not about the far more frequent cases that they got right or about the times they ignored data as not being meaningful—and were correct to do so. If every decision had to be questioned, nothing would ever get done. But if decisions are not questioned, there will be major mistakes—rarely, but often of substantial penalty. The design challenge is to present the information about the state of the system (a device, vehicle, plant, or activities being monitored) in a way that is easy to assimilate and interpret, as well as to provide alternative explanations and interpretations. It is useful to question decisions, but impossible to do so if every action—or failure to act—requires close attention. This is a difficult problem with no obvious solution. KNOWLEDGE-BASED MISTAKES

Knowledge-based behavior takes place when the situation is novel enough that there are no skills or rules to cover it. In this case, a new procedure must be devised. Whereas skills and rules are controlled at the behavioral level of human processing and are therefore subconscious and automatic, knowledge-based behavior is controlled at the reflective level and is slow and conscious. With knowledge-based behavior, people are consciously problem solving. They are in an unknown situation and do not have any available skills or rules that apply directly. Knowledge-based behavior is required either when a person encounters an unknown situation, perhaps being asked to use some novel equipment, or

184

The Design of Everyday Things

even when doing a familiar task and things go wrong, leading to a novel, uninterpretable state. The best solution to knowledge-based situations is to be found in a good understanding of the situation, which in most cases also translates into an appropriate conceptual model. In complex cases, help is needed, and here is where good cooperative problem-solving skills and tools are required. Sometimes, good procedural manuals (paper or electronic) will do the job, especially if critical observations can be used to arrive at the relevant procedures to follow. A more powerful approach is to develop intelligent computer systems, using good search and appropriate reasoning techniques (artificial-intelligence decision-making and problem-solving). The difficulties here are in establishing the interaction of the people with the automation: human teams and automated systems have to be thought of as collaborative, cooperative systems. Instead, they are often built by assigning the tasks that machines can do to the machines and leaving the humans to do the rest. This usually means that machines do the parts that are easy for people, but when the problems become complex, which is precisely when people could use assistance, that is when the machines usually fail. (I discuss this problem extensively in The Design of Future Things.) MEMORY-LAPSE MISTAKES

Memory lapses can lead to mistakes if the memory failure leads to forgetting the goal or plan of action. A common cause of the lapse is an interruption that leads to forgetting the evaluation of the current state of the environment. These lead to mistakes, not slips, because the goals and plans become wrong. Forgetting earlier evaluations often means remaking the decision, sometimes erroneously. The design cures for memory-lapse mistakes are the same as for memory-lapse slips: ensure that all the relevant information is continuously available. The goals, plans, and current evaluation of the system are of particular importance and should be continually available. Far too many designs eliminate all signs of these items once they have been made or acted upon. Once again, the designer

five: Human Error? No, Bad Design 185

should assume that people will be interrupted during their activities and that they may need assistance in resuming their operations.

Social and Institutional Pressures A subtle issue that seems to figure in many accidents is social pressure. Although at first it may not seem relevant to design, it has strong influence on everyday behavior. In industrial settings, social pressures can lead to misinterpretation, mistakes, and accidents. To understand human error, it is essential to understand social pressure. Complex problem-solving is required when one is faced with knowledge-based problems. In some cases, it can take teams of people days to understand what is wrong and the best ways to respond. This is especially true of situations where mistakes have been made in the diagnosis of the problem. Once the mistaken diagnosis is made, all information from then on is interpreted from the wrong point of view. Appropriate reconsiderations might only take place during team turnover, when new people come into the situation with a fresh viewpoint, allowing them to form different interpretations of the events. Sometimes just asking one or more of the team members to take a few hours’ break can lead to the same fresh analysis (although it is understandably difficult to convince someone who is battling an emergency situation to stop for a few hours). In commercial installations, the pressure to keep systems running is immense. Considerable money might be lost if an expensive system is shut down. Operators are often under pressure not to do this. The result has at times been tragic. Nuclear power plants are kept running longer than is safe. Airplanes have taken off before everything was ready and before the pilots had received permission. One such incident led to the largest accident in aviation history. Although the incident happened in 1977, a long time ago, the lessons learned are still very relevant today. In Tenerife, in the Canary Islands, a KLM Boeing 747 crashed during takeoff into a Pan American 747 that was taxiing on the same runway, killing 583 people. The KLM plane had not received clearance to take off, but the weather was starting to get bad and the crew had already been delayed for too long (even being on the 186

The Design of Everyday Things

Canary Islands was a diversion from the scheduled flight—bad weather had prevented their landing at their scheduled destination). And the Pan American flight should not have been on the runway, but there was considerable misunderstanding between the pilots and the air traffic controllers. Furthermore, the fog was coming in so thickly that neither plane’s crew could see the other. In the Tenerife disaster, time and economic pressures were acting together with cultural and weather conditions. The Pan American pilots questioned their orders to taxi on the runway, but they continued anyway. The first officer of the KLM flight voiced minor objections to the captain, trying to explain that they were not yet cleared for takeoff (but the first officer was very junior to the captain, who was one of KLM’s most respected pilots). All in all, a major tragedy occurred due to a complex mixture of social pressures and logical explaining away of discrepant observations. You may have experienced similar pressure, putting off refueling or recharging your car until it was too late and you ran out, sometimes in a truly inconvenient place (this has happened to me). What are the social pressures to cheat on school examinations, or to help others cheat? Or to not report cheating by others? Never underestimate the power of social pressures on behavior, causing otherwise sensible people to do things they know are wrong and possibly dangerous. When I was in training to do underwater (scuba) diving, our instructor was so concerned about this that he said he would reward anyone who stopped a dive early in favor of safety. People are normally buoyant, so they need weights to get them beneath the surface. When the water is cold, the problem is intensified because divers must then wear either wet or dry suits to keep warm, and these suits add buoyancy. Adjusting buoyancy is an important part of the dive, so along with the weights, divers also wear air vests into which they continually add or remove air so that the body is close to neutral buoyancy. (As divers go deeper, increased water pressure compresses the air in their protective suits and lungs, so they become heavier: the divers need to add air to their vests to compensate.) five: Human Error? No, Bad Design 187

When divers have gotten into difficulties and needed to get to the surface quickly, or when they were at the surface close to shore but being tossed around by waves, some drowned because they were still being encumbered by their heavy weights. Because the weights are expensive, the divers didn’t want to release them. In addition, if the divers released the weights and then made it back safely, they could never prove that the release of the weights was necessary, so they would feel embarrassed, creating self-induced social pressure. Our instructor was very aware of the resulting reluctance of people to take the critical step of releasing their weights when they weren’t entirely positive it was necessary. To counteract this tendency, he announced that if anyone dropped the weights for safety reasons, he would publicly praise the diver and replace the weights at no cost to the person. This was a very persuasive attempt to overcome social pressures. Social pressures show up continually. They are usually difficult to document because most people and organizations are reluctant to admit these factors, so even if they are discovered in the process of the accident investigation, the results are often kept hidden from public scrutiny. A major exception is in the study of transportation accidents, where the review boards across the world tend to hold open investigations. The US National Transportation Safety Board (NTSB) is an excellent example of this, and its reports are widely used by many accident investigators and researchers of human error (including me). Another good example of social pressures comes from yet another airplane incident. In 1982 an Air Florida flight from National Airport, Washington, DC, crashed during takeoff into the Fourteenth Street Bridge over the Potomac River, killing seventy-eight people, including four who were on the bridge. The plane should not have taken off because there was ice on the wings, but it had already been delayed for over an hour and a half; this and other factors, the NTSB reported, “may have predisposed the crew to hurry.” The accident occurred despite the first officer’s attempt to warn the captain, who was flying the airplane (the captain and first officer—sometimes called the copilot—usually alternate flying 188

The Design of Everyday Things

roles on different legs of a trip). The NTSB report quotes the flight deck recorder’s documenting that “although the first officer expressed concern that something ‘was not right’ to the captain four times during the takeoff, the captain took no action to reject the takeoff.” NTSB summarized the causes this way: The National Transportation Safety Board determines that the probable cause of this accident was the flight crew’s failure to use engine antiice during ground operation and takeoff, their decision to take off with snow/ice on the airfoil surfaces of the aircraft, and the captain’s failure to reject the takeoff during the early stage when his attention was called to anomalous engine instrument readings. (NTSB, 1982.)

Again we see social pressures coupled with time and economic forces. Social pressures can be overcome, but they are powerful and pervasive. We drive when drowsy or after drinking, knowing full well the dangers, but talking ourselves into believing that we are exempt. How can we overcome these kinds of social problems? Good design alone is not sufficient. We need different training; we need to reward safety and put it above economic pressures. It helps if the equipment can make the potential dangers visible and explicit, but this is not always possible. To adequately address social, economic, and cultural pressures and to improve upon company policies are the hardest parts of ensuring safe operation and behavior. CHECKLISTS

Checklists are powerful tools, proven to increase the accuracy of behavior and to reduce error, particularly slips and memory lapses. They are especially important in situations with multiple, complex requirements, and even more so where there are interruptions. With multiple people involved in a task, it is essential that the lines of responsibility be clearly spelled out. It is always better to have two people do checklists together as a team: one to read the instruction, the other to execute it. If, instead, a single person executes the checklist and then, later, a second person checks the items, the five: Human Error? No, Bad Design 189

results are not as robust. The person following the checklist, feeling confident that any errors would be caught, might do the steps too quickly. But the same bias affects the checker. Confident in the ability of the first person, the checker often does a quick, less than thorough job. One paradox of groups is that quite often, adding more people to check a task makes it less likely that it will be done right. Why? Well, if you were responsible for checking the correct readings on a row of fifty gauges and displays, but you know that two people before you had checked them and that one or two people who come after you will check your work, you might relax, thinking that you don’t have to be extra careful. After all, with so many people looking, it would be impossible for a problem to exist without detection. But if everyone thinks the same way, adding more checks can actually increase the chance of error. A collaboratively followed checklist is an effective way to counteract these natural human tendencies. In commercial aviation, collaboratively followed checklists are widely accepted as essential tools for safety. The checklist is done by two people, usually the two pilots of the airplane (the captain and first officer). In aviation, checklists have proven their worth and are now required in all US commercial flights. But despite the strong evidence confirming their usefulness, many industries still fiercely resist them. It makes people feel that their competence is being questioned. Moreover, when two people are involved, a junior person (in aviation, the first officer) is being asked to watch over the action of the senior person. This is a strong violation of the lines of authority in many cultures. Physicians and other medical professionals have strongly resisted the use of checklists. It is seen as an insult to their professional competence. “Other people might need checklists,” they complain, “but not me.” Too bad. Too err is human: we all are subject to slips and mistakes when under stress, or under time or social pressure, or after being subjected to multiple interruptions, each essential in its own right. It is not a threat to professional competence to be

190

The Design of Everyday Things

human. Legitimate criticisms of particular checklists are used as an indictment against the concept of checklists. Fortunately, checklists are slowly starting to gain acceptance in medical situations. When senior personnel insist on the use of checklists, it actually enhances their authority and professional status. It took decades for checklists to be accepted in commercial aviation: let us hope that medicine and other professions will change more rapidly. Designing an effective checklist is difficult. The design needs to be iterative, always being refined, ideally using the human-centered design principles of Chapter 6, continually adjusting the list until it covers the essential items yet is not burdensome to perform. Many people who object to checklists are actually objecting to badly designed lists: designing a checklist for a complex task is best done by professional designers in conjunction with subject matter experts. Printed checklists have one major flaw: they force the steps to follow a sequential ordering, even where this is not necessary or even possible. With complex tasks, the order in which many operations are performed may not matter, as long as they are all completed. Sometimes items early in the list cannot be done at the time they are encountered in the checklist. For example, in aviation one of the steps is to check the amount of fuel in the plane. But what if the fueling operation has not yet been completed when this checklist item is encountered? Pilots will skip over it, intending to come back to it after the plane has been refueled. This is a clear opportunity for a memory-lapse error. In general, it is bad design to impose a sequential structure to task execution unless the task itself requires it. This is one of the major benefits of electronic checklists: they can keep track of skipped items and can ensure that the list will not be marked as complete until all items have been done.

Reporting Error If errors can be caught, then many of the problems they might lead to can often be avoided. But not all errors are easy to detect. Moreover, social pressures often make it difficult for people to admit to

five: Human Error? No, Bad Design 191

their own errors (or to report the errors of others). If people report their own errors, they might be fined or punished. Moreover, their friends may make fun of them. If a person reports that someone else made an error, this may lead to severe personal repercussions. Finally, most institutions do not wish to reveal errors made by their staff. Hospitals, courts, police systems, utility companies—all are reluctant to admit to the public that their workers are capable of error. These are all unfortunate attitudes. The only way to reduce the incidence of errors is to admit their existence, to gather together information about them, and thereby to be able to make the appropriate changes to reduce their occurrence. In the absence of data, it is difficult or impossible to make improvements. Rather than stigmatize those who admit to error, we should thank those who do so and encourage the reporting. We need to make it easier to report errors, for the goal is not to punish, but to determine how it occurred and change things so that it will not happen again. CASE STUDY: JIDOKA—HOW TOYOTA HANDLES ERROR

The Toyota automobile company has developed an extremely efficient error-reduction process for manufacturing, widely known as the Toyota Production System. Among its many key principles is a philosophy called Jidoka, which Toyota says is “roughly translated as ‘automation with a human touch.’” If a worker notices something wrong, the worker is supposed to report it, sometimes even stopping the entire assembly line if a faulty part is about to proceed to the next station. (A special cord, called an andon, stops the assembly line and alerts the expert crew.) Experts converge upon the problem area to determine the cause. “Why did it happen?” “Why was that?” “Why is that the reason?” The philosophy is to ask “Why?” as many times as may be necessary to get to the root cause of the problem and then fix it so it can never occur again. As you might imagine, this can be rather discomforting for the person who found the error. But the report is expected, and when it is discovered that people have failed to report errors, they are punished, all in an attempt to get the workers to be honest. 192

The Design of Everyday Things

POKA-YOKE: ERROR PROOFING

Poka-yoke is another Japanese method, this one invented by Shigeo Shingo, one of the Japanese engineers who played a major role in the development of the Toyota Production System. Poka-yoke translates as “error proofing” or “avoiding error.” One of the techniques of poka-yoke is to add simple fixtures, jigs, or devices to constrain the operations so that they are correct. I practice this myself in my home. One trivial example is a device to help me remember which way to turn the key on the many doors in the apartment complex where I live. I went around with a pile of small, circular, green stick-on dots and put them on each door beside its keyhole, with the green dot indicating the direction in which the key needed to be turned: I added signifiers to the doors. Is this a major error? No. But eliminating it has proven to be convenient. (Neighbors have commented on their utility, wondering who put them there.) In manufacturing facilities, poka-yoke might be a piece of wood to help align a part properly, or perhaps plates designed with asymmetrical screw holes so that the plate could fit in only one position. Covering emergency or critical switches with a cover to prevent accidental triggering is another poka-yoke technique: this is obviously a forcing function. All the poka-yoke techniques involve a combination of the principles discussed in this book: affordances, signifiers, mapping, and constraints, and perhaps most important of all, forcing functions. NASA’S AVIATION SAFETY REPORTING SYSTEM

US commercial aviation has long had an extremely effective system for encouraging pilots to submit reports of errors. The program has resulted in numerous improvements to aviation safety. It wasn’t easy to establish: pilots had severe self-induced social pressures against admitting to errors. Moreover, to whom would they report them? Certainly not to their employers. Not even to the Federal Aviation Authority (FAA), for then they would probably be punished. The solution was to let the National Aeronautics and Space Administration (NASA) set up a voluntary accident reporting system whereby pilots could submit semi-anonymous reports five: Human Error? No, Bad Design 193

of errors they had made or observed in others (semi-anonymous because pilots put their name and contact information on the reports so that NASA could call to request more information). Once NASA personnel had acquired the necessary information, they would detach the contact information from the report and mail it back to the pilot. This meant that NASA no longer knew who had reported the error, which made it impossible for the airline companies or the FAA (which enforced penalties against errors) to find out who had submitted the report. If the FAA had independently noticed the error and tried to invoke a civil penalty or certificate suspension, the receipt of self-report automatically exempted the pilot from punishment (for minor infractions). When a sufficient number of similar errors had been collected, NASA would analyze them and issue reports and recommendations to the airlines and to the FAA. These reports also helped the pilots realize that their error reports were valuable tools for increasing safety. As with checklists, we need similar systems in the field of medicine, but it has not been easy to set up. NASA is a neutral body, charged with enhancing aviation safety, but has no oversight authority, which helped gain the trust of pilots. There is no comparable institution in medicine: physicians are afraid that self-reported errors might lead them to lose their license or be subjected to lawsuits. But we can’t eliminate errors unless we know what they are. The medical field is starting to make progress, but it is a difficult technical, political, legal, and social problem.

Detecting Error Errors do not necessarily lead to harm if they are discovered quickly. The different categories of errors have differing ease of discovery. In general, action slips are relatively easy to discover; mistakes, much more difficult. Action slips are relatively easy to detect because it is usually easy to notice a discrepancy between the intended act and the one that got performed. But this detection can only take place if there is feedback. If the result of the action is not visible, how can the error be detected?

194

The Design of Everyday Things

Memory-lapse slips are difficult to detect precisely because there is nothing to see. With a memory slip, the required action is not performed. When no action is done, there is nothing to detect. It is only when the lack of action allows some unwanted event to occur that there is hope of detecting a memory-lapse slip. Mistakes are difficult to detect because there is seldom anything that can signal an inappropriate goal. And once the wrong goal or plan is decided upon, the resulting actions are consistent with that wrong goal, so careful monitoring of the actions not only fails to detect the erroneous goal, but, because the actions are done correctly, can inappropriately provide added confidence to the decision. Faulty diagnoses of a situation can be surprisingly difficult to detect. You might expect that if the diagnosis was wrong, the actions would turn out to be ineffective, so the fault would be discovered quickly. But misdiagnoses are not random. Usually they are based on considerable knowledge and logic. The misdiagnosis is usually both reasonable and relevant to eliminating the symptoms being observed. As a result, the initial actions are apt to appear appropriate and helpful. This makes the problem of discovery even more difficult. The actual error might not be discovered for hours or days. Memory-lapse mistakes are especially difficult to detect. Just as with a memory-lapse slip the absence of something that should have been done is always more difficult to detect than the presence of something that should not have been done. The difference between memory-lapse slips and mistakes is that, in the first case, a single component of a plan is skipped, whereas in the second, the entire plan is forgotten. Which is easier to discover? At this point I must retreat to the standard answer science likes to give to questions of this sort: “It all depends.” EXPLAINING AWAY MISTAKES

Mistakes can take a long time to be discovered. Hear a noise that sounds like a pistol shot and think: “Must be a car’s exhaust backfiring.” Hear someone yell outside and think: “Why can’t my

five: Human Error? No, Bad Design 195

neighbors be quiet?” Are we correct in dismissing these incidents? Most of the time we are, but when we’re not, our explanations can be difficult to justify. Explaining away errors is a common problem in commercial accidents. Most major accidents are preceded by warning signs: equipment malfunctions or unusual events. Often, there is a series of apparently unrelated breakdowns and errors that culminate in major disaster. Why didn’t anyone notice? Because no single incident appeared to be serious. Often, the people involved noted each problem but discounted it, finding a logical explanation for the otherwise deviant observation. T H E C A S E O F T H E W R O N G T U R N O N A H I G H WAY

I’ve misinterpreted highway signs, as I’m sure most drivers have. My family was traveling from San Diego to Mammoth Lakes, California, a ski area about 400 miles north. As we drove, we noticed more and more signs advertising the hotels and gambling casinos of Las Vegas, Nevada. “Strange,” we said, “Las Vegas always did advertise a long way off—there is even a billboard in San Diego— but this seems excessive, advertising on the road to Mammoth.” We stopped for gasoline and continued on our journey. Only later, when we tried to find a place to eat supper, did we discover that we had missed a turn nearly two hours earlier, before we had stopped for gasoline, and that we were actually on the road to Las Vegas, not the road to Mammoth. We had to backtrack the entire twohour segment, wasting four hours of driving. It’s humorous now; it wasn’t then. Once people find an explanation for an apparent anomaly, they tend to believe they can now discount it. But explanations are based on analogy with past experiences, experiences that may not apply to the current situation. In the driving story, the prevalence of billboards for Las Vegas was a signal we should have heeded, but it seemed easily explained. Our experience is typical: some major industrial incidents have resulted from false explanations of anomalous events. But do note: usually these apparent anomalies should be ignored. Most of the time, the explanation for their pres196

The Design of Everyday Things

ence is correct. Distinguishing a true anomaly from an apparent one is difficult. IN HINDSIGHT, EVENTS SEEM LOGICAL

The contrast in our understanding before and after an event can be dramatic. The psychologist Baruch Fischhoff has studied explanations given in hindsight, where events seem completely obvious and predictable after the fact but completely unpredictable beforehand. Fischhoff presented people with a number of situations and asked them to predict what would happen: they were correct only at the chance level. When the actual outcome was not known by the people being studied, few predicted the actual outcome. He then presented the same situations along with the actual outcomes to another group of people, asking them to state how likely each outcome was: when the actual outcome was known, it appeared to be plausible and likely and other outcomes appeared unlikely. Hindsight makes events seem obvious and predictable. Foresight is difficult. During an incident, there are never clear clues. Many things are happening at once: workload is high, emotions and stress levels are high. Many things that are happening will turn out to be irrelevant. Things that appear irrelevant will turn out to be critical. The accident investigators, working with hindsight, knowing what really happened, will focus on the relevant information and ignore the irrelevant. But at the time the events were happening, the operators did not have information that allowed them to distinguish one from the other. This is why the best accident analyses can take a long time to do. The investigators have to imagine themselves in the shoes of the people who were involved and consider all the information, all the training, and what the history of similar past events would have taught the operators. So, the next time a major accident occurs, ignore the initial reports from journalists, politicians, and executives who don’t have any substantive information but feel compelled to provide statements anyway. Wait until the official reports come from trusted sources. Unfortunately, this could be months or years after the accident, and the public usually wants five: Human Error? No, Bad Design 197

answers immediately, even if those answers are wrong. Moreover, when the full story finally appears, newspapers will no longer consider it news, so they won’t report it. You will have to search for the official report. In the United States, the National Transportation Safety Board (NTSB) can be trusted. NTSB conducts careful investigations of all major aviation, automobile and truck, train, ship, and pipeline incidents. (Pipelines? Sure: pipelines transport coal, gas, and oil.)

Designing for Error It is relatively easy to design for the situation where everything goes well, where people use the device in the way that was intended, and no unforeseen events occur. The tricky part is to design for when things go wrong. Consider a conversation between two people. Are errors made? Yes, but they are not treated as such. If a person says something that is not understandable, we ask for clarification. If a person says something that we believe to be false, we question and debate. We don’t issue a warning signal. We don’t beep. We don’t give error messages. We ask for more information and engage in mutual dialogue to reach an understanding. In normal conversations between two friends, misstatements are taken as normal, as approximations to what was really meant. Grammatical errors, self-corrections, and restarted phrases are ignored. In fact, they are usually not even detected because we concentrate upon the intended meaning, not the surface features. Machines are not intelligent enough to determine the meaning of our actions, but even so, they are far less intelligent than they could be. With our products, if we do something inappropriate, if the action fits the proper format for a command, the product does it, even if it is outrageously dangerous. This has led to tragic accidents, especially in health care, where inappropriate design of infusion pumps and X-ray machines allowed extreme overdoses of medication or radiation to be administered to patients, leading to their deaths. In financial institutions, simple keyboard errors have led to huge financial transactions, far beyond normal limits. 198

The Design of Everyday Things

Even simple checks for reasonableness would have stopped all of these errors. (This is discussed at the end of the chapter under the heading “Sensibility Checks.”) Many systems compound the problem by making it easy to err but difficult or impossible to discover error or to recover from it. It should not be possible for one simple error to cause widespread damage. Here is what should be done: • Understand the causes of error and design to minimize those causes. • Do sensibility checks. Does the action pass the “common sense” test? • Make it possible to reverse actions—to “undo” them—or make it harder to do what cannot be reversed. • Make it easier for people to discover the errors that do occur, and make them easier to correct. • Don’t treat the action as an error; rather, try to help the person complete the action properly. Think of the action as an approximation to what is desired.

As this chapter demonstrates, we know a lot about errors. Thus, novices are more likely to make mistakes than slips, whereas experts are more likely to make slips. Mistakes often arise from ambiguous or unclear information about the current state of a system, the lack of a good conceptual model, and inappropriate procedures. Recall that most mistakes result from erroneous choice of goal or plan or erroneous evaluation and interpretation. All of these come about through poor information provided by the system about the choice of goals and the means to accomplish them (plans), and poor-quality feedback about what has actually happened. A major source of error, especially memory-lapse errors, is interruption. When an activity is interrupted by some other event, the cost of the interruption is far greater than the loss of the time required to deal with the interruption: it is also the cost of resuming the interrupted activity. To resume, it is necessary to remember precisely the previous state of the activity: what the goal was, where one was in the action cycle, and the relevant state of the system. Most systems make it difficult to resume after an interruption. five: Human Error? No, Bad Design 199

Most discard critical information that is needed by the user to remember the numerous small decisions that had been made, the things that were in the person’s short-term memory, to say nothing of the current state of the system. What still needs to be done? Maybe I was finished? It is no wonder that many slips and mistakes are the result of interruptions. Multitasking, whereby we deliberately do several tasks simultaneously, erroneously appears to be an efficient way of getting a lot done. It is much beloved by teenagers and busy workers, but in fact, all the evidence points to severe degradation of performance, increased errors, and a general lack of both quality and efficiency. Doing two tasks at once takes longer than the sum of the times it would take to do each alone. Even as simple and common a task as talking on a hands-free cell phone while driving leads to serious degradation of driving skills. One study even showed that cell phone usage during walking led to serious deficits: “Cell phone users walked more slowly, changed directions more frequently, and were less likely to acknowledge other people than individuals in the other conditions. In the second study, we found that cell phone users were less likely to notice an unusual activity along their walking route (a unicycling clown)” (Hyman, Boss, Wise, McKenzie, & Caggiano, 2010). A large percentage of medical errors are due to interruptions. In aviation, where interruptions were also determined to be a major problem during the critical phases of flying—landing and takeoff—the US Federal Aviation Authority (FAA) requires what it calls a “Sterile Cockpit Configuration,” whereby pilots are not allowed to discuss any topic not directly related to the control of the airplane during these critical periods. In addition, the flight attendants are not permitted to talk to the pilots during these phases (which has at times led to the opposite error—failure to inform the pilots of emergency situations). Establishing similar sterile periods would be of great benefit to many professions, including medicine and other safety-critical operations. My wife and I follow this convention in driving: when the driver is entering or leaving a high-speed highway, conversa200 The Design of Everyday Things

tion ceases until the transition has been completed. Interruptions and distractions lead to errors, both mistakes and slips. Warning signals are usually not the answer. Consider the control room of a nuclear power plant, the cockpit of a commercial aircraft, or the operating room of a hospital. Each has a large number of different instruments, gauges, and controls, all with signals that tend to sound similar because they all use simple tone generators to beep their warnings. There is no coordination among the instruments, which means that in major emergencies, they all sound at once. Most can be ignored anyway because they tell the operator about something that is already known. Each competes with the others to be heard, interfering with efforts to address the problem. Unnecessary, annoying alarms occur in numerous situations. How do people cope? By disconnecting warning signals, taping over warning lights (or removing the bulbs), silencing bells, and basically getting rid of all the safety warnings. The problem comes after such alarms are disabled, either when people forget to restore the warning systems (there are those memory-lapse slips again), or if a different incident happens while the alarms are disconnected. At that point, nobody notices. Warnings and safety methods must be used with care and intelligence, taking into account the tradeoffs for the people who are affected. The design of warning signals is surprisingly complex. They have to be loud or bright enough to be noticed, but not so loud or bright that they become annoying distractions. The signal has to both attract attention (act as a signifier of critical information) and also deliver information about the nature of the event that is being signified. The various instruments need to have a coordinated response, which means that there must be international standards and collaboration among the many design teams from different, often competing, companies. Although considerable research has been directed toward this problem, including the development of national standards for alarm management systems, the problem still remains in many situations. More and more of our machines present information through speech. But like all approaches, this has both strengths and five: Human Error? No, Bad Design 201

weaknesses. It allows for precise information to be conveyed, especially when the person’s visual attention is directed elsewhere. But if several speech warnings operate at the same time, or if the environment is noisy, speech warnings may not be understood. Or if conversations among the users or operators are necessary, speech warnings will interfere. Speech warning signals can be effective, but only if used intelligently. DESIGN LESSONS FROM THE STUDY OF ERRORS

Several design lessons can be drawn from the study of errors, one for preventing errors before they occur and one for detecting and correcting them when they do occur. In general, the solutions follow directly from the preceding analyses. A DDI NG CONST R A I N TS TO BLO C K E R ROR S

Prevention often involves adding specific constraints to actions. In the physical world, this can be done through clever use of shape and size. For example, in automobiles, a variety of fluids are required for safe operation and maintenance: engine oil, transmission oil, brake fluid, windshield washer solution, radiator coolant, battery water, and gasoline. Putting the wrong fluid into a reservoir could lead to serious damage or even an accident. Automobile manufacturers try to minimize these errors by segregating the filling points, thereby reducing description-similarity errors. When the filling points for fluids that should be added only occasionally or by qualified mechanics are located separately from those for fluids used more frequently, the average motorist is unlikely to use the incorrect filling points. Errors in adding fluids to the wrong container can be minimized by making the openings have different sizes and shapes, providing physical constraints against inappropriate filling. Different fluids often have different colors so that they can be distinguished. All these are excellent ways to minimize errors. Similar techniques are in widespread use in hospitals and industry. All of these are intelligent applications of constraints, forcing functions, and poka-yoke.

202 The Design of Everyday Things

Electronic systems have a wide range of methods that could be used to reduce error. One is to segregate controls, so that easily confused controls are located far from one another. Another is to use separate modules, so that any control not directly relevant to the current operation is not visible on the screen, but requires extra effort to get to. UNDO

Perhaps the most powerful tool to minimize the impact of errors is the Undo command in modern electronic systems, reversing the operations performed by the previous command, wherever possible. The best systems have multiple levels of undoing, so it is possible to undo an entire sequence of actions. Obviously, undoing is not always possible. Sometimes, it is only effective if done immediately after the action. Still, it is a powerful tool to minimize the impact of error. It is still amazing to me that many electronic and computer-based systems fail to provide a means to undo even where it is clearly possible and desirable. C O N F I R M AT I O N A N D E R R O R M E S S AG E S

Many systems try to prevent errors by requiring confirmation before a command will be executed, especially when the action will destroy something of importance. But these requests are usually ill-timed because after requesting an operation, people are usually certain they want it done. Hence the standard joke about such warnings: Person: Delete “my most important file.” System: Do you want to delete “my most important file”? Person: Yes. System: Are you certain? Person: Yes! System “My most favorite file” has been deleted. Person: Oh. Damn.

five: Human Error? No, Bad Design 203

The request for confirmation seems like an irritant rather than an essential safety check because the person tends to focus upon the action rather than the object that is being acted upon. A better check would be a prominent display of both the action to be taken and the object, perhaps with the choice of “cancel” or “do it.” The important point is making salient what the implications of the action are. Of course, it is because of errors of this sort that the Undo command is so important. With traditional graphical user interfaces on computers, not only is Undo a standard command, but when files are “deleted,” they are actually simply moved from sight and stored in the file folder named “Trash,” so that in the above example, the person could open the Trash and retrieve the erroneously deleted file. Confirmations have different implications for slips and mistakes. When I am writing, I use two very large displays and a powerful computer. I might have seven to ten applications running simultaneously. I have sometimes had as many as forty open windows. Suppose I activate the command that closes one of the windows, which triggers a confirmatory message: did I wish to close the window? How I deal with this depends upon why I requested that the window be closed. If it was a slip, the confirmation required will be useful. If it was by mistake, I am apt to ignore it. Consider these two examples: A slip leads me to close the wrong window.

Suppose I intended to type the word We, but instead of typing Shift + W for the first character, I typed Command + W (or Control + W), the keyboard command for closing a window. Because I expected the screen to display an uppercase W, when a dialog box appeared, asking whether I really wanted to delete the file, I would be surprised, which would immediately alert me to the slip. I would cancel the action (an alternative thoughtfully provided by the dialog box) and retype the Shift + W, carefully this time. A mistake leads me to close the wrong window. 204 The Design of Everyday Things

Now suppose I really intended to close a window. I often use a temporary file in a window to keep notes about the chapter I am working on. When I am finished with it, I close it without saving its contents—after all, I am finished. But because I usually have multiple windows open, it is very easy to close the wrong one. The computer assumes that all commands apply to the active window—the one where the last actions had been performed (and which contains the text cursor). But if I reviewed the temporary window prior to closing it, my visual attention is focused upon that window, and when I decide to close it, I forget that it is not the active window from the computer’s point of view. So I issue the command to shut the window, the computer presents me with a dialog box, asking for confirmation, and I accept it, choosing the option not to save my work. Because the dialog box was expected, I didn’t bother to read it. As a result, I closed the wrong window and worse, did not save any of the typing, possibly losing considerable work. Warning messages are surprisingly ineffective against mistakes (even nice requests, such as the one shown in Chapter 4, Figure 4.6, page 143). Was this a mistake or a slip? Both. Issuing the “close” command while the wrong window was active is a memory-lapse slip. But deciding not to read the dialog box and accepting it without saving the contents is a mistake (two mistakes, actually). What can a designer do? Several things: • Make the item being acted upon more prominent. That is, change the appearance of the actual object being acted upon to be more visible: enlarge it, or perhaps change its color. • Make the operation reversible. If the person saves the content, no harm is done except the annoyance of having to reopen the file. If the person elects Don’t Save, the system could secretly save the contents, and the next time the person opened the file, it could ask whether it should restore it to the latest condition. SENSIBILITY CHECKS

Electronic systems have another advantage over mechanical ones: they can check to make sure that the requested operation is sensible. five: Human Error? No, Bad Design 205

It is amazing that in today’s world, medical personnel can accidentally request a radiation dose a thousand times larger than normal and have the equipment meekly comply. In some cases, it isn’t even possible for the operator to notice the error. Similarly, errors in stating monetary sums can lead to disastrous results, even though a quick glance at the amount would indicate that something was badly off. For example, there are roughly 1,000 Korean won to the US dollar. Suppose I wanted to transfer $1,000 into a Korean bank account in won ($1,000 is roughly Ę1,000,000). But suppose I enter the Korean number into the dollar field. Oops—I’m trying to transfer a million dollars. Intelligent systems would take note of the normal size of my transactions, querying if the amount was considerably larger than normal. For me, it would query the million-dollar request. Less intelligent systems would blindly follow instructions, even though I did not have a million dollars in my account (in fact, I would probably be charged a fee for overdrawing my account). Sensibility checks, of course, are also the answer to the serious errors caused when inappropriate values are entered into hospital medication and X-ray systems or in financial transactions, as discussed earlier in this chapter. MINIMIZING SLIPS

Slips most frequently occur when the conscious mind is distracted, either by some other event or simply because the action being performed is so well learned that it can be done automatically, without conscious attention. As a result, the person does not pay sufficient attention to the action or its consequences. It might therefore seem that one way to minimize slips is to ensure that people always pay close, conscious attention to the acts being done. Bad idea. Skilled behavior is subconscious, which means it is fast, effortless, and usually accurate. Because it is so automatic, we can type at high speeds even while the conscious mind is occupied composing the words. This is why we can walk and talk while navigating traffic and obstacles. If we had to pay conscious attention to every little thing we did, we would accomplish far less in our 206 The Design of Everyday Things

lives. The information processing structures of the brain automatically regulate how much conscious attention is being paid to a task: conversations automatically pause when crossing the street amid busy traffic. Don’t count on it, though: if too much attention is focused on something else, the fact that the traffic is getting dangerous might not be noted. Many slips can be minimized by ensuring that the actions and their controls are as dissimilar as possible, or at least, as physically far apart as possible. Mode errors can be eliminated by the simple expedient of eliminating most modes and, if this is not possible, by making the modes very visible and distinct from one another. The best way of mitigating slips is to provide perceptible feedback about the nature of the action being performed, then very perceptible feedback describing the new resulting state, coupled with a mechanism that allows the error to be undone. For example, the use of machine-readable codes has led to a dramatic reduction in the delivery of wrong medications to patients. Prescriptions sent to the pharmacy are given electronic codes, so the pharmacist can scan both the prescription and the resulting medication to ensure they are the same. Then, the nursing staff at the hospital scans both the label of the medication and the tag worn around the patient’s wrist to ensure that the medication is being given to the correct individual. Moreover, the computer system can flag repeated administration of the same medication. These scans do increase the workload, but only slightly. Other kinds of errors are still possible, but these simple steps have already been proven worthwhile. Common engineering and design practices seem as if they are deliberately intended to cause slips. Rows of identical controls or meters is a sure recipe for description-similarity errors. Internal modes that are not very conspicuously marked are a clear driver of mode errors. Situations with numerous interruptions, yet where the design assumes undivided attention, are a clear enabler of memory lapses—and almost no equipment today is designed to support the numerous interruptions that so many situations entail. And failure to provide assistance and visible reminders for performing infrequent procedures that are similar to much more five: Human Error? No, Bad Design 207

frequent ones leads to capture errors, where the more frequent actions are performed rather than the correct ones for the situation. Procedures should be designed so that the initial steps are as dissimilar as possible. The important message is that good design can prevent slips and mistakes. Design can save lives. THE SWISS CHEESE MODEL OF HOW ERRORS LEAD TO ACCIDENTS

Fortunately, most errors do not lead to accidents. Accidents often have numerous contributing causes, no single one of which is the root cause of the incident. James Reason likes to explain this by invoking the metaphor of multiple slices of Swiss cheese, the cheese famous for being riddled with holes (Figure 5.3). If each slice of cheese represents a condition in the task being done, an accident can happen only if holes in all four slices of cheese are lined up just right. In well-designed systems, there can be many equipment failures, many errors, but they will not lead to an accident unless they all line up precisely. Any leakage—passageway through a hole—is most likely blocked at the next level. Well-designed systems are resilient against failure. This is why the attempt to find “the” cause of an accident is usually doomed to fail. Accident investigators, the press, government officials, and the everyday citizen like to find simple explanations for the cause of an accident. “See, if the hole in slice A FIGURE 5. 3. Reason’s Swiss Cheese Model of Accidents. Accidents usually have multiple causes, whereby had any single one of those causes not happened, the accident would not have occurred. The British accident researcher James Reason describes this through the metaphor of slices of Swiss cheese: unless the holes all line up perfectly, there will be no accident. This metaphor provides two lessons: First, do not try to find “the” cause of an accident; Second, we can decrease accidents and make systems more resilient by designing them to have extra precautions against error (more slices of cheese), less opportunities for slips, mistakes, or equipment failure (less holes), and very different mechanisms in the different subparts of the system (trying to ensure that the holes do not line up). (Drawing based upon one by Reason, 1990.)

208 The Design of Everyday Things

had been slightly higher, we would not have had the accident. So throw away slice A and replace it.” Of course, the same can be said for slices B, C, and D (and in real accidents, the number of cheese slices would sometimes measure in the tens or hundreds). It is relatively easy to find some action or decision that, had it been different, would have prevented the accident. But that does not mean that this was the cause of the accident. It is only one of the many causes: all the items have to line up. You can see this in most accidents by the “if only” statements. “If only I hadn’t decided to take a shortcut, I wouldn’t have had the accident.” “If only it hadn’t been raining, my brakes would have worked.” “If only I had looked to the left, I would have seen the car sooner.” Yes, all those statements are true, but none of them is “the” cause of the accident. Usually, there is no single cause. Yes, journalists and lawyers, as well as the public, like to know the cause so someone can be blamed and punished. But reputable investigating agencies know that there is not a single cause, which is why their investigations take so long. Their responsibility is to understand the system and make changes that would reduce the chance of the same sequence of events leading to a future accident. The Swiss cheese metaphor suggests several ways to reduce accidents: • Add more slices of cheese. • Reduce the number of holes (or make the existing holes smaller). • Alert the human operators when several holes have lined up.

Each of these has operational implications. More slices of cheese means mores lines of defense, such as the requirement in aviation and other industries for checklists, where one person reads the items, another does the operation, and the first person checks the operation to confirm it was done appropriately. Reducing the number of critical safety points where error can occur is like reducing the number or size of the holes in the Swiss cheese. Properly designed equipment will reduce the opportunity for slips and mistakes, which is like reducing the number of holes five: Human Error? No, Bad Design 209

and making the ones that remain smaller. This is precisely how the safety level of commercial aviation has been dramatically improved. Deborah Hersman, chair of the National Transportation Safety Board, described the design philosophy as: U.S. airlines carry about two million people through the skies safely every day, which has been achieved in large part through design redundancy and layers of defense.

Design redundancy and layers of defense: that’s Swiss cheese. The metaphor illustrates the futility of trying to find the one underlying cause of an accident (usually some person) and punishing the culprit. Instead, we need to think about systems, about all the interacting factors that lead to human error and then to accidents, and devise ways to make the systems, as a whole, more reliable.

When Good Design Isn’t Enough WHEN PEOPLE REALLY ARE AT FAULT

I am sometimes asked whether it is really right to say that people are never at fault, that it is always bad design. That’s a sensible question. And yes, of course, sometimes it is the person who is at fault. Even competent people can lose competency if sleep deprived, fatigued, or under the influence of drugs. This is why we have laws banning pilots from flying if they have been drinking within some specified period and why we limit the number of hours they can fly without rest. Most professions that involve the risk of death or injury have similar regulations about drinking, sleep, and drugs. But everyday jobs do not have these restrictions. Hospitals often require their staff to go without sleep for durations that far exceed the safety requirements of airlines. Why? Would you be happy having a sleep-deprived physician operating on you? Why is sleep deprivation considered dangerous in one situation and ignored in another? Some activities have height, age, or strength requirements. Others require considerable skills or technical knowledge: people 210

The Design of Everyday Things

not trained or not competent should not be doing them. That is why many activities require government-approved training and licensing. Some examples are automobile driving, airplane piloting, and medical practice. All require instructional courses and tests. In aviation, it isn’t sufficient to be trained: pilots must also keep in practice by flying some minimum number of hours per month. Drunk driving is still a major cause of automobile accidents: this is clearly the fault of the drinker. Lack of sleep is another major culprit in vehicle accidents. But because people occasionally are at fault does not justify the attitude that assumes they are always at fault. The far greater percentage of accidents is the result of poor design, either of equipment or, as is often the case in industrial accidents, of the procedures to be followed. As noted in the discussion of deliberate violations earlier in this chapter (page 169), people will sometimes deliberately violate procedures and rules, perhaps because they cannot get their jobs done otherwise, perhaps because they believe there are extenuating circumstances, and sometimes because they are taking the gamble that the relatively low probability of failure does not apply to them. Unfortunately, if someone does a dangerous activity that only results in injury or death one time in a million, that can lead to hundreds of deaths annually across the world, with its 7 billion people. One of my favorite examples in aviation is of a pilot who, after experiencing low oil-pressure readings in all three of his engines, stated that it must be an instrument failure because it was a one-in-a-million chance that the readings were true. He was right in his assessment, but unfortunately, he was the one. In the United States alone there were roughly 9 million flights in 2012. So, a onein-a-million chance could translate into nine incidents. Sometimes, people really are at fault.

Resilience Engineering In industrial applications, accidents in large, complex systems such as oil wells, oil refineries, chemical processing plants, electrical power systems, transportation, and medical services can have major impacts on the company and the surrounding community. five: Human Error? No, Bad Design

211

Sometimes the problems do not arise in the organization but outside it, such as when fierce storms, earthquakes, or tidal waves demolish large parts of the existing infrastructure. In either case, the question is how to design and manage these systems so that they can restore services with a minimum of disruption and damage. An important approach is resilience engineering, with the goal of designing systems, procedures, management, and the training of people so they are able to respond to problems as they arise. It strives to ensure that the design of all these things—the equipment, procedures, and communication both among workers and also externally to management and the public—are continually being assessed, tested, and improved. Thus, major computer providers can deliberately cause errors in their systems to test how well the company can respond. This is done by deliberately shutting down critical facilities to ensure that the backup systems and redundancies actually work. Although it might seem dangerous to do this while the systems are online, serving real customers, the only way to test these large, complex systems is by doing so. Small tests and simulations do not carry the complexity, stress levels, and unexpected events that characterize real system failures. As Erik Hollnagel, David Woods, and Nancy Leveson, the authors of an early influential series of books on the topic, have skillfully summarized: Resilience engineering is a paradigm for safety management that focuses on how to help people cope with complexity under pressure to achieve success. It strongly contrasts with what is typical today—a paradigm of tabulating error as if it were a thing, followed by interventions to reduce this count. A resilient organisation treats safety as a core value, not a commodity that can be counted. Indeed, safety shows itself only by the events that do not happen! Rather than view past success as a reason to ramp down investments, such organisations continue to invest in anticipating the changing potential for failure because they appreciate that their knowledge of the gaps is imperfect and that their environment constantly changes. One measure of resilience is therefore the ability to create foresight—to anticipate the changing shape of risk, 212

The Design of Everyday Things

before failure and harm occurs. (Reprinted by permission of the publishers. Hollnagel, Woods, & Leveson, 2006, p. 6.)

The Paradox of Automation Machines are getting smarter. More and more tasks are becoming fully automated. As this happens, there is a tendency to believe that many of the difficulties involved with human control will go away. Across the world, automobile accidents kill and injure tens of millions of people every year. When we finally have widespread adoption of self-driving cars, the accident and casualty rate will probably be dramatically reduced, just as automation in factories and aviation have increased efficiency while lowering both error and the rate of injury. When automation works, it is wonderful, but when it fails, the resulting impact is usually unexpected and, as a result, dangerous. Today, automation and networked electrical generation systems have dramatically reduced the amount of time that electrical power is not available to homes and businesses. But when the electrical power grid goes down, it can affect huge sections of a country and take many days to recover. With self-driving cars, I predict that we will have fewer accidents and injuries, but that when there is an accident, it will be huge. Automation keeps getting more and more capable. Automatic systems can take over tasks that used to be done by people, whether it is maintaining the proper temperature, automatically keeping an automobile within its assigned lane at the correct distance from the car in front, enabling airplanes to fly by themselves from takeoff to landing, or allowing ships to navigate by themselves. When the automation works, the tasks are usually done as well as or better than by people. Moreover, it saves people from the dull, dreary routine tasks, allowing more useful, productive use of time, reducing fatigue and error. But when the task gets too complex, automation tends to give up. This, of course, is precisely when it is needed the most. The paradox is that automation can take over the dull, dreary tasks, but fail with the complex ones. five: Human Error? No, Bad Design 213

When automation fails, it often does so without warning. This is a situation I have documented very thoroughly in my other books and many of my papers, as have many other people in the field of safety and automation. When the failure occurs, the human is “out of the loop.” This means that the person has not been paying much attention to the operation, and it takes time for the failure to be noticed and evaluated, and then to decide how to respond. In an airplane, when the automation fails, there is usually considerable time for the pilots to understand the situation and respond. Airplanes fly quite high: over 10 km (6 miles) above the earth, so even if the plane were to start falling, the pilots might have several minutes to respond. Moreover, pilots are extremely well trained. When automation fails in an automobile, the person might have only a fraction of a second to avoid an accident. This would be extremely difficult even for the most expert driver, and most drivers are not well trained. In other circumstances, such as ships, there may be more time to respond, but only if the failure of the automation is noticed. In one dramatic case, the grounding of the cruise ship Royal Majesty in 1997, the failure lasted for several days and was only detected in the postaccident investigation, after the ship had run aground, causing several million dollars in damage. What happened? The ship’s location was normally determined by the Global Positioning System (GPS), but the cable that connected the satellite antenna to the navigation system somehow had become disconnected (nobody ever discovered how). As a result, the navigation system had switched from using GPS signals to “dead reckoning,” approximating the ship’s location by estimating speed and direction of travel, but the design of the navigation system didn’t make this apparent. As a result, as the ship traveled from Bermuda to its destination of Boston, it went too far south and went aground on Cape Cod, a peninsula jutting out of the water south of Boston. The automation had performed flawlessly for years, which increased people’s trust and reliance upon it, so the normal manual checking of location or careful perusal of the display (to see the tiny letters “dr” indicating “dead reckoning” mode) were not done. This was a huge mode error failure. 214

The Design of Everyday Things

Design Principles for Dealing with Error People are flexible, versatile, and creative. Machines are rigid, precise, and relatively fixed in their operations. There is a mismatch between the two, one that can lead to enhanced capability if used properly. Think of an electronic calculator. It doesn’t do mathematics like a person, but can solve problems people can’t. Moreover, calculators do not make errors. So the human plus calculator is a perfect collaboration: we humans figure out what the important problems are and how to state them. Then we use calculators to compute the solutions. Difficulties arise when we do not think of people and machines as collaborative systems, but assign whatever tasks can be automated to the machines and leave the rest to people. This ends up requiring people to behave in machine like fashion, in ways that differ from human capabilities. We expect people to monitor machines, which means keeping alert for long periods, something we are bad at. We require people to do repeated operations with the extreme precision and accuracy required by machines, again something we are not good at. When we divide up the machine and human components of a task in this way, we fail to take advantage of human strengths and capabilities but instead rely upon areas where we are genetically, biologically unsuited. Yet, when people fail, they are blamed. What we call “human error” is often simply a human action that is inappropriate for the needs of technology. As a result, it flags a deficit in our technology. It should not be thought of as error. We should eliminate the concept of error: instead, we should realize that people can use assistance in translating their goals and plans into the appropriate form for technology. Given the mismatch between human competencies and technological requirements, errors are inevitable. Therefore, the best designs take that fact as given and seek to minimize the opportunities for errors while also mitigating the consequences. Assume that every possible mishap will happen, so protect against them. Make actions reversible; make errors less costly. Here are key design principles: five: Human Error? No, Bad Design 215

• Put the knowledge required to operate the technology in the world. Don’t require that all the knowledge must be in the head. Allow for efficient operation when people have learned all the requirements, when they are experts who can perform without the knowledge in the world, but make it possible for non-experts to use the knowledge in the world. This will also help experts who need to perform a rare, infrequently performed operation or return to the technology after a prolonged absence. • Use the power of natural and artificial constraints: physical, logical, semantic, and cultural. Exploit the power of forcing functions and natural mappings. • Bridge the two gulfs, the Gulf of Execution and the Gulf of Evaluation. Make things visible, both for execution and evaluation. On the execution side, provide feedforward information: make the options readily available. On the evaluation side, provide feedback: make the results of each action apparent. Make it possible to determine the system’s status readily, easily, accurately, and in a form consistent with the person’s goals, plans, and expectations.

We should deal with error by embracing it, by seeking to understand the causes and ensuring they do not happen again. We need to assist rather than punish or scold.

216

The Design of Everyday Things

GOMS, Distributed Cognition, And The Knowledge Structures Of Organizations Robert L. West ([email protected]) Department of Psychology, University of Hong Kong, Hong Kong, SAR. Alan Wong ([email protected]) Department of Psychology, University of Hong Kong, Hong Kong, SAR. Alonso H. Vera ([email protected]) Department of Psychology, University of Hong Kong, Hong Kong, SAR. Abstract The idea that GOMS can be used to model HCI tasks within the organizational environment in which they occur is discussed and reviewed. An example in terms of satellite operations is provided.

Mantovani (1996) has proposed that the study of human computer interaction (HCI) is currently limited because we, “lack an integrated model of social context suitable for HCI research,” (Mantovani, 1996). However, while it is true that social context has not often been explicitly addressed in the HCI literature, this does not mean that the modeling systems currently in use cannot accommodate social context. More specifically, we propose that GOMS can be used to model HCI tasks within the relevant contextual aspects of organizations, such as companies and institutions. To support this we review the relevant issues and present the initial results of a study incorporating this goal. GOMS (Card, Moran, & Newell, 1983) is a modeling system designed to capture how experts execute well learned, routine tasks (e.g. word processing, satellite tracking). Essentially, it breaks down a task into Goals, Operators, Methods, and Selection rules (see John & Kieras, 1994 for a detailed review and discussion of the different variations of GOMS). Goals describe what the user wants to achieve; methods are combinations of subgoals and operators used to achieve goals; selection rules are rules for selecting between methods; and operators are actions, either physical (e.g. move the mouse), perceptual (e.g. search the screen), or cognitive (e.g. add two numbers). Although GOMS was originally developed to model individual humans interacting with computers, the unit of analysis is flexible. Thus a task performed by a room full of humans and computers could be described at the level of the interactions between the individual humans and computers, or at the level of the room, i.e. without reference to the specific agents involved. GOMS describes the knowledge level of the task. The granularity of the units involved depends on the goals of the researchers. Thus GOMS is equally suitable for describing the knowledge of an individual or the knowledge of a distributed cognitive

system, such as described by Hutchins (1990). As Olson and Olson (1990) note, the original formulation of GOMS had many limitations that have since been ameliorated. For example, the ability of GOMS to account for learning, errors, the limitations of working memory, and parallel processing have all been considerably improved (see Olson & Olson, 1990 for a review). What we are proposing is that organizations can be analyzed as distributed cognitive systems, and that GOMS can be used to describe the knowledge level of such systems, including the knowledge structures mediating task related social interactions (note, in this paper a social interaction is defined as any interaction between two or more people). Organizations tend to involve a high amount of routine, well learned activity. In addition, we suggest that the social and cultural rules mediating interpersonal relationships within organizations are likewise routine in nature. Therefore, GOMS, which has a good track record for modeling routine behavior, seems an appropriate modeling choice. However, note, we are not suggesting that GOMS can be used for all social situations, e.g. GOMS may not be a good choice for modeling close personal relationships and the like. In addition there are several important advantages to using GOMS. First, GOMS is a well established modeling tool with a host of studies demonstrating how to solve various specific problems. Second, modeling at the organizational level can entail a high degree of cross disciplinary work, involving diverse areas such as cognitive science, social psychology, game theory, sociology, and anthropology. Because GOMS is relatively easy to understand at the conceptual level, it is a good choice to serve as a common modeling language. Third, GOMS is currently used to create models of the types of systems commonly found in many organizations. If GOMS is used to model an organization’s structure then the model will be compatible with existing GOMS models of specific subtasks within that organization. Fourth, GOMS can be used to address specific questions such as time estimates, the efficient use of resources, possible goal conflicts, the degree to which goals can be fulfilled, and whether or not an organization would be robust in the face of changing conditions. In addition, this approach could also be applied to psychological and sociological questions, such as

determining the nature of the employee environment (e.g. does it promote undue stress? under what conditions?) or the emergent functional properties of the organization (e.g. does it perpetuate racism?). The advantage of a GOMS model for these types of questions is that it can provide a process model of how such conditions arise, as well as how they feed back into the system.

the person in the situation and then mentally simulate it (e.g. see Kahneman & Tversky, 1982). In this sense people may be very similar to intelligent software agents that use GOMS models to predict the behavior of the user (e.g. Vera and Rosenblatt, 1995). If this is the case, then it follows that GOMS would be particularly appropriate for modeling this type of behavior.

Modeling Social Actions

The Importance of Goals

In this section we consider the issue of modeling social actions, that is, the interactions between two or more people. From an individual’s point of view, there are two broad types of social actions, 1) actions relating to other individuals, such as making the judgment, “do I trust this person?” and 2) actions relating to groups, such as making the judgment “will the market go up?” In some cases, GOMS methods may be constructed to make these types of judgments. Such methods would reflect the knowledge level of the task. For example, to answer the question of whether the market will go up or down, an expert might execute the sub goals of gathering economic and political information. However, it may also be necessary to assume social operators for gut level decisions. For example, to answer the question, “what is the mood of the market?” an analyst may go with his or her feeling about it. Damasio’s (1994) Somatic-Marker Hypothesis is a good candidate for understanding this type of process. Social operators can also be modeled using the mechanisms proposed in social psychology, combined with specific organizational, social and cultural knowledge. For example, Fishbein and Ajzen’s (1974) reasoned action theory can be combined with real world data to predict attitudes towards alternative actions. Social operators could also be modeled using AI models, such as ACT-R (Anderson, 1993) or SOAR (Newell, 1990). However, it is not always necessary to provide models of operators. As John (1995) notes, “Operators can be defined at many different levels of abstraction but most GOMS models define them at a concrete level, like button presses and menu selections.” This is certainly useful since simple operations can be assumed to be performed correctly and within an approximate time span most of the time (Card et al, 1983), but operators needn’t be limited to simple operations. In principle, any process can be made into an operator. For example, a model could be constructed in which an architect calls an operator to judge the aesthetic quality of a design. Thus by assuming a complex action exists as an operator it becomes possible to frame where and when it takes place within a GOMS model, as well as its functional significance. For example, the output of the aesthetic operator above could be used as the basis for a selection rule (e.g. if aesthetic, continue drawing; if not aesthetic, throw it away). The issue is one of finding the most useful level of operator abstraction for the task being modeled (Card et al, 1983). It is also interesting to note that GOMS itself can be used as a model for certain types of social thought. Specifically, it can be argued that when one person wants to predict what another person will do in a particular situation, that they construct something very much like a GOMS model of

As Mantovani (1996) notes, people have socially and culturally based goals. Without considering such higher level goals, most GOMS models implicitly assume that the user’s highest goal is to do the task as well as possible. To get at the organizational, social and cultural goals of a user it is necessary to ask why the employee wants to complete the task. In many cases the answer will be that the employee will benefit from doing it (i.e. they will be paid, they will avoid being fired). In this case the goal hierarchy need go no higher. However, in some cases, the higher level goal structure may be more complex. In particular when an employee is free to choose between various tasks it is necessary to understand his or her higher level goals in order to predict what he or she will do next. In addition, in the case of multiple employees, it may be important to understand how the higher level goal structure of an employee interacts with the higher level goal structures of other employees. This type of analysis could be approached from a game theory perspective (i.e. assume payoffs for achieving goals and that employees will act rationally) or a social cognition perspective (i.e. modeling based on questionnaires, observation, etc.). Another interesting and related issue is determining the goals of an organization. Organizations have goals, for example, an environmental consulting firm may state their goal as being to create a clean environment. However, it is interesting to note that individuals can work within an organization without adopting the higher level goals of the organization. For example, a person could work for an environmental consulting firm simply for the money. The question as to whether the structure of an organization is such that the goals of the individuals result in the stated goal of the organization is an interesting and important one, especially for governmental and other public service organizations. Also, as noted in the introduction, firms may have goals implicit in their structure. For example, sociologists often refer to institutionalized racism. The benefit of using a GOMS to examine this type of issue is that it can provide a highly specific process model of why various goals exist within a system. This is because goals are triggered by a clear chain of events involving higher level goals and selection rules.

The Top Down Effects of Goals The manner in which higher level organizational, social or cultural goals can effect the behavior of lower level methods and operators is of critical importance. GOMS is modular in nature and thus based on the traditional cognitive science assumption that we can abstract simple problem spaces from a complex world and deal with them

in relative isolation (Vera & Simon, 1993). Mantovani (1996), making the case for his interpretation of the role of social and cultural factors, argues that this is never the case, even for very simple actions such as tying shoes. when we wear shoes, we usually have some project in our mind regarding some activity which is relevant with respect to our current interests and requires wearing shoes. Thus tying laces and wearing shoes are simple activities which depend on actors’ cultural models, for example, models of healthy behavior generate broader projects like keeping fit by running in the park in the morning. (Mantovani, 1996).

However, the traditional cognitive science perspective argues that simple actions such as shoe tying can be treated in isolation and that is it is unnecessary to understand the context shoe tying beyond the goal of wanting one’s shoes to be tied. A person could tie his or her shoes as part of fulfilling many divergent goals (e.g. a run in the park, a night at the opera, or overthrowing a dictatorship) and the process would remain essentially the same. This is true in two senses. First, in terms of the measurable outcome the laces would be tied in approximately the same way each time. The results might be affected by factors such as time pressure and memory load (e.g. forgetting to tie ones shoes or tying them sloppily), but these are not direct effects of the social context, rather they are mediated by basic cognitive variables (e.g. memory and processing speed). The second sense in which this is true is in terms of the knowledge and motor skills deployed to achieve the result. Again, allowing for mediating factors such as memory load and time, we would expect the process of tying shoes to remain constant once it has been well learned. Low level operators (such as clicking a mouse on an icon) will generally be unaffected by higher level social/cultural goals. However, in cases in which a low level operator is affected, the effect will be mediated by well defined, variables such as memory load, demands on attention, speed/accuracy tradeoffs, and so on. This type of effect, when it occurs, may be an important consideration in modeling the system. Specifically, if a low level operator error can cause a significant problem, then the factors that mediate the likelihood of such an error should be modeled in. An example of this approach is the use of GOMS to predict the effect of work load on working memory (Card et al., 1983). In the case of social operators we argue that the same approach can be taken. That is, for the most part social operators will not be affected by higher level goals, but when they are it will be through the mediation of a limited number of variables. In fact, this is the approach used in Social Cognition. Cognitive variables, such as those mentioned above, can be used to predict mediators such as stress, which is a major determinant of social functioning (e.g. a social operator, such as, “behave politely towards the customer,” could be severely disrupted by stress). Other socially mediating factors, such as threats to self esteem may also play an important role. Thus operators are defined in terms of the general factors

that could effect their operation, as well as in terms of how they effect the general factors. To keep track of this in a model, an index can be attached to each employee (or customer, or client) to record the general factors impinging on them at each step in the process. For example, an index could be used to keep track of memory load or stress due to time pressure. This type of approach could be useful in terms of modeling systems in which human error can cause serious results.

Spin-off Effects: An Example Although we argue that lower level methods and operators can be treated in relative isolation from higher level goals, their appropriateness for fulfilling specific higher level goals is still a potentially important issue. Specifically, evaluating methods solely in terms of how they satisfy the immediate goal may not lead to the optimal solution. This is because methods can have spin-off effects which may be important for satisfying goals elsewhere in the system. For example, one of the authors (RLW) is working on the HCI for an electronic Chinese-English dictionary designed for those who cannot read Chinese. Initially the highest level goal in the model for this task was to look up a Chinese character and find its meaning in English. This generated a lot of high tech suggestions, such as scanning the character and using a neural network to identify it. However, by identifying the higher level goal of the intended users, which in this case was to learn to read Chinese, it was realized that some form of the traditional system (the radical search method), would probably be better. This is because the traditional system requires the user to parse the characters into their meaningful, pictographic components, which studies show play an important role in character recognition. Thus a process for learning the structure of Chinese characters is situated in the process of looking them up in the dictionary. Of course, there are always ever higher goals being generated by ever higher system structures. Obviously we cannot consider every possibility in a model. However, we have a better chance of finding spin-off effects within an organization if we possess a model of the organization. With a detailed model, spin off effects can be located through simulating different versions of the model (e.g. inserting different HCI structures for specific tasks), or simply by studying the model.

Reactivity to the Environment Traditionally, GOMS models have assumed a pristine task environment, one in which interruptions unrelated to the specific task being modeled do not occur. However, in a social setting (i.e. within an organization), the user can be interrupted and information injected that can alter minor or major components of the task. Thus tasks are situated within a social/cultural/organizational environment. The issue is one of reactivity to the environment, which involves two aspects of GOMS modeling. The first aspect is the goal stack. A shallow goal stack increases reactivity by allowing the system to more frequently run checks on the environment between executing goal stacks (e.g. John & Vera, 1992). The second aspect is the level of abstraction

involved in defining the operators (Card et al, 1983). Obviously, operators defined at a gross level of granularity will tend to overlook opportunities in which a person could be interrupted. Another approach to dealing with this problem is to adopt a parallel processing approach, such as CPM-GOMS (Gray, John & Atwood, 1993). CPM-GOMS uses a schedule chart to represent simultaneously occurring activity, which can be analyzed using a critical path analysis. Using this approach the environment can be monitored in parallel while other tasks are going on. For example, an engineer waiting for some plans might work on a side task while monitoring for news that the plans have arrived.

Modeling Satellite Maneuvers Currently we are working on a GOMS model of satellite maneuvering, a task that is very routine in nature but also demands a high level of reactivity to the environment. The satellite technicians (hence forth STs) must pay attention to the computer interface system as well as to each other. In addition, the maneuvers take place within the larger context of the satellite management organization. Here we report how we have approached modeling this activity with regard to performing an attitude maneuver.

Method Unobtrusive observations of satellite attitude maneuvers were conducted. The satellite technicians were all fully trained with an average of 6 months job experience. Two observers took notes during the task. Other sources of data included the task manual, checklists and other handbooks. Separate interviews were also conducted with individual operators after the missions were completed.

The Model The completion of the maneuver required the fulfillment of seven tasks: 1.

2. 3.

4.

5.

Configure system: The satellite maneuver task is semiautomated. The STs must call up a program at the beginning of the scenario which consists of batches of commands and instructions that guide operator’s behaviors. Prepare for phase check: This includes selecting data channels for information collection, switching on the printer and refilling the printer paper. Execute phase check: The purpose of this is to ensure synchronization with the satellite. The STs specify the necessary parameters then starts the printer. At the end of phase check, a time-graph is charted and measurements are taken. Prepare maneuver: The STs specify time of the maneuver execution and prepares for it (e.g. automated notation control and antenna pointing are switched off). Execute maneuver: The actual maneuver process is completely automated. However the printer must start running 10 seconds before its execution in order for

6.

7.

data collection. Finish maneuver: When the execution ends, the STs check the data and return the spin rate and temperature of the satellite to normal, and the automated notation control and antenna pointing systems to their usual status. Attend alarm: Alarms occurs anytime throughout the maneuver task. The STs have to acknowledge and analyze them individually before continuing the normal task.

Although routine and largely automated, it was observed that the STs actions were intimately dependent on cues provided by the computer interface (primarily the monitor screen). The interface captured many aspects of the maneuver, breaking them down into much smaller subgoals, and cueing the STs when appropriate. Hence, although the task structure appeared retrospectively as a large serial plan, the STs were actually highly reactive to their environment. This was corroborated by the observations that, 1) STs constantly referred to their monitor screens, 2) STs waited for the external cue before taking action, and 3) when alarms unexpectedly interrupted the scenario, the ST typically completed the sub-goal he was engaged in, dealt with the problem specified by the alarm, and resumed his normal course of action by looking for cues on the interface. To model the STs’ reactivity to system cues we adopted the strategy of using very shallow goal stacks, prompted by the system cues. For example, when the STs perceive the cue “APE to manual” the sub-goal of turning off the antenna pointing system is pushed onto the goal stack. Note the unit task is very small, in this case consisting of only two operators, “enter command” and “verify command,” allowing the STs to frequently return to monitoring the environment. The STs, modeled in this way, do not need to understand the relationship between the sub-goals to successfully accomplish the maneuver. However, this approach could not fully account for the STs behavior. First, some of the sub-goals were not cued by the interface. These were, turn on printer, run printer, collect data, start counting, and select data channels. As a consequence, in one observation the STs forgot to run the printer during the phase check, and the procedure had to be repeated. Second, although none of the cued sub-goals were missed, as the procedure is very routine in nature we would expect that the STs would be aware of a missed a cue after receiving the subsequent cue, which would seem sequentially inappropriate. Finally, if the system made an error, such as failing to give a cue, we would expect that the STs would be aware of this for the same reasons he would be aware of missing a cue (in fact one of the functions of the STs is to detect problems with the system). To model this we assumed that the STs have knowledge of the sequence of sub-goals for the attitude maneuver. When an external interface representation causes a subgoal to be pushed onto the stack, it is verified against this knowledge structure. When a sub-goal cued by the interface is in conflict with the ST’s knowledge of what should be

occurring it signals that a problem has occurred. At this point the STs would engage in problem solving behaviors (similar to Gray, Kirschenbaum and Ehret, 1997, this process could be modeled using a problem solving architecture, such as ACT-R or SOAR). Assuming no problem is detected, after the current sub-goal is verified the next sub-goal can be retrieved while work on the current goal is going on in parallel.

Levels of Analysis As noted in the introduction, the unit of analysis for a GOMS model can vary. For example, the seven tasks comprising an attitude maneuver could easily be described without distinguishing between the computer systems and the STs. From a GOMS perspective, distributed cognitive systems, such as the satellite operations room, can be considered to possess expert knowledge in the same way that individual humans do. However, since we were interested in the HCI characteristics of the task, we differentiated between the STs and the computer systems, and found that the task knowledge is distributed across the STs and the computer systems, with some level of redundancy. Another issue is the level of analysis with regard to the STs. In actuality the maneuver is performed in a room containing a supervisor and two operators, one to execute commands and the other to verify commands. The supervisor, in addition to supervising, also functions to communicate with other departments involved with various decisions pertaining to the maneuver. In terms of ST knowledge, our observations showed that, in addition to system cues and the knowledge stored in each ST’s long term memory, knowledge was also drawn from interactions with job manuals, checklists, and the other STs. Essentially, our model treats the STs, including their job manuals and checklists, as a single unit. The behavior of individual STs is never referred to. Treated in this way, the STs function as a distributed cognitive sub-system within the greater distributed system of the operations room. What this level of analysis leaves out is how the STs organize themselves and how they draw knowledge from sources other than the interface system. For example, our model assumes that the STs can retrieve the next sub-goal while work on the current sub-goal is going on in parallel. This could involve an individual ST retrieving the next sub-goal from memory while working on the current sub-goal in parallel, or requesting another ST to look up the next subgoal on a check list while he or she attends to the current sub-goal. One of the next goals in this project is to model the flow of information between the individual STs, their manuals, checklists, and the computer interface system. Therefore, to summarize, we can conceptualize this task at three different levels of analysis: 1) The room itself as a distributed cognitive system. This level focuses on a functional description of the task. 2) The computers as one distributed system and the STs (including their non-computerized reference material) as another distributed system. This level emphasizes the interaction between the STs and the computers.

3) The STs as individual systems interacting with their manuals, checklists, and the computer interface system. This level highlights the flow of information around the room.

Social Actions The social interactions were highly constrained by the demands of the task, as well as organizational policies as to how the STs should interact during this task. In addition, the STs frequently perform this maneuver, as well as other maneuvers which are highly similar in terms of what is required from the STs. Thus the individual behaviors of the STs towards each other are very routine in nature, and as such can be captured using GOMS at fairly detailed level. However, social interactions need not always be modeled at the level of individual behaviors. For example, the STs can decide who will “execute” and who will “verify” each time the task is performed. Currently we are modeling this decision process as an operator attached to the distributed cognitive system comprised of all three STs. That is, for our purposes we are not interested in the interactions involved in this decision, just the fact that it is made by the STs. Social interactions between groups are also involved in this process. Specifically, the satellite operations department must interact with the control room (where instructions as to what type of maneuver to perform come from) and the orbital engineers (who calculate the parameters of the maneuver). These other departments are represented at the level of the department (i.e. as distributed cognitive entities without reference to the human and computer agents that make them up), and are modeled only in terms of their relevance to the satellite room.

Conclusions The thesis of this paper is that GOMS can be used to model large, interactive systems such as organizations or institutions, by treating them as distributed cognitive systems. The example from our ongoing research on satellite operations is highly specific, but this is the point of GOMS modeling - to uncover a knowledge level description of how a particular task is performed. In contrast, work on social interactions and the behavior of organizations has tended to focus on finding general psychological mechanisms (e.g. social psychology), or general principles or patterns of interactions (e.g. sociology, anthropology). GOMS, therefore, offers an alternative perspective which we believe would compliment work in these areas. In addition, by considering the organizational context in which an HCI task occurs we gain a broader picture of the task. This is particularly relevant as organizations are increasingly employing computer networks, instead of isolated PCs.

References Anderson, J. R. (1993). Rules of the Mind. Hillsdale, NJ: Lawrence Erlbaum Associates. Card, S. K., Moran, T. P. and Newell, A. (1983). The

psychology of human-computer interaction. Lawrence Erlbaum Associates, Hillsdale, N.J. Damasio, A. R. (1994). Decartes Error. Avon Books, N.Y. Fishbein, M. & Ajzen, I. (1974). Attitudes towards objects as predictors of single and multiple behavioral criteria. Psychological Review, 81, 59-74. Gray, W. D., John, B. E., & Atwood, M. E. (1993) Project Ernestine: A validation of GOMS for prediction and explanation of real-world task performance, Human-computer interaction, 8 (3), pp. 237-309. Gray, W. D., Kirschenbaum, S. S., & Ehret, B. D. (1997). Subgoaling and subschemas for submariners: Cognitive models of situation assessment (GMU-ARCH 97-01-16): George Mason University. Hutchins, E. (1994). Cognition in the wild. Cambridge, MA: The MIT Press. John, B. E. (1995). Why GOMS? Interactions. October, 1995. John, B. E. & Kieras, K. E. (1994). The GOMS family of analysis techniques: tools for design and evaluation. In Human-computer interaction institute technical report CMU-HCII-94-106. John, B. E., & Vera, A. H. (1992). A GOMS analysis for a graphic, machine paced, highly interactive task. Behavior and Information Technology, 13, 4, 255-276. Kahneman, D., & Tversky, A. (1982). The simulation heuristic. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgement under uncertainty: Heuristics and biases (pp. 201-208). New York: Cambridge University Press. Mantovani, G. (1996). Social context in HCI: A new framework for mental models, cooperation, and communication. Cognitive Science, 20 (2), 237-269. Newell, A. (1990). Unified theories of Cognition. Cambridge, Mass: Harvard University Press. Olson, J. R., & Olson, G. M. (1990). The growth of cognitive modeling in humn-computer interaction since GOMS. Human-Computer Interaction, 5, 221-265 Vera, A. H. & Simon, H. A. (1993). Situated action: A symbolic interpretation. Cognitive Science, 17 (5), 7-48.

COGNITIVE

19,

SCIENCE

265-288

(1995)

How a Cockpit Remembers Its Speeds EDWIN HUTCHINS University of California, San Diego

Cognitive

science normally

many human endeavors, entirely

takes the individual

however,

by the information

inferred

from

detailed

the knowledge

mercial aviation, by a system and with framework

processing

the properties

properties

of the individual

of the properties

for example,

of individuals. agents,

of those

the successful

Nor can they be

alone,

individuals

completion

a suite

of technological

devices.

that tokes a distributed,

unit of analysis.

in that it is concerned with how information tions are transformed

This

no matter

and propagated

properties

of such distributed of the individuals

systems who inhabit

airliner

can differ

rather

a theoretical than an indi-

is explicitly

is represented

in the performance

is produced

with each other

presents

system

how

may be. In com-

of a flight

This framework

task in the cockpit of a commercial

properties

article

socio-technical

In

are not determined

that typically includes two or more pilots interacting

vidual mind as its primary

a memory

agent as its unit of analysis.

the outcomes of interest

cognitive

and how representaof tasks. An analysis

shows

radically

of

how the cognitive from

the cognitive

them.

Thirty years of research in cognitive psychology and other areas of cognitive science have given us powerful models of the information processing properties of individual human agents. The cognitive science approach provides a very useful frame for thinking about thinking. When this frame is applied to the individual human agent, one asks a set of questions about the mental An initial analysis of speed bugs as cognitive artifacts was completed in November of 1988. Since then, my knowledge of the actual uses of speed bugs and my understanding of their role in cockpit cognition has changed dramatically. Some of the ideas in this paper were presented in a paper titled, “Information Flow in the Cockpit” at the American Institute for Aeronautics “Challenges in Aviation Human Factors: The National Plan,” and Astronautics symposium, in Vienna, Virginia. The current draft also benefited from the comments of members of the Flight Deck Research Group of the Boeing Commercial Airplane Company and from the participants in the second NASA Aviation Safety/Automation program researchers meeting. Thanks to Hank Strub, Christine Halverson, and Everett Palmer for discussions and written comments on earlier versions of this paper. This research was supported by grant NCC 2-591 from the Ames Research center of the National Aeronautics and Space Administration in the Aviation Safety/Automation Program. Everett Palmer served as Technical Monitor. Correspondence and requests for reprints should be sent to Edwin Hutchins, Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093-05 15; or e-mail to: [email protected]. 265

266

HUTCHINS

processes that organize the behavior of the individual.’ In particular, one asks how information is represented in the cognitive system and how representations are transformed, combined, and propagated through the system (Simon, 1981). Cognitive science thus concerns itself with the nature of knowledge structures and the processes that operate on them. The properties of these representations inside the system and the processes that operate on representations are assumed to cause or explain the observed performance of the cognitive system as a whole. In this paper, I will attempt to show that the classical cognitive science approach can be applied with little modification to a unit of analysis that is larger than a person, One can still ask the same questions of a larger, sociotechnical system that one would ask of an individual. That is, we wish to characterize the behavioral properties of the unit of analysis in terms of the structure and the processing of representations that are internal to the system. With the new unit of analysis, many of the representations can be observed directly, so in some respects, this may be a much easier task than trying to determine the processes internal to the individual that account for the individual’s behavior. Posing questions in this way reveals how systems that are larger than an individual may have cognitive properties in their own right that cannot be reduced to the cognitive properties of individual persons (Hutchins, 1995). Many outcomes that concern us on a daily basis are produced by cognitive systems of this sort. Thinking of organizations as cognitive systems is not new, of course.’ What is new is the examination of the role of the material media in which representations are embodied, and in the physical processes that propagate representations across media. Applying the cognitive science approach to a larger unit of analysis requires attention to the details of these processes as they are enacted in the activities of real persons interacting with real material media. The analysis presented here shows that structure in the environment can provide much more than external memory (Norman, 1993). I will take the cockpit of a commercial airliner as my unit of analysis and will show how the cockpit system performs the cognitive tasks of computing and remembering a set of correspondences between airspeed and wing configuration. I will not present extended examples from actual observations because I don’t know how to render such observations meaningful for a non-flying audience without swamping the reader in technical detail. Instead, I will present a somewhat stylized account of the use of the small set of tools in the performance of this simple task, which is accomplished every time an airliner makes an approach to landing. ’ This notion is widespread in cognitive science. See Simon & Kaplan, 1989. The canonical statement of what is currently accepted as the standard position appears in Newell & Simon, 1972. See also Wickens & Flach, 1988 for a direct application of this perspective to aviation. * March and Simon staked out this territory with their seminal book, 1958. For a review of conceptions of organizations see Morgan, 1986.

Organizations,

in

COCKPIT

SPEEDS

267

The procedures described below come straight from the pages of a major airline’s operations manual for a midsized jet, the McDonnell Douglas MD-80. Similar procedures exist for every make and model airliner. The explanations of the procedures are informed by my experience as a pilot and as an ethnographer of cockpits. In conducting research on aviation safety during the past 6 years,’ I have made more than 100 flights as an observer member of crews in the cockpits of commerical airliners. These observations spanned a wide range of planes, including old and new technology cockpits, domestic and international (tram-oceanic) operations, and both foreign and US-flag carriers. APPLYING THE COGNITIVE FRAME TO THE COCKPIT SYSTEM If we want to explain the information processing properties of individuals, we have no choice but to attempt to infer what is inside the individual’s mind. Cognitive scientists do this by constructing carefully selected contexts for eliciting behavior from which they can attribute internal states to actors. However, if we take the cockpit system as the unit of analysis, we can look inside it and directly observe many of the phenomena of interest. In particular, we can directly observe the many representations that are inside the cockpit system, yet outside the heads of the pilots. We can do a lot of research on the cognitive properties of such a system (i.e., we can give accounts of the system’s behavioral properties in terms of its internal representations), without saying anything about the processes that operate inside individual actors (Hutchins, 1990, 1991, 1995). This suggests that rather than trying to map the findings of cognitive psychological studies of individuals directly onto the individual pilots in the cockpit, we should map the conceptualization of the cognitive system onto a new unit of analysis: the cockpit as a whole.

REMEMBERING SPEEDS Why Speeds Must he Remembered For an illustration of the application of the cognitiver science frame to the cockpit system, consider the events having to do with remembering speeds in the cockpit of a midsize civil transport jet (a McDonnell Douglas MD-80) on a typical descent from a cruise altitude above 30,000 feet, followed by an ’ This research was performed under a contract from the flight human factors branch of the NASA Ames research center. In addition to my activities as an observer, I hold a commercial pilot certificate with multiengine and instrument airplane ratings. I have completed the transition training course (both ground school and full-flight) for the Boeing 747400 and the ground schools for the McDonnell Douglas MD-S, and the Airbus A320. I am grateful to the Boeing Commercial Airplane group, McDonnell Douglas, and America West Airlines for these training opportunities.

268

HUTCHINS

instrument landing system (ILS) approach and landing. Virtually all of the practices described in this paper are mandated by federal regulations and airline policy or both. A reader may wonder how many crews do these things. The answer is that nearly all of them do these things on every flight. Exceptions are extremely rare. In all of my observations, never have I seen a crew fail to compute and set the approach speeds. This is known in the aviation world as a “killer” item. It is something that can cause a fatal accident, if missed. Of course, sometimes crews do miss these procedures, and sometimes they make headlines as a result. To understand what the task is and how it is accomplished, one needs to know something about the flight characteristics of commercial jet transports as well as something about the mandated division of labor among members of the crew. Flaps and Slats

The wings of airliners are designed to enable fast flight, yet performance and safety considerations require airliners to fly relatively slowly just after takeoff and before landing. The wings generate ample lift at high speeds, but the shapes designed for high speed cannot generate enough lift to keep the airplane flying at low speeds. To solve this problem, airplanes are equipped with devices, called slats and flaps,4 that change the shape and area of the wing. Slats and flaps are normally retracted in flight, giving the wing a very clean aerodynamic shape. For slow flight, slats and flaps are extended, enlarging the wing and increasing its coefficient of lift. The positions of the slats and flaps define configurations of the wing. In a “clean” wing configuration, the slats and flaps are entirely retracted. There is a lower limit on the speed at which the airplane can be flown in this configuration. Below this limit, the wing can no longer produce lift. This condition is called a wing stall.5 The stall has an abrupt onset and invariably leads to loss of altitude. Stalls at low altitude are very dangerous. The minimum maneuvering speed for a given configuration and aircraft weight is a speed that guarantees a reasonable margin of safety above the stall speed. Flying slower than this speed is dangerous because the airplane is nearer to a stall. Changing the configuration of the wing by extending the slats and flaps lowers the stall speed of the wing, thus permitting the airplane to fly safely at slower speeds. As the airplane nears the airport, it must slow down to maneuver for landing. To maintain safe flight at slower speeds, the crew must extend the slats and flaps to produce the appropriate wing configurations at the right speeds. The coordination of changing wing configuration with changing speed as the airplane slows down is the first part of the speed memory task. 4 Slats are normally

’ This “stall”

on the leading

edge of a wing.

has nothing to do with the functioning tions, any airplane can stall with all engines generating

Flaps normally

on the trailing

edge.

of the engines. Under the right condimaximum thrust.

COCKPIT

SPEEDS

269

The second part concerns remembering the speed at which the landing is to be made. Vref Within the range of speeds at which the airplane can be flown in its final flap and slat configuration, which speed is right for landing? There are tradeoffs in the determination of landing speed. High speeds are safe in the air because they provide good control response and large stall margins, but they are dangerous on the ground. Limitations on runway length, energy to be dissipated by braking, and the energy to be dissipated if there is an accident on landing, all suggest that landing speed should be as slow as is feasible. The airplane should be traveling slowly enough that it is ready to quit flying when the wheels touch down, but fast enough that control can be maintained in the approach and that if a landing cannot be made, the airplane has enough kinetic energy to climb away from the ground. This speed is called the reference speed, or Vrer. Precise control of speed at the correct value is essential to a safe landing. The minimum maneuvering speeds for the various wing configurations and the speed for landing (called the reference speed) are tabulated in the FLAP/SLAT CONFIGURATION MIN MAN AND REFERENCE SPEED table (Table 1). If weight were not a factor, there would be only one set of speeds to remember, and the task would be much simpler. Crew Division of Labor All modern jet transports have two pilot stations, each equipped with a complete set of flight instrumentation. While the airplane is in the air, one pilot is designated the pilot flying (PF) and other, the pilot not flying (PNF). These roles carry with them particular responsibilities with respect to the conduct of the flight. The pilot flying is concerned primarily with control of the airplane. The PNF communicates with air traffic control (ATC), operates the aircraft systems, accomplishes the checklists required in each plase of flight, and attends to other duties in the cockpit. THREE DESCRIPTIONS OF MEMORY FOR SPEEDS With an understanding of the problem and the basics of crew organization, we can now examine the activities in the cockpit that are involved with the generation and maintenance of representations of the maneuvering and reference speeds. I will provide three descriptions of the same activities. The first description is procedural. It is the sort of description that a pilot might provide. The second and third descriptions are cognitive in that they concern representations and processes that transform those representations. The second description treats the representations and processes that are external

270

COCKPIT SPEEDS

271

to the pilots. It provides the constraints for the final description of the representations and processes that are presummed to be internal to the pilots. A Procedural

Description of Memory for Speeds

Prepare the Landing Data After initiation of the descent from cruise altitude and before reaching 18,000 feet, the PNF should prepare the landing data. This means computing the correspondences between wing configurations and speeds for the projected landing weight. The actual procedure followed depends on the materials available, company policy, and crew preferences.6 For example, many older cockpits use the table in the operations manual (Table 1) and a hard plastic landing data card on which the arrival weather conditions, goaround thrust settings, landing gross weight, and landing speeds are indicated with a grease pencil. Still others use the table in the operations manual and write the speeds on a piece of paper (flight paperwork, printout of destination weather, and so forth). Crews of airplanes equipped with flight management computer systems can look up the approach speeds on a page display of the computer. The MD-80 uses a booklet of speed cards. The booklet contains a page for each weight interval (usually in 2,000 pound increments) with the appropriate speeds permanently printed on the card (Figure 1). The preparation of landing data consists of the following steps: 1.

2. 3.

Determine the gross weight of the airplane and select the appropriate card in the speed card booklet. Airplane gross weight on the MD-80 is continuously computed and displayed on the fuel quantity indicator on the center flight instrument panel (Figure 2). Post the selected speed card in a prominent position in the cockpit. Set the speed bugs on both airspeed indicator (ASI) instruments (Figure 3) to match the speeds shown on the speed card. On the instrument depicted in Figure 3, the airspeed is shown both in knots (the black-tipped dial pointer indicating 245) and Mach (the digital indicator showing 0.735). The striped indicator at 348 knots indicates the m~imum permissible indicated air speed (IAS). The four black speed bug pointers on the edge of the dial are external to the instrument

6 The procedural account given here has been constructed from in-flight observations, and from analyses of video and audio recordings of crews operating in high fidelity simulators of this and other aircraft. The activities described here are documented further in airIine operations manuais and training manuals, and in the manufacturer’s operational descriptions. Because these manuals and the documentation provided by the Douglas Aircraft company are considered proprietary, the actual sources will not be identified. Additional information came from other published sources, for example, Webb, 1971; Tenney, 1988 and from interviews with pilots. There are minor variations among the operating procedures of various airline companies, but the procedure described here can be taken as representative of this activity.

272

HUTCHINS

MANEUVERING FLAPS/SLATS

O/RET O/EXT 11 15 28 40

SPEED

-

227 177 155 152 142 137

VREF 28/EXT - 132 40/EXT - 128

122,000 Figure

1. A speed

card

from

LBS on

MD-80

speed

card

booklet

and are manually set by sliding them to the desired positions. The other speed bug (called the “salmon bug” for its orange color) is internal to the instrument and indicates the speed commanded to the flight director and the autothrottle system (which is shown differing from the indicated airspeed by about 2 knots) or both. Starting with the bug at 227 knots and moving counterclockwise, the bugs indicate: 227-the minimum maneuvering speed with no flaps or slats extended; 177-minimum maneuvering speed with slats, but no flaps, extended; 152-minimum maneuvering speed with flaps at 15” and slats extended; 128-landing speed with flaps at 40” and slats extended (also called Vrer). The preparation of the landing data is usually performed about 25 to 30 minutes prior to landing. The speed bugs are set at this time because at this point crew workload is relatively light and the aircraft is near enough to the destination to make accurate projections of landing gross weight. Later in the approach, the crew workload increases dramatically. The Descent

During the descent and the approach, the airplane will be slowed in stages, from cruise speed to final approach speed. Before descending through 10,000

COCKPIT SPEEDS

273

R MAIN

L MAIN

2600

Figure 2. The fuel quantity

I

indicator.

feet MSL (mean sea level), the airplane must slow to a speed at or below 250 KIAS (knots indicated air speed). This speed restriction exists primarily to give pilots more time to see and avoid other traffic as the big jets descend into the congested airspace of the terminal area, and into the realm of small, slow, light aircraft which mostly stay below 10,000 feet. At about 7,000 feet AFL (above field level), the crew must begin slowing the airplane to speeds that require slat and flap extension. At this point, they use the previously set external speed bugs on the ASI as indicators of where flap extension configuration changes should be made. Some companies specify crew coordination cross-checking procedures for the initial slat selection. For example, “After initial slat selection (O”/EXT), both pilots will visually verify that the slats have extended to the correct position (slat TAKEOFF light on) before reducing speed below O/RET Min Maneuver speed. . . ” Because it is dangerous to fly below the minimum maneuvering speed for any configuration, extending the flaps and slats well before slowing to the minimum maneuvering speed might seem to be a good idea. Doing so both would increase the safety margin on the speeds and would give the pilots a

274

HUTCHINS

Figure 3. Speed the

McDonnell

bugs. Douglas

This

illustration

MD-80,

is modeled

OS described

on the

by Tenney,

airspeed

indicator

instrument

in

1988.

wider window of speed (and therefore, of time) for selecting the next flap/slat configuration. Unfortunately, other operational considerations rule this out. As one operations manual puts it, “To minimize the air loads on the flaps/salts, avoid extension and operation near the maximum airspeeds. Extend flaps/slats near the Min Maneuver Speed for the flap/slat configuration.” The extension of the flaps and slats must be coordinated precisely with the changes in airspeed. This makes the accurate memory of the speeds even more important than it would be otherwise. The crew must continue configuration changes as the airplane is slowed further. The Final Approach After intercepting the glide slope and beginning the final approach segment, the crew will perform the final approach checklist. One of the elements on this checklist is the challenge/response pair, “Flight instruments and bugs/ Set and cross-checked.” The PNF reads the challenge. Both pilots check the approach and landing bug positions on their own AS1 against the bug position on the other pilot’s ASI and against the speeds shown on the speed card. Both crew members will confirm verbally that the bug speeds have been set and cross checked. For example, the captain (who sits in the left seat) might say, “Set on the

COCKPIT

SPEEDS

275

left and cross-checked”, whereas the first officer would respond, “Set on the right and cross-checked.” A more complete cross-check would include a specification of the actual value, (e.g., “One thirty two and one twenty seven set on the left and cross-checked”). At about 1,000 feet AFL, the crew selects the final flap setting of 28 ’ or 40”, and maintain the approach speed. At 500 feet AFL, the PNF calls out the altitude, the airspeed relative to the approach airspeed, and the descent rate. For example, “Five hundred feet, plus four, seven down,” meaning 500 feet above the field elevation, 4 knots faster than desired approach speed, descending at 700 feet per minute. The PNF also may specify relation to the glide slope, indicating whether the airplane is below, on, or above the glide slope. Once final flaps are set during the final approach segment, the PNF calls out airspeed whenever it varies more than plus or minus 5 knots from approach speed. A Cognitive Description of Memory for Speeds-Representations and Processes Outside the Pilots Let us now apply the cognitive science frame to the cockpit as a cognitive system. How are the speeds represented in the cockpit? How are these representations transformed, processed, and coordinated with other representations in the descent, approach, and landing? How does the cockpit system remember the speeds at which it is necessary to change the configuration of the wing in order to maintain safe flight? The observable representations directly involved in the cockpit processes that coordinate airspeed with flap and slat settings are: the gross weight display (Figure 2), the speed card booklet (Figure l), the two airspeed indicator instruments with internal and external bugs (Figure 3), the speed select window of the flight guidance control panel, and the speed-related verbal exchanges among the members of the crew. The speed-related verbalizations may appear in the communication of the values from PNF to PF while setting the speed bugs, in the initial slat extenion cross-check, in the subsequent configuration changes, in the cross-check phase of the before-landing checklist performance, in the PNF’s approach progress report at 500 feet AFL, and in any required speed deviation call outs on the final approach segment after the selection of the landing flap setting. In addition to the directly observable media listed earlier, we may also assume that some sort of representation of the speeds has been created in two media that are not directly observable: the memories of the two pilots, themselves. Later, we will consider in detail the task environment in which these memories may form. For now, let us simply note that these mental memories are additional media in the cockpit system, which may support and retain internal representations of any of the available external representations of the speeds.

276

HUTCHINS

Accessing the Speeds and Setting the Bugs The speed card booklet is a long-term memory in the cockpit system. It stores a set of correspondences between weights and speeds that are functionally durable, in that they are applicable over the entire operating life of the airplane. The weight/speed correspondences represented in the printed booklet are also physically durable, in that short of destroying the physical medium of the cards, the memory is nonvolatile and cannot be corrupted. This memory is not changed by any crew actions. (It could be misplaced, but there is a backup in the form of the performance tables in the operating manual). The appropriate speeds for the airplane are determined by bringing the representation of the airplane gross weight into coordination with the structure of the speed card booklet. The gross weight is used as a filter on this written memory, making one set of speeds much more accessible than any other. The outcome of the filtering operation is imposed on the physical configuration of the speed card booklet by arranging the booklet such that the currently appropriate speed card is the only one visible. Once performed, the filtering need not be done again during the flight. The physical configuration of the booklet produced by opening it to the correct page becomes a representation of the cockpit system’s memory for both the projected gross weight and the appropriate speeds. That is, the questions, “Which gross weight did we select?” and “What are the speeds for the selected weight?“, can both be answered by reading the visible speed card. The correspondence of a particular gross weight to a particular set of speeds is built into the physical structure of each card by printing the corresponding weight and speed values on the same card. This is a simple but effective way to produce the computation of the speeds, because selecting the correct weight can’t help but select the correct speeds. Posting the appropriate speed card where it can be seen easily, by both pilots creates a distribution (across social space) of access to information in the system that may have important consequences for several kinds of subsequent processing. Combined with a distribution of knowledge that results from standardized training and experience, this distribution of access to information supports the development of redundant storage of the information and redundant processing. Also, it creates a new trajectory by which speed-relevant information may reach the PF. Furthermore, posting the speed card provides a temporally enduring resource for checking and cross checking speeds, so that these tasks can be done (or redone) any time. And because the card shows both a set of speeds and the weight for which the speeds are appropriate, it also provides a grounds for checking the posted gross weight against the displayed gross weight on the fuel quantity panel (Figure 2), which is just a few inches above the normal posting position of the speed card. This is very useful in deciding whether the wrong weight, and therefore, the wrong speeds, may have been selected.

COCKPIT

SPEEDS

277

In addition to creating a representation of the appropriate speeds in the configuration of the speed card booklet, the PNF creates two other representations of the same information: the values are represented as spoken words when the PNF tells the PF what the speeds are, and the speeds are represented on the airspeed indicator in the positions of the speed bugs. By announcing aloud the values to be marked, the PNF both creates yet another representation of the speeds, and notifies the PF that the activity of setting the speed bugs should commence at this time. Unlike the printed speed card, the verbal representation is ephemeral, that is, it won’t endure over time. If it is to be processed, it must be attended to at the time it is created. The required attending to can be handled by auditory rather than visual resources. The latter often are overtaxed, whereas the former often are underutilized in the cockpit.’ By reading back the values heard, the PF creates yet another representation that allows the PNF to check on the values being used by the PF to set the PF’s bugs. The PF may make use of any of the representations the PNF has prepared in order to create a representation of the bug speeds on the PF’s airspeed indicator. The spoken representation and the speed card provide the PF’s easiest access to the values, although it is also possible for the PF to read the PNF’s airspeed indicator. Because all of these representations are available simultaneously, there are multiple opportunities for consistency checks in the system of distributed representation. When the pilots set the speed bugs, the values that were listed in written form on the speed card, and were represented in spoken form by the PNF, are re-represented as marked positions adjacent to values on the scale of the airspeed indicator (ASI). Because there are two ASI’s, this is a redundant representation in the cockpit system. In addition, it provides a distribution of access to information that will be taken advantage of in later processes. The external speed bug settings capture a regularity in the environment that is of a shorter time scale than the weight/speed correspondences that are represented in the speed card booklet. The speed bug settings are a memory that is malleable, and that fits a particular period of time (this approach). Because of the location of the AS1 and the nature of the bugs, this representation is quite resistant to disruption by other activities. Using the Configuration

Change Bugs

The problem to be solved is the coordination of the wing configuration changes with the changes in airspeed as the airplane slows to maneuver for the approach. The location of the airplane in the approach and or the instructions received from ATC determine the speed to be flown at any point ’ See Gras et al., 1991 (p. 49ff) for a discussion of the balance among the senses in the modern cockpit. Of course, aural attending may produce an internal representation that endures longer than the spoken words.

278

HUTCHINS

in the approach. The cockpit system must somehow construct and maintain an appropriate relationship between airspeed and slat/flap configuration. The information path that leads from indicated airspeed to flap/slat configuration includes several observable representations in addition to the speed bugs. The airspeed is displayed on the AS1 by the position of the airspeed indicator needle. Thus, as the AS1 needle nears the speed bug that represents the clean-configuration minimum maneuvering speed, the pilot flying can call for “Flaps 0.” The spoken flap/slat setting name is coordinated with the labels on the flap handle quadrant. That is, the PNF positions the flap handle adjacent to the label that matches (or is equivalent to) the flap/slat setting name called by the PF. Movement of the flap handle then actuates the flaps and slats themselves which produce the appropriate wing configurations for the present speed. The speed bugs contribute to this process by providing the bridge between the indicated airspeed and the name of the appropriate flap/slat configuration for the aircraft at its present gross weight. The cockpit procedures of some airlines require that the configuration that is produced by the initial extension of slats be verified by both crew members (by reference to an indicator on the flight instrument panel) before slowing below the clean MinMan speed. This verification activity provides a context in which disagreements between the settings of the first speed bug on the two ASIs can be discovered. Also, it may involve a consultation with the speed card by either pilot to check the MinMan speed, or even a comparison of the weight indicated by the selected speed card and the airplane gross weight as displayed on the fuel quantity panel. The fact that these other checks are so easy to perform with the available resources, highlights the fact that the physical configuration of the speed card is both a memory for speed, and a memory for a decision that was made earlier in the flight about the appropriate approach speed. Any of these activities may also refresh either pilot’s internal memory for the speeds or the gross weight. The depth of the processing engaged in here, that is, how many of these other checks are performed, may depend on the time available and the pilots’ sense about whether or not things are going well. Probably, it is not possible to predict how many other checks may be precipitated by this mandated cross check, but it is important to note that several are possible and may occur at this point. When the pilot flying calls for a configuration change, the PNF can, and should, verify that the speed is appropriate for the commanded configuration change. The mandated division of labor in which the PF calls for the flap setting, and the PNF actually selects it by moving the flap handle, permits the PF to keep hands on the yoke and throttles during the flap extension. This facilitates airplane control because changes in pitch attitude normally occur during flap extension. It is likely that facilitating control was the original justification for this procedure. However, this division of labor

COCKPIT SPEEDS

279

also has a very attractive system-level cognitive side effect in that it provides for additional redundancies in checking the bug settings and for the correspondences between speeds and configuration changes. Using the Salmon Bug On the final approach, the salmon bug provides the speed reference for both pilots, as both have speed-related tasks to perform. The spatial relation between the AS1 needle and the salmon bug provides the pilots with an indication of how well the airplane is tracking the speed target, and may give indications of the effects on airspeed of pitch changes input by the crew (or other autoflight systems in tracking the glide slope during a coupled approach) or of local weather conditions, such as windshear. The salmon bug is also the reference which the PNF computes the deviation from target speed. The PNF must make the mandatory call out at 500 feet AFL, as well as any other call outs required if the airspeed deviates by more than five knots from the target approach speed. In these call outs, the trajectory of task-relevant representational state is from the relationship between the AS1 needle and the salmon bug, to a verbalization by the PNF directed to the PF. Because the final approach segment is visually intensive for the PF, the conversion of the airspeed information from the visual into the auditory modality by the PNF permits the PF access to this important information, without requiring the allocation of precious visual resources to the ASI.

Summary of Representations and Processes Outside the Pilot Setting the speed bugs is a matter of producing a representation in the cockpit environment that will serve as a resource that organizes performances that are to come later. This structure is produced by bringing representations into coordination with one another (the gross weight readout, the speed card, the verbalizations, and so forth) and will provide the representational state (relations between speed bug locations and AS1 needle positions) that will be coordinated with other representations (names for flap positions, flap handle quadrant labels, flap handle positions, and so forth) ten to fifteen minutes later, when the airplane begins slowing down. I call this entire process a cockpit system’s “memory” because it consists of the creation, inside the system, of a representational state that is then saved and used to organize subsequent activities. A Cognitive Description of Memory for Speeds-Representations and Processes Inside the Pilots Having described the directly observable representationa states involved in the memory for speeds in the cockpit system during the approach, we ask of that same cycle of activity, “What are the cognitive tasks facing the pilots?”

280

HUTCHINS

The description of transformations of the representational state in the previous section is both a description of how the system processes information and a specification of cognitive tasks facing individual pilots. It is, in fact, a better cognitive task specification than can be had by simply thinking in terms of procedural descriptions. The task specification is detailed enough, in some cases, to put constraints on the kinds of representations and processes that the individuals must use. In much of the cockpit’s remembering, significant functions are achieved by a person interpreting material symbols, rather than by a person recalling those symbols from his or her memory. So we must go beyond looking for things that resemble our expectations about human memory to understand the phenomena of memory in the cockpit as a cognitive system. Computing the Speeds and Setting the Bugs The speeds are computed by pattern matching on the airplane gross weight and the weights provided on the cards. The pilots don’t have to remember what the weights are that appear on the cards. It is necessary only to find the place of the indicated gross weight value in the cards that are provided. However, repeated exposure to the cards may lead to implicit learning of the weight intervals, and whatever such knowledge that does develop may be a resource in selecting the appropriate speed card for any given gross weight. With experience, pilots may develop internal structures to coordinate with predictable structure in the task environment. Once the appropriate card has been selected, the values must be read from the card. Several design measures have been taken to facilitate this process. Frequently used speeds appear in larger font size than do infrequently used speeds, and there is a box around the Vref speeds to help pilots find these values (Wickens & Flach, 1988). Reading is, probably, an overlearned skill for most pilots. Still, there is a need for working memory: transposition errors are probably the most frequent sort of error committed in this process (Norman, 1991; Wickens & Flach, 1988). Setting any single speed bug to a particular value requires the pilot to hold the target speed in memory, read speed scale, locate the target speed on the speed scale (a search similar to that for weight in the speed card booklet), and then, manually, move the speed bug to the scale position. Because not

all tick marks on the speed scale have printed values adjacent to them, some interpolation, or counting of ticks, also is required. Coordinating reading the speeds with setting the bugs is more complicated. The actions of reading and setting may be interleaved in many possible orders. One could read each speed before setting it or read several speeds, retain them in memory, then set them one by one. Other sequences are also possible. The demands on working memory will depend on the strategy chosen. If several speeds are to be remembered and then set, they may be

COCKPIT

SPEEDS

281

rehearsed to maintain the memory. Such a memory is vulnerable to interference from other tasks in the same modality (Wickens & Flach, 1988), and the breakdown of such a memory may lead to a shift to a strategy that has less demanding memory requirements. The activities involved in computing the bug speeds and rerepresenting them in several other media may permit them to be represented in a more enduring way in the memory of the PNF. Similarly, hearing the spoken values, possibly reading them from the landing data card, and setting them on the airspeed indicator, may permit a more enduring representation of the values to form in the memory of the PF. Lacking additional evidence, we cannot know the duration or quality of these memories. But we know from observation that there are ample opportunities for rehearsals and associations of the rehearsed values with representations in the environment. Using the Configuration

Change Bugs

The airspeed indicator needle moves counter-clockwise as the airplane slows. Because the airspeed scale represents speed as spatial position and numerical relations as spatial relations, the airspeed bugs segment the face of the AS1 into regions that can be occupied by the AS1 needle. The relation of the AS1 needle to the bug positions is thus constructed as the location of the airplane’s present airspeed in a space of speeds. The bugs are also associated with particular flap/slat setting names (e.g., O”/RET, lS”/EXT, and so forth), so the regions on the face of the ASI have meaning both as speed regimes and as locations for flap/slat setting names. Once the bugs have been set, the pilots do not simply take in sensory data from the ASI; rather, the pilots impose additional meaningful structure on the image of the ASI. They use the bugs to define regions of the face of the ASI, and they associate particular meanings with those regions (Figure 4). The coordination of speed with wing configuration is achieved by superimposing representations of wing configuration and representation of speed on the same instrument. Once the bugs are set, it is not necessary actually to read the scale values where they are placed. It is necessary, however, to remember the meanings of each of the bugs with respect to names for flap/slat configurations. Since the regions of speed scale that are associated with each configuration are not permanently marked on the jet ASI, the pilot must construct the meanings of the regions in the act of “seeing” the AS1 with bugs as a set of meaningful regions. Speed bugs are part of what Luria called a functional system (Luria, 1979). It is a constellation of structures, some of them internal to the human actors, some external, involved in the performance of some invariant task. It is commonplace to refer to the speed bug as a memory aid (Norman, 1991; Tenney, 1988). Speed bugs are said to help the pilot remember the critical speeds. But now that we have looked at how speed bugs are set up and how

282

HUTCHINS

O/EXT Figure 4. Meaningful airspeed

indicator

the airplane

regions of the airspeed

scale as having meanings

indicator face. The pilots “see” regions of the in terms

of the configurations

required

to fly

at the speeds in each region.

they are used, it is not clear that they contribute to the pilot’s memory at all. The functional system of interest here is the one that controls the coordination of airspeeds with wing configurations. It is possible to imagine a functional system without speed bugs, in which pilots are required to read the speeds, remember the speeds, remember which configuration change goes with each speed, read the scale, and so forth. Adding speed bugs to the system does nothing to alter the memory of the pilots, but it does permit a different set of processes to be assembled into a functional system that achieves the same results as the system without speed bugs. In the functional system with speed bugs, some of the memory requirements for the pilot are reduced. What was accomplished without speed bugs by remembering speed values, reading the AS1 needle values, and comparing the two values is accomplished with the use of speed bugs by judgments of spatial proximity. Individual pilot memory has not been enhanced; rather, the memory function has now become a property of a larger system in which the individual engages in a different sort of cognitive behavior. The beauty of devices like speed bugs is that they permit these reconfigurations of functional systems in ways that

COCKPIT

SPEEDS

283

reduce the requirements for scarce cognitive resources. To call speed bugs a “memory aide” for the pilots is to mistake the cognitive properties of the reorganized functional system for the cognitive properties of one of its human components. Speed bugs do not help pilots remember speeds; rather, they are part of the process by which the cockpit system remembers speeds. Using the Salmon Bug Without a speed bug, on final approach the PF must remember the approach speed, read the airspeed indicator scale to find the remembered value of the approach speed on the airspeed indicator scale, and compare the position of the ASI needle on the scale with the position of the approach speed on the scale. With the salmon bug set, the pilot no longer needs to read the airspeed indicator scale. He or she simply looks to see whether or not the indicator needle is lined up with the salmon bug. Thus, a memory and scale reading task is transformed into a judgment of spatial adjacency. It is important to make these tasks as simple as possible because there are many other things the pilot must do on the final approach. The pilot must continue monitoring the airspeed while also monitoring the glide path and runway alignment of the aircraft. Deviations in any of these may require corrective actions. In making the required speed call outs, the PNF uses the salmon bug in a way similar to the way the PF does. To determine the numerical relation between the indicated speed and the setting of the salmon bug, the PNF could use mental arithmetic and subtract the current speed from the value of Vref. This is the sort of cognitive task we imagine might face the crew if we simply examined the procedural description. A less obvious, but equally effective method, is to use the scale of the AS1 as a computational medium. The base of the salmon bug is about ten knots wide in the portion of the speed scale relevant to maneuvering for approach and landing. To determine if the current speed is within 5 knots of the target, one only need see if the airspeed pointer is pointing at any part of the body of the salmon bug. This strategy permits a conceptual task to be implemented by perceptual processes. Having determined the deviation from target speed, the PNF calls it out to the PF. Notice the role of the representation of information. Twice in this example, a change in the nature of the representation of information results in a change in the nature of the cognitive task facing the pilot. In the first case, the speed bug itself permits a simple judgment of spatial proximity to be substituted for a scale reading task operation. In the second case, the PNF further transforms the task facing the PF from a judgment of spatial proximity (requiring scarce visual resources) into a task of monitoring a particular aural cue (a phrase like, “five knots fast”). Notice also that the change in the task for the pilot flying changes the kinds of internal knowledge structures that must be brought into play in order to decide on an appropriate action.

284

HUTCHINS

The Pilot’s Memory for Speeds Memory is normally thought of as a psychological function internal to the individual. However, memory tasks in the cockpit may be accomplished by functional systems which transcend the boundaries of the individual actor. Memory processses may be distributed among human agents, or between human agents and external representational devices. In some sense, the speeds are being remembered by the crew, partly, I suspect, in the usual sense of individual internal memory. But the speeds are also being read, written, and compared to other speeds in many representations. They are being compared to long-term memories for the typical or expected speeds for a plane of this specific weight. The comparison might be in terms of numbers; that is, “Is 225 KIAS a fast or a slow speed for initial flap extension?” The comparison could also take place in terms of the number in the pilot’s head, or on the landing data card, or on the position of the first bug on the airspeed indicator, or all of these together. In this setting, the pilot’s memory of these speeds may be a richly interwoven fabric of interaction with many representations that seem superficial or incomplete compared to the compact localized internal memory for which cognitive scientists usually look. The memory observed in the cockpit is a continual interaction with a world of meaningful structure. The pilots continually are reading and writing, reconstituting and reconstructing the meaning and the organization of both the internal and the external representations of the speeds. It is not just the retrieval of something from an internal storehouse, and not just a recognition or a match of an external form to an internally stored template. It is, rather, a combination of recognition, recall, pattern matching, cross modality consistency checking, construction, and reconstruction that is conducted in interaction with a rich set of representational structures, many of which permit, but do not demand, the reconstruction of some internal representation that we would normally call the “memory” for the speed. In the cockpit’s memory for speeds, we see many examples of opportunistic use of structure in the environment. Some of these were never anticipated by designers. Using the width of the salmon bug as a yardstick in local speed space is a wonderful example. The engineer who wrote the specifications for the airspeed indicator in the Boeing 7571767 reported to me that the width of the base of the command airspeed pointer (salmon bug) is not actually spelled out in the specifications. The width of the tip of the pointer is explicitly specified, but the width of the base is not. On engineering drawings, the base is shown fitting just between the large ticks at ten-knot intervals on the scale. The engineers say it has this width so that it will be easy to find, but will never obscure more than one large tick mark at a time. If it covered more than one large tick mark, it might make it difficult to interpolate and read speeds. That constraint solves a design problem for the engineers that the pilots never notice (because the difficulty in reading the scale that would

COCKPIT

SPEEDS

285

be caused by a wider bug never arises), and provides a bit of structure in the world for the pilots that can be opportunistically exploited to solve an operational problem that the designers never anticipated. COGNITIVE

PROPERTIES

OF THE COCKPIT SYSTEM

The task is to control the configuration of the airplane to match the changes in speed required for maneuvering in the approach and landing. The flaps are controlled by positioning the flap handle. The flap handle is controlled by aligning it with written labels for flap positions that correspond to spoken labels produced by the PF. The spoken labels are produced at the appropriate times by speaking the name of the region on the AS1 face that the needle is approaching. The regions of the AS1 are delimited by the settings of the speed bugs. The names of the regions are produced by the PF through the application of a schema for seeing the dial face. The speed bugs are positioned by placing them in accordance with the speeds listed on the selected speed card. And the speed card is selected by matching the weight printed on the bottom with the weight displayed on the fuel quantity panel. This system makes use of representations in many different media. The media themselves have very different properties. The speed card booklet is a relatively permanent representation. The spoken representation is ephemeral and endures only in its production. The memory is stored ultimately for use in the physical state of the speed bugs. It is represented temporarily in the spoken interchanges, and represented with unknown persistence in the memories of the individual pilots. The pilot’s memories clearly are involved, but they operate in an environment where there is a great deal of support for recreating the memory. Speed bugs are involved in a distribution of cognitive labor across social space. The speed bug helps the solo pilot by simplifying the task of determining the relation of present airspeed to Vrer, thereby reducing the amount of time required for the pilot’s eyes to be on the airspeed indicator during the approach. With multi-pilot crews, the cognitive work of reading the airspeed indicator and monitoring the other instruments on the final approach can be divided among the pilots. The PF can dedicate visual resources to monitoring the progress of the aircraft, whereas the pilot not flying can use visual resources to monitor airspeed and transform the representation of the relation between current airspeed and Vrer from a visual to an auditory form. Speed bugs permit a shift in the distribution of cognitive effort across time. They enable the crew to calculate correspondences between speeds and configurations during a low workload phase of flight, and save the results of that computation for later use. Internal memory also supports this redistribution of effort across time, but notice the different properties of the two kinds of representation; a properly set speed bug is much less likely than a pilot’s memory to “forget” its value. The robustness of the physical device

286

HUTCHINS

as a representation permits the computation of speeds to be moved arbitrarily far in time from the moment of their use and is relatively insensitive to the interruptions, the distractions, and the delays that may disrupt internal memories. This is a surprisingly redundant system. Not only is there redundant rep. resentation in memory; there is also redundant processing and redundani checking. The interaction of the representations in the different media give: the overall system the properties it has. This is not to say that knowing about the people is not important, but rather to say that much of what we care about is in the interaction of the people with each other and with physical structure in the environment. The analog ASI display maps an abstract conceptual quantity, speed, onto an expanse of physical space. This mapping of conceptual structure onto physical space allows important conceptual operations to be defined in terms of simple perceptual procedures. Simple internal structures (the meanings of the regions on the dial face defined by the positions of the speed bugs) in interaction with simple and specialized external representations perform powerful computations.

DISCUSSION The cockpit system remembers its speeds, and the memory process emerges from the activity of the pilots. The memory of the cockpit, however, is not made primarily of pilot memory. A complete theory of individual human memory would not be sufficient to understand that which we wish to understand because so much of the memory function takes place outside the individual. In some sense, what the theory of individual human memory explains is not how this system works, but why this system must contain so many components that are functionally implicated in cockpit memory, yet are external to the pilots themselves. The speed bug is one of many devices in the cockpit that participate in functional systems which accomplish memory tasks. The altitude alerting system and the many pieces of paper that appear in even the most modern glass cockpit are other examples. The properties of functional systems that are mediated by external representations differ from those that rely exclusively on internal representations, and may depend on the physical properties of the external representational media. Such factors as the endurance of a representation, the sensory modality via which it is accessed, its vulnerability to disruption, and the competition for modality specific resources may all influence the cognitive properties of such a system. This article presents a theoretical framework that takes a socio-technical system, rather than an individual mind, as its primary unit of analysis. This theory is explicitly cognitive in the sense that it is concerned with how infor-

COCKPIT

SPEEDS

287

mation is represented and how representations are transformed and propagated through the system. Such a theory can provide a bridge between the information processing properties of individuals and the information processing properties of a larger system, such as an airplane cockpit. One of the primary jobs of a theory is to help us look in the right places for answers to questions. This system-level cognitive view directs our attention beyond the cognitive properties of individuals to the properties of external representations and to the interactions between internal and external representations. Technological devices introduced into the cockpit invariably affect the flow of information in the cockpit. They may determine the possible trajectories of information or the kinds of transformations of information structure that are required for propagation. Given the current rapid pace of introduction of computational equipment, these issues are becoming increasingly important. REFERENCES Gras,

A., Moricot,

C., Poirot-Delpech, S., & Scardigli, V. (1991). Le Pitote, Ie confr3teur, e/ du rapport prkdefinition PIRTTEM-CNRS et du rapport final

/‘automate (Reedition

SERT-Minis&e des transports. ed.). Paris: Editions de L’Iris. Hutchins, E. (1990). The technology of team navigation. In J. Galegher, R. Kraut, & C. Egido (Eds.), Intellectual teamwork: Social and technical bases of collaborative work. Hillsdale, NJ: Erlbaum. Hutchins, E. (1991). Organizing work by adaptation. Organization Science, 2, 14-38. Hutchins, E. (1995). Cognition in the wild. Cambridge, MA: MIT Press. Luria, A.R. (1979). The making of mind: A personal account of Soviet psychology. (M. Cole & S. Cole, Trans.). Cambridge, MA: Harvard University Press. March, J., & Simon, H. (1958). Organizations. New York: Wiley. Morgan, G. (1986). Images of organization. Beverly Hills, CA: Sage. Newell, A., & Simon, H.A. (1972). Humanproblem solving. Englewood Cliffs, NJ: Prentice Hall. Norman, D.A. (1991). Cognitive science in the cockpit. CESERlAC Gateway, 2, l-6. Norman, D.A. (1993). Things that make us smart. Reading, MA: Addison Wesley. Simon, H.A. (1981). The sciences of the arttficial(2nd ed.). Cambridge, MA: MIT Press. Simon, H.A., &Kaplan, C.A. (1989). Foundations of Cognitive Science. In M. Posner (Eds.), Foundations of cognitive science. Cambridge, MA: MIT Press. Tenney, D.C. (1988, December). Bug speeds pinpointed by autothrottles mean less jockeying but more thinking. Professional pilot, pp. 96-99. Webb, J. (1971). F/y the wing. Ames, IA: Iowa State University Press. Wickens, C., & Flach, J. (1988). Information processing. In E. Wiener & D. Nagel (Eds.), Human factors in aviation. New York: Academic.

GLOSSARY AS1 ATC Flap

Air speed indicator. Air traffic control. A panel mounted on the trailing edge of the wing that can be extended to change the shape of the wing and increase its area.

288

HUTCHINS

Indicated air speed. The airspeed determined by the dynamic pressure of the airstream over the airplane. This may be different from true airspeed. It is the speed that is indicated on the ASI. MinMan speed The minimum maneuvering speed. A speed at which an airplane has a reasonable margin over a stall given the current configuration. This is usually 1.3 times the stall speed for the configuration. PF Pilot flying. The crewmember who is responsible for flying and navigating the airplane. PNF Pilot not flying. The crewmember who is responsible for communicating with ATC and operating the airplane non-flying systems, airconditioning and pressurization, for example. Slat A panel mounted on the leading edge of the wing that can be extended to change the shape of the wing and increase its area. Slats are normally extended before flaps. V ref The approach reference speed or velocity. This is the target speed for the final approach segment. IAS

4 Studying Context: A Comparison of Activity Theory, Situated Action Models, and Distributed Cognition Bonnie A. Nardi It has been recognized that system design will benefit from explicit study of the context in which users work. The unaided individual divorced from a social group and from supporting artifacts is no longer the model user. But with this realization about the importance of context come many difficult questions. What exactly is context? If the individual is no longer central, what is the correct unit of analysis? What are the relations between artifacts, individuals, and the social groups to which they belong? This chapter compares three approaches to the study of context: activity theory, situated action models, and distributed cognition. I consider the basic concepts each approach promulgates and evaluate the usefulness of each for the design of technology.1 A broad range of work in psychology (Leont'ev 1978; Vygotsky 1978; Luria 1979; Scribner 1984; Newman, Griffin, and Cole 1989; Norman 1991; Salomon 1993), anthropology (Lave 1988; Suchman 1987; Flor and Hutchins 1991; Hutchins 1991a; Nardi and Miller 1990, 1991; Gantt and Nardi 1992; Chaiklin and Lave 1993), and computer science (Clement 1990; Mackay 1990; MacLean et al. 1990) has shown that it is not possible to fully understand how people learn or work if the unit of study is the unaided individual with no access to other people or to artifacts for accomplishing the task at hand. Thus we are motivated to study context to understand relations among individuals, artifacts, and social groups. But as human-computer interaction researchers, how can we conduct studies of context that will have value to designers who seek our expertise? Brooks (1991) argues that HCI specialists will be most valuable to designers when we can provide (1) a broad background of comparative understanding over many domains, (2) high-level analyses useful for evaluating the impact of major design decisions, and (3) information that suggests actual designs rather than simply general design guidelines or metrics for evaluation. To be able to provide such expertise, we must develop an appropriate analytical abstraction that ``discards irrelevant details while isolating and emphasizing those properties of artifacts and situations that are most significant for design'' (Brooks, 1991, emphasis added). It is especially difficult to isolate and emphasize critical properties of artifacts and situations in studies that consider a full context because the scope of analysis has been widened to accommodate such holistic breadth. Taking context seriously means finding oneself in the thick of the complexities of particular situations at particular times with particular individuals. Finding commonalities across situations is difficult because studies may go off in so many different directions, making it problematic to provide the comparative understanding across domains that Brooks (1991) advocates. How can we confront the blooming, buzzing confusion that is ``context'' and still produce generalizable research results? This chapter looks at three approaches to the study of context—activity theory, situated action models, and the distributed cognition approach—to see what tools each offers to help manage the study of context. In particular we look at the unit of analysis proposed by each approach, the categories offered to support a description of context, the extent to which each treats action as structured prior to or during activity, and the stance toward the conceptual equivalence of people and things. Activity theory, situated action models, and distributed cognition are evolving frameworks and will change and grow as each is exercised with empirical study. In this chapter I ask where each approach seems to be headed and what its emphases and perspectives are. A brief overview of each approach to studying context will be given, followed by a discussion of some critical differences among the approaches. An argument is made for the advantages of activity theory as an overall framework while at the same time recognizing the value of situated action models and distributed cognition analyses.

35

SITUATED ACTION MODELS Situated action models emphasize the emergent, contingent nature of human activity, the way activity grows directly out of the particularities of a given situation.2 The focus of study is situated activity or practice, as opposed to the study of the formal or cognitive properties of artifacts, or structured social relations, or enduring cultural knowledge and values. Situated action analysts do not deny that artifacts or social relations or knowledge or values are important, but they argue that the true locus of inquiry should be the ``everyday activity of persons acting in [a] setting'' (Lave 1988).3 That this inquiry is meant to take place at a very fine-grained level of minutely observed activities, inextricably embedded in a particular situation, is reflected in Suchman's (1987) statement that ``the organization of situated action is an emergent property of moment-by-moment interactions between actors, and between actors and the environments of their action.'' Lave (1988) identifies the basic unit of analysis for situated action as ``the activity of personsacting in setting.'' The unit of analysis is thus not the individual, not the environment, but a relation between the two. A setting is defined as ``a relation between acting persons and the arenas in relation with which they act.'' An arena is a stable institutional framework. For example, a supermarket is an arena within which activity takes place. For the individual who shops in the supermarket, the supermarket is experienced as a setting because it is a ``personally ordered, edited version'' of the institution of the supermarket. In other words, each shopper shops only for certain items in certain aisles, depending on her needs and habits. She has thus ``edited'' the institution to match her personal preferences (Lave 1988). An important aspect of the ``activity of persons-acting in setting'' as a unit of analysis is that it forces the analyst to pay attention to the flux of ongoing activity, to focus on the unfolding of real activity in a real setting. Situated action emphasizes responsiveness to the environment and the improvisatory nature of human activity (Lave 1988). By way of illustrating such improvisation, Lave's (1988) ``cottage cheese'' story has become something of a classic. A participant in the Weight Watchers program had the task of fixing a serving of cottage cheese that was to be three-quarters of the two-thirds cup of cottage cheese the program normally allotted.4 To find the correct amount of cottage cheese, the dieter, after puzzling over the problem a bit, ``filled a measuring cup two-thirds full of cheese, dumped it out on a cutting board, patted it into a circle, marked a cross on it, scooped away one quadrant, and served the rest'' (Lave 1988). In emphasizing improvisation and response to contingency, situated action deemphasizes study of more durable, stable phenomena that persist across situations. The cottage cheese story is telling: it is a one-time solution to a one-time problem, involving a personal improvisation that starts and stops with the dieter himself. It does not in any serious way involve the enduring social organization of Weight Watchers or an analysis of the design of an artifact such as the measuring cup. It is a highly particularistic accounting of a single episode that highlights an individual's creative response to a unique situation. Empirical accounts in studies of situated action tend to have this flavor. Lave (1988) provides detailed descriptions of grocery store activity such as putting apples into bags, finding enchiladas in the frozen food section, and ascertaining whether packages of cheese are mispriced. Suchman (1987) gives a detailed description of experiments in which novices tried to figure out how to use the double-sided copy function of a copier. Suchman and Trigg (1991) describe the particulars of an incident of the use of a baggage- and passenger-handling form by airport personnel. These analyses offer intricately detailed observations of the temporal sequencing of a particular train of events rather than being descriptive of enduring patterns of behavior across situations. A central tenet of the situated action approach is that the structuring of activity is not something that precedes it but can only grow directly out of the immediacy of the situation (Suchman 1987; Lave 1988). The insistence on the exigencies of particular situations and the emergent, contingent character of action is a reaction to years of influential work in artificial intelligence and cognitive science in which ``problem solving'' was seen as a ``series of objective, rational pre-specified means to ends'' (Lave 1988) and work that overemphasized the importance of plans in shaping behavior (Suchman 1987). Such work failed to recognize the opportunistic, flexible way that people engage in real activity. It failed to treat the environment as an important shaper of activity, concentrating almost exclusively on representations in the head—usually rigid, planful ones—as the object of study. Situated action models provide a useful corrective to these restrictive notions that put research into something of a cognitive straitjacket. Once one looks at real behavior in real situations, it becomes clear that rigid mental representations such as formulaic plans or simplistically conceived ``rational problem

36

solving'' cannot account for real human activity. Both Suchman (1987) and Lave (1988) provide excellent critiques of the shortcomings of the traditional cognitive science approach. ACTIVITY THEORY Of the approaches examined in this chapter, activity theory is the oldest and most developed, stretching back to work begun in the former Soviet Union in the 1920s. Activity theory is complex and I will highlight only certain aspects here. (For summaries see Leont'ev 1974; Bødker 1989; and Kuutti 1991; for more extensive treatment see Leont'ev 1978; Wertsch 1981; Davydov, Zinchenko, and Talyzina 1982; and Raeithel 1991.) This discussion will focus on a core set of concepts from activity theory that are fundamental for studies of technology. In activity theory the unit of analysis is an activity. Leont'ev, one of the chief architects of activity theory, describes an activity as being composed of subject, object, actions, and operations (1974). A subject is a person or a group engaged in an activity. An object (in the sense of ``objective'') is held by the subject and motivates activity, giving it a specific direction. ``Behind the object,'' he writes, ``there always stands a need or a desire, to which [the activity] always answers.'' Christiansen (this volume) uses the term ``objectified motive,'' which I find a congenial mnemonic for a word with as many meanings in English as ``object.'' One might also think of the ``object of the game'' or an ``object lesson.'' Actions are goal-directed processes that must be undertaken to fulfill the object. They are conscious (because one holds a goal in mind), and different actions may be undertaken to meet the same goal. For example, a person may have the object of obtaining food, but to do so he must carry out actions not immediately directed at obtaining food.... His goal may be to make a hunting weapon. Does he subsequently use the weapon he made, or does he pass it on to someone else and receive a portion of the total catch? In both cases, that which energizes his activity and that to which his action is directed do not coincide (Leont'ev 1974). Christiansen (this volume) provides a nice example of an object from her research on the design of the information systems used by Danish police: ``[The detective] expressed as a vision for [the] design [of his software system] that it should be strong enough to handle a `Palme case,' referring to the largest homicide investigation known in Scandinavia, when the Swedish prime minister Oluf Palme was shot down on a street in Stockholm in 1986!'' This example illustrates Raeithel and Velichkovsky's depiction of objects as actively ``held in the line of sight.'' ... the bull's eye of the archer's target, which is the original meaning of the German word Zweck (``purpose''), for example, is a symbol of any future state where a real arrow hits near it. Taking it into sight, as the desired ``end'' of the whole enterprise, literally causes this result by way of the archer's action-coupling to the physical processes that let the arrow fly and make it stop again (Raeithel and Velichkovsky, this volume). Thus, a system that can handle a ``Palme case'' is a kind of bull's eye that channels and directs the detective's actions as he designs the sofware system that he envisions. Objects can be transformed in the course of an activity; they are not immutable structures. As Kuutti (this volume) notes, ``It is possible that an object itself will undergo changes during the process of an activity.'' Christiansen (this volume) and Engeström and Escalante (this volume) provide case studies of this process. Objects do not, however, change on a moment-by-moment basis. There is some stability over time, and changes in objects are not trivial; they can change the nature of an activity fundamentally (see, for example, Holland and Reeves, this volume). Actions are similar to what are often referred to in the HCI literature as tasks (e.g., Norman 1991). Activities may overlap in that different subjects engaged together in a set of coordinated actions may have multiple or conflicting objects (Kuutti 1991). Actions also have operational aspects, that is, the way the action is actually carried out. Operations become routinized and unconscious with practice. When learning to drive a car, the shifting of the gears is an action with an explicit goal that must be consciously attended to. Later, shifting gears becomes operational and ``can no longer be picked out as a special goal-directed process: its goal is not picked out

37

and discerned by the driver; and for the driver, gear shifting psychologically ceases to exist'' (Leont'ev 1974). Operations depend on the conditions under which the action is being carried out. If a goal remains the same while the conditions under which it is to be carried out change, then ``only the operational structure of the action will be changed'' (Leont'ev 1974). Activity theory holds that the constituents of activity are not fixed but can dynamically change as conditions change. All levels can move both up and down (Leont'ev 1974). As we saw with gear shifting, actions become operations as the driver habituates to them. An operation can become an action when ``conditions impede an action's execution through previously formed operations'' (Leont'ev 1974). For example, if one's mail program ceases to work, one continues to send mail by substituting another mailer, but it is now necessary to pay conscious attention to using an unfamiliar set of commands. Notice that here the object remains fixed, but goals, actions, and operations change as conditions change. As Bødker (1989) points out, the flexibility recognized by activity theory is an important distinction between activity theory and other frameworks such as GOMS. Activity theory ``does not predict or describe each step in the activity of the user (as opposed to the approach of Card, Moran and Newell, 1983)'' as Bødker (1989) says, because activity theory recognizes that changing conditions can realign the constituents of an activity. A key idea in activity theory is the notion of mediation by artifacts (Kuutti 1991). Artifacts, broadly defined to include instruments, signs, language, and machines, mediate activity and are created by people to control their own behavior. Artifacts carry with them a particular culture and history (Kuutti 1991) and are persistent structures that stretch across activities through time and space. As Kaptelinin (chapter 3, this volume) points out, recognizing the central role of mediation in human thought and behavior may lead us to reframe the object of our work as ``computer-mediated activity,'' in which the starring role goes to the activity itself rather than as ``human-computer interaction'' in which the relationship between the user and a machine is the focal point of interest. Activity theory, then, proposes a very specific notion of context: the activity itself is the context. What takes place in an activity system composed of object, actions, and operation, is the context. Context is constituted through the enactment of an activity involving people and artifacts. Context is not an outer container or shell inside of which people behave in certain ways. People consciously and deliberately generate contexts (activities) in part through their own objects; hence context is not just ``out there.'' Context is both internal to people—involving specific objects and goals—and, at the same time, external to people, involving artifacts, other people, specific settings. The crucial point is that in activity theory, external and internal are fused, unified. In Zinchenko's discussion of functional organs (this volume) the unity of external and internal is explored (see also Kaptelinin, this volume, chapters 3 and 5). Zinchenko's example of the relationship between Rostropovich and his cello (they are inextricably implicated in one another) invalidates simplistic explanations that divide internal and external and schemes that see context as external to people. People transform themselves profoundly through the acquisition of functional organs; context cannot be conceived as simply a set of external ``resources'' lying about. One's ability—and choice—to marshall and use resources is, rather, the result of specific historical and developmental processes in which a person is changed. A context cannot be reduced to an enumeration of people and artifacts; rather the specific transformative relationship between people and artifacts, embodied in the activity theory notion of functional organ, is at the heart of any definition of context, or activity. DISTRIBUTED COGNITION The distributed cognition approach (which its practitioners refer to simply as distributed cognition, a convention I shall adopt here) is a new branch of cognitive science devoted to the study of: the representation of knowledge both inside the heads of individuals and in the world ...; the propagation of knowledge between different individuals and artifacts ...; and the transformations which external structures undergo when operated on by individuals and artifacts.... By studying cognitive phenomena in this fashion it is hoped that an understanding of how intelligence is manifested at the systems level, as opposed to the individual cognitive level, will be obtained. (Flor and Hutchins 1991) Distributed cognition asserts as a unit of analysis a cognitive system composed of individuals and the artifacts they use (Flor and Hutchins 1991; Hutchins 1991a). The cognitive system is something like what

38

activity theorists would call an activity; for example, Hutchins (1991a) describes the activity of flying a plane, focusing on ``the cockpit system.'' Systems have goals; in the cockpit, for example, the goal is the ``successful completion of a flight.''5 Because the system is not relative to an individual but to a distributed collection of interacting people and artifacts, we cannot understand how a system achieves its goal by understanding ``the properties of individual agents alone, no matter how detailed the knowledge of the properties of those individuals might be'' (Hutchins 1991a). The cockpit, with its pilots and instruments forming a single cognitive system, can be understood only when we understand, as a unity, the contributions of the individual agents in the system and the coordination necessary among the agents to enact the goal, that is, to achieve ``the successful completion of a flight.'' (Hutchins 1994 studies shipboard navigation and makes similar points.) Thus distributed cognition moves the unit of analysis to the system and finds its center of gravity in the functioning of the system, much as classic systems theory did (Weiner 1948; Ashby 1956; Bertalanffy 1968). While a distributed cognition analyst would probably, if pushed, locate system goals in the minds of the people who are part of the system, the intent is to redirect analysis to the systems level to reveal the functioning of the system itself rather than the individuals who are part of the system. Practitioners of distributed cognition sometimes refer to the ``functional system'' (instead of the ``cognitive system'') as their central unit of analysis (Hutchins 1994; Rogers and Ellis 1994), hinting at an even further distance from the notion of the individual that the term cognitive cannot help but suggest. Distributed cognition is concerned with structure—representations inside and outside the head—and the transformations these structures undergo. This is very much in line with traditional cognitive science (Newell and Simon 1972) but with the difference that cooperating people and artifacts are the focus of interest, not just individual cognition ``in the head.'' Because of the focus on representations—both internal to an individual and those created and displayed in artifacts—an important emphasis is on the study of such representations. Distributed cognition tends to provide finely detailed analyses of particular artifacts (Norman 1988; Norman and Hutchins 1988; Nardi and Miller 1990; Zhang 1990; Hutchins 1991a, Nardi et al. 1993) and to be concerned with finding stable design principles that are widely applicable across design problems (Norman 1988, 1991; Nardi and Zarmer 1993). The other major emphasis of distributed cognition is on understanding the coordination among individuals and artifacts, that is, to understand how individual agents align and share within a distributed process (Flor and Hutchins 1991; Hutchins 1991a, 1991b; Nardi and Miller 1991). For example, Flor and Hutchins (1991) studied how two programmers performing a software maintenance task coordinated the task among themselves. Nardi and Miller (1991) studied the spreadsheet as a coordinating device facilitating the distribution and exchange of domain knowledge within an organization. In these analyses, shared goals and plans, and the particular characteristics of the artifacts in the system, are important determinants of the interactions and the quality of collaboration. DIFFERENCES BETWEEN ACTIVITY THEORY, SITUATED ACTION MODELS, AND DISTRIBUTED COGNITION All three frameworks for analyzing context that we have considered are valuable in underscoring the need to look at real activity in real situations and in squarely facing the conflux of multifaceted, shifting, intertwining processes that comprise human thought and behavior. The differences in the frameworks should also be considered as we try to find a set of concepts with which to confront the problem of context in HCI studies. The Structuring of Activity An important difference between activity theory and distributed cognition, on the one hand, and situated action, on the other hand, is the treatment of motive and goals. In activity theory, activity is shaped first and foremost by an object held by the subject; in fact, we are able to distinguish one activity from another only by virtue of their differing objects (Leont'ev 1974; Kozulin 1986; Kuutti 1991, this volume). Activity theory emphasizes motivation and purposefulness and is ``optimistic concerning human self-determination'' (Engeström 1990). A distributed cognition analysis begins with the positing of a system goal, which is similar to the activity theory notion of object, except that a system goal is an abstract systemic concept that does not involve individual consciousness.

39

Attention to the shaping force of goals in activity theory and distributed cogntion, be they conscious human motives or systemic goals, contrasts with the contingent, responsive, improvisatory emphasis of situated action. In situated action, one activity cannot be distinguished from another by reference to an object (motive); in fact Lave (1988) argues that ``goals [are not] a condition for action.... An analytic focus on direct experience in the lived-in world leads to ... the proposition that goals are constructed, often in verbal interpretation'' (emphasis in original). In other words, goals are our musings out loud about why we did something after we have done it; goals are ``retrospective and reflexive'' (Lave 1988). In a similar vein, Suchman (1987), following Garfinkel (1967), asserts that ``a statement of intent generally says very little about the action that follows.'' If we appear to have plans to carry out our intent, it is because plans are ``an artifact of our reasoning about action, not ... the generative mechanism of action.'' (emphasis in original). Suchman (1987) says that plans are ``retrospective reconstructions.''6 The position adopted by Lave (1988) and Suchman (1987) concerning goals and plans is that they are post hoc rationalizations for actions whose meaning can arise only within the immediacy of a given situation. Lave (1988) asks the obvious question about this problematic view of intentionality: ``If the meaning of activity is constructed in action ... from whence comes its intentional character, and indeed its meaningful basis?'' Her answer, that ``activity and its values are generated simultaneously,'' restates her position but does not explicate it. Winograd and Flores (1986) also subscribe to this radically situated view, using the colorful term ``throwness'' (after Heidegger) to argue that we are actively embedded, or ``thrown into,'' in an ongoing situation that directs the flow of our actions much more than reflection or the use of durable mental representations. In activity theory and distributed cognition, by contract, an object-goal is the beginning point of analysis. An object precedes and motivates activity. As Leont'ev (1974) states, ``Performing operations that do not realize any kind of goal-directed action [and recursively, a motive] on the subject's part is like the operation of a machine that has escaped human control.'' In activity theory and distributed cognition, an object is (partially) determinative of activity; in situated action, every activity is by definition uniquely constituted by the confluence of the particular factors that come together to form one ``situation.'' In a sense, situated action models are confined to what activity theorists would call the action and operation levels (though lacking a notion of goal at the action level in the activity theory sense). Situated action concentrates, at these levels, on the way people orient to changing conditions. Suchman's (1987) notion of ``embodied skills'' is similar to the notion of operations, though less rich than the activity theory construct which grounds operations in consciousness and specifies that operations are dependent on certain conditions obtaining and that they may dynamically transform into actions when conditions change. While in principle one could reasonably focus one's efforts on understanding the action and operation levels while acknowledging the importance of the object level, neither Lave (1988) nor Suchman (1987), as we have seen, does this. On the contrary, the very idea of an object's generating activity is rejected; objects (goals) and plans are ``retrospective reconstructions,'' post hoc ``artifacts of reasoning about action,'' after action has taken place. Why people would construct such explanations is an interesting question not addressed in these accounts. And why other people would demand or believe such retrospective reconstructions is another question to be addressed by this line of reasoning. Situated action models have a slightly behavioristic undercurrent in that it is the subject's reaction to the environment (the ``situation'') that finally determines action. What the analyst observes is cast as a response (the subject's actions/operations) to a stimulus (the ``situation''). The mediating influences of goals, plans, objects, and mental representations that would order the perception of a situation are absent in the situated view. There is no attempt to catalog and predict invariant reactions (as in classical behaviorism) as situations are said to vary unpredictably, but the relation between actor and environment is one of reaction in this logic.7 People ``orient to a situation'' rather than proactively generating activity rich with meaning reflective of their interests, intentions, and prior knowledge. Suchman and Trigg (1991) cataloged their research methods in describing how they conduct empirical studies. What is left out is as interesting as what is included. The authors report that they use (1) a stationary video camera to record behavior and conversation; (2) ``shadowing'' or following around an individual to study his or her movements; (3) tracing of artifacts and instrumenting of computers to audit usage, and (4) event-based analysis tracking individual tasks at different locations in a given setting. Absent from this catalog is the use of interviewing; interviews are treated as more or less unreliable accounts of idealized or rationalized behavior, such as subjectively reported goals as ``verbal interpretation'' (Lave

40

1988) and plans as ``retrospective reconstructions'' (Suchman 1987). Situated action analyses rely on recordable, observable behavior that is ``logged'' through analysis of a videotape or other record (Suchman and Trigg 1993; Jordan and Henderson 1994).8 Accounts from study participants describing in their own words what they think are doing, and why, such as those in this book by Bellamy, Bødker, Christiansen, Engeström and Escalante, Holland and Reeves, and Nardi, are not a focal point of situated action analyses. Activity theory has something interesting to tell us about the value of interview data. It has become a kind of received wisdom in the HCI community that people cannot articulate what they are doing (a notion sometimes used as a justification for observational studies and sometimes used to avoid talking to users at all). This generalization is true, however, primarily at the level of operations; it is certainly very difficult to say how you type, or how you see the winning pattern on the chessboard, or how you know when you have written a sentence that communicates well. But this generalization does not apply to the higher conscious levels of actions and objects; ask a secretary what the current problems are with the boss, or an effective executive what his goals are for the next quarter, and you will get an earful! Skillful interviewing or the need to teach someone how to do something often bring operations to the subject's conscious awareness so that even operations can be talked about, at least to some degree. Dancers, for example, use imagery and other verbal techniques to teach dance skills that are extremely difficult to verbalize. The ability to bring operations to a conscious level, even if only partially, is an aspect of the dynamism of the levels of activity as posited by activity theory. When the subject is motivated (e.g., by wishing to cooperate with a researcher or by the desire to teach), at least some operational material can be retrieved (see Bødker, this volume). The conditions fostering such a dynamic move to the action level of awareness may include skillful probing by an interviewer. In situated action, what constitutes a situation is defined by the researcher; there is no definitive concept such as object that marks a situation. The Leont'evian notion of object and goals remaining constant while actions and operations change because of changing conditions is not possible in the situated action framework that identifies the genesis of action as an indivisible conjunction of particularities giving rise to a unique situation. Thus we find a major difference between activity theory and situated action; in the former, the structuring of activity is determined in part, and in important ways, by human intentionality before the unfolding in a particular situation; in situated action, activity can be known only as it plays out in situ. In situated action, goals and plans cannot even be realized until after the activity has taken place, at which time they become constructed rationalizations for activity that is wholly created in the crucible of a particular situation. In terms of identifying activity, activity theory provides the more satisfying option of taking a definition of an activity directly from a subjectively defined object rather than imposing a definition from the researcher's view. These divergent notions of the structuring of activity, and the conceptual tools that identify one activity distinctly from another, are important for comparative work in studies of human-computer interaction. A framework that provides a clear way to demarcate one activity from another provides more comparative power than one that does not. Analyses that are entirely self-contained, in the way that a truly situated description of activity is, provide little scope for comparison. The level of analysis of situated action models—at the moment-by-moment level—would seem to be too low for comparative work. Brooks (1991) criticizes human-factors task analysis as being too low level in that all components in an analysis must ``be specified as at atomic a level as possible.'' This leads to an ad hoc set of tasks relevant only to a particular domain and makes cross-task comparison difficult (Brooks 1991). A similar criticism applies to situated action models in which a focus on moment-by-moment actions leads to detailed descriptions of highly particularistic activities (such as pricing cheeses in a bin or measuring out cottage cheese) that are not likely to be replicated across contexts. Most crucially, no tools for pulling out a higher-level description from a set of observations are offered, as they are in activity theory. Persistent Structures An important question for the study of context is the role that persistent structures such as artifacts, institutions, and cultural values play in shaping activity. To what extent should we expend effort analyzing the durable structures that stretch across situations and activities that cannot be properly described as simply an aspect of a particular situation? For both activity theory and distributed cognition, persistent structures are a central focus. Activity theory is concerned with the historical development of activity and the mediating role of artifacts. Leont'ev

41

(1974) (following work by Vygotsky) considered the use of tools to be crucial: ``A tool mediates activity that connects a person not only with the world of objects, but also with other people. This means that a person's activity assimilates the experience of humanity.'' Distributed cognition offers a similar notion; for example, Hutchins (1987) discusses ``collaborative manipulation,'' the process by which we take advantage of artifacts designed by others, sharing good ideas across time and space. Hutchins's example is a navigator using a map: the cartographer who created the map contributes, every time the navigator uses the map, to a remote collaboration in the navigator's task. Situated action models less readily accommodate durable structures that persist over time and across different activities. To the extent that activity is truly seen as ``situated,'' persistent, durable structures that span situations, and can thus be described and analyzed independent of a particular situation, will not be central. It is likely, however, that situated action models, especially those concerned with the design of technology, will allow some latitude in the degree of adherence to a purist view of situatedness, to allow for the study of cognitive and structural properties of artifacts and practices as they span situations. Indeed, in recent articles we find discussion of ``routine practices'' (Suchman and Trigg 1991) and ``routine competencies'' (Suchman 1993) to account for the observed regularities in the work settings studied. The studies continue to report detailed episodic events rich in minute particulars, but weave in descriptions of routine behavior as well. Situated action accounts may then exhibit a tension between an emphasis on that which is emergent, contingent, improvisatory and that which is routine and predictable. It remains to be seen just how this tension resolves—whether an actual synthesis emerges (more than simple acknowledgment that both improvisations and routines can be found in human behavior) or whether the claims to true situatedness that form the basis of the critique of cognitive science cede some importance to representations ``in the head.'' The appearance of routines in situated action models opens a chink in the situated armor with respect to mental representations; routines must be known and represented somehow. Routines still circumambulate notions of planful, intentional behavior; being canned bits of behavior, they obviate the need for active, conscious planning or the formulation of deliberate intentions or choices. Thus the positing of routines in situated action models departs from notions of emergent, contingent behavior but is consistent in staying clear of plans and motives. Of the three frameworks, distributed cognition has taken most seriously the study of persistent structures, especially artifacts. The emphasis on representations and the transformations they undergo brings persistent structures to center stage. Distributed cognition studies provide in-depth analyses of artifacts such as nomograms (Norman and Hutchins 1988), navigational tools (Hutchins 1990), airplane cockpits (Hutchins 1991a), spreadsheets (Nardi and Miller 1990, 1991), computer-aided design (CAD) systems (Petre and Green 1992), and even everyday artifacts such as door handles (Norman 1988). In these analyses, the artifacts are studied as they are actually used in real situations, but the properties of the artifacts are seen as persisting across situations of use, and it is believed that artifacts can be designed or redesigned with respect to their intrinsic structure as well as with respect to specific situations of use. For example, a spreadsheet table is an intrinsically good design (from a perceptual standpoint) for a system in which a great deal of dense information must be displayed and manipulated in a small space (Nardi and Miller 1990). Hutchins's (1991a) analysis of cockpit devices considers the memory requirements they impose. Norman (1988) analyzes whether artifacts are designed to prevent users from doing unintended (and unwanted) things with them. Petre and Green (1992) establish requirements for graphical notations for computer-aided design (CAD) users based on users' cognitive capabilities. In these studies, an understanding of artifacts is animated by observations made in real situations of their use, but there is also important consideration given to the relatively stable cognitive and structural properties of the artifacts that are not bound to a particular situation of use. Distributed cognition has also been productive of analyses of work practices that span specific situational contexts. For example, Seifert and Hutchins (1988) studied cooperative error correction on board large ships, finding that virtually all navigational errors were collaboratively ``detected and corrected within the navigation team.'' Gantt and Nardi (1992) found that organizations that make intensive use of CAD software may create formal in-house support systems for CAD users composed of domain experts (such as drafters) who also enjoy working with computers. Rogers and Ellis (1994) studied computermediated work in engineering practice. Symon et al. (1993) analyzed the coordination of work in a radiology department in a large hospital. Nardi et al. (1993) studied the coordination of work during neurosurgery afforded by video located within the operating room and at remote locations in the hospital. A series of studies on end user computing have found a strong pattern of cooperative work among users of a

42

variety of software systems in very different arenas, including users of word processing programs (Clement 1990), spreadsheets (Nardi and Miller 1990, 1991), UNIX (Mackay 1990), a scripting language (MacLean et al. 1990), and CAD systems (Gantt and Nardi 1992). In these studies the work practices described are not best analyzed as a product of a specific situation but are important precisely because they span particular situations. These studies develop points at a high level of analysis; for example, simply discovering that application development is a collaborative process has profound implications for the design of computer systems (Mackay 1990; Nardi 1993). Moment-by-moment actions, which would make generalization across contexts difficult, are not the key focus of these studies, which look for broader patterns spanning individual situations. People and Things: Symmetrical or Asymmetrical? Kaptelinin (chapter 5, this volume) points out that activity theory differs fundamentally from cognitive science in rejecting the idea that computers and people are equivalent. In cognitive science, a tight information processing loop with inputs and outputs on both sides models cognition. It is not important whether the agents in the model are humans or things produced by humans (such as computers). (See also Bødker, this volume, on the tool perspective.) Activity theory, with its emphasis on the importance of motive and consciousness—which belong only to humans—sees artifacts and people as different. Artifacts are mediators of human thought and behavior; people and things are not equivalent. Bødker (this volume) defines artifacts as instruments in the service of activities. In activity theory, people and things are unambiguously asymmetrical. Distributed cognition, by contrast, views people and things as conceptually equivalent; people and artifacts are ``agents'' in a system. This is similar to traditional cognitive science, except that the scope of the system has been widened to include a collaborating set of artifacts and people rather than the narrow ``man-machine'' dyad of cognitive science. While treating each node in a system as an ``agent'' has a certain elegance, it leads to a problematic view of cognition. We find in distributed cognition the somewhat illogical notion that artifacts are cognizing entities. Flor and Hutchins (1991) speak of ``the propagation of knowledge between different individuals and artifacts.'' But an artifact cannot know anything; it serves as a medium of knowledge for a human. A human may act on a piece of knowledge in unpredictable, self-initiated ways, according to socially or personally defined motives. A machine's use of information is always programmatic. Thus a theory that posits equivalence between human and machine damps out sources of systemic variation and contradiction (in the activity theory sense; see Kuutti, this volume) that may have important ramifications for a system. The activity theory notion of artifacts as mediators of cognition seems a more reasoned way to discuss relations between artifacts and people. Activity theory instructs us to treat people as sentient, moral beings (Tikhomirov 1972), a stance not required in relation to a machine and often treated as optional with respect to people when they are viewed simply as nodes in a system. The activity theory position would seem to hold greater potential for leading to a more responsible technology design in which people are viewed as active beings in control of their tools for creative purposes rather than as automatons whose operations are to be automated away, or nodes whose rights to privacy and dignity are not guaranteed. Engeström and Escalante (this volume) apply the activity theory approach of asymmetrical human-thing relations to their critique of actor-network theory. In an analysis of the role of Fitts's law in HCI studies undertaken from an activity theory perspective, Bertelsen (1994) argues that Fitts's ``law'' is actually an effect, subject to contextual variations, and throws into question the whole notion of the person as merely a predictable mechanical ``channel.'' Bertelsen notes that ``no matter how much it is claimed that Fitts' Law is merely a useful metaphor, it will make us perceive the human being as a channel. The danger is that viewing the human being as a channel will make us treat her as a mechanical device.... Our implicit or explicit choice of world view is also a choice of the world we want to live in; disinterested sciences do not exist'' (Bertelsen 1994). Seeing Fitts's findings as an effect, subject to contextual influence, helps us to avoid the depiction of the user as a mechanical part. Activity theory says, in essence, that we are what we do. Bertelsen sees Fitts's law as a tool of a particular kind of science that ``reduces the design of work environments, e.g., computer artifacts, to a matter of economical optimization.'' If we wish to design in such a manner, we will create a world of

43

ruthless optimization and little else, but it is certainly not inevitable that we do so. However, no amount of evidence that people are capable of behaving opportunistically, contingently, and flexibly will inhibit the development and dispersal of oppressive technologies; Taylorization has made that clear. If we wish a different world, it is necessary to design humane and liberating technologies that create the world as we wish it to be. There are never cut-and-dried answers, of course, when dealing with broad philosophical problems such as the definition of people and things, but activity theory at least engages the issue by maintaining that there is a difference and asking us to study its implications. Many years ago, Tikhomirov (1972) wrote, ``How society formulates the problem of advancing the creative content of its citizens' labor is a necessary condition for the full use of the computer's possibilities.'' Situated action models portray humans and things as qualitatively different. Suchman (1987) has been particularly eloquent on this point. But as I have noted, situated action models, perhaps inadvertently, may present people as reactive ciphers rather than fully cognizant human actors with self-generated agendas. DECIDING AMONG THE THREE APPROACHES All three approaches to the study of context have merit. The situated action perspective has provided a much-needed corrective to the rationalistic accounts of human behavior from traditional cognitive science. It exhorts us not to depend on rigidly conceived notions of inflexible plans and goals and invites us to take careful notice of what people are actually doing in the flux of real activity. Distributed cognition has shown how detailed analyses that combine the formal and cognitive properties of artifacts with observations on how artifacts are used can lead to understandings useful for design. Distributed cognition studies have also begun to generate a body of comparative data on patterns of work practices in varying arenas. Activity theory and distributed cognition are very close in spirit, as we have seen, and it is my belief that the two approaches will mutually inform, and even merge, over time, though activity theory will continue to probe questions of consciousness outside the purview of distributed cognition as it is presently formulated. The main differences with which we should be concerned here are between activity theory and situated action. Activity theory seems to me to be considerably richer and deeper than the situated action perspective.9 Although the critique of cognitive science offered by situated action analysts is cogent and has been extremely beneficial, the insistence on the ``situation'' as the primary determinant of activity is, in the long run, unsatisfying. What is a ``situation''? How do we account for variable responses to the same environment or ``situation'' without recourse to notions of object and consciousness? To take a very simple example, let us consider three individuals, each going on a nature walk. The first walker, a bird watcher, looks for birds. The second, an entomologist, studies insects as he walks, and the third, a meteorologist, gazes at clouds. The walker will carry out specific actions, such as using binoculars, or turning over leaves, or looking skyward, depending on his or her interest. The ``situation'' is the same in each case; what differs is the subject's object. While we might define a situation to include some notion of the subject's intentions, as we have seen, this approach is explicitly rejected by situated action analysts. (See also Lave 1993.) To take the example a step further, we observe that the bird watcher and the meteorologist might in some cases take exactly the same action from a behavioral point of view, such as looking skyward. But the observable action actually involves two very different activities for the subjects themselves. One is studying cloud formations, the other watching migrating ducks. The action of each, as seen on a videotape, for example, would appear identical; what differs is the subject's intent, interest, and knowledge of what is being looked at. If we do not consider the subject's object, we cannot account for simple things such as, in the case of the bird watcher, the presence of a field guide to birds and perhaps a ``life list'' that she marks up as she walks along.10 A bird watcher may go to great lengths to spot a tiny flycatcher high in the top of a tree; another walker will be totally unaware of the presence of the bird. The conscious actions and attention of the walker thus derive from her object. The bird watcher may also have an even longer-term object in mind as she goes along: adding all the North American birds to her life list. This object, very important to her, is in no way knowable from ``the situation'' (and not observable from a videotape). Activity theory gives us a vocabulary for talking about the walker's activity in meaningful subjective terms and gives the necessary attention to what the subject brings to a situation.11 In significant measure, the walker construes and creates

44

the situation by virtue of prior interest and knowledge. She is constrained by the environment in important ways, but her actions are not determined by it. As Davydov, Zinchenko, and Talyzina (1982) put it, the subject actively ```meets' the object with partiality and selectivity,'' rather than being ``totally subordinate to the effects of environmental factors ... the principle of reactivity is counterposed to the principle of the subject's activeness.'' It is also important to remember that the walker has consciously chosen an object and taken the necessary actions for carrying it out; she did not just suddenly and unexpectedly end up in the woods. Can we really say, as Suchman (1987) does, that her actions are ``ad hoc''? Situated action analyses often assume a ``situation'' that one somehow finds oneself in, without consideration of the fact that the very ``situation'' has already been created in part by the subject's desire to carry out some activity. For example, Suchman's famous canoeing example, intended to show that in the thick of things one abandons plans, is set up so that the ``situation'' is implicitly designated as ``getting your canoe through the falls'' (Suchman 1987). Surely the deck is stacked here. What about all the plotting and planning necessary to get away for the weekend, transport the canoe to the river, carry enough food, and so forth that must also be seen as definitive of the situation? It is only with the most mundane, plodding, and planful effort that one arrives ``at the falls.'' To circumscribe the ``situation'' as the glamorous, unpredictable moment of running the rapids is to miss the proverbial boat, as it were. An activity theory analysis instructs us to begin with the subjectively defined object as the point of analytical departure and thus will lead not simply to crystalline moments of improvisatory drama (whether measuring cottage cheese or running rapids) but to a more global view that encompasses the totality of an activity construed and constructed, in part, prior to its undertaking, with conscious, planful intent. Holland and Reeves (this volume) studied the differing paths taken by three groups of student programmers all enrolled in the same class and all beginning in the same ``situation.'' The professor gave each group the same specific task to accomplish during the semester and the students' ``performances were even monitored externally from an explicit and continually articulated position.'' The students were all supposed to be doing the same assignment; they heard the same lectures and had the same readings and resources. But as Holland and Reeves document, the projects took radically different courses and had extremely variable outcomes because the students themselves redefined the object of the class. Our understanding of what happened here must flow from an understanding of how each group of students construed, and reconstrued, the class situation. The ``situation'' by itself cannot account for the fact that one group of students produced a tool that was chosen for demonstration at a professional conference later in the year; one group produced a program with only twelve lines of code (and still got an A!); and the third group ``became so enmeshed in [interpersonal struggles] that the relationships among its members frequently became the object of its work.'' Bellamy (this volume) observes that to achieve practical results such as successfully introducing technology into the classroom, it is necessary to understand and affect the objects of educators: ``to change the underlying educational philosophy of schools, designers must design technologies that support students' learning activities and design technologies that support the activities of educators and educational administrators. Only by understanding and designing for the complete situation of education ... will it be possible for technology to bring about pervasive educational reform.'' Situated action models make it difficult to go beyond the particularities of the immediate situation for purposes of generalization and comparison. One immerses in the minutiae of a particular situation, and while the description may feel fresh, vivid, and ``on-the-ground'' as one reads it, when a larger task such as comparison is attempted, it is difficult to carry the material over. One finds oneself in a claustrophobic thicket of descriptive detail, lacking concepts with which to compare and generalize. The lack of conceptual vocabulary, the appeal to the ``situation'' itself in its moment-by-moment details, do not lend themselves to higher-order scientific tasks where some abstraction is necessary. It is appropriate to problematize notions of comparison and generalization in order to sharpen comparisons and generalizations, but it is fruitless to dispense with these foundations of scientific thought. A pure and radically situated view would by definition render comparison and generalization as logically at odds with notions of emergence, contingency, improvisation, description based on in situ detail and point of view. (I am not saying any of the situated theorists cited here are this radical; I am playing out the logical conclusion of the ideas.) Difficult though it may be to compare and generalize when the subject matter is people, it is nonetheless important if we are to do more than simply write one self-contained descriptive account after another. The more precise, careful, and sensitive comparisons and generalizations are, the better. This is true not only from the point of view of science but also of technology design. Design, a

45

practical activity, is going to proceed apace, and it is in our best interests to provide comparisons and generalizations based on nuanced and closely observed data, rather than rejecting the project of comparison and generalization altogether. Holland and Reeves compare their study to Suchman's (1994) study, which centers on a detailed description of how operations room personnel at an airport coordinated action to solve the problems of a dysfunctional ramp. Holland and Reeves point out that they themselves might have focused on a similar minutely observed episode such as studying how the student programmers produced time logs. However, they argue that they would then have missed the bigger picture of what the students were up to if they had, for example, concentrated on ``videotapes and transcriptions ... show[ing], the programmers' use of linguistic markers in concert with such items as physical copies of the time-log chart and the whiteboard xeroxes in order to orient joint attention, for example.'' Holland and Reeves's analysis argues for a basic theoretical orientation that accommodates a longer time horizon than is typical of a ``situation.'' They considered the entire three-month semester as the interesting frame of reference for their analysis, while Suchman looked at a much shorter episode, more easily describable as a ``situation.'' (See also Suchman and Trigg 1993, where the analysis centers on an hour and a half of videotape.) Holland and Reeves's analysis relies heavily on long-term participantobservation and verbal transcription; Suchman focuses on the videotape of a particular episode of the operations room in crisis. In comparing these two studies, we see how analytical perspective leads to a sense of what is interesting and determines where research effort is expended. Situated action models assume the primacy of a situation in which moment-by-moment interactions and events are critical, which leads away from a longer time frame of analysis. Videotape is a natural medium for this kind of analysis, and the tapes are looked at with an eye to the details of a particular interaction sequence (Jordan and Henderson 1994). By contrast, an activity theory analysis has larger scope for the kind of longer-term analysis provided by Holland and Reeves (though videotapes may certainly provide material of great interest to a particular activity theory analysis as in Bødker, this volume, and Engeström and Escalante, this volume). Of course the observation that theory and method are always entangled is not new; Hegel (1966) discussed this problem. Engeström (1993) summarized Hegel's key point: ``Methods should be developed or `derived' from the substance, as one enters and penetrates deeper into the object of study.'' And Vygotsky (1978) wrote, ``The search for method becomes one of the most important problems of the entire enterprise of understanding the uniquely human forms of psychological activity. In this case, the method is simultaneously prerequisite and product, the tool and the result of the study.'' Situated action models, then, have two key problems: (1) they do not account very well for observed regularities and durable, stable phenomena that span individual situations, and (2) they ignore the subjective. The first problem is partially addressed by situated action accounts that posit routines of one type or another (as discussed earlier). This brings situated action closer to activity theory in suggesting the importance of the historical continuity of artifacts and practice. It weakens true claims of ``situatedness'' which highlight the emergent, contingent aspects of action. There has been a continuing aversion to incorporating the subjective in situated action models, which have held fast in downplaying consciousness, intentionality, plans, motives, and prior knowledge as critical components of human thought and behavior (Suchman 1983, 1987; Lave 1988, 1993; Suchman and Trigg 1991; Lave and Wenger 1991; Jordan and Henderson 1994). This aversion appears to spring from the felt need to continue to defend against the overly rationalistic models of traditional cognitive science (see Cognitive Science 17, 1993 for the continuing debate) and the desire to encourage people to look at action in situ. While these are laudable motivations, it is possible to take them too far. It is severely limiting to ignore motive and consciousness in human activity and constricting to confine analyses to observable moment-by-moment interactions. Aiming for a broader, deeper account of what people are up to as activity unfolds over time and reaching for a way to incorporate subjective accounts of why people do what they and how prior knowledge shapes the experience of a given situation is the more satisfying path in the long run. Kaptelinin (chapter 5, this volume) notes that a fundamental question dictated by an activity theory analysis of human-computer interaction is: ``What are the objectives of computer use by the user and how are they related to the objectives of other people and the group/organization as a whole?'' This simple question leads to a different method of study and a different kind of result from a focus on a situation defined in its moment-by-moment particulars. METHODOLOGICAL IMPLICATIONS OF ACTIVITY THEORY

46

To summarize the practical methodological implications for HCI studies of what we have been discussing in this section, we see that activity theory implies: 1. A research time frame long enough to understand users' objects, including, where appropriate, changes in objects over time and their relation to the objects of others in the setting studied. Kuutti (this volume) observes that ``activities are longer-term formations and their objects cannot be transformed into outcomes at once, but through a process consisting often of several steps or phases.'' Holland and Reeves (this volume) document changing objects in their study of student programmers. Engeström and Escalante (this volume) trace changes in the objects of the designers of the Postal Buddy. Christiansen (this volume) shows how actions can become objectified, again a process of change over time. 2. Attention to broad patterns of activity rather than narrow episodic fragments that fail to reveal the overall direction and import of an activity. The empirical studies in this book demonstrate the methods and tools useful for analyzing broad patterns of activity. Looking at smaller episodes can be useful, but not in isolation. Bødker (this volume) describes her video analysis of episodes of use of a computer artifact: ``Our ethnographic fieldwork was crucial to understanding the sessions in particular with respect to contextualization.''12 Engeström and Escalante apply the same approach. 3. The use of a varied set of data collection techniques including interviews, observations, video, and historical materials, without undue reliance on any one method (such as video). Bødker, Christiansen, Engeström and Escalante, and Holland and Reeves (this volume) show the utility of historical data (see also McGrath 1990; Engeström 1993). 4. A commitment to understanding things from users' points of view, as in, for example, Holland and Reeves (this volume). Bellamy (this volume) underscores the practical need for getting the ``natives''' point of view in her study of technology in the classroom. For purposes of technology design, then, these four methodological considerations suggest a phased approach to design and evaluation. Laboratory-based experiments evaluating usability, the most commonly deployed HCI research technique at present, are a second phase in a longer process initiated by discovering the potential usefulness of technology through field research. Raeithel and Velichkovsky (this volume) describe an innovative technique of monitored communication for facilitating collaboration between designers and users. This technique sits somewhere between experimental and field methods and shows promise of providing a good way to encourage participatory design in a laboratory setting. CONCLUSION Activity theory seems the richest framework for studies of context in its comprehensiveness and engagement with difficult issues of consciousness, intentionality, and history. The empirical studies from all three frameworks are valuable and will undoubtedly mutually inform future work in the three areas. Human-computer interaction studies are a long way from the ideal set out by Brooks (1991): a corpus of knowledge that identifies the properties of artifacts and situations that are most significant for design and which permits comparison over domains, generates high-level analyses, and suggests actual designs. However, with a concerted effort by researchers to apply a systematic conceptual framework encompassing the full context in which people and technology come together, much progress can be made. A creative synthesis of activity theory as a backbone for analysis, leavened by the focus on representations of distributed cognition, and the commitment to grappling with the perplexing flux of everyday activity of the situated action perspective, would seem a likely path to success. ACKNOWLEDGMENTS

47

My grateful thanks to Rachel Bellamy, Lucy Berlin, Danielle Fafchamps, Vicki O'Day, and Jenny Watts for stimulating discussions of the problems of studying context. Kari Kuutti provided valuable commentary on an earlier draft of the chapter. Errors and omissions are my own. NOTES 1. This chapter is an expanded version of the paper that appeared in Proceedings East-West HCI Conference (pp. 352– 359), St. Petersburg, Russia. August 4–8, 1992, used with permission of the publisher. 2. I concentrate here on what Salomon (1993) calls the ``radical'' view of situatedness, to explore the most fundamental differences among the three perspectives. 3. Lave (1988) actually argues for the importance of institutions, but her analysis does not pay much attention to them, focusing instead on fine-grained descriptions of the particular activities of particular individuals in particular settings. 4. Weight Watchers is an organization that helps people lose weight. Dieters must weigh and measure their food to ensure that they will lose weight by carefully controlling their intake. 5. The word goal in everyday English usage is generally something like what activity theorists call an object in that it connotes a higher-level motive. 6. Suchman (1987) also says that plans may be ``projective accounts'' of action (as well as retrospective reconstructions), but it is not clear what the difference is between a conventional notion of plan and a ``projective account.'' 7. Rhetorically, the behavioristic cast of situated action descriptions is reflected in the use of impersonal referents to name study participants when reporting discourse. For example, study participants are referred to as ``Shopper'' in conversational exchanges with the anthropologist in Lave (1988), or become ciphers, e.g., A, B (Suchman 1987), or initials denoting the work role of interest, such as ``BP'' for baggage planner (Suchman and Trigg 1991). The use of pseudonyms to suggest actual people would be more common in a typical ethnography. 8. A good overview of the use of video for ``interaction analysis'' in which moment-by-moment interactions are the focus of study is provided by Jordan and Henderson (1994). They posit that understanding what someone ``might be thinking or intending'' must rely on ``evidence ... such as errors in verbal production or certain gestures and movements'' (emphasis in original). The ``evidence'' is not a verbal report by the study participant; it must be something visible on the tape—an observable behavior such as a verbal mistake. Jordan and Henderson observe that intentions, motivations and so forth ``can be talked about only by reference to evidence on the tape'' (emphasis in original). The evidence, judging by all their examples, does not include the content of what someone might say on the tape but only ``reactions,'' to use their word, actually seen on the tape. This is indeed a radical view of research. Does it mean that all experimental and naturalistic study in which someone is said to think or intend that has heretofore been undertaken and for which there are no video records does not have any ``evidence''? Does it mean that a researcher who has access only to the tapes has as good an idea of what study participants are up to as someone who has done lengthy participant-observation? The answers would appear to be yes since the ``evidence'' is, supposedly, encased in the tapes. In the laboratory where Jordan and Henderson work, the tapes are indeed analyzed by researchers who have not interacted personally with the study participants (Jordan and Henderson 1994). While certainly a great deal can be learned this way, it would also seem to limit the scope and richness of analysis. Much of interest happens outside the range of a video camera. The highly interpretive nature of video analysis has not been acknowledged by its supporters. The method is relatively new and in the first flush of enthusiastic embrace. Critiques will follow; they are being developed by various researchers taking a hard look at video. Jordan and Henderson do invite study participants into the lab to view the tapes and comment on them. This seems like a very interesting and fruitful idea. However, their philosophy is to try to steer informats toward their own epistemology—that is, that what is on the video is reality—not some other subjective reality the study participants might live with. As Jordan and Henderson (1994) say, ``elicitation'' based on viewing tapes ``has the advantage of staying much closer to the actual events [than conventional interviews]'' (emphasis added). 9. Rogers and Ellis (1994) make this same argument but for distributed cognition. However they do not consider activity theory.

48

10. Many bird watchers keep ``life lists'' in which they write down every individual bird species they have ever seen. They may want to see all the North American birds, or all the birds of Europe, or some other group of interest to them. 11. I use the term subjective to mean ``emanating from a subject'' (in activity theory terms), not ``lacking in objectivity'' in the sense of detachment, especially scientific detachment (a common meaning in English). 12. While Jordan and Henderson state that participant-observation is part of their method in interaction analysis, they use participant-observation to ``identify interactional `hot spots'—sites of activity for which videotaping promises to be productive'' (Jordan and Henderson 1994). Participant observation is used as a heuristic for getting at something very specific—interactions—and further, those particular interactions that will presumably be interesting on tape. In a sense, interaction analysis turns participant-observation on its head by selectively seeking events that will lend themselves to the use of a particular technology— video—rather than using video if and when a deeper understanding of some aspect of a culture is revealed in the process of getting to know the natives in their own terms, as in classic participantobservation. Note that Bødker (this volume) pairs ethnographic fieldwork with video to provide for contextualization; she thus uses ethnography to add to what can be seen on the tape, while Jordan and Henderson use it to pare down what will appear on the tape and thus what will be analyzed as ``evidence.''

REFERENCES Ashby, W. R. (1956). Introduction to Cybernetics. London: Chapman and Hall. Bertalanffy, L. (1968). General System Theory. New York: George Braziller. Bertelsen, O. (1994). Fitts' law as a design artifact: A paradigm case of theory in software design. In Proceedings EastWest Human Computer Interaction Conference (vol. 1, pp. 37–43). St. Petersburg, Russia, August 2–6. Bødker, S. (1989). A human activity approach to user interfaces. Human-Computer Interaction 4:171– 195. Brooks, R. (1991). Comparative task analysis: An alternative direction for human-computer interaction science. In J. Carroll, ed., Designing Interaction: Psychology at the Human Computer Interface. Cambridge: Cambridge University Press. Chaiklin, S., and Lave, J. (1993). Understanding Practice: Perspectives on Activity and Context. Cambridge: Cambridge University Press. Clement, A. (1990). Cooperative support for computer work: A social perspective on the empowering of end users. In Proceedings of CSCW'90 (pp. 223–236). Los Angeles, October 7–10. Davydov, V., Zinchenko, V., and Talyzina, N. (1982). The problem of activity in the works of A. N. Leont'ev. Soviet Psychology 21:31–42. Engeström, Y. (1990). Activity theory and individual and social transformation. Opening address at 2d International Congress for Research on Activity Theory, Lahti, Finland, May 21–25. Engeström, Y. (1993). Developmental studies of work as a testbench of activity theory. In S. Chaiklin and J. Lave, Understanding Practice: Perspectives on Activity and Context (pp. 64–103). Cambridge: Cambridge University Press. Fafchamps, D. (1991). Ethnographic workflow analysis. In H.-J. Bullinger, eds., Human Aspects in Computing: Design and Use of Interactive Systems and Work with Terminals (pp. 709–715). Amsterdam: Elsevier Science Publishers. Flor, N., and Hutchins, E. (1991). Analyzing distributed cognition in software teams: A case study of team programming during perfective software maintenance. In J. Koenemann-Belliveau et al., eds., Proceedings of the Fourth Annual Workshop on Empirical Studies of Programmers (pp. 36–59). Norwood, N.J.: Ablex Publishing. Gantt, M., and Nardi, B. (1992). Gardeners and gurus: Patterns of cooperation among CAD users. In Proceedings CHI '92 (pp. 107–118). Monterey, California, May 3–7.

49

Garfinkel, H. (1967). Studies in Ethnomethodology. Englewood Cliffs, NJ: Prentice-Hall. Goodwin, C., and Goodwin, M. (1993). Seeing as situated activity: Formulating planes. In Y. Engeström and D. Middleton, eds., Cognition and Communication at Work. Cambridge: Cambridge University Press. Hegel, G. (1966). The Phenomenology of Mind. London: George Allen & Unwin. Hutchins, E. (1987). Metaphors for interface design. ICS Report 8703. La Jolla: University of California, San Diego. Hutchins, E. (1990). The technology of team navigation. In J. Galegher, ed., Intellectual Teamwork. Hillsdale, NJ: Lawrence Erlbaum. Hutchins, E. (1991a). How a cockpit remembers its speeds. Ms. La Jolla: University of California, Department of Cognitive Science. Hutchins, E. (1991b). The social organization of distributed cognition. In L. Resnick, ed., Perspectives on Socially Shared Cognition (pp. 283–287). Washington, DC: American Psychological Association. Hutchins, E. (1994). Cognition in the Wild. Cambridge, MA: MIT Press. Kozulin, A. (1986). The concept of activity in Soviet psychology. American Psychologist 41(3):264–274. Kuutti, K. (1991). Activity theory and its applications to information systems research and development. In H.-E. Nissen, ed., Information Systems Research (pp. 529–549). Amsterdam: Elsevier Science Publishers. Jordan, B., and Henderson, A. (1994). Interaction analysis: Foundations and practice. IRL Technical Report. Palo Alto, IRL. Lave, J. (1988). Cognition in Practice. Cambridge: Cambridge University Press. Lave, J. (1993). The practice of learning. In S. Chaiklin and J. Lave, eds., Understanding Practice: Perspectives on Activity and Context. Cambridge: Cambridge University Press. Lave, J., and Wenger, I. (1991). Situated Learning: Legitimate Peripheral Participation. Cambridge: Cambridge University Press. Leont'ev, A. (1974). The problem of activity in psychology. Soviet Psychology 13(2):4–33. Leont'ev, A. (1978). Activity, Consciousness, and Personality. Englewood Cliffs, NJ: Prentice-Hall. Luria, A. R. (1979). The Making of Mind: A Personal Account of Soviet Psychology. Cambridge, MA: Harvard University Press. McGrath, J. (1990). Time matters in groups. In J. Galeher, R. Kraut, and C. Egido, eds., Intellectual Teamwork: Social and Technological Foundations of Cooperative Work (pp. 23–61). Hillsdale, NJ: Lawrence Erlbaum. Mackay, W. (1990). Patterns of sharing customizable software. In Proceedings CSCW'90 (pp. 209–221). Los Angeles, October 7–10. MacLean, A., Carter, K., Lovstrand, L., and Moran, T. (1990). User-tailorable systems: Pressing the issues with buttons. In Proceedings, CHI'90 (pp. 175–182). Seattle, April 1–5. Nardi, B. (1993). A Small Matter of Programming: Perspectives on End User Computing. Cambridge: MIT Press.

50

Nardi, B., and Miller, J. (1990). The spreadsheet interface: A basis for end user programming. In Proceedings of Interact'90 (pp. 977–983). Cambridge, England, August 27–31. Nardi, B., and Miller, J. (1991). Twinkling lights and nested loops: Distributed problem solving and spreadsheet development. International Journal of Man-Machine Studies 34:161–184. Nardi, B., and Zarmer, C. (1993). Beyond models and metaphors: Visual formalisms in user interface design. Journal of Visual Languages and Computing 4:5–33. Nardi, B., Schwarz, H., Kuchinsky, A., Leichner, R., Whittaker, S., and Sclabassi, R. (1993). Turning away from talking heads: The use of video-as-data in neurosurgery. In Proceedings INTERCHI'93 (pp. 327–334). Amsterdam, April 24–28. Newell, A., and Simon, H. (1972). Human Problem Solving. Englewood Cliffs, NJ: Prentice-Hall. Newman, D., Griffin, P., and Cole, M. (1989). The Construction Zone: Working for Cognitive Change in School. Cambridge: Cambridge University Press. Norman, D. (1988). The Psychology of Everyday Things. New York: Basic Books. Norman, D. (1991). Cognitive artifacts. In J. Carroll, ed., Designing Interaction: Psychology at the Human Computer Interface. New York: Cambridge University Press. Norman, D., and Hutchins, E. (1988). Computation via direct manipulation. Final Report to Office of Naval Research, Contract No. N00014-85-C-0133. La Jolla: University of California, San Diego. Petre, M., and Green, T. R. G. (1992). Requirements of graphical notations for professional users: Electronics CAD systems as a case study. Le Travail humain 55:47–70. Raeithel, A. (1991). Semiotic self-regulation and work: An activity theoretical foundation for design. In R. Floyd et al., eds., Software Development and Reality Construction. Berlin: Springer Verlag. Rogers, Y., and Ellis, J. (1994). Distributed cognition: An alternative framework for analysing and explaining collaborative working. Journal of Information Technology 9:119–128. Salomon, G. (1993). Distributed Cognitions: Psychological and Educational Considerations. Cambridge: Cambridge University Press. Scribner, S. (1984). Studying working intelligence. In B. Rogoff and J. Lave, eds., Everyday Cognition: Its Development in Social Context. Cambridge, MA: Harvard University Press. Seifert, C. and Hutchins, E. (1988). Learning from error. Education Report Number AD-A199. Washington, DC: American Society for Engineering. Suchman, L. (1987). Plans and Situated Actions. Cambridge: Cambridge University Press. Suchman, L. (1993). Response to Vera and Simon's situated action: A symbolic interpretation. Cognitive Science 1:71– 76. Suchman, L. (1994). Constituting shared workspaces. In Y. Engeström and D. Middleton, eds., Cognition and Communication at Work. Cambridge: Cambridge University Press. Suchman, L., and Trigg, R. (1991). Understanding practice: Video as a medium for reflection and design. In J. Greenbaum and M. Kyng, eds., Design at Work: Cooperative Design of Computer Systems. Hillsdale, NJ: Lawrence Erlbaum.

51

Suchman L., and Trigg, R. (1993). Artificial intelligence as craftwork. In S. Chaiklin and J. Lave, eds., Understanding Practice: Perspectives on Activity and Context. Cambridge: Cambridge University Press. Symon, G., Long, K., Ellis, J., and Hughes, S. (1993). Information sharing and communication in conducting radiological examinations. Technical report. Cardiff, UK: Psychology Department, Cardiff University. Tikhomirov, O. (1972). The psychological consequences of computerization. In O. Tikhomirov, ed., Man and Computer. Moscow: Moscow University Press. Vygotsky, L. S. (1978). Mind in Society. Cambridge, MA: Harvard University Press. Wertsch, J. (ed.). (1981). The Concept of Activity in Soviet Psychology. Armonk, NY: M. E. Sharpe. Wiener, N. (1948). Cybernetics. New York: Wiley. Winograd, T. and Flores, F. (1986). Understanding Computers and Cognition: A New Foundation for Design. Norwood, NJ: Ablex. Zhang, J. (1990). The interaction of internal and external information in a problem solving task. UCSD Technical Report 9005. La Jolla: University of California, Department of Cognitive Science.

52

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/213799991

Do Artifacts Have Politics? Chapter in Daedalus · January 1985

CITATIONS

READS

597

384

1 author: Langdon Winner Rensselaer Polytechnic Institute 52 PUBLICATIONS 2,965 CITATIONS SEE PROFILE

All content following this page was uploaded by Langdon Winner on 21 May 2014. The user has requested enhancement of the downloaded file.

Do Artifacts Have Politics? Author(s): Langdon Winner Source: Daedalus, Vol. 109, No. 1, Modern Technology: Problem or Opportunity? (Winter, 1980), pp. 121-136 Published by: The MIT Press on behalf of American Academy of Arts & Sciences Stable URL: http://www.jstor.org/stable/20024652 Accessed: 29/09/2010 07:54 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=mitpress. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

The MIT Press and American Academy of Arts & Sciences are collaborating with JSTOR to digitize, preserve and extend access to Daedalus.

http://www.jstor.org

LANGDON WINNER

Do Artifacts Have Politics?

about

In controversies

and

technology

society,

there

is no

idea more

pro

vocative

than the notion that technical things have political qualities. At issue is the claim that the machines, structures, and systems of modern material culture can be not of efficiency and pro accurately judged only for their contributions side effects, ductivity, not merely for their positive and negative environmental can but also for the ways in which of forms power and they embody specific a of Since this kind have and in ideas presence authority. persistent troubling about the meaning of technology, discussions deserve attention.1 explicit they in Technology and Culture almost two decades ago, Lewis Mumford Writing gave

classic

to one

statement

version

of

the

theme,

arguing

that

"from

late

neo

lithic times in the Near East, right down to our own day, two technologies have one authoritarian, the other democratic, the recurrently existed side by side: first system-centered, but the other unstable, immensely powerful, inherently but resourceful and durable."2 This thesis man-centered, relatively weak, stands at the heart of Mumford's studies of the city, architecture, and the his and mirrors concerns voiced earlier in the works of Peter tory of technics, Kropotkin, ism. More America

William

and other nineteenth

Morris,

recently, have adopted

antinuclear a

similar

and

prosolar as notion

century

critics of industrial

movements

energy a centerpiece

in in

their

Europe

and

arguments.

Thus environmentalist Denis Hayes "The increased deployment of concludes, nuclear power facilities must lead society toward authoritarianism. Indeed, safe reliance upon nuclear power as the principal source of energy may be possible only in a totalitarian state." Echoing the views of many proponents of appropri ate and the soft energy path, Hayes contends that "dispersed solar technology sources are more compatible than centralized technologies with social equity, freedom and cultural pluralism."3 An eagerness to interpret technical artifacts in political language is by no means the exclusive property of critics of systems. large-scale high-technology A long lineage of boosters have insisted that the "biggest and best" that science and industry made available were the best guarantees of freedom, democracy, and social justice. The factory system, automobile, radio, television, telephone, the space program, and of course nuclear power itself have all at one time or another been described as democratizing, in liberating forces. David Lilienthal, T.V.A.: Democracy on theMarch, for example, found this promise in the phos

121

122

LANGDON

WINNER

to rural that technical progress was bringing In a recent essay, The Republic of Technology, television for "its power to disband armies, to cashier

phate fertilizers and electricity Americans during the 1940s.4 Daniel

Boorstin

extolled

to create

presidents,

a whole

new

democratic

world?democratic

before imagined, even in America."5 Scarcely a new invention someone does not proclaim it the salvation of a free society. It is no surprise to learn that technical systems of various in the conditions of modern politics. The physical interwoven and the like industrial production, warfare, communications,

in ways

comes

never

along that

kinds are deeply of arrangements have fundamen tally changed the exercise of power and the experience of citizenship. But to go to argue that certain technologies in themselves have beyond this obvious fact and at We all know first mistaken. seems, political properties glance, completely that people have politics, not things. To discover either virtues or evils in aggre seems and chemicals transistors, gates of steel, plastic, integrated circuits, a true human of artifice and of the way just plain wrong, mystifying avoiding sources and and human of freedom sources, the injustice. oppression, justice even more foolish than blaming the victims when Blaming the hardware appears it comes to judging conditions of public life. the stern advice commonly given those who flirt with the notion that Hence, is not technology technical artifacts have political qualities: What matters itself, but the social or economic system in which it is embedded. This maxim, which is the central premise of a theory that can be called in a number of variations It serves as a of technology, has an obvious wisdom. the social determination needed corrective to those who focus uncritically on such things as "the comput er and its social impacts" but who fail to look behind technical things to notice and use. This view of their development, the social circumstances deployment, an to idea that tech determinism?the antidote naive provides technological an as and of unmediated the internal sole result then, dynamic, nology develops to have not Those who fit its molds other influence, patterns. any society by are shaped by social and economic in the which ways technologies recognized forces

have

not

gotten

very

far.

taken literally, it suggests that But the corrective has its own shortcomings; technical things do not matter at all. Once one has done the detective work in to reveal the social origins?power holders behind a particular necessary of stance of technological will have impor everything explained change?one tance. This conclusion offers comfort to social scientists: it validates what they about the study had always suspected, namely, that there is nothing distinctive can return to their standard models of technology in the first place. Hence, they of interest group politics, bureaucratic politics, Marxist of social power?those have everything of class struggle, and the like?and models they need. The of technology social determination is, in this view, essentially no different from the

social

determination

of,

say,

welfare

policy

or

taxation.

on a special are, however, good reasons technology has of late taken and scien in its own right for historians, fascination philosophers, political so reasons in ac of far models social science the standard tists; good only go most the and troublesome about for is what subject. In interesting counting social and political another place I have tried to show why so much of modern statements of what can be called a theory of tech thought contains recurring There

DO

ARTIFACTS

HAVE

POLITICS?

123

an odd of notions often crossbred with orthodox nological politics, mongrel The theory of technological and socialist liberal, conservative, philosophies.6 to the momentum of draws attention systems, large-scale sociotechnical politics to the response of modern societies to certain technological imperatives, and to In the all too common signs of the adaptation of human ends to technical means. so a novel framework of interpretation and explanation for some offers it doing of the more puzzling patterns that have taken shape in and around the growth of culture. One strength of this point of view is that it takes modern material technical artifacts seriously. Rather than insist that we immediately reduce to the interplay of social forces, it suggests that we pay attention to everything the characteristics of technical objects and the meaning of those characteristics. A necessary complement to, rather than a replacement for, theories of the social as this perspective determination of technology, identifies certain technologies own us to in It borrow Edmund their back, phenomena points political right. Husserl's injunction, to the things themselves. philosophical In what follows I shall offer outlines and illustrations of two ways in which artifacts can contain political properties. First are instances in which the inven tion, design, or arrangement of a specific technical device or system becomes a Seen in the proper light, way of settling an issue in a particular community. are of this kind and Second examples fairly straightforward easily understood. are cases of what can be called inherently political man-made sys technologies, tems that appear to require, or to be strongly compatible with, particular kinds of political relationships. about cases of this kind are much more Arguments troublesome and closer to the heart of the matter. By "politics," Imean arrange ments of power and authority in human associations as well as the activities that take place within those arrangements. For my purposes, here is "technology" to mean all of modern to understood I avoid but confusion practical artifice,7 to or or of smaller of hardware speak systems prefer technology, larger pieces of a specific kind. My intention is not to settle any of the issues here once and for and significance. all, but to indicate their general dimensions Technical Arrangements

as Forms

of Order

Anyone who has traveled the highways of America and has become used to a little odd about some the normal height of overpasses may well find something of the bridges over the parkways on New York. Many of the Long Island, are as as at the little nine feet of clearance low, having overpasses extraordinarily curb. Even those who happened to notice this structural peculiarity would not to it. In our accustomed way of look be inclined to attach any special meaning at we see like roads and the details of form as innocuous, and ing things bridges seldom give them a second thought. It turns out, however, that the two hundred or so low-hanging overpasses on were Island Long deliberately designed to achieve a particular social effect. Robert Moses, the master builder of roads, parks, bridges, and other public works from the 1920s to the 1970s in New York, had these overpasses built to that would discourage the presence of buses on his parkways. specifications to evidence According provided by Robert A. Caro in his biography of Moses, the reasons reflect Moses's social-class bias and racial prejudice. Automobile

124

LANGDON

WINNER

as he called them, owning whites of "upper" and "comfortable middle" classes, would be free to use the parkways for recreation and commuting. Poor people and blacks, who normally used public transit, were kept off the roads because One con the twelve-foot tall buses could not get through the overpasses. was access to limit of minorities and racial low-income sequence groups to Jones acclaimed public park. Moses made doubly sure of this Beach, Moses's widely to Jones result by vetoing a proposed extension of the Long Island Railroad Beach.8

a story in recent American life is fasci political history, Robert Moses's and His with and his careful mayors, governors, nating. presidents, dealings of the and labor banks, unions, press, public opinion manipulation legislatures, are all matters that political scientists could study for years. But the most impor tant and enduring results of his work are his technologies, the vast engineering that York much of For New its form. after present projects give generations Moses has gone and the alliances he forged have fallen apart, his public works, and bridges he built to favor the use of the automobile especially the highways over the development of mass transit, will continue to shape that city. Many of structures of concrete and steel embody a systematic his monumental social a way of a after time, among people that, relationships engineering inequality, told becomes just another part of the landscape. As planner Lee Koppleman had "The old Caro about the low bridges on Wantagh Parkway, son-of-a-gun made sure that buses would never be able to use his goddamned parkways."9 of architecture, and public works contain many ex Histories city planning, that contain explicit or implicit political pur amples of physical arrangements broad Parisian poses. One can point to Baron Haussmann's thoroughfares, to prevent any recurrence of street at Louis direction Napoleon's engineered one can fighting of the kind that took place during the revolution of 1848. Or As

visit

any

number

of

grotesque

concrete

buildings

and

huge

plazas

constructed

on American campuses during the late 1960s and early 1970s to de university and instruments Studies of industrial machines fuse student demonstrations. also turn up interesting political stories, including some that violate our normal innovations are made in the first place. If about why technological expectations are introduced to achieve increased efficien we suppose that new technologies shows that we will sometimes be disappointed. cy, the history of technology a not the least of change expresses panoply of human motives, Technological over even some to have dominion which is the desire of others, though it may some to an occasional and violence the norm of sacrifice of cost-cutting require more from less. getting illustration can be found in the history of nineteenth One poignant century At Cyrus McCormick's industrial mechanization. reaper manufacturing plant in a new and largely in the middle 1880s, pneumatic molding machines, Chicago cost of at an estimated were added to the foundry untested innovation, we would of such In the standard economic $500,000. things, interpretation the plant and achieve the kind of expect that this step was taken to modernize efficiencies that mechanization brings. But historian Robert Ozanne has shown must be seen in a broader context. At the time, Cyrus the why development II was engaged in a battle with the National Union of Iron Mold McCormick as a way to "weed out the bad ers. He saw the addition of the new machines

DO

ARTIFACTS

HAVE

125

POLITICS?

element among the men," namely, the skilled workers who had organized the union local in Chicago.10 The new machines, manned by unskilled labor, ac at a cost inferior than the earlier process. After tually produced castings higher three years of use the machines were, in fact, abandoned, but by that time they had served their purpose?the destruction of the union. Thus, the story of these at the McCormick technical developments ade factory cannot be understood to outside the of record workers' attempts quately organize, police repression of in Chicago during that period, and the events the labor movement surrounding the bombing at Hay market Square. Technological history and American politi cal history were at that moment deeply intertwined. In cases like those of Moses's ma low bridges and McCormick's molding one sees use of the technical of chines, arrangements that precede the importance the things in question. can be used in ways that It is obvious that technologies enhance the power, authority, and privilege of some over others, for example, the use of television to sell a candidate. To our accustomed way of thinking, are seen as neutral tools that can be used well or technologies poorly, for good, in between. But we usually do not stop to a evil, or something inquire whether a a have device and built in been such that it way given might designed produces set of consequences logically and temporally prior to any of its professed uses. Robert Moses's bridges, after all, were used to carry automobiles from one point to another; McCormick's machines were used to make metal castings; both tech purposes far beyond their immediate use. If nologies, however, encompassed our moral and includes only cate political language for evaluating technology to not with and if do tools it does attention to the include uses, gories having our of the and we will be of then artifacts, meaning designs arrangements blinded to much that is intellectually and practically crucial. the point is most easily understood Because in the light of particular in tentions embodied in physical form, I have so far offered illustrations that seem almost conspiratorial. But to recognize the political dimensions in the shapes of not we does that look for conscious technology require conspiracies or malicious intentions. The organized movement of handicapped in the United people States during the 1970s pointed out the countless ways in which machines, instruments,

and

structures

of

common

plumbing fixtures, and so forth?made sons to move about a condition freely, to is life. It safe that say public designs from

long-standing

neglect

than

from

use?buses,

sidewalks,

buildings,

it impossible for many handicapped per that systematically excluded them from arose more unsuited for the handicapped anyone's

active

intention.

But

now

that

the issue has been raised for public attention, it is evident that justice requires a are A now whole of artifacts and rebuilt to remedy. range being redesigned accommodate this minority. that have Indeed, many of the most important examples of technologies are those that transcend the political consequences simple categories of "in tended" and "unintended" are These instances in which the very altogether. of so technical a is in biased process development thoroughly particular direc tion that it regularly produces results counted as wonderful breakthroughs by some social interests and crushing setbacks by others. In such cases it is neither correct nor intended to do somebody else harm." insightful to say, "Someone one must that the deck has been stacked long in ad Rather, say technological

126 vanee

to

receive

favor

certain

a better

hand

social

LANGDON

WINNER

interests,

and

that

some

were

people

to

bound

than others. tomato harvester,

a remarkable device re The mechanical perfected by searchers at the University of California from the late 1940s to the present, offers an illustrative tale. The machine is able to harvest tomatoes in a single a the from the ground, shaking the fruit loose, row, cutting pass through plants and in the newest models into large plastic sorting the tomatoes electronically tons of produce headed for gondolas that hold up to twenty-five canning. To accommodate the rough motion of these "factories in the field," agricultural researchers

have

bred

new

varieties

of

tomatoes

that

are

hardier,

sturdier,

and

less tasty. The harvesters replace the system of handpicking, in which crews of farmworkers would pass through the fields three or four times putting ripe to matoes in lug boxes and saving immature fruit for later harvest.11 Studies in California indicate that the machine reduces costs by approximately five to sev en dollars per ton as But the benefits are by no compared to hand-harvesting.12 means divided in the In in the fact, the machine equally agricultural economy. a in has this instance been the occasion for of social garden thorough reshaping tomato in of rural California. production relationships the ma By their very size and cost, more than $50,000 each to purchase, chines are compatible only with a highly concentrated form of tomato growing. With the introduction of this new method of harvesting, the number of tomato declined from in four thousand the growers approximately early 1960s to about in 1973, yet with a substantial six hundred increase in tons of tomatoes pro thousand By the late 1970s an estimated thirty-two jobs in the tomato as a direct consequence had eliminated of mechanization.13 been Thus, industry a jump in productivity to the benefit of very large growers has occurred at a sacrifice to other rural agricultural communities.

duced.

The

of California's

University

research

like the tomato harvester is at attorneys for California Rural Legal a group of farmworkers and other are officials University spending tax chines

ful

of

private

interests

to

the

detriment

and

development

on

agricultural

ma

this time the subject of a law suit filed by an Assistance, representing organization suit charges that interested parties. The on projects that benefit a hand monies of

farmworkers,

small

farmers,

con

sumers, and rural California generally, and asks for a court injunction to stop the has denied these charges, practice. The University arguing that to accept them "would require elimination of all research with any potential practical application."14

of the tomato far as I know, no one has argued that the development was the result of a plot. Two students of the controversy, William Friedland and Amy Barton, specifically exonerate both the original developers of the machine and the hard tomato from any desire to facilitate economic con As

harvester

see here instead is an social ongoing industry.15 What we scientific knowledge, invention, and corporate process in which technological profit reinforce each other in deeply entrenched patterns that bear the unmistak able stamp of political and economic power. Over many decades agricultural in American has research and development land-grant colleges and universities It is in the face of tended to favor the interests of large agribusiness concerns.16 such subtly ingrained patterns that opponents of innovations like the tomato centration

in that

DO are made

harvester

to

seem

ARTIFACTS

HAVE

or

"antitechnology"

127

POLITICS? For

"antiprogress."

the

harves

ter is not merely the symbol of a social order that rewards some while punishing of that order. others; it is in a true sense an embodiment Within a given category of technological change there are, roughly speaking, two kinds of choices that can affect the relative distribution of power, authority, and privilege in a community. Often the crucial decision is a simple "yes or no" or not? In recent years we going to develop and adopt the choice?are thing international about and national, local, many disputes technology have centered on "yes or no" judgments about such things as food additives, pesticides, the nuclear reactors, and dam projects. The fundamental building of highways, or not the to join choice about an ABM or an SST is whether thing is going are a as fre of its Reasons for and against society piece operating equipment. as important as those concerning the adoption of an important new law. quently A second range of choices, equally critical inmany instances, has to do with or arrangement of a technical system after the specific features in the design decision to go ahead with it has already been made. Even after a utility company can to build a large electric power line, important controversies wins permission remain with respect to the placement of its route and the design of its towers; con even after an organization has decided to institute a system of computers, to can the kinds of arise with still troversies programs, components, regard modes of access, and other specific features the system will include. Once the tomato harvester had been developed in its basic form, design altera mechanical addition of electronic tion of critical social significance?the sorters, for ex on effects the balance of wealth the character of the machine's ample?changed and power in California Some of the most interesting research on agriculture. in a and politics at present focuses on the attempt to demonstrate technology concrete fashion how seemingly innocuous design features in mass detailed, transit systems, water projects, and other technologies industrial machinery, David Noble is of mask choices Historian social actually profound significance. now two have kinds automated machine tool that different of systems studying and labor in the industries implications for the relative power of management that might employ them. He is able to show that, although the basic electronic of the record/playback and mechanical and numerical control sys components are

tems

in the

element

choice

the

similar,

for social struggles cutting, efficiency,

of one

design

on the shop floor. To or the modernization

over

another

has

crucial

consequences

see the matter solely in terms of cost of equipment is to miss a decisive

story.17

The From such examples I would offer the following general conclusions. we call our world. Many are ways of in order things "technologies" building for technical devices and systems important in everyday life contain possibilities or not, deliber many different ways of ordering human activity. Consciously or inadvertently, societies choose structures for technologies that influence ately how

people

are

going

to work,

communicate,

travel,

consume,

and

so forth

over

a very are made, structuring decisions long time. In the processes by which situated and possess unequal degrees of power as different people are differently well as unequal levels of awareness. By far the greatest latitude of choice exists the very first time a particular instrument, is introduced. system, or technique Because

choices

tend to become

strongly fixed

inmaterial

equipment,

economic

128

WINNER

LANGDON

vanishes for all practical and social habit, the original flexibility are once In initial commitments made. that sense technological the purposes that establish a innovations are similar to legislative acts or political foundings For that framework for public order that will endure over many generations. investment,

same

the

reason,

one

attention

careful

would

to

give

the

rules,

roles,

and

rela

to such as the tionships of politics must also be given things building of high and the of television the creation networks, ways, tailoring of seemingly on new machines. The issues that divide or unite features people in insignificant are not in and of the institutions settled practices society politics proper, only but also, wires

and

nuts

transistors,

Inherently

in tangible

less obviously,

and

Political

and

of steel and concrete,

arrangements

bolts.

Technologies

None of the arguments and examples considered thus far address a stronger, more and society?the troubling claim often made in writings about technology are by their very nature political in a specific way. belief that some technologies to this view, the adoption of a given technical system unavoidably According that have a distinctive political it conditions for human relationships with brings cast?for

or

centralized

example,

decentralized,

egalitarian

or

re

inegalitarian,

pressive or liberating. This is ultimately what is at stake in assertions like those one authoritarian, the of Lewis Mumford that two traditions of technology, cases I In cited all the other democratic, exist side by side inWestern history. are relatively flexible in design and arrangement, and above the technologies one can recognize a particular result produced variable in their effects. Although in a particular setting, one can also easily imagine how a roughly similar device or situated with very much different political or system might have been built consequences.

idea we

The

now

must

examine

and

evaluate

is that

certain

kinds

and that to choose them is to choose of technology do not allow such flexibility, a particular form of political life. A remarkably forceful statement of one version of this argument appears in anar in 1872. Answering Friedrich Engels's little essay "On Authority" written chists who believed that authority is an evil that ought to be abolished altogeth for authoritarianism, er, Engels launches into a panegyric among maintaining, in modern other things, that strong authority is a necessary condition industry. To advance his case in the strongest possible way, he asks his readers to imagine a social revolution de that the revolution has already occurred. "Supposing throned the capitalists, who now exercise their authority over the production to adopt entirely the point of view of the and circulation of wealth. Supposing, land and the instruments of labour had become the that the anti-authoritarians, use them. Will of the workers who collective property authority have dis or its form?"18 it have will only changed appeared His answer draws upon lessons from three sociotechnical systems of his day, mills,

cotton-spinning

finished

becoming tions at different tasks,

from

running

another. Because work

and

railways,

ships

at sea. He

observes

thread, through locations in the factory. The workers the

steam

engine

these tasks must

is "fixed by the authority

to

that,

on

its way

to

a number

cotton moves

carrying

the

of different opera perform a wide variety of

products

and because be coordinated, of the steam," laborers must

from

one

room

to

the timing of the learn to accept a

DO

HAVE

ARTIFACTS

129

POLITICS?

to at regular hours and Engels, work according rigid discipline. They must, to to wills the their individual subordinate persons in charge of factory agree to that produc do the If risk fail so, they horrifying possibility operations. they "The automatic tion will come to a grinding halt. Engels pulls no punches. "is much more despotic than the small of a big factory," he writes, machinery who

capitalists

ever

workers

employ

have

been."19

are adduced in Engels's analysis of the necessary operating of for railways and ships at sea. Both require the subordination conditions workers to an "imperious authority" that sees to it that things run according to an idiosyncracy of capitalist social organ plan. Engels finds that, far from being of all arise "independently and subordination of ization, relationships authority us are condi with the material social organization, [and] together imposed upon tions under which we produce and make products circulate." Again, he intends this to be stern advice to the anarchists who, according to Engels, thought it at a single to and subordination eradicate superordination simply possible Similar

lessons

All

stroke.

such

are

schemes

nonsense.

roots

The

of

unavoidable

author

in the human involvement with itarianism are, he argues, deeply implanted "If man, by dint of his knowledge and inventive genius, science and technology. has subdued the forces of nature, the latter avenge themselves upon him by a as he independ employs them, to veritable despotism subjecting him, insofar ent of all social organization."20 to justify strong authority on the basis of supposedly necessary Attempts conditions of technical practice have an ancient history. A pivotal theme in the Republic is Plato's quest to borrow the authority of techn? and employ it by analo the illus gy to buttress his argument in favor of authority in the state. Among is that of a ship on the high seas. Because large trations he chooses, like Engels, nature need to be steered with a firm hand, sailors sailing vessels by their very must yield to their captain's commands; no reasonable person believes that ships a state is can be run democratically. Plato goes on to suggest that governing as a physician. rather like being captain of a ship or like practicing medicine Much

same

the

nized

technical

that

conditions activity

require create this

also

central need

rule in

and

decisive

action

in orga

government.

argument, and arguments like it, the justification for authority is no made by Plato's classic analogy, but rather directly with reference to longer as Engels believed If the basic case is as compelling it to be, itself. technology one would expect that, as a society adopted increasingly complicated technical systems as its material basis, the prospects for authoritarian ways of life would In Engels's

be greatly enhanced. Central control by knowledgeable people acting at the top In this respect, his of a rigid social hierarchy would seem increasingly prudent. to at Karl stand in "On Authority" be variance with Marx's appears position in will Volume One of Capital. Marx tries to show that increasing mechanization render obsolete dination

that,

manufacturing.

the hierarchical division of labor and the relationships of subor in his view, were necessary during the early stages of modern The

"Modern

Industry,"

he

writes,

"...

sweeps

away

by

technical means the manufacturing division of labor, under which each man is bound hand and foot for life to a single detail operation. At the same time, the this same division of labour in a capitalistic form of that industry reproduces in still more monstrous the the workman shape; factory proper, by converting into

a

living

appendage

of

the machine.

. . ."21 In Marx's

view,

the

conditions

130

WINNER

LANGDON

that will eventually dissolve the capitalist division of labor and facilitate prole tarian revolution are conditions latent in industrial technology itself. The dif in Capital and Engels's in his essay raise an ferences between Marx's position after all, does modern important question for socialism: What, technology make in political life? The theoretical tension we see here mir possible or necessary rors many troubles in the practice of freedom and authority that have muddied the tracks of socialist revolution. are in some sense to the effect that inherently politi Arguments technologies cal have

been In my

here.

in a wide

advanced of

reading

such

of

variety notions,

far

contexts,

too are

there

however,

to summarize

many two

basic

of

ways

case. One version claims that the adoption of a stating the given technical sys tem actually requires the creation and maintenance of a particular set of social conditions as the operating environment of that system. Engels's position is of this kind. A similar view is offered by a contemporary writer who holds that "if you

nuclear

accept

power

also

you

plants,

a techno-scientific-industrial

accept

in charge, you could not have nuclear elite. Without these people military some In of technology this kinds power."29 conception, require their social en to

vironments

in

structured

be

a

in much

way

particular

same

the

sense

that

an automobile in order to run. The thing could not exist as an requires wheels certain social as well as material effective operating unless conditions entity were met. The meaning of "required" here is that of practical (rather than logi cal) necessity. Thus, Plato thought it a practical necessity that a ship at sea have one captain and an obedient crew. unquestioningly A second, somewhat weaker, version of the argument holds that a given kind of technology is strongly compatible with, but does not strictly require, of a particular stripe. Many social and political relationships advocates of solar are now more that hold that of energy variety compatible with a technologies democratic, clear

power;

energy

society egalitarian same at the time

requires

than they

case

Their

democracy.

energy do not

systems

based

maintain

that that

is, briefly,

on

coal,

oil,

anything

solar

nu

and

about

solar

is decentral

energy

a technical and political sense: technically it is vastly speaking, izing in both more reasonable to build solar systems in a disaggregated, distributed widely manner than in speaking, solar energy large-scale centralized plants; politically to manage and local communities accommodates the attempts of individuals because they are dealing with systems that are more their affairs effectively sources. In than huge centralized and controllable accessible, comprehensible, this view, solar energy is desirable not only for its economic and environmental it is likely to permit in other areas benefits, but also for the salutary institutions of public life.23 to be there is a further distinction Within both versions of the argument of a given technical that are internal to the workings made between conditions internal social system and those that are external to it. Engels's thesis concerns cotton to and within factories relations said be required railways, for example; a mean of for the condition what such relationships society at large is for him separate

are

compatible

society

In contrast,

question.

removed

with from

democracy the

the

solar

pertains

organization

advocate's

belief

to the way of

those

that

solar

technologies

they complement

technologies

as

aspects of

such.

are, then, several different directions that arguments of this kind can follow. Are the social conditions predicated said to be required by, or strongly There

DO

HAVE

ARTIFACTS

131

POLITICS?

compatible with, the workings of a given technical system? Are those conditions internal to that system or external to it (or both)? Although writings that address such questions are often unclear about what is being asserted, arguments in this an important presence in modern political discourse. general category do have enter to into how many attempts They explain changes in social life take place in the wake of technological innovation. More they are often used importantly, to buttress attempts to justify or criticize proposed courses of action involving new or technology. By offering distinctly political reasons for against the adop tion of a particular technology, arguments of this kind stand apart from more more commonly employed, easily quantifiable claims about economic costs and benefits, environmental impacts, and possible risks to public health and safety that technical systems may involve. The issue here does not concern how many jobs will be created, how much income generated, how many pollutants added, or how many cancers produced. Rather, the issue has to do with ways in which choices about technology have important consequences for the form and quality of

human

associations.

If we

examine social patterns that comprise the environments of technical we to specific find certain devices and almost linked systems, systems invariably of and The is: Does this ways organizing power authority. important question state of affairs derive from an unavoidable social response to intractable proper ties in the things themselves, or is it instead a pattern imposed independently by a governing body, ruling class, or some other social or cultural institution to further its own purposes? most obvious example, the atom bomb is an Taking the inherently political artifact. As long as it exists at all, its lethal properties demand that it be con trolled by a centralized, chain of command closed to all rigidly hierarchical influences that might make its workings unpredictable. The internal social sys tem of the bomb must be authoritarian; there is no other way. The state of affairs stands as a practical necessity independent of any larger political system in which the bomb is embedded, independent of the kind of regime or character of

its rulers.

states

democratic

Indeed,

structures

social

and

must

that

mentality

to find

try

the

characterize

to ensure

ways

that of

management

the

nuclear

weapons do not "spin off' or "spill over" into the polity as a whole. The bomb is, of course, a special case. The reasons very rigid relationships of

are

authority

in its

necessary

immediate

should

presence

be

clear

to

anyone.

we

look for other instances in which particular varieties of tech If, however, are of a special pattern of power nology widely perceived to need the maintenance and authority, modern technical history contains a wealth of examples. a monumental in The Visible Hand, Alfred D. Chandler study of modern to defend the hypothe business enterprise, presents impressive documentation sis

that

tion,

the

transportation,

centuries

tralized, Typical

construction

require

the

and

and

day-to-day

of

development

hierarchical

organization

of Chandler's

reasoning

made possible Technology liable movement of goods and repair of locomotives,

a

particular

is his analysis

stock,

systems

in the nineteenth

administered

fast, all-weather and passengers, rolling

of many

operation

communication

social

form?a

of

produc

twentieth

large-scale

cen

skilled managers. by highly of the growth of the railroads.

transportation; as the as well and

and

track,

but

safe,

regular, maintenance

continuing stations, roadbed,

round

re

132

LANGDON

WINNER

of a sizable the creation administrative equipment, required to It meant of a set of managers the employment these supervise organization. over an extensive activities functional of an area; and the appointment geographical to monitor, administrative command and of middle and top executives evaluate, and

houses,

other

the work

coordinate

of managers

for

responsible

the day-to-day

operations.

his book Chandler points to ways inwhich technologies used in the and distribution of electricity, chemicals, and a wide range of indus "demanded" or "required" this form of human association. "Hence, of railroads demanded the creation of the first the operational requirements in American administrative hierarchies business."25 Were there other conceivable ways of organizing these aggregates of people and apparatus? Chandler shows that a previously dominant social form, the small traditional family firm, simply could not handle the task in most cases. he does not speculate further, it is clear that he believes there is, to be Although

Throughout production trial goods

realistic, very within modern nologies?oil

little

in the forms of power and authority appropriate tech systems. The properties of many modern such that over and refineries, for example?are If such of scale and speed are possible. economies

latitude

sociotechnical pipelines

impressive whelmingly systems are to work effectively, efficiently, quickly, and safely, certain require ments of internal social organization have to be fulfilled; the material possi available could not be exploited bilities make that modern technologies as one institu that otherwise. Chandler compares sociotechnical acknowledges tions of different nations, one sees "ways in which cultural attitudes, values,

and social structure affect these imperatives."26 systems, ideologies, political But the weight of argument and empirical evidence in The Visible Hand suggests that any significant departure from the basic pattern would be, at best, highly unlikely. It may example, prove

be

that

of

capable

other

of

those

of arrangements worker democratic

conceivable

decentralized,

factories,

administering

refineries,

in

and

Yugoslavia

other

countries

is often

for

authority,

could

self-management, communications

and railroads as well as or better than the organizations teams in Sweden from automobile Evidence assembly plants

and

power

systems,

describes. Chandler and worker-managed to

presented

these

salvage

pos

over this matter here, but I shall not be able to settle controversies sibilities. to to be their bone of contention. The available what I consider merely point evidence tends to show that many large, sophisticated systems are technological control. The in fact highly compatible with centralized, hierarchical managerial to do with whether or not this pattern is in has however, interesting question, sense

any

a

of

requirement

such

systems,

a

if

what,

such

anything,

measures

require

of

the

is not

that

question

rests on our judgments cal one. The matter ultimately are practically necessary in the workings of particular

an

solely

about what

empiri

steps,

kinds of technology

structure

of human

if any, and

associations.

Was Plato right in saying that a ship at sea needs steering by a decisive hand and an obedient crew? that this could only be accomplished by a single captain and Is Chandler correct in saying that the properties of large-scale systems require control? hierarchical managerial centralized, To

moral

answer

such

questions,

claims of practical

we

necessity

w7ould

(including

have

to

examine

those advocated

in

some

detail

in the doctrines

the

of

DO

HAVE

ARTIFACTS

133

POLITICS?

economics) and weigh them against moral claims of other sorts, for example, the in the command of a ship or that notion that it is good for sailors to participate in a workers have a right to be involved in making and administering decisions on It of based is characteristic societies factory. large, complex technological that

however,

systems,

reasons

moral

other

than

those

of

necessity

practical

claims one "idealistic," and irrelevant. Whatever appear increasingly obsolete, or on can to wish make behalf of be may immediately liberty, justice, equality neutralized when confronted with arguments to the effect: "Fine, but that's no way

to run

a railroad"

(or

steel

or

mill,

or

airline,

communications

and

system,

so on). Here we encounter an important quality in modern political discourse are justified in think about what measures and in the way people commonly to the make In available. response many instances, to possibilities technologies are inherently political is to say that certain widely say that some technologies reasons of the need to maintain crucial accepted practical necessity?especially as to entities?have tended systems eclipse smoothly working technological other sorts of moral and political reasoning. One attempt to salvage the autonomy of politics from the bind of practical involves the notion that conditions of human association found in the necessity of technological internal workings systems can easily be kept separate from the a as whole. Americans have polity long rested content in the belief that arrange ments of power and authority inside industrial corporations, public utilities, and the like have little bearing on public institutions, and ideas at practices, was as a at the factory gates" taken fact of life that stops large. That "democracy had nothing to do with the practice of political freedom. But can the internal and the politics of the whole community be so easily politics of technology recent ex A of American business leaders, contemporary separated? study of Chandler's "visible hand of found them emplars management," remarkably with

impatient

such

democratic

as

scruples

"one

one

man,

vote."

If

democracy

for the firm, the most critical institution in all of society, American of a ask, how well can it be expected to work for the government

doesn't work executives

when

nation?particularly

that

to

attempts

government

with

interfere

the

achievements of the firm? The authors of the report observe that patterns of that in the corporation become for businessmen work "the authority effectively desirable model against which to compare political and economic relationships in the rest of society."27 While such findings are far from conclusive, they do common reflect a sentiment in the land: what dilemmas like the increasingly of wealth or broader public partici energy crisis require is not a redistribution pation

but,

rather,

stronger,

centralized

proposal for an Energy Mobilization An especially vivid case in which

public

Board

Carter's

management?President

and the like.

the operational requirements of a technical influence the of system might quality public life is now at issue in debates about the risks of nuclear power. As the supply of uranium for nuclear reactors runs as a in out, a proposed alternative fuel is the plutonium generated by-product reactor

cores.

ceptable

gers these

Well-known

economic

in regard concerns,

ards?those

objections its risks of

costs,

to the

international

however,

that involve

stands

to plutonium environmental

proliferation another

less

the sacrifice of civil

focus

recycling contamination,

of nuclear widely

weapons.

appreciated

liberties. The

on and

its unac its dan

Beyond set

widespread

of

haz

use of

134

WINNER

LANGDON

as a fuel increases the chance that this toxic substance plutonium might be sto len by terrorists, organized crime, or other persons. This raises the prospect, and not a trivial one, that extraordinary measures would have to be taken to from theft and to recover it if ever the substance were safeguard plutonium as in the nuclear stolen. Workers industry as well ordinary citizens outside covert surveillance, could well become subject to background checks, security wiretapping,

and

informers,

even

emergency

measures

under

martial

law?all

justified by the need to safeguard plutonium. Russell W. Ayres's recycling study of the legal ramifications of plutonium concludes: "With the passage of time and the increase in the quantity of pluto to eliminate the traditional checks the nium in existence will come pressure courts and legislatures place on the activities of the executive and to develop a to enforce strict safeguards." He avers powerful central authority better able that "once a quantity of plutonium had been stolen, the case for literally turning the country upside down to get it back would be overwhelming."31 Ayres antic I have of the kinds about and worries thinking that, argued, characterize ipates true that, in a world in which human It is still inherently political technologies. is "required" in an absolute beings make and maintain artificial systems, nothing sense.

Nevertheless,

once

a course

of

action

is

underway,

once

artifacts

like

the kinds of reason nuclear power plants have been built and put in operation, to technical life of social the pop up as requirements adaptation ing that justify "Once recycling be as flowers in the spring. In Ayres's words, spontaneously the theft become real rather than hypothetical, gins and the risks of plutonium seem will case for compelling."28 infringement of protected rights governmental and im After a certain point, those who cannot accept the hard requirements as dreamers and fools. will be dismissed peratives *

*

*

I have outlined indicate how artifacts can The two varieties of interpretation in which specific In the first instance we noticed ways have political qualities. of a device or system could provide a features in the design or arrangement in a given convenient means of establishing patterns of power and authority a range of flexibility in of the dimensions of this kind have setting. Technologies form. It is precisely because they are flexible that their con their material sequences for society must be understood with reference to the social actors able are chosen. In the second instance to influence which designs and arrangements we examined ways in which the intractable properties of certain kinds of tech are linked to particular institutionalized nology strongly, perhaps unavoidably, of and initial about the choice whether or not power patterns authority. Here, to adopt something There are no alter is decisive in regard to its consequences. or arrangements that would make a significant dif native physical designs no for creative intervention ference; there are, furthermore, genuine possibilities or different socialist?that could social systems?capitalist change the intrac by or alter of its the quality political effects. tability of the entity significantly is applicable in a given case is often To know which variety of interpretation some of them passionate ones, about the meaning of what is at stake in disputes, we live. I have argued a "both/and" position here, for it for how technology

DO

ARTIFACTS

HAVE

135

POLITICS?

are applicable in different circum that both kinds of understanding a can that within it Indeed, happen particular complex of technology?

seems to me stances. a

of

system

or

communication

for

transportation,

aspects

example?some

may

in their possibilities for society, while other aspects may be (for be flexible I intractable. The two varieties of interpretation better or worse) completely have

examined

here

can

and

overlap

at

intersect

many

points.

some issues on which people can disagree. Thus, are, of course, now resources at from renewable believe of have last proponents energy they a set of communitarian tech discovered intrinsically democratic, egalitarian, the social consequences In my best estimation, of build however, nologies. on will the specific configurations ing renewable energy systems surely depend of both hardware and the social institutions created to bring that energy to us. It may be that we will find ways to turn this silk purse into a sow's ear. By com of nuclear power seem to believe parison, advocates of the further development that they are working on a rather flexible technology whose adverse social ef fects can be fixed by changing the design parameters of reactors and nuclear waste disposal systems. For reasons indicated above, I believe them to be dead wrong in that faith. Yes, we may be able to manage some of the "risks" to public health and safety that nuclear power brings. But as society adapts to the more These

indelible features of nuclear power, what will be the dangerous and apparently in toll human freedom? long-range My belief that we ought to attend more closely to technical objects them selves is not to say that we can ignore the contexts in which those objects are situated. A ship at sea may well require, as Plato and Engels insisted, a single captain and obedient crew. But a ship out of service, parked at the dock, needs a caretaker. To understand which and which contexts are only technologies an to must and is that involve both the study of us, important why, enterprise as as a technical and their well systems specific history thorough grasp of the our and controversies of In times concepts political theory. people are often to to make in drastic the live accord with way they willing changes technological innovation at the same time they would resist similar kinds of changes justified on political If for no other reason than that, it is important for us to grounds. achieve

a clearer

view

of

these

matters

than

has

been

our

habit

so

far.

References ll would

like to thank Merritt Roe Smith, Leo Marx, David Noble, Charles James Miller, Loren Graham, Gail Stuart, Dick Sclove, and Stephen Graubard for their Weiner, Sherry Turkle, comments on earlier drafts of this essay. and criticisms thanks also to Doris Morrison of the My of California, for her bibliographical Agriculture Library of the University Berkeley, help. 2Lewis Mumford, "Authoritarian and Democratic 5 (1964): Technics," Technology and Culture, 1-8. 3 Denis Hayes, Rays ofHope: The Transition to a Post-Petroleum World (New York: W. W. Norton, 1977), pp. 71, 159. 4David Lilienthal, 72-83.

T. V.A.: Democracy

on theMarch

(New York: Harper

and Brothers,

1944), pp.

5Daniel J. Boorstin, The Republic of & Row, 1978), p. 7. Technology (New York: Harper as a Theme in Political Autonomous Technology: Technics-out-of-Control 6Langdon Winner, Thought 1977). Press, (Cambridge, Mass.: M.I.T. 7The meaning of I not in some this does of the broader essay encompass "technology" employ definitions ofthat found in contemporary for example, the notion of "technique" literature, concept

136

LANGDON WINNER

in the writings of Jacques Ellul. My purposes here are more limited. For a discussion of the diffi to define see Ref. 6, pp. 8-12. that arise in attempts "technology," 8Robert A. Caro, The Power Broker: Robert Moses and the Fall of New York (New York: Random 1974), pp. 318, 481, 514, 546, 951-958. House, 9Ibid., p. 952. 10Robert Ozanne, A Century of Labor-Management Relations atMcCormick and International Harvest er (Madison, Wis.: of Wisconsin Press, 1967), p. 20. University 11 The of the tomato harvester is told in Wayne D. Rasmussen, "Advances in early history as a Case American The Mechanical Tomato Harvester Agriculture: Study," Technology and Culture, 531-543. 9(1968): 12Andrew Schmitz and David "Mechanized and Social Welfare: The Case Seckler, Agriculture of the Tomato American Journal of Agricultural 52 (1970): 569-577. Harvester," Economics, 13William H. Friedland and Amy Barton, "Tomato Technology," 13:6 (September/Oc Society, tober 1976). See also William H. Friedland, Social Sleepwalkers: Scientific and Technological Research in of California, of Applied Behavioral Davis, Department Sciences, California Agriculture, University No. Research Monograph 13, 1974. 1, 1979. 14University of California Clip Sheet, 54:36, May 15Friedland and Barton, "Tomato Technology." 16A history and critical analysis of agricultural research in the land-grant is given in colleges Hard Tomatoes, Hard Times (Cambridge, Mass.: Schenkman, 1978). James Hightower, culties

17David Noble, inMachine "Social Choice The Case of Automatically Controlled Design: in Case Studies in the Labor Process (New York: Monthly chine Tools," Review Press, forthcoming). 18Friedrich Engels, "On Authority" in The Marx-Engels Reader, 2nd ed., Robert Tucker (New York: W. W. Norton, 1978), p. 731.

Ma (ed.)

"Ibid. 20Ibid., pp. 732, 731. 21Karl Marx, vol. 1, 3rd ed., Samuel Moore and Edward Aveling (trans.) (New York: Capital, The Modern 1906), p. 530. Library, Four Arguments for the Elimination (New York: William Morrow, of Television 22Jerry Mander, 1978), p. 44. The Sun Builders: A Barbara Emanuel, and Stephen Graham, 23See, for example, Robert Argue, to Solar, Wind and Wood Energy in Canada (Toronto: Renewable in Canada, People's Guide Energy is an implicit component of renewable this implies the 1978). "We think decentralization energy; decentralization of energy systems, communities and of power. Renewable energy doesn't require sources of Our cities and towns, which mammoth transmission corridors. disruptive generation on some to achieve have been dependent centralized energy supplies, may be able degree of auton their own energy needs" and administering omy, thereby controlling (p. 16). in American Business (Cam Revolution 24Alfred D. Chandler, Jr., The Visible Hand: The Managerial Press, 1977), p. 244. Belknap, Harvard University bridge, Mass.: 2sIbid. 26Ibid., p. 500. Ethics and Profits: The Crisis of Confidence in American Business 27Leonard Silk and David Vogel, and Schuster, (New York: Simon 1976), p. 191. The Civil Liberties 28Russel W. Ayres, Fallout," Harvard Civil Rights-Civil "Policing Plutonium: 374. 10 (1975):443, Liberties Law Review, 413-4,

View publication stats

The "Industrial Revolution" in the Home: Household Technology and Social Change in the 20th Century Author(s): Ruth Schwartz Cowan Source: Technology and Culture, Vol. 17, No. 1 (Jan., 1976), pp. 1-23 Published by: The Johns Hopkins University Press and the Society for the History of Technology Stable URL: http://www.jstor.org/stable/3103251 . Accessed: 09/12/2013 15:53 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

.

The Johns Hopkins University Press and Society for the History of Technology are collaborating with JSTOR to digitize, preserve and extend access to Technology and Culture.

http://www.jstor.org

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

The "IndustrialRevolution"in the Home: Household Technologyand Social Change in the 20th Century RUTH

SCHWARTZ

COWAN

When we think about the interaction between technology and society, we tend to think in fairly grandiose terms: massive computers invading the workplace, railroad tracks cutting through vast wildernesses, armies of woman and children toiling in the mills. These grand visions have blinded us to an important and rather peculiar technological revolution which has been going on right under our noses: the technological revolution in the home. This revolution has transformed the conduct of our daily lives, but in somewhat unexpected ways. The industrialization of the home was a process very different from the industrialization of other means of production, and the impact of that process was neither what we have been led to believe it was nor what students of the other industrial revolutions would have been led to predict. *

*

*

Some years ago sociologists of the functionalist school formulated an explanation of the impact of industrial technology on the modern family. Although that explanation was not empirically verified, it has become almost universally accepted.1 Despite some differences in emphasis, the basic tenets of the traditional interpretation can be roughly summarized as follows: Before industrialization the family was the basic social unit. Most families were rural, large, and self-sustaining; they produced and processed almost everything that was needed for their own support and for trading in the marketplace, while at the same time performDR. COWAN,associate professor of history at the State University of New York at Stony Brook, is currently engaged in further research on the development of household technology and its impact upon women. This paper is based upon a presentation by Dr. Cowan at SHOT's 1973 annual meeting in San Francisco. 'For some classic statements of the standard view, see W. F. Ogburn and M. F. Nimkoff, Technology and the Changing Family (Cambridge, Mass., 1955); Robert F. Winch, The Modern Family (New York, 1952); and William J. Goode, The Family (Englewood Cliffs, N.J., 1964).

1

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

2

Ruth SchwartzCowan

ing a host of other functions ranging from mutual protection to entertainment. In these preindustrial families women (adult women, that is) had a lot to do, and their time was almost entirely absorbed by household tasks. Under industrialization the family is much less important. The household is no longer the focus of production; production for the marketplace and production for sustenance have been removed to other locations. Families are smaller and they are urban rather than rural. The number of social functions they perform is much reduced, until almost all that remains is consumption, socialization of small children, and tension management. As their functions diminished, families became atomized; the social bonds that had held them together were loosened. In these postindustrial families women have very little to do, and the tasks with which they fill their time have lost the social utility that they once possessed. Modern women are in trouble, the analysis goes, because modern families are in trouble; and modern families are in trouble because industrial technology has either eliminated or eased almost all their former functions, but modern ideologies have not kept pace with the change. The results of this time lag are several: some women suffer from role anxiety, others land in the divorce courts, some enter the labor market, and others take to burning their brassieres and demanding liberation. This sociological analysis is a cultural artifact of vast importance. Many Americans believe that it is true and act upon that belief in various ways: some hope to reestablish family solidarity by relearning lost productive crafts-baking bread, tending a vegetable garden -others dismiss the women's liberation movement as "simply a bunch of affluent housewives who have nothing better to do with their time." As disparate as they may seem, these reactions have a common ideological source-the standard sociological analysis of the impact of technological change on family life. As a theory this functionalist approach has much to recommend it, but at present we have very little evidence to back it up. Family history is an infant discipline, and what evidence it has produced in recent years does not lend credence to the standard view.2 Phillippe Aries has shown, for example, that in France the ideal of the small nuclear family predates industrialization by more than a century.3 Historical demographers working on data from English and French families have been surprised to find that most families were quite small and 2This point is made by Peter Laslett in "The Comparative History of Household and Family," in The AmericanFamily in Social HistoricalPerspective,ed. Michael Gordon (New York, 1973), pp. 28-29. 3Phillippe Aries, Centuries of Childhood:A Social History of Family Life (New York, 1960).

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution"in the Home

3

that several generations did not ordinarily reside together; the extended family, which is supposed to have been the rule in preindustrial societies, did not occur in colonial New England either.4 Rural English families routinely employed domestic servants, and even very small English villages had their butchers and bakers and candlestick makers; all these persons must have eased some of the chores that would otherwise have been the housewife's burden.5 Preindustrial housewives no doubt had much with which to occupy their time, but we may have reason to wonder whether there was quite as much pressure on them as sociological orthodoxy has led us to suppose. The large rural family that was sufficient unto itself back there on the prairies may have been limited to the prairies-or it may never have existed at all (except, that is, in the reveries of sociologists). Even if all the empirical evidence were to mesh with the functionalist theory, the theory would still have problems, because its logical structure is rather weak. Comparing the average farm family in 1750 (assuming that you knew what that family was like) with the average urban family in 1950 in order to discover the significant social changes that had occurred is an exercise rather like comparing apples with oranges; the differences between the fruits may have nothing to do with the differences in their evolution. Transferring the analogy to the case at hand, what we really need to know is the difference, say, between an urban laboring family of 1750 and an urban laboring family 100 and then 200 years later, or the difference between the rural nonfarm middle classes in all three centuries, or the difference between the urban rich yesterday and today. Surely in each of these cases the analyses will look very different from what we have been led to expect. As a guess we might find that for the urban laboring families the changes have been precisely the opposite of what the model predicted; that is, that their family structure is much firmer today than it was in centuries past. Similarly, for the rural nonfarm middle class the results might be equally surprising; we might find that married women of that class rarely did any housework at all in 1890 because they had farm girls as servants, whereas in 1950 they bore the full brunt of the work themselves. I could go on, but the point is, I hope, clear: in order to verify or falsify the functionalist theory, it will be necessary to know more than we presently do about the impact of industrialization on families of similar classes and geographical locations. *

*

*

4See Laslett, pp. 20-24; and Philip J. Greven, "Family Structure in Seventeenth Century Andover, Massachusetts," William and Mary Quarterly23 (1966): 234-56. 5Peter Laslett, The World We Have Lost (New York, 1965), passim.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

4

Ruth SchwartzCowan

With this problem in mind I have, for the purposes of this initial study, deliberately limited myself to one kind of technological change affecting one aspect of family life in only one of the many social classes of families that might have been considered. What happened, I asked, to middle-class American women when the implements with which they did their everyday household work changed? Did the technological change in household appliances have any effect upon the structure of American households, or upon the ideologies that governed the behavior of American women, or upon the functions that families needed to perform? Middle-class American women were defined as actual or potential readers of the better-quality women's magazines, such as the Ladies' Home Journal, American Home, Parents' Magazine, GoodHousekeeping,and McCall's.6 Nonfictional material (articles and advertisements) in those magazines was used as a partial indicator of some of the technological and social changes that were occurring. The Ladies' HomeJournal has been in continuous publication since 1886. A casual survey of the nonfiction in the Journal yields the immediate impression that that decade between the end of World War I and the beginning of the depression witnessed the most drastic changes in patterns of household work. Statistical data bear out this impression. Before 1918, for example, illustrations of homes lit by gaslight could still be found in the Journal; by 1928 gaslight had disappeared. In 1917 only one-quarter (24.3 percent) of the dwellings in the United States had been electrified, but by 1920 this figure had doubled (47.4 percent-for rural nonfarm and urban dwellings), and by 1930 it had risen to four-fifths percent).7 If electrification had meant simply the change from gas or oil lamps to electric lights, the changes in the housewife's routines might not have been very great (except for eliminating the chore of cleaning and filling oil lamps); 6For purposes of historical inquiry, this definition of middle-class status corresponds to a sociological reality, although it is not, admittedly, very rigorous. Our contemporary experience confirms that there are class differences reflected in magazines, and this situation seems to have existed in the past as well. On this issue see Robert S. Lynd and Helen M. Lynd, Middletown:A Studyin Contemporary AmericanCulture (New York, 1929), pp. 240-44, where the marked difference in magazines subscribed to by the businessclass wives as opposed to the working-class wives is discussed; Salme Steinberg, "Reformer in the Marketplace: E. W. Bok and The Ladies HomeJournal" (Ph.D. diss., Johns Hopkins University, 1973), where the conscious attempt of the publisher to attract a middle-class audience is discussed; and Lee Rainwater et al., Workingman'sWife (New York, 1959), which was commissioned by the publisher of working-class women's magazines in an attempt to understand the attitudinal differences betweeen workingclass and middle-class women. 7HistoricalStatisticsof the United States,Colonial Timesto 1957 (Washington, D.C., 1960), p. 510.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution" in the Home

5

but changes in lighting were the least of the changes that electrification implied. Small electric appliances followed quickly on the heels of the electric light, and some of those augured much more profound changes in the housewife's routine. Ironing, for example, had traditionally been one of the most dreadful household chores, especially in warm weather when the kitchen stove had to be kept hot for the better part of the day; irons were heavy and they had to be returned to the stove frequently to be reheated. Electric irons eased a good part of this burden.8 They were relatively inexpensive and very quickly replaced their predecessors; advertisements for electric irons first began to appear in the ladies' magazines after the war, and by the end of the decade the old flatiron had disappeared; by 1929 a survey of 100 Ford employees revealed that ninety-eight of them had the new electric irons in their homes.9 Data on the diffusion of electric washing machines are somewhat harder to come by; but it is clear from the advertisements in the magazines, particularly advertisements for laundry soap, that by the middle of the 1920s those machines could be found in a significant number of homes. The washing machine is depicted just about as frequently as the laundry tub by the middle of the 1920s; in 1929, forty-nine out of those 100 Ford workers had the machines in their homes. The washing machines did not drastically reduce the time that had to be spent on household laundry, as they did not go through their cycles automatically and did not spin dry; the housewife had to stand guard, stopping and starting the machine at appropriate times, adding soap, sometimes attaching the drain pipes, and putting the clothes through the wringer manually. The machines did, however, reduce a good part of the drudgery that once had been associated with washday, and this was a matter of no small consequence.10 Soap powders appeared on the market in the early 1920s, thus eliminating the need to scrape and boil bars of laundry soap.11 By the end of the 8The gas iron, which was available to women whose homes were supplied with natural gas, was an earlier improvement on the old-fashioned flatiron, but this kind of iron is so rarely mentioned in the sources that I used for this survey that I am unable to determine the extent of its diffusion. 9Hazel Kyrk, Economic Problemsof the Family (New York, 1933), p. 368, reporting a study in Monthly Labor Review 30 (1930): 1209-52. 10Although this point seems intuitively obvious, there is some evidence that it may not be true. Studies of energy expenditure during housework have indicated that by far the greatest effort is expended in hauling and lifting the wet wash, tasks which were not eliminated by the introduction of washing machines. In addition, if the introduction of the machines served to increase the total amount of wash that was done by the housewife, this would tend to cancel the energy-saving effects of the machines themselves. "Rinso was the first granulated soap; it came on the market in 1918. Lux Flakes had been available since 1906; however it was not intended to be a general laundry product

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

6

Ruth SchwartzCowan

1920s Blue Monday must have been considerably less blue for some housewives-and probably considerably less "Monday," for with an electric iron, a washing machine, and a hot water heater, there was no reason to limit the washing to just one day of the week. Like the routines of washing the laundry, the routines of personal hygiene must have been transformed for many households during the 1920s-the years of the bathroom mania.12 More and more bathrooms were built in older homes, and new homes began to include them as a matter of course. Before the war most bathroom fixtures (tubs, sinks, and toilets) were made out of porcelain by hand; each bathroom was custom-made for the house in which it was installed. After the war industrialization descended upon the bathroom industry; cast iron enamelware went into mass production and fittings were standardized. In 1921 the dollar value of the production of enameled sanitary fixtures was $2.4 million, the same as it had been in 1915. By 1923, just two years later, that figure had doubled to $4.8 million; it rose again, to $5.1 million, in 1925.13 The first recessed, double-shell cast iron enameled bathtub was put on the market in the early 1920s. A decade later the standard American bathroom had achieved its standard American form: the recessed tub, plus tiled floors and walls, brass plumbing, a single-unit toilet, an enameled sink, and a medicine chest, all set into a small room which was very often 5 feet square.14 The bathroom evolved more quickly than any other room of the house; its standardized form was accomplished in just over a decade. Along with bathrooms came modernized systems for heating hot water: 61 percent of the homes in Zanesville, Ohio, had indoor plumbing with centrally heated water by 1926, and 83 percent of the homes valued over $2,000 in Muncie, Indiana, had hot and cold running but rather one for laundering delicate fabrics. "Lever Brothers," Fortune 26 (November 1940): 95. 12I take this account, and the term, from Lynd and Lynd, p. 97. Obviously, there were many American homes that had bathrooms before the 1920s, particularly urban row houses, and I have found no way of determining whether the increases of the 1920s were more marked than in previous decades. The rural situation was quite different from the urban; the President's Conference on Home Building and Home Ownership reported that in the late 1920s, 71 percent of the urban families surveyed had bathrooms, but only 33 percent of the rural families did (John M. Gries and James Ford, eds., Homemaking,Home Furnishing and Information Services, President's Conference on Home Building and Home Ownership, vol. 10 [Washington, D.C., 1932], p. 13). 13The data above come from Siegfried Giedion, Mechanization Takes Command(New York, 1948), pp. 685-703. "4For a description of the standard bathroom see Helen Sprackling, "The Modern Bathroom," Parents' Magazine 8 (February 1933): 25.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution"in the Home

7

water by 1935.15 These figures may not be typical of small American cities (or even large American cities) at those times, but they do jibe with the impression that one gets from the magazines: after 1918 references to hot water heated on the kitchen range, either for laundering or for bathing, become increasingly difficult to find. Similarly, during the 1920s many homes were outfitted with central heating; in Muncie most of the homes of the business class had basement heating in 1924; by 1935 Federal Emergency Relief Administration data for the city indicated that only 22.4 percent of the dwellings valued over $2,000 were still heated by a kitchen stove.16 What all these changes meant in terms of new habits for the average housewife is somewhat hard to calculate; changes there must have been, but it is difficult to know whether those changes produced an overall saving of labor and/or time. Some chores were eliminated-hauling water, other fire-but the kitchen on the water stove, maintaining heating chores were added-most notably the chore of keeping yet another room scrupulously clean. It is not, however, difficult to be certain about the changing habits that were associated with the new American kitchen-a kitchen from which the coal stove had disappeared. In Muncie in 1924, cooking with gas was done in two out of three homes; in 1935 only 5 percent of the homes valued over $2,000 still had coal or wood stoves for cooking.17 After 1918 advertisements for coal and wood stoves disappeared from the Ladies'HomeJournal; stove manufacturers purveyed only their gas, oil, or electric models. Articles giving advice to homemakers on how to deal with the trials and tribulations of starting, stoking, and maintaining a coal or a wood fire also disappeared. Thus it seems a safe assumption that most middle-class homes had switched to the new method of cooking by the time the depression began. The change in routine that was predicated on the change from coal or wood to gas or oil was profound; aside from the elimination of such chores as loading the fuel and removing the ashes, the new stoves were much easier to light, maintain, and regulate (even when they did not have thermostats, as the earliest models did not).18 Kitchens were, in addition, much easier to clean when they did not have coal dust regularly tracked through them; one writer in the Ladies' 15Zanesville,Ohio and Thirty-sixOtherAmericanCities (New York, 1927), p. 65. Also see Robert S. Lynd and Helen M. Lynd, Middletownin Transition (New York, 1936), p. 537. Middletown is Muncie, Indiana. 16Lynd and Lynd, Middletown, p. 96, and Middletown in Transition, p. 539. 17Lynd and Lynd, Middletown, p. 98, and Middletownin Transition, p. 562. O80nthe advantages of the new stoves, see Boston Cooking School Cookbook(Boston, 1916), pp. 15-20; and Russell Lynes, The DomesticatedAmericans(New York, 1957), pp. 119-20.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

8

Ruth SchwartzCowan

HomeJournal estimated that kitchen cleaning was reduced by one-half when coal stoves were eliminated.19 Along with new stoves came new foodstuffs and new dietary habits. Canned foods had been on the market since the middle of the 19th century, but they did not become an appreciable part of the standard middle-class diet until the 1920s-if the recipes given in cookbooks and in women's magazines are a reliable guide. By 1918 the variety of foods available in cans had been considerably expanded from the peas, corn, and succotash of the 19th century; an American housewife with sufficient means could have purchased almost any fruit or vegetable and quite a surprising array of ready-made meals in a can -from Heinz's spaghetti in meat sauce to Purity Cross's lobster a la Newburg. By the middle of the 1920s home canning was becoming a lost art. Canning recipes were relegated to the back pages of the women's magazines; the business-class wives of Muncie reported that, while their mothers had once spent the better part of the summer and fall canning, they themselves rarely put up anything, except an occasional jelly or batch of tomatoes.20 In part this was also due to changes in the technology of marketing food; increased use of refrigerated railroad cars during this period meant that fresh fruits and vegetables were in the markets all year round at reasonable prices.21 By the early 1920s convenience foods were also appearing on American tables: cold breakfast cereals, pancake mixes, bouillon cubes, and packaged desserts could be found. Wartime shortages accustomed Americans to eating much lighter meals than they had previously been wont to do; and as fewer family members were taking all their meals at home (businessmen started to eat lunch in restaurants downtown, and factories and schools began installing cafeterias), there was simply less cooking to be done, and what there was of it was easier to do.22 *

*

*

Many of the changes just described-from hand power to electric power, from coal and wood to gas and oil as fuels for cooking, from one-room heating to central heating, from pumping water to running water-are enormous technological changes. Changes of a similar dimension, either in the fundamental technology of an industry, in the diffusion of that technology, or in the routines of workers, would have long since been labeled an "industrial revolution." The change from the laundry tub to the washing machine is no less profound than '9"How to Save Coal While Cooking," Ladies' Home Journal 25 (January 1908): 44. 20Lynd and Lynd, Middletown, p. 156. 21Ibid.; see also "Safeway Stores," Fortune 26 (October 1940): 60. 22Lynd and Lynd, Middletown, pp. 134-35 and 153-54.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution"in the Home

9

the change from the hand loom to the power loom; the change from pumping water to turning on a water faucet is no less destructive of traditional habits than the change from manual to electric calculating. It seems odd to speak of an "industrial revolution" connected with housework, odd because we are talking about the technology of such homely things, and odd because we are not accustomed to thinking of housewives as a labor force or of housework as an economic commodity-but despite this oddity, I think the term is altogether appropriate. In this case other questions come immediately to mind, questions that we do not hesitate to ask, say, about textile workers in Britain in the early 19th century, but we have never thought to ask about housewives in America in the 20th century. What happened to this particular work force when the technology of its work was revolutionized? Did structural changes occur? Were new jobs created for which new skills were required? Can we discern new ideologies that influenced the behavior of the workers? The answer to all of these questions, surprisingly enough, seems to be yes. There were marked structural changes in the work force, changes that increased the work load and the job description of the workers that remained. New jobs were created for which new skills were required; these jobs were not physically burdensome, but they may have taken up as much time as the jobs they had replaced. New ideologies were also created, ideologies which reinforced new behavioral patterns, patterns that we might not have been led to expect if we had followed the sociologists' model to the letter. Middle-class housewives, the women who must have first felt the impact of the new household technology, were not flocking into the divorce courts or the labor market or the forums of political protest in the years immediately after the revolution in their work. What they were doing was sterilizing baby bottles, shepherding their children to dancing classes and music lessons, planning nutritious meals, shopping for new clothes, studying child psychology, and hand stitching colorcoordinated curtains-all of which chores (and others like them) the standard sociological model has apparently not provided for. The significant change in the structure of the household labor force was the disappearance of paid and unpaid servants (unmarried daughters, maiden aunts, and grandparents fall in the latter category) as household workers-and the imposition of the entire job on the housewife herself. Leaving aside for a moment the question of which was cause and which effect (did the disappearance of the servant create a demand for the new technology, or did the new technology make the servant obsolete?), the phenomenon itself is relatively easy

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

10

Ruth SchwartzCowan

to document. Before World War I, when illustrators in the women's magazines depicted women doing housework, the women were very often servants. When the lady of the house was drawn, she was often the person being served, or she was supervising the serving, or she was adding an elegant finishing touch to the work. Nursemaids diapered babies, seamstresses pinned up hems, waitresses served meals, laundresses did the wash, and cooks did the cooking. By the end of the 1920s the servants had disappeared from those illustrations; all those jobs were being done by housewives-elegantly manicured and coiffed, to be sure, but housewives nonetheless (compare figs. 1 and 2). If we are tempted to suppose that illustrations in advertisements are not a reliable indicator of structural changes of this sort, we can corroborate the changes in other ways. Apparently, the illustrators really did know whereof they drew. Statistically the number of persons throughout the country employed in household service dropped from 1,851,000 in 1910 to 1,411,000 in 1920, while the number of households enumerated in the census rose from 20.3 million to 24.4 million.23 In Indiana the ratio of households to servants increased from 13.5/1 in 1890 to 30.5/1 in 1920, and in the country as a whole the number of paid domestic servants per 1,000 population dropped from 98.9 in 1900 to 58.0 in 1920.24 The business-class housewives of Muncie reported that they employed approximately one-half as many woman-hours of domestic service as their mothers had done.25 In case we are tempted to doubt these statistics (and indeed statistics about household labor are particularly unreliable, as the labor is often transient, part-time, or simply unreported), we can turn to articles on the servant problem, the disappearance of unpaid family workers, the design of kitchens, or to architectural drawings for houses. All of this evidence reiterates the same point: qualified servants were difficult to find; their wages had risen and their numbers fallen; houses were being designed without maid's rooms; daughters and unmarried aunts were finding jobs downtown; kitchens were being designed for housewives, not for servants.26 The first home with a 23HistoricalStatistics, pp. 16 and 77. 24For Indiana data, see Lynd and Lynd, Middletown, p. 169. For national data, see D. L. Kaplan and M. Claire Casey, OccupationalTrendsin the United States, 1900-1950, U.S. Bureau of the Census Working Paper no. 5 (Washington, D.C., 1958), table 6. The extreme drop in numbers of servants between 1910 and 1920 also lends credence to the notion that this demographic factor stimulated the industrial revolution in housework. 25Lynd and Lynd, Middletown, p. 169. 26On the disappearance of maiden aunts, unmarried daughters, and grandparents, see Lynd and Lynd, Middletown,pp. 25, 99, and 110; Edward Bok, "Editorial,"American Home 1 (October 1928): 15; "How to Buy Life Insurance," Ladies' Home Journal 45

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

FIG. 1.-The housewife as manager. (Ladies'HomeJournal, April 1918. Courtesy of Lever Brothers Co.)

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

FIG.2.-The housewife as laundress. (Ladies'HomeJournal, August 1928. Courtesy of Colgate-Palmolive-Peet.)

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution"in the Home

13

kitchen that was not an entirely separate room was designed by Frank Lloyd Wright in 1934.27 In 1937 Emily Post invented a new character for her etiquette books: Mrs. Three-in-One, the woman who is her own cook, waitress, and hostess.28 There must have been many new Mrs. Three-in-Ones abroad in the land during the 1920s. As the number of household assistants declined, the number of household tasks increased. The middle-class housewife was expected to demonstrate competence at several tasks that previously had not been in her purview or had not existed at all. Child care is the most obvious example. The average housewife had fewer children than her mother had had, but she was expected to do things for her children that her mother would never have dreamed of doing: to prepare their special infant formulas, sterilize their bottles, weigh them every day, see to it that they ate nutritionally balanced meals, keep them isolated and confined when they had even the slightest illness, consult with their teachers frequently, and chauffeur them to dancing lessons, music lessons, and evening parties.29 There was very little Freudianism in this new attitude toward child care: mothers were not spending more time and effort on their children because they feared the psychological trauma of separation, but because competent nursemaids could not be found, and the new theories of child care required constant attention from well-informed persons-persons who were willing and able to read about the latest discoveries in nutrition, in the control of contagious diseases, or in the techniques of behavioral psychology. These persons simply had to be their mothers. Consumption of economic goods provides another example of the housewife's expanded job description; like child care, the new tasks associated with consumption were not necessarily physically burdensome, but they were time consuming, and they required the acquisi(March 1928): 35. The house plans appeared every month in American Home, which began publication in 1928. On kitchen design, see Giedion, pp. 603-21; "Editorial," Ladies' HomeJournal 45 (April 1928): 36; advertisement for Hoosier kitchen cabinets, Ladies' Home Journal 45 (April 1928): 117. Articles on servant problems include "The Vanishing Servant Girl," Ladies Home Journal 35 (May 1918): 48; "Housework, Then and Now," American Home 8 (June 1932): 128; "The Servant Problem," Fortune 24 (March 1938): 80-84; and Report of the YWCA Commissionon Domestic Service (Los Angeles, 1915). 27Giedion, p. 619. Wright's new kitchen was installed in the Malcolm Willey House, Minneapolis. 28Emily Post, Etiquette:The Blue Book of Social Usage, 5th ed. rev. (New York, 1937), p. 823. 29This analysis is based upon various child-care articles that appeared during the period in the Ladies'HomeJournal, AmericanHome, and Parents' Magazine. See also Lynd and Lynd, Middletown, chap. 11.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

14

Ruth SchwartzCowan

tion of new skills.30 Home economists and the editors of women's magazines tried to teach housewives to spend their money wisely. The present generation of housewives, it was argued, had been reared by mothers who did not ordinarily shop for things like clothing, bed linens, or towels; consequently modern housewives did not know how to shop and would have to be taught. Furthermore, their mothers had not been accustomed to the wide variety of goods that were now available in the modern marketplace; the new housewives had to be taught not just to be consumers, but to be informed consumers.31 Several contemporary observers believed that shopping and shopping wisely were occupying increasing amounts of housewives' time.32 Several of these contemporary observers also believed that standards of household care changed during the decade of the 1920s.33 The discovery of the "household germ" led to almost fetishistic concern about the cleanliness of the home. The amount and frequency of laundering probably increased, as bed linen and underwear were changed more often, children's clothes were made increasingly out of washable fabrics, and men's shirts no longer had replaceable collars and cuffs.34 Unfortunately all these changes in standards are difficult to document, being changes in the things that people regard as so insignificant as to be unworthy of comment; the improvement in standards seems a likely possibility, but not something that can be proved. In any event we do have various time studies which demonstrate somewhat surprisingly that housewives with conveniences were spending just as much time on household duties as were housewives without them-or, to put it another way, housework, like so many 30John Kenneth Galbraith has remarked upon the advent of woman as consumer in Economicsand the Public Purpose (Boston, 1973), pp. 29-37. 31There was a sharp reduction in the number of patterns for home sewing offered by the women's magazines during the 1920s; the patterns were replaced by articles on "what is available in the shops this season." On consumer education see, for example, "How to Buy Towels," Ladies' Home Journal 45 (February 1928): 134; "Buying Table Linen," Ladies' Home Journal 45 (March 1928): 43; and "When the Bride Goes Shopping," AmericanHome 1 (January 1928): 370. 32See, for example, Lynd and Lynd, Middletown, pp. 176 and 196; and Margaret G. Reid, Economicsof HouseholdProduction (New York, 1934), chap. 13. 33See Reid, pp. 64-68; and Kyrk, p. 98. 34See advertisement for Cleanliness Institute-"Self-respect thrives on soap and water," Ladies' Home Journal 45 (February 1928): 107. On changing bed linen, see "When the Bride Goes Shopping," AmericanHome 1 (January 1928): 370. On laundering children's clothes, see, "Making a Layette," Ladies'HomeJournal 45 (January 1928): 20; and Josephine Baker, "The Youngest Generation," Ladies'HomeJournal 45 (March 1928): 185.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution"in the Home

15

other types of work, expands to fill the time available.35 A study comparing the time spent per week in housework by 288 farm families and 154 town families in Oregon in 1928 revealed 61 hours spent by farm wives and 63.4 hours by town wives; in 1929 a U.S. Department of Agriculture study of families in various states produced almost identical results.36 Surely if the standard sociological model were valid, housewives in towns, where presumably the benefits of specialization and electrification were most likely to be available, should have been spending far less time at their work than their rural sisters. However, just after World War II economists at Bryn Mawr College reported the same phenomenon: 60.55 hours spent by farm housewives, 78.35 hours by women in small cities, 80.57 hours by women in large ones-precisely the reverse of the results that were expected.37 A recent survey of time studies conducted between 1920 and 1970 concludes that the time spent on housework by housewives has remained remarkably constant nonemployed the throughout period.38 All these results point in the same direction: mechanization of the household meant that time expended on some jobs decreased, but also that new jobs were substituted, and in some cases-notably laundering-time expenditures for old jobs increased because of higher standards. The advantages of mechanization may be somewhat more dubious than they seem at first glance. *

*

*

As the job of the housewife changed, the connected ideologies also changed; there was a clearly perceptible difference in the attitudes that women brought to housework before and after World War I.39 35This point is also discussed at length in my paper "What Did Labor-saving Devices Really Save?" (unpublished). 36As reported in Lyrk, p. 51. 37Bryn Mawr College Department of Social Economy, WomenDuring the War and After (Philadelphia, 1945); and Ethel Goldwater, "Woman's Place," Commentary4 (December 1947): 578-85. 38JoAnn Vanek, "Keeping Busy: Time Spent in Housework, United States, 1920-1970" (Ph.D. diss., University of Michigan, 1973). Vanek reports an average of 53 hours per week over the whole period. This figure is significantly lower than the figures reported above, because each time study of housework has been done on a different basis, including different activities under the aegis of housework, and using different methods of reporting time expenditures; the Bryn Mawr and Oregon studies are useful for the comparative figures that they report internally, but they cannot easily be compared with each other. 39This analysis is based upon my reading of the middle-class women's magazines between 1918 and 1930. For detailed documentation see my paper "Two Washes in the Morning and a Bridge Party at Night: The American Housewife between the Wars," Women'sStudies (in press). It is quite possible that the appearance of guilt as a strong

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

16

Ruth SchwartzCowan

Before the war the trials of doing housework in a servantless home were discussed and they were regarded as just that-trials, necessary chores that had to be got through until a qualified servant could be found. After the war, housework changed: it was no longer a trial and a chore, but something quite different-an emotional "trip." Laundering was not just laundering, but an expression of love; the housewife who truly loved her family would protect them from the embarrassment of tattletale gray. Feeding the family was notjust feeding the family, but a way to express the housewife's artistic inclinations and a way to encourage feelings of family loyalty and affection. Diapering the baby was not just diapering, but a time to build the baby's sense of security and love for the mother. Cleaning the bathroom sink was not just cleaning, but an exercise of protective maternal instincts, providing a way for the housewife to keep her family safe from disease. Tasks of this emotional magnitude could not possibly be delegated to servants, even assuming that qualified servants could be found. Women who failed at these new household tasks were bound to feel guilt about their failure. If I had to choose one word to characterize the temper of the women's magazines during the 1920s, it would be "guilt." Readers of the better-quality women's magazines are portrayed as feeling guilty a good lot of the time, and when they are not guilty they are embarrassed: guilty if their infants have not gained enough weight, embarrassed if their drains are clogged, guilty if their children go to school in soiled clothes, guilty if all the germs behind the bathroom sink are not eradicated, guilty if they fail to notice the first signs of an oncoming cold, embarrassed if accused of having body odor, guilty if their sons go to school without good breakfasts, guilty if their daughters are unpopular because of old-fashioned, or unironed, or-heaven forbid--dirty dresses (see figs. 3 and 4). In earlier times women were made to feel guilty if they abandoned their children or were too free with their affections. In the years after World War I, American women were made to feel guilty about sending their children to school in scuffed shoes. Between the two kinds of guilt there is a world of difference. *

*

*

Let us return for a moment to the sociological model with which this essay began. The model predicts that changing patterns of element in advertising is more the result of new techniques developed by the advertising industry than the result of attitudinal changes in the audience-a possibility that I had not considered when doing the initial research for this paper. See A. Michael McMahon, "An American Courtship: Psychologists and Advertising Theory in the Progressive Era," American Studies 13 (1972): 5-18.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

LADIES HOMEJOURNAL

95

foolyourself inI to soopty .tho. too s.St, ko Xt voota *xeaiw

st, you tiy hve tKt

A

MUX uy s wcp ,uI7 c v wuuti'::? . away' :: Onoti ed thet w USTERINE SHAV-' I ' t? e Ai t f u.i:uf e rfelI . NG 'C: .co otg ter sving. , waen Inidf n a hampom think (?$2r hit??ity

T

E

R Is ER

:-the safe antiseptic

FIG. 3.-Sources of housewifely guilt: the good mother smells sweet. (Ladies'Home Journal, August 1928. Courtesy of Warner-Lambert, Inc.)

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

of housewifely guilt: the good mother must be beautiful. (Ladies' FIG. 4.-Sources HomeJournal, July 1928. Courtesy of Colgate-Palmolive-Peet.)

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution" in the Home

19

household work will be correlated with at least two striking indicators of social change: the divorce rate and the rate of married women's labor force participation. That correlation may indeed exist, but it certainly is not reflected in the women's magazines of the 1920s and 1930s: divorce and full-time paid employment were not part of the life-style or the life pattern of the middle-class housewife as she was idealized in her magazines. There were social changes attendant upon the introduction of modern technology into the home, but they were not the changes that the traditional functionalist model predicts; on this point a close analysis of the statistical data corroborates the impression conveyed in the magazines. The divorce rate was indeed rising during the years between the wars, but it was not rising nearly so fast for the middle and upper classes (who had, presumably, easier access to the new technology) as it was for the lower classes. By almost every gauge of socioeconomic status-income, prestige of husband's work, education-the divorce rate is higher for persons lower on the socioeconomic scale-and this is a phenomenon that has been constant over time.40 The supposed connection between improved household technology and married women's labor force participation seems just as dubious, and on the same grounds. The single socioeconomic factor which correlates most strongly (in cross-sectional studies) with married women's employment is husband's income, and the correlation is strongly negative; the higher his income, the less likely it will be that she is working.41 Women's labor force participation increased during the 1920s but this increase was due to the influx of single women into the force. Married women's participation increased slightly during those years, but that increase was largely in factory labor -precisely the kind of work that middle-class women (who were, again, much more likely to have labor-saving devices at home) were least likely to do.42 If there were a necessary connection between the improvement of household technology and either of these two social indicators, we would expect the data to be precisely the reverse of what in fact has occurred: women in the higher social classes should have fewer func40For a summary of the literature on differential divorce rates, see Winch, p. 706; and William J. Goode, AfterDivorce (New York, 1956) p. 44. The earliest papers demonstrating this differential rate appeared in 1927, 1935, and 1939. 41For a summary of the literature on married women's labor force participation, see Juanita Kreps, Sex in the Marketplace:American Women at Work (Baltimore, 1971), pp. 19-24. 42Valerie Kincaid Oppenheimer, The Female Labor Force in the United States, Population Monograph Series, no. 5 (Berkeley, 1970), pp. 1-15; and Lynd and Lynd, Middletown, pp. 124-27.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

20

Ruth SchwartzCowan

tions at home and should therefore be more (rather than less) likely to seek paid employment or divorce. Thus for middle-class American housewives between the wars, the social changes that we can document are not the social changes that the functionalist model predicts; rather than changes in divorce or patterns of paid employment, we find changes in the structure of the work force, in its skills, and in its ideology. These social changes were concomitant with a series of technological changes in the equipment that was used to do the work. What is the relationship between these two series of phenomena? Is it possible to demonstrate causality or the direction of that causality? Was the decline in the number of households employing servants a cause or an effect of the mechanization of those households? Both are, after all, equally possible. The declining supply of household servants, as well as their rising wages, may have stimulated a demand for new appliances at the same time that the acquisition of new appliances may have made householders less inclined to employ the laborers who were on the market. Are there any techniques available to the historian to help us answer these questions? *

*

*

In order to establish causality, we need to find a connecting link between the two sets of phenomena, a mechanism that, in real life, could have made the causality work. In this case a connecting link, an intervening agent between the social and the technological changes, comes immediately to mind: the advertiser-by which term I mean a combination of the manufacturer of the new goods, the advertising agent who promoted the goods, and the periodical that published the promotion. All the new devices and new foodstuffs that were being offered to American households were being manufactured and marketed by large companies which had considerable amounts of capital invested in their production: General Electric, Procter & Gamble, General Foods, Lever Brothers, Frigidaire, Campbell's, Del Monte, American Can, Atlantic & Pacific Tea-these were all well-established firms by the time the household revolution began, and they were all in a position to pay for national advertising campaigns to promote their new products and services. And pay they did; one reason for the expanding size and number of women's magazines in the 1920s was, no doubt, the expansion in revenues from available advertisers.43 Those national advertising campaigns were likely to have been powerful stimulators of the social changes that occurred in the 430n the expanding size, number, and influence of women's magazines during the 1920s, see Lynd and Lynd, Middletown, pp. 150 and 240-44.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution"in the Home

21

household labor force; the advertisers probably did not initiate the changes, but they certainly encouraged them. Most of the advertising campaigns manifestly worked, so they must have touched upon areas of real concern for American housewives. Appliance ads specifically suggested that the acquisition of one gadget or another would make it possible to fire the maid, spend more time with the children, or have the afternoon free for shopping.44 Similarly, many advertisements played upon the embarrassment and guilt which were now associated with household work. Ralston, Cream of Wheat, and Ovaltine were not themselves responsible for the compulsive practice of weighing infants and children repeatedly (after every meal for newborns, every day in infancy, every week later on), but the manufacturers certainly did not stint on capitalizing upon the guilt that women apparently felt if their offspring did not gain the required amounts of weight.45 And yet again, many of the earliest attempts to spread "wise" consumer practices were undertaken by large corporations and the magazines that desired their advertising: mail-order shopping guides, "producttesting" services, pseudoinformative pamphlets, and other such promotional devices were all techniques for urging the housewife to buy new things under the guise of training her in her role as skilled consumer.46

Thus the advertisers could well be called the "ideologues" of the certain very specific social changes-as 1920s, encouraging wont to do. Not surprisingly, the changes that ocare ideologues curred were precisely the ones that would gladden the hearts and fatten the purses of the advertisers; fewer household servants meant a greater demand for labor and timesaving devices; more household tasks for women meant more and more specialized products that they would need to buy; more guilt and embarrassment about their failure to succeed at their work meant a greater likelihood that they would buy the products that were intended to minimize that failure. Happy, 44See, for example, the advertising campaigns of General Electric and Hotpoint from 1918 through the rest of the decade of the 1920s; both campaigns stressed the likelihood that electric appliances would become a thrifty replacement for domestic servants. 45The practice of carefully observing children's weight was initiated by medical authorities, national and local governments, and social welfare agencies, as part of the campaign to improve child health which began about the time of World War I. 46These practices were ubiquitous, AmericanHome, for example, which was published by Doubleday, assisted its advertisers by publishing a list of informative pamphlets that readers could obtain; devoting half a page to an index of its advertisers; specifically naming manufacturer's and list prices in articles about products and services; allotting almost one-quarter of the magazine to a mail-order shopping guide which was not (at least ostensibly) paid advertisement; and as part of its editorial policy, urging its readers to buy new goods.

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

22

Ruth SchwartzCowan

full-time housewives in intact families spend a lot of money to maintain their households; divorced women and working women do not. The advertisers may not have created the image of the ideal American housewife that dominated the 1920s-the woman who cheerfully and skillfully set about making everyone in her family perfectly happy and perfectly healthy-but they certainly helped to perpetuate it. The role of the advertiser as connecting link between social change and technological change is at this juncture simply a hypothesis, with nothing much more to recommend it than an argument from plausibility. Further research may serve to test the hypothesis, but testing it may not settle the question of which was cause and which effect-if that question can ever be settled definitively in historical work. What seems most likely in this case, as in so many others, is that cause and effect are not separable, that there is a dynamic interaction between the social changes that married women were experiencing and the technological changes that were occurring in their homes. Viewed this way, the disappearance of competent servants becomes one of the factors that stimulated the mechanization of homes, and this mechanization of homes becomes a factor (though by no means the only one) in the disappearance of servants. Similarly, the emotionalization of housework becomes both cause and effect of the mechanization of that work; and the expansion of time spent on new tasks becomes both cause and effect of the introduction of time-saving devices. For example the social pressure to spend more time in child care may have led to a decision to purchase the devices; once purchased, the devices could indeed have been used to save time- although often they were not. *

*

*

If one holds the question of causality in abeyance, the example of household work still has some useful lessons to teach about the general problem of technology and social change. The standard sociological model for the impact of modern technology on family life clearly needs some revision: at least for middle-class nonrural American families in the 20th century, the social changes were not the ones that the standard model predicts. In these families the functions of at least one member, the housewife, have increased rather than decreased; and the dissolution of family life has not in fact occurred. Our standard notions about what happens to a work force under the pressure of technological change may also need revision. When industries become mechanized and rationalized, we expect certain general changes in the work force to occur: its structure becomes more highly differentiated, individual workers become more

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

"IndustrialRevolution" in the Home

23

specialized, managerial functions increase, and the emotional context of the work disappears. On all four counts our expectations are reversed with regard to household work. The work force became less rather than more differentiated as domestic servants, unmarried daughters, maiden aunts, and grandparents left the household and as chores which had once been performed by commercial agencies (laundries, delivery services, milkmen) were delegated to the housewife. The individual workers also became less specialized; the new housewife was now responsible for every aspect of life in her household, from scrubbing the bathroom floor to keeping abreast of the latest literature in child psychology. The housewife is just about the only unspecialized worker left in America-a veritable jane-of-all-trades at a time when the jacks-ofall-trades have disappeared. As her work became generalized the housewife was also proletarianized: formerly she was ideally the manager of several other subordinate workers; now she was idealized as the manager and the worker combined. Her managerial functions have not entirely disappeared, but they have certainly diminished and have been replaced by simple manual labor; the middle-class, fairly well educated housewife ceased to be a personnel manager and became, instead, a chauffeur, charwoman, and short-order cook. The implications of this phenomenon, the proletarianization of a work force that had previously seen itself as predominantly managerial, deserve to be explored at greater length than is possible here, because I suspect that they will explain certain aspects of the women's liberation movement of the 1960s and 1970s which have previously eluded explanation: why, for example, the movement's greatest strength lies in social and economic groups who seem, on the surface at least, to need it least-women who are white, well-educated, and middle-class. Finally, instead of desensitizing the emotions that were connected with household work, the industrial revolution in the home seems to have heightened the emotional context of the work, until a woman's sense of self-worth became a function of her success at arranging bits of fruit to form a clown's face in a gelatin salad. That pervasive social illness, which Betty Friedan characterized as "the problem that has no name," arose not among workers who found that their labor brought no emotional satisfaction, but among workers who found that their work was invested with emotional weight far out of proportion to its own inherent value: "How long," a friend of mine is fond of asking, "can we continue to believe that we will have orgasms while waxing the kitchen floor?"

This content downloaded from 165.123.34.86 on Mon, 9 Dec 2013 15:53:41 PM All use subject to JSTOR Terms and Conditions

Value Sensitive Design and Information Systems BATYA FRIEDMAN, PETER H. KAHN, JR., AND ALAN BORNING University of Washington Forthcoming in P. Zhang & D. Galletta (Eds.), Human-Computer Interaction in Management Information Systems: Foundations. M.E. Sharpe, Inc: NY. ________________________________________________________________________ Value Sensitive Design is a theoretically grounded approach to the design of technology that accounts for human values in a principled and comprehensive manner throughout the design process. It employs an integrative and iterative tripartite methodology, consisting of conceptual, empirical, and technical investigations. We explicate Value Sensitive Design by drawing on three case studies. The first study concerns information and control of web browser cookies, implicating the value of informed consent. The second study concerns using high-definition plasma displays in an office environment to provide a “window” to the outside world, implicating the values of physical and psychological well-being and privacy in public spaces. The third study concerns an integrated land use, transportation, and environmental simulation system to support public deliberation and debate on major land use and transportation decisions, implicating the values of fairness, accountability, and support for the democratic process, as well as a highly diverse range of values that might be held by different stakeholders, such as environmental sustainability, opportunities for business expansion, or walkable neighborhoods. We conclude with direct and practical suggestions for how to engage in Value Sensitive Design.

________________________________________________________________________ 1. INTRODUCTION There is a longstanding interest in designing information and computational systems that support enduring human values. Researchers have focused, for example, on the value of privacy [Ackerman and Cranor 1999; Agre and Rotenberg 1998; Fuchs 1999; Jancke et al. 2001; Palen and Grudin 2003; Tang 1997], ownership and property [Lipinski and Britz 2000], physical welfare [Leveson 1991], freedom from bias [Friedman and Nissenbaum 1996], universal usability [Shneiderman 1999, 2000; Thomas 1997], autonomy [Suchman 1994; Winograd 1994], informed consent [Millett et al. 2001], and trust [Fogg and Tseng 1999; Palen and Grudin 2003; Riegelsberger and Sasse 2002; Rocco 1998; Zheng et al. 2001]. Still, there is a need for an overarching theoretical and methodological framework with which to handle the value dimensions of design work. Value Sensitive Design is one effort to provide such a framework (e.g., Friedman [1997a], Friedman and Kahn [2003], Friedman and Nissenbaum [1996], Hagman, Hendrickson, and Whitty [2003], Nissenbaum [1998], Tang [1997], and Thomas [1997]). Our goal in this paper is to provide an account of Value Sensitive Design, with enough detail for other researchers and designers to critically examine and systematically build on this approach. We begin by sketching the key features of Value Sensitive Design, and then describe its integrative tripartite methodology, which involves conceptual, empirical, and technical investigations, employed iteratively. Then we explicate Value Sensitive Design by drawing on three case studies. One involves cookies and informed consent in web browsers; the second involves HDTV display technology in an office environment; the third involves user interactions and interface for an integrated land use, transportation, and environmental simulation. We conclude with direct and practical suggestions for how to engage in Value Sensitive Design. 1

2. WHAT IS VALUE SENSITIVE DESIGN? Value Sensitive Design is a theoretically grounded approach to the design of technology that accounts for human values in a principled and comprehensive manner throughout the design process. 2.1. What is a Value? In a narrow sense, the word “value” refers simply to the economic worth of an object. For example, the value of a computer could be said to be two thousand dollars. However, in the work described here, we use a broader meaning of the term wherein a value refers to what a person or group of people consider important in life.1 In this sense, people find many things of value, both lofty and mundane: their children, friendship, morning tea, education, art, a walk in the woods, nice manners, good science, a wise leader, clean air. This broader framing of values has a long history. Since the time of Plato, for example, the content of value-oriented discourse has ranged widely, emphasizing “the good, the end, the right, obligation, virtue, moral judgment, aesthetic judgment, the beautiful, truth, and validity” [Frankena 1972, p. 229]. Sometimes ethics has been subsumed within a theory of values, and other times conversely, with ethical values viewed as just one component of ethics more generally. Either way, it is usually agreed [Moore 1903/1978] that values should not be conflated with facts (the “fact/value distinction”) especially insofar as facts do not logically entail value. In other words, “is” does not imply “ought” (the naturalistic fallacy). In this way, values cannot be motivated only by an empirical account of the external world, but depend substantively on the interests and desires of human beings within a cultural milieu. In Table 1 in Section 6.8, we provide a list of human values with ethical import that are often implicated in system design, along with working definitions and references to the literature. 2.2. Related Approaches to Values and System Design In the 1950’s, during the early periods of computerization, cyberneticist Norbert Wiener [1953/1985] argued that technology could help make us better human beings, and create a more just society. But for it to do so, he argued, we have to take control of the technology. We have to reject the “worshiping [of] the new gadgets which are our own creation as if they were our masters” (p. 678). Similarly, a few decades later, computer scientist Joseph Weizenbaum [1972] wrote: What is wrong, I think, is that we have permitted technological metaphors…and technique itself to so thoroughly pervade our thought processes that we have finally abdicated to technology the very duty to formulate questions…Where a simple man might ask: “Do we need these things?”, technology asks “what electronic wizardry will make them safe?” Where a simple man will ask “is it good?”, technology asks “will it work?” (pp. 611-612) More recently, supporting human values through system design has emerged within at least four important approaches. Computer Ethics advances our understanding of key values that lie at the intersection of computer technology and human lives, e.g., Bynum [1985], Johnson and Miller [1997], and Nissenbaum [1999]. Social Informatics has been successful in providing socio-technical analyses of deployed technologies, e.g., Kling, Rosenbaum, and Hert [1998], Kling and Star [1998], and Sawyer and Rosenbaum [2000]. 1

The Oxford English Dictionary definition of this sense of value is: “the principles or standards of a person or society, the personal or societal judgement of what is valuable and important in life.” [Simpson and Weiner 1989] 2

Computer Supported Cooperative Work (CSCW) has been successful in the design of new technologies to help people collaborate effectively in the workplace, e.g., Fuchs [1999], Galegher, Kraut, and Egido [1990], Olson and Teasley [1996], and Grudin [1988]. Finally, Participatory Design substantively embeds democratic values into its practice, e.g., Bjerknes & Bratteteig [1995], Bødker [1990], Ehn [1989], Greenbaum and Kyng [1991], and Kyng and Mathiassen [1997]. (See Friedman and Kahn [2003] for a review of each of these approaches.) 3. THE TRIPARTITE METHODLOGY: CONCEPTUAL, EMPIRICAL, AND TECHNICAL INVESTIGATIONS Think of an oil painting by Monet or Cézanne. From a distance it looks whole; but up close you can see many layers of paint upon paint. Some paints have been applied with careful brushstrokes, others perhaps energetically with a palate knife or fingertips, conveying outlines or regions of color. The diverse techniques are employed one on top of the other, repeatedly, and in response to what has been laid down earlier. Together they create an artifact that could not have been generated by a single technique in isolation of the others. So, too, with Value Sensitive Design. An artifact (e.g., system design) emerges through iterations upon a process that is more than the sum of its parts. Nonetheless, the parts provide us with a good place to start. Value Sensitive Design builds on an iterative methodology that integrates conceptual, empirical, and technical investigations; thus, as a step toward conveying Value Sensitive Design, we describe each investigation separately. 3.1 Conceptual Investigations Who are the direct and indirect stakeholders affected by the design at hand? How are both classes of stakeholders affected? What values are implicated? How should we engage in trade-offs among competing values in the design, implementation, and use of information systems (e.g., autonomy vs. security, or anonymity vs. trust)? Should moral values (e.g., a right to privacy) have greater weight than, or even trump, non-moral values (e.g., aesthetic preferences)? Value Sensitive Design takes up these questions under the rubric of conceptual investigations. In addition, careful working conceptualizations of specific values clarify fundamental issues raised by the project at hand, and provide a basis for comparing results across research teams. For example, in their analysis of trust in online system design, Friedman, Kahn, and Howe [2000], drawing on Baier [1986], first offer a philosophically informed working conceptualization of trust. They propose that people trust when they are vulnerable to harm from others, yet believe those others would not harm them even though they could. In turn, trust depends on people’s ability to make three types of assessments. One is about the harms they might incur. The second is about the good will others possess toward them that would keep those others from doing them harm. The third involves whether or not harms that do occur lie outside the parameters of the trust relationship. From such conceptualizations, Friedman et al. were able to define clearly what they meant by trust online. This definition is in some cases different from what other researchers have meant by the term – for example, the Computer Science and Telecommunications Board, in their thoughtful publication Trust in Cyberspace [Schneider 1999], adopted the terms “trust” and “trustworthy” to describe systems that perform as expected along the dimensions of correctness, security, reliability, safety, and survivability. Such a definition, which equates “trust” with expectations for machine performance, differs markedly from one that says trust is fundamentally a relationship between people (sometimes mediated by machines). 3

3.2 Empirical Investigations Conceptual investigations can only go so far. Depending on the questions at hand, many analyses will need to be informed by empirical investigations of the human context in which the technical artifact is situated. Empirical investigations are also often needed to evaluate the success of a particular design. Empirical investigations can be applied to any human activity that can be observed, measured, or documented. Thus, the entire range of quantitative and qualitative methods used in social science research is potentially applicable here, including observations, interviews, surveys, experimental manipulations, collection of relevant documents, and measurements of user behavior and human physiology. Empirical investigations can focus, for example, on questions such as: How do stakeholders apprehend individual values in the interactive context? How do they prioritize competing values in design trade-offs? How do they prioritize individual values and usability considerations? Are there differences between espoused practice (what people say) compared with actual practice (what people do)? Moreover, because the development of new technologies affects groups as well as individuals, questions emerge of how organizations appropriate value considerations in the design process. For example, regarding value considerations, what are organizations’ motivations, methods of training and dissemination, reward structures, and economic incentives? 3.3 Technical Investigations As discussed in Section 2.3 (Value Sensitive Design’s Constellation of Features), Value Sensitive Design adopts the position that technologies in general, and information and computer technologies in particular, provide value suitabilities that follow from properties of the technology. That is, a given technology is more suitable for certain activities and more readily supports certain values while rendering other activities and values more difficult to realize. In one form, technical investigations focus on how existing technological properties and underlying mechanisms support or hinder human values. For example, some videobased collaborative work systems provide blurred views of office settings, while other systems provide clear images that reveal detailed information about who is present and what they are doing. Thus the two designs differentially adjudicate the value trade-off between an individual’s privacy and the group’s awareness of individual members’ presence and activities. In the second form, technical investigations involve the proactive design of systems to support values identified in the conceptual investigation. For example, Fuchs [1999] developed a notification service for a collaborative work system in which the underlying technical mechanisms implement a value hierarchy whereby an individual’s desire for privacy overrides other group members’ desires for awareness. At times, technical investigations – particularly of the first form – may seem similar to empirical investigations insofar as both involve technological and empirical activity. However, they differ markedly on their unit of analysis. Technical investigations focus on the technology itself. Empirical investigations focus on the individuals, groups, or larger social systems that configure, use, or are otherwise affected by the technology. 4. VALUE SENSITIVE DESIGN IN PRACTICE: THREE CASE STUDIES To illustrate Value Sensitive Design’s integrative and iterative tripartite methodology, we draw on three case studies with real world applications, one completed and two under way. Each case study represents a unique design space. 4

4.1 Cookies and Informed Consent in Web Browsers Informed consent provides a critical protection for privacy, and supports other human values such as autonomy and trust. Yet currently there is a mismatch between industry practice and the public’s interest. According to a recent report from the Federal Trade Commission (2000), for example, 59% of Web sites that collect personal identifying information neither inform Internet users that they are collecting such information nor seek the user’s consent. Yet, according to a Harris poll (2000), 88% of users want sites to garner their consent in such situations. Against this backdrop, Friedman, Felten, and their colleagues [Friedman et al. 2002; Friedman et al. 2000; Millett et al. 2001] sought to design web-based interactions that support informed consent in a web browser through the development of new technical mechanisms for cookie management. This project was an early proof-of-concept project for Value Sensitive Design, which we use here to illustrate several key features of the methodology. 4.1.1 Conceptualizing the Value. One part of a conceptual investigation entails a philosophically informed analysis of the central value constructs. Accordingly, Friedman et al. began their project with a conceptual investigation of informed consent itself. They drew on diverse literature, such as the Belmont Report, which delineates ethical principles and guidelines for the protection of human subjects [Belmont Report 1978; Faden and Beauchamp 1986], to develop criteria for informed consent in online interactions. In brief, the idea of “informed” encompasses disclosure and comprehension. Disclosure refers to providing accurate information about the benefits and harms that might reasonably be expected from the action under consideration. Comprehension refers to the individual’s accurate interpretation of what is being disclosed. In turn, the idea of “consent” encompasses voluntariness, comprehension, and agreement. Voluntariness refers to ensuring that the action is not controlled or coerced. Competence refers to possessing the mental, emotional and physical capabilities needed to be capable of giving informed consent. Agreement refers to a reasonably clear opportunity to accept or decline to participate. Moreover, agreement should be ongoing, that is, the individual should be able to withdraw from the interaction at any time. See Friedman, Millet, and Felten [2000] for an expanded discussion of these five criteria. 4.1.2 Using a Conceptual Investigation to Analyze Existing Technical Mechanisms. With a conceptualization for informed consent online in hand, Friedman et al. conducted a retrospective analysis (one form of a technical investigation) of how the cookie and web-browser technology embedded in Netscape Navigator and Internet Explorer changed with respect to informed consent over a 5-year period, beginning in 1995. Specifically, they used the criteria of disclosure, comprehension, voluntariness, competence, and agreement to evaluate how well each browser in each stage of its development supported the users’ experience of informed consent. Through this retrospective analysis, they found that while cookie technology had improved over time regarding informed consent (e.g., increased visibility of cookies, increased options for accepting or declining cookies, and access to information about cookie content), as of 1999 some startling problems remained. For example: (a) While browsers disclosed to users some information about cookies, they still did not disclose the right sort of information – that is, information about the potential harms and benefits from setting a particular cookie. (b) In Internet Explorer, the burden to accept or decline all third party cookies still fell to the user, placing undue burden on the user to decline each third party cookie one at a time. (c) Users’ out-of-the-box experience of cookies (i.e., the default setting) was no different in 1999 than it was in 1995: to accept all cookies. That is, the novice user installed a 5

browser that accepted all cookies and disclosed nothing about that activity to the user. (d) Neither browser alerted a user when a site wished to use a cookie and for what purpose, as opposed to when a site wished to store a cookie.

(a) Peripheral awareness mechanism.

(b) Just-in-time cookie management tool.

Figure 1. Screen shot (a) of the Mozilla implementation shows the peripheral awareness of cookies interface (at the left) in the context of browsing the web. Each time a cookie is set, a color-coded entry for that cookie appears in the sidebar. Third party cookies are red; others are green. At the user’s discretion, he or she can click on any entry to bring up the Mozilla cookie manager for that cookie. Screen shot (b) after the user has clicked on an entry to bring up the just-in-time cookie management tool (in the center) for a particular cookie.

4.1.3 The Iteration and Integration of Conceptual, Technical, and Empirical Investigations. Based on the results from these conceptual and technical investigations, Friedman et al. then iteratively used the results to guide a second technical investigation: a redesign of the Mozilla browser (the open-source code for Netscape Navigator). Specifically, they developed three new types of mechanisms: (a) peripheral awareness of cookies; (b) just-in-time information about individual cookies and cookies in general; and (c) just-in-time management of cookies (see Figure 1). In the process of their technical work, Friedman et al. conducted formative evaluations (empirical investigations) which led to a further design criterion, minimal distraction, which refers to meeting the above criteria for informed consent without unduly diverting the user from the task at hand. Two situations are of concern here. First, if users are overwhelmed with queries to consent to participate in events with minor benefits and risks, they may become numbed to the informed consent process by the time participation in an event with significant benefits and risks is at hand. Thus, the user’s participation in that event may not receive the careful attention that is warranted. Second, if the overall distraction to obtain informed consent becomes so great as to be perceived to be an intolerable nuisance, users are likely to disengage from the informed consent process in its entirety and accept or decline participation by rote. Thus undue distraction can single-handedly undermine informed consent. In this way, the iterative results of the above empirical investigations not only shaped and then validated the technical work, but impacted the initial conceptual investigation by adding to the model of informed consent the criterion of minimal distraction. Thus, this project illustrates the iterative and integrative nature of Value Sensitive Design, and provides a proof-of-concept for Value Sensitive Design in the context of mainstream Internet software. 6

4.2 Room with a View: Using Plasma Displays in Interior Offices Janice is in her office, writing a report. She’s trying to conceptualize the report’s higher-level structure, but her ideas won’t quite take form. Then she looks up from her desk and rests her eyes on the fountain and plaza area outside her building. She notices the water bursting upward, and that a small group of people are gathering by the water’s edge. She rests her eyes on the surrounding pool of calm water. Her eyes then lift toward the clouds and the streaking sunshine. Twenty seconds later she returns to her writing task at hand, slightly refreshed, and with an idea taking shape. What’s particularly novel about this workplace scenario is that Janice works in an interior office. Instead of a real window looking out onto the plaza, Janice has a large screen video plasma display that continuously displays the local outdoor scene in realtime. Realistic? Beneficial? This design space is currently being researched by Kahn, Friedman, and their colleagues, using the framework of Value Sensitive Design. In Kahn et al.’s initial conceptual investigation of this design space, they drew on the psychological literature that suggests that interaction with real nature can garner physiological and psychological benefits. For example, in one study, Ulrich [1984] found that post-operative recovery improved when patients were assigned to a room with a view of a natural setting (a small stand of deciduous trees) versus a view of a brown brick wall. More generally, studies have shown that even minimal connection with nature – such as looking at a natural landscape – can reduce immediate and long-term stress, reduce sickness of prisoners, and calm patients before and during surgery. (See Beck and Katcher [1996], Kahn [1999], and Ulrich [1993] for reviews.) Thus Kahn et al. hypothesized that an “augmented window” of nature could render benefits in a work environment in terms of the human values of physical health, emotional well-being, and creativity. To investigate this question in a laboratory context, Kahn et al. are comparing the short-term benefits of working in an office with a view out the window of a beautiful nature scene versus an identical view (in real time) shown on a large video plasma display that covers the window in the same office (Figure 2a). In this latter condition, they employed a High Definition TV (HDTV) camera (Figure 2b) to capture real-time local images. The control condition involved a blank covering over the window. Their measures entailed (a) physiological data (heart rate), (b) performance data (on cognitive and creativity tasks), (c) video data that captured each subject’s eye gaze on a second-bysecond level, and time synchronized with the physiological equipment, so that analyses can determine whether physiological benefits accrued immediately following an eye gaze onto the plasma screen, and (d) social-cognitive data (based on a 50-minute interview with each subject at the conclusion of the experimental condition wherein they garnered each subject’s reasoned perspective on the experience). Data analysis is in progress. However, preliminary results are showing the following trends. First, participants looked out the plasma screen just as frequently as they did the real window, and more frequently than they stared at the blank wall. In this sense, the plasma-display window was functioning like a real window. But, when participants gazed for 30 seconds or more, the real window provided greater physiological recovery from low-level stress as compared to the plasma display window.

7

(a) “The Watcher”

(b) The HDTV Camera

(c) “The Watched”

Figure 2. Plasma Display Technology Studies

From the standpoint of illustrating Value Sensitive Design, we would like to emphasize five ideas. 4.2.1. Multiple Empirical Methods. Under the rubric of empirical investigations, Value Sensitive Design supports and encourages multiple empirical methods to be used in concert to address the question at hand. As noted above, for example, this study employed physiological data (heart rate), two types of performance data (on cognitive and creativity tasks), behavioral data (eye gaze), and reasoning data (the social-cognitive interview). From a value-oriented perspective, multiple psychological measures increase the veracity of most accounts of technology in use. 4.2.2. Direct and Indirect Stakeholders. In their initial conceptual investigation of the values implicated in this study, Kahn et al. sought to identify not only direct but also indirect stakeholders affected by such display technology. At that early point, it became clear to the researchers that an important class of indirect stakeholders (and their respective values) needed to be included: namely, the individuals who, by virtue of walking through the fountain scene, unknowingly had their images displayed on the video plasma display in the “inside” office (Figure 2c). In other words, if this application of projection technology were to come into widespread use (as web cams and surveillance cameras have begun to) then it would potentially encroach on the privacy of individuals in public spaces – an issue that has been receiving increasing attention in the field of computer ethics and public discourse [Nissenbaum 1998]. Thus, in addition to the experimental laboratory study, Kahn et al. initiated two additional but complementary empirical investigations with indirect stakeholders: (a) a survey of 750 people walking through the public plaza, and (b) in-depth social cognitive interviews with 30 individuals walking through the public plaza [Friedman, Kahn, and Hagman 2004]. Both investigations focused on indirect stakeholders’ judgments of privacy in public space, and in particular having their real-time images captured and displayed on plasma screens in nearby and distant offices. The importance of such indirect stakeholder investigations is being borne out by the results. For example, significant gender differences were found in their survey data: more women than men expressed concern about the invasion of privacy through web cameras in public places. This finding held whether their image was to be displayed locally or in another city (Tokyo), or viewed by one person, thousands, or millions. One implication of this finding is that future technical designs and implementations of such display technologies need to be responsive to ways in which men and women might perceive potential harms differently. 8

4.2.3. Coordinated Empirical Investigations. Once Kahn et al. identified an important group of indirect stakeholders, and decided to undertake empirical investigations with this group, they then coordinated these empirical investigations with the initial (direct stakeholder) study. Specifically, a subset of identical questions were asked of both the direct stakeholders (“The Watchers”) and indirect stakeholders (“The Watched”). Results show some interesting differences. For example, more men in The Watched condition expressed concerns about that people’s images might be displayed locally, nationally, or internationally than men in The Plasma Display Watcher condition. No differences were found between women in The Watcher Plasma Display Condition and women in the Watched condition. Thus, the Value Sensitive Design methodology helps to bring to the forefront values that matter not only to the direct stakeholders of a technology (such as physical health, emotional well-being, and creativity), but to the indirect stakeholders (such as privacy, informed consent, trust, and physical safety). Moreover, from the standpoint of Value Sensitive Design, the above study highlights how investigations of indirect stakeholders can be woven into the core structure of the experimental design with direct stakeholders. 4.2.4. Multiplicity of and Potential Conflicts among Human Values. Value Sensitive Design can help researchers uncover the multiplicity of and potential conflicts among human values implicated in technological implementations. In the above design space, for example, values of physical health, emotional well-being, and creativity appear to partially conflict with other values of privacy, civil rights, trust, and security. 4.2.5. Technical Investigations. Conceptual and empirical investigations can help to shape future technological investigations, particularly in terms of how nature (as a source of information) can be embedded in the design of display technologies to further human well-being. One obvious design space involves buildings. For example, if Kahn et al.’s empirical results continue to emerge in line with their initial results, then one possible design guideline is as follows: we need to design buildings with nature in mind, and within view. In other words, we cannot with psychological impunity digitize nature and display the digitized version as a substitute for the real thing (and worse, then destroy the original). At the same time, it is possible that technological representations of nature can garner some psychological benefits, especially when (as in an inside office) direct access to nature is otherwise unavailable. Other less obvious design spaces involve, for example, airplanes. In recent discussions with Boeing Corporation, for example, we were told that for economic reasons engineers might like to construct airplanes without passenger windows. After all, windows cost more to build and decrease fuel efficiency. At stake, however, is the importance of windows in the human experience of flying. In short, this case study highlights how Value Sensitive Design can help researchers employ multiple psychological methods, across several studies, with direct and indirect stakeholders, to investigate (and ultimately support) a multiplicity of human values impacted by deploying a cutting-edge information technology. 4.3 UrbanSim: Integrated Land Use, Transportation, and Environmental Simulation In many regions in the United States (and globally), there is increasing concern about pollution, traffic jams, resource consumption, loss of open space, loss of coherent community, lack of sustainability, and unchecked sprawl. Elected officials, planners, and citizens in urban areas grapple with these difficult issues as they develop and evaluate alternatives for such decisions as building a new rail line or freeway, establishing an urban growth boundary, or changing incentives or taxes. These decisions interact in complex ways, and, in particular, transportation and land use decisions interact strongly 9

with each other. There are both legal and common sense reasons to try to understand the long-term consequences of these interactions and decisions. Unfortunately, the need for this understanding far outstrips the capability of the analytic tools used in current practice. In response to this need, Waddell, Borning, and their colleagues have been developing UrbanSim, a large simulation package for predicting patterns of urban development for periods of twenty years or more, under different possible scenarios [Waddell 2002; Noth et al. 2003; Waddell et al. 2003]. Its primary purpose is to provide urban planners and other stakeholders with tools to aid in more informed decisionmaking, with a secondary goal to support further democratization of the planning process. When provided with different scenarios – packages of possible policies and investments – UrbanSim models the resulting patterns of urban growth and redevelopment, of transportation usage, and of resource consumption and other environmental impacts.

(a) 1980 Employment

(b) Change 1980-1994

(c) Resulting 1994 Employment

Figure 3. Results from UrbanSim for Eugene/Springfield, Oregon, forecasting land use patterns over a 14-year period. These results arise from the simulated interactions among demographic change, economic change, real estate development, transportation, and other actors and processes in the urban environment. Map (a) shows the employment density in 1980 (number of jobs located in each 150x150 meter grid cell). Darker red indicates higher density. Map (b) shows the predicted change from 1980 to 1994 (where darker red indicates a greater change), and map (c) the predicted employment density in 1994. In a historical validation of the model, this result was then compared with the actual 1994 employment, with a 0.917 correlation over a 1-cell radius.

To date, UrbanSim has been applied in the metropolitan regions around Eugene/Springfield, Oregon (Figure 3), Honolulu, Hawaii, Salt Lake City, Utah, and Houston, Texas, with application to the Puget Sound region in Washington State under way. UrbanSim is undergoing significant redevelopment and extension in terms of its underlying architecture, interface, and social goals. Under the direction of Borning, Friedman, and Kahn, Value Sensitive Design is playing a central role in this endeavor. UrbanSim illustrates important aspects of Value Sensitive Design in addition to those described in the previous two case studies: 4.3.1 Distinguishing Explicitly Supported Values from Stakeholder Values. In their conceptual investigations, Borning et al. distinguished between explicitly supported values (i.e., ones that they explicitly want to embed in the simulation) and stakeholder values (i.e., ones that are important to some but not necessarily all of the stakeholders). Next, Borning et al. committed to three specific moral values to be supported explicitly. One is fairness, and more specifically freedom from bias. The simulation should not discriminate unfairly against any group of stakeholders, or privilege one mode of transportation or policy over another. A second is accountability. Insofar as possible, stakeholders should be able to confirm that their values are reflected in the simulation, evaluate and judge its validity, and develop an appropriate level of confidence in its 10

output. The third is democracy. The simulation should support the democratic process in the context of land use, transportation, and environmental planning. In turn, as part of supporting the democratic process, Borning et al. decided that the model should not a priori favor or rule out any given set of stakeholder values, but instead, should allow different stakeholders to articulate the values that are most important to them, and evaluate the alternatives in light of these values. 4.3.2 Handling Widely Divergent and Potentially Conflicting Stakeholder Values. From the standpoint of conceptual investigations, UrbanSim as a design space poses tremendous challenges. The research team cannot focus on a few key values, as occurred in the Web Browser project (e.g., the value of informed consent), or the Room with a View project (e.g., the values of privacy in public spaces, and physical and psychological well-being). Rather, disputing stakeholders bring to the table widely divergent values about environmental, political, moral, and personal issues. Examples of stakeholder values are environmental sustainability, walkable neighborhoods, space for business expansion, affordable housing, freight mobility, minimal government intervention, minimal commute time, open space preservation, property rights, and environmental justice. How does one characterize the wide-ranging and deeply held values of diverse stakeholders, both present and future? Moreover, how does one prioritize the values implicated in the decisions? And how can one move from values to measurable outputs from the simulation to allow stakeholders to compare alternative scenarios? As part of addressing these questions, the research group implemented a web-based interface that groups indicators into three broad value categories pertaining to the domain of urban development (economic, environmental, and social), and more specific value categories under that. To allow stakeholders to evaluate alternative urban futures, the interface provides a large collection of indicators: variables that distill some attribute of interest about the results [Gallopin 1997]. (Examples of indicators are the number of acres of rural land converted to urban use each year, the degree of poverty segregation, or the mode share between autos and transit.) These categories and indicators draw on a variety of sources, including empirical research on people’s environmental concepts and values [Kahn 1999; Kahn and Kellert 2002], community-based indicator projects [Palmer 1998; Hart 1999], and the policy literature. Stakeholders can then use the interface to select indicators that speak to values that are important to them from among these categories. This interface illustrates the interplay among conceptual, technical, and empirical investigations. The indicators are chosen to speak to different stakeholder values – responding to our distinction between explicitly supported values and stakeholder values in the initial conceptual investigation. The value categories are rooted empirically in both human psychology and policy studies, not just philosophy – and then embodied in a technical artifact (the web-based interface), which is in turn evaluated empirically. 4.3.3 Technical Choices Driven by Initial and Emergent Value Considerations. Most of the technical choices in the design of the UrbanSim software are in response to the need to generate indicators and other evaluation measures that respond to different strongly-held stakeholder values. For example, for some stakeholders, walkable, pedestrian-friendly neighborhoods are very important. But being able to model walking as a transportation mode makes difficult demands on the underlying simulation, requiring a finer-grained spatial scale than is needed for modeling automobile transportation alone. In turn, being able to answer questions about walking as a transportation mode is important for two explicitly supported values: fairness (not to privilege one transportation mode over another), and democracy (being able to answer questions about a value that is important to a significant number of stakeholders). As a second example of technical 11

choices being driven by value considerations, UrbanSim’s software architecture is designed to support rapid evolution in response to changed or additional requirements. For instance, the software architecture decouples the individual component models as much as possible, allowing them to evolve and new ones to be added in a modular fashion. Also, the system writes the simulation results into an SQL database, making it easy to write queries that produce new indicators quickly and as needed, rather than embedding the indicator computation code in the component models themselves. For similar reasons, the UrbanSim team uses the YP agile software development methodology [Freeman-Benson and Borning 2003], which allows the system to evolve and respond quickly to emerging stakeholder values and policy considerations. 4.3.4 Designing for Credibility, Openness, and Accountability. Credibility of the system is of great importance, particularly when the system is being used in a politically charged situation and is thus the subject of intense scrutiny. The research group has undertaken a variety of activities to help foster credibility, including using behaviorally transparent simulation techniques (i.e., simulating agents in the urban environment, such as households, businesses, and real estate developers, rather than using some more abstract and opaque simulation technique), and performing sensitivity analyses [Franklin et al. 2002] and a historical validation. In the historical validation, for example, the group started the model with 1980 data from Eugene/Springfield, simulated through 1994, and compared the simulation output with what actually happened. One of these comparisons is shown in Figure 3. In addition, our techniques for fostering openness and accountability are also intended to support credibility. These include using Open Source software (releasing the source code along with the executable), writing the code in as clear and understandable a fashion as possible, using a rigorous and extensive testing methodology, and complementing the Open Source software with an Open Process that makes the state of our development visible to anyone interested. For example, in our laboratory, a battery of tests is run whenever a new version of the software is committed to the source code repository. A traffic light (a real one) is activated by the testing regime – green means that the system has passed all tests, yellow means testing is under way, and red means that a test has failed. There is also a virtual traffic light, mirroring the physical one, visible on the web (www.urbansim.org/fireman). Similarly, the bug reports, feature requests, and plans are all on the UrbanSim project website as well. Details of this Open Process approach may be found in Freeman-Benson and Borning [2003]. Thus, in summary, Borning et al. are using Value Sensitive Design to investigate how a technology – an integrated land use, transportation, and environmental computer simulation – affects human values on both the individual and organizational levels; and how human values can continue to drive the technical investigations, including refining the simulation, data, and interaction model. Finally, employing Value Sensitive Design in a project of this scope serves to validate its use for complex, large-scale systems. 5. VALUE SENSITIVE DESIGN’S CONSTELLATION OF FEATURES Value Sensitive Design shares and adopts many interests and techniques from related approaches to values and system design – computer ethics, social informatics, CSCW, and Participatory Design – as discussed in Section 2.2. However, Value Sensitive Design itself brings forward a unique constellation of eight features. First, Value Sensitive Design seeks to be proactive: to influence the design of technology early in and throughout the design process. 12

Second, Value Sensitive Design enlarges the arena in which values arise to include not only the work place (as traditionally in the field of CSCW), but also education, the home, commerce, online communities, and public life. Third, Value Sensitive Design contributes a unique methodology that employs conceptual, empirical, and technical investigations, applied iteratively and integratively (see Section 3). Fourth, Value Sensitive Design enlarges the scope of human values beyond those of cooperation (CSCW) and participation and democracy (Participatory Design) to include all values, especially those with moral import. By moral, we refer to issues that pertain to fairness, justice, human welfare and virtue, encompassing within moral philosophical theory deontology [Dworkin 1978; Gewirth 1978; Kant 1785/1964; Rawls 1971], consequentialism ([Smart and Williams 1973]; see Scheffler [1982] for an analysis), and virtue [Foot 1978; MacIntyre 1984; Campbell and Christopher 1996]. Value Sensitive Design also accounts for conventions (e.g., standardization of protocols) and personal values (e.g., color preferences within a graphical user interface). Fifth, Value Sensitive Design distinguishes between usability and human values with ethical import. Usability refers to characteristics of a system that make it work in a functional sense, including that it is easy to use, easy to learn, consistent, and recovers easily from errors [Adler and Winograd 1992; Norman 1988; Nielsen 1993]. However, not all highly usable systems support ethical values. Nielsen [1993], for example, asks us to imagine a computer system that checks for fraudulent applications of people who are applying for unemployment benefits by asking applicants numerous personal questions, and then checking for inconsistencies in their responses. Nielsen’s point is that even if the system receives high usability scores some people may not find the system socially acceptable, based on the moral value of privacy. Sixth, Value Sensitive Design identifies and takes seriously two classes of stakeholders: direct and indirect. Direct stakeholders refer to parties – individuals or organizations – who interact directly with the computer system or its output. Indirect stakeholders refer to all other parties who are affected by the use of the system. Often, indirect stakeholders are ignored in the design process. For example, computerized medical records systems have often been designed with many of the direct stakeholders in mind (e.g., insurance companies, hospitals, doctors, and nurses), but with too little regard for the values, such as the value of privacy, of a rather important group of indirect stakeholders: the patients. Seventh, Value Sensitive Design is an interactional theory: values are viewed neither as inscribed into technology (an endogenous theory), nor as simply transmitted by social forces (an exogenous theory). Rather, the interactional position holds that while the features or properties that people design into technologies more readily support certain values and hinder others, the technology’s actual use depends on the goals of the people interacting with it. A screwdriver, after all, is well-suited for turning screws, and is also amenable to use as a poker, pry bar, nail set, cutting device, and tool to dig up weeds, but functions poorly as a ladle, pillow, or wheel. Similarly, an online calendar system that displays individuals’ scheduled events in detail readily supports accountability within an organization but makes privacy difficult. Moreover, through human interaction, technology itself changes over time. On occasion, such changes (as emphasized in the exogenous position) can mean the societal rejection of a technology, or that its acceptance is delayed. But more often it entails an iterative process whereby technologies are first invented, and then redesigned based on user interactions, which then are reintroduced to users, further interactions occur, and further redesigns 13

implemented. Typical software updates (e.g., of word processors, browsers, and operating systems) epitomize this iterative process. Eighth, Value Sensitive Design builds from the psychological proposition that certain values are universally held, although how such values play out in a particular culture at a particular point in time can vary considerably [Kahn 1999; Turiel 1998, 2002]. For example, even while living in an igloo, Inuits have conventions that ensure some forms of privacy; yet such forms of privacy are not maintained by separated rooms, as they are in most Western cultures. Generally, the more concretely (act-based) one conceptualizes a value, the more one will be led to recognizing cultural variation; conversely, the more abstractly one conceptualizes a value, the more one will be led to recognizing universals. Value Sensitive Design seeks to work both levels, the concrete and abstract, depending on the design problem at hand. Note that this is an empirical proposition, based on a large amount of psychological and anthropological data, not a philosophical one. We also make this claim only for certain values, not all – there are clearly some values that are culture-specific. The three case studies presented in Section 5 illustrate the different features in this constellation. For example, UrbanSim illustrates the goal of being proactive and influencing the design of the technology early in and throughout the design process (Feature 1), and also involves enlarging the arena in which values arise to include urban planning and democratic participation in public decision-making (Feature 2). The cookies work is a good illustration of Value Sensitive Design’s tripartite methodology (Feature 3): conceptual, technical, and empirical investigations, applied iteratively and integratively, were essential to the success of the project. Each of the three projects brings out a different set of human values (Feature 4): among others, informed consent for the cookies work; physical and psychological well-being and privacy in public spaces for Room with a View; and fairness, accountability, and democracy for UrbanSim, as well as the whole range of different sometimes competing stakeholder values. The cookies project illustrates the complex interaction between usability and human values (Feature 5): early versions of the system supported informed consent at the expense of usability, requiring additional work to develop a system that was both usable and provided reasonable support for informed consent. The Room with a View work considers and takes seriously both direct and indirect stakeholders (Feature 6): the occupants of the inside office (“The Watchers”), and passers-by in the plaza (“The Watched”). Value Sensitive Design’s position that values are neither inscribed into technology nor simply transmitted by social forces (Feature 7) is illustrated by UrbanSim: the system by itself is certainly not neutral with respect to democratic process, but at the same time does not on its own ensure democratic decision-making on land use and transportation issues. Finally, the proposition that certain values are universally held, but play out in very different ways in different cultures and different times (Feature 8) is illustrated by the Room with a View project: the work is informed by a substantial body of work on the importance of privacy in all cultures (for example, the deep connection between privacy and self-identity), but concerns about privacy in public spaces play out in a specific way in the United States, and might do so quite differently in another cultural context. We could draw out additional examples that illustrate Value Sensitive Design’s constellation of features, both from the three case studies presented in Section 5, and in other projects; but hope that this short description demonstrates the unique contribution that Value Sensitive Design can make to the design of technology.

14

6. PRACTICAL SUGGESTIONS FOR USING VALUE SENSITIVE DESIGN One natural question with Value Sensitive Design is, “How exactly do I do it?” In this section we offer some practical suggestions. 6.1. Start With a Value, Technology, or Context of Use Any of these three core aspects – a value, technology, or context of use – easily motivates Value Sensitive Design. We suggest starting with the aspect that is most central to your work and interests. In the case of Informed Consent and Cookies, for example, Friedman et al. began with a value of central interest (informed consent) and moved from that value to its implications for Web browser design. In the case of UrbanSim, Borning et al. began with a technology (urban simulation) and a context of use (the urban planning process); upon inspection of those two, values issues quickly came to the fore. 6.2. Identify Direct and Indirect Stakeholders As part of the initial conceptual investigation, systematically identify direct and indirect stakeholders. Recall that direct stakeholders are those individuals who interact directly with the technology or with the technology’s output. Indirect stakeholders are those individuals who are also impacted by the system, though they never interact directly with it. In addition, it is worthwhile to recognize the following: • Within each of these two overarching categories of stakeholders, there may be several subgroups. • A single individual may be a member of more than one stakeholder group or subgroup. For example, in the UrbanSim project, an individual who works as an urban planner and lives in the area is both a direct stakeholder (i.e., through his or her direct use of the simulation to evaluate proposed transportation plans) and an indirect stakeholder (i.e., by virtue of living in the community for which the transportation plans will be implemented). • An organizational power structure is often orthogonal to the distinction between direct and indirect stakeholders. For example, there might be low-level employees who are either direct or indirect stakeholders and who don’t have control over using the system (e.g., workers on an assembly line). Participatory Design has contributed a substantial body of analysis to these issues, as well as techniques for dealing with them, such as ways of equalizing power among groups with unequal power. (See the references cited in Section 2.1.) 6.3. Identify Benefits and Harms for Each Stakeholder Group Having identified the key stakeholders, systematically identify the benefits and harms for each group. In doing so, we suggest attention to the following points: • Indirect stakeholders will be benefited or harmed to varying degrees; and in some designs it is probably possible to claim every human as an indirect stakeholder of some sort. Thus, one rule of thumb in the conceptual investigation is to give priority to indirect stakeholders who are strongly affected, or to large groups that are somewhat affected. • Attend to issues of technical, cognitive, and physical competency. For example, children or the elderly might have limited cognitive competency. In such a case, care must be taken to ensure that their interests are represented in the design process, either by representatives from the affected groups themselves or, if this is not possible, by advocates. 15

• Personas [Pruitt and Grudin 2003] are a popular technique that can be useful for identifying the benefits and harms to each stakeholder group. However, we note two caveats. First, personas have a tendency to lead to stereotypes because they require a list of “socially coherent” attributes to be associated with the “imagined individual.” Second, while in the literature each persona represents a different user group, in Value Sensitive Design (as noted above) the same individual may be a member of more than one stakeholder group. Thus, in our practice, we have deviated from the typical use of personas that maps a single persona onto a single user group, to allow for a single persona to map onto to multiple stakeholder groups. 6.4. Map Benefits and Harms onto Corresponding Values With a list of benefits and harms in hand, one is in a strong position to recognize corresponding values. Sometimes the mapping is one of identity. For example, a harm that is characterized as invasion of privacy maps onto the value of privacy. Other times the mapping is less direct if not multifaceted. For example, with the Room with a View study, it is possible that a direct stakeholder’s mood is improved when working in an office with an augmented window (as compared with no window). Such a benefit potentially implicates not only the value of psychological welfare, but also creativity, productivity, and physical welfare (health), assuming there is a causal link between improved mood and these other factors. In some cases, the corresponding values will be obvious, but not always. Table 1 in Section 5.8 provides a table of human values with ethical import often implicated in system design. This table may be useful in suggesting values that should be considered in the investigation. 6.5. Conduct a Conceptual Investigation of Key Values Following the identification of key values in play, a conceptual investigation of each can follow. Here it is helpful to turn to the relevant literature. In particular, the philosophical ontological literature can help provide criteria for what a value is, and thereby how to assess it empirically. (For example, Section 4.1.1 described how existing literature helped provide criteria for the value of informed consent.) 6.6. Identify Potential Value Conflicts Values often come into conflict. Thus, once key values have been identified and carefully defined, a next step entails examining potential conflicts. For the purposes of design, value conflicts should usually not be conceived of as “either/or” situations, but as constraints on the design space. Admittedly, at times designs that support one value directly hinder support for another. In those instances, a good deal of discussion among the stakeholders may be warranted to identify the space of workable solutions. Typical value conflicts include accountability vs. privacy, trust vs. security, environmental sustainability vs. economic development, privacy vs. security, and hierarchical control vs. democratization. 6.7. Integrate Value Considerations Into One’s Organizational Structure Ideally, Value Sensitive Design will work in concert with organizational objectives. Within a company, for example, designers would bring values into the forefront, and in the process generate increased revenue, employee satisfaction, customer loyalty, and other desirable outcomes for their companies. In turn, within a government agency, designers would both better support national and community values, and enhance the organization’s ability to achieve its objectives. In the real world, of course, human values 16

(especially those with ethical import) may collide with economic objectives, power, and other factors. However, even in such situations, Value Sensitive Design should be able to make positive contributions, by showing alternate designs that better support enduring human values. For example, if a standards committee were considering adopting a protocol that raised serious privacy concerns, a Value Sensitive Design analysis and design might result in an alternate protocol that better addressed the issue of privacy while still retaining other needed properties. Citizens, advocacy groups, staff members, politicians, and others could then have a more effective argument against a claim that the proposed protocol was the only reasonable choice. 6.8. Human Values (with Ethical Import) Often Implicated in System Design We stated earlier that while all values fall within its purview, Value Sensitive Design emphasizes values with ethical import. In Table 1, we present a list of frequently implicated values. This table is intended as a heuristic for suggesting values that should be considered in the investigation – it is definitely not intended as a complete list of human values that might be implicated. Table 1. Human Values (with Ethical Import) Often Implicated in System Design Human Value

Definition

Sample Literature

Human Welfare

Refers to people’s physical, material, and psychological well-being

Ownership and Property

Refers to a right to possess an object (or information), use it, manage it, derive income from it, and bequeath it Refers to a claim, an entitlement, or a right of an individual to determine what information about himself or herself can be communicated to others

Leveson [1991]; Friedman, Kahn, & Hagman [2003]; Neumann [1995]; Turiel [1983, 1998] Becker [1977]; Friedman [1997b]; Herskovits [1952]; Lipinski & Britz [2000]

Privacy

Freedom From Bias

Universal Usability

Trust

Refers to systematic unfairness perpetrated on individuals or groups, including pre-existing social bias, technical bias, and emergent social bias Refers to making all people successful users of information technology Refers to expectations that exist between people who can experience good will, extend good will toward others, feel vulnerable, and experience betrayal

17

Agre and Rotenberg [1998]; Bellotti [1998]; Boyle, Edwards, & Greenberg [2000]; Friedman [1997b]; Fuchs [1999]; Jancke, Venolia, Grudin, Cadiz, and Gupta [2001]; Palen & Dourish [2003]; Nissenbaum [1998]; Phillips [1998]; Schoeman [1984]; Svensson, Hook, Laaksolahti, & Waern [2001] Friedman & Nissenbaum [1996]; cf. Nass & Gong [2000]; Reeves & Nass [1996] Aberg & Shahmehri [2001]; Shneiderman [1999, 2000]; Cooper & Rejmer [2001]; Jacko, Dixon, Rosa, Scott, & Pappas [1999]; Stephanidis [2001] Baier [1986]; Camp [2000]; Dieberger, Hook, Svensson, & Lonnqvist [2001]; Egger [2000]; Fogg & Tseng [1999]; Friedman, Kahn, & Howe [2000]; Kahn & Turiel [1988]; Mayer, Davis, &

Autonomy

Refers to people’s ability to decide, plan, and act in ways that they believe will help them to achieve their goals

Informed Consent

Refers to garnering people’s agreement, encompassing criteria of disclosure and comprehension (for “informed”) and voluntariness, competence, and agreement (for “consent”) Refers to the properties that ensures that the actions of a person, people, or institution may be traced uniquely to the person, people, or institution Refers to treating people with politeness and consideration Refers to people’s understanding of who they are over time, embracing both continuity and discontinuity over time

Accountability

Courtesy Identity

Calmness Environmental Sustainability

Refers to a peaceful and composed psychological state Refers to sustaining ecosystems such that they meet the needs of the present without compromising future generations

Schoorman [1995]; Olson & Olson [2000]; Nissenbaum [2001]; Rocco [1998] Friedman & Nissenbaum [1997]; Hill [1991]; Isaacs, Tang, & Morris [1996]; Suchman [1994]; Winograd [1994] Faden & Beauchamp [1986]; Friedman, Millett, & Felten [2000]; The Belmont Report [1978] Friedman & Kahn [1992]; Friedman & Millet [1995]; Reeves & Nass [1996] Bennett & Delatree [1978];

Wynne & Ryan [1993] Bers, Gonzalo-Heydrich, & DeMaso [2001]; Rosenberg [1997]; Schiano & White [1998]; Turkle [1996] Friedman & Kahn [2003]; Weiser & Brown [1997] United Nations [1992]; World Commission on Environment and Development [1987]; Hart [1999]; Moldan, Billharz, & Matravers [1997]; Northwest Environment Watch [2002]

Two caveats. First, not all of these values are fundamentally distinct from one another. Nonetheless, each value has its own language and conceptualizations within its respective field, and thus warrants separate treatment here. Second, as noted above, this list is not comprehensive. Perhaps no list could be, at least within the confines of a paper. Peacefulness, respect, compassion, love, warmth, creativity, humor, originality, vision, friendship, cooperation, collaboration, purposefulness, devotion, loyalty, diplomacy, kindness, musicality, harmony – the list of other possible moral and non-moral values could get very long very quickly. Our particular list comprises many of the values that hinge on the deontological and consequentialist moral orientations noted above: human welfare, ownership and property, privacy, freedom from bias, universal usability, trust, autonomy, informed consent, and accountability. In addition, we have chosen several other values related to system design: courtesy, identity, calmness, and environmental sustainability. 6.9. Heuristics for Interviewing Stakeholders As part of an empirical investigation, it is useful to interview stakeholders, to better understand their judgments about a context of use, an existing technology, or a proposed design. A semi-structured interview often offers a good balance between addressing the 18

questions of interest and gathering new and unexpected insights. In these interviews, the following heuristics can prove useful: • In probing stakeholders’ reasons for their judgments, the simple question “Why?” can go a good distance. For example, seniors evaluating a ubiquitous computing video surveillance system might respond negatively to the system. When asked “Why?” a response might be: “I don’t mind my family knowing that other people are visiting me, so they don’t worry that I’m alone – I just don’t want them to know who is visiting.” The researcher can probe again: “Why don’t you want them to know?” An answer might be: “I might have a new friend I don’t want them to know about. It’s not their business.” Here the first “why” question elicits information about a value conflict (the family’s desire to know about the senior’s well-being and the senior’s desire to control some information); the second “why” question elicits further information about the value of privacy for the senior. • Ask about values not only directly, but indirectly, based on formal criteria specified in the conceptual investigation. For example, suppose that you want to conduct an empirical investigation of people’s reasoning and values about “X” (say, trust, privacy, or informed consent), and that you decided to employ an interview methodology. One option is to ask people directly about the topic. “What is X?” “How do you reason about X?” “Can you give me an example from your own life of when you encountered a problem that involved X?” There is some merit to this direct approach. Certainly it gives people the opportunity to define the problem in their own terms. But you may quickly discover that it comes up short. Perhaps the greatest problem is that people have concepts about many aspects of the topic on which they cannot directly reflect. Rather, you will usually be better served by employing an alternative approach. As is common in social cognitive research (see Kahn [1999], chap. 5, for a discussion of methods), you could interview people about a hypothetical situation, or a common everyday event in their lives, or a task that you have asked them to solve, or a behavior in which they have just engaged. But, no matter what you choose, the important point is a priori to conceptualize what the topic entails, if possible demarcating its boundaries through formal criteria, and at a minimum employing issues or tasks that engage people’s reasoning about the topic under investigation. 6.10. Heuristics for Technical Investigations When engaging in value-oriented technical investigations, the following heuristics can prove useful: • Technical mechanisms will often adjudicate multiple if not conflicting values, often in the form of design trade-offs. We have found it helpful to make explicit how a design trade-off maps onto a value conflict and differentially affects different groups of stakeholders. For example, the Room with a View study suggests real-time displays in interior offices may provide physiological benefits for those in the inside offices (the direct stakeholders), yet may impinge on the privacy and security of those walking through the outdoor scene (the indirect stakeholders), and especially women. • Unanticipated values and value conflicts often emerge after a system is developed and deployed. Thus, when possible, design flexibility into the underlying technical architecture so that it can be responsive to such emergent concerns. In UrbanSim, for example, Borning et al. used agile programming techniques to design an architecture that can more readily accommodate new indicators and models. • The control of information flow through underlying protocols – and the privacy concerns surrounding such control – is a strongly contested area. Ubiquitous 19

computing, with sensors that collect and then disseminate information at large, has only intensified these concerns. We suggest that underlying protocols that release information should be able to be turned off (and in such a way that the stakeholders are confident they have been turned off). 7. CONCLUSION There is a growing interest and challenge to address values in design. Our goal in this paper has been to provide enough detail about Value Sensitive Design so that other researchers and designers can critically examine, use, and extend this approach. Our hope is that this approach can contribute to a principled and comprehensive consideration of values in the design of information and computational systems. ACKNOWLEDGMENTS Value Sensitive Design has emerged over the past decade and benefited from discussions with many people. We would like particularly to acknowledge all the members of our respective research groups, along with Edward Felten, Jonathan Grudin, Sara Kiesler, Clifford Nass, Helen Nissenbaum, John Thomas, and Terry Winograd. This research was supported in part by NSF Awards IIS-9911185, IIS-0325035, EIA-0121326, and EIA0090832.

20

REFERENCES ABERG, J., and SHAHMEHRI, N. 2001. An empirical study of human Web assistants: Implications for user support in Web information systems. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 2000) (pp. 404-411). New York, NY: Association for Computing Machinery Press. ACKERMAN, M. S., and CRANOR, L. 1999. Privacy critics: UI components to safeguard users’ privacy. In Extended Abstracts of CHI 1999, ACM Press, 258-259. ADLER, P. S., and WINOGRAD, T., Eds. 1992. Usability: Turning Technologies into Tools. Oxford: Oxford University Press. AGRE, P. E., and ROTENBERG, M., Eds. 1998. Technology and Privacy: The New Landscape. MIT Press, Cambridge, MA. BAIER, A. 1986. Trust and antitrust. Ethics, 231-260. BECK, A., and KATCHER, A. 1996. Between Pets and People. West Lafayette, IN: Purdue University Press. BECKER, L. C. 1977. Property Rights: Philosophical Foundations. London, England: Routledge & Kegan Paul. BELLOTTI, V. 1998. Design for privacy in multimedia computing and communications environments. In P. E. Agre and M. Rotenberg, Eds., Technology and Privacy: The New Landscape (pp. 63-98). Cambridge, MA: The MIT Press. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. 1978. The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. BENNETT, W. J., and DELATREE, E. J. 1978. Moral education in the schools. The Public Interest, 50, 8198. BERS, M. U., GONZALEZ-HEYDRICH, J., and DEMASO, D. R. 2001. Identity construction environments: Supporting a virtual therapeutic community of pediatric patients undergoing dialysis. In Proceedings of the Conference of Human Factors in Computing Systems (CHI 2001), 380-387. New York, NY: Association for Computing Machinery. BJERKNES, G., and BRATTETEIG, T. 1995. User participation and democracy: A discussion of Scandinavian research on system development. Scandinavian Journal of Information Systems, 7(1), 73-97. BØDKER, S. 1990. Through the Interface – A Human Activity Approach to User Interface Design. Hillsdale, NJ: Lawrence Erlbaum Associates. BOYLE, M., EDWARDS, C., and GREENBERG, S. 2000. The effects of filtered video on awareness and privacy. In Proceedings of Conference on Computer Supported Cooperative Work (CSCW 2000), 1-10. New York, NY: Association for Computing Machinery. BYNUM, T. W., Ed. 1985. Metaphilosophy, 16(4). [Entire issue.] CAMP, L. J. 2000. Trust & Risk in Internet Commerce. MIT Press, Cambridge, MA. CAMPBELL, R. L., and CHRISTOPHER, J. C. 1996. Moral development theory: A critique of its Kantian presuppositions. Developmental Review, 16, 1-47. COOPER, M., and REJMER, P. 2001. Case study: Localization of an accessibility evaluation. In Extended Abstracts of the Conference on Human Factors in Computing Systems (CHI 2001), 141-142. New York, NY: Association for Computing Machinery Press.

21

DIEBERGER, A., HOOK, K., SVENSSON, M., and LONNQVIST, P. 2001. Social navigation research agenda. In Extended Abstracts of the Conference on Human Factors in Computing Systems (CHI 2001), 107-108. New York, NY: Association of Computing Machinery Press. DWORKIN, R. 1978. Taking Rights Seriously. Cambridge, MA: Harvard University Press. EGGER, F. N. 2000. “Trust me, I’m an online vendor”: Towards a model of trust for e-commerce system design. In Extended Abstracts of the Conference of Human Factors in Computing Systems (CHI 2000), 101102. New York, NY: Association for Computing Machinery. EHN, P. 1989. Work-Oriented Design of Computer Artifacts. Hillsdale, NJ: Lawrence Erlbaum Associates. FADEN, R. and BEAUCHAMP, T. 1986. A History and Theory of Informed Consent. New York, NY: Oxford University Press. FOGG, B.J., and TSENG, H. 1999. The elements of computer credibility. In Proceedings of CHI 1999, ACM Press, 80-87. FOOT, P. 1978. Virtues and Vices. Berkeley and Los Angeles, CA: University of California Press. FRANKENA, W. 1972. Value and valuation. In P. Edwards, Ed., The Encyclopedia of Philosophy, Vol. 7-8. (pp. 409-410). New York, NY: Macmillan. FRANKLIN, J., WADDELL, P., and BRITTING, J. 2002. Sensitivity Analysis Approach for an Integrated Land Development & Travel Demand Modeling System, Presented at the Association of Collegiate Schools of Planning 44th Annual Conference, November 21-24, 2002, Baltimore, MD. Preprint available from www.urbansim.org. FREEMAN-BENSON, B.N., and BORNING, A. 2003. YP and urban simulation: Applying an agile programming methodology in a politically tempestuous domain. In Proceedings of the 2003 Agile Programming Conference, Salt Lake City, June 2003. Preprint available from www.urbansim.org. FRIEDMAN, B., Ed. 1997a. Human Values and the Design of Computer Technology. Cambridge University Press, New York NY. FRIEDMAN, B. 1997b. Social judgments and technological innovation: Adolescents’ understanding of property, privacy, and electronic information. Computers in Human Behavior, 13(3), 327-351. FRIEDMAN, B., HOWE, D. C., and FELTEN, E. 2002. Informed consent in the Mozilla browser: Implementing Value-Sensitive Design. In Proceedings of HICSS-35, IEEE Computer Society, Abstract, p. 247; CD-ROM of full papers, OSPE101. FRIEDMAN, B. and KAHN, P. H., JR. 1992. Human agency and responsible computing: Implications for computer system design. Journal of Systems Software, 17, 7-14. FRIEDMAN, B., KAHN, P. H., JR., and HOWE, D. C. 2000. Trust online. Commun. ACM, 43, 12, 34-40. FRIEDMAN, B., and KAHN, P. H., JR. 2003. Human values, ethics, and design. In J. Jacko and A. Sears, Eds., The Human-Computer Interaction Handbook. Lawrence Erlbaum Associates, Mahwah NJ. FRIEDMAN, B., KAHN, P. H., JR., AND HAGMAN, J. 2003. Hardware companions?: What online AIBO discussion forums reveal about the human-robotic relationship. Conference Proceedings of CHI 2003, 273 – 280. New York, NY: ACM Press. FRIEDMAN, B., KAHN, P. H., JR., AND HAGMAN, J. 2004. The Watcher and The Watched: Social judgments about privacy in a public place. Online Proceedings of CHI Fringe 2004. Vienna, Austria: ACM CHI Place, 2004. (http://www.chiplace.org/chifringe/2004/198). FRIEDMAN, B., and MILLETT, L. 1995. “It's the computer's fault” – Reasoning about computers as moral agents. In Conference Companion of the Conference on Human Factors in Computing Systems (CHI 95) (pp. 226-227). New York, NY: Association for Computing Machinery Press.

22

FRIEDMAN, B., MILLETT, L., and FELTEN, E. 2000. Informed Consent Online: A Conceptual Model and Design Principles. University of Washington Computer Science & Engineering Technical Report 00-12-2. FRIEDMAN, B., and NISSENBAUM, H. Bias in computer systems. 1996. ACM Transactions on Information Systems, 14, 3, 330-347. FRIEDMAN, B., and NISSENBAUM, H. 1997. Software agents and user autonomy. Proceedings of the First International Conference on Autonomous Agents, 466-469. New York, NY: Association for Computing Machinery Press. FUCHS, L. 1999. AREA: A cross-application notification service for groupware. In Proceedings of ECSCW 1999, Kluwer, Dordrechet Germany, 61-80. GALEGHER, J., KRAUT, R. E., and EGIDO, C., Eds. 1990. Intellectual Teamwork: Social and Technological Foundations of Cooperative Work. Hillsdale, NJ: Lawrence Erlbaum Associates. GALLOPIN, G.C. 1997. Indicators and their use: Information for decision-making. In B. Moldan, S. Billharz and R. Matravers, Eds., Sustainability Indicators: A Report on the Project on Indicators of Sustainable Development, Wiley, Chichester, England. GEWIRTH, A. 1978. Reason and Morality. Chicago, IL: University of Chicago Press. GREENBAUM, J., and KYNG, M., Eds. 1991. Design at Work: Cooperative Design of Computer Systems. Hillsdale, NJ: Lawrence Erlbaum Associates. GRUDIN, J. 1988. Why CSCW applications fail: Problems in the design and evaluation of organizational interfaces. In Proceedings of the Conference on Computer Supported Cooperative Work (CSCW ‘88), 85-93. New York, NY: Association for Computing Machinery Press. HAGMAN, J., HENDRICKSON, A., and WHITTY, A. 2003. What’s in a barcode: Informed consent and machine scannable driver licenses. CHI 2003 Extended Abstracts of the Conference on Human Factors in Computing System, 912-913. New York, NY: ACM Press. HART, M. 1999. Guide to Sustainable Community Indicators, Hart Environmental Data, PO Box 361, North Andover, MA 01845, second edition. HERSKOVITS, M. J. 1952. Economic Anthropology: A Study of Comparative Economics. New York, NY: Alfred A. Knopf. HILL, T. E., JR. 1991. Autonomy and self-respect. Cambridge: Cambridge University Press. ISAACS, E. A., TANG, J. C., and MORRIS, T. 1996. Piazza: A desktop environment supporting impromptu and planned interactions. In Proceedings of the Conference on Computer Supported Cooperative Work (CSCW 96), 315-324. New York, NY: Association for Computing Machinery Press. JACKO, J. A., DIXON, M. A., ROSA, R. H., JR., SCOTT, I. U., and PAPPAS, C. J. 1999. Visual profiles: A critical component of universal access. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 99), 330-337. New York, NY: Association for Computing Machinery Press. JANCKE, G., VENOLIA, G. D., GRUDIN, J., CADIZ, J. J. and GUPTA, A. 2001. Linking public spaces: Technical and social issues. In Proceedings of CHI 2001, 530-537. JOHNSON, E. H. 2000. Getting beyond the simple assumptions of organization impact [social informatics]. Bulletin of the American Society for Information Science, 26, 3, 18-19. JOHNSON, D. G., and MILLER, K. 1997. Ethical issues for computer scientists and engineers. In A. B. Tucker, Jr., Ed.-in-Chief, The Computer Science and Engineering Handbook (pp. 16-26). CRC Press. KAHN, P. H., JR. 1999. The Human Relationship with Nature: Development and Culture. MIT Press, Cambridge MA. KAHN, P. H., JR., and KELLERT, S. R., Eds. 2002. Children and Nature: Psychological, Sociocultural, and Evolutionary Investigations. MIT Press, Cambridge MA.

23

KAHN, P. H., JR., and TURIEL, E. 1988. Children's conceptions of trust in the context of social expectations. Merrill-Palmer Quarterly, 34, 403-419. KANT, I. 1964. Groundwork of the Metaphysic of Morals (H. J. Paton, Trans.). New York, NY: Harper Torchbooks. (Original work published 1785.) KLING, R., ROSENBAUM, H., and HERT, C. 1998. Social informatics in information science: An introduction. Journal of the American Society for Information Science, 49(12), 1047-1052. KLING, R., and STAR, S. L. 1998. Human centered systems in the perspective of organizational and social informatics. Computers and Society, 28(1), 22-29. KYNG, M., and MATHIASSEN, L., Eds. 1997. Computers and Design in Context. Cambridge, MA: The MIT Press. LEVESON, N. G. 1991. Software safety in embedded computer systems. Commun. ACM, 34, 2, 34-46. LIPINSKI, T. A., and BRITZ, J. J. 2000. Rethinking the ownership of information in the 21st century: Ethical implications. Ethics and Information Technology, 2, 1, 49-71. MACINTYRE, A. 1984. After Virtue. Nortre Dame: University of Nortre Dame Press. MAYER, R.C., DAVIS, J.H, AND SCHOORMAN, F.D., 1995. An integrative model of organizational trust. The Academy of Management Review, 20, 3, 709-734. MILLETT, L., FRIEDMAN, B., and FELTEN, E. 2001. Cookies and web browser design: Toward realizing informed consent online. In Proceedings of CHI 2001, ACM Press, 46-52. MOLDAN, B., BILLHARZ, S., and MATRAVERS, R., Eds., 1997. Sustainability Indicators: A Report on the Project on Indicators of Sustainable Development, Wiley, Chichester, England. MOORE, G. E. 1978. Principia ethica. Cambridge: Cambridge University Press. (Original work published 1903.) NASS, C., and GONG, L. 2000. Speech interfaces from an evolutionary perspective. Communications of the ACM, 43(9), 36-43. NEUMANN, P. G. 1995. Computer Related Risks. New York, NY: Association for Computing Machinery Press. NIELSEN, J. 1993. Usability Engineering. Boston, MA: AP Professional. NISSENBAUM, H. 1998. Protecting privacy in an information age: The problem with privacy in public. Law and Philosophy, 17, 559-596. NISSENBAUM, H. 1999. Can trust be secured online? A theoretical perspective. Etica e Politca, 2 (Electronic journal). NISSENBAUM, H. 2001. Securing trust online: Wisdom or oxymoron. Boston University Law Review, 81(3), 635-664. NORMAN, D. A. 1988. The Psychology of Everyday Things. New York, NY: Basic Books. NORTHWEST ENVIRONMENT WATCH. 2002. This Place on Earth 2002: Measuring What Matters. Northwest Environment Watch, 1402 Third Avenue, Seattle, WA 98101. NOTH, M., BORNING, A., and WADDELL, P. 2003. An extensible, modular architecture for simulating urban development, transportation, and environmental impacts. Computers, Environment and Urban Systems, 27, 2, 181-203. OLSON, J. S., and OLSON, G. M. 2000. i2i trust in e-commerce. Communications of the ACM, 43(12), 41-44.

24

OLSON, J. S., and TEASLEY, S. 1996. Groupware in the wild: Lessons learned from a year of virtual collaboration. In Proceedings of the Conference on Computer Supported Cooperative Work (CSCW 96), 419427. New York, NY: Association for Computing Machinery Press. ORLIKOWSI, W. J., and IACONO, C. S. 2001. Research commentary: desperately seeking the “IT” in IT research—a call to theorizing the IT artifact. Information Systems Research, 12, 2, 121-134. PALEN, L., and GRUDIN, J. 2003. Discretionary adoption of group support software: Lessons from calendar applications. In B.E. Munkvold, Ed., Implementing Collaboration Technologies in Industry, Springer Verlag, Heidelberg. PALEN, L., and DOURISH, P. 2003. Privacy and trust: Unpacking “privacy” for a networked world. In Proceedings of CHI 2003, 129-136. PALMER, K., Ed. 1998. Indicators of Sustainable Community, Sustainable Seattle, Seattle, WA. PHILLIPS, D. J. 1998. Cryptography, secrets, and structuring of trust. In P. E. Agre and M. Rotenberg, Eds., Technology and Privacy: The New Landscape (pp. 243-276). Cambridge, MA: The MIT Press. PRUITT, J., and GRUDIN, J. 2003. Personas: Practice and theory. In Proceedings of DUX 2003, ACM Press. RAWLS, J. 1971. A Theory of Justice. Cambridge, MA: Harvard University Press. REEVES, B., and NASS, C. 1996. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. New York, NY and Stanford, CA: Cambridge University Press and CSLI Publications. RIEGELSBERGER, J., and SASSE, M. A. 2002. Face it – Photos don’t make a web site trustworthy, in Extended Abstracts of CHI 2002, ACM Press, 742-743. ROCCO, E. 1998. Trust breaks down in electronic contexts but can be repaired by some initial face-to-face contact. In Proceedings of CHI 1998, ACM Press, 496-502. ROSENBERG, S. 1997. Multiplicity of selves. In R. D. Ashmore and L. Jussim, Eds., Self and Identity: Fundamental Issues (pp. 23-45). New York, NY: Oxford University Press. SAWYER, S, and ROSENBAUM, H. 2000. Social informatics in the information sciences: Current activities and emerging direction. Informing Science, 3(2), 89-95. SCHEFFLER, S. 1982. The Rejection of Consequentialism. Oxford, England: Oxford University Press. SCHIANO, D. J., and WHITE, S. 1998. The first noble truth of cyberspace: People are people (even when they MOO). In Proceedings of the Conference of Human Factors in Computing Systems (CHI 98), 352-359. New York, NY: Association for Computing Machinery. SCHNEIDER, F. B., Ed. 1999. Trust in Cyberspace. National Academy Press, Washington, D.C. SCHOEMAN, F. D., Ed. 1984. Philosophical Dimensions of Privacy: An Anthology. Cambridge, England: Cambridge University Press. SHNEIDERMAN, B. 1999. Universal usability: Pushing human-computer interaction research to empower every citizen. ISR Technical Report 99-72. University of Maryland, Institute for Systems Research. College Park, MD. SHNEIDERMAN, B. 2000. Universal usability. Commun. of the ACM, 43, 5, 84-91. SIMPSON, J.A., and WEINER, E.S.C., Eds. 1989. “value, n.” Oxford English Dictionary. Oxford: Clarendon Press, 1989. OED Online. Oxford University Press. 30 May 2003. http://dictionary.oed.com/cgi/entry/00274678 SMART, J. J. C. and WILLIAMS, B. 1973. Utilitarianism For and Against. Cambridge: Cambridge University Press.

25

STEPHANIDIS, C., Ed. 2001. User Interfaces for All: Concepts, Methods, and Tools. Mahwah, NJ: Lawrence Erlbaum Associates. SUCHMAN, L. 1994. Do categories have politics? The language/action perspective reconsidered. CSCW Journal, 2, 3 , 177-190. SVENSSON, M., HOOK, K., LAAKSOLAHTI, J., and WAERN, A. 2001. Social navigation of food recipes. In Proceedings of the Conference of Human Factors in Computing Systems (CHI 2001), 341-348. New York, NY: Association for Computing Machinery. TANG, J. C. 1997. Eliminating a hardware switch: Weighing economics and values in a design decision. In B. Friedman, Ed., Human Values and the Design of Computer Technology, (pp. 259-269). Cambridge Univ. Press, New York NY. THOMAS, J. C. 1997. Steps toward universal access within a communications company. In B. Friedman, Ed., Human Values and the Design of Computer Technology, (pp. 271-287). Cambridge Univ. Press, New York NY. TURIEL, E. 1983. The Development of Social Knowledge. Cambridge, England: Cambridge University Press. TURIEL, E. 1998. Moral development. In N. Eisenberg, Ed., Social, Emotional, and Personality Development (pp. 863-932). Vol. 3 of W. Damon, Ed., Handbook of Child Psychology. 5th edition. New York, NY: Wiley. TURIEL, E. 2002. The Culture of Morality: Social Development, Context, and Conflict. Cambridge, England: Cambridge University Press. TURKLE, S. 1996. Life on the Screen: Identify in the Age of the Internet. New York, NY: Simon and Schuster. ULRICH, R. S. 1984. View through a window may influence recovery from surgery. Science, 224, 420-421. ULRICH, R. S. 1993. Biophilia, biophobia, and natural landscapes. In S. R. Kellert and E. O. Wilson, Eds., The Biophilia Hypothesis (pp. 73-137). Washington, D.C.: Island Press. UNITED NATIONS. 2002. Report of the United Nations Conference on Environment and Development, held in Rio de Janeiro, Brazil, 1992. Available from http://www.un.org/esa/sustdev/documents/agenda21/english/agenda21toc.htm WADDELL, P. 2002. UrbanSim: Modeling urban development for land use, transportation, and environmental planning. Journal of the American Planning Association, 68, 3, 297-314. WADDELL, P., BORNING, A., NOTH, M., FREIER, N., BECKE, M., and ULFARSSON, G. 2003. Microsimulation of Urban Development and Location Choices: Design and Implementation of UrbanSim. Networks and Spatial Economics, 3, 1, 43-67. WEISER, M., and BROWN, J. S. 1997. The coming age of calm technology. In P. Denning and B. Metcalfe, Eds., Beyond Calculation: The Next 50 Years of Computing (pp. 75-85). New York, NY: Springer-Verlag. WEIZENBAUM, J. 1972. On the impact of the computer on society: How does one insult a machine? Science, 178, 609-614. WIENER, N. 1985. The machine as threat and promise. In P. Masani, Ed., Norbert Wiener: Collected Works and Commentaries, Vol. IV (pp. 673-678). Cambridge, MA: The MIT Press. (Reprinted from St. Louis Post Dispatch, 1953, December 13.) WINOGRAD, T. 1994. Categories, disciplines, and social coordination. CSCW Journal, 2, 3, 191-197. WORLD COMMISSION ON ENVIRONMENT AND DEVELOPMENT (Gro Harlem Brundtland, Chair). 1987. Our Common Future. Oxford University Press, Oxford. WYNNE, E. A., and RYAN, K. 1993. Reclaiming our schools: A handbook on teaching character, academics, and discipline. New York, Macmillan.

26

ZHENG, J., BOS, N., OLSON, J., and OLSON, G. M. 2001. Trust without touch: Jump-start trust with social chat. In Extended Abstracts of CHI 2001, ACM Press, 293-294.

27

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services Tzu-Sheng Kuo∗

Hong Shen∗

Jisoo Geum

[email protected] Carnegie Mellon University Pittsburgh, PA, USA

[email protected] Carnegie Mellon University Pittsburgh, PA, USA

[email protected] Carnegie Mellon University Pittsburgh, PA, USA

Nev Jones

Jason I. Hong

Haiyi Zhu†

[email protected] University of Pittsburgh Pittsburgh, PA, USA

[email protected] Carnegie Mellon University Pittsburgh, PA, USA

[email protected] Carnegie Mellon University Pittsburgh, PA, USA

Kenneth Holstein†

[email protected] Carnegie Mellon University Pittsburgh, PA, USA

Jamie is experiencing

Jamie is told that the county is

Jamie calls the county to apply for a

Given thousands of applications

The staff person uses a computer

homelessness.

offering a public housing program.

housing unit. However, the staff

each year, what methods should the

that provides advice on who should

person tells Jamie that there aren’t

county use for prioritizing housing

be prioritized for housing, based on

enough units to house every

applicants?

their risks of being harmed if they remain unhoused.

applicant.

(a)

(b)

(c)

(d)

(e)

Figure 1: We use an adapted version of the comicboarding method [44] to understand frontline workers’ and unhoused individuals’ perspectives on an AI system used in homeless service. In order to elicit specific stakeholder feedback and design ideas around the design and deployment of an AI system, our adaptation disaggregates the AI development lifecycle into different design components of an AI system, such as the system’s task definition (as shown in this example).

ABSTRACT Recent years have seen growing adoption of AI-based decisionsupport systems (ADS) in homeless services, yet we know little about stakeholder desires and concerns surrounding their use. In this work, we aim to understand impacted stakeholders’ perspectives on a deployed ADS that prioritizes scarce housing resources. ∗ Co-first

authors contributed equally to this research. authors contributed equally to this research.

† Co-senior

We employed AI lifecycle comicboarding, an adapted version of the comicboarding method, to elicit stakeholder feedback and design ideas across various components of an AI system’s design. We elicited feedback from county workers who operate the ADS daily, service providers whose work is directly impacted by the ADS, and unhoused individuals in the region. Our participants shared concerns and design suggestions around the AI system’s overall objective, specific model design choices, dataset selection, and use in deployment. Our findings demonstrate that stakeholders, even without AI knowledge, can provide specific and critical feedback on an AI system’s design and deployment, if empowered to do so.

This work is licensed under a Creative Commons Attribution International 4.0 License. CHI ’23, April 23–28, 2023, Hamburg, Germany © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. https://doi.org/10.1145/3544548.3580882

CCS CONCEPTS • Human-centered computing → HCI design and evaluation methods; Empirical studies in HCI.

CHI ’23, April 23–28, 2023, Hamburg, Germany

KEYWORDS homelessness, AI-based decision support, comicboarding, public algorithms ACM Reference Format: Tzu-Sheng Kuo, Hong Shen, Jisoo Geum, Nev Jones, Jason I. Hong, Haiyi Zhu, and Kenneth Holstein. 2023. Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 17 pages. https://doi.org/10.1145/3544548.3580882

1

INTRODUCTION

According to the United Nations, homelessness is a “profound assault on dignity, social inclusion and the right to life” [29]. More than 1.8 billion people lack adequate housing worldwide [29]. Even in developed countries such as the United States, more than 326,000 people experienced sheltered homelessness on a single night in 2021 [22]. This number does not even account for those experiencing unsheltered homelessness and growth in the unhoused population due to the global pandemic [22, 51]. Furthermore, homelessness is often deeply stigmatized, with specific stereotypes linking unhoused individuals with dangerousness, criminality, and moral failure [32, 50]. Public prejudice further marginalizes some of the most vulnerable in our society [34]. In recent years, government agencies have increasingly turned to AI-based decision-support (ADS) systems to assist in prioritizing scarce housing resources. The use of algorithmic systems in homeless services has spread rapidly: in the past half-decade, such systems have been considered or deployed in US counties including Los Angeles [15], San Francisco [64], Allegheny County [18], as well as in Ontario, Canada [35]. However, despite this rapid spread, the stakeholders most directly impacted by these systems have had little say in these systems’ designs [15, 64]. To date, we lack an adequate understanding of impacted stakeholders’ desires and concerns around the design and use of ADS in homeless services. Yet as a long line of research in HCI and participatory design demonstrates, without such an understanding to guide design, technology developers risk further harming already vulnerable social groups, or missing out on opportunities to better support these groups [11, 14, 17, 62, 70]. In this work, we aim to understand frontline workers’ and unhoused individuals’ perspectives on the Housing Allocation Algorithm (HAA)1 , an ADS system that prioritizes housing resources for people experiencing homelessness, which has been deployed in a US county for over two years. We employ AI lifecycle comicboarding, a feedback elicitation and co-design method adapted from comicboarding [23, 44], to elicit specific feedback on various aspects of an AI system’s design from stakeholders with diverse backgrounds and literacies. Prior HCI methods aimed at broadening participation in the design and critique of AI systems have often focused either on eliciting broad feedback on AI systems’ overall designs (e.g., feedback on the problem framing and design objectives) [27, 62, 67, 70], or on eliciting specific feedback by narrowing down the elicitation process around specific aspects of the system design [5, 6, 38, 56, 60]. Our adaptation uses comicboards (see Figure 1) 1 Throughout

the paper, we refer to the ADS system by this pseudonym.

Kuo and Shen et al.

to scaffold both broad and specific conversations around different components of an AI system’s design, from problem formulation to data selection to model definition and deployment. Using our adapted approach, we elicited feedback on HAA’s design from frontline workers, including county workers who operate the ADS daily and external service providers whose work is directly impacted by the ADS, as well as both current and former unhoused individuals in the region. We recruited these stakeholder groups to center the voices of those who are most directly affected by the system, yet who are currently least empowered to shape its design. Our participants shared critical concerns and specific design suggestions related to the system’s overall design objective, specific model design choices, the selection of data use to train and operate the system, and broader sociotechnical design considerations around the system’s deployment. Reflecting on their experience during the study, participants noted that our approach helped to open up conversations around otherwise hidden assumptions and decisions underlying the ADS’s design and deployment. In summary, our work contributes an in-depth understanding of frontline workers’ and unhoused individuals’ perspectives on the use of AI in homeless services. In order to understand their perspectives, we employ AI lifecycle comicboarding, an adapted comicboarding method uniquely tailored to understand the sociotechnical implications of AI systems. Our adaptation disaggregates the AI development lifecycle to make otherwise opaque AI design choices accessible and open to critique. Overall, our findings demonstrate that stakeholders spanning a broad range of relevant literacies can provide specific and critical feedback on an AI system’s design and deployment, if empowered to do so.

2 RELATED WORK 2.1 ADS Systems in the Public Sector Over the past decade, AI-based decision support (ADS) systems, powered by machine learning (ML) techniques, have increasingly been adopted to augment decision-making across a range of public services [39]. For example, ADS systems have been used to assist judges in deciding whether defendants should be detained or released while awaiting trial [10, 16]. They have been adopted by child protection agencies to assist workers in screening child maltreatment referrals [7, 31, 57]. They have also been used by school districts to assist in assigning students to public schools [55]. The growing use of ADS in public services has been met with both enthusiasm and concern [7, 39]. While proponents have argued for its potential to improve the equity, efficiency and effectiveness of decision-making, critics have raised serious concerns about ways these systems may fail to deliver on these promises in practice, and instead amplify the problems that they were meant to address. For example, public outcry has erupted over biased and harmful outcomes caused by recidivism prediction algorithms [1], predictive analytics for child welfare [17], and predictive policing [59]. A growing body of work in ML and HCI has begun to both deepen our understanding of stakeholder concerns and implement changes in AI systems in an effort to address them. Past work in the fair ML research community has focused on proposing various statistical fairness criteria and then developing novel algorithmic methods to align ML models with these criteria [8, 9, 43]. However,

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

early work in this space has been critiqued for its disconnectedness from real-world stakeholders’ actual needs, values, and system constraints, which may not align with theoretical notions of fairness [6, 26, 67]. Meanwhile, HCI research has paid increasing attention to understanding stakeholders’ desires and concerns around the design of public sector algorithms [27, 31, 55, 58, 62, 67]. For example, in the context of public education, Robertson et al. [55] found that student assignment algorithms deployed in San Francisco failed because they relied on modeling assumptions that fundamentally clashed with families’ actual priorities, constraints, and goals. In contrast to the domains discussed above, the rapid spread of ADS in homeless services has received surprisingly limited attention from the HCI community to date. Yet the lack of engagement of directly impacted stakeholders in this high-stakes domain has already received attention in the popular press, regarding the potential for such systems to negatively impact already vulnerable social groups (e.g., [15, 64]).

2.2

Broadening Participation in AI Design

In current practice, stakeholders with no background in statistics or data science and even researchers and designers who have less technical knowledge of AI and machine learning are often excluded from conversations around the design and use of AI-based systems [14, 27, 31]. Directly impacted stakeholders–e.g. individuals facing criminal adjudication, children and families subject to out of home placements and state intervention, individuals with intellectual/developmental disabilities–are even less likely to be meaningfully included in algorithm development than other groups. Although top down development can reduce the burden on AI teams, in terms of having to explain complex technical concepts to laypeople, a lack of feedback from stakeholders who will use or be directly impacted by these systems can lead to serious misalignments and failures once these systems are deployed (e.g., [31, 55, 58, 62, 67]). Even when less technical stakeholders are involved in design conversations, they may not be provided with sufficiently detailed information about the system to offer specific critiques and suggestions, or they may only be invited to provide feedback on a narrow aspect of an AI system’s design. For example, stakeholders might be invited to critique the design of a system’s user interface, but not the underlying model or the overall problem formulation [14, 71]. In response, a growing body of research in HCI and ML has focused on the development of new methods and tools aimed at broadening who is able to participate in the design or critique of AIbased systems [14, 20, 24, 33, 60, 72, 73]. A body of prior work has focused on inviting high-level feedback on AI systems’ overall designs (e.g., feedback on the problem framing and design objectives), through a wide range of methods including interviews, workshops, and co-design techniques (e.g.,[27, 62, 67, 70]). The feedback that results from these approaches can be extremely valuable in informing research and design teams’ understandings of stakeholders’ broad desires and concerns around specific kinds of AI systems. For example, recent work from Stapleton et al. found that both community members and social workers [62] had major concerns about existing uses of predictive analytics in child welfare decision-making, and desired fundamentally different forms of technology-based support. However, these approaches often fail to make explicit the assumptions and design choices involved in an AI system’s design (e.g.,

CHI ’23, April 23–28, 2023, Hamburg, Germany

specific choices of training data or proxy outcomes that a model predicts), and thus fail to provide the knowledge and context that would be necessary to elicit more specific critiques. By contrast, another body of work has focused on developing methods and tools that can elicit more detailed stakeholder feedback, by narrowing down the elicitation process around specific aspects of the system design (e.g., [5, 6, 38, 60]). For example, Lee et al. [38] developed a voting-based preference elicitation approach to solicit stakeholders preferences regarding how a matching algorithm should prioritize stakeholder groups. Such approaches are powerful in eliciting stakeholder feedback that can directly inform the design or redesign of AI systems. However, as discussed in recent work [56], these approaches also risk limiting the types of feedback that stakeholders are able to provide, by restricting their inputs to forms that are readily computable. For example, a method might ask stakeholders to weigh in on how an algorithm should prioritize services among different groups, without providing opportunities for them to reflect on whether the proposed algorithm is in fact addressing the right problem in the first place [56]. In this paper, we introduce an adaptation of the comicboarding method [23, 44], which aims to complement these prior approaches by scaffolding participants to provide targeted feedback on key components of an AI system’s design, from a system’s problem formulation to the selection of training data to the design of a model and its use in deployment.

3

STUDY CONTEXT

We conducted this research in the United States, where the unhoused population has grown significantly in the past decades [61], and the use of algorithmic systems in homeless services has rapidly expanded [15, 18, 64]. Following the definition used by the U.S. Department of Housing and Urban Development, throughout this paper, we consider the experience of homelessness as the lack of “a fixed, regular, and adequate nighttime residence” [22]. Prior work in HCI has studied the perceptions, uses, and impacts of technologies among both people experiencing homelessness and homeless service providers in the United States. For example, early work by Le Dantec et al. [36, 37] and Roberson et al. [54] explored how commodity technologies, such as mobile phones, affect unhoused individuals’ daily lives. With the rise of social media, HCI researchers investigated how unhoused individuals used social media to portray life on the streets, develop social ties, and meet survival needs [28, 68, 69]. More recently, given the rapid spread of data-driven tools used for housing allocation, Karusala et al. conducted semi-structured interviews with policymakers and homeless service providers to understand data practices around a questionnaire-based triage tool, and to understand participants’ desires for new potential uses of these data [30]. Our work builds upon this line of HCI research, centering the voices of unhoused individuals and frontline workers in homeless services. In this research, we focus on a US county where an ADS for housing resource allocation has been in use for over two years. According to the county, the housing units available through turnover can serve fewer than half of the individuals or families experiencing homelessness in the county, resulting in a housing gap of more than a thousand people a year [46]. Due to the limited housing

CHI ’23, April 23–28, 2023, Hamburg, Germany

Kuo and Shen et al.

HAA

emergency room visits link staff predict

pull data

personal information

collapse

mental health inpatient

1

10

HAA score

assign

allocate

waitlists

housing

field unit

assign

allocate

jail booking

data insufficient

unreflective score

advocate

alternative assessment

prioritization request

Figure 2: Diagram showing HAA’s workflow. HAA automatically pulls an applicant’s personal information from the county’s data warehouse, predicts an applicant’s likelihood of adverse events such as emergency room visits, and generates a score that determines an applicant’s position on the housing waitlists. availability, the county began exploring ways to improve its assessment tool for housing prioritization about five years prior to our study [46]. At that time, the county used a questionnaire-based assessment tool [49]. However, according to the county [46, 47], these questionnaires were time consuming for applicants to fill out, and the information gathered was potentially inaccurate, given that the questionnaire required applicants to answer highly sensitive questions. For example, a question asked applicants whether they have “exchange[d] sex for money, run drugs for someone, or have unprotected sex with someone you don’t know” [46]. Furthermore, the county worried that answering such sensitive questions risked retraumatizing applicants, who might be forced to relive difficult experiences while completing the questionnaire [46]. To address these concerns, the county collaborated with external researchers to develop an ADS that prioritizes housing resources for people experiencing homelessness using government administrative records instead of questionnaire responses [4, 65, 66]. In the remainder of this paper, we refer to this ADS system with the pseudonym “HAA,” an abbreviation of the Housing Allocation Algorithm. HAA’s workflow is illustrated in Figure 2. The housing assessment process starts with an applicant getting in contact with either a county’s link staff in the office or a frontline worker in the field unit. After the county staff or worker asks a few filtering questions to ensure that the applicant is eligible for the assessment, they press a button on their computers to run HAA. HAA automatically pulls

the applicant’s personal information from the county’s data warehouse [48] and predicts how likely the applicant will experience the following three situations if they remain unhoused over the next 12 months: more than four emergency room visits based on healthcare utilization data, at least one mental health inpatient funded by Medicaid, and at least one jail booking. Based on this likelihood, HAA generates a risk score between 1 to 10. The higher the score, indicates the higher risk of being harmed due to homelessness. The staff or worker then assigns applicants to the housing waitlists corresponding to their scores. Once these housing programs have openings, the county connects people with the downstream housing providers. Sometimes, county workers run the alternative assessment [65], which is an adaptation of the previous questionnaire-based assessment tool that relies on an applicant’s self-reported information. In the remainder of this paper, we refer to this alternative assessment with the pseudonym “alt HAA,” an abbreviation of the alternative Housing Allocation Algorithm. The county uses alt HAA under two circumstances [46]. First, when an applicant’s data within the county exists for less than 90 days and thus has no sufficient data for the system to pull. Second, when workers believe that HAA’s score doesn’t reflect an applicant’s vulnerability. The alt HAA generates an alternative score, also between 1 to 10, that the county uses to assign applicants to waitlists. In rare cases, county workers may file a prioritization request when they believe neither HAA’s nor alt HAA’s score reflects an applicant’s vulnerability.

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

4

STUDY DESIGN

Drawing upon prior HCI methods, we introduced “AI lifecycle comicboarding,” an adaptation of the comicboarding method discussed below. Using this method, we conducted a series of oneon-one study sessions with unhoused individuals and frontline workers, to understand their perspectives, concerns, and desires around the use of AI in homeless services. As discussed below, we also presented participants with de-identified design ideas and feedback generated by prior participants from the two other stakeholder groups, in order to facilitate asynchronous inter-group deliberation without compromising participant comfort and safety during our study.

4.1

AI Lifecycle Comicboarding

We adapted the comicboarding method [23, 44] with the goal of scaffolding conversations around various aspects of an AI system’s design and deployment, when working with participants who span a broad spectrum of relevant reading and technology literacies (in the context of our study: unhoused individuals and frontline workers in homeless services). Comicboarding is a co-design technique that uses the structure of comic strips, including partially completed content, as scaffolding to facilitate idea generation and elicit design insights from populations who may have little experience with brainstorming. The comicboarding method has been successfully used in prior research to elicit rich design ideas and feedback from participants with lower reading literacy [23, 44]. This property made comicboarding a strong fit for our context, given that low reading literacy presents a barrier to participation in co-design for many unhoused participants [19]. We adapted this method to engage diverse stakeholders in conversations around the design and use of AI systems. In order to surface AI design choices and assumptions that might otherwise remain opaque to non-AI experts, we developed targeted comicboards to solicit design feedback and ideas around various aspects of an AI system’s design. We created a comicboard for each of eight major components of an AI system’s design, each corresponding to a different stage of the AI development lifecycle [12]: problem formulation, task definition, data curation, model definition, training, testing, deployment, and feedback processes. Each of these eight comicboards consists of a set of “story starter” panels, which set the context for a key design choice (e.g., see Figure 3 for an example in the context of model definition). The next panel in a comicboard then presents a blank space, along with an open-ended question to prompt ideation and critique. The final comicboard panel, which is revealed to a participant only after they have engaged around the open-ended panel, reveals how the given aspect of the system is currently designed in reality (or the proposed design, in the case of systems that have not yet been developed). Using this approach, in the current study we developed a set of comicboards to understand frontline workers’ and unhoused populations’ perspectives on AI used in homeless services. All our comicboards are shown in Appendix A. We based the information in each of our comicboards upon detailed technical reports published by the county about the HAA system [4, 65, 66]. As such, our comicboards served to surface key design decisions that were buried in these lengthy reports, translating this information into a

CHI ’23, April 23–28, 2023, Hamburg, Germany

format that could be more readily scrutinized by our participants. For example, our comicboard for task definition invited design ideas regarding the AI system’s overall objective; our comicboard for model definition elicited feedback on the choice of proxies that the model used to predict a person’s risk of being harmed due to homelessness. In addition to the key design decisions we asked participants about in the blank comicboard panels, we also embedded various design choices throughout the comicboards to elicit specific feedback on each component. For example, in the comicboard for data curation, we surfaced details about the training data to elicit discussion around potential issues surrounding its representativeness of the local unhoused population. In the comicboard for feedback, we illustrated the third-party consulting firm from which the county gathers technical feedback in order to elicit responses to the values encoded in the current feedback process. Finally, through several meetings with the county, we validated that our comicboards accurately reflected HAA’s current design and use. Given the stigma surrounding issues of homelessness, we used a gender neutral persona throughout the storyboards, so that study participants had the option to refer to the persona in third-person when discussing sensitive topics, rather than directly referencing their own experiences.

4.2

Study Protocol

We conducted AI lifecycle comicboarding in a series of one-onone sessions with unhoused individuals and frontline workers in homeless services. We decided to run these sessions one-on-one, rather than in a workshop format, because we anticipated that participants in our study might not be comfortable sharing their experiences and viewpoints as openly in the presence of other participants [25, 62]. Each study session began with a brief interview portion to understand participants’ backgrounds. We then began the comicboarding activity to elicit participants’ feedback and ideas around the design of HAA, presenting one comicboard for each of eight major components of HAA’s design (as shown at the top of Figure 3). After participants provided their feedback, but before moving onto the next comicboard, we shared a few selected responses from other participants for discussion. This provided an opportunity for participants to express agreement or disagreement with others’ perspectives, or to build upon ideas generated by others, within the context of a one-on-one session. Given that these responses included design ideas from other stakeholder groups, this process also provided a safe space for deliberation between unhoused individuals, county workers, and external service providers, given the power dynamics among these groups (cf. [25]). Finally, we revealed the final panel of the comicboard, describing how the relevant aspect of HAA is currently designed in reality. Participants were invited to provide feedback on the actual design choices that the developers of HAA had made, and to compare these with their own ideas, before moving onto the next comicboard. After going through all eight of the comicboards, we invited participants to reflect on their overall experience during the study, and we then wrapped up the study by collecting demographic information from participants.

CHI ’23, April 23–28, 2023, Hamburg, Germany

problem

formulation

task

definition

Based on the personal information, the computer predicts the possibility of various challenging situations the applicants may face if they remain unhoused.

data

curation

Depending on the severity of the predicted challenging situations, the computer then calculates a score that represents the risk of being harmed due to homelessness.

Kuo and Shen et al.

model

definition

training

The applicants with the higher scores are more likely to receive housing.

testing

deployment

What challenging situations should the computer consider when calculating the risk score?

feedback

The 3 considered challenging situations are mental health impatien jail bookin 4+ emergency room visits

Figure 3: We developed a set of comicboards based on eight major components of an AI system’s design [12], as shown in the top row. We then used these comicboards as scaffolding to elicit participant’s feedback around each component. For example, the pictured comicboard was created for the model definition component, based on detailed technical reports released by the county about HAA [4, 65, 66]. Table 1: Potential barriers to participation in co-design and design critique in the context of ADS used in homeless services, and how our approach addresses each barrier.

Potential barriers to participation

How our approach addresses these barriers

AI literacy: Many design choices and assumptions that are baked into AI systems can be opaque to non-AI experts.

Our comicboards break down the AI development lifecycle to elicit specific feedback on different components of an AI system’s design.

Reading literacy: Low reading literacy presents a significant bar- Our comicboards combine illustrations with brief, carefully crafted rier to co-design among many unhoused individuals. captions to increase accessibility to individuals with lower literacy. Social stigma: There are significant stigmas surrounding homeless- We create a gender-neutral persona that allows participants to selfness. Unhoused individuals may be uncomfortable openly sharing determine when to bring in their own lived experiences and when their experiences and perspectives. to distance themselves from the discussion subjects at hand. Power imbalance: There are imbalanced power dynamics across We conduct one-on-one sessions with participants, but share deour stakeholder groups (e.g. county workers versus unhoused indi- identified responses from prior study participants from other stakeviduals), which risk hindering safe and open conversation. holder groups, to facilitate asynchronous inter-group deliberation.

4.3

Recruitment

We adopted a purposive sampling approach to recruit both frontline workers in homeless services and people with lived experiences of homelessness. Through our contacts in county government, we recruited county workers in the field unit that mainly focuses on street outreach. Specifically, we first got into contact with a data analyst from the county’s analytics team. Our contact then connected us with the field unit’s supervisor, who shared our recruitment message with frontline workers in the field unit. These frontline workers have direct experience using HAA to run housing assessments on a daily basis, and they also have regular, face-to-face interactions with local unhoused populations. Using contact information collected through public websites and databases of homeless services in the region, we also recruited

non-profit service providers, whose services span street outreach, medical support, education for youth, and housing for women and the LGBTQ community. We connected with these service providers through emails, phone calls, or participating in local community meetings where community leaders connected us to trusted service providers they worked with. Because many of these service providers have built longstanding trust with local unhoused individuals, they are able to provide a unique birds-eye view of HAA’s impacts on local unhoused populations in addition to insights into the system’s impacts on their day-to-day work. They expanded on and extended the perspectives of the county workers and individuals with direct experiences of homelessness we interviewed. Finally, we recruited participants with lived experiences of homelessness. These participants have on-the-ground knowledge of homelessness in the region and are directly impacted by HAA

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

CHI ’23, April 23–28, 2023, Hamburg, Germany

Table 2: Participants’ self-reported demographics. Given the sensitive nature of our context, we present aggregated information.

Demographic information

Participant counts or statistics

Race

Caucasian (12), African American (9)

Age

mean: 39.3, minimum: 26, maximum: 64

Gender

female (11), male (8), non-binary (2)

Homelessness Status (unhoused individuals only)

currently unhoused (5), formerly unhoused (7)

Duration of Homelessness (unhoused individuals only)

months (4), years (5), decades (3)

Years in the Field Unit (county workers only)

mean: 2, minimum: 1.5, maximum: 3

Services Provided (service providers only)

street outreach (1), street medicine (1), education for youth (1), housing for women (1), housing for LGBTQ community (1)

and/or prior assessment tools’ decisions. Recognizing the ethical complexities of conducting research with unhoused individuals, we solicited feedback on recruitment strategies from county workers, service providers, and relevant domain experts. Based on their feedback, as well as prior recruitment practices in HCI research [36], we decided to recruit unhoused and previously unhoused individuals through county workers and service providers who have established relationships in the unhoused community as intermediaries. In total, we recruited 21 participants, including 4 county workers, 5 non-profit service providers, and 12 people with current/former personal experience of homelessness. Recognizing that research sites can be a barrier to trust and acceptance among community members [21], we provided several study locations, including both in-person and virtual options, for participants to choose from based on their preferences. In the end, we met all county workers inperson in the county’s building; we talked to all service providers virtually over Zoom; we spoke with unhoused individuals either in-person on campus or virtually over Zoom or by phone. Each study session lasted 108 minutes on average, and all of our study participants were compensated $60 for their participation. The amount of compensation was recommended by our contacts within the county who had extensive experience working with unhoused individuals and recruiting them for feedback sessions.

4.4

Data Analysis

To analyze our study data, we adopted a reflexive thematic analysis approach [2, 3]. Three authors conducted open coding on transcriptions of approximately 38 hours of audio recording and generated a total of 1023 codes. Each transcript was coded by at least two people, including the researcher who conducted the interview for that transcript. Throughout this coding process, we continuously discussed disagreements and ambiguities in the codes, and iteratively refined our codes based on these discussions. Such discussions are critical in a reflexive thematic analysis approach, where different perspectives contribute to the collaborative shaping of codes and themes via conversation [3, 42]. Accordingly, in line with standard practice for a reflexive thematic analysis, we do not calculate interrater reliability, given that consensus and iterative discussion of disagreements is built into the process of generating codes and themes [3, 42].

We also intentionally conducted our analysis across comicboards, rather than conducting analyses per comicboard, given that our goal was to understand broader themes in participants’ responses across the full set of comicboards. This analysis approach is common in HCI comicboarding and storyboarding studies [13, 44], and can be considered analogous to thematic analysis of results from interview studies, where coders do not necessarily organize codes based on specific prepared interview questions. Moreover, our participants sometimes returned to the comicboards they had already read and provided more feedback after learning more about how the system worked in later comicboards. This complicated the attribution of specific participant responses to specific comicboards. Many design choices in an AI system are inevitably intertwined, and participants responses across multiple comicboards reflected this interconnectedness. After coding, we conceptualized higher-level themes from these codes through affinity diagramming. In total, this process yielded 65 first-level themes, 10 second-level themes, and three third-level themes. We present our results in Section 5, where section headers broadly correspond to second and third-level themes. All second and third-level themes are shown in Appendix B.

4.5

Positionality Statement

We acknowledge that our experiences shape our research, and our relative privilege within society provides us with advantages that our study participants do not hold. Specifically, we are researchers who work and receive research training in the United States in the fields of Human-Computer Interaction, Social Work, and Communication. Our team has prior research experiences in social work and public-sector technology. One author has direct work experience with unhoused communities and homelessness services. All authors live in the county where HAA is deployed. Two authors briefly experienced homelessness in the region but were never unsheltered on the streets. Another author had a family experience of homelessness when growing up outside of the region. To conduct this research, we consulted domain experts in homelessness and worked closely with frontline workers in the county and non-profit organizations. Following prior literature [45, 53], we also openly shared that we are researchers working independently from the county and agency that deploy HAA with participants before the interview and verified our interpretation of their quotes

CHI ’23, April 23–28, 2023, Hamburg, Germany

after the data analysis. Throughout this research, we followed prior approach in HCI [40, 41, 63] to pause and reflect on (1) Who would potentially benefit from the research outcome? (2) Are we truly supporting and serving the community? (3) What biases do we bring to this research? We paid particular attention to how the participants could directly benefit from participating in our study in addition to the compensation, considering that time is invaluable, especially in the case of unhoused individuals. We share reflections on our study from participants across all stakeholder groups in the Discussion (Section 6).

5

RESULTS

In this section, we organize our findings around three third-level themes identified through our analysis: (1) Desires for Feedback Opportunities, (2) Feedback on HAA’s Design, and (3) Feedback on HAA’s Deployment. Overall, we found that both unhoused individuals and frontline workers wished for opportunities to provide feedback on the design and use of the AI system. As discussed in Section 5.1, participants noted the county had previously provided regular opportunities for feedback around older assessment tools. However, once the county adopted the “black box” HAA, community members and frontline workers were no longer invited to provide such feedback. Within the county, there was significant skepticism that non-AI experts could provide meaningful feedback on an AI system. However, our findings demonstrate that community members and frontline workers can provide specific, critical feedback on an AI system’s design and deployment, if empowered to do so. As we will discuss in Sections 5.2 and 5.3, participants shared concerns and design suggestions related to the AI system’s overall design objective, specific model design choices, the selection of data used to train and run the model, and the broader sociotechnical system design around the model’s deployment. Workers at the county’s field unit offered a unique perspective given their direct, day-to-day experience interacting with HAA, as well as their regular interactions with local unhoused populations. For example, county workers shared concerns and ideas for improvement based on their observations of systematic limitations and errors in HAA’s predictions. Complementing county workers’ perspectives, external service providers brought a birds-eye view of the system’s downstream impacts on both unhoused individuals and service providers in the region. Finally, unhoused individuals brought in direct lived experience of homelessness, and were able to compare the HAA’s design and deployment against their own on-the-ground knowledge of homelessness in the region. Throughout this section, county workers are identified with a “W,” non-profit service providers are identified with a “N,” and unhoused participants are identified with a “P.”

5.1

Kuo and Shen et al.

All participants expressed a desire for regular opportunities to give feedback on the system’s design and deployment, for example through regularly-held community feedback sessions. Although the county already held regular feedback sessions related to their programs and services, these sessions typically avoided topics that were assumed to be overly technical. Participants shared that, in the past, they were able to discuss and update the questions used in the county’s previous, questionnairebased assessment tool. For example, a service provider noted that “one of the things with [the questionnaire-based tool], you know, at least once a year, we were able to get into conversations about whether there were questions that we felt would be relevant to better understand some of these risks to be added to the system. We don’t get that opportunity now” (N2). Although some light consultations, such as focus groups and information sessions, occurred before HAA’s deployment, county workers and service providers perceive that there has been no follow-up. For example, a county worker said “I think it would be helpful to bring [unhoused individuals] back in the room. [...] I do remember when we were bringing people in to discuss it, but I didn’t hear anything about follow up after that” (W3). Similarly, a service provider recalled that “prior to [HAA]’s deployment, there is a forum that is facilitated by [the county]. [...] There were a couple of sessions introducing the algorithm. But after that point, we never had any kind of follow up” (N3). All of our participants expressed a desire for continuous, postdeployment feedback channels around HAA. However, they perceived that frontline homeless service workers and people with lived experiences of homelessness are insufficiently involved in the design and evaluation process for tools like HAA: “Most times people that do the kind of outreach work, where they’re face-to-face with community members and doing the hardest lift, they’re the least sought out as far as like research is concerned, and that should be the exact opposite” (N1). Participants also emphasized that it is critical to get direct feedback from people with lived experiences of homelessness, who are directly impacted by HAA’s decisions: “I think [the unhoused individuals] should be allowed to be involved because this is about them. They should be allowed to be heard. Not just the staff, not the county. They get to go home and sleep at night” (P10). Some service providers worried that the county currently intentionally avoids gathering legitimate feedback about HAA or discussing potential flaws with the system: “I think the county doesn’t actually get legitimate feedback on housing programs. It reports feedback based on predetermined criteria, but they intentionally don’t do that effectively” (N2). Meanwhile, county workers shared that there was significant skepticism within the county that community members or other non-AI experts could provide meaningful feedback on a complex system like HAA: “I don’t think you’re gonna get feedback from [them] about HAA, or the assessment, or anything like that. I think [people can give feedback based on their] experiences of the housing program” (W2).

Desires for Feedback Opportunities

Both service providers and county workers shared that there used to be a more open community feedback process around previous assessment tools. However, after the county adopted the “black box” HAA, they perceived that workers and community members were no longer invited into conversations to provide such feedback.

5.2

Feedback on HAA’s Design

As participants gained insight, through our comicboarding method, into HAA’s design and use, they raised a number of concerns and suggestions for alternative designs. Reflecting on their experience

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

during the study, several participants noted that our comicboarding approach helped to open up conversations around otherwise hidden assumptions and decisions underlying the algorithm’s design and deployment. As we discuss below, participants questioned the algorithm’s overall design objectives, aspects of its model design, and the selection of data used to train and run the algorithm. For example, as participants learned more about HAA’s overall design, they expressed concerns about ways the algorithm may be optimizing more towards the county’s interests rather than the actual needs of unhoused individuals (Section 5.2.1). Participants also expressed concerns about the validity and reliability of specific aspects of the model’s design, such as the current choices of proxy variables used to measure particular real-world outcomes (Section 5.2.2). Finally, participants shared real-life examples to illustrate how the algorithm’s use of incomplete, decontextualized, or potentially misleading data may lead to erroneous scores and disadvantage particular populations (Section 5.2.3). 5.2.1 Feedback on HAA’s objective. Some participants challenged HAA’s problem formulation – the very idea of prioritization. They believed that “everyone has the right to housing just like a survival need” (P2). A currently unhoused participant raised the question to the county: “what more do you need for me to tell you that I’m important enough to live somewhere” (P12). Participants suggested the county focus its resources on addressing the root problems rather than prioritization, such as providing more housing or preventing homelessness in the first place: “[The county] could get [homelessness] end more effectively, if they were able to take in people that had just hit homelessness, instead of they have to go through god knows what and see if they even survive it” (P3). In addition, as participants learned more through our comicboards about HAA’s overall objectives and the specific outcomes it predicts, they perceived that HAA had been designed and evaluated to reflect the county’s values and serve their interests rather than to reflect local community members’ needs. Even when participants acknowledged that it may be necessary to prioritize housing resources, given that the system is currently set up in a way that guarantees resource scarcity, they raised concerns with the particular ways HAA implements prioritization. For example, a formerly unhoused participant said: “It looks like they’re trying to [prioritize] people that are causing problems for others in society, as opposed to people that are at risk for themselves [and] are suffering internally” (P2). A service provider shared a similar concern: “Mental health inpatient and all of these things have significant financial tags associated with them when a person experiencing homelessness. [...] It’s possible that these are the best proxies that exist, but it looks like we’re measuring more financial cost to systems that have power, than we are measuring actual harms to actual people” (N5). Meanwhile, several participants expressed concerns that the metrics used to evaluate whether HAA is successful seem to encode specific values that are not focused on the subjective experiences and outcomes of unhoused individuals. Instead, a service provider suggested evaluating HAA based on more clientfocused outcomes: “The best metric for success is going to be: did the people who went through the system and received services have better outcomes than they were having before the system was implemented? [...] Do they maintain stable housing? Do they find employment that

CHI ’23, April 23–28, 2023, Hamburg, Germany

is sustainable and satisfying?” (N5). They questioned who had been involved in or excluded from determining the evaluation metric: “What does it mean when something performs better. [...] Whose version of better is that?” (W3). 5.2.2 Feedback on HAA’s model design. Participants voiced several concerns about the validity of the HAA model. First, some participants were skeptical about whether it is truly possible to predict a person’s likelihood of being harmed on the streets, based on their administrative records. For example, a formerly unhoused participant said: “There’s just no way for a computer to accurately predict somebody’s future based on that limited data. [...] If a person is alone, scared, and suffering from fears of institutions, a computer doesn’t know that” (P2). Similarly, a service provider worried that people can be more vulnerable on the streets for many reasons that may not be reflected in administrative records: “Some people are much more likely to be victimized on the street, [...], that may or may not really get picked up in the data. People who have patterns of behavior, or relationships that are recurrent, that get them into a relationship with an abuser, or they’re taken advantage of physically, sexually, economically” (N4). Participants were particularly concerned that the proxy outcomes that the HAA model is trained to predict (e.g., counts of hospital visits and stays in jail), while conveniently available in administrative data, could be highly misleading. For example, some county workers expected that some people may visit the emergency room in order to get out of the cold, not because they are truly sick. Similarly, as W2 noted: “If a person is, over the past two years, in the emergency room every week, but they drop off, and then we find that they have moved out into a tent. [...] Their health is probably worse than it was when they were going to the emergency room every week” (W2). Without data that could provide insight into the actual causes behind these observed outcomes, participants worried that HAA’s predictions might lead decision-makers astray: “I don’t know how a computer considers whether a person will continue to experiencing homelessness if we don’t understand what brought him there in the first place. [...] I don’t think we have the right data. We’re looking at outcomes and not the causal effects of what resulted in that outcome” (N3). Given these concerns, even though participants generally agreed that the housing allocation process should “ideally see people more vulnerable floating to the top of the list and getting served more quickly” (W1), they were concerned that the score generated by HAA’s model could not accurately reflect who was in more urgent need of help. For example, a service provider argued that “somebody with [...] multiple years of chronic homelessness is far safer on the street than a 25-year-old white female newly on the street alone, [... but] this individual would return like a zero or one” (N2). They perceived that the current prioritization process “basically says, until you can experience the real depth of the traumas associated with living on the street, you’re gonna have to stay out there” (N2). Participants suggested that there should be multiple pathways for prioritization because “we’ll find folks that are pretty stable, but they just need an extra pick-me-up. It doesn’t always feel like there are a lot of resources for those people. So I think there should be two paths” (W3).

CHI ’23, April 23–28, 2023, Hamburg, Germany

Participants expressed conflicting views regarding whether the HAA model should account for identity characteristics, such as ethnicity. For example, after learning from our comicboards that HAA does not consider race when making its predictions, some participants brought attention to the vulnerabilities that these characteristics introduce: “statistically speaking, Black people are discriminated against every day of their life on a macro and micro level, both professionally and personally. [...] How can you claim something is equitable, when you’re not even considering race?” (N1). Meanwhile, several unhoused participants argued for a demographic-blind prioritization process. For example, a currently unhoused participant who self-identified as African-American argued that “I really don’t think race is a big deal because everybody has their battles. [...] No matter what race we are, anything can happen to us” (P4). Finally, participants worried that HAA’s predictions may become less and less valid as time progresses because, although the HAA model is static, they believed the true relationship between the model’s inputs and the real-world outcomes that it predicts is highly unstable across time (cf. [52]). A county worker, W2, offered an example to illustrate how a change in policy or the launch of a new program could impact the validity of the model’s predictions. W2 noted that after a new program, the Continuum of Care was initiated, their organization expanded their efforts to proactively find unhoused individuals: “So I think the past profiles are not a perfect correlation to the present and the future because [...] we were housing people who were calling us on the phone. The vulnerability changes as we become better at reaching people who are more vulnerable” (W2). Another participant echoed this concern that HAA is trying to model a rapidly shifting target, yet it is trained on outdated data: “The living cost, income, nothing is the same. How can you predict for now when nothing in the world is the same? [...] The world evolves every day, so their criteria for the prediction should evolve with the world” (P7). 5.2.3 Feedback on HAA’s data. As described in Section 3, HAA relies on administrative data in the county’s data warehouse. Participants worried that HAA misses the data of people who are averse to accessing public services due to institutional violence and prior traumas with these systems. As a county worker argued: “the lack of engaging in a service is not a lack of need” (W2). A formerly unhoused participant shared his experience of institutional violence: “I was in a lot of pain and I couldn’t really move. [...] I went to the hospital. [...] They searched me when I went in. [...] They came and searched for me again. [...] They brought me upstairs and searched me a third time. So I left. [...] They were looking at my record and judging me based on my past experience at the hospital that I was likely to have drugs on me” (P2). Another participant shared a similar experience and how it delayed her cancer treatment: “I had cancer inside my body. That was misdiagnosed because they thought I was there for other reasons to get free opioids” (P7). In light of such fears and mistrust around public services, a county worker suggested that sometimes “the drop-off is a greater indicator of the vulnerability than the continued engagement with the service” (W2). Participants also worried that the use of HAA could further disadvantage certain populations, as they anticipated additional causes of systematic missingness in the data. For example,

Kuo and Shen et al.

a county worker (W3) and a service provider (N3) mentioned that medical outreach and free clinics do not keep track of people’s medical records. They also argued that young adults and people new to homelessness may not have records on file. They wondered whether HAA’s current design was basically saying: “because you haven’t been homeless long enough, you don’t deserve housing” (N3). Furthermore, participants mentioned that people who suffer from domestic violence might provide false information about themselves: “because of the fear of their perpetrator finding them” (N3). Finally, participants expressed concerns that: “some folks who are really struggling with their mental health might not realize how vulnerable they are” (W3). For example, a medical service provider shared that “I’m much more concerned with people that are eating out of garbage cans, that are not admitted to mental health hospitals; people who are sleeping next to busy intersections so that the noise will drown out the voices in their head, who are not going to inpatient admissions” (N4). Furthermore, participants were concerned that the administrative records upon which HAA relies would often be outdated, failing to reflect significant changes in people’s situations. For example, a participant shared that she was likely to face homelessness again, but for a completely different reason than her prior experience of homelessness: “I was facing totally different issues than I am now. The reason for me needing them before is because I had no support. I had no family. I was a foster kid. Now, I’m an adult. I can’t work. I’m facing health issues. So my issues back then weren’t what they are today” (P7). P7 shared with us that she tried to call the county to update her information but couldn’t reach anyone on the other end. Participants also noted that a person’s situation could change rapidly, even within a few days: “that was my biggest concern because somebody’s situation could change over a weekend. [...] The incidence of violence or financial background, that kind of stuff can change pretty quickly” (N1). Finally, participants emphasized that, to complement available quantitative data, it is critical to consider qualitative narratives when making decisions about housing prioritization. They argued that simply increasing the granularity of the numeric features currently used by the model (e.g., trying to capture broad categories of reasons why a person went to jail) would never be able to replace the need to consider such narratives. For example, a service provider suggested: “having a qualitative narrative of what those things were, I think that might be better than trying to assign a numerical value to a pot charge versus a domestic violence charge” (N5). In addition, participants believed that other important factors, such as social dynamics on the street, could only be captured through qualitative narratives: “you’re not really accessing some of the social dynamics that aren’t digitized. They’re more of a narrative of story. The situation on the streets is like following a soap opera [...] because people are relating to each other. There are dangerous domestic violence situations, people that are threatening to kill other people, and people who are beginning to give up and overdosing themselves on purpose. I don’t think that a lot of those types of data, either medical, psychiatric, or soap opera are going to be captured by the computer” (N4). In order to capture such narratives, participants argued for the importance of human investigation, to complement analyses of quantitative data.

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

5.3

Feedback on HAA’s Deployment

Frontline workers both within and outside of the county shared experiences where they had observed potentially harmful errors in HAA’s scoring behavior. When they encountered such behaviors, workers often felt powerless to take action—particularly given that information about how HAA works is intentionally withheld from them. Although county workers shared ways they are currently able to exercise their discretion and advocate for individuals to be prioritized differently, they noted that they were discouraged from using these mechanisms. Participants felt strongly that frontline workers should be empowered and extensively trained to override HAA’s recommendations when appropriate (Section 5.3.1). They also suggested broader changes to the sociotechnical system around HAA’s deployment (Section 5.3.2), and proposed alternative ways of using data to streamline provision of services (Section 5.3.3). 5.3.1 Feedback on the current deployment. Throughout our study, frontline workers both within and outside the county shared multiple examples of what they believed to be erroneous, potentially harmful algorithmic behaviors. However, they felt frustrated and powerless when they encountered these situations: “When someone’s score doesn’t qualify them for anything, I kind of have to share that [with unhoused individuals]. And then, it’s just we’re both helpless in that situation” (W1). County workers noted that even though they directly interact with HAA day-to-day, they were kept from knowing how the algorithm calculates the scores or weighs different features: “[My supervisor] always says, [the model developers] tell him don’t ask what’s in the sausage.” (W1). In the absence of formal insight into how scores are computed, workers developed their own hypotheses about how the system works and where the system may be less reliable, based on their daily interactions with the system (cf. [31]). Workers shared that, in cases where the score does not match a person’s representation, they sometimes try to override HAA with alt HAA or submit prioritization requests. However, they are discouraged from using these override mechanisms. For example, a county worker shared that when they disagreed with HAA’s score, “the only other thing you can do is run the alt HAA, and we’re really not supposed to do that” (W1). Meanwhile, some unhoused participants found the questionnaire-based alt HAA itself easily gamifiable: “I’m a really good test taker. That’s how I got in” (P3). Aside from alt AHA, prioritization requests heavily rely on individual advocacy, which overloads individual workers without systematic support. For example, a county worker shared an experience submitting a prioritization request on behalf of a person who was diagnosed with schizophrenia and stayed in abandoned houses for two years but only scored a three: “I have to work really hard to be able to offer this person services [...] because he’s not going to be served through this system” (W1). Considering these limitations, participants suggested that a system like HAA should at least blend the unique strengths of human judgment with those of data-driven algorithms. For example, a participant suggested the county “should continue [the system] but take some of the information that [comes out of] this research and update the system and have more human involvement in the calculations” (P7). Participants also believed that frontline workers who use HAA should be empowered and extensively trained to

CHI ’23, April 23–28, 2023, Hamburg, Germany

know when to rely on HAA versus when to override HAA’s recommendations: “The person should ultimately be responsible for the decision. They should go through extensive training on how to make those decisions and how to weigh the scores versus their intuitions and the new information that the client has given” (N5). Despite these suggestions, unhoused participants still expressed worries that the county may over-rely on HAA and use it as an excuse to avoid more time-consuming but valuable investigations. 5.3.2 Feedback on the broader sociotechnical design considerations. In addition to human intervention, participants also asked for broader changes to the sociotechnical design around the algorithm’s deployment. Specifically, they argued more upstream work is needed to connect people to HAA in the first place, and downstream care plans are required for people to succeed after receiving a score from HAA. Participants argued there is upstream work needed to connect people to HAA because many unhoused individuals who don’t utilize services are flying under the radar. As a medical service provider put it: “You shouldn’t just be worried about who’s in the waiting room. You should be worried about who’s not in the waiting room” (N4). Participants shared various reasons why the unhoused individuals do not connect with HAA in the first place. For example, people may not have the inclination to reach out because of their distrust in institutions: “People don’t want to go [to the county] because of the trust issue” (P3). In addition, some people are too vulnerable to seek resources on their own: “It’s weird that you got to prove that you’re homeless [to get housing]. Most people aren’t in the mental state to be able to prove anything in my experience” (P3). Participants also suggested that downstream care plans should be customized for each individual to prevent them from cycling back to homelessness. For example, a formerly unhoused participant shared that “I actually have a friend. He’s in a house, [but] he can’t stop going out and panhandling and flying signs. [...] Moving into a house doesn’t make you not in a homeless mind state” (P3). County workers also shared examples where people who receive high scores are assigned to programs with a lower level of support and have traumatic experiences in those programs: “The big problem we saw last year is that some 10s are getting rapid programs because there is more availability of resources for the lower level of support. [...] There are people experiencing some trauma from getting into those housing programs” (W2). Participants suggested there should be a systematic effort to follow through and help a person succeed in the end so that “all this information about the vulnerability isn’t just their ticket into the program door” (W2). 5.3.3 Ideas for alternative uses of data, to streamline services. Participants shared multiple ideas for alternative ways to use data to streamline service provision, which they perceived as more valuable than the current design of HAA. For example, several unhoused participants suggested that the system can actively connect people to various resources, not only housing, based on the data already available to the county: “I think that’s the way a computer can help because they’ll weave out what is needed for you to best suit your needs” (P8). County workers also asked for better information exchange about unhoused individuals to form individualized care plans and alleviate their burden: “There’s so many times where the person who’s being served has to sit down and have the

CHI ’23, April 23–28, 2023, Hamburg, Germany

same conversation and reveal the same information over and over again. Why isn’t the computer doing a better job of alleviating that burden from the person” (W2). Finally, service providers noted that changes to HAA’s primary data sources within the county’s data warehouse, such as the Homeless Management Information System (HMIS), could have an enormous impact: “HMIS is something that is part of the federal government requirements for everybody that receives funding from HUD, [...] but it’s archaic, horrible, poorly designed, and rarely updated” (N2).

6

DISCUSSION

Given the spread of ADS systems in homeless services, it is critical to understand directly impacted stakeholders’ perspectives on these systems. In this paper, we present the first in-depth understanding of frontline workers’ and unhoused individuals’ desires and concerns around the use of AI in homeless services. To elicit feedback on a deployed ADS system from stakeholders spanning a wide range of relevant literacies, we employed AI lifecycle comicboarding: a feedback elicitation and co-design method that adapts the comicboarding method [23, 44] to scaffold both broad and specific conversations around different components of an AI system’s design, from problem formulation to data selection to model definition and deployment. In this section, we highlight key takeaways, reflect on our experience using AI lifecycle comicboarding, and discuss considerations for the use of this method in future research. As discussed in Section 5.1, within county government, there was significant skepticism that stakeholders without AI expertise could provide meaningful feedback on the design of an AI system. Given this skepticism, community members and frontline workers were given minimal opportunities to learn about the HAA system or provide feedback. Yet our findings suggest that non-AI experts can provide specific, critical feedback on an AI system’s design and use, if invited and appropriately empowered to do so. As our participants gained insight, through our comicboarding method, into HAA’s design (Section 5.2) and deployment (Section 5.3), they raised a number of concerns and suggestions. Some of participants’ design feedback surfaced broad concerns. For example, as participants learned more about HAA’s overall design, they expressed concerns about ways the algorithm may serve to optimize more towards the county’s interests rather than the actual needs and safety of unhoused individuals (Section 5.2.1). Participants also shared ideas for alternative ways to use the data currently available to the county, which they perceived as more likely to bring benefits and less likely to cause harm, compared with HAA’s current design (Section 5.3.3). In addition, participants provided specific feedback on particular model-level and data-level design choices. For instance, participants expressed concerns about the validity and reliability of specific aspects of the model’s design, such as the current choices of proxy variables used to measure particular real-world outcomes (Section 5.2.2). In light of the limitations of available administrative data (Section 5.2.3), participants also offered additional deployment suggestions, including ways to elevate human judgment in the decision-making process (Section 5.3.1) and ways to improve the design of the broader sociotechnical system surrounding HAA’s deployment (Section 5.3.2).

Kuo and Shen et al.

Reflecting on their experiences during the study, participants noted that our comicboarding approach helped to open up conversations around hidden assumptions and decisions underlying the AI system’s design and deployment, which would have otherwise remained opaque to them. For example, a service provider (N4) noted that although they were aware of HAA prior to the study: “I think it helped me understand [how HAA works] because I only knew [in] general terms, I didn’t really see the process laid out like that.” N4 shared that, after going through the full set of comicboards: “It also made me concerned about the limitations of you know, garbage in garbage out. Or incomplete, incomplete out.” Similarly, N3 reflected that, “[The] user friendly narrative around the storyboards was effective for me. It helped me better organize my thoughts. [...] I come out of that [...] with a much better understanding of the HAA.” N3 added that following their experience in the study, “The validity of [HAA] causes me great concern.” These concerns motivated some service providers to reach out to the county after the study with the desire to improve the system and mitigate the potential harm they identified through our comicboards. Meanwhile, participants also reflected on how their participation in the study had direct benefits beyond monetary compensation. For example, after understanding how HAA currently generates scores based on personal information, some unhoused individuals decided to actively reach out to the county in order to update their life situations: “It gave me a better understanding for the application process, I didn’t know how they went about that at all. It let me know certain things that I can do on my end to help my chances of getting the housing. After this, I’m gonna try to reach out and provide updated documentation for my medical issues” (P7). For one county worker, who interacts with HAA day-to-day, participating in the study made them feel empowered to explain how HAA’s algorithmic decisions are made: “This is revealing to me that I can be like, more upfront with people when I do give them the score and let them know, this is exactly what you are, you know, what you’re eligible for [...] I think it would be good [...] for me to be a little bit more transparent with people” (W1). In this study, we used AI lifecycle comicboarding to solicit feedback and suggestions for design modifications to an AI system after that system had already been deployed. While collecting postdeployment feedback is critical, as discussed in Section 5.1, we also encourage using it early in the design process of a new AI system (e.g., conceptualization, prototyping, or pilot testing stages). Indeed, at later stages, once significant resources have been invested in an AI system’s development, there is a risk that system developers will be hesitant to implement broader changes such as those discussed in Section 5.2.1 and Section 5.3.3, and may instead be biased towards more incremental changes that can be implemented with fewer resources. When using the method before a system is already in place, the final panel of each comicboard (e.g., Figure 1 (e)) can be used to elicit feedback on proposed designs. With this flexibility, our method is intended to empower participants not only to critique and redesign existing AI systems, but also to redirect the design processes of proposed systems at the earliest stages of the design process. In the future, more formal and controlled evaluations of AI lifecycle comicboarding would help further understand its strengths and limitations in elicting specific feedback compared to standard interviews or alternative comicboarding approaches. We envision

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

that our method could be adapted and used to elicit feedback from other stakeholder groups who experience intersectional social disadvantages, such as children and families subject to out-of-home placements and state intervention or individuals with intellectual/developmental or psychiatric disabilities. In future work, we also plan to develop a suite of toolkits based on the comicboards generated for this study, along with instructions on how to adapt and use them. We hope that releasing these toolkits will help HCI practitioners and developers of AI systems to better integrate the voices and perspectives of impacted stakeholders.

7

CONCLUSION

Our study demonstrates that with an appropriate feedback elicitation method, community stakeholders spanning diverse backgrounds and literacies can provide specific and critical feedback on an AI system’s design. Using our method, we have presented an in-depth understanding of frontline workers’ and unhoused individuals’ perspectives on the use of AI in homeless services. Future research should explore ways to adapt the AI lifecycle comicboarding method for use with other stakeholder groups, who may face additional barriers to participation in AI design and critique. In addition, future work should explore the design of practical processes, policies, and technical approaches that can support the effective incorporation of stakeholder feedback into AI system design in practice.

ACKNOWLEDGMENTS We thank our participants for their time and input that shaped this research. We also thank our contacts in the county and non-profit service providers for helping with the recruitment and verifying the comicboards. Finally, we thank Laura Dabbish, Yodit Betru, Bonnie Fan, Jordan Taylor, Wesley Deng, Logan Stapleton, Anna Kawakami, Jane Hsieh, Seyun Kim, Tiffany Chih, and anonymous reviewers for their insightful feedback on the study design and paper draft. This work was supported by the National Science Foundation (NSF) under Award No. 1939606, 2001851, 2000782 and 1952085, and the Carnegie Mellon University Block Center for Technology and Society (Award No. 55410.1.5007719).

REFERENCES [1] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. In Ethics of Data and Analytics. Auerbach Publications, 254–264. [2] Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psychological Association. [3] Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis. Qualitative research in sport, exercise and health 11, 4 (2019), 589–597. [4] Carlos Castillo, Mariano Martín Zamorano, Giovanna Jaramillo, and Sara Suárez Gonzalo. 2020. Algorithmic Impact Assessment of the predictive system for risk of homelessness developed for the Allegheny County. Technical Report. Eticas Research and Consulting. http://www.alleghenycountyanalytics.us/wp-content/ uploads/2020/08/Eticas-assessment.pdf [5] Valerie Chen, Umang Bhatt, Hoda Heidari, Adrian Weller, and Ameet Talwalkar. 2022. Perspectives on Incorporating Expert Feedback into Model Updates. arXiv preprint arXiv:2205.06905 (2022). [6] Hao-Fei Cheng, Logan Stapleton, Ruiqi Wang, Paige Bullock, Alexandra Chouldechova, Zhiwei Steven Steven Wu, and Haiyi Zhu. 2021. Soliciting stakeholders’ fairness notions in child maltreatment predictive systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–17. [7] Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. PMLR, 134–148.

CHI ’23, April 23–28, 2023, Hamburg, Germany

[8] Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810 (2018). [9] Sam Corbett-Davies and Sharad Goel. 2018. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023 (2018). [10] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining. 797–806. [11] Sasha Costanza-Chock. 2020. Design justice: Community-led practices to build the worlds we need. The MIT Press. [12] Henriette Cramer, Jenn Wortman Vaughan, Ken Holstein, Hanna Wallach, Jean Garcia-Gathright, Hal Daumé III, Miroslav Dudík, and Sravana Reddy. 2019. Challenges of incorporating algorithmic fairness into industry practice. FAT* Tutorial (2019). [13] Scott Davidoff, Min Kyung Lee, Anind K Dey, and John Zimmerman. 2007. Rapidly exploring application design through speed dating. In International conference on ubiquitous computing. Springer, 429–446. [14] Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. 2021. Stakeholder Participation in AI: Beyond" Add Diverse Stakeholders and Stir". arXiv preprint arXiv:2111.01122 (2021). [15] Jack Denton. 2019. Will algorithmic tools help or harm the homeless. Pacific Standard (April 2019). https://psmag.com/social-justice/will-algorithmic-toolshelp-or-harm-the-homeless [16] Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science advances 4, 1 (2018), eaao5580. [17] Virginia Eubanks. 2018. Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press. [18] Pittsburgh Task Force. 2020. Report of the Pittsburgh Task Force on Public Algorithms. https://www.cyber.pitt.edu/sites/default/files/pittsburgh_task_ force_on_public_algorithms_report.pdf [19] Lenin C Grajo, Sharon A Gutman, Hannah Gelb, Katie Langan, Karen Marx, Devon Paciello, Christie Santana, Ashley Sgandurra, and Krysti Teng. 2020. Effectiveness of a functional literacy program for sheltered homeless adults. OTJR: Occupation, Participation and Health 40, 1 (2020), 17–26. [20] Aaron Halfaker and R Stuart Geiger. 2020. Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–37. [21] Christina Harrington, Sheena Erete, and Anne Marie Piper. 2019. Deconstructing community-based collaborative design: Towards more equitable participatory design engagements. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–25. [22] Meghan Henry, Tanya de Sousa, Colette Tano, Rhaia Hull Nathaniel Dick, Tori Morris Meghan Shea, Sean Morris, and Abt Associates. 2022. The 2021 annual homelessness assessment report to congress. (2022). [23] Alexis Hiniker, Kiley Sobel, and Bongshin Lee. 2017. Co-designing with preschoolers using fictional inquiry and comicboarding. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 5767–5772. [24] Kenneth Holstein, Erik Harpstead, Rebecca Gulotta, and Jodi Forlizzi. 2020. Replay enactments: Exploring possible futures through historical data. In Proceedings of the 2020 ACM Designing Interactive Systems Conference. 1607–1618. [25] Kenneth Holstein, Bruce M McLaren, and Vincent Aleven. 2019. Designing for complementarity: Teacher and student needs for orchestration support in AI-enhanced classrooms. In International conference on artificial intelligence in education. Springer, 157–171. [26] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. 2019. Improving fairness in machine learning systems: What do industry practitioners need?. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–16. [27] Naja Holten Møller, Irina Shklovski, and Thomas T Hildebrandt. 2020. Shifting concepts of value: Designing algorithmic decision-support systems for public services. In Proceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping Experiences, Shaping Society. 1–12. [28] Andrea Hu, Stevie Chancellor, and Munmun De Choudhury. 2019. Characterizing homelessness discourse on social media. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems. 1–6. [29] United Nations Human Rights Council. 2019. Guidelines for the Implementation of the Right to Adequate Housing. https://documents-dds-ny.un.org/doc/ UNDOC/GEN/G19/353/90/PDF/G1935390.pdf [30] Naveena Karusala, Jennifer Wilson, Phebe Vayanos, and Eric Rice. 2019. Streetlevel realities of data practices in homeless services provision. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–23. [31] Anna Kawakami, Venkatesh Sivaraman, Hao-Fei Cheng, Logan Stapleton, Yanghuidi Cheng, Diana Qing, Adam Perer, Zhiwei Steven Wu, Haiyi Zhu, and Kenneth Holstein. 2022. Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support. In CHI Conference on Human Factors in Computing Systems. 1–18.

CHI ’23, April 23–28, 2023, Hamburg, Germany

[32] Nathan J Kim, Jessica Lin, Craig Hiller, Chantal Hildebrand, and Colette Auerswald. 2021. Analyzing US tweets for stigma against people experiencing homelessness. Stigma and Health (2021). [33] Bogdan Kulynych, David Madras, Smitha Milli, Inioluwa Deborah Raji, Angela Zhou, and Richard Zemel. 2020. Participatory approaches to machine learning. In International Conference on Machine Learning Workshop. [34] Robert Kurzban and Mark R Leary. 2001. Evolutionary origins of stigmatization: the functions of social exclusion. Psychological bulletin 127, 2 (2001), 187. [35] Liny Lamberink. 2020. A city plagued by homelessness builds AI tool to predict who’s at risk. CBC News (August 2020). https://www.cbc.ca/news/canada/ london/artificial-intelligence-london-1.5684788 [36] Christopher A Le Dantec and W Keith Edwards. 2008. Designs on dignity: perceptions of technology among the homeless. In Proceedings of the SIGCHI conference on human factors in computing systems. 627–636. [37] Christopher A Le Dantec, Robert G Farrell, Jim E Christensen, Mark Bailey, Jason B Ellis, Wendy A Kellogg, and W Keith Edwards. 2011. Publics in practice: Ubiquitous computing at a shelter for homeless mothers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1687–1696. [38] Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Daniel See, Ritesh Noothigattu, Siheon Lee, Alexandros Psomas, et al. 2019. WeBuildAI: Participatory framework for algorithmic governance. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–35. [39] Karen Levy, Kyla E Chasalow, and Sarah Riley. 2021. Algorithms and decisionmaking in the public sector. Annual Review of Law and Social Science 17 (2021), 309–334. [40] Calvin Liang. 2021. Reflexivity, positionality, and disclosure in HCI. https://medium.com/@caliang/reflexivity-positionality-and-disclosurein-hci-3d95007e9916 [41] Calvin A Liang, Sean A Munson, and Julie A Kientz. 2021. Embracing four tensions in human-computer interaction research with marginalized people. ACM Transactions on Computer-Human Interaction (TOCHI) 28, 2 (2021), 1–47. [42] Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice. Proceedings of the ACM on human-computer interaction 3, CSCW (2019), 1–23. [43] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35. [44] Neema Moraveji, Jason Li, Jiarong Ding, Patrick O’Kelley, and Suze Woolf. 2007. Comicboarding: using comics as proxies for participatory design with children. In Proceedings of the SIGCHI conference on Human factors in computing systems. 1371–1374. [45] Ann Oakley. 2013. Interviewing women: A contradiction in terms. In Doing feminist research. Routledge, 52–83. [46] Allegheny County Department of Human Services. 2020. Allegheny Housing Assessment (AHA) Frequently Asked Questions (FAQs). Retrieved January 20, 2023 from https://www.alleghenycounty.us/WorkArea/linkit.aspx?LinkIdentifier=id& ItemID=6442472819 [47] Allegheny County Department of Human Services. 2020. Allegheny Housing Assessment (AHA) Report on Client Focus Groups. Retrieved January 20, 2023 from https://www.alleghenycountyanalytics.us/wp-content/uploads/2020/08/ AHA-Focus-group-report.pdf [48] Allegheny County Department of Human Services. 2021. Allegheny County Data Warehouse. Retrieved January 26, 2023 from https://www.alleghenycountyanalytics.us/wp-content/uploads/2021/02/DataWarehouse-updated-1-2021.pdf [49] Allegheny County Department of Human Services. 2021. A Bumpy but Worthwhile Ride: The Allegheny County Department of Human Services’ data-driven efforts to improve services for people experiencing homelessness. Retrieved January 26, 2023 from https://www.alleghenycountyanalytics.us/wp-content/uploads/2021/10/21ACDHS-07-CoordinatedEntry_v3.pdf [50] Jo Phelan, Bruce G Link, Robert E Moore, and Ann Stueve. 1997. The stigma of homelessness: The impact of the label" homeless" on attitudes toward poor persons. Social psychology quarterly (1997), 323–337. [51] Cotina Lane Pixley, Felicia A Henry, Sarah E DeYoung, and Marc R Settembrino. 2022. The role of homelessness community based organizations during COVID-19. Journal of Community Psychology 50, 4 (2022), 1816–1830. [52] Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2008. Dataset shift in machine learning. Mit Press. [53] Shulamit Reinharz and Lynn Davidman. 1992. Feminist methods in social research. Oxford University Press. [54] Jahmeilah Roberson and Bonnie Nardi. 2010. Survival needs and social inclusion: Technology use among the homeless. In Proceedings of the 2010 ACM conference on Computer supported cooperative work. 445–448. [55] Samantha Robertson, Tonya Nguyen, and Niloufar Salehi. 2021. Modeling assumptions clash with the real world: Transparency, equity, and community challenges for student assignment algorithms. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.

Kuo and Shen et al.

[56] Samantha Robertson and Niloufar Salehi. 2020. What If I Don’t Like Any Of The Choices? The Limits of Preference Elicitation for Participatory Algorithm Design. arXiv preprint arXiv:2007.06718 (2020). [57] Devansh Saxena, Karla Badillo-Urquiola, Pamela J Wisniewski, and Shion Guha. 2020. A human-centered review of algorithms used within the US child welfare system. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–15. [58] Devansh Saxena, Karla Badillo-Urquiola, Pamela J Wisniewski, and Shion Guha. 2021. A framework of high-stakes algorithmic decision-making for the public sector developed through a case study of child-welfare. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–41. [59] Aaron Shapiro. 2017. Reform predictive policing. Nature 541, 7638 (2017), 458– 460. [60] Hong Shen, Leijie Wang, Wesley H Deng, Ciell Brusse, Ronald Velgersdijk, and Haiyi Zhu. 2022. The Model Card Authoring Toolkit: Toward Communitycentered, Deliberation-driven AI Design. In 2022 ACM Conference on Fairness, Accountability, and Transparency. 440–451. [61] Anne B Shlay and Peter H Rossi. 1992. Social science research and contemporary studies of homelessness. Annual review of sociology (1992), 129–160. [62] Logan Stapleton, Min Hun Lee, Diana Qing, Marya Wright, Alexandra Chouldechova, Ken Holstein, Zhiwei Steven Wu, and Haiyi Zhu. 2022. Imagining new futures beyond predictive systems in child welfare: A qualitative study with impacted stakeholders. In 2022 ACM Conference on Fairness, Accountability, and Transparency. 1162–1177. [63] Denny L Starks, Tawanna Dillahunt, and Oliver L Haimson. 2019. Designing technology to support safety for transgender women & non-binary people of color. In Companion Publication of the 2019 on Designing Interactive Systems Conference 2019 Companion. 289–294. [64] Caitlin Thompson. 2021. Who’s homeless enough for housing? In San Francisco, an algorithm decides. Pacific Standard (September 2021). https://www.codastory. com/authoritarian-tech/san-francisco-homeless-algorithm [65] Rhema Vaithianathan and Chamari I Kithulgoda. 2020. Using Predictive Risk Modeling to Prioritize Services for People Experiencing Homelessness in Allegheny County. Technical Report. Centre for Social Data Analytics, Auckland, New Zealand. https://www.alleghenycounty.us/WorkArea/linkit.aspx?LinkIdentifier= id&ItemID=6442473749 [66] Rhema Vaithianathan and Chamari I Kithulgoda. 2020. Using Predictive Risk Modeling to Prioritize Services for People Experiencing Homelessness in Allegheny County - Methodology Update. Technical Report. Centre for Social Data Analytics, Auckland, New Zealand. https://www.alleghenycountyanalytics.us/wp-content/ uploads/2021/01/AHA-Methodology-Update_December-2020_v2.pdf [67] Michael Veale, Max Van Kleek, and Reuben Binns. 2018. Fairness and accountability design needs for algorithmic support in high-stakes public sector decisionmaking. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1–14. [68] Jill Palzkill Woelfer and David G Hendry. 2010. Homeless young people’s experiences with information systems: Life and work in a community technology center. In Proceedings of the SIGCHI conference on human factors in computing systems. 1291–1300. [69] Jill Palzkill Woelfer and David G Hendry. 2012. Homeless young people on social network sites. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2825–2834. [70] Allison Woodruff, Sarah E Fox, Steven Rousso-Schindler, and Jeffrey Warshaw. 2018. A qualitative exploration of perceptions of algorithmic fairness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1–14. [71] Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating how experienced UX designers effectively work with machine learning. In Proceedings of the 2018 designing interactive systems conference. 585– 596. [72] Haiyi Zhu, Bowen Yu, Aaron Halfaker, and Loren Terveen. 2018. Value-sensitive algorithm design: Method, case study, and lessons. Proceedings of the ACM on human-computer interaction 2, CSCW (2018), 1–23. [73] Douglas Zytko, Pamela J. Wisniewski, Shion Guha, Eric PS Baumer, and Min Kyung Lee. 2022. Participatory Design of AI Systems: Opportunities and Challenges Across Diverse Users, Relationships, and Application Domains. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–4.

A

COMICBOARDS

We include all the comicboards we developed and used for the study in Figure 4. Each row corresponds to one of eight major components of an AI system’s design. While problem formulation is typically the first stage within AI’s development lifecycle, we showed the corresponding comicboard at the end of our study in order to broaden the discussion once participants had a better

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

understanding of the current system. We also kept the last panel of this comicboard open-ended to invite a broad range of alternative problem formulations. This presentation order is a design choice, given that HAA has been deployed for over two years. When using our method in contexts where an AI system is at earlier stages of conceptualization, researchers are encouraged to explore alternative presentation orders (e.g., starting with the problem formulation) as appropriate.

B

CHI ’23, April 23–28, 2023, Hamburg, Germany

HIGHER-LEVEL THEMES

We provide a summary of the higher-level themes we identified through a reflexive thematic analysis approach in Table 3. These themes broadly correspond to the section headers in Section 5. Due to the limitation of word counts and space, we do not include the 65 first-level theme and 1023 codes in the table.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Kuo and Shen et al.

task

definition

Jamie is experiencing homelessness.

Jamie is told that the county is offering a public housing program.

Jamie calls the county to apply for a housing unit. owever, the staff person tells Jamie that there aren’t enough units to house every applicant.

The computer tries to learn from the past about who is more likely to be harmed due to homelessness.

To do so, the computer looks into the historical profiles of 4, people who have experienced homelessness in the county.

The computer tries to find personal information within each profile that might reveal a person's risk of being harmed due to homelessness.

What kind of personal infor ation should the computer consider?

Based on the personal information, the computer predicts the possibility of various challenging situations the applicants may face if they remain unhoused.

Depending on the severity of the predicted challenging situations, the computer then calculates a score that represents the risk of being harmed due to homelessness.

The applicants with the higher scores are more likely to receive housing.

What challenging situations should the computer consider when calculating the risk score?

The 3 considered challenging situations are mental health impatien jail bookin 4+ emergency room visits

n order to predict how likely a reallife applicant may face these challenging situations based on their information, the computer tries to learn from the past.

The computer looks into the historical profiles of 4, people who have been unhoused in the county and their records of experiencing these situations.

The computer tries to find specific personal information within each historical profile that shows a higher risk of these situations.

an you think of ways the computer might get things wrong? If so, how?

The computer is able to make fairly accurate predictions for 4, people in the historical profiles. Now it is time to use it for predicting a real-life new applicant.

To see whether the computer can make prediction on new cases that it hasn’t seen before, the scientist tests it with new profiles on an ongoing basis.

The computer analyzes the new cases based on what it previously learned.

The scientist evaluates the performance of the computer in various ways.

What would it mean for the computer to be successful?

The scientist measures its accuracy and fairness among different populations and finds it performs better than the previously used tool for housing prioritization.

Now, the computer is looking at Jamie’s profile to predict how likely it is for Jamie to end up in the challenging situations.

Staff member sees that Jamie has a risk score of 6.

Staff member sees that another applicant has a risk score of 8.

How should these risk scores be used for prioritization?

Based on the computer's response and other information gathered, Jamie is put on the waitlist, where people believed to be at a higher risk will be place in front of Jamie.

The county has met with people like Jamie, staff person, and researchers to discuss the use of the computer’s suggestions in housing prioritization.

The county also hired a third-party consulting firm to evaluate the performance and potential biases of the computer.

The county shared and responded to the results of the meetings and third-party evaluation on its website.

What would be some other ways for the county to gather feedback from Jamie?

The county continues to gather feedback through a dedicated phone number. Independent researchers like us also talk to impacted people to understand their concerns of the computer.

Jamie now understands how the computer works.

Jamie also knows how the computer is being used.

Jamie wonders if there are other ways the computer can support people with similar experiences.

What are the other ways the computer can be used to support people experiencing homelessness?

H

iven thousands of applications each year, what ethods should the county use for prioritizing housing applicants? G

m

The staff person uses a computer that provides advice on who should be prioritized for housing, based on their risks of being harmed if they remain unhoused.

data

curation

000

m

or each profile, the computer examines personal information across 1 domains. Race is not included. F

0

model definition

training

I

000

C

000

testing

deployment

feedback

problem

formulation

Figure 4: The comicboards we developed and used for our study.

Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services

CHI ’23, April 23–28, 2023, Hamburg, Germany

Table 3: The three third-level themes and ten second-level themes we identified through data analysis.

Third-level themes

Second-level themes

desires for feedback opportunities

perceptions of a more open feedback process for previous assessment tools perceptions of insufficient community involvement in HAA desires for continuous, post-deployment feedback channels perceptions of the county’s intention of gathering legitimate feedback

feedback on HAA’s design

feedback on HAA’s objective feedback on HAA’s model design feedback on HAA’s data

feedback on HAA’s deployment

feedback on the current deployment feedback on the broader sociotechnical design considerations ideas for alternative uses of data to streamline services

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study Perttu Hämäläinen∗

Mikke Tavast∗

Anton Kunnari

[email protected] Aalto University Espoo, Finland

[email protected] Aalto University Espoo, Finland

[email protected] University of Helsinki Helsinki, Finland

ABSTRACT Collecting data is one of the bottlenecks of Human-Computer Interaction (HCI) research. Motivated by this, we explore the potential of large language models (LLMs) in generating synthetic user research data. We use OpenAI’s GPT-3 model to generate open-ended questionnaire responses about experiencing video games as art, a topic not tractable with traditional computational user models. We test whether synthetic responses can be distinguished from real responses, analyze errors of synthetic data, and investigate content similarities between synthetic and real data. We conclude that GPT-3 can, in this context, yield believable accounts of HCI experiences. Given the low cost and high speed of LLM data generation, synthetic data should be useful in ideating and piloting new experiments, although any fndings must obviously always be validated with real data. The results also raise concerns: if employed by malicious users of crowdsourcing services, LLMs may make crowdsourcing of self-report data fundamentally unreliable.

CCS CONCEPTS • Human-centered computing → Empirical studies in HCI.

KEYWORDS User experience, User models, Language models, GPT-3 ACM Reference Format: Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3544548.3580688

1

INTRODUCTION

HCI Researcher: Dear AI, do you think you could role-play a user study participant for me? AI: Sure, I can try. Researcher: Let’s imagine you’re a scientist and the user study is about academic writing tools. What is your favorite writing tool and why? ∗ Perttu

Hämäläinen and Mikke Tavast contributed equally to this work.

This work is licensed under a Creative Commons Attribution International 4.0 License. CHI ’23, April 23–28, 2023, Hamburg, Germany © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. https://doi.org/10.1145/3544548.3580688

AI: My favorite writing tool would have to be Microsoft Word. I fnd it very user-friendly and versatile, and it has everything I need to get the job done. Researcher: Why do you prefer Word over LaTeX? AI: LaTeX is a great tool for typesetting, but I fnd it to be much more difcult to use than Word. With Word, I can easily format my document the way I want it to look, and I don’t have to worry about code. The dialogue above was generated using OpenAI Playground1 , a tool that allows one to input a piece of text—a prompt—and ask the GPT-3 large language model (LLM) [10] to generate a plausible continuation. We wrote the boldface parts and let GPT-3 generate the italicized continuations. The result is characteristic of the phenomenon we investigate in this paper: Through learning to model and predict various kinds of human-produced texts ranging from technical documentation to online discussions and poetry, LLMs like GPT-3 give the appearance of "understanding" human experiences such as interactive product use. Of course, the internal operation of the models difers from the internal psychological and neurophysiological processes of humans—LLMs simply learn to predict the next symbol (or impute missing symbols) in a sequence. Nevertheless, on a purely behavioral level, the results can be very human-like. Much of HCI research is conducted using verbal data such as interviews and questionnaires (e.g., [3, 61, 72]), but collecting such data can be slow and expensive. Therefore, the above suggests that LLMs might be useful in generating synthetic/hypothetical data for HCI research, a notion we explore empirically in this paper. LLMs are typically trained on enormous Internet datasets such as Common Crawl [67]), including an abundance of online discussions about interactive technology and products such as phones, computers, and games. Therefore, it seems plausible that LLMs could generate, e.g., realistic 1st-person accounts of technology use, and answer natural language questions about user experiences, motivations, and emotions. We emphasize that we do not claim that such synthetic LLM data could ever be a replacement for data from real human participants. We simply consider that synthetic based data might be useful in some contexts, for example, when piloting ideas or designing an interview paradigm. In efect, we view LLMs as a new kind of search engine into the information, opinions, and experiences described in their Internetscale training data. Unlike traditional search engines, LLMs can be queried in the form of a narrative such as a fctional interview. Furthermore, LLMs exhibit at least some generalization capability to new tasks and data (e.g., [45, 71, 81]). This presents an untapped opportunity for counterfactual What if? exploration, e.g., allowing 1 https://beta.openai.com/playground

CHI ’23, April 23–28, 2023, Hamburg, Germany

a researcher or designer to probe questions such as "What might users say if I ask them X?" or "Might interview topic X result in interesting answers?" The beneft of such model-based exploration is the high speed and low cost of data generation, while the obvious drawback is data quality: Any fndings based on generated data should be validated with real human participants, as language models are known to exhibit biases and make factual errors [29, 66, 83]. Nevertheless, we believe it worthwhile to explore the capabilities of LLMs in this context and investigate how human-like the generated data is. In the bigger picture, LLMs have potential to expand computational user modeling and simulation to signifcant new avenues. Despite Oulasvirta’s call for rediscovering computational user models [57] and recent modeling successes like simulation-based prediction of touchscreen typing behavior [30] and game level difculty [69], computational user modeling and simulation is presently limited to relatively simple behavioral measures. We are intrigued by the potential of LLMs in generating rich synthetic self-report data about user experience, motivation, and emotion. Because of the complexity of these phenomena, it is enormously challenging to construct computational models and simulations explicitly, from the bottom-up. In contrast, LLMs tackle the modeling problem implicitly: The Transformer neural network architecture [77] underlying LLMs learns latent representations and procedures that can model and generate language in a surprisingly generalizable manner [43, 55, 63], e.g., utilizing novel concepts only described in the prompt and not included in training data [10] and generating chain-of-thought "inner monologue" that explains the reasoning behind question answers [79]. Contribution: Considering the above, LLMs appear to present an interesting new tool for HCI research, but their usefulness hinges on the validity of the generated data. However, the human-likeness of data generated by LLMs has not yet been evaluated in the HCI research domain. This presents the knowledge gap addressed in this paper. We contribute through a series of experiments investigating the following research questions, each probing an aspect of humanlikeness of synthetic data generated using GPT-3: Experiment 1: Can one distinguish between GPT-3 generated synthetic question answers and real human answers? (Method: quantitative online study, N=155). Experiment 2: What kinds of errors does GPT-3 make? (Method: qualitative evaluation) Experiment 3: Can synthetic data provide plausible answers to real HCI research questions? What similarities and diferences are there in GPT-3 and real data? (Method: computational analysis and visualization) Each of these questions was investigated in the specifc context of participants describing art experiences in video games. This allows us to compare GPT-3 generations to real human data from Bopp et al. [6, 7], a study recent enough that the data is not included in GPT-3’s training data. Experiencing games as art was chosen as the domain because it would be challenging for any prior user modeling or simulation approach. We only use real data for evaluating GPT-3 generations, without using the data for any training or fnetuning, and without including the data in the prompt to guide the generations.

Hämäläinen et al.

An obvious limitation of our work is that we only examine LLM capabilities using one particular dataset. We make no claims about the generalizability of our results to all the other possible use cases; nevertheless, we believe that our investigation is both useful and needed to assess the application potential of GPT-3 and LLMs as synthetic HCI data sources. Our results should also help in understanding the misuse potential and risks that LLMs may present, e.g., if bots and malicious users adopt LLMs to generate fake answers on online research crowdsourcing platforms such as Prolifc or Amazon Mechanical Turk. If LLMs responses are highly human-like, detecting fake answers may become impossible and the platforms need new ways to validate their users and data.

2 BACKGROUND AND RELATED WORK 2.1 Language Modeling and Generation Language modeling and generation has a long history in AI and computational creativity research [12, 44, 70]. Typically, text generation is approached statistically as sampling each token—a character, word, or word part—conditional on previous tokens, �� ∼ � (�� |� 1 . . . �� −1 ; � ), where �� denotes the �:th token in the text sequence, and � denotes the parameters of the sampling distribution. In this statistical view, the modeling/learning task amounts to optimizing � based on training data, e.g., to maximize the probabilities of all tokens in the training data conditional on up to � preceding tokens, where � is the context size. In the most simple case of a very low � and a vocabulary of just a few tokens, it can be feasible to count and memorize the probabilities/frequencies of all � -token sequences in the training data. However, the number of possible sequences grows exponentially with � . Modern language models like GPT-3 abandon memorization and instead use artifcial neural networks, i.e., � denotes the parameters of the network. For the text generation/sampling task, such a neural network takes in a sequence of tokens and outputs the sampling probabilities of each possible next token. Deep neural networks are particularly suited for the task, as their expressiveness can grow exponentially with network depth [52, 62], which mitigates the exponential complexity. While the currently used learning/optimization algorithms for deep neural networks have no convergence guarantees, there is ample empirical evidence that large enough neural language models can exhibit remarkably creative and intelligent behavior, e.g., in handling novel concepts not included in the training data and only introduced in the prompt. An example is provided by the following prompt (bold) and the continuation generated by GPT-3 (italic) [10]: A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is: One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles. This result cannot be explained through simple memorization of the training material, as "whatpu" and "farduddle" are made-up words not used in training the model [10]. Although the details are beyond the scope of this paper, recent research has begun to shed light on the mechanisms and mathematical principles underlying

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

this kind of generalization capability. One explanation is that the commonly used next token prediction objective may force an LLM to implicitly learn a wide range of tasks [65, 71]. For example, a model might learn the format and structure of question answering by training on generic text from a web forum [71]. There is evidence that next token prediction can even result in computational operations and procedures that generalize to other data domains such as image understanding [45]. A pair of recent papers also provides a plausible mathematical and mechanistic explanation for LLM in-context learning, i.e., the capability to operate on tasks and information included in the prompt instead of the training data [21, 55]. Here, we’d like to emphasize that most modern LLMs including GPT-3 utilize the Transformer architecture [10, 77] which goes beyond simple memorization and recall. Transformer models are composed of multiple layers, each layer performing one step of a complex multi-step computational procedure, operating on internal data representations infuenced by learned model parameters and an attention mechanism. Transformer "grokking" research indicates that the representations can allow highly accurate generalization to data not included in training [43, 63]. Furthermore, a "hard-attention" variant of the Transformer has been proven Turing-complete, based on the ability to compute and access the representations [60]. More generally, Transformer models have proven highly capable in generating music and images [20, 68], solving equations [39], performing logical and counterfactual reasoning with facts and rules defned using natural language [15], and generating proteins with desired properties [48]. As an extreme example of generalization, Transformer LLMs have been demonstrated as general-purpose world models that can describe how the fctional world of a text adventure game reacts to arbitrary user actions such as "invent the Internet" [16], or how a Linux virtual machine reacts to terminal commands [17]. Taken together, the evidence above makes it plausible that LLMs might produce at least somewhat realistic results when provided with a hypothetical scenario of a research interview. From a critical perspective, neural language models require massive training data sets, which are in practice composed by automatically scraping Internet sources like Reddit discussions. Careful manual curation of such data is not feasible, and automatic heuristic measures like Reddit karma points are used instead [65]. This means the datasets are biased and may contain various kinds of questionable content. This can be mitigated to some degree by automatically detecting and regenerating undesired content [54, 74] and researchers are developing "debiasing" approaches [83]. On the other hand, model architectures, training data sets, and data curation methods are also evolving. Hence, one can expect the quality of the synthetic data generated by language models to continue improving.

2.2

GPT-3

GPT-3 is based on the Transformer architecture [10, 77]. The largest GPT-3 model has 175 billion parameters [10], however, multiple variants of diferent sizes and computational costs are currently available for use via OpenAI’s API. Generally, larger models yield better results, and this is expected to continue in the future [10].

CHI ’23, April 23–28, 2023, Hamburg, Germany

However, even the largest language models have common wellknown problems. For example, neural language generators often produce unnatural repetition [29] and exhibit biases like overconfdence, recency bias, majority label bias, and common token bias [83]. Fortunately, GPT-3’s performance can be improved by socalled few-shot learning, i.e., engineering the prompt to include examples such as the "whatpu" sentence above [10], and methods are being developed to estimate and counteract the biases without having to train the model again [83].

2.3

Computational User Models

In this paper, we are proposing and investigating the possibility of augmenting real HCI research data with synthetic data generated by a computational user model, which is an active research topic in HCI [57]. In HCI user modeling, there has been a recent uptick of applying AI and machine learning to predict user behavior in contexts like touchscreen typing [30], mid-air interaction gestures [13], and video game play [69]. Language models have also been used for optimizing text entry [34] and personalized web search [80]. However, although models like GPT-3 have been evaluated in natural language question answering [10, 83], the focus has been on factual knowledge and logical reasoning where the correctness of answers can be measured objectively. Here, our focus is instead on how believable the generated texts are in mimicking self-reports of human subjective experiences. We would like to stress that the actual computations performed by the model are not grounded in cognitive science or neurophysiology. Thus, GPT-3 can only be considered a user model on a purely behavioral and observational level. Many other psychological and HCI models such as Fitt’s law [23, 46] or Prospect Theory [31] fall in the same category and do not implement any explicit simulation of the underlying mechanisms of perception, cognition, or motor control. Nevertheless, such models can produce predictions of practical utility. This paper builds on our previous work-in-progress papers [27, 76]. Our Experiment 2 is based on [27], where we analyzed what kinds of errors GPT-3 makes. We extend the analysis and complement it with our Experiment 1 and Experiment 3. In [76], we investigated whether GPT-3 can produce human-like synthetic data for questionnaires using Likert-scales, whereas here we investigate open-ended answers. Concurrent with our work, Park et al. [59] have used GPT-3 to generate synthetic users and conversations for the purposes of prototyping social computing platforms, and Argyle et al. [2] demonstrate that GPT-3 can predict how demographic data afects voting behavior and political question answers. Park et al. argue that although LLMs are unlikely to perfectly predict human behavior, the generated behaviors can be realistic enough for them to be useful for designers. This conclusion aligns well with our motivation for the current study.

3

DATA

This section details the data used in the experiments of this paper.

CHI ’23, April 23–28, 2023, Hamburg, Germany

3.1

Human Data

We compare GPT-3 generations to real human participant responses from a recent study by Bopp et al. [7] regarding art experiences in video games. As a part of the study, Bopp et al. asked the participants to write about a time when they had experienced digital games as art (question "Please bring to mind..." shown in section 3.2. ). The open dataset of Bopp et al. [6] contains 178 responses to this question. We do not flter the responses from the dataset based on any quality metrics, as we want to compare GPT-3 -generated data to (raw) data typically received from online studies. We selected Bopp et al. dataset because of its recency: The data was published after GPT-3, and therefore is not included in GPT-3 training data. This is also why we use the original GPT-3 models in our experiments instead of the variants recently added by OpenAI. Experiencing art is a deep, subjective, and fundamentally human topic, and should thus provide a challenge from an AI user modeling perspective. It also provides contrast to the widely used language model benchmark tasks such as factual question answering.

3.2

GPT-3 Data

Table 1: The three prompts used to ’replicate’ the Bopp et al. [6, 7] human data collection. Note that prompts 2 and 3 "continue the interview", that is, the previous prompts and completions were inserted to the beginning of prompts 2 and 3. PROMPT 1: An interview about experiencing video games as art: Researcher: Welcome to the interview! Participant: Thanks, happy to be here. I will answer your questions as well as I can. Researcher: Did you ever experience a digital game as art? Think of "art" in any way that makes sense to you. Participant: Yes Researcher: Please bring to mind an instance where you experienced a digital game as art. Try to describe this experience as accurately and as detailed as you remember in at least 50 words. Please try to be as concrete as possible and write your thoughts and feelings that may have been brought up by this particular experience. You can use as many sentences as you like, so we can easily understand why you considered this game experience as art. Participant: PROMPT 2: Researcher: What is the title of the game? Participant: PROMPT 3: Researcher: In your opinion, what exactly made you consider this experience as art? Participant:

The prompts used to generate the GPT-3 data are shown in Table 1. The prompts were formulated as a partial in silico replication of Bopp et al. [7]. They include questions directly from the study ("Did you ever experience...", "Please bring to mind...", "What is the title of the game?", "...what exactly made you consider this experience

Hämäläinen et al.

as art?"), preceded by some additional context. For real human participants, similar context would be provided via experiment/study instructions. Note that our Experiment 1 and Experiment 2 only use the frst prompt in Table 1. Broadly, all three experiments used the same process to generate the GPT-3 data. The general method is described below, small changes to this procedure are noted in the methods section of each experiment. To generate the synthetic data, we used a Python script to interface with the GPT-3 public API. We used a maximum continuation length of 500 tokens and implemented the following heuristics to automatically improve the data quality: • To avoid generating follow-up questions as part of the response, we only utilized the portion of each response until the frst occurrence of the string "Researcher:" • From the completions, we automatically cut any tokens after the frst newline character. That is, we only included the frst paragraph of text. • If the resulting response length was less than 10 words, we discarded it and generated an entirely new one, reapplying the heuristics above. • We discarded and regenerated a response also if it contained consecutive unique repetitions of over 10 characters. The default GPT-3 parameters were used: temperature=0.7, top_p=1.0, frequency_penalty=0, presence_penalty=0, best_of=1. For the textdavinci-002 model (currently the most recent GPT-3 variant) used in Experiment 3, we used temperature=1.0 instead of 0.7, as the model does not appear to need the artifcial coherence boost given by a lowered temperature. With temperature=1.0, the token sampling probabilities directly correspond to those learned from the training data.

4

EXPERIMENT 1: DISTINGUISHING BETWEEN GPT-3 AND REAL DATA

Our frst experiment provides a quantitative study of how distinguishable GPT-3 are from real human responses. For the usefulness of GPT-3 synthetic data, we consider it necessary (but not sufcient) that GPT-3 responses are not clearly distinguishable from human responses. Although the distinguishability of GPT-3 generated texts from human texts has been studied before [e.g. 10], here we focus specially on the distinguishability to textual research data in the HCI domain.

4.1

Participants and Stimuli

We used Prolifc to recruit the participants and Gorilla experiment builder [1] as a data collection platform. In total, 175 adult participants were recruited from Prolifc with the criteria that participants needed to have an approval rate of 100/100 and they needed to be fuent speakers of English. Participants were paid £2.4 via Prolifc for the attending the study (£7.57/h for estimated 19 minute completion time). Two Prolifc participants were removed from the dataset as they withdrew their consent to use their data. After exclusions (see section 4.3), the fnal sample size was 155. 55.48% of the fnal participants identifed as men, 43.23% women, and 1.3% other or preferred not to disclose their gender. On a scale from 1 (I barely understand) to 5 (I am a native speaker), 43.23% of

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

CHI ’23, April 23–28, 2023, Hamburg, Germany

the participants rated their ability to read and understand English as 5, 52.9% rated as 4, and 3.87% rated as 3. Majority of the participants were under 35 years old (ages 18-25: 56.77%, ages 26-35: 34.84%, ages 36-45: 5.81%, ages 46-55: 2.58%). All participants provided informed consent to participate in the experiment, and for sharing the anonymous research data. Before collecting the data, we ran a pilot study in Prolifc with 4 participants. These participants were not included in the results reported here. The stimuli used in this experiment were 50 text passages written by humans and 50 text passages generated by OpenAI’s GPT-3 Davinci model. The set of human stimuli was randomly sampled participant responses from the Bopp et al dataset [7]. A set of 50 GPT-3 completions were generated for this experiment according to the methods in section 3.2 (PROMPT 1). The average word length for the fnal GPT-3 stimuli was 142.06 words (SD: 107.15, median: 116.0), and for the human stimuli 81.38 words (SD: 60.68, median: 64.0). In total, 7 GPT-3 completions were automatically discarded based on the two criteria stated in section 3.2.

4.2

Procedure

Each participant evaluated 20 stimuli in total, 10 randomly chosen from the human stimulus set, and 10 randomly chosen from the stimulus set generated with GPT-3. The participants were presented with the text passages one-by-one, in random order. For each text, their task was to decide whether they thought that it is more likely that the text in question was written by a human or generated by an AI system. They answered by pressing (with the computer mouse) either a button with the text "Written by a human participant" or "Generated by Artifcial Intelligence". Before they started the task, the participants were informed that half of the text passages they will see were written by humans and half generated by an AI, and that the order of presentation is randomized. The question of Bopp et al. [7] ("Please bring to mind..." ) was visible in every evaluation, and the participants knew that the human answers were written and GPT-3 answers generated in response to the question. There was no time limit on individual evaluations, but the experiment was discontinued and rejected if it was not completed in 4 hours. After the 20 evaluations, the participants were asked to answer two open questions regarding their decision process: 1) "What made you consider an answer as written by a human?" and 2) "What made you consider an answer as generated by AI?". Before the experiment, the participants were informed that the AI text passages were generated with a system called GPT-3. However, they were not provided any detailed information about what GPT-3 is, how it works, or any example texts generated with GPT-3. There were also no practice trials that would have shown examples of correct answers. Thus, the participants were kept as naive as possible in terms of the (possible) common diferences between human and AI generated texts. There were very few experts in NLP methods in the sample. In the questionnaire before the experiment, we included a question regarding the participants’ experience in subfeld of Artifcial Intelligence called Natural Language Processing on a scale from 1 (I have never heard the term before) to 5 (I am an expert). The percentage of participants answering 1,2,3,4, and 5 were 14.19%, 36.13%, 36.13%, 11.61%, and 1.94%, respectively.

Figure 1: A) Shows the participants’ median reaction times (see main text for explanation) to the 20 stimuli as a histogram. Participants with reaction times faster than the exclusion limit were excluded from the fnal analyses. B) The reaction times plotted from slowest to fastest. The exclusion limit is at the knee point of the curve.

4.3

Data Analysis

Before conducting statistical analyses, we excluded 16 careless and inattentive participants based on the reaction times and the quality of the open question responses. Additionally, two participants were excluded as they reported their English fuency level to be below 3 on a scale from 1 (I barely understand) to 5 (I am a native speaker). To identify participants who conducted the task implausibly fast, we divided stimulus length (i.e. word count, word length not normalized) by the reaction time for each trial. As attentive participants should at least in most cases read the whole text to make an informed decision, this measure can be considered as the lower bound of participants reading speed in terms of words read per minute (lower bound, as it ignores the time it takes to make the decision). Considering the reaction time distribution (see Figure 1), and meta-analysis of reading rates of adults [11], reaction times of over 664 words per minute were deemed implausible. A similar word per minute reaction time rate exclusion criterion has been used recently in reading research [35], and is broadly comparable to a recommendation of fagging participants with reaction times over 600 words per minute in online crowdsourced data [82]. 10 participants whose median reaction time across the 20 evaluations surpassed the 664 words per minute limit were excluded. As an additional carelessness check, two of the authors evaluated the answers to the open questions. Our criterion for inclusion was that the participants should give at least one reason per question regarding their decision-making process. Additionally, a participant could be excluded if the answers were deemed otherwise nonsensical or shallow, suggesting that the participant had not paid attention to the task. After a frst independent categorization pass, participants were excluded if both authors agreed on the exclusion.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Decisions about excluding participants of whom only one of the authors excluded in the frst independent pass were resulted with a discussion, refecting on the criteria. In total, 6 participants were excluded based on the open question responses2 . Authors agreed on the frst pass in 92% of the cases (Cohen’s kappa = 0.36). Open question carelessness checks were done before looking at the AI vs. human response data. Following previous studies that have investigated peoples’ ability to discriminate between real and AI generated stimuli [e.g. 38, 53], we analyzed the data by inspecting the confdence intervals of the recognition accuracies, and with signal detection theory (SDT) methods. From an SDT point of view, the current experiment can be considered as a Yes-No Experiment, where the participants’ ability to distinguish between two categories of stimuli is measured [47]. As our main interest was in how "human-like" the two categories of stimuli are perceived, our analysis considered recognizing human text as human written as a correct hit, and misidentifying GPT-3 text as human written as a false alarm. This allows us to calculate how sensitive the participants were in terms of distinguishing human texts from GPT-3 texts (with SDT measure of d’), and how much bias (SDT measure of c) they showed in their tendency to report the texts as human written. The discriminability of the GPT-3 and human texts was investigated with a one-sample t-test, where the participants’ d’ values were tested against zero. In this context, d’ value of 0 would indicate that the participant could not diferentiate between the GPT-3 texts and human texts in terms of how often they are evaluated to be human produced texts (i.e., that there are equal amounts of correct hits as false alarms). Positive d’ values result from more hits than false alarms, and negative d’ values from more false alarms than hits. Response bias was investigated with an SDT measure of c (criterion), where a c value of zero would be an indication of no response bias. A participant with a liberal decision bias would be more willing to judge a text to be written by a human. With such a participant, the criterion value would be negative, which means that they would have more false alarms than misses. In a like manner, a participant with a conservative decision bias will have less false alarms than misses, thus, a positive criterion value. A priori power analysis indicated, that a sample size of 156 has the power of 0.8 to detect a small efect (d=0.2) in a one-tailed t-test. We based our power analysis on one-tailed test, as our prediction for the main analysis of interest was that the human written texts would be categorized as human written more often than the GPT-3 generated texts. However, as the main efect of interest was unexpectedly to the other direction, we report here the t-test results with two-tailed alternative hypothesis.

4.4

Results

On aggregate, human written texts were correctly recognized 54.45% of the time, with 95% confdence interval excluding the chance level of 50% (95% CI: 51.97%-56.93%). The average accuracy of recognizing GPT-3 generated texts as AI-written was below chance level 40.45% (95% CI: 38.01%-42.89%). Thus, participants showed a bias towards 2 In

total, 7 were categorized as careless based on the open question answers. However, one of these participants was also excluded based on the response speed. Categorizations and open answers are provided in the supplementary data.

Hämäläinen et al.

Figure 2: The fgure shows the average proportion across participants of responses that categorized each stimulus (dots) as human written. The boxplot shows the median, the frst, and the third quartiles. The red line connects the two group means. GPT-3 generated stimuli were rated to have been written by a human more often than the human written stimuli. Table 2: The cross-tabulation shows how many times diferent responses were given to the two stimulus categories in experiment 2.

Generated by Artifcial Intelligence Written by a human participant

GPT-3 Texts 627 923

Human Texts 706 844

answering that the texts were written by a human, as 57% of all responses were "Written by a human participant" (see Table 2). The average participant bias was c=-0.2, with participant bias values difering signifcantly from zero in a one-sample t-test (t(154)=-7.74, p < 0.001). Against our expectations, GPT-3 texts were deemed more humanlike based on the d’ values. The one sample t-test testing d’ values against zero was statistically signifcant with a small efect size (t(154)-2.52, p=0.013, d=-0.2). The average d’ value was negative (d’ = -0.15), that is, the participants were more likely to respond with false alarms (i.e. GPT-3 text are written by humans) than with correct hits (i.e. human texts are written by humans). This tendency can also be seen visually from Figure 2, where the average proportion of "Written by a human participant" are plotted for each of the 100 stimuli. Exploratory analyses of the open question answers suggest that a frequent criterion for determining if a text was written by a human was whether the text included descriptions of emotional experiences. Although we did not conduct a thorough classifcation of the open question answers to diferent categories, the importance of emotion can be seen, for example, from word frequencies. In

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

CHI ’23, April 23–28, 2023, Hamburg, Germany

of real human participants. Initial categorization was performed by two annotators in two passes. In the frst pass, the annotators independently carried out the categorization and identifed distinct types of anomalies. The types of anomalies were then discussed and merged into a codebook that was used in a second categorization/refnement pass. Finally, the responses were classifed into valid or invalid by a third annotator and the anomalies were categorized, using the codebook as a guide. It should be noted that the annotations are inherently subjective, and the annotators were not fully blind to the data of other annotators. They should nevertheless provide useful concrete examples of the kinds of errors that GPT-3 makes in our context, complementing previous analyses of LLM limitations.

5.2 Figure 3: Top 10 most frequent word stems from the responses to the two open questions after discarding stop words (179 stop words from Python NLTK corpus) .

total, 54.19% of the responses to the question "What made you consider an answer as written by a human?" contained either the string ’emotion’ or the string ’feeling’. Also, the word stem emot was the second most frequent word stem in the responses to the same question, only behind the word stem human (see Figure 3).

5

EXPERIMENT 2: WHAT KINDS OF ERRORS DOES GPT-3 MAKE?

It is clear that although the best-case GPT-3 responses seem very human-like, all generations are not of high quality. To better understand the limitations, we conducted a qualitative investigation of the synthetic data. We generated two sets of 100 responses, and investigated the types of errors GPT-3 makes. The participants of Experiment 1 already refected on what made them rate a response as generated by AI or a written by a real human, but this provides limited information due to 1) the participants being inexperienced with AI generated text, and 2) the participants providing the refection in hindsight, after rating all the responses. Complementing this, the following identifes common failure modes in the synthetic data and refects on which failure modes could be automatically recognized and eliminated.

5.1

Methods

We used two versions of the PROMPT 1 described in Table 1, one ending simply with "Participant:" and other with "Participant: I’m thinking of the game". The motivation for this was to check whether providing the extra guidance would improve response quality. Generally, more specifc prompts tend to produce higher quality results [8, 10, 83]. 100 responses were generated for both prompt versions. The responses were categorized into valid or invalid by three annotators (the authors). A response was regarded as invalid if it exhibited some clear anomaly, e.g., the model generating an answer to a diferent question. We disregarded grammar and fuency issues that could be considered as natural variation in a diverse sample

Results

We identifed 8 distinct types of anomalies plus an "other" category, examples of which are given in Table 4. For the default prompt, the three annotators considered 54%, 62%, and 76% as valid responses (mean 64 %). For the more specifc prompt ending with "I’m thinking of the game" 64%, 79%, and 94% were considered as valid (mean 79 %). The diference in the amounts of invalid responses is inconclusive due to our limited data, but it is in line with existing research highlighting the importance of prompt design [8, 10, 83]. More interestingly, GPT-3 provided us a lesson on how prompt design can fail due to the model coming up with unexpected yet valid ways to continue the text: We expected the prompt ending with "I’m thinking of the game" to encourage the continuations to start with a game name, but some continuations avoided that by stating, e.g., "that I played." Perhaps the most peculiar are the generations that are wellwritten and coherent, but describe imaginary gameplay. An example about Journey is included in Table 4. Journey is a widely acclaimed game that was also mentioned by several of the human participants of Bopp et al. [7], and the description is superfcially plausible, but there is no baby to take care of in the real game. Some of the anomalies such as dodging the question are easy to spot by a human reviewer, and it is likely that one could use some automatic approach to detect and regenerate such answers. This might be possible even by GPT-3 itself, if prompted with few-shot examples of valid and invalid answers. However, there are also cases where making the valid/invalid distinction is hard. The imaginary gameplay descriptions are perhaps the foremost example, as recognizing them may require in-depth knowledge of the discussed games.

6

EXPERIMENT 3: DIFFERENCES BETWEEN REAL AND GPT-3 CONTENT

In Experiment 1, our participants found GPT-3 responses surprisingly human-like, on average. However, for synthetic data to be useful in answering real research questions or piloting a real experiment’s data analysis pipeline, it needs to be not only superfcially human-like but also refect real properties of user behavior and experiences. Therefore, we now investigate the data through computational analysis of the content. We compare the real and synthetic distributions of the following aspects of the data, which a researcher might investigate to answer specifc research questions:

CHI ’23, April 23–28, 2023, Hamburg, Germany

Hämäläinen et al.

Table 3: Examples of GPT-3 and human stimuli from Experiment 1. For each stimulus, we show the percentage of "Written by a human participant" responses in bold, followed by rank of the stimulus when all stimuli are sorted based on the percentages. All stimuli are provided as supplementary material.

GPT-3 generated stimuli:

Human stimuli (from Bopp et al. [7]):

89.19 % (1/100): I think the frst time was when I was in high school and played "Journey". It was a small game and only had a few hours of gameplay but it was one of the most moving and beautiful games I had ever played. It wasnt´ about the story or the gameplay, it was about the experience. I remember walking around this massive desert looking for other players and talking to them. I remember being surprised that we all just happened to be in the same place at the same time and I remember how much I enjoyed the music. I remember that I couldnt´ wait to get home from school to play it and it was the frst game I played where I felt like I was actually embodying the character. I thought about the game all day long and the experience stuck with me for a long time. Thatś what I would call experiencing game as art.

81.25 % (7/100): The frst Bioshock was clearly art for me, I’d get into new areas and just wander around looking at things for as long as it took to see everything. Same thing with Skyrim, I have probably 40 hours of just wandering around following foxes to see where they’d go and admiring the view. A few side scrolling games were clearly designed by artists as well, a great example is Valiant Hearts which uses a sort of card based art to suggest three dimensions without actually modeling them. 79.31 % (9/100): I became heavily invested in the story and atmosphere of Red Dead Redemption 2, to the extent that I engaged in as many quests as possible to experience as much character interaction as possible. I would also spend hours simply wandering the countryside to take in the beautiful scenery. When the story ended, I was heartbroken at the conclusion and it afected me deeply. I cried through the credits, and I felt like I had lost a close friend. Then, the epilogue began and I felt like it was too soon; I felt like I needed more time to mourn the ending of the game. As I played through the epilogue, I continued to feel sad for the loss of the main character, even as I was happy to see the evolution of the other characters.

88.24 % (2/100): I don’t know if I can describe my experience in 50 words. I would have to write a novel. I will try to be brief. I experienced a digital game as art when I played The Last of Us in 2014. I played the game in the same way I would watch a movie. I sat in the dark, turned of all lights, and put on headphones to enhance the experience. I felt like I was in the game, like I was the protagonist. I felt his pain, his fear, and his struggles. I knew what he was going to do next, and I felt like he was a real person. I cried when he cried, I 75.76 % (13/100): Just last night I was playing an online smiled when he smiled. I felt like the developers of that game shooter and got absolutely pissed at my own poor performance. did such a good job that I felt like I was not just playing the game, but that I was living the game. It was a beautiful experience. 75.68 % (15/100): The game was "This War of Mine" the whole game was, to me, an artistic experience. There 88.0 % (3/100): Well, I’ve experienced a game as art twice. The are too many instances to recall one in particular, but I do frst time was when I started playing Myst. And I was amazed know that the instances that made me think "wow this is at how the environment was all 3D, yet, the game was in 2D. art" were the most emotional events of the game - the plot I remember the frst time I was on an island, and the sun was twists in the Stories mode, the difcult decisions one has to shining, and it was just a beautiful sight to see. It was like I was make, the moments where the gameplay took me of-guard. there. I thought to myself how amazing it was that I was actually looking at a computer generated world, and it was beautiful. The 25.0 % (97/100): Themed, expressive worlds, exit stage left, second time was when I found a game called The Path by Tale of GIANT monsters, raccoons that fy, bears that turn to stone, Tales. I was looking for games on Steam, and I found The Path. I music that inspires. I was young still, but up to this point was intrigued by the game, and downloaded it. I started playing games we a very specifc thing; this one is Mario, he steps on it, and I was fascinated by the images and the music. They turtles and saves the princess, this one is Contra, they’re soldiers were like paintings, and the music was haunting, yet uplifting fghting of aliens, this one is Metroid, a space soldier fghting at the same time. I even wrote a review for it on Steam, and I aliens. But this one is art, you’re not just moving from one level gave it a positive review. I’ve never given a positive review for to the next, you’re on the map, and every world is new and any games, but I felt that it was something diferent, something unique. Ghost houses, the sun has a face and it’s mad at you, unique and I wanted other people to experience what I experienced. it attacks you, the every day enemies are now huge, you go to the sky and there is a castle in the clouds, a world of pipes, and 21.88 % (99/100): I would like to start with a defnition of a world of lava. The creativity of it all mixed into one game. art. I think it’s a set of aesthetic criteria that are as follows: 22.22% (98/100): An RPG I am playing features digital art cut 21.62 % (100/100): I have been playing video games since the scene s, a plot that could be considered a novel, and a beautifully early 90s and I have played a lot of games. depicted digital world.

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

CHI ’23, April 23–28, 2023, Hamburg, Germany

Table 4: Examples of the 8 anomaly categories identifed in GPT-3 generations from Experiment 2. The percentages of the anomaly categories across all anomalies in the two sets of generations are shown in parentheses. For brevity, long generations have been truncated. Clearly too long or rambling incoherently (27.5%): I’m thinking of the game Journey, what a raw, emotional, and beautiful game. It’s hard to explain why exactly it hit me like that, but I can try. The game is about the feeling of being disconnected from the world, and feeling like you’re not part of the world. The environment is so surreal and alien, and there’s no way to communicate with the other people in the game. The music made me feel the same way as I feel when I’m completely alone, and I felt like I was being brought back to the time when I was little, before I had any friends, and I didn’t know what was going on with the world. I was a blank slate, and Journey made me feel that way again. I was just another person in the crowd, just like I was when I was little. I was being brought back to the sense of confusion and helplessness I felt when I was little, but at the same time I was being taken out of the world. The game made me feel like I was being taken away from the world that I know, and I felt reconnected to the world that I didn’t know... Not answering the question or dodging the question (20.5%): I can’t think of a specifc game, but I can tell you about a couple of experiences that I had. Incomplete: answer ends before getting to the point (15.2%): I would start by saying that I am currently a student in the games and narrative design (gamedesign) program in the third year. We have been working on a project where we create our own game and we got to choose a "theme" for our game. I chose art as my theme. Other (11.7%): I’m thinking of the game The Path (gameplay video), it’s a very simple game. You are in the woods, you need to fnd The Mother, and to do that you need to go through the woods. There are paths, and the paths are blocked by barbed wire. You can jump over it or you can go around it. If you jump over it you get a point. And you get a point if you fnd The Mother. But you can’t just fnd The Mother, you have to fnd a key in order to get in her house. Contradicts the positive answer in the prompt by claiming that has not experienced game as art (8.2%): Well I have to be honest, I have never experienced a game as "art" in the conventional sense. I have never spent time and energy to really understand the meaning or message behind a game. I have always enjoyed games, but I have never tried to dissect them. I guess I should give an example. Imaginary game or gameplay (7.6%): I’m thinking of the game I played most recently, which was Journey. It was a very emotional experience for me, and I was moved by it. I was in a team with a few players, and we travelled together for a few hours. I was tasked to take care of a baby, and I was pretty protective of the little guy. When the baby grew up, we were really close. Talks about art more generally, not about a game (4.1%): I would describe an art experience as something that’s grandiose or that captures a lot of emotion. It’s something that you feel a lot, something that brings out a lot of emotion. It’s something that you just feel like you need to think about and you need to experience. It’s something that you can’t really explain. It’s something that you just feel, and you need to feel... Sentences that clearly contradict each other (3.5%): I’m not sure how to answer this. I have learned to appreciate games in diferent ways, but I haven’t really had a moment where I felt like it was art. When I played Limbo years ago, I remember feeling like it was art. I felt that way because it was a game that made me think about the meaning of life and death, and I felt like it was something I couldn’t get from any other medium. Names a flm or other type of media instead of a game (1.8%): I’m thinking of the game Koyaanisqatsi: Life out of Balance’ (1982), directed by Godfrey Reggio. I watched this flm while I was playing a game called ’The Path’ (2009), directed by Tale of Tales. Both of these experiences were in the context of an art exhibition and the ’Koyaanisqatsi’ flm was shown in a dark room with a big screen...

• The games mentioned (RQ: "What games do players experience as art?") • Reasons given for experiencing a game as art (RQ: "What makes players consider a game as art?")

6.1

Methods

For this experiment, we continued the synthetic interview of experiments 1 and 2 with follow-up questions that allowed us to investigate more deeply the similarities between human and GPT-3 generated data. In this experiment, we used all the three prompts shown in Table 1. In the frst step of data generation, we generated descriptions of art experiences as in the previous experiments (PROMPT 1, Table 1). These response were included in the next prompt that "continued"

the interview with the question "What is the title of the game?" (PROMPT 2, Table 1). These answers were also appended to the next prompt further asking "In your opinion, what exactly made you consider this experience as art?" (PROMPT 3, Table 1). Thus, the questions ending the prompts 2 and 3 was kept the same for all generations, but the individual prompts varied based on the previous GPT-3 completions. We generated 178 "full interviews" (i.e. 178 responses to each of the three prompts) to match the number of human responses from Bopp et al [6] dataset. To allow inspecting how model size and type afects the result, the set of 178 responses was created using fve diferent GPT-3 variants: ada, babbage, curie, davinci, and text-davinci-002. In this experiment, we allowed the response to include three paragraphs of text, except for the question regarding the game titles

CHI ’23, April 23–28, 2023, Hamburg, Germany

where the response was cut after the frst newline as in previous experiments. As the prompt regarding game titles was expected to result in shorter continuations, for this prompt the maximum continuation length was set to 50 tokens. Automatic Qualitative Coding. Our analysis of the "Why art?" answers (i.e. completions to PROMPT 3) is based on the observation that GPT-3 can be prompted to perform a form of qualitative inductive coding of the data, using the prompt given in Table 5. The codes provide compact descriptors of the stated reasons, and allow fexible further analysis such as grouping into broader topics and counting the topic frequencies. We perform the following steps: (1) Code the answers using the prompt in Table 5. We used a Python script to insert each answer to the end of the prompt, and extract the codes separated by semicolons from the GPT3 continuations. To make the coding as unbiased as possible, the prompt in Table 5 is designed to require no deeper interpretation of the coded texts. Instead, we simply extract compact descriptions of the given reasons. For example, if the answer is "The questions it raised and the highly emotional connection that emerged between me and the game", the codes are "raising questions" and "emotional connection" (2) Compute semantic embedding vectors of the codes. Semantic embedding maps a word or a piece of text � to a vector v� ∈ R� , such that the distance between vectors for similar concepts or texts is small. � depends on the embedding implementation. We use the embeddings of the text-curie001 GPT-3 model, with � = 4096. (3) Reduce the dimensionality of the embedding vectors using Uniform Manifold Approximation and Projection (UMAP) [4, 50]. This allows efcient visualization and clustering of the embedding vectors. (4) Cluster the dimensionality-reduced embedding vectors using HDBSCAN [49], a variant of the popular DBSCAN algorithm [22] that automatically selects the epsilon parameter. This allows combining similar codes into larger groups or topics. To obtain a concise human-readable name for a group, we list the the most representative codes of the group. Here, a code’s representativeness is measured as the cosine distance between the code embedding and the average embedding of all the codes in the group. (5) Count the frequencies of the code groups/clusters (i.e., the percentage of answers that were assigned at least one code from the group). This allows comparing topic prevalence between human and GPT-3 data. The group frequencies are more robust than individual code frequencies, as there can be two codes representing the same reason, just phrased slightly diferently. An example of the coding and grouping results are shown in Figure 4. The fgure highlights the 5 highest-frequency real (i.e. human data) code groups, and their closest GPT-3 counterparts, measured by cosine distance of the normalized mean embedding vectors of groups. Note that although the grouping was done independently for both datasets, the joint visualization required running the dimensionality reduction again for the joint data, which may cause some grouped codes to be located far from others. The full

Hämäläinen et al.

coded datasets and the Python source code are included in the supplementary material. Note that although automatic coding using GPT-3 is obviously more limited than manual coding by an experienced researcher, the beneft is that the exact same biases are applied to all compared datasets, allowing more reliable comparison. The embedding, dimensionality reduction, and clustering steps are the same as in the BerTopic topic mining approach [26]. We added the coding step as applying the embedding and clustering to the raw text data produces very noisy results, in part due to many answers listing multiple reasons, which confuses the embedding process. The coding distills the essence of the answers, reducing the noise, and naturally handles the multiplicity of reasons. Our automated two-level coding approach is analogous to stages 2 and 3 of qualitative thematic analysis as described by Braun and Clarke [9], i.e., coding and then combining the codes into themes. In the frst stage, one familiarizes oneself with the data and notes down initial ideas, which in our case corresponds to crafting the coding prompt. However, although our code groups could be considered as "themes", a full thematic analysis would go further into interpreting the themes and reporting the results with illustrative quotes. For the sake of objectivity, we avoid such interpretation, and only look at diferences in group prevalence between datasets. Data Quality Metrics. Using the code embedding vectors, we compute two standard metrics for generative model data. First, we compute Frechet Distances between the distributions of human and GPT-3 code embedding vectors that are reduced to 5 dimensions using UMAP. Frechet Distance is a commonly used metric in benchmarking image generators [28] and has later been also applied to text embeddings [73]. Second, we compute precision and recall metrics using the 5-dimensional code embeddings and the procedure of [37]. Intuitively, precision measures how large portion of the generated data samples lie close to real data, and recall measures how large portion of the real data is covered by the generated data. An ideal generator has both high precision and high recall. The metrics are visualized in Figure 5. For more reliable comparison, Figure 5 also includes additional results based on coding the game experience descriptions in addition to the "Why art?" questions. This additional coding prompt is included in the supplementary material. Topic Similarities and Diferences. We independently code and group the compared dataset, and sort the code groups based on their frequencies. We then use a circular graph (Figure 7) to visualize the sorted groups and the connections between datasets. The visualised connection strengths correspond to the cosine similarity of the fulldimensional mean normalized embedding vectors of the groups. We only included the davinci GPT-3 variant in this analysis, as it was the most human-like model based on the data quality metrics above. Answer consistency. It is important that the separately queried answers continue an interview or a survey in a consistent manner. Our prompts are designed for this, as the previously generated answers by the same synthetic "participant" are included in the prompt for the next answer. Importantly, the "Why art?" prompt

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

CHI ’23, April 23–28, 2023, Hamburg, Germany

Table 5: The prompt used for automatic qualitative coding. Manually coded few-shot examples are separated by "###". This prompt guides GPT-3 to summarize the essential information in the answers (it will produce a wide range of codes, not only repeat the ones shown in the few-shot examples). To minimize LLM repetition bias, the example answers were selected randomly from the real human dataset, while avoiding answers that result in same codes. For brevity, only 3 out of the total 10 few-shot examples are shown here. The full prompt can be found in supplementary material. The following presents a qualitative coding of answers from a video game research study. The answers explain why a participant experienced a game as art. The codes summarize the given reasons as compactly as possible. If an answer lists multiple reasons, the corresponding codes are separated by semicolons. ### Answer: The questions it raised and the highly emotional connection that emerged between me and the game, the experience. Codes: raising questions; emotional connection ### Answer: For a game experience to feel like a work of art to me, it would usually be an immersive experience that creates a real emotional response. Since games accomplish this through a combination of illustration, animation, sound, music, storytelling elements all together, I would consider these types of experiences art. Codes: immersive experience; emotional response ### Answer: The fact that each asset was hand drawn in such a unique style. Codes: unique visual style ### Answer:

(PROMPT 3, Table 1) always included a previously generated art experience description (generated using PROMPT 1, Table 1). Previously, we have used GPT-3 to generate synthetic Likertscale data for a psychological questionnaire (PANAS), by generating completions to questionnaire items one-by-one, always including the previous answers in the prompt for the next item generation [76]. The factorial structure that emerged from generating the data this way was similar to human data. This suggests that GPT-3 can take coherently into account the previous answers included in a prompt. Our data is open-ended answers, which does not allow factor analysis. Instead, we measured consistency using text embeddings computed with the text-curie-001 model. Because both PROMPT 1 and PROMPT 3 probe diferent aspects of the same experience, consistently generated answers should exhibit at least some similarity, which we measured using cosine similarity of answer embedding vectors. Moreover, PROMPT 1 and PROMPT 3 responses should be more similar when taken from a single participant (intraparticipant similarity) instead of two randomly chosen participants (inter-participant similarity). To investigate this, we computed and visualized both intra-participant and inter-participant similarities (Figure 6). When computing the means and standard errors, we used intra-participant similarities of all 178 participants, and interparticipant similarities of 178 pairs of randomly shufed (permuted) pairs of participants. For more reliable results, the random permutation of participant pairs was repeated 5000 times and the means and standard errors in Figure 6 are averaged over these 5000 permutations. Game Frequencies. Finally, we also count the frequencies of each mentioned game in the human and GPT-3 data. For the human data, we included responses from the same 178 participants that were included in the topic analyses. For the GPT-3 data, the frequency of games was counted from the 178 completions that were queried with prompts with the question "What is the title of the game?".

The frequency of the games was manually counted from the data. If there was mentions of two or more games in the same response, these were counted as separate mentions. If it was clear that the response referred to the same game, small diferences in responses were discarded (for example, breath of the wild was categorized as the same answer as The Legend of Zelda: Breath of the Wild). If the response did not include a specifc game title or we could not fnd the game title to refer to a published game, the response was ignored. For brevity, we only report results from the davinci and text-davinci-002 variants. Note that game frequency analyzes do not utilize the automatic coding step described above.

6.2

Results

The results of this experiment can be summarized as: • Highly similar groups/topics emerge from both real and GPT-3 data. Figure 7 shows how many of the most frequent groups in the human data correspond to groups that are also amongst the most frequent in the GPT-3 data, such as groups relating to aspects of story (most frequent in both human and GPT-3 data) and music (2nd most frequent in both human and GPT-3 data). The visualization of code embedding vectors in Figure 4 also indicates that coding both datasets results in largely similar codes. • Table 6 (most frequently mentioned games) shows that real and GPT-3 data discuss some of the same games, like Journey, Bioshock, and Shadow of the Colossus. However, many games in the human data are missing from the GPT-3 generated data, suggesting that LLM generated synthetic data may have less diversity than real data. Only 17.3% of games in the human data are mentioned in GPT-3 davinci data (see supplementary material for details). • Larger GPT-3 variants yield more human-like data (Figure 5). OpenAI does not disclose the exact sizes of the GPT3 models available through its API, but the ada, babbage,

CHI ’23, April 23–28, 2023, Hamburg, Germany

Hämäläinen et al.

Figure 4: A scatterplot of 2D dimensionality-reduced code embeddings of both real and GPT-3 data. The colored markers show the real code groups with highest frequencies, and their closest GPT-3 equivalents. The visualization demonstrates how similar codes are located close to each other, and that similar codes and groups emerged from both datasets, i.e., there are no large clusters with only real or only GPT-3 data.

Figure 5: Frechet embedding distances (smaller is more human-like) and precision & recall metrics (larger is more human-like) for diferent GPT-3 variants. Overall, human-likeness grows with model size from ada to davinci. Curiously, text-davinci-002, the latest GPT-3 variant, shows improved precision but lower recall, i.e., the generated data is of high quality but has less diversity than real data or the older davinci variant. curie, and davinci models have been inferred to correspond to the continuum of increasingly larger models evaluated in the original paper [24]. The ordering of the models also corresponds to increasing text generation cost, supporting

the conclusion that ada is the smallest model and davinci is the largest one. • The newest text-davinci-002 model has low recall and clearly less diversity than real data. This is evident in the lists of games mentioned, where 151 out of 178 answers discuss

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

Journey (only 7 mentions in the real data). Although OpenAI recommends this model as the default, our data suggests that it should be avoided for user modeling purposes, at least when one cares about data diversity. • As visualized in Figure 6, both human data and all GPT-3 variants exhibit higher intra-participant than inter-participant answer similarity, indicating at least some degree of consistency in answering consecutive questions. For all of the data sources (ada, babbage, curie, davinci, text-davinci-002, human data), in all of the 5000 diferent inter-participant permutations, the mean inter-participant similarity was lower than the true mean intra-participant similarity. The overall slightly higher-than-human GPT-3 similarities, and the notably higher text-davinci-002 similarities probably refect the data diversity issues noted above. Based on the above, one can conclude that investigating synthetic data can provide plausible answers to real research questions, with some important caveats. First, real data can have more diversity. This is especially true when using the later text-davinci-002 model. The diversity problems of the model can be explained through its training procedure. The model is based on the InstructGPT series, which has been fnetuned based on user feedback [58]. While this procedure does improve the average sample quality, it does not encourage diversity. Second, synthetic data can also provide some misleading results, for example, many of the lower frequency groups in synthetic data do not have direct corresponding group in the human data. Some of the groups are hard to interpret (e.g. "unknown, unquantifable, unfn.."), however, others might reasonably be expected to describe art experiences (e.g. "connection to other players"). When comparing the datasets, it should be noted that the sample of Bopp et al. [7] was not representative, thus the human dataset might also miss some themes that might arise in a more comprehensive sample.

7

DISCUSSION

In our three experiments, we have investigated the quality of GPT-3 answers to open-ended questions about experiencing video games as art. We can now summarize the answers to the research questions we posed in the introduction (for more details, see the results sections of each experiment): Can one distinguish between GPT-3 generated synthetic question answers and real human answers? Our Experiment 1 suggest that GPT-3 can be capable of generating human-like answers to questions regarding subjective experiences with interactive technology, at least in our specifc context. Surprisingly, our participants even responded "Written by a human participant" slightly more often for GPT-3 generated texts than for actual human written texts. What kinds of errors does GPT-3 make? In Experiment 2, we identifed multiple common failure modes. Some errors such as the model dodging a question could possibly be detected and the answers regenerated automatically, using a text classifer model or GPT-3 itself with a few-shot classifcation prompt. A particularly difcult error category is factual errors that cannot be detected based on superfcial qualities of the generated text, but instead require domain knowledge about the discussed topics.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Can synthetic data provide plausible answers to real HCI research questions? What similarities and diferences are there in GPT-3 and real data? Experiment 3 indicates that similar topics are discussed in both datasets, and that synthetic data can reveal plausible answers for research questions like "Why are games experienced as art?" and "What games do people experience as art?". However, although GPT3 correctly discusses some of the same games as real participants, the GPT-3 data exhibits considerably less diversity (e.g., the "Journey bias" in our case). GPT-3 also discusses some topics not found in the real data, although some diferences would be expected even between two sets of real human data, given the non-representative sample of Bopp et al. [6, 7]. Taken together, we fnd the results promising and intriguing, considering that even more capable models than GPT-3 have already appeared [e.g. 14]. LLM scaling laws predict that their performance will improve with new and even larger models [32, 66, 78], and the quality metrics of our Experiment 3 also indicate that larger scale yields more human-like data.

7.1

Use Cases for Synthetic Data

Regarding the possible uses for synthetic data, it is important to consider the trade-of between data quality, latency, and cost. GPT3 -generated data is of lower quality than real data—at least if disregarding the problems of online crowdsourcing such as bots and careless, insincere or humorous responses—but GPT-3 also has very low latency and cost. The crucial question then becomes: When can low cost and latency ofset issues with quality? We believe the synthetic data can be useful in initial pilot research or experiment design where one explores possible research ideas or hypotheses or what people might say or write, before investing in real participant recruitment and data collection. The same should apply both to academic research and designers trying to understand their users. In such work, LLMs ofer an alternative to other exploration tools such as web searches. In comparison to web searches, LLMs have two primary benefts. First, they can directly provide data in the same format as an actual study. This allows using the synthetic data for pilot-testing and debugging of data analysis and visualization pipelines. Also, when pilot testing, seeing the data in the right format can arguably help the researcher explore the space of possibilities for the design of the actual study. Such exploration benefts from the low latency and cost of synthetic data collection, especially if combined with automatic data analysis similar to our Experiment 3. For example, reading the synthetic answers and inspecting the emerging codes and themes may give ideas for further questions to ask in an interview. The second beneft over web searches is that LLMs can generalize to new tasks and data, as reviewed in Section 2.1. This suggests that LLMs may at least in some cases generate answers to questions that are not directly searchable from the training data. For instance, the real human data used in this paper was released in July 2020 [6], i.e., it cannot have been used in training the GPT-3 variants we tested, except for text-davinci-002. According to OpenAI documentation, text-davinci-002 training data ends in 2021 and the data of older variants ends in 2019. However, it is not currently possible to predict how well a model generalizes for a specifc case, without actually testing.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Hämäläinen et al.

Figure 6: A) Mean intra-participant and inter-participant cosine similarities between the experience descriptions and "Why art?" answers. The shaded areas indicate standard errors of the mean. B) Cosine similarity matrices between the experience descriptions and "Why art?" answers of all 178 synthetic and human participants. The intra-participant similarities are on the matrix diagonals, whereas the of-diagonal elements display inter-participant similarities. Table 6: Most common games in human, GPT-3 davinci, and GPT-3 text-davinci-002 data. The numbers in bold indicate how many times the game was mentioned in the data. The table shows all the games that were mentioned more than twice in the human data. The games with corresponding frequency ranks from the GPT-3 davinci and text-davinci-002 data are shown in the second and third column. Ties are sorted in alphabetical order. Rank

Human data

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. –>

The Legend of Zelda: BOTW Journey Nier: Automata Red Dead Redemption 2 The Last of Us Part II Firewatch Hollow Knight Disco Elysium Life Is Strange Bioshock Shadow of the Colossus The Witcher 3 Undertale ... and 97 other games

GPT-3 davinci 10 7 7 6 6 5 5 4 4 3 3 3 3 113

Journey The Last of Us Dear Esther Portal Bioshock Shadow of the Colossus The Path Limbo Mirror’s Edge The Stanley Parable Final Fantasy IX Final Fantasy VII Flower ... and 65 other games

To understand the limits and opportunities of generalization, consider that LLM text training data typically originates from a generative human thought process afected by multiple latent variables such as the communicated content and the writer’s emotion, intent, and style. Now, assuming that the training dataset is too large to simply memorize, an efcient way to minimize the next token prediction error is to learn internal data representations and computational operations that allow mimicking the data-generating process.3 For example, vector representations of words produced by language models can exhibit semantic-algebraic relations such 3 Recall

that in a deep multilayer neural network such as an LLM, each layer performs one step of a multi-step computational process, operating on the representations produced by the previous layer(s). The power of deep learning lies in the ability to automatically learn good representations [5, 40].

GPT-3 text-davinci-002 44 12 8 7 6 5 5 3 3 3 2 2 2 69

Journey Flower That Dragon, Cancer Braid Shadow of the Colossus Dreams of Geisha Final Fantasy VII Flow Frog Fractions Halo 5: Guardians Kingdom Hearts The Legend of Zelda: BOTW Nier: Automata ... and 10 other games

151 5 3 2 2 1 1 1 1 1 1 1 1 10

as �� − �� = �� − �� [42, 51]; this allows subsequent computational operations to perform semantic manipulations. To minimize the average prediction error over all data, an LLM should prioritize representing and operating on latent variables that afect a large portion of the data. Hence, one should care less about particular facts that only afect a small subset of the data, but assign a high priority to commonly infuential variables such as emotion, style, and political views. Fittingly, LLMs are known to make factual errors, but can generate text in many literary styles [8], perform sentiment analysis [64], generate human-like self-reports of emotion [76], and predict how political views afect voting behavior and which words people associate with members of diferent political

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

CHI ’23, April 23–28, 2023, Hamburg, Germany

Figure 7: A circular graph presenting the human and GPT-3 davinci data resulting from the automatic qualitative coding. Each human data group is connected with a line to the most similar group in the GPT-3 data. The lines are color coded based on cosine similarity. The color coding and sorting of the group nodes is based in how frequent the groups were in the two datasets (groups with highest frequency on top). Here, group frequencies are reported as percentages. parties [2]. LLM representations have also been observed to encode emotion and sentiment [36, 64]. Considering the above, a reasonable working assumption is that although LLMs can be expected to make factual errors when discussing interactive software or HCI artefacts—especially novel ones not included in training data—they may be useful in generating data about psychological latent variables such as user emotion and

motivation in response to a hypothetical scenario described in the prompt, or about user experiences more generally, as in this paper. Obviously, confrming hypotheses or arriving at conclusions about what people really think, feel, or need should only be done based on real data. LLM-based exploration could also steer interview questions in a more biased direction, which will subsequently reduce data quality in interviews with real users. On the other hand,

CHI ’23, April 23–28, 2023, Hamburg, Germany

other exploration techniques such as web searches or initial interviews with small participant samples can also be biased. More work is needed to test and evaluate LLMs in real research and design projects.

7.2

Misuse potential

Unfortunately, the quality requirements for synthetic data may be much lower for misuse than for actual research. In particular, GPT-3 and other LLMs may exacerbate the data quality problems of online crowdsourcing platforms. The reward incentives of such platforms encourage completing studies as fast as possible, in the extreme case by utilizing bots (e.g., [25]) and/or multiple accounts. Based on our experiments, it is clear that advanced language models can enable bots to generate more convincing questionnaire answers. Similarly, human participants might artifcially increase their efciency by generating answers to open-ended questions that may be slow to answer for real. Now that GPT-3 is widely available outside the initially closed beta program, there is a risk that online crowdsourcing of self-report data becomes fundamentally unreliable. If the risk is realized, new tools are needed for detecting non-human data. Unfortunately, this is likely to become increasingly harder as language models advance. The risk may also imply a change in the cost-beneft analysis between research data sources. If one cannot anymore trust that online crowdsourced textual research data is from real humans, researchers may need to rely more on time-consuming and expensive laboratory studies than previously. If this is the case, LLM-generated fast and cheap synthetic data may become even more valuable for initial exploration and piloting.

7.3

GPT-3 and Emotions

As an incidental observation, the open question data of Experiment 1 suggests that there might be a common belief that human-written texts can be recognized based on the emotion that the text conveys. If this is the case, the belief is a probable contributor to the high rate of human evaluations for some of our GPT-3 texts. For example, many of the top human-like rated GPT-3 stimuli contain detailed accounts of how the player has felt during the gameplay, including specifc emotional responses such as "I felt his pain, his fear, and his struggles." (see Table 3). Considering this, it is not surprising that these texts fooled the participants into thinking they were written by humans. If the belief that artifcial intelligence cannot generate descriptions of experiences that the reader interprets as emotional is a more general phenomenon, it might be of importance when considering the risks related to language model misuse. For example, fake social media accounts that write about "their" emotional experiences might be perceived as more believable.

7.4

Future Directions

To mitigate the problems and better understand the biases of generated data, future eforts are needed in collecting reference human data together with extensive demographic information, and including the demographic information in the prompt to guide synthetic data generation. This would allow a more in-depth and nuanced inspection of the similarities and diferences between real and human responses, including the ability of LLMs to portray diferent

Hämäläinen et al.

participant demographics. In initial tests, we have observed GPT-3 adapting its output based on participant age and gender given in the prompt, when generating synthetic answers to the question "What is your favorite video game and why?". In addition to training larger and better models, data quality could be improved by using bias correction techniques such as the calibration approach of Zhao et al. [83], which does not require a slow and expensive retraining of a model. However, correct use of such technique also requires reference data—from a user modeling perspective, one should not try to remove the natural biases and imperfections of humans. On the other hand, problems of real human data such as social desirability bias and careless or humorous answers should be avoided in synthetic data. We hypothesize that with a sufciently capable language model, this could be implemented by describing a virtual participant’s motivations and attitudes as part of the prompt. Although not yet explored in this paper, it might be possible to use LLMs to augment AI agents performing simulated user testing, which is currently focused on non-verbal data such as task difculty or ergonomics [13, 30, 57]. LLMs could be integrated by generating textual descriptions of the test situation and agent behavior, and having the LLM generate synthetic "think aloud" descriptions of what the agent feels or thinks. This might greatly expand what kinds of data and insights simulated user testing can produce.

7.5

Limitations

The recruitment of Experiment 1 was not limited to native English speakers, and a sample with only native speakers might have different distinguishability scores. Additionally, as our sample was based on online crowdsourcing where the participants have at least an indirect incentive to respond fast, it is possible that laboratory studies would show diferent rates of distinguishability between human and GPT-3 generated texts. We only examined one HCI context: art experiences in video games, thus the generalizability of our results is unclear. Future work should investigate the scope of possibilities for synthetic data more thoroughly, e.g., for what kinds of questions synthetic data is and is not helpful for. For instance, we only evaluated open-ended question answers and are working on expanding our study to the quantitative Likert-scale aesthetic emotion data that Bopp et al. also collected [6]. We also only tested one kind of prompt structure for generating the synthetic data. Although we used the same questions as in the reference human data—which is logical for evaluating human-likeness—we acknowledge that results may be sensitive to the wording of the other parts of the prompt. More research is needed on prompt design for HCI data generation, although we believe that our prompt design can be a promising starting point in many cases. Also, we did not investigate how useful researchers rate LLM use when designing new interview paradigms. This is an important direction for future studies. Each of our three experiments probes a diferent aspect of the human-likeness and quality of GPT-3 generations. Although our experiments complement each other, they do not yet paint a complete picture—there is no all-encompassing defnition of human-likeness, and the relevant features depend on context. Future studies should

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

investigate how realistically synthetic data can represent participants from diferent demographic groups, and expand the evaluation of the human-likeness to other important features. Fortunately, new benchmarks and metrics are emerging for evaluating LLM biases [18, 19, 41]. Finally, one would often like to explore the reasons and explanations for an observed data distribution. Our "Why art?" question demonstrates that GPT-3 can be directly prompted for further insights in relation to previous generations (here: the experience descriptions). Naturally, the model cannot produce real causal explanations of why it generated something, it merely samples an explanation that is probable given the earlier generations included in the prompt. This is reminiscent of the research on Chain-of-Thought (CoT) prompting: An LLM can be prompted to provide step-by-step explanations for its "thought process", which can actually improve LLM reasoning capabilities [33]. The primary limitation is that generated explanations should be treated as hypotheses to be validated with real data, rather than trustworthy evidence. Our present experiments also focus on qualitative data—future work should explore collecting both quantitative and qualitative synthetic data, e.g., Likert-scale responses augmented with open-ended questions that probe the reasons.

8

CONCLUSION

We have explored and evaluated a general-purpose large language model (GPT-3) in generating synthetic HCI research data, in the form of open-ended question answers about experiencing video games as art. Our results indicate that GPT-3 responses can be very human-like in the best case, and can discuss largely similar topics as real human responses, although future work is needed to verify this with other datasets and research topics. On the other hand, GPT-3 responses can have less diversity than real responses, and contain various anomalies and biases. More research is needed on ways to prune anomalous responses and/or guide the model towards better and less biased responses. Regarding use cases, we believe that LLMs can be useful in initial research exploration and pilot studies, especially as the models continue to improve. However, one must carefully consider the potential efects of the models’ biases and confrm any gained insights with real data. As a downside, our results indicate that LLMs might make cheating in crowdsourcing platforms such as Amazon Mechanical Turk more prevalent and harder to detect. This poses a risk of crowdsourced self-report data becoming fundamentally unreliable. OpenAI’s GPT-3 is currently the largest and most capable publicly available language model. However, other technology companies have joined the race to train the best performing (and largest) generative language model. The past 12 months have seen the introductions of (among others) Microsoft’s and NVIDIA’s 530 billion parameter model Megatron-Turing NLG [75], DeepMind’s 280 billion parameter model Gopher [66] and Google AI’s 540-billion parameter model Pathways [14]. During this paper’s review period, OpenAI also released a new and improved GPT-3 variant called ChatGPT [56]. The size of the models as well as the performance in numerous NLP benchmark tasks is increasing [14, 66, 75]. It will be intriguing to compare the present results to the generations of

CHI ’23, April 23–28, 2023, Hamburg, Germany

even more capable models in the future. However, the availability of the latest models is limited, as they are too large to run on consumer hardware or even on the computing infrastructures of most academic research labs. In practice, one may have to wait for the models to be released as cloud services, similar to GPT-3.

ACKNOWLEDGMENTS This work has been supported by the European Commission through the Horizon 2020 FET Proactive program (grant agreement 101017779) and by Academy of Finland Grant #318937.

REFERENCES [1] Alexander L Anwyl-Irvine, Jessica Massonnié, Adam Flitton, Natasha Kirkham, and Jo K Evershed. 2020. Gorilla in our midst: An online behavioral experiment builder. Behavior research methods 52, 1 (2020), 388–407. https://doi.org/10.3758/ s13428-019-01237-x [2] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua Gubler, Christopher Rytting, and David Wingate. 2022. Out of One, Many: Using Language Models to Simulate Human Samples. arXiv preprint arXiv:2209.06899 (2022). https://doi.org/10.48550/ arXiv.2209.06899 [3] Javier A. Bargas-Avila and Kasper Hornbæk. 2011. Old Wine in New Bottles or Novel Challenges: A Critical Analysis of Empirical Studies of User Experience. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 2689–2698. https: //doi.org/10.1145/1978942.1979336 [4] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. 2019. Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology 37, 1 (2019), 38–44. https://doi.org/10.1038/nbt.4314 [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828. https://doi.org/10.1109/TPAMI. 2013.50 [6] Julia Bopp, Jan Benjamin Vornhagen, Roosa Piitulainen, Barbara Keller, and Elisa D. Mekler. 2020. GamesAsArt. (July 2020). https://doi.org/10.17605/OSF. IO/RYVT6 Publisher: OSF. [7] Julia A. Bopp, Jan B. Vornhagen, and Elisa D. Mekler. 2021. "My Soul Got a Little Bit Cleaner": Art Experience in Videogames. Proc. ACM Hum.-Comput. Interact. 5, CHI PLAY, Article 237 (oct 2021), 19 pages. https://doi.org/10.1145/3474664 [8] Gwern Branwen. 2020. GPT-3 creative fction. (2020). https://www.gwern.net/ GPT-3 [9] Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/ 1478088706qp063oa [10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jefrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings. neurips.cc/paper/2020/fle/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf [11] Marc Brysbaert. 2019. How many words do we read per minute? A review and meta-analysis of reading rate. Journal of Memory and Language 109 (2019), 104047. https://doi.org/10.1016/j.jml.2019.104047 [12] Erik Cambria and Bebo White. 2014. Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]. IEEE Computational Intelligence Magazine 9, 2 (2014), 48–57. https://doi.org/10.1109/MCI.2014.2307227 [13] Noshaba Cheema, Laura A. Frey-Law, Kourosh Naderi, Jaakko Lehtinen, Philipp Slusallek, and Perttu Hämäläinen. 2020. Predicting Mid-Air Interaction Movements and Fatigue Using Deep Reinforcement Learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376701 [14] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin

CHI ’23, April 23–28, 2023, Hamburg, Germany

[15]

[16] [17] [18]

[19]

[20] [21]

[22]

[23] [24] [25]

[26] [27]

[28]

[29] [30]

[31]

Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jef Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL] Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2021. Transformers as Soft Reasoners over Language. In Proceedings of the Twenty-Ninth International Joint Conference on Artifcial Intelligence (Yokohama, Yokohama, Japan) (IJCAI’20). Article 537, 9 pages. https://doi.org/10.24963/ijcai.2020/537 Matt Cox. 2019. This AI text adventure generator lets you do anything you want. https://www.rockpapershotgun.com/this-ai-text-adventuregenerator-lets-you-do-anything-you-want Jonas Degrave. 2022. Building A Virtual Machine inside ChatGPT. https: //www.engraved.blog/building-a-virtual-machine-inside/ Pieter Delobelle, Ewoenam Tokpo, Toon Calders, and Bettina Berendt. 2022. Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1693–1706. https://doi.org/10.18653/v1/2022.naacl-main. 122 Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 862–872. https://doi.org/10.1145/3442188.3445924 Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. 2020. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020). https://doi.org/10.48550/ARXIV.2005.00341 Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfeld-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2021. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread (2021). https://transformer-circuits.pub/2021/framework/index.html. Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Portland, Oregon) (KDD’96). AAAI Press, 226–231. Paul M Fitts. 1954. The information capacity of the human motor system in controlling the amplitude of movement. Journal of experimental psychology 47, 6 (1954), 381–391. Leo Gao. 2021. On the Sizes of OpenAI API Models. Retrieved 2022-09-12 from https://blog.eleuther.ai/gpt3-model-sizes/ Marybec Grifn, Richard J Martino, Caleb LoSchiavo, Camilla Comer-Carruthers, Kristen D Krause, Christopher B Stults, and Perry N Halkitis. 2022. Ensuring survey research data integrity in the era of internet bots. Quality & quantity 56, 4 (2022), 2841–2852. https://doi.org/10.1007/s11135-021-01252-1 Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022). https://doi.org/10. 48550/ARXIV.2203.05794 Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2022. Neural Language Models as What If? -Engines for HCI Research. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22 Companion). Association for Computing Machinery, New York, NY, USA, 77–80. https://doi.org/10.1145/ 3490100.3516458 Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2017/fle/8a1d694707eb0fefe65871369074926d-Paper.pdf Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In International Conference on Learning Representations. Jussi Jokinen, Aditya Acharya, Mohammad Uzair, Xinhui Jiang, and Antti Oulasvirta. 2021. Touchscreen Typing As Optimal Supervisory Control. In CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama, Japan, May 8-13, 2021, Yoshifumi Kitamura, Aaron Quigley, Katherine Isbister, Takeo Igarashi, Pernille Bjørn, and Steven Mark Drucker (Eds.). ACM, 720:1–720:14. https://doi.org/10.1145/3411764.3445483 Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of fnancial decision making:

Hämäläinen et al.

Part I. World Scientifc, 99–127. [32] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jefrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). https://doi.org/10.48550/ARXIV.2001.08361 [33] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/2205.11916 [34] Per Ola Kristensson. 2018. Statistical Language Processing for Text Entry. In Computational Interaction. Oxford University Press, 43–64. [35] Victor Kuperman, Aki-Juhani Kyröläinen, Vincent Porretta, Marc Brysbaert, and Sophia Yang. 2021. A lingering question addressed: Reading rate and most efcient listening rate are highly similar. Journal of Experimental Psychology: Human Perception and Performance 47, 8 (2021), 1103–1112. https://doi.org/10. 1037/xhp0000932 [36] Mijin Kwon, Tor Wager, and Jonathan Phillips. 2022. Representations of emotion concepts: Comparison across pairwise, appraisal feature-based, and word embedding-based similarity spaces. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 44. [37] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2019. Improved Precision and Recall Metric for Assessing Generative Models. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/fle/ 0234c510bc6d908b28c70f313743079-Paper.pdf [38] Nils C. Köbis, Barbora Doležalová, and Ivan Soraperra. 2021. Fooled twice: People cannot detect deepfakes but think they can. iScience 24, 11 (2021), 103364. https://doi.org/10.1016/j.isci.2021.103364 [39] Guillaume Lample and François Charton. 2019. Deep Learning For Symbolic Mathematics. In International Conference on Learning Representations. [40] Phuc H. Le-Khac, Graham Healy, and Alan F. Smeaton. 2020. Contrastive Representation Learning: A Framework and Review. IEEE Access 8 (2020), 193907– 193934. https://doi.org/10.1109/ACCESS.2020.3031549 [41] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning. PMLR, 6565–6576. [42] Shusen Liu, Peer-Timo Bremer, Jayaraman J Thiagarajan, Vivek Srikumar, Bei Wang, Yarden Livnat, and Valerio Pascucci. 2017. Visual exploration of semantic relationships in neural word embeddings. IEEE transactions on visualization and computer graphics 24, 1 (2017), 553–562. [43] Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J Michaud, Max Tegmark, and Mike Williams. 2022. Towards Understanding Grokking: An Efective Theory of Representation Learning. In Advances in Neural Information Processing Systems. [44] Róisín Loughran and Michael O’Neill. 2017. Application Domains Considered in Computational Creativity.. In ICCC. 197–204. [45] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. 2022. Frozen Pretrained Transformers as Universal Computation Engines. In Proc. AAAI 2022. 7628–7636. https://doi.org/10.1609/aaai.v36i7.20729 [46] I Scott MacKenzie and William Buxton. 1992. Extending Fitts’ law to twodimensional tasks. In Proceedings of the SIGCHI conference on Human factors in computing systems. 219–226. [47] Neil A. Macmillan. 2005. Detection theory : a user’s guide (2nd ed.). Lawrence Erlbaum Associates, Mahwah, N.J. [48] Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi, Po-Ssu Huang, and Richard Socher. 2020. Progen: Language modeling for protein generation. arXiv preprint arXiv:2004.03497 (2020). [49] Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205. [50] Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). [51] Tomáš Mikolov, Wen-tau Yih, and Geofrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 746–751. [52] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the Number of Linear Regions of Deep Neural Networks. Advances in Neural Information Processing Systems 27 (2014), 2924–2932. [53] Sophie J. Nightingale and Hany Farid. 2022. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences 119, 8 (2022), e2120481119. https://doi.org/10.1073/pnas. 2120481119 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2120481119 [54] Maxwell Nye, Michael Tessler, Josh Tenenbaum, and Brenden M Lake. 2021. Improving Coherence and Consistency in Neural Sequence Models with DualSystem, Neuro-Symbolic Reasoning. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 25192–25204. https://proceedings. neurips.cc/paper/2021/fle/d3e2e8f631bd9336ed25b8162aef8782-Paper.pdf

Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study

[55] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfeld-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2022. In-context Learning and Induction Heads. Transformer Circuits Thread (2022). https://transformer-circuits.pub/2022/in-context-learning-and-inductionheads/index.html. [56] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. https: //openai.com/blog/chatgpt/ [57] Antti Oulasvirta. 2019. It’s time to rediscover HCI models. Interactions 26, 4 (2019), 52–56. https://doi.org/10.1145/3330340 [58] Long Ouyang, Jef Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. https: //doi.org/10.48550/ARXIV.2203.02155 [59] Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2022. Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–18. [60] Jorge Pérez, Pablo Barceló, and Javier Marinkovic. 2021. Attention is TuringComplete. J. Mach. Learn. Res. 22, 75 (2021), 1–35. [61] Ingrid Pettersson, Florian Lachner, Anna-Katharina Frison, Andreas Riener, and Andreas Butz. 2018. A Bermuda Triangle? A Review of Method Application and Triangulation in User Experience Evaluation. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–16. https://doi.org/10.1145/3173574.3174035 [62] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. 2016. Exponential expressivity in deep neural networks through transient chaos. Advances in neural information processing systems 29 (2016), 3360– 3368. [63] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022. Grokking: Generalization beyond overftting on small algorithmic datasets. arXiv preprint arXiv:2201.02177 (2022). [64] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444 (2017). [65] Alec Radford, Jefrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9. [66] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hofmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Safron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jef Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geofrey Irving. 2021. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv:2112.11446 [cs.CL] [67] Colin Rafel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the

[68]

[69]

[70] [71] [72] [73] [74] [75]

[76]

[77] [78]

[79]

[80] [81] [82]

[83]

CHI ’23, April 23–28, 2023, Hamburg, Germany

Limits of Transfer Learning with a Unifed Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. https://proceedings.mlr.press/v139/ramesh21a.html Shaghayegh Roohi, Asko Relas, Jari Takatalo, Henri Heiskanen, and Perttu Hämäläinen. 2020. Predicting Game Difculty and Churn Without Players. In CHI PLAY ’20: The Annual Symposium on Computer-Human Interaction in Play, Virtual Event, Canada, November 2-4, 2020, Pejman Mirza-Babaei, Victoria McArthur, Vero Vanden Abeele, and Max Birk (Eds.). ACM, 585–593. https://doi.org/10.1145/3410404.3414235 Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE 88, 8 (2000), 1270–1278. Victor Sanh, Albert Webson, Colin Rafel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chafn, Arnaud Stiegler, Teven Scao, Arun Raja, et al. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. (2022). Jef Sauro and James R Lewis. 2016. Quantifying the user experience: Practical statistics for user research. Morgan Kaufmann. Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. 2018. On Accurate Evaluation of GANs for Language Generation. https://doi.org/10.48550/ARXIV. 1806.04936 Tom Simonite. 2021. It Began as an AI-Fueled Dungeon Game. It Got Much Darker. https://www.wired.com/story/ai-fueled-dungeon-game-got-much-darker/ Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv:2201.11990 [cs.CL] Mikke Tavast, Anton Kunnari, and Perttu Hämäläinen. 2022. Language Models Can Generate Human-Like Self-Reports of Emotion. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22 Companion). Association for Computing Machinery, New York, NY, USA, 69–72. https: //doi.org/10.1145/3490100.3516464 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008. Jason Wei, Yi Tay, Rishi Bommasani, Colin Rafel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022). Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. https://doi.org/10.48550/ARXIV.2201. 11903 Gui-Rong Xue, Jie Han, Yong Yu, and Qiang Yang. 2009. User language model for collaborative personalized search. ACM Transactions on Information Systems (TOIS) 27, 2 (2009), 1–28. Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. 2021. Crossft: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835 (2021). Camilla Zallot, Gabriele Paolacci, Jesse Chandler, and Itay Sisso. 2021. Crowdsourcing in observational and experimental research. Handbook of Computational Social Science, Volume 2: Data Science, Statistical Modelling, and Machine Learning Methods (2021), 140–157. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12697– 12706. https://proceedings.mlr.press/v139/zhao21c.html

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course Sarah Sterman

[email protected] University of California, Berkeley USA

Molly Jane Nicholas

[email protected] University of California, Berkeley USA

Jessica R Mindel

[email protected] University of California, Berkeley USA

Janaki Vivrekar

[email protected] University of California, Berkeley USA

Eric Paulos

[email protected] University of California, Berkeley USA

Figure 1: Kaleidoscope is a remote collaboration tool for student teams in a project-based user interface design course. Group interaction centers in the “Studio Space," where groups document the history of their project with multimedia artifacts. Other features support assignment submission, peer feedback, portfolio creation, and instructor visibility into student process.

ABSTRACT Documentation can support design work and create opportunities for learning and refection. We explore how a novel documentation tool for a remote interaction design course provides insight into design process and integrates strategies from expert practice to support studio-style collaboration and refection. Using Research through Design, we develop and deploy Kaleidoscope, an online tool for documenting design process, in an upper-level HCI class

during the COVID-19 pandemic, iteratively developing it in response to student feedback and needs. We discuss key themes from the real-world deployment of Kaleidoscope, including: tensions between documentation and creation; efects of centralizing discussion; privacy and visibility in shared spaces; balancing evidence of achievement with feelings of overwhelm; and the efects of initial perceptions and incentives on tool usage. These successes and challenges provide insights to guide future tools for design documentation and HCI education that scafold learning process as an equal partner to execution.

CCS CONCEPTS This work is licensed under a Creative Commons Attribution International 4.0 License.

• Human-centered computing → Interactive systems and tools; Field studies.

CHI ’23, April 23–28, 2023, Hamburg, Germany © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. https://doi.org/10.1145/3544548.3581255

KEYWORDS HCI education, refection, documentation, studio, online learning

CHI ’23, April 23–28, 2023, Hamburg, Germany

ACM Reference Format: Sarah Sterman, Molly Jane Nicholas, Janaki Vivrekar, Jessica R Mindel, and Eric Paulos. 2023. Kaleidoscope: A Refective Documentation Tool for a User Interface Design Course. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/ 3544548.3581255

1

INTRODUCTION

Design education is a growing area of interest among the HCI research community. Since HCI is an interdisciplinary feld, teaching HCI requires covering a complex array of concepts from multiple domains. Essential in this mix is design process: how designers perform, order, and cycle between tasks and actions in pursuit of a design goal. Many HCI educators teach some form of design as part of their HCI courses [55], and HCI itself can be seen as a fundamentally design-oriented practice [19]. While there are many ways to teach design, and multiple interpretations of “design process,” a common approach is to use project-based learning and a studio environment to give hands-on experience in iteration, critique, and collaboration [44, 54]. There is no single prescriptive structure for successful design process [45], so project-based courses give students the opportunity to explore process for themselves, to try multiple approaches, and to adapt to changing needs. At the University of California, Berkeley, such a course is the upper-level undergraduate User Interface Design and Development class. Though facility with the design process is a key learning goal of this course, instructors do not have a way to directly evaluate or view students’ process. Instructors assess student projects based on the quality of individual assignment outputs; while these assignments represent key points in the design process – for instance, submitting preliminary sketches to demonstrate early ideation, then submitting wireframes to show progress and iteration — they capture only snapshots of outputs. Instructors only have access to these singular moments of students’ process, which have been curated by the students to be “successful” submissions. Moreover, students themselves have limited visibility into the structure of their workfows even as they perform them. Without such visibility, students and instructors are limited in their ability to refect on the design process itself. One leverage point to make process more accessible to both instructors and students is documentation tools. Tools have signifcant efects on how practitioners approach process [16, 33]. Documentation tools in particular support not just individual tasks or post-hoc records, but are also active participants in the creative process, enabling iteration, branching ideas, and reuse of artifacts across the entire design process [26, 30, 40, 51]. In user interface design courses, students learn how to use specifc tools for particular tasks (e.g. paper sketches for ideation, Figma for wireframing, slide decks for prototyping, etc.), but there is a gap for tools that support refecting on the high-level aspects of process across the entire design journey. Documentation tools for design ofer a unique opportunity to capture and refect on process holistically while also supporting particular design skills. In this paper, we ask three research questions:

Sterman et al.

(1) How can a documentation tool for user interface design make process visible to students and instructors for metacognitive refection? (2) How can a documentation tool directly support students’ design process in collaborative interaction design projects? (3) How can strategies from expert process be incorporated into tools for student learning? We present a design documentation tool, called Kaleidoscope (Fig. 1), which we developed and deployed in the upper-level undergraduate user interface design course at our institution. Using Research through Design, we seek to understand how a processfocused documentation tool can support student design processes, group collaboration, and critical refection on personal process. This work responds to the call for research in HCI education to provide empirical evidence from real classroom deployments [46]. In our deployment, students documented over 3800 artifacts in Kaleidoscope – design sketches, notes, photographs of prototypes, code, Figma documents, etc. – and left each other over 1000 pieces of feedback. These artifacts spanned many mediums, creating a central repository for project progress and infrastructure for feedback within and between teams. At the end of the semester, students used the tool to generate fnal portfolios from these artifacts for the class showcase. Student interactions with Kaleidoscope provided insights into the role of documentation tools in a course setting and shaped the design directions of Kaleidoscope as it was continuously developed throughout the semester in response to student needs, usage patterns, and feedback. We deployed Kaleidoscope in a fully remote semester during the COVID-19 pandemic. Since this course is usually taught in an in-person studio format, this ofered a chance to explore how a documentation tool might assist students in remote collaboration and go “beyond” replication of a studio environment, using the digital format to add greater depth and new interactions [25]. To guide the design of Kaleidoscope, we synthesized fve key design principles from prior research on design process, education, and documentation tools: collaboration, seeing the big picture, metacognition, curating the creative space, and making progress visible. Through these design principles, the tool seeks to support the learning goals of the user interface design course, including how to work together on team projects, how to give and receive feedback, the importance of iteration in design, how to communicate results, and how to design, prototype, and evaluate interfaces. Tying together each of these specifc learning goals is the role of refection in learning; Kaleidoscope’s key pedagogical philosophy is to support refection on the design process as well as the design process itself. We perform a thematic analysis of data collected across the semester, and discuss fve key themes that arose from the tool’s deployment: tensions between documentation and creation, centralizing discussion, privacy and visibility in shared spaces, balancing evidence of achievement with feelings of overwhelm, and the efects of initial perceptions and incentives on tool usage. Kaleidoscope acts as an interpretive artifact for investigating process-focused tool design, where our vision of more concrete histories of, refection on, and evaluation of process can be explored and critiqued in real world use [22, 49, 58]. Successes and challenges with Kaleidoscope

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

provide insights to guide future tools for design education and process documentation, as well as for refective documentation tools outside of educational contexts. This paper contributes: (1) A novel documentation tool for user interface design courses. (2) A thematic analysis of student and instructor experiences, including how the tool supported the design process and shaped student learning experiences. (3) An annotated portfolio of the documentation tool as an artifact for shaping and refecting on process.

2 RELATED WORK 2.1 HCI Education and Studio Learning Environments Recent scholarship in the HCI community has increasingly investigated how research knowledge can improve HCI education, for example exploring a research agenda for HCI education [55], integrating research with refections on teaching [45], and testing research theories in the classroom [46]. In this work, we use a Research through Design methodology [58] to introduce a new tool into a project-based user interface design course to better understand how to support student refection, documentation, and collaborative process in an online setting. In a survey of HCI educators, Wilcox et al. found that the vast majority of HCI courses include design in the curriculum (92% of respondents) [55]. We deployed Kaleidoscope in one of these such courses, which serves as both an introduction to HCI and to user interface design at our university. This course is heavily project-based, a common format for teaching design through practice. Students complete several group projects during the semester, culminating in a signifcant fnal project (see the Supplement for additional detail). Studio environments are often essential to project-based design courses: they teach critique skills and refection, enable learningby-doing, and support peer interaction [44, 54]. Studio spaces make process visible through the physical presence of intentional artifacts and the detritus of process, which come together to ground learning and discussion [29]. Exploring how to bring studio interactions into the digital world, Koutsabasis et al. created a virtual studio in a 3D simulation environment where avatars can interact in group collaboration spaces [31], and found instructor awareness of student collaboration, real-time remote collaboration, and creative freedom to customize the group space as strengths of the virtual studio. Following from these works, we draw from the strengths of studiobased learning to design a custom tool for documenting and sharing design process in a fully remote design course. This work was performed during the COVID-19 pandemic, which introduced new challenges to teaching and learning HCI. Roldan et al. report challenges as COVID interrupted their Spring 2020 HCI course, but also note opportunities such as easy recording of online meetings to support reviewing and refecting on design behaviors [46]. Markel et al. explore design recommendations for experiential learning in the context of the pandemic [37], and Benabdallah et al. and Peek et al. both discuss the challenges of bringing hands-on making courses to remote contexts [6, 42]. We also sought opportunities within the challenges, designing Kaleidoscope not just to replicate features of in-person studios, but to provide additional

CHI ’23, April 23–28, 2023, Hamburg, Germany

capabilities to save process history, search and view multimedia design artifacts, and collaborate with teams.

2.2

Components of Design Process

The design of Kaleidoscope focuses on three specifc elements of design process: documentation, refection, and feedback. 2.2.1 Documentation. Documentation is an essential component of the creative process. The tools we use afect how we work and approach problems [16, 33], including tools for managing project histories. In domains from data science [26] to creative coding, tapestry weaving, and writing [51], to design history [30], the tools we use to document, visualize, and interact with history afect what and how we create. The same is true for education, where tools for documentation afect student behaviors. Chen et al. discuss how the structure of deliverables in design courses afected the types of documentation students created, and the way those types of documentation structured their understanding of design process into discrete stages [13]. Keune et al. show how tools for creating and sharing portfolios in makerspaces afect process, for example how providing a blog interface and specifc times to journal helped a student integrate documentation into planning and creating new ideas [27]. Kaleidoscope draws from research on how documentation tools afect process to support specifc strategies from expert practice in the classroom. Documentation tools can also shape social and community norms, such as in Mosaic, an online community for sharing in-progress work that creates norms of feedback, reduces fear of sharing unfnished pieces, and supports refection on process [28]. In studios, making past work visible and physical in a space enables transparency of process and constant critique and discussion [29]. Structures of documentation, including how writers store drafts or how ceramicists make work visible in a space, can shape responses to failure or error, creating more resilient and productive mindsets and community norms [53]. Kaleidoscope draws from these philosophies of transparency and the value of in-progress work to support remote collaboration and encourage norms that value process rather than only outcomes. Information reuse is essential to the design process, where one’s own prior work or that of colleagues is a key resource for inspiration and problem framing. Lupfer et al. discuss how interfaces for design history curation can support process through spatial organization across multiple scales of view [35, 36]. Annotated portfolios provide a way to capture a design history for a future audience, uncover underlying values, and communicate insights and learnings to a wider audience [21]. Designers keep many artifacts from the design process, relying on visual foraging to make sense of collections of artifacts [47], yet Sharmin et al. also note the difculty of keeping artifacts connected to past design process [47]. Kaleidoscope seeks to support information reuse by acting as a central history repository across multimedia sources and providing context for the design history of artifacts. Despite its importance, documentation can be difcult and underutilized. Documentation takes time and efort, and workplace value structures can deprioritize documentation in comparison to the speed of progress or generating new outputs. Specifc materials or components of the design process can be harder to document

CHI ’23, April 23–28, 2023, Hamburg, Germany

than others; da Rocha et al. explore the challenges and importance of documenting samples, noting their value for reproduction and communication, as well as the difculties in interrupting a workfow to document and dedicating time to documentation [23]. In this work, we discuss challenges related to prioritizing documentation in a classroom setting and communicating its value to students.

2.2.2 Reflection. Refection on design process helps designers and students improve how they work [45]. Roldan et al. introduce refective activities into a studio design course, showing how structured refection on past data can improve both design outcomes as well as students’ understanding of their own process and what they might need to improve [46]. Roldan et al. focus on skills in participatory design sessions; we focus on longer-term patterns of design cycles and decision making. Feedback also plays a key role in refection: it can be an anchor for refection, and becomes more useful to the student when structured refection is applied to the feedback itself [43]. Tools can help make process visible to students in order to structure these kinds of discussions and longer-term refections [14, 34, 56]; Kaleidoscope seeks to make the design process visible to students by 1) collecting artifacts created across the entire design life cycle with many diferent tools and mediums into a single context, and 2) co-locating feedback on each specifc artifact with the artifact itself as well as situated within the greater design context.

2.2.3 Feedback. Feedback is a key part of the student learning experience and the iterative design process. In the user interface design course we worked with in this project, feedback came from course staf, either as formative feedback during project work or at assessment points, from group members within a project group, and from peers outside the project group. Feedback contributes to the iterative design process, but also to students’ metacognition around their own learning and process, in line with Boud et al.’s framing of students as active partners in the feedback process [8]. Feedback and critique can be hard to scale; Kulkarni et al. designed PeerStudio to provide scalable feedback in MOOCs by peers [32], and Tinapple et al. designed CritViz to support critique in large design courses, considering not just the logistics of critique but the social values of community, self-perception, and social accountability [52]. Similarly, Kaleidoscope seeks to support positive community dynamics and create visibility into peers’ design process to allow peer-learning, while integrating feedback into a more comprehensive studio documentation tool. Studio critique or design critique is a specifc form of feedback present in many studio-based HCI courses. Such critique sessions tend to be collaborative, interactive, and formative, fostering discussion among instructor and peers of the work under examination rather than evaluation [41, 44, 54]. As this project focuses on the role of documentation, we have chosen primarily to support written formative feedback within the tool, though the artifacts documented in the tool can be used in synchronous critique sessions. Direct support of interactive studio critique was beyond the scope of this paper, but combining specifc strategies for studio critique with a documentation tool may be fruitful future work.

Sterman et al.

2.3

Digital Collaboration Tools in Our Classroom

Collaboration is essential to group work and successful design projects. Mercier et al. identify “creation of a joint problem space” as a key feature in successful collaboration in a design course, and emphasize the role of tools and shared artifacts in creating this space [38]. Kaleidoscope supports shared understanding by encouraging the central collection of all content related to the project, and acting as a shared reference for discussion and iteration. Diverse collaboration tools have roles in the design classroom, in both in-person and remote oferings of courses. In the user interface design course we engaged with in this work, these include course support tools like Canvas [1], used for turning in assignments, hosting course media like PDFs of readings, and recording grades; or Piazza [4], a forum for questions and discussion. Students are taught to use Figma [2], a web tool for design layouts and wireframing (for a visual reference, see the Supplement), and turn in video demos of projects by uploading them to YouTube. During the pandemic, we also noted an increase in student use of other digital tools to support their group collaboration processes, such as Miro [3], a digital whiteboarding application for brainstorming and Google Drive, Docs, and Slides for live collaboration and organizing documents. Students also relied on messaging and video calling services like Zoom, Facebook Messenger, and Discord to communicate synchronously and asynchronously during group collaboration. Kaleidoscope seeks to fll a specifc niche by focusing on design documentation and metacognition around process, incorporating or working alongside these tools rather than trying to replace them.

2.4

Action Research and Educational Deployments

Field deployments can provide real-world data from a large population of users in the environment of intended use [48]. In educational contexts in particular, Roldan et al. emphasize the importance of implementing and studying HCI research recommendations in real classrooms [46]. In this work, deploying Kaleidoscope in a semesterlong design course allowed us to see how students used it in combination with other tools, during long-term projects, and with real group dynamics, and to investigate Kaleidoscope in relation to students’ mindsets and stressors. In particular, we draw from the philosophy of action research to guide this project [24]. In introducing a new tool into a classroom, we have multiple types of stakeholders: the students in the class, who have multifacted roles as learners, group collaborators, and designers; and the course staf, both the head instructor and the TAs who support the students through grading, mentorship, and lecturing. We engaged with both the teaching team and the students as a participatory community in the iterative design of Kaleidoscope. Action research can provide frst-hand experience with practical applications of ideas, however, challenges around efort and time required make it less common than lab experiments and other research methods [39]. In the case of designing a tool for design education, we found it to be particularly appropriate to engage the students in the design and critique process. Within the frame of action research, we apply a Research through Design methodology [57, 58]. Zimmerman et al. discuss four key

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

components of Research through Design: process, invention, relevance, and extensibility [58]. In documenting the process of this research work, we will present a system description, details of interactions and data collection with students that led to system design decisions, and a thematic analysis of qualitative data. Kaleidoscope presents invention through a novel multimedia documentation tool that supports remote design studio interactions and course requirements. Kaleidoscope allows students to investigate their own creative process at a metacognitive level, in contrast to prior literature and tools which support specifc skills or detailed refection. Kaleidoscope addresses questions of immediate relevance to the design community, as we continue to face remote teaching challenges related to the pandemic and broader cultural shifts towards online learning, and as the HCI community expands its interest in how to teach HCI and design most efectively. We hope that the community can extend the knowledge generated by this project to design future tools for creative documentation, consider new contexts for the role of refection in learning design, and support remote learning in studio courses.

3

METHODS

In this project, we engaged in action research [24] through a Research through Design methodology [58]. Below we describe the course context, the design process with stakeholders including course staf and students, the Kaleidoscope system, and the method of evaluation. The long-term use and iterative design of Kaleidoscope within a real-world course context allowed us to support instructors and students during the transition to an online format for the user interface and design course at our institution during the COVID pandemic, while also allowing us to generate research knowledge through the expression, evolution, and evaluation of our design goals as instantiated by a real system.

3.1

Course Context

This project occurred in the context of an upper-level undergraduate HCI and user interface design course in the Computer Science department at the University of California, Berkeley, a large public university in the United States. This course covers user interface design, technical development skills, and HCI foundations; we will refer to it here as User Interface Design (UID). Between August and December 2020, this course was taught fully online for the frst time, in response to the COVID-19 pandemic (see Supplement for additional details). UID is a project-based course, with approximately 100 students, in which students learn a version of the design process that incorporates needfnding, prototyping, and evaluation techniques in an iterative cycle. The course is structured around multiple design projects across the semester, culminating in a two-month fnal project in which groups of four to fve students design and implement a mobile application within the theme of “equity and inclusion.” In standard oferings of this course, student project groups meet in-person to collaborate on design and implementation. The course also relies on in-person studio time, where students critique each other’s work, test prototypes, and receive feedback. The remote ofering of UID retained the project structure, but shifted all work online. Many students used Zoom and Discord for group meetings,

CHI ’23, April 23–28, 2023, Hamburg, Germany

Facebook Messenger for asynchronous communication, and Google Drive to collaborate in real time. Figma was a required tool for the course, which students used to brainstorm and create layouts and wireframes for prototypes. Prior to the beginning of the Fall 2020 semester, the research team developed an initial version of Kaleidoscope, a functional documentation system for supporting collaboration and refection. Throughout the semester, we continued to design and develop the system in response to its usage and student and instructor feedback. We collaborated closely with course staf as key stakeholders in the design and use of a new classroom tool. two members of the research team were also members of the teaching team for this ofering of UID, one as a teaching assistant, and one as the lecturer. A third member of the research team was a former lecturer for UID, and two members of the research team had taken a prior in-person ofering of UID as students. Members of the teaching staf who were not research team members participated in discussions around the tool’s role in the course, their experiences using it in their teaching, and desires and needs for its design. As the second key group of stakeholders, students provided feedback and suggestions to the research team on their experiences and needs, refected on their experiences, and communicated directly with the research team through feature requests, bug reports, and interviews. Kaleidoscope was introduced to students at the start of the semester as a documentation tool for group collaboration. In the How-To Guide on using Kaleidoscope, we describe it as follows: “While working on a project, designers often collect lots of images and examples as they build their vision for the fnal outcome. This tool allows designers to see everything collected in one place. This could help a designer to stay in touch with the original plan, try out new directions, and collaborate with others. This tool also lets designers look back at earlier iterations and see what’s changed throughout the process.”

The instructors demonstrated Kaleidoscope during a course section early in the semester, and encouraged students to integrate it into their design process, for instance by using it to share feedback and materials with their teams. The course required students to turn in certain assignments through Kaleidoscope; beyond that, there were no requirements about how students used Kaleidoscope in their process, and students created individual ways to integrate Kaleidoscope with other tools in their workfows. Throughout the semester, we collected multiple types of data (see Section 3.4.1), investigating questions around the role of documentation tools in the HCI classroom, how to support remote studio environments, and how to encourage student refection. All human subjects research activities were approved by our IRB. Participants volunteered for interviews through a recruitment form shared with all class members, and provided informed consent prior to each interview. Interviewees were compensated $15/hr for interviews. Students provided separate informed consent to allow the use of their private Kaleidoscope artifacts in research. To preserve student privacy, artifacts included in screenshots of the interface in this paper are illustrative examples created by the researchers, not student data. Figure 4 shows a screenshot of the fnal showcase for UID, which is publicly available online. De-identifed responses from course surveys relevant to Kaleidoscope were analyzed as

CHI ’23, April 23–28, 2023, Hamburg, Germany

secondary data. Usage statistics and student feedback on Kaleidoscope were never included in student grades. The two members of the research team who were concurrent course instructors did not participate in performing interviews, did not have access to consent data, and were not shown student interview data until after grades were submitted. Members of the research team who were not current staf had no access to any student course data beyond the data sources specifed in Section 3.4.1, including no access to grades and non-Kaleidoscope assignments, such as reading responses and technical assessments. Mid-semester feedback was collected anonymously and responses related to Kaleidoscope were fltered from general course feedback by a course staf member before being provided to non-staf research team members.

3.2

Initial Design Principles

Documentation and history tools can shape creative process among expert practitioners, supporting particular strategies of refection, motivation, and mindsets [40, 51]. In this project, we explore how such strategies might be introduced to design students through a creativity support tool. By drawing from strategies used by expert creative practitioners, we hoped to guide students towards building the skills they would need in the future. Through discussions with course staf and the research team, we identifed specifc strategies from prior research on design process, education, and documentation tools that might be relevant to the UID students and support the learning goals of the course. This synthesis resulted in the guiding principles listed below, which informed the overall goal and initial design of the tool. We continually iterated on both the role of the tool in the course and the overall tool design throughout the semester, in partnership with students and instructional staf. The fve guiding principles for our studio tool were: metacognition, seeing the big picture, curating the creative space, making progress visible, and collaboration. Below we describe each of these motivating principles, with example considerations and related theory, as well as connections to specifc learning goals of UID (a complete list of learning goals can be found in the Supplement). Metacognition – Refecting on how we learn and work can improve our process. Kaleidoscope should provide visibility into students’ process so they can learn what works for them and what they might wish to change, by refecting on both their own process and others’. Metacognition and refection has been suggested as important components of design education across a broad range of research: Rivard et al. propose refexive learning as a framework for design education, emphasizing the value of critical refection to learning design [45]. Roldan et al. explore how video can support structured refection on student-led participatory design sessions in a design course [46]. Chen et al. use probes in a remote design course to encourage students to refect on their documentation practices, and found that the majority of their participants valued documentation for supporting metacognitive processes [13]. Nicholas et al. show how embodying progress can support refection as well as practitioner wellbeing [40]. Documentation tools particularly serve a role in metacognition: Yan et al. explored the benefts of visualizing version control histories for refection in computer science courses [56], providing unique opportunities for students to refect on how they

Sterman et al.

approached writing code; Sterman et al. show how extended lifetimes of records can support refection between projects and across long periods of incubation [51]. In UID, a foundational learning goal is to design, prototype, and evaluate interfaces. Metacognition helps students examine their processes in these domains and improve their skills and approaches through refection. Seeing the Big Picture –Providing a high-level view of the project history can support design process, refection, and understanding of progress. Kaleidoscope should provide a holistic view of design history, across all types of mediums and tools used in any stage of the design process. Nicholas et al. and Sterman et al. show how access to artifacts from past stages of the creative process support future work, by anchoring work to enable future exploration, maintaining an active palette of materials, and supporting refection and motivation [40, 51]. Sharmin et al. explore the value of re-use of artifacts particularly in design activities [47]; Klemmer et al. discuss the value of visibility of artifacts in studio and workshop contexts to enable communication and coordination as well as situated learning [29]. Studying design documentation at multiple scales of view, Lupfer et al. show the value of high-level views of design documentation to exploring and communicating ideas [35, 36]. As a design documentation tool, Kaleidoscope draws on multiscale approaches to representing history, and should support visual foraging and building on older artifacts. High-level views align with the learning goal of understanding the importance of iterative design for usability by allowing students to more easily build on prior artifacts and fexibly iterate. In supporting communication and coordination, this principle also addresses learning goals including how to communicate your results to a group and work together on a team project. Curating the Creative Space – The character of the studio space afects designers’ mindsets, bricolage practice, and feelings of ownership. Kaleidoscope should allow users to hide artifacts, draw attention to artifacts, and personalize the space. Klemmer et al. describe how the artifacts present in design studios provide aesthetic and structural features to support peer learning, discussion, and critique in educational design contexts [29]. Similar benefts occur across creative domains, where practitioners deliberately curate their creative spaces to be surrounded by inspirational artifacts, such as their own past work or others’ [40, 51]. In constructing a design studio in a 3D virtual world, Koutsabasis et al. found the ability to construct and decorate their virtual collaboration space was engaging for student groups [31], and Nicholas et al. discuss how “aestheticizing” can create personal motivation for creative activities by increasing the sense of value of an artifact and a desire to return to it [40]. Curation of the space can support learning goals including work together on a team project and give and receive feedback as part of design iteration. Making Progress Visible – Mindsets afect confdence, selfefcacy, and perceptions of success. Kaleidoscope should allow students to see progress made on a project and have easy access to work of which they are proud. In Mosaic, Kim et al. demonstrate how sharing works-in-progress supports productive mindsets around learning, improvement, and the value of process, as opposed to placing all value on fnal outputs [28]; similarly, Nicholas et al. show motivational benefts from embodying progress [40]. Especially in a domain like design, where failure is an inherent part of the process [45], growth mindsets [18] and valuing process over fnal output

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

should be essential learning goals for design courses. Not only does growth mindset underlie UID’s teaching team’s philosophy of teaching and learning design, a focus on progress helps support the learning goal of understanding the importance of iterative design for usability, drawing student attention to how designs improve over time. Collaboration – Working with a team is integral to design and to the structure of UID. Kaleidoscope should provide context for decisions, support communication, and allow teams to get feedback on the project as a whole or on specifc artifacts. Mercier et al. discuss the importance of a “joint problem space” for group collaboration, where members can concretize ideas and share context for deliverables and decisions [38]. CritViz, a system for structuring peer feedback in creative classes like a design class, shows how giving and receiving feedback leads to better outputs and creates a sense of community and teamwork [52]. Several learning goals of UID focus on teamwork, including how to work together on a team project, ability to give and receive feedback as part of design iteration, and how to communicate your results to a group.

3.3

CHI ’23, April 23–28, 2023, Hamburg, Germany

Check-ins are a special type of artifact, used for submissions of course assignments. A check-in template lists the requirements for the assignment; students can select particular artifacts to include for each requirement. Check-ins are not displayed in the studio, but can be accessed through a separate page for templates and check-ins. The Explore Page displays artifacts that groups decide to make public. Instructors can make artifacts submitted with assignment check-ins public, allowing them to curate galleries of student work: for example, collecting all low-fdelity sketches from an assignment and sharing this view with all students. In this way, students can see and learn from peer work, similar to how they would in a physical studio environment. At the end of the semester, students participated in a design showcase to publicly present their work. To support the virtual version of this event, and to help students create a portfolio-style summary of their project, Portfolio Pages allows students to arrange artifacts in a curated, public-facing layout (Fig. 4).

Kaleidoscope System

Kaleidoscope is an online collaboration tool for documenting design history, supporting student refections on their design process, and providing features for design education (Fig. 1). Kaleidoscope is written in React, and uses Google Firebase for database and server hosting. Students use their institutional Google accounts to log in to Kaleidoscope. 3.3.1 Studio Spaces. The central feature of Kaleidoscope is the “Studio Space,” where individuals or groups collect and display artifacts from their project work (Fig. 2). Each group has its own studio space for each class project; an individual can only see and edit spaces of which they are a member. Users can upload artifacts to a studio space, where they are displayed as thumbnails. Artifacts can be images, text, GitHub commits, or links to other webpages, with special support for YouTube videos and Figma layouts. These covered the core types of information created for the class, with physical sketches and prototypes documented through photographs and videos. Initially, studio spaces displayed artifacts in an automatic grid layout; later iteration introduced a whiteboard-style free-form layout feature, where students can rearrange artifacts and save layout histories (Fig. 2). Artifacts can be tagged with free-text or suggested tags during upload or later on, to track particular design stages, assignments, or ideas. Artifacts can also be associated with each other, to form conceptual groups between separate artifacts. Artifacts are displayed in the studio space, where they can be sorted and searched. They can be viewed individually on a detail page, containing the artifact, tags, description, title, and associated artifacts (Fig. 3). Detail pages also display feedback from group members, course staf, and other students. On the studio page, an icon in the corner of the artifact indicates the amount of feedback on the artifact. Artifacts can be kept private to the team and course staf, or made public for any student to view and leave feedback. 3.3.2 Course Tools. Certain features were designed specifcally to support Kaleidoscope’s role as a tool for UID.

3.3.3 Design Iterations. Over the course of the semester, the research team solicited feedback from students, spoke with course instructors, and monitored bug reports and feature requests from students. We analyzed and discussed feedback and student use behaviors as they were collected. We continuously updated the tool, adding new features and fxing bugs in response to student needs while aligning the tool more efectively with the design goals. Two major updates were editable artifacts and customizeable layouts. Initially, all artifacts were uneditable. Once uploaded, they acted as a static archive of the design history. Deleting artifacts was possible, but not recommended. However, students were frustrated by small errors in text artifacts that then had to be re-created to fx, and wanted to be able to work with teammates to update text artifacts after they were created. This resulted in an evolution of our design goals, where the initial conception of Kaleidoscope as a static archive was relaxed to support students’ needs to co-locate creating and editing content as well as documenting it. In response, we introduced a rich text editor to the text artifact detail pages, allowing changes to text artifacts. The initial studio space layout was a column-based layout of artifacts, running left to right in chronological order from most recent to oldest. Artifact tiles had a fxed width, which could be adjusted for all artifacts at the same time by a slider. Each artifact took the least amount of vertical space it needed to be completely visible, and so tiles varied in length. Students found this layout messy and hard to search. They expressed desire for customizeable arrangements in order to explore ideas and more actively interact with the design history during brainstorming and group discussions. In response, we introduced layouts, a grid-based default view in which artifacts could be resized, moved, or hidden from a view (Fig. 2). Layouts could be saved with custom names and timestamps, and easily reloaded from a dropdown menu. Other changes included bug fxes, support for additional artifact types, and the end-of-semester portfolio feature (Sec. 3.3.2).

CHI ’23, April 23–28, 2023, Hamburg, Germany

Sterman et al.

Figure 2: We iterated on the Studio Space design throughout the course. We began with the design on the top, where artifacts are automatically organized chronologically to show development over time. Filtering on tags (top right) surfaced particular artifacts and allowed focused comparison across topics. Around the middle of the semester, we released fexible layouts for the studio spaces. Artifacts in the default grid (bottom left) were square aspect ratios, creating a neater initial view. Artifacts could be resized and moved freely (bottom right), and the custom views saved in a dropdown list for later review or editing (bottom center). Filtering by tags is also supported in the custom view. (Artifacts shown in screenshots are hypothetical data to demonstrate the interface.)

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

CHI ’23, April 23–28, 2023, Hamburg, Germany

Figure 3: The Artifact Detail page shows information related to a specifc artifact: the artifact itself, in this case a design sketch; an editable text annotation; the history of group discussion and feedback on this artifact; tags applied to the artifact; a tile view of associated artifacts for this artifact. (Artifacts shown in screenshots are hypothetical data to demonstrate the interface.)

3.4

Evaluation Methods

3.4.1 Data Collection. As Kaleidoscope was integrated with UID throughout the semester, we had access to a breadth of data collection methods, including course assignments, refections, and feedback surveys, as well as sources specifc to the research project, including semi-structured interviews with student volunteers. This breadth of data types allowed us to learn about how Kaleidoscope was used and received through multiple contexts throughout the semester. Data collected during the semester was used to guide the iterative design of Kaleidoscope. (1) Mid-semester semi-structured interviews (N=5). Near the midpoint of the semester, the research team performed semistructured interviews with individual students on their design process during the course, refections on learning design, and the role of Kaleidoscope in process and learning. The interviews were performed by non-instructor members of the research team. Interviews began by discussing where the students were in the course, what stage of the current project they were in, and how they felt the project was going. Interviews then transitioned to specifc questions about personal and group workfows and their usage of Kaleidoscope. Following standard semi-structured qualitative interview techniques [11, 12], the questions evolved within and between interviews; a representative selection of guiding questions can be found in the Supplement. Students volunteered for

interviews, and provided separate informed consent to interview procedures. Interviews were recorded for transcription purposes. Participants were compensated at $15/hr; interviews ranged between 45 minutes and 2 hours. The identities of interview participants were not disclosed to members of the teaching team, and had no efect on course grades. (2) Mid-semester course survey (N=34 students mentioned Kaleidoscope). The course staf released an anonymous mid-semester course survey in which students refected on the class overall and gave feedback on what was going well and what could be improved, as a standard part of UID. A member of the course staf fltered survey responses for responses related to Kaleidoscope before providing them to the research team. The text of questions containing responses mentioning Kaleidoscope can be found in the Supplement. (3) Design refection extra credit assignment (N=55). Near the middle of the semester, the teaching staf released an optional extra credit assignment in which students could refect on their design process so far. Optional extra credit assignments during the semester in which students refect on their design process and teamwork are a standard part of UID. The assignment came from the teaching staf as part of the course and did not mention Kaleidoscope in the instructions or the questions. The data were de-identifed and analyzed as secondary data. Extra credit was given to all respondents,

CHI ’23, April 23–28, 2023, Hamburg, Germany

Sterman et al.

Figure 4: Portfolio Pages. At the end of the semester, students created interactive portfolios from their artifacts (hypothetical example to demonstrate the interface at right). Portfolios were collected as part of a publicly available online showcase (left). with no evaluation of “correct” or “incorrect” answers. The goals of the assignment were: to describe and discuss your own creative process; make explicit any subconscious behaviors and themes that afect your process; refect on potential improvements to your process for future projects; and consider how tools can support your learning, creativity, and refections. While the questions did not explicitly reference Kaleidoscope, many students discussed the tool’s role in their process. The instructions for and questions presented in the survey can be found in the Supplement. (4) Kaleidoscope critique session (N=18). In the middle of the semester, we moderated a voluntary critique session during which small groups of students discussed their biggest frustrations with and wishes for Kaleidoscope in a focus group for which participants provided consent. Students were given fve minutes to individually add thoughts in a shared Google Doc, in response to questions about their experiences with Kaleidoscope (details in the Supplement). Next, groups took fve minutes to read others’ comments and add followups. Sessions concluded with 15 minutes of open discussion moderated by a single researcher, who took de-identifed notes on student responses, with no recordings or other identifable data. (5) Post-semester semi-structured interviews (N=7). Post-semester semi-structured interviews with students focused on their design process during the course, refections on learning design, and the role of Kaleidoscope in process and learning (performed by non-instructor members of the research team). Interviews began by discussing general refections on the course, before transitioning to specifc questions about

usage of Kaleidoscope. As semi-structured interviews, the questions evolved within and between interviews; a representative selection of guiding questions can be found in the Supplement. Interview consent and compensation protocols were the same as for midsemester interviews. (6) Meetings with course staf (N=3 course staf, not including members of the research team). Throughout the semester, we held meetings with course staf to discuss their usage of the tool and their perceptions of student experience, and took detailed notes of the conversations. (7) Bug reports and feature requests. We collected bug reports and feature requests from students during the semester through a Google Form linked directly from the Kaleidoscope page, through direct emails, and Piazza posts. (8) Usage data. We collected all materials uploaded to Kaleidoscope, and logged interactions on the platform. Over the course of the semester, 149 users across 181 teams created 3268 artifacts, including 1063 images (33%), 1892 text artifacts (58%), 116 GitHub commits (4%), 89 YouTube videos (3%), 64 Figma layouts (2%), and 44 other web page links (1%) (Fig. 5). 1077 individual pieces of feedback were left on artifacts. 553 check-ins were created for course assignments. 3.4.2 Analysis. During the deployment semester, the research team held weekly meetings where we discussed data collected so far, including student and course staf’s experiences using the tool, and newly requested features and bugs. We used these meetings to guide the direction of tool development and refect on the tool design, role, and student experience. After the semester, we performed a thematic analysis of the qualitative data from the sources described in Section 3.4.1. We frst transcribed all interviews and critique

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

CHI ’23, April 23–28, 2023, Hamburg, Germany

proximal changes, and most artifacts remained static. This artifactfocused, archival design choice allowed students to collect a history of past ideas in the studio space. By keeping these visible to the team, the artifacts showed how the project developed over time and allowed students to refect on their process at a high level: I tend to think of Kaleidoscope as timestamps of my creative thinking. It was great to see how my ideas were evolving over time. (Anon - Critique Session)

The design choice to present artifacts as a collection of visual tiles allowed students to quickly page in past context and stages of work: It facilitates my thinking process... By reviewing Kaleidoscope, it reminds me of the designing process quickly. (S67 - Refection Assignment)

Figure 5: Types of artifacts uploaded to Kaleidoscope. 3,268 artifacts were created during the semester. sessions, and extracted all responses from surveys and instructor meeting notes. Two researchers iteratively applied open coding to all of the combined corpus, creating an initial set of low level descriptive codes. We then grouped the codes into higher-level themes, creating memos that incorporated the descriptive codes, quotes, and emergent concepts. In a refexive process, we reapplied the higher-level codes to the corpus and refned the codes and memos. We are specifcally interested in analyzing Kaleidoscope as a Research through Design artifact within the frame of process-sensitive creativity support tools [50]. Therefore we focus our fndings and discussion on interpreting the efects the Kaleidoscope system had on student experience and learning, including changes across the iterative development of the tool. We present fndings from the thematic analysis below. We do not report participant counts for themes, as semi-structured and evolving interviews meant not every participant was asked identical questions, and reporting ‘counts’ is not appropriate for this type of refexive qualitative methodology [9, 10]. Additionally, we draw from the concept of Annotated Portfolios to present this work, demonstrating design decisions through annotated fgures [21].

4

FINDINGS

In this section, we discuss the themes identifed through our thematic analysis. Several of these themes explore tensions within the tool, where particular design choices enabled benefcial uses while at the same time creating challenges. Table 1 organizes these successes and challenges by Kaleidoscope’s initial design principles.

4.1

Documentation Supports Refection, Conficts with Creation

Kaleidoscope was designed as a tool for documentation. The archival nature of Kaleidoscope supported metacognitive refection, providing a beneft long after the act of creation itself. However, there was a tension between these later benefts and the immediate labor of artifact creation. As a documentation tool, Kaleidoscope’s design began with an archival approach to artifacts, in which artifacts were kept longterm without editing. While editing of text artifacts was introduced later in the semester, it was mostly used for minor, temporally

Students were prompted to carry out an explicit refection on their creative process midway through the semester. While the assignment did not mention Kaleidoscope, many students reported using Kaleidoscope to refect on their project history: [For the refection assignment] I defnitely took a look at my previous sketches in kaleidoscope. Which [at the start] did not seem like a great tool, but looking back really changed the way I looked at it. It almost feels like a version control for prototyping. (S103 - Refection Assignment)

Aside from ad-hoc or prompted refection, the fnal assignment encouraged additional refection. Creating a portfolio to collect both fnal outputs and show a retrospective on process is a standard technique in design classes. When creating their fnal portfolios within Kaleidoscope, students were able to use the history of the project already collected in Kaleidoscope to reveal their design process and help their peers learn from their process: A lot of the artifacts that we added [to our portfolio] were actually artifacts that we already had... we wanted to include that step of the process to help inform other people’s processes as well. (S50 - End Semester Interview)

Yet the very design choices that supported these types of refection also interrupted creation and made it less likely that students would pause to save an artifact. When students were creating content, they had to work in other tools. This student described why they chose not to create an artifact in Kaleidoscope: I could have created an artifact in Kaleidoscope. But why do that when I want people to go to Figma and make edits? (S47 - End Semester Interview)

The archival nature of Kaleidoscope, where a Figma wireframe might have been contextualized with artifacts from other mediums, might have supported long term refection. However in the moment of creation, using Kaleidoscope would have split teammates’ attention between two platforms, making it less likely for them to make direct contributions to the current task. Students added artifacts to Kaleidoscope most commonly for assignment submissions, and group discussion. Other parts of the project history were left in other tools, for instance snapshots of edits in Figma were rarely incorporated into Kaleidoscope. Instead, the artifact was added to Kaleidoscope only when it was considered “fnished." Stopping to create intermediate artifacts required a change in focus from “creating" to “documenting":

CHI ’23, April 23–28, 2023, Hamburg, Germany

Design Principle

Sterman et al.

Kaleidoscope Features

Successes

Challenges

Collaboration

studio space; feedback

central repository of team data; ability to collect multi-source artifacts; sharing peer feedback

lack of live collaboration; discomfort with making artifacts public to team; discomfort with permanence of artifacts

Seeing the Big Picture

tile display of artifacts; setting artifacts to public

visual display enables high-level views; peer learning about process

visual display can be messy, overwhelming, disorganized

Metacognition

archive of process history

support post-hoc refection; understand process through documentation

tension between modes of ‘creation’ and ‘documentation’ reduces storing of history

Curating Creative Space

fexible layouts

customization of views; active interaction with history data

lack of personalization; lack of aesthetic control

Making Progress Visible

tile display of artifacts; studio space

see progress through artifact accumulation; see evidence of teamwork; see idea development

bugs and system limitations created frustration

Table 1: Summary of qualitative insights, organized by the design principles of Kaleidoscope, to inform future design process documentation tools for education. I would like to have things more documented...but it’s really hard because in the moment you don’t know when you’re going to change things...When I create things, I want them to be the fnal version. So I don’t think “Oh, I should document this right now”, because it’s either 1) it sucks, and I don’t want to document it, or 2) it’s good, and then it’ll stay around. (S47 End Semester Interview)

Despite the hurdles to capturing-during-creation, students expressed a wish for easy access to those intermediate histories after the fact. For example, students found value in viewing multiple drafts of Figma documents in parallel, to see the variation between design options. In contrast, they experienced frustration with how hidden past versions are in Google Docs. They found the centralized history in Kaleidoscope helpful for refections and creating fnal documentation. Yet these benefts were often realized only in retrospect; in the moment, students did not want to be removed from the activities of creation or to put their teammates in a refective rather than generative mindset, or they did not know at the time when a change was important enough to be worth the disruption to document it. Without intermediate artifact states, metacognitive refection is harder; yet capturing intermediate states is disruptive.

4.2

Centralizing Discussion

Feedback was one of the most successful and well-received features of Kaleidoscope by students, however instructors felt pressure not to share potentially negative feedback in a public setting. In this section, we explore both student and teaching team reactions to centralizing discussions. Students: Students appreciated the parallel viewing of artifacts and feedback, the ability to rapidly see how many people had left feedback on an artifact from the main studio page, and the permanence of discussions. In cases where groups did not use Kaleidoscope’s feedback features, conversations were often buried in chat logs or scattered across document types. Kaleidoscope co-located group discussions, feedback from TAs, and feedback from peers

with the project history, so that discussion and decision points were easily accessible and contextualized by the artifacts. Beyond ease of access, making artifacts public to other classmates for review and feedback helped students learn from each other’s process: There were the times that we would do the feedback for people’s artifacts ...it not only allowed me to inform people about what our team had done and see if that could potentially help provide any additional help for that team or any additional inspiration, but also our team ourselves got inspiration from what other people had to say on ours...I really did value the time that I got to look at other people’s portfolios [and] look at other people’s artifacts. (S50 - End Semester Interview)

We designed an additional feature for sharing artifacts and feedback with classmates, called the “Explore Page," where students could browse public artifacts from their peers on particular tags. For example, they might view a gallery of “early ideation" to see how other groups were approaching that stage of the process. However, this feature contained signifcant bugs for most of the semester, and we were unable to collect sufcient data on how it was used. Teaching Team: Like the students, TAs also appreciated being able to see student comments on the artifacts, providing insight into the group’s process and discussions. Such discussions were otherwise invisible to the teaching staf, as they took place in private or ephemeral channels such as group notes documents or messaging applications. Visibility helped both students and instructors access, understand, and critique process. One challenge of centralized feedback was that TAs did not feel comfortable sharing feedback that might be interpreted as negative or critical in a public area, despite the importance of critical feedback to learning. While TAs shared their positive feedback for students within the Kaleidoscope interface, they used our institution’s Canvas platform for critical feedback, or comments related to grading. Separating critical feedback from positive feedback may skew students’ ability to learn from peers’ work, and remains a complex challenge related to privacy and visibility.

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

4.3

Privacy and Visibility in a Shared Space

Besides tensions around what types of feedback should be public, the shared nature of a Kaleidoscope studio space both supported and stymied group collaboration and communication, with tensions between wanting to have access to teammates’ work and progress, and desiring privacy during individual creation. Teams developed personal structures for managing collaboration and project state, some relying on tools like Google Drive, and some on Kaleidoscope; many groups used a combination of multiple tools. Kaleidoscope’s studio space was particularly benefcial to managing team state and communication, since it combined materials from many diferent sources along with design discussions: It [Kaleidoscope] keeps all of our work together and we can always refer to our studio. (S79 - Refection Assignment) We documented every design we had. And we put almost all our design discussion in kaleidoscope. Whenever I need to look for something, I would frst check kaleidoscope. (S32 - Refection Assignment)

Students who didn’t use Kaleidoscope found it more challenging to maintain an awareness of the team’s state: It’s hard to measure progress because I think also people do things on their own and then they ported over [to a shared Google Doc] just like I did...So it’s hard to see how people are progressing and what they’re thinking or where they are in their parts of the project. (S117 - Midsemester Interview)

However, studio spaces created a tension between individual and team work, or private and public artifacts. Many students only wanted team members to see polished or completed artifacts, and held personal parts of the process back. If I think that someone else is going to see it, it often hinders my ability to be as honest about whatever my ideas are or thoughts are. (S47 - Midsemester Interview)

Since other group members could see them, artifacts in the studio space felt more “permanent" (S47 - Midsemester Interview), and the inability to edit them made them feel “set in stone" (S42 Midsemester Interview). Kaleidoscope therefore failed to capture evidence of the design process that students felt was in-progress or individual. These student reactions led to many discussions within the research team about design decisions related to visibility in the tool. The original design had assumed that group members would be comfortable sharing artifacts among themselves, but would desire privacy from peers outside of their group. Yet even within groups, students felt pressure to share only polished work with each other. This undermined the goal of Kaleidoscope as a complete record of process; to make a more efective shared record of progress will require careful sensitivity to the balance between privacy and visibility even among group members. Visibility makes many types of learning possible – refecting on complete histories of your own team’s process, learning from other students, and providing instructional staf insight into how the students are learning so they can provide better instruction. Yet fear of judgment and criticism reduces how much people are willing to share in a visible space, even knowing the goals of a complete archive. Addressing this tension will require careful design choices.

CHI ’23, April 23–28, 2023, Hamburg, Germany

One direction might extend the idea of low-fdelity versioning from [51], so that team members can see that certain artifacts have been created by other members, but not the details, or creating temporarily private sections of the studio so that individuals can work privately before sharing. However, resolving this tension will also require deeper investigation into the motivational and mindset aspects of why students are unwilling to share certain artifacts and reshaping the social and team structures that cause fear of judgment or criticism.

4.4

History Display Creates Sense of Achievement but also Overwhelms

The tile display of artifacts in the studio space allowed students to see their project at a high level. At the same time, the same design choice could be overwhelming as more and more artifacts were added to the space. Seeing artifacts collecting in the studio space helped make progress visible, and created a sense of achievement: I also saw that with Kaleidoscope, seeing at the very beginning, you have your artifacts that you created with your team ...and then you start innovating and as you kind of look back at the check in artifacts or the feedback that you get from people, you kind of see we’re making pretty good progress and we’ve come a long way from where we started. And that’s really cool. (S50 - End Semester Interview) [Kaleidoscope is] used to help document the iteration process, which can often be really empowering for teams. (Anon - Critique Session)

In contrast, the histories in tools like Google Docs are more hidden: edited or changed materials disappear unless explicitly sought out in a separate history tab, and can only be viewed one at a time. Students did not get the same satisfaction in progress or team awareness in tools like Google Docs as in the visible Kaleidoscope history: I think the fact that you can see an artifact is kind of like a accomplishment...versus a Google Doc or Google Slides [is] just a chunk of documents put together...It’s kind of fulflling and rewarding, you actually came a long way as a team. (S42 - Midsemester Interview) It’s nice to be able to scroll through and see our project’s journey. Some of these things I’ve since forgotten so I love the visual aspect of Kaleidoscope that allows me to easily refresh my memory. (S13 - Refection Assignment)

Since creative design is an underspecifed, complex task where it can be hard to see a path to “success” while embedded in the process, making efort and progress visible to students can be an essential part of motivating students and building a sense of self-efcacy and forward progress. But the visual layout was also a challenge, especially initially when the layout was automatically generated. Many students found it messy and overwhelming: I don’t really like the Kaleidoscope interface. I fnd it to be very messy. (Anon - Critique Session)

CHI ’23, April 23–28, 2023, Hamburg, Germany

Sterman et al.

When I frst go into Kaleidoscope, I’m greeted by a wall of all my artifacts and that’s a little bit overwhelming for me. (S117 - Midsemester Interview)

Some students preferred to store artifacts in Kaleidoscope, but organize their artifacts in less “messy” interfaces. One team used Kaleidoscope’s Detail pages, where feedback and annotations were co-located with the artifacts, for discussion and archiving, then copied direct links to the Detail pages into a Google Doc, which they found easier to manage and search. A common request during the early part of the semester was for more organization abilities in Kaleidoscope, for example a folder structure, to sort artifacts into conceptual groups and hide artifacts that were deemed no longer relevant. The introduction of fexible and saveable layouts partially addressed this need, but especially for students who characterized themselves as particularly neat or organization-focused, the lack of structure drove them away from Kaleidoscope. To beneft from seeing the entire project at once, they also needed to hide artifacts.

4.5

Initial Perceptions and Incentives

As a research tool under active development during the course, Kaleidoscope was less stable and polished than commercial tools that students are used to working with. The research team kept a tight response cycle on addressing bugs, listening to student feedback, and incorporating new features, however Kaleidoscope had some severe bugs during its deployment, including a case where feedback was overwritten in the database after being submitted. While this was rapidly fxed, it undermined student trust in the system, which persisted after the issue was resolved. Some students cited specifc bugs or problems with the visual layout as reasons they used other tools rather than Kaleidoscope, or the general difculty of using a less polished tool: There were moments where my project team and I thought about just dropping random thoughts/artifacts into our studio that made me realize how great [Kaleidoscope] could be as a collaborative tool. We never ended up doing so because it was just easier to do on Google Docs even if it was messier. (Anon - Critique Session)

Beyond practical issues with the system, a second challenge arose with student perceptions of the role of the tool. Check-ins were developed as a way to make assignment submissions easier; the reasoning was that if all the material is already in Kaleidoscope, picking specifc artifacts to submit should be easier than exporting materials to assemble in another tool and then uploading that result to Canvas (UID’s course management system). Moreover, check-ins on Kaleidoscope support easy sharing of artifacts for feedback and peer-learning, since artifacts in check-ins can be grouped together and made public by the instructors. However, the use of check-ins for assignments fostered an early perception that Kaleidoscope was a submission platform, rather than a tool for design work. Some student groups began to use Kaleidoscope only for submissions, importing artifacts only when they needed to submit a check-in. The combination of bugs and hard-to-use interface aspects, along with the perception of Kaleidoscope as a submission system, discouraged some students from interacting with it, even after the bugs and interface issues were fxed or improved. Once the early perceptions were established, they were hard to change.

I think those initial weeks really colored a lot of our perceptions of what Kaleidoscope was possible of, and because we had already found alternative ways to work by the time Kaleidoscope start addressing those issues, it was just harder to then switch back. (S126 - End Semester Interview)

Portfolios ended up being a highly successful feature at the end of the semester, where the motivation for having all artifacts and project history centrally available was clear and aligned with both course assessment requirements and students’ intrinsic motivations for showcasing their work. UID was many students’ frst exposure to design; the frst time through the design process, students did not realize or appreciate the value of an early sketch or idea until they wanted to include it in an assignment or fnal presentation. If we had to change the way we record information, I would put more materials into Kaleidoscope initially. (S18 - Refection Assignment)

In introducing a research tool into a course setting, early student interactions should be carefully aligned with desired perceptions and uses of the tool. In our case, aligning with the tool’s value to the design process and refection should have preceded any assignment submissions with the platform.

5

DISCUSSION

Having explored themes of how students interacted with Kaleidoscope in Section 4, we now turn to discussing Kaleidoscope from a pedagogical perspective, in relation to research literature around education and expert practice.

5.1

Documentation Enables Explicit and Opportunistic Refection on Process

In this research, we asked How can a documentation tool for user interface design make process visible to students and instructors for metacognitive refection? Key learning goals of UID included learning to design, prototype, evaluate, and iterate on interfaces; these skills combine into an overall ‘design process.’ Refection can provide students opportunities to consider successes and improvements to their process. Here we discuss the design choices of Kaleidoscope that enabled diferent types of refection: explicit and opportunistic. 5.1.1 Explicit Reflection. Section 4 showed how Kaleidoscope enabled explicit refection on process, both through direct assignments like the extra credit Refection Assignment, and when putting together communicative documents like the fnal portfolios. Explicit refection depends on specifc time and context, where practitioners can enter a refective rather than creative mindset: Lin et al. discuss how computer tools that deliberately bring a student from one learning environment to another can support explicit refection [34]; like UID’s Refection Extra Credit, Roldan et al. used explicit refection assignments at key points in the course [46]. Having a concrete medium to ground explicit refection is important. Trying to refect without grounding artifacts is susceptible to memory errors, such as focusing only on particularly memorable instances, or missing subtle details [46]. Our fndings showed two primary design decisions that contributed to Kaleidoscope acting as a useful medium for design history: centralizing the history of design activities from multiple sources into a single location, and

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

presenting these traces of history in a visual way that showcased multiple artifacts at once. Diferent types of artifacts can be used to ground refection: Roldan et al. explore video [46], and Fleck et al. discuss how refecting on diferent mediums, such as records of events, audio recordings, or sensor data, allow returning to forgotten topics or seeing from new points of view [20]. Kaleidoscope’s approach could be combined with other mediums to expand opportunities for sensemaking. 5.1.2 Opportunistic Reflection. In addition to engaging with their Kaleidoscope histories to perform intentional metacognition on process, students also used Kaleidoscope to support their design work, such as referencing old iterations of artifacts, catching up on progress from team members, or sharing feedback. Kaleidoscope’s display of the project history allowed these practical tasks to become moments of opportunistic refection. Without setting aside time for explicit refection, students were able to see how ideas evolved, identify the benefts of iteration, and recognize their own learning accomplishments through the progress they had made. In the context of expert practitioners, Sterman et al. showed that ambient display of history and revisiting artifacts opportunistically supports refection on personal process and creative identity that can direct and inspire future work [51]. Kaleidoscope enabled opportunistic refection that was otherwise difcult or impossible with existing tools, by surfacing older work with newer work, and presenting a view of the entire history as a starting point for design tasks. In our remote semester, digital spaces were the only shared spaces for teams; yet even during in-person semesters, courses like UID do not have permanent physical studio spaces for undergraduate teams. For courses where studio critiques and project workspaces are ephemeral due to constraints of classroom space and resources, the digital studio may continue to provide opportunistic refection. Kaleidoscope’s process documentation created a concrete medium for explicit and opportunistic refection; concrete representations may beneft HCI design courses where students refect on abstract learning goals like “design process”.

5.2

Challenges to Integrating Documentation with the Design Process

Our second guiding research question asked How can a documentation tool directly support students’ design process in collaborative interaction design projects? Besides enabling refection, a successful documentation tool should help students do good work and learn process by doing. Kaleidoscope directly supported students in certain parts of their design processes, especially in tasks related directly to documentation: referencing old artifacts, group collaboration, and giving and receiving feedback. In this section, we discuss three challenges Kaleidoscope encountered in supporting design work. 5.2.1 Overwhelm and Sprawl. One challenge to efective process support was that students felt overwhelmed by the quantity and clutter of artifacts in their studios (Sec. 4.4). Chen et al. identifed a similar theme in their probes of documentation behaviors in a design course, where “sprawl” made it difcult for students to fnd artifacts and records they needed among the vast quantity of documentation they had created [13]; Dalsgaard et al. identify choosing

CHI ’23, April 23–28, 2023, Hamburg, Germany

what to document and at what level of detail as a key challenge even for experts [17]. A frst reaction to solving the problem of overwhelm might be to organize the artifacts better; however choosing which artifacts to document is a more foundational issue that contributes to sprawl. Chen et al. discuss this as the “Cartographer’s Dilemma” [13]. Much like Borges’s point-for-point map [7], we saw this dilemma among our students as they tried to identify what changes would matter to them later (Sec. 4.1), and what would simply clutter the studio. Both too much and too little documentation resulted in frustration. Sterman et al. explore a similar issue in version control systems in creative practices, identifying lowfdelity versioning as one solution to the Cartographer’s Dilemma when practitioners prioritize fexibility and spontaneity [51]. In the cases of lower-fdelity documentation, the choice to exclude detail is deliberate and carefully aligned with the practitioners’ context. Simply capturing less detail does not solve overwhelm and sprawl. Identifying important changes and managing iterations is an important skill for students to learn; a documentation tool could help scafold students towards recognizing and practicing this skill. 5.2.2 Mindsets and ‘Mode Switching’. Besides the challenges of too many artifacts, some students also struggled with documenting enough artifacts. Section 4.1 explored how the labor of documentation disrupted creation, breaking students’ fow [15]. To avoid breaking fow, students sacrifced documentation. The labor of creating documentation is a well-explored challenge: da Rocha et al. discuss the tension between interrupting fow to create documentation, and the necessity to document immediately after creation [23]; Dalsgaard et al. discuss how even in research, the time and efort needed to document can be at odds with design fow [17]. But viewing documentation purely as a negative interruption may not be the whole story. In prior work, we discuss the strategy of ‘mode switching’ in expert process, where practitioners move between tools and tasks to intentionally alter their mindsets and focus [40]. Several of Kaleidoscope’s design choices position it as a tool for refection, but not creation: it is a standalone documentation tool, where artifacts can be only minimally edited but are easily viewable in relation to each other. Therefore students are forced to ‘mode switch’ as they moved between their creation tools (Figma, sketching, text editors, etc.) and their documentation tool (Kaleidoscope). In the student experiences reported in Section 4.1, we see using Kaleidoscope for documentation can either be disruptive – when students are in a fow state of creating – or benefcial – when it allows students to step back and curate or revisit their documentation, either through portfolios, feedback, or refection. In expert practice, mode switching is defned as an intentional, productive strategy. To maintain productive mode switching during curation and refection stages, a tool like Kaleidoscope should provide a focused, standalone view into history separate from creation tools. This mode supports higher-efort curation tasks and explicit refection. Yet to reduce unproductive context switching during creation, a tool like Kaleidoscope might beneft from lowefort, low-interruption recording of artifacts; if refection occurs, it should be opportunistic rather than explicit. In ‘mode switching’, diferent mindsets are often enabled by diferent tools. In the next section, we consider a possible way to

CHI ’23, April 23–28, 2023, Hamburg, Germany

move towards productive mode switching for student documentation and refection using ecosystems of tools, in combination with a discussion of Kaleidoscope’s relationship to the learning goals of design process and teamwork.

5.2.3 Addressing Overwhelm and Mode Switching: Integrating with an Ecosystem of Tools. Students work on their design projects in many other tools; in UID, common tools included Figma, GitHub, physical paper, GoogleDocs, GoogleSlides, and more. Some of these tools capture their own histories internally, but do not make these histories accessible in a way that supports refection [51]. Kaleidoscope was designed as another stand-alone web platform alongside these tools. Kaleidoscope integrates with some of the tools that students use in the course: importing Figma projects from a URL with a live thumbnail, and automatically importing GitHub commits from linked projects, but the primary interaction paradigm is to manually upload individual artifacts to the platform. One efect of this design choice is that users must go to Kaleidoscope specifcally to record an artifact. This forces the user to mode switch, entering a documentation tool and mindset, and requires an active choice to create a record of a moment in the design history. In order to address the paired issues of overwhelm and breaking fow through unwanted mode switching, we consider here how to integrate Kaleidoscope more efectively into the broader ecosystem of tools. One envisionment might be to rethink Kaleidoscope as a wrapper around the tools in which students do the work of design and creation. By leveraging the internal histories of tools like Figma or GoogleDocs, a refective documentation tool could pull in artifacts that refect particular points of change automatically, much as we did with GitHub commits. Students would not have to leave their design tools to make an artifact; perhaps they could even mark or annotate particular key artifacts from the tool in which they were created. Small changes would be kept within the tool’s history, while important changes could be surfaced in Kaleidoscope, combating sprawl and overwhelm while still tracking the full history. Artifacts could be promoted to Kaleidoscope if they turn out to be important, and demoted to the original tool if the team decides they are insignifcant. In such a design, Kaleidoscope would link back to the source tool from each artifact, allowing easy transitions between tools for creation and tools for refection in order to continue to enable opportunistic refection during design activities such as catching up on teammates’ progress or revisiting old iterations. Explicit refection would be supported within Kaleidoscope, where drawing multimedia histories together from multiple tools would continue to enable refection across the entire design process in a visual manner. Integration with creation tools may also address privacy and visibility, allowing teammates to capture histories of in-progress work in the source tool without feeling that it has been made public or permanent to the broader team until they are ready. The histories currently supported by individual digital tools are comprehensive changelogs, but are siloed and intended more for error reversion than for refection or other aspects of process [51]. From our experience with a standalone version of Kaleidoscope, we have seen the benefts of history tools that bring together multiple mediums from the design process, and present them in an accessible format for refection. The next iteration of refective documentation

Sterman et al.

tools may combine the benefts of both these approaches to reduce overwhelm and better support refection. Documentation tools for refection may beneft from integration with the broader ecosystem of tools, such the labor of documentation does not interrupt a student’s creative fow, and the labor of curation integrates with refection.

5.3

Incentives and Motivations for Documentation

Chen et al. proposed an open question: “To what degree do these documentation practices carry on into professional practice in creative felds once the academic requirements of documentation processes are removed?” [13] Our third research question addressed the inverse question: How can strategies of expert process be incorporated into tools for student learning? In this discussion of Kaleidoscope’s deployment, we must also ask To what degree is it appropriate for documentation practices from professional practice to be brought into the academic context? The design principles that guided Kaleidoscope drew strongly from expert practice [29, 40, 47, 51], yet the extrinsic motivators of the academic environment created challenges for applying intrinsically motivated expert practice in the academic environment. In a course setting like UID, some motivations for documentation overlap with expert practice, while others are specifc to the educational context. Documentation can contribute to internal project process: like experts, students document to communicate with their teammates, to structure their own workfows and design cycles, and to perform metacognitive refection. Documentation can also be for external consumption: while professionals might document for clients or public dissemination, students must document in order to submit assignments, receive grades, or create fnal presentations and portfolios. Ideally these goals would balance in the learning context to support both learning goals of external communication and internal process. However, this research revealed multiple ways these motivations can work against each other. Amabile has shown dampening efects of extrinsic motivation on creativity [5]. In the academic context of our institution, as in others, students are constantly under pressure to complete the next assignment or take the next class, with little institutional support for refection or returning to old work. This leads to a gap between process in courses and process in expert or personal practice. For example, one student expressed diferent mindsets around maintaining history between personal and course projects: For my personal projects [and] research I’m a little bit more cognizant of keeping things organized... so that if I’m stuck or if I don’t know where to go in my research, I can just go back into those archives and try to spark something or remember what I did. But for group projects because it’s more of like getting them [done] quick, and it sometimes may not apply to my own interests, I take less care to keep those things organized. (S117 - Midsemester Interview)

In Section 4.5, we discussed students’ perceptions of Kaleidoscope as a submission platform, instead of a tool for design. Chen et al. note a similar efect, where the implicit and explicit expectations of the course setting shaped students’ behaviors and perceptions of

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

documentation as primarily an external requirement for communicating with instructors and peers [13]. With Kaleidoscope, students often became aware of the benefts of documentation as a medium for refection only later in the course, when they had to manage a larger project, perform explicit refection, and communicate with teammates and external audiences (Sec. 4.1, 4.5). Students did beneft from expert design strategies, such as opportunistic refection, identifying progress, and enhancing collaborative discussion, which suggests that there is value in integrating documentation strategies from expert practice into educational tools. But tools alone cannot change behavior without support from the broader course structure and environment. How might we restructure the incentives and implicit and explicit expectations of the academic environment to help students to practice the multifaceted purposes of documentation? One small step might be making metacognition an explicit learning goal within the course to align student efort with extrinsic motivators, leading to earlier buy-in from students on the value of documentation and refection.

6

LIMITATIONS

The development and deployment of Kaleidoscope in a course over four months allowed us to collect real-world user data and learn from student needs as we iteratively designed the tool. However, the time pressures of the semester and the requirement to support particular course needs also limited the features we could release, and meant students encountered bugs with the system. This made interactions with Kaleidoscope less fuid than with stable commercial tools. Among the features we could not prioritize were techniques for interactive critique. Critique is essential to a studio course like UID, and while Kaleidoscope artifacts could be used in a critique session, we did not implement specifc features beyond basic feedback interactions. Both studio critique and feedback are complex domains of their own; the extensive research in these areas might be combined with our work on documentation fruitfully in the future. Similarly, implementation of logging was constrained by the challenges of parallel development and deployment. Logging was added to features at diferent times across the semester primarily to support system debugging and resolve student issues. Therefore we are unable to analyze quantitative metrics for interaction logs. Such data sources may provide additional insight in the future, but were out of scope for this research. Kaleidoscope was designed and deployed specifcally for UID, an interaction design course, and our design decisions are inextricable from that context. For example, Kaleidoscope primarily supports visual and interaction design. Physical artifacts are incorporated through photographs or other digital representations, however other forms of design which focus more fully on other types of artifacts or processes might require alternate design decisions. Similarly, we focused on individual students, groups of students, and instructors as key stakeholders in privacy and collaboration considerations in UID. However, projects in other design courses might include sensitive data, such as interview data or photographs, or materials generated during co-design sessions. Expanding Kaleidoscope to support these aspects of the design process would require

CHI ’23, April 23–28, 2023, Hamburg, Germany

additional consideration of participant privacy in data access and representation.

7

FUTURE WORK

Future work might explore how documentation tools like Kaleidoscope can more explicitly support refection on process, for example, creating visualizations of team interaction and artifact creation patterns, or integrating refection prompts in the tool. Kaleidoscope drew from the strengths of existing tools by interfacing with Figma, GitHub, and YouTube, but also competed with these tools for student time, efort, and attention; we might consider how to lower the amount of efort needed to document work, either by further integration with existing tools, or pursuing documentation layers within an ecosystem of tools rather than as separate platforms. While Kaleidoscope was deployed during a fully remote semester, it may be fruitful to explore how to document and reveal process during hybrid or in-person courses as well, integrating a tool like Kaleidoscope into in-person activities, or pursuing hybrid-specifc tool designs. Future work should also address student buy-in to refection and metacogntion; we might investigate when it is appropriate to introduce discussions of meta-concepts around process to students, and how to align the incentive structures of the educational context with refection. Tool design can only go so far in the educational context; assessments, motivation, and learning goals must all be aligned to support desired behaviors.

8

CONCLUSION

In this paper, we presented Kaleidoscope, a documentation system for design process. We deployed Kaleidoscope in an upper-level undergraduate user interface design course during a remote semester. Kaleidoscope displays artifacts generated during the design process in a virtual studio space, providing a shared repository for project teams to collect their work, document and annotate their progress, and receive feedback from peers and instructors. We report data from a variety of surveys, critique sessions, discussions, and interviews with students and course staf to understand how a documentation tool like Kaleidoscope can support collaboration, metacognition, making progress visible, high-level views of project histories, and personalization of a remote studio environment. We discuss successes and challenges encountered by students and researchers, and how these insights might support HCI educators building tools for teaching design process. We envision the lessons learned from Kaleidoscope may support a future of design tools which holistically understand the design process wherever it happens, support student learning, sharing, and metacognition, and makes creative process visible for discussion, critique, and intentional modifcation.

ACKNOWLEDGMENTS We thank the teaching staf of Berkeley’s Fall 2020 User Interface Design and Development course for their contributions and support, and the students for making this research possible by engaging with a new technology during the disruption of COVID-19. This research was supported in part by the Berkeley Changemaker Technology Innovation Grants, and by the National Science Foundation Graduate Research Fellowship under Grant No. DGE 1752814.

CHI ’23, April 23–28, 2023, Hamburg, Germany

REFERENCES [1] [2] [3] [4] [5] [6]

[7] [8] [9]

[10] [11] [12] [13]

[14]

[15] [16] [17]

[18] [19] [20]

[21] [22] [23]

[24] [25]

[26]

2022. Canvas. https://www.instructure.com/canvas/try-canvas. 2022. Figma. https://www.fgma.com. 2022. Miro. http://www.miro.com. 2022. Piazza. https://piazza.com/. Teresa M Amabile. 2018. Creativity in context: Update to the social psychology of creativity. Routledge. Gabrielle Benabdallah, Sam Bourgault, Nadya Peek, and Jennifer Jacobs. 2021. Remote Learners, Home Makers: How Digital Fabrication Was Taught Online During a Pandemic. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 350, 14 pages. https://doi.org/10.1145/ 3411764.3445450 Jorge Luis Borges. 1999. On exactitude in science. In Collected Fictions. Penguin, New York. Translated by Andrew Hurley. David Boud and Elizabeth Molloy. 2013. Rethinking models of feedback for learning: the challenge of design. Assessment & Evaluation in higher education 38, 6 (2013), 698–712. Virginia Braun and Victoria Clarke. 2021. Can I use TA? Should I use TA? Should I not use TA? Comparing refexive thematic analysis and other pattern-based qualitative analytic approaches. Counselling and Psychotherapy Research 21, 1 (2021), 37–47. Virginia Braun and Victoria Clarke. 2021. To saturate or not to saturate? Questioning data saturation as a useful concept for thematic analysis and sample-size rationales. Qualitative research in sport, exercise and health 13, 2 (2021), 201–216. Kathy Charmaz. 2006. Constructing grounded theory: A practical guide through qualitative analysis. Sage. Kathy Charmaz and Linda Liska Belgrave. 2007. Grounded theory. The Blackwell encyclopedia of sociology (2007). Ricky Chen, Mychajlo Demko, Daragh Byrne, and Marti Louw. 2021. Probing Documentation Practices: Refecting on Students’ Conceptions, Values, and Experiences with Documentation in Creative Inquiry. In Creativity and Cognition (Virtual Event, Italy) (C&C ’21). Association for Computing Machinery, New York, NY, USA, Article 32, 14 pages. https://doi.org/10.1145/3450741.3465391 Allan Collins and John Seely Brown. 1988. The Computer as a Tool for Learning Through Refection. In Learning Issues for Intelligent Tutoring Systems, Heinz Mandl and Alan Lesgold (Eds.). Springer US, New York, NY, 1–18. https://doi. org/10.1007/978-1-4684-6350-7_1 Mihaly Csikszentmihalyi. 1990. Flow: The psychology of optimal experience. Vol. 1990. Harper & Row New York. Peter Dalsgaard. 2017. Instruments of inquiry: Understanding the nature and role of tools in design. International Journal of Design 11, 1 (2017). Peter Dalsgaard and Kim Halskov. 2012. Refective Design Documentation. In Proceedings of the Designing Interactive Systems Conference (Newcastle Upon Tyne, United Kingdom) (DIS ’12). Association for Computing Machinery, New York, NY, USA, 428–437. https://doi.org/10.1145/2317956.2318020 Carol S Dweck. 2008. Mindset: The new psychology of success. Random House Digital, Inc. Daniel Fallman. 2003. Design-oriented human-computer interaction. In Proceedings of the SIGCHI conference on Human factors in computing systems. 225–232. https://doi.org/10.1145/642611.642652 Rowanne Fleck and Geraldine Fitzpatrick. 2010. Refecting on refection: framing a design landscape. In Proceedings of the 22nd Conference of the Computer-Human Interaction Special Interest Group of Australia on Computer-Human Interaction. 216–223. Bill Gaver and John Bowers. 2012. Annotated Portfolios. Interactions 19, 4 (2012), 40–49. https://doi.org/10.1145/2212877.2212889 William Gaver. 2012. What should we expect from research through design?. In Proceedings of the SIGCHI conference on human factors in computing systems. 937–946. Bruna Goveia da Rocha, Janne Spork, and Kristina Andersen. 2022. Making Matters: Samples and Documentation in Digital Craftsmanship. In Sixteenth International Conference on Tangible, Embedded, and Embodied Interaction (Daejeon, Republic of Korea) (TEI ’22). Association for Computing Machinery, New York, NY, USA, Article 37, 10 pages. https://doi.org/10.1145/3490149.3502261 Gillian R Hayes. 2014. Knowing by doing: action research as an approach to HCI. In Ways of Knowing in HCI. Springer, 49–68. Jim Hollan and Scott Stornetta. 1992. Beyond Being There. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Monterey, California, USA) (CHI ’92). Association for Computing Machinery, New York, NY, USA, 119–125. https://doi.org/10.1145/142750.142769 Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 1265–1276. https://doi.org/10.1145/3025453.3025626

Sterman et al.

[27] Anna Keune, Naomi Thompson, Kylie Peppler, and Stephanie Chang. 2017. "My portfolio helps my making": Motivations and mechanisms for documenting creative projects. In "Young & Creative: Digital Technologies Empowering Children in Everyday Life", Ilana Eleá and Lothar Mikos (Eds.). Nordicom, University of Gothenburg, Chapter 12. [28] Joy Kim, Maneesh Agrawala, and Michael S. Bernstein. 2017. Mosaic: Designing Online Creative Communities for Sharing Works-in-Progress. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (Portland, Oregon, USA) (CSCW ’17). Association for Computing Machinery, New York, NY, USA, 246–258. https://doi.org/10.1145/2998181.2998195 [29] Scott R. Klemmer, Björn Hartmann, and Leila Takayama. 2006. How Bodies Matter: Five Themes for Interaction Design. In Proceedings of the 6th conference on Designing Interactive Systems (University Park, PA, USA) (DIS ’06). Association for Computing Machinery, New York, NY, USA, 140–149. https://doi.org/10. 1145/1142405.1142429 [30] Scott R. Klemmer, Michael Thomsen, Ethan Phelps-Goodman, Robert Lee, and James A. Landay. 2002. Where Do Web Sites Come from?: Capturing and Interacting with Design History. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Minneapolis, Minnesota, USA) (CHI ’02). ACM, New York, NY, USA, 1–8. https://doi.org/10.1145/503376.503378 [31] Panayiotis Koutsabasis and Spyros Vosinakis. 2012. Rethinking HCI education for design: problem-based learning and virtual worlds at an HCI design studio. International Journal of Human-Computer Interaction 28, 8 (2012), 485–499. [32] Chinmay E. Kulkarni, Michael S. Bernstein, and Scott R. Klemmer. 2015. PeerStudio: Rapid Peer Feedback Emphasizes Revision and Improves Performance. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale (Vancouver, BC, Canada) (L@S ’15). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/2724660.2724670 [33] Bruno Latour. 1994. On technical mediation. Common knowledge 3, 2 (1994). [34] Xiaodong Lin, Cindy E. Hmelo, Charles K. Kinzer, and Teresa Secules. 1999. Designing technology to support refection. Educational Technology Research and Development 47 (1999), 43–62. [35] Nic Lupfer. 2018. Multiscale Curation: Supporting Collaborative Design and Ideation. In Proceedings of the 2018 ACM Conference Companion Publication on Designing Interactive Systems (Hong Kong, China) (DIS ’18 Companion). Association for Computing Machinery, New York, NY, USA, 351–354. https: //doi.org/10.1145/3197391.3205380 [36] Nic Lupfer, Andruid Kerne, Rhema Linder, Hannah Fowler, Vijay Rajanna, Matthew Carrasco, and Alyssa Valdez. 2019. Multiscale Design Curation: Supporting Computer Science Students’ Iterative and Refective Creative Processes. In Proceedings of the 2019 on Creativity and Cognition (San Diego, CA, USA) (C&C ’19). Association for Computing Machinery, New York, NY, USA, 233–245. https://doi.org/10.1145/3325480.3325483 [37] Julia M. Markel and Philip J. Guo. 2020. Designing the Future of Experiential Learning Environments for a Post-COVID World: A Preliminary Case Study. (2020). https://www.microsoft.com/en-us/research/publication/designingthe-future-of-experiential-learning-environments-for-a-post-covid-world-apreliminary-case-study/ [38] Emma Mercier, Shelley Goldman, and Angela Booker. 2006. Collaborating to Learn, Learning to Collaborate: Finding the Balance in a Cross-Disciplinary Design Course. In Proceedings of the 7th International Conference on Learning Sciences (Bloomington, Indiana) (ICLS ’06). International Society of the Learning Sciences, 467–473. [39] Troy Nachtigall, Daniel Tetteroo, and Panos Markopoulos. 2018. A Five-Year Review of Methods, Purposes and Domains of the International Symposium on Wearable Computing. In Proceedings of the 2018 ACM International Symposium on Wearable Computers (Singapore, Singapore) (ISWC ’18). Association for Computing Machinery, New York, NY, USA, 48–55. https://doi.org/10.1145/3267242.3267272 [40] Molly Jane Nicholas, Sarah Sterman, and Eric Paulos. 2022. Creative and Motivational Strategies Used by Expert Creative Practitioners. In Proceedings of the 2022 Conference on Creativity and Cognition. https://doi.org/10.1145/3527927.3532870 [41] Vanessa Oguamanam, Taneisha Lee, Tom McKlin, Zane Cochran, Gregory Abowd, and Betsy DiSalvo. 2020. Cultural Clash: Exploring How Studio-Based Pedagogy Impacts Learning for Students in HCI Classrooms. In Proceedings of the 2020 ACM Designing Interactive Systems Conference (Eindhoven, Netherlands) (DIS ’20). Association for Computing Machinery, New York, NY, USA, 1131–1142. https://doi.org/10.1145/3357236.3395544 [42] Nadya Peek, Jennifer Jacobs, Wendy Ju, Neil Gershenfeld, and Tom Igoe. 2021. Making at a Distance: Teaching Hands-on Courses During the Pandemic. Association for Computing Machinery, New York, NY, USA. [43] Sarah Quinton and Teresa Smallbone. 2010. Feeding forward: using feedback to promote student refection and learning–a teaching model. Innovations in Education and Teaching International 47, 1 (2010), 125–135. [44] Yolanda Jacobs Reimer and Sarah A Douglas. 2003. Teaching HCI design with the studio approach. Computer science education 13, 3 (2003), 191–205. [45] Kathryn Rivard and Haakon Faste. 2012. How Learning Works in Design Education: Educating for Creative Awareness through Formative Refexivity. In Proceedings of the Designing Interactive Systems Conference (Newcastle Upon

Kaleidoscope: A Reflective Documentation Tool for a User Interface Design Course

[46]

[47]

[48] [49] [50] [51] [52]

Tyne, United Kingdom) (DIS ’12). Association for Computing Machinery, New York, NY, USA, 298–307. https://doi.org/10.1145/2317956.2318002 Wendy Roldan, Ziyue Li, Xin Gao, Sarah Kay Strickler, Allison Marie Hishikawa, Jon E. Froehlich, and Jason Yip. 2021. Pedagogical Strategies for Refection in Project-based HCI Education with End Users. In Designing Interactive Systems Conference 2021. 1846–1860. https://doi.org/10.1145/3461778.3462113 Moushumi Sharmin, Brian P. Bailey, Cole Coats, and Kevin Hamilton. 2009. Understanding Knowledge Management Practices for Early Design Activity and Its Implications for Reuse. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Boston, MA, USA) (CHI ’09). Association for Computing Machinery, New York, NY, USA, 2367–2376. https://doi.org/10.1145/ 1518701.1519064 Katie A Siek, Gillian R Hayes, Mark W Newman, and John C Tang. 2014. Field deployments: Knowing from using in context. In Ways of Knowing in HCI. Springer, 119–142. Pieter Jan Stappers and Elisa Giaccardi. 2017. Research through design. The encyclopedia of human-computer interaction (2017), 1–94. Sarah Sterman. 2022. Process-Sensitive Creativity Support Tools. Ph. D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley. edu/Pubs/TechRpts/2022/EECS-2022-207.html Sarah Sterman, Molly Jane Nicholas, and Eric Paulos. 2022. Towards Creative Version Control. In Proceedings of the 2022 ACM Conference on Computer Supported Cooperative Work and Social Computing. David Tinapple, Loren Olson, and John Sadauskas. 2013. CritViz: Web-based software supporting peer critique in large creative classrooms. Bulletin of the IEEE Technical Committee on Learning Technology 15, 1 (2013), 29.

CHI ’23, April 23–28, 2023, Hamburg, Germany

[53] Cesar Torres, Sarah Sterman, Molly Nicholas, Richard Lin, Eric Pai, and Eric Paulos. 2018. Guardians of Practice: A Contextual Inquiry of Failure-Mitigation Strategies within Creative Practices. In Proceedings of the 2018 Designing Interactive Systems Conference (Hong Kong, China) (DIS ’18). Association for Computing Machinery, New York, NY, USA, 1259–1267. https://doi.org/10.1145/3196709.3196795 [54] Mihaela Vorvoreanu, Colin M. Gray, Paul Parsons, and Nancy Rasche. 2017. Advancing UX Education: A Model for Integrated Studio Pedagogy. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 1441–1446. https://doi.org/10.1145/3025453.3025726 [55] Lauren Wilcox, Betsy DiSalvo, Dick Henneman, and Qiaosi Wang. 2019. Design in the HCI Classroom: Setting a Research Agenda. In Proceedings of the 2019 on Designing Interactive Systems Conference (San Diego, CA, USA) (DIS ’19). Association for Computing Machinery, New York, NY, USA, 871–883. https: //doi.org/10.1145/3322276.3322381 [56] Lisa Yan, Annie Hu, and Chris Piech. 2019. Pensieve: Feedback on Coding Process for Novices. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education (Minneapolis, MN, USA) (SIGCSE ’19). Association for Computing Machinery, New York, NY, USA, 253–259. https://doi.org/10. 1145/3287324.3287483 [57] John Zimmerman and Jodi Forlizzi. 2014. Research through design in HCI. In Ways of Knowing in HCI. Springer, 167–189. [58] John Zimmerman, Jodi Forlizzi, and Shelley Evenson. 2007. Research through Design as a Method for Interaction Design Research in HCI. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI ’07). Association for Computing Machinery, New York, NY, USA, 493–502. https://doi.org/10.1145/1240624.1240704

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App Ishita Chordia

The Information School, University of Washington Seattle, USA [email protected]

Emily Parrish

The Information School, University of Washington Seattle, USA [email protected]

Lena-Phuong Tran

Human-Centered Design and Engineering, University of Washington Seattle, USA [email protected]

Sheena Erete

College of Information Studies, The University of Maryland College Park, USA [email protected]

Tala Tayebi

The Information School, University of Washington Seattle, USA [email protected]

Jason Yip

The Information School, University of Washington Seattle, USA [email protected]

Alexis Hiniker

The Information School, University of Washington Seattle, USA [email protected]

ABSTRACT

KEYWORDS

Deceptive design patterns (known as dark patterns) are interface characteristics which modify users’ choice architecture to gain users’ attention, data, and money. Deceptive design patterns have yet to be documented in safety technologies despite evidence that designers of safety technologies make decisions that can powerfully infuence user behavior. To address this gap, we conduct a case study of the Citizen app, a commercially available technology which notifes users about local safety incidents. We bound our study to Atlanta and triangulate interview data with an analysis of the user interface. Our results indicate that Citizen heightens users’ anxiety about safety while encouraging the use of proft-generating features which ofer security. These fndings contribute to an emerging conversation about how deceptive design patterns interact with sociocultural factors to produce deceptive infrastructure. We propose the need to expand an existing taxonomy of harm to include emotional load and social injustice and ofer recommendations for designers interested in dismantling the deceptive infrastructure of safety technologies.

dark patterns, safety, dark infrastructure, manipulative design, deceptive design, crime, community safety, fear, anxiety, safety technologies

CCS CONCEPTS • Human-centered computing → Empirical studies in HCI.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License. CHI ’23, April 23–28, 2023, Hamburg, Germany © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. https://doi.org/10.1145/3544548.3581258

ACM Reference Format: Ishita Chordia, Lena-Phuong Tran, Tala Tayebi, Emily Parrish, Sheena Erete, Jason Yip, and Alexis Hiniker. 2023. Deceptive Design Patterns in Safety Technologies: A Case Study of the Citizen App. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 18 pages. https: //doi.org/10.1145/3544548.3581258

1

INTRODUCTION

In 2010, Brigunull coined the term “dark patterns” to describe how interface design can be used to undermine user agency and manipulate user decision-making [18]. He describes dark patterns as “tricks used in websites and apps that make you do things that you didn’t mean to, like buying or signing up for something” [18]. Examples of dark patterns originally identifed by Brignull include "Sneak Into Basket," when a site sneaks additional items into the shopping cart without your consent, and "Disguised Ads," where advertisements are disguised as content or navigation. Since 2010, research on dark patterns has grown substantially and has evolved to include both explicitly manipulative "tricks" and lighter nudges which, at scale, can cause harm to both users and society [75]. Recent literature has identifed dark patterns in e-commerce [74], digital privacy [20, 32], social media [73, 79], ubiquitous computing [50], robotics [64], and gaming [111] domains. The terminology has also evolved [18, 87]- in the remainder of the paper, we refer to dark patterns as “deceptive design patterns” to avoid equating the racialized term “dark” with problematic behavior.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Understanding deceptive design patterns and how they operate in diferent contexts is increasingly important, as deceptive design patterns are now pervasive and, for example, employed in the vast majority (95%) of apps on the Google Play Store [37]. Researchers, however, have yet to document the existence of deceptive design patterns in safety technologies. “Safety technology” refers to any digital technologies used for the purpose of increasing user safety. Designers of commercial safety technologies make decisions that powerfully infuence users’ behavior [44]. Prior literature, for example, has documented how safety technologies can infuence users’ levels of civic engagement [41, 44], their interactions with other members of their communities [44, 69, 70], the social norms of the neighborhood [63, 90], and individuals’ feelings of safety [14, 58]. Safety technologies can also impact individuals who are not users of those technologies by contributing to racial profling [44, 70] and online racism [109]. Given that designers of safety technologies make decisions that can have consequences for both users and non-users of these technologies and can shape both online and ofine behavior, we were curious about how these decisions may be infuenced by proft motives. We conduct a case study [76] of the Citizen app, a commercially available location-based crime alert technology that notifes users about local incidents related to public safety. We interview ffteen users of the Citizen app who live in Atlanta, a racially diverse mid-sized city in the Southern portion of the United States. To understand how deceptive design patterns infuence the user experience, we triangulate the interview data with an interface analysis of the app. We ask: • RQ1: How, if at all, does the design of the Citizen interface refect known deceptive design patterns? • RQ2: How do these designs afect the user experience? We fnd that Citizen employs a collection of user interface elements that together raise the salience of safety incidents, emphasizing the extent to which reported incidents pose a threat to the user. The app further presents itself as a solution to danger, leveraging a collection of common deceptive design patterns to exert purchase pressure on the user and encourage data disclosure. Participants’ experiences aligned with this feature analysis. They voiced an appreciation for receiving hyper-local, real-time safety information that helped them navigate risk, but many also reported that the app’s information-sharing practices increased fear and encouraged dependence on the app. Furthermore, users explained that Citizen infuenced their ofine behavior, including the neighborhoods they visited and their interactions with Black and unhoused individuals perceived to be dangerous. Deceptive infrastructure (sometimes known as dark infrastructure) refers to the interactions between deceptive design patterns and larger social, psychological, and cultural factors that together undermine user agency at scale [108]. Our study contributes to an emerging conversation on deceptive infrastructure by demonstrating how deceptive design patterns, human biases, and sociocultural contexts interact to produce harm for both users and non-users of the Citizen app. Deceptive design patterns interact with attentional bias to create anxiety for users and interact with negative cultural stereotypes to disproportionately harm vulnerable and historically

Chordia et al.

marginalized populations. In light of these results, we identify emotional load and social injustice as two forms of harm perpetuated by deceptive design patterns that are yet to be documented [75]. We additionally ofer four concrete suggestions to designers of safety technologies who are interested in dismantling the deceptive infrastructure produced by existing safety technologies.

2 RELATED WORK 2.1 Deceptive Design Patterns in HCI A recent review of the literature on deceptive design patterns in HCI fnds that while there are many diferent defnitions, a uniting characteristic is that deceptive design patterns all “modify the underlying choice architecture for users” [75, p.9]. Deceptive design patterns grew out of manipulative practices in retail, research on persuasive design, and digital marketing as a way for companies to gain users’ attention, data, and money [80]. Deceptive design patterns use language, emotion, color, style, and cognitive biases to undermine user agency [74, 75]. They are pervasive on online platforms and have been documented in the vast majority (95%) of apps on the Google Play Store [37], including e-commerce [74], gaming [111], and social media platforms [73, 79]. Examples of common deceptive design patterns include “Infnite Scrolling” [79] where new content automatically loads as users scroll the page and “Hard to Cancel” subscriptions [74]. Deceptive design patterns can be highly efective in manipulating user behavior [72, 83]. Prior research has found that American consumers are twice as likely to sign up for a premium theft protection service when presented with a mild deceptive design pattern and four times as likely to sign up when presented with an aggressive deceptive design pattern compared to users who are shown a neutral interface [83]. Calo and Rosenblat argue that digital technologies are uniquely efective at infuencing user behavior because of their ability to capture and store information about users, their ability to architect virtually every aspect of the platforms, and their ability to translate insight about user behavior into design [21, 22]. Furthermore, the emergence of online markets, such as digital sharing economies, presents new opportunities for companies to manipulate users by modifying the choice architecture of both sellers (e.g. Uber drivers) and buyers (e.g. riders) [21, 22]. Deceptive design patterns can diminish user wellbeing through fnancial loss, invasion of privacy, and cognitive burdens [75]. Schull and others have found that social media platforms employ addictive deceptive design patterns, such as infnite scroll or Youtube’s autoplay, that rely on a variable reward that mimics strategies used by the gambling industry [67, 97], and prior work has even documented the prevalence of deceptive design patterns in mobile applications for children [87]. The impact of deceptive design patterns, however, is not limited to individual users. Mathur and colleagues discuss the potential for deceptive design patterns to also impact collective welfare, by decreasing trust in the marketplace and by contributing to unanticipated societal consequences [75]. They point to Cambridge Analytica’s use of personal data, collected with the help of deceptive design patterns on Facebook, to infuence the 2016 U.S. presidential election as an example. Given that deceptive design patterns are efective at manipulating user behavior and can negatively impact both individual and

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

collective welfare, it is important to understand how they operate in diferent contexts. In the present study, we investigate the incidence and infuence of deceptive design patterns in safety technologies, flling a gap in the feld.

2.2

Safety Technologies in HCI

There is a rich body of work on safety technologies in HCI that spans more than a decade. Much of the early literature sought to design technologies to reduce individuals’ risk of victimization [14, 58, 99, 103]. This work was infuenced by victimization theory from criminology which views victims and ofenders as rational actors who use the information they have to assess their risk of being victimized or caught, respectively [68]. Digital technologies inspired by this perspective sought to provide users with information that would lower their chance of victimization. For example, Blom and colleagues designed a mobile application that allowed women to view and label spaces as “safe” or “unsafe” [14], and Shah prototyped CrowdSafe, which shared location-based crime information and trafc navigation guidance with users [99]. These technologies focused on decreasing individuals’ risk. In contrast to the victimization theory, the social control theory focused on the community and the informal and formal controls in place to deter crime [68, 92]. Digital technologies drawing from this theory emphasized the importance of not only sharing information with individuals, but also supporting community engagement, collaboration, and problem-solving [58, 69]. Researchers in HCI studied neighborhood listservs [41, 44, 69] and social media [54, 56, 90, 91, 109, 112] to understand how to increase collaboration between citizens and local authorities [91, 112], encourage civic engagement [41, 44], support user engagement and information sharing [19, 58, 69], and decrease individuals’ fear of crime [14, 15, 58]. The most recent work examining safety technologies in HCI has investigated the potential for safety technologies to perpetuate harm against historically marginalized populations. For example, empirical work studying online communication on local neighborhood listservs and Nextdoor fnd that these platforms serve as spaces for online negotiations of “suspicious behavior” that can lead to increased policing and surveillance of people of color [63, 70, 71, 78]. On Reddit, ambiguous and passive policies towards racist comments that are focused more on protecting Reddit’s image and user engagement lead to both new and old racism in discussions of safety [109]. Researchers have documented similar patterns of racism, policing, and surveillance on other apps where users organize around and discuss community safety, such as WhatsApp [78] and Amazon Neighbors [17]. A study analyzing product reviews and promotional material of Citizen, Nextdoor, and bSafe [59] found that companies encourage users to surveil members of their communities, leading users to express fear and racist beliefs. Collectively, this research suggests a need to investigate the role that design plays in perpetuating harm against historically marginalized populations. Sociologist Rahim Kurwa explains that such work is critical because surveillance and policing "relies[rely] on de-racialized governing narratives of safety that nevertheless have racist implementation and results” [63, pg.114].

CHI ’23, April 23–28, 2023, Hamburg, Germany

Designers of safety technologies make decisions that shape users’ individual and collective behavior. Furthermore, these technologies can have harmful and far-reaching consequences. By studying deceptive design patterns, we can begin to understand the factors that motivate these infuential design decisions.

3

CASE STUDY DESIGN

We employed a case study method [76] to understand how deceptive design patterns infuence the user experience of safety applications. We investigated a single case, the Citizen app, and bound our study to Atlanta users and their experience with the app from 2021 to 2022. This work is a case study because of our in-depth, holistic description and analysis of a bounded phenomenon [76]. For the single case to have power, the selection of the case needs to be strategic [47]. We selected Citizen because we see it as an extreme case [110]. Citizen deviates from other safety technologies in its proft model because it does not sell advertisements nor does it sell user data [26]. Rather, Citizen’s premium feature connects users to Citizen employees who monitor a user’s surroundings; this is the only way Citizen currently generates revenue. We chose Citizen for our case because we hypothesized that this business model may have unique implications on the design of the application. At the same time, because Citizen has many of the same features as other safety technologies, including the ability to view and discuss safety incidents, receive alerts about safety incidents, and view locationspecifc data, we hypothesized that our fndings may reveal insights about other safety technologies as well. We triangulated data from two sources [110]. We frst conducted user interviews and asked participants about the infuence of individual features to allow evidence of deceptive design patterns to emerge organically. We then conducted a researcher-led review of the user interface to identify known deceptive design patterns. In the following sections, we give context for our case and describe our process for collecting and analyzing data.

3.1

Contextual Background

3.1.1 Atlanta Context. We chose to geographically bound our investigation to users of Citizen that live in and around Atlanta. We chose a city in which Citizen was available, where crime was a concern, and where the authors had access to online neighborhood groups and/or local Facebook pages for recruitment. Prior research suggests that safety technologies are used diferently by diferent communities [41, 44], and we hoped that by geographically bounding our investigation, we may better see patterns in individual behavior. It is important to note, however, that the experience of users in Atlanta is not necessarily representative of users from other USA cities. Atlanta is a racially diverse city in the Southeastern portion of the United States. According to the 2021 Census [6], Black people make up the largest percentage of the city (51%), followed by White people (40.9%), and Asians (4.4%). Once considered a “Black Mecca,” Atlanta’s racial demographics have, however, changed drastically in the last decade. For the frst time since the 1920s, the Black population has been declining while the White population has been growing [36]. This racial shift can be attributed to the recent onset of gentrifcation, as well as the population growth of the city [65].

CHI ’23, April 23–28, 2023, Hamburg, Germany

(a)

Chordia et al.

(b)

(c)

(d)

(e)

Figure 1: Citizen is made up of fve main tabs: a) Home Page tab, b) Safety Network tab, c) Live Broadcast tab, d) Newsfeed tab, e) Notifcations tab. Additionally, in 2019, Atlanta had the second largest inequality gap in the country [9], with 20% of the population living below the poverty line [10]. A survey collected by the City Continuum of Care counted roughly 3,200 unhoused individuals in 2020, 88% of whom were Black [48]. In the early 2000s, Atlanta had one of the highest rates of violent crime in the country. Although crime rates have largely decreased in the 2010s, violent crimes such as homicides, aggravated assaults, and shooting incidents have gone up since 2017 [85]. Between 2019 and 2021, homicide increased 54% and aggravated assaults by 23% [31, 53]. This is consistent with numbers in large cities across the country who have all experienced a surge in violent crime during the COVID-19 pandemic [7]. In addition to an increase in violent crime, Atlanta has faced outrage and protests due to a number of high-visibility murders of Black people at the hands of police and White vigilantes [46, 84]. 3.1.2 Citizen Context. Citizen is a location-based crime alert platform that notifes users about local incidents which can afect public safety [55]. Citizen was originally released in 2016 as Vigilante, a platform where users could develop vigilante-style networks to protect themselves from potential ofenders. After being banned from the Apple App Store for its potential to incite violence, parent company Sp0n re-branded and re-released the platform as Citizen in 2017 [55]. The mission of the app, as reported on its website in August 2022, reads: “We live in a world where people can access information quickly, share efortlessly, and connect easily — but we have yet to see the power of bringing people together to watch out for each other. At Citizen, we’re developing cutting-edge technology so you can take care of the people and places you love” [27]. Citizen’s custom-built AI algorithm listens to frst-responder radio transmissions. From these raw feeds, the AI algorithm automatically processes radio clips and extracts keywords. A Citizen analyst listening to the 911 dispatch then writes a short incident notifcation, which may be sent to users as an alert [13]. These incidents are supplemented with crowdsourced user videos, which are

reviewed by the company’s moderators before appearing on the app. The Citizen FAQ reports that they include “major incidents that are in progress, or ones that we assess could afect public safety” [24]. The radius around which a user will receive notifcations varies based on a number of factors, including the "nature of the incident and the population density of the area" [23]. The basic version of the app is free, does not have ads, and CEO Frame says it does not sell or share user data [13]. However, Citizen is currently facing pressure from venture capitalists backing the platform to monetize and is experimenting with premium features, such as “Citizen Protect,” which allows users to contact company employees to virtually monitor their surroundings and dispatch emergency responders [8]. There are fve tabs that users can interact with in the app and Figure 1 shows screenshots of each tab. The fve tabs are: 1) the Home tab, which displays a map with the user’s current location and nearby incidents as well as a list of nearby incidents; 2) the Safety Network tab, which displays a map of the user’s friends’ current locations. This tab also display the safety incidents near each friend, the distance from each friend to the user, and the battery life remaining on each friend’s mobile device; 3) the Broadcast tab, which allows users to record live videos. They can choose between two types of live videos: “Incidents” or “Good Vibes” with the app defaulting to “Incidents”; 4) the Newsfeed tab, which shows live videos captured by users. Tapping into a video takes users to a page with more information about the incident, including additional video clips (if available), a list of updates, comments and reactions from other users, and the address on a map. In addition to local incidents, users can also choose to view incidents in other major cities or a “global” category; 5) the Notifcations tab lists a history of all reported incidents since the user joined the app. As of January 2022, Citizen is released in 60 cities or metro areas [25]. Citizen was made available in Atlanta in October 2020, and as of November 2020, was reported to have over 17,000 users [16].

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

CHI ’23, April 23–28, 2023, Hamburg, Germany

Table 1: Demographic Characteristics of Study Participants

3.2

Participant Race ID

Age

Gender

Length of Time Using Citizen

Time Spent on Citizen Each Week

P1

White

25-34

Female

About 4 months

30 minutes-1 hour, maybe more

P2

White

45-54

Female

7 months

10 minutes

P3

Black

25-34

Female

5 weeks

60 minutes

P4

White

25-34

Female

2 months

About an hour

P5

Undisclosed

35-44

Undisclosed

Over a year

5 minutes

P6

White and American

Native

35-44

Female

3-6 months

30 minutes or so

P7

Asian or Pacifc Islander

35-44

Male

6 months

30 minutes

P8

Hispanic or Latino/a

35-44

Male

6 months

30 minutes

P9

White

25-34

Female

1.5 years

10-20 minutes

P10

White

65-74

Male

3 months

Only when notifed

P11

White

35-44

Male

1.5 years

1.5 hours

P12

White

35-44

Female

2 years

20 minutes

P13

Black

18-24

Male

Undisclosed

12 hours

P14

Black

25-34

Male

6 months

Undisclosed

P15

Black

18-24

Male

2 years

10-15 minutes

Data Collection

3.2.1 User Interviews. Two members of the research team conducted ffteen semi-structured Zoom [113] interviews with Citizen users who live in and around Atlanta. The frst twelve interviews were conducted between September and October 2021, and an additional three interviews were conducted in June and July 2022 targeting people of color so that our fndings would better refect the diversity of Atlanta. To recruit Atlanta users, we posted a screener survey on Nextdoor, Reddit, and Facebook, as these are sites where there is prior evidence of users engaging with local safety-related information [70, 90, 109]. There were 139 individuals who completed the initial screening survey. We followed up with 67 individuals and invited them for interviews. Twelve of these individuals completed the interviews. All but one of the participants we interviewed found our post on Nextdoor. The majority of people in this sample were between 35 and 44 years old, female, and White. In our second round of recruitment, we aimed to interview more people of color and posted our screening survey on subreddits and Facebook groups for Black colleges in Atlanta. We also posted on two diferent Nextdoor groups in predominantly Black neighborhoods. There were 72 individuals who completed the recruitment screening survey, 24 of whom selfidentifed as Black residents of Atlanta. We invited nine of these

individuals for interviews, and conducted interviews with the three who accepted. Participants noted that they had used Citizen between 5 weeks and 2 years, with a rough average of 9.5 months (some participants did not give exact answers). Participants spent between fve minutes to 12 hours per week on the app, with a rough average of approximately 87.5 minutes per week (some participants did not give exact answers). Table 1 lists the demographics of all 15 participants. Our participant sample includes the following: 53% of our participants identifed as female (n = 8) and 40% identifed as male (n = 6). One participant declined to specify their gender. 46.6% identifed as White, 6.6% identifed as Hispanic or Latino/a, 13.3% as Asian or Pacifc Islander, 20% as Black, and 6.6% identifed as White and Native American. Additionally, 1 participant declined to specify their race. Despite our targeted recruitment strategy, Black people were underrepresented in our sample. This may be for a number of reasons. Our research team could not fnd data about the racial makeup of Citizen users to determine whether our participants refect the broader population of Citizen users in Atlanta, but prior work suggests that Black people are less likely to use social media to fnd out about local crime activities [56]. Additionally, Black communities in Atlanta have been exploited by researchers and have high levels of distrust which has afected recruitment of this

CHI ’23, April 23–28, 2023, Hamburg, Germany

population in the past [66]. In both the screening survey as well as follow-up emails to schedule the interview, we explained that all interviews would be recorded on Zoom, which may have biased our sample towards those participants who are more trusting of researchers or feel more lax with privacy. There is an opportunity for future research to focus specifcally on Black population regarding their usage of the app. During interviews, we asked participants to describe: (1) the features they used, (2) their motivation for use, (3) how often they used each feature, and (4) their experience with that feature, including the way it may have shaped their behaviors and beliefs. The interviews ranged from 21 minutes to 57 minutes, with the average interview length being 42.31 minutes (sd = 11.04 ). Each participant was compensated with a $30 e-gift card. 3.2.2 Deceptive Design Patern Identification. Three members of the research team conducted an interface analysis to identify deceptive design patterns employed by the Citizen app. Adapting a methodology used by Di Geronimo et al. [37] and Gunawan et al. [51], we recorded our interactions with Citizen by following six predefned user scenarios. An iPhone X, an iPhone 13 mini, and a Pixel 4a were used to record and interact with Citizen version 0.1100.0. Researchers recorded the scenarios in their city of residence, which included Atlanta as well as Seattle. Recording incidents in our city of residence was not only practical, but also enabled us to contextualize the incidents we viewed on the app. The six user scenarios were selected to capture the diversity of ways that users can interact with the app, which we learned from user interviews as well as the frst author’s use of the app for research purposes over the course of one year. User Scenarios: (1) Download and Setup: Download the application and allow alerts. Share your location data and enter your home address when prompted by the application. Share your contacts, and add 1-2 members of the research team to your Safety Network. Navigate and explore all fve tabs at the bottom of the screen. Share, follow, and comment on one incident. (2) Incident Alert: The frst time you receive an alert, tap on the alert and explore the landing page. This alert may be about a contact who is added to your Safety Network. (3) Random Check: Explore the Home tab, the Safety Network tab, and the Notifcations tab. Customize the settings to your preference. (4) Broadcast Live Incident: Navigate to the Broadcast tab. Give the application permission to use the microphone and camera and start recording a live incident happening in the area (e.g. police cars or helicopters overhead). Submit the incident for moderators to review and stop recording. (5) Premium Use: Upgrade to the Citizen Protect feature and sign up for the free 30-day trial. (6) Delete and End Use: Turn of notifcations and delete friends from your Safety Network. Cancel the Citizen Protect subscription. Delete account and remove the application from the phone. After recording our interactions with the app, we had a total of 18 videos with an average length of 3.35 minutes. We used an inductive approach [33] to identify deceptive design patterns since

Chordia et al.

prior work has not yet examined deceptive design patterns in safety technologies. Using Mathur et al.’s defnition of deceptive design patterns [75] and a coding methodology adapted from Radesky et al. [87], three researchers independently watched the videos and identifed instances of monetization and reinforcement techniques which we believed modifed the underlying choice architecture for us as users. After removing duplicates, we had a total of 34 usage experiences where we believed the design modifed the user’s choice architecture. It is important to note that we did not consider designer intent during this review– as Di Geronimo and colleagues note, “understanding designers’ intentions and ethical decisions is subjective and may lead to imprecision” [37, p.4]. Instead, we chose to assess what was presented in the user interface and whether or not those designs modifed the choice architecture for users [75].

3.3

Data Analysis

Our data analysis process occurred in three stages: 1) analysis of the interview data; 2) analysis of the data from the interface review; and 3) integration of the two datasets. To identify themes in the frst twelve interview transcripts, four members of the research team, including the frst author, independently coded the transcripts using Delve Tool [35]. The research team met for two weeks to develop the codebook – all disagreements were resolved through discussion. The frst author grouped these codes into larger themes. Over the course of fve weeks, our research team met weekly to discuss, refne, and iterate on the codes as well as the emerging themes. After collecting our second round of interview data, our team members coded the transcripts using the existing codebook. During this second round, we generated one new code which led us to re-code older transcripts with this new code in mind. At the end of data analysis, we had 36 codes which were grouped into six overarching themes. To identify deceptive design patterns, the three members of the team who collected and identifed the usage experiences organized these usage experiences using afnity diagramming [52] in Miro Board [77]. Afnity diagramming is an inductive approach that allows users to iteratively group data by theme. This process helped us identify six underlying deceptive design patterns which motivated the usage experiences. We renamed these six patterns using existing nomenclature by consulting deceptive design pattern taxonomies from attention capture [79], e-commerce [74], and privacy [20] domains. The fnal set of six deceptive design patterns and examples of corresponding usage experiences are presented in Table 2. The frst author integrated the two datasets by iteratively matching on 1) feature and 2) concept. For example, interview data that discussed the Safety Network was integrated with data from the interface analysis related to the Safety Network, and interview data that discussed the concept of community was integrated with data from the interface analysis that was related to the community. After this matching process, two other members of the research team provided feedback on the integrated data. Collection and analysis of the two datasets occurred independently, and thus, not all deceptive design patterns were refected in the user interviews, and not all user experiences were infuenced by deceptive design patterns. We present our integrated data in the Results Section, sharing the

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

deceptive design patterns identifed by researchers as well as how those features did and did not infuence the user experience.

4

RESULTS

The Citizen interface creates an infated sense of danger while simultaneously positioning itself as a solution to that danger. We describe the interface components that create this efect and report on users’ experiences with these features.

4.1

Manufacturing Anxiety

The Citizen interface presents a stream of incidents that systematically include categories of events that do not pose a risk to the user. Participants consistently told us that they valued using Citizen but felt an increased sense of fear as a result of their engagement with the app. We document how the notifcation stream, lack of contextual detail, and lack of community contributed to their increased sense that danger lurked around every corner. 4.1.1 Interface Analysis: Indiscriminately Raising the Salience and Visibility of Safety Incidents. In reviewing the interface, we encountered fve types of incidents that were shared with users but did not present a threat to their safety. First, the app notifed users about incidents that were not proximate. For example, in one instance, the notifcation feed displayed an incident about a missing child from a neighboring state (see Figure 2e), and in another, it showed mass shootings from another part of the country. These incidents informed users about alarming incidents that were too far away to afect their personal safety but were presented alongside incidents that occurred nearby, expanding the set of alarming events that were shared with users. Second, we encountered incidents that were not a threat to public safety and represented minimal or no risk to those who were not directly involved. For example, one incident alerted users of an “Occupied Stuck Elevator” (see Figure 2d). Third, we found that incidents persisted on the feed long after they were over. For example, as shown in Figure 2f, users were shown information about a “Small Brush Fire” that had been extinguished nine hours prior. Videos shared on the Live Broadcast Tab appeared to persist for 24 hours, even if the incident had been resolved. Fourth, the app encouraged users to add friends to their Safety Network (see Figure 3c), and upon doing so, people began receiving intermittent alerts about incidents that the app framed as relevant to their friends’ safety. For example, the frst author received notifcations that a friend was 0.5 miles away from a reported structure fre and, later, that the same friend was 1.1 miles away from a man reported to be armed with a gun and involved in a dispute (see Figure 3d). In a dense metropolitan city where nearly half of all adults live in a home with a gun [57], this may always be the case, but the alerts signaled to the user that there was reason to be concerned for the safety of a loved one, regardless of whether or not that loved one was actually in danger. Finally, we encountered incidents that did not provide enough information to determine whether or not the incident presented a safety threat. For example, one incident reported a “Man Threatening Staf” without additional context, leaving the user unsure of how, if at all, the incident related to broader public safety concerns (see Figure 2d). Thus, the collective set of incidents documented

CHI ’23, April 23–28, 2023, Hamburg, Germany

events that might be reported as local news stories with few presenting a plausible threat to the user’s safety. However, Citizen did not encourage users to consume content as local news; the app encouraged users to stay vigilant and maintain real-time awareness of safety risks like "active shooters" by enabling alerts (see Figure 2a). Citizen required users to enable alerts in order to view their Notifcation Feed, manufacturing an artifcial dependency, what Mathur et al. call a forced action deceptive design pattern [74] (all deceptive design patterns are documented in Table 2). 4.1.2 User Experience: Constant Notifications Manufacture Anxiety. All participants reported that Citizen increased their awareness of safety-related incidents in Atlanta. P10 described the app as an “electronic bubble of information” that heightens his awareness of his surroundings no matter where he goes. Citizen left participants feeling shocked at how many criminal incidents occur in the city, commenting on the number of car thefts (P8), fres (P11), and instances of gun violence (P10). They expressed their dismay over the prevalence of danger saying things like, "there’s so much crime and you just don’t expect that" (P8), and they explained that this awareness developed through their use of Citizen, which had surfaced a backdrop of crime they had not previously realized existed. For example, P1 told us, "there’s a level of ignorance is bliss where, if you don’t know anything going on, you know everything seems safe and happy and then, when you add Citizen suddenly you’re aware that there’s danger around." Participants in our study viewed information from Citizen as reliable because of its unfltered nature. They trusted the reports because they came from police radio (P8, P12) and perceived incidents to be devoid of the extra commentary (P1), "sensation" (P5), and political slant (P7) that they associated with local news and social media posts (P10, P8, P11, P7, P1, P5). However, the afordance that participants found most valuable was Citizen’s ability to provide hyper-local, real-time information. Participants P3, P10, and P4 all shared that they were alerted about incidents that they could see happening outside their house or gunshots that they could hear in their neighborhood, incidents they perceived as relevant to their safety but too minor to be reported on the news. P10 shared that these alerts helped him "know what to do" and which places to avoid at what time, and P13 liked that he can fnd out about crime "immediately". P5 put it succinctly, "I just want to know, like locally, just straight up what’s going on near me." As these examples illustrate, for our participants, the core use case for Citizen is to cultivate a real-time awareness of nearby events that might afect their safety. Although participants appreciated the increased awareness that came with using Citizen, they also said that the frequency of alerts was “stressful” (P9) and “anxiety-inducing” (P6). This is consistent with what users have shared on product reviews of the app [59]. Participants received fve to ffteen alerts per day, with the infux becoming “really crazy” at night (P8). The incidents that participants felt were the least helpful were ones that were “far, far away” (P3) or inconsequential to their personal safety. P11, for example, guessed that maybe “one out of 20 [incident notifcations] is actually useful” because “unless you’re within half a mile or a quarter-mile away from me, I really don’t care.” P3 and P12 felt similarly, voicing that it was “annoying” (P3) to receive so many “random notifcations about

CHI ’23, April 23–28, 2023, Hamburg, Germany

(a)

Chordia et al.

(b)

(c)

(f)

(d)

(e)

(g)

Figure 2: Users are encouraged to allow incident alerts from the Citizen App (a). Upon receiving an alert (b), users can tap it to view the story, view other nearby alerts on a map (c), or view a list of recent notifcations (d). While our research team mostly received local notifcations (e.g., within 5 miles), (e) shows a notifcation about an Amber Alert of a missing child from a diferent state. (f) displays a list of incidents that are "Further Away" and (g) shows a video on the Live Broadcast tab of a fre that had been resolved hours ago. things that are not happening within my vicinity” (P12). Participants reported that they often received notifcations about “fres” (P11) and “helicopters” (P9) that they did not care about, and P6 shared that Citizen alerts her about “a whole bunch of fuf, if you will, you know unnecessary calls to the police.” Participants expressed frustration with excessive alerts that depict “all this crime, but it’s actually not, and then it makes it not as useful, like the boy who cried wolf ” (P9). For some participants, the excessive notifcations manufactured what they perceived to be an unnecessary sense of fear. For example, P2 explained that “there’s always a little action right around me because I’m by Edgewood, and there’s kind of a lot of crap going on in Edgewood, so [the constant stream of notifcations] just has me super paranoid.” Other participants shared that they expected crime in a big city but seeing so much of it was "scary" (P4), "anxiety-inducing" (P6), and "not for the faint of heart" (P8). P6 described this phenomenon by explaining that, because of Citizen, she hears about “every little teeny tiny thing, whether it’s true or not. . . instead of like hearing

the things that actually matter, I see all of these diferent things that are probably not a concern. But then it’s like it’s overwhelming to see like, ‘wow, 15 diferent things have happened within a mile from me.’ ” P12 agreed, describing the app as "alarmist". Participants suggested ways that the app might scale back irrelevant notifcations and prioritize relevant ones, and as a result, inspire fear only when warranted. For example, P5 refected on the diference between violent and nonviolent crimes with respect to his safety, and P9 explained that Citizen needed to be more discerning about the “diference in severity” between incidents. She wished there were “more ways to break down when you would get notifcations and about what types. . . like I don’t want a notifcation about a trafc accident but, like, I would like to know if there’s a shooting right across the street, or if there was a break in near my complex within you know, a mile or two.” P11 suggested that Citizen implement a “geofence” so that he would only be alerted about notifcations that were proximate. Although users did not seem to be aware of the forced action deceptive design pattern, the requirement to enable

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

(a)

(b)

CHI ’23, April 23–28, 2023, Hamburg, Germany

(c)

(d)

(e)

Figure 3: During the onboarding process, users are invited to add optional emergency contacts. To do so, they are required to grant Citizen access to their phone’s contacts (a). Users are required to share their location in order to access the app’s incident map (b) and required to add friends from their contacts to use the Safety Network (c). Once they’ve added friends to the Safety Network, they can view their friends’ current location and battery life on the Safety Network tab (d). When one of their contacts joins Citizen, the user receives a notifcation prompting them to connect with their contact (e). alerts in order to view the Notifcation Feed disempowered users from choosing how and when they would like to view incident information. 4.1.3 User Experience: Lack of Context and Qality-Control Inspires Unwarranted Concern. Participants shared that the lack of contextual information made it difcult to discern whether an incident was cause for concern and wanted Citizen to surface details that would enable them to make this judgment more readily. P1, for example, said, “It’s very important to be able to separate. . . what’s real, what’s a threat, and what’s not, because at the end of the day, if you get into a fght with your boyfriend inside your house and you call the police, I’m very sad for you and I hope that you’re okay, but, I don’t need to see an alert on my app that there is a report of like, you know, ‘brawl in the street,’ and like, ‘someone with a knife chasing a woman’ because then I get worried. . . and so I think that there’s a way of making it objective versus just the over-inundation of information that then causes you to not trust it or not wanting to know.” As P1’s quote illustrates, participants perceived lack of context made it difcult to diferentiate between private incidents and threats to public safety, contributing to unnecessary fear. In other examples, P11 described an incident where the Atlanta Police Department was conducting a drill, but Citizen incorrectly transcribed it as a “full-on open assault, like, shooting between two diferent parties” which led to alarm throughout the neighborhood (as witnessed by P11 in the comments). These kinds of incidents prompted requests for "quality-control" (P12) and "a little fact-checking" (P9). Participants also speculated that this lack of quality control, fact-checking, and context led to consuming culturally and racially biased information. P9 refected on this concern saying, “The issues

of, like, you know somebody like looking into a car, like, I question is like—was it a black person looking into their own car? [Or] was it actually somebody, like, checking car handles and trying to break into cars?” Similarly, P11 knows “a lot of ‘Karens’ 1 in the neighborhood” who are quick to call 911, thereby infating the Notifcation Feed with biased incidents that may nevertheless inspire concern. 4.1.4 User Experience: Lack of Community Limits Users’ Resilience to Fear. Participants in the study infrequently interacted with other users on Citizen. Twelve participants shared that they had never posted on Citizen. P1 talked about how her contributions to the app consisted of once adding a “sad emoji” reaction while P11 described Citizen as a platform where people did not “make lifelong friendships.” Participants cited the following deterrents for connecting with other users: personal preference in using the app for quick news alerts (P1), online anonymity (P4), and high amounts of negative content which made it a difcult place to “hang out” (P1). Participants said they encountered more community-building activities on other platforms, such as Nextdoor and Facebook. P1 talked about how she turns to Nextdoor and Facebook for “personal color commentary” to augment the reports she sees on Citizen. She explained that this commentary enables her to have “a more complete picture” of what was happening in her neighborhood, and made it “a lot easier to live with that danger that you [she] know[s] about from Citizen.” Refecting on the impact of the “personal color commentary,” she shared: “It does make it less scary. . . when you add [start using] Citizen, suddenly you’re aware that there’s danger around you but you don’t know exactly who or how or why that danger exists, you just know that 1 Karen

refers to the 2020 “Karen” meme caricaturing white women who typically overreact and escalate situations including making threats to involve the police or abusing grocery store workers for mask-wearing policies. The meme is often associated with white supremacy [81].

CHI ’23, April 23–28, 2023, Hamburg, Germany

Chordia et al.

Table 2: Deceptive Design Patterns Employed by Citizen Deceptive Design Pattern

Defnition

Example Within Citizen

Infuence on User Experience

Domain

Social ment

The use of social metrics to reward users for their engagement and incentivize continued use

Sharing the number of nearby users who would presumably view, react, and comment to user-uploaded videos

N/A

Attention Capture [79]

Obstruction

Making a certain action harder than it needs to be in order to dissuade users from taking that action

A hidden "Skip" button needed to avoid premium upgrade

N/A

E-commerce [74]

Misdirection

The use of language, visuals, and emotions to guide users toward or away from making a particular choice

A foating button to upgrade to Citizen Protect overlaid on safety notifcations and videos

N/A

E-commerce [74]

Forced Action

Requiring users to take certain tangential actions to complete their tasks

A requirement to enable alerts in order to view the Notifcation Feed

Users enabled alerts and received information that was not always relevant to their safety concerns

Privacy [74]

Publish

Sharing personal data publicly

An alert that notifes users that a contact has joined Citizen without informing that contact

Users added contacts to their Safety Network and received alerts about contacts that were not always relevant to their safety concerns

Privacy [20]

Obscure

Making it challenging for users to learn how their personal data is collected, stored, and/or processed

The lack of transparency about what personal data is collected and how it is stored

Users did not trust Citizen and felt reluctant to share information with the application

Privacy [20]

Invest-

Note: Citizen employs known deceptive design patterns from attention capture, privacy, and e-commerce domains. .

it is, and then with Nextdoor you get a little more understanding of why this person is waving a gun on the corner and that if you drive through they’re not going to shoot out your window, like, they’re really pissed at their ex-husband.” As this story illustrates, participants saw the interpersonal communication that occurs on other platforms as humanizing and potentially mitigating fear of crime. One participant shared that the lack of online community on Citizen left few chances for users to “make sense of what are the motivations and the kinds of things that may be incentivizing that kind of behavior” leading users on Citizen to be more “apathetic” than “empathetic” in their comments (P5). While one participant said that he liked that the comments were uncensored (P11), others said they found the comments “gross” (P4, P5), “unkind” (P4), “violent” (P5), and “racist” (P11).

4.2

Ofering a Solution to Users’ Heightened Safety Needs

Despite the frustration and anxiety that users reported, they also felt it was important to keep using the app to better manage their own safety (P2, P9, P1, P14). The users we interviewed were not unaware of the negative impacts of using Citizen, but felt beholden to the application. Since downloading Citizen, P14 described having a constant urge to know “what’s really going on” including checking whether a place he is in is “secure.” P2 shared that she felt “beholden

to these sound alerts that instill panic. It’s like Pavlov’s dog: you hear the bell and you have a reaction; it’s visceral. . . I feel like a slave to it but it’s the only way I’m going to be able to control my safety as much as I can.” Others agreed—P9 voiced that she has gone back and forth on whether or not to delete the app because it induces anxiety, but decided not to get rid of it because it provided her with valuable information. Thus, we found that Citizen became both a source of and a solution to anxiety for our participants. Here, we examine the interface features that position Citizen as a solution and the steps— both with and without the app—users took to manage their safety.

4.2.1 Interface Analysis: Encouraging the Use of Lucrative Features Which Promise Protection. Citizen ofers users three features for their protection, their loved ones’ protection, and their community’s protection: Citizen Protect, Safety Network, and Live Broadcast. These three features are also proftable, helping the company gain users’ money, data, and attention. Citizen Protect is Citizen’s premium feature which was launched in 2021. The feature ofers users the option to contact Citizen employees, known as Protect Agents, who can monitor the user’s surroundings, contact frst responders when situations escalate, alert users’ emergency contacts, and create new incidents on behalf of a user to alert nearby users of the app. Citizen Protect is promoted as a tool that brings people together to watch out for each other

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

(a)

(b)

CHI ’23, April 23–28, 2023, Hamburg, Germany

(c)

(d)

(e)

Figure 4: The Citizen Protect feature is frst advertised to users during the onboarding process (a) (b). In-app advertisements highlight the benefts of using Protect and spotlight ”success stories” such as fnding missing pets (c) or people (d). There is also a foating blue button to sign up for Citizen Protect that is constantly visible in the lower right corner of the screen (e). [27]. In-app advertisements give the example of a Protect Agent creating an incident to alert nearby users about a missing pet and the nearby users responding en masse (see Figure 4c). Researchers found this vision of mobilizing users reminiscent of Citizen’s prior avatar as the Vigilante app. Although the app is free, we found that Citizen aggressively advertises its premium features with the use of deceptive design patterns. For example, the Citizen Protect feature is advertised twice to new users during the sign-up process. In the latter instance, researchers noted the hidden “Skip” button which made it particularly challenging to bypass the advertisement, an example of a deceptive design pattern called obstruction [74]. The most egregious deceptive design pattern, however, is a foating button to sign up for Citizen Protect which is overlaid on each screen, constantly visible as users scroll through videos and notifcations, many of which do not present any threat to users’ safety but heighten fear nonetheless (see Figure 4e). We saw this as an example of a misdirection deceptive design pattern, a button which supports Citizen in translating heightened awareness and anxiety about safety into purchases. Users can purchase an individual or a family plan. Citizen also encourages users to monitor their friends and family’s safety by adding contacts to their Safety Network. To take advantage of this feature, Citizen requires users to share their entire contact list with the app (see Figure 3a). There is no option to add contacts individually, an example of a forced action deceptive design pattern [74] because it creates a false dependency. If users choose to share their contacts, Citizen will alert all contacts who are existing Citizen users that their friend has joined the app without informing the user. This alert encourages contacts to add the user to their Safety Network and share location data with the user (see Figure 3e). We saw this as an example of publish, a privacy deceptive design pattern, [20] where information about an individual is shared without their consent or knowledge. This deceptive design pattern has the potential to exponentially increase new users for Citizen. Researchers also discovered that the app collected data

about the user without their knowledge, including data about the user’s heart rate and about their mobile device’s battery life. Battery life information was shared with friends on the Safety Network without consent. These are examples of privacy deceptive design patterns which obscure what data is being collected and how [20]. The app describes Live Broadcast as a feature that allows users to create and share videos in order to “spread awareness of safety incidents with your community in real-time.” Citizen nudges users with verbal cues and displays the number of nearby users (who would presumably see the live video) (see Figure 1c, Figure 2c). We see this as an example of a social investment deceptive design pattern because it encourages the use of the app through social metrics such as the potential number of reactions, comments, and views to useruploaded videos [79]. Researchers also documented one instance where users were prompted with the notifcation: “Go Live. 600 feet away. Hit-and-Run Collision. Tap to update your community” (see Figure 1e). The research team found this notifcation particularly challenging to reconcile with the app’s mission to support user safety [27]. User-generated broadcasts were used to capture and engage users’ attention. For example, one researcher received an alert that there was a “live video at the scene”, to encourage viewing a video of an overturned car after a collision. Each video was also overlaid with users’ comments, reactions, and a pulsating share button to encourage users to share the video via text or social media. 4.2.2 User Experience: A Heightened Need for Safety Requires Action. Sensitized to the risks around them, users engaged Citizen’s features for protection in two ways and responded individually, taking matters into their own hands, in many ways. While we did not speak to any participants who had used Citizen Protect or Live Broadcast and could not evaluate the infuence of the obstruction, misdirection, or social investment deceptive design patterns, we did speak to four participants who added friends to their Safety Networks (P1, P3, P4, P6). P6 mentioned that he has a very diverse group of friends, and given the racially-charged political climate, he appreciated the ability to make sure they were safe.

CHI ’23, April 23–28, 2023, Hamburg, Germany

P3 similarly appreciated being able to track her family members’ locations. P1 downloaded the Citizen app when her friend invited her to join her Safety Network due to the publish deceptive design pattern. While P1 valued the information she received from the app, she decided to turn on “Ghost Mode” because alerts about P1’s nearby incidents were causing her friend undue stress and anxiety. Taking advantage of the information on Citizen, we observed how some participants began engaging in detective work. A Citizen post helped P14, an undergraduate student, create awareness about his missing friend. Other students on his campus also used the app, and P14 found that the comment section provided useful and comforting information when his friend went missing. Some participants viewed incidents on Citizen and cross-referenced that information on other platforms to get more context (P6, P4, P1, P9). P9, for example, was able to collect more information about a neighbor’s missing car using Citizen and Facebook, while P4 was able to locate a Nextdoor neighbor’s missing mail by cross-referencing information from Citizen. Others did not feel as comfortable relying on Citizen because they worried about sharing location data with the app (P9, P12, P15, P11). P11 changed his settings so that he was only sharing his location when he was using the app because he assumed Citizen had to make money, and they must be doing something with his data that he was unaware of. P12 lives in an apartment complex where she knows there is gang activity. However, she admitted that she no longer feels comfortable calling 911 because she worries identifable information might be leaked onto Citizen. She said, “I can’t believe I question now calling 911..it made me think to have like who has access to 911 recordings now?” Although users did not seem to be aware of specifc deceptive design patterns, the lack of transparency about Citizen’s privacy policy due to design decisions such as the obscure deceptive design pattern disempowered users from taking actions that might protect their safety. In addition to relying on Citizen, many participants took matters into their own hands and began carrying tasers (P9), guns (P12), knives (P2), mace (P9, P2) and investing in new home security systems (P9, P12, P7). Others began avoiding certain sub-populations perceived as dangerous. A small group of participants shared that their use of the app led to an increased fear of individuals who are homeless (P1), mentally ill (P2), Black teenagers (P2), and “Black men” (P4). P12 felt that she sees so many crime-related incidents with such little context that her mind can’t help but draw conclusions about who is committing these crimes. P1 refected that: "Before I downloaded Citizen when I would see homeless people in the park I wouldn’t think anything of it, you know they’re there sleeping, this is a soft relatively private place for you to lay your head tonight, and I would go on my way. Since downloading Citizen, I will leave a little more space, and I will look in those bushes a little more like, ‘is there, someone that could potentially be right there waiting to pounce?’" For P11, Citizen brought to light the city’s “vagrancy problem” and the sense that more police activity and local leadership is needed. Almost every participant began avoiding certain areas of the city that they perceived as dangerous. Participants mentioned changing the routes they drove (P8), the routes they walked at night (P2, P6,

Chordia et al.

P9, P11, P4), and the businesses they frequented (P9, P11). Based on the incidents that participants viewed on the app, they began to create mental models of “hot pockets” (P6) in the city to avoid. P8, for example, said that after seeing the same street names, again and again, she began avoiding those areas. Similarly, P11 described how he used Citizen to fgure out if he should “avoid that section of town” for the day. Furthermore, these mental models persisted beyond just the usage of the app. P4, for example, no longer attends the Castleberry art walk because she now associates that neighborhood with crime, and P2 said she no longer goes out for walks alone after six pm. For others, the data from the app has infuenced long-term decisions like where to buy a house (P7, P8) and whether it makes sense to move to another state altogether (P10, P2). The areas that participants mentioned as “hot pockets” of crime include Castleberry Hill, home to one of the highest concentrations of Black-owned land and businesses in the country, and Mechanicsville, where the vibrant and predominantly Black community of the 1950s has since diminished largely due to misguided urban renewal [1, 5].

5 DISCUSSION 5.1 The Power of Deceptive Infrastructure In 2021, HCI researchers Westin and Chiasson introduced the concept of dark digital infrastructure [108]. They observed that examining individual features in isolation neglects the ways in which these features interact with each other and with larger social and psychological factors [89, 108]. A narrow focus on individual features limits researchers’ ability to fully understand the impacts of these designs. To account for this oversight, Westin and Chiasson use “dark infrastructure” to refer to the larger sociotechnical machinery—built on deceptive design patterns—that undermines user agency at scale [108]. In our review of the Citizen app, we similarly observed that a feature-level analysis did not capture the full extent to which Citizen’s interface can modify users’ choice architecture. Here, we consider how the deceptive design patterns we identifed in Citizen might intersect with cognitive biases and sociocultural contexts to produce dark infrastructure. We refer to dark infrastructure as “deceptive infrastructure” to avoid confating the racialized term “dark” with problematic behavior. 5.1.1 Deceptive Design Paterns and Cognitive Biases. In 2014, Facebook notoriously conducted an experiment to understand how users’ emotional states are infuenced by the emotional valence of the content on their feeds [61]. While this study was highly controversial, it is not the only example of technology manipulating users’ emotional states at scale [107, 108]. Our results describe how Citizen modifes users’ choice architecture by sharing information that does not present a threat to users’ safety, but heightens anxiety and fear nonetheless. While individual deceptive design patterns (like requiring users to enable notifcations) may seem relatively innocuous, we found that over time they created high emotional costs for participants in our study who described their experience as “scary,” “anxiety-inducing,” “stressful,” “frustrating,” and “paranoia”-inducing. Attentional bias is a type of cognitive bias where people disproportionately attend to emotionally evocative information due to

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

CHI ’23, April 23–28, 2023, Hamburg, Germany

Figure 5: We propose an expansion to the Mathur et al. taxonomy of harm [75]. Light blue items are taken from the existing taxonomy; dark blue items are our proposed additions. All icons are taken from Flaticon [49]. the evolutionary importance of early and fast processing of threatrelated information [60]. Given the potential for deceptive design patterns to exploit users’ cognitive biases [74, 105], we hypothesize that attentional bias may explain why safety incidents evoked strong emotional responses even when they did not present a threat to user safety. Attentional bias suggests a lowered threshold for modifying users’ choice architecture with safety information, and a highly-cited meta-review of attentional bias found that increased exposure to negatively valenced content has a causal, bidirectional, and mutually reinforcing relationship with anxiety [104]. The interactions between individual deceptive design patterns and cognitive biases may thus create a deceptive infrastructure that leaves users vulnerable to manipulation and creates emotional costs that persist even after users log of [96].

vulnerable and historically marginalized populations. Technologies used by millions of users to discover new restaurants (Yelp) or buy and sell homes (Zillow) proft of of users’ engagement even as that engagement contributes to the reproduction of racial biases and gentrifcation [29, 114]. Researchers have documented the inconsistent enforcement of online racism on Reddit due to an interest in protecting user engagement on the platform [109] and children from lower socioeconomic backgrounds have been found to play apps with more deceptive designs [87]. These examples all point to the potential for deceptive design patterns to interact with sociocultural contexts to reproduce implicit biases and stereotypes that systematically harm vulnerable and historically marginalized populations.

5.1.2 Deceptive Design Paterns and Sociocultural Contexts. We identify the potential for deceptive design patterns to interact with the cultural and social contexts within which they exist to systematically reproduce negative stereotypes and reinforce cultural biases. Interviewees reported instances of racism in the comments section of the app and shared that the fear of crime left them feeling increasingly distrustful and suspicious of strangers, particularly Black men and unhoused individuals. This is not surprising given that decades of research has established that the fear of crime is not expressed neutrally, and in the United States, is likely to be directed in ways that refect existing biases against Black people [40]. The fear of crime is closely associated with a narrative of Black criminality [40, 106], leading to, for example, the policing, profling, and surveillance of Black, Brown, and low-income populations [12, 45, 63, 70, 82]. Citizen is one of many technologies that interact with their cultural and social contexts in ways that disproportionately impact

5.1.3 Expanding the Taxonomy of Harm. In a 2021 meta-review of the literature, Mathur et al. identifed individual and collective welfare as overarching normative concerns that underlie the discussion on deceptive design patterns [75]. They ofer a taxonomy of harms organized under these two categories with the hope of providing researchers with a common language to explain why deceptive design patterns are of import and concern [75]. Their review of the literature fnds that deceptive design patterns have the potential to harm individual welfare through fnancial loss, invasion of privacy, and cognitive burdens. They also have the potential to harm collective welfare through decreased competition, reduced price transparency, distrust in the market, and unanticipated societal consequences (see Figure 5). In light of our results, we propose the need to expand this taxonomy to include emotional load as harm to individual welfare. Emotional load is defned as the emotional cost borne by users due to a technology’s deceptive infrastructure. We see the need for

CHI ’23, April 23–28, 2023, Hamburg, Germany

researchers to begin systematically documenting this harm; leveraging empirically validated measures from psychology to identify and measure complex emotions such as fear can support researchers in this endeavor. As an example, Westin and Chiasson use an empirically validated scale to measure users’ “fear of missing out” and the role that deceptive infrastructure plays in producing this fear [108]. Unlike individual welfare which has been a core focus for deceptive design pattern research, collective welfare has received little attention [75]. This is an oversight given the ways that technologies can interact with social and cultural contexts to reproduce harm for whole sub-populations. We propose an expansion of Mathur et al.’s taxonomy to include social injustice as harm to collective welfare. Social injustice refers to the inequitable distribution of harms and benefts in society [39]. This is distinct from harm due to unanticipated societal consequences, which are harms that designers are unable to predict. Mathur et al. give the example of Cambridge Analytica’s use of personal data from Facebook to initiate a disinformation campaign to infuence the 2016 U.S. presidential election [75] as an unanticipated societal consequence. In contrast, social injustice can be identifed and documented in design using frameworks such as those proposed by Costanza-Chock [30] and Dombrowski et al. [39]. As an example, Corbett engages a social justice framework proposed by Dombrowski et al. to identify the ways that commercially available technologies can reproduce and resist gentrifcation [29]. By expanding the taxonomy of harm to include social injustice, we hope to draw attention to the ways that deceptive infrastructure can contribute to harm to some populations while benefting others. The Federal Trade Commission (FTC) is an independent government agency whose mission is to promote competition and protect consumers from unfair or deceptive practices [102]. It has a long history of investigating and regulating seller behavior which “unreasonably creates or takes advantage of an obstacle to the free exercise of consumer decision-making” [11]. In such cases, the FTC evaluates whether the seller’s behavior “causes or is likely to cause substantial injury to consumers which is not reasonably avoidable by consumers themselves and not outweighed by countervailing benefts to consumers or to competition” [2]. Not only does the FTC have the authority to regulate such behavior, but it can also provide remedies for "signifcant injuries," such as fnancial losses [86]. In recent years, the FTC has requested feedback on proposals to regulate how companies collect and store user data [3] and how they hide fees [4], for example. By expanding the taxonomy of harm, we hope to raise these issues as critical for researchers to document and for the FTC to start investigating.

5.2

Dismantling the Deceptive Infrastructure of Safety

With a heightened awareness of how deceptive design patterns interact with human biases and sociocultural contexts, designers can better account for the potential harm caused by the technologies they create. In this section, we ofer recommendations that demonstrate how an awareness of deceptive infrastructure can be translated into concrete design suggestions. These suggestions would

Chordia et al.

not have been possible if we had focused narrowly on feature-level patterns. (1) Empowering Users to Selectively Engage With Safety Information. Because users disproportionately attend to safety information even if it does not present a threat to their safety [115], it is critical that design empowers users to selectively engage. Participants voiced a need to flter the information they received. They suggested features like a geofence as well as the ability to discriminate between violent and nonviolent incidents. Safety applications could additionally provide users the option to flter for ongoing incidents or ones occurring in public rather than in private spaces. Furthermore, implementing processes to verify reported incidents and sharing that process transparently can help users assess the potential threat an incident poses. (2) Contextualizing Danger Over Place and Time. Presenting an authoritative and singular representation of place that is governed entirely by crime can make it easy for users to feel scared and default to unexamined assumptions about a place and the people who live there [62]. Providing users with feeds that refect not just crime, but a diversity of events can help users maintain perspective. For example, alongside stories that highlight criminal incidents, platforms could also share community events, highlight instances of collaboration, or celebrate individual members of the community. This type of diversity can support users in developing a more nuanced understanding of place and people. Furthermore, longitudinal data can help contextualize individual incidents. For example, property crime in Atlanta has decreased steadily and dramatically between 2009 and 2021, and violent crime has decreased signifcantly since 2009, with a slight uptick between 2018 and 2021 that nevertheless remains lower than any year 2017 or earlier [53]. Design which communicates these longitudinal trends can support users in contextualizing safety incidents within a longer time frame. (3) Actively Dismantling Cultural Stereotypes. Decades of research on implicit biases and cultural stereotypes have documented the ways that Blackness is associated with criminality in US culture [34, 40, 106]. Black people are more likely to be characterized by White people as violent and perceived as more likely to engage in criminal activity than White people [101]. As evidenced by Facebook [90, 91], Reddit [109], Nextdoor [63, 70], and WhatsApp [78], safety technologies that do not actively engage with these stereotypes risk reproducing them. Design, however, can play an active role in dismantling cultural stereotypes through the use of evidencebased strategies, such as by promoting counter narratives and embedding opportunities for media literacy training [88]. Prior research by Jessa Dickinson and colleagues, for example, have designed safety technology for street outreach workers to support the dissemination of counter-narratives [38]. (4) Channeling Fear Productively. Engaging with safety information is likely to inspire fear [104], but that fear can be

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

channeled in ways that strengthen the community. Collective efcacy is a measure of a community’s level of social cohesion and how efectively that social cohesion can be mobilized towards a common goal [43, 94, 95]. Collective efcacy is a robust predictor of lower rates of violence [95] and increasing collective efcacy is a well-established strategy when designing for community crime prevention [42, 43, 58, 69]. Supporting collective efcacy both online and ofine can empower users to channel their fear in productive ways. For example, designers can ofer features to organize support for victims after a safety incident or features that encourage users to connect with local nonprofts. These evidence-based strategies both increase collective effcacy and empower individuals to channel fear towards eforts that strengthen a community [43, 93, 100]. Without such channels, users default to individual responses which ultimately create suspicion and distrust [28] and decrease feelings of safety [98].

5.3

Limitations

There are a few limitations of this study. First, 12 of the 15 user interviews were done almost a year prior to the interface analysis of the application. While the main functionalities of the app remained the same, the frst author who used Citizen for the duration of the study did note some incongruence. For example, by the time we conducted the interface analysis, Citizen appeared to be advertising Citizen Protect more aggressively and using nudges to encourage users to Live Broadcast. We hypothesize that the reason we were not able to interview users who had broadcasted or used Citizen Protect is that these were not popular features at the time the interviews were conducted, due to their limited advertising. This data would have further illuminated the ways that deceptive design patterns can create purchase pressure. Future work could contribute meaningfully by taking a longitudinal approach to understanding how the infuence of deceptive design patterns evolves over time. A second limitation is the lack of precision in understanding users’ emotional states. Participants used words like stress, worry, insecurity, anxiety, fear, and paranoia interchangeably, limiting our ability to specify the exact nature of the user experience. For this reason, we suggest future work draw on methods from psychology to precisely defne the infuence technologies have on users’ emotional states. Third, our fndings are unique to Atlanta users in 2020 and 2021. Users from diferent cities at diferent time periods may have very diferent experiences with the application. Since companies often conduct A/B testing which provides some users with views that difer from views presented to other users, even the participants we spoke to may have had diferent views of the application. For this reason, we suggest that future investigations of deceptive design patterns using case methods clearly communicate the bounds of the case and refrain from generalizing beyond those bounds. Fourth, consistent with prior literature on deceptive design patterns [37, 74, 87], the research team conducted an interface analysis of the Citizen app. However, this approach likely limited the number of patterns that we were able to identify since the review was restricted by the experience of three users. Further, because the

CHI ’23, April 23–28, 2023, Hamburg, Germany

user interviews were conducted prior to the interface analysis, we may have attended more to features that were discussed by users, including incident alerts and feeds. Future work can account for these limitations by supplementing researcher reviews with users’ posts and comments directly from the application. This may be especially useful to identify deceptive design patterns in domains that are understudied. Finally, as with other interview-based research, our data is selfreported. Participants could have misremembered, selectively shared information, or may have interpreted past experiences and emotions diferently than how they were originally experienced. Participants may have been especially hesitant to share the negative infuences of Citizen on their emotional states or behavior due to the heightened vulnerability that such responses demand.

6

CONCLUSION

Our goal in this paper was to investigate how deceptive design patterns manifest in safety technologies and how they infuence the user experience. We conducted a case study of the Citizen app, a commercially-available location-based crime alert technology. By triangulating interview data with an interface review of the app, we fnd that feature-level deceptive design patterns interact with sociocultural factors and cognitive biases to create unique harms to both individual and collective welfare. This work contributes to an emerging discussion about deceptive infrastructure. We propose an expansion to Mathur et al.’s existing taxonomy of harm to include emotional load and social injustice and ofer suggestions for designers interested in dismantling the deceptive infrastructure of safety technologies.

ACKNOWLEDGMENTS We thank the participants for their candidness and their time. We additionally appreciate the anonymous reviewers and members of the Georgia Tech Public Participation Lab (Christopher, Alyssa, Ashley, and Meichen) for their ideas and edits. We fnally thank Sunny, Saharsh, Shipra, and Tarun Chordia as well as Kartik Shastri for user testing and unwavering support.

REFERENCES [1] 2021. BNC Raises The Bar On Juneteenth Coverage To Create Premier TV Destination For Emancipation Day Celebrations. (2021). https://apnews.com/ article/juneteenth-lifestyle-3103c12dc772aea12121dfa833bfeb06 [2] 2021. A Brief Overview of the Federal Trade Commission’s Investigative, Law Enforcement, and Rulemaking Authority. https://www.ftc.gov/about-ftc/ mission/enforcement-authority [3] 2022. Trade Regulation Rule on Commercial Surveillance and Data Security. https://www.federalregister.gov/documents/2022/08/22/2022-17752/traderegulation-rule-on-commercial-surveillance-and-data-security [4] 2022. Unfair or Deceptive Fees Trade Regulation Rule Commission Matter No. R207011. https://www.federalregister.gov/documents/2022/11/08/202224326/unfair-or-deceptive-fees-trade-regulation-rule-commission-matterno-r207011 [5] n.d.. History: Mechanicsville. (n.d.). http://mechanicsvilleatl.org/history/ [6] Census 2021. 2021. QuickFacts Atlanta city, Georgia. (2021). https://www. census.gov/quickfacts/atlantacitygeorgia [7] David Abrams. 2021. City Crime Stats Crime in Major U.S. Cities. https: //citycrimestats.com/covid/ [8] Boone Ashworth. 2021. What is Citizen’s criteria for reporting incidents? (Aug. 2021). https://www.wired.com/story/citizen-protect-subscription/ [9] Trevor Bach. 2020. The 10 U.S. Cities With the Largest Income Inequality Gaps. (2020). https://www.usnews.com/news/cities/articles/2020-09-21/uscities-with-the-biggest-income-inequality-gaps

CHI ’23, April 23–28, 2023, Hamburg, Germany

[10] Trevor Bach. 2020. The 10 U.S. Cities With the Largest Income Inequality Gaps. (2020). https://www.usnews.com/news/cities/articles/2020-09-21/uscities-with-the-biggest-income-inequality-gaps [11] J Howard Beales III. 2003. The Federal Trade Commission’s use of unfairness authority: Its rise, fall, and resurrection. Journal of Public Policy & Marketing 22, 2 (2003), 192–200. [12] Ruha Benjamin. 2019. Race after technology: Abolitionist tools for the new jim code. Social forces (2019). [13] Steven Bertoni. 2019. Murder! Muggings! Mayhem! How An Ex-Hacker Is Trying To Use Raw 911 Data To Turn Citizen Into The Next Billion-Dollar App. Forbes (july 2019). https://www.forbes.com/sites/stevenbertoni/2019/07/15/murdermuggings-mayhem-how-an-ex-hacker-is-trying-to-use-raw-911-data-toturn-citizen-into-the-next-billion-dollar-app/?sh=5a3e2dd21f8a [14] Jan Blom, Divya Viswanathan, Mirjana Spasojevic, Janet Go, Karthik Acharya, and Robert Ahonius. 2010. Fear and the city: role of mobile services in harnessing safety and security in urban use contexts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1841–1850. [15] Mark A Blythe, Peter C Wright, and Andrew F Monk. 2004. Little brother: could and should wearable computing technologies be applied to reducing older people’s fear of crime? Personal and Ubiquitous Computing 8, 6 (2004), 402–415. [16] Terah Boyd and Dave Huddleston. 2020. Metro leaders have mixed reaction to public safety app that lets you stream crime right to police. WSB-TV (2020). https://www.theguardian.com/world/2017/mar/12/netherlands-willpay-the-price-for-blocking-turkish-visit-erdogan [17] Lauren Bridges. 2021. Infrastructural obfuscation: unpacking the carceral logics of the Ring surveillant assemblage. Information, Communication & Society 24, 6 (2021), 830–849. [18] Harry Brignull. 2022. About. https://www.deceptive.design/ [19] AJ Bernheim Brush, Jaeyeon Jung, Ratul Mahajan, and Frank Martinez. 2013. Digital neighborhood watch: Investigating the sharing of camera data amongst neighbors. In Proceedings of the 2013 conference on Computer supported cooperative work. 693–700. [20] Christoph Bösch, Benjamin Erb, Frank Kargl, Henning Kopp, and Stefan Pfattheicher. 2016. Tales from the Dark Side: Privacy Dark Strategies and Privacy Dark Patterns. Proceedings on Privacy Enhancing Technologies 2016, 4 (Oct. 2016), 237–254. https://doi.org/10.1515/popets-2016-0038 [21] M. Ryan Calo. 2013. Digital Market Manipulation. SSRN Electronic Journal (2013). https://doi.org/10.2139/ssrn.2309703 [22] Ryan Calo and Alex Rosenblat. 2017. The Taking Economy: Uber, Information, and Power. SSRN Electronic Journal (2017). https://doi.org/10.2139/ssrn.2929643 [23] Citizen. 2021. How do I enable notifcations and location? (2021). https://support.citizen.com/hc/en-us/articles/115000606974-How-do-Ienable-notifcations-and-locationWhat is Citizen’s criteria for reporting incidents? [24] Citizen. 2021. (2021). https://support.citizen.com/hc/en-us/articles/115000603373-What-isCitizen-s-criteria-for-reporting-incidents[25] Citizen. 2021. Where is Citizen available? (2021). https://support.citizen.com/ hc/en-us/articles/115000273653-Where-is-Citizen-availableHow does Citizen [26] Citizen App Frequently Asked Questions 2021. work? https://support.citizen.com/hc/en-us/articles/115000278894-How-doesCitizen-work[27] CitizenAbout 2022. About. https://citizen.com/about [28] John E Conklin. 1975. The impact of crime. Macmillan New York. [29] Eric Corbett and Yanni Loukissas. 2019. Engaging gentrifcation as a social justice issue in HCI. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–16. [30] Sasha Costanza-Chock. 2020. Design justice: Community-led practices to build the worlds we need. The MIT Press. [31] Atlanta Anti-Violence Advisory Council. 2021. 2021 ANTI-VIOLENCE ADVISORY COUNCIL RECOMMENDATIONS REPORT. Technical Report. City of Atlanta, GA Ofce of Communications, Atlanta, GA. https://www.atlantaga.gov/home/ showdocument?id=51962 [32] Norwegian Consumer Council. 2018. Deceived by design, how tech companies use dark patterns to discourage us from exercising our rights to privacy. Norwegian Consumer Council Report (2018). [33] John W Creswell and Cheryl N Poth. 2016. Qualitative inquiry and research design: Choosing among fve approaches. Sage publications. [34] Angela Y Davis. 2011. Are prisons obsolete? Seven Stories Press. [35] Delve Tool 2021. Delve Tool. https://delvetool.com/ software to analyze qualitative data. [36] Shaila Dewan and B Goodman. 2006. Gentrifcation changing face of new Atlanta. New York Times 11 (2006). https://www.nytimes.com/2006/03/11/us/ gentrifcation-changing-face-of-new-atlanta.html [37] Linda Di Geronimo, Larissa Braz, Enrico Fregnan, Fabio Palomba, and Alberto Bacchelli. 2020. UI dark patterns and where to fnd them: a study on mobile applications and user perception. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–14.

Chordia et al.

[38] Jessa Dickinson, Jalon Arthur, Maddie Shiparski, Angalia Bianca, Alejandra Gonzalez, and Sheena Erete. 2021. Amplifying Community-led Violence Prevention as a Counter to Structural Oppression. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–28. [39] Lynn Dombrowski, Ellie Harmon, and Sarah Fox. 2016. Social justice-oriented interaction design: Outlining key design strategies and commitments. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems. 656–671. [40] Jennifer L Eberhardt, Phillip Atiba Gof, Valerie J Purdie, and Paul G Davies. 2004. Seeing black: race, crime, and visual processing. Journal of personality and social psychology 87, 6 (2004), 876. [41] Sheena Erete and Jennifer O. Burrell. 2017. Empowered Participation: How Citizens Use Technology in Local Governance. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, Denver Colorado USA, 2307–2319. https://doi.org/10.1145/3025453.3025996 [42] Sheena Lewis Erete. 2013. Protecting the home: exploring the roles of technology and citizen activism from a burglar’s perspective. In Proceedings of the sigchi conference on human factors in computing systems. 2507–2516. [43] Sheena L Erete. 2014. Community, group and individual: A framework for designing community technologies. The Journal of Community Informatics 10, 1 (2014). [44] Sheena L Erete. 2015. Engaging around neighborhood issues: How online communication afects ofine behavior. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. 1590–1601. [45] Virginia Eubanks. 2018. Automating inequality: How high-tech tools profle, police, and punish the poor. St. Martin’s Press. [46] Richard Fausset. 2022. What We Know About the Shooting Death of Ahmaud Arbery. (2022). https://www.nytimes.com/article/ahmaud-arbery-shootinggeorgia.html [47] Bent Flyvbjerg. 2006. Five misunderstandings about case-study research. Qualitative inquiry 12, 2 (2006), 219–245. [48] Partners for Home. 2020. Point-in-Time Count (2020) - Partners For HOME. https://partnersforhome.org/wp-content/uploads/2020/08/2020-PITFull-Report_FINAL-1.pdf [49] noomtah ultimatearm goodware Freepik, Uniconlabs. 2022. Flaticon. https: //www.faticon.com/free-icons/ [50] Saul Greenberg, Sebastian Boring, Jo Vermeulen, and Jakub Dostal. 2014. Dark patterns in proxemic interactions: a critical perspective. In Proceedings of the 2014 conference on Designing interactive systems. 523–532. [51] Johanna Gunawan, Amogh Pradeep, David Chofnes, Woodrow Hartzog, and Christo Wilson. 2021. A Comparative Study of Dark Patterns Across Web and Mobile Modalities. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct. 2021), 1–29. https://doi.org/10.1145/3479521 [52] Gunnar Harboe and Elaine M Huang. 2015. Real-world afnity diagramming practices: Bridging the paper-digital gap. In Proceedings of the 33rd annual ACM conference on human factors in computing systems. 95–104. [53] Moshe Haspel. 2022. Atlanta Crime Rates in Historical Perspective. Technical Report. Atlanta Regional Commission, Atlanta, GA. https://33n.atlantaregional. com/friday-factday/atlanta-crime-in-historical-perspective-2009-2021 [54] M. J. Hattingh. 2015. The use of Facebook by a Community Policing Forum to combat crime. In Proceedings of the 2015 Annual Research Conference on South African Institute of Computer Scientists and Information Technologists SAICSIT ’15. ACM Press, Stellenbosch, South Africa, 1–10. https://doi.org/10. 1145/2815782.2815811 [55] David Ingram and Cyrus Farivar. 2021. Inside citizen: The public safety app pushing surveillance boundaries. https://www.nbcnews.com/tech/tech-news/ citizen-public-safety-app-pushing-surveillance-boundaries-rcna1058 [56] Aarti Israni, Sheena Erete, and Che L. Smith. 2017. Snitches, Trolls, and Social Norms: Unpacking Perceptions of Social Media Use for Crime Prevention. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, Portland Oregon USA, 1193–1209. https: //doi.org/10.1145/2998181.2998238 [57] Tammy Joyner. 2022. Your guide to Georgia’s gun laws. https://atlantaciviccircle. org/2022/05/28/your-guide-to-georgias-gun-laws/ [58] Cristina Kadar, Yiea-Funk Te, Raquel Rosés Brüngger, and Irena Pletikosa Cvijikj. 2016. Digital Neighborhood Watch: To Share or Not to Share?. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. ACM, San Jose California USA, 2148–2155. https://doi.org/10.1145/ 2851581.2892400 [59] Liam Kennedy and Madelaine Coelho. 2022. Security, Suspicion, and Surveillance? There’s an App for That. Surveillance & Society 20, 2 (2022), 127–141. [60] Ernst HW Koster, Geert Crombez, Stefaan Van Damme, Bruno Verschuere, and Jan De Houwer. 2004. Does imminent threat capture and hold attention? Emotion 4, 3 (2004), 312. [61] Adam DI Kramer, Jamie E Guillory, and Jefrey T Hancock. 2014. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111, 24 (2014), 8788–8790. [62] Laura Kurgan. 2013. Close up at a distance: Mapping, technology, and politics. MIT Press.

Deceptive Design Paterns in Safety Technologies: A Case Study of the Citizen App

[63] Rahim Kurwa. 2019. Building the digitally gated community: The case of Nextdoor. Surveillance & Society 17, 1/2 (2019), 111–117. [64] Cherie Lacey and Catherine Caudwell. 2019. Cuteness as a ‘dark pattern’in home robots. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 374–381. [65] Jamiles Lartey. 2018. Nowhere for people to go: Who will survive the gentrifcation of Atlanta. The Guardian 23 (2018). [66] Christopher A Le Dantec and Sarah Fox. 2015. Strangers at the gate: Gaining access, building rapport, and co-constructing community-based research. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing. 1348–1358. [67] Chris Lewis. 2014. Irresistible Apps: Motivational design patterns for apps, games, and web-based communities. Springer. [68] Dan A Lewis and Greta Salem. 2017. Community crime prevention: An analysis of a developing strategy. The Fear of Crime (2017), 507–523. [69] Sheena Lewis and Dan A Lewis. 2012. Examining technology that supports community policing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1371–1380. [70] Maria R. Lowe, Madeline Carrola, Dakota Cortez, and Mary Jalufka. 2021. “I Live Here”: How Residents of Color Experience Racialized Surveillance and Diversity Ideology in a Liberal Predominantly White Neighborhood. Social Currents (Dec 2021), 23294965211052544. https://doi.org/10.1177/23294965211052545 [71] Maria R. Lowe, Angela Stroud, and Alice Nguyen. 2017. Who Looks Suspicious? Racialized Surveillance in a Predominantly White Neighborhood. Social Currents 4, 1 (Feb 2017), 34–50. https://doi.org/10.1177/2329496516651638 [72] Jamie Luguri and Lior Jacob Strahilevitz. 2021. Shining a light on dark patterns. Journal of Legal Analysis 13, 1 (2021), 43–109. [73] Kai Lukof, Ulrik Lyngs, Himanshu Zade, J Vera Liao, James Choi, Kaiyue Fan, Sean A Munson, and Alexis Hiniker. 2021. How the design of youtube infuences user sense of agency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–17. [74] Arunesh Mathur, Gunes Acar, Michael J. Friedman, Elena Lucherini, Jonathan Mayer, Marshini Chetty, and Arvind Narayanan. 2019. Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov. 2019), 1–32. https://doi.org/10. 1145/3359183 arXiv:1907.07032 [cs]. [75] Arunesh Mathur, Jonathan Mayer, and Mihir Kshirsagar. 2021. What Makes a Dark Pattern... Dark? Design Attributes, Normative Considerations, and Measurement Methods. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–18. https://doi.org/10.1145/3411764.3445610 arXiv:2101.04843 [cs]. [76] Sharan B Merriam and Robin S Grenier. 2019. Qualitative research in practice: Examples for discussion and analysis. John Wiley and Sons. [77] Miro 2022. https://miro.com/ Visual Collaboration Software. [78] Anouk Mols and Jason Pridmore. 2019. When citizens are “actually doing police work”: The blurring of boundaries in WhatsApp neighbourhood crime prevention groups in The Netherlands. Surveillance & Society 17, 3/4 (2019), 272–287. [79] Alberto Monge Rofarello and Luigi De Russis. 2022. Towards Understanding the Dark Patterns That Steal Our Attention. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. ACM, New Orleans LA USA, 1–7. https://doi.org/10.1145/3491101.3519829 [80] Arvind Narayanan, Arunesh Mathur, Marshini Chetty, and Mihir Kshirsagar. 2020. Dark Patterns: Past, Present, and Future: The evolution of tricky user interfaces. Queue 18, 2 (2020), 67–92. [81] Diane Negra and Julia Leyda. 2021. Querying ‘Karen’: The rise of the angry white woman. European Journal of Cultural Studies 24, 1 (2021), 350–357. [82] Safya Umoja Noble. 2018. Algorithms of oppression. In Algorithms of Oppression. New York University Press. [83] Midas Nouwens, Ilaria Liccardi, Michael Veale, David Karger, and Lalana Kagal. 2020. Dark patterns after the GDPR: Scraping consent pop-ups and demonstrating their infuence. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13. [84] Richard A. Oppel Jr., Derrick Taylor, and Nicholas Bogel-Burroughs. 2022. What to Know About Breonna Taylor’s Death. (2022). https://www.nytimes.com/ article/breonna-taylor-police.html [85] John Perry and Emily Merwin DiRico. 2021. Atlanta crime trends, 2011-present. The Atlanta Journal-Constitution (2021). https://www.ajc.com/news/crime/ atlanta-crime-trends-2011-present/XIIM6AMGHBHABMBVHCIHM3AOPQ/ [86] Robert Pitofsky. 1976. Beyond Nader: consumer protection and the regulation of advertising. Harv. L. Rev. 90 (1976), 661. [87] Jenny Radesky, Alexis Hiniker, Caroline McLaren, Eliz Akgun, Alexandria Schaller, Heidi M. Weeks, Scott Campbell, and Ashley N. Gearhardt. 2022. Prevalence and Characteristics of Manipulative Design in Mobile Applications Used by Children. JAMA Network Open 5, 6 (Jun 2022), e2217641. https://doi.org/10.1001/jamanetworkopen.2022.17641 [88] Srividya Ramasubramanian. 2007. Media-based strategies to reduce racial stereotypes activated by news stories. Journalism & Mass Communication Quarterly

CHI ’23, April 23–28, 2023, Hamburg, Germany

84, 2 (2007), 249–264. [89] Yvonne Rogers, Margot Brereton, Paul Dourish, Jodi Forlizzi, and Patrick Olivier. 2021. The dark side of interaction design. In Extended abstracts of the 2021 CHI conference on human factors in computing systems. 1–2. [90] Niharika Sachdeva and Ponnurangam Kumaraguru. 2015. Deriving requirements for social media based community policing: insights from police. In Proceedings of the 16th Annual International Conference on Digital Government Research. ACM, Phoenix Arizona, 316–317. https://doi.org/10.1145/2757401.2757452 [91] Niharika Sachdeva and Ponnurangam Kumaraguru. 2015. Social networks for police and residents in India: exploring online communication for crime prevention. In Proceedings of the 16th Annual International Conference on Digital Government Research. ACM, Phoenix Arizona, 256–265. https://doi.org/10.1145/ 2757401.2757420 [92] Robert J Sampson. 1988. Local friendship ties and community attachment in mass society: A multilevel systemic model. American sociological review (1988), 766–779. [93] Robert J Sampson. 2017. Collective efcacy theory: Lessons learned and directions for future inquiry. Taking stock (2017), 149–167. [94] Robert J Sampson, Jefrey D Morenof, and Felton Earls. 1999. Beyond social capital: Spatial dynamics of collective efcacy for children. American sociological review (1999), 633–660. [95] Robert J Sampson, Stephen W Raudenbush, and Felton Earls. 1997. Neighborhoods and violent crime: A multilevel study of collective efcacy. science 277, 5328 (1997), 918–924. [96] Lisette J Schmidt, Artem V Belopolsky, and Jan Theeuwes. 2017. The time course of attentional bias to cues of threat and safety. Cognition and Emotion 31, 5 (2017), 845–857. [97] ND Schüll. 2014. Addiction by Design: Machine Gambling in Las Vegas Princeton. [98] Mary E Schwab-Stone, Tim S Ayers, Wesley Kasprow, Charlene Voyce, Charles Barone, Timothy Shriver, and Roger P Weissberg. 1995. No safe haven: A study of violence exposure in an urban community. Journal of the American Academy of Child & Adolescent Psychiatry 34, 10 (1995), 1343–1352. [99] Sumit Shah, Fenye Bao, Chang-Tien Lu, and Ing-Ray Chen. 2011. CROWDSAFE: crowd sourcing of crime incidents and safe routing on mobile devices. In Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems - GIS ’11. ACM Press, Chicago, Illinois, 521. https://doi.org/10.1145/2093973.2094064 [100] Patrick Sharkey, Gerard Torrats-Espinosa, and Delaram Takyar. 2017. Community and the crime decline: The causal efect of local nonprofts on violent crime. American Sociological Review 82, 6 (2017), 1214–1240. [101] Lee Sigelman and Steven A. Tuch. 1997. Metastereotypes: Blacks’ Perceptions of Whites’ Stereotypes of Blacks. The Public Opinion Quarterly 61, 1 (1997), 87–101. http://www.jstor.org/stable/2749513 [102] the Premerger Notifcation Ofce Staf, DPIP Staf, and CTO. 2022. Mission. https://www.ftc.gov/about-ftc/mission [103] Elliot Tan, Huichuan Xia, Cheng Ji, Ritu Virendra Joshi, and Yun Huang. 2015. Designing a mobile crowdsourcing system for campus safety. iConference 2015 Proceedings (2015). [104] Bram Van Bockstaele, Bruno Verschuere, Helen Tibboel, Jan De Houwer, Geert Crombez, and Ernst Koster. 2014. A review of current evidence for the causal impact of attentional bias on fear and anxiety. PSYCHOLOGICAL BULLETIN 140, 33 (2014), 682–721. https://doi.org/10.1037/a0034834 [105] Ari Ezra Waldman. 2020. Cognitive biases, dark patterns, and the ‘privacy paradox’. Current Opinion in Psychology 31 (Feb. 2020), 105–109. https://doi. org/10.1016/j.copsyc.2019.08.025 [106] Kelly Welch. 2007. Black criminal stereotypes and racial profling. Journal of contemporary criminal justice 23, 3 (2007), 276–288. [107] Fiona Westin and Sonia Chiasson. 2019. Opt out of privacy or" go home" understanding reluctant privacy behaviours through the FoMO-centric design paradigm. In Proceedings of the New Security Paradigms Workshop. 57–67. [108] Fiona Westin and Sonia Chiasson. 2021. “It’s So Difcult to Sever that Connection”: The Role of FoMO in Users’ Reluctant Privacy Behaviours. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–15. https://doi.org/10.1145/3411764.3445104 [109] Qunfang Wu, Louisa Kayah Williams, Ellen Simpson, and Bryan Semaan. 2022. Conversations About Crime: Re-Enforcing and Fighting Against Platformed Racism on Reddit. Proceedings of the ACM on Human-Computer Interaction 6, CSCW1 (Mar 2022), 1–38. https://doi.org/10.1145/3512901 [110] Robert K Yin. 2011. Applications of case study research. sage. [111] José P Zagal, Stafan Björk, and Chris Lewis. 2013. Dark patterns in the design of games. In Foundations of Digital Games 2013. [112] Min Zhang, Arosha K Bandara, Blaine Price, Graham Pike, Zoe Walkington, Camilla Elphick, Lara Frumkin, Richard Philpot, Mark Levine, Avelie Stuart, et al. 2020. Designing Technologies for Community Policing. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 1–9. [113] zoom 2021. Zoom(Version 5.4.9). https://zoom.us. Video Conferencing Software.

CHI ’23, April 23–28, 2023, Hamburg, Germany

[114] Sharon Zukin, Scarlett Lindeman, and Laurie Hurson. 2017. The omnivore’s neighborhood? Online restaurant reviews, race, and gentrifcation. Journal of Consumer Culture 17, 3 (2017), 459–479.

Chordia et al.

[115] Ariel Zvielli, Amit Bernstein, and Ernst HW Koster. 2015. Temporal dynamics of attentional bias. Clinical Psychological Science 3, 5 (2015), 772–788.

Changes in Research Ethics, Openness, and Transparency in Empirical Studies between CHI 2017 and CHI 2022 Kavous Salehzadeh Niksirat

Lahari Goswami

Pooja S. B. Rao

[email protected] University of Lausanne Lausanne, Switzerland

[email protected] University of Lausanne Lausanne, Switzerland

[email protected] University of Lausanne Lausanne, Switzerland

James Tyler

Alessandro Silacci

Sadiq Aliyu

[email protected] University of Lausanne Lausanne, Switzerland

[email protected] University of Lausanne Lausanne, Switzerland School of Management of Fribourg, HES-SO University of Applied Sciences and Arts Western Switzerland Fribourg, Switzerland

[email protected] University of Lausanne Lausanne, Switzerland

Annika Aebli

Chat Wacharamanotham

Mauro Cherubini

[email protected] University of Lausanne Lausanne, Switzerland

[email protected] Swansea University Swansea, United Kingdom

[email protected] University of Lausanne Lausanne, Switzerland

ABSTRACT

KEYWORDS

In recent years, various initiatives from within and outside the HCI feld have encouraged researchers to improve research ethics, openness, and transparency in their empirical research. We quantify how the CHI literature might have changed in these three aspects by analyzing samples of 118 CHI 2017 and 127 CHI 2022 papers—randomly drawn and stratifed across conference sessions. We operationalized research ethics, openness, and transparency into 45 criteria and manually annotated the sampled papers. The results show that the CHI 2022 sample was better in 18 criteria, but in the rest of the criteria, it has no improvement. The most noticeable improvements were related to research transparency (10 out of 17 criteria). We also explored the possibility of assisting the verifcation process by developing a proof-of-concept screening system. We tested this tool with eight criteria. Six of them achieved high accuracy and F1 score. We discuss the implications for future research practices and education. This paper and all supplementary materials are freely available at https://doi.org/10.17605/osf.io/n25d6.

replicability, reproducibility, transparency, ethics, open science, data availability, CHI

CCS CONCEPTS • Human-centered computing → Empirical studies in HCI; • Social and professional topics → Codes of ethics.

This work is licensed under a Creative Commons Attribution International 4.0 License. CHI ’23, April 23–28, 2023, Hamburg, Germany © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. https://doi.org/10.1145/3544548.3580848

ACM Reference Format: Kavous Salehzadeh Niksirat, Lahari Goswami, Pooja S. B. Rao, James Tyler, Alessandro Silacci, Sadiq Aliyu, Annika Aebli, Chat Wacharamanotham, and Mauro Cherubini. 2023. Changes in Research Ethics, Openness, and Transparency in Empirical Studies between CHI 2017 and CHI 2022. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 23 pages. https://doi.org/10.1145/3544548.3580848

1

INTRODUCTION

Empirical research is one of the cornerstones of the HumanComputer Interaction (HCI) feld. Since HCI research examines human experiences, ethical research has long been at the heart of planning and conducting studies. In the last decade, many scholarly felds increasingly recognized the value of openness and transparency in research. The feld of HCI also participates in this broader discourse through various movements and research works. Let us look at these three values—Research Ethics, Openness, and Transparency. Research ethics aims to protect research participants and foster socially responsible collaboration between science and society [75]. Research ethics in HCI studies include having study plans vetted by an institutional review board (IRB), obtaining informed consent from participants, implementing measures to ensure participant safety, and protecting data collected from study participants [14, 18, 38]. Within the ACM SIGCHI community, several research publications (e.g., [1, 64, 78]) and events ( e.g., [18, 37]) were dedicated to discourses on research ethics. In 2016, the SIGCHI

CHI ’23, April 23–28, 2023, Hamburg, Germany

Executive Committee appointed an Ethics Committee to facilitate the discourses and review related policies and procedures. The UNESCO Recommendation on Open Science defnes the term “Open Science” as “an inclusive construct that combines various movements and practices aiming to make multilingual scientifc knowledge openly available, accessible and reusable for everyone” [90]. Although we appreciate the inclusiveness of this defnition, for the reason that will be apparent in the next paragraph, we use a narrower defnition in this paper: Openness refers to precisely the availability of research publications and materials. Openness initiatives have led research institutions and funding agencies to renegotiate their relationships with scientifc publishers—including the ACM.1 Consequently, ACM SIGCHI also made papers in selected conference proceedings from 2016 freely downloadable. Transparency is closely related to openness and is often mentioned together, such as in The Center of Open Science’s Transparency and Openness Promotion Guideline [65]. For this paper, we distinguish transparency from openness. We defne transparent research practices as researchers’ actions in disclosing details of methods, data, and other research artifacts. A transparent practice does not guarantee openness and vice versa. For example, describing statistical results in detail is transparent, but when the paper is behind a paywall, the results are also not open. In the HCI community, the discourse on transparency manifests in community-led events, such as RepliCHI [104–107] and Transparent Research [25, 49, 50], surveys [94, 95], and opinion pieces [27, 88]. Despite being regarded as desirable qualities, research ethics, openness, and transparency could be challenging to achieve. The limitation of research resources—fnance and human resources— and the misalignment of incentives can be barriers to openness and transparency [90, 95]. Specifcally for HCI, some research settings may cause tensions between these values. For example, research projects with participants from a vulnerable population might need to prioritize ethics over transparency. In other cases, researchers may need to sacrifce these values to ensure the quality of the knowledge. For example, a research project could emphasize transparency by creating a social network to learn about people’s behavior on social media sites. The ecological validity of the fndings from this study would be less than if the study were conducted on Facebook or Twitter where transparency of research data is limited. Previous work either investigated specifc aspects of the HCI literature such as statistical reporting [94], sample size reporting [19], or replication [45]. Other works indirectly assess the situation through self-reported surveys [95], and content-analysis of journal guidelines [11]. To determine how the feld of HCI evolved in these aspects and where the community should focus improvement efforts, we need an assessment across these aspects based on actual published papers and their research artifacts. Towards this goal, this paper makes three contributions: • We collected criteria in research ethics, openness, and transparency and operationalized them for evaluation based on published papers and research materials.

Salehzadeh Niksirat, et al.

• We sampled 118 and 127 papers from CHI 2017 and 2022 and evaluated them with these criteria to provide snapshots of research practices and discuss the implications of the results. • We explored the possibility of assisting the assessment by developing a proof-of-concept screening system.

2

2.1

ACM Plan S Compliance statement at https://authors.acm.org/open-access/plans-compliance, last accessed January 2023.

Research Ethics

The ethical guidelines of responsibly conducting experiments are often informed by national or state laws and institutional regulations. Additionally, diferent science communities design their own domain-specifc codes of ethics [99].2 Munteanu et al. [64] make the point that although the formal process of establishing the ethical approval of a study can vary by country, the underlying principles are universal. However, they also note that new technologies present challenges to existing ethical review processes, which may need mitigating. An example is the raw power of data collection aforded by technologies, where opinions on the kind (or extent) that is acceptable are subject to changing attitudes [93]. Some researchers, such as Punchoojit and Hongwarittorrn [78], have attempted to understand how ethical concerns have evolved. They ofer categories ranging from broad issues to some highly specifc to HCI. It is vital that such concerns or conficts are not oversimplifed or proceduralized to an extent that researchers refrain from engaging with the issues [17]. Instead of simply writing that they followed the institutional safeguards, researchers should describe research ethic issues they faced and how they were addressed. Such ethical considerations can also help researchers inoculate themselves against biases. Well-defned standards may help researchers engage with and report on the ethical dimensions of their work. Ethical standards in HCI can be related to both the data collection & analysis and reporting & dissemination of results.3 For data collection & analysis, practices such as acquiring ethical approval and collecting participants’ consent are discussed in HCI textbooks (see, for example, [55, section 15]). Some aspects are studied in more detail. For example, Pater et al. [71] assessed ethical challenges in compensating participants. Their systematic literature review of papers from four HCI venues (CHI, CSCW, Ubicomp/IMWUT, UIST) found that 84.2% of the studies did not sufciently report essential decisions in participant compensation. For the ethics of reporting and the dissemination of results, Abbott et al. [1] examined reporting trends with regard to anonymization practices in CHI. They studied 509 CHI papers for health, wellness, accessibility, and aging research and found that codes and pseudonyms were the most 2 See,

1 See

RELATED WORK

In this section, we frst review the existing work on research ethics. Next, we review the relevant studies on practices related to openness. Finally, we review studies focused on the principle of transparency. For all three practices, we review studies conducted in the HCI community and those in adjacent felds that contribute to general guidelines.

for example, ACM Code of Ethics and Professional Conduct at https://www.acm.org/code-of-ethics, last accessed January 2023. 3 In this work, we focus on research ethics. To read about design ethics, see a literature survey by Nunes Vilaza et al. [66].

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

used techniques to protect participant privacy. They ofered further suggestions to the community that facilitate data reporting while limiting privacy risks. Finally, several studies discussed the ethical precautions that HCI researchers should consider when dealing with vulnerable populations. For example, Walker et al. [97] proposed heuristics for HCI research with vulnerable populations. This heuristic includes several actions to be conducted before research (e.g., understanding the needs and interests of vulnerable communities), during research (e.g., considering if collected data can be harmful to participants), and after research (e.g., considering researchers’ positionality in relation to the vulnerable community when presenting the results). On a diferent note, Antle [9] refected on their experience in doing research with children who live in poverty and asked fve questions to consider when working with vulnerable populations, for example, “How can we feel relatively certain that we are providing benefts to the population we are working with?” [9]. Furthermore, Gautam et al. [40] described the tension they experienced in running a participatory design study with a vulnerable population and McDonald et al. [58] discussed how privacy researchers should consider the power dynamics that may impact vulnerable populations. Despite these studies, the adherence of HCI researchers to diferent practices regarding research ethics still requires investigation.

2.2

Openness

In comparison to the studies on transparency and research ethics, the HCI literature lacks sufcient studies on openness to understand to what extent researchers publish their papers and materials freely—without locking them behind a paywall—and whether they face any challenges in meeting open science standards. More than a decade ago, several articles in ACM magazines discussed open-access publication models and their benefts for computer science [57, 98]. In order to publish open-access, authors had to pay a so-called article processing charges (APCs) fee. While APCs are mostly sponsored by the authors’ institutions or funding agencies, researchers without such support might face difculties [22]. Furthermore, awareness of open science is not globally distributed and some researchers, from developing countries, might face difculties when seeking for funding for open access. Spann et al. [86] discussed the benefts of an alternative publication model (used by some publishers) called Pay What You Want (PWYW) where researchers are allowed to pay any amount that they can aford. Some publishers (e.g., ACM) support green open access and allow authors to publish the author version of their article publicly on their personal or institutional website [4]. However, some authors might also use commercial social networking websites such as ResearchGate. Jamali [46] showed that almost half of the authors who publish their non-open-access articles on ResearchGate infringe the copyrights of their publishers. Thus, ACM strictly prohibited sharing on such websites [4]. Besides the use of open access for sharing articles, several researchers studied diferent practices for sharing supplementary materials (e.g., [15]). One of the most typical practices for sharing supplementary materials is promising to share upon request. Krawczyk and Reuben [52] showed that the compliance rate for such requests is low. Vines et al. [92] showed that it can be even

CHI ’23, April 23–28, 2023, Hamburg, Germany

lower when papers are published far in the past. The standard approach for material sharing is the use of platforms that are compatible with FAIR principles [101], namely being Findable (e.g., having unique identifers), Accessible (e.g., not being locked behind a paywall), Interoperable (e.g., providing ReadMe fles to clarify the structure), and Reusable (e.g., providing metadata that can support readers to understand the data and reuse it). Two well-known FAIR-compatible platforms are OSF and Zenodo.

2.3

Transparency

Transparent research practices disclose details of methods, data, and other research artifacts. In quantitative research, these practices usually lead to reproducibility and increase the likelihood of replicability [65]. Reproducibility means that re-running the same analysis on the same data yields the same results [72]. Replicability means that re-running the study to produce new data—analyzed in the same or diferent manner—should yield a similar result [72]. No study can be reproduced or replicated without having access to its detailed methodology, procedures, and materials. Replication studies—where the explicit intent is to confrm or challenge the results of prior work—are infrequent in the feld of HCI. Hornbæk et al. [45] examined 891 studies across four diferent HCI outlets and found that only 3% attempted to replicate a prior result. Upon closer examination, they found that authors of nonreplication studies could have often corroborated earlier work by, for example, analyzing data diferently and collecting additional data. Often, these choices would have required minimal additional efort [45]. That many HCI studies overlook these kinds of opportunities has led some to question the culture of the feld. Nevertheless, outside HCI, the consensus on general practices for research transparency boils down to a 36-item checklist [5]. In qualitative research, the discussion on research transparency is more complex. The term transparency has another semantics. In the Introduction chapter in an infuential ethnographic text— The Religion of Java—Geertz describes a desirable characteristic of ethnographic reports, where the “ethnographer is able to get out of the way of his data, to make himself translucent” [41, p. 7]. Cliford disagrees with this portrayal of objectivity as “too simple notions of transparency”. The word “transparency” was used as a paraphrase of Geertz’s “translucency”. To avoid confusion, we will refer to this semantics with Geertz’s original term: translucency. In our defnition, research transparency does not require or preclude translucency. In fact, despite disagreeing with translucency, Cliford praised Geertz’s practice of sharing his ethnographic feld notes extensively [26, p. 61], which is a transparency practice. In a panel discussion about transparency in qualitative research at CHI 2020 [88], the panelists concurred that in qualitative research, transparency in the method should be emphasized over transparency in data. In addition to this separation of transparency between data and method, Moravcsik [62] points out the third aspect: production transparency, which demonstrates how arguments and citations are drawn fairly from diferent points of view in the literature. We set aside production transparency because it is not possible to evaluate this aspect within each paper. In the following subsections, we distinguish transparency in method, results in the paper, data beyond the paper, and other non-data research artifacts.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al.

2.3.1 Transparency of research methods. Numerous guidelines for reporting research methods are evidence of the importance of research method transparency. In quantitative research, there is a list of 34 research decisions that could be pertinent to �-hacking [100]. More specifcally, there are guidelines for reporting decisions on sample size [19, 53], measurements, and constructs [6]. Quantitative data analysis could also be transparent by sharing the analysis code. In a survey of CHI 2018–2019 authors [95], around 25% of the respondents shared quantitative analysis procedures. In qualitative research, the Standards for Reporting Qualitative Research (SRQR) standard is extensive in methodological decisions [67]. More specifc guides are also available for interview and focus group research [8, 89], refexive thematic analysis [16], and for using inter-rater reliability [59]. The survey of CHI 2018– 2019 authors found around 25% of the respondents shared qualitative analysis procedures; this percentage is similar to quantitative research [95]. Another practice to foster research transparency is the preregistration of study objectives and methods before collecting or analyzing data. Cockburn et al. [27] promote preregistering HCI experiments. They argue that preregistration will clarify the intent to do exploratory research and reduce the misuse of null hypothesis signifcance testing (NHST). Preregistration is also helpful in qualitative research. Haven et al. [43] conducted a Delphi study with 295 qualitative researchers; the results of their study culminated in 13 items for preregistration of qualitative studies. In the feld of HCI, preregistration is rare. Pang et al. [68] systematically reviewed CHI 2018–21 papers and found only 32 papers with preregistration. Another novel method to promote methodological transparency is Registered Report, where the research method is written and peer-reviewed before data collection [23]. Despite over 300 journals supporting this format4 , none of them is HCI.

2.3.3 Transparency of data. In Wacharamanotham et al. [95]’s survey, they found that around 40% shared some data, with around 21% sharing raw data. The respondents of their survey reported key concerns about protecting data that may be sensitive and that they had not obtained permission from the participants to share data. A recent study supports these concerns: VandeVusse et al. [91] found that participants in qualitative studies volunteer to share data to be helpful. However, their participants misunderstood “sharing” as disseminating research fndings instead of sharing the interview transcripts [91]. HCI research has looked into challenges in Research Data Management [34, 35] and has come up with an innovative approach to facilitate sharing despite these challenges [63].

2.3.2 Transparency of research results. In addition to research methods, the research results reported in the paper contribute to its transparency. In quantitative HCI research, problems in statistical reporting persist. In 2006, Cairns [20] looked at the use of inferential statistics in BCS HCI conferences over two years and the output of two leading HCI journals in the same year. Of the 80 papers analyzed, 41 used inferential statistics, and only one conducted inferential statistics appropriately. All others had errors in their reporting or analysis. Still, in 2020, Vornhagen et al. [94] looked at the quality of reporting statistical signifcance testing in CHI PLAY 2014–19. More than half of the papers employed NHST without adequate specifcity in their research questions of statistical hypotheses [94]. To address these problems, several HCI books are dedicated to statistical practices and reporting, for example, [21, 80]. In qualitative research, how research results are transparent depends on the research methods. The SRQR standard only requires the results to (1) describe an analysis and (2) support with evidence [67]. The Consolidated criteria for reporting qualitative research (COREQ) checklist—for interviews and focus group studies— adds consistency and clarity as criteria [89]. Braun & Clarke also highlight that the results must ft the assumptions made in the analysis method and the epistemology [16, Table 2].

While the previous research studied diferent aspects of research ethics, openness, and transparency, none provided a comprehensive picture of these practices in HCI. In particular, it is necessary to inquire into the status quo of the adopted practices and understand how much progress the feld has made and which areas are lacking. One way to objectively measure this is by collecting criteria for these practices and analyzing the text of published research articles and their supplementary materials. To address this gap, we operationalize 45 criteria related to research ethics, openness, and transparency. We evaluate the HCI literature by comparing two samples of papers published in ACM CHI 2017 and CHI 2022. Additionally, given the lack of a screening tool to assess HCI articles, we explore the potential for such a system.

2.3.4 Transparency of research artifacts. In addition to sharing methods, results, and data, researchers also generate other artifacts. In the survey of CHI 2018–2019 authors [95], slightly above 30% of respondents reported that they shared study materials, such as stimuli or interview guides. A slightly higher percentage —around 40%—reported sharing hardware or software. One worrisome result is that many respondents indicated that they did not see the benefts of sharing these materials. In another analysis of CHI 2016–17 papers, only around 2% of papers publicly share source code [31]. The proliferation of guidelines, discussions, and empirical studies in the last few years might have changed the transparency practices in HCI research. In fact CHI conferences have added a Transparency section to the Guide to Authors and Reviewers5 since CHI 2020 [42]. For the time-being, empirical studies about research transparency in the HCI literature are self-reported survey [95], and focus on individual aspects [31, 94] or policies [11]. We need a comprehensive study into how transparency is actually practiced in order to take stock of where the feld currently stands, and which directions the efort to improve should be focused.

3

Towards the goal of evaluating the research ethics, openness, and transparency of HCI publications, we developed a comprehensive set of criteria for assessing published papers and their published research artifacts. This section describes the development process and highlights the insights we gained. 5 See

4 See

https://www.cos.io/initiatives/registered-reports, last accessed January 2023.

CRITERIA FOR RESEARCH ETHICS, OPENNESS, AND TRANSPARENCY

https://chi2020.acm.org/authors/papers/guide-to-a-successful-submission/, last accessed January 2023.

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

—————— S2. Check of Normality assumption for parametric statistics. If the authors checked and reported the distribution of the data to justify using a parametric test. Instruction: Search the PDF of the article for the following keywords: ●

Likely Terms. ‘normality’, ‘parametric’, ‘non-parametric’, ‘Shapiro-Wilk’, ‘Kolmogorov-Smirnov’, ‘QQ plot’

Report as “yes” or “No” in Column BM. Why is this point important: Reporting the assessment of statistical assumptions allows readers to determine whether the chosen statistical approach is suitable. Citations that justify this criterion: ● Point 27 from the consensus-based transparency checklist [Aczel et al., 2020]. ● Section 4.2 of a survey of HCI papers [Cairns 2007] ● Item 7b of ARRIVE guideline for animal research [du Sert et al., 2020] ● SAMPL guideline from the field of medicine [Lang & Altman, 2016]

Figure 1: An excerpt from an instruction note provided to coders for data collection for stat-normality criterion. The note involved a title, step-by-step instructions, keywords, rationale, and references.

3.1

Development process

We drew some criteria that are already operationalized in prior works, for example, statistical reporting criteria [94]. Other criteria were inspired by high-level principles, self-report checklists, survey questionnaires, and textbook recommendations. From these sources, two co-authors created a set of distinct criteria and worked out how to inspect them solely from the papers and their published research artifacts. This initial version was discussed and refned together with two other co-authors. The second version was used to create detailed coding instructions (half A4 page per criterion on average). Figure 1 demonstrated an excerpt from the instruction note provided for the authors, which includes a title, step-by-step instructions, keywords, rationale, and references (for detailed examples, see Sup. 1).6 The coding instructions were refned in a collaborative coding process as detailed in Section 4.2. The study was preregistered at OSF Registries7 . In the preregistered study design, we identifed 44 criteria. The total number of criteria evolved during the course of the study, as explained in Sup. 2. Table 1 presents an overview of the fnal version with 45 criteria. Some criteria (marked with an asterisk *) apply to a subset of empirical paper. For example, share-interview-guide is only applicable to qualitative papers that use interviews and statdescriptive is only applicable to quantitative or mixed papers that uses frequentist statistics. The criteria are related to the distinct phases of research including study design, data collection, data analysis, and reporting (see Table 1). The specifc subset of each criterion is listed in criteria defnition document (see Tables 1–6 in this document in Sup. 3). We also provide Table 1 in the Excel format (Sup. 4) for authors, reviewers, and teachers to adapt them to their purposes. The criteria defnition document (Sup. 3) also provides the rationale behind each criterion in detail with additional citations. We hope that knowing the rationale will better encourage the practices 6 All

supplementary materials of the paper are publicly available on OSF at https://doi.org/10.17605/osf.io/n25d6. 7 See preregistration document at https://doi.org/10.17605/osf.io/k35w4

CHI ’23, April 23–28, 2023, Hamburg, Germany

related to research ethics, openness, and transparency. For example, a statistical guideline prescribes reporting degrees of freedom in statistical tests [54] (see stat-parameters). In the supplement, we explain that readers could use the degrees of freedom to determine whether the choice of statistical tests and the input data are appropriate. In a diferent example, for the criterion about study preregistration (see prereg), we explain that preregistration is a useful practice to avoid HARKing (i.e., Hypothesizing After the Results are Known) and we provide resources for the most commonly used services for preregistration.

3.2

Insights

Below, we describe notable insights from the criteria and the development process. Some insights are facts that—we believe—are not well known. Others are caveats for future researchers who will use this criteria set. 3.2.1 Downloading CHI papers for free (for a limited time), if you know where to look. Since 2016 SIGCHI have made the CHI proceedings available without any paywall restriction at this openproceedings page.8 Although this page indicates that the proceedings are “permanent open access,” the availability is subject to the ACM OpenTOC program that is still in the pilot phase and could be discontinued in the future [3]. Additionally, it seems that OpenTOC pages are not indexed by search engines, which limits the discoverability of this access channel. 3.2.2 Supplementary materials are free on the ACM Digital Library. The ACM policy [2] indicates that supplementary materials on the ACM Digital Library can be downloaded for free, even if the paper itself is not. This fact makes the supplementary materials on ACM Digital Library compatible with the FAIR principles. Nevertheless, the supplementary materials for each paper are displayed as one zip fle. This presentation impairs the discoverability of its content, especially when the paper is behind a paywall. 3.2.3 Nuances among openness and transparency terms. The terms “free,” “open,” “public,” and “transparent” are closely related. However, we found two cases where their nuances matter. In the frst case, the ACM Digital Library is marked at the top-left corner of some paper webpage with either “Open Access,” “Free Access,” or “Public Access.” Only the Open Access paper can be accessed without a paywall at the time of publication in perpetuity. Public Access papers are eventually open after an embargo period—mandated by the funding agencies. For the last category, Free Access papers are freely accessible for a limited period—determined by ACM—before being locked behind a paywall. The second case highlights the diference between transparency and openness. Some papers share research artifacts, such as questionnaires, in an appendix of the paper. Although this practice is transparent, the questionnaire is not open if the paper is behind a paywall. To avoid depending on the availability of the paper, research artifacts should be shared as separate materials in an open repository. In the criterion extra-fair, we assess whether research artifacts are shared at a location that meets the FAIR principles. A paper may meet this criterion by publishing its supplementary 8 https://sigchi.org/conferences/conference-proceedings/,

last accessed January 2023.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al.

Table 1: A summary of research ethics, openness, and transparency criteria for evaluating research papers. See Sup. 3 for full defnitions. This table is also available in Excel format in Sup. 4. code Criterion Sources Phase‡ Criteria for Research Ethics Did the study receive approval from an institutional review board? [55] D irb Was written consent obtained from study participants? [55] D/C consent (reported)∗ Do supplementary materials include the consent form? [55] D/C consent (form shared)∗ Was participants’ compensation explained in the paper? [55] D/R study-compensation∗ anon Was any data anonymization used? [1, 103] R Are facial photos in the paper shared with consent? Is privacy being protected? [1, 24, 87] R face-photo∗ Were any ethical measures taken to support vulnerable participants? [79, 97] D vulnerable∗ Were any ethical measures taken to support animals? [30] D animal∗ Criteria for Openness paywall-acmdl† Is the paper in ACM DL available as open access? [4] R Is the paper PDF available on external platforms other than ACM DL? [46] R free-pdf-extern† Are any research artifacts beyond the paper provided anywhere? [96] R extra Do all provided research artifacts exist at the location specifed in the paper? [101] R extra-exist∗ Do any of the locations of provided artifacts satisfy the FAIR principle? [101] R extra-fair∗† Criteria for Transparency prereg Was the study preregistered? [27, 65] D share-stimuli∗ [95] D/R Are study stimuli (except survey questionnaires) archived? [95] D/R Are questionnaires or surveys archived? share-survey∗ share-interview-guide∗ [95] D/R Is interview guide archived? share-study-protocol [73] D/R Is the study protocol archived? [19] D Was the sample size justifed (qualitative studies)? justify-n-qal∗ justify-n-qan∗ [53, 74] D Was the sample size justifed (quantitative studies)? [39] C/R Was the demographics information of the participants described? demographics∗ Did the study properly explain study design (e.g., grouping, IDVs)? condition-assignment∗ [94] D/R specify-qal-analysis∗ [95][65] A/R Is qualitative data analysis approach named or explicitly described? [95][65] A/R Is quantitative data analysis code shared? share-analysis-code∗ qal-data-raw∗ [95][65] R Is raw qualitative data shared? [95][65] R Is processed qualitative data shared? qal-data-processed∗ qan-data-raw∗ Is raw quantitative data shared? [95][65] R qan-data-processed∗ Is processed quantitative data shared? [95][65] R Is the source code of the software shared? [95] R share-software∗ share-hardware∗ Is the code of the hardware shared? [95] R Is any hand-drawn sketch shared? — R share-sketch∗ Criteria for Reporting (i.e., frequentist analysis, estimation analysis, qualitative reporting) For each key dependent variable on the interval or ratio scale, were [28][54] A/R stat-descriptive (cen. tend.)∗ their sample central tendency reported? stat-descriptive (variability)∗ For each key dependent variable was their sample variability reported? [28][54] A/R Were their sample reported for each key dependent variable on the ∗ stat-descriptive (cat. data) [28][54] A/R nominal or ordinal scale? (categorical data) ∗ Is the statistical procedure for data analysis clearly named? [28] A/R stat-clear-procedure When the normality assumption is required by the statistical ∗ stat-normality [5, 20, 54, 94] A/R procedure, was the assumption assessed? When the statistical procedure requires additional assumptions, [54, 94] A/R stat-other-assumptions∗ were they assessed? ∗ stat-parameters (� � ) Were degree of freedom reported? [54, 94] A/R stat-parameters (test value)∗ Were the test statistic and all test parameters reported? (e.g., � -value) [54, 94] A/R Were �-value reported? [54, 94] A/R stat-parameters (�-value)∗ For the efects that were tested, were efect sizes reported? [54, 94, 110] R stat-effect-size∗ For the efects that were tested, were their confdence intervals stat-ci∗ [28, 94] R reported? ∗ Were interval estimates reported? [29] R estimates-interval estimates-vis-uncertainty∗ Was the uncertainty of the efect visualized? [29] R Did the study properly report themes and quotes? [55] R qal-interview-report∗ ∗ Evaluated on applicable subset of empirical papers. See Section 3.1 for explanation. † See additional discussion about these openness criteria in Section 3.2. ‡ Study Phase: D, C, A, and R stand for Study Design, Data Collection, Data Analysis, and Reporting, respectively. § Potential for Automation: Def: Defnitely, Scr: Screening, PP: Potentially Possible, and No: Difcult to Automate. Details in Section 3.2.4.

Auto§ Def Def Def Def Scr PP No Scr No PP Scr Scr PP Def Scr Scr Scr Scr Scr Def Def Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr Scr No

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

CHI ’23, April 23–28, 2023, Hamburg, Germany 17

materials on FAIR repositories (e.g., OSF) or on the ACM Digital Library (as discussed in the previous subsection). Papers that share research artifacts only in the appendix meet this criterion only when the paper itself is either open-access or public-access.

4

METHOD

To investigate the changes in research ethics, openness, and transparency practices in HCI, we applied the criteria above to assess papers from two proceedings of the ACM Conference on Human Factors in Computing Systems (CHI). We chose CHI for three reasons: (1) Its once-per-year camera-ready deadline is a single cutof point. The cut-of point provides a clear separation between years—unlike journal publications where the duration between initial submission and the publication varies across papers. (2) CHI conferences have considerable numbers of papers that span a broad range of HCI application areas. (3) For many years, CHI conferences hosted many events (SIG discussions, workshops, research presentations) that contributed to the discourse on research ethics, openness, and transparency. These events might have changed the awareness and understanding of these issues among their attendees. In this study, we investigate how the feld of HCI has progressed in addressing issues related to research ethics, openness, and transparency. This study will help us understand the extent to which practices in research ethics, openness, and transparency have been reported and implemented in the CHI literature. Additionally, given the tension between practices in research ethics versus transparency [25, 36, 88, 95], we exploratorily investigate how transparency practices can difer between papers that deal with more ethical constraints (e.g., studies with vulnerable populations) and papers that deal with lesser ethical constraints (e.g., studies without vulnerable populations). This fnding will provide an understanding of whether tension is actually refected in researchers’ practices.

15

13 11

5

2022

2021

2020

2019

2018

2016

2017

3

2 2015

2014

1 2013

2012

2011

1 2010

2009

2008

2007

2006

2005

2004

2003

2002

3

2

1 2001

1 2000

3.2.4 A potential for screening system. For the majority of the criteria, it is possible to narrow down parts of a paper for assessment based on keywords (for a complete list of keywords, see Sup. 5). This insight indicates the potential to automate (fully or partially) the assessment of some criteria. We describe a proof-of-concept system in Section 6. Based on this system, we indicate the potential of a screening system for each criterion in the fourth column of Table 1. We labeled them as ‘defnitely’ (i.e., for criteria with high accuracy in our system), ‘potentially possible’ (i.e., for criteria that might require advanced techniques like Computer Vision, not attempted in our tool), ‘screening’ (i.e., for criteria where automation is possible to narrow down some papers or parts of them, but manual checks are required), and ‘no’ (i.e., for criteria that we believe require manual inspection). Six out of the eight criteria we attempted could be checked automatically with high accuracy (> 0.80) and F1 scores (> 0.75). For one of the criteria (condition-assignment), our proofof-concept system yielded a high accuracy of 0.81 but an F1 score of 0.74 narrowly missing our desired 0.75 threshold. One criterion (anon) might beneft from machine-screening, but the content requires humans to manually do the checking. The proportions reported in Section 5, are solely based on the manual review efort. In the study below, we did not rely on the results of the screener tool for reporting the result section.

16

Figure 2: The search results of ‘open science,’ ‘reproducibility,’ ‘replicability,’ ‘replication crisis,’ or ‘research ethics’ from the ACM DL. The y-axis shows the number of matched papers. The majority of the matched papers were from after 2017.

Methodological deviations from the preregistered study plan are explained in Sup. 6. The study protocol had institutional review board (IRB) approval.

4.1

Samples

Proceeding Selection. We used proceedings of CHI 2022, which was the most recent volume at the time of this research. Additionally, we searched the abstracts of SIGCHI Sponsored Conferences between 2000–2022 with any of the following terms: open science, reproducibility, replicability, replication crisis, or research ethics. These searches resulted in 91 papers (full search results are listed in Sup. 7). As shown in Figure 2, 80% of these were published after 2017, suggesting it to be a watershed moment. Therefore, we selected the proceedings of CHI 2017 and CHI 2022. We used only the “Paper” publication type because the papers have undergone rigorous referee vetting processes.9 Sample Sizes. The sheer number of papers each year (600 in CHI 2017 and 637 in CHI 2022) exceeds our resources. For this study, we analyzed samples of papers. To determine the sample size, we considered the efect size from past surveys of transparent research practices among CHI authors [95]. Among the respondents of their surveys, the transparent research practices across all dimensions were, on average, 27.6% among CHI 2017 and 31% among CHI 2018 authors. The diference is 3%. We used this information to conduct an a priori power analysis based on the z-test of the diference between two independent proportions in G*Power [33] at � = 0.05 and � = 0.80 (for details see Sup. 8.) The power analysis suggested sampling 119 papers from CHI 2017 and 127 papers from CHI 2022. Sampling Procedure. The paper sampling procedure is demonstrated in Figure 3. The organization of sessions at CHI conferences groups together thematically related papers [51]. We used this fact to inform a stratifed sampling [69], ensuring we drew across the application areas covered by the conference. The number of sessions (149 and 139 in CHI 2017 and 2022, respectively) is higher than the 9 See

https://www.acm.org/publications/policies/pre-publication-evaluation, last accessed January 2023.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al. CHI 2017 proceedings 149 sessions, 600 papers

CHI 2022 proceedings 139 sessions, 637 papers

Randomly selected sesseions and one paper per session. Screened for empirical contributions Repeated until having enough papers as the power analysis indicates. 10 papers were not empirical.

Screened 129 papers (from 129 sessions)

Screened 146 papers (from 146 sessions)

1 paper was excluded (see Sec 4.1).

Analyzed 119 papers

Analyzed 127 papers

Included in the study: 118 papers

Included in the study: 127 papers

19 papers were not empirical.

Figure 3: A fow diagram showing the paper sampling process. planned sample size. Therefore, we randomly sampled the sessions, and for each session randomly sampled a paper. Then, seven co-authors read each paper’s title and abstract and coded its contribution type according to Wobbrock and Kientz [108]’s taxonomy of HCI research contributions. The coding of contribution types was later re-checked by one of the co-authors (i.e., diferent from the person who initially coded it). In case of any mismatches, the coding was refned.10 If there were no empirical contributions, paper replacements occurred through subsequent rounds of sampling and coding with the same procedure. In one case, during data analysis, (as explained in deviations from preregistration, Sup. 6), while conducting consistency checks on the articles, we found one article from CHI 2017 that was an experience report of case studies of design processes. Although the cases contain empirical studies, the article did not report on those empirical results and rather reported on the designers’ experience working on these cases. Thus, we excluded this article, and our sample size was reduced from 246 to 245.

4.2

Coding procedure

Based on the title and the abstract of each paper, we coded the broad types of the method (qualitative, quantitative, mixed-method), research questions (exploratory or confrmatory), and the participants (human or animal). These broad codes allow us to subsequently subset the papers for each set of criteria. The assessment of the relevant subset of papers followed the procedure described in Sup. 3. We use papers as the unit of analysis. For papers with multiple studies, a criterion can be satisfed by any of the studies described in the paper. In contrast, a criterion was marked as violated, only when all studies failed to met that. This hysteresis and the assessment at the paper-level is a lower bar to meet than assessing each study individually. Nevertheless, these choices are necessary for us to avoid making judgments about the relative importance of the studies in each paper. These choices also avoid the page limit constraint that was present only in CHI 2017. Seven co-authors contributed to coding and were assigned to work on �� = 35 papers. The seven coders were two postdocs (in HCI and psychology) and fve PhD students (all in HCI). The PhD 10 A

reviewer pointed out that we could have better controlled this step by calculating inter-rater reliability, and we agree. We disclose that we overlooked this decision.

students have 2-4 years of experience working on HCI research. The overall process of criteria defnition and coding was supervised by two HCI professors who are experts on topics related to transparency, openness, and research ethics. This assignment allowed each coder to be familiar with the structure and context of their papers. To prevent overload, we worked in rounds; each round focused on 4–11 criteria drawn from similar aspects. Each round comprised these steps: (1) An expert coder created a detailed procedure (see Figure 1 or Sup. 1). (2) Each coder independently coded their paper. (3) Each coder independently coded additional fve papers randomized from other coders. (4) We calculated an agreement score [61] from these twicecoded papers. (a) If the agreement score was lower than 90%, each coder coded three additional random papers and calculated the second agreement score from this set. (b) If the second agreement score was still lower than 90%, two expert coders inspected all of the twice-coded papers and resolved the inconsistencies. (5) The resolutions were discussed and resolved in group meetings. (6) Each coder then updated their work accordingly. (7) Finally, each coder checked the work of another coder. The pairing of each round rotated according to a Latin Square to avoid systematic infuences between coders. (8) The detailed procedure (Figure 1 or Sup. 1) was updated to incorporate insights from the discussion. Two coders with statistical knowledge created the codes for statistical criteria (e.g., stat-descriptive and estimates-interval). Each coder worked on half of the papers with NHST statistics (a total of 117). After the frst round of coding, 23.4% of the papers were unclear. We discussed these papers with a co-author who is an expert in statistics. After the consultation, the coders revised their work. Finally, each coder independently coded fve random papers from another coder. The agreement score of the twice-coded papers was 96.4% for the statistical criteria. For other criteria, the agreement scores were

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

95.6% on average (SD=7.5%). In total, 30 review meetings were conducted, and 45 criteria were extracted out of these activities.

4.3

CHI ’23, April 23–28, 2023, Hamburg, Germany

Also, for share-sketch, we did not test diferences between years because determining a meaningful denominator of this criterion requires a deep understanding of the paper’s contributions and research methods.

Data Analysis

As explained in deviations from preregistration (see Sup. 6), in the preregistration, we planned to use a two-sample Z-test for proportions [113] to compare the two years in each criterion. However, several criteria have boundary probabilities (close to 0 or 1) because the cell frequencies difer greatly. Z-tests and their confdence intervals are therefore not reliable in these cases [7, p. 164]. Instead, we calculated the confdence intervals using the Miettinen-Nurminen asymptotic score method—which does not sufer from the boundary cases [32, p. 250]. We use the implementation in the diffscoreci() function from the PropCI package [83] in R. The analysis script and data are provided in Sup. 9. If a criterion was met, we coded it as “Yes”, otherwise as “No”. The proportions for each criterion were calculated based on the applicable denominator subset as mentioned in Tables 1–6 in Sup. 3. For two criteria (share-study-protocol & share-survey), we used the label “partially.” For both, we treated “partially” as “Yes” to consider bare minimum practices in survey and protocol sharing. For study-compensation, we coded “Paid with the amount mentioned,” “Paid without the amount mentioned”, and “Not paid (or voluntary)” as “Yes,”—as a sign of transparency in the compensation policy, and “Not mentioned” as “No.” For face-photo, we coded “Face is not clear,” “Face is masked or cropped,” and “Consent collected” as “Yes” since they support participants’ photo privacy. For the criterion vulnerable, if any additional ethical measures were reported to protect the well-being of the concerned vulnerable population other than general practices, we coded the criterion as “Yes,” otherwise “No.” To understand any potential trade-of between research ethics and transparency practices, we focused on the factor of vulnerability. Researchers usually consider data collected from vulnerable participants as sensitive and they are concerned that transparency may disclose participants’ identities and cause negative consequences for them [25, 36, 88, 95]. Therefore, we distinguished between papers that deal with more ethical constraints (i.e., studies with vulnerable populations, coded as “Yes”) and papers that deal with lesser ethical constraints (i.e., studies without vulnerable populations, coded as “No”). To determine the relation between ethical constraints and comparable transparency practices, we consider data sharing to be a relevant dimension of transparency since it might include sensitive information. A paper’s data sharing is coded as “Yes” if either raw or processed data has been shared, irrespective of the paper being quantitative, qualitative, or mixed-method. We visualize the relationship between participant types (i.e., being vulnerable or not) with data sharing practices through a mosaic plot using the geom_mosaic function from the ggplot2 package in R. Additionally, to check for potential selection bias due to our sampling approach, we compared the proportions of papers with Best Paper awards or Honorable Mention between the two years using a two-sample Z-test. Finally, while we defned and extracted 45 criteria, we test and visualize 41 criteria. animal, estimates-interval, and estimatesvis-uncertainty had only � = 1 paper in their respective subsets.

5

RESULTS

Table 2 summarizes the characteristics of the selected papers. Most papers were mixed-method (40%) or qualitative (35%), while almost one-fourth of the papers were quantitative. Moreover, there were more qualitative papers in the CHI 2022 sample (39%) compared with quantitative papers (20%), whereas in the CHI 201711 sample, there were equal amounts of qualitative and quantitative papers. The increase in the number of qualitative over quantitative papers might be due to the COVID-19 pandemic which might have limited the quantitative empirical research practices during the lockdown (e.g., in-person lab experiments). In both years, around one-ffth of the papers conducted confrmatory research while the rest conducted exploratory research. In terms of contribution, all the papers were empirical. Some papers also had other contributions, with artifact contribution being the next most common in both samples (i.e., 54% of CHI’17 and 59% of CHI’22 papers). The vast majority of the papers in both years (i.e., > 98%) recruited human participants for data collection or data annotation, whereas the rest used datasets (i.e., human data collected earlier). At least one-ffth of the selected papers received either a Best Paper award or a Honorable Mention. Although the number of awarded papers was greater in the CHI’22 subset (25%) compared with the CHI’17 subset (21%), this diference was not statistically signifcant (� = 0.74, � = .46). The results showed that our samples did not bias toward a higher-quality paper in one year than the other.

5.1

Changes in Research Ethics

Among the criteria for research ethics, we found good improvements in CHI’22, where four out of seven criteria showed better adherence to research ethics (see Figure 4). Practices about acquiring IRB approval (irb), reporting consent collection (consent), and being transparent with participant compensation (studycompensation) were all almost doubled during the last fve years. While these fndings show substantial improvements in ethics criteria, these practices still have more room for improvement as they were observed only in around half of the CHI’22 papers. We also observed evidence of transparency and ethics in CHI’22, where four papers shared their complete consent form as supplementary material. With regards to participant compensation (see Figure 5A), the most common approach was mentioning the exact amount or type of compensation (32%), whereas a few papers reported payment without any detail (3%). Practices regarding preserving photo privacy (face-photo) and anonymization (anon) did not change between CHI’17 and CHI’22 (see Figure 4). 27% of the papers used participants’ photos in their paper fgures. Among the papers with participant’s photos, 42% did not show any protective measures (see Figure 5B). The two top measures were (i) not depicting participants’ faces clearly (e.g., 11 Henceforth,

in the results section, we will refer to CHI 2017 and CHI 2022 as CHI’17 and CHI’22, respectively.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al.

Table 2: Characteristics of the paper samples in CHI 2017 and CHI 2022.

Method

Hypothesis testing

Contribution

Participant Award Total

Mixed-method papers Qualitative papers Quantitative papers Exploratory research Confrmatory research Empirical Artifact Methodological Theoretical Literature survey Dataset Opinion Papers with human participants Papers with animal participants Papers with award Total

Improvements in the proportion in CHI 2022

Criteria Code

CHI 2017 44 (37.3%) 37 (31.4%) 37 (31.4%) 94 (79.7%) 24 (20.3%) 118 (100.0%) 64 (54.2%) 7 (5.9%) 0 (0.0%) 5 (4.2%) 0 (0.0%) 0 (0.0%) 116 (98.3%) 0 (0.0%) 25 (21.2%) 118

CHI 2022 53 (41.7%) 49 (38.6%) 25 (19.7%) 100 (78.7%) 27 (21.3%) 127 (100.0%) 75 (59.1%) 8 (6.3%) 9 (7.1%) 1 (0.8%) 3 (2.4%) 3 (2.4%) 125 (98.4%) 1 (0.8%) 32 (25.2%) 127

Proportion of sampled papers that meet each criterion

The further from the red line, the higher the difference

19 %

IRB

23 %

CONSENT (reported) CONSENT (form shared)

50 %

27 %

77 %

A wider confidence interval indicates more uncertainty.

−0.6

−0.4

−0.2

27 % 0.0

0.2

0.4

0.6

CI for difference of proportions using diffscoreci()

50 % 43 %

50 %

92 % 84 %

0%

125

73 %

Yes CHI 2022

116 125 127

50 %

50%

Yes CHI 2017

116

118

49 % 34 %

50 %

127

125

73 %

50 %

118

116

51 % 66 %

FACE-PHOTO

n

100 % 97 %

8% 16 %

ANON

81 %

57 %

3%

STUDY-COMPENSATION

VULNERABLE

Total 97 (39.6%) 86 (35.1%) 62 (25.2%) 194 (79.2%) 51 (20.8%) 245 (100.0%) 139 (56.7%) 15 (6.1%) 9 (3.7%) 6 (2.4%) 3 (1.2%) 3 (1.2%) 241 (98.4%) 1 (0.4%) 57 (23.3%) 245

37 29 37 34 100% of n

No

n differs because some criteria are applicable to a subset of papers (section 3.1)

Figure 4: Research Ethics: (Right) Proportion of sampled papers meeting each of the ethics-related criterion. (Left) the diference in CI of the proportions between CHI’17 and CHI’22. CI on the right of the red line indicates improvements in CHI’22. n represents the number of papers applicable to each criterion. photos taken from the back side) and (ii) obfuscating their faces (e.g., masking). Surprisingly, only a few papers (8%) reported collecting consent from participants before publishing their photos. Regarding research with vulnerable populations, we found that 29% of the papers used data of participants from a vulnerable population (vulnerable) such as minorities or children. Among the papers with vulnerable populations, awareness about research ethics

in the CHI’22 papers was higher than in the CHI’17 papers (50% vs. 27%), however the confdence interval is close to zero, suggesting that at best the improvement is negligible. Figure 5C shows the details of these vulnerabilities. The most frequent types of participants were people with disabilities (27%), potentially vulnerable students (14%), and children (11%).

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

Participant compensation practices

A

C

Not mentioned

Vulnerable population types

148 (61.4%)

Paid with amount mentioned

Children

7 (9.9%) 6 (8.5%)

Patients

Photo privacy protection practices 28 (42.4%)

Privacy may be violated

17 (25.8%)

Older adults

5 (7%)

Other minorities

4 (5.6%)

Pregnant women

2 (2.8%)

Adolescents

2 (2.8%)

16 (24.2%) 5 (7.6%)

Consent collected

7 (9.9%)

People with low socio−economic status

No. of eligble papers: ( n = 241 )

Face is masked or cropped

8 (11.3%)

More than one categories

8 (3.3%)

Face is not clear

10 (14.1%)

Potential vulnerable students

8 (3.3%)

Not Paid (or voluntary)

19 (26.8%)

People with disabilities

77 (32%)

Paid without amount mentioned

B

CHI ’23, April 23–28, 2023, Hamburg, Germany

1 (1.4%)

Gender specific minorities

No. of eligble papers: ( n = 71 )

No. of eligble papers: ( n = 66 )

Figure 5: A. Summary of participant compensation practices in CHI’17 and CHI’22 papers . B. Authors’ practices with regard to photo privacy. We acknowledge that consent for publishing photos might be collected verbally, but potential consent collection was not reported in the paper. C. Summary of the vulnerability identifed in papers involving participants from vulnerable populations. Improvements in the proportion in CHI 2022

Proportion of sampled papers that meet each criterion

Criteria Code

31 %

PAYWALL-ACMDL

69 %

55 %

n

28 % 27 %

EXTRA

although the difference is not apparent, the practice is already high in both samples

88% % 118 127 23 %

EXTRA-EXIST

73 %

62 %

50 % −0.2

0.0

0.2

0.4

0.6

CI for difference of proportions using diffscoreci()

0%

37 %

50%

Yes CHI 2017

Yes CHI 2022

118 127

6% 32 8 % 79 50 %

63 %

−0.4

38 %

94 % 92 %

EXTRA-FAIR −0.6

127

45 %

92 % 77 %

FREE-PDF-EXTERN

118

32 79 100% of n

No

Figure 6: Openness practices: (Left) The diference in CI of the proportions between CHI’17 and CHI’22. CI on the right of the red line indicates improvements in CHI’22. (Right) Proportion of sampled papers meeting each of the openness-related criterion. n represents the number of papers applicable to each criterion. In our sample, we only found one paper with animal participants (animal). The paper did not report any ethical measures.

5.2

Changes in Openness Practices

Figure 6 summarizes our fndings about openness practices. On the ACM DL, papers will eventually be available without a paywall (paywall-acmdl) if they are either open access or public access

(i.e., eventually publicly accessible after an embargo period). There are 31% from CHI’17 and 55% from CHI’22, with either open access or public access. The number of open-access papers in both samples outnumbered public-access papers (82 vs. 25). The number of papers that were accessible on other platforms (free-pdfextern) was relatively higher. 77% of the papers from CHI’22 and 92% from CHI’17 were available in external platforms. Table 3

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al.

Table 3: Summary of external sources used for sharing PDFs. Type Long-term archival plan (e.g., ArXiv) Transitory (e.g., personal website) Commercial (e.g., ResearchGate) Long-term but accidental (e.g., Wayback Machine) Total

CHI 2017

CHI 2022

Total

37 (34.3%)

34 (34.7%)

71 (34.5%)

37 (34.3%)

39 (39.8%)

76 (36.9%)

34 (31.5%)

14 (14.3%)

48 (23.3%)

0 (0%)

11 (11.2%)

11 (5.3%)

108

98

206

shows that among these external sources, only 34% of papers share on repositories with a long-term archival plan, for example, university/institutional/library research information systems, OSF, or ArXiv. In contrast, a slightly higher percentage, 37%, share on personal/lab/company websites, GitHub, or Google Drive, which do not guarantee longevity (i.e., labeled as transitory). 23% share on commercial social networking websites such as ResearchGate or Semantic Scholar, which is not permitted by the ACM publication policy [4] and might incur a copyright infringement [46]. Interestingly, for 11 papers, their PDF on the ACM DL were cached by the Wayback Machine and can be found by web search. In the long term, these papers will remain publicly available. However, it is unclear why these papers were crawled and cached. For this reason, we do not recommend depending on the Wayback Machine for archiving and disseminating research. The complete breakdown for Table 3 can be found in Sup. 10. From the readers’ perspective, accessing most CHI papers should be possible, as the papers are published open access, public access, or can be found somewhere else by searching in Google Scholar. The reason for more CHI’17 papers being accessible on external platforms can be the short period between our data collection and the release of the CHI’22 proceedings (April to July 2022). Therefore, the authors did not have an opportunity to upload their work on external platforms, or the search engines did not crawl them, prior to our data collection. Moreover, given the higher rate of open access among the CHI’22 papers, some authors might not be interested in sharing their paper elsewhere. Next, for sharing any additional research artifacts beyond the paper (extra), we found a substantial increase in sharing practices between the two CHIs, where a higher proportion of CHI’22 papers (62%) shared additional materials (vs. 27% in CHI’17), through supplementary materials using ACM DL, supplementary materials shared in external repositories such as OSF/GitHub, or appendices at the end of the papers. Additionally, with regard to the existence of purportedly shared material (extra-exist), the ratio between the two years was very close. Among the papers that shared additional research artifacts, 94% of CHI’17 and 92% of CHI’22 papers properly provided the promised materials. We discovered nine cases of missing data from either shared repositories or appendices, as follows: project website (� = 3), GitHub (� = 2), Harvard Dataverse (� = 1), ACM DL (� = 1), appendix (� = 1), and a broken link (� = 1). Seven of these locations were not FAIR-compatible. Finally, regarding the availability of the shared research artifacts (extra-fair), despite the higher percentage of availability in CHI’22 (63% vs. 50%

in CHI’17), the confdence interval crossed zero, suggesting that at best the improvement is negligible.

5.3

Changes in Transparency Practices

We observed improvement in CHI’22 compared with CHI’17 for 10 out of 17 transparency-related criteria (see Figure 7). We found great improvements in the sharing of interview protocols of qualitative papers (share-interview-guide). While only 2% of the CHI’17 papers shared their interview protocols, this ratio increased to 25% in CHI’22. Such improvements were also seen in other aspects of qualitative papers where more papers from the CHI’22 sample clearly specify their data analysis procedure (specify-qal-analysis: 79% in CHI’22 vs. 58% in CHI’17). A higher proportion of the CHI’22 sample shared qualitative data (qal-data-raw: 7% vs. 0% and qal-data-processed: 17% vs. 4%). Similarly, more quantitative studies from the CHI’22 samples shared data analysis procedures (share-analysis-code: 10% vs. 1%). Also, sharing raw and processed data (qan-data-raw & qan-dataprocessed) increased in CHI’22. Despite this improvement, the overall data sharing is still low in both qualitative and quantitative studies. With regards to clarifying the sample size, for both qualitative and quantitative studies, the confdence interval capturing zero suggests that the diference is inconclusive (justify-n-qal: 95% CI [-0.046, 0.144], justify-n-qan: 95% CI [-0.010, 0.131]). In both samples, the majority of the papers described the demographic of their participants (∼ 90%), however the diference between the two samples was negligible (demographics: 95% CI [-0.019, 0.139]). In most papers (∼ 81%), from both samples, the authors clearly explained their study design, but the diference between the two samples was negligible (condition-assignment: 95% CI [-0.106, 0.192]). We also found improvements in sharing study protocols (sharestudy-protocol) or multimedia stimuli (share-stimuli) that could facilitate replicability, however, the ratios are still very low (9–13%). Surprisingly, there is no increase in sharing surveys or questionnaire materials (share-survey: 95% CI [-0.056, 0.221]). Additionally, based on the confdence interval, we cannot be certain about the improvement in sharing software and hardware (share-software: 95% CI [-0.058, 0.199]; share-hardware: 95% CI [-0.059, 0.079]). For share-sketch, we found eight CHI’17 and fve CHI’22 papers that shared their sketches (i.e., we did not test the diference, see the last paragraph of Section 4.3). Finally, while there were no CHI’17 papers which preregistered their studies, we found seven cases in CHI’22 with preregistration (prereg). This improvement equates to only 6% of CHI’22 papers preregistering their study, indicating that this practice is still far from perfect. Are papers that involved more ethically concerning entities less likely to adhere to transparency practices? Earlier debates in the CHI community [25, 36, 88, 95] revealed that some transparency practices and research ethics might be at odds with each other. Researchers in sensitive domains may need to contend with research decisions that sacrifce transparency for ethical practices. For instance, researchers might need to forgo sharing data when conducting research with vulnerable populations to minimize the likelihood of making the participants identifable. For this reason, the prevalence of transparency practices could be dramatically diferent for research where ethical concerns are dominant as opposed to other research.

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

CHI ’23, April 23–28, 2023, Hamburg, Germany

Improvements in the proportion in CHI 2022

Proportion of sampled papers that meet each criterion

Criteria Code PREREG

6 %

118 127

98 % 87 %

116 125

2% 13 %

SHARE-STIMULI

23 % 32 %

SHARE-SURVEY SHARE-INTERVIEW-GUIDE

2%

SHARE-STUDY-PROTOCOL

1% 9%

75 %

59 88 118 127

91 % 86 %

81 102

99 % 94 %

1% 6%

JUSTIFY-N-QUAN

69 91

99 % 91 %

9% 14 %

JUSTIFY-N-QUAL

77 % 68 % 98 %

25 %

n

100 % 94 %

81 78 13% 116 7% 125

87% 93%

DEMOGRAPHICS

79% 83%

CONDITION ASSIGNMENT 58 %

SPECIFY-QUAL-ANALYSIS SHARE-ANALYSIS-CODE QUAL-DATA-RAW

QUAN-DATA-RAW QUAN-DATA-PROCESSED

42 % 21 %

79 % 99 % 90 %

81 78

7%

100 % 93 %

81 102

96 %

81 102

99 % 88 %

81 78

1% 9%

99 % 91 %

81 78

2% 3%

SHARE-HARDWARE −0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

CI for difference of proportions using diffscoreci()

83 %

1% 12 %

14 % 21 %

SHARE-SOFTWARE

81 102

1% 10 %

4% 17 %

QUAL-DATA-PROCESSED

21% 57 17% 54

0%

86 % 79 %

64 75

98 % 97 % 50%

Yes CHI 2017

Yes CHI 2022

64 75

100% of n

No

Figure 7: Transparency practices: (Left) The diference in CI of the proportions between CHI’17 and CHI’22. CI on the right of the red line indicates improvements in CHI’22. (Right) Proportion of sampled papers meeting each of the transparency-related criterion. n represents the number of papers applicable to each criterion. As a preliminary investigation, we used the vulnerability of study participants as a proxy for ethical concerns. We divided all sampled papers into two groups: those with vs. without study participants from a vulnerable population (71 vs. 174 papers). We compared the availability of any type of data. We show the results in a mosaic

plot in Figure 8 to emphasize a diference in the number of papers for the two groups. We found that 17% of the papers with nonvulnerable participants shared at least one type of data: either raw or processed and qualitative or quantitative. Only 8% of the papers with vulnerable populations shared their data. However, the

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al.

A

zero (stat-effect-size: 95% CI [-0.034, 0.0.323], stat-ci: 95% CI [-0.050, 0.226]). We found only one paper with estimation analysis, where it properly reported data using interval estimates and visualized confdence intervals (estimates-interval & estimates-visuncertainty). Finally, reporting practices for qualitative results improved (qal-interview-report). While only 64% of CHI’17 papers properly reported their qualitative data, the rate was 90% in CHI’22.

6

29

65

145

6

Non−vulnerable population No data shared

Vulnerable population Shared any type of data

B

A confidence interval of the proportional difference (non-vulnerable - vulnerable) suggests that papers with non-vulnerable population might be inclined to share more data. However, the result is not clear-cut because the confidence interval captures zero.

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Figure 8: Ethics–Transparency Trade-of: Visualization of data sharing practices based on the involvement of participants from vulnerable populations.

confdence interval (95% CI [-0.017, 0.162]) indicates that, at best, the diference is negligible. This preliminary result only slightly supports the ethics-transparency trade-of and the result should be taken with caution due to the big diference between the number of papers in the two groups.

We proposed 45 criteria for research ethics, openness, and transparency. This sheer number of criteria could be prohibitive for authors and reviewers to keep in mind. We envision a future where a tool for research ethics, openness, and transparency is integrated into a writing environment, similar to spelling and grammar checkers. As the authors fnish drafting each section of their paper, the tool assesses their text and reminds them to consider relevant criteria. The user can then (1) add information, (2) tell the tool to remind them later, or (3) decide that the suggestion is incorrect or irrelevant to their research method or domain. Reviewers will be assisted by a diferent tool: After reading the paper, the reviewer can go over the list of criteria and click on relevant criteria that the reviewer forgot to pay attention to during their frst read. The tool will point to the locations in the text that satisfy the criterion or indicate that it could not fnd the text. The reviewer can use this feedback to selectively read the paper to verify. During the discussion phase, the lists of criteria from all reviewers are tabulated to provide a basis for discussion. In this vision, human authors and reviewers play an active role in making judgments. Their roles are necessary because of their knowledge about the research method, the domain, and the research settings. To enable these tools, we need a system that can detect whether the text meets a criterion. Below, we describe design considerations, a proof-of-concept system, and a preliminary evaluation on eight criteria. The Python code for this proof-of-concept system is opensource at GitHub12 for future research.

6.1 5.4

A PROOF-OF-CONCEPT SCREENING TOOL

Design considerations

Lack of Change in Reporting Practices

Overall, the transparency practices related to reporting quantitative fndings did not change between the two CHIs (see Figure 9). The fndings showed that the ratios for some unchanged practices were high enough, such as reporting central tendency (statdescriptive), clarity of statistical tests (stat-clear-procedure), and reporting main statistical values (stat-parameters) like pvalue and F-value (79–100%). Surprisingly, more CHI’17 papers reported their degree of freedom. Regarding statistical assumptions, while more CHI’22 papers reported their normality assumption (stat-normality), the use of other statistical assumptions (statother-assumptions) did not improve (95% CI [-0.087, 0.215]). We also checked the report on efect size (stat-effect-size) and confdence interval (stat-ci). These numbers were reported slightly more in CHI’22, while more than half of the quantitative papers in CHI’22 reported efect size, only around one-ffth reported confdence intervals for reporting data variability. However, the differences are inconclusive given the confdence intervals capture

Some criteria apply to a subset of papers, for example, statistical reporting criteria do not apply to qualitative papers. Additionally, fulflling one criterion may require sacrifcing others. Combining a set of criteria into one score might inhibit nuanced discussion. Finally, one paper may present a combination of multiple studies that use diferent methods. Therefore, each criterion should be evaluated independently (D1) at the level of the sentence or group of sentences (D2). An ideal system should be accurate in both (1) giving a positive response for a paper that satisfes the criterion (true-positive), and (2) giving a negative response for a paper that does not meet the criterion (true-negative). In reality, there is a trade-of between these goals. For example, a system could achieve a perfect true-negative rate by simply labeling that no sentences satisfy the criterion. However, this approach would incur false-negatives: sentences that actually satisfy the criterion are left undetected. This approach 12 See

https://github.com/petlab-unil/replica

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

CHI ’23, April 23–28, 2023, Hamburg, Germany

Improvements in the proportion in CHI 2022

Proportion of sampled papers that meet each criterion

n

Criteria Code

STAT-DESCRIPTIVE (central tendency)

84 % 84 %

16 % 16 %

70 % 76 %

STAT-DESCRIPTIVE (varibility) STAT-DESCRIPTIVE (categorical data)

94 %

69 %

30 % 24 %

57 55

6% 31 %

18 13

5 % 58 5 % 57

95 % 95 %

STAT-CLEAR-PROCEDURE 27 %

STAT-NORMALITY

73 %

49 %

14 % 20 %

STAT-OTHER-ASSUMPTIONS

49 45

51 %

51 50

86 % 80 % 82 % 67 %

STAT-PARAMETERS (degree of freedom)

18 % 33 %

90 %

STAT-PARAMETERS (test value)

100 % 98 % 37 % 52 %

STAT-EFFECT-SIZE

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

CI for difference of proportions using diffscoreci()

57 56

89 % 80 % 64 %

QUAL-INTERVIEW-REPORT

58 2%57 63 % 48 %

11 % 20 %

STAT-CI

0%

90 %

55 56 36 %

50%

Yes CHI 2017

Yes CHI 2022

56 57

10 % 58 57 21 %

79 %

STAT-PARAMETERS (p-value)

56 51

59 10 % 88 100% of n

No

Figure 9: Reporting practices: (Left) The diference in CI of the proportions between CHI’17 and CHI’22. CI on the right of the red line indicates improvements in CHI’22. (Right) Proportions of sampled papers meeting each of the reporting criterion. n represents the number of papers applicable to each criterion. For the degrees of freedom criterion, we exclude two papers because they used path analysis and Cox Regression. To our knowledge, degrees of freedom are not conventionally reported for each test in these models—perhaps to retain readability. The implementation of these models in R also did not output degrees of freedom per test. is also unhelpful because the whole paper needs to be manually checked. As mentioned in our vision at the beginning of Section 6, both the authors and the reviewers who use such tools will have already been familiar with the paper’s content. For the text that is likely to fulfll the criterion, the system should bias towards highlighting the text rather than missing it. In other words, the system should prioritize reducing false-negatives (D3). However, too many false-positives could be distracting for the users. Therefore, the system should provide a possibility for the user to narrow down the positive results to the most confdent ones (D4).

6.2

Implementation

We implemented a proof-of-concept system that detects how each sentence satisfes a criterion. The system architecture is shown in

Figure 10. After preprocessing the PDF into a set of individual sentences, each sentence is independently analyzed in two steps: First, the system determines how similar the input sentence is with any reference sentences. Second, for each sentence that is adequately similar, the system assigns a probability that the sentence could be labeled by each of the criterion’s keywords. The input sentences that pass both tests are positive results. Any paper with a positive sentence is classifed as satisfying the criterion. Below are the implementation details. 6.2.1 Preprocessing PDF into sentences. The paper PDF fles were processed with the pdfminer.six library,13 resulting in the text with information such as hierarchical structure, font style, and blank spaces. We use this information to distinguish the body text from 13 See

https://github.com/pdfminer/pdfminer.six, last accessed January 2023.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Salehzadeh Niksirat, et al. Criterion n Criterion 3 Criterion 2 Criterion 1

Reference sentences

PDF of the Lorem ipsum article

PDF Parsing

all sentences

Similarity matching with BERTScore

Criterion keywords

sentences with score > threhold

Zero-shot classifier

for any sentence probability > 0.78

Article satisfies criterion

Figure 10: Architecture of the proof-of-concept screening tool section titles. The body text was then segmented into sentences. Each sentence was individually used as input for the next two steps (D2). 6.2.2 Filtering based on sentence-similarity. For each criterion (D1), we manually extracted 5–10 reference sentences that make the papers fulfll the criterion. Here are two examples of the reference sentences for the irb criterion: “The study had institutional research ethics approval.”, “The University of [...] institutional review board ([...]) approved our study.”. Input sentences that are adequately similar to any reference sentence are positive results. The similarity is scored with the BERTScore method [112] with contextual embedding from a pre-trained language model DistilBERT [82]. To compare an input sentence to a reference sentence, the system used the language model to convert each word into a vector that encodes its contextual information. The cosine similarity is computed with the vectors of the input and the reference sentence word tokens. This approach is superior to exact or approximate pattern-matching because it does not restrict the matching to specifc grammatical roles. For example, the two reference sentences above would have a high similarity score despite having their subject and object reversed. There are three types of BERTScore: precision, recall, and F1. Precision is calculated based on greedy-matching the input to the reference, whereas recall is calculated in the opposite direction. The F1 score is a harmonic mean of precision and recall—eliminating the emphasis on any direction of the comparison. For this reason, we chose BERTScore’s F1 as the similarity score. When the score exceeds a threshold, the input sentence is a hit. This threshold hyperparameter is empirically derived for each criterion by testing the system with a small set of random samples. We set the threshold relatively low to reduce the false negatives (D3). The output of this step could be used for screening purposes, where the system highlights the hits and human authors or reviewers look at the hits to confrm. 6.2.3 Further narrowing down the hits with a text classifier. The number of hits can be further narrowed down to reduce the false positives (D4). This step is formulated as Defnition-Wild Zero-shot Text Classifcation task [111]. The classifer infers the probability that a keyword entails, that is, logically follows, the input sentence. For example, the input sentence “The study had institutional research ethics approval” can be logically followed by “This example is about

IRB.”, which is created from the IRB as the keyword. The keywords are drawn from the list in Sup. 5. A sentence with adequately high entailment probability to any of the keywords is a hit. Any paper with a hit is considered to satisfy the criterion. The entailment probability threshold was empirically determined to be 0.78 for all criteria. This approach requires no other training data. It is scalable for the large set of criteria we presented in this paper and possibly additional criteria in the future. We deviated from Yin et al. [111]’s work by using the BART-large language model pre-trained on the MNLI dataset [102] because it was found to perform better than the BERT model.14 Both this and the previous step were implemented with Pytorch [70] and HuggingFace [109] libraries.

6.3

Evaluation

We assessed the system on eight criteria, selected to strike a good balance between elements that are simpler to identify (e.g., irb) and those that are difcult to assess (e.g., justify-n-qant). The results are shown in Table 4. Most of the criteria have an imbalanced class distribution: There are many more papers that do not satisfy the criteria than those that do. To account for this imbalance, we report the precision, recall, and F1 score along with the accuracy to better understand the tool’s true performance [48, 77]. A higher recall indicates a higher number of true positives and a lesser number of false negatives identifed by the tool. Higher precision means lesser false positives indicating that if a higher number of positives were identifed, then they were really positive in human coding. The F1 score is the harmonic mean of precision and recall indicating the balance between the two. As explained in Section 6.2, each criterion was evaluated independently (D1) in two steps of fltering based on similarity and narrowing down with a text classifer. However, the defnition of condition-assignment and demographics are more granular in nature with each having three and four sub-criteria, respectively (see Sup. 2 for the sub-criteria under condition-assignment and demographics). Hence, the evaluation is done in two steps for each sub-criteria under condition-assignment and demographics. Finally, we performed a logical operation between their respective sub-criteria – logical AND for condition-assignment and a logical OR for demographics to determine the outcome. 14 See

https://joeddav.github.io/blog/2020/05/29/ZSL.html, last accessed January 2023.

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

CHI ’23, April 23–28, 2023, Hamburg, Germany

Table 4: Evaluation of our tool using Accuracy, F1 scores, Precision and Recall for eight example criteria. The SciScore paper [60] tested three criteria in common with our work. We provided the results from their paper for comparison.

Criterion irb consent (reported) study-compensation anon prereg justify-n-qant demographics condition-assignment

Accuracy 0.89 0.81 0.83 0.68 0.99 0.99 0.87 0.81

F1 Score 0.85 0.78 0.79 0.37 0.80 0.83 0.92 0.74

Precision 0.85 0.73 0.70 0.24 0.75 0.83 1 0.74

Our tool Recall # of papers that meet a criterion 0.85 87 0.84 98 0.89 85 0.77 30 0.86 7 0.83 6 0.85 217 0.75 90

The criteria irb, prereg, justify-n-qant, and demographics perform well in both accuracy and F1. The criteria consent and study-compensation have a very high accuracy and reasonable F1 score. They have a higher recall than precision, indicating that a very small number of the articles satisfying the criteria were missed, but had more false positives. These results indicate that the system satisfes design consideration D3 as expected. The criterion condition-assignment was challenging for the tool because some independent variables are implicit. For example, a longitudinal study may not explicitly state that time is its independent variable; therefore, this paper could be misclassifed as unmet the criterion. We observed high recall for anon. However, we also found many false positives since the model could not distinguish between the lines referring to anonymization, participant codes, and data exclusion. We compared our results with SciScore15 to further validate our tool. SciScore is a proprietary automated tool which assesses research articles based on their adherence to criteria on rigor and transparency in biomedical science [12, 60]. There were three criteria that SciScore and our work have in common: irb, consent and justify-n-qant. Our system yielded higher F1 on irb and justifyn-qant while SciScore was better on consent. Since SciScore was trained on a large number of labeled sentences, whereas our approach does not involve any training, these preliminary results indicate that our approach is highly promising. Based on these fndings, we label each criterion in Table 2 on their potential for being identifed with a screener tool. For instance, irb and consent, which have a conventional format of reporting, can defnitely be identifed with a screener tool. For criteria with high recall like anon, the tool can be used to screen potential sections in the articles. Criteria like face-photo and free-pdf-extern might require a combination of natural language processing and computer vision. Our approach could still be improved in several ways. The performance of the system depends on the efciency of the PDF parser, which is prone to errors due to the various PDF styles. Using a more reliable format, for example, source fles or HTML format could improve the performance. Our system assessed individual sentences. Incorporating the information about the section of the paper where the text is located might help in increasing the confdence of the

15 See

https://www.sciscore.com/, last accessed January 2023.

Results from SciScore [60] F1 Score Precision Recall 0.81 0.85 0.80 0.95 0.96 0.93 0.65 0.74 0.60 -

prediction. Furthermore, the information about the research methods might help to further rule out more false positives, for example, by checking justify-n-qant only for quantitative papers. Such information could be obtained at the submission time, for example, from PCS keyword checkboxes or subcommittee choices. Further, the availability of labeled data and using a model trained only on scientifc articles like SciBERT [13] might also improve the results.

7

DISCUSSION

Overall, our fndings showed positive changes in CHI 2022, where the authors of the CHI 2022 papers adhered to research ethics, openness, and transparency more than those of CHI 2017. In terms of the main practices, there are more improvements in research ethics and transparency and fewer in openness and reporting. However, despite such improvements, the overall rates are still low. For example, among the criteria related to research ethics, the highest rate was for consent forms with 57% adherence. This is indeed alarming and shows that almost half of the user studies in the most recent CHI proceedings either did not use a consent form at all or they did not report the consent collection in their papers. The report on consent collection for using photos was much lower for papers that used participants’ facial photos in their fgures. The rates for transparency practices are even lower than for research ethics. For example, despite a great improvement in sharing interview protocols, 75% of the qualitative studies do not share their interview guides. Therefore, there is room for improvement in transparency practices in CHI. One of the areas that requires more improvement is artifact sharing where authors should be more mindful about sharing software and hardware designed or tested in the studies. Similarly, sample size justifcation for both quantitative and qualitative studies is still not a common practice in HCI. In terms of transparency in reporting results, most of the practices for reporting quantitative statistical tests are at acceptable levels. However, the use and report of statistical assumptions, such as normality and reporting additional information including efect size and confdence intervals, should be improved to provide more insights into the results. Such practices can be further improved if CHI or other HCI outlets mandate existing reporting guidelines such as APA guidelines [10]. We fnd an interesting improvement in reporting qualitative fndings. We observed a considerable number of CHI 2017 papers that did not systematically report their qualitative fndings. Even some reported conducting interviews and

CHI ’23, April 23–28, 2023, Hamburg, Germany

did not report the fndings. In contrast, in CHI 2022, most of the qualitative studies followed standard reporting practices. Our fndings showed that around 29% of the selected CHI papers conducted research with vulnerable populations and thus they deal with more sensitive data and should apply stricter ethical constraints. These papers shared relatively fewer data compared with papers with non-vulnerable populations showing that ethical constraints may play against transparency. However, other reasons might exist such as lack of knowledge of software and techniques to anonymize the dataset. More in-depth studies (e.g., interviewing researchers) are required to shed light on the reasons for the lack of transparency and on how to systematically enhance transparency practices in ethically constrained studies. What do these results suggest, in general terms? We distilled four implications ranging from measurement, creating awareness, and checking adherence to research ethics, openness, and transparency.

7.1

Raising Awareness

Among the three practices of transparency, openness, and ethics, the Guide to a Successful Submission on the CHI 2022 website16 provides clearer instructions for transparency practices. This guide encourages authors of quantitative studies to ensure that their studies are reproducible. It also encourages the authors to do betatesting to check steps taken for data collection and data analysis. Moreover, the authors of the qualitative studies are encouraged to be transparent with their study procedure and data analysis. Finally, sharing study materials and using FAIR-compatible repositories such as OSF are advised in the guide. Some of the changes we observed, such as sharing more data and data analysis procedures, 16 See

might be due to the instructions in the CHI submission guideline. Surprisingly, some of the criteria (e.g., sharing via FAIR-compatible repositories) were not improved, despite being mentioned in the submission guideline. Two criteria that remain almost similar between CHI 2017 and CHI 2022 were justifying sample sizes for qualitative and quantitative studies. Interestingly, these criteria do not appear in the submission guide. We recommend that HCI venues provide specifc guidelines with detailed instructions on how to meet each practice. Some of these practices require special skills and training, such as transparency practices for quantitative studies [96]. The current CHI guidelines somewhat support transparency, but they should also raise awareness about research ethics and openness. For instance, the fact that many studies did not report consent collection is somewhat worrisome. Thus, it is crucial to increase awareness and knowledge of the community to move forward in all aspects related to the research design, execution, and reporting.

Self-Report Surveys vs. Actual Practices

We noticed an interesting discrepancy between our fndings and the results of Wacharamanotham et al. [95], where CHI 2018 and CHI 2019 authors self-reported their transparency practices. As discussed above, the success rates of criteria for transparency practices were relatively low. For instance, the average data sharing and artifact sharing rates in CHI 2022 were 11% and 12%, respectively. However, in the study by Wacharamanotham et al. [95], the rates of similar practices were higher (17% and 40%). This diference may indicate that when researchers self-report their practices, they can be optimistic and truly believe they share what is needed to replicate their study. However, the factual data indicated that in practice they adhere less. This detail could also be related to our fne-grained criteria. Instead of considering data and artifact sharing as a general practice, we searched for specifc practices for specifc data types and artifacts (e.g., quantitative raw data). On a diferent note, participants in Wacharamanotham et al. [95]’s study might indicate that they would share data upon request. In our study, we assessed the actual materials or the absence thereof. It is worth mentioning that earlier studies showed that the response rate and compliance rate for supplementary material requests are not high among the authors of papers promising to provide “data available upon request” [52]. Additionally, the chance of data availability rapidly declines when papers become older [92].

7.2

Salehzadeh Niksirat, et al.

https://chi2022.acm.org/for-authors/presenting/papers/guide-to-a-successfulsubmission/, last accessed January 2023.

7.3

How to Make Further Progress

A promising approach for improvement in research ethics, openness, and transparency would be for HCI journal editors or program chairs of the HCI conferences to defne sharp and measurable criteria in the submission guidelines. One might think that using checklists in the submission platforms could help improve these practices where authors could skim through diferent practices and self-report their practices. However, the limitation of such checklists is that most authors might be optimistic while flling those forms, and they answer diferently than their actual practices [95]. Another approach to improve adherence is to instruct associate chairs and reviewers about these criteria and provide specifc instructions to check these upon inspection of the paper. The transparency instructions given to reviewers on the CHI 2023 website17 are identical to those of the guide given to authors. We believe CHI should also provide specifc guidelines for reviewers on how to assess these practices. More recently, some venues in computer science18 have a separate review process for the research artifacts of the accepted papers. Such a practice can support replication and reproducibility. It can also ensure that all promised data are available and adequately prepared to avoid the problem of missing data such as 8% of the papers in our samples. However, we should acknowledge that applying a separate review process in CHI might not be feasible given the larger volume of submissions and the extra workload added to reviewers in a limited period. To reduce this workload, ideally, each aspect of criteria could be reviewed by one reviewer, either by assignment or volunteering. At a minimum, we suggest checking these criteria for the papers nominated for the “best paper award.” This step ensures that at least the distinguished papers can meet the highest standards and become examples for future research.

17 See

the Transparency paragraph in Guide to Reviewing Papers at https://chi2023.acm.org/submission-guides/guide-to-reviewing-papers, last accessed January 2023. 18 See, for example, PoPETs Artifact Review at https://petsymposium.org/artifacts.php, last accessed January 2023.

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

Ideally, the review process should include all submitted papers. From a futuristic perspective, this creates an opportunity for intelligent screening systems as scalable solutions to play an essential role in the review process. Ideally, paper submission platforms such as Precision Conference Solutions (PCS) can encourage authors to pass their submissions (i.e., paper and supplementary materials) over a screening system before the submission deadlines. Even if such systems are not entirely reliable, as an output, they can produce an evaluation list including approvals and warnings where the authors could go over the warnings and further clarify their practices for specifc criteria not approved by the systems. PCS can be used to submit the system’s output and the authors’ clarifcation. This practice can assist reviewers in the review process by reducing their efort. Future studies on machine learning and natural language processing (NLP) should concentrate their eforts on developing reliable screening systems for assessing research ethics, openness, and transparency. Abuse of explicit evaluation criteria and screening tools is a possibility. Authors who lack integrity can add keywords to make their manuscript satisfy the screening without actually having satisfed the criteria. These actions should be rare because they require more work (i.e., to game the system) than just complying with the requirements. However, the situation would be more precarious from a reviewer’s role. For example, when a screening tool reports that some criteria are unmet, an uncaring reviewer could misuse this result to quickly dismiss the research without a proper (and fair) evaluation. Therefore, it is essential to educate the reviewers about the usage and limitations of this approach, shall such tools be used to support reviewers. Authors and reviewers should use the screening tools the same way as we use spell or grammar checker tools. These tools should assist human users in focusing their limited attention and time on areas requiring more in-depth evaluation. Additionally, the fnal decisions still require humans to be in the loop precisely because of trade-ofs between transparent practices and compliance with research ethics. The criteria described in this work provide a concrete starting point for HCI sub-communities, whether by methodological or application domain, to discuss and develop guidelines to help authors and reviewers navigate these trade-ofs. We also hope that educators will use these criteria and their subsequent refnements to educate young researchers to make their future contributions more ethical, transparent, and open.

7.4

Extra Care is Needed With Students

Among the vulnerable populations identifed in our samples, the frst two most frequent groups are participants with disabilities and students (see Figure 5C). Research involving people with disabilities has dedicated research communities (e.g., ASSETS conference, SIGACCESS, a dedicated CHI subcommittee) that could promote the appropriate treatment of participants through the discourses or peer review. Similarly, the student population will also beneft from a dedicated research community that ensures their equitable treatment as participants. University students are frequently used in HCI research. According to Linxen et al. [56], almost 70% of CHI 2016–2020 papers involved study participants who are university students or graduates.

CHI ’23, April 23–28, 2023, Hamburg, Germany

We consider students a vulnerable population because, in some situations, they might be unable to protect their interests fully [84, p. 35][47]. Specifcally, students might be subject to power dynamics because the evaluation of their learning progress might be conducted by the same institution recruiting them [81]. This power dynamic is particularly potent if the researchers are directly involved in the student’s chosen courses. Without an appropriate informed consent process, coercion to participate in the study could occur directly or through an indirect assumption that it can lead to higher grades, for example. These situations could also threaten the study’s internal validity by biasing students’ responses in favor of the study condition they perceived as their instructors’ work. Therefore, we call for researchers to (1) avoid recruiting students in their courses or department as study participants unless strictly required by the research goal of the study or method,19 (2) disclose power-relationship or the lack thereof explicitly, and (3) discuss ethical implications and safeguards in their paper. Reviewers should also be vigilant and demand authors to address these points. We also believe that the CHI community should examine the ethical issues of using students as study participants.

7.5

Limitations

Our study fndings may be susceptible to limitations. First, the difference in the page-limit constraint between the two proceedings could be a confounding variable. CHI 2017 has a page limit per article, whereas, in CHI 2022 the authors were encouraged to adjust the length of their papers based on their contribution (i.e., not a strict guideline). The median of the page count (i.e., not including references and appendix section) for the sampled CHI 2017 papers was 10 pages, whereas it was 14 pages for CHI 2022. Therefore, this might have forced some authors to sacrifce some information or relegate it to supplementary materials. To mitigate this limitation, we thoroughly inspected the appendices and supplementary materials of the papers. Second, our research focused on “good" research practices that can support transparency, openness, and being ethically sound. Nevertheless a good practice is not equal to “correct” practice. It was out of our scope to assess the correctness of the research practices (e.g., if a paper used the correct statistical test or if its degree of freedom matched the sample size). Although some of our criteria could be more in-depth in that respect, given that we focused on changes between two years, our assessment of the goodness of the practices was consistent across two years and should be reliable. Third, our criteria list might not be exhaustive. For transparency, we could also consider reporting pre-processing steps such as data transformation and exclusion of outliers [44] (e.g., data blinding; a helpful step before analyzing data, particularly in randomized controlled studies, to reduce experimenter bias [76]). Additionally, our work only touched upon transparency criteria for qualitative research: Our criteria only cover the interview method and the generic description of analysis methods. Unlike quantitative analysis, where data analysis code details the analysis process, qualitative research has more diverse data-analytic artifacts. Not all artifacts 19 For

instance, specifc research might require the researcher to take active roles in the research (e.g., participant observation).

CHI ’23, April 23–28, 2023, Hamburg, Germany

are generated across all research methods. Also, research methods could difer in how research artifacts connect to transparency. For example, sharing codebooks shows transparency for codingreliability methods such as Framework Analysis [85]. In contrast, codebooks may reveal little about the analysis process for interpretive methods. These diferences call for more nuanced criteria specifc to each qualitative analysis method. Lastly, our results with regard to condition-assignment should be interpreted with caution. Although we excluded papers without user studies and with non-experiment studies, we noticed diferent types of studies (e.g., exploratory vs. confrmatory and basic design vs. factorial design) that may impose the authors to follow diferent reporting styles.

8

Salehzadeh Niksirat, et al.

[6]

[7] [8]

CONCLUSION

Within HCI and across scientifc disciplines, there have been many initiatives to improve research ethics, openness, and transparency within empirical research in recent years. In this study, we show the current status quo of adoption of research ethics, openness, and transparency in HCI by assessing the changes in CHI literature between CHI 2017 and CHI 2022. This work makes the following contributions: We gathered pertinent criteria for research ethics, openness, and transparency, and operationalized them for evaluation based on published papers and research materials. We present the current state of practices in these issues and evaluate any developments between CHI 2017 and CHI 2022. Furthermore, we propose a proof-of-concept screening system to assess a certain subset of criteria. This study shows that adherence to these practices is improving overall. However, the HCI community still needs to become more mature by setting the highest standards in terms of research ethics, openness, and transparency. We hope that studies like this one will contribute to raise awareness and standards.

ACKNOWLEDGMENTS We would like to thank Vincent Vandersluis for proofreading this article and the ACM DL team for providing a clear answer to our question about the ACM publication policy. Lastly, we sincerely thank the anonymous reviewers for their very constructive feedback and encouragement.

[9] [10] [11]

[12]

[13]

[14]

[15]

[16] [17]

REFERENCES [1] Jacob Abbott, Haley MacLeod, Novia Nurain, Gustave Ekobe, and Sameer Patil. 2019. Local Standards for Anonymization Practices in Health, Wellness, Accessibility, and Aging Research at CHI. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3290605.3300692 [2] ACM. 2019. ACM Policy on Submission, Hosting, Access, and Ownership of Digital Artifacts. https://www.acm.org/publications/policies/digital-artifacts [3] ACM. 2019. Permanent Access. https://www.acm.org/publications/policies/ permanent-access https://www.acm.org/ [4] ACM. 2022. Open Access Publication & ACM. publications/openaccess [5] Balazs Aczel, Barnabas Szaszi, Alexandra Sarafoglou, Zoltan Kekecs, Šimon Kucharský, Daniel Benjamin, Christopher D. Chambers, Agneta Fisher, Andrew Gelman, Morton A. Gernsbacher, John P. Ioannidis, Eric Johnson, Kai Jonas, Stavroula Kousta, Scott O. Lilienfeld, D. Stephen Lindsay, Candice C. Morey, Marcus Munafò, Benjamin R. Newell, Harold Pashler, David R. Shanks, Daniel J. Simons, Jelte M. Wicherts, Dolores Albarracin, Nicole D. Anderson, John Antonakis, Hal R. Arkes, Mitja D. Back, George C. Banks, Christopher Beevers, Andrew A. Bennett, Wiebke Bleidorn, Ty W. Boyer, Cristina Cacciari, Alice S. Carter, Joseph Cesario, Charles Clifton, Ronán M. Conroy, Mike Cortese, Fiammetta Cosci, Nelson Cowan, Jarret Crawford, Eveline A.

[18]

[19]

[20]

[21] [22] [23]

Crone, John Curtin, Randall Engle, Simon Farrell, Pasco Fearon, Mark Fichman, Willem Frankenhuis, Alexandra M. Freund, M. Gareth Gaskell, Roger GinerSorolla, Don P. Green, Robert L. Greene, Lisa L. Harlow, Fernando Hoces de la Guardia, Derek Isaacowitz, Janet Kolodner, Debra Lieberman, Gordon D. Logan, Wendy B. Mendes, Lea Moersdorf, Brendan Nyhan, Jefrey Pollack, Christopher Sullivan, Simine Vazire, and Eric-Jan Wagenmakers. 2020. A ConsensusBased Transparency Checklist. Nature Human Behaviour 4, 1 (Jan. 2020), 4–6. https://doi.org/10.1038/s41562-019-0772-6 Lena Fanya Aeschbach, Sebastian A.C. Perrig, Lorena Weder, Klaus Opwis, and Florian Brühlmann. 2021. Transparency in Measurement Reporting: A Systematic Literature Review of CHI PLAY. Proceedings of the ACM on HumanComputer Interaction 5, CHI PLAY (Oct. 2021), 233:1–233:21. https://doi.org/10. 1145/3474660 Alan Agresti. 2011. Score and Pseudo-Score Confdence Intervals for Categorical Data Analysis. Statistics in Biopharmaceutical Research 3, 2 (May 2011), 163–172. https://doi.org/10.1198/sbr.2010.09053 Herman Aguinis and Angelo M. Solarino. 2019. Transparency and Replicability in Qualitative Research: The Case of Interviews with Elite Informants. Strategic Management Journal 40, 8 (2019), 1291–1315. https://doi.org/10.1002/smj.3015 Alissa N. Antle. 2017. The Ethics of Doing Research with Vulnerable Populations. Interactions 24, 6 (Oct. 2017), 74–77. https://doi.org/10.1145/3137107 APA. 2020. Publication Manual of the American Psychological Association, Seventh Edition. American Psychological Association (APA), IL, USA. https://apastyle. apa.org/products/publication-manual-7th-edition Nick Ballou, Vivek R. Warriar, and Sebastian Deterding. 2021. Are You Open? A Content Analysis of Transparency and Openness Guidelines in HCI Journals. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3411764.3445584 Anita Bandrowski and Martijn Roelandse. 2022. SciScore, a Tool That Can Measure Rigor Criteria Presence or Absence in a Biomedical Study. In The 1st International Conference on Drug Repurposing. ScienceOpen, Maastricht, Netherlands, 2. https://doi.org/10.14293/S2199-1006.1.SOR-.PPPXBQN6.v1 Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientifc Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3615–3620. https://doi.org/10.18653/v1/D191371 Pernille Bjorn, Casey Fiesler, Michael Muller, Jessica Pater, and Pamela Wisniewski. 2018. Research Ethics Town Hall Meeting. In Proceedings of the 2018 ACM Conference on Supporting Groupwork (GROUP ’18). Association for Computing Machinery, New York, NY, USA, 393–396. https://doi.org/10.1145/3148330. 3154523 Ángel Borrego and Francesc Garcia. 2013. Provision of Supplementary Materials in Library and Information Science Scholarly Journals. Aslib Proceedings: New Information Perspectives 65, 5 (Jan. 2013), 503–514. https://doi.org/10.1108/AP10-2012-0083 Virginia Braun and Victoria Clarke. 2006. Using Thematic Analysis in Psychology. Qualitative Research in Psychology 3, 2 (Jan. 2006), 77–101. https: //doi.org/10.1191/1478088706qp063oa Barry Brown, Alexandra Weilenmann, Donald McMillan, and Airi Lampinen. 2016. Five Provocations for Ethical HCI Research. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 852–863. https://doi.org/10.1145/ 2858036.2858313 Amy S. Bruckman, Casey Fiesler, Jef Hancock, and Cosmin Munteanu. 2017. CSCW Research Ethics Town Hall: Working Towards Community Norms. In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW ’17 Companion). Association for Computing Machinery, New York, NY, USA, 113–115. https://doi.org/10.1145/3022198. 3022199 Kelly Caine. 2016. Local Standards for Sample Size at CHI. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 981–992. https: //doi.org/10.1145/2858036.2858498 Paul Cairns. 2007. HCI... Not as It Should Be: Inferential Statistics in HCI Research. In Proceedings of the 21st British HCI Group Annual Conference on People and Computers: HCI...but Not as We Know It - Volume 1 (BCS-HCI ’07). BCS Learning & Development Ltd., Swindon, GBR, 195–201. Paul Cairns. 2019. Doing Better Statistics in Human-Computer Interaction. Cambridge University Press, Cambridge. https://doi.org/10.1017/9781108685139 Peter Celec. 2004. Open Access and Those Lacking Funds. Science 303, 5663 (March 2004), 1467–1467. https://doi.org/10.1126/science.303.5663.1467c Christopher D. Chambers. 2013. Registered Reports: A New Publishing Initiative at Cortex. Cortex 49, 3 (March 2013), 609–610. https://doi.org/10.1016/j.cortex. 2012.12.016

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

[24] Mauro Cherubini, Kavous Salehzadeh Niksirat, Marc-Olivier Boldi, Henri Keopraseuth, Jose M. Such, and Kévin Huguenin. 2021. When Forcing Collaboration Is the Most Sensible Choice: Desirability of Precautionary and Dissuasive Mechanisms to Manage Multiparty Privacy Conficts. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (April 2021), 53:1–53:36. https://doi.org/10.1145/3449127 [25] Lewis L. Chuang and Ulrike Pfeil. 2018. Transparency and Openness Promotion Guidelines for HCI. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA ’18). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3170427.3185377 [26] James Cliford (Ed.). 1990. Notes on (Field) Notes. Cornell University Press, NY, USA. https://www.jstor.org/stable/10.7591/j.ctvv4124m [27] Andy Cockburn, Carl Gutwin, and Alan Dix. 2018. HARK No More: On the Preregistration of CHI Experiments. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3173715 [28] Douglas Curran-Everett and Dale J. Benos. 2004. Guidelines for Reporting Statistics in Journals Published by the American Physiological Society. Physiological Genomics 18, 3 (Aug. 2004), 249–251. https://doi.org/10.1152/physiolgenomics. 00155.2004 [29] Pierre Dragicevic. 2016. Fair Statistical Communication in HCI. In Modern Statistical Methods for HCI, Judy Robertson and Maurits Kaptein (Eds.). Springer International Publishing, Cham, 291–330. https://doi.org/10.1007/978-3-31926633-6_13 [30] Nathalie Percie du Sert, Amrita Ahluwalia, Sabina Alam, Marc T. Avey, Monya Baker, William J. Browne, Alejandra Clark, Innes C. Cuthill, Ulrich Dirnagl, Michael Emerson, Paul Garner, Stephen T. Holgate, David W. Howells, Viki Hurst, Natasha A. Karp, Stanley E. Lazic, Katie Lidster, Catriona J. MacCallum, Malcolm Macleod, Esther J. Pearl, Ole H. Petersen, Frances Rawle, Penny Reynolds, Kieron Rooney, Emily S. Sena, Shai D. Silberberg, Thomas Steckler, and Hanno Würbel. 2020. Reporting Animal Research: Explanation and Elaboration for the ARRIVE Guidelines 2.0. PLOS Biology 18, 7 (July 2020), e3000411. https://doi.org/10.1371/journal.pbio.3000411 [31] Florian Echtler and Maximilian Häußler. 2018. Open Source, Open Science, and the Replication Crisis in HCI. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA ’18). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3170427.3188395 [32] Morten W Fagerland, Stian Lydersen, and Petter Laake. 2015. Recommended Confdence Intervals for Two Independent Binomial Proportions. Statistical Methods in Medical Research 24, 2 (April 2015), 224–254. https://doi.org/10. 1177/0962280211415469 [33] Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G*Power 3: A Flexible Statistical Power Analysis Program for the Social, Behavioral, and Biomedical Sciences. Behavior Research Methods 39, 2 (May 2007), 175–191. https://doi.org/10.3758/bf03193146 [34] Sebastian S. Feger, Cininta Pertiwi, and Enrico Bonaiuti. 2022. Research Data Management Commitment Drivers: An Analysis of Practices, Training, Policies, Infrastructure, and Motivation in Global Agricultural Science. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (Nov. 2022), 322:1–322:36. https://doi.org/10.1145/3555213 [35] Sebastian S. Feger, Paweł W. Wozniak, Lars Lischke, and Albrecht Schmidt. 2020. ’Yes, I Comply!’: Motivations and Practices around Research Data Management and Reuse across Scientifc Fields. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (Oct. 2020), 141:1–141:26. https://doi.org/10.1145/3415212 [36] Casey Fiesler, Christopher Frauenberger, Michael Muller, Jessica Vitak, and Michael Zimmer. 2022. Research Ethics in HCI: A SIGCHI Community Discussion. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/3491101.3516400 [37] Casey Fiesler, Jef Hancock, Amy Bruckman, Michael Muller, Cosmin Munteanu, and Melissa Densmore. 2018. Research Ethics for HCI: A Roundtable Discussion. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA ’18). Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/3170427.3186321 [38] Christopher Frauenberger, Amy S. Bruckman, Cosmin Munteanu, Melissa Densmore, and Jenny Waycott. 2017. Research Ethics in HCI: A Town Hall Meeting. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’17). Association for Computing Machinery, New York, NY, USA, 1295–1299. https://doi.org/10.1145/3027063.3051135 [39] John Furler, Parker Magin, Marie Pirotta, and Mieke van Driel. 2012. Participant Demographics Reported in "Table 1" of Randomised Controlled Trials: A Case of "Inverse Evidence"? International Journal for Equity in Health 11, 1 (March 2012), 14. https://doi.org/10.1186/1475-9276-11-14 [40] Aakash Gautam, Chandani Shrestha, Andrew Kulak, Steve Harrison, and Deborah Tatar. 2018. Participatory Tensions in Working with a Vulnerable Population. In Proceedings of the 15th Participatory Design Conference: Short Papers, Situated Actions, Workshops and Tutorial - Volume 2 (PDC ’18). Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/3210604.3210629

CHI ’23, April 23–28, 2023, Hamburg, Germany

[41] Cliford Geertz. 1976. The Religion of Java. University of Chicago Press, Chicago, IL. https://press.uchicago.edu/ucp/books/book/chicago/R/bo3627129.html [42] {CHI guideline contributors in alphabetical order}, Pernille Bjørn, Fanny Chevalier, Pierre Dragicevic, Shion Guha, Steve Haroz, Helen Ai He, Elaine M. Huang, Matthew Kay, Ulrik Lyngs, Joanna McGrenere, Christian Remy, Poorna Talkad Sukumar, and Chat Wacharamanotham. 2019. Proposal for Changes to the CHI Reviewing Guidelines. Technical Report. Zenodo. https://doi.org/10. 5281/zenodo.5566172 [43] Tamarinde L. Haven, Timothy M. Errington, Kristian Skrede Gleditsch, Leonie van Grootel, Alan M. Jacobs, Florian G. Kern, Rafael Piñeiro, Fernando Rosenblatt, and Lidwine B. Mokkink. 2020. Preregistering Qualitative Research: A Delphi Study. International Journal of Qualitative Methods 19 (Jan. 2020), 1–13. https://doi.org/10.1177/1609406920976417 [44] Constance Holman, Sophie K. Piper, Ulrike Grittner, Andreas Antonios Diamantaras, Jonathan Kimmelman, Bob Siegerink, and Ulrich Dirnagl. 2016. Where Have All the Rodents Gone? The Efects of Attrition in Experimental Research on Cancer and Stroke. PLOS Biology 14, 1 (Jan. 2016), 1–12. https://doi.org/10.1371/journal.pbio.1002331 [45] Kasper Hornbæk, Søren S. Sander, Javier Andrés Bargas-Avila, and Jakob Grue Simonsen. 2014. Is Once Enough? On the Extent and Content of Replications in Human-Computer Interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). Association for Computing Machinery, New York, NY, USA, 3523–3532. https://doi.org/10.1145/2556288.2557004 [46] Hamid R. Jamali. 2017. Copyright Compliance and Infringement in ResearchGate Full-Text Journal Articles. Scientometrics 112, 1 (July 2017), 241–254. https: //doi.org/10.1007/s11192-017-2291-4 [47] David W. Jamieson and Kenneth W. Thomas. 1974. Power and Confict in the Student-Teacher Relationship. The Journal of Applied Behavioral Science 10, 3 (July 1974), 321–336. https://doi.org/10.1177/002188637401000304 [48] Nathalie Japkowicz and Mohak Shah. 2011. Evaluating Learning Algorithms: A Classifcation Perspective. Cambridge University Press, Cambridge, England. [49] Matthew Kay, Steve Haroz, Shion Guha, and Pierre Dragicevic. 2016. Special Interest Group on Transparent Statistics in HCI. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’16). Association for Computing Machinery, New York, NY, USA, 1081–1084. https://doi.org/10.1145/2851581.2886442 [50] Matthew Kay, Steve Haroz, Shion Guha, Pierre Dragicevic, and Chat Wacharamanotham. 2017. Moving Transparent Statistics Forward at CHI. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’17). Association for Computing Machinery, New York, NY, USA, 534–541. https://doi.org/10.1145/3027063.3027084 [51] Juho Kim, Haoqi Zhang, Paul André, Lydia B. Chilton, Wendy Mackay, Michel Beaudouin-Lafon, Robert C. Miller, and Steven P. Dow. 2013. Cobi: A Community-Informed Conference Scheduling Tool. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (UIST ’13). Association for Computing Machinery, New York, NY, USA, 173–182. https://doi.org/10.1145/2501988.2502034 [52] Michal Krawczyk and Ernesto Reuben. 2012. (Un)Available upon Request: Field Experiment on Researchers’ Willingness to Share Supplementary Materials. Accountability in Research 19, 3 (May 2012), 175–186. https://doi.org/10.1080/ 08989621.2012.678688 [53] Daniël Lakens. 2022. Sample Size Justifcation. Collabra: Psychology 8, 1 (March 2022), 33267. https://doi.org/10.1525/collabra.33267 [54] Tom Lang and Douglas Altman. 2016. Statistical Analyses and Methods in the Published Literature: The SAMPL Guidelines. Medical Writing 25 (Sept. 2016), 31–36. https://journal.emwa.org/statistics/statistical-analyses-and-methodsin-the-published-literature-the-sampl-guidelines/ [55] Jonathan Lazar, Jinjuan Heidi Feng, and Harry Hochheiser. 2017. Research Methods in Human-Computer Interaction - 2nd Edition. Morgan Kaufmann, MA, USA. https://www.elsevier.com/books/research-methods-in-humancomputer-interaction/lazar/978-0-12-805390-4 [56] Sebastian Linxen, Christian Sturm, Florian Brühlmann, Vincent Cassau, Klaus Opwis, and Katharina Reinecke. 2021. How WEIRD Is CHI?. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi. org/10.1145/3411764.3445488 [57] Florian Mann, Benedikt von Walter, Thomas Hess, and Rolf T. Wigand. 2009. Open Access Publishing in Science. Commun. ACM 52, 3 (March 2009), 135–139. https://doi.org/10.1145/1467247.1467279 [58] Nora McDonald, Karla Badillo-Urquiola, Morgan G. Ames, Nicola Dell, Elizabeth Keneski, Manya Sleeper, and Pamela J. Wisniewski. 2020. Privacy and Power: Acknowledging the Importance of Privacy Research and Design for Vulnerable Populations. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (CHI EA ’20). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3334480.3375174 [59] Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proceedings of the ACM on Human-Computer Interaction 3,

CHI ’23, April 23–28, 2023, Hamburg, Germany

CSCW (Nov. 2019), 72:1–72:23. https://doi.org/10.1145/3359174 [60] Joe Menke, Martijn Roelandse, Burak Ozyurt, Maryann Martone, and Anita Bandrowski. 2020. The Rigor and Transparency Index Quality Metric for Assessing Biological and Medical Science Methods. iScience 23, 11 (Nov. 2020), 101698. https://doi.org/10.1016/j.isci.2020.101698 [61] Matthew B. Miles, A. Michael Huberman, and Johnny Saldana. 2022. Qualitative Data Analysis: A Methods Sourcebook. SAGE, CA, USA. https://us.sagepub.com/ en-us/nam/qualitative-data-analysis/book246128 [62] Andrew Moravcsik. 2014. Transparency: The Revolution in Qualitative Research. PS: Political Science & Politics 47, 1 (Jan. 2014), 48–53. https://doi.org/10.1017/ S1049096513001789 [63] Gaia Mosconi, Dave Randall, Helena Karasti, Saja Aljuneidi, Tong Yu, Peter Tolmie, and Volkmar Pipek. 2022. Designing a Data Story: A Storytelling Approach to Curation, Sharing and Data Reuse in Support of Ethnographicallydriven Research. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (Nov. 2022), 289:1–289:23. https://doi.org/10.1145/3555180 [64] Cosmin Munteanu, Heather Molyneaux, Wendy Moncur, Mario Romero, Susan O’Donnell, and John Vines. 2015. Situational Ethics: Re-thinking Approaches to Formal Ethics Requirements for Human-Computer Interaction. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). Association for Computing Machinery, New York, NY, USA, 105–114. https://doi.org/10.1145/2702123.2702481 [65] B. A. Nosek, G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen, M. Contestabile, A. Dafoe, E. Eich, J. Freese, R. Glennerster, D. Gorof, D. P. Green, B. Hesse, M. Humphreys, J. Ishiyama, D. Karlan, A. Kraut, A. Lupia, P. Mabry, T. Madon, N. Malhotra, E. Mayo-Wilson, M. McNutt, E. Miguel, E. Levy Paluck, U. Simonsohn, C. Soderberg, B. A. Spellman, J. Turitto, G. VandenBos, S. Vazire, E. J. Wagenmakers, R. Wilson, and T. Yarkoni. 2015. Promoting an Open Research Culture. Science 348, 6242 (June 2015), 1422–1425. https://doi.org/10.1126/science.aab2374 [66] Giovanna Nunes Vilaza, Kevin Doherty, Darragh McCashin, David Coyle, Jakob Bardram, and Marguerite Barry. 2022. A Scoping Review of Ethics Across SIGCHI. In Designing Interactive Systems Conference (DIS ’22). Association for Computing Machinery, New York, NY, USA, 137–154. https://doi.org/10.1145/ 3532106.3533511 [67] Bridget C. O’Brien, Ilene B. Harris, Thomas J. Beckman, Darcy A. Reed, and David A. Cook. 2014. Standards for Reporting Qualitative Research: A Synthesis of Recommendations. Academic Medicine 89, 9 (Sept. 2014), 1245–1251. https: //doi.org/10.1097/ACM.0000000000000388 [68] Yuren Pang, Katharina Reinecke, and René Just. 2022. Apéritif: Scafolding Preregistrations to Automatically Generate Analysis Code and Methods Descriptions. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3491102.3517707 [69] Van L. Parsons. 2017. Stratifed Sampling. In Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, NY, USA, 1–11. https://doi.org/10.1002/ 9781118445112.stat05999.pub2 [70] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, HighPerformance Deep Learning Library. https://doi.org/10.48550/arXiv.1912.01703 arXiv:1912.01703 [cs, stat] [71] Jessica Pater, Amanda Coupe, Rachel Pfafman, Chanda Phelan, Tammy Toscos, and Maia Jacobs. 2021. Standardizing Reporting of Participant Compensation in HCI: A Systematic Literature Review and Recommendations for the Field. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–16. https://doi.org/10.1145/3411764.3445734 [72] Prasad Patil, Roger D. Peng, and Jefrey T. Leek. 2016. A Statistical Defnition for Reproducibility and Replicability. Preprint. bioRxiv. https://doi.org/10.1101/ 066803 [73] George Peat, Richard D. Riley, Peter Croft, Katherine I. Morley, Panayiotis A. Kyzas, Karel G. M. Moons, Pablo Perel, Ewout W. Steyerberg, Sara Schroter, Douglas G. Altman, Harry Hemingway, and for the PROGRESS Group. 2014. Improving the Transparency of Prognosis Research: The Role of Reporting, Data Sharing, Registration, and Protocols. PLOS Medicine 11, 7 (July 2014), e1001671. https://doi.org/10.1371/journal.pmed.1001671 [74] Marco Perugini, Marcello Gallucci, and Giulio Costantini. 2018. A Practical Primer To Power Analysis for Simple Experimental Designs. International Review of Social Psychology 31, 1 (July 2018), 20. https://doi.org/10.5334/irsp.181 [75] Kenneth D. Pimple. 2002. Six Domains of Research Ethics. Science and Engineering Ethics 8, 2 (June 2002), 191–205. https://doi.org/10.1007/s11948-002-0018-1 [76] Denise F. Polit. 2011. Blinding during the Analysis of Research Data. International Journal of Nursing Studies 48, 5 (May 2011), 636–641. https: //doi.org/10.1016/j.ijnurstu.2011.02.010 [77] David .M.W Powers. 2011. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning

Salehzadeh Niksirat, et al.

Technologies 2, 1 (Dec. 2011), 37–63. https://doi.org/10.48550/arXiv.2010.16061 [78] Lumpapun Punchoojit and Nuttanont Hongwarittorrn. 2015. Research Ethics in Human-Computer Interaction: A Review of Ethical Concerns in the Past Five Years. In 2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science (NICS). IEEE, Ho Chi Minh City, Vietnam, 180–185. https://doi.org/10.1109/NICS.2015.7302187 [79] Camille R. Quinn. 2015. General Considerations for Research with Vulnerable Populations: Ten Lessons for Success. Health & Justice 3, 1 (Jan. 2015), 1. https: //doi.org/10.1186/s40352-014-0013-z [80] Judy Robertson and Maurits Kaptein (Eds.). 2016. Modern Statistical Methods for HCI. Springer International Publishing, Cham. https://doi.org/10.1007/978-3319-26633-6 [81] Susan L. Rose and Charles E. Pietri. 2002. Workers as Research Subjects: A Vulnerable Population. Journal of Occupational and Environmental Medicine 44, 9 (2002), 801–805. https://doi.org/10.1097/00043764-200209000-00001 [82] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. https://doi.org/10.48550/arXiv.1910.01108 arXiv:1910.01108 [cs] [83] Ralph Scherer. 2018. PropCIs: Various Confdence Interval Methods for Proportions. https://CRAN.R-project.org/package=PropCIs [84] Council for International Organizations of Medical Sciences. 2017. International Ethical Guidelines for Health-Related Research Involving Humans. World Health Organization, Geneva, CH. https://www.cabdirect.org/cabdirect/abstract/ 20173377536 [85] Joanna Smith and Jill Firth. 2011. Qualitative Data Analysis: The Framework Approach. Nurse Researcher 18, 2 (2011), 52–62. https://doi.org/10.7748/nr2011. 01.18.2.52.c8284 [86] Martin Spann, Lucas Stich, and Klaus M. Schmidt. 2017. Pay What You Want as a Pricing Model for Open Access Publishing? Commun. ACM 60, 11 (Oct. 2017), 29–31. https://doi.org/10.1145/3140822 [87] Jose M. Such, Joel Porter, Sören Preibusch, and Adam Joinson. 2017. Photo Privacy Conficts in Social Media: A Large-scale Empirical Study. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). Association for Computing Machinery, New York, NY, USA, 3821–3832. https: //doi.org/10.1145/3025453.3025668 [88] Poorna Talkad Sukumar, Ignacio Avellino, Christian Remy, Michael A. DeVito, Tawanna R. Dillahunt, Joanna McGrenere, and Max L. Wilson. 2020. Transparency in Qualitative Research: Increasing Fairness in the CHI Review Process. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (CHI EA ’20). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3334480.3381066 [89] Allison Tong, Peter Sainsbury, and Jonathan Craig. 2007. Consolidated Criteria for Reporting Qualitative Research (COREQ): A 32-Item Checklist for Interviews and Focus Groups. International Journal for Quality in Health Care 19, 6 (Dec. 2007), 349–357. https://doi.org/10.1093/intqhc/mzm042 [90] UNESCO. 2021. UNESCO Recommendation on Open Science. https://unesdoc. unesco.org/ark:/48223/pf0000379949.locale=en [91] Alicia VandeVusse, Jennifer Mueller, and Sebastian Karcher. 2022. Qualitative Data Sharing: Participant Understanding, Motivation, and Consent. Qualitative Health Research 32, 1 (Jan. 2022), 182–191. https://doi.org/10.1177/ 10497323211054058 [92] Timothy H. Vines, Arianne Y. K. Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. 2014. The Availability of Research Data Declines Rapidly with Article Age. Current Biology 24, 1 (Jan. 2014), 94–97. https://doi.org/10.1016/j.cub.2013.11.014 [93] Jessica Vitak, Katie Shilton, and Zahra Ashktorab. 2016. Beyond the Belmont Principles: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW ’16). Association for Computing Machinery, New York, NY, USA, 941–953. https://doi.org/10.1145/2818048. 2820078 [94] Jan B. Vornhagen, April Tyack, and Elisa D. Mekler. 2020. Statistical Signifcance Testing at CHI PLAY: Challenges and Opportunities for More Transparency. In Proceedings of the Annual Symposium on Computer-Human Interaction in Play. Association for Computing Machinery, New York, NY, USA, 4–18. https: //doi.org/10.1145/3410404.3414229 [95] Chat Wacharamanotham, Lukas Eisenring, Steve Haroz, and Florian Echtler. 2020. Transparency of CHI Research Artifacts: Results of a Self-Reported Survey. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376448 [96] Chat Wacharamanotham, Fumeng Yang, Xiaoying Pu, Abhraneel Sarma, and Lace Padilla. 2022. Transparent Practices for Quantitative Empirical Research. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/3491101.3503760

Changes in Research Ethics, Openness, and Transparency in Empirical CHI Studies

[97] Ashley Marie Walker, Yaxing Yao, Christine Geeng, Roberto Hoyle, and Pamela Wisniewski. 2019. Moving beyond ’One Size Fits All’: Research Considerations for Working with Vulnerable Populations. Interactions 26, 6 (Oct. 2019), 34–39. https://doi.org/10.1145/3358904 [98] Dan S. Wallach. 2011. Rebooting the CS Publication Process. Commun. ACM 54, 10 (Oct. 2011), 32–35. https://doi.org/10.1145/2001269.2001283 [99] Shirley Wheeler. 2003. Comparing Three IS Codes of Ethics - ACM, ACS and BCS. PACIS 2003 Proceedings 107 (Dec. 2003), 1576–1589. https://aisel.aisnet. org/pacis2003/107 [100] Jelte M. Wicherts, Coosje L. S. Veldkamp, Hilde E. M. Augusteijn, Marjan Bakker, Robbie C. M. van Aert, and Marcel A. L. M. van Assen. 2016. Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Frontiers in Psychology 7 (2016), 12. https: //doi.org/10.3389/fpsyg.2016.01832 [101] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jefrey S. Grethe, Jaap Heringa, Peter A. C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. The FAIR Guiding Principles for Scientifc Data Management and Stewardship. Scientifc Data 3, 1 (March 2016), 160018. https://doi.org/10.1038/sdata.2016.18 [102] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1112–1122. https://doi.org/10.18653/v1/N18-1101 [103] Günter Wilms. 2019. Guide on Good Data Protection Practice in Research. https://www.eui.eu/documents/servicesadmin/deanofstudies/ researchethics/guide-data-protection-research.pdf [104] Max Wilson, Wendy Mackay, Ed Chi, Michael Bernstein, and Jefrey Nichols. 2012. RepliCHI SIG: From a Panel to a New Submission Venue for Replication.

CHI ’23, April 23–28, 2023, Hamburg, Germany

[105]

[106]

[107]

[108] [109]

[110] [111] [112] [113]

In CHI ’12 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’12). Association for Computing Machinery, New York, NY, USA, 1185–1188. https://doi.org/10.1145/2212776.2212419 Max L. Wilson, Ed H. Chi, Stuart Reeves, and David Coyle. 2014. RepliCHI: The Workshop II. In CHI ’14 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’14). Association for Computing Machinery, New York, NY, USA, 33–36. https://doi.org/10.1145/2559206.2559233 Max L. Wilson, Wendy Mackay, Ed Chi, Michael Bernstein, Dan Russell, and Harold Thimbleby. 2011. RepliCHI - CHI Should Be Replicating and Validating Results More: Discuss. In CHI ’11 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’11). Association for Computing Machinery, New York, NY, USA, 463–466. https://doi.org/10.1145/1979742.1979491 Max L. L. Wilson, Paul Resnick, David Coyle, and Ed H. Chi. 2013. RepliCHI: The Workshop. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’13). Association for Computing Machinery, New York, NY, USA, 3159–3162. https://doi.org/10.1145/2468356.2479636 Jacob O. Wobbrock and Julie A. Kientz. 2016. Research Contributions in HumanComputer Interaction. Interactions 23, 3 (April 2016), 38–44. https://doi.org/10. 1145/2907069 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/arXiv.1910.03771 Koji Yatani. 2016. Efect Sizes and Power Analysis in HCI. In Modern Statistical Methods for HCI, Judy Robertson and Maurits Kaptein (Eds.). Springer International Publishing, Cham, 87–110. https://doi.org/10.1007/978-3-319-26633-6_5 Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking Zero-shot Text Classifcation: Datasets, Evaluation and Entailment Approach. https: //doi.org/10.48550/arXiv.1909.00161 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. https://doi.org/10. 48550/arXiv.1904.09675 Kelly H. Zou, Julia R. Fielding, Stuart G. Silverman, and Clare M. C. Tempany. 2003. Hypothesis Testing I: Proportions. Radiology 226, 3 (March 2003), 609–613. https://doi.org/10.1148/radiol.2263011500

Contestable Camera Cars: A Speculative Design Exploration of Public AI That Is Open and Responsive to Dispute Kars Alfrink

Ianus Keller

[email protected] Delft University of Technology Department of Sustainable Design Engineering Delft, The Netherlands

[email protected] Delft University of Technology Department of Human-Centered Design Delft, The Netherlands

Neelke Doorn

Gerd Kortuem

[email protected] Delft University of Technology Department of Values, Technology and Innovation Delft, The Netherlands

ABSTRACT Local governments increasingly use artifcial intelligence (AI) for automated decision-making. Contestability, making systems responsive to dispute, is a way to ensure they respect human rights to autonomy and dignity. We investigate the design of public urban AI systems for contestability through the example of camera cars: human-driven vehicles equipped with image sensors. Applying a provisional framework for contestable AI, we use speculative design to create a concept video of a contestable camera car. Using this concept video, we then conduct semi-structured interviews with 17 civil servants who work with AI employed by a large northwestern European city. The resulting data is analyzed using refexive thematic analysis to identify the main challenges facing the implementation of contestability in public AI. We describe how civic participation faces issues of representation, public AI systems should integrate with existing democratic practices, and cities must expand capacities for responsible AI development and operation.

CCS CONCEPTS • Human-centered computing → Empirical studies in interaction design; • Applied computing → Computing in government.

KEYWORDS artifcial intelligence, automated decision-making, camera cars, contestability, local government, machine learning, public administration, public AI, speculative design, urban AI, urban sensing, vehicular urban sensing ACM Reference Format: Kars Alfrink, Ianus Keller, Neelke Doorn, and Gerd Kortuem. 2023. Contestable Camera Cars: A Speculative Design Exploration of Public AI That

This work is licensed under a Creative Commons Attribution International 4.0 License. CHI ’23, April 23–28, 2023, Hamburg, Germany © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. https://doi.org/10.1145/3544548.3580984

[email protected] Delft University of Technology Department of Sustainable Design Engineering Delft, The Netherlands Is Open and Responsive to Dispute. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 16 pages. https://doi.org/10. 1145/3544548.3580984

1

INTRODUCTION

Local governments increasingly use artifcial intelligence (AI) to support or entirely automate public service decision-making. We defne AI broadly, following Suchman [72]: “[a] cover term for a range of techniques for data analysis and processing, the relevant parameters of which can be adjusted according to either internally or externally generated feedback.” As the use of AI in public sector decision-making increases, so do concerns over its harmful social consequences, including the undermining of the democratic rule of law and the infringement of fundamental human rights to dignity and self-determination [e.g. 19, 20]. Increasing systems’ contestability is a way to counteract such harms. Contestable AI is a small but growing feld of research [2, 3, 36, 39, 66, 74]. However, the contestable AI literature lacks guidance for application in specifc design situations. In general, designers need examples and instructions to apply a framework efectively [41, 55]. We, therefore, seek to answer the questions: RQ1: What are the characteristics of a contestable public AI system? RQ2: What are the challenges facing the implementation of contestability in public AI? We ground our work in the use of camera cars: human-driven vehicles equipped with image sensors used for vehicular urban sensing (VUS). The primary motivation for these systems is increased efciency (cost reduction), for example for parking enforcement. Outside of the densest urban areas, costs of traditional means of parking enforcement quickly exceed collected fees [61]. Ethical concerns over using camera cars for these and other purposes refect those around smart urbanism more broadly: data is captured without consent or notice, its benefts favor those doing the capturing, leading to reductionist views and overly technocratic decision-making [49]. In this paper, we explore the shape contestable AI may take in the context of local government public services and we describe the responses of civil servants who work with AI to these future visions.

CHI ’23, April 23–28, 2023, Hamburg, Germany

Our design methods are drawn from speculative, critical and future-oriented approaches [7, 26, 34, 50]. We use the ‘Contestable AI by Design’ framework [2] as a generative tool to design a concept for a contestable camera car system. Using the resulting concept video as a prompt, we conduct semi-structured interviews with civil servants employed by Amsterdam who work with AI. Our focus here is on the challenges our respondents see towards implementing these future visions and contestability more generally. We then use refexive thematic analysis [13–15] to generate themes from the interview transcripts that together describe the major challenges facing the implementation of contestability in public AI.1 The empirical work for this study was conducted in Amsterdam. The city has previously explored ways of making camera cars more “human-friendly.” But eforts so far have been limited to up-front design adjustments to camera cars’ physical form.2 The contributions of this paper are twofold: First, we create an example near-future concept of a contestable AI system in the context of public AI, specifcally camera-based VUS. The concept video is usable for debating the merits of the contestable AI concept and exploring implications for its implementation. Second, we ofer an account of the challenges of implementing contestability in public AI, as perceived by civil servants employed by Amsterdam who work with AI. We structure this paper as follows: First, we introduce Amsterdam and its current use of camera cars for parking enforcement and other purposes. Next, we discuss related work on contestable AI, public and urban AI, VUS, and speculative design. Subsequently, we describe our research approach, including our design process, interview method, and data analysis. We then report on the resulting design concept and civil servant responses. Finally, we refect on what our fndings mean for current notions of contestable AI and consider the implications for its design in the context of public and urban AI in general and camera-based VUS in particular.

2 BACKGROUND 2.1 Amsterdam Amsterdam is the capital and largest city of the Netherlands. Its population is around 0.9 million (881.933 in 2022).3 “By Dutch standards, the city is a fnancial and cultural powerhouse” [65]. Amsterdam is intensely urbanized. The city covers 219.492 km2 of land (2019). The city proper has 5.333 (2021) inhabitants per km2 and 2.707 (2019) houses per km2.4 Amsterdam is considered the fnancial and business capital of the country. It is home to a signifcant number of banks and corporations. Its port is the fourth largest in terms of sea cargo in Northwest Europe.5 Amsterdam is also one of the most popular tourist destinations in Europe.6 In 2022, over a third (35%) of residents were born abroad.7 Amsterdam has relatively many households with a very low income

1 This

study was preregistered at Open Science Framework: https://osf.io/26rts 2 https://responsiblesensinglab.org/projects/scan-cars 3 https://onderzoek.amsterdam.nl/interactief/kerncijfers 4 https://onderzoek.amsterdam.nl/interactief/kerncijfers 5 https://www.amsterdam.nl/bestuur-organisatie/volg-beleid/economie/haven 6 https://onderzoek.amsterdam.nl/publicatie/bezoekersprognose-2022-2024 7 https://onderzoek.amsterdam.nl/interactief/dashboard-kerncijfers

Alfrink, Keller, Doorn, and Kortuem

(17%) and a very high income (14%).8 In 2020, Amsterdam’s working population (age 15-74) was relatively highly educated (48%).9 The city is governed by a directly elected municipal council, a municipal executive board, and a government-appointed mayor. The mayor is a member of the board but also has individual responsibilities. The 2022-2026 coalition agreement’s fnal chapter on “cooperation and organization” contains a section on “the digital city and ICT,” which frames technology as a way to improve services and increase equality and emancipation. Among other things, this section focuses on protecting citizens’ privacy, safeguarding digital rights, monitoring systems using an algorithm register10 , testing systems for “integrity, discrimination and prejudice” throughout their lifecycle, and the continuing adherence to principles outlined in a local manifesto describing values for a responsible digital city11 .

2.2

Camera car use in Amsterdam

In January 2021, 13 municipalities in the Netherlands, including Amsterdam, made use of camera cars for parking monitoring and enforcement.12 Paid parking targets parking behavior and car use of citizens, businesses, and visitors. Its aims are to reduce the number of cars in the city, relieve public space pressures, and improve air quality. Cities expect to make alternative modes of transportation (cycling, public transport) more attractive by charging parking fees and limiting the availability of parking licenses per area. The system in Amsterdam checks if parked cars have paid their parking fee or have a parking permit. Community service ofcers use cars outftted with cameras to patrol city parking areas. They capture images of license plates and use computer vision algorithms to recognize license plates. The system uses these license plates to check with a national parking register if a vehicle has the right to park in its location and at the given time. Payment must be made within 5 minutes after the vehicle has been ‘scanned.’ If not, a parking inspector employed by the company that operates the system on behalf of the city reviews the situation based on four photos to determine if exceptional circumstances apply (e.g., curbside (un)loading, stationary at trafc light). This human reviewer also checks if the license plate is recognized correctly. In case of doubt, they dispatch a parking controller by motor scooter to assess the situation on-site. The system issues a parking fne if no exceptional circumstances apply by passing the data to the municipal tax authorities. They then use the same parking register database to retrieve the personal data of the owner of the vehicle to send them a parking fne. A dedicated website allows people to appeal a fne within six weeks of issuing. The website provides access to the environment and license plate photos. (Any bystanders, unrelated license plates, and other privacy-sensitive information are made unrecognizable.) A third-party service also ofers to object to trafc and parking fnes on behalf of people, free of charge. 8 https://onderzoek.amsterdam.nl/publicatie/de-staat-van-de-stad-amsterdam-xi-

2020-2021 9 https://onderzoek.amsterdam.nl/publicatie/de-staat-van-de-stad-amsterdam-xi2020-2021 10 https://algoritmeregister.amsterdam.nl 11 https://tada.city 12 https://www.rtlnieuws.nl/nieuws/nederland/artikel/5207606/scanauto-boeteaanvechten-grote-steden-amsterdam-utrecht-den-haag

Contestable Camera Cars

CHI ’23, April 23–28, 2023, Hamburg, Germany

Amsterdam also uses the parking monitoring camera cars to detect stolen vehicles and vehicles with a claim from the police or the public prosecutor. Cars are registered as stolen in the parking register. In case of a match with a scanned license plate, a national vehicle crime unit, possibly cooperating with the police, can take action. Data is also collected about ‘parking pressure’ and the types of license holders for municipal policy development. Finally, Amsterdam is exploring additional applications of camera cars, including outdoor advertisement taxes13 and side-placed garbage collection.14

practices, mapped to major system stakeholders and typical AI system lifecycle phases. For Alfrink et al. [2], contestability is about “leveraging confict for continuous system improvement.” Most of the works Alfrink et al. [2] include are theoretical rather than empirical, and are not derived from specifc application contexts. Contexts that do feature in works discussed are healthcare [39, 64], smart cities [47], and content moderation [27, 75, 76]. The framework has not been validated, and lacks guidance and examples for ready application by practitioners.

3 RELATED WORK 3.1 Contestable AI by design

An increasing number of researchers report on studies into the use of AI in the public sector, i.e., public AI [16, 21, 24, 29, 31, 60, 63, 68, 69, 78]. Although some do use the term “AI” [21, 24, 29], more commonly the term used is “algorithm” or “algorithmic system” [16, 31, 63, 68, 69, 78]. These algorithmic systems are put to use for informing or automating (public) decision-making by government public service (or sector) agencies [16, 24, 29, 68, 69]. The application contexts researchers report on include: child protection [16, 24, 68, 69, 78]; public housing [24]; public health [24, 63]; social protection [24, 31, 78]; public security [29, 60] and taxation [78]. Some of the issues explored include: how transparency, explanations and justifcations may afect citizens’ trust, acceptance and perceived legitimacy of public AI [16, 21, 24]; the politics of measurement, the human subjective choices that go into data collection, what does and does not get counted, and in what way [60, 63]; and how public sector employees’ work is impacted by public AI [31], with a particular focus on discretion [68, 69], and how research and practice might more productively collaborate [78]. An overlapping but distinct area of research focuses on the role of AI in the built environment, so-called urban AI [1, 42, 56, 57, 67, 73, 79]. Many application contexts here are mobility-related, for example smart electric vehicle charging [1]; autonomous vehicles [56]; and automated parking control systems [67]. The focus of this research tends to be more on how AI molds, mediates, and orchestrates the daily lived experience of urban places and spaces. Ethical questions related to AI become intertwined with city-making ethics, “who has the right to design and live in human environments” [57]. What the urban AI ‘lens’ adds to public AI discourse are questions of spatial justice [70] in addition to those of procedural and distributive justice.

3.2

A small but growing body of research explores the concept of contestable AI [2, 3, 36, 39, 66, 74]. Contestability helps to protect against fallible, unaccountable, unlawful, and unfair automated decision-making. It does so by ensuring the possibility of human intervention throughout the system lifecycle, and by creating arenas for adversarial debate between decision subjects and system operators. Hirsch et al. [39] defne contestability as “humans challenging machine predictions,” framing it as a way to protect against inevitably fallible machine models, by allowing human controllers to intervene before machine decisions are put into force. Vaccaro et al. [74] frame contestability as a “deep system property,” representing joint human-machine decision-making. Contestability is a form of procedural justice, giving voice to decision subjects, and increasing perceptions of fairness. Almada [3] defnes contestability as the possibility for “human intervention,” which can occur not only post-hoc, in response to an individual decision, but also ex-ante, as part of AI system development processes. For this reason, they argue for a practice of “contestability by design.” Sarra [66] argues that contestability exceeds mere human intervention. They argue that contestability requires a “procedural relationship.” A “human in the loop” is insufcient if there is no possibility of a “dialectical exchange” between decision subjects and human controllers. Finally, Henin and Le Métayer [36] argue that the absence of contestability undermines systems’ legitimacy. They distinguish between explanations and justifcations. The former are descriptive and intrinsic to systems themselves. The latter are normative and extrinsic, depending on outside references for assessing outcomes’ desirability. Because contestability seeks to show that a decision is inappropriate or inadequate, it requires justifcations, in addition to explanations. Building on these and other works, Alfrink et al. [2] defne contestable AI as “open and responsive to human intervention, throughout their lifecycle, establishing a procedural relationship between decision subjects and human controllers.” They develop a preliminary design framework that synthesizes elements contributing to contestability identifed through a systematic literature review. The framework comprises fve system features and six development

13 https://responsiblesensinglab.org/projects/scan-cars

14 https://medium.com/maarten-sukel/garbage-object-detection-using-pytorch-and-

yolov3-d6c4e0424a10

3.3

Public & urban AI

Vehicular urban sensing

Vehicular (urban) sensing is when “vehicles on the road continuously gather, process, and share location-relevant sensor data” [54]. They are “a prominent example of cyber-physical systems” requiring a multidisciplinary approach to their design [62]. Sensors can be mounted on vehicle, or onboard smartphones may be used instead or in addition [28, 54]. Vehicles, here, are usually cars (automobiles). One advantage of cars is that they have few power constraints [62]. Much of the literature to date focuses on enlisting privately owned vehicles in crowdsourcing eforts [28, 53, 62, 81], as well as networking infrastructure challenges [17, 54, 62, 81]. A wide range of sensors is discussed, but some focus specifcally on the use of cameras [12, 17, 61, 80]. Applications include trafc monitoring and urban surveillance [17], air pollution and urban trafc [62],

CHI ’23, April 23–28, 2023, Hamburg, Germany

infrastructure monitoring (i.e., “remote assessment of structural performance”) [12], and (of particular note for our purposes here) parking monitoring and enforcement [61]. Mingardo [61] describes enforcement of on-street parking in Rotterdam, the Netherlands, using “scan-cars.” They claim the main reason for introducing this system was to reduce the cost of enforcement. Income usually covers enforcement costs in areas with high fees and large numbers of motorists. However, residents usually have afordable parking permits in peripheral areas, and the area to cover is much larger. Systems like the one in Rotterdam use so-called “automatic number plate recognition” (ANPR). Zhang et al. [80] propose an approach to segmenting license plates that can deal with a wide range of angles, lighting conditions, and distances. They report an accuracy of 95%.

3.4

Speculative design

We use ‘speculative design’ as a cover term for various forms of design futuring, including design fction and critical design. Speculative design seeks to represent or “project” future consequences of a current issue [22]. Although early exemplars of speculative design often took the form of products, later projects usually include various forms of storytelling, primarily to aid audience interpretation and engagement [34]. Auger [5] calls this a design’s “perceptual bridge.” Sterling [71] frames design fction as a marriage of science-fction literature and industrial product design, which should address the inabilities of both to “imagine efectively.” Kirby [48] has described the relationship between science-fction cinema and design. Design in service of cinema produces “diegetic prototypes,” objects that function within a flm’s story world. Alternatively, as Bleecker [11] puts it, speculative design produces things that tell stories and, in the audience’s minds, create future worlds. This notion is similar to what Dunne and Raby [26] call “design as a catalyst for social dreaming.” For them, the focus of speculation is on the implications of new developments in science and technology. As such, they claim speculative design can contribute to new “sociotechnical imaginaries” [45, 46]. Speculative design can be a way to “construct publics” around “matters of concern” [11, 22, 32], to “design for debate” [59]. It is about asking questions rather than solving problems [32, 34]. To spark debate, speculative design must be provocative [8]. It evokes critical refection using satirical wit [58]. For this satire to work, the audience must read speculative designs as objects of design, contextualized and rationalized with a narrative of use [58, 59]. Speculative designs do not lack function and can, therefore, not be dismissed as mere art. Instead, speculative design leverages a broader conception of function that goes beyond traditional notions of utility, efciency, and optimization and instead seeks to be relational and dynamic [59]. To further support audiences’ engagement in debate, some attempts have combined speculative design with participatory approaches. In workshop-like settings, speculative designs co-created with audiences can surface controversies, and be a form of “infrastructuring” that creates “agonistic spaces” [32, 34, 38]. Early work was primarily focused on speculative design as a ‘genre,’ exploring what designs can do, and less on how it should be

Alfrink, Keller, Doorn, and Kortuem

practiced [34]. Since then, some have explored speculative design as a method in HCI design research, particularly in ‘research through design’ or ‘constructive design research’ [6, 8, 34]. There have been a few attempts at articulating criteria by which to evaluate speculative designs [6, 7, 22, 34]. Some works ofer guidelines for what makes speculative design critical [6]; refecting on speculative designs [52]; evaluations that match expected knowledge outcomes [9]; and ‘tactics’ for that drawn from a canon of exemplars [30].

4

METHOD

Our overall approach can be characterized as constructive design research that sits somewhere between what Koskinen [51] calls the ‘feld’ and ‘showroom’ modes or research through design using the ‘genre’ of speculative design [34]. We create a concept video of a near future contestable camera car. We actively approach our audience to engage with the concept video through interviews. We use storytelling to aid audience interpretation, to help them recognize how a contestable camera car might ft in daily life. We seek to strike a balance between strangeness and normality. We measure success by the degree to which our audience is willing and able to thoughtfully engage with the concept video. In other words, we use speculative design to ask questions, rather than provide answers. Our study is structured as follows: (1) we frst formulate a design brief to capture the criteria that the speculative design concept video must adhere to; (2) we then conduct the speculative design project; (3) a rough cut of the resulting concept video is assessed with experts; (4) the video is then adjusted and fnalized; (5) using the fnal cut of the speculative design concept video as a ‘prompt’ we then conduct semi-structured interviews with civil servants; (6) fnally, we use the interview transcripts for refexive thematic analysis, exploring civil servants’ views of challenges facing the implementation of contestability. The data we generate consist of: (1) visual documentation of the design concepts we create; and (2) transcripts of semi-structured interviews with respondents. The visual documentation is created by the principal researcher and design collaborators as the product of the design stage. The transcripts are generated by an external transcriber on the basis of audio recordings.

4.1

Design process

We frst created a design brief detailing assessment criteria for the design outcomes, derived partly from Bardzell et al. [7]. The brief also specifed an application context for the speculated near-future camera car: trash detection. We drew inspiration from an existing pilot project in Amsterdam. Garbage disposal may be a banal issue, but it is also multifaceted and has real stakes. We hired a flmmaker to collaborate with on video production. Funding for this part of the project came from AMS Institute, a public-private urban innovation center.15 We frst created a mood board to explore directions for the visual style. Ultimately, we opted for a collage-based approach because it is a fexible style that would allow us to depict complex actions without a lot of production overhead. It also struck a nice balance between accessibility and things feeling slightly of. We 15 https://www.ams-institute.org

Contestable Camera Cars

then wrote a script for the video. Here, we used contestability literature in general and the ‘Contestable AI by Design’ framework [2], in particular, to determine what elements to include. We tried to include a variety of risks and related system improvements (rather than merely one of each) so that the audience would not quickly dismiss things for lack of verisimilitude. Having settled on a script, we then sketched out a storyboard. Our main challenge here was to balance the essential depiction of an intelligent system with potential risks, ways citizens would be able to contest, and the resulting system improvements. As we collaboratively refned the storyboard, our flmmaker developed style sketches that covered the most essential building blocks of the video.16 Once we were satisfed with the storyboard and style sketches, we transitioned into video production. Production was structured around reviews of weekly renders. On one occasion, this review included partners from AMS Institute. Our next milestone was to get a rough cut of the video ‘feature complete’ for assessment with experts. For this assessment, we created an interview guide and a grading rubric. We based the rubric on the assessment criteria developed in the original design brief. All experts were colleagues at our university, selected for active involvement in the felds of design, AI and ethics. We talked to seven experts (fve male, two female; two earlycareer researchers, three mid-career, and two senior). Interviews took place in early February 2022. Each expert was invited for a oneon-one video call of 30-45 minutes. After a brief introduction, we went over the rubric together. We then showed the concept video rough cut. Following this, the expert would give us the grades for the video. After this, we had an open-ended discussion to discuss potential further improvements. Audio of the conversations was recorded with informed consent, and (roughly) transcribed using an automated service. We then informally analyzed the transcripts to identify the main points of improvement. We frst summarized the comments of each respondent point by point. We then created an overall summary, identifying seven points of improvement. We visualized the rubric score Likert scale data as a diverging stacked bar chart.17 Once we completed the expert assessment, we identifed improvements using informal analysis of the automated interview transcripts. The frst author then updated the storyboard to refect the necessary changes. We discussed these with the flmmaker, and converged on what changes were necessary and feasible. The changes were then incorporated into a fnal cut, adding music and sound efects created by a sound studio and a credits screen.

4.2

Civil servant interviews

Interviews were conducted from early May through late September 2022. We used purposive and snowball sampling. We were specifcally interested in acquiring the viewpoint of civil servants involved in using AI in public administration. We started with a hand-picked set of fve respondents, whom we then asked for further people to interview. We prioritized additional respondents for their potential to provide diverse and contrasting viewpoints. We 16 Design

brief, script, and storyboards are available as supplementary material. guide, assessment form template, completed forms, tabulated assessment scores, and informal analysis report are available as supplementary material.

17 Interview

CHI ’23, April 23–28, 2023, Hamburg, Germany

stopped collecting data when additional interviews failed to generate signifcantly new information. We spoke to 17 respondents in total. Details about their background are summarized in Table 1. We invited respondents with a stock email. Upon expressing their willingness to participate, we provided respondents with an information sheet and consent form and set a date and time. All interviews were conducted online, using videoconferencing software. Duration was typically 30-45 minutes. Each interview started with an of-the-record introduction, after which we started audio recording with informed consent from respondents. We used an interview guide to help structure the conversation but were fexible about follow-up questions and the needs of respondents. After a few preliminary questions, we would show the video. After the video, we continued with several more questions and always ended with an opportunity for the respondents to ask questions or make additions for the record. We then ended the audio recording and asked for suggested further people to approach. After each interview, we immediately archived audio recordings and updated our records regarding whom we spoke to and when. We then sent the audio recordings to a transcription service, which would return a document for our review. We would review the transcript, make corrections based on a review of the audio recording where necessary, and remove all identifying data. The resulting corrected and pseudonymized transcript formed the basis for our analysis.18

4.3

Analysis

Our analysis of the data is shaped by critical realist [33, 35] and contextualist [37, 44] commitments. We used refexive thematic analysis [13–15] because it is a highly fexible method that readily adapts to a range of questions, data generation methods, and sample sizes. Because of the accessibility of its results, it is also well-suited to our participatory approach. The principal researcher took the lead in data analysis. Associate researchers contributed with partial coding and review of coding results. The procedure for turning “raw” data into analyzable form was: (1) reading and familiarization; (2) selective coding (developing a corpus of items of interest) across the entire dataset; (3) searching for themes; (4) reviewing and mapping themes; and (5) defning and naming themes. We conducted coding using Atlas.ti. We used a number of credibility strategies: member checking helped ensure our analysis refects the views of our respondents; diferent researchers analyzed the data reducing the likelihood of a single researcher’s positionality overly skewing the analysis; and refexivity ensured that analysis attended to the viewpoints of the researchers as they relate to the phenomenon at hand.19 In what follows, all direct quotes from respondents were translated from Dutch into English by the frst author.

5 RESULTS 5.1 Concept video description The concept video has a duration of 1 minute and 57 seconds. Several stills from the video can be seen in Figure 1. It consists of four parts. The frst part shows a camera car identifying garbage in the streets and sending the data of to an unseen place of processing. 18 Interview

19 Interview

material.

guide is available as supplementary material. transcript summaries and code book are available as supplementary

CHI ’23, April 23–28, 2023, Hamburg, Germany

Alfrink, Keller, Doorn, and Kortuem

Item

Category

Number

Gender

Female Male Digital Strategy and Information Legal Afairs Trafc, Public Space, and Parking Urban Innovation and R&D AI, arts & culture, business, data science, information science, law, philosophy, political science, sociology

10 7 3 2 2 10 –

Department

Background

Table 1: Summary of civil servant interview respondent demographics

We then see the system building a heat map from identifed garbage and a resulting prioritization of collection services. Then, we see garbage trucks driving of and a sanitation worker tossing the trash in a truck. The second part introduces three risks conceivably associated with the suggested system. The frst risk is the so-called ‘chilling efect.’ People feel spied on in public spaces and make less use of it. The second risk is the occurrence of ‘false positives,’ when objects that are not garbage are identifed as such, leading to wasteful or harmful confrontations with collection services. The third risk is ‘model drift.’ Prediction models trained on historical data become out of step with reality on the ground. In this case, collection services are not dispatched to where they should be, leading to inexplicable piling up of garbage. The third part shows how citizens introduced in the risks section contest the system using a four-part loop. First, they use explanations to understand system behavior. Second, they use integrated channels for contacting the city about their concern. Third, they discuss their concern and point of view with a city representative. Fourth, the parties decide on how to act on the concern. The fourth and fnal part shows how the system is improved based on contestation decisions. The chilling efect is addressed by explicitly calling out the camera car’s purpose on the vehicle itself, and personal data is discarded before transmission. False positives are guarded against by having a human controller review images the system believes are trash before action is taken. Finally, model drift is prevented by regularly updating models with new data. The video ends with a repeat of garbage trucks driving of and a sanitation worker collecting trash. A credits screen follows it.20

5.2

Civil servant responses to concept video

From our analysis of civil servant responses to the concept video, we constructed three themes covering 13 challenges. See Table 2 for a summary. 5.2.1 Theme 1: Enabling civic participation. T1.1 Citizen capacities (P1, P4, P5, P9, P10, P11, P12, P16, P17): Several respondents point out that contestability assumes sovereign, independent, autonomous, empowered, and articulate citizens. Citizens need sufcient awareness, knowledge, and understanding of systems to contest efectively. “But everything actually starts with that information position as far as I am concerned.” (P10) 20 The

concept video is available as supplementary material.

It can be hard for people to understand metrics used for evaluating model performance. For example, P17 describes how a model’s intersection over union (IOU) score of 0.8 was talked about internally as an accuracy score of 80%. Individuals also struggle to identify systemic shortcomings. Their view is limited to the impacts directly relating to themselves only. They may not even be aware that the decision that has impacted them personally was made in part by an algorithm. In addition, citizens can have false views of what systems do. For example, citizens and civic groups believed parking enforcement camera cars recorded visual likenesses of people in the streets, which was not the case. Citizens’ ability to efectively contest further depends on how well they can navigate the city government’s complicated internal organizational structure. Many respondents describe how citizens’ willingness to engage depends on their view of city government. Those who feel the city does not solve their problems will be reluctant to participate. Citizens’ inclination to scrutinize public algorithmic systems also depends on their general suspicion of technology. This suspicion appears to be at least somewhat generational. For example, younger people are more cautious about sharing their data. Suspicion is contextual, depending on what is at stake in a given situation. A lack of trust can also lead to citizens rejecting explanations and justifcations ofered by the city. “I just think what a challenge it is to have a substantive conversation and how do you arrive at that substantive conversation.” (P16) T1.2 Communication channels (P3, P4, P7, P8, P11, P12, P14, P16): Many respondents recognize the importance of ensuring citizens can talk to a human representative of the city. Currently, citizens can contact the city about anything using a central phone number. Reports from citizens are subsequently routed internally to the proper channels. Ideally, the city should be able to route questions related to AI to civil servants who understand the relevant systems. Citizens are not able nor responsible for determining which issues pertain to algorithms and which do not. Triage should happen behind the scenes, as is currently the case with the central phone line. In other words, respondents would not favor a separate point-of-contact for ‘digital matters.’

Contestable Camera Cars

CHI ’23, April 23–28, 2023, Hamburg, Germany

Figure 1: Stills from concept video Executive departments are responsible for work processes, including those that use AI. They should therefore be the ones answering questions, including those that relate to technology. But this is currently not always the case. Some respondents point out that development teams cannot be made responsible for answering citizens’ questions. Despite this fact, these respondents describe how their development

teams do receive emails from citizens and simply answer them. Beyond a central phone line, some respondents are considering other easily accessible, lower-threshold interaction modalities for expressing disagreement or concern (cf. Item T2.3).

CHI ’23, April 23–28, 2023, Hamburg, Germany

Alfrink, Keller, Doorn, and Kortuem

Theme

#

Challenge

Enabling civic participation (5.2.1)

T1.1 T1.2 T1.3 T1.4 T1.5 T2.1 T2.2 T2.3 T3.1 T3.2 T3.3 T3.4 T3.5

Citizen capacities Communication channels Feedback to development Reporting inequality Participation limitations Democratic control External oversight Dispute resolution Organizational limits Accountability infrastructure Civil servant capacities Commissioning structures Resource constraints

Ensuring democratic embedding (5.2.2) Building capacity for responsibility (5.2.3)

Table 2: Overview of themes and associated challenges

T1.3 Feedback to development (P1, P2, P3, P4, P5, P7, P10, P13, P14, P15, P17): Respondents feel it is important for development teams to seek feedback from citizens during development. Indeed, for those systems developed internally, it is currently common practice to follow some iterative development methodology that includes testing pre-release software with citizen representatives. Most of the algorithmic systems discussed by respondents are still in this so-called pilot stage. Pilots are used to test new ideas for viability and explore the practical and ethical issues that might arise when a system is taken into regular everyday operation. “But I also think testing is necessary for these kinds of things. So if you think it through completely, you will eventually see if you test whether it is feasible. Because now I have every time with such an iteration [...] you run into other things that make you think how is this possible?” (P12) The city also conducts pilots to identify what is needed to justify the use of technology for a particular purpose. “So we start a pilot in the situation where we already think: we have to take many measures to justify that. Because bottom line, we think it is responsible, but what do you think about this if we do it exactly this way? Do you agree, or is that [...] do you use diferent standards?” (P7) Respondents involved with system development recognize that feedback from citizens can help eliminate blind spots and may lead to new requirements. Some respondents argue that all reports received by algorithmic system feedback channels should be open and public, or at least accessible to the municipal council so that democratic oversight is further enabled (cf. Item T2.1). On a practical level, to close the loop between citizens’ reports and development, infrastructure is needed (cf. Item T3.2). For example, the city’s service management system, which integrates with the internal software development environment, is not yet open to direct reports from citizens but only from human controllers (cf. Item T3.2). For those

systems using machine learning models, there are no provisions yet for capturing feedback from citizens to retrain models (e.g., in a supervised learning approach). T1.4 Reporting inequality (P1, P4, P6, P12, P14, P15): Several respondents mention the issue of “reporting inequality,” where some citizens are more able and inclined to report issues to the city than others (cf. Item T1.1). Some recent VUS eforts aim to counteract this reporting inequality; for example, the trash detection pilot our concept video took as a source of inspiration. Afuent neighborhoods are known to report on stray trash more than disadvantaged areas do and, as a result, are served better than is considered fair. Because of reporting inequality, respondents are weary of approaches that tie system changes directly to individual reports. For example, contestability may counteract the unequal distribution of vehicles due to system faws, but it may just as well reintroduce the problem of reporting inequality. Contestability runs the risk of giving resourceful citizens even more outsize infuence. Other respondents counter that making system changes in response to individual complaints may still be warranted if those changes beneft most citizens. Ultimately, many respondents feel it is up to developers and civil servants to interpret and weigh the signals they receive from citizens (cf. Item T1.3). T1.5 Participation limitations (P1, P2, P4, P5, P6, P8, P10, P12, P14, P15, P16): Just as government should be aware of reporting inequality (cf. Item T1.4), they should also ensure participation and contestation are representative. A real risk is that those with technical know-how and legal clout shape the debate around algorithmic systems. Respondents repeatedly point out that existing citizen participation eforts struggle with ensuring diversity, inclusion, and representation. “For example, in [district], we also met someone who did many development projects with the neighborhood and who also agreed that, of course, the empowered people or the usual suspects often provide input and in [district] also low literacy and all sorts of other things make it much

Contestable Camera Cars

more difcult to [...] provide input if it is their neighborhood [...].” (P2) For the city, it is a struggle to fnd citizens willing and able to contribute to participation processes. Sometimes as a solution, the city compensates citizens for participating. Another way to improve inclusion is to go where citizens are rather than expect them to approach the city—for example, by staging events and exhibitions as part of local cultural festivals or community centers. Participation eforts assume direct representation. There is no mechanism by which individuals can represent interest groups. Citizens do not represent anyone but themselves and are not legally accountable for their decisions. Respondents point out that as one goes up the participation ladder [4, 18] more obligations should accompany more infuence. Some respondents point out that government should take responsibility and depend less on individual citizens or hide behind participatory processes. 5.2.2 Theme 2: Ensuring democratic embedding. T2.1 Democratic control (P1, P2, P3, P4, P5, P6, P7, P9, P10, P11, P12, P13, P14, P15, P16, P17): Several respondents point out that the discretion to use AI for decision-making lies with the executive branch. For this reason, the very decision to do so, and the details of how an algorithmic system will enact policy, should, in respondents’ eyes, be a political one. Debate in the municipal council about such decisions would improve accountability. Respondents identify a tension inherent in public AI projects: Policy-makers (alderpersons) are accountable to citizens and commission public AI projects, but they often lack the knowledge to debate matters with public representatives adequately. On the other hand, those who build the systems lack accountability to citizens. Accountability is even more lacking when developers do not sit within the municipal organization but are part of a company or non-proft from which the city commissions a system (cf. Item T3.4). Respondents also point out that contestations originate with individual citizens or groups, but also with elected representatives. In other words, the municipal council does monitor digital developments. The legislature can, for example, shape how the executive develops AI systems by introducing policy frameworks. P7 outlined three levels of legislation that embed municipal AI projects: (a) the national level, where the city must determine if there is indeed a legal basis for the project; (b) the level of local ordinances, which ideally are updated with the introduction of each new AI system so that public accountability and transparency are ensured; and fnally (c) the project or application level, which focuses on the ‘how’ of an AI system, and in the eyes of P7 is also the level where direct citizen participation makes sense and adds value (cf. Item T1.5). Feedback on AI systems may be about business rules and policy, which would require a revision before a technical

CHI ’23, April 23–28, 2023, Hamburg, Germany

system can be adjusted.21 This then may lead to the executive adjusting the course on system development under its purview (cf. Item T1.3). There is also an absence of routine procedures for reviewing and updating existing AI systems in light of the new policy. Political preferences of elected city councils are encoded in business rules, which are translated into code. Once a new government is installed, policy gets updated, but related business rules and software are not, as a matter of course, but should be. T2.2 External oversight (P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, P13, P14, P15, P16, P17): The city makes use of several forms of external review and oversight. Such reviews can be a requirement, or something the city seeks out because of, for example, citizens’ lack of trust (cf. Item T1.1). A frequently mentioned body is the local personal data commission (PDC). A PDC review is mandatory when a prospective algorithmic system processes personal data or when it is considered a high-risk application. The PDC focuses, among other things, on a system’s legal basis, proportionality, and mitigation of identifed risks. One respondent proposes that such human rights impact assessments be made open for debate.22 Other review and oversight bodies include the local and national audit ofces, the municipal ombudsperson, and a so-called reporting point for chain errors. One shortcoming is that many of these are incident-driven. They cannot proactively investigate systems. Naturally, the civil servants, committee members, ombudspersons, and judges handling such cases must have a sufcient understanding of the technologies involved. External review bodies sometimes, at least in respondents’ eyes, lack sufcient expertise. One example of such a case is recent negative advice delivered by a work participation council after a consultation on using AI by the work participation and income department to evaluate assistance beneft applications. At least one respondent involved in developing the system proposal felt that, despite considerable efort to explain the system design, the council did not fully grasp it. P7 considers judicial review by an administrative court of a decision that is at least in part informed by an algorithmic system, the “fnishing touch.” When a client fle includes the 21 This

entanglement of software and policy is well-described by Jackson et al. [43]. widespread resistance against a 1971 national census, the Dutch government established a commission in 1976 to draft the frst national privacy regulation. Because it collected and processed a signifcant amount of personal data itself, Amsterdam decided not to wait and created local regulations in 1980. Every municipal service and department was required to establish privacy regulations. The city established a special commission to review these guidelines and to decide if municipal bodies were allowed to exchange information, thereby creating the PDC (“Commissie Persoonsgegevens Amsterdam (CPA),” https://assets.amsterdam. nl/publish/pages/902156/brochure_cpa_40_jarig_bestaan.pdf). The executive board expanded the tasks of the PDC in December 2021 (https://www.amsterdam.nl/bestuurorganisatie/college/nieuws/nieuws-19-januari-2022/). It now advises the board, upon request or on its own initiative, on issues “regarding the processing of personal data, algorithms, data ethics, digital human rights and disclosure of personal data” (https://assets.amsterdam.nl/publish/pages/902156/cpa_reglement.pdf). In the leadup to this decision, in April 2021, a coalition of green, left and social liberal parties submitted an initiative proposal to the board that aimed to “make the digital city more humane.” It too argued for the expansion of the PDC’s role (https: //amsterdam.groenlinks.nl/nieuws/grip-op-technologie). 22 Following

CHI ’23, April 23–28, 2023, Hamburg, Germany

data that signifcantly impacts a model prediction, a judge’s ruling on a municipal decision is implicitly also about the operation of the model. “If [a decision] afects citizens in their legal position, for example, in the case of a fne [...] then yes, the administrative court can look into it. That is when it gets exciting. That is the fnishing touch to what we have come up with.” (P7) This sentiment was echoed by P11 when they discussed how they could show in court what images the municipal parking monitoring camera car exactly captured, which received a favorable ruling from a judge. T2.3 Dispute resolution (P10, P11, P14, P15): Respondents feel that, for individual substantive grievances caused by algorithmic decision-making, existing complaint, objection, and appeal procedures should also work. These form an escalating ladder of procedures: complaints are evaluated by civil servants; objections go to an internal committee; if these fail, the case is handled by an ombudsperson; and fnally, appeals procedures are handled by a judge. Respondents point out that existing procedures can be costly and limiting for citizens and not at all “user-friendly.” Existing procedures still rely heavily on communication by paper mail. Current procedures can be stressful because people are made to feel like an ofender rather than being given the beneft of the doubt. “And we criminalize the citizen very quickly if he does not want to—a difcult citizen, annoying. Yes, no, it is just that way, and no, sorry, bye. So there is little to no space, and if you have heard [a complaint] ten times from citizens, then maybe you should think about, we have ten complaining citizens. It is not one or two. There might be something wrong so let us look at that.” (P13) Respondents agree that more efort should be put into creating alternative dispute resolution mechanisms. These should help citizens stay out of costly and stressful legal proceedings. However, these ideas are mostly considered an ‘innovation topic,’ which is to say, it is not part of daily operation. Such measures would require collaboration between those departments executing work processes and legal. At the moment, execution tends to consider dealing with disputes as not part of their remit. Legal does currently call citizens who have started an appeals procedure to make them feel heard, fnd alternative solutions, and ofer them the opportunity to withdraw. Existing mechanisms do require more integration with technology. For example, case fles should include all the relevant information about the data and algorithms used. Some services, such as parking monitoring, have already built custom web interfaces for appeals that integrate with algorithmic systems and ofer citizens access to the data collected on their case. These would either expedite otherwise unwieldy legacy procedures or seek to keep citizens out of formal legal appeal procedures altogether.

Alfrink, Keller, Doorn, and Kortuem

5.2.3

Theme 3: Building capacity for responsibility.

T3.1 Organizational limits (P4, P5, P7, P11, P15): Respondents point out that organizational fragmentation works against the city’s capacity to respond to citizen reports. The problem is not necessarily that signals are not received by the city. Often the problem is that they are not adequately acted upon. Internal fragmentation also makes it hard for citizens to know who they should approach with questions (cf. Items T1.1, T1.2). For example, with parking, citizens are inclined to go to their district department, these need to pass on questions to parking enforcement, who in turn, if it concerns a street-level issue, must dispatch a community service ofcer. “And I think that if you cut up the organization as it is now [...] then you might also have to work with other information in order to be able to deliver your service properly. So when we all had [more self-sufcient, autonomous] district councils in the past and were somewhat smaller, you could of course immediately say that this now has priority, we receive so many complaints or the alderman is working on it.” (P11) Fragmentation and the bureaucratic nature of the city organization works against the adoption of ‘agile methods.’ Although pilots are in many ways the thing that makes the innovation funnel of the city function, respondents also describe pilots as “the easy part.” The actual implementation in daily operations is a completely diferent matter. P3 describes this as the “innovation gap.” Transitioning a successful pilot into operation can easily take 3-5 years. T3.2 Accountability infrastructure (P2, P4, P5, P7, P11, P12, P13): Respondents discuss various systems that are put in place to improve accountability. The city is working to ensure requirements are traceable back to the person that set them, and developers record evidence to show they are met. Evidence would include email chains that record design decisions and system logging that shows specifc measures are indeed enforced (such as deletion of data). Regarding models, respondents indicate the importance of validating them to demonstrate that they indeed do what they are said to do. Once past the pilot stage, monitoring and maintenance become essential considerations currently under-served. For this purpose, developers should correctly document systems in anticipation of handover to a maintenance organization. Systems must be ensured to operate within defned boundaries, both technical and ethical (impact on citizens), and the delivery of “end-user value” must also be demonstrated. Such monitoring and maintenance in practice require the system developers’ continued involvement for some time. Another provision for accountability is the service management system integrated with the city’s software development and operations environment (cf. Item T1.3). Several respondents point out that surveillance and enforcement are two separate organizational functions. For those AI systems related to surveillance and enforcement, a ‘human-in-theloop’ is currently already a legal requirement at the enforcement stage. Human controllers use the service management system to report system faws, which may lead to changes

Contestable Camera Cars

and are fully traceable (cf. Item T1.3). Once in maintenance, with these systems in place, it should be possible for functional management to revise systems periodically, also in light of policy changes (cf. Item T2.1). Several respondents argue that the city should also monitor individual complaints for issues that require a system change (cf. Item T1.3). T3.3 Civil servant capacities (P1, P3, P4, P6, P7, P15): Contestability puts demands on civil servants. “[...] I think all contestability [shown in the video] assumes a very assertive citizen who is willing to contact a city representative who is willing to listen and has time for it and is committed to doing something about it.” (P1) Civil servants need knowledge and understanding of AI systems, including those employees that speak to citizens who contact the city with questions, e.g., through the central phone number. Politicians, city council members, and alderpersons also need this understanding to debate the implications of new systems adequately. At the level of policy execution, department heads and project leads are the “frst line of defense” when things go wrong (P7). So they cannot rely on the expertise of development teams but must have sufcient understanding themselves. Finally, legal department staf must also understand algorithms. P15 mentions that a guideline is in the making that should aid in this matter. Beyond updating the knowledge and skills of existing roles, new roles are necessary. In some cases, agile-methodsstyle ‘product owners’ act as the person that translates policy into technology. However, P7 feels the organization as a whole still lacks people who can translate legislation and regulations into system requirements. Zooming out further, respondents mention challenges with the current organizational structure and how responsibility and accountability require multidisciplinary teams that can work across technical and social issues (cf. Item T3.1). T3.4 Commissioning structures (P1, P3, P4, P11, P12, P13, P16, P17): The city can commission AI systems in roughly three ways, with diferent impacts on the level of control it has over design, development and operation: (a) by purchasing from a commercial supplier a service that may include an AI system; (b) by outsourcing policy execution to a third party, usually a non-proft entity who receives a subsidy from the city in return; or (c) by developing a system in-house. When purchasing, the city can exercise control mainly by imposing purchasing conditions, requiring a strong role as a commissioner. When out-placing policy execution, the city has less control but can impose conditions on the use of technology as part of a subsidy provision. When developing in-house, the city owns the system completely and is therefore in full control. In all cases, however, the city is ‘policy owner’ and remains responsible for executing the law. These diferent collaboration structures also shape the possible dialogue between policy-makers and system developers at the start of a new project. When development happens in-house, an open conversation can happen. In the case of a

CHI ’23, April 23–28, 2023, Hamburg, Germany

tender, one party cannot be advantaged over others, so there is little room for hashing things out until an order is granted. Of course, collaboration with external developers can also have “degrees of closeness” (P4). More or less ‘agile’ ways of working can be negotiated as part of a contract, which should allow for responding to new insights mid-course. Purchasing managers sometimes perceive what they are doing as the acquisition of a service that is distinct from buying technology solutions and can sometimes neglect to impose sufcient conditions on a service provider’s use of technology. The duration of tenders is typically three years. On occasion, the city comes to new insights related to the responsible use of technologies a service provider employs (e.g., additional transparency requirements). However, it cannot make changes until after a new tender. Respondents point out that an additional feedback loop should lead to the revision of purchasing conditions. P17 describes a project in which parts of the development and operation are outsourced, and other components are done in-house. The decision on what to outsource mainly hinges on how often the city expects legislature changes that demand system updates. T3.5 Resource constraints (P3, P4, P12, P16, P17): Supporting contestability will require additional resource allocation. Respondents point out that the various linchpins of contestable systems sufer from limited time and money: (a) conducting sufciently representative and meaningful participation procedures; (b) having knowledgeable personnel available to talk to citizens who have questions or complaints; (c) ensuring project leads have the time to enter information into an algorithm register; (d) performing the necessary additional development work to ensure systems’ compliance with security and privacy requirements; and (e) ensuring proper evaluations are conducted on pilot projects. P12 compares the issue to the situation with freedom of information requests, where civil servants who are assigned to handle these are two years behind. Similarly, new legislation such as the European AI Act is likely to create more work for the city yet. For new projects, the city will also have to predict the volume of citizen requests so that adequate stafng can be put into place in advance. Having a face-to-face dialogue in all instances will, in many cases, be too labor-intensive (cf. Item T1.2). A challenge with reports from citizens is how to prioritize them for action by city services, given limited time and resources (cf. Item T1.4).

6

DISCUSSION

Our aim has been two-fold: (1) to explore characteristics of contestable public AI, and (2) to identify challenges facing the implementation of contestability in public AI. To this end, we created a speculative concept video of a contestable camera car and discussed it with civil servants employed by Amsterdam who work with AI.

CHI ’23, April 23–28, 2023, Hamburg, Germany

6.1

Alfrink, Keller, Doorn, and Kortuem

Summary of results -ap

pe al

loo

p

6.1.1 Concept video: Example of contestable public AI. The speculative design concept argues for contestability from a risk mitigation and quality assurance perspective. First, it shows several hazards related to camera car use: chilling efect, false positives, and model drift. Then, it shows how citizens use contestability mechanisms to petition the city for system changes. These mechanisms are explanations, channels for appeal, an arena for adversarial debate, and an obligation to decide on a response. Finally, the video shows how the city improves the system in response to citizen contestations. The improvements include data minimization measures, human review, and a feedback loop back to model training. The example application of a camera car, the identifed risks, and resulting improvements are all used as provocative examples, not as a prescribed solution. Together they show how, as Alfrink et al. [2] propose, “contestability leverages confict for continuous system improvement.”

cis ion de

decisions

T1 Enabling civic participation (5.2.1): Citizens need skills and knowledge to contest public AI on equal footing. Channels must be established for citizens to engage city representatives in a dialogue about public AI system outcomes. The feedback loop from citizens back to system development teams must be closed. The city must mitigate against ‘reporting inequality’ and the limitations of direct citizen participation in AI system development. T2 Ensuring democratic embedding (5.2.2): Public AI systems embed in various levels of laws and regulations. An adequate response to contestation may require policy change before technology alterations. Oversight by city council members must be expanded to include scrutiny of AI use by the executive. Alternative non-legal dispute resolution approaches that integrate tightly with technical systems should be developed to complement existing complaint, objection, and appeal procedures. T3 Building capacity for responsibility (5.2.3): City organizations’ fragmented and bureaucratic nature fghts against adequately responding to citizen signals. More mechanisms for accountability are needed, including logging system actions and monitoring model performance. Civil servants need more knowledge and understanding of AI to engage with citizens adequately. New roles that translate policy into technology must be created, and more multidisciplinary teams are needed. Contracts and agreements with external development parties must include responsible AI requirements and provisions for adjusting course mid-project. Contestability requires time and money investments across its various enabling components. 6.1.3 Diagram: Five contestability loops. We can assemble fve contestability loops from civil servants’ accounts (Figure 2). This model’s backbone is the primary loop where citizens elect a city council and (indirectly) its executive board (grouped as “policymakers”). Systems developers translate the resulting policy into

elections

participation (L3) appeals (L1) Public AI

Policy-

system

makers participation (L2)

software monitoring (L4)

6.1.2 Civil servant interviews: Contestability implementation challenges. From civil servant responses to the concept video, we constructed three themes:

monitoring (L5)

Citizens

policy Developers

Figure 2: Diagram of our “fve loops model,” showing the basic fow of policy through software into decisions (solid arrows), the direct way citizens can contest individual decisions (L1, dashed arrow), the direct ways in which citizens can contest systems development and policy making (L2-3, dotted arrows), and the second-order feedback loops leading from all decision-appeal interactions in the aggregate back to software development and policy-making (L4-5, dasheddotted arrows).

algorithms, data, and models. (Other policy is translated into guidance to be executed by humans directly.) The resulting “software,” along with street-level bureaucrats and policy, form the public AI systems whose decisions impact citizens. Our model highlights two aspects that are particular to the public sector context: (1) the indirect, representative forms of citizen control at the heart of the primary policy-software-decisions loop; and (2) the second-order loops that monitor for systemic faws which require addressing on the level of systems development or policy-making upstream. These fve loops highlight specifc intervention points in public AI systems. They indirectly indicate what forms of contestation could exist and between whom. To be fully contestable, we suggest that public AI systems implement all fve loops. Better integration with the primary loop, and the implementation of second-order monitoring loops, deserve particular attention.

6.2

Results’ relation to existing literature

6.2.1 Contestable AI by design. Following Alfrink et al. [2]’s definition of contestable systems as “open and responsive to human intervention,” our respondents appear broadly sympathetic to this vision, particularly the idea that government should make more of an efort to be open and responsive to citizens. We recognize many key contestability concepts in current city eforts as described by our respondents. For example, the possibility

Contestable Camera Cars

of human intervention [39] is mandatory in cases of enforcement, which can protect against model fallibility, at least to the extent errors can be detected by individual human controllers. Nevertheless, this human-in-the-loop is implemented more for legal compliance than quality control. Respondents talk about quality assurance and ways to achieve it, e.g., through audits and monitoring, but few practical examples appear to exist as of yet. The city recognizes the need to integrate institutional contestability provisions with technical systems (i.e., contestability as “deep system property” [74]). However, this integration is currently underdeveloped. Positive examples include the custom web interface for appealing parking enforcement decisions. Ex-ante contestability measures [3] are present mainly in pilots in the form of civic participation in earlystage systems design. However, most participation happens on the project level and has no impact on policy decisions upstream from technology design. A dialectical relationship [66] is present on the far ends of what we could describe as the question-complaint-objectappeal spectrum; for example, the central phone line on one end and the review of algorithmic decisions by administrative courts on the other. The middle range seems to have less opportunity for exchanging arguments; again, these measures generally lack integration with technology. In any case, executing this ideal at scale will be costly. Finally, the city appears to approach accountability and legitimacy by ensuring the availability of explanations (e.g., in the form of an algorithm register). There appears to be less interest in, or awareness of, the need for justifcations [36] of decisions. Most of the literature emphasizes contestability from below and outside but does not account for the representative democracy mechanisms in which public AI systems are embedded. In terms of our fve loops model, city eforts emphasize individual appeals of decisions (L1) and direct participation in systems development (L2). Cities’ policy execution departments are not, by their nature, adept at adjusting course based on external signals. Furthermore, many cities still approach AI mostly from a pilot project perspective. Attitudes should shift to one of continuous learning and improvement. For example, Amsterdam conducts pilots with uncharacteristically high care. These pilots receive more scrutiny than systems in daily operation to allow for operation “in the wild” while staying within acceptable boundaries. The additional scrutiny throughout and the mandatory intensive evaluations upon completion serve to identify risks that may arise if systems were to transition into daily operation. This careful approach transforms pilots from the non-committal testing grounds common in the business world into something more akin to a social experiment guided by bioethical principles [77]. While Amsterdam’s pilots serve as good examples, successful pilots’ transition into daily operation faces difculties. This “innovation gap” (cf. Item T3.1) may be partially alleviated when designers stay involved after delivery. Public AI designers should consider themselves stewards, whose role is never fnished [25]. Finally, it is not just AI and its development process that need ‘redesigning.’ Cities’ AI commissioning and governance structures must also be adjusted. Again referring to our fve loops model, this would mean a focus on participation in policy-making (L3) and the second-order feedback loops from decision appeals to developers and policy-makers (L4-5).

CHI ’23, April 23–28, 2023, Hamburg, Germany

6.2.2 Public & urban AI, and VUS. Our example case of camera-carbased trash detection illustrates the need for the public and urban AI felds to converse more actively with each other. Public AI tends to focus on what goes on inside city organizations; urban AI tends to focus on what happens in the streets. Our results show how the concept of contestability connects the dots between several issues focused on in the literature so far. Namely, between explanations and justifcations [16, 21, 24], street-level bureaucrat discretion [31, 68, 69], and citizens’ daily lived experience of urban space [56, 57]. Participation in public and urban AI literature is almost invariably of the direct kind [16, 69] as if we have given up on representative modes of democracy. There is potential in renewing existing forms of civic oversight and control. So, again, in our fve loops model, a shift from focusing on individual appeals and direct participation in development (L1-2) to participation in policy-making (L3) and monitoring of appeals by policy-makers (L5). We fnd it striking that the HCI design space appears to devote little or no attention to (camera-based) VUS. Camera cars appear to ofer tremendous seductive appeal to administrators. More public camera car applications will likely fnd their way into the cities of the global north. They deserve more scrutiny from (critical) HCI scholars. 6.2.3 Speculative design as a research method. Turning to methodological aspects, we will make a few observations. As is often the case with contemporary speculative design, our concept is more a story than a product [34]. Indeed, we sought to spark the imagination of the audience [26, 71]. One respondent recognized this: “And I think the lack of imagination that you have dealt with really well with your flm is what keeps the conversation going even now, which is exactly the goal.” (P9) The story we tell explores the implications of new technology [26]. It is a projection of potential future impacts of public AI that is (or is not) contestable [26]. Nevertheless, it would go too far to say we are ‘constructing a public’ [22]. We have not engaged in “infrastructuring” or the creation of “agonistic spaces” [32, 34, 38]. We did design for one-on-one debate [59] and worked to ensure the video is sufciently provocative and operates in the emotional register without tipping over into pure fancy or parody [58, 59]. We used speculative design to open up rather than close down [52]. In this opening up, we went one step beyond merely critiquing current public AI practice and ofered a speculative solution of contestability, framed in such a way that it invited commentary. Thus, asking questions rather than solving problems may not be the best way to distinguish speculative design from ‘afrmative design.’ As Malpass [59] points out, rather than lacking function, critical design’s function goes beyond traditional notions of utility, efciency, and optimization and instead seeks to be relational, contextual, and dynamic. On a more practical level, by building on the literature [6, 7, 22, 34], we defned success criteria up front. Before bringing the result to our intended audience, we built an explicit evaluation step into our design process. This step used these same criteria to gain confdence that our artifact would have the efect we sought it to have on our audience. This approach can be an efective way for

CHI ’23, April 23–28, 2023, Hamburg, Germany

other design researchers to pair speculative design with empirical work.

6.3

Transferability: Results’ relation to city and citizens

Amsterdam is not a large city in global terms, but populous and dense enough to struggle with “big city issues” common in popular discourse. Amsterdam was an early poster child of the “smart cities” phenomenon. It embraced the narrative of social progress through technological innovation with great enthusiasm. Only later did it become aware and responsive to concerns over the detrimental efects of technology. We expect that Amsterdam’s public AI eforts, the purposes technology is put to, and the technologies employed are relatively common. The city’s government structure is typical of local representative democracies globally. Furthermore, the Netherlands’s electoral system is known to be efective at ensuring representation. Many of the challenges we identify concerning integrating public AI in local democracy should be transferable to cities with similar regimes. Amsterdam is quite mature in its policies regarding “digital,” including the responsible design, development, and operation of public AI. Less-advanced cities will likely struggle with more foundational issues before many of the challenges we have identifed come into focus. For example, Amsterdam has made considerable progress concerning the transparency of its public AI system in the form of an algorithm register, providing explanations of global system behavior. The city has also made notable progress with developing in-house capacity for ML development, enabling it to have more control over public AI projects than cities dependent on private sector contractors. Amsterdam’s residents have a national reputation for being outspoken and skeptical of government. Indeed, city surveys show that a signifcant and stable share of the population is politically active. Nevertheless, a recent survey shows that few believe they have any real infuence.23 Political engagement and self-efcacy are unequally divided across income and educational attainment groups, and these groups rarely encounter each other. Our respondents tended to speak broadly about citizens and the city’s challenges in ensuring their meaningful participation in public AI developments. However, in articulating strategies for addressing the challenges we have identifed, it is vital to keep in mind this variation in political engagement and self-efcacy. For example, improving citizens’ information position so they can participate as equals may be relevant for politically active people but will do little to increase engagement. For that, we should rethink the form of participation itself. Likewise, improving the democratic embedding of public AI systems to increase their legitimacy is only efective if citizens believe they can infuence the city government in the frst place.

6.4

Alfrink, Keller, Doorn, and Kortuem

Over half of the civil servants interviewed have a position in the R&D and innovation department of the city. Their direct involvement is mostly with pilot projects, less so with systems in daily operations. The themes and challenges we have constructed appear, for the most part, equally relevant across both classes of systems. It is conceivable, however, that civil servants employed in other parts of the city executive (e.g., social services) are more concerned with challenges we have not captured here. Further work could expand on our study by including citizen, civil society, and business perspectives. This would surface the variety of interests stakeholder groups have with regards to contestability measures. Our respondents’ statements are based on a frst impression of the concept video. We expect more nuanced and richer responses if we give respondents more time to engage with the underlying ideas and apply them to their context. Finally, interviews do not allow for debate between respondents. Another approach would be to put people in dialogue with each other. This would identify how stakeholder group interests in contestability may align or confict.

6.5

Future work

The public sector context brings with it particular challenges facing the implementation of contestability mechanisms, but also unique opportunities. For example, the existing institutional arrangements for contestation that are typical of representative democracies, on the one hand, demand specifc forms of integration and, on the other hand, ofer more robust forms of participation than are typically available in the private sector. For this reason, future work should include the translation of ‘generic’ contestability design knowledge into context-specifc forms. Considering the numerous examples of public AI systems with large-scale and far-reaching consequences already available to us, such work is not without urgency. Most contestability research focuses on individual appeals (L1 in our fve loops model) or participation in the early phases of AI systems development (L2, but limited to requirements defnition). Future work should dig into the second-order loops we have identifed (L4-5) and how citizens may contest decisions made in later phases of ML development (i.e., L2, but engaging with the ‘materiality’ of ML [10, 23, 40]). The participatory policy-making loop (L3) is investigated in a more general form in, for example, political science. However, such work likely lacks clear connections to AI systems development implications downstream. Finally, to contribute to public AI design practice, all of the above should be translated into actionable guidance for practitioners on the ground. Practical design knowledge is often best transmitted through evocative examples. Many more artifacts like our own concept video should be created and disseminated among practitioners. HCI design research has a prominent role in assessing such practical design knowledge for efcacy, usability, and desirability.

Limitations

Our study is limited by the fact that we only interacted with civil servants and the particular positions these respondents occupy in the municipal organization. 23 https://onderzoek.amsterdam.nl/publicatie/amsterdamse-burgermonitor-2021

7

CONCLUSION

City governments make increasing use of AI in the delivery of public services. Contestability, making systems open and responsive to dispute, is a way to ensure AI respects human rights to autonomy and dignity. Contestable AI is a growing feld, but the knowledge

Contestable Camera Cars

produced so far lacks guidance for the application in specifc contexts. To this end, we sought to explore the characteristics of contestable public AI and the challenges facing its implementation, by creating a speculative concept video of a contestable camera car, and conducting semi-structured interviews with civil servants who work with AI in a large northwestern European city. The concept video illustrates how contestability can leverage disagreement for continuous system improvement. The themes we constructed from the interviews show that public AI contestability eforts must contend with limits of direct participation, ensure systems’ democratic embedding, and seek to improve organizational capacities. ‘Traditional’ policy execution is subject to scrutiny from elected representatives, checks from the judiciary and other external oversight bodies, and direct civic participation. The shift to AI-enacted public policy has undermined and weakened these various forms of democratic control. Our fndings suggest that contestability in the context of public AI does not mean merely allowing citizens to have more infuence over systems’ algorithms, models, and datasets. Contestable public AI demands interventions in how executive power uses technology to enact policy.

ACKNOWLEDGMENTS We thank Roy Bendor, for advising us on our method and approach; Thijs Turèl (Responsible Sensing Lab and AMS Institute) for supporting the production of the concept video; Simon Scheiber (Trim Tab Pictures) for creating the concept video; the interviewed experts for their productive criticism that led to many improvements to the concept video; the interviewed civil servants for taking the time to talk to us and providing valuable insights into current practice; and the reviewers for their constructive comments. This research was supported by a grant from the Dutch National Research Council NWO (grant no. CISC.CC.018).

REFERENCES [1] Kars Alfrink, Ianus Keller, Neelke Doorn, and Gerd Kortuem. 2022. Tensions in Transparent Urban AI: Designing a Smart Electric Vehicle Charge Point. AI & SOCIETY (March 2022). https://doi.org/10/gpszwh [2] Kars Alfrink, Ianus Keller, Gerd Kortuem, and Neelke Doorn. 2022. Contestable AI by Design: Towards a Framework. Minds and Machines (Aug. 2022). https: //doi.org/10/gqnjcs [3] Marco Almada. 2019. Human Intervention in Automated Decision-Making: Toward the Construction of Contestable Systems. In Proceedings of the 17th International Conference on Artifcial Intelligence and Law, ICAIL 2019. 2–11. https://doi.org/10/gghft8 [4] Sherry R. Arnstein. 1969. A Ladder of Citizen Participation. Journal of the American Institute of planners 35, 4 (1969), 216–224. https://doi.org/10/cvct7d [5] James Auger. 2013. Speculative Design: Crafting the Speculation. Digital Creativity 24, 1 (March 2013), 11–35. https://doi.org/10/gd4q58 [6] Jefrey Bardzell and Shaowen Bardzell. 2013. What Is "Critical" about Critical Design?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Paris France, 3297–3306. https://doi.org/10/ggc5s7 [7] Jefrey Bardzell, Shaowen Bardzell, and Erik Stolterman. 2014. Reading Critical Designs: Supporting Reasoned Interpretations of Critical Design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Toronto Ontario Canada, 1951–1960. https://doi.org/10/f3nnk2 [8] Shaowen Bardzell, Jefrey Bardzell, Jodi Forlizzi, John Zimmerman, and John Antanitis. 2012. Critical Design and Critical Theory: The Challenge of Designing for Provocation. In Proceedings of the Designing Interactive Systems Conference on - DIS ’12. ACM Press, Newcastle Upon Tyne, United Kingdom, 288. https: //doi.org/10/ggpv92 [9] Eric P. S. Baumer, Mark Blythe, and Theresa Jean Tanenbaum. 2020. Evaluating Design Fiction: The Right Tool for the Job. In Proceedings of the 2020 ACM Designing Interactive Systems Conference. ACM, Eindhoven Netherlands, 1901–1913. https://doi.org/10/ghnnv6

CHI ’23, April 23–28, 2023, Hamburg, Germany

[10] Jesse Josua Benjamin, Arne Berger, Nick Merrill, and James Pierce. 2021. Machine Learning Uncertainty as a Design Material: A Post-Phenomenological Inquiry. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–14. https://doi.org/10/gksmbj [11] Julian Bleecker. 2009. Design Fiction: A Short Essay on Design, Science, Fact and Fiction. [12] Alan Bloodworth. 2015. Using Camera Cars to Assess the Engineering Impact of Tsunamis on Buildings. Proceedings of the Institution of Civil Engineers - Civil Engineering 168, 4 (Nov. 2015), 150–150. https://doi.org/10/gqmddc [13] Virginia Braun and Victoria Clarke. 2006. Using Thematic Analysis in Psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10/fswdcx [14] Virginia Braun and Victoria Clarke. 2012. Thematic Analysis. In APA Handbook of Research Methods in Psychology, Vol 2: Research Designs: Quantitative, Qualitative, Neuropsychological, and Biological., Harris Cooper, Paul M. Camic, Debra L. Long, A. T. Panter, David Rindskopf, and Kenneth J. Sher (Eds.). American Psychological Association, Washington, 57–71. https://doi.org/10.1037/13620-004 [15] Virginia Braun and Victoria Clarke. 2021. Thematic Analysis: A Practical Guide to Understanding and Doing (frst ed.). SAGE Publications, Thousand Oaks. [16] Anna Brown, Alexandra Chouldechova, Emily Putnam-Hornstein, Andrew Tobin, and Rhema Vaithianathan. 2019. Toward Algorithmic Accountability in Public Services: A Qualitative Study of Afected Community Perspectives on Algorithmic Decision-making in Child Welfare Services. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, Glasgow Scotland Uk, 1–12. https://doi.org/10/gjgz67 [17] Rafaele Bruno and Maddalena Nurchis. 2015. Efcient Data Collection in Multimedia Vehicular Sensing Platforms. Pervasive and Mobile Computing 16 (Jan. 2015), 78–95. https://doi.org/10/f6wx2c [18] Paolo Cardullo and Rob Kitchin. 2017. Being a ‘Citizen’ in the Smart City: Up and down the Scafold of Smart Citizen Participation. https://doi.org/10.31235/ osf.io/v24jn [19] Fabio Chiusi, Sarah Fischer, Nicolas Kayser-Bril, and Nicolas Spielkamp. 2020. Automating Society Report 2020. Technical Report. Algorithm Watch. [20] Kate Crawford, Roel Dobbe, Theodora Dryer, Genevieve Fried, Ben Green, Elizabeth Kazianus, Amba Kak, Varoon Mathur, Erin McElroy, Andrea Nill Sánchez, Deborah Raji, Roy Lisi Rankin, Rashida Richardson, Jason Schultz, Sarah Myers West, and Meredith Whittaker. 2019. AI Now 2019 Report. Technical Report. AI Now Institute, New York. [21] Karl de Fine Licht and Jenny de Fine Licht. 2020. Artifcial Intelligence, Transparency, and Public Decision-Making: Why Explanations Are Key When Trying to Produce Perceived Legitimacy. AI & SOCIETY 35, 4 (Dec. 2020), 917–926. https://doi.org/10/ghh5p3 [22] Carl DiSalvo. 2009. Design and the Construction of Publics. Design issues 25, 1 (2009), 48–63. https://doi.org/10/bjgcd6 [23] Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX Design Innovation: Challenges for Working with Machine Learning as a Design Material. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems CHI ’17 2017-May (2017), 278–288. https://doi.org/10/cvvd [24] Karolina Drobotowicz, Marjo Kauppinen, and Sari Kujala. 2021. Trustworthy AI Services in the Public Sector: What Are Citizens Saying About It? In Requirements Engineering: Foundation for Software Quality, Fabiano Dalpiaz and Paola Spoletini (Eds.). Vol. 12685. Springer International Publishing, Cham, 99–115. https://doi. org/10.1007/978-3-030-73128-1_7 [25] Hugh Dubberly. 2022. Why We Should Stop Describing Design as “Problem Solving”. [26] Anthony Dunne and Fiona Raby. 2013. Speculative Everything: Design, Fiction, and Social Dreaming. The MIT Press, Cambridge, Massachusetts ; London. [27] Niva Elkin-Koren. 2020. Contesting Algorithms: Restoring the Public Interest in Content Filtering by Artifcial Intelligence. Big Data & Society 7, 2 (July 2020), 205395172093229. https://doi.org/10/gg8v9r [28] Guiyun Fan, Yiran Zhao, Zilang Guo, Haiming Jin, Xiaoying Gan, and Xinbing Wang. 2021. Towards Fine-Grained Spatio-Temporal Coverage for Vehicular Urban Sensing Systems. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. IEEE, Vancouver, BC, Canada, 1–10. https://doi.org/10/gmr355 [29] Samar Fatima, Kevin C. Desouza, Christoph Buck, and Erwin Fielt. 2022. Public AI Canvas for AI-enabled Public Value: A Design Science Approach. Government Information Quarterly (June 2022), 101722. https://doi.org/10/gqmc79 [30] Gabriele Ferri, Jefrey Bardzell, Shaowen Bardzell, and Stephanie Louraine. 2014. Analyzing Critical Designs: Categories, Distinctions, and Canons of Exemplars. In Proceedings of the 2014 Conference on Designing Interactive Systems. ACM, Vancouver BC Canada, 355–364. https://doi.org/10/gnkpkh [31] Asbjørn Ammitzbøll Flügge. 2021. Perspectives from Practice: Algorithmic Decision-Making in Public Employment Services. In Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing. ACM, Virtual Event USA, 253–255. https://doi.org/10/gnhcvm [32] Laura Forlano and Anijo Mathew. 2014. From Design Fiction to Design Friction: Speculative and Participatory Design of Values-Embedded Urban Technology. Journal of Urban Technology 21, 4 (2014), 7–24. https://doi.org/10/gf65fb

CHI ’23, April 23–28, 2023, Hamburg, Germany

[33] Christopher Frauenberger. 2016. Critical Realist HCI. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. ACM, San Jose California USA, 341–351. https://doi.org/10/chg6 [34] A. Galloway and C. Caudwell. 2018. Speculative Design as Research Method: From Answers to Questions and “Staying with the Trouble”. In Undesign: Critical Practices at the Intersection of Art and Design. Taylor and Francis, 85–96. [35] Ben Green and Salomé Viljoen. 2020. Algorithmic Realism: Expanding the Boundaries of Algorithmic Thought. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. ACM, Barcelona Spain, 19–31. https://doi.org/10/ggjpcj [36] Clément Henin and Daniel Le Métayer. 2021. Beyond Explainability: Justifability and Contestability of Algorithmic Decision Systems. AI & SOCIETY (July 2021). https://doi.org/10/gmg8pf [37] Karen Henwood and Nick Pidgeon. 1994. Beyond the Qualitative Paradigm: A Framework for Introducing Diversity within Qualitative Psychology. Journal of Community & Applied Social Psychology 4, 4 (Oct. 1994), 225–238. https: //doi.org/10/c94p2d [38] P.-A. Hillgren, A. Seravalli, and A. Emilson. 2011. Prototyping and Infrastructuring in Design for Social Innovation. CoDesign 7, 3-4 (2011), 169–183. https://doi.org/10/f234v3 [39] Tad Hirsch, Kritzia Merced, Shrikanth Narayanan, Zac E. Z. E. Imel, and D. C. David C. Atkins. 2017. Designing Contestability: Interaction Design, Machine Learning, and Mental Health. In DIS 2017 - Proceedings of the 2017 ACM Conference on Designing Interactive Systems. ACM Press, 95–99. https://doi.org/10/gddxqb [40] Lars Erik Holmquist. 2017. Intelligence on Tap: Artifcial Intelligence as a New Design Material. Interactions 24, 4 (June 2017), 28–33. https://doi.org/10/gc8zwk [41] Kristina Höök and Jonas Löwgren. 2012. Strong Concepts: Intermediate-Level Knowledge in Interaction Design Research. ACM Transactions on ComputerHuman Interaction 19, 3 (2012), 1–18. https://doi.org/10/f225d4 [42] Bill Howe, Jackson Maxfeld Brown, Bin Han, Bernease Herman, Nic Weber, An Yan, Sean Yang, and Yiwei Yang. 2022. Integrative Urban AI to Expand Coverage, Access, and Equity of Urban Data. The European Physical Journal Special Topics 231, 9 (July 2022), 1741–1752. https://doi.org/10/gqmc7g [43] Steven J. Jackson, Tarleton Gillespie, and Sandy Payette. 2014. The Policy Knot: Re-integrating Policy, Practice and Design. CSCW Studies of Social Computing (2014), 588–602. https://doi.org/10/gg5g9w [44] Marianne E. Jaeger and Ralph L. Rosnow. 1988. Contextualism and Its Implications for Psychological Inquiry. British Journal of Psychology 79, 1 (Feb. 1988), 63–75. https://doi.org/10/cx4bvj [45] Sheila Jasanof and Sang-Hyun Kim. 2009. Containing the Atom: Sociotechnical Imaginaries and Nuclear Power in the United States and South Korea. Minerva 47, 2 (2009), 119–146. https://doi.org/10/fghmvb [46] Sheila Jasanof and Sang-Hyun Kim (Eds.). 2015. Dreamscapes of Modernity: Sociotechnical Imaginaries and the Fabrication of Power. The University of Chicago Press, Chicago ; London. [47] Matthew Jewell. 2018. Contesting the Decision: Living in (and Living with) the Smart City. International Review of Law, Computers and Technology (2018). https://doi.org/10/gk48xb [48] David Kirby. 2010. The Future Is Now: Diegetic Prototypes and the Role of Popular Films in Generating Real-world Technological Development. Social Studies of Science 40, 1 (Feb. 2010), 41–70. https://doi.org/10/fcn38m [49] Rob Kitchin. 2016. The Ethics of Smart Cities and Urban Science. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374, 2083 (Dec. 2016), 20160115. https://doi.org/10/bs3w [50] Eva Knutz, Thomas Markussen, and Poul Rind Christensen. 2014. The Role of Fiction in Experiments within Design, Art & Architecture. Artifact 3, 2 (Dec. 2014), 8. https://doi.org/10/gmrb63 [51] Ilpo Kalevi Koskinen (Ed.). 2011. Design Research through Practice: From the Lab, Field, and Showroom. Morgan Kaufmann/Elsevier, Waltham, MA. [52] Sandjar Kozubaev, Chris Elsden, Noura Howell, Marie Louise Juul Søndergaard, Nick Merrill, Britta Schulte, and Richmond Y. Wong. 2020. Expanding Modes of Refection in Design Futuring. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu HI USA, 1–15. https: //doi.org/10/gh99ht [53] Yongxuan Lai, Yifan Xu, Duojian Mai, Yi Fan, and Fan Yang. 2022. Optimized Large-Scale Road Sensing Through Crowdsourced Vehicles. IEEE Transactions on Intelligent Transportation Systems 23, 4 (April 2022), 3878–3889. https://doi.org/ 10/gqmdc3 [54] Uichin Lee and Mario Gerla. 2010. A Survey of Urban Vehicular Sensing Platforms. Computer Networks 54, 4 (March 2010), 527–544. https://doi.org/10/bqd52z [55] Jonas Löwgren, Bill Gaver, and John Bowers. 2013. Annotated Portfolios and Other Forms of Intermediate- Level Knowledge. Interactions (2013), 30–34. https: //doi.org/10/f23xgp [56] Aale Luusua and Johanna Ylipulli. 2020. Artifcial Intelligence and Risk in Design. In Proceedings of the 2020 ACM Designing Interactive Systems Conference. ACM, Eindhoven Netherlands, 1235–1244. https://doi.org/10/gg38dj [57] Aale Luusua and Johanna Ylipulli. 2020. Urban AI: Formulating an Agenda for the Interdisciplinary Research of Artifcial Intelligence in Cities. In Companion

Alfrink, Keller, Doorn, and Kortuem

[58] [59] [60] [61] [62] [63] [64] [65]

[66] [67] [68]

[69]

[70] [71] [72] [73] [74]

[75]

[76] [77] [78]

[79] [80]

[81]

Publication of the 2020 ACM Designing Interactive Systems Conference (DIS’ 20 Companion). Association for Computing Machinery, New York, NY, USA, 373–376. https://doi.org/10/gjr4r6 Matt Malpass. 2013. Between Wit and Reason: Defning Associative, Speculative, and Critical Design in Practice. Design and Culture 5, 3 (Nov. 2013), 333–356. https://doi.org/10/gc8zsz Matt Malpass. 2015. Criticism and Function in Critical Design Practice. Design Issues 31, 2 (April 2015), 59–71. https://doi.org/10/gc8ztj Vidushi Marda and Shivangi Narayan. 2020. Data in New Delhi’s Predictive Policing System. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. ACM, Barcelona Spain, 317–324. https://doi.org/10/ggjpcs Giuliano Mingardo. 2020. Rotterdam, The Netherlands. In Parking. Elsevier, 133–145. https://doi.org/10.1016/B978-0-12-815265-2.00008-X Giovanni Pau and Rita Tse. 2012. Challenges and Opportunities in Immersive Vehicular Sensing: Lessons from Urban Deployments. Signal Processing: Image Communication 27, 8 (Sept. 2012), 900–908. https://doi.org/10/f4b22s Kathleen H. Pine and Max Liboiron. 2015. The Politics of Measurement and Action. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, Seoul Republic of Korea, 3147–3156. https://doi.org/10/gf65dx Thomas Ploug and Søren Holm. 2020. The Four Dimensions of Contestable AI Diagnostics - A Patient-Centric Approach to Explainable AI. Artifcial Intelligence in Medicine 107 (July 2020), 101901. https://doi.org/10/gpk2pk Rob Raven, Frans Sengers, Philipp Spaeth, Linjun Xie, Ali Cheshmehzangi, and Martin de Jong. 2019. Urban Experimentation and Institutional Arrangements. European Planning Studies 27, 2 (Feb. 2019), 258–281. https://doi.org/10.1080/ 09654313.2017.1393047 Claudio Sarra. 2020. Put Dialectics into the Machine: Protection against Automatic-decision-making through a Deeper Understanding of Contestability by Design. Global Jurist 20, 3 (Oct. 2020), 20200003. https://doi.org/10/gj7sk6 Nitin Sawhney. 2022. Contestations in Urban Mobility: Rights, Risks, and Responsibilities for Urban AI. AI & SOCIETY (July 2022). https://doi.org/10/gqmc7n Devansh Saxena, Karla Badillo-Urquiola, Pamela J. Wisniewski, and Shion Guha. 2021. A Framework of High-Stakes Algorithmic Decision-Making for the Public Sector Developed through a Case Study of Child-Welfare. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct. 2021), 1–41. https://doi.org/10/ gnhcrn Devansh Saxena and Shion Guha. 2020. Conducting Participatory Design to Improve Algorithms in Public Services: Lessons and Challenges. In Conference Companion Publication of the 2020 on Computer Supported Cooperative Work and Social Computing. ACM, Virtual Event USA, 383–388. https://doi.org/10/gnhcvj Edward Soja. 2009. The City and Spatial Justice. Justice spatiale/Spatial justice 1, 1 (2009), 1–5. Bruce Sterling. 2009. Cover Story: Design Fiction. Interactions 16, 3 (May 2009), 20–24. https://doi.org/10/cfx568 Lucy Suchman. 2018. Corporate Accountability. Yu-Shan Tseng. 2022. Assemblage Thinking as a Methodology for Studying Urban AI Phenomena. AI & SOCIETY (June 2022). https://doi.org/10/gqmc9d Kristen Vaccaro, Karrie Karahalios, Deirdre K. Mulligan, Daniel Kluttz, and Tad Hirsch. 2019. Contestability in Algorithmic Systems. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing. ACM, Austin TX USA, 523–527. https://doi.org/10/gjr4r5 K. Vaccaro, C. Sandvig, and K. Karahalios. 2020. "At the End of the Day Facebook Does What It Wants": How Users Experience Contesting Algorithmic Content Moderation. Proceedings of the ACM on Human-Computer Interaction 4 (2020). https://doi.org/10/gj7sk7 Kristen Vaccaro, Ziang Xiao, Kevin Hamilton, and Karrie Karahalios. 2021. Contestability For Content Moderation. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct. 2021), 1–28. https://doi.org/10/gnct2z Ibo van de Poel. 2016. An Ethical Framework for Evaluating Experimental Technology. Science and engineering ethics 22, 3 (2016), 667–686. https://doi.org/ 10/gdnf49 Michael Veale, Max Van Kleek, and Reuben Binns. 2018. Fairness and Accountability Design Needs for Algorithmic Support in High-Stakes Public Sector DecisionMaking. Conference on Human Factors in Computing Systems - Proceedings 2018April (2018), 1–14. https://doi.org/10/ct4s Tan Yigitcanlar, Duzgun Agdas, and Kenan Degirmenci. 2022. Artifcial Intelligence in Local Governments: Perceptions of City Managers on Prospects, Constraints and Choices. AI & SOCIETY (May 2022). https://doi.org/10/gqmc9f Xianchao Zhang, Xinyue Liu, and He Jiang. 2007. A Hybrid Approach to License Plate Segmentation under Complex Conditions. In Third International Conference on Natural Computation (ICNC 2007). IEEE, Haikou, China, 68–73. https://doi. org/10/djp92v Cong Zuo, Kaitai Liang, Zoe L. Jiang, Jun Shao, and Junbin Fang. 2017. CostEfective Privacy-Preserving Vehicular Urban Sensing System. Personal and Ubiquitous Computing 21, 5 (Oct. 2017), 893–901. https://doi.org/10/gb4ph6

Human-Computer Interaction

CHAPTER

Scientific Foundations

4

In the last chapter, we examined a variety of interaction topics in HCI. By and large, the research methodology for studying these topics is empirical and scientific. Ideas are conceived, developed, and implemented and then framed as hypotheses that are tested in experiments. This chapter presents the enabling features of this methodology. Our goal is to establish the what, why, and how of research, with a focus on research that is both empirical and experimental. While much of the discussion is general, the examples are directed at HCI. We begin with the terminology surrounding research and empirical research.

4.1 What is research? Research means different things to different people. “Being a researcher” or “conducting research” carries a certain elevated status in universities, colleges, and corporations. Consequently, the term research is bantered around in a myriad of situations. Often, the word is used simply to add weight to an assertion (“Our research shows that …”). While writing an early draft of this chapter, a television ad for an Internet service provider was airing in southern Ontario. The ad proclaimed, “Independent research proves [name_of_product] is the fastest and most reliable— period.”1 One might wonder about the nature of the research, or of the independence and impartiality of the work. Of course, forwarding assertions to promote facts, observations, hypotheses, and the like is often the goal. But what is research? Surely, it is more than just a word to add force to a statement or opinion. To rise above conjecture, we demand evidence—evidence meeting a standard of credibility such that the statement is beyond dispute. Providing such credibility is the goal of research. Returning to the word itself, research has at least three definitions. First, conducting research can be an exercise as simple as careful or diligent search.2 So carefully searching one’s garden to find and remove weeds meets one standard of 1

Advertisement by Rogers Communications Inc. airing on television in southern Ontario during the winter of 2008/2009. 2 www.merriam-webster.com. Human-Computer Interaction. © 2013 Elsevier Inc. All rights reserved.

121

122

CHAPTER 4 Scientific Foundations

conducting research. Or perhaps one undertakes a search on a computer to locate all files modified on a certain date. That’s research. It’s not the stuff of MSc or PhD theses, but it meets one definition of research. The second definition of research is collecting information about a particular subject. So surveying voters to collect information on political opinions is conducting research. In HCI we might observe people interacting with an interface and collect information about their interactions, such as the number of times they consulted the manual, clicked the wrong button, retried an operation, or uttered an expletive. That’s research. The third definition is more elaborate: research is investigation or experimentation aimed at the discovery and interpretation of facts and revision of accepted theories or laws in light of new facts. In this definition we find several key elements of research that motivate discussions in this book. We find the idea of experimentation. Conducting experiments is a central activity in a lot of HCI research. I will say more about this in the next chapter. In HCI research, an experiment is sometimes called a user study. The methodology is sometimes formal, sometimes ad hoc. A formal and standardized methodology is generally preferred because it brings consistency to a body of work and facilitates the review and comparison of research from different studies. One objective of this book is to promote the use of a consistent methodology for experimental research in HCI. To be fair, the title of this book changed a few times on the way to press. Is the book about experimental research? Well, yes, a lot of it is, but there are important forms of HCI research that are non-experimental. So as not to exclude these, the focus shifted to empirical research, a broader term that encompasses both experimental and non-experimental methodologies. Among the latter is building and testing models of interaction, which we examine formally in Chapter 7. Returning to research, the third definition speaks of facts. Facts are the building blocks of evidence, and it is evidence we seek in experimental research. For example, we might observe that a user committed three errors while entering a command with an interface. That’s a fact. Of course, context is important. Did the user have prior experience with the interface, or with similar interfaces? Was the user a child or a computer expert? Perhaps we observed and counted the errors committed by a group of users while interacting with two different interfaces over a period of time. If they committed 15 percent more errors with one interface than with the other, the facts are more compelling (but, again, context is important). Collectively, the facts form an outward sign leading to evidence—evidence that one interface is better, or less error prone, than the other. Evidence testing is presented in more detail in Chapter 6, Hypothesis Testing. Note that prove or proof is not used here. In HCI research we don’t prove things; we gather facts and formulate and test evidence. The third definition mentions theories and laws. Theory has two common meanings. In the sense of Darwin’s theory of evolution or Einstein’s theory of relativity, the term theory is synonymous with hypothesis. In fact, one definition of theory is simply “a hypothesis assumed for the sake of argument or investigation.” Of course,

4.1 What is research?

through experimentation, these theories advanced beyond argument and investigation. The stringent demands of scientific inquiry confirmed the hypotheses of these great scientists. When confirmed through research, a theory becomes a scientifically accepted body of principles that explain phenomena. A law is different from a theory. A law is more specific, more constraining, more formal, more binding. In the most exacting terms, a law is a relationship or phenomenon that is “invariable under given conditions.” Because variability is germane to human behavior, laws are of questionable relevance to HCI. Of course, HCI has laws. Take HCI’s best-known law as an example. Fitts’ law refers to a body of work, originally in human motor behavior (Fitts, 1954), but now widely used in HCI. Fitts’ work pertained to rapid-aimed movements, such as rapidly moving a cursor to an object and selecting it in a graphical user interface. Fitts himself never proposed a law. He proposed a model of human motor behavior. And by all accounts, that’s what Fitts’ law is—a model, a behavioral, descriptive, and predictive model. It includes equations and such for predicting the time to do point-select tasks. It is a law only in that other researchers took up the label as a celebration of the generality and importance of Fitts’ seminal work. We should all be so lucky. Fitts’ law is presented in more detail in Chapter 7. Research, according to the third definition, involves discovery, interpretation, and revision. Discovery is obvious enough. That’s what we do—look for, or discover, things that are new and useful. Perhaps the discovery is a new style of interface or a new interaction technique. Interpretation and revision are central to research. Research does not proceed in a vacuum. Today’s research builds on what is already known or assumed. We interpret what is known; we revise and extend through discovery. There are additional characteristics of research that are not encompassed in the dictionary definitions. Let’s examine a few of these.

4.1.1 Research must be published Publication is the final step in research. It is also an essential step. Never has this rung as true as in the edict publish or perish. Researchers, particularly in academia, must publish. A weak or insufficient list of publications might spell disappointment when applying for research funds or for a tenure-track professorship at a university. Consequently, developing the skill to publish begins as a graduate student and continues throughout one’s career as a researcher, whether in academia or industry. The details and challenges in writing research papers are elaborated in Chapter 8. Publishing is crucial, and for good reason. Until it is published, the knowledge gained through research cannot achieve its critical purpose—to extend, refine, or revise the existing body of knowledge in the field. This is so important that publication bearing a high standard of scrutiny is required. Not just any publication, but publication in archived peer-reviewed journals or conference proceedings. Research results are “written up,” submitted, and reviewed for their integrity, relevance, and contribution. The review is by peers—other researchers doing similar work. Are

123

124

CHAPTER 4 Scientific Foundations

the results novel and useful? Does the evidence support the conclusions? Is there a contribution to the field? Does the methodology meet the expected standards for research? If these questions are satisfactorily answered, the work has a good chance of acceptance and publication. Congratulations. In the end, the work is published and archived. Archived implies the work is added to the collection of related work accessible to other researchers throughout the world. This is the “existing body of knowledge” referred to earlier. The final step is complete. Research results are sometimes developed into bona fide inventions. If an individual or a company wishes to profit from their invention, then patenting is an option. The invention is disclosed in a patent application, which also describes previous related work (prior art), how the invention addresses a need, and the best mode of implementation. If the application is successful, the patent is granted and the inventor or company thereafter owns the rights to the invention. If another company wishes to use the invention for commercial purpose, they must enter into a license agreement with the patent holder. This side note is included only to make a small point: a patent is a publication. By patenting, the individual or company is not only retaining ownership of the invention but is also making it public through publication of the patent. Thus, patents meet the must-publish criterion for research.

4.1.2 Citations, references, impact Imagine the World Wide Web without hyperlinks. Web pages would live in isolation, without connections between them. Hyperlinks provide the essential pathways that connect web pages to other web pages, thus providing structure and cohesion to a topic or theme. Similarly, it is hard to imagine the world’s body of published research without citations and references. Citations, like hyperlinks, connect research papers to other research papers. Through citations, a body of research takes shape. The insights and lessons of early research inform and guide later research. The citation itself is just an abbreviated tag that appears in the body of a paper, for example, “… as noted in earlier research (Smith and Jones, 2003)” or “… as confirmed by Green et al. [5].” These two examples are formatted differently and follow the requirements of the conference or journal. The citation is expanded into a full bibliographic entry in the reference list at the end of the paper. Formatting of citations and references is discussed in Chapter 8. Citations serve many purposes, including supporting intellectual honesty. By citing previous work, researchers acknowledge that their ideas continue, extend, or refine those in earlier research. Citations are also important to back up assertions that are otherwise questionable, for example, “the number of tablet computer users worldwide now exceeds two billion [9].” In the Results section of a research paper, citations are used to compare the current results with those from earlier research, for example, “the mean time to formulate a search query was about 15 percent less than the time reported by Smith and Jones [5].” Figure 4.1 provides a schematic of a collection of research papers. Citations are shown as arrows. It incorporates a timeline, so all arrows point to the left, to earlier

4.1 What is research?

FIGURE 4.1 A collection of research papers with citations to earlier papers.

papers. One of the papers seems to have quite a few citations to it. The number of citations to a research paper is a measure of the paper’s impact. If many researchers cite a single paper, there is a good chance the work described in the cited paper is both of high quality and significant to the field. This point is often echoed in academic circles: “The only objective and transparent metric that is highly correlated with the quality of a paper is the number of citations.”3 Interestingly enough, citation counts are only recently easily available. Before services like Google Scholar emerged, citation counts were difficult to obtain. Since citation counts are available for individual papers, they are also easy to compile for individual researchers. Thus, impact can be assessed for researchers as well as for papers. The most accepted single measure of the impact of a researcher’s publication record is the H-index. If a researcher’s publications are ordered by the number of citations to each paper, the H-index is the point where the rank equals the number of citations. In other words, a researcher with H-index = n has n publications each with n or more citations. Physicist J. Hirsch first proposed the H-index in 2005 (Hirsch, 2005). H-index quantifies in a single number both research productivity (number of publications) and overall impact of a body of work (number of citations). Some of the strengths and weaknesses of the H-index, as a measure of impact, are elaborated elsewhere (MacKenzie, 2009a).

3

Dianne Murray, General Editor, Interacting with Computers. Posted to chi-announcements@acm. org on Oct 8, 2008.

125

126

CHAPTER 4 Scientific Foundations

4.1.3 Research must be reproducible Research that cannot be replicated is useless. Achieving an expected standard of reproducibility, or repeatability, is therefore crucial. This is one reason for advancing a standardized methodology: it enforces a process for conducting and writing about the research that ensures sufficient detail is included to allow the results to be replicated. If skilled researchers care to test the claims, they will find sufficient guidance in the methodology to reproduce, or replicate, the original research. This is an essential characteristic of research. Many great advances in science and research pertain to methodology. A significant contribution by Louis Pasteur (1822–1895), for example, was his use of a consistent methodology for his research in microbiology (Day and Gastel, 2006, pp. 8–9). Pasteur’s experimental findings on germs and diseases were, at the time, controversial. As Pasteur realized, the best way to fend off skepticism was to empower critics—other scientists—to see for themselves. Thus, he adopted a methodology that included a standardized and meticulous description of the materials and procedure. This allowed his experiments and findings to be replicated. A researcher questioning a result could redo the experiment and therefore verify or refute the result. This was a crucial advance in science. Today, reviewers of manuscripts submitted for publication are often asked to critique the work on this very point: “Is the work replicable?” “No” spells certain rejection. One of the most cited papers in publishing history is a method paper. Lowry et al.’s, 1951 paper “Protein Measurement With the Folin Phenol Reagent” has garnered in excess of 200,000 citations (Lowry, Rosenbrough, Farr, and Randall, 1951).4 The paper describes a method for measuring proteins in fluids. In style, the paper reads much like a recipe. The method is easy to read, easy to follow, and, importantly, easy to reproduce.

4.1.4 Research versus engineering versus design There are many ways to distinguish research from engineering and design. Researchers often work closely with engineers and designers, but the skills and contributions each brings are different. Engineers and designers are in the business of building things. They create products that strive to bring together the best in form (design emphasis) and function (engineering emphasis). One can imagine that there is certain tension, even a trade-off, between form and function. Finding the right balance is key. However, sometimes the balance tips one way or the other. When this occurs, the result is a product or a feature that achieves one (form or function) at the expense of the other. An example is shown in Figure 4.2a. The image shows part of a notebook computer, manufactured by a well-known computer company. By most accounts, it is a typical notebook computer. The image shows part of the keyboard and the built-in pointing device, a touchpad. The touchpad design (or is it engineering?) is interesting. It is seamlessly embedded in the system chassis. 4

See http://scholar.google.com.

4.1 What is research?

FIGURE 4.2 Form trumping function: (a) Notebook computer. (b) Duct tape provides tactile feedback indicating the edge of the touchpad.

The look is elegant—smooth, shiny, metallic. But something is wrong. Because the mounting is seamless and smooth, tactile feedback at the sides of the touchpad is missing. While positioning a cursor, the user has no sense of when his or her finger reaches the edge of the touchpad, except by observing that the cursor ceases to move. This is an example of form trumping function. One user’s solution is shown in Figure 4.2b. Duct tape added on each side of the touchpad provides the all-important tactile feedback.5 Engineers and designers work in the world of products. The focus is on designing complete systems or products. Research is different. Research tends to be narrowly focused. Small ideas are conceived of, prototyped, tested, then advanced or discarded. New ideas build on previous ideas and, sooner or later, good ideas are refined into the building blocks—the materials and processes—that find their way into products. But research questions are generally small in scope. Research tends to be incremental, not monumental. 5

For an amusing example of function trumping form, visit Google Images using “Rube Goldberg simple alarm clock.”

127

128

CHAPTER 4 Scientific Foundations

FIGURE 4.3 Timeline for research, engineering, and design.

Engineers and designers also work with prototypes, but the prototype is used to assess alternatives at a relatively late stage: as part of product development. A researcher’s prototype is an early mock-up of an idea, and is unlikely to directly appear in a product. Yet the idea of using prototypes to inform or assess is remarkably similar, whether for research or for product development. The following characterization by Tim Brown (CEO of design firm IDEO) is directed at designers, but is well aligned with the use of prototypes for research: Prototypes should command only as much time, effort, and investment as are needed to generate useful feedback and evolve an idea. The more “finished” a prototype seems, the less likely its creators will be to pay attention to and profit from feedback. The goal of prototyping isn’t to finish. It is to learn about the strengths and weaknesses of the idea and to identify new directions that further prototypes might take (Brown, 2008, p. 3).

One facet of research that differentiates it from engineering and design is the timeline. Research precedes engineering and design. Furthermore, the march forward for research is at a slower pace, without the shackles of deadlines. Figure 4.3 shows the timeline for research, engineering, and design. Products are the stuff of deadlines. Designers and engineers work within the corporate world, developing products that sell, and hopefully sell well. The raw materials for engineers and designers are materials and processes that already exist (dashed line in Figure 4.3) or emerge through research. The computer mouse is a good example. It is a hugely successful product that, in many ways, defines a generation of computing, post 1981, when the Xerox Star was introduced. But in the 1960s the mouse was just an idea. As a prototype it worked well as an input controller to maneuver a tracking symbol on a graphics display. Engelbart’s invention (English et al., 1967) took nearly 20 years to be engineered and designed into a successful product. Similar stories are heard today. Apple Computer Inc., long known as a leader in innovation, is always building a better mousetrap. An example is the iPhone, introduced in June, 2007. And, evidently, the world has beaten a path to Apple’s door.6 Notably, “with the iPhone, Apple successfully brought together decades 6

The entire quotation is “Build a better mousetrap and the world will beat a path to your door” and is attributed to American essayist Ralph Waldo Emerson (1803–1882).

4.2 What is empirical research?

of research” (Selker, 2008). Many of the raw materials of this successful product came by way of low-level research, undertaken well before Apple’s engineers and designers set forth on their successfully journey. Among the iPhone’s interaction novelties is a two-finger pinch gesture for zooming in and out. New? Perhaps, but Apple’s engineers and designers no doubt were guided or inspired by research that came before them. For example, multi-touch gestures date back to at least the 1980s (Buxton, Hill, and Rowley, 1985; Hauptmann, 1989). What about changing the aspect ratio of the display when the device is tilted? New? Perhaps not. Tilt, as an interaction technique for user interfaces, dates back to the 1990s (B. Harrison et al., 1998; Hinckley et al., 2000; Rekimoto, 1996). These are just two examples of research ideas that, taken alone, are small scale. While engineers and designers strive to build better systems or products, in the broadest sense, researchers provide the raw materials and processes engineers and designers work with: stronger steel for bridges, a better mouse for pointing, a better algorithm for a search engine, a more natural touch interface for mobile phones.

4.2 What is empirical research? By prefixing research with empirical, some powerful new ideas are added. According to one definition, empirical means originating in or based on observation or experience. Simple enough. Another definition holds that empirical means relying on experience or observation alone, often without due regard for system and theory. This is interesting. These words suggest researchers should be guided by direct observations and experiences about phenomena, without prejudice to, or even consideration of, existing theories. This powerful idea is a guiding principle in science—not to be blinded by preconceptions. Here’s an example. Prior to the 15th century, there was a prevailing system or theory that celestial bodies revolved around the earth. The Polish scientist Nicolas Copernicus (1473–1543) found evidence to the contrary. His work was empirical. It was based on observation without bias toward, influence by, or due regard to, existing theory. He observed, he collected data, he looked for patterns and relationships in the data, and he found evidence within the data that cut across contemporary thinking. His empirical evidence led to one of the great achievements in modern science—a heliocentric cosmology that placed the sun, rather than the earth, at the center of the solar system. Now that’s a nice discovery (see the third definition of research at the beginning of this chapter). In HCI and other fields of research, discoveries are usually more modest. By another definition, empirical means capable of being verified or disproved by observation or experiment. These are strong words. An HCI research initiative is framed by hypotheses—assertions about the merits of an interface or an interaction technique. The assertions must be sufficiently clear and narrow to enable verification or disproval by gathering and testing evidence. This means using language in an assertion that speaks directly to empirical, observable, quantifiable aspects of the interaction. I will expand on this later in this chapter in the discussion on research questions.

129

130

CHAPTER 4 Scientific Foundations

4.3 Research methods There are three common approaches, or methods, for conducting research in HCI and other disciplines in the natural and social sciences: the observational method, the experimental method, and the correlational method. All three are empirical as they are based on observation or experience. But there are differences and these follow from the objectives of the research and from the expertise and style of the researcher. Let’s examine each method.

4.3.1 Observational method Observation is the starting point for this method. In conducting empirical research in HCI, it is essential to observe humans interacting with computers or computerembedded technology of some sort. The observational method encompasses a collection of common techniques used in HCI research. These include interviews, field investigations, contextual inquiries, case studies, field studies, focus groups, think aloud protocols, storytelling, walkthroughs, cultural probes, and so on. The approach tends to be qualitative rather than quantitative. As a result, observational methods achieve relevance while sacrificing precision (Sheskin, 2011, p. 76). Behaviors are studied by directly observing phenomena in a natural setting, as opposed to crafting constrained behaviors in an artificial laboratory setting. Real world phenomena are high in relevance, but lack the precision available in controlled laboratory experiments. Observational methods are generally concerned with discovering and explaining the reasons underlying human behavior. In HCI, this is the why or how of the interaction, as opposed to the what, where, or when. The methods focus on human thought, feeling, attitude, emotion, passion, sensation, reflection, expression, sentiment, opinion, mood, outlook, manner, style, approach, strategy, and so on. These human qualities can be studied through observational methods, but they are difficult to measure. The observations are more likely to involve note-taking, photographs, videos, or audio recordings rather than measurement. Measurements, if gathered, tend to use categorical data or simple counts of phenomena. Put another way, observational methods tend to examine and record the quality of interaction rather than quantifiable human performance.

4.3.2 Experimental method With the experimental method (also called the scientific method), knowledge is acquired through controlled experiments conducted in laboratory settings. Acquiring knowledge may imply gathering new knowledge, but it may also mean studying existing knowledge for the purpose of verifying, refuting, correcting, integrating, or extending. In the relevance-precision dichotomy, it is clear where controlled experiments lie. Since the tasks are artificial and occur in a controlled laboratory setting, relevance is diminished. However, the control inherent in the

4.3 Research methods

methodology brings precision, since extraneous factors—the diversity and chaos of the real world—are reduced or eliminated. A controlled experiment requires at least two variables: a manipulated variable and a response variable. In HCI, the manipulated variable is typically a property of an interface or interaction technique that is presented to participants in different configurations. Manipulating the variable simply refers to systematically exposing participants to different configurations of the interface or interaction technique. To qualify as a controlled experiment, at least two configurations are required. Thus, comparison is germane to the experimental method. This point deserves further elaboration. In HCI, we often hear of a system or design undergoing a “usability evaluation” or “user testing.” Although these terms often have different meanings in different contexts, such evaluations or tests generally do not follow the experimental method. The reason is simple: there is no manipulated variable. This is mentioned only to distinguish a usability evaluation from a user study. Undertaking a user study typically implies conducting a controlled experiment where different configurations of a system are tested and compared. A “usability evaluation,” on the other hand, usually involves assessing a single user interface for strengths and weaknesses. The evaluation might qualify as research (“collecting information about a particular subject”), but it is not experimental research. I will return to this point shortly. A manipulated variable is also called an independent variable or factor. A response variable is a property of human behavior that is observable, quantifiable, and measurable. The most common response variable is time, often called task completion time or some variation thereof. Given a task, how long do participants take to do the task under each of the configurations tested? There are, of course, a multitude of other behaviors that qualify as response variables. Which ones are used depend on the characteristics of the interface or interaction technique studied in the research. A response variable is also called a dependent variable. Independent variables and dependent variables are explored in greater detail in Chapter 5. HCI experiments involve humans, so the methodology employed is borrowed from experimental psychology, a field with a long history of research involving humans. In a sense, HCI is the beneficiary of this more mature field. The circumstances manipulated in a psychology experiment are often quite different from those manipulated in an HCI experiment, however. HCI is narrowly focused on the interaction between humans and computing technology, while experimental psychology covers a much broader range of the human experience. It is naïve to think we can simply choose to focus on the experimental method and ignore qualities of interaction that are outside the scope of the experimental procedure. A full and proper user study—an experiment with human participants— involves more than just measuring and analyzing human performance. We engage observational methods by soliciting comments, thoughts, and opinions from participants. Even though a task may be performed quickly and with little or no error, if participants experience fatigue, frustration, discomfort, or another quality of interaction, we want to know about it. These qualities of interaction may not appear in the numbers, but they cannot be ignored.

131

132

CHAPTER 4 Scientific Foundations

One final point about the experimental method deserves mention. A controlled experiment, if designed and conducted properly, often allows a powerful form of conclusion to be drawn from the data and analyses. The relationship between the independent variable and the dependent variable is one of cause and effect; that is, the manipulations in the interface or interaction techniques are said to have caused the observed differences in the response variable. This point is elaborated in greater detail shortly. Cause-and-effect conclusions are not possible in research using the observational method or the correlational method.

4.3.3 Correlational method The correlational method involves looking for relationships between variables. For example, a researcher might be interested in knowing if users’ privacy settings in a social networking application are related to their personality, IQ, level of education, employment status, age, gender, income, and so on. Data are collected on each item (privacy settings, personality, etc.) and then relationships are examined. For example, it might be apparent in the data that users with certain personality traits tend to use more stringent privacy settings than users with other personality traits. The correlational method is characterized by quantification since the magnitude of variables must be ascertained (e.g., age, income, number of privacy settings). For nominal-scale variables, categories are established (e.g., personality type, gender). The data may be collected through a variety of methods, such as observation, interviews, on-line surveys, questionnaires, or measurement. Correlational methods often accompany experimental methods, if questionnaires are included in the experimental procedure. Do the measurements on response variables suggest relationships by gender, by age, by level of experience, and so on? Correlational methods provide a balance between relevance and precision. Since the data were not collected in a controlled setting, precision is sacrificed. However, data collected using informal techniques, such as interviews, bring relevance—a connection to real-life experiences. Finally, the data obtained using correlational methods are circumstantial, not causal. I will return to this point shortly. This book is primarily directed at the experimental method for HCI research. However, it is clear in the discussions above that the experimental method will often include observational methods and correlational methods.

4.4 Observe and measure Let’s return to the foundation of empirical research: observation.

4.4.1 Observation The starting point for empirical research in HCI is to observe humans interacting with computers. But how are observations made? There are two possibilities. Either

4.4 Observe and measure

another human is the observer or an apparatus is the observer. A human observer is the experimenter or investigator, not the human interacting with the computer. Observation is the precursor to measurement, and if the investigator is the observer, then measurements are collected manually. This could involve using a log sheet or notebook to jot down the number of events of interest observed. Events of interest might include the number of times the user clicked a button or moved his or her hand from the keyboard to the mouse. It might involve observing users in a public space and counting those who are using mobile phones in a certain way, for example, while walking, while driving, or while paying for groceries at a checkout counter. The observations may be broken down by gender or some other attribute of interest. Manual observation could also involve timing by hand the duration of activities, such as the time to type a phrase of text or the time to enter a search query. One can imagine the difficulty in manually gathering measurements as just described, not to mention the inaccuracy in the measurements. Nevertheless, manual timing is useful for preliminary testing, sometimes called pilot testing. More often in empirical research, the task of observing is delegated to the apparatus—the computer. Of course, this is a challenge in some situations. As an example, if the interaction is with a digital sports watch or automated teller machine (ATM), it is not possible to embed data collection software in the apparatus. Even if the apparatus is a conventional desktop computer, some behaviors of interest are difficult to detect. For example, consider measuring the number of times the user’s attention switches from the display to the keyboard while doing a task. The computer is not capable of detecting this behavior. In this case, perhaps an eye tracking apparatus or camera could be used, but that adds complexity to the experimental apparatus. Another example is clutching with a mouse—lifting and repositioning the device. The data transmitted from a mouse to a host computer do not include information on clutching, so a conventional host system is not capable of observing and recording this behavior. Again, some additional apparatus or sensing technology may be devised, but this complicates the apparatus. Or a human observer can be used. So depending on the behaviors of interest, some ingenuity might be required to build an apparatus and collect the appropriate measurements. If the apparatus includes custom software implementing an interface or interaction technique, then it is usually straightforward to record events such as key presses, mouse movement, selections, finger touches, or finger swipes and the associated timestamps. These data are stored in a file for follow-up analyses.

4.4.2 Measurement scales Observation alone is of limited value. Consider observations about rain and flowers. In some locales, there is ample rain but very few flowers in April. This is followed by less rain and a full-blown field of flowers in May. The observations may inspire anecdote (April showers bring May flowers), but a serious examination of patterns for rain and flowers requires measurement. In this case, an observer located in a garden would observe, measure, and record the amount of rain and

133

134

CHAPTER 4 Scientific Foundations

s Ratio sophisticated

s Interval s Ordinal crude

s Nominal FIGURE 4.4 Scales of measurement: nominal, ordinal, interval, and ratio. Nominal measurements are considered simple, while ratio measurements are sophisticated.

the number of flowers in bloom. The measurements might be recorded each day during April and May, perhaps by several observers in several gardens. The measurements are collected, together with the means, tallied by month and analyzed for “significant differences” (see Chapter 6). With measurement, anecdotes turn to empirical evidence. The observer is now in a position to quantify the amount of rain and the number of flowers in bloom, separately for April and May. The added value of measurement is essential for science. In the words of engineer and physicist Lord Kelvin (1824–1907), after whom the Kelvin scale of temperature is named, “[Without measurement] your knowledge of it is of a meager and unsatisfactory kind.”7 As elaborated in many textbooks on statistics, there are four scales of measurement: nominal, ordinal, interval, and ratio. Organizing this discussion by these four scales will help. Figure 4.4 shows the scales along a continuum with nominal scale measurements as the least sophisticated and ratio-scale measurements as the most sophisticated. This follows from the types of computations possible with each measurement, as elaborated below. The nature, limitations, and abilities of each scale determine the sort of information and analyses possible in a research setting. Each is briefly defined below.

4.4.3 Nominal A measurement on the nominal scale involves arbitrarily assigning a code to an attribute or a category. The measurement is so arbitrary that the code needn’t be a number (although it could be). Examples are automobile license plate numbers, codes for postal zones, job classifications, military ranks, etc. Clearly, mathematical manipulations on nominal data are meaningless. It is nonsense, for example, to

7

The exact and full quote, according to several online sources, is “When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge of it is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced it to the stage of science.”

4.4 Observe and measure

compute the mean of several license plate numbers. Nominal data identify mutually exclusive categories. Membership or exclusivity is meaningful, but little else. The only relationship that holds is equivalence, which exists between entities in the same class. Nominal data are also called categorical data. If we are interested in knowing whether males and females differ in their use of mobile phones, we might begin our investigation by observing people and assigning each a code of “M” for male, “F” for female. Here, the attribute is gender and the code is M or F. If we are interested in handedness, we might observe the writing habits of users and assign codes of “LH” for left-handers and “RH” for right-handers. If we are interested in scrolling strategies, we might observe users interacting with a GUI application and categorize them according to their scrolling methods, for example as “MW” for mouse wheel, “CD” for clicking and dragging the scrollbar, or “KB” for keyboard. Nominal data are often used with frequencies or counts—the number of occurrences of each attribute. In this case, our research is likely concerned with the difference in the counts between categories: “Are males or females more likely to …?”, “Do left handers or right handers have more difficulty with …?”, or “Are Mac or PC users more inclined to …?” Bear in mind that while the attribute is categorical, the count is a ratio-scale measurement (discussed shortly). Here is an example of nominal scale attributes using real data. Attendees of an HCI research course were dispatched to several locations on a university campus. Their task was to observe, categorize, and count students walking between classes. Each student was categorized by gender (male, female) and by whether he or she was using a mobile phone (not using, using). The results are shown in Figure 4.5. A total of 1,527 students were observed. The split by gender was roughly equal (51.1% male, 48.9% female). By mobile phone usage, 13.1 percent of the students (200) were observed using their mobile phone while walking. The research question in Figure 4.5 is a follows: are males or females more likely to use a mobile phone as they walk about a university campus? I will demonstrate how to answer this question in Chapter 6 on Hypothesis Testing.

Gender

Mobile Phone Usage

Total

%

98

781

51.1%

102

746

48.9%

1527

Not Using

Using

Male

683

Female

644

Total

1327

200

%

86.9%

13.1%

FIGURE 4.5 Two examples of nominal scale data: gender (male, female) and mobile phone usage (not using, using).

135

136

CHAPTER 4 Scientific Foundations

How many email messages do you receive each day? 1. None (I don’t use email) 2. 1-5 per day 3. 6-25 per day 4. 26-100 per day 5. More than 100 per day

FIGURE 4.6 Example of a questionnaire item soliciting an ordinal response.

4.4.4 Ordinal data Ordinal scale measurements provide an order or ranking to an attribute. The attribute can be any characteristic or circumstance of interest. For example, users might be asked to try three global positioning systems (GPS) for a period of time and then rank the systems by preference: first choice, second choice, third choice. Or users could be asked to consider properties of a mobile phone such as price, features, coolappeal, and usability, and then order the features by personal importance. One user might choose usability (first), cool-appeal (second), price (third), and then features (fourth). The main limitation of ordinal data is that the interval is not intrinsically equal between successive points on the scale. In the example just cited, there is no innate sense of how much more important usability is over cool-appeal or whether the difference is greater or less than that between, for example, cool-appeal and price. If we are interested in studying users’ e-mail habits, we might use a questionnaire to collect data. Figure 4.6 gives an example of a questionnaire item soliciting ordinal data. There are five rankings according to the number of e-mail messages received per day. It is a matter of choice whether to solicit data in this manner or, in the alternative, to ask for an estimate of the number of e-mail messages received per day. It will depend on how the data are used and analyzed. Ordinal data are slightly more sophisticated than nominal data since comparisons of greater than or less than are possible. However, it is not valid to compute the mean of ordinal data.

4.4.5 Interval data Moving up in sophistication, interval data have equal distances between adjacent values. However, there is no absolute zero. The classic example of interval data is temperature measured on the Fahrenheit or Celsius scale. Unlike ordinal data, it is meaningful to compute the mean of interval data, for example, the mean mid-day temperature during the month of July. Ratios of interval data are not meaningful, however. For example, one cannot say that 20°C is twice as warm as 10°C. In HCI, interval data are commonly used in questionnaires where a response on a linear scale is solicited. An example is a Likert Scale (see Figure 4.7), where verbal responses are given a numeric code. In the example, verbal responses are

4.4 Observe and measure

Please indicate your level of agreement with the following statements. Strongly Mildly Mildly Strongly disagree disagree Neutral agree agree --------------------------------------------------------------------------------------------------------It is safe to talk on a mobile phone while driving.

1

2

3

4

5

It is safe to read a text message on a mobile phone while driving.

1

2

3

4

5

It is safe to compose a text message on a mobile phone while driving.

1

2

3

4

5

FIGURE 4.7 A set of questionnaire items organized in a Likert Scale. The responses are examples of interval scale data.

symmetric about a neutral, central value with the gradations between responses more or less equal. It is this last quality—equal gradations between responses—that validates calculating the mean of the responses across multiple respondents. There is some disagreement among researchers on the assumption of equal gradations between the items in Figure 4.7. Do respondents perceive the difference between, say, 1 and 2 (strongly disagree and mildly disagree) the same as the difference between, say, 2 and 3 (mildly disagree and neutral)? Attaching verbal tags to numbers is likely to bring qualitative and highly personal interpretations to the responses. There is evidence that respondents perceive items at the extremes of the scale as farther apart than items in the center (Kaptein, Nass, and Markopoulos, 2010). Nevertheless, the graduation between responses is much more similar here than between the five ordinal responses in Figure 4.6. One remedy for non-equal gradations in Likert-scale response items is simply to instruct respondents to interpret the items as equally spaced. Examples of Likert Scale questionnaire items in HCI research papers are as follows: Bickmore and Picard, 2004; Dautenhahn et al., 2006; Garau et al., 2003; Guy, Ur, Ronen, Perer, and Jacovi, 2011; Wobbrock, Chau, and Myers, 2007.

4.4.6 Ratio data Ratio-scale measurements are the most sophisticated of the four scales of measurement. Ratio data have an absolute zero and support a myriad of calculations to

137

138

CHAPTER 4 Scientific Foundations

summarize, compare, and test the data. Ratio data can be added, subtracted, multiplied, divided; means, standard deviations, and variances can be computed. In HCI, the most common ratio-scale measurement is time—the time to complete a task. But generally, all physical measurements are also ratio-scale, such as the distance or velocity of a cursor as it moves across a display, the force applied by a finger on a touchscreen, and so on. Many social variables are also ratio-scale, such as a user’s age or years of computer experience. Another common ratio-scale measurement is count (noted above). Often in HCI research, we count the number of occurrences of certain human activities, such as the number of button clicks, the number of corrective button clicks, the number of characters entered, the number of incorrect characters entered, the number of times an option is selected, the number of gaze shifts, the number of hand movements between the mouse and keyboard, the number of task retries, the number of words in a search query, etc. Although we tend to give time special attention, it too is a count—the number of seconds or minutes elapsed as an activity takes place. These are all ratio-scale measurements. The expressive nature of a count is improved through normalization; that is, expressing the value as a count per something. So for example, knowing that a 10-word phrase was entered in 30 seconds is less revealing than knowing that the rate of entry was 10 / 0.5 = 20 words per minute (wpm). The main benefit of normalizing counts is to improve comparisons. It is easy to compare 20 wpm for one method with 23 wpm for another method—the latter method is faster. It is much harder to compare 10 words entered in 30 seconds for one method with 14 words entered in 47 seconds for another method. As another example, let’s say two errors were committed while entering a 50-character phrase of text. Reporting the occurrence of two errors reveals very little, unless we also know the length of the phrase. Even so, comparisons with results from another study are difficult. (What if the other study used phrases of different lengths?) However, if the result is reported as a 2 / 50 = 4% error rate, there is an immediate sense of the meaning, magnitude, and relevance of the human performance measured, and as convention has it, the other study likely reported error rates in much the same way. So where possible, normalize counts to make the measurements more meaningful and to facilitate comparisons. An example in the literature is an experiment comparing five different text entry methods (Magerkurth and Stenzel, 2003). For speed, results were reported in “words per minute” (that’s fine); however, for accuracy, results were reported as the number of errors committed. Novice participants, for example, committed 24 errors while using multi-tap (Magerkurth and Stenzel, 2003, Table 2). While this number is useful for comparing results within the experiment, it provides no insight as to how the results compare with those in related research. The results would be more enlightening if normalized for the amount of text entered and reported as an “error rate (%),” computed as the number of character errors divided by the total number of characters entered times 100.

4.5 Research questions

4.5 Research questions In HCI, we conduct experimental research to answer (and raise!) questions about a new or existing user interface or interaction technique. Often the questions pertain to the relationship between two variables, where one variable is a circumstance or condition that is manipulated (an interface property) and the other is an observed and measured behavioral response (task performance). The notion of posing or answering questions seems simple enough, but this is tricky because of the human element. Unlike an algorithm operating on a data set, where the time to search, sort, or whatever is the same with each try, people exhibit variability in their actions. This is true both from person to person and for a single person repeating a task. The result is always different! This variability affects the confidence with which we can answer research questions. To gauge the confidence of our answers, we use statistical techniques, as presented in Chapter 6, Hypothesis Testing. Research questions emerge from an inquisitive process. The researcher has an idea and wishes to see if it has merit. Initial thoughts are fluid and informal: ● ● ● ● ● ● ●

Is it viable? Is it as good as or better than current practice? What are its strengths and weaknesses? Which of several alternatives is best? What are the human performance limits and capabilities? Does it work well for novices, for experts? How much practice is required to become proficient?

These questions are unquestionably relevant, since they capture a researcher’s thinking at the early stages of a research project. However, the questions above suffer a serious deficiency: They are not testable. The goal, then, is to move forward from the loose and informal questions above to questions more suitable for empirical and experimental enquiry. I’ll use an example to show how this is done. Perhaps a researcher is interested in text entry on touchscreen phones. Texting is something people do a lot. The researcher is experienced with the Qwerty soft keyboard on touchscreen phones, but finds it error prone and slow. Having thought about the problem for a while, an idea emerges for a new technique for entering text. Perhaps it’s a good idea. Perhaps it’s really good, better than the basic Qwerty soft keyboard (QSK). Being motivated to do research in HCI, the researcher builds a prototype of the entry technique and fiddles with the implementation until it works fine. The researcher decides to undertake some experimental research to evaluate the idea. What are the research questions? Perhaps the following capture the researcher’s thinking: ● ●

Is the new technique any good? Is the new technique better than QSK?

139

140

CHAPTER 4 Scientific Foundations

● ● ●

Is the new technique faster than QSK? Is the new technique faster than QSK after a bit of practice? Is the measured entry speed (in words per minute) higher for the new technique than for a QSK after one hour of use?

From top to bottom, the questions are progressively narrower and more focused. Expressions like “any good” or “better than,” although well intentioned, are problematic for research. Remember observation and measurement? How does one measure “better than”? Farther down the list, the questions address qualities that are more easily observed and measured. Furthermore, since they are expressed across alternative designs, comparisons are possible. The last question speaks very specifically to entry speed measured in words per minute, to a comparison between two methods, and to a criterion for practice. This is a testable research question.

4.6 Internal validity and external validity At this juncture we are in a position to consider two important properties of experimental research: internal validity and external validity. I’ll use the research questions above to frame the discussion. Two of the questions appear in the plot in Figure 4.8. The x-axis is labeled Breadth of Question or, alternatively, External Validity. The y-axis is labeled Accuracy of Answer or, alternatively, Internal Validity. The question Is the new technique better than QSK?

is positioned as high in breadth (that’s good!) yet answerable with low accuracy (that’s bad!). As already noted, this question is not testable in an empirical sense. Attempts to answer it directly are fraught with problems, because we lack a methodology to observe and measure “better than” (even though finding better interfaces is the final goal). High

Accuracy of Answer

Is the measured entry speed (in words per minute) higher with the new technique than with QSK after one hour of use?

Internal Validity

Is the new technique better than QSK?

Low Low

High

Breadth of Question External Validity

FIGURE 4.8 Graphical comparison of Internal Validity and External Validity.

4.6 Internal validity and external validity

The other, more detailed question Is the measured entry speed (in words per minute) higher with the new technique than with QSK after one hour of use?

is positioned as low in breadth (that’s bad!) yet answerable with high accuracy (that’s good!). The question is testable, which means we can craft a methodology to answer it through observation and measurement. Unfortunately, the narrow scope of the question brings different problems. Focusing on entry speed is fine, but what about other aspects of the interaction? What about accuracy, effort, comfort, cognitive load, user satisfaction, practical use of the technique, and so on? The question excludes consideration of these, hence the low breadth rating. The alternative labels for the axes in Figure 4.8 are internal validity and external validity. In fact, the figure was designed to set up discussion on these important terms in experimental research. Internal validity (definition) is the extent to which an effect observed is due to the test conditions. For the example, an effect is simply the difference in entry speed between the new technique and QSK. If we conduct an experiment to measure and compare the entry speed for the two techniques, we want confidence that the difference observed was actually due to inherent differences between the techniques. Internal validity captures this confidence. Perhaps the difference was due to something else, such as variability in the responses of the participants in the study. Humans differ. Some people are predisposed to be meticulous, while others are carefree, even reckless. Furthermore, human behavior—individually or between people—can change from one moment to the next, for no obvious reason. Were some participants tested early in the day, others late in the day? Were there any distractions, interruptions, or other environmental changes during testing? Suffice it to say that any source of variation beyond that due to the inherent properties of the test conditions tends to compromise internal validity. High internal validity means the effect observed really exists. External validity (definition) is the extent to which experimental results are generalizable to other people and other situations. Generalizable clearly speaks to breadth in Figure 4.8. To the extent the research pursues broadly framed questions, the results tend to be broadly applicable. But there is more. Research results that apply to “other people” imply that the participants involved were representative of a larger intended population. If the experiment used 18- to 25-year-old computer literate college students, the results might generalize to middle-aged computer literate professionals. But they might not generalize to middle-aged people without computer experience. And they likely would not apply to the elderly, to children, or to users with certain disabilities. In experimental research, random sampling is important for generalizability; that is, the participants selected for testing were drawn at random from the desired population. Generalizable to “other situations” means the experimental environment and procedures were representative of real world situations where the interface or

141

142

CHAPTER 4 Scientific Foundations

FIGURE 4.9 There is tension between internal validity and external validity. Improving one comes at the expense of the other. (Sketch courtesy of Bartosz Bajer)

technique will be used. If the research studied the usability of a GPS system for taxi drivers or delivery personnel and the experiment was conducted in a quiet, secluded research lab, there may be a problem with external validity. Perhaps a different experimental environment should be considered. Research on text entry where participants enter predetermined text phrases with no punctuation symbols, no uppercase characters, and without any ability to correct mistakes, may have problem with external validity. Again, a different experimental procedure should be considered. The scenarios above are overly dogmatic. Experiment design is an exercise in compromise. While speaking in the strictest terms about high internal validity and high external validity, in practice one is achieved at the expense of the other, as characterized in Figure 4.9. To appreciate the tension between internal and external validity, two additional examples are presented. The first pertains to the experimental environment. Consider an experiment that compares two remote pointing devices for presentation systems. To improve external validity, the experimental environment mimics expected usage. Participants are tested in a large room with a large presentationsize display, they stand, and they are positioned a few meters from the display. The other participants are engaged to act as an audience by attending and sitting around tables in the room during testing. There is no doubt this environment improves external validity. But what about internal validity? Some participants may be distracted or intimidated by the audience. Others might have a tendency to show off, impress, or act out. Such behaviors introduce sources of variation outside the realm of the devices under test, and thereby compromise internal validity. So our effort to improve external validity through environmental considerations may negatively impact internal validity. A second example pertains to the experimental procedure. Consider an experiment comparing two methods of text entry. In an attempt to improve external validity, participants are instructed to enter whatever text they think of. The text may include punctuation symbols and uppercase and lowercase characters, and participants can edit the text and correct errors as they go. Again, external validity is improved since this is what people normally do when entering text. However, internal validity is compromised because behaviors are introduced that are not directly related to the text entry techniques—behaviors such as pondering (What should

4.7 Comparative evaluations

I enter next?) and fiddling with commands (How do I move the cursor back and make a correction? How is overtype mode invoked?). Furthermore, since participants generate the text, errors are difficult to record since there is no “source text” with which to compare the entered text. So here again we see the compromise. The desire to improve external validity through procedural considerations may negatively impact internal validity. Unfortunately, there is no universal remedy for the tension between internal and external validity. At the very least, one must acknowledge the limitations. Formulating conclusions that are broader than what the results suggest is sure to raise the ire of reviewers. We can strive for the best of both worlds with a simple approach, however. Posing multiple narrow (testable) questions that cover the range of outcomes influencing the broader (untestable) questions will increase both internal and external validity. For example, a technique that is fast, accurate, easy to learn, easy to remember, and considered comfortable and enjoyable by users is generally better. Usually there is a positive correlation between the testable and untestable questions; i.e., participants generally find one UI better than another if it is faster and more accurate, takes fewer steps, is more enjoyable, is more comfortable, and so on. Before moving on, it is worth mentioning ecological validity, a term closely related to external validity. The main distinction is in how the terms are used. Ecological validity refers to the methodology (using materials, tasks, and situations typical of the real world), whereas external validity refers to the outcome (obtaining results that generalize to a broad range of people and situations).

4.7 Comparative evaluations Evaluating new ideas for user interfaces or interaction techniques is central to research in human-computer interaction. However, evaluations in HCI sometimes focus on a single idea or interface. The idea is conceived, designed, implemented, and evaluated—but not compared. The research component of such an evaluation is questionable. Or, to the extent the exercise is labeled research, it is more aligned with the second definition of research noted earlier: “collecting information about a particular subject.” From a research perspective, our third definition is more appealing, since it includes the ideas of experimentation, discovery, and developing theories of interaction. Certainly, more meaningful and insightful results are obtained if a comparative evaluation is performed. In other words, a new user interface or interaction technique is designed and implemented and then compared with one or more alternative designs to determine which is faster, more accurate, less confusing, more preferred by users, etc. The alternatives may be variations in the new design, an established design (a baseline condition), or some combination of the two. In fact, the testable research questions above are crafted as comparisons (e.g., “Is Method A faster than Method B for …?”), and for good reason. A controlled experiment must include at least one independent variable and the independent variable must have at

143

144

CHAPTER 4 Scientific Foundations

FIGURE 4.10 Including a baseline condition serves as a check on the methodology and facilitates the comparison of results between user studies.

least two levels or test conditions. Comparison, then, is inherent in research following the experimental method discussed earlier. The design of HCI experiments is elaborated further in Chapter 5. The idea of including an established design as a baseline condition is particularly appealing. There are two benefits. First, the baseline condition serves as a check on the methodology. Baseline conditions are well traveled in the research literature, so results in a new experiment are expected to align with previous results. Second, the baseline condition allows results to be compared with other studies. The general idea is shown in Figure 4.10. The results from two hypothetical user studies are shown. Both user studies are comparative evaluations and both include condition A as a baseline. Provided the methodology was more or less the same, the performance results in the two studies should be the same or similar for the baseline condition. This serves not only as a check on the methodology but also facilitates comparisons between the two user studies. A quick look at the charts suggests that condition C out-performs condition B. This is an interesting observation because condition C was evaluated in one study, condition B in another. Consider the idea cited earlier of comparing two remote pointing devices for presentation systems. Such a study would benefit by including a conventional mouse as a baseline condition.8 If the results for the mouse are consistent with those found in other studies, then the methodology was probably okay, and the results for the remote pointing devices are likely valid. Furthermore, conclusions can often be expressed in terms of the known baseline condition, for example, “Device A was found to be about 8 percent slower than a conventional mouse.” The value in conducting a comparative study was studied in research by Tohidi et al. (2006), who tested the hypothesis that a comparative evaluation yields more insight than a one-of evaluation. In their study, participants were assigned to groups and were asked to manually perform simple tasks with climate control interfaces 8

The example cited earlier on remote pointing devices included a conventional mouse as a baseline condition (MacKenzie and Jusoh, 2001).

4.8 Relationships: circumstantial and causal

(i.e., thermostats). There were three different interfaces tested. Some of the participants interacted with just one interface, while others did the same tasks with all three interfaces. The participants interacting with all three interfaces consistently found more problems and were more critical of the interfaces. They were also less prone to inflate their subjective ratings. While this experiment was fully qualitative—human performance was not measured or quantified—the message is the same: a comparative evaluation yields more valuable and insightful results than a single-interface evaluation.

4.8 Relationships: circumstantial and causal I noted above that looking for and explaining interesting relationships is part of what we do in HCI research. Often a controlled experiment is designed and conducted specifically for this purpose, and if done properly a particular type of conclusion is possible. We can often say that the condition manipulated in the experiment caused the changes in the human responses that were observed and measured. This is a cause-and-effect relationship, or simply a causal relationship. In HCI, the variable manipulated is often a nominal-scale attribute of an interface, such as device, entry method, feedback modality, selection technique, menu depth, button layout, and so on. The variable measured is typically a ratio-scale human behavior, such as task completion time, error rate, or the number of button clicks, scrolling events, gaze shifts, etc. Finding a causal relationship in an HCI experiment yields a powerful conclusion. If the human response measured is vital in HCI, such as the time it takes to do a common task, then knowing that a condition tested in the experiment reduces this time is a valuable outcome. If the condition is an implementation of a novel idea and it was compared with current practice, there may indeed be reason to celebrate. Not only has a causal relationship been found, but the new idea improves on existing practice. This is the sort of outcome that adds valuable knowledge to the discipline; it moves the state of the art forward.9 This is what HCI research is all about! Finding a relationship does not necessarily mean a causal relationship exists. Many relationships are circumstantial. They exist, and they can be observed, measured, and quantified. But they are not causal, and any attempt to express the relationship as such is wrong. The classic example is the relationship between smoking and cancer. Suppose a research study tracks the habits and health of a large number of people over many years. This is an example of the correlational method of research mentioned earlier. In the end, a relationship is found between smoking and cancer: cancer is more prevalent in the people who smoked. Is it correct to conclude from the study that smoking causes cancer? No. The relationship observed is 9

Reporting a non-significant outcome is also important, particularly if there is reason to believe a test condition might improve an interface or interaction technique. Reporting a non-significant outcome means that, at the very least, other researchers needn’t pursue the idea further.

145

146

CHAPTER 4 Scientific Foundations

circumstantial, not causal. Consider this: when the data are examined more closely, it is discovered that the tendency to develop cancer is also related to other variables in the data set. It seems the people who developed cancer also tended to drink more alcohol, eat more fatty foods, sleep less, listen to rock music, and so on. Perhaps it was the increased consumption of alcohol that caused the cancer, or the consumption of fatty foods, or something else. The relationship is circumstantial, not causal. This is not to say that circumstantial relationships are not useful. Looking for and finding a circumstantial relationship is often the first step in further research, in part because it is relatively easy to collect data and look for circumstantial relationships. Causal relationships emerge from controlled experiments. Looking for a causal relationship requires a study where, among other things, participants are selected randomly from a population and are randomly assigned to test conditions. A random assignment ensures that each group of participants is the same or similar in all respects except for the conditions under which each group is tested. Thus, the differences that emerge are more likely due to (caused by) the test conditions than to environmental or other circumstances. Sometimes participants are balanced into groups where the participants in each group are screened so that the groups are equal in terms of other relevant attributes. For example, an experiment testing two input controllers for games could randomly assign participants to groups or balance the groups to ensure the range of gaming experience is approximately equal. Here is an HCI example similar to the smoking versus cancer example: A researcher is interested in comparing multi-tap and predictive input (T9) for text entry on a mobile phone. The researcher ventures into the world and approaches mobile phone users, asking for five minutes of their time. Many agree. They answer a few questions about experience and usage habits, including their preferred method of entering text messages. Fifteen multi-tap users and 15 T9 users are found. The users are asked to enter a prescribed phrase of text while they are timed. Back in the lab, the data are analyzed. Evidently, the T9 users were faster, entering at a rate of 18 words per minute, compared to 12 words per minute for the multi-tap users. That’s 50 percent faster for the T9 users! What is the conclusion? There is a relationship between method of entry and text entry speed; however, the relationship is circumstantial, not causal. It is reasonable to report what was done and what was found, but it is wrong to venture beyond what the methodology gives. Concluding from this simple study that T9 is faster than multi-tap would be wrong. Upon inspecting the data more closely, it is discovered that the T9 users tended to be more tech-savvy: they reported considerably more experience using mobile phones, and also reported sending considerably more text messages per day than the multi-tap users who, by and large, said they didn’t like sending text messages and did so very infrequently.10 So the difference observed may be due to prior experience and usage habits, rather than to inherent differences in the text entry methods. If there is a genuine interest in determining if one text entry method 10

Although it is more difficult to determine, perhaps technically savvy users were more willing to participate in the study. Perhaps the users who declined to participate were predominantly multi-tap users.

4.9 Research topics

is faster than another, a controlled experiment is required. This is the topic of the next chapter. One final point deserves mention. Cause and effect conclusions are not possible in certain types of controlled experiments. If the variable manipulated is a naturally occurring attribute of participants, then cause and effect conclusions are unreliable. Examples of naturally occurring attributes include gender (female, male), personality (extrovert, introvert), handedness (left, right), first language (e.g., English, French, Spanish), political viewpoint (left, right), and so on. These attributes are legitimate independent variables, but they cannot be manipulated, which is to say, they cannot be assigned to participants. In such cases, a cause and effect conclusion is not valid because is not possible to avoid confounding variables (defined in Chapter 5). Being a male, being an extrovert, being left-handed, and so on always brings forth other attributes that systematically vary across levels of the independent variable. Cause and effect conclusions are unreliable in these cases because it is not possible to know whether the experimental effect was due to the independent variable or to the confounding variable.

4.9 Research topics Most HCI research is not about designing products. It’s not even about designing applications for products. In fact, it’s not even about design or products. Research in HCI, like in most fields, tends to nip away at the edges. The march forward tends to be incremental. The truth is, most new research ideas tend to build on existing ideas and do so in modest ways. A small improvement to this, a little change to that. When big changes do arise, they usually involve bringing to market, through engineering and design, ideas that already exist in the research literature. Examples are the finger flick and two-finger gestures used on touchscreen phones. Most users likely encountered these for the first time with the Apple iPhone. The gestures seem like bold new advances in interaction, but, of course, they are not. The flick gesture dates at least to the 1960s. Flicks are clearly seen in use with a light pen in the videos of Sutherland’s Sketchpad, viewable on YouTube. They are used to terminate a drawing command. Two-finger gestures date at least to the 1970s. Figure 4.11 shows Herot and Weinzapfel’s (1978) two-finger gesture used to rotate a virtual knob on a touch-sensitive display. As reported, the knob can be rotated to within 5 degrees of a target position. So what might seem like a bold new advance is often a matter of good engineering and design, using ideas that already exist. Finding a research topic is often the most challenging step for graduate students in HCI (and other fields). The expression “ABD” for “all but dissertation” is a sad reminder of this predicament. Graduate students sometimes find themselves in a position of having finished all degree requirements (e.g., coursework, a teaching practicum) without nailing down the big topic for dissertation research. Students might be surprised to learn that seasoned researchers in universities and industry also struggle for that next big idea. Akin to writer’s block, the harder one tries, the

147

148

CHAPTER 4 Scientific Foundations

FIGURE 4.11 A two-finger gesture on a touch-sensitive display is used to rotate a virtual knob. (Adapted from Herot and Weinzapfel, 1978)

less likely is the idea to appear. I will present four tips to overcome “researcher’s block” later in this section. First, I present a few observations on ideas and how and where they arise.

4.9.1 Ideas In the halcyon days after World War II, there was an American television show, a situation comedy, or sitcom, called The Many Loves of Dobie Gillis (1959–1963). Much like Seinfeld many years later, the show was largely about, well, nothing. Dobie’s leisurely life mostly focused on getting rich or on endearing a beautiful woman to his heart. Each episode began with an idea, a scheme. The opening scene often placed Dobie on a park bench beside The Thinker, the bronze and marble statue by French sculptor Auguste Rodin (1840–1917). (See Figure 4.12.) After some pensive moments by the statue, Dobie’s idea, his scheme, would come to him. It would be nice if research ideas in HCI were similarly available and with such assurance as were Dobie’s ideas. That they are not is no cause for concern, however. Dobie’s plans usually failed miserably, so we might question his approach to formulating his plans. Is it possible that The Thinker, in his pose, is more likely to inspire writer’s block than the idea so desperately sought? The answer may be yes, but there is little science here. We are dealing with human thought, inspiration, creativity, and a milieu of other human qualities that are poorly understood, at best. If working hard to find a good idea doesn’t work, perhaps a better approach is to relax and just get on with one’s day. This seems to have worked for the ancient Greek scholar Archimedes (287–212 BC) who is said to have effortlessly come upon a brilliant idea as a solution to a problem. As a scientist, Archimedes was called upon to determine if King Hiero’s crown was pure gold or if it was compromised with a lesser alloy. One solution was to melt the crown, separating the constituent parts. This would destroy the crown–not a good idea. Archimedes’ idea was simple, and he is said to have discovered it while taking a bath. Yes, taking a bath, rather than sitting for hours in The Thinker’s pose. He realized–in an instant– that the volume of water displaced as he entered the bathtub must equal the volume

4.9 Research topics

FIGURE 4.12 Rodin’s The Thinker often appeared in the opening scenes of the American sitcom The Many Loves of Dobie Gillis.

of his body. Immersing the crown in water would similarly yield the crown’s volume, and this, combined with the crown’s weight, would reveal the crown’s density. If the density of the crown equaled the known density of gold, the King’s crown was pure gold–problem solved. According to the legend, Archimedes was so elated at his moment of revelation that he jumped from his bath and ran nude about the streets of Syracuse shouting “Eureka!” (“I found it!”). While legends make good stories, we are not likely to be as fortunate as Archimedes in finding a good idea for an HCI research topic. Inspiration is not always the result of single moment of revelation. It is often gradual, with sources unknown or without a conscious and recognizable connection to the problem. Recall Vannevar Bush’s memex, described in the opening chapter of this book. Memex was a concept. It was never built, even though Bush described the interaction with memex in considerable detail. We know memex today as hypertext and the World Wide Web. But where and how did Bush get his idea? The starting point is having a problem to solve. The problem of interest to Bush was coping with ever-expanding volumes of information. Scientists like Bush needed a convenient way to access this information. But how? It seems Bush’s inspiration for memex came from… Let’s pause for a moment, lest we infer Bush was engaged in a structured approach to problem solving. It is not likely that Bush went to work one morning intent on solving the problem of information access. More than likely, the idea came without deliberate effort. It may have come flittingly, in an instant, or gradually, over days, weeks, or months. Who knows? What is known, however, is that the idea did not arise from nothing. Ideas come from the human experience. This is why in HCI we often read about things like “knowledge in the head and knowledge in the world”

149

150

CHAPTER 4 Scientific Foundations

FIGURE 4.13 Pie menus in HCI: (a) The inspiration? (b) HCI example. (Adapted from G. Kurtenbach, 1993)

(Norman, 1988, ch. 3) or metaphor and analogy (Carroll and Thomas, 1982). The context for inspiration is the human experience. So what was the source of Bush’s inspiration for memex? The answer is in Bush’s article, and also in Chapter 1. Are there other examples relevant to HCI? Sure. Twitter co-founder Jack Dorsey is said to have come up with the idea for the popular micro-blogging site while sitting on a children’s slide in a park eating Mexican food.11 What about pie menus in graphical user interfaces? Pie menus, as an alternative to linear menus, were first proposed by Don Hopkins at the University of Maryland in 1988 (cited in Callahan et al., 1988). We might wonder about the source of Hopkins’ inspiration (see Figure 4.13). See also student exercises 4-2 and 4-3 at the end of this chapter.

4.9.2 Finding a topic It is no small feat to find an interesting research topic. In the following paragraphs, four tips are offered on finding a topic suitable for research. As with the earlier discussion on the cost and frequency of errors (see Figure 3-46), there is little science to offer here. The ideas follow from personal experience and from working with students and other researchers in HCI.

4.9.3 Tip #1: Think small! At a conference recently, I had an interesting conversation with a student. He was a graduate student in HCI. “Have you found a topic for your research,” I asked. “Not really,” he said. He had a topic, but only in a broad sense. Seems his supervisor had funding for a large research project related to aviation. The topic, in a general sense, was to develop an improved user interface for an air traffic control system. He was stuck. Where to begin? Did I have any ideas for him? Well, actually, no I didn’t. Who wouldn’t be stuck? The task of developing a UI for an air traffic control system is huge. Furthermore, the project mostly involves engineering and 11

New York Times Oct 30, 2010, p BU1.

4.9 Research topics

design. Where is the research in designing an improved system of any sort? What are the research questions? What are the experimental variables? Unfortunately, graduate students are often saddled with similar big problems because a supervisor’s funding source requires it. The rest of our discussion focused on narrowing the problem—in a big way. Not to some definable sub-system, but to a small aspect of the interface or interaction. The smaller, the better. The point above is to think small. On finding that big idea, the advice is… forget it. Once you shed that innate desire to find something really significant and important, it’s amazing what will follow. If you have a small idea, something that seems a little useful, it’s probably worth pursuing as a research project. Pursue it and the next thing you know, three or four related interaction improvements come to mind. Soon enough, there’s a dissertation topic in the works. So don’t hesitate to think small.

4.9.4 Tip #2: Replicate! An effective way to get started on research is to replicate an existing experiment from the HCI literature. This seems odd. Where is the research in simply replicating what has already been done? Of course, there is none. But there is a trick. Having taught HCI courses many times over many years, I know that students frequently get stuck finding a topic for the course’s research project. Students frequently approach me for suggestions. If I have an idea that seems relevant to the student’s interests, I’ll suggest it. Quite often (usually!) I don’t have any particular idea. If nothing comes to mind, I take another approach. The student is advised just to study the HCI literature—research papers from the CHI proceedings, for example— and find some experimental research on a topic of interest. Then just replicate the experiment. Is that okay, I am asked. Sure, no problem. The trick is in the path to replicating. Replicating a research experiment requires a lot of work. The process of studying a research paper and precisely determining what was done, then implementing it, testing it, debugging it, doing an experiment around it, and so on will empower the student—the researcher—with a deep understanding of the issues, much deeper than simply reading the paper. This moves the line forward. The stage is set. Quite often, a new idea, a new twist, emerges. But it is important not to require something new. The pressure in that may backfire. Something new may emerge, but this might not happen until late in the process, or after the experiment is finished. So it is important to avoid a requirement for novelty. This is difficult, because it is germane to the human condition to strive for something new. Self-doubt may bring the process to a standstill. So keep the expectations low. A small tweak here, a little change there. Good enough. No pressure. Just replicate. You may be surprised with the outcome.

4.9.5 Tip #3: Know the literature! It might seem obvious, but the process of reviewing research papers on a topic of interest is an excellent way to develop ideas for research projects. The starting

151

152

CHAPTER 4 Scientific Foundations

point is identifying the topic in a general sense. If one finds gaming of interest, then gaming is the topic. If one finds social networking of interest, then that’s the topic. From there the task is to search out and aggressively study and analyze all published research on the topic. If there are too many publications, then narrow the topic. What, in particular, is the interest in gaming or social networking? Continue the search. Use Google Scholar, the ACM Digital Library, or whatever resource is conveniently available. Download all the papers, store them, organize them, study them, make notes, then open a spreadsheet file and start tabulating features from the papers. In the rows, identify the papers. In the columns, tabulate aspects of the interface or interaction technique, conditions tested, results obtained, and so on. Organize the table in whatever manner seems reasonable. The process is chaotic at first. Where to begin? What are the issues? The task is daunting, at the very least, because of the divergence in reporting methods. But that’s the point. The gain is in the process—bringing shape and structure to the chaos. The table will grow as more papers are found and analyzed. There are examples of such tables in published papers, albeit in a condensed summary form. Figure 4.14 shows an example from a research paper on text entry using small keyboards. The table amounts to a mini literature review. Although the table is neat and tidy, don’t be fooled. It emerged from a difficult and chaotic process of reviewing a collection of papers and finding common and relevant issues. The collection of notes in the right-hand column is evidence of the difficulty. This column is like a disclaimer, pointing out issues that complicate comparisons of the data in the other columns. Are there research topics lurking within Figure 4.14? Probably. But the point is the process, not the product. Building such a table shapes the research area into relevant categories of inquiry. Similar tables are found in other research papers (e.g., Figure 11 and Figure 12 in MacKenzie, 1992; Table 3 and Table 4 in Soukoreff and MacKenzie, 2004). See also student exercise 4-4 at the end of this chapter.

4.9.6 Tip #4: Think inside the box! The common expression “think outside the box” is a challenge to all. The idea is to dispense with accepted beliefs and assumptions (in the box) and to think in a new way that assumes nothing and challenges everything. However, there is a problem with the challenge. Contemporary, tech-savvy people, clever as they are, often believe they in fact do think outside the box, and that it is everyone else who is confined to life in the box. With this view, the challenge is lost before starting. If there is anything useful in tip #4, it begins with an unsavory precept: You are inside the box! All is not lost, however. Thinking inside the box, then, is thinking about and challenging one’s own experiences—the experiences inside the box. The idea is simple. Just get on with your day, but at every juncture, every interaction, think and question. What happened? Why did it happen? Is there an alternative? Play the

Study (1st author)

Number of Keys a

Direct/ Number of Speedb Scanning Indirect Participants (wpm)

Bellman [2]

5

Indirect

No

11

11

Dunlop [4]

4

Direct

No

12

8.90

Notes 4 cursors keys + SELECT key. Error rates not reported. No error correction method. 4 letter keys + SPACE key. Error rates reported as "very low.”

4 letter keys + 1 key for SPACE/NEXT. Error rates not reported. No error correction method. 4 letters keys + 4 keys for editing, and selecting. 5 hours training. Tanaka-Ishii [25 3 Direct No 8 12+ Error rates not reported. Errors corrected using CLEAR key. 3 letter keys + two additional keys. Error rate = 2.1%. Errors Gong [7] 3 Direct No 32 8.01 corrected using DELETE key. 2 cursor keys + SELECT key. Error rate = 2.2%. No error MacKenzie [16] 3 Indirect No 10 9.61 correction method. 1 SELECT key + BACKSPACE key. 43 virtual keys. RC scanning. Baljko [1] 2 Indirect Yes 12 3.08 Same phrase entered 4 times. Error rate = 18.5%. Scanning interval = 750 ms. 1 SELECT key. 26 virtual keys. RC scanning. Excluded trials with Simpson [24] 1 Indirect Yes 4 4.48 selection errors or missed selections. No error correction. Scanning interval = 525 ms at end of study. 1 SELECT key. 33 virtual keys. RC scanning with word prediction. Dictionary size not given. Virtual BACKSPACE key. 10 blocks of Koester [10] 1 Indirect Yes 3 7.2 trials. Error rates not reported. Included trials with selection errors or missed selections. Fastest participant: 8.4 wpm. a For "direct" entry, the value is the number of letter keys. For "indirect" entry, the value is the total number of keys. b The entry speed cited is the highest of the values reported in each source, taken from the last block if multiple blocks. Dunlop [5]

4

Direct

No

20

12

FIGURE 4.14 Table showing papers (rows) and relevant conditions or results (columns) from research papers on text entry using small keyboards. (From MacKenzie, 2009b, Table 1; consult for full details on studies cited)

154

CHAPTER 4 Scientific Foundations

FIGURE 4.15 Elevator control panel. The button label is more prominent than the button.

role of both a participant (this is unavoidable) and an observer. Observe others, of course, but more importantly observe yourself. You are in the box, but have a look, study, and reconsider. Here’s an example, which on the surface seems trivial (but see tip #1). Recently, while at work at York University, I was walking to my class on Human-Computer Interaction. Being a bit late, I was in a hurry. The class was in a nearby building on the third floor and I was carrying some equipment. I entered the elevator and pushed the button—the wrong button. Apparently, for each floor the control panel has both a button label and a button. (See Figure 4.15.) I pushed the button label instead of the button. A second later I pushed the button, and my journey continued. End of story. Of course, there is more. Why did I push the wrong button? Yes, I was in a hurry, but that’s not the full reason. With a white number on a black background, the floor is identified more prominently by the button label than by the button. And the button label is round, like a button. On the button, the number is recessed in the metal and is barely visible. The error was minor, only a slip (right intention, wrong action; see Norman, 1988, ch. 5). Is there a research topic in this? Perhaps. Perhaps not. But experiencing, observing, and thinking about one’s interactions with technology can generate ideas and promote a humbling yet questioning frame of thinking—thinking that moves forward into research topics. The truth is, I have numerous moments like this every day (and so do you!). Most amount to nothing, but the small foibles in interacting with technology are intriguing and worth thinking about. In this chapter, we have examined the scientific foundations for research in human-computer interaction. With this, the next challenge is in designing

4.9 Research topics

and conducting experiments using human participants (users) to evaluate new ideas for user interfaces and interaction techniques. We explore these topics in Chapter 5.

STUDENT EXERCISES 4-1. Examine some published papers in HCI and find examples where results were reported as a raw count (e.g., number of errors) rather than as a count per something (e.g., percent errors). Find three examples and write a brief report (or prepare a brief presentation) detailing how the results were reported and the weakness or limitation in the method. Propose a better way to report the same results. Use charts or graphs where appropriate. 4-2. What, in Vannevar Bush’s “human experience,” formed the inspiration for memex? (If needed, review Bush’s essay “As We May Think,” or see the discussion in Chapter 1.) What are the similarities between his inspiration and memex? 4-3. A fisheye lens or fisheye view is a tool or concept in HCI whereby high-value information is presented in greater detail than low-value information. Furnas first introduced the idea in 1986 (Furnas, 1986). Although the motivation was to improve the visualization of large data sets, such as programs or databases, Furnas’ idea came from something altogether different. What was Furnas’ inspiration for fisheye views? Write a brief report describing the analogy offered by Furnas. Include in your report three examples of fisheye lenses, as described and implemented in subsequent research, noting in particular the background motivation. 4-4. Here are some research themes: 3D gaming, mobile phone use while driving, privacy in social networking, location-aware user interfaces, tactile feedback in pointing and selecting, multi-touch tabletop interaction. Choose one of these topics (or another) and build a table similar to that in Figure 4.14. Narrow the topic, if necessary (e.g., mobile phone texting while driving), and find at least five relevant research papers to include in the table. Organize the table identifying the papers in the rows and methods, relevant themes, and findings in the columns. Write a brief report about the table. Include citations and references to the selected papers. 4-5. In Chapter 3, we used a 2D plot to illustrate the trade-off between the frequency of errors (x-axis) and the cost of errors (y-axis) (see Figure 3.46). The plot was just a sketch, since the analysis was informal. In this chapter, we discussed another trade-off, that between form and function. The

155

876944 research-article2019

NMS0010.1177/1461444819876944new media & societyHallinan et al.

Article

Unexpected expectations: Public reaction to the Facebook emotional contagion study

new media & society 2020, Vol. 22(6) 1076–1094 © The Author(s) 2019 Article reuse guidelines: sagepub.com/journals-permissions https://doi.org/10.1177/1461444819876944 DOI: 10.1177/1461444819876944 journals.sagepub.com/home/nms

Blake Hallinan Jed R Brubaker Casey Fiesler

University of Colorado Boulder, USA

Abstract How to ethically conduct online platform-based research remains an unsettled issue and the source of continued controversy. The Facebook emotional contagion study, in which researchers altered Facebook News Feeds to determine whether exposure to emotional content influences a user’s mood, has been one focal point of these discussions. The intense negative reaction by the media and public came as a surprise to those involved—but what prompted this reaction? We approach the Facebook study as a mediated controversy that reveals disconnects between how scholars, technologists, and the public understand platform-based research. We examine the controversy from the bottom up, analyzing public reactions expressed in comments on news articles. Our analysis reveals fundamental disagreements about what Facebook is and what a user’s relationship to it should be. We argue that these divergent responses emphasize the contextual nature of technology and research ethics, and conclude with a relational and contextual approach to ethical decision-making. Keywords Controversy analysis, Facebook, nonuse, platform studies, privacy, research ethics

Corresponding author: Blake Hallinan, Department of Communication and Journalism, Hebrew University of Jerusalem, Jerusalem, Israel. Email: [email protected]

Hallinan et al.

1077

Introduction The publication of “Experimental Evidence of Massive-Scale Emotional Contagion” (Kramer et al., 2014) in the Proceedings of the National Academy of Sciences (PNAS) on 2 June 2014 set the Internet ablaze. Reactions on Twitter expressed shock and outrage that Facebook was “LITERALLY playing with users’ emotions” (Altmetric, n.d.). News reports echoed and amplified public sentiment, with headlines such as: “How Facebook’s news feed controls what you see and how you feel” (Steadman, 2014), “Facebook totally screwed with a bunch of people in the name of science” (Frizell, 2014), and “So you are shocked Facebook did #psyops on people?” (Silberg, 2014). In the midst of the controversy, The Guardian conducted a reader poll where 61% of respondents reported that they were surprised to learn about the study, 84% had lost trust in the social network, and 66% were considering closing their account (Fishwick, 2014). One of the researchers involved in the study received hundreds of concerned emails from members of the public following the media attention (Hancock, 2019). As a result of the negative publicity and public reaction, both Facebook and the article’s lead author issued apologies (D’Onfro, 2014; Hiltzik, 2014) and PNAS issued a statement of editorial concern (Verma, 2014). The various apologies revealed that the negative backlash to the study came as a surprise to the researchers, the journal, and Facebook. Though the technological architecture of Facebook has long shaped possibilities for expression and social interaction, the discussion surrounding the Facebook emotional contagion (FEC) study highlighted the implications of the technological architecture for the general public and raised ethical questions about conducting research on online platforms. But what did the study, described as “amazing scifi reading” (Altmetric, n.d.), actually entail? Conducted as a collaboration between Facebook and academic researchers, the FEC study sought to both replicate laboratory experiments and longitudinal studies on the transference of emotions, or “emotional contagion” (Fowler and Christakis, 2008; Hatfield et al., 1993; Rosenquist et al., 2011), and test the claim from prior research that repeated exposure to positive content on Facebook was making its users unhappy due to negative social comparisons (Turkle, 2011). To this end, the researchers designed and conducted an experiment on nearly 700,000 English-speaking Facebook users in which they modified users’ News Feeds, the algorithmically sorted feature that organizes and displays content generated from a user’s list of friends, according to the results of automated sentiment analysis (Pennebaker et al., 2007). One group saw a higher concentration of positive content, one group saw a higher concentration of negative content, and one group saw less emotional content of any variety. By comparing the sentiment and frequency of user posts before and after the experiment, researchers found that users exposed to higher concentrations of emotional content were slightly more likely to feature similar emotional content in their own Facebook posts for up to 3 days after exposure, and users exposed to less emotional content showed a slight decrease in engagement with the site, posting less frequently and with fewer words (Kramer et al., 2014). In short, the study offered evidence of some emotional contagion on Facebook and challenged the idea that exposure to positive content was making people sad, based on an assumption that the word choice of Facebook posts offers a reliable indicator of a person’s emotional state. In the wake of controversies such as the FEC study and the Cambridge Analytica scandal of 2018, there has been a pronounced interest in the ethics surrounding social media

1078

new media & society 22(6)

research (Brown et al., 2016; Stark, 2018; Vitak et al., 2016). While issues of privacy and data use have received the most attention, the FEC study points to another important and unresolved issue—how to ethically conduct online platform-based research. The controversy that followed the publication of the FEC study provides a unique opportunity to examine responses to social computing research from members of the general public, including those who might have negative attitudes toward research or toward Facebook (e.g. Facebook nonusers). To study public reaction, we collected thousands of comments left on news articles about the FEC study. Our primary goal was to develop a deep understanding of perceptions of and attitudes toward the controversy, and by extension, research ethics for social computing platforms generally. As a result, our analysis was driven by a set of exploratory research questions: what were the patterns of public responses? What issues and considerations were most important to commenters? Simplistically, why were people so upset about this study, and what can we learn from that? Public reactions have the potential to be an important resource for bottom-up approaches to ethical decision-making and the research ethics community generally (Nebeker et al., 2017)—especially given the prominence of normative, top-down approaches to ethical issues. However, this work faces an inherent challenge: those most negatively impacted by research and those with the most negative attitudes toward research are least likely to have their voices heard within research (Fiesler and Proferes, 2018). Studies that are interested in understanding how the public perceives and feels about research ethics typically involve deception (Hudson and Bruckman, 2004) or face an inherent selection bias toward those willing to participate in a research study (Fiesler and Proferes, 2018; Schechter and Bravo-Lillo, 2014; Williams et al., 2017). How can we take into account other relevant voices, including those that are uninterested or unwilling to participate in research? One solution is to borrow from controversy analysis (Marres, 2015; Marres and Moats, 2015) and studies of mediated public reactions (Fiesler and Hallinan, 2018; Vines et al., 2013), which is the strategy we employ in our examination of comments on news articles.

Theoretical foundations Our analysis of public reaction to the FEC study brings together two related research traditions: (1) controversy analysis from science and technology studies and (2) expectancy violation theory (EVT) from communication. Together, these traditions provide a framework for understanding the significance of public reaction to technology controversies. Controversy analysis establishes the value of using mediated controversies to study contested issues alongside the role of contemporary media and communication technologies (Marres, 2015; Marres and Moats, 2015), drawing attention to beliefs and values that might otherwise be overlooked or taken for granted. For example, an analysis of the Facebook Trending Topics controversy showed that news reports on the practices of the human curation team acted as a proxy for discussion about larger shifts in the news media environment (Carlson, 2018). While public expectations for Facebook typically go unstated, catalysts such as the publication of the FEC study can bring these underlying views into the foreground and reveal tensions and vulnerabilities at work in the social integration of technologies (Goodnight, 2005). In other words, mediated

Hallinan et al.

1079

controversies can reveal larger tensions within the cultural positioning of technology (Satchell and Dourish, 2009). EVT holds that individuals have expectations about the communicative behavior of others and the violation of those expectations causes people to assess their knowledge of and relationship to others (Burgoon and Le Poire, 1993). Variables that influence expectations include characteristics of the communicator, the relationship, and the context (Burgoon and Le Poire, 1993; Griffin et al., 2011). While the theory was developed in the context of interpersonal, face-to-face interactions, more recent work has extended the theory to computer-mediated contexts—for example, norms of interactions on Facebook (Bevan et al., 2014; McLaughlin and Vitak, 2012). Design choices and features of social media platforms also shape the possibilities and expectations for interaction. Previous research has examined expectations for particular features, including the Facebook Like button (Scissors et al., 2016), algorithmic curation (Bucher, 2017; Eslami et al., 2015; Rader and Gray, 2015), and design changes (Eslami et al., 2016). Together, this work demonstrates that user expectations shape assessments about the experience of social media and the desirability of particular features and practices. Where EVT research points to the gap between knowing that expectations have been violated and knowing what those expectations are (Shklovski et al., 2014), controversy analysis prompts consideration of what large, underlying factors may be at work behind the scenes. The analysis that follows demonstrates how an understanding of expectations about platforms can contribute to ethical decision-making for researchers.

Methods To examine public reaction, we collected and analyzed public comments on news articles about the FEC study. Analyzing the content of online news comments offers a time and resource efficient way to study public reactions (Henrich and Holmes, 2013). Previous research has used public comments to study public views on ethical and political issues related to the use of medical technologies (Chandler et al., 2017), climate change (De Kraker et al., 2014), online privacy (Fiesler and Hallinan, 2018), and even humancomputer interaction (HCI) research (Vines et al., 2013). While the framing of news articles can impact comments, the FEC study was fundamentally a mediated controversy: people learned about the experiment through the publication of the research and subsequent news coverage. Therefore, it is neither possible nor desirable to separate public reaction from media coverage, since engagement with the media becomes the central site for people to analyze and understand the controversy. As participant-driven responses, comments help reveal issues of public importance (Chandler et al., 2017; Henrich and Holmes, 2013), which is particularly important for ethics research. Comments also capture reactions and sense-making practices as they unfold and provide access to the perspectives of people who may not have social media accounts or do not use social media frequently, potentially surfacing more critical or antagonistic perspectives than user-centric social media research (Satchell and Dourish, 2009). Finally, studying public comments helps address a known limitation of ethics research: participant response bias (Fiesler and Proferes, 2018). Where surveys, interviews, and lab studies on research ethics are limited to the perspectives of those who are

1080

new media & society 22(6)

willing to participate in research, news comments are not subject to the same limitations. News comments provide a broader sample of online groups. Although news comments do introduce new biases—namely, people with Internet access willing to comment on news articles—they provide access to the reasoning behind different opinions. In addition, news comments are particularly impactful opinions, with previous research showing that public comments shape the views of other readers (De Kraker et al., 2014). This influence, combined with the potential to access nonusers and people uninterested in participating in research studies, makes the analysis of news comments a valuable complement to other ways of studying public reaction. However, there are ethical considerations with respect to the collection and analysis of public data. While this is a common practice in social computing research, there are disagreements within the research community about the ethics of, for example, whether to include quotes verbatim and how—if at all—to attribute authorship of quotes (Vitak et al., 2016). Although comments are publicly available information, a best practice for making ethical decisions about the use of public data is to consider the specific context and the expectations of the people involved (Fiesler and Proferes, 2018; Nissenbaum, 2004). Arguably, comments on news sites are more “public” than some other forms of social data—that is, data from social networking sites—because comments are addressed to an audience of strangers rather than a known community of friends or followers. Commenting on a news article also indicates an interest in making one’s viewpoint known, and in the FEC study, commenters were weighing in on research ethics and the practices of social media platforms, which aligns with the context and motivations of this article. After weighing potential risks to those whose content was part of our analysis, we have decided to include quotes verbatim, without identification, which is consistent with other thematic analyses of news comments (Chandler et al., 2017; Fiesler and Hallinan, 2018; Giles et al., 2015; Glenn et al., 2012; Vines et al., 2013), and also to choose illustrative quotes that are not easily discoverable through a simple web search and that do not reveal any personal or sensitive information.

Data collection In order to construct a dataset of public comments, we identified a set of articles starting with law professor and privacy advocate James Grimmelmann’s (2014) collection of Internet coverage about the FEC study. Because Grimmelmann’s article set included personal blog posts as well as journalist reporting, we narrowed the set into articles from news outlets that contained at least one comment, which resulted in 12 articles from that collection. Given that the earliest article on the list was published on 30 June 2014, nearly a month after the initial publication of the FEC study, we supplemented the collection with eight additional articles published prior to that date, identified using a keyword search (“Facebook + Research”) on LexisNexis and Google News for pieces published between 1 June and 30 June 2014. Our criteria for inclusion were that the article was (1) primarily about the FEC study; (2) written in English; and (3) included at least one comment; this supplemental, systematic method of adding additional articles also ensured that we included a broader set of news sources than may have been included by Grimmelmann. Our final dataset included comments from 20 articles from the

Hallinan et al.

1081

following news sources: The Atlantic (3), Slate (1), Forbes (3), The New York Times (3), The Guardian (2), Wired (1), Wall Street Journal (3), The Washington Post (1), Financial Times (1), The Telegraph (1), and The Chronicle of Higher Education (1). Although this was not a criterion for inclusion, all the articles were published by organizations based in the United States and the United Kingdom. Our search uncovered a few articles published in English in other countries, but none included comments. Therefore, in addition to the limitations with news comments as a data source generally, this data may be biased toward Western voices or toward news outlets with subject matter or ideological leanings that could have influenced the decision to cover this story and our results should be interpreted with this in mind. We manually collected all available comments on the articles, including top-level comments and replies. Our final dataset consisted of 2790 total comments from 20 unique articles. The number of comments on an article ranged from 2 to 850 (M = 140; SD = 215.13; median = 42).

Data analysis Driven by our exploratory research questions, we performed a thematic analysis (Clarke and Braun, 2006) of the data. As one of the most common approaches for studying online comments (Chandler et al., 2017; Giles et al., 2015; Holton et al., 2014; Silva, 2015; Vines et al., 2013), thematic analysis excels at revealing patterned responses in the data, especially when the analysis is concerned with meaning or explanation (Clarke and Braun, 2006). We began our analysis with the question: “What bothered people about the study?” We open coded individual comments and then developed themes inductively following the recursive steps outlined by Clarke and Braun (2006). Two of the authors met periodically to share and reconcile differences in coding, create memos, and to derive the themes discussed in the following section.

Findings Although comment sections are notoriously antagonistic spaces, distinct patterns emerged from the thematic analysis of our data. Here, we focus on four major themes that represent public reactions, which we have labeled “Living in a lab,” “Manipulation anxieties,” “Wake up, sheeple,” and “No big deal.” Across these themes, we find divergent and contradictory understandings of Facebook as a platform, along with repeated surprise that these understandings are not universally shared. As it turns out, the researchers behind the FEC study, PNAS, and Facebook were not the only ones surprised by the reaction to the study. Some members of the public were also surprised by the expectations of their peers; in other words, there appears to be no “common” sense when it comes to social media research.

Living in a lab The publication of the FEC study came as a surprise to some commenters who did not know that Facebook conducted experiments or collaborated with academic researchers. Their reactions were less about the specifics of testing emotional contagion and more

1082

new media & society 22(6)

about the revelation of experimentation as a general practice. In other words, the announcement of any experiment would violate the implicit understanding of Facebook as a place for people to connect with friends and family: Dear Mr. Zuckerburg, Last I checked, we did not decide to jump in a petri dish to be utilized at your disposal . . . We connect with our loved ones.1

As the reference to a petri dish suggests, the concern is with the idea of research—or “secret experiments”—taking place on online platforms. Furthermore, the concern with experimentation often conflates very different models of research, including academic research, collaborative research between academics and corporations, and applied commercial research. The tendency to conflate all forms of platform-based research into a single category is facilitated by a lack of awareness about research practices—indeed, previous research has found, for example, that nearly two-thirds of Twitter users did not know that academic researchers use public social media data (Fiesler and Proferes, 2018). The temporal dynamics of online experiments further complicate the understanding of research on Facebook. Lab-based experiments conventionally have an obvious start and endpoint, making it clear when someone is (and is not) participating in research. With platform-based experiments, participants often have no knowledge of their own participation. In the case of the FEC study, Facebook users did not know about the experiment until it appeared in the media. Even then, people had no way of determining whether their own News Feed had been affected, despite their expressed interest—indeed, the question comes up repeatedly in our data, and one of the authors of the study received many emails with this question (Goel, 2014; Hancock, 2019). The uncertainty over participation and the lag in awareness created a sense of secrecy around research and prompted commenters to question what other kinds of experiments might be happening: This was two years ago? Who knows what they’re doing now.

Commenters overwhelmingly characterized scientific research as negative and exploitative. Some compared the contagion study with other controversial experiments such as the Stanford prison experiment (Recuber, 2016) and Nazi medical experimentation. Others invoked the language of biomedical experiments, comparing the treatment of Facebook users with that of animal test subjects—“lab rats” or “guinea pigs”—framing scientific research as inherently dehumanizing and without benefit to the experimental subject: At least lab rats get paid in chow. How does Facebook compensate its users to be sitting ducks for algorithms?

Even among comments defending the legitimacy of scientific research, there was little attention to any benefits, actual or potential, of the FEC study, which indicates a disconnect between the researchers’ concern with the potential negative emotional consequences of social media (Kramer et al., 2014) and the concerns expressed in public comments. The scientific value of the research and its contributions to improving user experience are not so much debated as dismissed outright; instead, comments typically

Hallinan et al.

1083

frame the value of the study as serving the interests of researchers disconnected from “real world” concerns or as a proof-of-concept for the emotional exploitation of Facebook users. Where institutional decisions concerning research ethics are typically made by weighing harm and benefit, judgments from the public rarely expressed consideration for the benefit side of the equation. These comments about “living in a lab” support the idea that some members of the public cared about the lack of transparency and consent, as well as the power dynamics between researchers and those being researched. However, concerns about experimentation were not isolated to the FEC study and instead point to discomfort with the idea of any experimentation on Facebook. Such concerns were compounded by a lack of understanding for how the research could be in service to the interests of Facebook users. As one commenter explained, the experiment demonstrated that Facebook “will pervert its stated objective of facilitating communication.” Without trust in the value of the research for Facebook users, the negative and exploitative associations of scientific research proliferated.

Manipulation anxieties For other commenters, the FEC study was upsetting because of what the research suggested about Facebook’s powers of manipulation. While the term “manipulation” appears only once in the original publication and nowhere in the press release, it is repeated constantly in news headlines, articles, and public comments. The surprise and outrage over the “manipulation” of the News Feed suggest that many people did not realize that the News Feed selects and displays content in a particular order, or these people had assumed that content was selected according to a fair and objective standard. For example, one comment argued that Facebook “is supposed to be a neutral arbiter for its services.” The lack of familiarity with how the News Feed works aligns with prior research (Eslami et al., 2015) and helps explain why the manipulation aspect produced such intensely negative reactions: the experiment revealed not only a single time-and-population-limited instance of manipulation, but also that manipulation is endemic to the operation of the News Feed. In other words, commenters were upset both about the experimental manipulation and about the existence of any News Feed algorithm. Fairness is a commonly stated reason for anxiety around manipulation, tracking to findings of focus-group research on social media data mining concerns (Kennedy et al., 2015). While some commenters considered any form of manipulation to be a self-evident violation of ethics, others were worried about the specific context of manipulation on Facebook. These folks were worried that changes to the News Feed could cause them to miss out on important posts, such as an announcement of good news or a call for help: If you were one of the friends of the almost 700,000 users, but a piece of [your] news . . . didn’t get posted . . . and this messed with your relationship to the other user? More people than just the research subject were manipulated.

From this perspective, manipulating the order of the News Feed simultaneously manipulates relationships between people that extend beyond those directly involved in

1084

new media & society 22(6)

the experiment. The concern over missing important content aligns with lab-based research on user reactions to revelations about the existence and operation of the News Feed algorithm (Eslami et al., 2015). However, many commenters took the concern with manipulation to more extreme ends. Our data include considerable speculation about the future implications of this research. The extrapolations were guided by examples from dystopic fiction, such as 1984 and Brave New World, and also by fears concerning politics, control, and conspiracies: Lets see if The Algorithm can retrospectively identify the users who got the downer feeds, and when. Also those who got the happy feeds. Then there is even more useful data to be had, by medical professionals: compare the data injections against the use of health services, hospitalizations, etc. for the downers cohort and against manic spending sprees for the uppers recipients. After that’s completed, the guinea pigs can be informed of what was done to them, unless, of course, yet another health-related use can be found for the data.

Some commenters justified their far-reaching, grim extrapolations by pointing to the general lack of transparency surrounding Facebook’s practices. The public’s surprise acts as evidence of a lack of transparency, even as Facebook does disclose some information about their use of data in official policies, public-facing research, and statements to the press. The adequacy of such disclosures is outside the focus of this article, though just because information is technically available does not mean it is effectively so, as evidenced by the extensive research showing that people do not read platform policies (Martin, 2016a). As these patterns of response make clear, public perceptions of transparency do not necessarily align with company practices (Fiesler and Hallinan, 2018; Martin, 2015). Other commenters justified their dark speculations by pointing to the subject of manipulation: emotions. For these commenters, emotional manipulation is a touchstone that enables the manipulation of what someone thinks, believes, and does—even who they are. The personal significance of emotions ups the stakes significantly, such that the experiment is understood as manipulating not only user experience, but also the very identity of the user. This kind of power is seen as having drastic political consequences that can sway elections or create, in the words of one commenter, a “herd of docile consumers”: Don’t be fooled, manipulating a mood is the ability to manipulate a mind. Political outcomes, commerce, and civil unrest are just a short list of things that can be controlled.

There are also concerns about the relationship between emotional manipulation and mental health. Participants in the experiment received different treatments: the News Feeds of one group prioritized negative content, which some commenters interpreted as Facebook intentionally making people sad. This group received significantly more attention from comments than the group exposed to a higher concentration of positive content or the group exposed to less emotional content overall. An ethical response survey conducted shortly after the controversy broke (Schechter and Bravo-Lillo, 2014) also found greater support for a version of the study that only added more positive content to News Feeds. The addition of negative content is seen as a particularly harmful form of

Hallinan et al.

1085

manipulation, a view compounded by concerns that the sample population could have included vulnerable populations: Faecesbook [sic] is evil. What if one (or more) of their users (or victims) had been depressed and on the edge of suicide? Murdered for Zuckerbergs greater profits?

The extreme stakes of manipulation—from total political control to mass suicide— may seem out of place given the relatively minor treatment (tweaking the order in which content appears in the News Feed according to sentiment analysis of word choice) and the small effects size of the study’s findings (Kramer et al., 2014). Indeed, the findings of the study could only be significant at the scale of a massive platform like Facebook with billions of users. However, the concerns expressed in public reaction posit a much more dramatic scale of effects and make it apparent that many people do not have an accurate frame of reference to interpret these kinds of harms—or benefits.

Wake up, sheeple Not all commenters expressed surprise about the FEC study. The theme “Wake up, sheeple” brings together comments that interpret the FEC study as a confirmation of pre-existing negative views of Facebook. These comments take a position of being profoundly unsurprised, seeing the experiment as a confirmation of the way they already understand and relate to Facebook. Similar to the “Manipulation anxieties” theme, these comments paint a negative, even dystopic picture of Facebook—but these comments also lack any sense of surprise. Experimentation and manipulation appear to be ordinary and expected behavior when considered alongside accounts of Facebook’s past bad behavior, negative perceptions of social media and Silicon Valley generally, or sharp critiques of the larger economic order. The comments tend to argue that other people need to “wise up” and either accept that this is the way the world works or opt out of using social media entirely, an attitude that has surfaced in prior work examining public reactions to privacy controversies (Fiesler and Hallinan, 2018): The minute anyone signs up for membership to ANY group, you know that you are going to be manipulated. Ever hear the word SHEEPLE?

This antagonistic stance allows commenters to affirm their own positions, knowledge, and decisions. It also discredits the reactions of others, treating all aspects of the controversy as things that Facebook users should already expect. In doing this, the commenters shift accountability away from the company or the researchers and toward individual Facebook users: Anyone who doesn’t realise that anything you put “out there” on Facebook (or any other social media site) is like shouting it through a bullhorn should have their internet competency licence revoked. We can’t blame all stupidity on some or other conspiracy . . .

It is notable that many of the people whose comments fell into this theme also identified as nonusers of Facebook. Some commenters framed their nonuse status as a value

1086

new media & society 22(6)

judgment against those who use social media. Other commenters argued that people should follow their example and decide to leave social media. These comments reframed the Facebook user base, arguing that users are actually the product that is sold to advertisers, the “real users” of Facebook. In these explanations, commenters frequently shame others for not having the same expectations they do: Facebook is akin to an open corral baited with fake food; the herd gathers instinctively, but receives no nourishment . . . Get wise, people.

What exactly should people wise up about? Our data point to the behavior and practices of Facebook, of Silicon Valley, and of any service that is “free.” Rather than focusing on the study itself, the thrust of the indictment is that other people failed to recognize an obvious situation. However, even with this framing, there are substantial differences in opinion about what is considered obvious and how people should respond. Some call for the wholesale rejection of social media and testify to their own ability to get by without it. Others call for the adoption of a nihilistic attitude: this is the way the world works and all you can do is resign yourself to the facts. Despite disagreement over the solution, these commenters agree that the attitudes and actions of anyone who is outraged are the crux of the problem, not the experiment itself or the practices of the platform.

No big deal Finally, even among the outrage, some commenters indicated that they had no issues with the FEC study—not necessarily because they judged it ethical, but rather because it was not just unsurprising but also unremarkable. It aligned with their expectations, whether for Facebook, advertising-supported media, or corporations generally: The only thing that surprises me about this study is that anyone is surprised. Purveyors of information routinely attempt to manipulate their audiences and always have . . .

Similar to the “Wake up, sheeple” theme, these comments take the experiment as confirmation of their expectations and understanding of the platform. However, in contrast, these comments assess the situation as unproblematic and, if any action is required, it is the need for education about what Facebook is and how it works. The views of some comments in this theme most strongly align with those of the researchers and with Facebook itself. Many commenters shared the view that there had been miscommunication or misunderstanding; as a result, comments explain different aspects of the situation, including the prevalence of A/B testing, industry research, and the general operation of the News Feed. Unlike those who were alarmed because of their ignorance of News Feed algorithms, these commenters formed expectations based, in part, on their understanding of those algorithms: A/B testing (i.e. basically what happened here) when software companies change content or algorithms for a subset of users happens *all the time*. It’s standard industry practice.

Hallinan et al.

1087

Other commenters argue that emotional manipulation is not a concern because individuals have the ability to resist manipulation, whether through a skeptical disposition, education, or willpower. As one commenter put it, users are “masters of their own minds” and cannot be so easily swayed by a website. For others, Facebook’s actions are typical of any corporation; a company is entitled to pursue its own policies and interests and if people do not like the practices of a company, they can simply choose not to use its services. This echoes the control model of privacy and supports a market approach to regulation (Martin, 2016b): They can do whatever they want with their platform. Period. Build your own if you want to set the rules.

Other commenters point out that this is nothing new, referencing other forms of manipulation or persuasion, from advertising and marketing, to political speech, to everyday interactions. Where commenters expressing “manipulation anxieties” also considered the broader contexts for manipulation, the difference here is the normalization of manipulation as mundane rather than a dystopic version of a possible future: So what’s new? The raison d’être for all media, even before the printing press, was to influence our emotions, secretly or otherwise.

Both this theme and “Wake up, sheeple” argue that the controversy surrounding the FEC study stems from a lack of public understanding of how social media works and propose communication solutions—albeit with radically different understandings and solutions. From avoiding all social media to knowing oneself, from embracing nihilism to education about technology, the recommendations are divergent and contradictory. The problem of communication, then, is about more than strategies and tactics and is instead based on a more fundamental disagreement about what the platform is and what people should expect from it.

Discussion We began this research, in part, with the hope that analyzing public responses would tell us what people found to be objectionable about the FEC study, and thus what the public perceived as “unethical” in platform-based research. Our findings provide some answers to this question, including issues of transparency, manipulation, and the potential for future harm. However, just as the people involved in the research and publication of the FEC study were surprised by the public reaction to the study, our analysis reveals that members of the public were also surprised by the values and expectations of their peers. While the use of Facebook and other online platforms is widespread and frequent, a common part of people’s daily experience, the differences expressed in the comments of news articles about the FEC study highlight the lack of consensus around what these platforms are, how they should operate, and the role of platform-based research. In other words, the norms surrounding online platforms are neither unified nor settled. As a result, there is no single answer to what bothered people about this research, which means there is no single answer to what needs to be “fixed.”

1088

new media & society 22(6)

While our findings do not support a one-size-fits-all solution to ensuring ethical research and avoiding controversy, our findings do support the importance of thinking about platform-based research holistically—that is, considering the relationship between academic research, collaborative research between academic and corporate researchers, and industry research both basic and applied. Although the FEC study was the product of a collaboration between academic and Facebook researchers, commenters rarely engaged with the specificity of the setup regardless of their position on the research. For example, comments in the “Living in a lab” theme tended to group all research together into a nefarious and dehumanizing category exemplified by animal testing and Nazi medical experimentation, while comments in the “No big deal” theme tended to argue for the normalcy and importance of research for improving commercial products. Certainly, neither account accurately describes the context or conduct of the FEC study. At the same time, the conflation of very different kinds of research cautions researchers against assuming that the public understands what platform-based research involves or why it might matter—questions that should, in turn, guide the design and communication of research. Just as there is no shared understanding of platform-based research, so too is there no shared understanding of the platforms themselves. Ethical considerations for platformbased research often begin from the terms set by the platforms themselves—for example, the desirability of algorithmically sorting content in the News Feed. Despite formal agreement to a platform’s terms of service, these terms are not necessarily known or accepted by all users, to say nothing of the broader public that includes users and nonusers alike. The assumption of a shared understanding of Facebook’s News Feed algorithm and practices of research and experimentation made the negative reactions to the study genuinely unexpected to those involved. Certainly, the FEC study is not an isolated instance of researchers “badly reading the room” when it comes to expectations about social media. The public’s relationship—or rather, relationships—to platforms shape their assessment of research conducted on and about platforms. Facebook has repeatedly struggled in comparison to other platforms and tech companies in terms of public trust (Newton, 2017). The pre-existing lack of trust in the platform helps explain some of the more extreme accounts of harm in the reaction to the FEC study, which in turn further exacerbated issues of trust as The Guardian poll conducted in the wake of the controversy found (Fishwick, 2014). Complementing other calls for contextually sensitive ethical decision-making (Fiesler and Proferes, 2018; Jackman and Kanerva, 2016), we suggest a relational approach to the ethics of platform research that highlights what our data suggests is a particularly important context that researchers should be considering: the public’s relationship to online platforms. This approach takes inspiration from work on relational ethics (Ellis, 2007), developed to guide interpersonal interactions for qualitative research. However, interactions on social media are not only interpersonal, but also involve human-machine communication, or interactions with technologies that reproduce aspects of human intelligence (Guzman and Lewis, 2019). The News Feed and other forms of recommendation offer prominent examples of this technology on Facebook, selecting and organizing content in order to show people the “stories that matter most” (News Feed, n.d.). As a result, the platform functions as a kind of third party to social media research, and a particularly important party because the relationship between the platform and the research subjects

Hallinan et al.

1089

precedes and endures beyond the boundaries of any given study. Just as an ethnographer works to maintain good relationships with members of a community so that future researchers can obtain access to that community, so too should social media researchers consider ways of maintaining or improving relationships with their research populations. Such considerations may seem out of place for research practices that do not involve direct interpersonal interactions between researchers and research subjects—with the FEC study, for example, the research subjects had no way of knowing that they were part of an experiment. However, our findings illustrate that even experimental setups without interpersonal interactions can be perceived in very personal ways. These negative reactions can have a corrosive effect on trust for both platforms and research. How can we work to preserve a positive relationship instead? For researchers, the first step in a relational approach to ethics involves understanding the public’s expectations for platforms. Although EVT was initially developed in the context of interpersonal communication, it also offers a theoretical framework for ethical considerations of online platform-based research. Relying on formal practices such as institutional review or compliance with terms of service is unlikely to address user norms and expectations because social media users are often unaware of research taking place on online platforms (Fiesler and Proferes, 2018), rarely read terms of service (Galbraith, 2017), and interpret the meaning of formal policy documents according to pre-existing expectations (Martin, 2015). Given the limitations of these formal practices, researchers can develop better understandings of user expectations from empirical ethics research (Fiesler and Proferes, 2018; Kennedy et al., 2015; Schechter and Bravo-Lillo, 2014) and from the emerging literature on folk theories of platforms (Devito et al., 2018; Eslami et al., 2016). The analysis of news comments presented here contributes to this project and demonstrates the complementary value of this methodology as a way to bring in different voices and study the relationship between expectations and arguments. The importance of diverse relationships to platforms suggests another strategy for ethical decision-making: researchers should broaden our understanding of ethical stakeholders to include nonusers. As our data illustrate, even people who do not use Facebook have expectations for the platform and are invested enough in these expectations to react publicly when they are violated. Nonusers are also stakeholders, both because they consider themselves to be and because as social media platforms grow in terms of features and users, the influence of platforms includes broad societal effects (Baumer et al., 2015; Satchell and Dourish, 2009). Controversy analysis provides a way to surface beliefs and values that might otherwise be overlooked or taken for granted, even as these beliefs and values are central to the ways that people evaluate the actions of platforms—including research that takes place on platforms. Furthermore, the willingness of nonusers to make their interests and concerns public means that these perspectives fold back upon the platform’s user-base, shaping their expectations and concerns in turn. As a result, incorporating the expectations of nonusers into ethical decision-making can help anticipate controversies, push researchers to consider the public benefit of their research, and cultivate more beneficial ways of relating to platforms. While we argue for the importance of considering a broader range of ethical stakeholders, we recognize that this is a challenging task. Just as previous research has argued for the importance of understanding user expectations in ethical decision-making (Fiesler

1090

new media & society 22(6)

and Proferes, 2018; Martin, 2015, 2016b; Schechter and Bravo-Lillo, 2014), our findings suggest that it is not feasible (or desirable) to identify a set of basic expectations common to all users. The views expressed in the “Wake up, sheeple” theme overwhelmingly begin from the premise that social media research is inherently unethical and that it either should be avoided entirely or that its use requires resignation to an unethical system. It is difficult to imagine a meaningful baseline set of expectations that include this perspective alongside the expectations of those who endorsed the experiment and see a clear social value in Facebook. However, a better understanding of the different relationships people have to platforms offers an opportunity to develop approaches that account for the needs and expectations of different relationships. Instead of simply telling people what their expectations should be, or inferring expectations from official policies (Gelinas et al., 2017), there is value in empirically studying expectations. In addition to formalized responses such as official policies that govern platform conduct, we should consider initiatives designed to cultivate informal norms and expectations. Compared to other forms of regulation such as legislation or formal policies, norms offer greater flexibility to adapt to particular contexts and technological developments. Expectation violation can have substantial ramifications on the public perception of research and—potentially—support for future research. Controversies can also drive change, such as the development and implementation of industry review of research at Facebook (Jackman and Kanerva, 2016). The case of the FEC study offers some insight into what was poorly received and why. We can clearly say, for instance, that an approach to platform-based research based on implicit consent for research via terms of service is unpopular among the commenting public. A study that places specific opt-in requirements on its participants, even if the study design is kept hidden, may be received more positively and resolve some of the more prominent concerns, including the cloud of secrecy around research, not knowing if and when one has been the subject of an experiment, and the inclusion of vulnerable populations. Even an opt-out option could address some of these concerns, as it would allow people with specific objections to research to be removed from it without requiring them to stop using the platform entirely. Fundamentally, a relational approach to ethical decision-making for platform-based research begins with an understanding of public expectations for platforms and uses that understanding to inform the design and communication of research. Authors’ note Blake Hallinan is now affiliated to Hebrew University.

Acknowledgements The authors would like to thank Michael Zimmer, Katie Shilton, Caitlin Reynolds, the anonymous reviewers, and members of the Identity Lab and the Internet Rules lab for all of their help cultivating this paper.

Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/ or publication of this article: This work was funded by NSF award IIS-1704369 as part of the PERVADE (Pervasive Data Ethics for Computational Research) project.

1091

Hallinan et al. ORCID iD Blake Hallinan

https://orcid.org/0000-0002-4696-8290

Note 1.

Quotes are presented unaltered, as they originally appeared in the data.

References Altmetric (n.d.) Overview of attention for article published in Proceedings of the National Academy of Sciences of the United States. Altmetric. Available at: https://www.altmetric. com/details/2397894 (accessed 15 April 2018). Baumer EPS, Ames MG, Burrell J, et al. (2015) Why study technology non-use? First Monday 20(11). Available at: http://dx.doi.org/10.5210/fm.v20i11.6310 Bevan JL, Ang PC and Fearns JB (2014) Being unfriended on Facebook: an application of expectancy violation theory. Computers in Human Behavior 33: 171–178. Brown B, Weilenman A, McMillan D, et al. (2016) Five provocations for ethical HCI research. In: Proceedings of the SIGCHI conference on human factors in computing systems, Montréal, QC, Canada, 22–27 April. San Jose, CA: ACM Press. Bucher T (2017) The algorithmic imaginary: exploring the ordinary affects of Facebook algorithms. Information Communication and Society 20(1): 30–44. Burgoon JK and Le Poire BA (1993) Effects of communication expectancies, actual communication, and expectancy disconfirmation on evaluations of communicators and their communication behavior. Human Communication Research 20(1): 67–96. Carlson M (2018) Facebook in the news: social media, journalism, and public responsibility following the 2016 Trending Topics controversy. Digital Journalism 6(1): 4–20. Chandler JA, Sun JA and Racine E (2017) Online public reactions to fMRI communication with patients with disorders of consciousness: quality of life, end-of-life decision making, and concerns with misdiagnosis. AJOB Empirical Bioethics 8(1): 40–51. Clarke V and Braun V (2006) Using thematic analysis in psychology. Qualitative Research in Psychology 3(2): 77–101. De Kraker J, Kuys S, Cörvers RJM, et al. (2014) Internet public opinion on climate change: a world views analysis of online reader comments. International Journal of Climate Change Strategies and Management 6(1): 19–33. Devito MA, Birnholtz J, Hancock JT, et al. (2018) How people form folk theories of social media feeds and what it means for how we study self-presentation. In: CHI ’18: Proceedings of the 36th annual ACM conference on human factors in computing systems, Montreal, QC, Canada, 21–26 April. D’Onfro J (2014) Facebook researcher responds to backlash against “creepy” mood manipulation. Business Insider, 29 June. Available at: https://www.businessinsider.com.au/adam-kramerfacebook-mood-manipulation-2014-6 Ellis C (2007) Telling secrets, revealing lives: relational ethics in research with intimate others. Qualitative Inquiry 13(1): 3–29. Eslami M, Rickman A, Vaccaro K, et al. (2015) “I always assumed that I wasn’t really that close to [her]”: reasoning about invisible algorithms in news feeds. In: CHI ’15: Proceedings of the 33rd annual ACM conference on human factors in computing systems, Seoul, Republic of Korea, 18–23 April, pp. 153–162. New York: ACM Press. Eslami M, Karahalios K, Sandvig C, et al. (2016) First I “like” it, then I hide it: folk theories of social feeds. In: CHI ’16: Proceedings of the 2016 CHI conference on human factors in computing systems, San Jose, CA, 7–12 May, pp. 2371–2382. New York: ACM Press.

1092

new media & society 22(6)

Fiesler C and Hallinan B (2018) We are the product”: public reactions to online data sharing and privacy controversies in the media. In: Proceedings of the 2018 CHI conference on human factors in computing systems (CHI ’18), Montreal, QC, Canada, 21–26 April. Fiesler C and Proferes N (2018) “Participant” perceptions of Twitter research ethics. Social Media + Society 4(1): 1–14. Fishwick C (2014) Facebook’s secret mood experiment: have you lost trust in the social network? The Guardian, 30 June. Available at: https://www.theguardian.com/technology/poll/2014/ jun/30/facebook-secret-mood-experiment-social-network Fowler JH and Christakis NA (2008) Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham heart study. BMJ 337: a2338. Frizell S (2014) Facebook totally screwed with a bunch of people in the name of science. TIME, 28 June. Available at: http://time.com/2936729/facebook-emotions-study/ Galbraith KL (2017) Terms and conditions may apply (but have little to do with ethics). American Journal of Bioethics 17(3): 21–22. Gelinas L, Pierce R, Winkler S, et al. (2017) Using social media as a research recruitment tool: ethical issues and recommendations. American Journal of Bioethics 17(3): 3–14. Giles EL, Holmes M, McColl E, et al. (2015) Acceptability of financial incentives for breastfeeding: thematic analysis of readers’ comments to UK online news reports. BMC Pregnancy and Childbirth 15(1): 116. Glenn NM, Champion CC and Spence JC (2012) Qualitative content analysis of online news media coverage of weight loss surgery and related reader comments. Clinical Obesity 2(5–6): 125–131. Goel V (2014) As data overflows online, researchers grapple with ethics. The New York Times, 12 August. Available at: https://www.nytimes.com/2014/08/13/technology/the-boon-ofonline-data-puts-social-science-in-a-quandary.html Goodnight GT (2005) Science and technology controversy: a rationale for inquiry. Argumentation and Advocacy 42(1): 26–29. Griffin E, Ledbetter A and Sparks G (2011) Expectancy violations theory. In: Griffin EA (ed.) A First Look at Communication Theory. 9th ed. New York: McGraw-Hill, pp. 84–92. Grimmelmann J (2014) The Facebook emotional manipulation study: sources. The Laboratorium, 30 June. Available at: http://laboratorium.net/archive/2014/06/30/the_facebook_emotional_ manipulation_study_source Guzman AL and Lewis SC (2019) Artificial intelligence and communication: a human–machine communication research agenda. New Media & Society 22(1): 70–86. Hancock JT (2019) The ethics of digital research. In: Welles BF and Gonzalez-Bailon S (eds) The Oxford Handbook of Networked Communication. Oxford: Oxford University Press. Available at: https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780190460518.001.0001/ oxfordhb-9780190460518-e-25. Hatfield E, Cacioppo JTJ and Rapson RL (1993) Emotional contagion. Current Directions in Psychological Science 2(3): 96–99. Henrich N and Holmes B (2013) Web news readers’ comments: towards developing a methodology for using on-line comments in social inquiry. Journal of Media and Communication Studies 5(1): 1–4. Hiltzik M (2014) Facebook on its mood manipulation study: another non-apology apology. Los Angeles Times, 2 July. Available at: http://www.latimes.com/business/hiltzik/la-fi-mh-facebook-apology-20140702-column.html Holton A, Lee N and Coleman R (2014) Commenting on health: a framing analysis of user comments in response to health articles online. Journal of Health Communication 19(7): 825–837.

Hallinan et al.

1093

Hudson JM and Bruckman A (2004) ‘Go Away’: Participant Objections and the ethics of Chatroom Research. The Information Society: An International Journal 20(2): 127–139. Jackman M and Kanerva L (2016) Evolving the IRB: building robust review for industry research. Washington and Lee Law Review 72(3): 442–457. Kennedy H, Elgesem D and Miguel C (2015) On fairness: user perspectives on social media data mining. Convergence: The International Journal of Research into New Media Technologies 23: 270–288. Kramer ADI, Guillory JE and Hancock JT (2014) Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111(24): 8788–8790. McLaughlin C and Vitak J (2012) Norm evolution and violation on Facebook. New Media & Society 14(2): 299–315. Marres N (2015) Why map issues? On controversy analysis as a digital method. Science, Technology, & Human Values 40(5): 655–686. Marres N and Moats D (2015) Mapping controversies with social media: the case for symmetry. Social Media and Society 1(2): 1–17. Martin K (2015) Privacy notices as Tabula Rasa. Journal of Public Policy & Marketing 34(2): 210–227. Martin K (2016a) Formal versus informal privacy contracts: comparing the impact of privacy notices and norms on consumer trust online. The Journal of Legal Studies 45(Suppl. 2): S191–S215. Martin K (2016b) Understanding privacy online: development of a social contract approach to privacy. Journal of Business Ethics 137(3): 551–569. Nebeker C, Harlow J, Espinoza Giacinto R, et al. (2017) Ethical and regulatory challenges of research using pervasive sensing and other emerging technologies: IRB perspectives. AJOB Empirical Bioethics 8(4): 266–276. News Feed (n.d.) Facebook for media. Available at: https://www.facebook.com/facebookmedia/ solutions/news-feed (accessed 9 December 2018). Newton C (2017) America doesn’t trust Facebook. The Verge, 27 October. Available at: https:// web.archive.org/web/20190402070156/https://www.theverge.com/2017/10/27/16552620/ facebook-trust-survey-usage-popularity-fake-news Nissenbaum H (2004) Privacy as contextual integrity. Washington Law Review 79(119): 1119– 1158. Pennebaker JW, Chung CK, Ireland M, et al. (2007) The Development and Psychometric Properties of LIWC2007 (LIWC2007 Manual). Austin, TX: LIWC. Rader E and Gray R (2015) Understanding user beliefs about algorithmic curation in the Facebook news feed. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems (CHI’15), pp. 173–182. Available at: https://dl.acm.org/citation. cfm?id=2702174 Recuber T (2016) From obedience to contagion: discourses of power in Milgram, Zimbardo, and the Facebook experiment. Research Ethics 12(1): 44–54. Rosenquist JN, Fowler JH and Christakis NA (2011) Social network determinants of depression. Molecular Psychiatry 16(3): 273–281. Satchell C and Dourish P (2009) Beyond the user: use and non-use in HCI. In: Proceedings of the 21st annual conference of the Australian computer-human interaction special interest (OZCHI ’09), Melbourne, VIC, Australia, 23–27 November, pp. 9–16. New York: ACM Press. Schechter S and Bravo-Lillo C (2014) Using ethical-response surveys to identify sources of disapproval and concern with Facebook’s emotional contagion experiment and other controversial studies. Available at: https://www.microsoft.com/en-us/research/wp-content/ uploads/2016/02/Ethical-Response20Survey202014-10-30.pdf

1094

new media & society 22(6)

Scissors L, Burke M and Wengrovitz S (2016) What’s in a like? Attitudes and behaviors around receiving likes on Facebook. In: Proceedings of the 19th ACM conference on computer-supported cooperative work & social computing (CSCW’16), San Francisco, CA, 27 February–2 March, pp. 1499–1508. New York: ACM Press. Shklovski I, Mainwaring SD, Skúladóttir HH, et al. (2014) Leakiness and creepiness in app space. In: Proceedings of the 32nd annual ACM conference on human factors in computing systems (CHI’14), pp. 2347–2356. Available at: https://dl.acm.org/citation.cfm?id=2557421 Silberg AW (2014) So you are shocked Facebook did #psyops on people? Huffpost, 29 August. Available at: https://www.huffpost.com/entry/so-you-are-shocked-facebo_b_5542094. Silva MT (2015) What do users have to say about online news comments? Readers’ accounts and expectations of public debate and online moderation: a case study. Participations: Journal of Audience & Reception Studies 12(2): 32–44. Stark L (2018) Algorithmic psychometrics and the scalable subject. Social Studies of Science 48(2): 204–231. Steadman I (2014) How Facebook’s news feed controls what you see and how you feel. New Statesman, 30 June. Available at: https://www.newstatesman.com/future-proof/2014/06/ how-facebooks-news-feed-controls-what-you-see-and-how-you-feel Turkle S (2011) Alone Together: Why We Expect More from Technology and Less from Each Other. New York: Basic Books. Verma IM (2014) Editorial expression of concern: experimental evidence of massive scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111(29): 10779–10779. Vines J, Thieme A, Comber R, et al. (2013) HCI in the press: online public reactions to mass media portrayals of HCI research. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI ’13), pp. 1873–1882. Available at: https://dl.acm.org/citation. cfm?id=2466247 Vitak J, Shilton K and Ashktorab Z (2016) Beyond the Belmont principles: ethical challenges, practices, and beliefs in the online data research community. In: Proceedings of the 19th ACM conference on computer-supported cooperative work & social computing (CSCW’16), pp. 939–951. Available at: https://dl.acm.org/citation.cfm?id=2820078 Williams M, Burnap P, Sloan L. Jessop C, et al. (2017) Users’ views of ethics in social media research: Informed consent, anonymity, and harm. In: Woodfield K (ed) The Ethics of Online Research (Advances in Research Ethics and Integrity, Vol. 2). Emerald Publishing Limited, pp. 27–52.

Author biographies Blake Hallinan is a postdoctoral fellow in the Department of Communication and Journalism at Hebrew University of Jerusalem. From the Lazarsfeld-Stanton Program Analyzer to the Like button, Blake researches the cultural politics of the transformation of feelings and emotions into information. Jed R Brubaker is an assistant professor in the Department of Information Science at the University of Colorado Boulder. His research focuses on how identity is designed, represented, and experienced through technology. Casey Fiesler is an assistant professor in the Department of Information Science at the University of Colorado Boulder. She studies social computing and governance, including the ethical and legal implications of researching and designing technology.

Survey Research in HCI Hendrik Müller, Aaron Sedley, and Elizabeth Ferrall-Nunge

Short Description of the Method A survey is a method of gathering information by asking questions to a subset of people, the results of which can be generalized to the wider target population. There are many different types of surveys, many ways to sample a population, and many ways to collect data from that population. Traditionally, surveys have been administered via mail, telephone, or in person. The Internet has become a popular mode for surveys due to the low cost of gathering data, ease and speed of survey administration, and its broadening reach across a variety of populations worldwide. Surveys in human–computer interaction (HCI) research can be useful to: • Gather information about people’s habits, interaction with technology, or behavior • Get demographic or psychographic information to characterize a population • Get feedback on people’s experiences with a product, service, or application • Collect people’s attitudes and perceptions toward an application in the context of usage • Understand people’s intents and motivations for using an application • Quantitatively measure task success with specific parts of an application • Capture people’s awareness of certain systems, services, theories, or features • Compare people’s attitudes, experiences, etc. over time and across dimensions H. Müller (*) Google Australia Pty Ltd., Level 5, 48 Pirrama Road, Pyrmont, NSW 2009, Australia e-mail: [email protected] A. Sedley Google, Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA e-mail: [email protected] E. Ferrall-Nunge Twitter, Inc., 1355 Market Street, Suite 900, San Francisco, CA 94103, USA e-mail: [email protected] J.S. Olson and W.A. Kellogg (eds.), Ways of Knowing in HCI, DOI 10.1007/978-1-4939-0378-8_10, © Springer Science+Business Media New York 2014

229

230

H. Müller et al.

While powerful for specific needs, surveys do not allow for observation of the respondents’ context or follow-up questions. When conducting research into precise behaviors, underlying motivations, and the usability of systems, then other research methods may be more appropriate or needed as a complement. This chapter reviews the history of surveys and appropriate uses of surveys and focuses on the best practices in survey design and execution.

History, Intellectual Tradition, Evolution Since ancient times, societies have measured their populations via censuses for food planning, land distribution, taxation, and military conscription. Beginning in the nineteenth century, political polling was introduced in the USA to project election results and to measure citizens’ sentiment on a range of public policy issues. At the emergence of contemporary psychology, Francis Galton pioneered the use of questionnaires to investigate the nature vs. nurture debate and differences between humans, the latter of which evolved into the field of psychometrics (Clauser, 2007). More recently, surveys have been used in HCI research to help answer a variety of questions related to people’s attitudes, behaviors, and experiences with technology. Though nineteenth-century political polls amplified public interest in surveys, it was not until the twentieth century that meaningful progress was made on surveysampling methods and data representativeness. Following two incorrect predictions of the US presidential victors by major polls (Literary Digest for Landon in 1936 and Gallup for Dewey in 1948), sampling methods were assailed for misrepresenting the US electorate. Scrutiny of these polling failures; persuasive academic work by statisticians such as Kiaer, Bowley, and Neyman; and extensive experimentation by the US Census Bureau led to the acceptance of random sampling as the gold standard for surveys (Converse, 1987). Roughly in parallel, social psychologists aimed to minimize questionnaire biases and optimize data collection. For example, in the 1920s and 1930s, Louis Thurstone and Rensis Likert demonstrated reliable methods for measuring attitudes (Edwards & Kenney, 1946); Likert’s scaling approach is still widely used by survey practitioners. Stanley Payne’s, 1951 classic “The Art of Asking Questions” was an early study of question wording. Subsequent academics scrutinized every aspect of survey design. Tourangeau (1984) articulated the four cognitive steps to survey responses, noting that people have to comprehend what is asked, retrieve the appropriate information, judge that information according to the question, and map the judgement onto the provided responses. Krosnick & Fabrigar (1997) studied many components of questionnaire design, such as scale length, text labels, and “no opinion” responses. Groves (1989) identified four types of survey-related error: coverage, sampling, measurement, and non-response. As online surveys grew in popularity, Couper (2008) and others studied bias from the visual design of Internet questionnaires. The use of surveys for HCI research certainly predates the Internet, with efforts to understand users’ experiences with computer hardware and software. In 1983, researchers at Carnegie Mellon University conducted an experiment comparing

Survey Research in HCI

231

Statisticians present arguments in favor of random sampling Ancient civilization censuses for food planning & land distributions, etc. Ancient times 1824 Straw polls for US Presidential election

US Presidential polls incorrectly predict victory for Landon

19th century 1876

Payne publishes "The Art of Asking Questions", the first comprehensive study of survey question wording

1936

1920s-30s

Galton conducts first psychology survey

Standardized questionnaires to assess usability are developed (e.g., SUS, SUMI, QUIS) 1980s

1951 1948

US Presidential Polls incorrectly predict victory for Dewey

Social psychologists propose scaling methods to reliably measure attitudes (Likert)

Georgia Tech starts annual online surveys to study Internet usage

1983

1984

1994 1989

Tourangeau identifies four cognitive steps for question response

Carnegie Mellon University conducts first survey with a computer interface

Couper publishes compilation on web survey visual design 2008 1990s Krosnick publishes on questionnaire design validity and reliability

Groves publishes taxonomy study of survey errors

Fig. 1 Summary of the key stages in survey history

computer-collected survey responses with those from a printed questionnaire, finding less socially desirable responses in the digital survey and longer openended responses than in the printed questionnaire (Kiesler & Sproull, 1986). With the popularization of graphical user interfaces in the 1980s, surveys joined other methods for usability research. Several standardized questionnaires were developed to assess usability (e.g., SUS, QUIS, SUMI, summarized later in this chapter). Surveys are a direct means of measuring satisfaction; along with efficiency and effectiveness, satisfaction is a pillar of the ISO 9241, part 11, definition of usability (Abran et al., 2003). User happiness is fundamental to Google’s HEART framework for user-centric measurement of Web applications (Rodden, Hutchinson, & Fu, 2010). In 1994, the Georgia Institute of Technology started annual online surveys to understand Internet usage and users and to explore Web-based survey research (Pitkow & Recker, 1994). As the Internet era progressed, online applications widely adopted surveys to measure users’ satisfaction, unaddressed needs, and problems experienced, in addition to user profiling. See a summary of key stages in survey history in Fig. 1.

What Questions the Method Can Answer When used appropriately, surveys can help inform application and user research rs’ attitudes, experiences, intents, demostrategies and provide insights into users’ graphics, and psychographic characteristics. However, surveys are not the most appropriate method for many other HCI research goals. Ethnographic interviews, log data analysis, card sorts, usability studies, and other methods may be more appropriate. In some cases, surveys can be used with other research methods to holistically inform HCI development. This section explains survey appropriateness, when to avoid using surveys, as well as how survey research can complement other research methods.

232

H. Müller et al.

When Surveys Are Appropriate Overall, surveys are appropriate when needing to represent an entire population, to measure differences between groups of people, and to identify changes over time in people’s attitudes and experiences. Below are examples of how survey data can be used in HCI research. Attitudes. Surveys can accurately measure and reliably represent attitudes and perceptions of a population. While qualitative studies are able to gather attitudinal data, surveys provide statistically reliable metrics, allowing researchers to benchmark attitudes toward an application or an experience, to track changes in attitudes over time, and to tie self-reported attitudes to actual behavior (e.g., via log data). For example, surveys can be used to measure customer satisfaction with online banking immediately following their experiences. Intent. Surveys can collect peoples’ reasons for using an application at a specific time, allowing researchers to gauge the frequency across different objectives. Unlike other methods, surveys can be deployed while a person is actually using an application (i.e., an online intercept survey), minimizing the risk of imperfect recall on the respondent’s part. Note that specific details and the context of one’s intent may not be fully captured in a survey alone. For example, “Why did you visit this website?” could be answered in a survey, but qualitative research may be more appropriate in determining how well one understood specific application elements and what users’ underlying motivations are in the context of their daily lives. Task success. Similar to measuring intent, while HCI researchers can qualitatively observe task success through a lab or a field study, a survey can be used to reliably quantify levels of success. For example, respondents can be instructed to perform a certain task, enter results of the task, and report on their experiences while performing the task. User experience feedback. Collecting open-ended feedback about a user’s experience can be used to understand the user’s interaction with technology or to inform system requirements and improvements. For example, by understanding the relative frequency of key product frustrations and benefits, project stakeholders can make informed decisions and trade-offs when allocating resources. User characteristics. Surveys can be used to understand a system’s users and to better serve their needs. Researchers can collect users’ demographic information, technographic details such as system savviness or overall tech savviness, and psychographic variables such as openness to change and privacy orientation. Such data enables researchers to discover natural segments of users who may have different needs, motivations, attitudes, perceptions, and overall user experiences. Interactions with technology. Surveys can be used to understand more broadly how people interact with technology and how technology influences social interactions with others by asking people to self-report on social, psychological, and demographic

Survey Research in HCI

233

variables while capturing their behaviors. Through the use of surveys, HCI researchers can glean insights into the effects technology has on the general population. Awareness. Surveys can also help in understanding people’s awareness of existing technologies or specific application features. Such data can, for example, help researchers determine whether low usage with an application is a result of poor awareness or other factors, such as usability issues. By quantifying how aware or unaware people are, researchers can decide whether efforts (e.g., marketing campaigns) are needed to increase overall awareness and thus use. Comparisons. Surveys can be used to compare users’ attitudes, perceptions, and experiences across user segments, time, geographies, and competing applications and between experimental and control versions. Such data enable researchers to explore whether user needs and experiences vary across geographies, assess an application’s strengths and weaknesses among competing technologies and how each compares with their competitors’ applications, and evaluate potential application improvements while aiding decision making between a variety of proposed designs.

When to Avoid Using a Survey Because surveys are inexpensive and easy to deploy compared to other methods, many people choose survey research even when it is inappropriate for their needs. Such surveys can produce invalid or unreliable data, leading to an inaccurate understanding of a population and poor user experiences. Below are some HCI research needs that are better addressed with other methods. Precise behaviors. While respondents can be asked to self-report their behaviors, gathering this information from log data, if available, will always be more accurate. This is particularly true when trying to understand precise user behaviors and flows, as users will struggle to recall their exact sequence of clicks or specific pages visited. For behaviors not captured in log data, a diary study, observational study, or experience sampling may gather more accurate results than a survey. Underlying motivations. People often do not understand or are unable to explain why they take certain actions or prefer one thing over another. Someone may be able to report their intent in a survey but may not be aware of their subconscious motivations for specific actions. Exploratory research methods such as ethnography or contextual inquiry may be more appropriate than directly asking about underlying motivations in a survey. Usability evaluations. Surveys are inappropriate for testing specific usability tasks and understanding of tools and application elements. As mentioned above, surveys can measure task success but may not explain why people cannot use a particular application, why they do not understand some aspect of a product, or why they do not identify missteps that caused the task failure. Furthermore, a user may still be able to complete a given task even though he or she encountered several confusions, which could not be uncovered through a survey. Task-based observational research and interview methods, such as usability studies, are better suited for such research goals.

H. Müller et al.

234 Fig. 2 Employing survey research either before or after research using other methods

Survey research

Is the data just anecdotal or representative?

What are the reasons for trends or distributions?

Small-sample qualitative method

Using Surveys with Other Methods Survey research may be especially beneficial when used in conjunction with other research methods (see Fig. 2). Surveys can follow previous qualitative studies to help quantify specific observations. For many surveys, up-front qualitative research may even be required to inform its content if no previous research exists. On the other hand, surveys can also be used to initially identify high-level insights that can be followed by in-depth research through more qualitative (meaning smaller sample) methods. For example, if a usability study uncovers a specific problem, a survey can quantify the frequency of that problem across the population. Or a survey can be used first to identify the range of frustrations or goals, followed by qualitative interviews and observational research to gain deeper insights into self-reported behaviors and sources of frustration. Researchers may interview survey respondents to clarify responses (e.g., Yew, Shamma, & Churchill, 2011), interview another pool of participants in the same population for comparison (e.g., Froelich et al., 2012), or interview both survey respondents and new participants (e.g., Archambault & Grudin, 2012). Surveys can also be used in conjunction with A/B experiments to aid comparative evaluations. For example, when researching two different versions of an application, the same survey can be used to assess both. By doing this, differences in variables such as satisfaction and self-reported task success can be measured and analyzed in parallel with behavioral differences observed in log data. Log data may show that one experimental version drives more traffic or engagement, but the survey may show that users were less satisfied or unable to complete a task. Moreover, log data can further validate insights from a previously conducted survey. For example, a social recommendation study by Chen, Geyer, Dugan, Muller, and Guy (2009) tested the quality of recommendations first in a survey and then through logging in a large field deployment. Psychophysiological data may be another objective accompaniment to survey data. For example, game researchers have combined surveys with data such as facial muscle and electrodermal activity (Nacke, Grimshaw, & Lindley, 2010) or attention and meditation as measured with EEG sensors (Schild, LaViola, & Masuch, 2012).

Survey Research in HCI

235

How to Do It: What Constitutes Good Work This section breaks down survey research into the following six stages: 1. 2. 3. 4. 5. 6.

Research goals and constructs Population and sampling Questionnaire design and biases Review and survey pretesting Implementation and launch Data analysis and reporting

Research Goals and Constructs Before writing survey questions, researchers should first think about what they intend to measure, what kind of data needs to be collected, and how the data will be used to meet the research goals. When the survey-appropriate research goals have been identified, they should be matched to constructs, i.e., unidimensional attributes that cannot be directly observed. The identified constructs should then be converted into one or multiple survey questions. Constructs can be identified from prior primary research or literature reviews. Asking multiple questions about the same construct and analyzing the responses, e.g., through factor analysis, may help the researcher ensure the construct’s validity. An example will illustrate the process of converting constructs into questions. An overarching research goal may be to understand users’ happiness with an online application, such as Google Search, a widely used Web search engine. Since happiness with an application is often multidimensional, it is important to separate it into measurable pieces—its constructs. Prior research might indicate that constructs such as “overall satisfaction,” “perceived speed,” and “perceived utility” contribute to users’ happiness with that application. When all the constructs have been identified, survey questions can be designed to measure each. To validate each construct, it is important to evaluate its unique relationship with the higher level goal, using correlation, regression, factor analysis, or other methods. Furthermore, a technique called cognitive pretesting can be used to determine whether respondents are interpreting the constructs as intended by the researcher (see more details in the pretesting section). Once research goals and constructs are defined, there are several other considerations to help determine whether a survey is the most appropriate method and how to proceed: • Do the survey constructs focus on results which will directly address research goals and inform stakeholders’ decision making rather than providing merely informative data? An excess of “nice-to-know” questions increases survey length and the likelihood that respondents will not complete the questionnaire, diminishing the effectiveness of the survey results.

H. Müller et al.

236

Population

Sampling frame

Sample

Respondents

Fig. 3 The relationship between population, sampling frame, sample, and respondents

• Will the results be used for longitudinal comparisons or for one-time decisions? For longitudinal comparisons, researchers must plan on multiple survey deployments without exhausting available respondents. • What is the number of responses needed to provide the appropriate level of precision for the insights needed? By calculating the number of responses needed (as described in detail in the following section), the researcher will ensure that key metrics and comparisons are statistically reliable. Once the target number is determined, researchers can then determine how many people to invite.

Population and Sampling Key to effective survey research is determining who and how many people to survey. In order to do this, the survey’s population, or set of individuals that meet certain criteria, and to whom researchers wish to generalize their results must first be defined. Reaching everyone in the population (i.e., a census) is typically impossible and unnecessary. Instead, researchers approximate the true population by creating a sampling frame, i.e., the set of people who the researcher is able to contact for the survey. The perfect sampling frame is identical to the population, but often a survey’s sampling frame is only a portion of the population. The people from the sampling frame who are invited to take the survey are the sample, but only those who answer are respondents. See Fig. 3 illustrating these different groups. For example, a survey can be deployed to understand the satisfaction of a product’s or an application’s users. In this case, the population includes everyone that uses the application, and the sampling frame consists of users that are actually reachable. The sampling frame may exclude those who have abandoned the application, anonymous users, and users who have not opted in to being contacted for research. Though the sampling frame may exclude many users, it could still include far more people than are needed to collect a statistically valid number of responses. However, if the sampling frame systematically excludes certain types of people (e.g., very dissatisfied or disengaged users), the survey will suffer from coverage error or and its responses will misrepresent the population.

Survey Research in HCI

237

Probability Versus Non-probability Sampling Sampling a population can be accomplished through probability- and nonprobability-based methods. Probability or random sampling is considered the gold standard because every person in the sampling frame has an equal, nonzero chance of being chosen for the sample; essentially, the sample is selected completely randomly. This minimizes sampling bias, also known as selection bias, by randomly drawing the sample from individuals in the sampling frame and by inviting everyone in the sample in the same way. Examples of probability sampling methods include random digit telephone dialing, address-based mail surveys utilizing the US Postal Service Delivery Sequence File (DSF), and the use of a panel recruited through random sampling, those who have agreed in advance to receive surveys. For Internet surveys in particular, methods allowing for random sampling include intercept surveys for those who use a particular product (e.g., pop-up surveys or inproduct links), list-based samples (e.g., for e-mail invitations), and pre-recruited probability-based panels (see Couper, 2000, for a thorough review). Another way to ensure probability sampling is to use a preexisting sampling frame, i.e., a list of candidates previously assembled using probability sampling methods. For example, Shklovski, Kraut, and Cummings’ (2008) study of the effect of residential moves on communication with friends was drawn from a publicly available, highly relevant sampling frame, the National Change of Address (NCOA) database. Another approach is to analyze selected subsets of data from an existing representative survey like the General Social Survey (e.g., Wright & Randall, 2012). While probability sampling is ideal, it is often impossible to reach and randomly select from the entire target population, especially when targeting small populations (e.g., users of a specialized enterprise product or experts in a particular field) or investigating sensitive or rare behavior. In these situations, researchers may use non-probability sampling methods such as volunteer opt-in panels, unrestricted self-selected surveys (e.g., links on blogs and social networks), snowball recruiting (i.e., asking for friends of friends), and convenience samples (i.e., targeting people readily available, such as mall shoppers) (Couper, 2000). However, non-probability methods are prone to high sampling bias and hence reduce representativeness compared to random sampling. One way representativeness can be assessed is by comparing key characteristics of the target population with those from the actual sample (for more details, refer to the analysis section). Many academic surveys use convenience samples from an existing pool of the university’s psychology students. Although not representative of most Americans, this type of sample is appropriate for investigating technology behavior among young people such as sexting (Drouin & Landgraff, 2012; Weisskirch & Delevi, 2011), instant messaging (Anandarajan, Zaman, Dai, & Arinze, 2010; Junco & Cotten, 2011; Zaman et al., 2010), and mobile phone use (Auter, 2007; Harrison, 2011; Turner, Love, & Howell, 2008). Convenience samples have also been used to identify special populations. For example, because identifying HIV and tuberculosis patients through official lists of names is difficult because of patient confidentiality, one study about the viability of using cell phones and text messages in HIV and tuberculosis education handed out surveys to potential respondents in health clinic

238

H. Müller et al.

waiting rooms (Person, Blain, Jiang, Rasmussen, & Stout, 2011). Similarly, a study of Down’s syndrome patients’ use of computers invited participation through special interest listservs (Feng, Lazar, Kumin, & Ozok, 2010).

Determining the Appropriate Sample Size No matter which sampling method is used, it is important to carefully determine the target sample size for the survey, i.e., the number of survey responses needed. If the sample size is too small, findings from the survey cannot be accurately generalized to the population and may fail to detect generalizable differences between groups. If the sample is larger than necessary, too many individuals are burdened with taking the survey, analysis time for the researcher may increase, or the sampling frame quickly. Hence, calculating the optimal sample size becomes crucial is used up too quickly. for every survey. First, the researcher needs to determine approximately how many people make up the population being studied. Second, as the survey does not measure the entire population, the required level of precision must be chosen, which consists of the margin of error and the confidence level. The margin of error expresses the amount of sampling error in the survey, i.e., the range of uncertainty around an estimate of a population measure, assuming normally distributed data. For example, if 60 % of the sample claims to use a tablet computer, a 5 % margin of error would mean that actually 55–65 % of the population use tablet computers. Commonly used margin of errors are 5 and 3 %, but depending on the goals of the survey anywhere between 1 and 10 % may be appropriate. Using a margin of error higher than 10 % is not recommended, unless a low level of precision can meet the survey’s goals. The confidence level indicates how likely the reported metric falls within the margin of error if the study were repeated. A 95 % confidence level, for example, would mean that 95 % of the time, observations from repeated sampling will fall within the interval defined by the margin of error. Commonly used confidence levels are 99, 95, and 90 %; using less than 90 % is not recommended. There are various formulas for calculating the target sample size. Figure 4, based on Krejcie and Morgan’s formula (1970), shows the appropriate sample size, given the population size, as well as the chosen margin of error and confidence level for your survey.. Note that the table is based on a population proportion of 50 % for the response of interest, the most cautious estimation (i.e., when higher or lower than 50 %, the required sample size declines to achieve the same margin of error). For example, for a population larger than 100,000, a sample size of 384 is required to achieve a confidence level of 95 % and a margin of error of 5 %. Note that for population sizes over about 20,000, the required sample size does not significantly increase. Researchers may set the sample size to 500 to estimate a single population parameter, which yields a margin of error of about ±4.4 % at a 95 % confidence level for large populations. After having determined the target sample size for the survey, the researcher now needs to work backwards to estimate the number of people to actually invite to the

Survey Research in HCI 90%

Confidence level Margin of error

239 95%

99%

10%

5%

3%

1%

10%

5%

3%

1%

10%

5%

3%

1%

10

9

10

10

10

9

10

10

10

9

10

10

10

100

41

73

88

99

49

80

92

99

63

87

95

99

1000

63

213

429

871

88

278

516

906

142

399

648

943

10,000

67

263

699

4035

95

370

964

4899

163

622

1556

6239

100,000

68

270

746

6335

96

383

1056

8762

166

659

1810 14227

1,000,000

68

270

751

6718

96

384

1066

9512

166

663

1840 16317

100,000,000

68

271

752

6763

96

384

1067

9594

166

663

1843 16560

Size of population

Fig. 4 Sample size as a function of population size and accuracy (confidence level and margin of error)

survey, taking into account the estimated size for each subgroup and the expected response rate. If a subgroup’s incidence is very small, the total number of invitations must be increased to ensure the desired sample size for this subgroup. The response rate of a survey describes the percentage of those who completed the survey out of all those that were invited (for more details, see the later sections on monitoring survey paradata and maximizing response rates). If a similar survey has been conducted before, then its response rate is a good reference point for calculating the required sample size. If there is no prior response rate information, the survey can be sent out to a small number of people first to measure the response rate, which is then used to determine the total number of required invitations. For example, assuming a 30 % response rate, a 50 % incidence rate for the group of interest, and the need for 384 complete responses from that group, 2,560 people should be invited to the survey. At this point, the calculation may determine that the researcher may require a sample that is actually larger than the sampling frame; hence, the researcher may need to consider more qualitative methods as an alternative.

Mode and Methods of Survey Invitation To reach respondents, there are four basic survey modes: mail or written surveys, phone surveys, face-to-face or in-person surveys, and Internet surveys. Survey modes may also be used in combination. The survey mode needs to be chosen carefully as each mode has its own advantages and disadvantages, such as differences in typical response rates, introduced biases (Groves, 1989), required resources and costs, audience that can be reached, and respondents’ level of anonymity. Today, many HCI-related surveys are Internet based, as benefits often outweigh their disadvantages. Internet surveys have the following major advantages: • Easy access to large geographic regions (including international reach) • Simplicity of creating a survey by leveraging easily accessible commercial tools

240

H. Müller et al.

• Cost savings during survey invitation (e.g., no paper and postage, simple implementation, insignificant cost increase for large sample sizes) and analysis (e.g., returned data is already in electronic format) • Short fielding periods, as the data is collected immediately • Lower bias due to respondent anonymity, as surveys are self-administered with no interviewer present • Ability to customize the questionnaire to specific respondent groups using skip logic (i.e., asking respondents a different set of questions based on the answer to a previous question) Internet surveys also have several disadvantages. The most discussed downside is the introduction of coverage error, i.e., a potential mismatch between the target population and the sampling frame (Couper, 2000; Groves, 1989). For example, online surveys fail to reach people without Internet or e-mail access. Furthermore, those invited to Internet surveys may be less motivated to respond or to provide accurate data because such surveys are less personal and can be ignored more easily. This survey mode also relies on the respondents’ ability to use a computer and may only provide the researcher with minimal information about the survey respondents. (See chapter on “Crowdsourcing in HCI Research.”)

Questionnaire Design and Biases Upon establishing the constructs to be measured and the appropriate sampling method, the first iteration of the survey questionnaire can be designed. It is important to carefully think through the design of each survey question (first acknowledged by Payne, 1951), as it is fairly easy to introduce biases that can have a substantial impact on the reliability and validity of the data collected. Poor questionnaire design may introduce measurement error, defined as the deviation of the respondents’ answers from their true values on the measure. According to Couper (2000), measurement error in selfadministered surveys can arise from the respondent (e.g., lack of motivation, comprehension problems, deliberate distortion) or from the instrument (e.g., poor wording or design, technical flaws). In most surveys, there is only one opportunity to deploy, and unlike qualitative research, no clarification or probing is possible. For these reasons, it is crucial that the questions accurately measure the constructs of interest. Going forward, this section covers different types of survey questions, common questionnaire biases, questions to avoid, visual design considerations, reuse of established questionnaires, as well as visual survey design considerations. Types of Survey Questions There are two categories of survey questions—open- and closed-ended questions. Open-ended questions (Fig. 5) ask survey respondents to write in their own answers, whereas closed-ended questions (Fig. 6) provide a set of predefined answers to choose from.

241

Survey Research in HCI

What, if anything, do you find frustrating about your smartphone?

Fig. 5 Example of a typical open-ended question

Overall, how satisfied or dissatisfied are you with your smartphone? Neither Very Slightly satisfied nor Extremely dissatisfied dissatisfied dissatisfied dissatisfied

Slightly satisfied

Very satisfied

Extremely satisfied

Fig. 6 Example of a typical closed-ended question, a bipolar rating question in particular

Open-ended questions are appropriate when: • The universe of possible answers is unknown, e.g., “What is your favorite smartphone application?”. However, once the universe of possible answers is identified, it may be appropriate to create a closed-ended version of the same question. • There are so many options in the full list of possible answers that they cannot be easily displayed, e.g., “Which applications have you used on your smartphone in the last week?”. • Measuring quantities with natural metrics (i.e., a construct with an inherent unit of measurement, such as age, length, or frequency), when being unable to access information from log data, such as time, frequency, and length, e.g., “How many times do you use your tablet in a typical week?” (using a text field that is restricted ter be bucketed ted flexibly). to numeric input, the answers to which can later • Measuring qualitative aspects of a user’s experience, e.g., “What do you find most frustrating about using your smartphone?”. Closed-ended questions are appropriate when: • The universe of possible answers is known and small enough to be easily provided, e.g., “Which operating system do you use on your smartphone?” (with answer options including “Android” and “iOS”). • Rating a single object on a dimension, e.g., “Overall, how satisfied or dissatisfied are you with your smartphone?” (on a 7-point scale from “Extremely dissatisfied” to “Extremely satisfied”). • Measuring quantities without natural metrics, such as importance, certainty, or degree, e.g., “How important is it to have your smartphone within reach 24 h a day?” (on a 5-point scale from “Not at all important” to “Extremely important”).

H. Müller et al.

242

What is the highest level of education you have completed? Less than High School High School Some College 2-year College Degree (Associates) 4-year College Degree (BA, BS) Master's Degree Doctoral Degree Professional Degree (MD, JD)

Fig. 7 Example of a single-choice question

Which of the following apps do you use daily on your smartphone? Select all that apply. Gmail Maps Calendar Facebook Hangouts Drive

Fig. 8 Example of a multiple-choice question

Types of Closed-Ended Survey Questions There are four basic types of closed-ended questions: single-choice, multiple-choice, rating, and ranking questions. 1. Single-choice questions work best when only one answer is possible for each respondent in the real world (Fig. 7). 2. Multiple-choice questions are appropriate when more than one answer may apply to the respondent. Frequently, multiple-choice questions are accompanied by “select all that apply” help text. The maximum number of selections may also be specified to force users to prioritize or express preferences among the answer options (Fig. 8). 3. Ranking questions are best when respondents must prioritize their choices given a real-world situation (Fig. 9). 4. Rating questions are appropriate when the respondent must judge an object on a continuum. To optimize reliability and minimize bias, scale points need to be

243

Survey Research in HCI

Rank the following smartphone manufacturers in order of your preference: Add a number to each row, 1 being the least preferred, 5 being the most preferred. Apple HTC Samsung Motorola Nokia

Fig. 9 Example of a ranking question

How important is it to you to make phone calls from your smartphone? Not at all important

Slightly important

Moderately important

Very important

Extremely important

Fig. 10 Example of a rating question, for a unipolar construct in particular

fully labeled instead of using numbers (Groves et al., 2004), and each scale point should be of equal width to avoid bias toward visually bigger response options (Tourangeau, Couper, & Conrad, 2004). Rating questions should use either a unipolar or a bipolar scale, depending on the construct being measured (Krosnick & Fabrigar, 1997; Schaeffer & Presser, 2003). Unipolar constructs range from zero to an extreme amount and do not have a natural midpoint. They are best measured with a 5-point rating scale (Krosnick & Fabrigar, 1997), which optimizes reliability while minimizing respondent burden, and with the following scale labels, which have been shown to be semantically equidistant from each other (Rohrmann, 2003): “Not at all …,” “Slightly …,” “Moderately …,” “Very …,” and “Extremely ….” Such constructs include importance (see Fig. 10), interest, usefulness, and relative frequency. Bipolar constructs range from an extreme negative to an extreme positive with a natural midpoint. Unlike unipolar constructs, theyy are best measured with a 7-point rating scale to maximize reliability and data differentiation (Krosnic (Krosnick & Fabrigar, 1997). Bipolar constructs may use the following scale labels: “Extremely …,” “Moderately …,” “Slightly …,” “Neither … nor …,” “Slightly …,” “Moderately …,” and “Extremely ….” Such constructs include satisfaction (see Fig. 6, from dissatisfied to satisfied), perceived speed (from slow to fast), ease of use (from difficult to easy), and visual appeal (from unappealing to appealing). When using a rating scale, the inclusion of a midpoint should be considered. While some may argue that including a midpoint provides an easy target for respondents who shortcut answering questions, others argue that the exclusion of a

244

H. Müller et al.

midpoint forces people who truly are in the middle to choose an option that does not reflect their actual opinion. O’Muircheartaigh, Krosnick, and Helic (2001) found that having a midpoint on a rating scale increases reliability, has no effect on validity, and does not result in lower data quality. Additionally, people who look for shortcuts (“shortcutters”) are not more likely to select the midpoint when present. Omitting the midpoint, on the other hand, increases the amount of random measurement error, resulting in those who actually feel neutral to end up making a random choice on either side of the scale. These findings suggest that a midpoint should be included when using a rating scale. Questionnaire Biases After writing the first survey draft, it is crucial to check the phrasing of each question for potential biases that may bias the responses. The following section covers five common questionnaire biases: satisficing, acquiescence bias, social desirability, response order bias, and question order bias. Satisficing Satisficing occurs when respondents use a suboptimal amount of cognitive effort to answer questions. Instead, satisficers will typically pick what they consider to be the first acceptable response alternative (Krosnick, 1991; Simon, 1956). Satisficers compromise one or more of the following four cognitive steps for survey response as identified by Tourangeau (1984): 1. 2. 3. 4.

Comprehension of the question, instructions, and answer options Retrieval of specific memories to aid with answering the question Judgement of the retrieved information and its applicability to the question Mapping of judgement onto the answer options

Satisficers shortcut this process by exerting less cognitive effort or by skipping one or more steps entirely; satisficers use less effort to understand the question, to thoroughly search their memories, to carefully integrate all retrieved information, or to accurately pick the proper response choice (i.e., they pick the next best choice). Satisficing can take weak and strong forms (Krosnick, 1999). Weak satisficers make an attempt to answer correctly yet are less than thorough, while strong satisficers may not at all search their memory for relevant information and simply select answers at random in order to complete the survey quickly. In other words, weak satisficers carelessly process all four cognitive steps, while strong satisficers typically skip the retrieval and judgement steps. Respondents are more likely to satisfice when (Krosnick, 1991): • Cognitive ability to answer is low. • Motivation to answer is low. • Question difficulty is high at one of the four stages, resulting in cognitive exertion.

Survey Research in HCI

245

To minimize satisficing, the following may be considered: • Complex questions that require an inordinate amount of cognitive exertion should be avoided. • Answer options such as “no opinion,” “don’t know,” “not applicable,” or “unsure” should be avoided, since respondents with actual opinions will be tempted and select this option (Krosnick, 2002; Schaeffer & Presser, 2003). Instead, respondents should first be asked whether they have thought about the proposed question or issue enough to have an opinion; those that haven’t should be screened out. • Using the same rating scale in a series of back-to-back questions should be avoided. Potential satisfiers may pick the same scale point for all answer options. This is known as straight-lining or item non-differentiation (Herzog & Bachman, 1981; Krosnick & Alwin, 1987, 1988). • Long questionnaires should be avoided, since respondents will be less likely to optimally answer questions when they become increasingly fatigued and unmotivated (Cannell & Kahn, 1968; Herzog & Bachman, 1981). • Respondent motivation can be increased by explaining the importance of the survey topic and that their responses are critical to the researcher (Krosnick, 1991). • Respondents may be asked to justify their answer to the question that may exhibit satisficing. • Trap questions (e.g., “Enter the number 5 in the following text box:”) can identify satisficers and fraudulent survey respondents.

Acquiescence Bias When presented with agree/disagree, yes/no, or true/false statements, some respondents are more likely to concur with the statement independent of its substance. This tendency is known as acquiescence bias (Smith, 1967). Respondents are more likely to acquiescence when: • Cognitive ability is low (Krosnick, Narayan, & Smith, 1996) or motivation is low. • Question difficulty is high (Stone, Gage, & Leavitt, 1957). • Personality tendencies skew toward agreeableness (Costa & McCrae, 1988; Goldberg, 1990; Saris, Revilla, Krosnick, & Shaeffer, 2010). • Social conventions suggest that a “yes” response is most polite (Saris et al., 2010). • The respondent satisfices and only thinks of reasons why the statement is true, rather than expending cognitive effort to consider reasons for disagreement (Krosnick, 1991). • Respondents with lower self-perceived status assume that the survey administrator agrees with the posed statement, resulting in deferential agreement bias (Saris et al., 2010).

246

H. Müller et al.

To minimize acquiescence bias, the following may be considered: • Avoid questions with agree/disagree, yes/no, true/false, or similar answer options (Krosnick & Presser, 2010). • Where possible, ask construct-specific questions (i.e., questions that ask about the underlying construct in a neutral, non-leading way) instead of agreement statements (Saris et al., 2010). • Use reverse-keyed constructs; i.e., the same construct is asked positively and negatively in the same survey. The raw scores of both responses are then combined to correct for acquiescence bias.

Social Desirability Social desirability occurs when respondents answer questions in a manner they feel will be positively perceived by others (Goffman, 1959; Schlenker & Weigold, 1989). Favorable actions may be overreported, and unfavorable actions or views may be underreported. Topics that are especially prone to social desirability bias include voting behavior, religious beliefs, sexual activity, patriotism, bigotry, intellectual capabilities, illegal acts, acts of violence, and charitable acts. Respondents are inclined to provide socially desirable answers when: • Their behavior or views go against the social norm (Holbrook & Krosnick, 2010). • Asked to provide information on sensitive topics, making the respondent feel uncomfortable or embarrassed about expressing their actual views (Holbrook & Krosnick, 2010). • They perceive a threat of disclosure or consequences to answering truthfully (Tourangeau, Rips, & Rasinski, 2000). • Their true identity (e.g., name, address, phone number) is captured in the survey (Paulhus, 1984). • The data is directly collected by another person (e.g., in-person or phone surveys). To minimize social desirability bias, respondents should be allowed to answer anonymously or the survey should be self-administered (Holbrook & Krosnick, 2010; Tourangeau & Smith, 1996; Tourangeau & Yan, 2007).

Response Order Bias Response order bias is the tendency to select the items toward the beginning (i.e., primacy effect) or the end (i.e., recency effect) of an answer list or scale (Chan, 1991; Krosnick & Alwin, 1987; Payne, 1971). Respondents unconsciously interpret the ordering of listed answer options and assume that items near each other are related, top or left items are interpreted to be “first,” and middle answers in a scale without a natural order represent the typical value (Tourangeau et al., 2004). Primacy and recency effects are the strongest when the list of answer options is long (Schuman & Presser, 1981) or when they cannot be viewed as a whole (Couper et al., 2004).

Survey Research in HCI

247

To minimize response order effects, the following may be considered: • Unrelated answer options should be randomly ordered across respondents (Krosnick & Presser, 2010). • Rating scales should be ordered from negative to positive, with the most negative item first. • The order of ordinal scales should be reversed randomly between respondents, and the raw scores of both scale versions should be averaged using the same value for each scale label. That way, the response order effects cancel each other out across respondents (e.g., Villar & Krosnick, 2011), unfortunately, at the cost of increasing variability. Question Order Bias Order effects also apply to the order of the questions in surveys. Each question in a survey has the potential to bias each subsequent question by priming respondents (Kinder & Iyengar, 1987; Landon, 1971). The following guidelines may be considered: • Questions should be ordered from broad to more specific (i.e., a funnel approach) to ensure that the survey follows conversational conventions. • Early questions should be easy to answer and directly related to the survey topic (to help build rapport and engage respondents) (Dillman, 1978). • Non-critical, complex, and sensitive questions should be included toward the end of the survey to avoid early drop-off and to ensure collection of critical data. • Related questions need to be grouped to reduce context switching so that respondents can more easily and quickly access related information from memory, as opposed to disparate items. • The questionnaire should be divided into multiple pages with distinct sections labeled for easier cognitive processing. Other Types of Questions to Avoid Beyond the five common questionnaire biases mentioned above, there are additional question types that can result in unreliable and invalid survey data. These include broad, leading, double-barreled, recall, prediction, hypothetical, and prioritization questions. Broad questions lack focus and include items that are not clearly defined or those that can be interpreted in multiple ways. For example, “Describe the way you use your tablet computer” is too broad, as there are many aspects to using a tablet such as the purpose, applications being used, and its locations of use. Instead of relying on the respondent to decide on which aspects to report, the research goal as well as core construct(s) should be determined beforehand and asked about in a focused manner. A more focused set of questions for the example above could be “Which apps did you use on your tablet computer over the last week?” and “Describe the locations in which you used your tablet computer last week?”.

248

H. Müller et al.

Leading questions manipulate respondents into giving a certain answer by providing biasing content or suggesting information the researcher is looking to have confirmed. For example, “This application was recently ranked as number one in customer satisfaction. How satisfied are you with your experience today?”. Another way that questions can lead the respondent toward a certain answer includes those that ask the respondent to agree or disagree with a given statement, as for example in “Do you agree or disagree with the following statement: I use my smartphone more often than my tablet computer.” Note that such questions can additionally result in acquiescence bias (as discussed above). To minimize the effects of leading questions, questions should be asked in a fully neutral way without any examples or additional information that may bias respondents toward a particular response. Double-barreled questions ask about multiple items while only allowing for a single response, resulting in less reliable and valid data. Such questions can usually be detected by the existence of the word “and.” For example, when asked “How satisfied or dissatisfied are you with your smartphone and tablet computer?”, a respondent with differing attitudes toward the two devices will be forced to pick an attitude that either reflects just one device or the average across both devices. Questions with multiple items should be broken down into one question per construct or item. Recall questions require the respondent to remember past attitudes and behaviors, leading to recall bias (Krosnick & Presser, 2010) and inaccurate recollections. When a respondent is asked “How many times did you use an Internet search engine over the past 6 months?”, they will try to rationalize a plausible number, because recalling a precise count is difficult or impossible. Similarly, asking questions that compare past attitudes to current attitudes, as in “Do you prefer the previous or current version of the interface?”, may result in skewed data due to difficulty remembering past attitudes. Instead, questions should focus on the present, as in “How satisfied or dissatisfied are you with your smartphone today?”, or use a recent time frame, for example, “In the past hour, how many times did you use an Internet search engine?”. If the research goal is to compare attitudes or behaviors across different product versions or over time, the researcher should field separate surveys for each product version or time period and make the comparison themselves. Prediction questions ask survey respondents to anticipate future behavior or attitudes, resulting in biased and inaccurate responses. Such questions include “Over the next month, how frequently will you use an Internet search engine?”. Even more cognitively burdensome are hypothetical questions, i.e., asking the respondent to imagine a certain situation in the future and then predicting their attitude or behavior in that situation. For example, “Would you purchase more groceries if the store played your favorite music?” and “How much would you like this Website if it used blue instead of red for their color scheme?” are hypothetical questions. Other frequently used hypothetical questions are those that ask the respondent to prioritize a future feature set, as in “Which of the following features would make you more satisfied with this product?”. Even though the respondent may have a clear answer to this question, their response does not predict actual future usage of or satisfaction with the product if that feature was added. Such questions should be entirely excluded from surveys.

Survey Research in HCI

249

Leveraging Established Questionnaires An alternative to constructing a brand new questionnaire is utilizing questionnaires developed by others. These usually benefit from prior validation and allow researchers to compare results with other studies that used the same questionnaire. When selecting an existing questionnaire, one should consider their particular research goals and study needs and adapt the questionnaire as appropriate. Below are commonly used HCI-related questionnaire instruments. Note that as survey research methodology has significantly advanced over time, each questionnaire should be assessed for potential sources of measurement error, such as the biases and the to-be-avoided question types mentioned previously. • NASA Task Load Index (NASA TLX). Originally developed for aircraft cockpits, this questionnaire allows researchers to subjectively assess the workload of operators working with human–machine systems. It measures mental demand, physical demand, temporal demand, performance, effort, and frustration (Hart & Staveland, 1988). • Questionnairee for User Interface Satisfaction (QUIS). This questionnaire assesses one’ss overall reaction to a system, including its software, screen, terminology, system information, and learnability (Chin, Die Diehl, & Norman, 1988). • Software Usability Measurement Inventory (SUMI). This questionnaire measures perceived software quality covering dimensions such as efficiency, affect, helpfulness, control, and learnability, which are then summarized into a single satisfaction score (Kirakowski & Corbett, 1993). • Computer System Usability Questionnaires (CSUQ). This questionnaire developed by IBM measures user satisfaction with system usability (Lewis, 1995). • System Usability Scale (SUS). As one of the most frequently used scales in user experience, SUS measures attitudes regarding the effectiveness, efficiency, and satisfaction with a system with ten questions, yielding a single score (Brooke, 1996). • Visual Aesthetics of Website Inventory (VisAwi). This survey measures perceived visual aesthetics of a Website on the four subscales of simplicity, diversity, colorfulness, and craftsmanship (Moshagen & Thielsch, 2010).

Visual Survey Design Considerations Researchers should also take into account their survey’s visual design, since specific choices, including the use of images, spacing, and progress bars, may unintentionally bias respondents. This section summarizes such visual design aspects; for more details, refer to Couper (2008). While objective images (e.g., product screenshots) can help clarify questions, context-shaping images can influence a respondent’s mindset. For example, when asking respondents to rate their level of health, presenting an image of someone in a hospital bed has a framing effect that results in higher health ratings compared to (Couper, Conrad, & Tourangeau, 2007). that of someone jogging (Couper

250

H. Müller et al.

The visual treatment of response options also matters. When asking closed-ended questions, uneven spacing between horizontal scale options results in a higher selection rate for scale points with greater spacing; evenly spaced scale options are recommended (Tourangeau, Couper, & Conrad, 2004). Drop-down lists, compared to radio buttons, have been shown to be harder and slower to use and to result in more accidental selections (Couper, 2011). Lastly, larger text fields increase the amount of text entered (Couper, 2011) but may intimidate respondents, potentially causing higher break-offs (i.e., drop-out rates). Survey questions can be presented one per page, multiple per page, or all on one page. Research into pagination effects on completion rates is inconclusive (Couper, 2011). However, questions appearing on the same page may have higher correlations with each other, a sign of measurement bias (Peytchev, Couper, McCabe, & Crawford, 2006). In practice, most Internet surveys with skip logic use multiple pages, whereas very short questionnaires are often presented on a single page. While progress bars are generally preferred by respondents and are helpful for short surveys, their use in long surveys or surveys with skip logic can be misleading and intimidating. Progress between pages in long surveys may be small, resulting in increased break-off rates (Callegaro, Villar, & Yang, 2011). On the other hand, progress bars are likely to increase completion rates for short surveys, where substantial progress is shown between pages.

Review and Survey Pretesting At this point in the survey life cycle, it is appropriate to have potential respondents take and evaluate the survey in order to identify any remaining points of confusion. For example, the phrase “mobile device” may be assumed to include mobile phones, tablets, and in-car devices by the researcher, while survey respondents may interpret it to be mobile phones only. Or, when asking for communication tools used by the respondent, the provided list of answer choices may not actually include all possible options needed to properly answer the question. Two established evaluation methods used to improve survey quality are cognitive pretesting and field testing the survey by launching it to a subset of the actual sample, as described more fully in the remainder of this section. By evaluating surveys early on, the researcher can identify disconnects between their own assumptions and how respondents will read, interpret, and answer questions.

Cognitive Pretesting To conduct a cognitive pretest, a small set of potential respondents is invited to participate in an in-person interview where they are asked to take the survey while using the think-aloud protocol (similar to a usability study). A cognitive pretest assesses question interpretation, construct validity, and comprehension of survey

Survey Research in HCI

251

terminology and calls attention to missing answer options or entire questions (Bolton & Bronkhorst, 1995; Collins, 2003; Drennan, 2003; Presser et al., 2004). However, note that due to the testing environment, a cognitive pretest does not allow the researcher to understand contextual influences that may result in break-off or not filling out the survey in the first place. As part of a pretest, participants are asked the following for each question: 1. “Read the entire question and describe it in your own words.” 2. “Select or write an answer while explaining your thought process.” 3. “Describe any confusing terminology or missing answer choices.” During the interview, the researcher should observe participant reactions; identify misinterpretations of terms, questions, answer choices, or scale items; and gain insight into how respondents process questions and come up with their answers. The researcher then needs to analyze the collected information to improve problematic areas before fielding the final questionnaire. A questionnaire could go through several rounds of iteration before reaching the desired quality.

Field Testing Piloting the survey with a small subset of the sample will help provide insights that cognitive pretests alone cannot (Collins, 2003; Presser et al., 2004). Through field testing, the researcher can assess the success of the sampling approach, look for common break-off points and long completion times, and examine answers to openended questions. High break-off rates and completion times may point to flaws in the survey design (see the following section), while unusual answers may suggest a disconnect between a question’s intention and respondents’ interpretation. To yield additional insights from the field test, a question can be added at the end of each page or at the end of the entire survey where respondents can provide explicit feedback on any points of confusion. Similar to cognitive pretests, field testing may lead to several rounds of questionnaire improvement as well as changes to the sampling method. Finally, once all concerns are addressed, the survey is ready to be fielded to the entire sample.

Implementation and Launch When all questions are finalized, the survey is ready to be fielded based on the chosen sampling method. Respondents may be invited through e-mails to specifically named persons (e.g., respondents chosen from a panel), intercept pop-up dialogs while using a product or a site, or links placed directly in an application (see the sampling section for more details; Couper, 2000).

252

H. Müller et al.

There are many platforms and tools that can be used to implement Internet surveys, such as ConfirmIt, Google Forms, Kinesis, LimeSurvey, SurveyGizmo, SurveyMonkey, UserZoom, Wufoo, and Zoomerang, to name just a few. When deciding on the appropriate platform, functionality, cost, and ease of use should be taken into consideration. The questionnaire may require a survey tool that supports functionality such as branching and conditionals, the ability to pass URL parameters, multiple languages, and a range of question types. Additionally, the researcher may want to customize the visual style of the survey or set up an automatic reporting dashboard, both of which may only be available on more sophisticated platforms.

Piping Behavioral Data into Surveys Some platforms support the ability to combine survey responses with other log data, which is referred to as piping. Self-reported behaviors, such as frequency of use, feature usage, tenure, and platform usage, are less valid and reliable compared to generating the same metrics through log data. By merging survey responses with behavioral data, the researcher can more accurately understand the relationship between respondent characteristics and their behaviors or attitudes. For example, the researcher may find that certain types of users or the level of usage may correlate with higher reported satisfaction. Behavioral data can either be passed to the results database as a parameter in the survey invitation link or combined later via a unique identifier for each respondent.

Monitoring Survey Paradata With the survey’s launch, researchers should monitor the initial responses as well as survey paradata to identify potential mistakes in the survey design. Survey paradata is data collected about the survey response process, such as the devices from which the survey was accessed, time to survey completion, and various response-related rates. By monitoring such metrics, the survey researcher can quickly apply improvements before the entire sample has responded to the survey. The American Association for Public Opinion Research specified a set of definitions for commonly used paradata metrics (AAPOR, 2011): • • • • •

Click-through rate: Of those invited, how many opened the survey. Completion rate: Of those who opened the survey, how many finished the survey. Response rate: Of those invited, how many finished the survey. Break-off rate: Of those who started, how many dropped off on each page. Completion time: The time it took respondents to finish the entire survey.

Response rates are dependent on a variety of factors, the combination of which makes it difficult to specify an acceptable response rate in HCI survey research. A meta-analysis of 31 e-mail surveys from 1986 to 2000 showed that average response rates for e-mail surveys typically fall between 30 and 40 %, with follow-up

Survey Research in HCI

253

reminders significantly increasing response rates (Sheehan, 2001). Another review of 69 e-mail surveys showed that response rates averaged around 40 % (Cook, Heath, & Thompson, 2000). When inviting respondents through Internet intercept surveys (e.g., pop-up surveys or in-product links), response rates may be 15 % or lower (Couper, 2000). Meta-analyses of mailed surveys showed that their response rates are 40–50 % (Kerlinger, 1986) or 55 % (Baruch, 1999). In experimental comparisons to mailed surveys, response rates to Internet e-mail surveys were about 10 % lower (Kaplowitz, Hadlock, & Levine, 2004; Manfreda et al., 2008). Such meta reviews also showed that overall response rates have been declining over several decades (Baruch, 1999; Baruch & Holtom, 2008; Sheehan, 2001); however, this decline seems to have stagnated around 1995 (Baruch & Holtom, 2008).

Maximizing Response Rates In order to gather enough responses to represent the target population with the desired level of precision, response rates should be maximized. Several factors affect response rates, including the respondents’ interest in the subject matter, the perceived impact of responding to the survey, questionnaire length and difficulty, the presence and nature of incentives, and researchers’ efforts to encourage response (Fan & Yan, 2010). Based on experimentation with invitation processes for mail surveys, Dillman (1978) developed the “Total Design Method” to optimize response rates. This method, consistently achieving response rates averaging 70 % or better, consists of a timed sequence of four mailings: the initial request with the survey on week one, a reminder postcard on week two, a replacement survey to non-respondents on week four, and a second replacement survey to non-respondents by certified mail on week seven. Dillman incorporates social exchange theory into the Total Design Method by personalizing the invitation letters, using official stationery to increase trust in the survey’s sponsorship, explaining the usefulness of the survey research and the importance of responding, assuring the confidentiality of respondents’ data, and beginning the questionnaire with items directly related to the topic of the survey (1991). Recognizing the need to cover Internet and mixed-mode surveys, Dillman extended his prior work with the “Tailored Design Method.” With this update, he emphasized customizing processes and designs to fit each survey’s topic, population, and sponsorship (2007). Another component of optimizing response rates is getting as many complete responses as possible from those who start the survey. According to Peytchev (2009), causes of break-off may fall into the following three categories: • Respondent factors (survey topic salience and cognitive ability) • Survey design factors (length, progress indicators, and incentives) • Question design factors (fatigue and intimidation from open-ended questions and lengthy grid questions) The questionnaire design principles mentioned previously may help minimize break-off, such as making surveys as short as possible, having a minimum of required questions, using skip logic, and including progress bars for short surveys.

254

H. Müller et al.

Providing an incentive to encourage survey responses may be advantageous in certain cases. Monetary incentives tend to increase response rates more than nonmonetary incentives (Singer, 2002). In particular, non-contingent incentives, which are offered to all people in the sample, generally outperform contingent incentives, given only upon completion of the survey (Church, 1993). This is true even when a non-contingent incentive is considerably smaller than a contingent incentive. One strategy to maximize the benefit of incentives is to offer a small non-contingent award to all invitees, followed by a larger contingent award to initial non-respondents (Lavrakas, 2011). An alternate form of contingent incentive is a lottery, where a drawing is held among respondents for a small number of monetary awards or other prizes. However, the efficacy of such lotteries is unclear (Stevenson, Dykema, Cyffka, Klein, & Goldrick-Rab, 2012). Although incentives will typically increase response rates, it is much less certain whether they increase the representativeness of the results. Incentives are likely most valuable when facing a small population or sampling frame, and high response rates are required for sufficiently precise measurements. Another case where incentives may help is when some groups in the sample have low interest in the survey topic (Singer, 2002). Furthermore, when there is a cost to contact each potential respondent, as with door-to-door interviewing, incentives will decrease costs by lowering the number of people that need to be contacted.

Data Analysis and Reporting Once all the necessary survey responses have been collected, it is time to start making sense of the data by: 1. Preparing and exploring the data 2. Thoroughly analyzing the data 3. Synthesizing insights for the target audience of this research

Data Preparation and Cleaning Cleaning and preparing survey data before conducting a thorough analysis are essential to identify low-quality responses that may otherwise skew the results. When taking a pass through the data, survey researchers should look for signs of poor-quality responses. Such survey data can either be left as is, removed, or presented separately from trusted data. If the researcher decides to remove poor data, they must cautiously decide whether to remove data on the respondent level (i.e., listwise deletion), an individual question level (i.e., pairwise deletion), or only beyond a certain point in the survey where respondents’ data quality is declined. The following are signals that survey researchers should look out for at the survey response level:

Survey Research in HCI

255

• Duplicate responses. In a self-administered survey, a respondent might be able to fill out the survey more than once. If possible, respondent information such as name, e-mail address, or any other unique identifier should be used to remove duplicate responses. • Speeders. Respondents that complete the survey faster than possible, speeders, may have carelessly read and answered the questions, resulting in arbitrary responses. The researcher should examine the distribution of response times and remove any respondents that are suspiciously fast. • Straight-liners and other questionable patterns. Respondents that always, or almost always, pick the same answer option across survey questions are referred to as straight-liners. Grid-style questions are particularly prone to respondent straight-lining (e.g., by always picking the first answer option when asked to rate a series of objects). Respondents may also try to hide the fact that they are randomly choosing responses by answering in a fixed pattern (e.g., by alternating between the first and second answer options across questions). If a respondent straight-lines through the entire survey, the researcher may decide to remove the respondent’s data entirely. If a respondent starts straight-lining at a certain point, the researcher may keep data up until that point. • Missing data and break-offs. Some respondents may finish a survey but skip several questions. Others may start the survey but break off at some point. Both result in missing data. It should first be determined whether those who did not respond to certain questions are different from those who did. A non-response study should be conducted to assess the amount of non-response bias for each survey question. If those who did not answer certain questions are not meaningfully different from those who did, the researcher can consider leaving the data as is; however, if there is a difference, the researcher may choose to impute plausible values based on similar respondents’ answers (De Leeuw, Hox, & Huisman, 2003). Furthermore, the following signals may need to be assessed at a questionby-question level: • Low inter-item reliability. When multiple questions are used to measure a single construct, respondents’ answers to these questions should be associated with each other. Respondents that give inconsistent or unreliable responses (e.g., selecting “very fast” and “very slow” for separate questions assessing the construct of speed) may not have carefully read the set of questions and should be considered for removal. • Outliers. Answers that significantly deviate from the majority of responses are considered outliers and should be examined. For questions with numeric values, some consider outliers as the top and bottom 2 % of responses, while others calculate outliers as anything outside of two or three standard deviations from the mean. Survey researchers should determine how much of a difference keeping or removing the outliers has on variables’ averages. If the impact is significant, the researcher may either remove such responses entirely or replace them with a value that equals two or three standard deviations from the mean. Another way to describe the central tendency while minimizing the effect of outliers is to use the median, rather than the mean.

256

H. Müller et al.

• Inadequate open-ended responses. Due to the amount of effort required, open-ended questions may lead to low-quality responses. Obvious garbage and irrelevant answers, such as “asdf,” should be removed, and other answers from the same respondent should be examined to determine whether all their survey responses warrant removal.

Analysis of Closed-Ended Responses To get an overview of what the survey data shows, descriptive statistics are fundamental. By looking at measures such as the frequency distribution, central tendency (e.g., mean or median), and data dispersion (e.g., standard deviation), emerging patterns can be uncovered. The frequency distribution shows the proportion of responses for each answer option. The central tendency measures the “central” position of a frequency distribution and is calculated using the mean, median, and mode. Dispersion examines the data spread around the central position through calculations such as standard deviation, variance, range, and interquartile range. While descriptive statistics only describe the existing data set, inferential statistics can be used to draw inferences from the sample to the overall population in question. Inferential statistics consists of two areas: estimation statistics and hypothesis testing. Estimation statistics involves using the survey’s sample in order to approximate the population’ss value. Either the margin of error or the confidence interval of the sample’s data needs to be determined for such estimation. To calculate the margin of error for an answer option’s proportion, only the sample size, the proportion, and a selected confidence level are needed. However, to determine the confidence interval for a mean, the standard error of the mean is required additionally.. A confidence interval thus represents the estimated range of a population’s mean at a certain confi onfidence level. Hypothesis testing determines the probability of a hypothesis being true when comparing groups (e.g., means or proportions being the same or different) through the use of methods such as t-test, ANOVA, or Chi-square. The appropriate test is determined by the research question, type of prediction by the researcher, and type of variable (i.e., nominal, ordinal, interval, or ratio). An experienced quantitative researcher or statistician should be involved. Inferential statistics can also be applied to identify connections among variables: • Bivariate correlations are widely used to assess linear relationships between variables. For example, correlations can indicate which product dimensions (e.g., ease of use, speed, features) are most strongly associated with users’ overall satisfaction. • Linear regression analysis indicates the proportion of variance in a continuous dependent variable that is explained by one or more independent variables and the amount of change explained by each unit of an independent variable.

Survey Research in HCI

257

• Logistic regression predicts the change in probability of getting a particular value in a binary variable, given a unit change in one or more independent variables. • Decision trees assess the probabilities of reaching specific outcomes, considering relationships between variables. • Factor analysis identifies groups of covariates and can be useful to reduce a large number of variables into a smaller set. • Cluster analysis looks for related groups of respondents and is often used by market researchers to identify and categorize segments within a population. There are many packages available to assist with survey analysis. Software such as Microsoft Excel, and even certain survey platforms such as SurveyMonkey or Google Forms, can be used for basic descriptive statistics and charts. More advanced packages such as SPSS, R, SAS, or Matlab can be used for complex modeling, calculations, and charting. Note that data cleaning often needs to be a precursor to conducting analysis using such tools.

Analysis of Open-Ended Comments In addition to analyzing closed-ended responses, the review of open-ended comments contributes a more holistic understanding of the phenomena being studied. Analyzing a large set of open-ended comments may seem like a daunting task at first; however, if done correctly, it reveals important insights that cannot otherwise be extracted from closed-ended responses. The analysis of open-ended survey responses can be derived from the method of grounded theory (Böhm, 2004; Glaser & Strauss, 1967) (see chapter on “Grounded Theory Methods”). An interpretive method, referred to as coding (Saldaña, 2009), is used to organize and transform qualitative data from open-ended questions to enable further quantitative analysis (e.g., preparing a frequency distribution of the codes or comparing the responses across groups). The core of such qualitative analysis is to assign one or several codes to each comment; each code consists of a word or a short phrase summarizing the essence of the response with regard to the objective of that survey question (e.g., described frustrations, behavior, sentiment, or user type). Available codes are chosen from a coding scheme, which may already be established by the community or from previous research or may need to be created by the researchers themselves. In most cases, as questions are customized to each individual survey, the researcher needs to establish the coding system using a deductive or an inductive approach. When employing a deductive approach, the researcher defines the full list of possible codes in a top-down fashion; i.e., all codes are defined before reviewing the qualitative data and assigning those codes to comments. On the other hand, when using an inductive approach to coding, the codes are generated and constantly revised in a bottom-up approach; i.e., the data is coded according to categories

258

H. Müller et al.

identified by reading and re-reading responses to the open-ended question. Bottom-up, inductive coding is recommended, as it has the benefit of capturing categories the researcher may not have thought of before reading the actual comments; however, it requires more coordination if multiple coders are involved. (See “Grounded Theory Method” chapter for an analogous discussion.) To measure the reliability of both the developed coding system and the coding of the comments, either the same coder should partially repeat the coding or a second coder should be involved. Intra-rater reliability describes the degree of agreement when the data set is reanalyzed by the same researcher. Inter-rater reliability (Armstrong, Gosling, Weinman, & Marteau, 1997; Gwet, 2001) determines the agreement level of the coding results from at least two independent researchers (using correlations or Cohen’s kappa). If there is low agreement, the coding needs to be reviewed to identify the pattern behind the disagreement, coder training needs to be adjusted, or changes to codes need to be agreed upon to achieve consistent categorization. If the data set to be coded is too large and coding needs to be split up between researchers, inter-rater consistency can be measured by comparing results from coding an overlapping set of comments, by comparing the coding to a preestablished standard, or by including another researcher to review overlapping codes from the main coders. After having analyzed all comments, the researcher may prepare descriptive statistics such as a frequency distribution of codes, conduct inferential statistical tests, summarize key themes, prepare re necessary charts, and highlight specifics through the use of representative quotes. To compare results across groups, inferential analysis methods can be used as described above for closed-ended data (e.g., t-tests, ANOVA, or Chi-square).

Assessing Representativeness A key criterion in any survey’s quality is the degree to which the results accurately represent the target population. If a survey’s sampling frame fully covers the population and the sample is randomly drawn from the sampling frame, a response rate of 100 % would ensure that the results are representative at a level of precision based on the sample size. If, however, a survey has less than a 100 % response rate, those not responding might have provided a different answer distribution than those who did respond. An example is a survey intended to measure attitudes and behaviors regarding a technology that became available recently. Since people who are early adopters of new technologies are usually very passionate about providing their thoughts and feedback, surveying users of this technology product would overestimate responses from early adopters (as compared to more occasional users) and the incidence of favorable attitudes toward that product. Thus, even a modest level of non-response fect the degree of non-response bias. can greatly affect With response rates to major longitudinal surveys having decreased over time, much effort has been devoted to understanding non-response and its impact on data

Survey Research in HCI

259

quality as well as methods of adjusting results to mitigate non-response error. Traditional survey assumptions held that maximizing response rates minimized non-response bias (Groves, 2006). Therefore, the results of Groves’ 2006 metaanalysis were both surprising and seminal, finding no meaningful correlation between response rates and non-response error across mail, telephone, and face-toface surveys.

Reporting Survey Findings Once the question-by-question analysis is completed, the researcher needs to synthesize findings across all questions to address the goals of the survey. Larger themes may be identified, and the initially defined research questions are answered, which are in turn translated into recommendations and broader HCI implications as appropriate. All calculations used for the data analysis should be reported with the necessary statistical rigor (e.g., sample sizes, p-values, margins of error, and confidence levels). Furthermore, it is important to list the survey’s paradata and include response and break-off rates (see section on monitoring survey paradata). Similar to other empirical research, it is important to not only report the results of the survey but also describe the original research goals and the used survey methodology. A detailed description of the survey methodology will explain the population being studied, sampling method, survey mode, survey invitation, fielding process, and response paradata. It should also include screenshots of the actual survey questions and explain techniques used to evaluate data quality. Furthermore, it is often necessary to include a discussion on how the respondents compare to the overall population. Lastly, any potential sources of survey bias, such as sampling biases or non-response bias, should be outlined.

Exercises 1. What are the differences between a survey and a questionnaire, both in concept and design? 2. In your own research area, create a survey and test it with five classmates. How long do you think it will take a classmate to fill it out? How long did it take them? Acknowledgements We would like to thank our employers Google, Inc. and Twitter, Inc. for making it possible for us to work on this chapter. There are many that contributed to this effort, and we would like to call out the most significant ones: Carolyn Wei for identifying published papers that used survey methodology for their work, Sandra Lozano for her insights on analysis, Mario Callegaro for inspiration, Ed Chi and Robin Jeffries for reviewing several drafts of this document, and Professors Jon Krosnick from Stanford University and Mick Couper from the University of Michigan for laying the foundation of our survey knowledge and connecting us to the broader survey research community.

260

H. Müller et al.

References Overview Books Couper, M. (2008). Designing effective Web surveys. Cambridge, UK: Cambridge University Press. Fowler, F. J., Jr. (1995). Improving survey questions: Design and evaluation (Vol. 38). Thousand Oaks, CA: Sage. Incorporated. Groves, R. M. (1989). Survey errors and survey costs. Hoboken, NJ: Wiley. Groves, R. M. (2004). Survey errors and survey costs (Vol. 536). Hoboken, NJ: Wiley-Interscience. Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2004). Survey methodology. Hoboken, NJ: Wiley. Marsden, P. V., & Wright, J. (Eds.). (2010). Handbook of survey research (2nd ed.). Bingley, UK: Emerald Publishing Group Limited.

Sampling Methods Aquilino, W. S. (1994). Interview mode effects in surveys of drug and alcohol use: A field experiment. Public Opinion Quarterly., 58(2), 210–240. Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York, NY: Wiley. Couper, M. P. (2000). Web surveys: A review of issues and approaches. Public Opinion Quarterly, 64, 464–494. Kish, L. (1965). Survey sampling. New York, NY: Wiley. Krejcie, R. V., & Morgan, D. W. (1970). Determining sample size for research activities. Educational and Psychological Measurement, 30, 607–610. Lohr, S. L. (1999). Sampling: Design and analysis. Pacific Grove, CA: Duxbury Press.

Questionnaire Design Bradburn, N. M., Sudman, S., & Wansink, B. (2004). Asking questions: The definitive guide to questionnaire design – for market research, political polls, and social and health questionnaires. San Francisco, CA: Jossey-Bass. Revised. Cannell, C. F., & Kahn, R. L. (1968). Interviewing. The Handbook of Social Psychology, 2, 526–595. Chan, J. C. (1991). Response-order effects in Likert-type scales. Educational and Psychological Measurement, 51(3), 531–540. Costa, P. T., & McCrae, R. R. (1988). From catalog to classification: Murray’s needs and the fivefactor model. Journal of Personality and Social Psychology, 55(2), 258. Couper, M. P., Tourangeau, R., Conrad, F. G., & Crawford, S. D. (2004). What they see is what we get response options for web surveys. Social Science Computer Review, 22(1), 111–127. Edwards, A. L., & Kenney, K. C. (1946). A comparison of the Thurstone and Likert techniques of attitudes scale construction. Journal of Applied Psychology, 30, 72–83. Goffman, E. (1959). The presentation of self in everyday life, 1–17. Garden City, NY Goldberg, L. R. (1990). An alternative description of personality: The big-five factor structure. Journal of Personality and Social Psychology, 59(6), 1216.

Survey Research in HCI

261

Herzog, A. R., & Bachman, J. G. (1981). Effects of questionnaire length on response quality. Public Opinion Quarterly, 45(4), 549–559. Holbrook, A. L., & Krosnick, J. A. (2010). Social desirability bias in voter turnout reports tests using the item count technique. Public Opinion Quarterly, 74(1), 37–67. Kinder, D. R., & Iyengar, S. (1987). News That Matters: Television and American Opinion. Chicago: University of Chicago Press. Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology, 5, 213–236. Krosnick, J. A. (1999). Survey research. Annual review of psychology, 50(1), 537–567. Krosnick, J. A. (2002). The causes of no-opinion responses to attitude measures in surveys: They are rarely what they appear to be. In R. Groves, D. Dillman, J. Eltinge, & R. Little (Eds.), Survey non-response (pp. 87–100). New York: Wiley. Krosnick, J. A., & Alwin, D. F. (1987). Satisficing: A strategy for dealing with the demands of survey questions. Columbus, OH: Ohio State University. Krosnick, J. A., & Alwin, D. F. (1988). A test of the form-resistant correlation hypothesis ratings, rankings, and the measurement of values. Public Opinion Quarterly, 52(4), 526–538. Krosnick, J. A., & Fabrigar, L. A. (1997). Designing rating scales for effective measurement in surveys. In L. Lyberg et al. (Eds.), Survey measurement and process quality (pp. 141–164). New York: Wiley. Krosnick, J. A., Narayan, S., & Smith, W. R. (1996). Satisficing in surveys: Initial evidence. New Directions for Evaluation, 1996(70), 29–44. Krosnick, J. A., & Presser, S. (2010). Question and questionnaire design. In P. V. Marsden & J. D. Wright (Eds.), Handbook of survey research (pp. 263–314). Bingley, UK: Emerald Group Publishing Limited. Landon, E. L. (1971). Order bias, the ideal rating, and the semantic differential. Journal of Marketing Research, 8(3), 375–378. O’Muircheartaigh, C. A., Krosnick, J. A., & Helic, A. (2001). Middle alternatives, acquiescence, and the quality of questionnaire data. In B. Irving (Ed.), Harris Graduate School of Public Policy Studies. Chicago, IL: University of Chicago. Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46(3), 598. Payne, S. L. (1951). The art of asking questions. Princeton, NJ: Princeton University Press. Payne, J. D. (1971). The effects of reversing the order of verbal rating scales in a postal survey. Journal of the Marketing Research Society, 14, 30–44. Rohrmann, B. (2003). Verbal qualifiers for rating scales: Sociolinguistic considerations and psychometric data. Project Report. Australia: University of Melbourne Saris, W. E., Revilla, M., Krosnick, J. A., & Shaeffer, E. M. (2010). Comparing questions with agree/disagree response options to questions with construct-specific response options. Survey Research Methods, 4(1), 61–79. Schaeffer, N. C., & Presser, S. (2003). The science of asking questions. Annual Review of Sociology, 29, 65–88. Schlenker, B. R., & Weigold, M. F. (1989). Goals and the self-identification process: Constructing desired identities. In L. Pervin (Ed.), Goal concepts in personality and social psychology (pp. 243–290). Hillsdale, NJ: Erlbaum. Schuman, H., & Presser, S. (1981). Questions and answers in attitude surveys. New York: Academic Press. Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129–138. Smith, D. H. (1967). Correcting for social desirability response sets in opinion-attitude survey research. Public Opinion Quarterly, 31, 87–94. Stone, G. C., Gage, N. L., & Leavitt, G. S. (1957). Two kinds of accuracy in predicting another’s responses. The Journal of Social Psychology, 45(2), 245–254.

262

H. Müller et al.

Tourangeau, R. (1984). Cognitive science and survey methods. Cognitive aspects of survey methodology: Building a bridge between disciplines (pp. 73–100). Washington, DC: National Academy Press. Tourangeau, R., Couper, M. P., & Conrad, F. (2004). Spacing, position, and order: Interpretive heuristics for visual features of survey questions. Public Opinion Quarterly, 68(3), 368–393. Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of survey response. Cambridge, UK: Cambridge University Press. Tourangeau, R., & Smith, T. W. (1996). Asking sensitive questions the impact of data collection mode, question format, and question context. Public Opinion Quarterly, 60(2), 275–304. Tourangeau, R., & Yan, T. (2007). Sensitive questions in surveys. Psychological Bulletin, 133(5), 859. Villar, A., & Krosnick, J. A. (2011). Global warming vs. climate change, taxes vs. prices: Does word choice matter? Climatic change, 105(1), 1–12.

Visual Survey Design Callegaro, M., Villar, A., & Yang, Y. (2011). A meta-analysis of experiments manipulating progress indicators in Web surveys. Annual Meeting of the American Association for Public Opinion Research, Phoenix Couper, M. (2011). Web survey methodology: Interface design, sampling and statistical inference. Presentation at EUSTAT-The Basque Statistics Institute, Vitoria-Gasteiz Couper, M. P., Conrad, F. G., & Tourangeau, R. (2007). Visual context effects in Web surveys. Public Opinion Quarterly, 71(4), 623–634. Peytchev, A., Couper, M. P., McCabe, S. E., & Crawford, S. D. (2006). Web survey design paging versus scrolling. Public Opinion Quarterly, 70(4), 596–607. Yan, T., Conrad, F. G., Tourangeau, R., & Couper, M. P. (2011). Should I stay or should I go: The effects of progress feedback, promised task duration, and length of questionnaire on completing Web surveys. International Journal of Public Opinion Research, 23(2), 131–147.

Established Questionnaire Instruments Brooke, J. (1996). SUS-A quick and dirty usability scale. Usability Evaluation in Industry, 189, 194. Chin, J. P., Diehl, V. A., & Norman, K. L. (1988, May). Development of an instrument measuring user satisfaction of the human-computer interface. In Proceedings of the SIGCHI Conference on Human factors in computing systems (pp. 213–218). New York, NY: ACM Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Human Mental Workload, 1, 139–183. Kirakowski, J., & Corbett, M. (1993). SUMI: The software usability measurement inventory. British Journal of Educational Technology, 24(3), 210–212. Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use. International Journal of Human‐Computer Interaction, 7(1), 57–78. Moshagen, M., & Thielsch, M. T. (2010). Facets of visual aesthetics. International Journal of Human-Computer Studies, 68(10), 689–709.

Survey Research in HCI

263

Questionnaire Evaluation Bolton, R. N., & Bronkhorst, T. M. (1995). Questionnaire pretesting: Computer assisted coding of concurrent protocols. In N. Schwarz & S. Sudman (Eds.), Answering questions (pp. 37–64). San Francisco: Jossey-Bass. Collins, D. (2003). Pretesting survey instruments: An overview of cognitive methods. Quality of Life Research an International Journal of Quality of Life Aspects of Treatment Care and Rehabilitation, 12(3), 229–238. Drennan, J. (2003). Cognitive interviewing: Verbal data in the design and pretesting of questionnaires. Journal of Advanced Nursing, 42(1), 57–63. Presser, S., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., Rothgeb, J. M., et al. (2004). Methods for testing and evaluating survey questions. Public Opinion Quarterly, 68(1), 109–130.

Survey Response Rates and Non-response American Association for Public Opinion Research, AAPOR. (2011). Standard definitions: Final dispositions of case codes and outcome rates for surveys. (7th ed). http://aapor.org/Content/ NavigationMenu/AboutAAPOR/StandardsampEthics/StandardDefinitions/Standard Definitions2011.pdf Baruch, Y. (1999). Response rates in academic studies: A comparative analysis. Human Relations, 52, 421–434. Baruch, Y., & Holtom, B. C. (2008). Survey response rate levels and trends in organizational research. Human Relations, 61(8), 1139–1160. Church, A. H. (1993). Estimating the effect of incentives on mail survey response rates: A metaanalysis. Public Opinion Quarterly, 57, 62–79. Cook, C., Heath, F., & Thompson, R. L. (2000). A meta-analysis of response rates in Web- or Internet-based surveys. Educational and Psychological Measurement, 60(6), 821–836. Dillman, D. A. (1978). Mail and telephone surveys: The total design method. New York: Wiley. Dillman, D. A. (1991). The design and administration of mail surveys. Annual Review of Sociology, 17, 225–249. Dillman, D. A. (2007). Mail and Internet surveys: The tailored design method (2nd ed.). Hoboken, NJ: Wiley. Fan, W., & Yan, Z. (2010). Factors affecting response rates of the web survey: A systematic review. Computers in Human Behavior, 26(2), 132–139. Groves, R. M. (2006). Non-response rates and non-response bias in household surveys. Public Opinion Quarterly, 70, 646–75. Groves, R. M., Presser, S., & Dipko, S. (2004). The role of topic interest in survey participation decisions. Public Opinion Quarterly, 68(1), 2–31. Kaplowitz, M. D., Hadlock, T. D., & Levine, R. (2004). A comparison of web and mail survey response rates. Public Opinion Quarterly, 68(1), 94–101. Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). New York: Holt, Rinehart & Winston. Kiesler, S., & Sproull, L. S. (1986). Response effects in the electronic survey. Public Opinion Quarterly, 50, 402–413. Lavrakas, P. J. (2011). The use of incentives in survey research. 66th Annual Conference of the American Association for Public Opinion Research Lin, I., & Schaeffer, N. C. (1995). Using survey participants to estimate the impact of nonparticipation. Public Opinion Quarterly, 59(2), 236–258.

264

H. Müller et al.

Lu, H., & Gelman, A. (2003). A method for estimating design-based sampling variances for surveys with weighting, poststratification, and raking. Journal of Official Statistics, 19(2), 133–152. Manfreda, K. L., Bosnjak, M., Berzelak, J., Haas, I., Vehovar, V., & Berzelak, N. (2008). Web surveys versus other survey modes: A meta-analysis comparing response rates. Journal of the Market Research Society, 50(1), 79. Olson, K. (2006). Survey participation, non-response bias, measurement error bias, and total bias. Public Opinion Quarterly, 70(5), 737–758. Peytchev, A. (2009). Survey breakoff. Public Opinion Quarterly, 73(1), 74–97. Schonlau, M., Van Soest, A., Kapteyn, A., & Couper, M. (2009). Selection bias in web surveys and the use of propensity scores. Sociological Methods & Research, 37(3), 291–318. Sheehan, K. B. (2001). E-mail survey response rates: A review. Journal of Computer Mediated Communication, 6(2), 1–16. Singer, E. (2002). The use of incentives to reduce non-response in household surveys. In R. Groves, D. Dillman, J. Eltinge, & R. Little (Eds.), Survey non-response (pp. 87–100). New York: Wiley. 163–177. Stevenson, J., Dykema, J., Cyffka, C., Klein, L., & Goldrick-Rab, S. (2012). What are the odds? Lotteries versus cash incentives. Response rates, cost and data quality for a Web survey of low-income former and current college students. 67th Annual Conference of the American Association for Public Opinion Research

Survey Analysis Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The place of inter-rater reliability in qualitative research: An empirical study. Sociology, 31(3), 597–606. Böhm, A. (2004). Theoretical coding: Text analysis in grounded theory. In A companion to qualitative research, London: SAGE. pp. 270–275. De Leeuw, E. D., Hox, J. J., & Huisman, M. (2003). Prevention and treatment of item nonresponse. Journal of Official Statistics, 19(2), 153–176. Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. Hawthorne, NY: Aldine de Gruyter. Gwet, K. L. (2001). Handbook of inter-rater reliability. Gaithersburg, MD: Advanced Analytics, LLC. Heeringa, S. G., West, B. T., & Berglund, P. A. (2010). Applied survey data analysis. Boca Raton, FL: Chapman & Hall/CRC. Lee, E. S., Forthofer, R. N., & Lorimor, R. J. (1989). Analyzing complex survey data. Newbury Park, CA: Sage. Saldaña, J. (2009). The coding manual for qualitative researchers. Thousand Oaks, CA: Sage Publications Limited.

Other References Abran, A., Khelifi, A., Suryn, W., & Seffah, A. (2003). Usability meanings and interpretations in ISO standards. Software Quality Journal, 11(4), 325–338. Anandarajan, M., Zaman, M., Dai, Q., & Arinze, B. (2010). Generation Y adoption of instant messaging: An examination of the impact of social usefulness and media richness on use richness. IEEE Transactions on Professional Communication, 53(2), 132–143.

Survey Research in HCI

265

Archambault, A., & Grudin, J. (2012). A longitudinal study of facebook, linkedin, & twitter use. In Proceedings of the 2012 ACM Annual Conference on Human Factors in Computing Systems (CHI '12) (pp. 2741–2750). New York: ACM Auter, P. J. (2007). Portable social groups: Willingness to communicate, interpersonal communication gratifications, and cell phone use among young adults. International Journal of Mobile Communications, 5(2), 139–156. Calfee, J. E., & Ringold, D. J. (1994). The 70 % majority: Enduring consumer beliefs about advertising. Journal of Public Policy & Marketing, 13(2). Chen, J., Geyer, W., Dugan, C., Muller, M., & Guy, I. (2009). Make new friends, but keep the old: Recommending people on social networking sites. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI '09), (pp. 201–210). New York: ACM Clauser, B. E. (2007). The life and labors of Francis Galton: A review of four recent books about the father of behavioral statistics. Journal of Educational and Behavioral Statistics, 32(4), 440–444. Converse, J. (1987). Survey research in the United States: Roots and emergence 1890–1960. Berkeley, CA: University of California Press. Drouin, M., & Landgraff, C. (2012). Texting, sexting, and attachment in college students’ romantic relationships. Computers in Human Behavior, 28, 444–449. Feng, J., Lazar, J., Kumin, L., & Ozok, A. (2010). Computer usage by children with down syndrome: Challenges and future research. ACM Transactions on Accessible Computing, 2(3), 35–41. Froelich, J., Findlater, L., Ostergren, M., Ramanathan, S., Peterson, J., Wragg, I., et al. (2012). The design and evaluation of prototype eco-feedback displays for fixture-level water usage data. In Proceedings of the 2012 ACM Annual Conference on Human Factors in Computing Systems (CHI '12) (pp. 2367–2376). New York: ACM Harrison, M. A. (2011). College students’ prevalence and perceptions of text messaging while driving. Accident Analysis and Prevention, 43, 1516–1520. Junco, R., & Cotten, S. R. (2011). Perceived academic effects of instant messaging use. Computers & Education, 56, 370–378. Katosh, J. P., & Traugott, M. W. (1981). The consequences of validated and self-reported voting measures. Public Opinion Quarterly, 45(4), 519–535. Nacke, L. E., Grimshaw, M. N., & Lindley, C. A. (2010). More than a feeling: Measurement of sonic user experience and psychophysiology in a first-person shooter game. Interacting with Computers, 22(5), 336–343. Obermiller, C., & Spangenberg, E. R. (1998). Development of a scale to measure consumer skepticism toward advertising. Journal of Consumer Psychology, 7(2), 159–186. Obermiller, C., & Spangenberg, E. R. (2000). On the origin and distinctiveness of skepticism toward advertising. Marketing Letters, 11, 311–322. Person, A. K., Blain, M. L. M., Jiang, H., Rasmussen, P. W., & Stout, J. E. (2011). Text messaging for enhancement of testing and treatment for tuberculosis, human immunodeficiency virus, and syphilis: A survey of attitudes toward cellular phones and healthcare. Telemedicine Journal and e-Health, 17(3), 189–195. Pitkow, J. E., & Recker, M. (1994). Results from the first World-Wide web user survey. Computer Networks and ISDN Systems, 27(2), 243–254. Rodden, R., Hutchinson, H., & Fu, X. (2010). Measuring the user experience on a large scale: Usercentered metrics for web applications. In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI '10) (pp. 2395–2398) ACM, New York, NY, USA Schild, J., LaViola, J., & Masuch, M. (2012). Understanding user experience in stereoscopic 3D games. In Proceedings of the 2012 ACM Annual Conference on Human Factors in Computing Systems (CHI '12) (pp. 89–98). New York: ACM Shklovski, I., Kraut, R., & Cummings, J. (2008). Keeping in touch by technology: Maintaining friendships after a residential move. In Proceedings of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems (CHI '08) (pp. 807–816). New York: ACM

266

H. Müller et al.

Turner, M., Love, S., & Howell, M. (2008). Understanding emotions experienced when using a mobile phone in public: The social usability of mobile (cellular) telephones. Telematics and Informatics, 25, 201–215. Weisskirch, R. S., & Delevi, R. (2011). “Sexting” and adult romantic attachment. Computers in Human Behavior, 27, 1697–1701. Wright, P. J., & Randall, A. K. (2012). Internet pornography exposure and risky sexual behavior among adult males in the United States. Computers in Human Behavior, 28, 1410–1416. Yew, J., Shamma, D. A., & Churchill, E. F. (2011). Knowing funny: Genre perception and categorization in social video sharing. In Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems (CHI '11) (pp. 297–306). New York: ACM Zaman, M., Rajan, M. A., & Dai, Q. (2010). Experiencing flow with instant messaging and its facilitating role on creative behaviors. Computers in Human Behavior, 26, 1009–1018.

Brainstorm, Chainstorm, Cheatstorm, Tweetstorm: New Ideation Strategies for Distributed HCI Design Haakon Faste HCI Institute Carnegie Mellon [email protected]

Nir Rachmel HCI Institute Carnegie Mellon [email protected]

ABSTRACT

In this paper we describe the results of a design-driven study of collaborative ideation. Based on preliminary findings that identified a novel digital ideation paradigm we refer to as chainstorming, or online communication brainstorming, two exploratory studies were performed. First, we developed and tested a distributed method of ideation we call cheatstorming, in which previously generated brainstorm ideas are delivered to targeted local contexts in response to a prompt. We then performed a more rigorous case study to examine the cheatstorming method and consider its possible implementation in the context of a distributed online ideation tool. Based on observations from these studies, we conclude with the somewhat provocative suggestion that ideation need not require the generation of new ideas. Rather, we present a model of ideation suggesting that its value has less to do with the generation of novel ideas than the cultural influence exerted by unconventional ideas on the ideating team. Thus brainstorming is more than the pooling of “invented” ideas, it involves the sharing and interpretation of concepts in unintended and (ideally) unanticipated ways. Author Keywords

Ideation; brainstorming; chainstorming; cheatstorming; tweetstormer; ACM Classification Keywords

H.5.2. User Interfaces: Theory and Methods; H.5.3. Group and Organization Interfaces: Collaborative computing General Terms

Design; Experimentation. INTRODUCTION

The ability to generate new ideas as part of a creative design process is essential to research and practice in human-computer interaction. The question of how best to generate ideas is not entirely clear, however. Not only are countless design and research methodologies commonly employed by HCI teams, their ideation effectiveness depends on numerous interdependent and variable factors including the scope and objectives of the project in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2013, April 27–May 2, 2013, Paris, France. Copyright © 2013 ACM 978-1-4503-1899-0/13/04...$15.00.

Russell Essary HCI Institute Carnegie Mellon [email protected]

Evan Sheehan HCI Institute Carnegie Mellon [email protected]

question, the expertise and variety of the people involved, the strength and familiarity of their social relationships— not to mention their degree of familiarity with previous ideation and research activities—and cultural and personal factors including a person’s workplace norms and values, personal motivations and desires, confidence, degree of social collaboration, esteem, and so on. In this paper we describe the results of a design-driven study conducted with the aim of improving collaborative ideation on HCI projects using distributed software tools. Specifically we focused our research on how digital tools might be used to enhance the practice of group ideation among members of asynchronously distributed collaborative teams. A range of different ideation techniques are used in design and HCI. In this paper, we begin with a discussion of the relative benefits and drawbacks of one such ideation method, specifically brainstorming, as described by Osborn [33] and evaluated by Isaksen [21], among others. We then describe a design research process that explored the creation of distributed brainstorming alternatives. Two exploratory studies were performed. First, we developed and tested an ideation method we refer to as cheatstorming. Using this technique, previously generated brainstorm ideas are delivered to targeted local contexts without the need for imaginative ideation. We then performed a second study of the cheatstorming method to better understand its implications and improve its efficiency. Based on observations from these studies, we conclude with the observation that ideation need not be limited to the generation of new ideas. From this perspective, the value of group ideation activities such as brainstorming has less to do with the creation of novel ideas than its cultural influence on the ideating team. Ideation, in short, is the radical redistribution of ideas to “unconventionalize” a given context. Brainstorming Effectiveness as an Ideation Technique

The term brainstorming is best identified today with Osborn’s book on creativity titled Applied Imagination, first published in 1953 [33]. Osborn, who worked as an advertising executive in the 1940s and 50s, wrote a detailed examination of the creative problem solving process, and introduced brainstorming as one part of this process. Rich with current examples from that time, the book attempted to systematically define a method for deliberate creative group ideation from a very practical standpoint. Osborn divided the process into three main phases [33, p. 86]:

(1) Fact-finding: Problem-definition and preparation; gathering and analyzing the relevant data. (2) Idea-finding: Idea-production and idea-development; thinking of tentative ideas and possible leads and then selecting and combining them. (3) Solution finding: Evaluation and adoption; verifying the offered solutions, and deciding on and implementing a final selected set. In great detail, Osborn explains suggested practices for performing each of these stages, focusing in particular on the Idea-finding phase. He claimed that Idea-finding is “the part of problem-solving that is most likely to be neglected” by groups [33, p. 111], and offered four guidelines that should be carefully followed in order to conduct a brainstorming session effectively and yield the best results: 1. Criticism is ruled out: Adverse judgment of ideas must be withheld until later. 2. “Free-wheeling” is welcomed: the wilder the idea the better; it is easier to tame down than to think up. 3. Quantity is wanted: The greater the number of ideas, the more the likelihood of useful ideas. 4. Combination and improvement are sought: In addition to contributing ideas of their own, participants should suggest how ideas of others can be turned into better ideas; or how two or more ideas can be joined into still another idea. [33] Since then, many have built on these rules as brainstorming has become an increasingly popular method for idea generation in business and academic contexts. For example, Osborn’s rules have been adapted to be more playful and memorable for educational purposes (e.g. “Gleefully suspend judgment,” “Leapfrog off the ideas of others” [14]), and additional rules such as “be visual,” “stay focused on the topic,” and “one conversation at a time” have been added to better guide brainstorming sessions in the context of corporate design consulting [23]. Despite its widespread adoption in collaborative innovation environments in industry, the effectiveness of brainstorming has been a hot topic of debate in the academic community since its first introduction. The first criticism was sparked by a 1958 paper published by a group from Yale University (Taylor, Berry and Block) that compared the performance of randomly assigned brainstorming groups with that of randomly assigned individuals whose work was later pooled [42]. Numerous subsequent studies (e.g. [2, 35, 8, 26]) have built on this work to critique the effectiveness of brainstorming in groups relative to individuals working independently, arguing—among other things—that fewer good ideas are generated for each hour of individual effort expended. It is important to note, however, that the Taylor, Berry and Block study [42] did not actually test the effectiveness of the rules of brainstorming, since the same rules were applied to

both experimental conditions (individual and group). To be fair, Osborn recognized the necessity and advantages of working in groups for many reasons beyond the sheer quantity of ideas produced, especially when solving problems [33, p. 139]. In fact, the guidelines he suggested were specifically targeted at addressing the common inhibitory factors of group ideation. Rules such as “defer judgment” and “go wild” aimed not at individual productivity but improved social dynamics and sharing of ideas between members of a team. He also made the point to address a common misconception, stating that “group brainstorming is recommended solely as a supplement to individual ideation.” [33, pp. 141-142]. Still, studies critiquing the effectiveness of brainstorming on the grounds that it is inefficient were widespread through the late 1990s. More recently, the debate on productivity and collaboration has transferred to the domain of computer-mediated ideation (discussed below). Limitations of Brainstorming

Three major explanations have been offered to account for lower purported productivity in brainstorming groups relative to ideating alone: production blocking, evaluation apprehension and free riding [8]. We discuss each briefly in turn. Production blocking

Since only one person speaks at a time in a group setting, others are inhibited from expressing their ideas while another team-member is speaking, potentially slowing their ability to generate new ideas of their own. It is not the lack of speaking time in total that causes the alleged inhibition, as many times the flow of ideas ends before the end of a brainstorm session. Rather, it has been claimed that some participants’ ideas are suppressed or forgotten later in the process, as they may seem less relevant or less original than others being expressed. Furthermore, being in a situation where participants must passively listen to others’ ideas may distract and interrupt their thought processes and ability to record their own ideas. Examples of studies looking into this hypothesis can be found in [3, 24, 8, 16]. Evaluation apprehension

Creativity by definition is an unconventional act, and being creative therefore involves taking personal risks [13]. Even though one of the most important rules for successful brainstorming is to “defer judgment,” the fear of being criticized for having original ideas is often pervasive. Numerous authors have studied this phenomenon of “evaluation apprehension.” Maginn and Harris [29], for example, performed an experiment in which a brainstorming group was told that there were expert evaluators watching them through a one-way mirror. No major difference was observed between brainstorming performance in this condition relative to a control condition in which participants were not informed that they were being observed. In another study [5], groups of brainstorm participants were informed that some members of the group were “undercover” experts on the topic at hand. In this case, productivity loss was observed in groups that had been informed of their presence relative to a control group that had not been told.

Free riding

It may be the case that a brainstorming participant’s motivation to work decreases if they do not perceive that they will be recognized for their participation. Since brainstorming is a group activity in which all the generated ideas are ultimately grouped together, it is often the case that the generated results are not attributed to their specific contributor. Indeed, lower identifiability of ideas may increase participants’ motivation to contribute less, compared to an individual task where they know that their contribution will be recognized. Furthermore, many studies have shown that there is a lower perceived effectiveness of the individual in a group setting [8]. Structuring Ideation: Three Approaches Defined

In most of the aforementioned studies, proponents of brainstorming as an ideation technique tend to be its practitioners in the business and design communities (such as Osborn himself), while its detractors tend to be researchers interested in studying creative techniques but divorced from the nuances of its deeply embedded and culturally contextual practice [21]. Yet because the act of brainstorming incorporates numerous independent and complicated social variables—not least the makeup and experience of the team, the project objectives, the rules employed, and highly contextual success criteria—its effectiveness is difficult to study and empirically discern. Indeed, given that different ideation workplaces are likely to have differing communication patterns and communication needs depending on their cultural makeup and personnel, we find measuring the output of group ideation as a replacement for individual work to be an unsatisfactory approach. More compelling is the question of how intrinsic social and collaborative factors influence group ideation results by introducing “strangeness.” Perhaps this reflects our team’s ideological bent as design practitioners, but in today’s world, problem solving often requires experts from different fields, and new ideas are frequently sparked from novel combinations of existing concepts or the introduction of an existing concept to an unfamiliar context of use [27, 41]. Many authors have addressed the role of social factors in ideation. In this work, we ask how social factors and their resulting effects can be leveraged to develop more effective methods of group ideation online. Research has shown that social factors provide fresh sources of unexpected ideas that can help to reframe the design challenge, with design tools such as extreme characters and interaction labeling proposed as ways of dialing in the necessary “strangeness” for ideation to occur [9, 17]. Other classic ideation techniques include the use of ‘random input’ [6] and ‘oblique strategies’ [11] to generate fresh associations; by drawing on unexpected prompts and unrelated ideas to un-stick conventional thinking, such ‘trigger concepts’ bring fresh associations to the context of ideation, stimulating other associations “Like pebbles dropping in a pond.” [43] Drawing on these sources, we ask how brainstorming could be improved as a collaborative ideation technique through alternative methods of random input. In general, we classify three common social configurations of idea generation behavior: (1) face-to-face brainstorming

in groups; (2) individual (or “nominal”) idea generation sessions; and (3) computer-mediated ideation. We discuss the unique traits of each of these approaches in turn: Face-to-face Brainstorming Groups

The classic brainstorming session is done in face-to-face groups during a fixed period of time, usually between 15 to 45 minutes [33, p. 178], and is facilitated by a trained brainstorming expert that enforces the rules of brainstorming on the group. Participation is simultaneous and spontaneous: all participants can see each other’s ideas and are encouraged to build upon them. The ideas are recorded as they are suggested. At the end of a brainstorming session, Kelley et al. [23] suggest that participants vote on their favorite ideas as a way of generating closure and group consensus about which ideas are most compelling for future work. As for the optimal group’s size, in his original writings on brainstorming, Osborn suggested group sizes of up to 12 as effective [33, p. 159]. But there is no agreement in more recent literature as to optimal group-size (e.g. [16, 36, 4]), partly because it is difficult to define “optimal” in the context of real-world practice. Nominal Idea Generation Sessions

Nominal idea generation is done individually. The main element that defines this method is that participants are not influenced by the variety of social factors at play in a traditional brainstorming group: they cannot build on other participants’ ideas because they are not exposed to them, they will be less influenced by perceived criticisms to their ideas in real-time (although they may be reluctant to share them afterwards), they may be highly motivated to perform their work in the anticipation that their efforts will eventually be rewarded, and so on. Extensive research has been done to study the benefits and shortcomings of classic vs. nominal brainstorming, as described above. In general, it appears that nominal brainstorming has some benefits in terms of both quality and quantity of ideas [20, 30, 10, 28] due to psychological effects defined by Diehl & Stroebe [8]. Computer Mediated Ideation

Advances in digital technology have led to the potential for a variety of computer-mediated ideation techniques. Within this category, the term “electronic brainstorming” refers to any kind of brainstorming mediated by computers (e.g. [40, 7, 1]). One issue attempting to define electronic brainstorming is that any online activity that involves people entering information into cloud-based systems can be considered the contribution of “ideas” to a digital pool. For our purposes we therefore consider an electronic brainstorm to be only that subset of software-mediated interactions in which users are asked to specifically generate creative responses to a question or prompt. This differs slightly (with regard to intent) from forums in which people are asked to contribute “best practices” or “suggestions” based on priorknowledge simply as an act of knowledge-transfer (e.g., suggestion portals wherein users can recommend local restaurants or hotels). It also differs from critique feeds and forums, such as post-blog comment streams debating the

relative merits of an advanced position and/or themed around a topic of debate—although such kinds of activities are certainly related to electronic brainstorming and can be useful tools for the evaluation of brainstorming results as well as later phases in the ideation process. The various possible ideation approaches described above (group brainstorming, nominal idea generation, and computer-mediated ideation) are not mutually exclusive, and can be combined and mixed to make the most of each method. A brainstorming session could be performed in two parts, for example, the first in the nominal style followed by a face-to-face method to evaluate and combine ideas across participants. Electronic brainstorming can also support both nominal and group methods, or implement a diverse array of combinations between them. Indeed, it is precisely because of the flexibility of electronic methods to distribute various aspects of the brainstorming task across asynchronous distributed teams that we performed the studies described in the following section. Group ideation is an integral part of HCI research practice, and an area where the implementation of improved software interactions could greatly enhance how ideation happens in research laboratories, design firms and product companies alike. METHODOLOGY AND DESIGN RESEARCH

Our investigation began with the simple premise that collaborative ideation could be enhanced through the use of distributed online tools, and design-driven approaches could be used to explore and investigate the potentialities of this space. Our design team consisted of four members with diverse backgrounds including design consulting, software engineering, anthropology, and management. We held regular meetings over the course of several months to conduct freeform exploratory design research. Sessions were held once or twice weekly for 1-3 hours per session. The setting was a design studio in the HCI Institute at Carnegie Mellon University. This section describes our design research process, consisting of the following phases: (1) opportunity finding; (2) electronic brainstorming; (3) concept selection and refinement; and (4) experimentation and discussion. Opportunity finding

We began with a vision for an online space to browse and share ideas where they could be tagged, filtered, and contextualized in the cloud. This vision was founded on two beliefs: that creators are everywhere, and that they are driven by creative ideas for which they seek open outlets. Although a clear plan for how to develop such a system was not yet evident, we first created a series of exploratory concept sketches to help envision possible outcomes and establish goals. We then analyzed aspects of our concept drawings and generated a set of Post-It notes chronicling our complete list of observations and desires. Next, we arranged these notes on a 2x2 matrix to help group them into clusters and synthesize common themes. Because our aim with this stage was to work on a meaningful project that was enjoyable and inspirational to the team, the axes of

this matrix we created, ranging from low to high in each dimension, were “Fun Impact” vs. “Social Impact.” Seven areas of opportunity emerged from this exercise: (1) Reveal hidden (personal) meanings through metaphorical leaps of imagination; (2) Facilitate the discovery of thinking patterns; (3) Track creative influence to motivate participation; (4) Associate and juxtapose unexpected ideas; (5) Help people find ideas that are important to them; (6) Invent and embody “creative movements”; and (7) Spark and inspire interest and freedom. Electronic Brainstorming

Given our interest in exploring the possibilities of electronic brainstorming we decided to experiment with distributed ideation online. Using the identified opportunity areas as jumping-off points for generative design, we restated each of the seven opportunity statements described above as a “How could we...” question (e.g. “How could we facilitate the discovery of thinking patterns?). Each question was placed at the top of a separate new Google Docs file. We then invited some 30+ interdisciplinary undergraduate and graduate students in the HCI Institute to these seven files. All of these students had prior experience with group brainstorming, and were given the instruction to each contribute at least five ideas in response to one or more of the brainstorm questions. We performed this activity over the course of a four-day weekend, with the stated goal of achieving at least 50 ideas in response to each question. On the fourth day, five of the seven questions had more than 50 ideas. For the remaining two questions the research team made a concerted effort to generate the remaining necessary ideas. In total, 350 distinct opportunity concepts were generated. Next, seven of the most involved members of the laboratory team were asked to “vote” on their favorite ideas in each file by adding a brightly colored symbol next to the item number. In this way, a selected group of 35 “favorite” ideas were agreed upon from across all seven questions. Concept Selection and Refinement

Favorite ideas were printed out on paper, cut into strips, and placed on an Impact/Achievability matrix [15]. We then gave each of these ideas a more concise name by applying colorful Post-it notes on top of them and drawing broad categories around them with a colorful marker. The main outcome of this phase was two key concepts, each in the “easy” and “high-impact” quadrant. The first was a group of ideas we labeled “idea factories.” Of these there was one particularly compelling idea—the concept of an idea “broken telephone” game. We refer to this concept in general as “chainstorming.” The second was a category of ideas we identified as “creative judgment tasks” involving quickly voting on pre-existing ideas, much as we had done at the end of our electronic brainstorming sessions. We refer to this concept in general as “cheatstorming,” as described in studies 1 and 2 below. Finally, while not discussed here in detail, we are currently building a working prototype system that combines chainstorming with cheatstorming, called Tweetstormer, also described below. To clarify, the

relationship between brainstorming, chainstorming, cheatstorming, and tweetstormer is shown in figure 1. Ideation The generation and elaboration of ideas. Brainstorming The generation and elaboration of (usually) language-based ideas, following Osborn-like rules. Chainstorming Brainstorming performed by passing ideas (and rules) along communication chains.

Cheatstorming Brainstorming without the “generation” component. Offline

Electronic Brainstorming The generation and elaboration of (usually) text-based ideas, mediated by computers.

Tweetstormer A digital chainstorming application that leverages cheatstorming to study it.

Online

Figure 1. A taxonomy of interrelated ideation techniques. Experimentation: Cheatstorming (Study 1)

Our main work in this paper explores the cheatstorming concept. The basic premise of this paradigm is as follows: imagine a brainstorm has been performed, resulting in 50 ideas. Participants vote on their favorite ideas, and some of them are selected for implementation. Now another brainstorm is performed on a different topic, resulting in 50 more ideas and additional voting. In time, many hundreds of brainstorm questions are asked, and thousands of ideas are generated and saved. Some have been implemented, and others have not. At this point, a wealth of valuable brainstorming has already occurred. The cheatstorming paradigm proposes that no new ideas are necessary for further ideation to occur. Given a new prompt question and a set of 50 random previous ideas to draw from, cheatstorming simply bypasses the concept generation phase altogether and jumps directly to voting on which ideas to advance. To test this concept we performed a simple pilot experiment. First, each member of our team generated 3-5 “totally random” brainstorm questions on Post-It notes, not in response to any particular question or stated need (e.g. “What is the easiest way to make the most people happy cheaply?”). Next, a set of

Figure 2: Sample results from our first cheatstorming trial.

60+ solution concepts was generated equally at random (e.g. “Magnetic cellphones”, “Non-linear presentation tool”, “Magic annoying elf that re-arranges your clothing,” etc.). Finally, one of the previously generated brainstorm questions was selected at random and paired with 10 of the concept Post-Its at random. From these 10 ideas, the four concepts that most closely resonated as solutions to the given question were selected as “winners.” We repeated this process four times with four different questions. For example, one of the sample solution pairings is shown in figure 2. We were both surprised and delighted by the results of this method. Not only did we have little difficulty identifying those ideas that best resonated with the questions being asked, the resulting set of ideas was remarkably unexpected and fresh. Most exciting, the process was fast, fun, and required low effort, and the solutions revealed unexpected combinatory patterns and juxtapositions. In the first example shown in figure 2, for instance, the question asks “How could we illuminate large cities for less money to reduce nocturnal crime?” Surprisingly, three of the selected solution concepts are screen-based ideas that all emit light. Not only was this an unanticipated means of illumination, it was also one that could provide other forms of safety from nocturnal crime—via an interactive “call for help” kiosk or informative map, for example. Furthermore, the fourth idea in this set, “airbag for walking,” suggests that perhaps solutions for reducing nocturnal crime could be built directly into a user’s clothing. Combined with the other cheatstormed ideas, this in turn sparks a train of thought that perhaps clothing should be illuminated, or—alternatively—that the city’s streets should be padded. Finally, each of the other cheatstormed questions resulted in an equally compelling set of results. In response to the question “How could we reduce global warming effectively in the next five minutes?,” for example, “biodegradable vehicles” and “micro-financing” were among the selected concepts. While neither of these ideas may enable global warming to be reduced in the next five minutes alone, when combined together they indicate a potential direction for immediate action (i.e., green-vehicular crowdfunding). Experimentation: Cheatstorming (Study 2)

There are many variables in the way that cheatstorming could be performed that we were curious to explore, such as how the variable effects of different types of “idea input” would affect cheatstorming results. We also wanted to compare cheatstorming results with results from a traditional brainstorming session. To this end, our next study leveraged the results of five previously completed brainstorming sessions from other unrelated projects as input. We chose this data from prior brainstorming sessions that had been well documented with clear questions and solutions, and which had generated more than 50 ideas apiece. These ideas had also been voted upon in the previous iteration, enabling us to track the success or failure of previously successful ideas in the new cheatstorming context. Finally, it was important for us that the brainstorming sessions had been performed by different groups of participants spanning a diverse set of HCI topics, to

aesthetically pleasing narrative products and services?”, also from the narrative fiction project. Our study design involved four experimental conditions drawing on brainstorming results from the above-mentioned sets of data. All of the previously generated raw ideas from each set of data were printed on cards in a unique color, one color per set (Figure 3). These raw-idea cards were used as input data for each of our study conditions. In addition, those idea cards that had been originally selected within each set as the “winners” for that set were clearly marked with an asterisk; this allowed us to trace which previously successful ideas prevailed through the cheatstorming process.

Figure 3: Input data for the cheatstorming study.

ensure that we had a wide variety of ideas in our pool to draw from overall, and so that unanticipated biases based on the authorship of ideas was reduced. The prompts from the five selected sets of data were as follows: (1) “How could we summarize text-based information to make browsing it intuitive, useful, magical and fun?”, from a project on digital mind mapping; (2) “How could we sculpt and craft using digital tools?”, from a project on tangible computing; (3) “How could we encourage selfactualization and the experience of new experimental dynamics?”, from a project on augmented reality; (4) “How could we support the successful publication of confident high quality writing?”, from a project on narrative fiction; and (5) “How could we rigorously craft and curate the design of

The study conditions were designed to be structurally equivalent. In each case, 50 raw “input” ideas would be pared down to 10 “winning” ideas in response to the ideation prompt. We used the same ideation prompt across all conditions: question 5 (“How could we rigorously craft and curate the design of aesthetically pleasing narrative products and services?”). The experimental conditions, illustrated in figure 4, were as follows: Condition A (brainstorming baseline). Previously selected brainstorming results from set 5 (those with asterisks) were chosen automatically as de facto winners. Condition B (overlapping diverse input). 17 ideas were each selected at random from sets 2, 3, and 4, combining to make a total of 51 ideas. One idea was removed at random, resulting in 50 ideas. Cheatstorming then commenced using question 5 as the ideation prompt. Because set 4 was drawn from the same project as set 5, cheatstorm results were anticipated to be most similar to condition A. Condition C (unrelated diverse input). The same diverse

Figure 4. Experimental conditions for cheatstorming study 2.

input structure was used as in condition B, except input ideas were drawn from sets 1, 2, and 3. These ideas were not intentionally related to set 5 in any way. Condition D (unrelated narrow input). This session used a single unrelated set of ideas as input, from set 1. Cheatstorming proceeded by laying out all 50 input ideas for a given condition below the ideation prompt, then working through each of them one-by-one as a team, attempting to find ideas that would match with the brainstorming prompt (figure 5). Ideas that didn’t seem related were put aside. Remaining ideas were grouped together into 10 “winning” clusters, such that each cluster created a meaningful concept relevant to the prompt. Each cluster was then given a more concise and meaningful title so as to relevantly depict the newly synthesized idea (figure 6). DISCUSSION

As described in detail by Isaksen [21], evaluating the effectiveness of group ideation outcomes is fraught with methodological and practical problems. These include the necessity of identifying and isolating the different factors in the ideation tool or process likely to influence its effectiveness, being aware of the level of training (if at all) the

facilitator had gone through to run the session, determining the group’s experience with creative ideation in general (and their orientation to the task at hand in particular), the preparation and presentation of the task in such a way that it promotes ideation, the effectiveness of the ideation method in highly-contextual real-world practice, and the criteria employed to evaluate the outcomes. Given these challenges, we believe it is difficult if not impossible to generalize the effectiveness of a specific culturally embedded creative activity without first recognizing the serious practical limitations of attempting to do so. For this reason, the approach taken in this study was design-oriented, in line with Fallman’s characterization of design-oriented HCI as giving form to previously non-existent artifacts to uncover new knowledge that could not be arrived at otherwise [12]. We attempted to replicate a controlled methodology as precisely as possible during our cheatstorming study four times, each time varying only the set of ideas input into the selection process. Each time, 50 previously defined ideas were reduced down to 10 “winning” favorites by the same team of researchers, and each time our 10 favorite ideas were unique. Given the creative and intentionally unpredictable nature of ideation, we believe that even though we held all of these variables constant (i.e. same team, same brainstorming rules, same prompt question, etc.) we would likely have generated differing ideation results if we attempted to repeat this study again. This stated, some noteworthy qualitative observations can be made, and we will now reflect on the qualitative differences in both the application of the process and the outcomes it produced between differing conditions of study 2. Findings: Process

Figure 5: The 50 candidate cheatstorm ideas in condition C.

Figure 6: “Winning” concepts from condition B.

Cheatstorming was shown to be a fast and enjoyable means of creative ideation. Especially when cheatstorming ideas that came from different and diverse input sets, we find that this method works well as a mechanism of introducing novel concepts across creative cultures, a process akin to “technology brokering” among brainstorming teams whose ideas cross-pollinate [41]. Indeed, the greatest challenge and thrill of the cheatstorming method is being faced with the task of combining what often seem to be nonsensical results from previous brainstorming sessions—in that they contain remarkably little context by which to understand them—with ideation prompts that are likely to be equally without adequate context (especially should cheatstorming be widely deployed in a distributed setting). The natural reaction of the cheatstormer—indeed, their only real option—is to force an inventive connection between ideation and prompt. In this regard we posit that the more tightly constrained the input data given to the cheatstormer when synthesizing across large sets of data, within reason, the more effective they will be at identifying such juxtapositions. We note, for example, that our first study involved the reduction of 10 input ideas to four “winners,” which could be accomplished very quickly because the cheatstormer had no alternative than to pick something quickly that worked. Study 2, with its larger set of inputs and greater creative freedom, introduced a more overwhelming quantity of possible connections and, consequently, felt more tedious and less productive.

Based on this observation, we believe that setting timeoriented constraints might help to improve the cheatstorming experience. While the mandatory rigor of matching 10 winning concepts per question in study 2 was nice, it also resulted in additional room for idea comparison and judgment, leading more nuanced but ultimately (we feel) less inspired ideas. Adding a time limit or other kinds of creative constraint might encourage spontaneous connections and force weaker ideas to be eliminated more quickly on a visceral basis. Comparing the process across experimental conditions, it seemed both easier and more immediately intuitive to group together ideas that came from the same original source. Looking at the results of our final synthesis, however, we notice a distinctly integrated mixing of source material in the creation of our final generated concepts. This also highlights one of the possible biases that became evident as a result of our process. In retrospect we wished that we had not color-coded the input ideas, as it introduced a perceptible value judgment into the study. Indeed, the simple awareness that such a bias may have existed is likely to have resulted in our intentional or unintentional effort to use equal numbers of ideas of each color, for example. As a result, it is difficult to say if the approximately even survival rate of resulting ideas across input pools within each condition resulted from this bias. Another source of bias were ideas that repeated themselves in subsequent iterations. This influence was twofold: foremost, since the prompt remained the same for all four cheatstorming iterations, arriving at similar ideas with each round became quickly redundant. Furthermore, because each cheatstorm used a random mix of input material, about a third of the ideas from previous conditions re-appeared with each subsequent effort. In this regard, we believe that ideas that have been previously “used” by participants should be removed from the input pool in successive rounds. Using digital systems, we anticipate the implications of scaling these methods up to large crowds of users, and recommend tracking previously viewed ideas to prevent them from appearing again. Not only did ideas seem “less interesting” on the second occasion, they also became harder to associate with new outcomes and meanings. Indeed, if creativity systems are to be tasked with delivering unconventional content to users it’s essential that the content should not be familiar. Findings: Results

In addition to the cheatstorming team’s qualitative reflections on process, we consulted with an independent judge who had worked on the narrative fiction project to which the ideation prompt had originally belonged. Together we evaluated the top 10 “winning” idea clusters from each of the 4 conditions, to see if cheatstorming results would be applicable for potential real-world use on her project. Relative to the baseline brainstorm condition (condition A), the most noticeable quality of the winning cheatstormed ideas was that all of them were dramatically technological in nature. This is not surprising, given that the narrative fiction

project was the least technologically oriented of the prompts (the other three questions having been drawn from projects on augmented reality, tangible computing, and digital mind mapping, respectively). Furthermore, the degree to which ideas felt un-helpful to the project was directly proportional to their degree of strain. In condition A, the baseline condition, the ideas felt the most immediately useful and applicable to the project because they did not all have such a technology focus. We should also note, however, that our judge had originally been involved in selecting the baseline winners, but not the cheatstorming results, introducing a likely source of bias. Condition B, the overlapping diverse input group, was the most palatable set of the remaining ideas. It seemed to introduce fresh new ideas that were grounded in something familiar. Condition C, the unrelated diverse input group, was described as “the most random.” Ideas in this set—with names such as “tempo of experience control” and “real-time story-world generation”—were exciting but felt out-of-touch with project goals. Condition D, the unrelated diverse input group, were the most technologically immersive. Ideas such as “magic story wand” and “crowdsourced tangible narrative sculpting” were described as “nice to pursue if I had a team of designers and developers, but that would change the focus of what the project is really about.” In summary, all of the ideas that resulted were related to the ideation prompt, but clearly reflected the spirit of the brainstorm from which they originated. This is not surprising, but it does indicate that a diverse mix of somewhat related (but also diverse and different) ideas could have a positive impact at broadening the scope and breadth of a project’s ideation. CONCLUSIONS AND FUTURE WORK: CHAINSTORMING, TWEETSTORMING, AND CHEATSTORMING AT SCALE

This work has investigated distributed ideation from a design-driven perspective by designing and building prototypes of possible ideation mechanics and reflecting on the qualities of the outcomes. Our aim with this approach is to improve the design of HCI tools that facilitate efficient and effective group ideation. Reflecting on our findings, we realize that we have revealed a model for group ideation with four distinct stages of progressive activity. Each stage carries with it a set of differing requirements and resulting behaviors, and we expect that the criteria leading to effective ideation outcomes at each stage will be different. These stages are: (1) prompting, the stage during which the ideation facilitator presents a challenge to the group that will drive ideation; (2) sharing, the stage in which participants suggest and communicate ideas within the context of the medium that frames the activity (i.e., orally, and/or using a whiteboard, sticky-notes, database system, and so on); (3) selecting, the phase during which participants vote and/or otherwise determine their favorite ideas; and (4) committing, the stage at which a final criterion is set to evaluate and prioritize ideas, ultimately determining which ones the team moves forward with and (ideally) develops.

This framing is in contrast to previous ideation models (e.g. Jones’ “divergence, transformation, convergence” model [22], Nijstad et al.’s dual pathway ideation model [32], etc.) in that, while it recognizes the cognitive distribution of ideation across social structures, it does not view creative behavior as a “generative” activity. Instead, ideas are simply transferred (or “shared”) between people, and the act of sharing is the source of the ideation: it involves the expression and interpretation of possible conceptual meanings. Even in traditional brainstorming sessions, we propose, it is this communicative interplay between one person’s conception of an idea and another’s (mis)interpretation that results in the so-called “generation” of ideas. Cheatstorming demonstrates that ideas need not be created by the team for ideation to occur—they simply need to be interpreted as possibilities resulting from a collision of shared meanings. The only requirement for a successful ideation outcome is that the ideas introduced in the sharing stage are unconventional to the ideating individual, team, or culture [24] (i.e. “strange” [18]), and that they be interpreted as relevant (or not) to the ideation prompt. We have introduced the concepts of cheatstorming as ideation without the “idea generation” component, and chainstorming more generally as a paradigm of communicative ideation (figure 1). Rather than conceiving of creativity as a spontaneous act of personal imagination, chainstorming is intrinsically social by nature. It is inspired by the “broken telephone” (or “Chinese whispers”) social group game, in which one person (Alice) secretly tells a story to another person (Bob), such that none of the other people present can hear it. In turn, Bob tells the story as he remembers it to a third person (Carol), and so on, until all of the people in a continuous chain back to Alice are reached. The last person to hear the story shares what he or she remembers with the entire group, and that story is compared with the original story. In chainstorming, much like this game, each participant is asked to build on the story of the previous participant in the chain. The first person in the chain generates the prompt question and one or two ideas that respond to the question before sending it off to a network of friends. Each subsequent person sees the prompt question, along with a subset of the ideas from the previous participant, and uses these ideas to build on them and generate new ideas. Using this method, which introduces a degree of randomness at each stage and which can also be controlled by the design of the communication and its rules, we propose that collective creativity can be embedded in social networks through simple interactions that reduce cognitive effort. Indeed, similar approaches have been developed in recent related work, promising the development of evolutionary creativity algorithms wherein humans pick the “fittest” ideas to result in emergent solutions to potentially complex tasks [45]. In chainstorming, where a random subset of each participant’s previous ideas could be selected and passed along with each interaction, the continued juxtaposition and “constructive strain” [18] from potentially unrelated or even contradictory ideas could consistently spark unexpected new sociallygenerated concepts. Indeed, it is the unique ability of

cheatstorming to “dial in strangeness,” as explored in our study, that makes it such a compelling example of the future of ideation online. In the case of cheatstorming, this is far more nuanced than existing methods of random input, such as future workshops [31], inspiration card workshops [19], or other similar methods for lateral thinking, in that it enables operational changes to the ideation methodology and content directly, and thus can facilitate targeted and highly contextual “leaps” from an original set of ideas to a much wider framing of the problem domain. Clearly the success of chainstorming as paradigm depends largely on details of its implementation since, as noted in our discussion of brainstorming best practices, several factors will greatly influence the most effective outcomes. Much like offline group brainstorming, effective chainstorming is likely to depend heavily on the social constitution of the chain, the level of training (if any) that participants receive, the group’s experience and orientation with the task at hand, and the criteria employed to evaluate its outcomes. Moreover, ideation of this nature introduces additional factors that will need to be addressed—especially the potential lack of context accompanying the prompt communication, which (and how many) prior concepts accompany the message as it is passed from user to user, how is their selection determined, as well as how to handle redundant concepts, dead ends, cross-posting and parallel chains, and so on. Indeed, these are complicated issues that underlie all social messaging and communication networks. In order to investigate these questions of ideation more deeply, and identify best practices for chainstorming networks, we have begun the design and development of a new social media platform for ideation—Tweetstormer— which will leverage Twitter messages as the transactional medium of the chainstorming system. Using this platform, members of the online community will be able to post and respond to tweeted prompt questions to virally distribute the chainstorm. Not only will this enable Twitter users to ideate anytime from anywhere using their computer or mobile device, our plan is to implement a custom website that allows users to see other users’ questions, reply to them selectively, browse other users’ replies to prompts, and vote on their favorite ideas to select them. Our hope is that ideation via this and other similarly inspired platforms will enable a more nuanced empirical study of the chainstorming paradigm and how best to integrate it effectively into the social fabric of online innovation. REFERENCES

1. Barki, H., Pinsonneault, A. (2001). Small Group Brainstorming and Idea Quality: Is Electronic Brainstorming the Most Effective Approach?, Small Group Research, 32, 158. 2. Bayless, O.L. (1967). An alternative for problem solving discussion, Journal of Communication, 17, 188-197. 3. Bouchard, T. J., & Hare, M. (1970). Size, performance, and potential in brainstorming groups, Applied Psych, 54, 51-55. 4. Bouchard, T. J., Barsaloux, J., Drauden, G. (1974). Brainstorming procedure, group size, and sex as deter-

minants of the problem-solving effectiveness of groups and individuals, Applied Psychology, 59(2), 135-138.

proficiency (brainstorming): A review. European Journal of Social Psychology, 3, 361–388.

5. Collaros, P. a, & Anderson, L. R. (1969). Effect of perceived expertness upon creativity of members of brainstorming groups, Applied Psychology, 53(2), 159-163.

26. Larry, T., Paulus, P. (1995). Social Comparison and Goal Setting in Brainstorming Groups, Journal of Applied Social Psychology, 25(18), 1579-1596.

6. de Bono, E. (1970). Lateral Thinking: Creativity Step By Step, Harper Perennial. 7. DeRosa, D. M., Smith, C. L., & Hantula, D. A. (2007). The Medium Matters: Mining the long-promised merit of group interaction in creative idea generation tasks in a metaanalysis of the electronic group brainstorming literature, Computers in Human Behavior, 23, 1549-1581.

27. Lehrer, J. (2012). Groupthink: The brainstorming myth, The New Yorker, January 30 28. Madsen, D. B., & Finger, J. R. Jr., (1978). Comparison of a written feedback procedure, group brainstorming, and individual brainstorming, Applied Psychology, 63(1), 120-123. 29. Maginn, B. K., & Harris, R. J. (1980). Effects of anticipated evaluation on individual brainstorming performance, Journal of Applied Psychology, 65(2), 219-225. 30. Mullen, B., Johnson, C. (1991). Productivity Loss in Brainstorming Groups: A Meta-Analytic Integration, Basic and Applied Social Psychology, 12(1), 3-23. 31. Muller, M.J., White, E.A., and Wildman, D.M. (1993). Taxonomy of PD practices: A brief practitioner’s guide. Communications of the ACM, 36(6), 26-28 (June 1993).

8. Diehl, M., & Stroebe, W. (1987). Productivity Loss in Brainstorming Groups: Toward the Solution of a Riddle, Personality & Social Psychology, 53(3), 497-509. 9. Djajadiningrat, J.P., Gaver, W.W. & Frens, J.W. (2000). Interaction relabelling and extreme characters: Methods for exploring aesthetic interactions. Proc. DIS 2000, 66-71. 10. Dunnette, M. D., Campbell, J., Jaastad, K. (1963). The effect of group participation on brainstorming effectiveness for two industrial samples, Applied Psychology, 47(1), 30-37. 11. Eno, B. (1978). Oblique Strategies. Opal, London. 12. Fallman, D. (2003). Design-Oriented Human-Computer Interaction. Proc. CHI, 225-232 13. Faste, R. (1993). An Improved Model for Understanding Creativity and Convention, in Cary A. Fisher (ed.), ASME Resource Guide to Innovation in Engineering Design, American Society of Mechanical Engineers. 14. Faste, R. (1995). A Visual Essay on Invention and Innovation,” Design Management Journal, 6(2). 15. Faste, H. & Bergamasco, M. (2009). A Strategic Map for High-Impact Virtual Experience Design, Proc. SPIE, 7238 16. Gallupe, R., Bastianutti, L. M., & Cooper, W. H. (1991). Unblocking brainstorms. Journal of Applied Psychology, 76(1), 137-142. 17. Graham, C., Rouncefield, M., Gibbs, M., Vetere, F., and Cheverst, C. (2007). How Probes Work Proc. of OzCHI, 29 18. Gordon, W. J. J. (1971), The Metaphorical Way of Learning and Knowing, Porpoise Books, p. 20. 19. Halskov, K., Dalsgård, P. (2006). Inspiration Card Workshops, Proc. Designing Interactive Systems, 2-11. 20. Hegedus, D. M., (1986). Task Effectiveness and Interaction Process of a Modified Nominal Group Technique in Solving an Evaluation Problem, J. of Management, 12(4), 545-560. 21. Isaksen, S. G. (1998). A Review of Brainstorming Research: Six Critical Issues for Inquiry, Technical report, Creative Problem Solving Group, Buffalo, NY. 22. Jones, J. C. (1970). Design Methods, John Wiley & Sons. 23. Kelley, T., Littman, J. and Peters, T. (2001). The Art of Innovation: Lessons in Creativity from IDEO, America's Leading Design Firm, Crown Business. 24. Koestler, Arthur, The Act of Creation, Dell, NY, 1964 25. Lamm, H. and Trommsdorff, G. (1973). Group versus individual performance on tasks requiring ideational

32. Nijstad, B. A., De Dreu, C. K. W., Rietzschel, E. F., & Baas, M. (2010). The dual pathway to creativity model: Creative ideation as a function of flexibility and persistence. European Review of Social Psychology, 21, 34–77. 33. Osborn, A. F. (1963). Applied Imagination: Principles and procedures of creative thinking (3rd edition), Scribner. 34. Paulus, P. (2000). Groups, Teams and Creativity: The Creative Potential of Idea-Generating Groups, Applied Psychology: An International Review, 49(2), 237-262. 35. Price, K. (1985). Problem Solving Strategies: A Comparison by Problem-Solving Phases, Group and Organization Studies, 10(3), 278-299. 36. Renzulli, J. S., Owen, S. V., & Callahan, C. M. (1974). Fluency, flexibility, and originality as a function of group size, Journal of Creative Behavior, 8(2), 107-113. 37. Searle J., R. (1983). Intentionality: An Essay in the Philosophy of Mind, Cambridge University Press, 1983. 38. Shah, H. H. & Vargas-Hernandez, N. (2002). Metrics for Measuring Ideation Effectiveness, Design Studies, 24(2). 39. Stein, M. I. (1975). Stimulating Creativity: Group Procedures (volume 2), Academic Press, NY 40. Stenmark, D. (2001). The Mindpool Hybrid: Theorising a New Angle on EBS and Suggestion Systems, Proc. Hawaii International Conference on Systems Science. 41. Sutton, R., & Hargadon A. (1996). Brainstorming Groups in Context: Effectiveness in a Product Design Firm, Administrative Science Quarterly, 41(4), 685-718. 42. Taylor, D. W., Berry, P. C., Block, C. H. (1958). Does group participation when using brainstorming facilitate or inhibit creative thinking? Administrative Science Quarterly, 3(1), 23-47. 43. von Oech, R. (1986). A Kick in the Seat of the Pants, Harper. 44. Watson, W., Michaelsen, L. K. & Sharp, W. (1991). Member competence, group interaction, and group decision making: A longitudinal study. Applied Psychology, 76, 803-809. 45. Yu, L., & Nickerson, J. (2011). Cooks or Cobblers? Crowd Creativity through Combination, Proc. CHI, 1393-1402.

6470D CH06 UG

12/3/01

1:10 PM

Page 165

Chapter 6 The process of interaction design 6.1 Introduction 6.2 What is interaction design about? 6.2.1 Four basic activities of interaction design 6.2.2 Three key characteristics of the interaction design process 6.3 Some practical issues 6.3.1 Who are the users? 6.3.2 What do we mean by “needs”? 6.3.3 How do you generate alternative designs? 6.3.4 How do you choose among alternative designs? 6.4 Lifecycle models: showing how the activities are related 6.4.1 A simple lifecycle model for interaction design 6.4.2 Lifecycle models in software engineering 6.4.3 Lifecycle models in HCI

6.1. Introduction Design is a practical and creative activity, the ultimate intent of which is to develop a product that helps its users achieve their goals. In previous chapters, we looked at different kinds of interactive products, issues you need to take into account when doing interaction design and some of the theoretical basis for the field. This chapter is the first of four that will explore how we can design and build interactive products. Chapter 1 defined interaction design as being concerned with “designing interactive products to support people in their everyday and working lives.” But how do you go about doing this? Developing a product must begin with gaining some understanding of what is required of it, but where do these requirements come from? Whom do you ask about them? Underlying good interaction design is the philosophy of user-centered design, i.e., involving users throughout development, but who are the users? Will they know what they want or need even if we can find them to ask? For an innovative product, users are unlikely to be able to envision what is possible, so where do these ideas come from? In this chapter, we raise and answer these kinds of questions and discuss the four basic activities and key characteristics of the interaction design process that 165

6470D CH06 UG

166

12/3/01

1:10 PM

Chapter 6

Page 166

The process of interaction design

were introduced in Chapter 1. We also introduce a lifecycle model of interaction design that captures these activities and characteristics. The main aims of this chapter are to: • Consider what ‘doing’ interaction design involves. • Ask and provide answers for some important questions about the interaction design process. • Introduce the idea of a lifecycle model to represent a set of activities and how they are related. • Describe some lifecycle models from software engineering and HCI and discuss how they relate to the process of interaction design. • Present a lifecycle model of interaction design.

6.2 What is interaction design about? There are many fields of design, for example graphic design, architectural design, industrial and software design. Each discipline has its own interpretation of “designing.” We are not going to debate these different interpretations here, as we are focussing on interaction design, but a general definition of “design” is informative in beginning to understand what it’s about. The definition of design from the Oxford English Dictionary captures the essence of design very well: “(design is) a plan or scheme conceived in the mind and intended for subsequent execution.” The act of designing therefore involves the development of such a plan or scheme. For the plan or scheme to have a hope of ultimate execution, it has to be informed with knowledge about its use and the target domain, together with practical constraints such as materials, cost, and feasibility. For example, if we conceived of a plan for building multi-level roads in order to overcome traffic congestion, before the plan could be executed we would have to consider drivers’ attitudes to using such a construction, the viability of the structure, engineering constraints affecting its feasibility, and cost concerns. In interaction design, we investigate the artifact’s use and target domain by taking a user-centered approach to development. This means that users’ concerns direct the development rather than technical concerns. Design is also about trade-offs, about balancing conflicting requirements. If we take the roads plan again, there may be very strong environmental arguments for stacking roads higher (less countryside would be destroyed), but these must be balanced against engineering and financial limitations that make the proposition less attractive. Getting the balance right requires experience, but it also requires the development and evaluation of alternative solutions. Generating alternatives is a key principle in most design disciplines, and one that should be encouraged in interaction design. As Marc Rettig suggested: “To get a good idea, get lots of ideas” (Rettig, 1994). However, this is not necessarily easy, and unlike many design disciplines, interaction designers are not generally trained to generate alternative designs. However, the ability to brainstorm and contribute alternative ideas can be learned, and techniques from other design disciplines can be successfully used in interaction

6470D CH06 UG

12/3/01

1:10 PM

Page 167

6.2

What is interaction design about?

167

design. For example, Danis and Boies (2000) found that using techniques from graphic design that encouraged the generation of alternative designs stimulated innovative interactive systems design. See also the interview with Gillian Crampton Smith at the end of this chapter for her views on how other aspects of traditional design can help produce good interaction design. Although possible, it is unlikely that just one person will be involved in developing and using a system and therefore the plan must be communicated. This requires it to be captured and expressed in some suitable form that allows review, revision, and improvement. There are many ways of doing this, one of the simplest being to produce a series of sketches. Other common approaches are to write a description in natural language, to draw a series of diagrams, and to build prototypes. A combination of these techniques is likely to be the most effective. When users are involved, capturing and expressing a design in a suitable format is especially important since they are unlikely to understand jargon or specialist notations. In fact, a form that users can interact with is most effective, and building prototypes of one form or another (see Chapter 8) is an extremely powerful approach. So interaction design involves developing a plan which is informed by the product’s intended use, target domain, and relevant practical considerations. Alternative designs need to be generated, captured, and evaluated by users. For the evaluation to be successful, the design must be expressed in a form suitable for users to interact with.

ACTIVITY 6.1 Imagine that you want to design an electronic calendar or diary for yourself. You might use this system to plan your time, record meetings and appointments, mark down people’s birthdays, and so on, basically the kinds of things you might do with a paper-based calendar. Draw a sketch of the system outlining its functionality and its general look and feel. Spend about five minutes on this. Having produced an outline, now spend five minutes reflecting on how you went about tackling this activity. What did you do first? Did you have any particular artifacts or experience to base your design upon? What process did you go through? Comment

The sketch I produced is shown in Figure 6.1. As you can see, I was quite heavily influenced by the paper-based books I currently use! I had in mind that this calendar should allow me to record meetings and appointments, so I need a section representing the days and months. But I also need a section to take notes. I am a prolific note-taker, and so for me this was a key requirement. Then I began to wonder about how I could best use hyperlinks. I certainly want to keep addresses and telephone numbers in my calendar, so maybe there could be a link between, say, someone’s name in the calendar and their entry in my address book that will give me their contact details when I need them? But I still want the ability to be able to turn page by page, for when I’m scanning or thinking about how to organize my time. A search facility would be useful too. The first thing that came into my head when I started doing this was my own paper-based book where I keep appointments, maps, telephone numbers, and other small notes. I also thought about my notebook and how convenient it would be to have the two combined. Then I sat and sketched different ideas about how it might look (although I’m not very good at sketching). The sketch in Figure 6.1 is the version I’m happiest with. Note that my sketch

6470D CH06 UG

168

12/3/01

1:10 PM

Chapter 6

Page 168

The process of interaction design

link to address book notes section

MONTH DAY 9:30 Meeting John (notes)

DAY address and telephone section

links always available

To do: contact Daniel

link to notes section

turn to next page

Figure 6.1 An outline sketch of an electronic calendar. has a strong resemblance to a paper-based book, yet I’ve also tried to incorporate electronic capabilities. Maybe once I have evaluated this design and ensured that the tasks I want to perform are supported, then I will be more receptive to changing the look away from a paper-based “look and feel.” The exact steps taken to produce a product will vary from designer to designer, from product to product, and from organization to organization. In this activity, you may have started by thinking about what you’d like such a system to do for you, or you may have been thinking about an existing paper calendar. You may have mixed together features of different systems or other record-keeping support. Having got or arrived at an idea of what you wanted, maybe you then imagined what it might look like, either through sketching with paper and pencil or in your mind.

6.2.1 Four basic activities of interaction design Four basic activities for interaction design were introduced in Chapter 1, some of which you will have engaged in when doing Activity 6.1. These are: identifying needs and establishing requirements, developing alternative designs that meet those requirements, building interactive versions so that they can be communicated and assessed, and evaluating them, i.e., measuring their acceptability. They are fairly generic activities and can be found in other designs disciplines too. For example, in architectural design (RIBA, 1988) basic requirements are established in a work stage called “inception”, alternative design options are considered in a “feasibility” stage and “the brief” is developed through outline proposals and scheme de-

6470D CH06 UG

12/3/01

1:10 PM

Page 169

6.2

What is interaction design about?

169

sign. During this time, prototypes may be built or perspectives may be drawn to give clients a better indication of the design being developed. Detail design specifies all components, and working drawings are produced. Finally, the job arrives on site and building commences. We will be expanding on each of the basic activities of interaction design in the next two chapters. Here we give only a brief introduction to each. Identifying needs and establishing requirements In order to design something to support people, we must know who our target users are and what kind of support an interactive product could usefully provide. These needs form the basis of the product’s requirements and underpin subsequent design and development. This activity is fundamental to a user-centered approach, and is very important in interaction design; it is discussed further in Chapter 7. Developing alternative designs This is the core activity of designing: actually suggesting ideas for meeting the requirements. This activity can be broken up into two sub-activities: conceptual design and physical design. Conceptual design involves producing the conceptual model for the product, and a conceptual model describes what the product should do, behave and look like. Physical design considers the detail of the product including the colors, sounds, and images to use, menu design, and icon design. Alternatives are considered at every point. You met some of the ideas for conceptual design in Chapter 2; we go into more detail about conceptual and physical design in Chapter 8. Building interactive versions of the designs Interaction design involves designing interactive products. The most sensible way for users to evaluate such designs, then, is to interact with them. This requires an interactive version of the designs to be built, but that does not mean that a software version is required. There are different techniques for achieving “interaction,” not all of which require a working piece of software. For example, paper-based prototypes are very quick and cheap to build and are very effective for identifying problems in the early stages of design, and through role-playing users can get a real sense of what it will be like to interact with the product. This aspect is also covered in Chapter 8. Evaluating designs Evaluation is the process of determining the usability and acceptability of the product or design that is measured in terms of a variety of criteria including the number of errors users make using it, how appealing it is, how well it matches the requirements, and so on. Interaction design requires a high level of user involvement throughout development, and this enhances the chances of an acceptable product being delivered. In most design situations you will find a number of activities concerned with

6470D CH06 UG

170

12/3/01

1:10 PM

Chapter 6

Page 170

The process of interaction design

quality assurance and testing to make sure that the final product is “fit-for-purpose.” Evaluation does not replace these activities, but complements and enhances them. We devote Chapters 10 through 14 to the important subject of evaluation. The activities of developing alternative designs, building interactive versions of the design, and evaluation are intertwined: alternatives are evaluated through the interactive versions of the designs and the results are fed back into further design. This iteration is one of the key characteristics of the interaction design process, which we introduced in Chapter 1.

6.2.2 Three key characteristics of the interaction design process There are three characteristics that we believe should form a key part of the interaction design process. These are: a user focus, specific usability criteria, and iteration. The need to focus on users has been emphasized throughout this book, so you will not be surprised to see that it forms a central plank of our view on the interaction design process. While a process cannot, in itself, guarantee that a development will involve users, it can encourage focus on such issues and provide opportunities for evaluation and user feedback. Specific usability and user experience goals should be identified, clearly documented, and agreed upon at the beginning of the project. They help designers to choose between different alternative designs and to check on progress as the product is developed. Iteration allows designs to be refined based on feedback. As users and designers engage with the domain and start to discuss requirements, needs, hopes and aspirations, then different insights into what is needed, what will help, and what is feasible will emerge. This leads to a need for iteration, for the activities to inform each other and to be repeated. However good the designers are and however clear the users may think their vision is of the required artifact, it will be necessary to revise ideas in light of feedback, several times. This is particularly true if you are trying to innovate. Innovation rarely emerges whole and ready to go. It takes time, evolution, trial and error, and a great deal of patience. Iteration is inevitable because designers never get the solution right the first time (Gould and Lewis, 1985). We shall return to these issues and expand upon them in Chapter 9.

6.3 Some practical issues Before we consider how the activities and key characteristics of interaction design can be pulled together into a coherent process, we want to consider some questions highlighted by the discussion so far. These questions must be answered if we are going to be able to “do” interaction design in practice. These are: • • • •

Who are the users? What do we mean by needs? How do you generate alternative designs? How do you choose among alternatives?

6470D CH06 UG

12/3/01

1:10 PM

Page 171

6.3

Some practical issues

171

6.3.1 Who are the users? In Chapter 1, we said that an overarching objective of interaction design is to optimize the interactions people have with computer-based products, and that this requires us to support needs, match wants, and extend capabilities. We also stated above that the activity of identifying these needs and establishing requirements was fundamental to interaction design. However, we can’t hope to get very far with this intent until we know who the users are and what they want to achieve. As a starting point, therefore, we need to know who we consult to find out the users’ requirements and needs. Identifying the users may seem like a straightforward question, but in fact there are many interpretations of “user.” The most obvious definition is those people who interact directly with the product to achieve a task. Most people would agree with this definition; however, there are others who can also be thought of as users. For example, Holtzblatt and Jones (1993) include in their definition of “users” those who manage direct users, those who receive products from the system, those who test the system, those who make the purchasing decision, and those who use competitive products. Eason (1987) identifies three categories of user: primary, secondary and tertiary. Primary users are those likely to be frequent hands-on users of the system; secondary users are occasional users or those who use the system through an intermediary; and tertiary users are those affected by the introduction of the system or who will influence its purchase. The trouble is that there is a surprisingly wide collection of people who all have a stake in the development of a successful product. These people are called stakeholders. Stakeholders are “people or organizations who will be affected by the system and who have a direct or indirect influence on the system requirements” (Kotonya and Sommerville, 1998). Dix et al. (1993) make an observation that is very pertinent to a user-centered view of development, that “It will frequently be the case that the formal ‘client’ who orders the system falls very low on the list of those affected. Be very wary of changes which take power, influence or control from some stakeholders without returning something tangible in its place.” Generally speaking, the group of stakeholders for a particular product is going to be larger than the group of people you’d normally think of as users, although it will of course include users. Based on the definition above, we can see that the group of stakeholders includes the development team itself as well as its managers, the direct users and their managers, recipients of the product’s output, people who may lose their jobs because of the introduction of the new product, and so on. For example, consider again the calendar system in Activity 6.1. According to the description we gave you, the user group for the system has just one member: you. However, the stakeholders for the system would also include people you make appointments with, people whose birthdays you remember, and even companies that produce paper-based calendars, since the introduction of an electronic calendar may increase competition and force them to operate differently.

6470D CH06 UG

172

12/3/01

1:10 PM

Chapter 6

Page 172

The process of interaction design

This last point may seem a little exaggerated for just one system, but if you think of others also migrating to an electronic version, and abandoning their paper calendars, then you can see how the companies may be affected by the introduction of the system. The net of stakeholders is really quite wide! We do not suggest that you need to involve all of the stakeholders in your user-centered approach, but it is important to be aware of the wider impact of any product you are developing. Identifying the stakeholders for your project means that you can make an informed decision about who should be involved and to what degree.

ACTIVITY 6.2 Who do you think are the stakeholders for the check-out system of a large supermarket? Comment

First, there are the check-out operators. These are the people who sit in front of the machine and pass the customers’ purchases over the bar code reader, receive payment, hand over receipts, etc. Their stake in the success and usability of the system is fairly clear and direct. Then you have the customers, who want the system to work properly so that they are charged the right amount for the goods, receive the correct receipt, are served quickly and efficiently. Also, the customers want the check-out operators to be satisfied and happy in their work so that they don’t have to deal with a grumpy assistant. Outside of this group, you then have supermarket managers and supermarket owners, who also want the assistants to be happy and efficient and the customers to be satisfied and not complaining. They also don’t want to lose money because the system can’t handle the payments correctly. Other people who will be affected by the success of the system include other supermarket employees such as warehouse staff, supermarket suppliers, supermarket owners’ families, and local shop owners whose business would be affected by the success or failure of the system. We wouldn’t suggest that you should ask the local shop owner about requirements for the supermarket check-out system. However, you might want to talk to warehouse staff, especially if the system links in with stock control or other functions.

6.3.2 What do we mean by “needs”? If you had asked someone in the street in the late 1990s what she ‘needed’, I doubt that the answer would have included interactive television, or a jacket which was wired for communication, or a smart fridge. If you presented the same person with these possibilities and asked whether she would buy them if they were available, then the answer would have been different. When we talk about identifying needs, therefore, it’s not simply a question of asking people, “What do you need?” and then supplying it, because people don’t necessarily know what is possible (see Suzanne Robertson’s interview at the end of Chapter 7 for “un-dreamed-of” requirements). Instead, we have to approach it by understanding the characteristics and capabilities of the users, what they are trying to achieve, how they achieve it currently, and whether they would achieve their goals more effectively if they were supported differently. There are many dimensions along which a user’s capabilities and characteristics may vary, and that will have an impact on the product’s design. You have met

6470D CH06 UG

12/3/01

1:10 PM

Page 173

6.3

Some practical issues

173

some of these in Chapter 3. For example, a person’s physical characteristics may affect the design: size of hands may affect the size and positioning of input buttons, and motor abilities may affect the suitability of certain input and output devices; height is relevant in designing a physical kiosk, for example; and strength in designing a child’s toy—a toy should not require too much strength to operate, but may require strength greater than expected for the target age group to change batteries or perform other operations suitable only for an adult. Cultural diversity and experience may affect the terminology the intended user group is used to, or how nervous about technology a set of users may be. If a product is a new invention, then it can be difficult to identify the users and representative tasks for them; e.g., before microwave ovens were invented, there were no users to consult about requirements and there were no representative tasks to identify. Those developing the oven had to imagine who might want to use such an oven and what they might want to do with it. It may be tempting for designers simply to design what they would like, but their ideas would not necessarily coincide with those of the target user group. It is imperative that representative users from the real target group be consulted. For example, a company called Netpliance was developing a new “Internet appliance,” i.e., a product that would seamlessly integrate all the services necessary for the user to achieve a specific task on the Internet (Isensee et al., 2000). They took a user-centered approach and employed focus group studies and surveys to understand their customers’ needs. The marketing department led these efforts, but developers observed the focus groups to learn more about their intended user group. Isensee et al. (p. 60) observe that “It is always tempting for developers to create products they would want to use or similar to what they have done before. However, in the Internet appliance space, it was essential to develop for a new audience that desires a simpler product than the computer industry has previously provided.” In these circumstances, a good indication of future behavior is current or past behavior. So it is always useful to start by understanding similar behavior that is already established. Apart from anything else, introducing something new into people’s lives, especially a new “everyday” item such as a microwave oven, requires a culture change in the target user population, and it takes a long time to effect a culture change. For example, before cell phones were so widely available there were no users and no representative tasks available for study, per se. But there were standard telephones and so understanding the tasks people perform with, and in connection with, standard telephones was a useful place to start. Apart from making a telephone call, users also look up people’s numbers, take messages for others not currently available, and find out the number of the last person to ring them. These kinds of behavior have been translated into memories for the telephone, answering machines, and messaging services for mobiles. In order to maximize the benefit of e-commerce sites, traders have found that referring back to customers’ non-electronic habits and behaviors can be a good basis for enhancing e-commerce activity (CHI panel, 2000; Lee et al., 2000).

6470D CH06 UG

174

12/3/01

1:10 PM

Chapter 6

Page 174

The process of interaction design

6.3.3 How do you generate alternative designs? A common human tendency is to stick with something that we know works. We probably recognize that a better solution may exist out there somewhere, but it’s very easy to accept this one because we know it works—it’s “good enough.” Settling for a solution that is good enough is not, in itself, necessarily “bad,” but it may be undesirable because good alternatives may never be considered, and considering alternative solutions is a crucial step in the process of design. But where do these alternative ideas come from? One answer to this question is that they come from the individual designer’s flair and creativity. While it is certainly true that some people are able to produce wonderfully inspired designs while others struggle to come up with any ideas at all, very little in this world is completely new. Normally, innovations arise through cross-fertilization of ideas from different applications, the evolution of an existing product through use and observation, or straightforward copying of other, similar products. For example, if you think of something commonly believed to be an “invention,” such as the steam engine, this was in fact inspired by the observation that the steam from a kettle boiling on the stove lifted the lid. Clearly there was an amount of creativity and engineering involved in making the jump from a boiling kettle to a steam engine, but the kettle provided the inspiration to translate experience gained in one context into a set of principles that could be applied in another. As an example of evolution, consider the word processor. The capabilities of suites of office software have gradually increased from the time they first appeared. Initially, a word processor was just an electronic version of a typewriter, but gradually other capabilities, including the spell-checker, thesaurus, style sheets, graphical capabilities, etc., were added.

6470D CH06 UG

12/3/01

1:10 PM

Page 175

6.3

Some practical issues

175

So although creativity and invention are often wrapped in mystique, we do understand something of the process and of how creativity can be enhanced or inspired. We know, for instance, that browsing a collection of designs will inspire designers to consider alternative perspectives, and hence alternative solutions. The field of case-based reasoning (Maher and Pu, 1997) emerged from the observation that designers solve new problems by drawing on knowledge gained from solving previous similar problems. As Schank (1982; p. 22) puts it, “An expert is someone who gets reminded of just the right prior experience to help him in processing his current experiences.” And while those experiences may be the designer’s own, they can equally well be others’. A more pragmatic answer to this question, then, is that alternatives come from looking at other, similar designs, and the process of inspiration and creativity can be enhanced by prompting a designer’s own experience and by looking at others’ ideas and solutions. Deliberately seeking out suitable sources of inspiration is a valuable step in any design process. These sources may be very close to the intended new product, such as competitors’ products, or they may be earlier versions of similar systems, or something completely different.

ACTIVITY 6.3 Consider again the calendar system introduced at the beginning of the chapter. Reflecting on the process again, what do you think inspired your outline design? See if you can identify any elements within it that you believe are truly innovative. Comment

For my design, I haven’t seen an electronic calendar, although I have seen plenty of other software-based systems. My main sources of inspiration were my current paper-based books. Some of the things you might have been thinking of include your existing paper-based calendar, and other pieces of software you commonly use and find helpful or easy to use in some way. Maybe you already have access to an electronic calendar, which will have given you some ideas, too. However, there are probably other aspects that make the design somehow unique to you and may be innovative to a greater or lesser degree.

All this having been said, under some circumstances the scope to consider alternative designs may be limited. Design is a process of balancing constraints and constantly trading off one set of requirements with another, and the constraints may be such that there are very few viable alternatives available. As another example, if you are designing a software system to run under the Windows operating system, then elements of the design will be prescribed because you must conform to the Windows “look and feel,” and to other constraints intended to make Windows programs consistent for the user. We shall return to style guides and standards in Chapter 8. If you are producing an upgrade to an existing system, then you may face other constraints, such as wanting to keep the familiar elements of it and retain the same “look and feel.” However, this is not necessarily a rigid rule. Kent Sullivan reports that when designing the Windows 95 operating system to replace the Windows 3.1 and Windows for Workgroups 3.11 operating systems, they initially focused too much on consistency with the earlier versions (Sullivan, 1996).

6470D CH06 UG

12/3/01

176

1:10 PM

Chapter 6

Page 176

The process of interaction design

BOX 6.1 A Box Full of Ideas The innovative product design company IDEO was introduced in Chapter 1. It has been involved in the development of many artifacts including the first commercial computer mouse and the PalmPilot V. Underlying some of their creative flair is a collection of weird and wonderful engineering housed in a large flatbed filing cabinet called the TechBox (see Figure 6.2). The TechBox holds around 200 gizmos and interesting materials, divided into categories: “Amazing Materials,” “Cool Mechanisms,” “Interesting Manufacturing Processes,” “Electronic Technologies,” and “Thermal and Optical.” Each item has been placed in the box because it represents a neat idea or a new process. Staff at IDEO take along a selection of items from the TechBox to brainstorming meetings. The items may be chosen because they provide useful vi-

sual props or possible solutions to a particular issue, or simply to provide some light relief. Each item is clearly labeled with its name and category, but further information can be found by accessing the TechBox’s online catalog. Each item has its own page detailing what the item is, why it’s interesting, where it came from, and who has used it or knows more about it. For example, the page in Figure 6.3 relates to a metal injection-molding technique. Other items in the box include an example of metal-coated wood, materials with and without holes that stretch, bend, and change shape or color at different temperatures. Each TechBox has its own curator who is responsible for maintaining and cataloging the items and for promoting its use within the office. Anyone can submit a new item for consideration and

Figure 6.2 The TechBox at IDEO.

6470D CH06 UG

12/3/01

1:10 PM

Page 177

6.3

Some practical issues

Figure 6.3 The web page for the metal injection molding.

as items become common place, they are removed from the TechBox to make way for the next generation of fascinating contraptions. How are these things used? Well here is one example from Patrick Hall at the London IDEO office (see Figure 6.4): IDEO was asked to review the design of a mass-produced hand-held medical product that was deemed to be too big.

As well as brainstorming and other conventional idea-generation methods, I was able to immediately pick out items which I knew about from having used the TechBox in the past: Deep Draw; Fibre-Optic magnifier; Metal Injection molding; Flexy Battery. Further browsing and searching using the keywords search engine highlighted in-mold assembly and light-intensifying film. The

177

6470D CH06 UG

12/3/01

178

1:10 PM

Chapter 6

Page 178

The process of interaction design

(a)

(b)

Figure 6.4 Items from the TechBox used in the design of a medical product. (a) Deep Draw—A metal-forming process to generate close-ended cylindrical parts; (b) Metal Injection Molding—A molding and sintering process to produce complex metal parts in high numbers; (c) Flexy Battery—a lithium polymer cell from Varta that is very thin (intended for Smart Cards) and can be formed into cylindrical shapes.

(c)

associated web pages for these items enabled me to learn more about these items immediately and indicated who to talk to in IDEO to find out more, and the details of vendors to approach.

The project ended at the feasibility phase, with the client pursuing the technologies I had suggested. Only the fiberoptic magnifier proved (immediately) not to be worth pursuing (because of cost).

6470D CH06 UG

12/3/01

1:10 PM

Page 179

6.3

Some practical issues

179

DILEMMA Copying for Inspiration: Is It Legal? Designers draw on their experience of design when approaching a new project. This includes the use of previous designs that they know work, both designs they have created themselves and those that others have created. Others’ creations often spark inspiration that also leads to new ideas and innovation. This is well known and understood. However, the expression of an idea is protected by copyright, and people who infringe that copyright can be taken to court and prosecuted. Note that copyright covers the expression of an idea and not the idea itself. This means, for example, that while there are numerous word processors all with similar functionality, this does not represent an infringement of copyright as the idea has been expressed in different ways, and it’s the expression that’s been copyrighted. Copyright is free and is automatically invested in the author of something, e.g., the writer of a book or a programmer who develops a program, unless he signs the copyright over to someone else. Authors writing for academic journals often are asked to sign over their copyright to the publisher of the journal. Various limitations and special conditions can apply, but basically, the copyright is no longer theirs. People who produce something through their employment, such as programs or products, may have in their employment contract a statement saying that

the copyright relating to anything produced in the course of that employment is automatically assigned to the employer and does not remain with the employee. On the other hand, patenting is an alternative to copyright that does protect the idea rather than the expression. There are various forms of patenting, each of which is designed to allow the inventor the chance to capitalize on an idea. It is unusual for software to be patented, since it is a long, slow, and expensive process, although there is a recent trend towards patenting business processes. For example, Amazon, the on-line bookstore, has patented its “one-click” purchasing process, which allows regular users simply to choose a book and buy it with one mouse click (US Patent No. 5960411, September 29, 1999). This is possible because the system stores its customers’ details and “recognizes” them when they access the site again. So the dilemma comes in knowing when it’s OK to use someone else’s work as a source of inspiration and when you are infringing copyright or patent law. The issues around this question are complex and detailed, and well beyond the scope of this book, but more information and examples of law cases that have been brought successfully and unsuccessfully can be found in Bainbridge (1999).

6.3.4 How do you choose among alternative designs? Choosing among alternatives is about making design decisions: Will the device use keyboard entry or a touch screen? Will the device provide an automatic memory function or not? These decisions will be informed by the information gathered about users and their tasks, and by the technical feasibility of an idea. Broadly speaking, though, the decisions fall into two categories: those that are about externally visible and measurable features, and those that are about characteristics internal to the system that cannot be observed or measured without dissecting it. For example, externally visible and measurable factors for a building design include the ease of access to the building, the amount of natural light in rooms, the width of corridors, and the number of power outlets. In a photocopier, externally visible and measurable factors include the physical size of the machine, the speed and quality of copying, the different sizes of paper it can use, and so on. Underlying each of these factors are other considerations that cannot be observed or studied without dissecting the building or the machine. For example, the number of

6470D CH06 UG

180

12/3/01

1:10 PM

Chapter 6

Page 180

The process of interaction design

power outlets will be dependent on how the wiring within the building is designed and the capacity of the main power supply; the choice of materials used in a photocopier may depend on its friction rating and how much it deforms under certain conditions. In an interactive product there are similar factors that are externally visible and measurable and those that are hidden from the users’ view. For example, exactly why the response time for a query to a database (or a web page) is, say, 4 seconds will almost certainly depend on technical decisions made when the database was constructed, but from the users’ viewpoint the important observation is the fact that it does take 4 seconds to respond. In interaction design, the way in which the users interact with the product is considered the driving force behind the design and so we concentrate on the externally visible and measurable behavior. Detailed internal workings are important only to the extent that they affect the external behavior. This does not mean that design decisions concerning a system’s internal behavior are any less important: however, the tasks that the user will perform should influence design decisions no less than technical issues. So, one answer to the question posed above is that we choose between alternative designs by letting users and stakeholders interact with them and by discussing their experiences, preferences and suggestions for improvement. This is fundamental to a user-centered approach to development. This in turn means that the designs must be available in a form that can be reasonably evaluated with users, not in technical jargon or notation that seems impenetrable to them. One form traditionally used for communicating a design is documentation, e.g., a description of how something will work or a diagram showing its components. The trouble is that a static description cannot capture the dynamics of behavior, and for an interaction device we need to communicate to the users what it will be like to actually operate it. In many design disciplines, prototyping is used to overcome potential client misunderstandings and to test the technical feasibility of a suggested design and its production. Prototyping involves producing a limited version of the product with the purpose of answering specific questions about the design’s feasibility or appropriateness. Prototypes give a better impression of the user experience than simple descriptions can ever do, and there are different kinds of prototyping that are suitable for different stages of development and for eliciting different kinds of information. One experience illustrating the benefits of prototyping is described in Box 6.2. So one important aspect of choosing among alternatives is that prototypes should be built and evaluated by users. We’ll revisit the issue of prototyping in Chapter 8. Another basis on which to choose between alternatives is “quality,” but this requires a clear understanding of what “quality” means. People’s views of what is a quality product vary, and we don’t always write it down. Whenever we use anything we have some notion of the level of quality we are expecting, wanting, or needing. Whether this level of quality is expressed formally or informally does not matter. The point is that it exists and we use it consciously or subconsciously to evaluate alternative items. For example, if you have to wait too long to download

6470D CH06 UG

12/3/01

1:10 PM

Page 181

6.3

Some practical issues

181

BOX 6.2 The Value of Prototyping I learned the value of a prototype through a very effective role-playing exercise. I was on a course designed to introduce new graduates to different possible careers in industry. One of the themes was production and manufacturing and the aim of one group exercise was to produce a notebook. Each group was told that it had 30 minutes to deliver 10 books to the person in charge. Groups were given various pieces of paper, scissors, sticky tape, staples, etc., and told to organize ourselves as best we could. So my group set to work organizing ourselves into a production line, with one of us cutting up the paper, another stapling the pages together, another sealing the binding with the sticky tape, and so on. One person was even in charge of quality assurance. It took us less than 10 minutes to produce the 10 books, and we rushed off with our delivery. When we showed the person in

charge, he replied, “That’s not what I wanted, I need it bigger than that.” Of course, the size of the notebook wasn’t specified in the description of the task, so we found out how big he wanted it, got some more materials, and scooted back to produce 10 more books. Again, we set up our production line and produced 10 books to the correct size. On delivery we were again told that it was not what was required: he wanted the binding to work the other way around. This time we got as many of the requirements as we could and went back, developed one book, and took that back for further feedback and refinement before producing the 10 required. If we had used prototyping as a way of exploring our ideas and checking requirements in the first place, we could have saved so much effort and resource!

a web page, then you are likely to give up and try a different site—you are applying a certain measure of quality associated with the time taken to download the web page. If one cell phone makes it easy to perform a critical function while another involves several complicated key sequences, then you are likely to buy the former rather than the latter. You are applying a quality criterion concerned with efficiency. Now, if you are the only user of a product, then you don’t necessarily have to express your definition of “quality” since you don’t have to communicate it to anyone else. However, as we have seen, most projects involve many different stakeholder groups, and you will find that each of them has a different definition of quality and different acceptable limits for it. For example, although all stakeholders may agree on targets such as “response time will be fast” or “the menu structure will be easy to use,” exactly what each of them means by this is likely to vary. Disputes are inevitable when, later in development, it transpires that “fast” to one set of stakeholders meant “under a second,” while to another it meant “between 2 and 3 seconds.” Capturing these different views in clear unambiguous language early in development takes you halfway to producing a product that will be regarded as “good” by all your stakeholders. It helps to clarify expectations, provides a benchmark against which products of the development process can be measured, and gives you a basis on which to choose among alternatives. The process of writing down formal, verifiable–and hence measurable–usability criteria is a key characteristic of an approach to interaction design called usability engineering that has emerged over many years and with various proponents (Whiteside

6470D CH06 UG

182

12/3/01

1:10 PM

Chapter 6

Page 182

The process of interaction design

et al., 1988; Nielsen, 1993). Usability engineering involves specifying quantifiable measures of product performance, documenting them in a usability specification, and assessing the product against them. One way in which this approach is used is to make changes to subsequent versions of a system based on feedback from carefully documented results of usability tests for the earlier version. We shall return to this idea later when we discuss evaluation.

ACTIVITY 6.4 Consider the calendar system that you designed in Activity 6.1. Suggest some usability criteria that you could use to determine the calendar’s quality. You will find it helpful to think in terms of the usability goals introduced in Chapter 1: effectiveness, efficiency, safety, utility, learnability, and memorability. Be as specific as possible. Check your criteria by considering exactly what you would measure and how you would measure its performance. Having done that, try to do the same thing for the user experience goals introduced in Chapter 1; these relate to whether a system is satisfying, enjoyable, motivating, rewarding, and so on. Comment

Finding measurable characteristics for some of these is not easy. Here are some suggestions, but you may have found others. Note that the criteria must be measurable and very specific. • Effectiveness: Identifying measurable criteria for this goal is particularly difficult since it is a combination of the other goals. For example, does the system support you in keeping appointments, taking notes, and so on. In other words, is the calendar used? • Efficiency: Assuming that there is a search facility in the calendar, what is the response time for finding a specific day or a specific appointment? • Safety: How often does data get lost or does the user press the wrong button? This may be measured, for example, as the number of times this happens per hour of use. • Utility: How many functions offered by the calendar are used every day, how many every week, how many every month? How many tasks are difficult to complete in a reasonable time because functionality is missing or the calendar doesn’t support the right subtasks? • Learnability: How long does it take for a novice user to be able to do a series of set tasks, e.g., make an entry into the calendar for the current date, delete an entry from the current date, edit an entry in the following day? • Memorability: If the calendar isn’t used for a week, how many functions can you remember how to perform? How long does it take you to remember how to perform your most frequent task? Finding measurable characteristics for the user experience criteria is even harder, though. How do you measure satisfaction, fun, motivation or aesthetics? What is entertaining to one person may be boring to another; these kinds of criteria are subjective, and so cannot be measured objectively.

6.4 Lifecycle models: showing how the activities are related Understanding what activities are involved in interaction design is the first step to being able to do it, but it is also important to consider how the activities are related

6470D CH06 UG

12/3/01

1:10 PM

Page 183

6.4

Lifecycle models: showing how the activities relate

183

to one another so that the full development process can be seen. The term lifecycle model 1 is used to represent a model that captures a set of activities and how they are related. Sophisticated models also incorporate a description of when and how to move from one activity to the next and a description of the deliverables for each activity. The reason such models are popular is that they allow developers, and particularly managers, to get an overall view of the development effort so that progress can be tracked, deliverables specified, resources allocated, targets set, and so on. Existing models have varying levels of sophistication and complexity. For projects involving only a few experienced developers, a simple process would probably be adequate. However, for larger systems involving tens or hundreds of developers with hundreds or thousands of users, a simple process just isn’t enough to provide the management structure and discipline necessary to engineer a usable product. So something is needed that will provide more formality and more discipline. Note that this does not mean that innovation is lost or that creativity is stifled. It just means that a structured process is used to provide a more stable framework for creativity. However simple or complex it appears, any lifecycle model is a simplified version of reality. It is intended as an abstraction and, as with any good abstraction, only the amount of detail required for the task at hand should be included. Any organization wishing to put a lifecycle model into practice will need to add detail specific to its particular circumstances and culture. For example, Microsoft wanted to maintain a small-team culture while also making possible the development of very large pieces of software. To this end, they have evolved a process that has been called “synch and stabilize,” as described in Box 6.3. In the next subsection, we introduce our view of what a lifecycle model for interaction design might look like that incorporates the four activities and the three key characteristics of the interaction design process discussed above. This will form the basis of our discussion in Chapters 7 and 8. Depending on the kind of system being developed, it may not be possible or appropriate to follow this model for every element of the system, and it is certainly true that more detail would be required to put the lifecycle into practice in a real project. Many other lifecycle models have been developed in fields related to interaction design, such as software engineering and HCI, and our model is evolved from these ideas. To put our interaction design model into context we include here a description of five lifecycle models, three from software engineering and two from HCI, and consider how they relate to it.

1

Sommerville (2001) uses the term process model to mean what we mean by lifecycle model, and refers to the waterfall model as the software lifecycle. Pressman (1992) talks about paradigms. In HCI the term “lifecycle model” is used more widely. For this reason, and because others use “process model” to represent something that is more detailed than a lifecycle model (e.g., Comer, 1997) we have chosen to use lifecycle model.

6470D CH06 UG

12/3/01

184

1:10 PM

Chapter 6

Page 184

The process of interaction design

BOX 6.3 How Microsoft Builds Software (Cusumano and Selby, 1997) Microsoft is one of the largest software companies in the world and builds some very complex software; for example, Windows 95 contains more than 11 million lines of code and required more than 200 programmers. Over a two-and-a-halfyear period from the beginning of 1993, two researchers, Michael Cusumano and Richard Selby, were given access to Microsoft project documents and key personnel for study and interview. Their aim was to build up an understanding of how Microsoft produces software. Rather than adopt the structured software engineering practices others have followed, Microsoft’s strategy has been to cultivate entrepreneurial flexibility throughout its software teams. In essence, it has tried to scale up the culture of a loosely-structured, small software team. “The objective is to get many small teams (three to eight developers each) or individual programmers to work together as a single relatively large team in order to build large products relatively quickly while still allowing

individual programmers and teams freedom to evolve their designs and operate nearly autonomously” (p. 54). In order to maintain consistency and to ensure that products are eventually shipped, the teams synchronize their activities daily and periodically stabilize the whole product. Cusumano and Selby have therefore labeled Microsoft’s unique process “synch and stabilize.” Figure 6.5 shows an overview of this process, which is divided into three phases: the planning phase, the development phase and the stabilization phase. The planning phase begins with a vision statement that defines the goals of the new product and the user activities to be supported by the product. (Microsoft uses a method called activity-based planning to identify and prioritize the features to be built; we return to this in Chapter 9.) The program managers together with the developers then write a functional specification in enough detail to describe features and to develop schedules and al-

Planning Phase Define product vision, specifications, and schedule

Development Phase Feature development in 3 or 4 sequential subprojects that each results in a milestone release

• Vision Statement Product and program management use extensive customer input to identify and priority-order product features.

Program managers coordinate evolution of specification. Developers design, code, and debug. Testers pair with developers for continuous testing.

• Specification Document Based on vision statement, program management and development group define feature functionality, architectural issues, and component interdependencies.

• Subproject I First 1/3 of features (Most critical features and shared components)

• Schedule and Feature Team Formation Based on specification document, program management coordinates schedule and arranges feature teams that each contain approximately 1 program manager, 3–8 developers, and 3–8 testers (who work in parallel 1:1 with developers).

• Subproject III Final 1/3 of features (Least critical features)

• Subproject II Second 1/3 of features

Stabilization Phase Comprehensive internal and external testing, final product stabilization, and ship Program managers coordinate OEMs and ISVs and monitor customer feedback. Developers perform final debugging and code stabilization. Testers recreate and isolate errors. • Internal Testing Thorough testing of complete product within the company. • External Testing Thorough testing of complete product outside the company by “beta” sites, such as OEMs, ISVs, and end users.

Figure 6.5 Overview of the synch and stabilize development approach.

• Release preparation Prepare final release of “golden master” disks and documentation for manufacturing.

6470D CH06 UG

12/3/01

1:10 PM

Page 185

6.4

locate staff. The feature list in this document will change by about 30% during the course of development, so the list is not fixed at this time. In the next phase, the development phase, the feature list is divided into three or four parts, each with its own small development team, and the schedule is divided into sequential subprojects, each with its own deadline (milestone). The teams work in parallel on a set of features and synchronize their work by putting together their code and finding errors on a daily and weekly basis. This is necessary because many programmers may be working on the same code at once. For example, during the

Lifecycle models: showing how the activities relate

185

peak development of Excel 3.0, 34 developers were actively changing the same source code on a daily basis. At the end of a subproject, i.e., on reaching a milestone, all errors are found and fixed, thus stabilizing the product, before moving on to the next subproject and eventually to the final milestone, which represents the release date. Figure 6.6 shows an overview of the milestone structure for a project with three subprojects. This synch-andstabilize approach has been used to develop Excel, Office, Publisher, Windows 95, Windows NT, Word, and Works, among others.

Milestone 1 (first 1/3 features) Development (design, coding, prototyping) Usability Lab Private Release Testing Daily Builds Feature Debugging Feature Integration Code Stabilization (no severe bugs) Buffer Time (20%–50%)

Milestone 2 (next1/3) Development Usability Lab Private Release Testing Daily Builds Feature Debugging Feature Integration Code Stabilization Buffer Time

Figure 6.6 Milestones in the synch and stabilize approach (each taking two to four months).

Milestone 3 (last set) Development Usability Lab Private Release Testing Daily Builds Feature Debugging Feature Integration Feature Complete Code Complete Code Stabilization Buffer Time Zero Bug Release Release to Manufacturing

6470D CH06 UG

186

12/3/01

1:10 PM

Chapter 6

Page 186

The process of interaction design

6.4.1 A simple lifecycle model for interaction design We see the activities of interaction design as being related as shown in Figure 6.7. This model incorporates iteration and encourages a user focus. While the outputs from each activity are not specified in the model, you will see in Chapter 7 that our description of establishing requirements includes the need to identify specific usability criteria. The model is not intended to be prescriptive; that is, we are not suggesting that this is how all interactive products are or should be developed. It is based on our observations of interaction design and on information we have gleaned in the research for this book. It has its roots in the software engineering and HCI lifecycle models described below, and it represents what we believe is practiced in the field. Most projects start with identifying needs and requirements. The project may have arisen because of some evaluation that has been done, but the lifecycle of the new (or modified) product can be thought of as starting at this point. From this activity, some alternative designs are generated in an attempt to meet the needs and requirements that have been identified. Then interactive versions of the designs are developed and evaluated. Based on the feedback from the evaluations, the team may need to return to identifying needs or refining requirements, or it may go straight into redesigning. It may be that more than one alternative design follows this iterative cycle in parallel with others, or it may be that one alternative at a time is considered. Implicit in this cycle is that the final product will emerge in an evolutionary fashion from a rough initial idea through to the finished product. Exactly how this evolution happens may vary from project to project, and we return to this issue in Chapter 8. The only factor limiting the number of times through the cycle is the resources available, but whatever the number is, development ends with an evaluation activity that ensures the final product meets the prescribed usability criteria.

Identify needs/ establish requirements

(Re)Design

Evaluate

Build an interactive version Final product

Figure 6.7 A simple interaction design model.

6470D CH06 UG

12/3/01

1:10 PM

Page 187

6.4

Lifecycle models: showing how the activities relate

187

6.4.2 Lifecycle models in software engineering Software engineering has spawned many lifecycle models, including the waterfall, the spiral, and rapid applications development (RAD). Before the waterfall was first proposed in 1970, there was no generally agreed approach to software development, but over the years since then, many models have been devised, reflecting in part the wide variety of approaches that can be taken to developing software. We choose to include these specific lifecycle models for two reasons: First, because they are representative of the models used in industry and they have all proved to be successful, and second, because they show how the emphasis in software development has gradually changed to include a more iterative, user-centered view. The waterfall lifecycle model The waterfall lifecycle was the first model generally known in software engineering and forms the basis of many lifecycles in use today. This is basically a linear model in which each step must be completed before the next step can be started (see Figure 6.8). For example, requirements analysis has to be completed before

Requirements analysis

Design

Code

Test

Maintenance

Figure 6.8 The waterfall lifecycle model of software development.

6470D CH06 UG

188

12/3/01

1:10 PM

Chapter 6

Page 188

The process of interaction design

design can begin. The names given to these steps varies, as does the precise definition of each one, but basically, the lifecycle starts with some requirements analysis, moves into design, then coding, then implementation, testing, and finally maintenance. One of the main flaws with this approach is that requirements change over time, as businesses and the environment in which they operate change rapidly. This means that it does not make sense to freeze requirements for months, or maybe years, while the design and implementation are completed. Some feedback to earlier stages was acknowledged as desirable and indeed practical soon after this lifecycle became widely used (Figure 6.8 does show some limited feedback between phases). But the idea of iteration was not embedded in the waterfall’s philosophy. Some level of iteration is now incorporated in most versions of the waterfall, and review sessions among developers are commonplace. However, the opportunity to review and evaluate with users was not built into this model. The spiral lifecycle model For many years, the waterfall formed the basis of most software developments, but in 1988 Barry Boehm (1988) suggested the spiral model of software development (see Figure 6.9). Two features of the spiral model are immediately clear from Figure 6.9: risk analysis and prototyping. The spiral model incorporates them in an iterative framework that allows ideas and progress to be repeatedly checked and evaluated. Each iteration around the spiral may be based on a different lifecycle model and may have different activities. In the spiral’s case, it was not the need for user involvement that inspired the introduction of iteration but the need to identify and control risks. In Boehm’s approach, development plans and specifications that are focused on the risks involved in developing the system drive development rather than the intended functionality, as was the case with the waterfall. Unlike the waterfall, the spiral explicitly encourages alternatives to be considered, and steps in which problems or potential problems are encountered to be re-addressed. The spiral idea has been used by others for interactive devices (see Box 6.4). A more recent version of the spiral, called the WinWin spiral model (Boehm et al., 1998), explicitly incorporates the identification of key stakeholders and their respective “win” conditions, i.e., what will be regarded as a satisfactory outcome for each stakeholder group. A period of stakeholder negotiation to ensure a “win-win” result is included. Rapid Applications Development (RAD) During the 1990s the drive to focus upon users became stronger and resulted in a number of new approaches to development. The Rapid Applications Development (RAD) approach (Millington and Stapleton, 1995) attempts to take a user-centered view and to minimize the risk caused by requirements changing during the

6470D CH06 UG

12/3/01

1:10 PM

Page 189

6.4

Lifecycle models: showing how the activities relate

189

Cumulative cost Progress through steps

Evaluate alternatives, identify, resolve risks

Determine objectives, alternatives, constraints

Risk analysis

Risk analysis Risk analysis

Review

Commitment partition

Risk analy- Prototype 1 sis Requirements plan life-cycle plan

Development plan Integration and test plan

Plan next phases

Prototype 2

Prototype 3

Operational prototype

Simulations, models, benchmarks Concept of operation

Software requirements

Software product design

Requirements validation

Detailed design

Code Unit test

Design validation and verification

Integration and test

Implementation Acceptance test

Develop, verify next-level product

Figure 6.9 The spiral lifecycle model of software development.

course of the project. The ideas behind RAD began to emerge in the early 1990s, also in response to the inappropriate nature of the linear lifecycle models based on the waterfall. Two key features of a RAD project are: • Time-limited cycles of approximately six months, at the end of which a system or partial system must be delivered. This is called time-boxing. In effect, this breaks down a large project into many smaller projects that can deliver products incrementally, and enhances flexibility in terms of the development techniques used and the maintainability of the final system.

6470D CH06 UG

190

12/3/01

1:10 PM

Chapter 6

Page 190

The process of interaction design

• JAD (Joint Application Development) workshops in which users and developers come together to thrash out the requirements of the system (Wood and Silver, 1995). These are intensive requirements-gathering sessions in which difficult issues are faced and decisions are made. Representatives from each identified stakeholder group should be involved in each workshop so that all the relevant views can be heard. A basic RAD lifecycle has five phases (see Figure 6.10): project set-up, JAD workshops, iterative design and build, engineer and test final prototype, implementation review. The popularity of RAD has led to the emergence of an industrystandard RAD-based method called DSDM (Dynamic Systems Development Method). This was developed by a non-profit-making DSDM consortium made up of a group of companies that recognized the need for some standardization in the field. The first of nine principles stated as underlying DSDM is that “active user involvement is imperative.” The DSDM lifecycle is more complicated than the one we’ve shown here. It involves five phases: feasibility study, business study, functional model iteration, design and build iteration, and implementation. This is only a generic process and must be tailored for a particular organization. ACTIVITY 6.5 How closely do you think the RAD lifecycle model relates to the interaction design model described in Section 6.4.1? Comment

RAD and DSDM explicitly incorporate user involvement, evaluation and iteration. User involvement, however, appears to be limited to the JAD workshop, and iteration appears to be limited to the design and build phase. The philosophy underlying the interaction design model is present, but the flexibility appears not to be. Our interaction design process would be appropriately used within the design and build stage.

Project initiation

JAD workshops

Iterative design and build

Evaluate final system

Implementation review

Figure 6.10 A basic RAD lifecycle model of software development.

6470D CH06 UG

12/3/01

1:10 PM

Page 191

6.4

Lifecycle models: showing how the activities relate

191

BOX 6.4 A Product Design Process for Internet Appliances Netpliance, which has moved into the market of providing Internet appliances, i.e. one-stop products that allow a user to achieve a specific Internetbased task, have adopted a user-centered approach to development based on RAD (Isensee et al., 2000). They attribute their ability to develop systems from concept to delivery in seven months to this strong iterative approach: the architecture was revised and iterated over several days; the code was developed with weekly feedback sessions from users; components were typically revised four times, but some went through 12 cycles. Their simple spiral model is shown in Figure 6.11. The target audience for this appliance, called the i-opener, were people who did not use or own a PC and who may have been uncomfortable around computers. The designers were therefore looking to design something that would be as far away from the “traditional” PC model as possible in terms of both hardware and software. In designing the software, they abandoned the desktop metaphor of the Windows operating system and concentrated on an interface that provided good support for the user’s task. For the hardware design, they needed to get away from the image of a large heavy box with lots of wires and plugs, any one of which may be faulty and cause the user problems.

aly

in g

An

sis

I m ple

Im

em

m

la

an

n

t en

P

Pl

pl

nn

n ig

g

es

n

in

D

ig

s si

D

es

Ana ly

at

io

n

e n t a ti o n

Figure 6.11 Netpliance’s spiral development cycle.

The device provides three functions: sending and receiving email, categorical content, and web accessibility. That is it. There are no additional features, no complicated menus and options. The device is streamlined to perform these tasks and no more. This choice of functions was based on user studies and testing that served to identify the most frequently used functions, i.e., those that most appropriately supported the users. An example screen showing the news channel for i-opener is shown in Figure 6.12. Identifying requirements for a new device is difficult. There is no direct experience of using a similar product, and so it is difficult to know what will be used, what will be needed, what will be frustrating, and what will be ignored. The Netpliance team started to gather information for their device by focusing on existing data about PC users: demographics, usability studies, areas of dissatisfaction, etc. They employed marketing research, focus groups, and user surveys to identify the key features of the appliance, and concentrated on delivering these fundamentals well. The team was multidisciplinary and included hardware engineers, user interface designers, marketing specialists, test specialists, industrial designers, and visual designers. Users were involved throughout development and the whole team took an active part in the design. The interface was designed first, to meet user requirements, and then the hardware and software were developed to fit the interface. In all of this, the emphasis was on a lean development process with a minimum of documentation, early prototyping, and frequent iterations for each component. For example, the design of the hardware proceeded from sketches through pictures to physical prototypes that the users could touch, pick up, move around, and so on. To complement prototyping, the team also used usage scenarios, which are basically descriptions of the appliance’s use to achieve a task. These helped developers to understand how the product could be used from a user’s perspective. We will return to similar techniques in Chapter 7. Implementation was achieved through rapid cycles of implement and test. Small usability tests were conducted throughout implementation to

6470D CH06 UG

12/3/01

192

1:10 PM

Chapter 6

Page 192

The process of interaction design

Figure 6.12 The news channel as part of the categorical content. find and fix usability problems. Developers and their families or friends were encouraged to use the appliance so that designers could enjoy the same experience as the users (called “eating your own dogfood”!). For these field tests, the product

was instrumented so that the team could monitor how often each function was used. This data helped to prioritize the development of features as the product release deadline approached.

6.4.3 Lifecycle models in HCI Another of the traditions from which interaction design has emerged is the field of HCI (human–computer interaction). Fewer lifecycle models have arisen from this field than from software engineering and, as you would expect, they have a stronger tradition of user focus. We describe two of these here. The first one, the Star, was derived from empirical work on understanding how designers tackled HCI design problems. This represents a very flexible process with evaluation at its core. In contrast, the second one, the usability engineering lifecycle, shows a more structured approach and hails from the usability engineering tradition. The Star Lifecycle Model About the same time that those involved in software engineering were looking for alternatives to the waterfall lifecycle, so too were people involved in HCI looking for alternative ways to support the design of interfaces. In 1989, the Star lifecycle

6470D CH06 UG

12/3/01

1:10 PM

Page 193

6.4

Lifecycle models: showing how the activities relate

193

task analysis/ functional analysis

implementation

evaluation

requirements/ specification

prototyping

conceptual design/ formal design representation

Figure 6.13 The Star lifecycle model.

model was proposed by Hartson and Hix (1989) (see Figure 6.13). This emerged from some empirical work they did looking at how interface designers went about their work. They identified two different modes of activity: analytic mode and synthetic mode. The former is characterized by such notions as top-down, organizing, judicial, and formal, working from the systems view towards the user’s view; the latter is characterized by such notions as bottom-up, free-thinking, creative and ad hoc, working from the user’s view towards the systems view. Interface designers move from one mode to another when designing. A similar behavior has been observed in software designers (Guindon, 1990). Unlike the lifecycle models introduced above, the Star lifecycle does not specify any ordering of activities. In fact, the activities are highly interconnected: you can move from any activity to any other, provided you first go through the evaluation activity. This reflects the findings of the empirical studies. Evaluation is central to this model, and whenever an activity is completed, its result(s) must be evaluated. So a project may start with requirements gathering, or it may start with evaluating an existing situation, or by analyzing existing tasks, and so on. ACTIVITY 6.6 The Star lifecycle model has not been used widely and successfully for large projects in industry. Consider the benefits of lifecycle models introduced above and suggest why this may be. Comment

One reason may be that the Star lifecycle model is extremely flexible. This may be how designers work in practice, but as we commented above, lifecycle models are popular because “they allow developers, and particularly managers, to get an overall view of the development effort so that progress can be tracked, deliverables specified, resources allocated, targets set, and so on.” With a model as flexible as the Star lifecycle, it is difficult to control these issues without substantially changing the model itself.

The Usability Engineering Lifecycle The Usability Engineering Lifecycle was proposed by Deborah Mayhew in 1999 (Mayhew, 1999). Many people have written about usability engineering, and as

6470D CH06 UG

194

12/3/01

1:10 PM

Chapter 6

Page 194

The process of interaction design

REQUIREMENTS ANALYSIS

User Profile

Function/Data Modeling OOSE: Requirements Model

Task Analysis

Platform Capabilities/ Constraints

General Design Principles

Usability Goals

THE USABILITY ENGINEERING LIFECYCLE

Style Guide

LEVEL 1

DESIGN/TESTING/DEVELOPMENT

Work Reengineering

LEVEL 2

Conceptual Model (CM) Design

CM Mockups

LEVEL 3

Screen Design Standards (SDS)

Detailed User Interface Design (DUID) Unit/System Testing OOSE: Test Model

SDS Prototyping

Style Guide

Iterative SDS Evaluation

Iterative CM Evaluation

Style Guide

Iterative D UI D Evaluation

NO

NO

Eliminated Major Flaws?

YES NO

Start Application Architecture OOSE: Analysis Model

Met Usability Goals?

Met Usability Goals? YES

YES

Style Guide

Start Application Design/Development OOSE: Design Model/ Implementation Model NO

Figure 6.14 The Usability Engineering Lifecycle.

All Functionality Addressed?

YES

A

6470D CH06 UG

12/3/01

1:10 PM

Page 195

6.4

Lifecycle models: showing how the activities relate

INSTALLATION

UE Task

A

T Installation

195

User Feedback

All Issues Resolved?

YES

DONE

Development Task Decision Point Documentation

NO

Complex Applications Enhancements

Simple Applications (e.g. websites)

Figure 6.14 (continued).

Mayhew herself says, “I did not invent the concept of a Usability Engineering Lifecycle. Nor did I invent any of the Usability Engineering tasks included in the lifecycle . . . .”. However, what her lifecycle does provide is a holistic view of usability engineering and a detailed description of how to perform usability tasks, and it specifies how usability tasks can be integrated into traditional software development lifecycles. It is therefore particularly helpful for those with little or no expertise in usability to see how the tasks may be performed alongside more traditional software engineering activities. For example, Mayhew has linked the stages with a general development approach (rapid prototyping) and a specific method (objectoriented software engineering (OOSE, Jacobson et al, 1992)) that have arisen from software engineering. The lifecycle itself has essentially three tasks: requirements analysis, design/ testing/development, and installation, with the middle stage being the largest and involving many subtasks (see Figure 6.14). Note the production of a set of usability goals in the first task. Mayhew suggests that these goals be captured in a style guide that is then used throughout the project to help ensure that the usability goals are adhered to. This lifecycle follows a similar thread to our interaction design model but includes considerably more detail. It includes stages of identifying requirements, designing, evaluating, and building prototypes. It also explicitly includes the style guide as a mechanism for capturing and disseminating the usability goals of the project. Recognizing that some projects will not require the level of structure presented in the full lifecycle, Mayhew suggests that some substeps can be skipped if they are unnecessarily complex for the system being developed.

ACTIVITY 6.7 Study the usability engineering lifecycle and identify how this model differs from our interaction design model described in Section 6.4.1, in terms of the iterations it supports. Comment

One of the main differences between Mayhew’s model and ours is that in the former the iteration between design and evaluation is contained within the second phase. Iteration between the design/test/development phase and the requirements analysis phase occurs only after the conceptual model and the detailed designs have been developed, prototyped, and

6470D CH06 UG

196

12/3/01

1:10 PM

Chapter 6

Page 196

The process of interaction design evaluated one at a time. Our version models a return to the activity of identifying needs and establishing requirements after evaluating any element of the design.

Assignment Nowadays, timepieces (such as clocks, wristwatches etc) have a variety of functions. They not only tell the time and date but they can speak to you, remind you when it’s time to do something, and provide a light in the dark, among other things. Mostly, the interface for these devices, however, shows the time in one of two basic ways: as a digital number such as 23:40 or through an analog display with two or three hands—one to represent the hour, one for the minutes, and one for the seconds. In this assignment, we want you to design an innovative timepiece for your own use. This could be in the form of a wristwatch, a mantelpiece clock, an electronic clock, or any other kind of clock you fancy. Your goal is to be inventive and exploratory. We have broken this assignment down into the following steps to make it clearer: (a) Think about the interactive product you are designing: what do you want it to do for you? Find 3–5 potential users and ask them what they would want. Write a list of requirements for the clock, together with some usability criteria based on the definition of usability used in Chapter 1. (b) Look around for similar devices and seek out other sources of inspiration that you might find helpful. Make a note of any findings that are interesting, useful or insightful. (c) Sketch out some initial designs for the clock. Try to develop at least two distinct alternatives that both meet your set of requirements. (d) Evaluate the two designs, using your usability criteria and by role playing an interaction with your sketches. Involve potential users in the evaluation, if possible. Does it do what you want? Is the time or other information being displayed always clear? (e) Design is iterative, so you may want to return to earlier elements of the process before you choose one of your alternatives. Once you have a design with which you are satisfied, you can send it to us and we shall post a representative sample of those we receive to our website. Details of how to format your submission are available from our website.

Summary In this chapter, we have looked at the process of interaction design, i.e., what activities are required in order to design an interactive product, and how lifecycle models show the relationships between these activities. A simple interaction design model consisting of four activities was introduced and issues surrounding the identification of users, generating alternative designs, and evaluating designs were discussed. Some lifecycle models from software engineering and HCI were introduced.

Key points •

The interaction design process consists of four basic activities: identifying needs and establishing requirements, developing alternative designs that meet those requirements, building interactive versions of the designs so that they can be communicated and assessed, and evaluating them.

6470D CH06 UG

12/3/01

1:10 PM

Page 197

Further reading

197

•

Key characteristics of the interaction design process are explicit incorporation of user involvement, iteration, and specific usability criteria.

•

Before you can begin to establish requirements, you must understand who the users are and what their goals are in using the device.

•

Looking at others’ designs provides useful inspiration and encourages designers to consider alternative design solutions, which is key to effective design.

•

Usability criteria, technical feasibility, and users’ feedback on prototypes can all be used to choose among alternatives.

•

Prototyping is a useful technique for facilitating user feedback on designs at all stages.

•

Lifecycle models show how development activities relate to one another.

•

The interaction design process is complementary to lifecycle models from other fields.

Further reading RUDISILL, M., LEWIS, C., POLSON, P. B., AND MCKAY, T. D. (1995) (eds.) Human-Computer Interface Design: Success Stories, Emerging Methods, Real-World Context. San Francisco: Morgan Kaufmann. This collection of papers describes the application of different approaches to interface design. Included here is an account of the Xerox Star development, some advice on how to choose among methods, and some practical examples of real-world developments. BERGMAN, ERIC (2000) (ed.) Information Appliances and Beyond. San Francisco: Morgan Kaufmann. This book is an edited collection of papers which report on the experience of designing and building a variety of ‘information appliances’, i.e., purpose-built computer-based products which perform a specific task. For example, the Palm Pilot, mobile telephones, a vehicle navigation system, and interactive toys for children. MAYHEW, DEBORAH J. (1999) The Usability Engineering Lifecycle. San Francisco: Morgan Kaufmann. This is a very

practical book about product user interface design. It explains how to perform usability tasks throughout development and provides useful examples along the way to illustrate the techniques. It links in with two software development based methods: rapid prototyping and object-oriented software engineering. SOMMERVILLE, IAN (2001) Software Engineering (6th edition). Harlow, UK: Addison-Wesley. If you are interested in pursuing the software engineering aspects of the lifecycle models section, then this book provides a useful overview of the main models and their purpose. NIELSEN, JAKOB (1993) Usability Engineering. San Francisco: Morgan Kaufmann. This is a seminal book on usability engineering. If you want to find out more about the philosophy, intent, history, or pragmatics of usability engineering, then this is a good place to start.

6470D CH06 UG

198

12/3/01

1:10 PM

Chapter 6

INTERVIEW

Page 198

The process of interaction design

with Gillian Crampton Smith Gillian Crampton Smith is Director of the Interaction Design Institute Ivrea near Milan, Italy.

Prior to this, she was at the Royal College of Art where she started and directed the Computer Related Design Department, developing a program to enable artist-designers to develop and apply their traditional skills and knowledge to the design of all kinds of interactive products and systems.

GC: I believe that things should work but they should also delight. In the past, when it was really difficult to make things work, that was what people concentrated on. But now it’s much easier to make software and much easier to make hardware. We’ve got a load of technologies but they’re still often not designed for people—and they’re certainly not very enjoyable to use. If we think about other things in our life, our clothes, our furniture, the things we eat with, we choose what we use because they have a meaning beyond their practical use. Good design is partly about working really well, but it’s also about what something looks like, what it reminds us of, what it refers to in our broader cultural environment. It’s this side that interactive systems haven’t really addressed yet. They’re only just beginning to become part of culture. They are not just a tool for professionals any more, but an environment in which we live. HS: How do you think we can improve things? GC: The parallel with architecture is quite an interesting one. In architecture, a great deal of time and expense is put into the initial design; I don’t think very much money or time is put into the initial design of software. If you think of the big software engineering companies, how many people work in the design side rather than on the implementation side? HS: When you say design do you mean conceptual design, or task design, or something else? GC: I mean all phases of design. Firstly there’s research—finding out about people. This is not necessarily limited to finding out about what they want necessarily, because if we’re designing new things, they are probably things people don’t even know they

could have. At the Royal College of Art we tried to work with users, but to be inspired by them, and not constrained by what they know is possible. The second stage is thinking, “What should this thing we are designing do?” You could call that conceptual design. Then a third stage is thinking how do you represent it, how do you give it form? And then the fourth stage is actually crafting the interface—exactly what color is this pixel? Is this type the right size, or do you need a size bigger? How much can you get on a screen?—all those things about the details. One of the problems companies have is that the feedback they get is. “I wish it did x.” Software looks as if it’s designed, not with a basic model of how it works that is then expressed on the interface, but as a load of different functions that are strung together. The desktop interface, although it has great advantages, encourages the idea that you have a menu and you can just add a few more bits when people want more things. In today’s word processors, for instance, there isn’t a clear conceptual model about how it works, or an underlying theory people can use to reason about why it is not working in the way they expect. HS: So in trying to put more effort into the design aspect of things, do you think we need different people in the team? GC: Yes. People in the software field tend to think that designers are people who know how to give the product form, which of course is one of the things they do. But a graphic designer, for instance, is somebody who also thinks at a more strategic level, “What is the message that these people want to get over and to whom?” and then, “What is the best way to give form to a message like that?” The part you see is the beautiful design, the lovely poster or record sleeve, or elegant book, but behind that is a lot of thinking about how to communicate ideas via a particular medium. HS: If you’ve got people from different disciplines, have you experienced difficulties in communication? GC: Absolutely. I think that people from different disciplines have different values, so different results and different approaches are valued. People have different temperaments, too, that have led them to the different fields in the first place, and they’ve been trained in different ways. In my view the big differ-

6470D CH06 UG

12/3/01

1:10 PM

Page 199

Interview ence between the way engineers are trained and the way designers are trained is that engineers are trained to focus in on a solution from the beginning whereas designers are trained to focus out to begin with and then focus in. They focus out and try lots of different alternatives, and they pick some and try them out to see how they go. Then they refine down. This is very hard for both the engineers and the designers because the designers are thinking the engineers are trying to hone in much too quickly and the engineers can’t bear the designers faffing about. They are trained to get their results in a completely different way. HS: Is your idea to make each more tolerant of the other? GC: Yes, my idea is not to try to make renaissance people, as I don’t think it’s feasible. Very few people can do everything well. I think the ideal team is made up of people who are really confident and good at what they do and open-mined enough to realize there are very different approaches. There’s the scientific approach, the engineering approach, the design approach. All three are different and that’s their value—you don’t want everybody to be the same. The best combination is where you have engineers who understand design and designers who understand engineering. It’s important that people know their limitations too. If you realize that you need an ergonomist, then you go and find one and you hire them to consult for you. So you need to know what you don’t know as well as what you do. HS: What other aspects of traditional design do you think help with interaction design? GC: I think the ability to visualize things. It allows people to make quick prototypes or models or sketches so that a group of people can talk about something concrete. I think that’s invaluable in the process. I think also making things that people like is just one of the things that good designers have a feel for. HS: Do you mean aesthetically like or like in its whole sense? GC: In its whole sense. Obviously there’s the aesthetic of what something looks like or feels like but

199

there’s also the aesthetic of how it works as well. You can talk about an elegant way of doing something as well as an elegant look. HS: Another trait I’ve seen in designers is being protective of their design. GC: I think that is both a vice and a virtue. In order to keep a design coherent you need to keep a grip on the whole and to push it through as a whole. Otherwise it can happen that people try to make this a bit smaller and cut bits out of that, and so on, and before you know where you are the coherence of the design is lost. It is quite difficult for a team to hold a coherent vision of a design. If you think of other design fields, like film-making, for instance, there is one director and everybody accepts that it’s the director’s vision. One of the things that’s wrong with products like Microsoft Word, for instance, is that there’s no coherent idea in it that makes you think, “Oh yes, I understand how this fits with that.” Design is always a balance between things that work well and things that look good, and the ideal design satisfies everything, but in most designs you have to make trade-offs. If you’re making a game it’s more important that people enjoy it and that it looks good than to worry if some of it’s a bit difficult. If you’re making a fighter cockpit then the most important thing is that pilots don’t fall out of the sky, and so this informs the trade-offs you make. The question is, who decides how to decide the criteria for the tradeoffs that inevitably need to be made. This is not a matter of engineering: it’s a matter of values—cultural, emotional, aesthetic. HS: I know this is a controversial issue for some designers. Do you think users should be part of the design team? GC: No, I don’t. I think it’s an abdication of responsibility. Users should definitely be involved as a source of inspiration, suggesting ideas, evaluating proposals—saying, “Yes, we think this would be great” or “No, we think this is an appalling idea.” But in the end, if designers aren’t better than the general public at designing things, what are they doing as designers?

6470D CH06 UG

12/3/01

1:10 PM

Page 200

Handbook of Human-Computer Interaction Second, completely revised edition M. Helander, T.K. Landauer, P. Prabhu (eds.) 9 1997 Elsevier Science B.V. All rights reserved.

Chapter 16

What do Prototypes Prototype? types can get in the way of their effective use. Current terminology for describing prototypes centers on attributes of prototypes themselves, such as what tool was used to create them, and how refined-looking or -behaving they are. Such terms can be distracting. Tools can be used in many different ways, and detail is not a sure indicator of completion. We propose a change in the language used to talk about prototypes, to focus more attention on fundamental questions about the interactive system being designed: What role will the artifact play in a user's life? How should it look and feel? How should it be implemented? The goal of this chapter is to establish a model that describes any prototype in terms of the artifact being designed, rather than the prototype's incidental attributes. By focusing on the purpose of the prototype--that is, on what it p r o t o t y p e s - - w e can make better decisions about the kinds of prototypes to build. With a clear purpose for each prototype, we can better use prototypes to think and communicate about design. In the first section we describe some current difficulties in communicating about prototypes: the complexity of interactive systems; issues of multi-disciplinary teamwork; and the audiences of prototypes. Next, we introduce the model and illustrate it with some initial examples of prototypes from real projects. In the following section we present several more examples to illustrate some further issues. We conclude the chapter with a summary of the main implications of the model for prototyping practice.

Stephanie Houde and Charles Hill Apple Computer, Inc. Cupertino, California, USA 16.1 Introduction .................................................... 367 16.2 The P r o b l e m with P r o t o t y p e s ........................ 367

16.2.1 What is a Prototype? ................................ 368 16.2.2 Current Terminology ................................ 368 16.3 A M o d e l of W h a t P r o t o t y p e s P r o t o t y p e ....... 369

16.3.1 Definitions ................................................ 369 16.3.2 The Model ................................................ 369 16.3.3 Three Prototypes of One System ............. 369 16.4 F u r t h e r E x a m p l e s ........................................... 371

16.4.1 16.4.2 16.4.3 16.4.4

Role Prototypes ........................................ 372 Look and Feel Prototypes ........................ 374 Implementation Prototypes ...................... 376 Integration Prototypes .............................. 377

16.5 S u m m a r y ......................................................... 379 16.6 Acknowledgments ........................................... 380 16.7 P r o t o t y p e Credits ........................................... 380

16.8 References ....................................................... 380

16.1 Introduction

16.2 The Problem with Prototypes

Prototypes are widely recognized to be a core means of exploring and expressing designs for interactive computer artifacts. It is common practice to build prototypes in order to represent different states of an evolving design and to explore options. However, since interactive systems are complex, it may be difficult or impossible to create prototypes of a whole design in the formative stages of a project. Choosing the right kind of more focused prototype to build is an art in itself, and communicating its limited purposes to its various audiences is a critical aspect of its use. The ways that we talk, and even think, about proto-

Interactive computer systems are complex. Any artifact can have a rich variety of software, hardware, auditory, visual, and interactive features. For example, a personal digital assistant such as the Apple Newton has an operating system, a hard case with various ports, a graphical user interface and audio feedback. Users experience the combined effect of such interrelated features; and the task of designingmand prototyping--the user experience is therefore complex. Every aspect of the system must be designed (or inherited from a previous system), and many features need to be evaluated 367

368 in combination with others. Prototypes provide the means for examining design problems and evaluating solutions. Selecting the focus of a prototype is the art of identifying the most important open design questions. If the artifact is to provide new functionality for users--and thus play a new role in their lives--the most important questions may concern exactly what that role should be and what features are needed to support it. If the role is well understood, but the goal of the artifact is to present its functionality in a novel way, then prototyping must focus on how the artifact will look and feel. If the artifact's functionality is to be based on a new technique, questions of how to implement the design may be the focus of prototyping efforts. Once a prototype has been created, there are several distinct audiences that designers discuss prototypes with. These are: the intended users of the artifact being designed; their design teams; and the supporting organizations that they work within (Erickson, 1995). Designers evaluate their options with their own team by critiquing prototypes of alternate design directions. They show prototypes to users to get feedback on evolving designs. They show prototypes to their supporting organizations (such as project managers, business clients, or professors) to indicate progress and direction. It is difficult for designers to communicate clearly about prototypes to such a broad audience. It is challenging to build protoypes which produce feedback from users on the most important design questions. Even communication among designers requires effort due to differing perspectives in a multi-disciplinary design team. Limited understanding of design practice on the part of supporting organizations makes it hard for designers to explain their prototypes to them. Finally, prototypes are not self-explanatory: looks can be deceiving. Clarifying what aspects of a prototype correspond to the eventual artifact--and what don'tRis a key part of successful prototyping.

16.2.1 What is a Prototype? Designing interactive systems demands collaboration between designers of many different disciplines (Kim, 1990). For example, a project might require the skills of a programmer, an interaction designer, an industrial designer, and a project manager. Even the term "prototype" is likely to be ambiguous on such a team. Everyone has a different expectation of what a prototype is. Industrial designers call a molded foam model a prototype. Interaction designers refer to a simulation of on-

Chapter 16. What do Prototypes Prototype ? screen appearance and behavior as a prototype. Programmers call a test program a prototype. A user studies expert may call a storyboard which shows a scenario of something being used, a prototype. The organization supporting a design project may have an overly narrow expectation of what a prototype is. Shrage (1996) has shown that organizations develop their own "prototyping cultures" which may cause them to consider only certain kinds of prototypes to be valid. In some organizations, only prototypes which act as proof that an artifact can be produced are respected. In others, only highly detailed representations of look and feel are well understood. Is a brick a prototype? The answer depends on how it is used. If it is used to represent the weight and scale of some future artifact, then it certainly is: it prototypes the weight and scale of the artifact. This example shows that prototypes are not necessarily selfexplanatory. What is significant is not what media or tools were are used to create them, but how they are used by a designer to explore or demonstrate some aspect of the future artifact..

16.2.2 Current Terminology Current ways of talking about prototypes tend to focus on attributes of the prototype itself, such as which tool was used to create it (as in "C", "Director TM'', and "paper" prototypes); and on how finished-looking or -behaving a prototype is (as in "high-fidelity" and "low-fidelity" prototypes). Such characterizations can be misleading because the capabilities and possible uses of tools are often misunderstood and the significance of the level of finish is often unclear, particularly to non-designers. Tools can be used in many different ways. Sometimes tools which have high-level scripting languages (like HyperCardTM), rather than full programming languages (like C), are thought to be unsuitable for producing user-testable prototypes. However, Ehn and Kyng (1991) have shown that even prototypes made of cardboard are very useful for user testing. In the authors' experience, no one tool supports iterative design work in all of the important areas of investigation. To design well, designers must be willing to use different tools for different prototyping tasks; and to team up with other people with complementary skills. Finished-looking (or-behaving) prototypes are often thought to indicate that the design they represent is near completion. Although this may sometimes be the case, a finished-looking prototype might be made early in the design process (e.g., a 3D concept model for use

Houde and Hill in market research), and a rough one might be made later on (e.g., to emphasize overall structure rather than visual details in a user test). Two related terms are used in this context: "resolution" and "fidelity". We interpret resolution to mean "amount of detail", and fidelity to mean "closeness to the eventual design". It is important to recognize that the degree of visual and behavioral refinement of a prototype does not necessarily correspond to the solidity of the design, or to a particular stage in the process.

16.3 A Model of What Prototypes Prototype 16.3.1 Definitions Before proceeding, we define some important terms. We define artifact as the interactive system being designed. An artifact may be a commercially released product or any end-result of a design activity such as a concept system developed for research purposes. We define prototype as any representation of a design idea, regardless of medium. This includes a pre-existing object when used to answer a design question. We define designer as anyone who creates a prototype in order to design, regardless of job title. 16.3.2 The Model The model shown in Figure I represents a three-dimensional space which corresponds to important aspects of the design of an interactive artifact. We define the dimensions of the model as role; look and feel; and implementation. Each dimension corresponds to a class of questions which are salient to the design of any interactive system. "Role" refers to questions about the function that an artifact serves in a user' s life--the way in which is it useful to them. "Look and feel" denotes questions about the concrete sensory experience of using an artifactmwhat the user looks at, feels, and hears while using it. "Implementation" refers to questions about the techniques and components through which an artifact performs its functionmthe "nuts and bolts" of how it actually works. The triangle is drawn askew to emphasize that no one dimension is inherently more important than any other.

Goal of the Model: Given a design problem (of any scope or size), designers can use the model to separate design issues into three classes of questions which frequently demand different approaches to prototyping. Implementation usually requires a working system to be built; look and feel requires the concrete user expe-

369

Role

Implementat Look and feel Figure 1. A model of what prototypes prototype. rience to be simulated or actually created; role requires the context of the artifact's use to be established. Being explicit about what design questions must be answered is therefore an essential aid to deciding what kind of prototype to build. The model helps visualize the focus of exploration.

Markers: A prototype may explore questions or design options in one, two or all three dimensions of the model. In this chapter, several prototypes from real design projects are presented as examples. Their relationship to the model is represented by a marker on the triangle. This is a simple way to put the purpose of any prototype in context for the designer and their audiences. It gives a global sense of what the prototype is intended to explore; and equally important, what it does not explore. It may be noted that the triangle is a relative and subjective representation. A location toward one corner of the triangle implies simply that in the designer's own judgment, more attention is given to the class of questions represented by that comer than to the other two.

16.3.3 Three Prototypes of One System The model is best explained further through an example from a real project. The three prototypes shown in Examples I-3 were created during the early stages of development of a 3D space-planning application (Houde, 1992). The goal of the project was to design an example of a 3D application which would be accessible to a broad range of non-technical users. As such it was designed to work on a personal computer with an ordinary mouse. Many prototypes were created by different members of the multi-disciplinary design team during the project.

370

Figure 2. Relationship of three prototypes (Examples 1 - 3) to the model.

Example I. Role prototype for 3D space-planning application [El: Houde 1990]. The prototype shown in Example 1 was built to show how a user might select furniture from an on-line catalog and try it out in an approximation of their own room. It is an interactive slide show which the designer operated by clicking on key areas of the rough user interface. The idea that virtual space-planning would be a helpful task for non-technical users came from user studies. The purpose of the prototype was to quickly convey the proposed role of the artifact to the design team and members of the supporting organization. Since the purpose of the prototype was primarily to explore and visualize an example of the role of the future artifact, its marker appears very near the role corner of the model in Figure 2. It is placed a little toward the look and feel corner because it also explored user interface elements in a very initial form. One of the challenges of the project was to define an easy-to-use direct manipulation user interface for moving 3D objects with an ordinary 2D mouse cursor. User testing with a foam-core model showed that the

Chapter 16. What do Prototypes Prototype?

Example 2. Look and feel prototype for 3D space-planning application [E2: Houde 1990]. most important manipulations of a space-planning task were sliding, lifting, and turning furniture objects. Example 2 shows a picture of a prototype which was made to test a user interface featuring this constrained set of manipulations. Clicking once on the chair caused its bounding box to appear. This "handle box" offered hand-shaped controls for lifting and turning the box and object chair (as if the chair was frozen inside the box). Clicking and dragging anywhere on the box allowed the unit to slide on a 3D floor. The prototype was built using Macromedia Director (a high level animation and scripting tool). It was made to work only with the chair data shown: a set of images pre-drawn for many angles of rotation. The purpose of the Example 2 prototype was to get feedback from users as quickly as possible as to whether the look and feel of the handle-box user interface was promising. Users of the prototype were given tasks which encouraged them to move the chair around a virtual room. Some exploration of role was supported by the fact that the object manipulated was a chair, and space-planning tasks were given during the test. Although the prototype was interactive, the programming that made it so did not seriously explore how a final artifact with this interface might be implemented. It was only done in service of the look and feel test. Since the designer primarily explored the look and feel of the user interface, this prototype's marker is placed very near the look and feel comer of the model in Figure 2. A technical challenge of the project was figuring out how to render 3D graphics quickly enough on equipment that end-users might have. At the time, it was not clear how much real-time 3D interaction could be achieved on the Apple Macintosh TM II fx computer m the fastest Macintosh then available. Example 3

Houde and Hill

Example 3. Implementation prototypes for 3D spaceplanning application [E3: Chen 1990].

shows a prototype which was built primarily to explore rendering capability and performance. This was a working prototype in which multiple 3D objects could be manipulated as in Example 2, and the view of the room could be changed to any perspective. Example 3 was made in a programming environment that best supported the display of true 3D perspectives during manipulation. It was used by the design team to determine what complexity of 3D scenes was reasonable to design for. The user interface elements shown on the left side of the screen were made by the programmer to give himself controls for demonstrating the system: they were not made to explore the look and feel of the future artifact. Thus the primary purpose of the prototype was to explore how the artifact might be implemented. The marker for this example is placed near the implementation comer (Figure 2). One might assume that the role prototype (Example 1) was developed first, then the look and feel prototype (Example 2), and finally the implementation prototype (Example 3): that is, in order of increasing detail and production difficulty. In fact, these three prototypes were developed almost in parallel. They were built by different design team members during the early stages of the project. No single prototype could have represented the design of the future artifact at that time. The evolving design was too fuzzy---existing mainly as a shared concept in the minds of the designers. There were also too many open and interdependent questions in every design dimension: role, look and feel, implementation. Making separate prototypes enabled specific design questions to be addressed with as much clarity as possible. The solutions found became inputs to an integrated design. Answers to the rendering capability questions addressed by Example 3 informed the design

371

Figure 3. Four principal categories of prototypes on the model.

of the role that the artifact could play (guiding how many furniture objects of what complexity could be shown). It also provided guiding constraints for the direct manipulation user interface determining how much detail the handle forms could have). Similarly, issues of role addressed by Example 1 informed the implementation problem by constraining it: only a constrained set of manipulations was needed for a space-planning application. It also simplified the direct manipulation user interface by limiting the necessary actions, and therefore controls, which needed to be provided. It was more efficient to wait on the results of independent investigations in the key areas of role, look and feel and implementation than to try to build a monolithic prototype that integrated all features from the start. After sufficient investigation in separate prototypes, the prototype in Example 3 began to evolve into an integrated prototype which could be described by a position at the center of our model. A version of the user interface developed in Example 2 was implemented in the prototype in Example 3. Results of other prototypes were also integrated. This enabled a more complete user test of features and user interface to take place. This set of three prototypes from the same project shows how a design problem can be simultaneously approached from multiple points of view. Design questions of role, look and feel, and implementation were explored concurrently by the team with the three separate prototypes. The purpose of the model is to make it easier to develop and subsequently communicate about this kind of prototyping strategy.

16.4 Further Examples In this section we present twelve more examples of

372

Chapter 16. What do Prototypes Prototype?

Figure 4. Relationship of role prototypes (Examples 4- 7) to the model

Example 4. Storyboard for a portable notebook computer [E4: Vertelney 1990].

prototypes taken from real projects, and discuss them in terms of the model. Examples are divided into four categories which correspond to the four main regions of the model, as indicated in Figure 3. The first three categories correspond to prototypes with a strong bias to-ward one of the three comers: role, look and feel, and implementation prototypes, respectively. Integration prototypes occupy the middle of the model: they explore a balance of questions in all three dimensions.

the user interface. Its marker, shown in Figure 4, is therefore positioned near the role comer of the model and a little toward look and feel. Storyboards like this one are considered to be effective design tools by many designers because they help focus design discussion on the role of an artifact very early on. However, giving them status as prototypes is not common because the medium is paper and thus seems very far from the medium of an interactive computer system. We consider this storyboard to be a prototype because it makes a concrete representation of a design idea and serves the purpose of asking and answering design questions. Of course, if the designer needed to evaluate a user's reaction to seeing the notebook or to using the pen-and-finger interaction, it would be necessary to build a prototype which supported direct interaction. However, it might be wasteful to do so before considering design options in the faster, lighter-weight medium of pencil and paper.

16.4.1 Role Prototypes Role prototypes are those which are built primarily to investigate questions of what an artifact could do for a user. They describe the functionality that a user might benefit from, with little attention to how the artifact would look and feel, or how it could be made to actually work. Designers find such prototypes useful to show their design teams what the target role of the artifact might be; to communicate that role to their supporting organization; and to evaluate the role in user studies.

A Portable Notebook Computer: The paper storyboard shown in Example 4 was an early prototype of a portable notebook computer for.students which would accept both pen and finger input. The scenario shows a student making notes, annotating a paper, and marking pages for later review in a computer notebook. The designer presented the storyboard to her design team to focus discussion on the issues of what functionality the notebook should provide and how it might be controlled through pen and finger interaction. In terms of the model, this prototype primarily explored the role of the notebook by presenting a rough task scenario for it. A secondary consideration was a rough approximation of

An Operating System User Interface: Example 5 shows a screen view of a prototype that was used to explore the design of a new operating system. The prototype was an interactive story: it could only be executed through a single, ordered, sequence of interactions. Clicking with a cursor on the mailbox picture opened a mail window; then clicking on the voice tool brought up a picture of some sound tools; and so on. To demonstrate the prototype, the designer sat in front of a computer and play-acted the role of a user opening her mail, replying to it, and so forth. The prototype was used in design team discussions and also demonstrated to project managers to explain the current design direction. According to the model, this prototype primarily explored the role that certain features of the operating

Houde and Hill

Example 5. Interactive story for an operating system inter-

face [E5: Vertelneyand Wong 1990]. system could play in a user's daily tasks. It was also used to outline very roughly how its features would be portrayed and how a user would interact with it. As in the previous example, the system's implementation was not explored. Its marker is shown in Figure 4. To make the prototype, user interface elements were hand-drawn and scanned in. Transitions between steps in the scenario were made interactive in Macromedia Director. This kind of portrayal of on-screen interface elements as rough and hand-drawn was used in order to focus design discussion on the overall features of a design rather than on specific details of look and feel or implementation (Wong, 1992). Ironically, while the design team understood the meaning of the hand-drawn graphics, other members of the organization became enamored with the sketchy style to the extent that they considered using it in the final artifact. This result was entirely at odds with the original reasons for making a rough-looking prototype. This example shows how the effectiveness of some kinds of prototypes may be limited to a specific kind of audience. The Knowledge Navigator: Example 6 shows a scene from Apple Computer's Knowledge Navigator TM video. The video tape tells a day-in-the-life story of a professor using a futuristic notebook computer (Dubberly and Mitch, 1987). An intelligent agent named "Phil" acts as his virtual personal assistant, finding information related to a lecture, reminding him of his mother' s birthday, and connecting him with other professors via video-link. The professor interacts with Phil by talking, and Phil apparently recognizes everything said as well as a human assistant would. Based on the model, the Knowledge Navigator is identified primarily as a prototype which describes the role that the notebook would play in such a user's life.

373

Example 6. Knowledge NavigatorTM vision video for a future notebook computer [E6: Dubberly and Mitch 1987].

The story is told in great detail, and it is clear that many decisions were made about what to emphasize in the role. The video also shows specific details of appearance, interaction, and performance. However, they were not intended by the designers to be prototypes of look and feel. They were merely place-holders for the actual design work which would be necessary to make the product really work. Thus its marker goes directly on the role corner (Figure 4). Thanks to the video's special effects, the scenario of the professor interacting with the notebook and his assistant looks like a demonstration of a real product. Why did Apple make a highly produced prototype when the previous examples show that a rapid paper storyboard or a sketchy interactive prototype were sufficient for designing a role and telling a usage story? The answer lies in the kind of audience. The tape was shown publicly and to Apple employees as a vision of the future of computing. Thus the audience of the Knowledge Navigator was very broad--including almost anyone in the world. Each of the two previous role design prototypes was shown to an audience which was well informed about the design project. A rough hand-drawn prototype would not have made the idea seem real to the broad audience the video addressed: high resolution was necessary to help people concretely visualize the design. Again, while team members learn to interpret abstract kinds of prototypes accurately, less expert audiences cannot normally be expected to understand such approximate representations. The Integrated Communicator: Example 7 shows an appearance model of an Integrated Communicator created for customer research into alternate presentations of new technology (ID Magazine 1995). It was one of three presentations of possible mechanical configurations and interaction designs, each built to the same

374

Chapter 16. What do Prototypes Prototype ?

Example 7. Appearance model for the integrated communicator [E7: Udagawa 1995].

Figure 5. Relationships of the look and feel prototypes

high finish and accompanied by a video describing onscreen interactions. In the study, the value of each presentation was evaluated relative to the others, as perceived by study subjects during one-on-one interviews. The prototype was used to help subjects imagine such a product in the store and in their homes or offices, and thus to evaluate whether they would purchase such a product, how much they would expect it to cost, what features they would expect, etc. The prototype primarily addresses the role of the product, by presenting carefully designed cues which imply a telephone-like role and look-and-feel. Figure 4 shows its marker near the role comer of the model. As with the Knowledge Navigator, the very highresolution look and feel was a means of making the design as concrete as possible to a broad audience. In this case however it also enabled a basic interaction design strategy to be worked out and demonstrated. The prototype did not address implementation. The key feature of this kind of prototype is that it is a concrete and direct representation, as visually finished as actual consumer products. These attributes encourage an uncoached person to directly relate the design to their own environment, and to the products they own or see in stores. High quality appearance models are costly to build. There are two common reasons for investing in one: to get a visceral response by making the design seem "real" to any audience (design team, organization, and potential users); and to verify the intended look and feel of the artifact before committing to production tooling. An interesting sideeffect of this prototype was that its directness made it a powerful prop for promoting the project within the organization.

and demonstrate options for the concrete experience of an artifact. They simulate what it would be like to look at and interact with, without necessarily investigating the role it would play in the user's life or how it would be made to work. Designers make such prototypes to visualize different look and feel possibilities for themselves and their design teams. They ask users to interact with them to see how the look and feel could be improved. They also use them to give members of their supporting organization a concrete sense of what the future artifact will be like.

16.4.2 Look and Feel Prototypes Look and feel prototypes are built primarily to explore

(Examples 8-10) to the model

A Fashion Design Workspace: The prototype shown in Example 8 was developed to support research into collaboration tools for fashion designers (Hill et al, 1993; Scaife et al, 1994). A twenty-minute animation, it presented the concept design for a system for monitoring garment design work. It illustrated in considerable detail the translation of a proven paper-based procedure into a computer-based system with a visually rich, direct manipulation, user interface. The prototype's main purposes were to confirm to the design team that an engaging and effective look and feel could be designed for this application, and to convince managers of the possibilities of the project. It was presented to users purely for informal discussion. This is an example of a look and feel prototype. The virtue of the prototype was that it enabled a novel user interface design to be developed without having first to implement complex underlying technologies. While the role was inherited from existing fashion design practice, the prototype also demonstrated new options offered by the new computer-based approach. Thus, Figure 5 shows its marker in the look and feel area of the model. One issue with prototypes like this one is that inexperienced audiences tend to believe them to be more

Houde and Hill

Example 8. Animation of the look and feel of a fashion design workspace [E8: Hill 1992].

375

Example 10. Pizza-box prototype of an architect's computer [El 0: Apple Design Project, 1992].

role that the toy would play. Neither seriously addressed implementation. The designers of these very efficient prototypes wanted to know how a child would respond to a toy that appeared to speak and move of its own free will. They managed to convincingly simulate novel and difficult-to-implement technologies such as speech and automotion, for minimal cost and using readily available components. By using a "man behind the curtain" (or "Wizard of Oz") technique, the designers were able to present the prototypes directly to children and to directly evaluate their effect. Example 9. Look and feel simulation prototypes for a child's toy [E9: Bellman et al, 19931.

functional than they are just by virtue of being shown on a computer screen. When this prototype was shown, the designers found they needed to take great care to explain that the design was not implemented. A Learning Toy: The "GloBall" project was a concept for a children's toy: a ball that would interact with children who played with it. Two prototypes from the project are shown, disassembled, in Example 9. The design team wanted the ball to speak back to kids when they spoke to it, and to roll towards or away from them in reaction to their movements. The two prototypes were built to simulate these functions separately. The ball on the left had a walkie-talkie which was concealed in use. A hidden operator spoke into a linked walkie-talkie to simulate the bali's speech while a young child played with it. Similarly, the ball on the right had a radiocontrolled car which was concealed in use. A hidden operator remotely controlled the car, thus causing the ball to roll around in response to the child' s actions. As indicated by the marker in Figure 5, both prototypes were used to explore the toy's look and feel from a child's viewpoint, and to a lesser extent to evaluate the

An Architect's Computer: This example concerned the design of a portable computer for architects who need to gather a lot of information during visits to building sites. One of the first questions the designers explored was what form would be appropriate for their users. Without much ado they weighted the pizza box shown in Example 10 to the expected weight of the computer, and gave it to an architect to carry on a site visit. They watched how he carried the box, what else he carried with him, and what tasks he needed to do during the visit. They saw that the rectilinear form and weight were too awkward, given the other materials he carried with him, and this simple insight led them to consider a softer form. As shown by its marker, this is an example of a rough look and feel prototype (Figure 5). Role was also explored in a minor way by seeing the context that the artifact would be used in. The pizza box was a very efficient prototype. Spending virtually no time building it or considering options, the students got useful feedback on a basic design questionmwhat physical form would be best for the user. From what they learned in their simple field test, they knew immediately that they should try to think beyond standard rectilinear notebook computer forms. They began to consider many different options

376

Figure 6. Relationships of implementation prototypes (Examples 11 and 12) to the model

including designing the computer to feel more like a soft shoulder bag.

16.4.3 Implementation Prototypes Some prototypes are built primarily to answer technical questions about how a future artifact might actually be made to work. They are used to discover methods by which adequate specifications for the final artifact can be achieved--without having to define its look and feel or the role it will play for a user. (Some specifications may be unstated, and may include externally imposed constraints, such as the need to reuse existing components or production machinery.) Designers make implementation prototypes as experiments for themselves and the design team, to demonstrate to their organization the technical feasibility of the artifact, and to get feedback from users on performance issues.

A Digital Movie Editor: Some years ago it was not clear how much interactivity could be added to digital movies playing on personal computers. Example 11 shows a picture of a prototype that was built to investigate solutions to this technical challenge. It was an application, written in the C programming language to run on an Apple Macintosh computer. It offered a variety of movie data-processing functionality such as controlling various attributes of movie play. The main goal of the prototype was to allow marking of points in a movie to which scripts (which added interactivity) would be attached. As indicated by the marker in Figure 6, this was primarily a carefully planned implementation prototype. Many options were evaluated about the best way to implement its functions. The role that the functions would play was less well defined. The visible look and feel of the prototype was largely incidental: it was created by the designer almost purely to demonstrate the available functionality, and was not intended to be used by others.

Chapter 16. What do Prototypes Prototype ?

Example 11. Working prototype of a digital movie editor [Eli: Degen, 1994]. This prototype received varying responses when demonstrated to a group of designers who were not members of the movie editor design team. When the audience understood that an implementation design was being demonstrated, discussion was focused productively. At other times it became focused on problems with the user interface, such as the multiple cascading menus, which were hard to control and visually confusing. In these cases, discussion was less productive: the incidental user interface got in the way of the intentional implementation. The project leader shared some reflections after this somewhat frustrating experience. He said that part of his goal in pursuing a working prototype alone was to move the project through an organization that respected this kind of prototype more than "smoke and mirrors" prototypes---ones which only simulate functionality. He added that one problem might have been that the user interface was neither good enough nor bad enough to avoid misunderstandings. The edit list, which allowed points to be marked in movies, was a viable look and feel design; while the cascading menus were not. For the audience that the prototype was shown to, it might have been more effective to stress the fact that look and feel were not the focus of the prototype; and perhaps, time permitting, to have complemented this prototype with a separate look and feel prototype that explained their intentions in that dimension.

A Fluid Dynamics Simulation System: Example 12 shows a small part of the C++ program listing for a system for simulating gas flows and combustion in car engines, part of an engineering research project (Hill, 1993). One goal of this prototype was to demonstrate the feasibility of object-oriented programming using the C++ language in place of procedural programs written in the older FORTRAN language. Objectoriented programming can in theory lead to increased

Houde and Hill

377

Example 12. C++program sample from a fluid dynamics simulation system [El2: Hill, 1993].

Figure 7. Relationships of integration prototypes (Examples 13 - 15) to the model.

software reuse, better reliability and easier maintenance. Since an engine simulation may take a week to run on the fastest available computers and is extremely memory-intensive, it was important to show that the new approach did not incur excessive performance or memory overheads. The program listing shown was the implementation of the operation to copy one list of numbers to another. When tested, it was shown to be faster than the existing FORTRAN implementation. The prototype was built primarily for the design team' s own use, and eventually used to create a deployable system. The marker in Figure 6 indicates that this prototype primarily explored implementation. Other kinds of implementation prototypes include demonstrations of new algorithms (e.g., a graphical rendering technique or a new search technology), and trial conversions of existing programs to run in new environments (e.g., converting a program written in the C language to the Java language). Implementation prototypes can be hard to build, and since they actually work, it is common for them to find their way directly into the final system. Two problems arise from this dynamic: firstly, programs developed mainly to demonstrate feasibility may turn out in the long term to be difficult to maintain and develop; and secondly, their temporary user interfaces may never be properly redesigned before the final system is released. For these reasons it is often desirable to treat even implementation prototypes as disposable, and to migrate successful implementation designs to a new integrated prototype as the project progresses.

prototypes help designers to balance and resolve constraints arising in different design dimensions; to verify that the design is complete and coherent; and to find synergy in the design of the integration itself. In some cases the integration design may become the unique innovation or feature of the final artifact. Since the user's experience of an artifact ultimately combines all three dimensions of the model, integration prototypes are most able to accurately simulate the final artifact. Since they may need to be as complex as the final artifact, they are the most difficult and time consuming kinds of prototypes to build. Designers make integration prototypes to understand the design as a whole, to show their organizations a close approximation to the final artifact, and to get feedback from users about the overall design.

16.4.4 Integration Prototypes Integration prototypes are built to represent the complete user experience of an artifact. Such prototypes bring together the artifact's intended design in terms of role, look and feel, and implementation. Integrated

The Sound Browser: The "SoundBrowser" prototype shown in Example 13 was built as part of a larger project which investigated uses of audio for personal computer users (Degen et al, 1992). The prototype was built in C to run on a Macintosh. It allowed a user to browse digital audio data recorded on a special personal tape recorder equipped with buttons for marking points in the audio. The picture shows the SoundBrowser's visual representation of the audio data, showing the markers below the sound display. A variety of functions were provided for reviewing sound, such as high-speed playback and playback of marked segments of audio. This prototype earns a position right in the center of the model, as shown in Figure 7. All three dimensions of the model were explored and represented in the prototype. The role of the artifact was well thoughtout, being driven initially by observations of what users currently do to mark and play back audio, and then by iteratively designed scenarios of how it might be done more efficiently if electronic marking and viewing functions were offered. The look and feel of the prototype went through many visual design iterations.

378

Chapter 16. What do Prototypes Prototype?

Example 13. Integrated prototype of a sound browser [E13: Degen, 1993]. The implementation was redesigned several times to meet the performance needs of the desired high-speed playback function. When the SoundBrowser was near completion it was prepared for a user test. One of the features which the design team intended to evaluate was the visual representation of the sound in the main window. They wanted to show users several alternatives to understand their preferences. The programmer who built the SoundBrowser had developed most of the alternatives. In order to refine these and explore others, two other team members copied screen-shots from the tool into a pixel-painting application, where they experimented with modifications. This was a quick way to try out different visual options, in temporary isolation from other aspects of the artifact. It was far easier to do this in a visual design tool than by programming in C. When finished, the new options were programmed into the integrated prototype. This example shows the value of using different tools for different kinds of design exploration, and how even at the end of a project simple, low-fidelity prototypes might be built to solve specific problems.

The Pile Metaphor: The prototype shown in Example 14 was made as part of the development of the "pile" metaphor--a user interface element for casual organization of information (Mander et ai, 1992, Rose et al, 1993). It represented the integration of designs developed in several other prototypes which independently explored the look and feel of piles, "content-aware" information retrieval, and the role that piles could play as a part of an operating system. In the pile metaphor, each electronic document was represented by a small icon or "proxy", several of which were stacked to form a pile. The contents of the pile could be quickly reviewed by moving the arrow cursor over it. While the cursor was over a particular document, the "viewing cone" to the right displayed a short text summary of the document. This prototype was shown to designers, project managers, and software developers as a proof of concept of the novel technology. The implementation de-

Example 14. Integration prototype of the "Pile" metaphor for information retrieval [El4: Rose, 1993]. sign in this prototype might have been achieved with virtually no user interface: just text input and output. However, since the prototype was to be shown to a broad audience, an integrated style of prototype was chosen, both to communicate the implementation point and to verify that the piles representation was practically feasible. It helped greatly that the artifact's role and look and feel could be directly inherited from previous prototypes. Figure 7 shows its marker on the model.

A Garment History Browser: The prototype in Example 15 was a working system which enabled users to enter and retrieve snippets of information about garment designs via a visually rich user interface (Hill et al, 1993; Scaife et al, 1994). The picture shows the query tool which was designed to engage fashion designers and provide memorable visual cues. The prototype was designed for testing in three corporations with a limited set of users' actual data, and presented to users in interviews. It was briefly demonstrated, then users were asked to try queries and enter remarks about design issues they were currently aware of. This prototype was the end-result of a progression from an initial focus on role (represented by verbal usage scenarios), followed by rough look and feel prototypes and an initial implementation. Along the way various ideas were explored, refined or rejected. The working tool, built in Allegiant SuperCard TM, required two months' intensive work by two designers. In retrospect the designers had mixed feelings about it. It was highly motivating to users to be able to manipulate real user data through a novel user interface, and much was learned about the design. However, the designers also felt that they had had to invest a large amount of time in making the prototype, yet had only been able to support a very narrow role compared to the breadth shown in the animation shown in Example 8. Many broader design questions remained unanswered.

Houde and Hill

Example 15. Integrated prototype of a garment history browser [El5: Hill and Kamlish, 1992].

16.5 Summary In this chapter, we have proposed a change in the language used by designers to think and talk about prototypes of interactive artifacts. Much current terminology centers on attributes of prototypes themselves: the tools used to create them, or how refined-looking or -behaving they are. Yet tools can be used in many different ways, and resolution can be misleading. We have proposed a shift in attention to focus on questions about the design of the artifact itself: What role will it play in a users life? How should it look and feel? How should it be implemented? The model that we have introduced can be used by designers to divide any design problem into these three classes of questions, each of which may benefit from a different approach to prototyping. We have described a variety of prototypes from real projects, and have shown how the model can be used to communicate about their purposes. Several practical suggestions for designers have been raised by the examples: Define "prototype" broadly. Efficient prototypes produce answers to their designers' most important questions in the least amount of time. Sometimes very simple representations make highly effective prototypes: e.g., the pizza-box prototype of an architect's computer [Example 10] and the storyboard notebook [Example 1]. We define a prototype as any representation of a design ideam regardless of medium; and designers as the people who create them--regardless of their job titles. BuiM multiple prototypes. Since interactive artifacts can be very complex, it may be impossible to create an integrated prototype in the formative stages of a

379

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

3D space-planning (role) 3D space-planning (look and feel) 3D space-planning (implementation) Storyboard for portable notebook computer Interactive story, operating system user interface Vision video, notebook computer Appearance model, integrated communicator Animation, fashion design workspace Look and feel simulation, child's toy Pizza-box, architect's computer Working prototype, digital movie editor C++ program listing, fluid dynamics simulation Integrated prototype, sound browser Integrated prototype, pile metaphor Integrated prototype, garment histor browser

Figure 8. Relationships of all examples to the model.

project, as in the 3D space-planning example [Examples 1, 2, and 3]. Choosing the fight focused prototypes to build is an art in itself. Be prepared to throw some prototypes away, and to use different tools for different kinds of prototypes. 9 Know your audience. The necessary resolution and fidelity of a prototype may depend most on the nature of its audience. A rough role prototype such as the interactive storyboard [Example 4] may work well for a design team but not for members of the supporting organization. Broader audiences may require higherresolution representations. Some organizations expect to see certain kinds of prototypes: implementation designs are often expected in engineering departments, while look-and-feel and role prototypes may rule in a visual design environment. 9 Know your prototype; prepare your audience. Be clear about what design questions are being explored with a given prototype--and what are not. Communicating the specific purposes of a prototype

380 to its audience is a critical aspect of its use. It is up to the designer to prepare an audience for viewing a prototype. Prototypes themselves do not necessarily communicate their purpose. It is especially important to clarify what is and what is not addressed by a prototype when presenting it to any audience beyond the immediate design team. By focusing on the purpose of the prototype--that is, on what it prototypes--we can make better decisions about the kinds of prototypes to build. With a clear purpose for each prototype, we can better use prototypes to think and communicate about design.

16.6 Acknowledgments Special thanks are due to Thomas Erickson for guidance with this chapter, and to our many colleagues whose prototypes we have cited, for their comments on early drafts. We would also like to acknowledge S. Joy Mountford whose leadership of the Human Interface Group at Apple created an atmosphere in which creative prototyping could flourish. Finally, thanks to James Spohrer, Lori Leahy, Dan Russell, and Donald Norman at Apple Research Labs for supporting us in writing this chapter.

16.7 Prototype Credits We credit here the principal designer and design team of each example prototype shown. [El] Stephanie Houde [E2] Stephanie Houde[E3] Michael Chen (1990) 9 Apple Computer, Inc. Project team: Penny Bauersfeld, Michael Chen, Lewis Knapp (project leader), Laurie Vertelney and Stephanie Houde. [E4] Laurie Vertelney. (1990), 9 Apple Computer Inc. Project team: Michael Chen, Thomas Erickson, Frank Leahy, Laurie Vertelney (project leader). [E5] Laurie Vertelney and Yin Yin Wong. (1990), 9 Apple Computer Inc. Project team: Richard Mander, Gitta Salomon (project leader), Ian Small, Laurie Vertelney, Yin Yin Wong. [E6] Dubberly, H. and Mitch, D. (1987) 9 Apple Computer, Inc. The Knowledge Navigator (videotape.) [E7] Masamichi Udagawa. (1995) 9 Apple Computer Inc. Project team: Charles Hill, Heiko Sacher, Nancy Silver, Masamichi Udagawa.

Chapter 16. Whatdo Prototypes Prototype? [E8] Charles Hill. (1992) 9 Royal College of Art, London. Design team: Gillian Crampton Smith, Eleanor Curtis, Charles Hill, Stephen Kamlish, (all of the RCA), Mike Scaife (Sussex University, UK), and Philip Joe (IDEO, London). [E9] Tom Bellman, Byron Long, Abba Lustgarten. (1993) University of Toronto, 1993 Apple Design Project, 9 Apple Computer Inc. [El0] 1992 Apple Design Project, 9 Apple Computer, Inc. [El 1] Leo Degen (1994) 9 Apple Computer Inc., Project team: Leo Degen, Stephanie Houde, Michael Mills (team leader), David Vronay. [El2] Charles Hill (1993). Doctoral thesis project, Imperial College of Science, Technology and Medicine, London, UK. Project team: Charles Hill, Henry Weller. [El3] Leo Degen (1993) 9 Apple Computer Inc., Project team: Leo Degen, Richard Mander, Gitta Salomon (team leader), Yin Yin Wong. [El4] Daniel Rose. (1993). 9 Apple Computer, Inc. Project team: Penny Bauersfeld, Leo Degen, Stephanie Houde, Richard Mander, Ian Small, Gitta Salomon (team leader),Yin Yin Wong [El5] Charles Hill and Stephen Kamlish. (1992) 9 Royal College of Art, London. Design team: Gillian Crampton Smith, Eleanor Curtis, Charles Hill, Stephen Kamlish, (all of the RCA), and Mike Scaife (Sussex University, UK).

16.8 References Degen, L., Mander, R., Salomon, G. (1992). Working with Audio: Integrating Personal Tape Recorders and Desktop Computers. Human Factors in Computing Systems: CHI'92 Conference Proceedings. New York: ACM, pp. 413-418. Dubberly, H. and Mitch, D. (1987). The Knowledge Navigator. Apple Computer, Inc. videotape. Ehn, P., Kyng, M., (1991) Cardboard Computers: Mocking-it-up or Hands-on the Future., Design at Work: Cooperative Design of Computer Systems (ed. Greenbaum, J., and Kyng, M.). Hillsdale, NJ: Lawrence Erlbaum. pp. 169-195. Erickson, T., (1995) Notes on Design Practice: Stories and Prototypes as Catalysts for Communication.

Houde and Hill "Envisioning Technology: The Scenario as a Framework for the System Development Life Cycle" (ed. Carroll, J.). Addison-Wesley. Hill, C. (1993) Software Design for Interactive Engineering Simulation. Doctoral Thesis. Imperial College of Science, Technology and Medicine, University of London. Hill, C., Crampton Smith, G., Curtis, E., Kamlish, S., (1993) Designing a Visual Database for Fashion Designers. Human Factors in Computing Systems: INTERCHI'93 Adjunct Proceedings. New York, ACM, pp. 49-50. Houde, S., (1992). Iterative Design of and Interface for Easy 3-D Direct Manipulation. Human Factors in Computing Systems: CHI'92 Conference Proceedings. New York: ACM, pp. 135-142. I.D.Magazine, (1995) Apple's Shared Conceptual Model, The International Design Magazine: 41's Annual Design Review, July-August 1995. USA. pp. 206207 Kim, S. (1990). Interdisciplinary Collaboration. The Art of Human Computer Interface Design (ed. B. Laurel). Reading, MA: Addison-Wesley. pp.31-44. Mander, R., Salomon, G., Wong, Y.Y. (1992). A 'Pile' Metaphor for Supporting Casual Organization of Information. Human Factors in Computing Systems: CHI'92 Conference Proceedings. New York: ACM, pp. 627-634. Rose, D.E., Mander, R., Oren, T., Poncele6n, D.B., Salomon, G., Wong, Y. (1993). Content Awareness in a File System Interface: Implementing the 'Pile' Metaphor for Organizing Information. Research and Development in Information Retrieval: SIGIR Conference Proceedings. Pittsburgh, PA: ACM, pp. 260-269. Scaife, M., Curtis, E., Hill, C. (1994) Interdisciplinary Collaboration: a Case Study of Software Development for Fashion Designers. Interacting with Computers, Vol 6. no.4, pp. 395-410 Schrage, M. (1996). Cultures of Prototyping. Bringing Design to Software (ed. T. Winograd). USA: ACM Press. pp. 191-205. Wong, Y.Y. (1992). Rough and ready prototypes: Lessons from graphic design. Human Factors in Computing Systems: CHI'92 Conference, Posters and ShortTalks, New York: ACM, pp.83-84.

381

Chapter 52 Prototyping Tools and Techniques Michel Beaudouin-Lafon, Université Paris-Sud, [email protected] Wendy E. Mackay, INRIA, [email protected]

1. Introduction “A good design is better than you think” (Rex Heftman, cited by Raskin, 2000). Design is about making choices. In many fields that require creativity and engineering skill, such as architecture or automobile design, prototypes both inform the design process and help designers select the best solution. This chapter describes tools and techniques for using prototypes to design interactive systems. The goal is to illustrate how they can help designers generate and share new ideas, get feedback from users or customers, choose among design alternatives, and articulate reasons for their final choices. We begin with our definition of a prototype and then discuss prototypes as design artifacts, introducing four dimensions for analyzing them. We then discuss the role of prototyping within the design process, in particular the concept of a design space, and how it is expanded and contracted by generating and selecting design ideas. The next three sections describe specific prototyping approaches: Rapid prototyping, both off-line and on-line, for early stages of design, iterative prototyping, which uses on-line development tools, and evolutionary prototyping, which must be based on a sound software architecture. What is a prototype? We define a prototype as a concrete representation of part or all of an interactive system. A prototype is a tangible artifact, not an abstract description that requires interpretation. Designers, as well as managers, developers, customers and endusers, can use these artifacts to envision and reflect upon the final system. Note that prototypes may be defined differently in other fields. For example, an architectural prototype is a scaled-down model of the final building. This is not possible for interactive system prototypes: the designer may limit the amount of information the prototype can handle, but the actual interface must be presented at full scale. Thus, a prototype interface to a database may handle only a small pseudo database but must still present a full-size display and interaction techniques. Full-scale, one-of-a-kind models, such as a hand-made dress sample, are another type of prototype. These usually require an additional design phase in order to mass-produce the final design. Some interactive system prototypes begin as one-of-a-kind models which are then distributed widely (since the cost of duplicating software is so low). However, most successful software prototypes evolve into the final product and then continue to evolve as new versions of the software are released. Beaudouin-Lafon & Mackay

Draft 1 - 1

Prototype Development and Tools

Hardware and software engineers often create prototypes to study the feasibility of a technical process. They conduct systematic, scientific evaluations with respect to pre-defined benchmarks and, by systematically varying parameters, fine-tune the system. Designers in creative fields, such as typography or graphic design, create prototypes to express ideas and reflect on them. This approach is intuitive, oriented more to discovery and generation of new ideas than to evaluation of existing ideas. Human-Computer Interaction is a multi-disciplinary field which combines elements of science, engineering and design (Mackay and Fayard, 1997, DjkstraErikson et al., 2001). Prototyping is primarily a design activity, although we use software engineering to ensure that software prototypes evolve into technicallysound working systems and we use scientific methods to study the effectiveness of particular designs. 2. Prototypes as design artifacts We can look at prototypes as both concrete artifacts in their own right or as important components of the design process. When viewed as artifacts, successful prototypes have several characteristics: They support creativity, helping the developer to capture and generate ideas, facilitate the exploration of a design space and uncover relevant information about users and their work practices. They encourage communication, helping designers, engineers, managers, software developers, customers and users to discuss options and interact with each other. They also permit early evaluation since they can be tested in various ways, including traditional usability studies and informal user feedback, throughout the design process. We can analyze prototypes and prototyping techniques along four dimensions: • Representation describes the form of the prototype, e.g., sets of paper sketches or computer simulations; • Precision describes the level of detail at which the prototype is to be evaluated; e.g., informal and rough or highly polished; • Interactivity describes the extent to which the user can actually interact with the prototype; e.g., watch-only or fully interactive; and • Evolution describes the expected life-cycle of the prototype, e.g. throwaway or iterative. 2.1 Representation Prototypes serve different purposes and thus take different forms. A series of quick sketches on paper can be considered a prototype; so can a detailed computer simulation. Both are useful; both help the designer in different ways. We distinguish between two basic forms of representation: off-line and on-line. Off-line prototypes (also called paper prototypes) do not require a computer. They include paper sketches, illustrated story-boards, cardboard mock-ups and videos. The most salient characteristics of off-line prototypes (of interactive systems) is that they are created quickly, usually in the early stages of design, and they are usually thrown away when they have served their purpose. On-line prototypes (also called software prototypes) run on a computer. They include computer animations, interactive video presentations, programs written with scripting languages, and applications developed with interface builders. The cost of producing on-line prototypes is usually higher, and may require skilled programmers to implement advanced interaction and/or visualization techniques or

Beaudouin-Lafon & Mackay

Draft 1 - 2

Prototype Development and Tools

to meet tight performance constraints. Software prototypes are usually more effective in the later stages of design, when the basic design strategy has been decided. In our experience, programmers often argue in favor of software prototypes even at the earliest stages of design. Because they already are already familiar with a programming language, these programmers believe it will be faster and more useful to write code than to "waste time" creating paper prototypes. In twenty years of prototyping, in both research and industrial settings, we have yet to find a situation in which this is true. First, off-line prototypes are very inexpensive and quick. This permits a very rapid iteration cycle and helps prevent the designer from becoming overly attached to the first possible solution. Off-line prototypes make it easier to explore the design space (see section 3.1), examining a variety of design alternatives and choosing the most effective solution. On-line prototypes introduce an intermediary between the idea and the implementation, slowing down the design cycle. Second, off-line prototypes are less likely to constrain how the designer thinks. Every programming language or development environment imposes constraints on the interface, limiting creativity and restricting the number of ideas considered. If a particular tool makes it easy to create scroll-bars and pull-down menus and difficult to create a zoomable interface, the designer is likely to limit the interface accordingly. Considering a wider range of alternatives, even if the developer ends up using a standard set of interface widgets, usually results in a more creative design. Finally and perhaps most importantly, off-line prototypes can be created by a wide range of people: not just programmers. Thus all types of designers, technical or otherwise, as well as users, managers and other interested parties, can all contribute on an equal basis. Unlike programming software, modifying a storyboard or cardboard mock-up requires no particular skill. Collaborating on paper prototypes not only increases participation in the design process, but also improves communication among team members and increases the likelihood that the final design solution will be well accepted. Although we believe strongly in off-line prototypes, they are not a panacea. In some situations, they are insufficient to fully evaluate a particular design idea. For example, interfaces requiring rapid feedback to users or complex, dynamic visualizations usually require software prototypes. However, particularly when using video and wizard-of-oz techniques, off-line prototypes can be used to create very sophisticated representations of the system. Prototyping is an iterative process and all prototypes provide information about some aspects while ignoring others. The designer must consider the purpose of the prototype (Houde and Hill, 1997) at each stage of the design process and choose the representation that is best suited to the current design question. 2.2 Precision Prototypes are explicit representations that help designers, engineers and users reason about the system being built. By their nature, prototypes require details. A verbal description such as "the user opens the file" or "the system displays the results" provides no information about what the user actually does. Prototypes force designers to show the interaction: just how does the user open the file and what are the specific results that appear on the screen?

Beaudouin-Lafon & Mackay

Draft 1 - 3

Prototype Development and Tools

Precision refers to the relevance of details with respect to the purpose of the prototype1. For example, when sketching a dialog box, the designer specifies its size, the positions of each field and the titles of each label. However not all these details are relevant to the goal of the prototype: it may be necessary to show where the labels are, but too early to choose the text. The designer can convey this by writing nonsense words or drawing squiggles, which shows the need for labels without specifying their actual content. Although it may seem contradictory, a detailed representation need not be precise. This is an important characteristic of prototypes: those parts of the prototype that are not precise are those open for future discussion or for exploration of the design space. Yet they need to be incarnated in some form so the prototype can be evaluated and iterated. The level of precision usually increases as successive prototypes are developed and more and more details are set. The forms of the prototypes reflect their level of precision: sketches tend not to be precise, whereas computer simulations are usually very precise. Graphic designers often prefer using hand sketches for early prototypes because the drawing style can directly reflect what is precise and what is not: the wigglely shape of an object or a squiggle that represents a label are directly perceived as imprecise. This is more difficult to achieve with an on-line drawing tool or a user-interface builder. The form of the prototype must be adapted to the desired level of precision. Precision defines the tension between what the prototype states (relevant details) and what the prototype leaves open (irrelevant details). What the prototype states is subject to evaluation; what the prototype leaves open is subject to more discussion and design space exploration. 2.3 Interactivity An important characteristic of HCI systems is that they are interactive: users both respond to them and act upon them. Unfortunately, designing effective interaction is difficult: many interactive systems (including many web sites) have a good “look” but a poor “feel”. HCI designers can draw from a long tradition in visual design for the former, but have relatively little experience with how interactive software systems should be used: personal computers have only been commonplace for about a decade. Another problem is that the quality of interaction is tightly linked to the end users and a deep understanding of their work practices: a word processor designed for a professional typographer requires a different interaction design than one designed for secretaries, even though ostensibly they serve similar purposes. Designers must take the context of use into account when designing the details of the interaction. A critical role for an interactive system prototype is to illustrate how the user will interact with the system. While this may seem more natural with on-line prototypes, in fact it is often easier to explore different interaction strategies with off-line prototypes. Note that interactivity and precision are orthogonal dimensions. One can create an imprecise prototype that is highly interactive, such as a series of paper screen images in which one person acts as the user and the other plays the system. Or, one may create a very precise but non-interactive

1 Note that the terms low-fidelity and high-fidelity prototypes are often used in the literature. We

prefer the term precision because it refers to the content of the prototype itself, not its relationship to the final, as-yet-undefined system. Beaudouin-Lafon & Mackay

Draft 1 - 4

Prototype Development and Tools

prototype, such as a detailed animation that shows feedback from a specific action by a user. Prototypes can support interaction in various ways. For off-line prototypes, one person (often with help from others) plays the role of the interactive system, presenting information and responding to the actions of another person playing the role of the user. For on-line prototypes, parts of the software are implemented, while others are "played" by a person. (This approach, called the Wizard of Oz after the character in the 1939 movie of the same name, is explained in section 4.1.) The key is that the prototype feels interactive to the user. Prototypes can support different levels of interaction. Fixed prototypes, such as video clips or pre-computed animations, are non-interactive: the user cannot interact, or pretend to interact, with it. Fixed prototypes are often used to illustrate or test scenarios (see chapter 53). Fixed-path prototypes support limited interaction. The extreme case is a fixed prototype in which each step is triggered by a pre-specified user action. For example, the person controlling the prototype might present the user with a screen containing a menu. When the user points to the desired item, she presents the corresponding screen showing a dialog box. When the user points to the word "OK", she presents the screen that shows the effect of the command. Even though the position of the click is irrelevant (it is used as a trigger), the person in the role of the user can get a feel for the interaction. Of course, this type of prototype can be much more sophisticated, with multiple options at each step. Fixed-path prototypes are very effective with scenarios and can also be used for horizontal and task-based prototypes (see section 3.1). Open prototypes support large sets of interactions. Such prototypes work like the real system, with some limitations. They usually only cover part of the system (see vertical prototypes, section 3.1), and often have limited error-handling or reduced performance relative to that of the final system. Prototypes may thus illustrate or test different levels of interactivity. Fixed prototypes simply illustrate what the interaction might look like. Fixed-path prototypes provide designers and users with the experience of what the interaction might be like, but only in pre-specified situations. Open prototypes allow designers to test a wide range of examples of how users will interact with the system. 2.4 Evolution Prototypes have different life spans: rapid prototypes are created for a specific purpose and then thrown away, iterative prototypes evolve, either to work out some details (increasing their precision) or to explore various alternatives, and evolutionary prototypes are designed to become part of the final system. Rapid prototypes are especially important in the early stages of design. They must be inexpensive and easy to produce, since the goal is to quickly explore a wide variety of possible types of interaction and then throw them away. Note that rapid prototypes may be off-line or on-line. Creating precise software prototypes, even if they must be re-implemented in the final version of the system, is important for detecting and fixing interaction problems. Section 4 presents specific prototyping techniques, both off-line and on-line. Iterative prototypes are developed as a reflection of a design in progress, with the explicit goal of evolving through several design iterations. Designing prototypes that support evolution is sometimes difficult. There is a tension between evolving

Beaudouin-Lafon & Mackay

Draft 1 - 5

Prototype Development and Tools

toward the final solution and exploring an unexpected design direction, which may be adopted or thrown away completely. Each iteration should inform some aspect of the design. Some iterations explore different variations of the same theme. Others may systematically increase precision, working out the finer details of the interaction. Section 5 describes tools and techniques for creating iterative prototypes.

Figure 1: Evolutionary prototypes of the Apple Lisa: July 1979 (left), October 1980 (right) (Perkins et al., 1997) [[need permission]] Evolutionary prototypes are a special case of iterative prototypes in which the prototype evolves into part or all of the final system (Fig.1). Obviously this only applies to software prototypes. Extreme Programming (Beck, 2000), advocates this approach, tightly coupling design and implementation and building the system through constant evolution of its components. Evolutionary prototypes require more planning and practice than the approaches above because the prototypes are both representations of the final system and the final system itself, making it more difficult to explore alternative designs. We advocate a combined approach, beginning with rapid prototypes and then using iterative or evolutionary prototypes according to the needs of the project. Section 6 describes how to create evolutionary prototypes, by building upon software architectures specifically designed to support interactive systems. 3. Prototypes and the design process In the previous section, we looked at prototypes as artifacts, i.e. the results of a design process. Prototypes can also be seen as artifacts for design, i.e. as an integral part of the design process. Prototyping helps designers think: prototypes are the tools they use to solve design problems. In this section we focus on prototyping as a process and its relationship to the overall design process. User-centered design The field of Human-Computer Interaction is both user-centered (Norman & Draper, 1986) and iterative. User-centered design places the user at the center of the design process, from the initial analysis of user requirements (see chapters 4850 in this volume) to testing and evaluation (see chapters 56-59 in this volume). Prototypes support this goal by allowing users see and experience the final system long before it is built. Designers can identify functional requirements, usability problems and performance issues early and improve the design accordingly.

Beaudouin-Lafon & Mackay

Draft 1 - 6

Prototype Development and Tools

Iterative design involves multiple design-implement-test loops2, enabling the designer to generate different ideas and successively improve upon them. Prototypes support this goal by allowing designers to evaluate concrete representations of design ideas and select the best. Prototypes reveal the strengths as well as the weaknesses of a design. Unlike pure ideas, abstract models or other representations, they can be contextualized to help understand how the real system would be used in a real setting. Because prototypes are concrete and detailed, designers can explore different real-world scenarios and users can evaluate them with respect to their current needs. Prototypes can be compared directly with other, existing systems, and designers can learn about the context of use and the work practices of the end users. Prototypes can help designers (re)analyze the user's needs during the design process, not abstractly as with traditional requirements analysis, but in the context of the system being built. Participatory design Participatory (also called Cooperative) Design is a form of user-centered design that actively involves the user in all phases the design process (see Greenbaum and Kyng, 1991, and chapter 54 in this volume.) Users are not simply consulted at the beginning and called in to evaluate the system at the end; they are treated as partners throughout. This early and active involvement of users helps designers avoid unpromising design paths and develop a deeper understanding of the actual design problem. Obtaining user feedback at each phase of the process also changes the nature of the final evaluation, which is used to fine-tune the interface rather than discover major usability problems. A common misconception about participatory design is that designers are expected to abdicate their responsibilities as designers, leaving the design to the end user. In fact, the goal is for designers and users to work together, each contributing their strengths to clarify the design problem as well as explore design solutions. Designers must understand what users can and cannot contribute. Usually, users are best at understanding the context in which the system will be used and subtle aspects of the problems that must be solved. Innovative ideas can come from both users and designers, but the designer is responsible for considering a wide range of options that might not be known to the user and balancing the trade-offs among them. Because prototypes are shared, concrete artifacts, they serve as an effective medium for communication within the design team. We have found that collaborating on prototype design is an effective way to involve users in participatory design. Prototypes help users articulate their needs and reflect on the efficacy of design solutions proposed by designers. 3.1 Exploring the design space Design is not a natural science: the goal is not to describe and understand existing phenomena but to create something new. Designers do, of course, benefit from scientific research findings and they may use scientific methods to evaluate interactive systems. But designers also require specific techniques for generating new ideas and balancing complex sets of trade-offs, to help them develop and refine design ideas.

2 Software engineers refer to this as the Spiral model (Boehm, 1988).

Beaudouin-Lafon & Mackay

Draft 1 - 7

Prototype Development and Tools

Designers from fields such as architecture and graphic design have developed the concept of a design space, which constrains design possibilities along some dimensions, while leaving others open for creative exploration. Ideas for the design space come from many sources: existing systems, other designs, other designers, external inspiration and accidents that prompt new ideas. Designers are responsible for creating a design space specific to a particular design problem. They explore this design space, expanding and contracting it as they add and eliminate ideas. The process is iterative: more cyclic, than reductionist. That is, the designer does not begin with a rough idea and successively add more precise details until the final solution is reached. Instead, she begins with a design problem, which imposes set of constraints, and generates a set of ideas to form the initial design space. She then explores this design space, preferably with the user, and selects a particular design direction to pursue. This closes off part of the design space, but opens up new dimensions that can be explored. The designer generates additional ideas along these dimensions, explores the expanded design space, and then makes new design choices. Design principles (e.g., BeaudouinLafon and Mackay, 2000) help this process by guiding it both in the exploration and choice phases. The process continues, in a cyclic expansion and contraction of the design space, until a satisfying solution is reached. All designers work with constraints: not just limited budgets and programming resources, but also design constraints. These are not necessarily bad: one cannot be creative along all dimensions at once. However, some constraints are unnecessary, derived from poor framing of the original design problem. If we consider a design space as a set of ideas and a set of constraints, the designer has two options. She can modify ideas within the specified constraints or modify the constraints to enable new sets of ideas. Unlike traditional engineering, which treats the design problem as a given, designers are encouraged to challenge, and if necessary, change the initial design problem. If she reaches an impasse, the designer can either generate new ideas or redefine the problem (and thus change the constraints). Some of the most effective design solutions derive from a more careful understanding and reframing of the design brief. Note that all members of the design team, including users, may contribute ideas to the design space and help select design directions from within it. However, it is essential that these two activities are kept separate. Expanding the design space requires creativity and openness to new ideas. During this phase, everyone should avoid criticizing ideas and concentrate on generating as many as possible. Clever ideas, half-finished ideas, silly ideas, impractical ideas: all contribute to the richness of the design space and improve the quality of the final solution. In contrast, contracting the design space requires critical evaluation of ideas. During this phase, everyone should consider the constraints and weigh the trade-offs. Each major design decision must eliminate part of the design space: rejecting ideas is necessary in order to experiment and refine others and make progress in the design process. Choosing a particular design direction should spark new sets of ideas, and those new ideas are likely to pose new design problems. In summary, exploring a design space is the process of moving back and forth between creativity and choice. Prototypes aid designers in both aspects of working with a design space: generating concrete representations of new ideas and clarifying specific design directions. The next two sections describe techniques that have proven most useful in our own prototyping work, both for research and product development. Expanding the design space: Generating ideas The most well-known idea generation technique is brainstorming, introduced by Osborn (1957). His goal was to create synergy within the members of a group:

Beaudouin-Lafon & Mackay

Draft 1 - 8

Prototype Development and Tools

ideas suggested by one participant would spark ideas in other participants. Subsequent studies (Collaros and Anderson, 1969, Diehl and Stroebe, 1987) challenged the effectiveness of group brainstorming, finding that aggregates of individuals could produce the same number of ideas as groups. They found certain effects, such as production blocking, free-riding and evaluation apprehension, were sufficient to outweigh the benefits of synergy in brainstorming groups. Since then, many researchers have explored different strategies for addressing these limitations. For our purposes, the quantity of ideas is not the only important measure: the relationships among members of the group are also important. As de Vreede et al. (2000) point out, one should also consider elaboration of ideas, as group members react to each other's ideas. We have found that brainstorming, including a variety of variants, is an important group-building exercise and participatory design. Designers may, of course, brainstorm ideas by themselves. But brainstorming in a group is more enjoyable and, if it is a recurring part of the design process, plays an important role in helping group members share and develop ideas together. The simplest form of brainstorming involves a small group of people. The goal is to generate as many ideas as possible on a pre-specified topic: quantity not quality, is important. Brainstorming sessions have two phases: the first for generating ideas and the second for reflecting upon them. The initial phase should last no more than an hour. One person should moderate the session, keeping time, ensuring that everyone participates and preventing people from critiquing each other's ideas. Discussion should be limited to clarifying the meaning of a particular idea. A second person records every idea, usually on a flipchart or transparency on an overhead projector. After a short break, participants are asked to reread all the ideas and each person marks their three favorite ideas. One variation is designed to ensure that everyone contributes, not just those who are verbally dominant. Participants write their ideas on individual cards or post-it notes for a pre-specified period of time. The moderator then reads each idea aloud. Authors are encouraged to elaborate (but not justify) their ideas, which are then posted on a whiteboard or flipchart. Group members may continue to generate new ideas, inspired by the others they hear. We use a variant of brainstorming that involves prototypes called video brainstorming (Mackay, 2000): participants not only write or draw their ideas, they act them out in front of a video camera (Fig. 2). The goal is the same as other brainstorming exercises, i.e. to create as many new ideas as possible, without critiquing them. The use of video, combined with paper or cardboard mock-ups, encourages participants to actively experience the details of the interaction and to understand each idea from the perspective of the user. Each video brainstorming idea takes 2-5 minutes to generate and capture, allowing participants to simulate a wide variety of ideas very quickly. The resulting video clips provide illustrations of each idea that are easier to understand (and remember) than hand-written notes. (We find that raw notes from brainstorming sessions are not very useful after a few weeks because the participants no longer remember the context in which the ideas were created.)

Beaudouin-Lafon & Mackay

Draft 1 - 9

Prototype Development and Tools

Figure 2: Video Brainstorming: One person moves the transparency, projected onto the wall, in response to the actions of the user, who explores how he might interact with an on-line animated character. Each interaction idea is recorded and videotaped. Video brainstorming requires thinking more deeply about each idea. It is easier to stay abstract when describing an interaction in words or even with a sketch, but acting out the interaction in front of the camera forces the author of the idea (and the other participants) to seriously consider how a user would interact with the idea. It also encourages designers and users to think about new ideas in the context in which they will be used. Video clips from a video brainstorming session, even though rough, are much easier for the design team, including developers, to interpret than ideas from a standard brainstorming session. We generally run a standard brainstorming session, either oral or with cards, prior to a video brainstorming session, to maximize the number of ideas to be explored. Participants then take their favorite ideas from the previous session and develop them further as video brainstorms. Each person is asked to "direct" at least two ideas, incorporating the hands or voices of other members of the group. We find that, unlike standard brainstorming, video brainstorming encourages even the quietest team members to participate. Contracting the design space: Selecting alternatives After expanding the design space by creating new ideas, designers must stop and reflect on the choices available to them. After exploring the design space, designers must evaluate their options and make concrete design decisions: choosing some ideas, specifically rejecting others, and leaving other aspects of the design open to further idea generation activities. Rejecting good, potentially effective ideas is difficult, but necessary to make progress. Prototypes often make it easier to evaluate design ideas from the user's perspective. They provide concrete representations that can be compared. Many of the evaluation techniques described elsewhere in this handbook can be applied to prototypes, to help focus the design space. The simplest situation is when the designer must choose among several discrete, independent options. Running a simple experiment, using techniques borrowed from Psychology (see chapter 56) allows the designer to compare how users respond to each of the alternatives. The designer builds a prototype, with either fully-implemented or simulated versions of each option. The next step is to construct tasks or activities that are typical of how the system would be used, and ask people from the user population to try each of the options under controlled conditions. It is important to keep everything the same, except for the options being tested.

Beaudouin-Lafon & Mackay

Draft 1 - 10

Prototype Development and Tools

Designers should base their evaluations on both quantitative measures, such as speed or error rate, and qualitative measures, such as the user's subjective impressions of each option. Ideally, of course, one design alternative will be clearly faster, prone to fewer errors and preferred by the majority of users. More often, the results are ambiguous, and the designer must take other factors into account when making the design choice. (Interestingly, running small experiments often highlights other design problems and may help the designer reformulate the design problem or change the design space.) The more difficult (and common) situation, is when the designer faces a complex, interacting set of design alternatives, in which each design decision affects a number of others. Designers can use heuristic evaluation techniques, which rely on our understanding of human cognition, memory and sensory-perception (see chapters 1-6). They can also evaluate their designs with respect to ergonomic criteria (see chapter 51) or design principles (Beaudouin-Lafon and Mackay, 2000). See chapters 56-60 for a more thorough discussion of testing and evaluation methods. Another strategy is to create one or more scenarios (see chapter 53) that illustrate how the combined set of features will be used in a realistic setting. The scenario must identify who is involved, where the activities take place, and what the user does over a specified period of time. Good scenarios involve more than a string of independent tasks; they should incorporate real-world activities, including common or repeated tasks, successful activities and break-downs and errors, with both typical and unusual events. The designer then creates a prototype that simulates or implements the aspects of the system necessary to illustrate each set of design alternatives. Such prototypes can be tested by asking users to "walk through" the same scenario several times, once for each design alternative. As with experiments and usability studies, designers can record both quantitative and qualitative data, depending on the level of the prototypes being tested. The previous section described an idea-generation technique called video brainstorming, which allows designers to generate a variety of ideas about how to interact with the future system. We call the corresponding technique for focusing in on a design video prototyping. Video prototyping can incorporate any of the rapid-prototyping techniques (off-line or on-line) described in section 4.1. They are quick to build, force designers to consider the details of how users will react to the design in the context in which it will be used, and provide an inexpensive method of comparing complex sets of design decisions. See section 4.1 for more information on how to develop scenarios, storyboard and then videotape them. To an outsider, video brainstorming and video prototyping techniques look very similar: both involve small design groups working together, creating rapid prototypes and interacting with them in front of a video camera. Both result in video illustrations that make abstract ideas concrete and help team members communicate with each other. The critical difference is that video brainstorming expands the design space, by creating a number of unconnected collections of individual ideas, whereas video prototyping contracts the design space, by showing how a specific collection of design choices work together. 3.2 Prototyping strategies Designers must decide what role prototypes should play with respect to the final system and in which order to create different aspects of the prototype. The next section presents four strategies: horizontal, vertical , task-oriented and scenariobased, which focus on different design concerns. These strategies can use any of the prototyping techniques covered in sections 4, 5 and 6.

Beaudouin-Lafon & Mackay

Draft 1 - 11

Prototype Development and Tools

Horizontal prototypes The purpose of a horizontal prototype is to develop one entire layer of the design at the same time. This type of prototyping is most common with large software development teams, where designers with different skill sets address different layers of the software architecture. Horizontal prototypes of the user interface are useful to get an overall picture of the system from the user’s perspective and address issues such as consistency (similar functions are accessible through similar user commands), coverage (all required functions are supported) and redundancy (the same function is/is not accessible through different user commands). User interface horizontal prototypes can begin with rapid prototypes and progress through to working code. Software prototypes can be built with an interface builder (see section 5.1), without creating any of the underlying functionality making it possible to test how the user will interact with the user interface without worrying about how the rest of the architecture works. However some level of scaffolding or simulation of the rest of the application is often necessary, otherwise the prototype cannot be evaluated properly. As a consequence, software horizontal prototypes tend to be evolutionary, i.e. they are progressively transformed into the final system. Vertical prototypes The purpose of a vertical prototype is to ensure that the designer can implement the full, working system, from the user interface layer down to the underlying system layer. Vertical prototypes are often built to assess the feasibility of a feature described in a horizontal, task-oriented or scenario-based prototype. For example, when we developed the notion of magnetic guidelines in the CPN2000 system to facilitate the alignment of graphical objects (Beaudouin-Lafon and Mackay, 2000), we implemented a vertical prototype to test not only the interaction technique but also the layout algorithm and the performance. We knew that we could only include the particular interaction technique if the we could implement a sufficiently fast response. Vertical prototypes are generally high precision, software prototypes because their goal is to validate an idea at the system level. They are often thrown away because they are generally created early in the project, before the overall architecture has been decided, and they focus on only one design question. For example, a vertical prototype of a spelling checker for a text editor does not require text editing functions to be implemented and tested. However, the final version will need to be integrated into the rest of the system, which may involve considerable architectural or interface changes. Task-oriented prototypes Many user interface designers begin with a task analysis (see chapter 48), to identify the individual tasks that the user must accomplish with the system. Each task requires a corresponding set of functionality from the system. Task-based prototypes are organized as a series of tasks, which allows both designers and users to test each task independently, systematically working through the entire system. Task-oriented prototypes include only the functions necessary to implement the specified set of tasks. They combine the breadth of horizontal prototypes, to cover the functions required by those tasks, with the depth of vertical prototypes, enabling detailed analysis of how the tasks can be supported. Depending on the

Beaudouin-Lafon & Mackay

Draft 1 - 12

Prototype Development and Tools

goal of the prototype, both off-line and on-line representations can be used for task-oriented prototypes. Scenario-based prototypes Scenario-based prototypes are similar to task-oriented ones, except that they do not stress individual, independent tasks, but rather follow a more realistic scenario of how the system would be used in a real-world setting. Scenarios are stories that describe a sequence of events and how the user reacts (see chapter 53). A good scenario includes both common and unusual situations, and should explore patterns of activity over time. Bødker (1995) has developed a checklist to ensure that no important issues have been left out. We find it useful to begin with use scenarios based on observations of or interviews with real users. Ideally, some of those users should participate in the creation of the specific scenarios, and other users should critique them based on how realistic they are. Use scenarios are then turned into design scenarios, in which the same situations are described but with the functionality of the new system. Design scenarios are used, among other things, to create scenario-based video prototypes or software prototypes. Like task-based prototypes, the developer needs to write only the software necessary to illustrate the components of the design scenario. The goal is to create a situation in which the user can experience what the system would be like in a realistic situation, even if it addresses only a subset of the planned functionality. Section 4 describes a variety of rapid prototyping techniques which can be used in any of these four prototyping strategies. We begin with off-line rapid prototyping techniques, followed by on-line prototyping techniques. 4. Rapid prototypes The goal of rapid prototyping is to develop prototypes very quickly, in a fraction of the time it would take to develop a working system. By shortening the prototype-evaluation cycle, the design team can evaluate more alternatives and iterate the design several times, improving the likelihood of finding a solution that successfully meets the user's needs. How rapid is rapid depends on the context of the particular project and the stage in the design process. Early prototypes, e.g. sketches, can be created in a few minutes. Later in the design cycle, a prototype produced in less than a week may still be considered “rapid” if the final system is expected to take months or years to build. Precision, interactivity and evolution all affect the time it takes to create a prototype. Not surprisingly, a precise and interactive prototype takes more time to build than an imprecise or fixed one. The techniques presented in this section are organized from most rapid to least rapid, according to the representation dimension introduced in section 2. Off-line techniques are generally more rapid than on-line one. However, creating successive iterations of an on-line prototype may end up being faster than creating new off-line prototypes. 4.1 Off-line rapid prototyping techniques Off-line prototyping techniques range from simple to very elaborate. Because they do not involve software, they are usually considered a tool for thinking through the design issues, to be thrown away when they are no longer needed. This

Beaudouin-Lafon & Mackay

Draft 1 - 13

Prototype Development and Tools

section describes simple paper and pencil sketches, three-dimensional mock-ups, wizard-of-oz simulations and video prototypes. Paper & pencil The fastest form of prototyping involves paper, transparencies and post-it notes to represent aspects of an interactive system (for an example, see Muller, 1991). By playing the roles of both the user and the system, designers can get a quick idea of a wide variety of different layout and interaction alternatives, in a very short period of time. Designers can create a variety of low-cost "special effects". For example, a tiny triangle drawn at the end of a long strip cut from an overhead transparency makes a handy mouse pointer, which can be moved by a colleague in response to the user's actions. Post-it notes™, with prepared lists, can provide "pop-up menus". An overhead projector pointed at a whiteboard, makes it easy to project transparencies (hand-drawn or pre-printed, overlaid on each other as necessary) to create an interactive display on the wall. The user can interact by pointing (Fig. 3) or drawing on the whiteboard. One or more people can watch the user and move the transparencies in response to her actions. Everyone in the room gets an immediate impression of how the eventual interface might look and feel.

Figure 3: Hand-drawn transparencies can be projected onto a wall, creating an interface a user can respond to. Note that most paper prototypes begin with quick sketches on paper, then progress to more carefully-drawn screen images made with a computer (Fig. 4). In the early stages, the goal is to generate a wide range of ideas and expand the design space, not determine the final solution. Paper and pencil prototypes are an excellent starting point for horizontal, task-based and scenario-based prototyping strategies.

Beaudouin-Lafon & Mackay

Draft 1 - 14

Prototype Development and Tools

Figure 4: Several people work together to simulate interacting with this paper prototype. One person moves a transparency with a mouse pointer while another moves the diagram accordingly. Mock-ups Architects use mock-ups or scaled prototypes to provide three-dimensional illustrations of future buildings. Mock-ups are also useful for interactive system designers, helping them move beyond two-dimensional images drawn on paper or transparencies (see Bødker et al., 1988). Generally made of cardboard, foamcore or other found materials, mock-ups are physical prototypes of the new system. Fig. 5 shows an example of a hand-held mockup showing the interface to a new hand-held device. The mock-up provides a deeper understanding of how the interaction will work in real-world situations than possible with sets of screen images.

Figure 5: Mock-up of a hand-held display with carrying handle. Mock-ups allow the designer to concentrate on the physical design of the device, such as the position of buttons or the screen. The designer can also create several mock-ups and compare input or output options, such as buttons vs. trackballs. Designers and users should run through different scenarios, identifying potential problems with the interface or generating ideas for new functionality. Mock-ups can also help the designer envision how an interactive system will be incorporated into a physical space (Fig. 6).

Beaudouin-Lafon & Mackay

Draft 1 - 15

Prototype Development and Tools

Figure 6: Scaled mock-up of an air traffic control table, connected to a wall display. Wizard of Oz Sometimes it is useful to give users the impression that they are working with a real system, even before it exists. Kelley (1993) dubbed this technique the Wizard of Oz , based on the scene in the 1939 movie of the same name. The heroine, Dorothy, and her companions ask the mysterious Wizard of Oz for help. When they enter the room, they see an enormous green human head, breathing smoke and speaking with a deep, impressive voice. When they return later, they again see the Wizard. This time, Dorothy's small dog pulls back a curtain, revealing a frail old man pulling levers and making the mechanical Wizard of Oz speak. They realize that the impressive being before them is not a wizard at all, but simply an interactive illusion created by the old man. The software version of the Wizard of Oz operates on the same principle. A user sits a terminal and interacts with a program. Hidden elsewhere, the software designer (the wizard) watches what the user does and, by responding in different ways, creates the illusion of a working software program. In some cases, the user is unaware that a person, rather than a computer, is operating the system. The Wizard-of-Oz technique lets users interact with partially-functional computer systems. Whenever they encounter something that has not been implemented (or there is a bug), a human developer who is watching the interaction overrides the prototype system and plays the role destined to eventually be played by the computer. A combination of video and software can work well, depending upon what needs to be simulated. The Wizard of Oz was initially used to develop natural language interfaces (e.g. Chapanis, 1982, Wixon, Whiteside, Good and Jones, 1993). Since then, the technique has been used in a wide variety of situations, particularly those in which rapid responses from users are not critical. Wizard of Oz simulations may consist of paper prototypes, fully-implemented systems and everything in between. Video prototyping Video prototypes (Mackay, 1988) use video to illustrate how users will interact with the new system. As explained in section 3.1, they differ from video brainstorming in that the goal is to refine a single design, not generate new ideas.

Beaudouin-Lafon & Mackay

Draft 1 - 16

Prototype Development and Tools

Video prototypes may build on paper & pencil prototypes and cardboard mockups and can also use existing software and images of real-world settings. We begin our video prototyping exercises by reviewing relevant data about users and their work practices, and then review ideas we video brainstormed. The next step is to create a use scenario, describing the user at work. Once the scenario is described in words, the designer develops a storyboard. Similar to a comic book, the storyboard shows a sequence of rough sketches of each action or event, with accompanying actions and/or dialog (or subtitles), with related annotations that explain what is happening in the scene or the type of shot (Fig. 7). A paragraph of text in a scenario corresponds to about a page of a storyboard.

Figure 7: Storyboard. This storyboard is based on observations of real Coloured Petri Net users in a small company and illustrates how the CPN developer modifies a particular element of a net, the "Simple Protocol". Storyboards help designers refine their ideas, generate 'what if' scenarios for different approaches to a story, and communicate with the other people who are involved in creating the production. Storyboards may be informal "sketches" of ideas, with only partial information. Others follow a pre-defined format and are used to direct the production and editing of a video prototype. Designers should jot down notes on storyboards as they think through the details of the interaction. Storyboards can be used like comic books to communicate with other members of the design team. Designers and users can discuss the proposed system and alternative ideas for interacting with it (figure 8). Simple videos of each successive frame, with a voice over to explain what happens, can also be effective. However, we usually use storyboards to help us shoot video prototypes, which illustrate how a new system will look to a user in a real-world setting. We find that placing the elements of a storyboard on separate cards and arranging them (Mackay and Pagani, 1994) helps the designer experiment with different linear sequences and insert or delete video clips. However, the process of creating a video prototype, based on the storyboard, provides an even deeper understanding of the design.

Figure 8: Video Prototyping: The CPN design team reviews their observations of CPN developers and then discuss several design alternatives. They work out a scenario and storyboard it, then shoot a video prototype that reflects their design. The storyboard guides the shooting of the video. We often use a technique called "editing-in-the-camera" (see Mackay, 2000) which allows us the create the video

Beaudouin-Lafon & Mackay

Draft 1 - 17

Prototype Development and Tools

directly, without editing later. We use title cards, as in a silent movie, to separate the clips and to make it easier to shoot. A narrator explains each event and several people may be necessary to illustrate the interaction. Team members enjoy playing with special effects, such as "time-lapse photography". For example, we can record a user pressing a button, stop the camera, add a new dialog box, and then restart the camera, to create the illusion of immediate system feedback. Video is not simply a way to capture events in the real world or to capture design ideas, but can be a tool for sketching and visualizing interactions. We use a second live video camera as a Wizard-of-Oz tool. The wizard should have access to a set of prototyping materials representing screen objects. Other team members stand by, ready to help move objects as needed. The live camera is pointed at the wizard’s work area, with either a paper prototype or a partially-working software simulation. The resulting image is projected onto a screen or monitor in front of the user. One or more people should be situated so that they can observe the actions of the user and manipulate the projected video image accordingly. This is most effective if the wizard is well prepared for a variety of events and can present semi-automated information. The user interacts with the objects on the screen as wizard moves the relevant materials in direct response to each user action. The other camera records the interaction between the user and the simulated software system on the screen or monitor, to create either a video brainstorm (for a quick idea) or a fully-storyboarded video prototype).

Figure 9: Complex wizard-of-oz simulation, with projected image from a live video camera and transparencies projected from an overhead projector. Fig. 9 shows a Wizard-of-oz simulation with a live video camera, video projector, whiteboard, overhead projector and transparencies. The setup allows two people to experience how they would communicate via a new interactive communication system. One video camera films the blond woman, who can see and talk to the brunette. Her image is projected live onto the left-side of the wall. An overhead projector displays hand-drawn transparencies, manipulated by two other people, in response to gestures made by the brunette. The entire interaction is videotaped by a second video camera. Combining wizard-of-oz and video is a particularly powerful prototyping technique because it gives the person playing the user a real sense of what it might actually feel like to interact with the proposed tool, long before it has been implemented. Seeing a video clip of someone else interacting with a simulated tool is more effective than simply hearing about it; but interacting with it directly is Beaudouin-Lafon & Mackay

Draft 1 - 18

Prototype Development and Tools

more powerful still. Video prototyping may act as a form of specification for developers, enabling them to build the precise interface, both visually and interactively, created by the design team. 4.2 On-line rapid prototyping techniques The goal of on-line rapid prototyping is to create higher-precision prototypes than can be achieved with off-line techniques. Such prototypes may prove useful to better communicate ideas to clients, managers, developers and end users. They are also useful for the design team to fine tune the details of a layout or an interaction. They may exhibit problems in the design that were not apparent in less precise prototypes. Finally they may be used early on in the design process for low precision prototypes that would be difficult to create off-line, such as when very dynamic interactions or visualizations are needed. The techniques presented in this section are sorted by interactivity. We start with non-interactive simulations, i.e. animations, followed by interactive simulations that provide fixed or multiple-paths interactions. We finish with scripting languages which support open interactions. Non-interactive simulations A non-interactive simulation is a computer-generated animation that represents what a person would see of the system if he or she were watching over the user’s shoulder. Non-interactive simulations are usually created when off-line prototypes, including video, fail to capture a particular aspect of the interaction and it is important to have a quick prototype to evaluate the idea. It's usually best to start by creating a storyboard to describe the animation, especially if the developer of the prototype is not a member of the design team. One of the most widely-used tools for non-interactive simulations is Macromedia Director™. The designer defines graphical objects called sprites, and defines paths along which to animate them. The succession of events, such as when sprites appear and disappear, is determined with a time-line. Sprites are usually created with drawing tools, such as Adobe Illustrator or Deneba Canvas, painting tools, such as Adobe Photoshop, or even scanned images. Director is a very powerful tool; experienced developer can create sophisticated interactive simulations. However, non-interactive simulations are much faster to create. Other similar tools exist on the market, including Abvent Katabounga, Adobe AfterEffects and Macromedia Flash (Fig. 10).

Beaudouin-Lafon & Mackay

Draft 1 - 19

Prototype Development and Tools

Figure 10: A non-interactive simulation of a desktop interface created with Macromedia Flash. The time-line (top) displays the active sprites while the main window (bottom) shows the animation. (O. Beaudoux, with permission) Figure 11 shows a set of animation movies created by Dave Curbow to explore the notion of accountability in computer systems (Dourish, 1997). These prototypes explore new ways to inform the user of the progress of a file copy operation. They were created with Macromind Director by combining custommade sprites with sprites extracted from snapshots of the Macintosh Finder. The simulation features cursor motion, icons being dragged, windows opening and closing, etc. The result is a realistic prototype that shows how the interface looks and behaves, that was created in just a few hours. Note that the simulation also features text annotations to explain each step, which helps document the prototype.

Figure 11: Frames from an animated simulation created with Macromind Director (D. Curbow, with permission) Non-interactive animations can be created with any tool that generates images. For example, many Web designers use Adobe Photoshop to create simulations of their web sites. Photoshop images are composed of various layers that overlap like transparencies. The visibility and relative position of each layer can be controlled independently. Designers can quickly add or delete visual elements, simply by changing the characteristics of the relevant layer. This permits quick comparisons of alternative designs and helps visualize multiple pages that share a common layout or banner. Skilled Photoshop users find this approach much faster than most web authoring tools. We used this technique in the CPN2000 project (Mackay et al., 2000) to prototype the use of transparency. After several prototyping sessions with transparencies and overhead projectors, we moved to the computer to understand the differences between the physical transparencies and the transparent effect as it would be rendered on a computer screen. We later developed an interactive prototype with OpenGL, which required an order of magnitude more time to implement than the Photoshop mock-up.

Beaudouin-Lafon & Mackay

Draft 1 - 20

Prototype Development and Tools

Interactive simulations Designers can also use tools such as Adobe Photoshop to create Wizard-of-Oz simulations. For example, the effect of dragging an icon with the mouse can be obtained by placing the icon of a file in one layer and the icon of the cursor in another layer, and by moving either or both layers. The visibility of layers, as well as other attributes, can also create more complex effects. Like Wizard-of-Oz and other paper prototyping techniques, the behavior of the interface is generated by the user who is operating the Photoshop interface. More specialized tools, such as Hypercard and Macromedia Director, can be used to create simulations that the user can directly interact with. Hypercard (Goodman, 1987) is one of the most successful early prototyping tools. It is an authoring environment based on a stack metaphor: a stack contains a set of cards that share a background, including fields and buttons. Each card can also have its own unique contents, including fields and buttons (Fig. 12). Stacks, cards, fields and buttons react to user events, e.g. clicking a button, as well as system events, e.g. when a new card is displayed or about to disappear (Fig. 13). Hypercard reacts according to events programmed with a scripting language called Hypertalk. For example, the following script is assigned to a button, which switches to the next card in the stack whenever the button is clicked. If this button is included in the stack background, the user will be able to browse through the entire stack: on click goto next card end click

Figure 12: A Hypercard card (right) is the combination of a background (left) and the card's content (middle). (Apple Computer, permission requested)

Figure 13: The hierarchy of objects in Hypercard determines the order (from left to right) in which a handler is looked up for an event (Apple Computer, permission requested) Interfaces can be prototyped quickly with this approach, by drawing different states in successive cards and using buttons to switch from one card to the next. Multiple-path interactions can be programmed by using several buttons on each

Beaudouin-Lafon & Mackay

Draft 1 - 21

Prototype Development and Tools

card. More open interactions require more advanced use of the scripting language, but are fairly easy to master with a little practice. Director uses a different metaphor, attaching behaviors to sprites and to frames of the animation. For example, a button can be defined by attaching a behavior to the sprite representing that button. When the sprite is clicked, the animation jumps to a different sequence. This is usually coupled with a behavior attached to the frame containing the button that loops the animation on the same frame. As a result, nothing happens until the user clicks the button, at which point the animation skips to a sequence where, for example, a dialog box opens. The same technique can be used to make the OK and Cancel buttons of the dialog box interactive. Typically, the Cancel button would skip to the original frame while the OK button would skip to a third sequence. Director comes with a large library of behaviors to describe such interactions, so that prototypes can be created completely interactively. New behaviors can also be defined with a scripting language called Lingo. Many educational and cultural CD-ROMs are created exclusively with Director. They often feature original visual displays and interaction techniques that would be almost impossible to create with the traditional user interface development tools described in section 5. Designers should consider tools like Hypercard and Director as user interface builders or user interface development environments. In some situations, they can even be used for evolutionary prototypes (see section 6). Scripting languages Scripting languages are the most advanced rapid prototyping tools. As with the interactive-simulation tools described above, the distinction between rapid prototyping tools and development tools is not always clear. Scripting languages make it easy to quickly develop throw-away quickly (a few hours to a few days), which may or may not be used in the final system, for performance or other technical reasons. A scripting language is a programming language that is both lightweight and easy to learn. Most scripting languages are interpreted or semi-compiled, i.e. the user does not need to go through a compile-link-run cycle each time the script (program) is changed. Scripting languages can be forbidding: they are not strongly typed and non fatal errors are ignored unless explicitly trapped by the programmer. Scripting languages are often used to write small applications for specific purposes and can serve as glue between pre-existing applications or software components. Tcl (Ousterhout, 1993) was inspired by the syntax of the Unix shell, it makes it very easy to interface existing applications by turning the application programming interface (API) into a set of commands that can be called directly from a Tcl script. Tcl is particularly suitable to develop user interface prototypes (or small to medium-size applications) because of its Tk user interface toolkit. Tk features all the traditional interactive objects (called “widgets”) of a UI toolkit: buttons, menus, scrollbars, lists, dialog boxes, etc. A widget is typically only one line. For example: button .dialogbox.ok -text OK -command {destroy .dialogbox}

This command creates a button, called “.dialogbox.ok”, whose label is “OK”. It deletes its parent window “.dialogbox” when the button pressed. A traditional programming language and toolkit would take 5-20 lines to create the same button.

Beaudouin-Lafon & Mackay

Draft 1 - 22

Prototype Development and Tools

Tcl also has two advanced, heavily-parameterized widgets: the text widget and the canvas widget. The text widget can be used to prototype text-based interfaces. Any character in the text can react to user input through the use of tags. For example, it is possible to turn a string of characters into a hypertext link. In Beaudouin-Lafon (2000), the text widget was used to prototype a new method for finding and replacing text. When entering the search string, all occurrences of the string are highlighted in the text (Fig. 14). Once a replace string has been entered, clicking an occurrence replaces it (the highlighting changes from yellow to red). Clicking a replaced occurrence returns it to its original value. This example also uses the canvas widget to create a custom scrollbar that displays the positions and status of the occurrences.

Figure 14: using the Tk text and canvas widgets to prototype a novel search and replace interaction technique (Beaudouin-Lafon, 2000). The Tk canvas widget is a drawing surface that can contain arbitrary objects: lines, rectangles, ovals, polygons, text strings, and widgets. Tags allow behaviors (i.e. scripts) that are called when the user acts on these objects. For example, an object that can be dragged will be assigned a tag with three behaviors: button-press, mouse-move and button-up. Because of the flexibility of the canvas, advanced visualization and interaction techniques can be implemented more quickly and easily than with other tools. For example, Fig. 15 shows a prototype exploring new ideas to manage overlapping windows on the screen (Beaudouin-Lafon, 2001). Windows can be stacked and slightly rotated so that it is easier to recognize them, and they can be folded so it is possible to see what is underneath without having to move the window. Even though the prototype is not perfect (for example, folding a window that contains text is not properly supported), it was instrumental in identifying a number of problems with the interaction techniques and finding appropriate solutions through iterative design.

Figure 15: using the Tk canvas widget to prototype a novel window manager (Beaudouin-Lafon, 2001).

Beaudouin-Lafon & Mackay

Draft 1 - 23

Prototype Development and Tools

Tcl and Tk can also be used with other programming languages. For example, Pad++ (Bederson & Meyer, 1998) is implemented as an extension to Tcl/Tk: the zoomable interface is implemented in C for performance, and accessible from Tk as a new widget. This makes it easy to prototype interfaces that use zooming. It is also a way to develop evolutionary prototypes: a first prototype is implemented completely in Tcl, then parts of are re-implemented in a compiled language to performance. Ultimately, the complete system may be implemented in another language, although it is more likely that some parts will remain in Tcl. Software prototypes can also be used in conjunction with hardware prototypes. Figure 16 shows an example of a hardware prototype that captures hand-written text from a paper flight strip (using a combination of a graphics tablet and a custom-designed system for detecting the position of the paper strip holder). We used Tk/TCL, in conjunction with C++, to present information on a RADAR screen (tied to an existing air traffic control simulator) and to provide feedback on a touch-sensitive display next to the paper flight strips (Caméléon, Mackay et al., 1998). The user can write in the ordinary way on the paper flight strip, and the system interprets the gestures according to the location of the writing on the strip. For example, a change in flight level is automatically sent to another controller for confirmation and a physical tap on the strip's ID lights up the corresponding aircraft on the RADAR screen.

Fig. 16: Caméléon's augmented stripboard (left) is a working hardware prototype that identifies and captures hand-writing from paper flight strips. Members of the design team test the system (right), which combines both hardware and software prototypes into a single interactive simulation. 5. Iterative prototypes Prototypes may also be developed with traditional software development tools. In particular, high-precision prototypes usually require a level of performance that cannot be achieved with the rapid on-line prototyping techniques described above. Similarly, evolutionary prototypes intended to evolve into the final product require more traditional software development tools. Finally, even shipped products are not "final", since subsequent releases can be viewed as initial designs for prototyping the next release. Development tools for interactive systems have been in use for over twenty years and are constantly being refined. Several studies have shown that the part of the development cost of an application spent on the user interface is 50% - 80% of the total cost of the project (Myers & Rosson, 1992). The goal of development tools is to shift this balance by reducing production and maintenance costs. Another goal of development tools is to anticipate the evolution of the system over successive releases and support iterative design.

Beaudouin-Lafon & Mackay

Draft 1 - 24

Prototype Development and Tools

Interactive systems are inherently more powerful than non interactive ones (see Wegner, 1997, for a theoretical argument). They do not match the traditional, purely algorithmic, type of programming: an interactive system must handle user input and generate output at almost any time, whereas an algorithmic system reads input at the beginning, processes it, and displays results at the end. In addition, interactive systems must process input and output at rates that are compatible with the human perception-action loop, i.e. in time frames of 20ms to 200ms. In practice, interactive systems are both reactive and real-time systems, two active areas in computer science research. The need to develop interactive systems more efficiently has led to two interrelated streams of work. The first involves creation of software tools, from lowlevel user-interface libraries and toolkits to high-level user interface development environments (UIDE). The second addresses software architectures for interactive systems: how system functions are mapped onto software modules. The rest of this section presents the most salient contributions of these two streams of work. 5.1 Software tools Since the advent of graphical user interfaces in the eighties, a large number of tools have been developed to help with the creation of interactive software, most aimed at visual interfaces. This section presents a collection of tools, from lowlevel, i.e. requiring a lot of programming, to high-level. The lowest-level tools are graphical libraries ,that provide hardware-independence for painting pixels on a screen and handling user input, and window systems that provide an abstraction (the window) to structure the screen into several “virtual terminals”. User interface toolkits structure an interface as a tree of interactive objects called widgets, while user interface builders provide an interactive application to create and edit those widget trees. Application frameworks build on toolkits and UI builders to facilitate creation of typical functions such as cut/copy/paste, undo, help and interfaces based on editing multiple documents in separate windows. Model-based tools semi-automatically derive an interface from a specification of the domain objects and functions to be supported. Finally, user interface development environments or UIDEs provide an integrated collection of tools for the development of interactive software. Before we describe each of these categories in more detail, it is important to understand how they can be used for prototyping. It is not always best to use the highest-level available tool for prototyping. High-level tools are most valuable in the long term because they make it easier to maintain the system, port it to various platforms or localize it to different languages. These issues are irrelevant for vertical and throw-away prototypes, so a high-level tool may prove less effective than a lower-level one. The main disadvantage of higher-level tools is that they constrain or stereotype the types of interfaces they can implement. User interface toolkits usually contain a limited set of “widgets” and it is expensive to create new ones. If the design must incorporate new interaction techniques, such as bimanual interaction (Kurtenbach et al., 1997) or zoomable interfaces (Bederson & Hollan, 1994), a user interface toolkit will hinder rather than help prototype development. Similarly, application frameworks assume a stereotyped application with a menu bar, several toolbars, a set of windows holding documents, etc. Such a framework would be inappropriate for developing a game or a multimedia educational CD-ROM that requires a fluid, dynamic and original user interface.

Beaudouin-Lafon & Mackay

Draft 1 - 25

Prototype Development and Tools

Finally, developers need to truly master these tools, especially when prototyping in support of a design team. Success depends on the programmer's ability to quickly change the details as well as the overall structure of the prototype. A developer will be more productive when using a familiar tool than if forced to use a more powerful but unknown tool. Graphical libraries and Window systems Graphical libraries underlie all the other tools presented in this section. Their main purpose is to provide the developer with a hardware-independent, and sometimes cross-platform application programming interface (API) for drawing on the screen. They can be separated into two categories: direct drawing and scene-graph based. Direct drawing libraries provide functions to draw shapes on the screen, once specified their geometry and their graphical attributes. This means that every time something is to be changed on the display, the programmer has to either redraw the whole screen or figure out exactly which parts have changed. Xlib on Unix systems, Quickdraw on MacOS, Win32 GDI on Windows and OpenGL (Woo et al., 1997) on all three platforms are all direct drawing libraries. They offer the best compromise between performance and flexibility, but are difficult to program. Scene-graph based libraries explicitly represent the contents of the display by a structure called a scene graph. It can be a simple list (called display list), a tree (as used by many user interface toolkits – see next subsection), or a direct acyclic graph (DAG). Rather than painting on the screen the developer creates and updates the scene graph, and the library is responsible for updating the screen to reflect the scene graph. Scene graphs are mostly used for 3D graphics, e.g., OpenInventor (Strass, 1993), but in recent years they have been used for 2D as well (Bederson et al., 2000, Beaudouin-Lafon & Lassen, 2000). With the advent of hardware-accelerated graphics card, scene-graph based graphics libraries can offer outstanding performance while easing the task of the developer. Window systems provide an abstraction to allow multiple client applications to share the same screen. Applications create windows and draw into them. From the application perspective, windows are independent and behave as separate screens. All graphical libraries include or interface with a window system. Window systems also offer a user interface to manipulate windows (move, resize, close, change stacking order, etc.), called the window manager. The window manager may be a separate application (as in X-Windows), or it may be built into the window system (as in Windows), or it may be controlled of each application (as in MacOS). Each solution offers a different trade-off between flexibility and programming cost. Graphical libraries include or are complemented by an input subsystem. The input subsystem is event driven: each time the user interacts with an input device, an event recording the interaction is added to an input event queue. The input subsystem API lets the programmer query the input queue and remove events from it. This technique is much more flexible than polling the input devices repeatedly or waiting until an input device is activated. In order to ensure that input event are handled in a timely fashion, the application has to execute an event loop that retrieves the first event in the queue and handles it as fast as possible. Every time an event sits in the queue, there is a delay between the user action and the system reaction. As a consequence, the event loop sits at the heart of almost every interactive system. Window systems complement the input subsystem by routing events to the appropriate client application based on its focus. The focus may be specified explicitly for a device (e.g. the keyboard) or implicitly through the cursor position

Beaudouin-Lafon & Mackay

Draft 1 - 26

Prototype Development and Tools

(the event goes to the window under the cursor). Scene-graph based libraries usually provide a picking service to identify which objects in the scene graph are under or in the vicinity of the cursor. Although graphical libraries and window systems are fairly low-level, they must often be used when prototyping novel interaction and/or visualization techniques. Usually, these prototypes are developed when performance is key to the success of a design. For example, a zoomable interface that cannot provide continuous zooming at interactive frame rates is unlikely to be usable. The goal of the prototype is then to measure performance in order to validate the feasibility of the design. User interface toolkits User interface toolkits are probably the most widely used tool nowadays to implement applications. All three major platforms (Unix/Linux, MacOS and Windows) come with at least one standard UI toolkit. The main abstraction provided by a UI toolkit is the widget. A widget is a software object that has three facets that closely match the MVC model: a presentation, a behavior and an application interface. The presentation defines the graphical aspect of the widget. Usually, the presentation can be controlled by the application, but also externally. For example, under X-Windows, it is possible to change the appearance of widgets in any application by editing a text file specifying the colors, sizes and labels of buttons, menu entries, etc. The overall presentation of an interface is created by assembling widgets into a tree. Widgets such as buttons are the leaves of the tree. Composite widgets constitute the nodes of the tree: a composite widgets contains other widgets and controls their arrangement. For example menu widgets in a menu bar are stacked horizontally, while command widgets in a menu are stacked vertically. Widgets in a dialog box are laid out at fixed positions, or relative to each other so that the layout may be recomputed when the window is resized. Such constraint-based layout saves time because the interface does not need to be re-laid out completely when a widget is added or when its size changes as a result of, e.g., changing its label. The behavior of a widget defines the interaction methods it supports: a button can be pressed, a scrollbar can be scrolled, a text field can be edited. The behavior also includes the various possible states of a widget. For example, most widgets can be active or inactive, some can be highlighted, etc. The behavior of a widget is usually hardwired and defines its class (menu, button, list, etc.). However it is sometimes parameterized, e.g. a list widget may be set to support single or multiple selection. One limitation of widgets is that their behavior is limited to the widget itself. Interaction techniques that involve multiple widgets, such as drag-and-drop, cannot be supported by the widgets’ behavior alone and require a separate support in the UI toolkit. Some interaction techniques, such as toolglasses or magic lenses (Bier et al., 1993), break the widget model both with respect to the presentation and the behavior and cannot be supported by traditional toolkits. In general, prototyping new interaction techniques requires either implementing them within new widget classes, which is not always possible, or not using a toolkit at all. Implementing a new widget class is typically more complicated than implementing the new technique outside the toolkit, e.g. with a graphical library, and is rarely justified for prototyping. Many toolkits provide a “blank” widget (Canvas in Tk, Drawing Area in Motif, JFrame in Java Swing) that can be used by the application to implement its own presentation and behavior. This is usually a good alternative to implementing a new widget class, even for production code.

Beaudouin-Lafon & Mackay

Draft 1 - 27

Prototype Development and Tools

The application interface of a widget defines how it communicate the results of the user interactions to the rest of the application. Three main techniques exist. The first and most common one is called callback functions or callback for short: when the widget is created, the application registers the name of a one or more functions with it. When the widget is activated by the user, it calls the registered functions (Fig. 17). The problem with this approach is that the logic of the application is split among many callback functions (Myers, 1991). Define callback

OK

DoPrint (...) {...}

User action activates callback

OK

DoPrint (...)

Fig. 17: Callback functions The second approach is called active variables and consists of associating a widget with a variable of the application program (Fig. 18). A controller ensures that when the widget state changes, the variable is updated with a new value and, conversely, when the value of the variable changes, the widget state reflects the new value. This allows the application to change the state of the interface without accessing the widgets directly, therefore decoupling the functional core from the presentation. In addition, the same active variable can be used with multiple widgets, providing an easy way to support multiple views. Finally, it is easier to change the mapping between widgets and active variables than it is to change the assignment of callbacks. This is because active variables are more declarative and callbacks more procedural. Active variables work only for widgets that represent data, e.g. a list or a text field, but not for buttons or menus. Therefore they complement, rather than replace, callbacks. Few user interface toolkits implement active variables. Tcl/Tk (Ousterhout, 1994) is a notable exception. Define active variable User action changes value

2

cale

6

cale

9

cale

i := 19 Changed value updates widget

Fig. 18: Active variables The third approach for the application interface is based on listeners. Rather than registering a callback function with the widget, the application registers a listener object (Fig. 19). When the widget is activated, it sends a message to its listener describing the change in state. Typically, the listener of a widget would be its model (using the MVC terminology) or Abstraction (using the PAC terminology). The first advantage of this approach therefore is to better match the most common architecture models. It is also more truthful to the object-oriented approach that underlies most user interface toolkits. The second advantage is that it reduces the “spaghetti of callbacks” described above: by attaching a single listener to several widgets, the code is more centralized. A number of recent toolkits are based on the listener model, including Java Swing (Eckstein et al., 1998). Define listener object

OK

User action activates listener

OK

PrintDialog p ev

p.HandleEvent(ev)

Fig. 19: Listener objects

Beaudouin-Lafon & Mackay

Draft 1 - 28

Prototype Development and Tools

User interface toolkits have been an active area of research over the past 15 years. InterViews (Linton et al., 1989) has inspired many modern toolkits and user interface builders. A number of toolkits have also been developed for specific applications such as groupware (Roseman and Greenberg, 1996, 1999) or visualization (Schroeder et al., 1997). Creating an application or a prototype with a UI toolkit requires a solid knowledge of the toolkit and experience with programming interactive applications. In order to control the complexity of the inter-relations between independent pieces of code (creation of widgets, callbacks, global variables, etc.), it is important to use wellknown design patterns. Otherwise the code quickly becomes unmanageable and, in the case of a prototype, unsuitable to design space exploration. Two categories of tools have been designed to ease the task of developers: user interface builders and application frameworks. User-interface builders A user interface builder allows the developer of an interactive system to create the presentation of the user interface, i.e. the tree of widgets, interactively with a graphical editor. The editor features a palette of widgets that the user can use to “draw” the interface in the same way as a graphical editor is used to create diagrams with lines, circles and rectangles. The presentation attributes of each widget can be edited interactively as well as the overall layout. This saves a lot of time that would otherwise be spent writing and fine-tuning rather dull code that creates widgets and specifies their attributes. It also makes it extremely easy to explore and test design alternatives. User interface builders focus on the presentation of the interface. They also offer some facilities to describe the behavior of the interface and to test the interaction. Some systems allow the interactive specification of common behaviors such as a menu command opening a dialog box, a button closing a dialog box, a scrollbar controlling a list or text. The user interface builder can then be switched to a “test” mode where widgets are not passive objects but work for real. This may be enough to test prototypes for simple applications, even though there is no functional core nor application data. In order to create an actual application, the part of the interface generated by the UI builder must be assembled with the missing parts, i.e. the functional core, the application interface code that could not be described from within the builder, and the run-time module of the generator. Most generators save the interface into a file that can be loaded at run-time by the generator’s run-time (Fig. 20). With this method, the application needs only be re-generated when the functional core changes, not when the user interface changes. This makes it easy to test alternative designs or to iteratively create the interface: each time a new version of the interface is created, it can be readily tested by re-running the application. Editor

functional core

load save

generator run-time

link editor

interface

load

Interactive application

Figure 20: iterative user interface builder.

Beaudouin-Lafon & Mackay

Draft 1 - 29

Prototype Development and Tools

In order to make it even easier to modify the interface and test the effects with the real functional core, the interface editor can be built into the target application (Fig. 21). Changes to the interface can then be made from within the application and tested without re-running it. This situation occurs most often with interface builders based on an interpreted language (e.g. Tcl/Tk, Visual Basic). functional core

Editor

interface

generator run-time

link editor

save

Interactive application

load

Figure 21: interactive user interface builder. In either case, a final application can be created by compiling the interface generated by the user interface builder into actual code, linked with the functional core and a minimal run-time module. In this situation, the interface is not loaded from a file but directly created by the compiled code (Fig. 22). This is both faster and eliminates the need for a separate interface description file.

interface compiler compiled interface

functional core

generator run-time

link editor

Interactive application

Figure 22: generation of the final application. User interface builders are widely used to develop prototypes as well as final applications. They are easy to use, they make it easy to change the look of the interface and they hide a lot of the complexity of creating user interfaces with UI toolkits. However, despite their name, they do not cover the whole user interface, only the presentation. Therefore they still require a significant amount of programming, namely some part of the behavior and all the application interface. Systems such as NeXT’s Interface Builder (Next, 1991) ease this task by supporting part of the specification of the application objects and their links with the user interface. Still, user interface builders require knowledge of the underlying toolkit and an understanding of their limits, especially when prototyping novel visualization and interaction techniques.

Beaudouin-Lafon & Mackay

Draft 1 - 30

Prototype Development and Tools

5.2 Software environments Application frameworks Application frameworks address a different problem than user interface builders and are actually complementary. Many applications have a standard form where windows represent documents that can be edited with menu commands and tools from palettes; each document may be saved into a disk file; standard functions such as copy/paste, undo, help are supported. Implementing such stereotyped applications with a UI toolkit or UI builder requires replicating a significant amount of code to implement the general logic of the application and the basics of the standard functions. Application frameworks address this issue by providing a shell that the developer fills with the functional core and the actual presentation of the non-standard parts of the interface. Most frameworks have been inspired by MacApp, a framework developed in the eighties to develop applications for the Macintosh (Apple Computer, 1996). Typical base classes of MacApp include Document, View, Command and Application. MacApp supports multiple document windows, multiple views of a document, cut/copy/paste, undo, saving documents to files, scripting, and more. With the advent of object-oriented technology, most application frameworks are implemented as collections of classes. Some classes provide services such as help or drag-and-drop and are used as client classes. Many classes are meant to be derived in order to add the application functionality through inheritance rather than by changing the actual code of the framework. This makes it easy to support successive versions of the framework and limits the risks of breaking existing code. Some frameworks are more specialized than MacApp. For example, Unidraw (Vlissides and Linton, 1990) is a framework for creating graphical editors in domains such as technical and artistic drawing, music composition, or circuit design. By addressing a smaller set of applications, such a framework can provide more support and significantly reduce implementation time. Mastering an application framework takes time. It requires knowledge of the underlying toolkit and the design patterns used in the framework, and a good understanding of the design philosophy of the framework. A framework is useful because it provides a number of functions “for free”, but at the same time it constrains the design space that can be explored. Frameworks can prove effective for prototyping if their limits are well understood by the design team. Model-based tools User interface builders and application frameworks approach the development of interactive applications through the presentation side: first the presentation is built, then behavior, i.e., interaction, is added, finally the interface is connected to the functional core. Model-based tools take the other approach, starting with the functional core and domain objects, and working their way towards the user interface and the presentation (Szekely et al., 1992, 1993). The motivation for this approach is that the raison d’être of a user interface is the application data and functions that will be accessed by the user. Therefore it is important to start with the domain objects and related functions and derive the interface from them. The goal of these tools is to provide a semi-automatic generation of the user interface from the high-level specifications, including specification of the domain objects and functions, specification of user tasks, specification of presentation and interaction styles.

Beaudouin-Lafon & Mackay

Draft 1 - 31

Prototype Development and Tools

Despite significant efforts, the model-based approach is still in the realm of research: no commercial tool exists yet. By attempting to define an interface declaratively, model-based tools rely on a knowledge base of user interface design to be used by the generation tools that transform the specifications into an actual interface. In other words, they attempt to do what designers do when they iteratively and painstakingly create an interactive system. This approach can probably work for well-defined problems with well-known solutions, i.e. families of interfaces that address similar problems. For example, it may be the case that interfaces for Management Information Systems (MIS) could be created with model-based tools because these interfaces are fairly similar and well understood. In their current form, model-based tools may be useful to create early horizontal or task-based prototypes. In particular they can be used to generate a “default” interface that can serve as a starting point for iterative design. Future systems may be more flexible and therefore usable for other types of prototypes. User interface development environments Like model-based tools, user interface development environments (UIDE) attempt to support the development of the whole interactive system. The approach is more pragmatic than the model-based approach however. It consists in assembling a number of tools into an environment where different aspects of an interactive system can be specified and generated separately. C32 spreadsheet Jade dialog box creation system Lapidary interface builder

Garnet tools

Widget set Interactors

Opal graphics system Garnet toolkit

Constraint System KR object system X11 Window system

CommonLisp

operating system

Fig. 23: The Garnet toolkit and tools (Myers et al., 1990) Garnet (Fig. 23) and its successor Amulet (Myers et al, 1997) provide a comprehensive set of tools, including a traditional user interface builder, a semiautomatic tool for generating dialog boxes, a user interface builder based on a demonstration approach, etc. One particular tool, Silk, is aimed explicitly at prototyping user interfaces.

Beaudouin-Lafon & Mackay

Draft 1 - 32

Prototype Development and Tools

Fig. 24: A sketch created with Silk (top left) and its automatic transformation into a Motif user interface (top right). A storyboard (bottom) used to test sequences of interactions, here a button that rotates an object. (J. Landay, with permission) Silk (Landay & Myers, 2001) is a tool aimed at the early stages of design, when interfaces are sketched rather than prototyped in software. Using Silk, a user can sketch a user interface directly on the screen (Fig. 24). Using gesture recognition, Silk interprets the marks as widgets, annotations, etc. Even in its sketched form, the user interface is functional: buttons can be pressed, tools can be selected in a toolbar, etc. The sketch can also be turned into an actual interface, e.g. using the Motif toolkit. Finally, storyboards can be created to describe and test sequences of interactions. Silk therefore combines some aspects of off-line and on-line prototyping techniques, trying to get the best of both worlds. This illustrates a current trend in research where on-line tools attempt to support not only the development of the final system, but the whole design process. 6. Evolutionary Prototypes Evolutionary prototypes are a special case of iterative prototypes, that are intended to evolve into the final system. Methodologies such as Extreme Programming (Beck, 2000) consist mostly in developing evolutionary prototypes. Since prototypes are rarely robust nor complete, it is often impractical and sometimes dangerous to evolve them into the final system. Designers must think carefully about the underlying software architecture of the prototype, and developers should use well-documented design patterns to implement them. 6.1 Software architectures The definition of the software architecture is traditionally done after the functional specification is written, but before coding starts. The designers design on the structure of the application and how functions will be implemented by software Beaudouin-Lafon & Mackay

Draft 1 - 33

Prototype Development and Tools

modules. The software architecture is the assignment of functions to modules. Ideally, each function should be implemented by a single module and modules should have minimal dependencies among them. Poor architectures increase development costs (coding, testing and integration), lower maintainability, and reduce performance. An architecture designed to support prototyping and evolution is crucial to ensure that design alternatives can be tested with maximum flexibility and at a reasonable cost. Seeheim and Arch The first generic architecture for interactive systems was devised at a workshop in Seeheim (Germany) in 1985 and is known as the Seeheim model (Pfaff, 1985). It separates the interactive application into a user interface and a functional core (then called “application”, because the user interface was seen as adding a “coat of paint” on top of an existing application). The user interface is made of three modules: the presentation, the dialogue controller, and the application interface (Fig. 25). The presentation deals with capturing user’s input at a low level (often called lexical level by comparison with the lexical, syntactic and semantic levels of a compiler). The presentation is also responsible for generating output to the user, usually as visual display. The dialogue controller assembles the user input into commands (a.k.a. syntactic level), provides some immediate feedback for the action being carried out, such as an elastic rubber line, and detects errors. Finally, the application interface interprets the commands into calls to the functional core (a.k.a. semantic level). It also interprets the results of these calls and turns them into output to be presented to the user.

Presentation

Dialogue Controller

Application Interface

User Interface

Functional Core

Figure 25: Seeheim model (Pfaff, 1985) All architecture models for interactive systems are based on the Seeheim model. They all recognize that there is a part of the system devoted to capturing user actions and presenting output (the presentation) and another part devoted to the functional core (the computational part of the application). In between are one or more modules that transform user actions into functional calls, and application data (including results from functional calls) into user output. A modern version of the Seeheim model is the Arch model (The UIMS Workshop Tool Developers, 1992). The Arch model is made of five components (Fig. 26). The interface toolkit component is a pre-existing library that provides low-level services such as buttons, menus, etc. The presentation component provides a level of abstraction over the user interface toolkit. Typically, it implements interaction and visualization techniques that are not already supported by the interface toolkit. It may also provide platform independence by supporting different toolkits. The functional core component implements the functionality of the system. In some cases it is already existent and cannot be changed. The domain adapter component provides additional services to the dialogue component that are not in the functional core. For example, if the functional core is a Unixlike file system and the user interface is a iconic interface similar to the Macintosh Finder, the domain adapter may provide the dialogue controller with a notification service so the presentation can be updated whenever a file is changed. Finally, the dialogue component is the keystone of the arch; it handles the translation between the user interface world and the domain world.

Beaudouin-Lafon & Mackay

Draft 1 - 34

Prototype Development and Tools

Dialogue Component Presentation Component

DomainAdpater Component

DomainSpecific Component

Interaction Toolkit Component

Figure 26: The Arch Model (The UIMS Workshop Developers Tool, 1992) The Arch model is also known as the Slinky model because the relative sizes of the components may vary across applications as well as during the life of the software. For example, the presentation component may be almost empty if the interface toolkit provides all the necessary services, and be later expanded to support specific interaction or visualization techniques, or multiple platforms. Similarly, early prototypes may have a large domain adapter simulating the functional core of the final system, or interfacing to an early version of the functional core; the domain adapter may shrink to almost nothing when the final system is put together. The separation that Seeheim, Arch and most other architecture models make between user interface and functional core is a good, pragmatic approach but it may cause problems in some cases. A typical problem is a performance penalty when the interface components (left leg) have to query the domain components (right leg) during an interaction such as drag-and-drop. For example, when dragging the icon of a file over the desktop, icons of folders and applications that can receive the file should highlight. Determining which icons to highlight is a semantic operation that depends on file types and other information and must therefore be carried out by the functional core or domain adapter. If drag-and-drop is implemented in the user interface toolkit, this means that each time the cursor goes over a new icon, up to four modules have to be traversed once by the query and once by the reply to find out whether or not to highlight the icon. This is both complicated and slow. A solution to this problem, called semantic delegation, involves shifting, in the architecture, some functions such as matching files for drag-and-drop from the domain leg into the dialogue or presentation component. This may solve the efficiency problem, but at the cost of an added complexity especially when maintaining or evolving the system, because it creates dependencies between modules that should otherwise be independent. MVC and PAC Architecture models such as Seeheim and Arch are abstract models and are thus rather imprecise. They deal with categories of modules such as presentation or dialogue, when in an actual architecture several modules will deal with presentation and several others with dialogue. The Model-View-Controller or MVC model (Krasner and Pope, 1988) is much more concrete. MVC was created for the implementation of the Smalltalk-80 environment (Goldberg & Robson, 1983) and is implemented as a set of Smalltalk classes. The model describes the interface of an application as a collection of triplets of objects. Each triplet contains a model, a view and a controller. A Model represents information that needs to be represented and interacted with. It is controlled by applications objects. A View displays the information in a model in a certain way. A Controller interprets user input on the

Beaudouin-Lafon & Mackay

Draft 1 - 35

Prototype Development and Tools

view and transforms it into changes in the model. When a model changes it notifies its view so the display can be updated. Views and controllers are tightly coupled and sometimes implemented as a single object. A model is abstract when it has no view and no controller. It is noninteractive if it has a view but no controller. The MVC triplets are usually composed into a tree, e.g. an abstract model represents the whole interface, it has several components that are themselves models such as the menu bar, the document windows, etc., all the way down to individual interface elements such as buttons and scrollbars. MVC supports multiple views fairly easily: the views share a single model; when a controller modifies the model, all the views are notified and update their presentation. The Presentation-Abstraction-Control model, or PAC (Coutaz, 1987) is close to MVC. Like MVC, an architecture based on PAC is made of a set of objects, called PAC agents, organized in a tree. A PAC agent has three facets: the Presentation takes care of capturing user input and generating output; the Abstraction holds the application data, like a Model in MVC; the Control facet manages the communication between the abstraction and presentation facets of the agent, and with sub-agents and super-agents in the tree. Like MVC, multiple views are easily supported. Unlike MVC, PAC is an abstract model, i.e. there is no reference implementation. A variant of MVC, called MVP (Model-View-Presenter), is very close to PAC and is used in ObjectArts' Dolphin Smalltalk. Other architecture models have been created for specific purposes such as groupware (Dewan, 1999) or graphical applications (Fekete and Beaudouin-Lafon, 1996). 6.2 Design patterns Architecture models such as Arch or PAC only address the overall design of interactive software. PAC is more fine-grained than Arch, and MVC is more concrete since it is based on an implementation. Still, a user interface developer has to address many issues in order to turn an architecture into a working system. Design patterns have emerged in recent years as a way to capture effective solutions to recurrent software design problems. In their book, Gamma et al. (1995) present 23 patterns. It is interesting to note than many of these patterns come from interactive software, and most of them can be applied to the design of interactive systems. It is beyond the scope of this chapter to describe these patterns in detail. However it is interesting that most patterns for interactive systems are behavioral patterns, i.e. patterns that describe how to implement the control structure of the system. Indeed, there is a battle for control in interactive software. In traditional, algorithmic software, the algorithm is in control and decides when to read input and write output. In interactive software, the user interface needs to be in control because user input should drive the system’s reactions. Unfortunately, more often than not, the functional core also needs to be in control. This is especially common when creating user interfaces for legacy applications. In the Seeheim and Arch models, it is often believed that control is located in the dialog controller when in fact these architecture models do not explicitly address the issue of control. In MVC, the three basic classes Model, View and Controller implement a sophisticated protocol to ensure that user input is taken into account in a timely manner and that changes to a model are properly reflected in the view (or views). Some authors actually describe MVC as a design pattern, not an architecture. In fact it is both: the inner workings of the three basic classes is a pattern, but the

Beaudouin-Lafon & Mackay

Draft 1 - 36

Prototype Development and Tools

decomposition of the application into a set of MVC triplets is an architectural issue. It is now widely accepted that interactive software is event-driven, i.e. the execution is driven by the user’s actions, leading to a control localized in the user interface components. Design patterns such as Command, Chain of Responsibility, Mediator, and Observer (Gamma et al., 1995) are especially useful to implement the transformation of low-level user event into higher-level commands, to find out which object in the architecture responds to the command, and to propagate the changes produced by a command from internal objects of the functional core to user interface objects. Using design patterns to implement an interactive systems not only saves time, it also makes the system more open to changes and easier to maintain. Therefore software prototypes should be implemented by experienced developers who know their pattern language and who understand the need for flexibility and evolution. 7. Summary Prototyping is an essential component of interactive system design. Prototypes may take many forms, from rough sketches to detailed working prototypes. They provide concrete representations of design ideas and give designers, users and developers and managers an early glimpse into how the new system will look and feel. Prototypes increase creativity, allow early evaluation of design ideas, help designers think through and solve design problems, and support communication within multi-disciplinary design teams. Prototypes, because they are concrete and not abstract, provide a rich medium for exploring a design space. They suggest alternate design paths and reveal important details about particular design decisions. They force designers to be creative and to articulate their design decisions. Prototypes embody design ideas and encourage designers to confront their differences of opinion. The precise aspects of a prototype offer specific design solutions: designers can then decide to generate and compare alternatives. The imprecise or incomplete aspects of a prototype highlight the areas that must be refined or require additional ideas. We begin by defining prototypes and then discuss them as design artifacts. We introduce four dimensions by which they can be analyzed: representation, precision, interactivity and evolution. We then discuss the role of prototyping within the design process and explain the concept of creating, exploring and modifying a design space. We briefly describe techniques for generating new ideas, to expand the design space, and techniques for choosing among design alternatives, to contract the design space. We describe a variety of rapid prototyping techniques for exploring ideas quickly and inexpensively in the early stages of design, including off-line techniques (from paper&pencil to video) and on-line techniques (from fixed to interactive simulations). We then describe iterative prototyping techniques for working out the details of the on-line interaction, including software development tools and software environments We conclude with evolutionary prototyping techniques, which are designed to evolve into the final software system, including a discussion of the underlying software architectures, design patterns and extreme programming. This chapter has focused mostly on graphical user interfaces (GUIs) that run on traditional workstations. Such applications are dominant today, but this is changing as new devices are being introduced, from cell-phones and PDAs to

Beaudouin-Lafon & Mackay

Draft 1 - 37

Prototype Development and Tools

wall-size displays. New interaction styles are emerging, such as augmented reality, mixed reality and ubiquitous computing. Designing new interactive devices and the interactive software that runs on them is becoming ever more challenging: whether aimed at a wide audience or a small number of specialists, the hardware and software systems must be adapted to their contexts of use. The methods, tools and techniques presented in this chapter can easily be applied to these new applications. We view design as an active process of working with a design space, expanding it by generating new ideas and contracting as design choices are made. Prototypes are flexible tools that help designers envision this design space, reflect upon it, and test their design decisions. Prototypes are diverse and can fit within any part of the design process, from the earliest ideas to the final details of the design. Perhaps most important, prototypes provide one of the most effective means for designers to communicate with each other, as well as with users, developers and managers, throughout the design process. 8. References Apple Computer (1996). Programmer's Guide to MacApp. Beaudouin-Lafon, M. (2000). Instrumental Interaction: An Interaction Model for Designing Post-WIMP User Interfaces. Proceedings ACM Human Factors in Computing Systems, CHI'2000, pp.446-453, ACM Press. Beaudouin-Lafon, M. (2001). Novel Interaction Techniques for Overlapping Windows. Proceedings of ACM Symposium on User Interface Software and Technology, UIST 2001, CHI Letters 3(2), ACM Press. In Press. Beaudouin-Lafon, M. and Lassen, M. (2000) The Architecture and Implementation of a Post-WIMP Graphical Application. Proceedings of ACM Symposium on User Interface Software and Technology, UIST 2000, CHI Letters 2(2):191-190, ACM Press. Beaudouin-Lafon, M. and Mackay, W. (2000) Reification, Polymorphism and Reuse: Three Principles for Designing Visual Interfaces. In Proc. Conference on Advanced Visual Interfaces, AVI 2000, Palermo, Italy, May 2000, p.102-109. Beck, K. (2000). Extreme Programming Explained. New York: AddisonWesley. Bederson, B. and Hollan, J. (1994). Pad++: A Zooming Graphical Interface for Exploring Alternate Interface Physics. Proceedings of ACM Symposium on User Interface Software and Technology, UIST’94, pp.17-26, ACM Press. Bederson, B. and Meyer, J. (1998). Implementing a Zooming Interface: Experience Building Pad++. Software Practice and Experience, 28(10):11011135. Bederson, B.B., Meyer, J., Good, L. (2000) Jazz: An Extensible Zoomable User Interface Graphics ToolKit in Java. Proceedings of ACM Symposium on User Interface Software and Technology, UIST 2000, CHI Letters 2(2):171-180, ACM Press. Bier, E., Stone, M., Pier, K., Buxton, W., De Rose, T. (1993) Toolglass and Magic Lenses : the See-Through Interface. Proceedings ACM SIGGRAPH, pp.73-80, ACM Press.

Beaudouin-Lafon & Mackay

Draft 1 - 38

Prototype Development and Tools

Boehm, B. (1988). A Spiral Model of Software Development and Enhancement. IEEE Computer, 21(5):61-72. Bødker, S., Ehn, P., Knudsen, J., Kyng, M. and Madsen, K. (1988) Computer support for cooperative design. In Proceedings of the CSCW'88 ACM Conference on Computer-Supported Cooperative Work. Portland, OR: ACM Press, pp. 377-393. Chapanis, A. (1982) Man/Computer Research at Johns Hopkins, Information Technology and Psychology: Prospects for the Future. Kasschau, Lachman & Laughery (Eds.) Praeger Publishers, Third Houston Symposium, NY, NY. Collaros, P.A., Anderson, L.R. (1969), Effect of perceived expertness upon creativity of members of brainstorming groups. Journal of Applied Psychology, 53, 159-163. Coutaz, J. (1987). PAC, an Object Oriented Model for Dialog Design. In Proceedings of INTERACT’87, Bullinger, H.-J. and Shackel, B. (eds.), pp.431-436, Elsevier Science Publishers. Dewan, P. (1999). Architectures for Collaborative Applications. In BeaudouinLafon, M. (ed.), Computer-Supported Co-operative Work, Trends in Software Series, Wiley, pp.169-193. Djkstra-Erikson, E., Mackay, W.E. and Arnowitz, J. (March, 2001) Trialogue on Design of. ACM/Interactions, pp. 109-117. Dourish, P. (1997). Accounting for System Behaviour: Representation, Reflection and Resourceful Action. In Kyng and Mathiassen (eds), Computers and Design in Context. Cambridge: MIT Press, pp.145-170. Eckstein, R., Loy, M. and Wood, D. (1998). Java Swing. Cambridge MA: O’Reilly. Fekete, J-D. and Beaudouin-Lafon, M. (1996). Using the Multi-layer Model for Building Interactive Graphical Applications. In Proc. ACM Symposium on User Interface Software and Technology, UIST'96, ACM Press, p. 109-118. Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1995). Design Patterns, Elements of Reusable Object-Oriented Software. Reading MA: Addison Wesley. Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Boston: Houghton Mifflin. Goldberg, A. and Robson, D. (1983). Smalltalk--80: The language and its implementation. Reading MA: Addison Wesley. Goodman, D. (1987). The Complete HyperCard Handbook. New York: Bantam Books. Greenbaum, J. and Kyng, M., eds (1991). Design at Work: Cooperative Design of Computer Systems. Hillsdale NJ: Lawrence Erlbaum Associates. Houde, S. and Hill, C. (1997). What do Prototypes Prototype? In Handbook of Human Computer Interaction 2ed complety revised. North-Holland, pp.367-381.

Beaudouin-Lafon & Mackay

Draft 1 - 39

Prototype Development and Tools

Kelley, J.F. (1983) An empirical methodology for writing user-friendly natural language computer applications. In Proceedings of CHI '83 Conference on Human Factors in Computing Systems. Boston, Massachusetts. Krasner, E.G. and Pope, S.T. (1988). A Cookbook for Using the Model-ViewController User Interface Paradigm in Smalltalk-80. Journal of Object-Oriented Programming, August/September 1988, pp.27-49. Kurtenbach, G., Fitzmaurice, G., Baudel, T., Buxton. W. (1997). The Design of a GUI Paradigm based on Tablets, Two-hands, and Transparency. Proceedings of ACM Human Factors in Computing Systems, CHI'97, pp.35-42, ACM Press. Landay, J. and Myers, B.A. (2001). Sketching Interfaces: Toward More Human Interface Design. IEEE Computer, 34(3):56-64. Linton, M.A., Vlissides, J.M., Calder, P.R. (1989). Composing user interfaces with InterViews, IEEE Computer, 22(2):8-22. Mackay, W.E. (1988) Video Prototyping: A technique for developing hypermedia systems. Demonstration in Proceedings of CHI'88, Conference on Human Factors in Computing, Washington, D.C. Mackay, W.E. and Pagani, D. (1994) Video Mosaic: Laying out time in a physical space. Proceedings of ACM Multimedia '94. San Francisco, CA: ACM, pp.165-172. Mackay, W.E. and Fayard, A-L. (1997) HCI, Natural Science and Design: A Framework for Triangulation Across Disciplines. Proceedings of ACM DIS '97, Designing Interactive Systems. Amsterdam, Pays-Bas: ACM/SIGCHI, pp.223234. Mackay, W.E. (2000) Video Techniques for Participatory Design: Observation, Brainstorming & Prototyping. Tutorial Notes, CHI 2000, Human Factors in Computing Systems. The Hague, the Netherlands. (148 pages) URL: www.lri.fr/~mackay/publications Mackay, W., Ratzer, A. and Janecek, P. (2000) Video Artifacts for Design: Bridging the Gap between Abstraction and Detail. Proceedings ACM Conference on Designing Interactive Systems, DIS 2000, pp.72-82, ACM Press. Myers, B.A., Giuse, D.A., Dannenberg, R.B., Vander Zander, B., Kosbie, D.S., Pervin, E., Mickish, A., Marchal, P. (1990). Garnet: Comprehensive Support for Graphical, Highly-Interactive User Interfaces. IEEE Computer, 23(11):71-85. Myers, B.A. (1991). Separating application code from toolkits: Eliminating the spaghetti of call-backs. Proceedings of ACM SIGGRAPH Symposium on User Interface Software and Technology, UIST '91, pp.211-220. Myers, B.A. and Rosson, M.B. (1992). Survey on user interface programming. In ACM Conference on Human Factors in Computing Systems, CHI’92, pp.195202, ACM Press. Myers, B.A., McDaniel, R.G., Miller, R.C., Ferrency, A.S., Faulring, A., Kyle, B.D., Mickish, A., Klimotivtski, A., Doane, P. (1997). The Amulet environment. IEEE Transactions on Software Engineering, 23(6):347 - 365.

Beaudouin-Lafon & Mackay

Draft 1 - 40

Prototype Development and Tools

NeXT Corporation (1991). NeXT Interface Builder Reference Manual. Redwood City, California. Norman, D.A. and Draper S.W., eds (1986). User Centered System Design. Hillsdale NJ: Lawrence Erlbaum Associates. Osborn, A. (1957), Applied imagination: Principles and procedures of creative thinking (rev. ed.), New York: Scribner's. Ousterhout, J.K. (1994). Tcl and the Tk Toolkit. Reading MA: Addison Wesley. Perkins, R., Keller, D.S. and Ludolph, F (1997). Inventing the Lisa User Interface. ACM Interactions, 4(1):40-53. Pfaff, G.P. and P. J. W. ten Hagen, P.J.W., eds (1985). User Interface Management Systems. Berlin: Springer. Raskin, J. (2000). The Humane Interface. New York: Addison-Wesley. Roseman, M. and Greenberg, S. (1999). Groupware Toolkits for Synchronous Work. In Beaudouin-Lafon, M. (ed.), Computer-Supported Co-operative Work, Trends in Software Series, Wiley, pp.135-168. Roseman, M. and Greenberg, S. (1996). Building real-time groupware with GroupKit, a groupware toolkit. ACM Transactions on Computer-Human Interaction, 3(1):66-106. Schroeder, W., Martin, K., Lorensen, B. (1997). The Visualization Toolkit. Prentice Hall. Strass, P. (1993) IRIS Inventor, a 3D Graphics Toolkit. Proceedings ACM Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA '93, pp.192-200. Szekely, P., Luo, P. and Neches, R. (1992). Facilitating the Exploration of Interface Design Alternatives: The HUMANOID. Proceedings of ACM Conference on Human Factors in Computing Systems, CHI’92, pp.507-515. Szekely, P., Luo, P. and Neches, R. (1993). Beyond Interface Builders: Modelbased Interface Tools. Proceedings of ACM/IFIP Conference on Human Factors in Computing Systems, INTERCHI’93, pp.383-390. The UIMS Workshop Tool Developers (1992). A Metamodel for the Runtime Architecture of an Interactive System. SIGCHI Bulletin, 24(1):32-37. Vlissides, J.M. and Linton, M.A. (1990). Unidraw: a framework for building domain-specific graphical editors. ACM Transactions on Information Systems, 8(3):237 - 268. Wegner, P. (1997). Why Interaction is More Powerful Than Algorithms. Communications of the ACM, 40(5):80-91. Woo, M., Neider, J. and Davis, T. (1997) OpenGL Programming Guide, Reading MA: Addison-Wesley

Beaudouin-Lafon & Mackay

Draft 1 - 41

Prototype Development and Tools

Human-Computer Interaction An Empirical Research Perspective

I. Scott MacKenzie

".45&3%".r#0450/r)&*%&-#&3(r-0/%0/ /&8:03,r09'03%r1"3*4r4"/%*&(0 4"/'3"/$*4$0r4*/("103&r4:%/&:r50,:0

.PSHBO,BVGNBOOJTBOJNQSJOUPG&MTFWJFS

CHAPTER

Designing HCI Experiments

5

Learning how to design and conduct an experiment with human participants is a skill required of all researchers in human-computer interaction. In this chapter I describe the core details of designing and conducting HCI experiments. One way to think about experiment design is through a signal and noise metaphor. In the metaphor, we divide our observations and measurements into two components: signal and noise. (See Figure 5.1.) The source shows a time series. A slight upward trend is apparent; however, the variability or noise in the source makes this difficult to detect. If we separate the source into components for signal and noise, the trend in the signal is clear. In HCI experiments, the signal is related to a variable of interest, such as input device, feedback mode, or an interaction technique under investigation. The noise is everything else—the random influences. These include environmental circumstances such as temperature, lighting, background noise, a wobbly chair, or glare on the computer screen. The people or participants in the experiment are also a source of noise or variability. Some participants may be having a good day, while others are having a bad day. Some people are predisposed to behave in a certain manner; others behave differently. The process of designing an experiment is one of enhancing the signal while reducing the noise. This is done by carefully considering the setup of the experiment in terms of the variables manipulated and measured, the variables controlled, the procedures, the tasks, and so on. Collectively, these properties of an experiment establish the methodology for the research.

5.1 What methodology? The term method or methodology refers to the way an experiment is designed and carried out. This involves deciding on the people (participants), the hardware and software (materials or apparatus), the tasks, the order of tasks, the procedure for briefing and preparing the participants, the variables, the data collected and analyzed, and so on. Having a sound methodology is critical. On this point, Allen Newell did not hesitate: “Science is method. Everything else is commentary.”1 1

This quote from Allen Newell was cited and elaborated on by Stuart Card in an invited talk at the ACM’s SIGCHI conference in Austin, Texas (May 10, 2012). Human-Computer Interaction. © 2013 Elsevier Inc. All rights reserved.

157

158

CHAPTER 5 Designing HCI Experiments

FIGURE 5.1 Signal-to-noise conceptualization of experiment design.

These are strong words. Why did Newell apply such forceful yet narrow language to a topic as broad as science? The reason is that Newell, and others, understand that methodology is the bedrock of science. If the methodology is weak or flawed, there is no science forthcoming. What remains is little else than commentary. In the preceding chapter, I advocated the use of a standardized methodology to strengthen experimental research. The flip side is that an ad hoc, or made-up, methodology weakens research. There is little sense in contriving a methodology simply because it seems like a good way to test or demonstrate an idea. So what is the appropriate methodology for research in human-computer interaction? The discussions that follow pertain only to experimental research and in particular to factorial experiments, where participants are exposed to levels of factors (test conditions) while their behavior (human performance) is observed and measured. By and large, the methodology is plucked from one of HCI’s parent disciplines: experimental psychology. Just as the Association for Computing Machinery (ACM) is the dominant organization overseeing computer science and related special interests such as HCI, the American Psychological Association (APA) is the dominant organization overseeing experimental psychology. Their Publication Manual of the American Psychological Association, first published in 1929, is a valuable resource for researchers undertaking experimental research involving human participants (APA, 2010). The manual, now in its sixth edition, is used by over 1,000 journals across many disciplines (Belia, Fidler, Williams, and Cumming, 2005). These include HCI journals. The APA guidelines are recommended by journals such as the ACM’s Transactions on Computer-Human Interaction (TOCHI) (ACM, 2012) and Taylor and Francis’s Human-Computer Interaction (Taylor and Francis, 2012). The APA Publication Manual is about more than publishing style; the manual lays out many methodological issues, such as naming and referring to independent and dependent variables, recruiting participants, reporting the results of statistical tests, and so on. Also the important link between research and publication is reflected in the title. Another resource is psychologist David Martin’s Doing Psychology Experiments, now in its sixth edition (D. W. Martin, 2004). Martin’s approach is

5.2 Ethics approval

refreshing and entertaining, more cookbook-like than academic. All the core details are there, with examples that teach and amuse. The proceedings of the ACM SIGCHI’s annual conference (CHI) are also an excellent resource. CHI papers are easily viewed and downloaded from the ACM Digital Library. Of course, many research papers in the CHI proceedings do not present experimental research. And that’s fine. HCI is multidisciplinary. The research methods brought to bear on human interaction with technology are equally diverse. However, of those papers that do present a user study—an experiment with human participants—there are, unfortunately, many where the methodology is ad hoc. The additional burden of weaving one’s way through an unfamiliar methodology while simultaneously trying to understand a new and potentially interesting idea makes studying these papers difficult. But there are many CHI papers that stick to the standard methodology for experiments with human participants. It is relatively easy to spot examples. If the paper has a section called Method or Methodology and the first sub-section within is called Participants, there is a good chance the paper and the research it describes follow the standards for experimental research as laid out in this chapter. Examples from the CHI proceedings include the following: Aula, Khan, and Guan, 2010; Chin and Fu, 2010; Chin, Fu, and Kannampallil, 2009; Duggan and Payne, 2008; Gajos, Wobbrock, and Weld, 2008; Kammerer, Nairn, Pirolli, and Chi, 2009; Majaranta, Ahola, and Špakov, 2009; Räihä and Špakov, 2009; Sjölie et al., 2010; Sun, Zhang, Wiedenbeck, and Chintakovid, 2006; Tohidi et al., 2006; Wobbrock et al., 2009.

5.2 Ethics approval One crucial step that precedes the design of every HCI experiment is ethics approval. Since HCI research involves humans, “researchers must respect the safety, welfare, and dignity of human participants in their research and treat them equally and fairly.”2 The approval process is governed by the institution or funding agency overseeing the research. At this author’s institution, research projects must be approved by the Human Participant Review Committee (HRPC). Other committee names commonly used are the Institutional Review Board (IRB), Ethics Review Committee (ERC), and so on. Typically, the review committee serves to ensure a number of ethical guidelines are acknowledged and adhered to. These include the right of the participant to be informed of the following: ● ●

●

2

The nature of the research (hypotheses, goals and objectives, etc.) The research methodology (e.g., medical procedures, questionnaires, participant observation, etc.) Any risks or benefits

www.yorku.ca/research/support/ethics/humans.html.

159

160

CHAPTER 5 Designing HCI Experiments

●

●

The right not to participate, not to answer any questions, and/or to terminate participation at any time without prejudice (e.g., without academic penalty, withdrawal of remuneration, etc.) The right to anonymity and confidentiality

Details will vary according to local guidelines. Special attention is usually given for vulnerable participants, such as pregnant women, children, or the elderly. The basis for approving the research, where human participants are involved, is in achieving a balance between the risks to participants and the benefits to society.

5.3 Experiment design Experiment design is the process of bringing together all the pieces necessary to test hypotheses on a user interface or interaction technique. It involves deciding on and defining which variables to use, what tasks and procedure to use, how many participants to use and how to solicit them, and so on. One of the most difficult steps in designing an HCI experiment is just getting started. Ideas about a novel interface or interaction technique take shape well before thoughts of doing an experiment. There may even be an existing prototype that implements a research idea. Perhaps there is no prototype—yet. Regardless, there is an idea about an interface or interaction technique, and it seems new and interesting. Doing an experiment to test the idea seems like a good idea, but it is difficult transitioning from the creative and exciting work of developing a novel idea to the somewhat mechanical and mundane work of doing an experiment. Here is a question that will focus the mind more than any other in getting off and running with an HCI experiment: what are the experimental variables? This seems like an odd place to begin. After all, experimental variables are in a distant world from the creative effort invested thus far. Well, not really. Thinking about experimental variables is an excellent exercise. Here’s why. The process forces us to transition from well-intentioned, broad yet untestable questions (e.g., Is my idea any good?) to narrower yet testable questions (e.g., Can a task be performed more quickly with my new interface than with an existing interface?). If necessary, review the discussion in the preceding chapter on research questions and internal and external validity. Thinking about experimental variables forces us to craft narrow and testable questions. The two most important experimental variables are independent variables and dependent variables. In fact, these variables are found within the example question in the preceding paragraph. Expressions like “more quickly” or “fewer steps” capture the essence of dependent variables: human behaviors that are measured. The expression “with my new interface than with an existing interface” captures the essence of an independent variable: an interface that is compared with an alternative interface. In fact, a testable research question inherently expresses the relationship between an independent variable and a dependent variable. Let’s examine these two variables in more detail.

5.4 Independent variables

5.4 Independent variables An independent variable is a circumstance or characteristic that is manipulated or systematically controlled to a change in a human response while the user is interacting with a computer. An independent variable is also called a factor. Experiments designed with independent variables are often called factorial experiments. The variable is manipulated across multiple (at least two) levels of the circumstance or characteristic. The variable is “independent” because it is independent of participant behavior, which means there is nothing a participant can do to influence an independent variable. The variable manipulated is typically a nominal-scale attribute, often related to a property of an interface. Review any HCI research paper that presents a factorial experiment and examples of independent variables are easily found. They are anything that might affect users’ proficiency in using a computer system. Examples include device (with levels mouse, trackball, and stylus) (MacKenzie, Sellen, and Buxton, 1991), feedback modality (with levels auditory, visual, and tactile) (Akamatsu et al., 1995), display size (with levels large and small) (Dillon, Richardson, and Mcknight, 1990), display type (with levels CRT and LCD) (MacKenzie and Riddersma, 1994), cross-display technique (with levels stitching, mouse ether, and ether + halo) (Nacenta, Regan, Mandry, and Gutwin, 2008), transfer function (with levels constant gain and pointer acceleration) (Casiez, Vogel, Pan, and Chaillou, 2007), tree visualization (with levels traditional, list, and multi-column) (Song, Kim, Lee, and Seo, 2010), and navigation technique (with levels standard pan and zoom versus PolyZoom) (Javed, Ghani, and Elmqvist, 2012). These variables are easy to manipulate because they are attributes of the apparatus (i.e., the computer or software). The idea of “manipulating” simply refers to systematically giving one interface, then another, to participants as part of the experimental procedure. However, an independent variable can be many things besides an attribute of a computer system. It can be characteristics of humans, such as age (Chin and Fu, 2010; Chin et al., 2009), gender (male, female) (Sun et al., 2006; Zanbaka, Goolkasian, and Hodges, 2006), handedness (left handed, right handed) (Kabbash et al., 1993; Peters and Ivanoff, 1999), expertise in assessing web pages (expert, novice) (Brajnik, Yesilada, and Harper, 2011), body position (standing, sitting, walking), preferred operating system (Windows, Mac OS, Linux), first language (e.g., English, French, Chinese), political viewpoint (left, right), religious viewpoint, highest level of education, income, height, weight, hair color, shoe size, and so on. It is not clear that these human characteristics necessarily relate to HCI, but who knows. Note that human characteristics such as gender or first language are naturally occurring attributes. Although such attributes are legitimate independent variables, they cannot be “manipulated” in the same way as an attribute of an interface. An independent variable can also be an environmental circumstance, such as background noise (quiet, noisy), room lighting (sun, incandescent, fluorescent), vibration level (calm, in a car, in a train), and so on. Here are two tips to consider. First, when formulating an independent variable, express it both in terms of the circumstance or characteristic itself as well as the

161

162

CHAPTER 5 Designing HCI Experiments

levels of the circumstance or characteristic chosen for testing. (The levels of an independent variable are often called test conditions.) So we might have an independent variable called interaction stance with levels sitting, standing, and walking. This might seem like an odd point; however, in reading HCI research papers, it is surprising how often an independent variable is not explicitly named. For example, if an experiment seeks to determine whether a certain PDA task is performed better using audio versus tactile feedback, it is important to separately name both the independent variable (e.g., feedback modality) and the levels (audio, tactile). The second tip is related to the first: Once the name of the independent variable and the names of the levels are decided, stick with these terms consistently throughout a paper. These terms hold special meaning within the experiment and any deviation in form is potentially confusing to the reader. Switching to terms like interaction position (cf. interaction stance), upright (cf. standing), sound (cf. audio), or vibration (cf. tactile) is potentially confusing. Is this a minor, nit-picky point? No. At times, it is a struggle to follow the discussions in a research paper. The fault often lies in the write-up, not in one’s ability to follow or understand. The onus is on the researcher writing up the results of his or her work to deliver the rationale, methodology, results, discussion, and conclusions in the clearest way possible. Writing in a straightforward, consistent, and concise voice cannot be overemphasized. Further tips on writing for clarity are elaborated in Chapter 8. Although it is reasonable to design and conduct an HCI experiment with a single independent variable, experiments often have more than one independent variable. Since considerable work is invested in designing and executing an experiment, there is a tendency to pack in as many independent variables as possible, so that more research questions are posed and, presumably, answered. However, including too many variables may compromise the entire experiment. With every additional independent variable, more effects exist between the variables. Figure 5.2 illustrates. A design with a single independent variable includes a main effect, but no interaction effects. A design with two independent variables includes two main effects and one interaction effect, for a total of three effects. The interaction effect is a twoway interaction, since it is between two independent variables. For example, an experiment with independent variables Device and Task includes main effects for

Independent variables 1 2 3 4 5

Effects Main 2-way 3-way 4-way 5-way 1 2 1 3 3 1 4 6 3 1 5 10 6 3 1

Total 1 3 7 14 25

FIGURE 5.2 The number of effects (main and interaction) increases as the number of independent variables increases.

5.5 Dependent variables

Device and Task as well as a Device × Task interaction effect. As a reminder, the effect is on the dependent variable. The interpretation of interaction effects is discussed in Chapter 6 on Hypothesis Testing. Once a third independent variable is introduced, the situation worsens: there are seven effects! With four and five independent variables, there are 14 and 25 effects, respectively. Too many variables! It is difficult to find meaningful interpretations for all the effects where there are so many. Furthermore, variability in the human responses is added with each independent variable, so all may be lost if too many variables are included. Interaction effects that are three-way or higher are extremely difficult to interpret and are best avoided. A good design, then, is one that limits the number of independent variables to one or two, three at most.3

5.5 Dependent variables A dependent variable is a measured human behavior. In HCI the most common dependent variables relate to speed and accuracy, with speed often reported in its reciprocal form, time—task completion time. Accuracy is often reported as the percentage of trials or other actions performed correctly or incorrectly. In the latter case, accuracy is called errors or error rate. The dependent in dependent variable refers to the variable being dependent on the human. The measurements depend on what the participant does. If the dependent variable is, for example, task completion time, then clearly the measurements are highly dependent on the participant’s behavior. Besides speed and accuracy, a myriad of other dependent variables are used in HCI experiments. Others include preparation time, action time, throughput, gaze shifts, mouse-to-keyboard hand transitions, presses of backspace, target re-entries, retries, key actions, gaze shifts, wobduls, etc. The possibilities are limitless. Now, if you are wondering about “wobduls,” then you’re probably following the discussion. So what is a wobdul? Well, nothing, really. It’s just a made-up word. It is mentioned only to highlight something important in dependent variables: Any observable, measurable aspect of human behavior is a potential dependent variable. Provided the behavior has the ability to differentiate performance between two test conditions in a way that might shed light on the strengths or weaknesses of one condition over another, then it is a legitimate dependent variable. So when it comes to dependent variables, it is acceptable to “roll your own.” Of course, it is essential to clearly define all dependent variables to ensure the research can be replicated. An example of a novel dependent variable is “negative facial expressions” defined by Duh et al. (2008) in a comparative evaluation of three mobile phones used for gaming. Participants were videotaped playing games on different mobile 3

We should add that additional independent variables are sometimes added simply to ensure the procedure covers a representative range of behaviors. For example, a Fitts’ law experiment primarily interested in device and task might also include movement distance and target size as independent variables—the latter two included to ensure the task encompasses a typical range of target selection conditions (MacKenzie et al., 1991).

163

164

CHAPTER 5 Designing HCI Experiments

phones. A post-test analysis of the videotape was performed to count negative facial expressions such as frowns, confusion, frustration, and head shakes. The counts were entered in an analysis of variance to determine whether participants had different degrees of difficulty with any of the interfaces. Another example is “read text events.” In pilot testing a system using an eye tracker for text entry (eye typing), it was observed that users frequently shifted their point of gaze from the on-screen keyboard to the typed text to monitor their progress (Majaranta et al., 2006). Furthermore, there was a sense that this behavior was particularly prominent for one of the test conditions. Thus RTE (read text events) was defined and used as a dependent variable. The same research also used “re-focus events” (RFE) as a dependent variable. RFE was defined as the number of times a participant refocuses on a key to select it. Unless one is investigating mobile phone gaming or eye typing, it is unlikely negative facial expressions, read text events, or refocus events are used as dependent variables. They are mentioned only to emphasize the merit in defining, measuring, and analyzing any human behavior that might expose differences in the interfaces or interaction techniques under investigation. As with independent variables, it is often helpful to name the variable separately from its units. For example, in a text entry experiment there is likely a dependent variable called text entry speed with units “words per minute.” Experiments on computer pointing devices often use a Fitts’ law paradigm for testing. There is typically a dependent variable named throughput with units “bits per second.” The most common dependent variable is task completion time with units “seconds” or “milliseconds.” If the measurement is a simple count of events, there is no unit per se. When contriving a dependent variable, it is important to consider how the measurements are gathered and the data collected, organized, and stored. The most efficient method is to design the experimental software to gather the measurements based on time stamps, key presses, or other interactions detectable through software events. The data should be organized and stored in a manner that facilitates followup analyses. Figure 5.3 shows an example for a text entry experiment. There are two data files. The first contains timestamps and key presses, while the second summarizes entry of a complete phrase, one line per phrase. The data files in Figure 5.3 were created through the software that implements the user interface or interaction technique. Pilot testing is crucial. Often, pilot testing is considered a rough test of the user interface—with modifications added to get the interaction right. And that’s true. But pilot testing is also important to ensure the data collected are correct and available in an appropriate format for follow-on analyses. So pilot test the experiment software and perform preliminary analyses on the data collected. A spreadsheet application is often sufficient for this. To facilitate follow-up analyses, the data should also include codes to identify the participants and test conditions. Typically, this information is contained in additional columns in the data or in the filenames. For example, the filename for the data in Figure 5.3a is TextInputHuffman-P01-D99-B06-S01.sd1 and identifies the

5.6 Other variables

(a) my bike has a flat tire my bike has a flat tire 16 3 891 2 1797 3 m 3656 2 4188 1 4672 2 y 5750 3 5938 3 [Space] 6813 3 6984 2 7219 0 8656 3 b

(b) min_keystrokes,keystrokes,presented_characters,transcribed_characters, ... 55, 59, 23, 23, 29.45, 0, 9.37, 0.0, 2.5652173913043477, 93.22033898305085 61, 65, 26, 26, 30.28, 0, 10.3, 0.0, 2.5, 93.84615384615384 85, 85, 33, 33, 48.59, 0, 8.15, 0.0, 2.5757575757575757, 100.0 67, 71, 28, 28, 33.92, 0, 9.91, 0.0, 2.5357142857142856, 94.36619718309859 61, 70, 24, 24, 39.44, 0, 7.3, 0.0, 2.9166666666666665, 87.14285714285714

FIGURE 5.3 Example data files from a text entry experiment: (a) The summary data one (sd1) file contains timestamps and keystroke data. (b) The summary data two (sd2) file contains one line for each phrase of entry.

experiment (TextInputHuffman), the participant (P01), the device (D99), the block (B06) and the session (S01). The suffix is “sd1” for “summary data one.” Note that the sd2 file in Figure 5.3b is comma-delimited to facilitate importing and contains a header line identifying the data in each column below. If the experiment is conducted using a commercial product, it is often impossible to collect data through custom experimental software. Participants are observed externally, rather than through software. In such cases, data collection is problematic and requires a creative approach. Methods include manual timing by the experimenter, using a log sheet and pencil to record events, or taking photos or screen snaps of the interaction as entry proceeds. A photo is useful, for example, if results are visible on the display at the end of a trial. Videotaping is another option, but follow-up analyses of video data are time consuming. Companies such as Noldus (www.noldus.com) offer complete systems for videotaping interaction and performing post hoc timeline analyses.

5.6 Other variables Besides independent and dependent variables, there are three other variables: control, random, and confounding. These receive considerably less attention and are rarely mentioned in research papers. Nevertheless, understanding each is important for experimental research.

165

166

CHAPTER 5 Designing HCI Experiments

5.6.1 Control variables There are many circumstances or factors that (a) might influence a dependent variable but (b) are not under investigation. These need to be accommodated in some manner. One way is to control them—to treat them as control variables. Examples include room lighting, room temperature, background noise, display size, mouse shape, mouse cursor speed, keyboard angle, chair height, and so on. Mostly, researchers don’t think about these conditions. But they exist and they might influence a dependent variable. Controlling them means that they are fixed at a nominal setting during the experiment so they don’t interfere. But they might interfere if set at an extreme value. If the background noise level is very high or if the room is too cold, these factors might influence the outcome. Allowing such circumstances to exist at a fixed nominal value is typical in experiment research. The circumstances are treated as control variables. Sometimes it is desirable to control characteristics of the participants. The type of interface or the objectives of the research might necessitate testing participants with certain attributes, for example, right-handed participants, participants with 20/20 vision, or participants with certain experience. Having lots of control variables reduces the variability in the measured behaviors but yields results that are less generalizable.

5.6.2 Random variables Instead of controlling all circumstances or factors, some might be allowed to vary randomly. Such circumstances are random variables. There is a cost since more variability is introduced in the measures, but there is a benefit since results are more generalizable. Typically, random variables pertain to characteristics of the participants, including biometrics (e.g., height, weight, hand size, grip strength), social disposition (e.g., conscientious, relaxed, nervous), or even genetics (e.g., gender, IQ). Generally, these characteristics are allowed to vary at random. Before proceeding, it is worth summarizing the trade-off noted above for control and random variables. The comparison is best presented when juxtaposed with the experimental properties of internal validity and external validity, as discussed in the preceding chapter. Figure 5.4 shows the trade-off.

5.6.3 Confounding variables Any circumstance or condition that changes systematically with an independent variable is a confounding variable. Unlike control or random variables, confounding variables are usually problematic in experimental research. Is the effect observed due to the independent variable or to the confounding variable? Researchers must attune to the possible presence of a confounding variable and eliminate it, adjust for it, or consider it in some way. Otherwise, the effects observed may be incorrectly interpreted.

5.6 Other variables

Variable Random

Control

Advantage Improves external validity by using a variety of situations and people. Improves internal validity since variability due to a controlled circumstance is eliminated

Disadvantage Compromises internal validity by introducing additional variability in the measured behaviours. Compromises external validity by limiting responses to specific situations and people.

FIGURE 5.4 Relationship between random and control variables and internal and external validity.

As an example, consider an experiment seeking to determine if there is an effect of camera distance on human performance using an eye tracker for computer control. In the experiment, camera distance—the independent variable—has two levels, near and far. For the near condition, a small camera (A) is mounted on a bracket attached to the user’s eye glasses. For the far condition, an expensive eye tracking system is used with the camera (B) positioned above the system’s display. Here, camera is a confounding variable since it varies systematically across the levels of the independent variable: camera A for the near condition and camera B for the far condition. If the experiment shows a significant effect of camera distance on human performance, there is the possibility that the effect has nothing to do with camera distance. Perhaps the effect is simply the result of using one camera for the near condition and a different camera for the far condition. The confound is avoided by using the same camera (and same system) in both the near and far conditions. Another possibility is simply to rename the independent variable. The new name could be “setup,” with levels “near setup” and “far setup.” The new labels acknowledge that the independent variable encompasses multiple facets of the interface, in this case, camera distance, camera, and system. The distinction is important if for no other reason than to ensure the conclusions speak accurately to the different setups, rather than to camera distance alone. Confounding variables are sometimes found in Fitts’ law experiments. Most Fitts’ law experiments use a target selection task with movement amplitude (A) and target width (W) as independent variables. Fitts’ original experiment is a typical example. He used a stylus-tapping task with four levels each for movement amplitude (A = 0.25, 0.5, 1.0, and 2.0 inches) and target width (W = 2, 4, 8, and 16 inches) (Fitts, 1954).4 Fitts went beyond simple target selection, however. He argued by analogy with information theory and electronic communications that A and W are like signal and noise, respectively, and that each task carries information in bits. He proposed an index of difficulty (ID) as a measure in bits of the information content of a task: ID = log2(2A/W). Although the majority of Fitts’ law experiments treat A and W as independent variables, sometimes A and ID are treated as independent variables (e.g., Gan and Hoffmann, 1988). Consider the example in Figure 5.5a. There are two independent variables, A with levels 4, 8, 16, and 32 cm, and ID with 4

See also section 7.7.7, Fitts’ law.

167

168

CHAPTER 5 Designing HCI Experiments

(a)

ID (bits) 1 2 3 4

(b)

Amplitude (pixels) 16 32 64 128 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ (c)

W (pixels) 2 4 8 16 32 64 128

ID (bits) 1 2 3 4

Amplitude (pixels) 16 32 64 128 16 32 64 128 8 16 32 64 4 8 16 32 2 4 8 16

Amplitude (pixels) 16 32 64 128 ∗ ∗ ∗ ∗ ∗

∗ ∗ ∗

∗ ∗ ∗ ∗

∗ ∗ ∗ ∗

FIGURE 5.5 Fitts’ law experiment with A and ID as independent variables: (a) Yellow cells (*) show the test conditions. (b) Numbers in yellow cells show the target width, revealing a confounding variable. (c) Same design showing test conditions by A and W.

levels 1, 2, 3, and 4 bits, yielding 4 × 4 = 16 test conditions (shaded/yellow cells in the figure). To achieve the necessary ID for each A, target width must vary. The top-left cell, for example, requires W = 2A/2ID = (2 × 16)/21 = 16 cm. The target width for each condition is added in Figure 5.5b. Do you see the confound? As ID increases, W decreases. Target width (W) is a confounding variable. If the experiment reveals a significant effect of ID, is the effect due to ID or to W?5 To further illustrate, Figure 5.5c shows the same design, but reveals the conditions by movement amplitude (A) and target width (W). As another example, consider an experiment with “interaction technique” as an independent variable with three levels, A, B, and C. Assume, further, that there were 12 participants and all were tested on A, then B, then C. Clearly, performance might improve due to practice. Practice, in this case, is a confounding variable because it changes systematically with interaction technique. Participants had just a little practice for A, a bit more for B, still more for C. If performance was best for C, it would be nice to conclude that C is better than A or B. However, perhaps performance was better simply because participants benefited from practice on

5

It can be argued that a traditional Fitts’ law design, using A and W as independent variables, is similarly flawed because it contains ID as a confounding variable. However, this is a weak argument: A and W are primitive characteristics of the task whereas ID is a contrived variable, calculated from A and W.

5.7 Task and procedure

A and B prior to testing on C. One way to accommodate practice as a confounding variable is to counterbalance the order of presenting the test conditions to participants (see section 5.11). Here is another example: Two search engine interfaces are compared, Google versus “new.” If all participants have prior experience with Google but no experience with the new interface, then prior experience is a confounding variable. This might be unavoidable, as it is difficult to find participants without experience with the Google search engine. As long as the effect of prior experience is noted and acknowledged, then this isn’t a problem. Of course, the effect may be due to the confound, not to the test conditions. A similar confound may occur in text entry experiments where, for example, a new keyboard layout is compared with a Qwerty layout. A fair comparison would require participants having the same level of experience with both layouts. But of course it is difficult to find participants unfamiliar with the Qwerty layout. Thus the Qwerty layout is certain to have an advantage, at least initially. In such cases, it is worth considering a longitudinal design, where the layouts are compared over a prolonged period to see if the new keyboard layout has the potential to overcome Qwerty with practice.

5.7 Task and procedure Let’s revisit the definition of an independent variable: “a circumstance or characteristic that is manipulated or systematically controlled to elicit a change in a human response while the user is interacting with a computer.” Emphasis is added to “elicit a change in a human response.” When participants are given a test condition, they are asked to do a task while their performance is measured. Later, they are given a different test condition—another level of the independent variable—and asked to do the task again. Clearly, the choice of task is important. There are two objectives in designing a good task: represent and discriminate. A good task is representative of the activities people do with the interface. A task that is similar to actual or expected usage will improve the external validity of the research–the ability to generalize results to other people and other situations. A good task is also one that can discriminate the test conditions. Obviously, there is something in the interaction that differentiates the test conditions, otherwise there is no research to conduct. A good task must attune to the points of differentiation in order to elicit behavioral responses that expose benefits or problems among the test conditions. This should surface as a difference in the measured responses across the test conditions. A difference might occur if the interfaces or interaction techniques are sufficiently distinct in the way the task is performed. Often, the choice of a task is self-evident. If the research idea is a graphical method for inserting functions in a spreadsheet, a good task is inserting functions into a spreadsheet—using the graphical method versus the traditional typing method. If the research idea is an auditory feedback technique while programming

169

170

CHAPTER 5 Designing HCI Experiments

a GPS device, a good task is programming a destination in a GPS device—aided with auditory feedback versus visual feedback. Making a task representative of actual usage will improve external validity, but there is a downside. The more representative the task, the more the task is likely to include behaviors not directly related to the interface or interaction method under test. Such behaviors are likely to compromise the ability of the task to discriminate among the test conditions. There is nothing sinister in this. It is simply a reflection of the complex way humans go about their business while using computers. When we enter text, we also think about what to enter. We might pause, think, enter something, think again, change our minds, delete something, enter some more, and so on. This is actual usage. If the research goal is to evaluate a new text entry method, a task that mimics actual usage is fraught with problems. Actual usage includes secondary tasks—lots of them. If the task involves, for example, measuring text entry speed in words per minute, the measurement is seriously compromised if tasks unrelated to the entry method are present. While using a task that is representative of actual usage may improve external validity, the downside is a decrease in internal validity. Recall that high internal validity means the effects observed (i.e., the differences in means on a dependent variable) are due to the test conditions. The additional sources of variation introduced by secondary tasks reduce the likelihood that the differences observed are actually due to, or caused by, the test conditions. The differences may simply be artifacts of the secondary tasks. Furthermore, the additional variation may bring forth a non-significant statistical result. This is unfortunate if indeed there are inherent differences between the test conditions—differences that should have produced a statistically significant outcome. The best task is one that is natural yet focuses on the core aspects of the interaction: the points of differentiation between the test conditions. Points of similarly, while true to actual usage, introduce variability. Consider two different text entry techniques being compared in an experimental evaluation. If the techniques include the same method of capitalization, then capitalization does not serve to discriminate the techniques and can be excluded from the experimental task. Including capitalization will improve external validity but will also compromise internal validity due to the added variability. The tasks considered above are mostly performance-based or skill-based. Sometimes an independent variable necessitates using a knowledge-based task. For example, if the research is comparing two search methods, a reasonable task is to locate an item of information in a database or on the Internet (e.g., “Find the date of birth of Albert Einstein.”). Performance is still measured; however, the participant acquires knowledge of the task goal and, therefore, is precluded from further exposure to the same task. This is a problem if the independent variable is assigned within-subjects (discussed below). When the participant is tested with the other search method, the task must be changed (e.g., “Find the date of birth of William Shakespeare.”). This is tricky, since the new task must be more or less the same (so the search methods can be compared), but also different enough so that the participant does not benefit from exposure to the earlier, similar task.

5.8 Participants

The experimental procedure includes the task but also the instructions, demonstration, or practice given to the participants. The procedure encompasses everything that the participant did or was exposed to. If a questionnaire was administered before or after testing, it is also part of the experimental procedure and deserves due consideration and explanation in the write-up of the experiment.

5.8 Participants Researchers often assume that their results apply to people who were not tested. Applying results to people other than those who were tested is possible; however, two conditions are required. First, the people actually tested must be members of the same population of people to whom results are assumed to hold. For example, results are unlikely to apply to children if the participants in the experiment were drawn exclusively from the local university campus. Second, a sufficient number of participants must be tested. This requirement has more to do with statistical testing than with the similarly of participants to the population.6 Within any population, or any sample drawn from a population, variability is present. When performance data are gathered on participants, the variability in the measurements affects the likelihood of obtaining statistically significant results. Increasing the number of participants (large n) increases the likelihood of achieving statistically significant results. In view of the point above, we might ask: How many participants should be used in an experiment? Although the answer might seem peculiar, it goes something like this: use the same number of participants as in similar research (D. W. Martin, 2004, p. 234). Using more participants seems like a good idea, but there is a downside. If there truly is an inherent difference in two conditions, then it is always possible to achieve statistical significance—if enough participants are used. Sometimes the inherent difference is slight, and therein lies the problem. To explain, here is a research question to consider: Is there a speed difference between left-handed and right-handed users in performing point-select tasks using a mouse? There may be a slight difference, but it likely would surface only if a very large number of left- and right-handed participants were tested. Use enough participants and statistically significant results will appear. But the difference may be small and of no practical value. Therein lies the problem of using a large number of participants: statistically significant results for a difference of no practical significance. The converse is also problematic. If not enough participants are used, statistical significance may fail to appear. There might be a substantial experimental effect, but the variance combined with a small sample size (not enough participants) might prevent statistical significance from appearing. It is possible to compute the power of statistical tests and thereby determine the number of participants required. The analysis may be done a priori—before an 6

Participants drawn from a population are, by definition, similar to the population, since they (collectively) define the population.

171

172

CHAPTER 5 Designing HCI Experiments

experiment is conducted. In practice, a priori power analysis is rarely done because it hinges on knowing the variance in a sample before the data are collected.7 The recommendation, again, is to study published research. If an experiment similar to that contemplated reported statistically significant results with 12 participants, then 12 participants is a good choice. In HCI, we often hear of researchers doing usability evaluation or usability testing. These exercises often seek to assess a prototype system with users to determine problems with the interface. Such evaluations are typically not organized as factorial experiments. So the question of how many participants is not relevant in a statistical sense. In usability evaluations, it is known that a small number of participants is sufficient to expose a high percentage of the problems in an interface. There is evidence that about five participants (often usability experts) are sufficient to expose about 80 percent of the usability problems (Lewis, 1994; Nielsen, 1994). It is worth reflecting on the term participants. When referring specifically to the experiment, use the term participants (e.g., “all participants exhibited a high error rate”).8 General comments on the topic or conclusions drawn may use other terms (e.g., “these results suggest that users are less likely to…”). When recruiting participants, it is important to consider how the participants are selected. Are they solicited by word of mouth, through an e-mail list, using a notice posted on a wall, or through some other means? Ideally, participants are drawn at random from a population. In practice, this is rarely done, in part because of the need to obtain participants that are close by and available. More typically, participants are solicited from a convenient pool of individuals (e.g., members in the workplace, children at a school, or students from the local university campus). Strictly speaking, convenience sampling compromises the external validity of the research, since the true population is somewhat narrower than the desired population. To help identify the population, participants are typically given a brief questionnaire (discussed shortly) at the beginning or end of the experiment to gather demographic data, such as age and gender. Other information relevant to the research is gathered, such as daily computer usage or experience with certain applications, devices, or products. HCI experiments often require participants with specific skills. Perhaps a filtering process is used to ensure only appropriate participants are used. For example, an experiment investigating a new gaming input device might want a participant pool with specific skills, such a minimum of 15 hours per week playing computer games. Or perhaps participants without gaming experience are desired. Whatever the case, the selection criteria should be clear and should be stated in the write-up of the methodology, in a section labeled “Participants.” 7

Computing power in advance of an experiment also requires the researcher to know the size of the experimental effect (the difference in the means on the dependent variable) that is deemed relevant. Usually, the researcher simply wants to know if there is a statistically significant difference without committing in advance to a particular difference being of practical significance. 8 According to the APA guidelines, the term subjects is also acceptable (APA, 2010, p. 73).

5.9 Questionnaire design

Depending on the agency or institution overseeing the research, participants are usually required to sign a consent form prior to testing. The goal is to ensure participants know that their participation is voluntary, that they will incur no physical or psychological harm, that they can withdraw at any time, and that their privacy, anonymity, and confidentiality will be protected.

5.9 Questionnaire design Questionnaires are a part of most HCI experiments. They have two purposes. One is to gather information on demographics (age, gender, etc.) and experience with related technology. Another is to solicit participants’ opinions on the devices or interaction tasks with which they are tested. Questionnaires are the primary instrument for survey research, a form of research seeking to solicit a large number of people for their opinions and behaviors on a subject such as politics, spending habits, or use of technology. Such questionnaires are often lengthy, spanning several pages. Questionnaires administered in HCI experiments are usually more modest, taking just a few minutes to complete. Questions may be posed in several ways, depending on the nature of the information sought and how it is to be used. Let’s look at a few examples. Closed-ended questions are convenient, since they constrain a participant’s response to small set of options. The following are examples of close-ended questions: Do you use a GPS device while driving?

yes

no

Which browser do you use? Mozilla Firefox

Google Chrome

Microsoft IE

Other ( __________________ )

The question above includes an open-ended category, “Other.” Of course, the entire question could be open-ended, as shown here: Which browser do you use? _________________

Closed-end questions simplify follow-on analyses, since it is straightforward to tally counts of responses. It is usually important to know the gender and age of participants, since this helps identify the population. Age can be solicited as an open-ended ratio-scale response, as seen here: Please indicate your age:________

173

174

CHAPTER 5 Designing HCI Experiments

Collected in this manner, the mean and standard deviation are easily calculated. Ratio-scale responses are also useful in looking for relationships in data. For example, if the same questionnaire also included a ratio-scale item on the number of text messages sent per day, then it is possible to determine if the responses are correlated (e.g., Is the number of text messages sent per day related to age?). Age can also be solicited as an ordinal response, as in this example: Please indicate your age. < 20

20-29

30-39

40-49

50-59

60+

In this case, the counts in each category are tabulated. Such data are particularly useful if there is a large number of respondents. However, ordinal data are inherently lower quality than ratio-scale data, since it is not possible to compute the mean or standard deviation. Questionnaires are also used at the end of an experiment to obtain participants’ opinions and feelings about the interfaces or interaction techniques. Items are often formatted using a Likert scale (see Figure 4.7) to facilitate summarizing and analyzing the responses. One example is the NASA-TLX (task load index), which assesses perceived workload on six subscales: mental demand, physical demand, temporal demand, performance, effort, and frustration (Hart and Staveland, 1988). A questionnaire on frustration may be presented as follows: Frustration: I felt a high level of insecurity, discouragement, irritation, stress, or annoyance. 1 2 3 4 5 6 7 Strongly

Neutral

Strongly

disagree

agree

The ISO 9241-9 standard for non-keyboard input devices includes a questionnaire with 12 items to assess the comfort and fatigue experienced by participants (ISO 2000). The items are similar to those in the NASA-TLX but are generally directed to interaction with devices such as mice, joysticks, or eye trackers. The items may be tailored according to the device under test. For example, an evaluation of an eye tracker for computer control might include a questionnaire with the following response choices (see also Zhang and MacKenzie, 2007): Eye fatigue: 1

2

3

4

5

6

7

Very

Very

high

low

Note that the preferred response is 7, whereas the preferred response in the NASA-TLX example is 1. In the event the mean is computed over several response items, it is important that the items are consistently constructed.

5.10 Within-subjects and between-subjects

5.10 Within-subjects and between-subjects The administering of test conditions (levels of a factor) is either within-subjects or between-subjects. If each participant is tested on each level, the assignment is within-subjects. Within-subjects is also called repeated measures, because the measurements on each test condition are repeated for each participant. If each participant is tested on only one level, the assignment is between-subjects. For a between-subjects design, a separate group of participants is used for each test condition. Figure 5.6 provides a simple illustration of the difference between a withinsubjects assignment and a between-subjects assignment. The figure assumes a single factor with three levels: A, B, and C. Figure 5.6a shows a within-subjects assignment because each participant is tested on all three levels of the factor (but see section 5.11, Counterbalancing). Figure 5.6b shows a between-subjects assignment, since each participant is tested on only one level of the factor. There are three groups of participants, with two participants in each group. Clearly, there is a trade-off. For a between-subjects design, each participant is tested on only one level of a factor; therefore, more participants are needed to obtain the same number of observations (Figure 5.6b). For a within-subjects design, each participant is tested on all levels of a factor. Fewer participants are needed; however, more testing is required for each participant (Figure 5.6a). Given this trade-off, it is reasonable to ask: is it better to assign a factor within-subjects or between-subjects? Let’s examine the possibilities. Sometimes a factor must be between-subjects. For example, if the research is investigating whether males or females are more adept at texting, the experiment probably involves entering text messages on a mobile phone. The independent variable is gender with two levels, male and female. The variable gender is betweensubjects. Clearly, there is no choice. A participant cannot be male for half the testing, then female for the other half! Another example is handedness. Research investigating performance differences between left-handed and right-handed users requires a group of left-handed participants and a group of right-handed participants. Handedness, then, is a between-subjects factor. There is no choice. Sometimes a factor must be within-subjects. The most obvious example is practice, since the acquisition of skill occurs within people, not between people. Practice is usually investigated by testing participants over multiple blocks of trials. (a)

(b)

Participant 1 2

Test Condition A B C A B C

Participant 1 2 3 4 5 6

Test Condition A A B B C C

FIGURE 5.6 Assigning test conditions to participants: (a) Within-subjects. (b) Between-subjects.

175

176

CHAPTER 5 Designing HCI Experiments

For such designs, block is an independent variable, or factor, and there are multiple levels, such as block 1, block 2, block 3, and so on. Clearly, block is within-subjects since each participant is exposed to multiple blocks of testing. There is no choice. Sometimes there is a choice. An important trade-off was noted above. That is, a within-subjects design requires fewer participants but requires more testing for each participant. There is a significant advantage to using fewer participants, since recruiting, scheduling, briefing, demonstrating, practicing, and so on is easier if there are fewer participants. Another advantage of a within-subjects design is that the variance due to participants’ predispositions will be approximately the same across test conditions. Predisposition, here, refers to any aspect of a participant’s personality, mental condition, or physical condition that might influence performance. In other words, a participant who is predisposed to be meticulous (or sloppy!) is likely to carry their disposition in the same manner across the test conditions. For a between-subjects design, there are more participants and, therefore, more variability due to inherent differences between participants. Yet another advantage of within-subjects designs is that it is not necessary to balance groups of participants—because there is only one group! Between-subjects designs include a separate group of participants for each test condition. In this case, balancing is needed to ensure the groups are more or less equal in terms of characteristics that might introduce bias in the measurements. Balancing is typically done through random assignment, but may also be done by explicitly placing participants in groups according to reasonable criteria (e.g., ensuring levels of computer experience are similar among the groups). Because of the three advantages just cited, experiments in HCI tend to favor within-subjects designs over between-subjects designs. However, there is an advantage to a between-subjects design. Between-subjects designs avoid interference between test conditions. Interference, here, refers to conflict that arises when a participant is exposed to one test condition and then switches to another test condition. As an example, consider an experiment that seeks to measure touch-typing speed with two keyboards. The motor skill acquired while learning to touch type with one keyboard is likely to adversely affect touchtyping with the other keyboard. Clearly, participants cannot “unlearn” one condition before testing on another condition. A between-subjects design avoids this because each participant is tested on one, and only one, of the keyboards. If the interference is likely to be minimal, or if it can be mitigated with a few warm-up trials, then the benefit of a between-subjects design is diminished and a withinsubjects design is the best choice. In fact the majority of factors that appear in HCI experiments are like this, so levels of factors tend to be assigned within-subjects. I will say more about interference in the next section. It is worth noting that in many areas of research, within-subjects designs are rarely used. Research testing new drugs, for example, would not use a withinsubjects design because of the potential for interference effects. Between-subjects designs are typically used.

5.11 Order effects, counterbalancing, and latin squares

For an experiment with two factors it is possible to assign the levels of one factor within-subjects and the levels of the other factor between-subjects. This is a mixed design. Consider the example of an experiment seeking to compare learning of a text entry method between left-handed and right-handed users. The experiment has two factors: Block is within-subjects with perhaps 10 levels (block 1, block 2 … block 10) and handedness is between-subjects with two levels (left, right).

5.11 Order effects, counterbalancing, and latin squares When the levels of a factor (test conditions) are assigned within-subjects, participants are tested with one condition, then another condition, and so on. In such cases, interference between the test conditions may result due to the order of testing, as noted above. In most within-subjects designs, it is possible—in fact, likely—that participants’ performance will improve as they progress from one test condition to the next. Thus participants may perform better on the second condition simply because they benefited from practice on the first. They become familiar with the apparatus and procedure, and they are learning to do the task more effectively. Practice, then, is a confounding variable, because the amount of practice increases systematically from one condition to the next. This is referred to as a practice effect or a learning effect. Although less common in HCI experiments, it is also possible that performance will worsen on conditions that follow other conditions. This may follow from mental or physical fatigue—a fatigue effect. In a general sense, then, the phenomenon is an order effect or sequence effect and may surface either as improved performance or degraded performance, depending on the nature of the task, the inherent properties of the test conditions, and the order of testing conditions in a within-subjects design. If the goal of the experiment is to compare the test conditions to determine which is better (in terms of performance on a dependent variable), then the confounding influence of practice seriously compromises the comparison. The most common method of compensating for an order effect is to divide participants into groups and administer the conditions in a different order for each group. The compensatory ordering of test conditions to offset practice effects is called counterbalancing. In the simplest case of a factor with two levels, say, A and B, participants are divided into two groups. If there are 12 participants overall, then Group 1 has 6 participants and Group 2 has 6 participants. Group 1 is tested first on condition A, then on condition B. Group 2 is given the test conditions in the reverse order. This is the simplest case of a Latin square. In general, a Latin square is an n × n table filled with n different symbols (e.g., A, B, C, and so on) positioned such that each symbol occurs exactly once in each row and each column.9 Some examples of Latin square tables are shown in Figure 5.7. Look carefully and the pattern is easily seen. 9

The name “Latin” in Latin square refers to the habit of Swiss mathematician Leonhard Euler (1707– 1783), who used Latin symbols in exploring the properties of multiplication tables.

177

178

CHAPTER 5 Designing HCI Experiments

(a)

(b)

A B

(c) A B C

B A

B C A

(d) A B C D

C A B

B C D A

C D A B

A B C D E

D A B C

B C D E A

C D E A B

D E A B C

E A B C D

FIGURE 5.7 Latin squares: (a) 2 × 2. (b) 3 × 3. (c) 4 × 4. (d) 5 × 5. (a)

(b) A B C D

B C D A

D A B C

C D A B

A B C D E F

B C D E F A

F A B C D E

C D E F A B

E F A B C D

D E F A B C

FIGURE 5.8 Balanced Latin squares where each condition precedes and follows other conditions an equal number of times: (a) 4 × 4. (b) 6 × 6.

The first column is in order, starting at A. Entries in the rows are in order, with wrap around. A deficiency in Latin squares of order 3 and higher is that conditions precede and follow other conditions an unequal number of times. In the 4 × 4 Latin square, for example, B follows A three times, but A follows B only once. Thus an A-B sequence effect, if present, is not fully compensated for. A solution to this is a balanced Latin square, which can be constructed for even-order tables. Figure 5.8 shows 4 × 4 and 6 × 6 balanced Latin squares. The pattern is a bit peculiar. The first column is in order, starting at A. The top row has the sequence, A, B, n, C, n−1, D, n−2, etc. Entries in the second and subsequent columns are in order, with wrap around. When designing a within-subjects counterbalanced experiment, the number of levels of the factor must divide equally into the number of participants. If a factor has three levels, then the experiment requires multiple-of-3 participants; for example, 9, 12, or 15 participants. If there are 12 participants, then there are three groups with 4 participants per group. The conditions are assigned to Group 1 in order ABC, to Group 2 in order BCA, and to Group 3 in order CAB (see Figure 5.7b). Let’s explore this design with a hypothetical example. An experimenter seeks to determine if three editing methods (A, B, C) differ in the time required for common editing tasks. For the evaluation, the following task is used:10 Replace one 5-letter word with another, starting one line away. 10

This is the same as task T1 described by Card, Moran, and Newell in an experiment to validate the keystroke-level model (KLM) (Card et al., 1980).

5.11 Order effects, counterbalancing, and latin squares

Participant 1 2 3 4 5 6 7 8 9 10 11 12 Mean SD

Test Condition A B C 12.98 16.91 12.19 14.84 16.03 14.01 16.74 15.15 15.19 16.59 14.43 11.12 18.37 13.16 10.72 15.17 13.09 12.83 14.68 17.66 15.26 16.01 17.04 11.14 14.83 12.89 14.37 14.37 13.98 12.91 14.40 19.12 11.59 13.70 16.17 14.31 15.2 15.5 13.0 1.48 2.01 1.63

Group

Mean

SD

1

14.7

1.84

2

14.6

2.46

3

14.4

1.88

FIGURE 5.9 Hypothetical data for an experiment with one within-subjects factor having three levels (A, B, C). Values are the mean task completion time(s) for five repetitions of an editing task.

The following three editing methods are compared (descriptions are approximate): Method A: arrow keys, backspace, type Method B: search and replace dialog Method C: point and double click with the mouse, type Twelve participants are recruited. To counterbalance for learning effects, participants are divided into three groups with the tasks administered according to a Latin square (see Figure 5.7b). Each participant does the task five times with one editing method, then again with the second editing method, then again with the third. The mean task completion time for each participant using each editing method is tabulated. (See Figure 5.9.) Overall means and standard deviations are also shown for each editing method and for each group. Note that the left-to-right order of the test conditions in the figure applies only to Group 1. The order for Group 2 was BCA and for Group 3 CAB (see Figure 5.7b). At 13.0 s, the mouse method (C) was fastest. The arrow-key method (A) was 17.4 percent slower at 15.2 s, while the search-and-replace method (B) was 19.3 percent slower at 15.5 s. (Testing for statistical significance in the differences is discussed in the next chapter.) Evidently, counterbalancing worked, as the group means are very close, within 0.3 s. The tabulated data in Figure 5.9 are not typically provided in a research report. More likely, the results are presented in a chart, similar to that in Figure 5.10. Although counterbalancing worked in the above hypothetical example, there is a potential problem for the 3 × 3 Latin square. Note in Figure 5.7b that B follows A twice, but A follows B only once. So there is an imbalance. This cannot be avoided in Latin squares with an odd number of conditions. One solution in this case is to counterbalance by using all sequences. The 3 × 3 case is shown in Figure 5.11.

179

180

CHAPTER 5 Designing HCI Experiments

FIGURE 5.10 Task completion time(s) by editing method for the data in Figure 5.9. Error bars show ±1 SD.

A A B B C C

B C C A A B

C B A C B A

FIGURE 5.11 Counterbalancing an odd number of conditions using all (n!) combinations.

There are 3! = 6 combinations. Balancing is complete (e.g., B follows A three times, A follows B three times). MacKenzie and Isokoski (2008) used such an arrangement in an experiment with 18 participants, assigning 3 participants to each order. Yet another way to offset learning effects is to randomize the order of conditions. This is most appropriate where (a) the task is very brief, (b) there are many repetitions of the task, and (c) there are many test conditions. For example, experiments that use point-select tasks often include movement direction, movement distance, or target size as factors (Figure 5.12). The test conditions in Figure 5.12 might appear as factors in an experiment even though the experiment is primarily directed at something else. For example, research comparing the performance of different pointing devices might include device as a factor with, say, three levels (mouse, trackball, stylus). Movement direction, movement distance, and target size might be varied to ensure the tasks cover a typical range of conditions. Treating these conditions as factors ensures they are handled in a systematic manner. To ensure equal treatment, the conditions are chosen at random without replacement. Once all conditions have been used, the process may repeat if multiple blocks of trials are desired.

5.12 Group effects and asymmetric skill transfer

FIGURE 5.12 Test conditions suitable for random assignment: (a) Movement direction. (b) Movement distance. (c) Target size.

5.12 Group effects and asymmetric skill transfer If the learning effect is the same from condition to condition in a within-subjects design, then the group means on a dependent variable should be approximately equal.11 This was demonstrated above (see Figure 5.9). In other words, the advantage due to practice for a condition tested later in the experiment is offset equally by the disadvantage when the same condition is tested earlier in the experiment. That’s the point of counterbalancing. However, there are occasions where different effects appear for one order (e.g., A→B) compared to another (e.g., B→A). In such cases there may be a group effect—differences across groups in the mean scores on a dependent variable. When this occurs, it is a problem. In essence, counterbalancing did not work. A group effect is typically due to asymmetric skill transfer— differences in the amount of improvement, depending on the order of testing. We could develop an example of asymmetric skill transfer with hypothetical data, as with the counterbalancing example above; however, there is an example data set in a research report where an asymmetric transfer effect is evident. The example provides a nice visualization of the effect, plus an opportunity to understand why asymmetric skill transfer occurs. So we’ll use that data. The experiment compared two types of scanning keyboards for text entry (Koester and Levine, 1994a). Scanning keyboards use an on-screen virtual keyboard and a single key or switch for input. Rows of keys are highlighted one by one (scanned). When the row bearing the desired letter is highlighted, it is selected. Scanning enters the row and advances left to right. When the key bearing the desired letter is highlighted it is selected and the letter is added to the text message. Scanning keyboards provided a convenient text entry method for many users with a physical disability. 11

There is likely some difference, but the difference should not be statistically significant.

181

182

CHAPTER 5 Designing HCI Experiments

(a)

(b)

Testing Half Group First Second (Trials 1-10) (Trials 11-20) 20.42 27.12 22.68 28.39 23.41 32.50 25.22 32.12 26.62 35.94 1 28.82 37.66 30.38 39.07 31.66 35.64 32.11 42.76 34.31 41.06 24.97 19.47 27.27 19.42 29.34 22.05 31.45 23.03 33.46 24.82 2 33.08 26.53 34.30 28.59 35.82 26.78 36.57 31.09 37.43 31.07

FIGURE 5.13 Experiment comparing two scanning keyboards: (a) Letters-only keyboard (LO, top) and letters plus word prediction keyboard (L + WP, bottom). (b) Results for entry speed in characters per minute (cpm). Shaded cells are for the LO keyboard.

The experiment compared a letters-only (LO) scanning keyboard with a similar keyboard that added word prediction (L + WP). The keyboards are shown in Figure 5.13a. Six participants entered 20 phrases of text, 10 with one keyboard, followed by 10 with the other. To compensate for learning effects, counterbalancing was used. Participants were divided into two groups. Group 1 entering text with the LO keyboard first, then with the L + WP keyboard. Group 2 used the keyboards in the reverse order. Although not usually provided in a report, the results were given in a table showing the entry speed in characters per minute (cpm). The data are reproduced in Figure 5.13b as they appeared in the original report (Koester and Levine, 1994a, Table 2). The two columns show the sequence of testing: first half, then second half. The shaded and un-shaded cells show the results for the LO and L + WP keyboards respectively, thus revealing the counterbalanced order. There are at least three ways to summarize the data in Figure 5.13b. The overall result showing the difference between the LO and L + WP keyboards is shown in the left-side chart in Figure 5.14. Clearly, there was very little difference between the two keyboards: 30.0 cpm for the LO keyboard versus 30.3 cpm for the L + WP keyboard. The L + WP keyboard was just 1 percent faster. The error bars are large, mostly due to the improvement from trial to trial, as seen in Figure 5.13b.

5.12 Group effects and asymmetric skill transfer

FIGURE 5.14 Three ways to summarize the results in Figure 5.13b, by keyboard (left), by testing half (center), and by group (right). Error bars show ±1 SD.

FIGURE 5.15 Demonstration of asymmetric skill transfer. The chart uses the data in Figure 5.13b.

The center chart in Figure 5.14 shows another view of the results, comparing the first half and second half of testing. A learning effect is clearly seen. The overall entry speed was 26.4 cpm in the first half of testing (trials 1 to 10) and 33.8 cpm, or 28 percent higher, in the second half of testing (trials 11 to 20). Learning is fully expected, so this result is not surprising. Now consider the right-side chart in Figure 5.14. Counterbalancing only works if the order effects are the same or similar. This implies that the performance benefit of an LO→L + WP order is the same as the performance benefit of an L + WP→LO order. If so, the group means will be approximately equal. (This was demonstrated earlier in the counterbalancing example; see Figure 5.9). The right-side chart in Figure 5.14 reveals a different story. The mean for Group 1 was 31.4 cpm. The mean for Group 2 was lower at 28.8 cpm. For some reason, there was an 8 percent performance disadvantage for Group 2. This is an example of asymmetric skill transfer. Figure 5.15 illustrates. The figure reduces the data in Figure 5.13b to four points, one for each quadrant of 10 trials. Asymmetry is clearly seen in the cross-over of

183

184

CHAPTER 5 Designing HCI Experiments

the lines connecting the LO points and L + WP points between the first half and second half of testing. If counterbalancing had worked, the lines in Figure 5.15 would be approximately parallel. They are not parallel because of the asymmetry in the LO→L + WP order versus the L + WP→LO order. Asymmetric skill transfer is usually explainable by considering the test conditions or the experimental procedure. For this experiment, the effect occurs because of the inherent differences in entering text with the letters-only (LO) keyboard versus entering text with the letters plus word prediction (L + WP) keyboard. In fact, this example provides an excellent opportunity to understand why asymmetric skill transfer sometimes occurs. Here is the explanation. The L + WP keyboard is an enhanced version of the LO keyboard. The basic method of entering letters is the same with both keyboards; however, the L + WP keyboard adds word-prediction, allowing words to be entered before all letters in the word are entered. It is very likely that entering text first with the LO keyboard served as excellent practice for the more difficult subsequent task of entering text with the L + WP keyboard. To appreciate this, examine the two points labeled Group 1 in Figure 5.15. Group 1 participants performed better overall because they were tested initially with the easier LO keyboard before moving on the enhanced L + WP keyboard. Group 2 participants fared less well because they were tested initially on the more difficult L + WP keyboard. The simplest way to avoid asymmetric skill transfer is to use a between-subjects design. Clearly, if participants are exposed to only one test condition, they cannot experience skill transfer from another test condition. There are other possibilities, such as having participants practice on a condition prior to data collection. The practice trials seek to overcome the benefit of practice in the earlier condition, so that the measured performance accurately reflects the inherent properties of the test condition. It is not clear that this would work in the example. Participants cannot “unlearn.” In the end, the performance difference between the LO and L + WP keyboards remains an outstanding research question. The practice effect (28%) was much greater than the group effect (8%), so it is difficult to say whether word prediction in the L + WP keyboard offers a performance advantage. Clearly, there is a benefit with the L + WP keyboard, because words can be entered before all the letters are entered. However, there is also a cost, since users must attend to the on-going prediction process, and this slows entry. To determine whether the costs outweigh the benefits in the long run, a longitudinal study is required. This is examined in the next section.

5.13 Longitudinal studies The preceding discussion focused on the confounding influence of learning in experiments where an independent variable is assigned within-subjects. Learning effects—more generally, order effects—are problematic and must be accommodated in some way, such as counterbalancing. However, sometimes the research

5.13 Longitudinal studies

FIGURE 5.16 Example of a longitudinal study. Two text entry methods were tested and compared over 20 sessions of input. Each session involved about 30 minutes of text entry.

has a particular interest in learning, or the acquisition of skill. In this case, the experimental procedure involves testing users over a prolonged period while their improvement in performance is measured. Instead of eliminating learning, the research seeks to observe it and measure it. An experimental evaluation where participants practice over a prolonged period is called a longitudinal study. In a longitudinal study, “amount of practice” is an independent variable. Participants perform the task over multiple units of testing while their improvement with practice is observed and measured. Each unit of testing is a level of the independent variable. Various names are used for the independent variable, but a typical example is Session with levels Session 1, Session 2, Session 3, and so on. An example is an experiment comparing two text entry methods for mobile phones: multi-tap and LetterWise (MacKenzie, Kober, Smith, Jones, and Skepner, 2001). For English text entry, LetterWise requires an average of 44 percent fewer keystrokes than does multi-tap. However, a performance benefit might not appear immediately, since users must learn the technique. Furthermore, learning occurs with both methods, as participants become familiar with the experimental procedure and task. However, it was felt that the reduction in keystrokes with LetterWise would eventually produce higher text entry speeds. To test this, a longitudinal study was conducted, with entry method assigned between-subjects. The results are shown in Figure 5.16. Indeed, the conjectured improvement with practice was observed. Initial entry speeds were about 7.3 wpm for both methods in Session 1. With practice, both methods improved; however, the improvement was greater with LetterWise because of the ability to produce English text with fewer keystrokes on average. By Session 20, text entry speed with LetterWise was 21.0 wpm, about 36 percent higher than the rate of 15.5 wpm for multi-tap.

185

186

CHAPTER 5 Designing HCI Experiments

FIGURE 5.17 Crossover point. With practice, human performance with a new interaction technique may eventually exceed human performance using a current technique. (From MacKenzie and Zhang, 1999)

Performance trends in longitudinal studies, as shown in Figure 5.15, are often accompanied by an equation and best-fitting curve demonstrating the power law of learning. Examples are given in Chapter 7, section 7.2.5 (Skill Acquisition). In many situations, the goal of a longitudinal study is to compare the viability of a new technique against current practice. Here, current practice is any conventional interaction that is quantifiable using a performance measure. Examples include text entry, editing, pointing, selecting, searching, panning, zooming, rotating, drawing, scrolling, menu access, and so on. If users are experienced with a current interaction technique, then relatively poorer initial performance is expected with the new technique. But as learning progresses, the performance trends may eventually cross over, wherein performance with the new technique exceeds that with current practice. This is illustrated in Figure 5.17. As an example, consider the ubiquitous Qwerty keyboard. Although improved designs have been proposed, users experienced with a Qwerty keyboard are unlikely to demonstrate an immediate improvement in performance with an alternative design. Considerable practice may be required before performance on the new keyboard exceeds that with the Qwerty keyboard. The Dvorak simplified keyboard (DSK), for example, has been demonstrated in longitudinal studies to provide a speed advantage over Qwerty (see Noyes, 1983 for a review). Yet Qwerty remains the dominant form factor for computer keyboards. From a practical standpoint, learning a new technique bears a cost, since performance is initially superior with the current technique. However, after the crossover point is reached, the new technique provides a benefit, since performance is superior compared to current practice. The cost-benefit trade-off is shown in Figure 5.18. Despite the long-term benefits evident in Figure 5.18, new technologies often languish in the margins while established but less-optimal designs continue to dominate the marketplace. Evidently, the benefits are often insufficient to overcome the costs.

5.14 Running the experiment

FIGURE 5.18 Cost-benefit progression in learning a new interaction technique where there is existing skill with current practice.

With respect to the Qwerty debate, there are two such costs. One is the cost of manufacturing and retooling. Keyboards are electro-mechanical devices, so new designs require ground-up reengineering and new manufacturing materials and procedures. This is expensive. The other cost lies in overcoming user perceptions and attitudes. By and large, users are change-adverse: they are reluctant to give up habits they have acquired and are comfortable with. Simply put, users are “unwilling to change to a new keyboard layout because of the retraining required” (Noyes, 1983, p. 278). One interesting example is a soft or virtual keyboard, as commonly used on touchscreen phones or personal digital assistants (PDAs). Input is typically with a finger or stylus. Most such keyboards use the Qwerty letter arrangement. However, since the keyboard is created in software, there is no retooling cost associated with an alternative design. Thus there is arguably a better chance for an optimized design to enter the marketplace. One idea to increase text entry speed is to rearrange letters with common letters clustered near the center of the layout and less common letters pushed to the perimeter. The increase in speed results from the reduction in finger or stylus movement. However, since users are unfamiliar with the optimized letter arrangement, performance is initially poor (while they get accustomed to the letter arrangement). If learning the new technique is likely to take several hours or more, then the evaluation requires a longitudinal study, where users are tested over multiple sessions of input. Eventually, the crossover point may appear. This idea is explored further in Chapter 7, section 7.2.5 (Skill Acquisition).

5.14 Running the experiment When the experiment is designed, the apparatus built and tested, the participants recruited and scheduled, then testing begins. But wait! Are you sure the time to begin has arrived? It is always useful to have a pilot test (yes, one more pilot test)

187

188

CHAPTER 5 Designing HCI Experiments

with one or two participants. This will help smooth out the protocol for briefing and preparing the participants. It will serve as a check on the amount of time needed for each participant. If the testing is scheduled for one hour, it is important that all the testing combined with briefing, practicing, etc., comfortably fit into one hour. A final tweak to the protocol may be necessary. Better now than to have regrets later on. So the experiment begins. The experimenter greets each participant, introduces the experiment, and usually asks the participants to sign consent forms. Often, a brief questionnaire is administered to gather demographic data and information on the participants’ related experience. This should take just a few minutes. The apparatus is revealed, the task explained and demonstrated. Practice trials are allowed, as appropriate. An important aspect of the experiment is the instructions given to participants. Of course, the instructions depend on the nature of the experiment and the task. For most interaction tasks, the participant is expected to proceed quickly and accurately. These terms—quickly and accurately—are subject to interpretation, as well as to the capabilities of participants. What is reasonably quick for one participant may be unattainable by another. Performing tasks reasonably quick and with high accuracy, but at a rate comfortable to the individual, is usually the goal. Whatever the case, the instructions must be carefully considered and must be given to all participants in the same manner. If a participant asks for clarification, caution must be exercised in elaborating on the instructions. Any additional explanation that might motivate a participant to act differently from other participants is to be avoided. The experimenter plays a special role as the public face of the experiment. It is important that the experimenter portrays himself or herself as neutral. Participants should not feel they are under pressure to produce a specific outcome. Deliberately attempting to perform better on one test condition compared to another is to be avoided. Also, participants should not sense a particular attitude in the experimenter. An overly attentive experimenter may make the participant nervous. Similarly, the experimenter should avoid conveying indifference or disregard. If the experimenter conveys a sense of not caring, the participant may very well act with little regard to his or her performance. A neutral manner is preferred.

STUDENT EXERCISES 5-1. It was noted above that independent variables in research papers are sometimes identified without being given a name. Review some experimental research papers in HCI and find three examples of this. Propose a name for the independent variable and give examples of how to improve the paper, properly identifying both the name of the variable and the levels of the variable. Examine how the independent variables and the levels of the independent variables (test conditions) were referred to in the paper. Point out any inconsistencies.

April 1990

CHI 90 Procee&qs

HEURISTIC EVALUATION Jukob Nielsen

OF USER INTERFACES and

Technical University of Denmark Department of Computer Science DK-2800 Lyngby Copenhagen Denmark dat JN@NEUVMl . bitnet

Rolf Molich B altica A/S Mail Code B22 Klausdalsbrovej 601 DK-2750 Ballerup Denmark

ABSTRACT

ical or formal evaluation methods.

Heuristic evaluation is an informal method of usability analysis where a number of evaluators are presented with an interface design and asked to comment on it. Four experiments showed that individual evaluators were mostly quite bad at doing such heuristic evaluations and that they only found between 20 and 51% of the usability problems in the interfaces they evaluated. On the other hand, we could aggregate the evaluations from several evaluators to a single evaluation and such aggregates do rather well, even when they consist of only three to five people. KEYWORDS: Usability evaluation, early evaluation, usability engineering, practicalmethods.

In real life, most user interface evaluations are heuristic evaluations but almost nothing is known about this kind of evaluation since it has been seen as inferior by most researchers. We believe, however, that a good strategy for improving usability in most industrial situations is to study those usability methods which are likely to see practical use [Nielsen 19891. Therefore we have conducted the series of experiments on heuristic evaluation reported in this paper.

INTRODUCTION

done by looking at an interface and trying to come up with

HEURISTIC EVALUATION

As mentioned in the introduction, heuristic evaluation is There are basically four ways to evaluate a user interface: Formally by some analysis technique, automatically by a computerized procedure, empirically by experiments with teat users, and heuristically by simply looking at the interface and passing judgement according to ones own opinion. Formal analysis models are currently the object of extensive research but they have not reached the stage where they can be generally applied in real software development projects. Automatic evaluation is completely infeasible except for a few very primitive checks. Therefore current practice is to do empirical evaluations if one wants a good and thorough evaluation of a user interface. Unfortunately, in most practical situations, people actually do nof conduct empirical evaluations becausethey lack the time, expertise, inclination, or simply the tradition to do so. For example, M.&ted et al. 119893found that only 6% of Danish companies doing software development projects used the thinking aloud method and that nobody used uny other other empir-

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. TO copy otherwise, or to republish requires a fee and/or specific permission.

0

1990

ACM

O-89791 -345-O/90/0004-0249

1.50

an opinion about what is good and bad about the interface. Ideally people would conduct such evaluations according to certain rules, such as those listed in typical guidelines documents. Current collections of usability guidelines [Smith and Mosier 19861 have on the order of one thousand rules to follow, however, and are therefore seen as intimidating by developers. Most people probably perform heuristic evaluation on the basis of their own intuition and common senseinstead. We have tried cutting the complexity of the rule base by two orders of magnitudes by relying on a small set of heuristics such as the nine basic usability principles from [Molich and Nielsen 1990-Jlisted in Table 1. Such smaller sets of principles seem more suited as the basis for practical heuristic evaluation. Actually the use of very Simple and natural dialogue

Speak the user’s language Minimize user memory load Be consistent Provide feedback Provide clearly marked exits Provide shortcuts Good error messages Prevent errors

Table 1. Nine usability heuristics {discussed further h [Molich and Nielsen 19901).

249

Apil1990

Cl-II 90 procee&ngs complete and detailed guidelines as checklists for evaluations might be considered a. formalism, especially when they take the form of interface stand;&.

many situations it is realistic to wanotto conduct a usability evaluation in the specification stage of a software development process where no running system is yet available.

We have developed this specific list of heuristics during several years of experience with te.aching and consulting about usability engineering [Nielsen and Molich 19891. The nine heuristics can be presented in a single lecture and explain a very large proportion of the problems one observes in user interface designs. These nine principles correspond more or less to principles which are generally retognized in the user interface commu.nity, and most people might think that they were “obvious”’ if it was not because the results in the following sections of this paper show that they am difficult to apply in practice. The reader is referred to wolich and Nielsen 19901 for a more detailed explanation of each of the nine heuristics.

The evaluators were 37 computer science students who were taking a class in user interface design and had had a lecture on our evaluation heuristics before the experiment. The interface contained a total of 52 known usability problems.

EMPIRICAL TEST OF HEURISTIC EVALUATION

To test the practical applicability of heuristic evaluation, we conducted four experiments where people who were not usability experts analyzed a user interface heuristically. The basic method was the same in all four experiments: The evaluators (“subjects”) were given a user interface design and asked to write a report pointing out the usability problems in the interface as precisely as possible. Each report was then scored for the usability problems that were mentioned in it. The scoring was done by matching with a list of usability problems developed by the authors. Actually, our lists of usability problems had to be modified after we had made an initial pass through the reports, since our evaluators in each experiment discovered some problems which we had not originally identified ourselves. This shows that even usability experts are not perfect in doing heuristic evaluations.

2: Mantel

For experiment 2 we used a design which was constructed for the purpose of the test. Again the evaluators had access only to a written specification and not to a running system. The system was a design for a small information system which a telephone company wolild make available to its customers to dial in via their modems to find the name and address of the subscriber having a ,given telephone number. This system was called “Mantel” as an abbreviation of our hypothetical telephone company,. Manhattan Telephone (neither the company nor the system has any relation to any existing company or system). The entire system design consisted of a single screen and a :few system messagesso that the specification could be contained on a single page. The design document used for this experiment is reprinted as an appendix to [Molich and Nielsen 19901 which also gives a complete list and in-depth explanation of the 30 known usability problems in the Mantel design.

Scoring was liberal to the extent that credit was given for the mentioning of a usability problem even if it was not described completely.

The evaluators were readers of the Danish Computerworld magazine where our design was printed as an exercise in a contest. 77 solutions were mailed in, mostly written by industrial computer professionals. Our main reason for conducting this experiment was to ensure that we had data from real computer professionals and not just from students. We should note that these evaluatnrs did not have the (potential) benefit of having attended our lecture on the usability heuristics.

Table 2 gives a short summary of the four experiments which are described further in the following.

Experiments 3 and 4: Two Voice Response Systems: “Savings” and “Transport”

Tabk 2. Summaryof thefour experiments. Experiment

1: Telsdata

Experiment 1 tested the user interlace to the Danish videotex system, Teledata. The evaluators were given a set of ten screen dumps from the general search system and from the Scandinavian Airlines (SAS) subsystem. This means that the evaluators did not have accessto a “live” system, but in

250

Experiment

Experiments 3 and 4 were conducted to get data from heuristic evaluations of “live” systems (as opposed to the specification-only designs in experiments 1 and 2). Both experiments were done with the same.group of 34 computer science students as evaluators. Again, the students were taking a course in user interface design and were given a lecture on our usability heuristics, but there was no overlap between the group of evaluators in these experiments and the group from experiment 1. Both interfaces were “voice response” systems where users would dial up an information system from a touch tone telephone and interact with the system by pushing buttons on the 12-key keypad. The first syqem was run by a large Savings Union to give their customers information about their account balance, current foreign currency exchange rates, etc. This interface is refer& to as the “Savings” de-

April 1990

CHI !30 l’meedings sign in this article and it contained a total of 48 known usability problems. The second system was used by the municipal public transportation company in Copenhagen to provide commuters with information about bus routes. This interface is referred to as the “Transport” design and had a total of 34 known usability problems. Them were four usability problems which were related to inconsistency across the two voice response systems, Since the two systems are aimed at the same user population in the form of the average citizen and since they are accessed through the same terminal equipment, it would improve their collective usability if they both used the same conventions. Unfortunately there are differences, such as the use of the square1 key. In the Savings system, it is an end-of-command control character, while it is a command key for the “return to the main menu” command in the Transport system which does not use an end-of-command key at ah. The four shared inconsistency problems have been included in the count of usability problems for both systems. Since the same evaluators were used for both voice response experiments, we can compare the performance of the individual evaluators. In this comparison, we have excluded the four consistency problems discussed above which are shared among the two systems. A regression anaIysis of the two sets of evaluations is shown in Figure 1 and indicates a very weak correlation between the performance of the evaluators in the two experiments (R2=0.33, p