Handbook of Employee Selection [2 ed.] 9781138915190, 9781138915497, 9781315690193

This second edition of the Handbook of Employee Selection has been revised and updated throughout to reflect current thi

133 105 10MB

English Pages 1000 [1019] Year 2017

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Title
Copyright
Contents
Preface
About the Editors
Contributors
Part I Foundations of Psychological Measurement and Evaluation Applied to Employee Selection
Chapter 1 Reliability
Chapter 2 Validation Strategies for Primary Studies
Chapter 3 Validity Considerations in the Design and Implementation of Selection Systems
Chapter 4 Situational Specificity, Validity Generalization, and the Future of Psychometric Meta-analysis
Chapter 5 Strategy, Selection, and Sustained Competitive Advantage
Chapter 6 Work Analysis
Part II Implementation and Management of Employee Selection Systems in Work Organizations
Chapter 7 Attracting Job Candidates to Organizations
Chapter 8 Test Administration and the Use of Test Scores
Chapter 9 Managing Sustainable Selection Programs
Chapter 10 The Business Value of Employee Selection
Part III Categories of Individual Difference Constructs for Employee Selection
Chapter 11 Cognitive Ability: Measurement and Validity for Employee Selection
Chapter 12 Physical Performance Tests
Chapter 13 Personality: Its Measurement and Validity for Employee Selection
Chapter 14 Values, Styles, and Motivational Constructs
Chapter 15 Practical Intelligence, Emotional Intelligence, and Social Intelligence
Part IV Decisions in Developing, Selecting, Using, and Evaluating Predictors
Chapter 16 Decisions in the Operational Use of Employee Selection Procedures: Choosing, Evaluating, and Administering Assessment Tools
Chapter 17 The Sum of the Parts: Methods of Combining Assessments for Employment Decisions
Chapter 18 Choosing a Psychological Assessment: Reliability, Validity, and More
Chapter 19 Assessment Feedback
Part V Criterion Constructs in Employee Selection
Chapter 20 The Measurement of Task Performance as Criteria in Selection Research
Chapter 21 Adaptive and Citizenship-Related Behaviors at Work
Chapter 22 New Perspectives on Counterproductive Work Behavior Including Withdrawal
Chapter 23 Defining and Measuring Results of Workplace Behavior
Chapter 24 Employee Work-Related Health, Stress, and Safety
Chapter 25 The Deficiency of Our Criteria: Who Defines Performance, Contribution, and Value?
Part VI Legal and Ethical Issues in Employee Selection
Chapter 26 Ethics of Employee Selection
Chapter 27 Professional Guidelines/Standards
Chapter 28 An Updated Sampler of Legal Principles in Employment Selection
Chapter 29 Updated Perspectives on the International Legal Environment for Selection
Chapter 30 A Consideration of International Differences in the Legal Context of Employment Selection
Part VII Employee Selection in Specific Organizational Contexts
Chapter 31 Selection and Classification in the U.S. Military
Chapter 32 Public Sector Employment
Chapter 33 Selection Methods and Desired Outcomes: Improving Entry- and Mid-level Leadership Performance Through the Use of Assessment Technologies
Chapter 34 Blue-Collar Selection in Private Sector Organizations
Chapter 35 Selection for Service and Sales Jobs
Chapter 36 Selection in Multinational Organizations
Chapter 37 Selection for Team Membership: Complexity, Contingency, and Dynamism Across Multiple Levels
Chapter 38 Selecting Leaders: Executives and High-Potentials
Part VIII Technology and Employee Selection
Chapter 39 Technology and Employee Selection: An Overview
Chapter 40 Advancing O*NET Data, Application, and Uses
Chapter 41 Cybersecurity Issues in Selection
Chapter 42 Modern Psychometric Theory to Support Personnel Assessment and Selection
Chapter 43 Using Big Data to Enhance Staffing: Vast Untapped Resources or Tempting Honeypot?
Chapter 44 The Impact of Emerging Technologies on Selection Models and Research: Mobile Devices and Gamification as Exemplars
Index
Recommend Papers

Handbook of Employee Selection [2 ed.]
 9781138915190, 9781138915497, 9781315690193

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

“It is great to see the new and state-of-the-art edition of James Farr and Nancy Tippins’s Handbook of Employee Selection. This is a ‘must read’ for anybody involved in selection and assessment at work.” Sir Cary Cooper, 50th Anniversary Professor of Organizational Psychology and Health, Manchester Business School, University of Manchester, UK, and President of the CIPD “Jim Farr and Nancy Tippins are the ideal co-editors to provide ‘the’ book on selection/staffing. Jim has spent his academic career studying the science of selection and shaping practice; Nancy has spent her entire career as a scientist who puts into practice the science of staffing. Both individuals are seasoned professionals to whom SIOP, the courts, and practitioners turn to for their expertise in this domain.” Gary Latham, Secretary of State Chair of Organizational Effectiveness, Rotman School of Management, University of Toronto, Canada

Handbook of Employee Selection This second edition of the Handbook of Employee Selection has been revised and updated throughout to reflect current thinking on the state of science and practice in employee selection. In this volume, a diverse group of recognized scholars inside and outside the United States balances theory, research, and practice, often taking a global perspective. Divided into eight parts, chapters cover issues associated with measurement, such as validity and reliability, as well as practical concerns around the development of appropriate selection procedures and implementation of selection programs. Several chapters discuss the measurement of various constructs commonly used as predictors, and other chapters confront criterion measures that are used in test validation. Additional sections include chapters that focus on ethical and legal concerns and testing for certain types of jobs (e.g., blue-collar jobs). The second edition features a new section on technology and employee selection. The Handbook of Employee Selection, Second Edition provides an indispensable reference for scholars, researchers, graduate students, and professionals in industrial and organizational psychology, human resource management, and related fields. James L. Farr is Professor Emeritus of Psychology at Pennsylvania State University, USA. He is the author or editor of more than 85 publications in professional journals and books. He is an elected fellow of SIOP and APA and Past President of SIOP (1996–1997). He was a winner of SIOP’s James McKeen Cattell Award for Research Design (1980) and M. Scott Myers Award for Applied Research in the Workplace (1998). In 2001, he was awarded SIOP’s Distinguished Service Award. Nancy T. Tippins is a Principal Consultant at CEB. She is a Fellow of SIOP, Division 5 of APA, APA, and APS and Past President of SIOP (2000–2001). She is currently Secretary of the SIOP Foundation. In 2004, she was the winner of SIOP’s Distinguished Service Award, and, in 2013, she won SIOP’s Distinguished Professional Contributions Award.

Handbook of Employee Selection Second Edition

Edited by James L. Farr and Nancy T. Tippins Section Editors Walter C. Borman David Chan Michael D. Coovert Rick Jacobs P. Richard Jeanneret Jerard F. Kehoe Filip Lievens S. Morton McPhail

Kevin R. Murphy Robert E. Ployhart Elaine D. Pulakos Douglas H. Reynolds Ann Marie Ryan Neal Schmitt Benjamin Schneider

Second edition published 2017 by Routledge 711 Third Avenue, New York, NY 10017 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2017 Taylor & Francis The right of James L. Farr and Nancy T. Tippins to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. First edition published by Routledge 2010 Library of Congress Cataloging- in-Publication Data Names: Farr, James L., editor. | Tippins, Nancy Thomas, 1950- editor. Title: Handbook of employee selection / edited by James L. Farr and Nancy T. Tippins ; Section editors, Walter C. Borman [and others]. Description: Second edition. | New York, NY : Routledge, 2017. | Includes bibliographical references and index. Identifiers: LCCN 2016039053| ISBN 9781138915190 (hardback : alk. paper) | ISBN 9781138915497 (pbk. : alk. paper) | ISBN 9781315690193 (ebook) Subjects: LCSH: Employee selection. | Employee selection—Handbooks, manuals, etc. Classification: LCC HF5549.5.S38 H36 2018 | DDC 658.3/112—dc23 LC record available at https://lccn.loc.gov/2016039053 ISBN: 978-1-138-91519-0 (hbk) ISBN: 978-1-138-91549-7 (pbk) ISBN: 978-1-315-69019-3 (ebk) Typeset in Garamond MT and Futura BT by Apex CoVantage, LLC

CONTENTS

Prefacexi James L. Farr and Nancy T. Tippins About the Editors xiii Contributorsxv

Part I

Foundations of Psychological Measurement and Evaluation Applied to Employee Selection Section Editor: Benjamin Schneider

Chapter 1 Reliability

1 3

Dan J. Putka

Chapter 2

Validation Strategies for Primary Studies Neal W. Schmitt, John D. Arnold, and Levi Nieminen

34

Chapter 3

Validity Considerations in the Design and Implementation of Selection Systems Jerard F. Kehoe and Paul R. Sackett

56

Chapter 4 Situational Specificity, Validity Generalization, and the Future of Psychometric Meta-analysis93 James M. LeBreton, Jeremy L. Schoen, and Lawrence R. James

Chapter 5

Strategy, Selection, and Sustained Competitive Advantage Robert E. Ployhart and Jeff A. Weekley

115

Chapter 6

Work Analysis Michael T. Brannick, Kenneth Pearlman, and Juan I. Sanchez

134

Part II Implementation and Management of Employee Selection Systems in Work Organizations

163

Section Editors: Jerard F. Kehoe and Robert E. Ployhart

Chapter 7

Attracting Job Candidates to Organizations Ann Marie Ryan and Tanya Delany

165

Chapter 8

Test Administration and the Use of Test Scores Jeff W. Johnson and Frederick L. Oswald

182

Chapter 9

Managing Sustainable Selection Programs Jerard F. Kehoe, Stefan T. Mol, and Neil R. Anderson

205

Chapter 10

The Business Value of Employee Selection Wayne F. Cascio and John C. Scott

226

Contents

Part III Categories of Individual Difference Constructs for Employee Selection Section Editors: David Chan and Filip Lievens

249

Chapter 11 Cognitive Ability: Measurement and Validity for Employee Selection 251 Deniz S. Ones, Stephan Dilchert, Chockalingam Viswesvaran, and Jesús F. Salgado

Chapter 12

Physical Performance Tests Deborah L. Gebhardt and Todd A. Baker

277

Chapter 13 Personality: Its Measurement and Validity for

Chapter 14

Employee Selection Leaetta Hough and Stephan Dilchert

298

Values, Styles, and Motivational Constructs David Chan

326

Chapter 15 Practical Intelligence, Emotional Intelligence, and Social Intelligence Filip Lievens and David Chan

342

Part IV Decisions in Developing, Selecting, Using, and Evaluating Predictors Section Editors: Ann Marie Ryan and Neal Schmitt

365

Chapter 16 Decisions in the Operational Use of Employee Selection Procedures: Choosing, Evaluating, and Administering Assessment Tools Nancy T. Tippins, Emily C. Solberg, and Neha Singla

367

Chapter 17 The Sum of the Parts: Methods of Combining Assessments for Employment Decisions Juliet R. Aiken and Paul J. Hanges

388

Chapter 18 Choosing a Psychological Assessment: Reliability, Validity, and More Michael J. Zickar, Jose M. Cortina, and Nathan T. Carter

397

Chapter 19

Assessment Feedback Manuel London and Lynn A. McFarland

406

Part V

Criterion Constructs in Employee Selection Section Editors: Kevin R. Murphy and Elaine D. Pulakos

427

Chapter 20 The Measurement of Task Performance as Criteria

Chapter 21

in Selection Research Walter C. Borman, Matthew R. Grossman, Rebecca H. Bryant, and Jay Dorio

429

Adaptive and Citizenship-Related Behaviors at Work David W. Dorsey, Jose M. Cortina, Matthew T. Allen, Shonna D. Waters, Jennifer P. Green, and Joseph Luchman

448

Chapter 22 New Perspectives on Counterproductive Work Behavior Including Withdrawal Maria Rotundo and Paul E. Spector

viii

476

Contents

Chapter 23

Defining and Measuring Results of Workplace Behavior Ryan S. O’Leary and Elaine D. Pulakos

509

Chapter 24

Employee Work-Related Health, Stress, and Safety Lois E. Tetrick, Pamela L. Perrewé, and Mark Griffin

530

Chapter 25 The Deficiency of Our Criteria: Who Defines Performance, Contribution, and Value? Jeanette N. Cleveland, Kevin R. Murphy, and Adrienne Colella

554

Part VI

Legal and Ethical Issues in Employee Selection Section Editors: P. Richard Jeanneret and S. Morton McPhail

573

Chapter 26

Ethics of Employee Selection Joel Lefkowitz and Rodney L. Lowman

575

Chapter 27

Professional Guidelines/Standards P. Richard Jeanneret and Sheldon Zedeck

599

Chapter 28 An Updated Sampler of Legal Principles in Employment Selection Arthur Gutman, James L. Outtz, and Eric Dunleavy

631

Chapter 29 Updated Perspectives on the International Legal Environment for Selection Winny Shen, Paul R. Sackett, Filip Lievens, Eveline Schollaert, Greet Van Hoye, Dirk D. Steiner, Florence Rolland-Sayah, Konstantina Georgiou, Ioannis Nikolaou, Maria Tomprou, Shay Tzafrir, Peter Bamberger, Marilena Bertolino, Marco Mariani, Franco Fraccaroli, Tomoki Sekiguchi, Betty Onyura, Hyuckseung Yang, Janneke K. Oostrom, Paul Englert, Oleksandr S. Chernyshenko, Hennie J. Kriek, Tina Joubert, Jesús F. Salgado, Annika Wilhelmy, Cornelius J. König, Aichia Chuang, and Mark Cook

659

Chapter 30 A Consideration of International Differences in the Legal Context of Employment Selection Emilee Tison, Kristen Pryor, Michael Aamodt, and Eric Dunleavy

678

Part VII

Employee Selection in Specific Organizational Contexts Section Editors: Rick Jacobs and Douglas H. Reynolds

695

Chapter 31

Selection and Classification in the U.S. Military Wayne S. Sellman, Teresa L. Russell, and William J. Strickland

697

Chapter 32

Public Sector Employment Rick Jacobs and Donna L. Denning

722

Chapter 33 Selection Methods and Desired Outcomes: Improving Entry- and Mid-level Leadership Performance Through the Use of Assessment Technologies Scott C. Erker, Charles J. Cosentino, and Kevin B. Tamanini

738

Chapter 34

Blue-Collar Selection in Private Sector Organizations Robert P. Michel and Shannon Bonner

760

Chapter 35

Selection for Service and Sales Jobs John P. Hausknecht and Angela L. Heavey

781

Chapter 36

Selection in Multinational Organizations Paula Caligiuri and Karen B. Paul

797

ix

Contents

Chapter 37 Selection for Team Membership: Complexity, Contingency, and Dynamism Across Multiple Levels Susan Mohammed and Alexander S. McKay

812

Chapter 38

Selecting Leaders: Executives and High-Potentials George C. Thornton III, Stefanie K. Johnson, and Allan H. Church

833

Part VIII

Technology and Employee Selection Section Editors: Walter C. Borman and Michael D. Coovert

853

Chapter 39

Technology and Employee Selection: An Overview Douglas H. Reynolds and David N. Dickter

855

Chapter 40

Advancing O*NET Data, Application, and Uses David Rivkin, Christina M. Gregory, Jennifer J. Norton, Denise E. Craven, and Phil M. Lewis

874

Chapter 41

Cybersecurity Issues in Selection David W. Dorsey, Jaclyn Martin, David J. Howard, and Michael D. Coovert

913

Chapter 42 Modern Psychometric Theory to Support Personnel Assessment and Selection Stephen Stark, Oleksandr S. Chernyshenko, and Fritz Drasgow

931

Chapter 43 Using Big Data to Enhance Staffing: Vast Untapped Resources or Tempting Honeypot? Richard N. Landers, Alexis A. Fink, and Andrew B. Collmus

949

Chapter 44 The Impact of Emerging Technologies on Selection Models and Research: Mobile Devices and Gamification as Exemplars Winfred Arthur Jr., Dennis Doverspike, Ted B. Kinney, and Matthew O’Connell

967

Index987

x

PREFACE

In the preface to the first edition of this Handbook, we noted that industrial and organizational psychologists have worked for more than 100 years to improve employee selection by identifying new processes for developing and evaluating selection instruments and creating new predictors and criteria for a wide range of jobs and organizations. We also noted that the organizational environments in which selection tools are used have continued to evolve and generally become more complex. Although the first edition of this Handbook could only summarize the important ideas that influenced selection efforts at that time because of the massive body of relevant professional literature, we were extremely proud of it and thrilled by the amount of positive feedback we received from many professional colleagues. When we were approached by representatives of the publisher about editing a second edition, our collective initial response was “It’s too soon.” They were persuasive enough that we agreed to consult some of our colleagues who had participated in the creation of the first edition to ascertain if there was support for initiating a second edition. There was, and the publisher was eager to put the project into motion with a contract on the basis of a few, rather vague e-mails about plans for the revision. Based on what seems to have been a successful initial outcome, we decided on the advice of colleagues and our own analysis to stay the course established by the first edition in large part. Many of the chapters in this edition are updates by the same author teams of the earlier chapters, whereas others are updates by modified author teams. Some chapters and parts of the second edition are substantially changed (especially Part I), and eight chapters are completely new (most related to technology and selection in Part VIII). A few chapters dealt comprehensively with the history of important selection programs, and these have been dropped from the second edition. Staying the course of the first edition means that this edition of the Handbook was designed to cover the current thinking on not only basic concepts in employee selection but also specific applications of those concepts in various organizational settings. Throughout the book, we have encouraged authors to (a) balance the treatment of scientific (i.e., research findings and theory) and practical concerns related to implementation and operational use and (b) take a global perspective that reflects the concerns of multinational corporations and cross-cultural differences in testing practices and applicant skill. Our continued hope for this Handbook is that it serves as a reference for the informed reader possessing an advanced degree in industrial and organizational psychology, human resource management, and other related fields, as well as for graduate students in these fields. Because the intended audience for the Handbook is professionals who work in the area of employee selection in academic and professional settings, a conscientious effort has been made to include the latest scientific thought and the best practices in application. Handbooks of this size are not published without the help and hard work of many people. We are particularly grateful for the contributions of the 136 authors who wrote these 44 chapters. Without their expertise and willingness to share their professional and scientific knowledge, there would be no Handbook. Similarly, we also attribute the existence and quality of this Handbook to our section editors (Walter C. Borman, David Chan, Michael D. Coovert, Rick Jacobs, P. Richard Jeanneret, Jerard F. Kehoe, Filip Lievens, S. Morton McPhail, Kevin R. Murphy, Robert E. Ployhart, Elaine D. Pulakos, Douglas H. Reynolds, Ann Marie Ryan, Neal Schmitt, and

xi

Preface Benjamin Schneider) of the eight major parts of the Handbook, who not only helped us make decisions about chapter retention and revision but also were critical at identifying new chapter topics and authors. In addition, they each provided highly constructive feedback to chapter authors in their respective parts during the revision process. Christina Chronister, Editor at Routledge of the Taylor & Francis Group, deserves a special acknowledgement for shepherding the Handbook through the editing and production processes. Julie Toich, our Editorial Assistant at Routledge, handled literally hundreds of administrative tasks with cheerful tolerance of us and our authors and helped ensure that all permissions were obtained, author agreements were collected, quotations were appropriately cited, tables and figures were placed in the appropriate format, etc. Undertakings such as this Handbook require extreme organization. Autumn Spalding, our Project Manager who shepherded the book through the printing process, ensured it was appropriately copyedited, and kept us all on schedule. Betsy Saiani of CEB maintained our records of who had completed what tasks and generally kept us organized. Innumerable teachers, mentors, colleagues, and friends have taught us much about employee selection and gotten us to this point in our careers and lives. We are appreciative of their support and encouragement and their many direct, and indirect, contributions to this Handbook. Of course, our respective families have made a significant contribution to this book through their encouragement, support, patience, and tolerance. Thank you, Diane and Mac. James L. Farr State College, Pennsylvania Nancy T. Tippins Greenville, South Carolina

xii

ABOUT THE EDITORS

James L. Farr received his Ph.D. in industrial and organizational psychology from the University of Maryland. Beginning in 1971 until his retirement in 2013, he was a member of the faculty in the Department of Psychology of The Pennsylvania State University, where he is now Professor of Psychology Emeritus. He has also been a visiting scholar at the University of Sheffield (United Kingdom), the University of Western Australia, the Chinese University of Hong Kong, and the University of Giessen (Germany). His primary research interests are performance appraisal and feedback, personnel selection, the older worker, and innovation and creativity in work settings. Dr. Farr is the author or editor of more than 85 publications in professional journals and books, including The Measurement of Work Performance (with Frank Landy; Academic Press, 1983), Innovation and Creativity at Work: Psychological and Organizational Strategies (coedited with Michael West; John Wiley & Sons, 1990), Personnel Selection and Assessment: Individual and Organizational Perspectives (co-edited with Heinz Schuler and Mike Smith; Lawrence Erlbaum Associates, 1993), and Handbook of Employee Selection (co-edited with Nancy T. Tippins; Routledge, 2010). He was the editor of Human Performance from 2000–2006 and has been a member of the editorial boards of numerous other professional journals, including Journal of Applied Psychology, Organizational Behavior and Human Decision Processes, Journal of Occupational and Organizational Psychology, and Journal of Business and Psychology. Active in a number of professional organizations, Dr. Farr was president of the Society for Industrial and Organizational Psychology (SIOP) in 1996–1997 and has served in a variety of other positions for SIOP. He was an elected member of the Board of Representatives for the American Psychological Association (APA) from 1993–1996 and 2002–2004, representing SIOP. He is an elected fellow of SIOP and APA. A strong believer in the scientist-practitioner model for industrial/organizational (I/O) psychology, Dr. Farr was a winner of SIOP’s 1980 James McKeen Cattell Award for Research Design (with Frank Landy and Rick Jacobs) and its 1998 M. Scott Myers Award for Applied Research in the Workplace (with Frank Landy, Edwin Fleishman, and Robert Vance). In 2001 he was the winner of SIOP’s Distinguished Service Award. Nancy T. Tippins is a Principal Consultant at CEB, where she manages teams that develop talent acquisition strategies related to workforce planning, sourcing, acquisition, selection, job analysis and competency identification, performance management, succession planning, manager and executive assessments, employee and leadership development, and expert support in litigation. Prior to her work at CEB, Dr. Tippins worked as an internal consultant in the personnel research functions of Exxon, Bell Atlantic, and GTE, and then worked as an external consultant at Valtera Corporation, which was later acquired by CEB. Active in professional affairs, Dr. Tippins has a longstanding involvement with the Society for Industrial and Organizational Psychology (SIOP), where she served as President from 2000–2001 and is currently the Secretary of the SIOP Foundation. She served on the Ad Hoc Committee on the Revision of the Principles for the Validation and Use of Personnel Selection Procedures (1999) and is currently co-chairing the committee for the current revision of the Principles. She

xiii

About the Editors was one of the U.S. representatives on the ISO 9000 committee to establish international assessment standards. She also served on the most recent Joint Committee to revise the Standards for Educational and Psychological Tests (2014). She has been SIOP’s representative to the APA’s Council of Representatives and served on the APA’s Board of Professional Affairs and the Commission for the Recognition of Specialties and Proficiencies in Professional Psychology. Dr. Tippins has authored or presented numerous articles on tests and assessments. She coauthored Designing and Implementing Global Selection Systems with Ann Marie Ryan, co-edited the Handbook of Employee Selection with James L. Farr, and co-edited Technology Enhanced Assessments with Seymour Adler. She has served as the Associate Editor for the Scientist-Practitioner Forum of Personnel Psychology. She is currently on the Editorial Boards of Journal of Applied Psychology, Personnel Psychology, Industrial and Organizational Psychology: Perspectives on Science and Practice, Journal of Psychology and Business, and Personnel Assessment and Decisions. She is the current editor of SIOP’s Professional Practice Series. Dr. Tippins received her Ph.D. in Industrial and Organizational Psychology from the Georgia Institute of Technology. She is a fellow of SIOP, the American Psychological Association (APA), Division 5 of APA, and the American Psychological Society (APS), and is involved in several private industry research groups. In 2004, she was the winner of SIOP’s Distinguished Service Award, and in 2013, she won SIOP’s Distinguished Professional Contributions Award.

xiv

CONTRIBUTORS

Michael Aamodt, DCI Consulting Group, Inc., USA Juliet R. Aiken, University of Maryland, College Park, USA Matthew T. Allen, U.S. Department of Defense, USA Neil R. Anderson, Brunel University, London, UK John D. Arnold, Polaris Assessment Systems, USA Winfred Arthur Jr., Texas A&M University, USA Todd A. Baker, Human Resources Research Organization (HumRRO), USA Peter Bamberger, Tel Aviv University, Israel Marilena Bertolino, University of Nice, France Shannon Bonner, 3M, USA Walter C. Borman, University of South Florida, USA Michael T. Brannick, University of South Florida, USA Rebecca H. Bryant, Bank of America, USA Paula Caligiuri, Northeastern University, USA John P. Campbell, University of Minnesota, USA Nathan T. Carter, University of Georgia, USA Wayne F. Cascio, University of Colorado, Denver, USA David Chan, Singapore Management University, Republic of Singapore Oleksandr S. Chernyshenko, Nanyang Technological University, Republic of Singapore Aichia Chuang, National Taiwan University, Taiwan Allan H. Church, PepsiCo, USA Jeanette N. Cleveland, Colorado State University, USA Adrienne Colella, Tulane University, USA Andrew B. Collmus, Old Dominion University, USA Mark Cook, Swansea University, UK Michael D. Coovert, University of South Florida, USA Jose M. Cortina, George Mason University, USA Charles J. Cosentino, Development Dimensions International, USA Denise E. Craven, Center for O*NET Development, USA Tanya Delany, IBM Corporation, Italy Donna L. Denning, Retired, USA David N. Dickter, Western University of Health Sciences, USA Stephan Dilchert, Baruch College, USA Jay Dorio, IBM, USA David W. Dorsey, Department of Defense, USA Dennis Doverspike, University of Akron, USA Fritz Drasgow, University of Illinois at Urbana-Champaign, USA Eric Dunleavy, DCI Consulting Group, Inc., USA Paul Englert, Nanyang Technological University, Republic of Singapore Scott C. Erker, Development Dimensions International, USA Alexis A. Fink, Intel, USA Franco Fraccaroli, University of Tento, Italy Deborah L. Gebhardt, Human Resources Research Organization (HumRRO), USA Konstantina Georgiou, Athens University of Economics and Business, Greece Jennifer P. Green, George Mason University, USA

xv

Contributors Christina M. Gregory, Center for O*NET Development, USA Mark Griffin, University of Western Australia, Australia Matthew R. Grossman, PricewaterhouseCoopers, USA Arthur Gutman, Florida Institute of Technology, USA Paul J. Hanges, University of Maryland, College Park, USA John P. Hausknecht, Cornell University, USA Angela L. Heavey, James Madison University, USA Leaetta Hough, Dunnette Group, Ltd., USA David J. Howard, University of South Florida, USA Rick Jacobs, Pennsylvania State University, USA Lawrence R. James, Georgia Institute of Technology, USA P. Richard Jeanneret, Retired, USA Jeff W. Johnson, CEB, USA Stefanie K. Johnson, University of Colorado, Boulder, USA Tina Joubert, Independent Consultant, South Africa Jerard F. Kehoe, Selection and Assessment Consulting, USA Ted Kinney, Select International, Inc., USA Cornelius J. König, Universität des Saarland, Germany Hennie J. Kriek, Top Talent Solutions Inc., USA Richard N. Landers, Old Dominion University, USA James M. LeBreton, Pennsylvania State University, USA Joel Lefkowitz, Bernard M. Baruch College and The Graduate Center, CUNY, USA Phil M. Lewis, Center for O*NET Development, USA Filip Lievens, Ghent University, Belgium Manuel London, State University of New York at Stony Brook, USA Rodney L. Lowman, CSPP/Alliant International University, USA Joseph Luchman, Fors Marsh Group LLC, USA Marco Mariani, University of Bologna, Italy Jaclyn Martin, University of South Florida, USA Lynn A. McFarland, University of South Carolina, USA Alexander S. McKay, Pennsylvania State University, USA S. Morton McPhail, Retired, USA Robert P. Michel, Edison Electric Institute, USA Susan Mohammed, Pennsylvania State University, USA Stefan T. Mol, University of Amsterdam, the Netherlands Kevin R. Murphy, University of Limerick, Ireland Levi Nieminen, Denison Consulting, LLC, USA Ioannis Nikolaou, Athens University of Economics & Business, Greece Jennifer J. Norton, Center for O*NET Development, USA Matthew O’Connell, Select International, Inc., USA Ryan S. O’Leary, CEB, USA Deniz S. Ones, University of Minnesota, USA Betty Onyura, St. Michael’s Hospital, Canada Janneke K. Oostrom, Vrije Universiteit Amsterdam, the Netherlands Frederick L. Oswald, Rice University, USA James L. Outtz, Outtz and Associates, USA Karen B. Paul, 3M, USA Kenneth Pearlman, Independent Consultant, USA Pamela L. Perrewé, Florida State University, USA Robert E. Ployhart, University of South Carolina, USA Kristen Pryor, DCI Consulting Group, Inc., USA Elaine D. Pulakos, CEB, USA Dan J. Putka, HumRRO, USA Douglas H. Reynolds, DDI, USA David Rivkin, Center for O*NET Development, USA Florence Rolland-Sayah, Pole Emploi, France Maria Rotundo, University of Toronto, Canada Teresa L. Russell, Human Resources Research Organization (HumRRO), USA Ann Marie Ryan, Michigan State University, USA Paul R. Sackett, University of Minnesota, USA

xvi

Contributors Jesús F. Salgado, University of Santiago de Compostela, Spain Juan I. Sanchez, Florida International University, USA Neal Schmitt, Michigan State University, USA Benjamin Schneider, Marshall School of Business, University of Southern California and University of Maryland, USA Jeremy L. Schoen, Georgia Gwinnett College, USA Eveline Schollaert, University College Ghent, Belgium John C. Scott, APTMetrics, USA Tomoki Sekiguchi, Osaka University, Japan Wayne S. Sellman, Human Resources Research Organization (HumRRO), USA Winny Shen, University of Waterloo, Canada Neha Singla, CEB, USA Emily C. Solberg, CEB, USA Paul E. Spector, University of South Florida, USA Stephen Stark, University of South Florida, USA Dirk D. Steiner, University Nice Sophia Antipolis, France William J. Strickland, Human Resources Research Organization (HumRRO), USA Kevin B. Tamanini, Development Dimensions International, USA Lois E. Tetrick, George Mason University, USA George C. Thornton III, Colorado State University, USA Nancy T. Tippins, CEB, USA Emilee Tison, DCI Consulting Group, Inc., USA Maria Tomprou, Carnegie Mellon University, USA Shay Tzafrir, University of Haifa, Israel Greet Van Hoye, Ghent University, Belgium Chockalingam Viswesvaran, Florida International University, USA Shonna D. Waters, National Security Agency, USA Jeff A. Weekley, University of Texas at Dallas, USA Annika Wilhelmy, University of Zurich, Switzerland Hyuckseung Yang, Yonsei University, Republic of Korea Sheldon Zedeck, University of California at Berkeley, USA Michael J. Zickar, Bowling Green State University, USA

xvii

Part I FOUNDATIONS OF PSYCHOLOGICAL MEASUREMENT AND EVALUATION APPLIED TO EMPLOYEE SELECTION BENJAMIN SCHNEIDER, SECTION EDITOR

1 RELIABILITY DAN J. PUTKA

Reliability and validity are concepts that provide the scientific foundation upon which we construct and evaluate predictor and criterion measures of interest in personnel selection. They offer a common technical language for discussing and evaluating (a) the generalizability of scores resulting from our measures (to a population of like measures), as well as (b) the accuracy inferences we desire to make based on those scores (e.g., high scores on our predictor measure are associated with high levels of job performance; high scores on our criterion measure are associated with high levels of job performance).1 Furthermore, the literature surrounding these concepts provides a framework for scientifically sound measure development that, a priori, can enable us to increase the likelihood that scores resulting from our measures will be generalizable, and inferences we desire to make based upon them, supported. Like personnel selection itself, the science and practice surrounding the concepts of reliability and validity continue to evolve. The evolution of reliability has centered on its evaluation and framing of “measurement error,” as its operational definition over the past century has remained focused on notions of consistency of scores across replications of a measurement procedure (Haertel, 2006; Spearman, 1904; Thorndike, 1951). The evolution of validity has been more diverse—with changes affecting not only its evaluation but also its very definition, as evidenced by comparing editions of the Standards for Educational and Psychological Testing produced over the past half century by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) (AERA, APA, & NCME, 2014). Relative to the evolution of reliability, the evolution of validity has been well covered in the personnel selection literature (e.g., Binning & Barrett, 1989; McPhail, 2007; Schmitt & Landy, 1993; Society for Industrial and Organizational Psychology, Inc., 2003) and will continue to be well covered in this Handbook. For this reason, this chapter will be devoted to providing an integrated, modern perspective on reliability. In reviewing literature in preparation for this chapter, I was struck at the paucity of organizational research literature that has attempted to juxtapose and integrate perspectives on reliability of the last 50 years, with perspectives on reliability from the first half of the 20th century. Indeed, Borsboom (2006) lamented that to this day many treatments of reliability are explicitly framed or implicitly laden with assumptions based on measurement models from the early 1900s. While classical test theory (CTT) certainly has its place in treatments of reliability, framing entire treatments around it serves to “trap” us within the CTT paradigm (Kuhn, 1962). This makes it difficult for students of the field to compare and contrast—on conceptual and empirical grounds—perspectives offered by other measurement theories and approaches to reliability estimation. This state of affairs is highly unfortunate because perspectives on reliability and methods for its estimation have evolved greatly since Gulliksen’s codification of CTT in 1950, yet these advances have been slow to disseminate into personnel selection research and practice.

3

Dan J. Putka Indeed, my review of the literature reveals what appears to be a widening gap between perspectives of reliability offered in the organizational research literature and those of the broader psychometric community (e.g., Borsboom, 2006; Raykov & Marcoulides, 2011). Couple this trend with (a) the recognized decline in the graduate instruction of statistics and measurement over the past 30 years in psychology departments (Aiken, West, & Millsap, 2008; Merenda, 2007), as well as (b) the growing availability of statistical software and estimation methods since the mid1980s, and we have a situation where the psychometric knowledge base of new researchers and practitioners can be dated prior to exiting graduate training. Perhaps more disturbing is that the lack of dissemination of modern perspectives on reliability can easily give students of the field the impression that the area of reliability has not had many scientifically or practically useful developments since the early 1950s. In light of the issues raised above, my aim in the first part of this chapter is to parsimoniously reframe and integrate developments in the reliability literature over the past century that reflects, to the extent of my knowledge, our modern capabilities. In laying out this discussion, I use examples from personnel selection research and practice to relate key points to situations readers may confront in their own work. Given this focus, note that several topics commonly discussed in textbook or chapter-length treatments of reliability are missing from this chapter. For example, topics such as standard errors of measurement, factors affecting the magnitude of reliability coefficients (e.g., sample heterogeneity), and applications of reliability-related data (e.g., corrections for attenuation, measure refinement) receive little or no attention here. The omission of these topics is not meant to downplay their importance to our field; rather, it just reflects the fact that fine treatments of these topics already exist in several places in the literature (e.g., Feldt & Brennan, 1989; Haertel, 2006; Nunnally, 1978). My emphasis is on complementing the existing literature, not repeating it. In place of these important topics, I focus on integrating and drawing connections among historically disparate perspectives on reliability. As noted below, such integration is essential, because the literature on reliability has become extremely fragmented. For example, although originally introduced as a “liberalization” of CTT more than 40 years ago, generalizability theory is still not well integrated into textbook treatments of reliability in the organizational literature. It tends to be relegated to secondary sections that appear after the primary treatment of reliability (largely based on CTT) is introduced, not mentioned at all, or treated as if it had value in only a limited number of measurement situations faced in research and practice. Although such a statement may appear as a wholesale endorsement of generalizability theory and its associated methodology, it is not. As an example, the educational measurement literature has generally held up generalizability theory as a centerpiece of modern perspectives on reliability, but arguably, this has come at the expense of shortchanging confirmatory factor analytic (CFA)-based perspectives on reliability and how such perspectives relate to and can complement generalizability theory. Ironically, this lack of integration goes both ways, because CFA-based treatments of reliability rarely, if ever, acknowledge how generalizability theory can enrich the CFA perspective (e.g., DeShon, 1998), but rather link their discussions of reliability to CTT. Essentially, investigators faced with understanding modern perspectives on reliability are faced with a fragmented, complex literature.

OVERVIEW This chapter’s treatment of reliability is organized into three main sections. The first section offers a conceptual, “model-free” definition of measurement error. In essence, starting out with such a model-free definition of error is required to help clarify some confusion that tends to crop up when one begins to frame error from the perspective of a given measurement theory and the assumptions such theories make regarding the substantive nature of error. Next I overlay this conceptual treatment of error with perspectives offered by various measurement models. Measurement models are important because they offer a set of hypotheses regarding the composition of observed scores, which, if supported, can allow us to accurately estimate reliability from a sample of data and apply those estimates to various problems (e.g., corrections for

4

Reliability attenuation, construction of score bands). Lastly, I compare and contrast three traditions that have emerged for estimating reliability: (a) a classical tradition that arose out of work by Spearman (1904) and Brown (1910), (b) a random-effects model tradition that arose out of Fisher’s work with analysis of variance (ANOVA), and (c) a CFA tradition that arose out of Joreskog’s work on congeneric test models.

RELIABILITY A specification for error is central to the concept of reliability, regardless of one’s theoretical perspective, but to this day the meaning of the term “error” is a source of debate and confusion (Borsboom & Mellenbergh, 2002; Murphy & DeShon, 2000; Schmidt, Viswesvaran, & Ones, 2000). The sources of variance in scores that are designated as sources of error can differ as a function of (a) the inferences or assertions an investigator wishes to make regarding the scores, (b) how an investigator intends to use the scores (e.g., for relative comparison among applicants or absolute comparison of their scores to some set standard), (c) characteristics of the measurement procedure that produced them, and (d) the nature of the construct one is attempting to measure. Consequently, what is called error, even for scores produced by the same measurement procedure, may legitimately reflect different things under different circumstances. As such, there is no such thing as the reliability of scores (just as there is no such thing as the validity of scores), and it is possible for many reliability estimates to be calculated that depend on how error is being defined by an investigator. Just as if we qualify statements of validity, with statements of “validity for purpose X” or “evidence of validity for supporting inference X,” so too must care be taken when discussing reliability with statements such as “scores are reliable with respect to consistency across Y,” where Y might refer to items, raters, tasks, or testing occasions, or combinations of them (Putka & Hoffman, 2013, 2015). As we’ll see later, different reliability estimates calculated on the same data tell us very different things about the quality of our scores and the degree to which various inferences regarding their consistency are warranted. A convenient way to start to address these points is to examine how error has come to be operationally defined in the context of estimating reliability. All measurement theories seem to agree that reliability estimation attempts to quantify the expected degree of consistency in scores over replications of a measurement procedure (Brennan, 2001a; Haertel, 2006). Consequently, from the perspective of reliability estimation, error reflects the expected degree of inconsistency between scores produced by a measurement procedure and replications of it. Several elements of these operational definitions warrant further explanation, beginning with the notion of replication. Clarifying these elements will provide an important foundation for the remainder of this chapter.

Replication From a measurement perspective, replication refers to the repetition or reproduction of a measurement procedure such that the scores produced by each “replicate” are believed to assess the same construct.2 There are many ways of replicating a measurement procedure. Perhaps the most straightforward way would be to administer the same measurement procedure on more than one occasion, which would provide insight into how consistent scores are for a given person across occasions. However, we are frequently interested in more than whether our measurement procedure would produce comparable scores on different occasions. For example, would we achieve consistency over replicates if we had used an alternative, yet similar, set of items to those that comprise our measure? Answering the latter question is a bit more difficult in that we are rarely in a position to replicate an entire measurement procedure (e.g., construct two or more 20-item measures of conscientiousness and compare scores on each). Consequently, in practice, “parts” or “elements” of our measurement procedure (e.g., items) are often viewed as replicates of each other. The observed consistency of scores across these individual elements is then used

5

Dan J. Putka to make inferences about the level of consistency we would expect if our entire measurement procedure was replicated; that is, how consistent would we expect scores to be for a given person across alternative sets of items we might use to assess the construct of interest. The forms of replication described above dominated measurement theory for nearly the first five decades of the 20th century (Cronbach, 1947; Gulliksen, 1950). Modern perspectives on reliability have liberalized the notion of replicates in terms of (a) the forms that they take and (b) how the measurement facets (i.e., items, raters, tasks, occasisons) that define them are manifested in a data collection design (i.e., a measurement design). For example, consider a measurement procedure that involves having two raters provide ratings for individuals with regard to their performance on three tasks designed to assess the same construct. In this case, replicates take the form of the six rater-task pairs that comprise the measurement procedure, and as such, are multifaceted (i.e., each replicate is defined in terms of specific rater and a specific task). Prior to the 1960s, measurement theory primarily focused on replicates that were defined along a single facet (e.g., replicates represented different items, different splithalves of a test, or the same test administered on different occasions).3 Early measurement models were not concerned with replicates that were multifaceted in nature (Brown, 1910; Gulliksen, 1950; Spearman, 1910). Modern perspectives on reliability also recognize that measurement facets can manifest themselves differently in any given data collection design. For example, (a) the same raters might provide ratings for each ratee; (b) a unique, nonoverlapping set of raters might provide ratings for each ratee; or (c) sets of raters that rate each ratee may vary in their degree of overlap. As noted later, the data collection design underlying one’s measurement procedure has important implications for reliability estimation, which, prior to the 1960s, was not integrated into measurement models. It was simply not the focus of early measurement theory (Cronbach & Shavelson, 2004).

Expectation A second key element of the operational definition of reliability offered above is the notion of expectation. The purpose of estimating reliability is not to quantify the level of consistency in scores among the sample of replicates that comprise one’s measurement procedure for a given study (e.g., items, raters, tasks, or combinations thereof). Rather, the purpose is to use such information to make inferences regarding (a) the consistency of scores resulting from our measurement procedure as a whole with the population from which replicates comprising our measurement procedure were drawn (e.g., the population of items, raters, tasks, or combinations thereof believed to measure the construct of interest) and (b) the consistency of the said scores for the population of individuals from which our sample of study participants was drawn. Thus, the inference space of interest in reliability estimation is inherently multidimensional. As described in subsequent sections, the utility of measurement theories is that they help us make this inferential leap from sample to population; however, the quality with which estimation approaches derived from these theories do so depend on the properties of scores arising from each replicate, characteristics of the construct one is attempting to measure, and characteristics of the sample of one’s study participants.

Consistency and Inconsistency Lastly, the third key element of the operational definition of reliability is the notion of consistency in scores arising from replicates. Defining reliability in terms of consistency of scores implies that error, from the perspective of reliability, is anything that gives rise to inconsistency in scores.4 Conversely, anything that gives rise to consistency in a set of scores, whether it is the construct we intend to measure or some contaminate source of construct-irrelevant variation that is shared or consistent across replicates, delineates the “true” portion of an observed score from the perspective of reliability. Indeed, this is one reason why investigators are quick to note

6

Reliability that “true score,” in the reliability sense of the word, is a bit of a misnomer for the uninitiated— it is not the same as a person’s true standing on the construct of interest (Borsboom & Mellenbergh, 2002; Lord & Novick, 1968; Lumsden, 1976). Thus, what may be considered a source of error from the perspective of validity may be considered true score from the perspective of reliability. Although an appreciation of the distinction between true score from the perspective of reliability and a person’s true standing on a construct can be gleaned from the extant literature, there seems to be a bit more debate with regard to the substantive properties of error. The confusion in part stems from a disconnect between the operational definition of error outlined above (i.e., inconsistency in scores across replicates) and hypotheses that measurement theories make regarding the distributional properties of such inconsistencies, which may or may not reflect reality. For example, in the sections above I made no claims with regard to whether inconsistency in scores across replications reflected (a) unexplainable variation that would be pointless to attempt to model, (b) explainable variation that could potentially be meaningfully modeled using exogenous variables as predictors (i.e., measures other than our replicates), or (c) a combination of both of these types of variation. Historically, many treatments of reliability, whether explicitly or implicitly, have equated inconsistency in scores across replicates with “unpredictable” error (e.g., AERA, APA, & NCME, 1999, p. 27). However, nothing in the operational definition of error laid out above necessitates that inconsistencies in scores are unpredictable. Part of the confusion may lie in the fact that we often conceive of replicates as having been randomly sampled from a broader population(s) or are at least representative of some broader population(s) (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Nunnally, 1978; Tryon, 1957). From a statistical perspective, the effects associated with such replicates on scores would be considered random (Jackson & Brashers, 1994), but this does not necessitate that variation in those effects is unexplainable or beyond meaningful prediction, particularly when raters define the replicates of interest (Cronbach et al., 1972; Murphy & DeShon, 2000). Thus, one should be cautious when framing inconsistency in scores as reflecting random errors of measurement because it is often confused with the notion that such errors are beyond meaningful explanation (Ng, 1974).

Summary This section offered a model-free perspective on error and how it has come to be operationally defined from the perspective of reliability. I adopted this strategy in part because of the confusion noted above but also to bring balance to existing treatments of reliability in the industrialorganizational (I-O) literature, which explicitly or implicitly tends to frame discussions of reliability from the CTT tradition. The language historically used in treatments of CTT makes it difficult for investigators to recognize that inconsistency in scores is not necessarily beyond meaningful explanation, although we conceive of it as random. Another reason I belabor this point is that beginning with Spearman (1904), a legacy of organizational research emerged that focuses more on making adjustment for error in our measures (e.g., corrections for attenuation), rather than developing methods for modeling and understanding error in our measures, which in part may reflect our tendency to view such error as unexplainable.

ROLE OF MEASUREMENT MODELS The defining characteristic of a measurement model is that it specifies a statistical relationship between observed scores and unobserved components of those scores. Such unobserved components may reflect sources of consistency in scores (across replicates), whereas others may reflect sources of inconsistency. As noted earlier, the utility of measurement models is that they offer a set of hypotheses regarding the composition of observed scores, which, if supported, can allow us to accurately estimate reliability (e.g., reliability coefficients, standard errors of measurement) from a sample of data and apply those estimates to various problems (e.g.,

7

Dan J. Putka corrections for attenuation, construction of score bands). To the extent that such hypotheses are not supported, faulty conclusions regarding the reliability of scores may be drawn, inappropriate uses of the reliability information may occur, and knowledge regarding inconsistencies in our scores may be underutilized. In this section, I compare and contrast measurement models arising from two theories that underlie the modern literature on reliability, namely CTT and generalizability theory (G-theory).5 The measurement models underlying CTT and G-theory actually share some important similarities. For example, both (a) conceive of observed scores as being an additive function of true score and error components and (b) view true score and error components as uncorrelated. Nevertheless, as discussed below (cf. Generalizability Theory), certain characteristics of G-theory models enable them to be meaningfully applied to a much broader swath of measurement procedures that we encounter in personnel selection relative to the CTT models. Rather than being competing models, it is now commonly acknowledged that CTT models are simply a more restrictive, narrower version of the G-theory model, which is why G-theory is generally viewed as a “liberalization” of CTT (AERA, APA, & NCME, 1999; Brennan, 2006; Cronbach, Rajaratnam, & Gleser, 1963). Nevertheless, given its relatively narrow focus, it is convenient for pedagogical purposes to open with a brief discussion of CTT before turning to G-theory.

Classical Test Theory Under classical test theory, the observed score (X) for a given person p that is produced by replicate r of a measurement procedure is assumed to be a simple additive function of two parts: the person’s true score (T) and an error score (E). Xpr = Tp + Epr(1.1) Conceptually, a person’s true score equals the expected value of their observed scores across an infinite set of replications of the measurement procedure. Given that such an infinite set of replications is hypothetical, a person’s true score is unknowable but, as it turns out, not necessarily unestimatable (see Haertel, 2006, pp. 80–82). As noted earlier, true score represents the source(s) of consistency in scores across replicates (note there is no “r ” subscript on the true score component in Equation 1.1)—in CTT it is assumed to be a constant for a given person across replicates. Error, on the other hand, is something that varies from replicate to replicate, and CTT hypothesizes that the mean error across the population of replicates for any given person will be zero. In addition to these characteristics, if we look across persons, CTT hypothesizes that there will be (a) no correlation between true and error score associated with a given replicate (rTp.Ep1 = 0), (b) no correlation between error scores from different replicates (rEp1.Ep2 = 0), and (c) no correlation between error scores from a given replicate and true scores from another replicate (rEp1.Tp2 = 0). Although the CTT score models do not necessitate that error scores from a given replicate (or composite of replicates) be uncorrelated with scores from measures of other attributes, the latter is a key assumption underlying the use of reliability coefficients to correct observed correlations for attenuation (Schmidt & Hunter, 1996; Spearman, 1910). Essentially, this last assumption implies that inconsistency in a measurement procedure will be unrelated to any external variables (i.e., variables other than our replicates) and therefore beyond meaningful prediction. From basic statistics we know that the variance of the sum of two independent variables (such as T and E) will simply equal the sum of their variances; thus, under CTT, observed score variance across persons for a given replicate is simply the sum of true score variance and error variance. σ2X = σ2T + σ2E(1.2) As detailed later, reliability estimation attempts to estimate the ratio of σ2T over σ2T + σ2E, not for a single replicate but rather for a measurement procedure as a whole, which as noted earlier

8

Reliability is often conceived as consisting of multiple replicates. Thus, reliability coefficients are often interpreted as the proportion of observed score variance attributable to true score variance, or alternatively, the expected correlation between observed scores resulting from our measurement procedure and scores that would be obtained had we based our measure on the full population of replicates of interest (i.e., hypothetical true scores). One of the key defining characteristics of CTT is the perspective it takes on replicates. Recall that earlier I offered a very generic definition for what constitutes a replicate. I described how we often conceive of parts or elements of a measurement procedure as replicates and use them to estimate the reliability of scores produced by our procedure as a whole. As noted later, CTT-based reliability estimation procedures assume that replicates have a certain degree of “parallelism.” For example, for two replicates to be considered strictly (or classically) parallel, they must (a) produce identical true scores for a given individual (i.e., Tp for Replicate A = Tp for Replicate B), (b) have identical mean observed scores, and (c) have identical error variances.6 The commonly used Spearman-Brown prophecy formula is an example of a CTTbased estimation procedure that is based on the assumption that replicates involved in its calculation are strictly parallel (Feldt & Brennan, 1989). It is often not realistic to expect any two replicates to be strictly parallel. For example, items on a test of cognitive ability are rarely of the same difficulty level, and raters judging incumbents’ job performance often differ in their level of leniency/severity. Under such conditions, item means (or rater means) would differ, and thus, such replicates would not be considered strictly parallel. In recognition of this, CTT gradually relaxed its assumptions over the years to accommodate the degrees of parallelism that are more likely to be seen in practice. The work of Lord (1955), Lord and Novick (1968), and Joreskog (1971) lays out several degrees of parallelism, which are briefly reviewed below. Tau-equivalent replicates produce identical true scores for a given individual but may have different error variances (across persons) and as such different observed variances. Essentially, tau-equivalent replicates relax assumptions further, in that they allow true scores produced by any given pair replicates to differ by a constant (i.e., Tp for Replicate 1 = Tp for Replicate 2 + C, where the constant may differ from pair to pair of replicates). As such, essential tau-equivalence accommodates the situation in which there are mean differences across replicates (e.g., items differ in their difficulty, and raters differ in their leniency/severity). The assumption of essential tau-equivalence underlies several types of coefficients commonly used in reliability estimation, such as coefficient alpha, intraclass correlations, and as discussed in the next section, generalizability coefficients.7 One thing that may not be immediately obvious from the description of essential tauequivalence offered above is that it does not accommodate the situation in which replicates differ in true score variance (across persons). Joreskog’s (1971) notion of congeneric test forms (or more generally, congeneric replicates) accommodated this possibility. Specifically, the congeneric model allows true scores produced by a given replicate to be a linear function of true scores from another replicate (i.e., Tp for Replicate 1 = b × Tp for Replicate 2 + C). As illustrated in the later section on reliability estimation, this accommodates the possibility that replicates may be differentially saturated with true score variance or be measured on a different metric. The degrees of parallelism discussed above have implications for estimating reliability; more specifically, they have implications for the accuracy of results produced by reliability estimation methods that we apply to any given set of replicates. As discussed later, we can apply nearly any reliability estimation method derived from the classical tradition to any sample of replicates, regardless of their underlying properties; however, the estimate we get will differ in its accuracy depending on (a) the extent to which the underlying properties of those replicates conform to the assumptions above and (b) characteristics of the construct one is attempting to measure. It is beyond the scope of this chapter, and not its intent, to provide a catalog of coefficients that may be appropriate for estimating the reliability depending on the degree of parallelism among the replicates of interest, because excellent descriptions exist elsewhere in the literature (e.g., Feldt & Brennan, 1989, Table 3, p. 115; Raykov & Marcoulides, 2011). However, in reviewing treatments such as the one offered by Feldt and Brennan (1989), be cognizant that the myriad

9

Dan J. Putka coefficients they review (including the commonly used Spearman-Brown prophecy and coefficient alpha) were formulated to deal with scores arising from measurement procedures in which (a) replicates were defined by a single facet (e.g., replicates reflect different items or test parts) and (b) that facet was fully crossed with one’s objects of measurement (e.g., all test takers are administered the same set of items, and all test takers completed the same test on two different occasions). As we will see below, application of classical reliability estimation methods in cases in which replicates are multifaceted (e.g., replicates representing task-rater pairs) or cases in which the design underlying one’s measurement procedure is not fully crossed is problematic (Cronbach & Shavelson, 2004). The treatment of reliability for measurement procedures characterized by multifaceted replicates or involving noncrossed measurement designs leads naturally to the introduction of G-theory.

Generalizability Theory G-theory liberalizes CTT in that it has mechanisms within its score models for (a) dealing with single-faceted and multifaceted replicates, (b) simultaneously differentiating and estimating multiple sources of error arising from different measurement facets (e.g., items, raters, occasions, tasks), (c) dealing with scores produced by a wide variety of data collection designs (e.g., crossed, nested, and ill-structured measurement designs), (d) adjusting the composition of true score and error depending on the generalizations one wishes to make regarding the scores, (e) adjusting the composition of true score and error depending on how one intends to use the scores (e.g., for relative comparison among applicants or absolute comparison of their scores to some set standard), and (f) relaxing some of the assumptions put on the distributional properties of true and error components proscribed under CTT. The purpose of this section will be to elaborate these features of G-theory model in a way that is relatively free from G-theory jargon, which has been cited as one reason why the unifying perspective that G-theory offers on reliability has yet to be widely adopted by organizational researchers (DeShon, 2002). Perhaps the most visible way G-theory model liberalizes the CTT model is its ability to handle measurement procedures comprising multifaceted replicates. To illustrate this key difference between the G-theory and CTT models, let us first consider an example in which we have observed scores based on ratings of job applicants’ responses to three interview questions designed to assess interpersonal skill. Say that we had the same three raters interview each applicant and that each rater asked applicants the same three questions (i.e., applicants, raters, and questions are fully crossed). Thus, we have nine scores for each applicant—one for each of our nine “replicates,” which in this case are defined by unique question-rater combinations. Under CTT and G-theory, we might conceive of an applicant’s true score as the expected value of his/her observed score across the population of replicates—in this case it is the population of raters and questions. However, if we were to apply the CTT score model to such replicates, it would break down because it does not account for the fact that some replicates share a rater in common and other replicates share a question in common. As such, the error associated with some replicates will be correlated across applicants, therefore violating one of the key assumptions underlying CTT measurement model (i.e., errors associated with different replicates are uncorrelated). As shown below (cf. Equation 1.4), the G-theory measurement model permits the addition of terms to the model that account for the fact that replicates are multifaceted. The insidious part of this illustration is that the situation above would not prevent us from applying estimation methods derived from CTT to these data (e.g., calculating coefficient alpha on the nine replicates). Rather, perhaps unbeknownst to the investigator, the method would allocate error covariance among replicates that share a rater or question in common to true score variance because they are a source of consistency across at least some of the replicates (Komaroff, 1997; Raykov, 2001a). That is, the CTT score model and commonly used coefficients derived from it (e.g., coefficient alpha) are blind to the possibility of multifaceted replicates, which is a direct reflection of the fact that early measurement theory primarily concerned itself with fully crossed, single-faceted measurement designs (Cronbach & Shavelson, 2004).

10

Reliability To account for the potential that replicates can be multifaceted, G-theory formulates its measurement model from a random-effects ANOVA perspective. Unlike CTT, which has its roots in the correlational research tradition characteristic of Spearman and Pearson, G-theory has its roots in the experimental research tradition characteristic of Fisher (1925). As such, G-theory is particularly sensitive to dealing with replicates that are multifaceted in nature and both crossed and noncrossed measurement designs. It has long been acknowledged that issues of measurement design have been downplayed and overlooked in the correlational research tradition (Cattell, 1966; Cronbach, 1957), and this is clearly evident in reliability estimation approaches born out of CTT (Cronbach & Shavelson, 2004; Feldt & Brennan, 1989). To ease into the G-theory measurement model, I start with a simple example, one in which we assume observed scores are generated by a replicate of a measurement procedure that is defined along only one facet of measurement. The observed score (X) for a given person p that is produced by any given replicate defined by measurement facet “A” (e.g., “A” might reflect items, occasions, raters, tasks, etc.) is assumed to be an additive function: Xpa = b0 + up + ua + upa + epa(1.3) where b0 is the grand mean score across persons and replicates of facet A; up is the main effect of person p and conceptually the expected value of p’s score across the population of replicates of facet A (i.e., the analogue of true score); ua represents the main effect of replicate a and conceptually is the expected value of a’s score across the population of persons; upa represents the p × a interaction effect and conceptually reflects differences of the rank ordering of persons across the population of replicates of facet A; and lastly, epa is the residual error that conceptually is left over in Xpa after accounting for the other score effects.8 As with common random-effects ANOVA assumptions, these score effects are assumed to (a) have population means of zero, (b) be uncorrelated, and (c) have variances of σ2P, σ2A, σ2B, σ2AB, and σ2Residual, respectively (Jackson & Brashers, 1994). The latter variance components are the focus of estimation efforts in G-theory, and they serve as building blocks of reliability estimates derived by G-theory. Of course, the example above is introduced primarily for pedagogical purposes; the real strength of the random-effects formulation is that the model above is easily extended to measurement procedures with multifaceted replicates (e.g., replicates that reflect question-rater pairs). For example, the observed score (X) for a given person p that is produced by any given replicate defined by measurement facets “A” and “B” (e.g., “A” might reflect questions and “B” might reflect raters) is assumed to be an additive function. Xpab = b0 + up + ua + ub + upa + upb + uab + upab + epab(1.4) A key difference to point out between the models specified in Equations 1.3 and 1.4 is the interpretation of the main effects for individuals. Once again, up is the main effect of person p, but conceptually it is the expected value of p’s score across the population of replicates defined by facets A and B. Thus, although the up term in Equations 2.3 and 2.4 provides an analogue to the true score, the substance of true scores differs depending on the nature of the population(s) of replicates of interest. Extending this model beyond two facets (e.g., a situation in which replicates are defined as a combination of questions, raters, and occasions) is straightforward and simply involves adding main effect terms for the other facets and associated interaction terms (Brennan, 2001b). One thing that is evident from the illustration of the G-theory model provided above is that, unlike the CTT model, it is scalable; that is, it can expand or contract depending on the degree to which replicates underlying a measurement procedure are faceted. Given its flexibility to expand beyond simply a true and error component, the G-theory model potentially affords investigators with several more components of variance to consider relative to the CTT model. For example, using the interview example presented above, we could potentially decompose variance in interview scores for applicant p on question q as rated by rater r into seven components.9 σ2X = σ2P + σ2Q + σ2R + σ2PQ + σ2PR + σ2QR + σ2PQR, Residual(1.5)

11

Dan J. Putka Recall from the earlier discussion of CTT that the basic form reliability coefficients take on is σ2T/(σ2T + σ2E). This fact begs the question, from the G-theory perspective, what sources of variance comprise σ2T and σ2E? As one might guess from the decomposition above, the G-theory model offers researchers a great deal of flexibility when it comes to specifying what constitutes error variance and true score variance in any given situation. As demonstrated in the following sections, having this flexibility is of great value. As alluded to in the opening paragraph of this section, the sources of variance in scores that are considered to reflect error (and true score for that matter) can differ depending on (a) the generalizations an investigator wishes to make regarding the scores, (b) how an investigator intends to use the scores (e.g., for relative comparison among applicants or absolute comparison of their scores to some set standard), and (c) characteristics of the data collection or measurement design itself, which can limit an investigator’s ability to estimate various components of variance. The idea of having flexibility of specifying what components of observed variance contribute to true score and error is something that is beyond the CTT score model because it only partitions variance into two components. The following sections highlight how the G-theory model offers investigators flexibility for tailoring the composition of σ2T and σ2E to their situation.

Dependency of σ2T and σ2E on Desired Generalizations The decision of what components of variance comprise σ2T and σ2E depends in part on the generalizations the investigator wishes to make based on the scores. To illustrate this, let us take the interview example offered above and say that the investigator was interested in (a) gener­ alizing scores from his or her interview across the population of questions and raters and (b) using the scores to make relative comparisons among applicants who completed the interview. In such a case, variance associated with applicant main effects (σ2P) would comprise σ2T, and variance associated with interactions between applicants and each type of measurement facet (i.e., applicant-question interaction variance, σ2PQ; applicant-rater interaction variance, σ2PR; and applicant-question-rater interaction variance and residual variance, σ2PQR,Residual) would comprise σ2E. The relative contribution of these latter effects to error variance would be scaled according to the number of questions and raters involved in the measurement procedure. As the number of questions increases, the contribution of σ2PQ would go down (i.e., error associated with questions would be averaged away), and as the number of raters increases, the contribution of σ2PR would go down (i.e., error associated with raters would be averaged away). Specifically, the “generalizability” coefficient described above would be σ2P

Eρ 2 = σ2P

 σ 2 PQ σ 2 PR σ 2 PQR , Residual  + + +  nR n Q n R   n Q

=

σ2T σ2T + σ2E

(1.6) 

where the term in brackets represents σ2E, nQ is the number of interview questions, nR is the number of raters, and nQnR is the product of the number of questions and raters.10 Note that increasing the number of questions and/or raters will result in decreasing that part of error associated with questions and/or raters, respectively. The idea that G-theory allows for the scaling of these effects as a function of the number of questions and raters sampled is analogous to the role of the Spearman-Brown prophecy in CTT, in which the number of replicates that comprise a measurement procedure directly affects the estimated reliability of scores produced by that procedure (Feldt & Brennan, 1989). The key difference here is that G-theory allows one to differentiate and examine the effect that adjusting the sampling of different types of facets has for reliability (e.g., separately adjusting the number of questions and raters), whereas the Spearman-Brown prophecy does not allow such differentiation to occur. As such, applying the Spearman-Brown prophecy to estimate what the reliability of scores would be if the length of a measure is changed can greatly mislead investigators if the replicates that comprise that measure are multifaceted (Feldt & Brennan, 1989).

12

Reliability To illustrate, let us take the interview example offered above and say σ2P = .50, σ2PQ = .30, σ2PR = .10, and σ2PQR,Residual = .10. Recall our interview comprises three questions and three raters (i.e., nine question-rater pairs serve as replicates). Using Equation 1.6, the estimated reliability of the average rating across questions and raters would be .78 (σ2T = .50, σ2E = .14). Now, if we were to ask what effect “doubling the length of the interview” would have on reliability, and we used the Spearman-Brown prophecy (i.e., 2Eρ2/[1 + Eρ2]) to answer that question, we would achieve an estimate of .88, which is analogous to what we achieve if we replaced nQ, nR, and nQnR in Equation 1.6 with 2nQ, 2nR, and 2nQ2nR. Note that the Spearman-Brown prophecy does not provide the estimated reliability for 18 question-rater pairs (i.e., double the existing number of replicates), but rather an estimated reliability for 36 question-rater pairs (i.e., six questions × six raters). As such, in this case, the Spearman-Brown formula gives us an estimated reliability if the effective length of the interview were quadrupled rather than doubled. Another shortcoming of the Spearman-Brown formula is that it fails to account for the fact that there are multiple ways one can effectively double the length of the interview, each of which may produce a different reliability estimate. For example, we can have two questions and nine raters, which would give us 18 question-rater pairs and result in an average rating reliability of .75 on the basis of Equation 1.6. Alternatively, we can have nine questions and two raters, which would also give us 18 question-rater pairs but result in an average rating reliability of .85 on the basis of Equation 1.6. Essentially, there is no mechanism within the Spearman-Brown formula that accounts for the fact that facets may differentially contribute to error. As this example illustrates, making adjustments to the number of levels sampled for one facet (e.g., questions in this case) may have a much more profound effect on error than making adjustments to the number of levels sampled for other facets (e.g., raters) included in the design. Returning to the discussion of the dependency of σ2T and σ2E on the generalizations one wishes to make regarding their scores, let us now say that a different investigator uses the same interview procedure described above, but instead only wished to generalize scores from the procedure across the population of raters. For example, this might be the case if the investigator feels that the questions get at different parts of the interpersonal skill construct, and as such does not wish to treat inconsistency in scores across questions (for a given applicant) as error. In such a case, variance associated with applicant main effects (σ2P) and a function of applicant-question interaction effects (σ2PQ) would comprise σ2T, and variance associated with interactions between applicants and raters (σ2PR) and the applicant-rater-questions along with residual error applicant (σ2PQR,Residual) would comprise σ2E (Brennan, 2001b; DeShon, 2002). In this situation, the investigator is essentially examining the consistency of scores across raters on the basis of ratings that have been averaged across the three interview questions—in G-theory this is known as fixing a facet of measurement.11

Dependency of σ2T and σ2E on Intended Use of Scores Slightly varying the example above allows for illustration of the implications of how an investigator intends on using scores for the sources of variance that contribute to σ2T and σ2E. For example, let us say the interview above was conducted to determine if applicants met some minimum level of interpersonal skill. That is, rather than comparing applicants against one another, the interest is in comparing their scores to some standard of interpersonal skill. Also, let us return to the original example in which the investigator was interested in generalizing scores across the population of questions and raters. In this case, variance due to the main effects of questions and raters, as well as their interaction (i.e., σ2Q, σ2R, σ2QR), would contribute σ2E (in addition to sources identified earlier, σ2PQ, σ2PR, σ2PQR,Residual) because they influence the absolute magnitude of the score any given applicant receives. In the example from the previous paragraphs in which we were only interested in using scores to make relative comparisons among applicants, these effects did not contribute to error because they have no bearing on how applicants were rank ordered (i.e., question and rater main effects are constants across applicants for designs in which questions and raters are fully crossed with applicants). The potential for

13

Dan J. Putka such effects to contribute to σ2E in crossed designs (as they do in this example) is not addressed by CTT, because it is simply beyond the scope of the CTT model to handle error of that type (Cronbach & Shavelson, 2004).

Dependency of σ2T and σ2E on Characteristics of the Measurement Procedure Critics may argue that the interview examples offered above do not reflect the reality of measurement designs faced in applied organizational research and practice. Such critics would be right. Rarely, if ever, are the measurement designs involving ratings that we confront in the applied organizational research and practice fully crossed. When we are fortunate to have two or more raters for each ratee, the orientation of raters to ratees is often what Putka, Le, McCloy, and Diaz (2008) have termed “ill-structured”.12 Specifically, the sets of raters that rate each ratee are neither identical (indicative of a fully crossed design) nor completely unique (indicative of a design in which raters are nested with ratees); rather, each ratee is rated by a set of raters that may vary in their degree of overlap. The implications of the departure of measurement designs from the fully crossed ideal is that it can limit our ability to uniquely estimate the components of variance that underlie observed scores (e.g., those illustrated in Equation 1.5), which in turn limits our flexibility for choosing which components contribute σ2T and σ2E. To illustrate this, let’s consider a few variants on the interview example above. Say that instead of having three raters rate each applicant on each interview question, a different nonoverlapping set of three raters rates each applicant (i.e., raters are nested within applicants). In this case, rater main effect variance (σ2R) and applicant-rater interaction effect variance (σ2PR) would be inseparable, and both will contribute to σ2E regardless of whether the investigator was interested in using the scores simply to rank order applicants or compare applicants’ scores to some fixed standard (McGraw & Wong, 1996; Shrout & Fleiss, 1979). However, often in practice we are not dealt such nested designs—the sets of raters that may rate each ratee tend to vary in their degree of overlap. Although less “clean” than the aforementioned nested design, having some degree of overlap actually gives us an opportunity to uniquely estimate σ2R and σ2PR (as we are able to do in a fully crossed design) (Putka et al., 2008). Nevertheless, as was the case with the nested design, σ2R and σ2PR will contribute to σ2E, because the raters that rate each ratee are not identical, σ2R and σ2PR will affect the rank ordering of ratees’ scores (Schmidt et al., 2000). However, unlike the nested design, the contribution of σ2R to σ2E will be dependent on the amount of overlap between the sets of raters that rate each ratee—a subtlety not widely known but pertinent to many organizational researchers who work with ratings (Putka et al., 2008). Lastly, let’s use the previous interview example one more time to provide a critical insight offered by G-theory—the notion of hidden measurement facets and their implications for interpreting the substantive nature of σ2T and σ2E. In laying out the interview example above, it was implicit that raters conducted interviews on separate occasions. However, a more common situation might be that raters sit on a panel, and as such the three questions are asked of a given applicant on the same occasion. In either case, we have measurement procedures with designs that are “notationally” identical (i.e., applicants × questions × raters); however, the variance components underlying scores produced by these interview procedures have different substantive meanings. If each rater conducted a separate interview, variance attributable to the applicant-rater interaction (σ2PR) would also reflect applicant-occasion variance (σ2PO). In other words, σ2PR would not only reflect inconsistencies in raters’ rank ordering of applicants, but also inconsistency in the applicants’ responses across the occasions on which the interviews were conducted. If the applicant participated in the panel interview, raters would be rating the applicants’ responses on the same occasion, and as such variance attributable to the applicant-rater interaction (σ2PR) would be just that, but variance attributable to applicant main effects (σ2P) would also reflect applicantoccasion variance (σ2PO). This stems from the fact that raters are observing a given applicant on the same occasion, and as such occasion of measurement serves as a source of consistency in raters’ ratings that would not be present if raters conducted separate interviews. In both of the examples above, σ2PO is not separable from the other source of variance with which it is confounded. In the case of separate interviews, raters covary with occasions; in the case of panel

14

Reliability interviews, occasions are not replicated for a given applicant. Thus, these examples illustrate how a measurement facet can hide in different ways to influence the substantive meaning of σ2E (in the case of the separate interview) and σ2T (in the case of the panel interviews). The examples above also illustrate an important point—just because we cannot isolate or estimate a source of variance underlying observed scores does not mean those sources of variance are not present and influencing our scores (Brennan, 2001b; DeShon, 1998; Feldt & Brennan, 1989; Schmidt & Hunter, 1996). Indeed, it is interesting to take the concept of hidden facets and use them to frame some common measurement issues in personnel selection. For example, the magnitude of person-rater interaction variance (σ2PR) in job performance ratings has been found to be quite large (e.g., Schmidt et al., 2000; Scullen et al., 2000). However, if raters are viewing the performance of individuals on (a) different occasions, (b) different tasks, and/or (c) different tasks on different occasions, then part of what we typically label person-rater interaction variance may actually also reflect several other sources of variance (e.g., person-occasion interaction variance, person-task interaction variance, and person-task-occasion interaction variance). In other words, the hidden facets of occasion and task might help explain the sizable person-rater interaction effects often found in job performance ratings. In the context of assessment centers, hidden facets might partially explain the common finding of the dominance of exercise effects over dimension effects (Lance, 2008). For example, dimensions within exercises share an occasion of measurement in common (and sometimes share raters as well), whereas dimensions in different exercises do not. As such, all else being equal we would expect scores for dimensions within exercises to be more consistent with each other than with scores for dimensions in different exercises. Thus, what is interpreted as an exercise effect in the context of assessment center ratings may partially be explained by hidden occasion and rater facets of measurement that increase consistency among dimension scores within exercises relative to dimension scores across exercises (e.g., Cronbach, Linn, Brennan, & Haertel, 1997). These examples illustrate the potential utility of framing common measurement issues through the lens of hidden facets illuminated by G-theory.

Summary This section described perspectives on observed scores adopted by two measurement theories that dominate current discussions of reliability. Through a single example, I illustrated the many ways in which G-theory liberalizes not only the score model offered by CTT but also the perspective it offers on reliability. By no means did the discussion fully illustrate how G-theory is applied or how reliability coefficients based on G-theory are calculated. For such details, the reader is referred to other treatments (Brennan, 2001b; DeShon, 2002; Haertel, 2006; Shavelson & Webb, 1991). Nor did this section illustrate how very different conclusions regarding the reliability of scores can be depending on (a) the generalizations an investigator wishes to make regarding those scores, (b) how an investigator intends to use those scores (e.g., for relative comparison among applicants or absolute comparison of their scores to some set standard), or (c) the measurement design the investigator uses to gather data that gives rise to the scores (see Putka & Hoffman [2013, 2015] for concrete illustrations of the consequences of these decisions for the magnitude of reliability estimates for assessment center and job performance ratings, respectively). Nevertheless, given space constraints, this was not my intent. Rather, I tried, in a way that was relatively free of G-theory jargon, to show how G-theory offers a way for framing and dealing with measurement situations that CTT was designed to handle, as well as those that CTT was never really designed to handle—a key reason why G-theory currently underlies modern perspectives on reliability (Cronbach & Shavelson, 2004).

ESTIMATION OF RELIABILITY The previous sections outlined conceptual and model-based perspectives on reliability and measurement error. This section addresses how these concepts and models translate into methods for estimating reliability. Reliability is often summarized in terms of (a) coefficients ranging

15

Dan J. Putka from 0 to 1 or (b) standard errors of measurement (SEMs) expressed in a raw score metric. My focus is on the former, partly out of page limits and partly because the latter can typically be calculated from components of the former.13 As noted earlier, under CTT and G-theory the goal of reliability estimation is to estimate the ratio σ2T / (σ2T + σ2E). The following sections discuss methods for estimating this ratio and components of it. My intent here is not to provide a catalog of different types of reliability coefficients, nor is my intent to provide a cookbook on how to estimate reliability in any given situation. Indeed, as should be clear from the previous section, doing so would not be fruitful given that the composition of σ2T and σ2E in any given situation partly reflects the aims of the individual investigator. Rather, I focus on comparing and contrasting different historical traditions on estimating reliability, examining the pros and cons of each, and speaking to their equifinality under certain conditions. The extant literature on reliability estimation is characterized by a multitude of loosely organized coefficients and estimation methods. Historically, the psychometric literature tended to organize discussions of reliability estimation in terms of categories or types of reliability (e.g., test-retest reliability, split-half, parallel-forms, coefficients of equivalence, stability, precision; Cronbach, 1947; Gulliksen, 1950). With the advent of G-theory, psychometricians have slowly gravitated away from categories or types of coefficients that characterized early test theory because “the categories may now be seen as special cases of a more general classification, generalizability coefficients” (AERA et al., 1999, p. 27). As Campbell (1976) noted, the G-theory model “removes the somewhat arbitrary distinctions among coefficients of stability, equivalence, and internal consistency and replaces them with a general continuum of representativeness” (p. 202). Interestingly, this movement toward a unitarian perspective on reliability has temporally coincided with the movement from trinitarian to unitarian perspectives on validity (Brennan, 2006). Ironically, unlike our views on validity, our formal treatments of reliability estimation in organizational research have remained focused on categories or types of reliability coefficients (e.g., Aguinis, Henle, & Ostroff, 2001; Guion, 1998; Le & Putka, 2007; Ployhart, Schnider, & Schmitt, 2006; Schmidt & Hunter, 1996).14 Rather than continuing to bemoan the current state of affairs, I offer an alternative way of framing discussions of estimating reliability that may help bring organizational research, practice, and pedagogy more in line with modern psychometric thought. Before doing so, I offer a quick example to help illustrate the rationale behind the structure offered below. When calculating existing types of reliability coefficients, such as a simple Pearson correlation calculated between two replicates (Brown, 1910; Spearman, 1910), coefficient alpha (Cronbach, 1951), or intraclass correlation (ICC; Shrout & Fleiss, 1979)—with which most investigators are familiar—it is important to remember that these are just sets of mathematical operations that can be applied to any set of replicates of our choosing (e.g., raters, items, tasks, occasions). They will all produce, to varying degrees of quality (depending on the properties of the underlying data and construct being measured), estimates of the ratio σ2T/(σ2T + σ2E). As noted above, the substantive meaning of σ2T and σ2E will depend in large part on the types of replicates to which the mathematical operations are applied. For example, if we apply them to replicates defined as items, σ2T will reflect consistency across items; if we apply them to replicates defined as occasions, σ2T will reflect consistency across occasions; if we apply them to replicates defined as raters, σ2T will reflect consistency across raters; and so on and so forth.15 Unfortunately, our literature has a tendency to associate certain types of coefficients with certain types of replicates (e.g., coefficient alpha with items, ICCs with raters). This is unfortunate and misleading, because this simply reflects the type of replicate with which these procedures happened to be introduced by earlier authors. Computationally, the procedures are blind to the types of replicates to which they are applied, and many are algebraically identical (Cronbach & Shavelson, 2004; Feldt & Brennan, 1989). For example, alpha is a specific type of ICC, and all ICCs can be framed as generalizability coefficients (Brennan, 2001b; McGraw & Wong, 1996). The following discussion is organized around three traditions for estimating reliability. The classical tradition largely attempts to estimate reliability directly, with little attention toward estimating components of it. More modern traditions (e.g., those based on random-effects models and CFA models) attempt to generate estimates of σ2T and σ2E, or components of them, which gives investigators flexibility to combine components in different ways to calculate reliability estimates appropriate for

16

Reliability their situation and achieve a better understanding of the sources of error (and true score) in their measures.

Classical Tradition This classical tradition has its roots in using Pearson correlation between two replicates (e.g., split-halves of a single test, tests administered on two different occasions) to estimate reliability (Brown, 1910; Spearman, 1910). It is based on the premise that the correlation between two strictly parallel replicates (e.g., split-halves of a test, the same test administered on two occasions) equals the proportion of observed score variance attributable to true scores from a single replicate. If applied to split-halves of a test, the Spearman-Brown prophecy formula would then be used to “step-up” the said correlation to arrive at an estimate of reliability for scores produced by the full test. The primary strength of estimation methods based on this tradition is their simplicity and widespread familiarity. Pearson correlations are easy to calculate and widely used in selection research and practice (Schmidt & Hunter, 1996). Early psychometricians realized that the Spearman-Brown approach described above becomes unwieldy in situations dealing with more than two replicates (e.g., a 10-item conscientiousness scale). Specifically, they realized that depending on which split-halves of their test they calculated their correlation on, they would get a different estimate of reliability (Kuder & Richardson, 1937). In light of this difficulty, researchers developed alternative approaches to estimating reliability that were a function of replicate variances (e.g., item variances) and observed score variances (e.g., variance of the full test). These approaches provided a computationally simple solution that could easily accommodate measures involving two or more single-faceted replicates and are reflected in Kuder and Richardson’s (1937) KR-20, Guttman’s (1945) set of lambda coefficients, and Cronbach’s coefficient alpha (Cronbach, 1951).16, 17 Another positive characteristic of these latter approaches relative to the Spearman-Brown prophecy is that they only necessitate replicates be essentially tau-equivalent, as opposed to strictly parallel (Novick & Lewis, 1967), although subsequent research has found that alpha is robust to violations of essential tau-equivalence (Haertel, 2006). Unfortunately, all of the classical estimation approaches described above, from SpearmanBrown through coefficient alpha, are limited in some important ways. As noted earlier, the CTT model on which these coefficients are based was developed for use with measurement procedures involving single-faceted replicates that were fully crossed with one’s objects of measurement (Cronbach & Shavelson, 2004). The simplicity of calculating a Pearson r, the Spearman-Brown prophecy, and alpha belies interpretational and statistical problems that arise if one attempts to apply them to replicates that are (a) not fully crossed with one’s objects of measurement or (b) multifaceted in nature. As Cronbach and Shavelson (2004) noted in their discussion of the possibility of applying alpha to replicates that are not fully crossed, “Mathematically, it is easy enough to substitute scores from a nested sample matrix by simply taking the score listed first for each (person) as belonging in Column 1, but this is not the appropriate analysis” (p. 400). Nevertheless, application of such classical estimation methods, regardless of a procedure’s underlying design, has been common practice in organizational research (Viswesvaran, Schmidt, & Ones, 2005). To illustrate the problems that arise when classical estimation methods are applied to measurement designs that are not fully crossed, consider an example in which job incumbents are each rated by two raters on their job performance. Some incumbents may share one or more raters in common, whereas others may share no raters in common. In this case, standard practice is to (a) randomly treat one rater for each ratee as “rater 1” and the other as “rater 2,” (b) assign the ratings of “rater 1” to column 1 and the ratings of “rater 2” to column 2 in a data set, (c) calculate the Pearson correlation between columns to estimate the reliability of a single-rater’s ratings, and then (d) use the Spearman-Brown prophecy on the said correlation to estimate the reliability for the average rating (Viswesvaran et al., 2005). Putka et al. (2008) elaborated on several problems with this common practice, namely (a) the estimates derived from this process can differ depending on the assignment of raters to columns 1 and 2 for each ratee; (b) Pearson

17

Dan J. Putka r fails to account for the fact that residual errors are nonindependent for ratees who share one or more raters in common, which leads to a downward bias in estimated true score variance (σ2T) (Kenny & Judd, 1986); and (c) the Spearman-Brown prophecy inappropriately scales the contribution of rater main effect variance to error variance (σ2E) as a function of the number of raters per ratee, rather than the amount of overlap between sets of raters that rate each ratee, leading to an overestimate of σ2E (see also Brennan, 2001b, p. 236). In addition to these points, Putka and his colleagues offer a solution for dealing with this type of design that is based on the random-effects model tradition of estimating reliability (discussed later). Second, with regard to the problem of applying classical methods to multifaceted replicates, the task-rater example presented earlier clearly showed the hazards of blindly applying alpha to replicates of such nature. However, it would be a fallacy to suggest that investigators who adopt classical methods would actually apply alpha or other classical methods in such a manner. Indeed, early psychometricians seemed acutely aware of the limitations of the CTT model, and they attempted to deal with the inability of the CTT model to account for multifaceted replicates by calculating different types of coefficients. For example, Cronbach (1947) discussed the coefficient of equivalence and stability (CES), which was calculated by correlating two different forms of a measure completed by the same respondents on two different occasions (i.e., replicates defined by form-occasion combinations). Cronbach later realized that emergence of the G-theory score model in the 1960s eliminated the need to “mix and match” pairs of replicates like this and provided a generalized solution that applied regardless of whether one was dealing with single-faceted or multifaceted replicates and regardless of whether one was dealing with crossed or noncrossed designs (Cronbach & Shavelson, 2004). Although the tradition of using coefficients such as CES to deal with multifaceted replicates has faded in psychometrics, it has continued to characterize organizational research and practice, because we have continued to frame problems of reliability in a way that, for better or worse, resembles the psychometric literature of the 1940s. For example, Schmidt and his colleagues have demonstrated how, in the context of fully crossed designs, one can calibrate different sources of error in scores (e.g., error arising from inconsistencies across items, occasions, raters, etc.) through the addition and subtraction of Pearson correlations applied to different types of replicates (Schmidt et al., 2000). Indeed, for fully crossed designs, Schmidt and others illustrated how one can arrive at estimates for at least some of the variance components estimable based on the random-effects model underlying G-theory (Le, Schmidt, & Putka, 2009; Schmidt, Le, & Ilies, 2003). However, it is important to note that the calibration methods based on the classical coefficients alluded to above will not be able to estimate all components of variance that a given measurement design may support estimating, even if the design is fully crossed. For example, such methods cannot be used to estimate the unique contribution of facet main effects (e.g., rater main effects, question main effects) or interactions among facets (e.g., question-rater effects). Lacking this flexibility is unfortunate, particularly if one is interested in (a) comparing scores to standards (e.g., cutoff score) rather than simply making relative comparisons among individuals or (b) simply gaining a more comprehensive understanding of the sources of variance underlying scores. Remember that the CTT score model that gave rise to the classical coefficients discussed above was never designed to account for main effects of measurement facets, largely because they were assumed not to exist (e.g., recall parallel measures have equal means) and because they were not of interest in the problems that Spearman and other early psychometric researchers concerned themselves with (Cronbach & Shavelson, 2004).

Random-Effects Model Tradition If one has a measurement procedure involving multifaceted replicates, or the design that underlies the procedure is something other than fully crossed, a natural choice for estimating reliability is based on variance components generated by fitting a random-effects model to one’s data (Jackson & Brashers, 1994; Searle et al., 1992). The modern random-effects model has its root in the work of Fisher’s early work on the ANOVA model and ICCs (Fisher, 1925). Work by Hoyt (1941) and Ebel (1951) provided early examples of using the ANOVA framework for estimating

18

Reliability reliability for single-faceted replicates. Of particular note was Ebel’s (1951) work on ratings in which he dealt with crossed and nested measurement designs. This early work branched in two directions, one that manifested itself in today’s literature on ICCs (e.g., McGraw & Wong, 1996; Shrout & Fleiss, 1979) and the other that developed into G-theory (Cronbach et al., 1972). Although rarely acknowledged in the ICC literature on reliability estimation, G-theory encompasses that literature. ICCs and reliability coefficients produced under G-theory (i.e., G-coefficients) are nothing more than ratios of variance components; for example, σ2T/(σ2T + σ2E). G-theory simply acknowledges that these ICCs can take on many more forms than those discussed by McGraw and Wong (1996) and Shrout and Fleiss (1979), and, per the earlier discussion on G-theory, offers a comprehensive framework for constructing a reliability estimate that is appropriate given one’s situation. As alluded to in the earlier treatment of G-theory, when it was originally developed, the primary approach of estimating variance components that contributed to σ2T and σ2E was the random-effects ANOVA model. This same approach to estimating variance components underlies the modern literature on ICCs (e.g., McGraw & Wong, 1996). Unfortunately, estimating variance components using ANOVA-based procedures can be an arduous process that until recently and without highly specialized software involved numerous manipulations of the sums of squares resulting from ANOVA tables (e.g., Cronbach et al., 1972; Shavelson & Webb, 1991). Relative to calculating coefficients arising from the classical tradition, the difference in simplicity of estimating reliability could be substantial. Indeed, this may be a large reason why G-theory never gained traction among organizational researchers. However, since the 1960s several advances in random-effects models have made estimation of variance components much simpler and resolved many problems associated with ANOVA-based estimators of variance components (DeShon, 1995; Marcoulides, 1990; Searle et al., 1992). Unfortunately, this knowledge has been slow to disseminate into the psychometric and I-O literature, because many still seem to equate G-theory with ANOVA-based variance component estimation procedures that characterized G-theory upon its introduction to the literature. Procedures for the direct estimation of variance components that underlie all reliability coefficients are now widely available in common statistical packages (e.g., SAS, SPSS, R) and allow investigators to estimate variance components with a few clicks of a button. DeShon (2002) and Putka and McCloy (2008) provided clear examples of the ease with which variance components can be estimated within SAS and SPSS. As such, modern methods of variance component estimation are far easier to implement than (a) procedures characteristic of the early G-theory literature and (b) the calibration techniques discussed by Schmidt et al. (2000), which would require an investigator to engage in a series of manipulations with various types of coefficients arising out of the classical tradition. In addition to offering parsimony, modern methods of variance component estimation have another key advantage: they can readily deal with missing data and unbalanced designs characteristic of organizational research (DeShon, 1995; Greguras & Robie, 1998; Marcoulides, 1990; Putka et al., 2008). In contrast, ANOVA-based variance component estimators characteristic of the early G-theory literature are not well equipped to handle such messy designs. Indeed, when confronted with such designs, advocates of G-theory have often suggested discarding data to achieve a balanced design for purposes of estimating variance components (e.g., Shavelson & Webb, 1991)—with modern methods of variance component estimation, the need for such drastic steps has subsided. The most notable drawback of modern methods of variance component estimation—largely based on full or restricted maximum likelihood—is that they can involve rather substantial memory requirements for large measurement designs (Bell, 1985; Littell, Milliken, Stroup, & Wolfinger, 1996). In some cases, such requirements may outstrip the memory that Windows-based desktop computers can currently allocate to programs for estimating variance components (e.g., SAS and SPSS). The strengths of reliability estimation methods based on the random-effects model tradition relative to the classical tradition are substantial. First, they fully encompass classical methods in that they can be used to estimate reliability for measurement procedures involving singlefaceted replicates that are fully crossed with one’s object of measurement. Second, unlike classical methods, they can easily be used to formulate reliability estimates for measurement procedures involving multifaceted replicates in which the facets are crossed, nested, or any

19

Dan J. Putka combination thereof. Third, the random-effects tradition provides investigators with not only coefficients but also the variance components that underlie them. As Cronbach and Shavelson (2004) stated: “Coefficients (reliability) are a crude device that do not bring to the surface many subtleties implied by variance components” (p. 394). Variance components allow researchers to get a much finer appreciation of what comprises error than simply having one omnibus estimate of error. Readers interested in learning more about formulation of reliability estimates via variance components estimated by random-effects models—or more generally, G-theory— are referred to DeShon (2002), Haertel (2006), Putka et al. (2008), and Shavelson and Webb (1991). For a concrete illustration of a wide variety of reliability estimates that can be formulated for assessment center and job performance ratings, see Putka and Hoffman (2013) and (2015), respectively. For a more thorough technical presentation, one should consult Brennan (2001b).

Confirmatory Factor Analytic Tradition Although G-theory is often espoused as a conceptual centerpiece of modern psychometrics (along with item response theory, or IRT), it is important to separate the conceptual perspective G-theory offers on reliability from the estimation methods (random-effects models) it proscribes. Such a distinction is important because although the conceptual perspective offered by G-theory can serve as a parsimonious way to frame the problem of building a reliability coefficient appropriate for one’s situation (regardless of whether one uses classical methods, random-effects methods, or CFA methods to derive estimates of such coefficients), the random-effects model that undergirds G-theory and classical methods of estimating reliability share a key drawback. Specifically, they offer no clear mechanism for (a) testing or dealing with violations of CTT and G-theory measurement model assumptions and (b) specifying or testing alternative factorial compositions of true score— both of which have fundamental implications for the interpretation of reliability estimates.18 It is in this regard that CFA approaches to estimating reliability are strong (McDonald, 1999). Unlike reliability estimation approaches born out of the classical and random-effects traditions, CFA-based approaches force investigators to be specific about the substantive nature of the latent structure underlying their replicates (indicators, in CFA terms). For example, CFA forces them to face questions such as: • Is the covariance shared among replicates (i.e., true score variance from the perspective of classical and random-effects approaches) accounted for by a single latent true score factor or multiple latent factors? • Do indicators of the latent true score factor(s) load equally on that/those factor(s) (i.e., are they at least essentially tau-equivalent) or is their heterogeneity in factor loadings (i.e., suggesting they are not at least essentially tau-equivalent)? • What proportion of true score variance (as defined in CTT and G-theory) reflects the effects of a single latent factor, as opposed to residual covariances?

Although such questions have implications for reliability estimates arising from the classical and random-effect traditions, neither of these traditions has a built-in mechanism for addressing them. In essence, they ascribe all shared covariance among replicates to a latent entity (e.g., true score), regardless of whether it stems from a single factor or multiple factors. Thus, in some ways CFA can be seen as a way of clarifying the factorial composition of true score variance as conceived by CTT and G-theory measurement models. One may argue that such clarification is more an issue of validity rather than reliability (e.g., Schmidt et al., 2000); however, as discussed in the following paragraphs, the dimensionality of the focal construct of interest has implications for the accuracy of reliability estimates based on the classical and random-effects traditions (Lee & Frisbie, 1999; Rae, 2007; Rogers, Schmitt, & Mullins, 2002). The CFA tradition of reliability estimation arose out of Joreskog’s (1971) work on the notion of congeneric tests discussed earlier. To illustrate, consider a situation in which we administer a 10-item measure of agreeableness to a sample of job applicants. In this case, our replicates are single-faceted and defined in terms of items, and those items are fully crossed with our objects

20

Reliability of measurement—applicants. From the CFA perspective, we might view the replicates as indicators of a latent factor representing true score, and then fit a model to the data such that the variance of the latent factor is set to one, and the factor loadings and unique variances are freely estimated. On the basis of such a model, the estimated reliability of the sum of the k replicates can be obtained via

ω=

 k   ∑ λ i  i =1 2

2

   ∑ λ i  + ∑ θ ii i =1 i =1 k

k

σ^

= ^

2

2

,

T ^

σ T+σ

2

(1.7)

E



where λi represents the estimated factor loading for the ith of k replicates, and θii represents the estimated unique variance for the ith replicate (McDonald, 1999; Reuterberg & Gustafsson, 1992).19 As with the application of classical and random-effects approaches to reliability, the substantive meaning of σ2T and σ2E based on this formulation will differ depending on the type of replicates to which they are applied (e.g., items, raters, occasions, tasks, etc.). Thus, if applied to replicates defined by items, the estimate of σ2T provided by the squared sum of loadings will reflect consistency among items (attributable to the latent true score factor). If applied to replicates defined by raters, the estimate of σ2T provided by the squared sum of loadings will reflect consistency among raters (again, attributable to the latent true score factor). A key benefit of the CFA approach described above is that it will allow one to impose constraints on parameter estimates (e.g., the λs and θs) that allow one to test various assumptions underlying the CTT and G-theory score models (see Joreskog & Sorbom, 2001, pp. 124–128; Reuterberg & Gustafsson, 1992; Werts, Linn, & Joreskog, 1974). If replicates are strictly parallel (an assumption underlying the Spearman-Brown prophecy; Feldt & Brennan, 1989), then they should have equal factor loadings (i.e., λ1 = λ2 = λk) and equal unique variances (i.e., θ11= θ22 = θkk). If replicates are tau-equivalent or essentially tau-equivalent (an assumption underlying alpha and coefficients based on the random-effects tradition), then they should have equal factor loadings but their unique variances can differ. To the extent that factor loadings vary across replicates (i.e., the replicates are not at least essentially tau-equivalent), most reliability estimates based out of the classical and random-effects tradition (e.g., alpha) will tend to be slightly downward biased (Novick & Lewis, 1967).20 Nevertheless, this common claim is based on the premise that the replicates on which alpha is estimated are experimentally independent—from a CFA perspective this would imply there are no unmodeled sources of covariance among replicates after accounting for the latent true score factor (Komaroff, 1997; Raykov, 2001a; Zimmerman, Zumbo, & LaLonde, 1993). In light of the fact that many constructs of interest to organizational researchers are heterogeneous (e.g., situational judgment) or clearly multidimensional (e.g., job performance), application of the formula shown in Equation 1.7 would be questionable because it implies that a single common factor accounts for the covariance among replicates, which in practice may rarely be true. The observation above brings us to a critical difference between the CFA-based formulation of reliability noted in Equation 1.7 and those based on the classical and random-effects traditions—the former often specifies a single latent factor as the sole source of covariance among replicates, and as such only variance in replicates attributable to that factor is treated as true score variance (for a more general, CFA-based alternative, see Raykov & Shrout, 2002). Recall from the operational definition of true score offered earlier and the perspective on true score offered by the CTT and G-theory score models that true score reflects all sources of consistency across replicates. As Ghiselli (1964) noted, The fact that a single term . . . has been used to describe the amount of the trait an individual possesses should not be taken to imply that individual differences in scores on a given test are determined by a single factor. (p. 220)

The implications of this are that whereas the CFA formulation above ignores any covariance among replicates that is left over after extracting a first latent factor, classical coefficients such

21

Dan J. Putka as alpha and coefficients derived from the random-effects tradition lump such covariance into the estimate of true score variance (Bost, 1995; Komaroff, 1997; Maxwell, 1968; Raykov, 2001a; Smith & Luecht, 1992; Zimmerman et al., 1993). This characteristic of the CFA approach offered above presents investigators with a dilemma: Should residual covariance observed when adopting such an approach be treated as (a) error variance (σ2E) or (b) a source of true score variance (σ2T)? In estimates of reliability based on the classical tradition, one does not have much of an option. True score variance as estimated under the classical tradition reflects any source of consistency in scores, regardless of whether it stems from a first common factor, or what, in CFA terms, would be viewed as residual covariance or correlated uniquenesses (Komaroff, 1997; Scullen, 1999). Similarly, under the random-effects tradition, true score variance reflects any source of consistency in score, not accounted for by a variance component reflecting one of the facets of measurement (e.g., items, raters, occasions). However, with CFA, researchers have the flexibility to distinguish between true score variance that (a) arises from a common factor hypothesized to reflect a construct of interest and (b) reflects residual covariance among replicates after extracting the first factor (Raykov, 1998, 2001b). Although in theory having this flexibility is valuable because it allows one insight into the substance of true score variance, it also has practical benefits in that it can allow investigators to tailor a reliability coefficient to their situation depending on the nature of the construct they are assessing. To illustrate this flexibility, I offer three examples as follows that selection researchers and practitioners may encounter. First, let us say one (a) designs a measurement procedure to assess a unidimensional construct, (b) uses a fully crossed measurement design comprising replicates defined by a single facet of measurement (e.g., items) to assess it, (c) fits the single-factor CFA model described above to the resulting data, and (d) finds evidence of residual covariance. Assuming there is no pattern to the residual covariance that would suggest the presence of additional substantively meaningful factors, the investigator would likely desire to treat the residual covariance as a source error variance (σ2E) rather than a source of true score variance (σ2T).21 Fortunately, such residual covariance can be easily incorporated into Equation 1.7 by replacing the term corresponding to the sum of unique variances with a term that reflects the sum of unique variances and residual covariances or by simply replacing the denominator with observed score variance (Komaroff, 1997; Raykov, 2001a). If one were to calculate alpha on these same data, or fit a simple random-effects model to estimate σ2T, such residual covariance would be reflected in σ2T as opposed to σ2E and thus would produce a reliability estimate that is higher than the modified omega-coefficient described here when the sum of the residual covariances are positive (lower when the sum is negative) (Komaroff, 1997). It is important to note that the comparison made here between alpha and modified omega is based on the assumption that the replicates in the analysis are at least essentially tau-equivalent. If the replicates are not at least essentially tau-equivalent, then this would lower the estimate of alpha, thus either partially or completely offsetting any positive bias created by the presence of residual covariance (Raykov, 2001a). As another example, let us say one (a) designs a measurement procedure to assess a relatively heterogeneous, but ill-specified construct (e.g., situational judgment), and again (b) uses a fully crossed measurement design comprising replicates defined by a single facet of measurement (e.g., scenarios) to assess it, (c) fits the single-factor CFA model described above to the resulting data, and (d) finds evidence of residual covariance. In this case, the investigator may choose to treat the residual covariance as a source of true score variance (σ2T) rather than error variance (σ2E). Unlike the first example, given the heterogeneous nature of the situational judgment construct, the investigator would not likely expect the covariance among scenarios to be accounted for by a single factor. For example, the investigator may hypothesize that the scenarios comprising the assessment vary in the degree to which various combinations of individual differences (e.g., interpersonal skill, conscientiousness, and general mental ability) are required to successfully resolve them. As such, scores on scenarios may differentially covary depending on the similarity of the individual difference profile required to resolve them. Under such conditions, one would need more than a single factor to account for covariation among replicates, but given the ill-structured nature of situational judgment construct, the investigator may not find strong evidence for a simple factor structure. As was the case with treating residual covariances as σ2E in the previous

22

Reliability example, Equation 1.7 can easily be modified to treat residual covariances as σ2T by adding a term to the squared sum of loadings that reflects the sum of all residual covariances, specifically 2

k k  k   ∑ λ i  + ∑ ∑ Cov( e i , e j ) i =1 i j

ϖ' =

i≠j

2

k k k  k  λ + ( , e ) + θ ii Cov e ∑ i j  ∑ i  ∑ ∑ i =1 i j i =1 i≠j

σ^

= ^

2

2 T

σ T + σ^

2

.

(1.8)

E



Note that treating residual covariance as σ2T represents a departure from how the CFA literature on reliability estimation has generally espoused treating such covariance when estimating reliability (e.g., Komaroff, 1997; Raykov, 2001b). Nevertheless, the perspective offered by these authors is largely based on the assumption that the investigator is assessing a unidimensional construct. If one were to calculate alpha on such replicates, or fit a random-effects model to estimate σ2T, the covariance among residuals noted above would contribute to σ2T as opposed to σ2E and as such would produce a coefficient similar in magnitude to what is provided by Equation 1.8 (Komaroff, 1997). Lastly, and as a third example, let us say one (a) designs a measurement procedure to assess a multidimensional construct (e.g., job performance), (b) uses a fully crossed measurement design comprising replicates defined by a single facet of measurement (e.g., items) to assess it, and (c) samples content for the measure in a way that allows one to distinguish between different dimensions of the construct (e.g., samples items corresponding to multiple job performance dimensions). In this situation, one might be interested in estimating the reliability of scores on each dimension of the construct separately, as well as estimating the reliability of a composite score based on the sum of dimension-level scores (e.g., an overall performance score). To achieve a reliability estimate for scores on the overall composite, the single-factor CFA model described would clearly not be appropriate. Rather, a multifactor model may be fitted in which each factor reflects dimensions of the construct being targeted by the measure. Indicators would be allowed to load only on those factors they are designed to reflect, and the reliability and true score variance of the overall composite score would be a function of factor loadings and factor covariances (Kamata, Turhan, & Darandari, 2003; Raykov, 1998; Raykov & Shrout, 2002). Any residual covariance among indicators associated with a given factor could be treated as noted in the earlier examples (i.e., treated as σ2T or σ2E) depending on how the investigator views such covariance in light of the substantive nature of the target construct and particular measurement situation. Such multifactor models could also be used to simultaneously generate separate estimates of reliability of scores for each dimension of the construct (Raykov & Shrout, 2002). Although classical and random-effects traditions do not concern themselves with the factorial composition of true score covariance as the CFA tradition does, estimation methods arising out of the former traditions have developed to deal with reliability estimation for scores produced by measures that clearly reflect the composite of multiple dimensions. Such methods have typically been discussed under the guise of (a) reliability estimation for measures stratified on content (e.g., items comprising the measure were sampled to assess relatively distinct domains such as deductive and inductive reasoning) or, more generally, (b) reliability estimation for composite scores (Cronbach, Schoneman, & McKie, 1965; Feldt & Brennan, 1989). Here “composites” do not necessarily refer to a compilation of items thought to reflect the same construct (i.e., replicates), but rather compilations of measures designed to reflect distinct, yet related constructs (e.g., proficiency with regard to different requirements for a trade or profession) or different components of a multidimensional construct (e.g., task and contextual performance). Scores produced by such component measures may differ in their reliability and their observed relation with one another. Sensitivity to such issues is clearly seen in classical formulas for the reliability of composites such as stratified coefficient alpha (Cronbach et al., 1965) and Mosier’s (1943) formula for the reliability of a weighted composite.22 In the case of stratified alpha and Mosier’s coefficient, σ2T for the overall composite score reflects the sum of σ2T for each component of the composite and the sum of covariances between replicates (e.g., items, raters) comprising different components.23 The fact that covariances between replicates from different components of the

23

Dan J. Putka composite contribute to true score variance has a very important implication: these estimates will likely produce inflated estimates of reliability in cases in which measures of each component share one or more elements of a facet of measurement (e.g., raters, occasions) in common. For example, consider a situation in which one gathers job performance ratings on two dimensions of performance for each ratee—task performance and contextual performance. Assume that, for any given ratee, the same two raters provided ratings of task performance and contextual performance. In this case, the measures of task and contextual performance share raters in common and as such are “linked” (Brennan, 2001b). Thus, the covariation between task and contextual performance in this example reflects not only covariance between their true scores but also covariance arising from the fact that they share a common set of raters. Were we to apply stratified alpha or Mosier’s formula to estimate the reliability of the composite score produced by summing across the two dimensions (using inter-rater reliability estimates for each component in the aforementioned formulas), covariance attributable to having a common set of raters would contribute to true score variance, thus artificially inflating the reliability estimate (assuming we wish to generalize the measures across raters). Stratified alpha and Mosier’s formula are based on the assumption that errors of measurement associated with components that comprise the composite are uncorrelated; to the extent they are positively correlated—a likely case when components share one or more elements of a facet of measurement in common— the estimates they provide can be substantially inflated (Rae, 2007). Outside of multivariate G-theory, which is not widely used or discussed in the organizational research literature (Brennan, 2001b; Webb & Shavelson, 1981), there appear to be no practical, straightforward analytic solutions to this situation on the basis of classical and random-effects estimation traditions.

Multifaceted Replicates and Noncrossed Measurement Designs in CFA In all of the CFA examples offered above, the discussion assumed that the source(s) of extra covariation among replicates beyond the first factor was due to multidimensionality, or more generally heterogeneity in the construct being measured. However, as the example from the previous paragraph illustrated, such covariation can also arise from the characteristics of one’s measurement design. For example, such extra covariation can also arise if the replicates that serve as indicators in a CFA are multifaceted and share a measurement design element (e.g., a rater, an occasion) in common. This brings us to another critical point regarding the CFA-based approach to reliability estimation discussed above. When Joreskog (1971) originally formulated the congeneric test model upon which many CFA-based estimates of reliability are grounded, it was based on a set of replicates defined along a single facet of measurement (e.g., items), and that facet was assumed to be fully crossed with the objects of measurement (e.g., persons). However, as noted above, when replicates are multifaceted, those replicates that share a level of a given facet in common (e.g., replicates that share a common rater or occasion of measurement) will covary above and beyond any substantive factors (e.g., interpersonal skill, job performance) that underlie the replicates (DeShon, 1998). There are numerous ways to account for multifaceted replicates within the CFA framework; however, only recently have they begun to find their way into the literature (e.g., DeShon, 1998; Green, 2003; Le et al., 2009; Marcoulides, 1996; Marsh & Grayson, 1994). Many of the methods being espoused for handling multifaceted replicates in the context of CFA have their roots in the literature on modeling of multitrait-multimethod data (e.g., Kenny & Kashy, 1992; Widaman, 1985). For example, in the context of the interview example offered earlier, we might fit a model that not only includes a latent factor corresponding to the construct of interest (e.g., interpersonal skill) but also specifies latent factors that correspond to different raters or interview questions (e.g., all indicators associated with rater 1 would load on a “Rater 1” factor, all indicators associated with rater 2 would load on a “Rater 2” factor). Alternatively, one might allow uniqueness for those indicators that share a rater or question in common to covary and constrain those that do not go to zero (e.g., Lee, Dunbar, & Frisbie, 2001; Scullen, 1999). By fitting such models, one can derive estimates of variance components associated with various elements of one’s measurement design (e.g., person-rater effects, person-question effects) that

24

Reliability resemble what is achieved by fitting a random-effects model to the data described earlier (e.g., DeShon, 2002; Le et al., 2009; Marcoulides, 1996; Scullen et al., 2000). As illustrated earlier in the discussion of G-theory, these variance components can then be used to construct reliability coefficients appropriate for one’s situation. Unfortunately, as was the case with using classical reliability coefficients to calibrate various sources of error in scores (e.g., Schmidt et al., 2000), CFA-based approaches to variance component estimation have a few drawbacks. First, they do not lend themselves easily to estimating variance attributable to (a) facet main effects (e.g., rater main effects, question main effects) or (b) interactions among measurement facets (e.g., rater-question interaction effects). Although it is possible to estimate the effects above, this would require calculating covariances among persons (i.e., persons as columns/variables) across facets of measurement of interest (e.g., raters, question) as opposed to the typical calculation of covariances among question-rater pairs (i.e., question-rater pairs are treated as columns/variables) across objects of measurement (e.g., persons).24 Furthermore, it is not clear how CFA could be leveraged to deal with designs that are more ill-structured in nature (e.g., Putka et al., 2008). For example, recall the example earlier where we had performance ratings for a sample of incumbents that were rated by multiple raters, and the raters that rated each incumbent varied in their degree of overlap. When confronted with such designs in the past applications of CFA, organizational researchers have generally resorted to random assignment of raters to columns for each ratee (e.g., Mount, Judge, Scullen, Sytsma, & Hezlett, 1998; Scullen et al., 2000; Van Iddekinge, Raymark, Eidson, & Attenweiler, 2004). As noted earlier, the drawback of doing this is that it can produce results that (a) vary simply depending on how raters are assigned for each ratee and (b) fail to account for the nonindependence of residuals for incumbents that share a rater in common, which downwardly biases estimates of true score variance (Kenny & Judd, 1986; Putka et al., 2008). Lastly, for better or worse, the literature on CFA offers myriad ways to parameterize a model to arrive at variance component estimates, each of which has various strengths and weaknesses that are still in the process of being ironed out (e.g., Eid, Lischetzke, Nussbeck, & Trierweiler, 2003; Lance, Lambert, Gewin, Lievens, & Conway, 2004; Marsh & Grayson, 1994). With the random-effects model discussed above, the number of alternative parameterizations (at least as currently implemented in common statistical software such as SAS and SPSS) is quite limited. The difficulty this creates when using CFA to estimate variance components is determining which parameterization is most appropriate in a given situation, because a clear answer has not emerged and will likely ultimately depend on characteristics of the construct being measured and characteristics of one’s measurement situation (Marsh & Grayson, 1994). This is complicated by the fact that the choice of which parameterization is adopted often may be less a matter of substantive considerations and more a reflection of the parameterization that allowed the software fitting the model to converge to an admissible solution (Lance et al., 2004). Ultimately, such nuances, coupled with the complexity of CFA-based approaches to variance component estimation, may limit the utility of such approaches for reliability estimation in general selection research and practice.

Summary: Comparison and Equifinality of Estimation Traditions On the basis of the examples, one might ask which tradition best serves the needs of personnel selection research and practice? My answer would be no single tradition currently satisfies all needs. Table 1.1 summarizes characteristics of the reliability estimation traditions discussed above. Beyond their simplicity and familiarity, classical approaches do not appear to have much to offer. Modern random-effects approaches address not only measurement situations that classical approaches were initially designed to handle (e.g., those involving single-faceted replicates and fully crossed designs) but also those situations that classical approaches were not designed to handle (i.e., procedures involving multifaceted replicates and/or noncrossed designs). Couple this with the ease with which variance components can now be estimated using widely available software (e.g., SPSS, SAS), as well as the consistency of the random-effects model with modern psychometric

25

Dan J. Putka

TABLE 1.1 Relative Advantages and Disadvantages of Reliability Estimation Traditions Random Characteristics of Estimation Tradition

Classical

Effects

CFA

Perceived simplicity

Yes

No

No

Widely discussed in organizational literature on reliability estimation

Yes

No

No

Easily implemented with standard statistical software (e.g., SPSS, SAS)

Yes

Yes

Noa

Direct and simultaneous estimation of variance components underlying α2T and α2E

No

Yes

Yes

Straightforward to apply to nested and ill-structured measurement designs confronted in applied organizational research and practice

No

Yes

No

Capacity to isolate and estimate variance attributable to facet main effects and interactions among facets

No

Yes

Nob

Offers mechanism for testing and dealing with violations of CTT and G-theory measurement model assumptions

No

Noc

Yes

Offers mechanism for specifying and testing alternative factorial compositions of true score

No

No

Yes

Potential exception is PROC CALIS within SAS. As noted in text, such effects could be estimated by fitting CFA models to covariances calculated across measurement facets (e.g., question, raters, question-rater pairs) as opposed to objects of measurement (e.g., persons). c SAS and SPSS now offer users the ability to fit “heterogeneous variance” random-effect models, which for some designs can be used to assess various equivalence assumptions underlying the CTT and G-theory measurement models (e.g., Is α2T for Rater 1 = α2T for Rater 2?). a b

perspectives on reliability (i.e., G-theory; AERA, APA, & NCME, 2014; Brennan, 2006), and it appears the random-effects tradition has much to offer. Nevertheless, the classical and randomeffects traditions suffer from two similar drawbacks in that their estimation procedures offer no clear mechanism for (a) testing or dealing with violations of CTT and G-theory measurement model assumptions on which their formulations of reliability are based and (b) specifying or testing alternative factorial compositions of true score. The latter drawback can make the interpretation of reliability estimates difficult because of ambiguity of what constitutes true score, particularly for measures of heterogeneous constructs. This is where the CFA tradition can offer an advantage; however, this advantage does not come freely—its price is added complexity. For single-faceted replicates that are fully crossed with one’s objects of measurement, CFA methods are straightforward to apply and clear examples exist (e.g., Brown, 2006; McDonald, 1999). For multifaceted replicates, a systematic set of examples has yet to be provided for investigators to capitalize on, which is complicated by the fact that the CFA models can be parameterized in numerous different ways to arrive at a solution (Marsh & Grayson, 1994). This has a tendency to restrict such solutions to psychometrically savvy researchers and practitioners. Moreover, for the ill-structured measurement designs discussed by Putka et al. (2008), which are all too common in selection research involving ratings (e.g., assessment centers, interviews, job performance), it is not clear how the CFA models would overcome the issues raised. Thus, we have a tradeoff between the ease with which modern random-effects models and software can deal with multifaceted measurement designs of any sort and the model fitting and testing capabilities associated with CFA, which can not only check on measurement model assumptions but also refine our specification (and understanding) of true score for measures of heterogeneous constructs. Although I have treated reliability estimation approaches arising out of classical, random effects, and CFA traditions separately, it is important to recall how we began this section: all of these traditions can be used to arrive at the same ratio—σ2T/(σ2T + σ2E). The substantive meaning of σ2T and σ2E will depend on the type of replicates examined, the nature of the measurement procedure, and the construct that one is assessing. Nevertheless, all of these traditions can potentially be leveraged to arrive at an estimate of this ratio and/or components of it. How

26

Reliability they arrive at those estimates, the assumptions they make in doing so, and the strengths and weaknesses of the methodologies they use is what differentiates them. In cases in which one has a measurement procedure comprising single-faceted replicates or multifaceted replicates in which facets are fully crossed with one’s objects of measurement and one is interested solely in using scores to make relative comparisons among objects of measurement (e.g., persons), much literature has accumulated indicating that these traditions can produce very similar results, even in the face of moderate violation of common tau-equivalence assumptions (e.g., Le et al., 2009; Reuterberg & Gustafsson, 1992). For example, Brennan (2001b) and Haertel (2006) show how random-effects ANOVA models may be used to estimate variance components and form reliability coefficients that are identical to the types of reliability coefficients from the classical tradition (e.g., alpha, coefficients of equivalence and stability). Marcoulides (1996) demonstrated the equivalence of variance components estimated based on CFA and random-effects ANOVA models fitted to a multifaceted set of job analysis data. Le et al. (2009) illustrated how one can arrive at similar variance component estimates using functions of Pearson correlations, random-effects models, and CFA models. Lastly, Brennan (2001b) and Hocking (1995) demonstrated how it is possible to generate variance component estimates without even invoking the random-effects or CFA models but simply calculating them as functions of observed variance and covariances (in some ways akin to Schmidt et al., 2000). Each of these works illustrate that, under certain conditions, the three traditions discussed can bring investigators to similar conclusions. However, as illustrated, nuances regarding the (a) generalizations one wishes to make regarding their scores, (b) the intended use of those scores (e.g., relative comparisons among applicants vs. comparisons of their scores to a fixed cutoff), (c) characteristics of one’s measurement procedure itself (e.g., nature of its underlying design), and (d) characteristics of the construct one is attempting to measure (e.g., unidimensional vs. multidimensional, homogeneous vs. heterogeneous) make some of these approaches more attractive than others under different circumstances. Ideally, the well-informed investigator would be in a position to capitalize on the relative strengths of these traditions when formulating reliability and variance component estimates of interest given his/her situation. Lastly, regardless of what theoretical perspective one adopts on reliability estimation (i.e., whether it is more grounded in CTT or G-theory), I encourage future researchers to think critically and attempt to evaluate whether the data they are attempting to apply those theories to conform to the score models underlying their estimates of reliability. My sense is that within I-O psychology and the organization sciences, we often take the notion that our data conform to the assumptions implied by score models underlying reliability estimates as a given, but rarely do we take the time to seriously evaluate such claims.

CLOSING THOUGHTS ON RELIABILITY Perspectives on reliability and methods for its estimation have evolved greatly over the last 50 years, but these perspectives and methods have yet to be well integrated (Brennan, 2006). One potential reason for this lack of integration may stem from the historical disconnect between experimental and correlation research traditions (Cronbach, 1957), which continues to manifest itself today, particularly in our approaches to reliability estimation (Cronbach & Shavelson, 2004). Another potential reason for this lack of integration may stem from the recognized decline in the graduate instruction of statistics and measurement over the past 30 years in psychology departments (Aiken et al., 2008; Merenda, 2007). For example, in reviewing results of their study of doctoral training in statistics, measurement, and methodology in PhD psychology programs across North America, Aiken et al. (2008) lament: We find it deplorable . . . the measurement requirement occupies a median of only 4.5 weeks in the PhD curriculum in psychology. A substantial fraction of programs offered no training in test theory or test construction; only 46% of programs judge that the bulk of their graduates could assess the reliability of their own measures. (p. 43)

27

Dan J. Putka Under such conditions, it makes it nearly impossible for faculty to comprehensibly integrate and discuss implications of developments in the areas above into classroom discussions of psychometrics; almost out of necessity, we limit ourselves to basic treatment of age-old perspectives on measurement. Couple this observation with the explosion of new statistical software and availability of new estimation methods since the mid-1980s, and it creates a situation where staying psychometrically current can be a challenge for those in academe, as well as those in practice. Of course, also complicating the trends is the course of normal science, which leads us to pursue incremental research that refines measurement models and the perspectives on reliability they offer but does not emphasize integration of models and perspectives (Kuhn, 1962). Such a lack of integration among psychometric models and perspectives is unfortunate because it can serve as a source of parsimony, which is critical when one has limited time to devote to such topics in the course of graduate instruction and in the course of applied research and practice. I hope this treatment has brought some degree of parsimony to what have often been treated as disparate, loosely related topics. Furthermore, I hope it casts developments in the area of reliability in a novel light for selection researchers and practitioners and encourages us to explore and capitalize on modern methods for framing reliability, error, and their underlying components.

NOTES   1. Throughout this chapter I use the term “scores” to generically refer to observed manifestations of a measurement procedure—thus, scores might be ratings, behavioral observations, test scores, etc.   2. As we discuss later, the degree to which replicates are assumed to “assess the same construct” differs across measurement theories. The degree of similarity among replicates has been discussed under the rubric of degrees of part-test similarity (Feldt & Brennan, 1989) and degrees of parallelism (Lord, 1955). At this point, further discussion of this issue is unnecessary, but we will revisit this issue when discussing the role of measurement models in reliability estimation.   3. A key exception here is Cronbach’s (1947) treatment of a coefficient of equivalence and stability.   4. Actually, this is a bit of an overstatement. As alluded to in the opening paragraph, error, in any given situation, will be partly dependent on the generalization(s) the investigator wishes to make regarding scores. In some cases, investigators may choose not to treat a given source of inconsistency in scores as error. For example, this might occur in the context of performance ratings where inconsistencies in job incumbents’ scores across different dimensions of performance may be viewed as acceptable by the investigator (Scullen, Mount, & Goff, 2000). This example illustrates why the investigator is a critical part of defining error in any given situation. We will touch upon this topic again later when we discuss generalizability theory.   5. Readers may question the omission of item response theory (IRT) from the subsequent discussion. Like Brennan (2006), we tend to view IRT models as “scaling” models rather than “measurement” models because they do not have a built-in explicit consideration of measurement error. Furthermore, the focus of applications of IRT is often on estimation/scaling of a latent trait, ability of interest, or calibration of item parameters rather than the isolation and quantification of measurement error (see Brennan, 2006, pp. 6–7). Although I am not downplaying the obvious importance of IRT for psychometrics and personnel selection, I felt it was beyond the scope of this chapter to address IRT while still addressing reliability as I have done herein. For a recent, parsimonious treatment of IRT, see Yen and Fitzpatrick (2006).   6. Given condition (a) and (c) such replicates will also have identical observed score variances.   7. Actually, this statement is a bit of a misnomer, because coefficient alpha and intraclass correlations simply represent specific computational forms of a broader class of coefficients known as generalizability coefficients (Cronbach & Shavelson, 2004).   8. The highest order interaction term and the residual term in G-theory models are confounded because such designs essentially amount to having one observation per cell. Thus, in practice, it is not possible to generate separate estimates of variance in X attributable to these two effects.   9. Two notes here: First, as we will discuss in the following sections, one’s ability to estimate each of these components will be limited by the measurement design underlying one’s measurement procedure. The example here assumes a fully crossed design, which will often not be the case in practice. Second, note that in Equation 1.5 we combine variance components for the applicant-question-rater interaction and residual terms; this reflects the fact that these sources of variance will not be uniquely estimable.

28

Reliability 10. Although labeled as a “generalizability” coefficient, note that this formula provides an estimate of σ2T over σ2T +σ2E, and as such may be considered an estimate of reliability. 11. Note the idea of fixing a facet of measurement for purposes of estimating σ2T and σ2E in the context of G-theory is different from modeling a factor or covariate as fixed in the context of mixed-effects models (DeShon, 2002; Searle, Casella, & McCulloch, 1992). 12. Another common ratings design faced in practice (particularly with job performance ratings) is one in which ratees are nested with raters (e.g., each group of incumbents is rated by their respective group supervisor). In this case, each ratee has only one rater, and as such there is no way to distinguish between the σ2PR (typically considered a source of error) and σ2P (typically considered true score variance). Thus, estimating inter-rater reliability on the basis of data structured in this manner is not possible. 13. We refer the interested readers to Brennan (1998), Haertel (2006), and Qualls-Payne (1992) for modern treatments of SEMs in the context of CTT and G-theory. One advantage of SEMs over reliability coefficients is that they can be tailored to individuals being measured (e.g., differential amounts of error depending on individuals’ level of true score), whereas reliability coefficients are typically associated with groups of individuals. The latter is often cited as one benefit of IRT-based perspectives on measurement over CTT- and G-theory-based perspectives; however, CTT and G-theory also offer methods for generating individual-level SEMs (Haertel, 2006). 14. My speculation on why this occurred is (a) the perceived complexity and jargon-loaded nature of G-theory (DeShon, 2002), (b) the overarching dominance of the correlational research tradition underlying selection research and practice (Cronbach, 1957; Dunnette, 1966; Guion, 1998), and (c) the steady decline of teaching psychometrics and statistics in graduate programs since the 1970s (Aiken et al., 2008; Merenda, 2007). 15. Given the discussion raised earlier, σ2T in any of these examples may also reflect variance attributable to one or more hidden facets of measurement. 16. On a historical note, Cronbach did not invent coefficient alpha per se—Guttman’s (1945) L3 coefficient and Hoyt’s (1941) coefficient are algebraically identical to alpha and were introduced long before Cronbach’s (1951) landmark article. 17. We should note that a subtle difference between Pearson r-based indices of reliability and those noted here (i.e., KR-20, Gutman’s lambdas, alpha) is that the latter assess the additive relationship between replicates, whereas Pearson r assesses the linear relationship between replicates. Differences in the variances of replicates will reduce alpha and other additive reliability indices, but they will have no effect on Pearson r-based indices because the latter standardizes any variance differences between replicates away (McGraw & Wong, 1996). 18. One potential caveat to this regards the fairly recent ability of SAS and SPSS to fit random-effects models that allow for heterogeneity in variance component estimates (e.g., Littell et al., 1996; SPSS Inc., 2005). Such capabilities might be leveraged to test parallelism assumptions underlying the CTT and G-theory score models. 19. McDonald (1999) refers to this coefficient as “omega” (p. 89). 20. This downward bias arises from the fact that most reliability estimates based on these traditions rely on the average covariance among replicates to make inferences regarding the magnitude of true score variance for each replicate. To the extent that replicates are not essentially tau-equivalent, this average covariance will tend to underestimate true score variance for each replicate (a component of true score variance of the composite of replicates), thus leading to a slight underestimation of reliability when all other assumptions are met (e.g., uncorrelated errors among replicates) (Feldt & Brennan, 1989; Raykov, 2001a). 21. Even if there is a pattern to the residual covariances, the investigator might still wish to treat them as contributing to σ2E if they reflect an artifact of the particular measurement situation (e.g., Green & Hershberger, 2000). This raises an important point: The examples offered here are for illustration; they are not prescriptions for future research and practice. Ultimately, the individual investigator decides how to treat residual covariance given the characteristics of the measurement situation he or she faces. 22. Note Mosier’s (1943) formula is equivalent to the formula for stratified coefficient alpha if elements comprising a composite are equally weighted. 23. Note that all else being equal, stratified alpha will tend to be higher (appropriately so) than coefficient alpha applied to the same data if between-component item covariances are lower than withincomponent item covariances—likely a common occurrence in practice for measures of multidimensional constructs (Haertel, 2006; Schmitt, 1996). In both cases the denominator of these coefficients is the same (observed variance); what changes is how true score variance for each component of the composite is estimated (these in turn are part of what contribute to σ2T for the overall composite). For

29

Dan J. Putka stratified alpha, σ2T for any given component is a function of the average covariance among items within that component, for alpha, σ2T for any given component is a function of the average covariance among all items, regardless of component. As such, if between-component item covariances are lower than within-component item covariances, σ2T for any given component will be lower if alpha is applied to the data rather than stratified alpha; in turn, the estimate σ2T for the overall composite produced by alpha will also be lower. 24. Interested readers are referred to Hocking (1995) and Brennan (2001b, pp. 166–168).

REFERENCES Aguinis, H., Henle, C. A., & Ostroff, C. (2001). Measurement in work and organizational psychology. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work, and organizational psychology: Vol. 1: Personnel psychology (pp. 27–50). London, England: Sage. Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Graduate training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America. American Psychologist, 63, 32–50. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Bell, J. F. (1985). Generalizability theory: The software problem. Journal of Educational and Behavioral Statistics, 10, 19–29. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. Borsboom, D., & Mellenbergh, G. J. (2002). True scores, latent variables, and constructs: A comment on Schmidt and Hunter. Intelligence, 30, 505–514. Bost, J. E. (1995). The effects of correlated errors on generalizability and dependability coefficients. Applied Psychological Measurement, 19, 191–203. Brennan, R. L. (1998). Raw-score conditional standard errors of measurement in generalizability theory. Applied Psychological Measurement, 22, 307–331. Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38, 295–317. Brennan, R. L. (2001b). Generalizability theory. New York, NY: Springer-Verlag. Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: American Council on Education and Praeger. Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford Press. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322. Campbell, J. P. (1976). Psychometric theory. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 185–222). New York, NY: John Wiley & Sons. Cattell, R. B. (1966). The data box: Its ordering of total resources in terms of possible relational systems. In R. B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp. 67–128). Chicago, IL: Rand McNally. Cronbach, L. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12, 1–16. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 292–334. Cronbach, L. J. (1957). The two disciplines of scientific psychology. American Psychologist, 12, 671–684. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: John Wiley. Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. H. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57, 373–399. Cronbach, L. J., Rajaratnam, N., & Gleser, G. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16(2), 137–163.

30

Reliability Cronbach, L. J., Schoneman, P., & McKie, D. (1965). Alpha coefficient for stratified-parallel tests. Educational & Psychological Measurement, 25, 291–312. Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and its successor procedures. Educational and Psychological Measurement, 64, 391–418. DeShon, R. P. (1995, May). Restricted maximum likelihood estimation of variance components in generalizability theory: Overcoming balanced design requirements. Paper presented at the 10th annual conference of the Society of Industrial and Organizational Psychology, Orlando, FL. DeShon, R. P. (1998). A cautionary note on measurement error corrections in structural equation models. Psychological Methods, 3, 412–423. DeShon, R. P. (2002). Generalizability theory. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations (pp. 189–220). San Francisco, CA: Jossey-Bass. Dunnette, M. D. (1966). Personnel selection and placement. Oxford, England: Wadsworth. Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407–424. Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from traitspecific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8, 38–60. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York, NY: American Council on Education and Macmillan. Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver and Boyd. Ghiselli, E. E. (1964). Theory of psychological measurement. New York, NY: McGraw-Hill. Green, S. B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88–101. Green, S. B., & Hershberger, S. L. (2000). Correlated errors in true score models and their effect on coefficient alpha. Structural Equation Modeling, 7, 251–270. Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of 360 degree feedback ratings. Journal of Applied Psychology, 83, 960–968. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum. Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley. Guttman, L. A. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education and Praeger. Hocking, R. R. (1995). Variance component estimation in mixed linear models. In L. K. Edwards (Ed.), Applied analysis of variance in behavioral science (pp. 541–571). New York, NY: Marcel Dekker. Hoyt, C. (1941). Test reliability obtained by analysis of variance. Psychometrika, 6, 153–160. Jackson, S., & Brashers, D. E. (1994). Random factors in ANOVA. Thousand Oaks, CA: Sage. Joreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109–133. Joreskog, K. G., & Sorbom, D. (2001). LISREL 8: User’s reference guide. Lincolnwood, IL: Scientific Software International. Kamata, A., Turhan, A., & Darandari, E. (April 2003). Estimating reliability for multidimensional composite scale scores. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence assumption in analysis of variance. Psychological Bulletin, 99, 422–431. Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor analysis. Psychological Bulletin, 112, 165–172. Komaroff, E. (1997). Effect of simultaneous violations of essential tau-equivalence and uncorrelated errors on coefficient alpha. Applied Psychological Measurement, 21, 337–348. Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151–160. Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: The University of Chicago Press. Lance, C. L. (2008). Why assessment centers don’t work the way they’re supposed to. Industrial and Organizational Psychology, 1, 84–97. Lance, C. L., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of dimension and exercise variance components in assessment center postexercise dimension ratings. Journal of Applied Psychology, 89, 377–385. Le, H., & Putka, D. J. (2007). Reliability. In S. G. Rogelberg (Ed.), Encyclopedia of industrial and organizational psychology (Vol. 2, pp. 675–678). Thousand Oaks, CA: Sage. Le, H., Schmidt, F. L., & Putka, D. J. (2009). The multifaceted nature of measurement artifacts and its implications for estimating construct-level relationships. Organizational Research Methods, 12, 165–200.

31

Dan J. Putka Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from tests comprised of testlets. Educational and Psychological Measurement, 61, 958–975. Lee, G., & Frisbie, D. A. (1999). Estimating reliability under a generalizability theory model for test scores composed of testlets. Applied Measurement in Education, 12, 237–255. Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996). SAS system for mixed models. Cary, NC: SAS Institute. Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325–336. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lumsden, J. (1976). Test theory. Annual Review of Psychology, 27, 251–280. Marcoulides, G. A. (1990). An alternative method for estimating variance components in generalizability theory. Psychological Reports, 66, 102–109. Marcoulides, G. A. (1996). Estimating variance components in generalizability theory: The covariance structure analysis approach. Structural Equation Modeling, 3, 290–299. Marsh, H. W., & Grayson, D. (1994). Longitudinal confirmatory factor analysis: Common, time-specific, item-specific, and residual-error components of variance. Structural Equation Modeling, 1, 116–146. Maxwell, A. E. (1968). The effect of correlated error on reliability coefficients. Educational and Psychological Measurement, 28, 803–811. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. McPhail, S. M. (Ed.) (2007). Alternative validation strategies: Developing new and leveraging existing validity evidence. San Francisco, CA: John Wiley and Sons. Merenda, P. F. (2007). Psychometrics and psychometricians in the 20th and 21st centuries: How it was in the 20th century and how it is now. Perceptual and Motor Skills, 104, 3–20. Mosier, C. I. (1943). On the reliability of a weighted composite. Psychometrika, 8, 161–168. Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait rater and level effects in 360-degree performance ratings. Personnel Psychology, 51, 557–576. Murphy, K. R., & DeShon, R. (2000). Interrater correlations do not estimate the reliability of job performance ratings. Personnel Psychology, 53, 873–900. Ng, K. T. (1974). Spearman’s test score model: A restatement. Educational and Psychological Measurement, 34, 487–498. Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32, 1–13. Nunnally, J. C. (1978). Psychometric theory. New York, NY: McGraw-Hill. Ployhart, R. E., Schneider, B., & Schmitt, N. (2006). Staffing organizations: Contemporary practice and theory (3rd ed.). Mahwah, NJ: Lawrence Erlbaum. Putka, D. J., & Hoffman, B. J. (2013). Clarifying the contribution of assessee, dimension, exercise, and assessor-related effects to reliable and unreliable variance in assessment center ratings. Journal of Applied Psychology, 98, 114–133. Putka, D. J., & Hoffman, B. J. (2015). The reliability of job performance ratings equals 0.52. In C. E. Lance & R. J. Vandenberg (Eds.), More statistical and methodological myths and urban legends (pp. 247–275). New York, NY: Taylor & Francis. Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organizational research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93, 959–981. Putka, D. J., & McCloy, R. A. (February 2008). Estimating variance components in SPSS and SAS: An annotated reference guide. Retrieved March 23, 2009, from http://www.humrro.org/djp_archive/Estimating Variance Components in SPSS and SAS.pdf Qualls-Payne, A. L. (1992). A comparison of score level estimates of the standard error of measurement. Journal of Educational Measurement, 29, 213–225. Rae, G. (2007). A note on using stratified alpha to estimate the composite reliability of a test composed of interrelated nonhomogeneous items. Psychological Methods, 12, 177–184. Raykov, T. (1998). Coefficient alpha and composite reliability with interrelated nonhomogeneous items. Applied Psychological Measurement, 22, 375–385. Raykov, T. (2001a). Bias of coefficient α for fixed congeneric measures with correlated errors. Applied Psychological Measurement, 25, 69–76. Raykov, T. (2001b). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54, 315–323. Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. New York, NY: Routledge.

32

Reliability Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: Point and interval estimation using a structural equation modeling approach. Structural Equation Modeling, 9, 195–212. Reuterberg, S. E., & Gustafsson, J. E. (1992). Confirmatory factor analysis and reliability: Testing measurement model assumptions. Educational and Psychological Measurement, 52, 795–811. Rogers, W. M., Schmitt, N., & Mullins, M. E. (2002). Correction for unreliability of multifactor measures: Comparison of alpha and parallel forms of approaches. Organizational Research Methods, 5, 184–199. Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199–223. Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual-differences constructs. Psychological Methods, 8, 206–224. Schmidt, F. L., Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability. Personnel Psychology, 53, 901–912. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353. Schmitt, N., & Landy, F. J. (1993). The concept of validity. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection in organizations (pp. 275–309). San Francisco, CA: Jossey-Bass. Scullen, S. E. (1999). Using confirmatory factor analyses of correlated uniquenesses to estimate method variance in multitrait-multimethod matrices. Organizational Research Methods, 2, 275–292. Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85, 956–970. Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York, NY: Wiley. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. Smith, P. L., & Luecht, R. M. (1992). Correlated effects in generalizability studies. Applied Psychological Measurement, 36, 229–235. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295. SPSS. (2005). Linear mixed-effect modeling in SPSS: An introduction to the mixed procedure (Technical Report LMEMWP-0305). Chicago, IL: Author. Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement (pp. 560–620). Washington, DC: American Council on Education. Tryon, R. C. (1957). Reliability and behavior domain validity: Reformulation and historical critique. Psychological Bulletin, 54, 229–249. Van Iddekinge, C. H., Raymark, P. H., Eidson, C. E., Jr., & Attenweiler, W. J. (2004). What do structured selection interviews really measure? The construct validity of behavior description interviews. Human Performance, 17, 71–93. Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of job performance? A meta-analytic framework for disentangling substantive and error influences. Journal of Applied Psychology, 90, 108–131. Webb, N. M., & Shavelson, R. J. (1981). Multivariate generalizability of general educational development ratings. Journal of Educational Measurement, 18, 13–22. Werts, C. E., Linn, R. L., & Joreskog, K. G. (1974). Intraclass reliability estimates: Testing structural assumptions. Educational and Psychological Measurement, 34, 25–32. Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-multimethod data. Applied Psychological Measurement, 9, 1–26. Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger. Zimmerman, D. W., Zumbo, B. D., & LaLonde, C. (1993). Coefficient alpha as an estimate of test reliability under violation of two assumptions. Educational and Psychological Measurement, 53, 33–49.

33

2 VALIDATION STRATEGIES FOR PRIMARY STUDIES NEAL W. SCHMITT, JOHN D. ARNOLD, AND LEVI NIEMINEN

NATURE OF VALIDITY Most early applications of the use of tests as decision-making tools in the selection of personnel in work organizations involved a validation model in which the scores on tests were correlated with some measure or rating of job performance, such as the studies of salespersons by Scott (1915) and streetcar motormen by Thorndike (1911). This view of validity was reinforced in books by Hull (1928) and Viteles (1932). Subsequent reviews by Ghiselli (1966, 1973) were similarly focused on what was by then known as criterion-related validity. During this time, there was a recognition that tests could and should be based on other logical arguments as well. Standards for Educational and Psychological Tests (American Psychological Association [APA], American Educational Research Association [AERA], and National Council on Measurement in Education [NCME], 1954) identified four aspects of validity evidence: content, predictive, concurrent, and construct. With time, the predictive and concurrent aspects of validity became seen as simply different research designs, the purpose of which was to establish a predictor-criterion relationship; hence, they became known as criterion-related validity. Content and construct validation were seen as alternate methods by which one could validate and defend the use of test scores in decision making. A much broader view of the nature of validity is accepted today, and in general it is seen as the degree to which the inferences we draw from a set of test scores about job performance are accurate. Subsequent separation of approaches to validity (content, construct, and criterion-related) produced numerous problems, not the least of which was the notion that there were times when one approach was to be preferred over another or that there were different acceptable standards by which these different aspects of validity were to be judged. Most important, however, was the realization on the part of measurement scholars that all were aspects of construct validity—the theoretical reasonableness of our explanations of job behavior. There was a realization that the inferences we derive from test scores was central to all validation work. Content validity and the practices usually associated with it were recognized as desirable practices in the development of any test. Careful consideration of the “theory” and hypotheses that underlie our conceptualization of performance and how the constructs central to job performance are represented in our tests is always important and unifying insofar as our validation efforts are concerned. Traditional criterion-related research represents one type of evidence that can be collected to confirm/disconfirm these hypotheses. This “unitarian” approach to validity was strongly argued in the 1985 Standards and has been incorporated in the 1999 and 2014 versions of the Standards (AERA, APA, & NCME, 1999; 2014). In all instances, evidence from multiple studies or sources is desirable.

34

Validation Strategies for Primary Studies

Different Approaches to the Collection of Data About Validity Validity, as defined in the most recent version of the Standards (2014), is “the degree to which evidence and theory support the interpretation of test scores for proposed uses of the test” (p. 11). The user must state explicitly what interpretations are to be derived from a set of test scores, including the nature of the construct thought to be measured. The document goes on to describe a variety of evidence that can support such an interpretation.

Content Evidence An evaluation of test themes, wording, item format, tasks, and administrative guidelines all constitute the “content” of a test, and a careful logical or empirical analysis of the relationship of this content to the construct measured as well as expert judgments about the representativeness of the items to the construct measured supports validity. The evidence that a measure is content valid usually takes the form of an analysis by subject matter experts that describes a linkage between the test content and the content of a job. Perhaps most stringent is the view that a test is content valid if it is a “representative sample of the tasks, behaviors, or knowledge drawn from that domain” (Principles, 1987, p. 19), meaning the job domain. A more liberal approach to content validity is expressed in the 2003 version of the Principles. That is, a test is content valid if there is evidence that the test was designed explicitly as a sample of the “important work behaviors, activities and/or worker KSAOs necessary for performance on the job or in job training” (p. 21). Also, content validity evidence may include “logical or empirical analyses that compare the adequacy of the match between test content and work content, worker requirements, or outcomes of the job” (p. 6, Principles, 2003). The role of content evidence in the validation process continues to be controversial among professionals in the field, as evidenced by the paper authored by Murphy (2009) and the responses to his paper in the same issue of Industrial and Organizational Psychology. Murphy stated what some in this series of papers thought was old news; namely, that evidence that job content and test content were highly similar was not related to criterion-related validity. Responses reflected a variety of views as to the nature of content validity and the notion that there was no reason the sets of evidence should be related. In a typical content validity effort, subject matter experts provide judgments that link knowledge, skills, abilities, and other characteristics (KSAOs) to specific job elements (i.e., a KSAO is required to perform a part of the job adequately) and link KSAOs to test items or subtests (i.e., responses to a test item provide information about the level of a test taker’s KSAO). An effort is then made to assess the communality between these two lists of KSAOs and conclude that the communality is (not) sufficient to support the inference that people who do well on the test will also do well on the job. A concern in some instances is the degree to which the results of a content validity study conducted in one context can be used to support inferences about job performance in another situation. Perhaps the most common rationale for such generalization is the notion that the new or local setting is similar to that in which the content validity study was done; that is, characteristics of the applicant, the predictor and criterion constructs, and other important aspects of the two situations (that of the original content validation study and that of the situation to which results are to be generalized). The arguments that the work components of the two situations are the same must be clear and persuasive.

Response Processes Validity evidence can also take the form of an examination of the response processes involved in responding to an item. For example, in evaluating the capabilities of an applicant for a mechanical job, we might ask the person to read a set of instructions on how to operate a

35

Neal Schmitt et al. piece of equipment and then ask the applicant to demonstrate the use of the equipment. Because the equipment is used on the job, it would seem valid, but suppose we also find that test scores are highly related to examinees’ vocabulary level. We would then want to know if vocabulary is necessary to learn how to use this equipment on the job and, depending on the answer to that question, we may want to revise the test. It is rare, in our experience at least, that the similarity of response processes across tests and criteria are presented as the sole validity support, but they are often inherent in what is more likely to be termed content validity or transportability arguments (i.e., transporting a validity claim from one situation to another).

Internal Structure of the Test Yet a third piece of evidence might be to collect data regarding the internal structure of a test. We would examine the degree to which different items in a test (or responses to an interview) yield correlated results and whether items designed to measure one construct can be differentiated from items written to assess a different construct. Researchers interested in these questions use item means, standard deviations, and intercorrelations as well as exploratory and confirmatory analyses to evaluate hypotheses about the nature of the constructs measured by a test. When these data confirm the hypothesized nature of the constructs measured by the test and those constructs are deemed to underlie worker performance, there is support for the predictive inference (i.e., test scores predicts job performance).

Criterion-Related Evidence Similar to looking at the internal structure of a test, researchers can also examine its external validity by correlating the test results with job performance measures. Validity in the personnel selection area has been almost synonymous with the examination of the relationship between test scores and job performance measures, most often referred to as criterion-related validity. Because there is a large body of primary studies of many job performance-test relationships, one can also examine the extent to which tests of similar constructs are related to job performance and generalize in a way that supports the validity of a new measure or an existing measure in a new context. These are studies of validity generalization, which we will discuss in more depth in the Validity Generalization section. It should be noted that without primary studies of criterion-related validity, there can be no validity generalization studies, and without recent studies of criterion-related validity, we cannot assess newer developments in testing technology or criterion development using meta-analyses. Likewise, validity transportability and synthetic validity (see discussion of synthetic validity below) support for the predictive hypothesis underlying the use of tests are impossible without primary studies of criterion-related validity. In practice, criterion-related validity studies are often criticized for failing to adequately address validity issues surrounding the criterion measure(s) used. The relative lack of scientific scrutiny focused on criteria, termed the “criterion problem” (Austin & Villanova, 1992), has been a topic of discussion among personnel psychologists for years (Dunnette, 1963; Fiske, 1951; Guion, 1961). Universal to these discussions is the call for more rigorous validation evidence with respect to the criteria that are used. Binning and Barrett (1989) outlined this task, underscoring two interrelated goals for the validation researcher. First, they suggested that the selection of criteria should be rooted in job analysis to the same extent that selection of predictors traditionally are (i.e., more attention to rigorous “criterion development”). Other considerations relevant to the tasks of criterion development and validation include the use of “hard” or objective criteria versus more proximal behaviors that lead to these outcomes (Thayer, 1992), use of multiple relevant criteria as opposed to a single overall criterion (Campbell, McCloy, Oppler, & Sager, 1993; Dunnette, 1963), and the possibility

36

Validation Strategies for Primary Studies that criteria are dynamic (i.e., change over time for employees as a function of how long they have been on the job) (Barrett, Caldwell, & Alexander, 1985). Second, researchers should be concerned with demonstrating evidence of construct-related validity for the criterion. Investigators must specify the latent dimensions that underlay the content of their criterion measures. This involves expansion of the nomological network to include inferences that link the criterion measure(s) to constructs in the performance domain (e.g., by demonstrating that criterion measures are neither contaminated nor deficient with respect to their coverage of the intended constructs in the performance domain) and link constructs in the performance domain to job demands that require specific ability or motivational constructs (e.g., by demonstrating through job analysis that constructs in the performance domain are organizationally meaningful). Campbell and his colleagues (e.g., Campbell, McCloy, Oppler, & Sager, 1993) have repeatedly emphasized the importance of the nature of criteria or performance constructs. These authors make the somewhat obvious, although often overlooked, point that performance should be defined as behavior (“what people actually do and can be observed”); the products of one’s behavior, or what are often called “hard criteria,” are only indirectly the result of one’s behavior and may be influenced by other factors that are not attributable to an individual job incumbent. Further, we may consider relatively short-term or proximal criteria or distal criteria, such as the impact of one’s career on some field of interest. Any specification of a performance or criterion domain must also consider the impact of time (Ackerman, 1989; Henry & Hulin, 1989). In any study of performance, these various factors must be carefully considered when one decides on the nature of the performance constructs and actual operationalizations of the underlying constructs and how those measures might or might not be related to measures of other constructs in the domain of interest. Use of a criterion-related strategy makes a special set of methodological and statistical approaches relevant. Power analysis is a useful framework for interrelating the concepts of statistical significance, effect size, sample size, and reliability (Cohen, 1988) and has design and evaluation implications for the statistical relationships sought in criterion-related studies. For instance, the sample size needed to demonstrate a statistically significant predictor-criterion relationship decreases as the magnitude of the relationship that exists between predictor and criterion (i.e., effect size) increases. Sussman and Robertson (1986), in their assessment of various predictive and concurrent validation designs, found that those strategies that allowed larger sample sizes gained a trivial increment in power. This suggests that, as long as sample sizes can support the use of a criterion-related design, further attention toward increasing N may not reap large benefits. Other factors affecting power include the interrelatedness and number of predictors used, such that the addition of nonredundant predictors increases power (Cascio, Valenzi, & Silbey, 1978). The reliability of the predictors and criteria and the decision criteria used for inferring that a relationship is nonzero (i.e., the confidence interval around the estimate of effect size is not zero) also impact power. By incorporating power analysis in validation design, researchers can increase the likelihood that relationships relevant to key inferences will be tested with sufficient sample size upon which to have confidence in the results. However, from a scientific standpoint, the importance of demonstrating that predictor-criterion relationships are statistically significant may be overstated, given that relationships, which may not be practically meaningful, can reach statistical significance with large enough sample sizes. For instance, a statistically significant relationship, in which a test accounts for less than 5% or a relatively small portion of the variance in job performance, is not unequivocal support for the test’s use. This is especially evident when there is reason to suggest that other available tests could do a better job predicting performance. Further, rather than rely on statistical significance tests, an argument about the practical utility of the information about test takers’ KSAOs is most relevant and important (see discussion of utility later in this chapter). Operationally, there are several other important considerations in criterion-related research (e.g., job analyses that support the relevance of predictor and criterion constructs and the quality of the measures of each set of constructs). However, those concerns are addressed repeatedly in textbooks (e.g., Guion, 1998; Ployhart, Schneider, & Schmitt, 2006). In the next section of this chapter, we address a very important concern that is rarely discussed.

37

Neal Schmitt et al.

Transportability of Validity Another factor that can affect the extent of the local validation effort that is required is the availability of existing validation research. The Principles describes three related validation strategies that can be used as alternatives to conducting traditional local validation studies or to support the conclusions drawn at the primary study level. First, “transportability” of validity evidence involves applying validity evidence from one selection scenario to another, on the basis that the two contexts are judged to be sufficiently similar. Specifically, the Principles note that researchers should be concerned with assessing similarity in terms of characteristics [e.g., the knowledge, skills, and abilities (or KSAs) needed to perform the job in each context], job tasks and content, applicant pool characteristics, or other factors that would limit generalizability across the two contexts (e.g., cultural differences). Assessing similarity in this manner usually requires that researchers conduct a job analysis or rely on existing job analysis materials combined with their own professional expertise and sound judgment—and documenting carefully all procedures used to inform the decision.

Synthetic Validity Synthetic validity is a process in which validity for a test battery is “synthesized” from evidence of multiple predictor–job component relationships (Peterson, Wise, Arabian, & Hoffman, 2001; Scherbaum, 2005). Job analysis is used to understand the various components that make up a particular job, and then predictor–job component relationships are collected for all available jobs with shared components. Because evidence can be drawn from other jobs besides the focal job, synthetic validity may be a particularly useful strategy for organizations that have too few incumbents performing the focal job to reach adequate sample sizes for a traditional criterionrelated study (Scherbaum, 2005). Transportability and synthetic validity are similar notions; in transportability, one is taking the entire results of a validation study to justify use of a test or test battery in a new situation. In synthetic validity, one is taking the results of multiple different studies on different constructs to justify their use in a new situation in which the various constructs are deemed important to successful job performance. An excellent example and evaluation of a synthetic validation effort is provided by Johnson and Carter (2010). Job analysis data in a large organization provided evidence for 11 job families and 27 job components. Twelve tests were developed to predict performance on these job components. Test scores and performance data on the job components were collected from 1,926 incumbents. A test composite for each job component was created, and a test battery was chosen for each job family based on relevant job components. Synthetic validity coefficients computed on each battery compared favorably with traditional validity coefficients computed within those job families for which adequate sample sizes were available.

Validity Generalization Validity generalization involves using meta-analytic findings to support the conclusion that predictor-criterion validity evidence can be generalized across situations. Like transportability strategies, meta-analytic findings provide researchers with outside evidence to support inferences in a local context. The argument for validity generalization on the basis of meta-analyses is that some selection tests, such as cognitive ability tests (Ones, Viswesvaran, & Dilchert, 2005), are valid across selection contexts. Thus, the implication is that with validity generalization strategies, unlike transportability, in-depth job analyses or qualitative studies of the local organizational context are unnecessary. In support of this assertion, Schmidt and Hunter and colleagues (for review, see Schmidt & Hunter, 2003) have argued that between-study variability in validity coefficients can be largely attributed to statistical artifacts, such as range restriction, unreliability, or sampling error. However, caution is warranted to the extent that meta-analyses have identified

38

Validation Strategies for Primary Studies substantive moderators, or in the presence of strong theory indicating that some variable may moderate the magnitude of validity. Further, with regard to generalization across contexts, inferences drawn from meta-analytic findings are limited to the contexts of those studies included in the meta-analysis (LeBreton et al., Chapter 4 of this volume). When local criterion-related studies of some relationship are conducted, meta-analytic estimates of the same relationship may be used as Bayesian priors to estimate the degree to which a metaanalytic estimate should be modified by the new local evidence. Schmidt and Hunter (1977) discussed this possibility in their original validity generalization study, but few researchers have recognized or used meta-analytic evidence in this fashion. However, recognition of the utility of such an approach to science appears to be increasing (e.g., Zyphur, Oswald, & Rupp, 2015). At minimum, meta-analytic findings should be referenced in test development and can be used to supplement evidence at the local level, either via theoretical or statistical means (Newman, Jacobs, & Bartram, 2007). The argument for more direct use of validity generalization strategies is dependent on the strength of the meta-analytic findings and in some cases may mean that local validation efforts are unnecessary or even misleading (e.g., due to small sample sizes). Nevertheless, the legal defensibility of the selection procedure may necessitate a local validation study.

Evidence Regarding the Consequences of Test Use Finally, and somewhat controversially among industrial-organizational (I-O) psychologists, the Standards (1999, 2014) also suggest that researchers examine the intended and unintended consequences of test use to make decisions. This evidence is referred to as consequential validity (Messick, 1998). The consequences of most concern are the degree to which use of test scores results in disproportionate hiring of one or more subgroups (e.g., gender, race, disabled). Finally, some I-O psychologists have also noted that the traditional separation of reliability and validity concepts may be inadequate (Lance, Foster, Gentry, & Thoresen, 2004; Lance, Foster, Nemeth, Gentry, & Drollinger, 2007; Murphy & DeShon, 2000). Technology also affords the opportunity to make the traditional one-time criterion-related validity study an ongoing effort in which the accumulation of predictor and criterion data can be collected and aggregated across time and organizations.

VALIDATION IN DIFFERENT CONTEXTS This chapter discusses validation largely within the context of personnel selection. This is the most common application of the various approaches to validation. It is also the most straightforward example of how validation approaches can be applied. There is a wide range of contexts in which the validation of measures is desirable; however, organizations should, for example, ensure they are using “validated” tools and processes in their performance management systems, in their assessments of training and development outcomes, in their promotion and succession planning processes. In some instances, there should also be validity evidence for the use of survey measures (e.g., when the use of the measure is predicated on the notion that it measures some well-known construct and a scientific body of research is used to defend the use of the survey measure). Each of these circumstances is associated with its own set of challenges as the researcher designs an appropriate validation study. However, the design of the well-constructed study by necessity will follow the same logic as will be discussed for the personnel selection context. Following this logic, the studies should be structured to include the following three elements: 1. Job analysis. The foundation of validation in employment settings always involves the development of a clear understanding of job and organizational requirements. For example, for promotion purposes these would be the requirements of the target job(s) into which a person might be promoted. For training and development purposes, these would be the meaningful outcomes in terms of on-the-job performance that are the focus of the training/development efforts.

39

Neal Schmitt et al. 2. Systematic development. As measures are developed, they need to follow an architecture that is firmly grounded in the results of the job analysis. As the development of the measures is planned and as the tools are being constructed, activities need to be focused on ensuring that the measures are carefully targeted to address the intended constructs. 3. Independent verification. Once the measures are developed, they need to be subjected to independent verification that they measure the intended constructs. At times, this can involve statistical studies to determine whether the measures exhibit expected relationships with other independent measures (e.g., Does the 360-degree assessment of leadership behavior correlate with an independent interview panel’s judgment of a leader’s behavior?). The independent verification is often derived from structured expert reviews of the measures that are conducted prior to implementation. Regardless of the method, this “independent verification” is a necessary aspect of verifying the validity of a measure.

Strong Versus Weak Inferences About Validity Given that validation is a process of collecting evidence to support inferences derived from test scores (e.g., that a person will perform effectively on a job), the confidence with which inferences are made is a function of the strength of the evidence collected. Gathering stronger evidence of validity almost always necessitates increased effort, resources, and/or costs (e.g., to gain larger sample sizes or expand the breadth of the criterion measures). Thus, a key decision for researchers designing primary validation studies involves determining how to optimize the strength of the study (assurance that inferences are correct) within the bounds of certain practical limitations and organizational realities. Situations may vary in terms of the extent to which feasibility drives the researcher’s choice among validation strategies. In some cases, situational limitations may be the primary determinant of the validation strategies available to researchers. For example, for situations in which adequately powered sample sizes cannot be achieved, validation efforts may require use of synthetic validity strategies (Scherbaum, 2005), transporting validity evidence from another context that is judged to be sufficiently similar (Gibson & Caplinger, 2007), generalizing validity across jobs or job families on the basis of meta-analytic findings (McDaniel, 2007; Rothstein, 1992), or relying on evidence and judgments that the content of the selection procedures is sufficiently similar to job tasks and/or the KSAOs required to support their use in decision making. Other factors noted by the Principles that may limit the feasibility of certain validation strategies include unavailability of criterion data, inaccessibility to subject matter experts (SMEs), as might be the case when consulting SMEs would compromise test security, dynamic working conditions such that the target job is changing or does not yet exist, and time and/or money. Given the need to balance several competing demands (e.g., issues of feasibility limiting the approach that can be taken versus upholding high standards of professionalism and providing strong evidence to support key inferences), it is essential that researchers understand the various factors that have potentially important implications for the strength of evidence that is required in a given validation scenario. In other words, part of the decision process, with regard to planning and implementing a validation strategy, is a consideration of how strong the evidence in support of key inferences ought to be. The basic assumption here is that different situations warrant different strategies along several dimensions (Sussman & Robertson, 1986), one of which has to do with the strength of evidence needed in support of inferences. Rather, all validation studies and selection practices should aspire to the ethical and professional guidelines offered in the Principles, which means using sound methods rooted in scientific evidence and exhibiting high standards of quality. However, the Principles’ guidelines are not formulaic to the exclusion of professional judgment, nor are their applications invariant across circumstances. In the following paragraphs, several factors are identified that have potential implications for the strength of the evidence needed by a local validation effort.

Situational Factors Influencing the Strength of Evidence Needed Although it is beyond the scope of this chapter to describe in full detail the legal issues surrounding validation research and selection practice (see Chapters 28, 29, and 30, this volume,

40

Validation Strategies for Primary Studies for further discussions of legal issues), it would be difficult if not impossible to describe applied validation strategy without underscoring the influence of litigation or the prospect of litigation in the U.S. It is becoming almost cliché to state that, in circumstances in which there is a relatively high probability of litigation regarding selection practices, validation evidence is likely to function as a central part of defending selection practices. Indeed, much validation research is stimulated by litigation, whether post facto or in anticipation of lawsuits. Within this context, researchers make judgments regarding the potential for litigation and adapt their validation strategies accordingly. Numerous contextual factors contribute to the probability that litigation will occur. A primary example has to do with the type of selection instrument being validated and the potential for adverse impact, or the disproportionate rejection of identifiable subgroups. Tests that have historically resulted in adverse impact, such as tests of cognitive ability (Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997) or physical ability (Arvey, Nutting, & Landon, 1992; Hogan & Quigley, 1986), tend to promote more litigation, and researchers validating these instruments in a local context should anticipate this possibility. Similarly, selection instruments with low face validity (i.e., the test’s job relevance is not easily discerned by test takers) are more likely to engender negative applicant reactions (Shotland, Alliger, & Sales, 1998), and decisions based on such tests may lead to applicant perceptions of unfairness (Cropanzano & Wright, 2003). In their review of the antecedents and consequences of employment discrimination, Goldman, Gutek, Stein, and Lewis (2006) identified employee perceptions of organizational and procedural justice as important antecedents of discrimination lawsuits. In addition to considering characteristics of the selection instrument(s) being validated, lawsuits over selection practice are more frequent in some industry (e.g., government) and job types (Terpstra & Kethley, 2002). Researchers should also consider the implications and relative seriousness of hiring decisions that result in false positives or false-negative errors. A false positive is made by selecting an unqualified individual whose performance on the job will be low, whereas a false-negative error is made by rejecting a qualified individual whose performance on the job would have been high. Making an error of either type can be potentially costly to the organization. However, the relative impact of such errors can differ by occupation type and organizational context. For example, the negative impact of a false positive in high-risk occupations (e.g., nuclear power plant operator or air-traffic controller) or high-visibility occupations (e.g., Director of the Federal Emergency Management Agency [FEMA]) can be catastrophic, threaten the organization’s existence, and so on (Picano, Williams, & Roland, 2006). Alternatively, for occupations that are associated with less risk, such that failure on the job does not have catastrophic consequences for the organization or larger society, or when organizations use probationary programs or other trial periods, the cost of false-positive errors may be relatively low. Although validation efforts in both situations would be concerned with selection errors and demonstrating that use of tests can reduce the occurrence and negative consequences of such errors, clearly there are some situations in which this would be more of a central focus of the validation effort. It is our contention that validating selection systems for high-risk occupations are a special circumstance warranting particularly “watertight” validation strategies in which strong evidence should be sought to support the inferences made. In these circumstances, a test with low validity (e.g., less than r = .10) might be used to make hiring decisions when the outcome of such decisions is critically important to organizational effectiveness, and decision makers would want to use any evidence available to reduce risk even if they are not predicting a large amount of variance. In some circumstances, the cost of false negatives is more salient. For example, strong evidence of a test’s validity may be warranted when an organization needs to fill a position or several positions, but applicants’ test scores are below some acceptable standard, indicating that they are not fit to hire (i.e., predicted on-the-job performance is low or very low). In this case, the organization’s decision to reject an applicant on the basis of his/her test scores would leave a position or several positions within the organization vacant, a costly mistake in the event that false-negative errors are present. Demonstrable evidence to support the test’s validity would be needed to justify such a decision, and in essence, convince the organization that it is better off with a vacant position than putting the wrong person in the job. In these instances, one might want evidence of a larger test-criterion relationship (perhaps greater than r = .30) to warrant use of this test and the possible rejection of competent applicants.

41

Neal Schmitt et al. The possibility of false negatives becomes a special concern when members of some subgroup(s) are selected less frequently than members of another subgroup. When unequal ratios of various subgroups are selected, the organization must be prepared to show that false negatives are not primarily of one group as opposed to another. When this is impossible, the legal and social costs can be very high. Personnel psychologists have long been aware of the fact that the utility of selection systems increases as a function of selectivity, such that selection instruments even modestly related to important outcomes can have large payoffs when there are many applicants from which only a few are to be selected (Brogden, 1951, 1959). On the other hand, as selection ratios become extremely liberal, such that nearly all applicants are accepted, even selection instruments highly related to performance have less positive implications for utility. From a purely utilitarian perspective, it would seem logical that demonstrating test validity is less of an impetus when selection ratios are liberal (because even the best tests will have little effect) and more of an impetus when selection ratios are low. In licensing examinations, this utility perspective takes a different form because the major purpose of these examinations is to protect the public from “injuries” related to incompetent practice. In this case, the license–no license decision point using test scores is usually set at a point that is judged to indicate “minimal competence.” Depending on the service provided (e.g., hairdresser vs. surgeon), the cost of inappropriately licensing a person could be very different. On the other hand, certification examinations are usually oriented toward the identification of some special expertise in an area (e.g., pediatric dentistry or forensic photography); hence, a decision as to a score that would warrant certification might result in the rejection of larger numbers or proportions of examinees. The cost-benefit balance in this situation (assuming all are minimally competent) might accrue mostly to the individual receiving the certification in the form of greater earning power.

Design Considerations When Strong Evidence Is Needed On the basis of the preceding discussion, situational factors can affect the feasibility and appropriateness of the validation models applied to a given selection context. Moreover, researchers should be particularly attuned to contextual variables that warrant an increased concern for demonstrating the strength of evidence collected and high levels of confidence in the inferences to be made. The validity strategies used reflect consideration of these contextual factors and others. The discussion that follows is focused on identifying a handful of actionable validation strategies to be considered by researchers when particularly strong evidence is needed.

Importance of the Nomological Net Binning and Barrett (1989) offered a thorough conceptualization of the nomological network implicit in validity models (see Chapters 1 and 3 in this volume). Their model identifies multiple inferential pathways interrelating psychological constructs and their respective operational measures. Inferential pathways in the model are empirically testable using observed variables (e.g., linkages between operationalized measures of constructs and linkages between constructs and their operationalized measures). Others may be theoretically or rationally justified (e.g., construct-to-construct linkages) or tested using latent variable models, although these applications are relatively rare in personnel selection research (see Campbell, McHenry, & Wise, 1990, for an attempt to model job performance). Consistent with the unitarian conceptualization of validity, all validity efforts in a selection context are ultimately concerned with demonstrating that test scores predict future job performance, and each of the various inferential pathways represents sources or types of evidence to support this common inference. Binning and Barrett (1989, p. 482) described how “truncated” validation strategies often concentrate exclusively on demonstrating evidence for a single inferential pathway and as a result provide only partial support for

42

Validation Strategies for Primary Studies conclusions regarding test validity. A more cogent argument for validity is built upon demonstration of strong evidence for several inferential pathways within the nomological network. For example, in addition to demonstrating a statistical relationship between observed measures from the predictor and performance domain, as is commonly the main objective in criterion-related validity studies, researchers should provide evidence of the psychological constructs underlying job performance (as well as the predictor measures) and demonstrate that the criterion measure adequately samples constructs from the performance domain.

Criterion Concerns Various concerns regarding the criterion used in validation research are enumerated above in the description of criterion-related evidence of test validity. These concerns are particularly important when there is a need for strong evidence of test validity and tests are evaluated by relating test scores to measures of job performance.

Multiple Inferences in Validation Research Gathering evidence to support multiple inferences within a theoretically specified nomological network resembles a pattern-matching approach. The advantage of pattern-matching research strategies is that stronger support for a theory can be gained when complex patterns of observed results match those that are theoretically expected (Davis, 1989). Logically, it would be less likely that a complex pattern of results would be observed simply because of chance. For example, when high scores on a test of empathy are expected to moderate the relationship between conscientiousness and the realization of performance goals and this pattern of expected relationships holds, it is unlikely due to chance. In addition, when experimental control of potentially confounding variables is not possible, pattern matching can be used to preempt alternative explanations for the observed relationships (i.e., threats to validity; Cook & Campbell, 1979). A more extensive form of pattern matching involves the use of multiple studies, or research programs, to corroborate evidence of validity. Again, the logic is straightforward; stronger evidence is gained when a constellation of findings all lead to the same conclusion. Sussman and Robertson (1986) suggested that programs of research could be undertaken, “composed of multiple studies each utilizing a different design and aimed at collecting different types of evidence” (p. 467). Extending the rationale of the multi-trait multi-method (MTMM; Campbell & Fiske, 1959), convergent evidence across studies may indeed be stronger if gained through different research designs and methods. Landy’s (1986) assertion that test validation is a form of hypothesis testing, and that judgments of validity are to be based on a “preponderance of evidence” (p. 1191; Guion, as cited in Landy, 1986), provides the context for consideration of research strategies such as quasi-experimental designs (Cook & Campbell, 1979) and program evaluation research (Strickland, 1979). Binning and Barrett (1989) presented a similar rationale by calling for “experimenting organizations” (p. 490) in which local validation research is treated as an ongoing and iterative process. Published research on use of multiple experiments or methods in a selection-validation context remains sparse to date.

CONCERNS ABOUT THE QUALITY OF THE DATA: CLEANING THE DATA Once data have been collected, quality control techniques should be applied to ensure that the data are clean before proceeding to statistical analysis. Some basic quality control techniques include double-checking data for entry errors, spot checking for discrepancies between the electronic data and original data forms, inspecting data for out-of-range values and statistical outliers, and visually examining the data using graphical interfaces (e.g., scatterplots, histograms, stem-and-leaf plots). Special concern is warranted in scenarios in which multiple persons are

43

Neal Schmitt et al. accessing and entering data or data sets from multiple researchers are to be merged. Although these recommendations may appear trite, they are often overlooked, and the consequence of erroneous data can be profound for the results of analyses and their interpretations. A study by Maier (1988) illustrated, in stepwise fashion, the effects of data cleaning procedures on validity coefficients. Three stages of data cleaning were conducted, and the effects on correlations between the Armed Services Vocational Aptitude Battery (ASVAB) and subsequent performance on a work sample test for two military jobs (radio repairers and automotive mechanics) were observed. Selection was based on the experimental instrument (the ASVAB), and the work sample criterion tests were administered to incumbents in both occupations after some time had passed. In Phase 1 of the data cleaning process, the sample was made more homogenous for the radio repairers group by removing the data of some employees who received different or incomplete training before criterion data collection. In comparison to the total sample, the validity coefficient for the remaining, more representative group that had received complete training before criterion collection was decreased (from .28 to .09). The initial estimate had been inflated because of the partially trained group having scored low on the predictor and criterion. In Phase 2, scores on the criterion measure (i.e., ratings from a single rater on a work sample) were standardized across raters. Significant differences among raters were attributed to different rating standards and not to group differences in ratees, such as experience, rank, or supervisor performance ratings. The raters were noncommissioned officers and did not receive extensive training in the rating task, so that differences among raters in judgmental standards were not unexpected. As a result, the validity coefficients for both jobs increased (radio repairers, from .09 to .18; automotive mechanics, from .17 to .24). In Phase 3, validity coefficients were corrected for range restriction, which again resulted in an increase in the observed validity coefficients (radio repairers, from .18 to .49; automotive mechanics, from .24 to .37). Maier noted that the final validity coefficients were within the expected range on the basis of previous studies. The Maier (1988) study is illustrative of the large effect that data cleaning can have for attaining more accurate estimates of validity coefficients in a predictive design scenario. Several caveats are also evident, so that researchers can ensure that data cleaning procedures conducted on sound professional judgment are not perceived as data “fudging” and/or HARKing (Kerr, 1998). First, the cleaning procedures need to have a theoretical or rational basis. Researchers should document any decision criteria used and the substantive changes that are made. For example, researchers should record methods used for detecting and dealing with outliers. In addition, a strong case should be built in support of any decisions made. The researcher bears the burden of defending each alteration made to the data. For example, in the Maier study, the decision to standardize criterion data across raters (because raters were relatively untrained and used different rating standards) was supported by empirical evidence that ruled out several alternative explanations for the mean differences observed among raters. Perhaps the most serious problem is choosing which set of predictor-criterion relationships to report based on post hoc examination of the data. The best approach when using various corrections to observed data is to report both corrected and uncorrected values of data parameters. Finally, missing data on some variables in many applied studies is common, and a whole science (Little & Rubin, 2002) has evolved around the imputation of missing values (i.e., estimating the value of a missing variable based on the level of other available data). Bayesian analyses applied to the problem of missing data have allowed for estimation of parameters and the confidence with which attributions about individual difference–performance relationships can be made. These techniques are not commonly used in personnel selection research, and we think they could be usefully applied in many instances.

MODES OF DECISION-MAKING AND THE IMPACT ON UTILITY AND ADVERSE IMPACT If we have good-quality data, it still matters how we use those data in making decisions as to whether or not use of the test produces aggregated performance improvements. In this section, we will discuss the impact of various modes of decision making on two outcomes that

44

Validation Strategies for Primary Studies are of concern in most organizations: overall performance improvement or utility and adverse impact on some protected group defined as unequal proportions of selection across subgroups. Advancing both outcomes is often in conflict, especially when one uses cognitive ability tests to evaluate the ability of members of different racial groups or physical ability when evaluating male and female applicants for a position. Measures of some other constructs (e.g., mechanical ability) produce gender or race effects, but the subgroup differences that are largest and affect the most people are those associated with cognitive and physical ability constructs.

Top-Down Selection Using Test Scores If a test has a demonstrable relationship to performance on a job, the optimal utility in terms of expected employee performance will occur when the organization selects the top-scoring persons on the test to fill its positions (Brown & Ghiselli, 1953). Expected performance is a direct linear function of the test score–performance relationship in the situation in which the top-scoring individuals are selected. However, use of tests in this fashion when it is possible will mean that lower-scoring subgroups will be less likely to be selected (Murphy, 1986). This conflict between maximization of expected organizational productivity and adverse impact is well known and has been quantified for different levels of subgroup mean differences in ability and selection ratios (Sackett & Wilk, 1994; Sackett, Schmitt, Ellingson, & Kabin, 2001; Schmidt, Mack, & Hunter, 1984). For social, legal, and political reasons, as well as long-term organizational viability in some contexts, the adverse impact of a strict top-down strategy of test use often cannot be tolerated. For these reasons as well as others, researchers and practitioners have often experimented with and implemented other ways of using test scores.

Banding and Cut Scores One method of reducing the consequences of subgroup differences in test scores and top-down selection is to form bands of test scores that are not considered different, usually using a statistical criterion known as the standard error of the difference, which is based on the reliability of the test. The theory in employment selection use of banding is that the unreliability inherent in most tests makes the people within a band indistinguishable from each other, just as occurs when grades are assigned to students. Most of us are familiar with a form of banding commonly used in academic situations. Scores on tests are usually grouped into grades (e.g., A, B, C, etc., or red, green, and yellow, as is often the practice in organizational practice) that are reported without specific test score information. So persons with scores of 99 and 93 might both receive an A in a course, just as two with scores of 88 and 85 would receive a B. Because minorities tend to score lower on cognitive ability tests, creating these bands of indistinguishable scores helps increase the chances that minority applicants will fall in a top band and be hired. There are two ways in which banding can increase minority hiring. One is to make the bands very wide so that a greater number of minority test scorers will be included in the top bands. Of course, a cynic may correctly point out that a test of zero reliability will include everyone in the top band and that this approach supports the use of tests with low reliability. A second way in which to impact the selection of minority individuals is the manner in which individuals are chosen within a band. The clearest way to increase the selection of minority individuals is to choose these persons first within each band before proceeding to consider other individuals in the band, but this has proven difficult to legally justify in U.S. courts (Campion et al., 2001). Other approaches to selection within a band include random selection or selection on secondary criteria unrelated to subgroup status, but these procedures typically do not affect minority hiring rates in practically significant ways (Sackett & Roth, 1991). A discussion of various issues and debates regarding the appropriateness of banding is contained in an edited volume by Aguinis (2004).

45

Neal Schmitt et al.

Use of Minimum Cut Scores An extreme departure from top-down selection occurs when an organization sets a minimum cutoff test score such that individuals above some score are selected, whereas those below that score are rejected. In essence, there are two bands of test scores—those judged to represent a passable level of competence and those representing a failing level of test performance. Perhaps the most common use of cutoff scores is in licensing and credentialing, in which the effort is usually to identify a level of expertise and knowledge of the practice of a profession below which a licensure candidate is likely to bring harm to clients. In organizational settings, a cutoff is often required when selection of personnel is done sequentially over time rather than from among a large number of candidates at a single point in time. In this case, hire/reject decisions are made about individuals, and a pass score is essential. Use of a single cutoff score will certainly reduce the potential utility inherent in a valid test because it ignores the individual differences in ability above the test score cutoff. A great deal of evidence (e.g., Coward & Sackett, 1990) shows that test score–job performance relationships are linear throughout the range of test scores. However, using a minimum cutoff score on a cognitive ability test on which we usually see the largest minority-majority differences to select employees and selecting above that cutoff on a random basis or on the basis of some other valid procedure that does not display subgroup differences (e.g., a personality test where adverse impact is much less likely) may reduce the adverse impact that usually occurs with top-down selection using only a cognitive ability test. Perhaps the biggest problem with the use of cutoff scores is deriving a justifiable cutoff score. Setting a cutoff is always judgmental. Livingston (1980) and Cascio, Alexander, and Barrett (1988) among others have usually specified the following as important considerations in setting cutoffs: the qualifications of the experts who set the cutoff, the purpose for which the test is being used, and the consideration of the various types of decision errors that can be made (i.e., denying a qualified person and accepting an unqualified individual). One frequently used approach is the so-called Angoff method (Angoff, 1971), in which a representative sample of experts examines each test item and determines the probability that a minimally competent person (the definition and experts’ understanding of minimally competent is critical) would answer the question correctly. These probabilities are summed across experts and across items. The result is the cutoff score. A second approach to the setting of cutoff scores is to set them by reference to some acceptable level of performance on a criterion variable. In this case, one could end up saying that an individual with a score of 75 on some test has a 10% (or any percent) chance of achieving success on some job. However, this “benchmarking” of scores against criteria does not resolve the problem because someone will be asked to make sometimes equally difficult decisions about what constitutes acceptable performance. Cizek (2001) provided a comprehensive treatment of methods of setting performance standards. The use of cutoff scores to establish minimum qualifications or competency is common in licensing exams. Credentialing exams may require evidence of a higher level of skill or performance capability in some domain, but they too usually require only a “pass-fail” decision. Validation of these cutoffs almost always relies solely on the judgments of experts in the performance area of interest. In these cases, careful explication of the behaviors required to perform a set of tasks and the level of “acceptable” performance is essential and likely the only possible form of validation.

Using Profiles of Scores Another possibility when scores on multiple measures of different constructs are available is that a profile of measured KSAOs is constructed, and this profile is matched to a profile of the KSAOs thought to be required in a job. In this instance, we might measure and quantify the type of job experiences possessed by a job candidate along with their scores on various personality tests, and their oral communications and social skills as measured in an interview and scores on ability tests. If this profile of scores matches that required in the job, then the

46

Validation Strategies for Primary Studies person would be selected. This contrasts with the traditional approach described in textbooks in which the person’s scores on these tests would be linearly related to performance and combined using a regression model so that each score was optimally linearly related to job performance. In using profiles, one is interested in patterns of scores rather than an optimally weighted composite. Use of profiles of scores presents various complex measurement and statistical problems of which the user should be aware (Edwards, 2002). Instances in which selection decisions are made in this fashion include individual assessments (Jeanneret & Silzer, 1998), which involve the use of multiple techniques using multiple methods of assessment and a clinical judgment by the assessor that a person is qualified for some position (Ryan & Sackett, 1987, 1992, 1998). Another venue in which profiles of test scores are considered is in assessment centers in which candidates for positions (usually managerial) are evaluated in various exercises on different constructs and assessors make overall judgments that are then used in decision making. Overall judgments based on these procedures have shown criterionrelated validity (see Ryan & Sackett [1998] for a summary of data relevant to individual assessment and Gaugler, Rosenthal, Thornton, and Bentson [1987] or Arthur, Day, McNelly, and Edens [2003] on assessment center validity), but we are aware of no evidence that validates a profile or configural use of scores. Recently, Davison, Davenport, Yu-Feng, Kory, and Shiyang (2015) have provided a method of estimating the validity of the use of subscores in a profile of scores that may have value in this context. Perhaps the best description of the research results on the use of profiles to make high-stakes decisions is that we know very little. The following would be some of the issues that should receive research attention: (a) Is a profile of scores actually used, implicitly or explicitly, in combining information about job applicants and what is it? (b) What is the validity of such use and its incremental validity over the use of individual components of the profile or linear composites of the scores in the profile? and (c) What is the adverse impact on various subgroups using profile judgments? Or should a person with a profile above that of another person across the reported scores be selected in favor of a person whose scores are near exact replicas of the desired profile?

Clinical Versus Statistical Judgment Clinical judgment refers to the use and combination of different types of information to make a decision or recommendation about some person. In psychology, clinical judgment may be most often discussed in terms of diagnoses regarding clinical patients (Meehl, 1954). These judgments are likely quite similar to those made in the individual assessments often used in the selection of high-level executives but also may occur when judgments are made about job applicants in employment interviews, assessment centers, and various other instances in which human resource specialists or psychologists make employment decisions. Clinical judgment is often compared with statistical judgment in which test scores are combined on the basis of an arithmetic formula that reflects the desired weighting of each element of information. The weights may be determined rationally by a group of job experts or by using weights derived from a regression of a measure of overall job success on scores on various dimensions using different methods of measurement. Meehl’s original research (1954) showed that the accuracy of the worst regression estimate was equal to the judgments made by human decision makers. A more recent treatment and review of this literature by Hastie and Dawes (2001) has reaffirmed the general conclusion that predictions made by human experts are inferior to those based on a linear regression model. However, human experts are required to identify the types of information used in the prediction task. The predictions themselves are likely best left to some mechanical combination rule if one is interested in maximizing a performance outcome. The overall clinical judgment when used to make decisions should be the focus of the validation effort, but unless it is clear how information is combined by the decision maker, it is unclear what constructs are playing a role in their decisions. The fact that these clinical judgments are often not as highly correlated with externally relevant and important outcomes suggests that at least some of the constructs these decision makers use are not relevant.

47

Neal Schmitt et al. In clinical judgment, the presence or absence of adverse impact can be the result of a combination of information that does not display sizable subgroup differences or a bias on the part of the person making the judgment. Psychologists making clinical judgments may mentally adjust scores on the basis of their knowledge of subgroup differences on various measures. There are again no studies of which we are aware that address the use or appropriateness of such adjustments.

SCIENTIFIC OR LONG-TERM PERSPECTIVE: LIMITATIONS OF EXISTING PRIMARY VALIDATION STUDIES There are a great many meta-analyses of the criterion-related validity of various constructs in the prediction of job performance and many thousands of primary studies. Secondary analyses of meta-analyses have also been undertaken (e.g., Schmidt & Hunter, 1998). The studies that provided these data were nearly all conducted more than 30 years ago. Although it is not necessarily the case that the relationships between ability and performance documented in these studies have changed in the last half-century or so, this database has some limitations. In this section, we describe these limitations and make the case that researchers continue their efforts to evaluate test-performance relationships and improve the quality of the data that are collected.

Concurrent Validation Designs In criterion-related validation research, concurrent validation studies in which predictor and criterion data are simultaneously collected from job incumbents are distinguished from predictive designs. In the latter, predictor data are collected before hiring from job applicants and criterion data are collected from those hired presumably on the basis of criteria that are uncorrelated with the predictor data after some appropriate period of time when job performance is thought to have stabilized. Defects in the concurrent design (i.e., restriction of range and a different motivational set on the part of incumbents versus applicants) have been described frequently (Barrett, Phillips, & Alexander, 1981). However, some tests are probably more susceptible to motivational differences among job incumbents and applicants, as might be the case for many noncognitive measures that would display differences in validity when the participants in the research were actually being evaluated for employment versus a situation in which they were responding “for research purposes.” To our knowledge, this comparison has not been made frequently, and, when it has been done in meta-analyses, cognitive and noncognitive test validities have not been separated (Schmitt, Gooding, Noe, & Kirsch, 1984). Practical considerations have made the use of concurrent designs much more frequent than that of predictive designs (Schmitt et al., 1984). Meta-analytic data suggest that there are not large differences in the validity coefficients resulting from these two designs. Further, range restriction corrections can be applied to correct for the fact that data for lower-scoring persons are absent from concurrent studies, but these data are often absent in reports of criterion-related research. Nor can we estimate any effects on test scores that might result from the fact that much more is at stake in a testing situation that may result in employment as opposed to one that is being done for research purposes. Moreover, as Sussman and Robertson (1986) maintained, the manner in which some predictive studies are designed and conducted make them little different than concurrent studies.

Unidimensional Criterion Versus Multidimensional Perspectives Over the last two decades, the view that job performance is multidimensional has become much more widely accepted by I-O psychologists (Borman & Motowidlo, 1997; Campbell, Gasser, & Oswald, 1996). Early validation researchers often used a single rating of what is now called task performance as a criterion, or they combined a set of ratings into an overall performance measure.

48

Validation Strategies for Primary Studies In many cases a measure of training success was used as the criterion. The Project A research showed that performance comprised clearly identifiable dimensions (Campbell et al., 1990), and subsequent research has very often included the use of measures of contextual (e.g., helping others) and task performance (Motowidlo, 2003). Some researchers also argue that the nature of what constitutes performance has changed because jobs have changed (Ilgen & Pulakos, 1999). In all cases, the underlying performance constructs should be specified as carefully as possible, perhaps particularly so when performance includes contextual dimensions, which, as is true of any developing literature, have included everything that does not include “core” aspects of a job. Validation studies (and meta-analyses) that include this multidimensional view of performance are very likely to yield information that updates earlier validation results.

Small Sample Sizes The limitations of small sample sizes in validity research have become painfully obvious with the development of meta-analyses and validity generalization research (Schmidt & Hunter, 1977), as well as the recognition that the power to reject a null hypothesis that there is no test score– performance relationship is very low in much early validation work (Schmidt, Hunter, & Urry, 1976). Although methods to correct for the variability in observed validity coefficients are available and routinely used in meta-analytic and validity generalization research, the use of small samples does not provide for confidence in the results of that research and can be misleading in the short term as enough small sample studies are conducted and reported to discern generalizable findings. This may not be a problem if we are satisfied that the relationships studied in the past are the only ones in which our field is interested, but it is a problem when we want to evaluate new performance models (e.g., models that include a distinction between task, contextual dimensions, or others), new predictor constructs (e.g., some noncognitive constructs or even spatial or perceptual measures), or when we want to assess cross- or multilevel hypotheses. As stated earlier, meta-analyses are not possible in the absence of primary studies.

Inadequate Data Reporting The impact of some well-known deficiencies in primary validation studies is well known. Corrections for range restriction and criterion unreliability (in the mean and variance of validity coefficients) and for the variability due to small sample size are also well known and routinely applied in validity generalization work. However, most primary studies do not report information that allows for sample-based corrections for criterion unreliability or range restriction. Schmidt and Hunter (1977), in their original meta-analytic effort, used estimates of the sample size of the validity coefficients they aggregated because not even sample size was available in early reports. Consequently, in estimating population validity coefficients, meta-analysts have been forced to use assumed artifact distributions based on the small amount of data that are available. There is some evidence that these assumptions are approximately correct (e.g., Alexander, Carson, Alliger, & Cronshaw, 1989; Sackett & Ostgaard, 1994) for range restriction corrections, but the use of such assumed artifact distributions would not be necessary with adequate reporting of primary data. Unfortunately, such information for most of our primary database is lost. In addition, researchers disagree regarding the appropriate operationalization of criterion reliability (Murphy & DeShon, 2000; Schmidt, Viswesvaran, & Ones, 2000).

HARKing In psychology, generally, and in organizational psychology (Bosco, Aguinis, Field, & Pierce, in press), there has been increasing concern (Kerr, 1998) about a practice known as HARKing (hypothesizing after the results are known). The implicit hypothesis in criterion-related research

49

Neal Schmitt et al. is that the set of predictor variables considered is related to some relevant employee outcome. Not an infrequent practice in this research is to measure a wide range of predictor variables and then report and use only the subset of those predictors that display a statistically (or practically) significant relationship to the outcome variable(s). This is especially problematic when sample sizes are small and there is no cross-validation. The result is likely an overestimate of the validity of the remaining predictors, even when formula estimates of cross-validity are used to estimate shrinkage (Schmitt & Ployhart, 1999). A related problem is referred to as the file drawer problem in which studies that reveal nonsignificant correlations are never published or publicly noted. Meta-analysts of criterion-related research will retrieve only those studies available, and if only those that produce significant results are available, then the end result will be an overestimate of the validity of predictors. The potential for this type of bias in the estimation of validity coefficients has been noted by many in selection research (e.g., McDaniel, Rothstein, & Whetzel, 2006). The estimation of the potential for bias in meta-analytic estimates of relationships and the appropriate adjustment of such estimates are available and should be applied to meta-analysis reports (Rothstein, Sutton, & Borenstein, 2005).

Consideration of Multilevel Issues As described in the section above on the utility and adverse impact associated with selection procedures, selection researchers have attempted to estimate the organizational outcomes associated with the use of valid tests (Boudreau & Ramstad, 2003). Utility is linearly related to validity minus the cost of recruiting and assessing personnel. When multiplied by the number of people and the standard deviation of performance in dollar terms, the estimates of utility for most selection instruments are very large (e.g., see Schmidt, Hunter, Outerbridge, & Trattner, 1986). Another body of research has focused on the relationship between organizational human resource practices, such as the use of tests and measures of organizational success. The organizational-level research has documented the usefulness of various human resource practices including test use. Terpstra and Rozell (1993) early-reported correlational data that supported the conclusion that organizations that used various selection procedures such as interviews, cognitive ability tests, and biodata had higher annual levels of profit, growth in profit, and overall performance; subsequent research has supported this conclusion (Jiang, Lepak, Hu, & Baer, 2012). Various other authors have called for multilevel (individuals, work groups, organizations) or cross-level research on the relationship between KSAOs and organizational differences (Schneider, Smith, & Sipe, 2000). Ployhart and Schmitt (2007) and Schneider et al. (2000) have proposed a series of multilevel questions that include considerations of the relationships between the variance of KSAOs and measures of group and organizational effectiveness. In the context of the attraction-selection-attrition model (Schneider, 1987), there are many issues of a multilevel and longitudinal nature that researchers are only beginning to address and about which we have very little or no data. These questions should be addressed if we are to fully understand the relationships between KSAOs and individual and organizational performance. Chapter 5 in this volume provides additional discussion of these issues and questions.

Validation and Long-Term or Scientific Perspective Given the various limitations of our primary database noted in the previous sections of this chapter, we believe selection researchers should aim to conduct additional large-scale or consortium studies like Project A (Campbell, 1990; Campbell & Knapp, 2001). These studies should include the following characteristics: 1. They should be predictive (i.e., longitudinal with data collection at multiple points), concurrent, and of sufficient sample size to allow for adequate power in the tests of hypotheses. Large-scale studies in which organizations continue data collection over time on an ever-expanding group of participants should be initiated.

50

Validation Strategies for Primary Studies 2. Multiple criteria should be collected to allow for evaluation of various KSAO–performance relationships. 3. Data should be collected to allow for artifact corrections such as unreliability in the criteria and range restriction. 4. Unit-level data should be collected to allow for evaluation of multilevel hypotheses. These data should include basic unit characteristics and outcome data. 5. Demographic data should be collected to allow for evaluation of subgroup differences in the level of performance and differences in KSAO–performance relationships across subgroups. 6. Data on constructs thought to be related (and unrelated) to the target constructs of interest should be collected to allow for evaluation of broader construct validity issues. 7. Such large-scale studies should include studies of new tests and testing technologies when these become available to allow for innovation.

Obviously, these studies would necessitate a level of cooperation and planning not characteristic of multiple researchers, much less multiple organizations. However, real advancement in our understanding of individual differences in KSAOs and performance will probably not come from additional small-scale studies or meta-analyses of primary studies that address traditional questions with sample sizes, research designs, and measurement characteristics that are not adequate.

CONCLUSIONS It is certainly true that meta-analyses have provided our discipline with strong evidence that many of the relationships between individual differences and performance are relatively strong and generalizable. However, many situations where validation is necessary do not lend themselves to validity generalization or the use of meta-analytic databases. As a result, practitioners frequently find themselves in situations where well-designed primary studies are required. A focus on the appropriate designs for these studies is therefore important. Additionally, without primary studies of the relationships between individual differences and performance, there can be no meta-analyses or related applications of validity generalizability and transportability. The quality and nature of the original studies that are the source of our meta-analytic database determine to a great extent the currency and quality of the conclusions derived from the meta-analyses, statistical corrections notwithstanding. We argue that the field would be greatly served by large-scale primary studies of the type conducted as part of Project A (see Sackett, 1990, or Campbell & Knapp, 2001). These studies should begin with a clear articulation of the performance and predictor constructs of interest. They should involve the collection of concurrent and predictive data and improve upon research design and reporting issues that have bedeviled meta-analytic efforts for the past three decades. Demographic data should be collected and reported. All data should be collected across multiple organizational units and organizations (and perhaps globally), and data describing the organizational context should be collected and recorded. We know much more about the complexities of organizational behavior, research design, measurement, and individual differences than we did 80–100 years ago, and this should be reflected in how we collect our data and make them available to other professionals. The end result will be even greater progress in our understanding of the relationship between individual differences and work performance.

REFERENCES Ackerman, P. L. (1989). Within-task intercorrelations of skilled performance: Implications for predicting individual differences? (A comment on Henry & Hulin, 1987). Journal of Applied Psychology, 74, 360–364. Aguinis, H. (Ed.) (2004). Test score banding in human resource selection: Legal, technical and societal issues. Westport, CT: Praeger. Alexander, R. A., Carson, K. P., Alliger, G. M., & Cronshaw, S. F. (1989). Empirical distributions of range restricted SDx in validity studies. Journal of Applied Psychology, 74, 253–258.

51

Neal Schmitt et al. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association, American Educational Research Association, and National Council on Measurement in Education. (1954). Technical recommendations for psychological and diagnostic techniques. Psychological Bulletin, 51, 201–238. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education. Arthur, W., Jr., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related validity of assessment center dimensions. Personnel Psychology, 56, 125–153. Arvey, R. D., Nutting, S. M., & Landon, T. E. (1992). Validation strategies for physical ability testing in police and fire settings. Public Personnel Management, 21, 301–312. Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77, 836–874. Barrett, G. V., Caldwell, M. S., & Alexander, R. A. (1985). The concept of dynamic criteria: A critical reanalysis. Personnel Psychology, 38, 41–56. Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical reanalysis. Journal of Applied Psychology, 66, 1–6. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Borman, W. C., & Motowidlo, S. J. (1997). Task performance and contextual performance: The meaning for personnel selection research. Human Performance, 10, 99–109. Bosco, F. A., Aguinis, H., Field, J. G., Pierce, C. A., & Dalton, D. R. (2016). HARKing’s threat to organizational research: Evidence from primary and meta-analytic sources. Personnel Psychology, 69, 709–750. Boudreau, J. W., & Ramstad, P. M. (2003). Strategic industrial and organizational psychology and the role of utility. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology (Vol. 12, pp. 193–221). Hoboken, NJ: Wiley. Brogden, H. E. (1951). Increased efficiency of selection resulting from replacement of a single predictor with several differential predictors. Educational and Psychological Measurement, 11, 173–195. Brogden, H. E. (1959). Efficiency of classification as a function of number of jobs, percent rejected, and the validity and intercorrelation of job performance estimates. Educational and Psychological Measurement, 19, 181–190. Brown, C. W., & Ghiselli, E. E. (1953). Percent increase in proficiency resulting from use of selection devices. Journal of Applied Psychology, 37, 341–345. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56, 81–105. Campbell, J. P. (1990). An overview of the army selection and classification project (Project A). Personnel Psychology, 43, 231–240. Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). The substantive nature of job performance variability. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 258–299). San Francisco, CA: Jossey-Bass. Campbell, J. P., & Knapp, D. J. (Eds.) (2001). Exploring the limits in personnel selection and classification. Mahwah, NJ: Lawrence Erlbaum. Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-Bass. Campbell, J. P., McHenry, J. J., & Wise, L. L. (1990). Modeling job performance in a population of jobs. Personnel Psychology, 43, 313–334. Campion, M. A., Outtz, J. L., Zedeck, S., Schmidt, F. L., Kehoe, J. F., Murphy, K. R., & Guion, R. M. (2001). The controversy over score banding in personnel selection: Answers to 10 key questions. Personnel Psychology, 54, 149–185. Cascio, W. F., Alexander, R. A., & Barrett, G. V. (1988). Setting cut scores: Legal, psychometric, and professional issues and guidelines. Personnel Psychology, 41, 1–24. Cascio, W. F., Valenzi, E. R., & Silbey, V. (1978). Validation and statistical power: Implications for applied research. Journal of Applied Psychology, 63, 589–595. Cizek, G. J. (Ed.) (2001). Setting performance standards. Hillsdale, NJ: Lawrence Erlbaum. Cohen, J. (1988). Statistical power for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

52

Validation Strategies for Primary Studies Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago, IL: Rand-McNally. Coward, W. M., & Sackett, P. R. (1990). Linearity of ability-performance relationships: A reconfirmation. Journal of Applied Psychology, 75, 297–300. Cropanzano, R., & Wright, T. A. (2003). Procedural justice and organizational staffing: A tale of two paradigms. Human Resource Management Review. Special Issue: Fairness and Human Resources Management, 13, 7–39. Davis, J. E. (1989). Construct validity in measurement: A pattern matching approach. Evaluation and Program Planning. Special Issue: Concept Mapping for Evaluation and Planning, 12, 31–36. Davison, M. L., Davenport, E. C., Jr., Yu-Feng, C., Kory, V., & Shiyang, S. (2015). Criterion-related validity: Assessing the value of subscores. Journal of Educational Measurement, 52, 263–279. Dunnette, M. D. (1963). A note on the criterion. Journal of Applied Psychology, 47, 251–254. Edwards, J. R. (2002). Alternatives to difference scores: Polynomial regression analysis and response surface methodology. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations (pp. 350–400). San Francisco: Jossey-Bass. Fiske, D. W. (1951). Values, theory, and the criterion problem. Personnel Psychology, 4, 93–98. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analyses of assessment center validity. Journal of Applied Psychology, 72, 493–511. Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: Wiley. Ghiselli, E. E. (1973). The validity of aptitude tests in personnel selection. Personnel Psychology, 26, 461–478. Gibson, W. M., & Caplinger, J. A. (2007). Transportation of validation results. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence: The professional practice series (pp. 29–81). Hoboken, NJ: Wiley. Goldman, B. M., Gutek, B. A., Stein, J. H., & Lewis, K. (2006). Employment discrimination in organizations: Antecedents and consequences. Journal of Management, 32(6), 786–830. Guion, R. M. (1961). Criterion measurement and personnel judgments. Personnel Psychology, 14, 141–149. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum. Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world. Thousand Oaks, CA: Sage. Henry, R. A., & Hulin, C. L. (1989). Changing validities: Ability-performance relations and utilities. Journal of Applied Psychology, 54, 365–367. Hogan, J., & Quigley, A. M. (1986). Physical standards for employment and the courts. American Psychologist, 41, 1193–1217. Hull, C. L. (1928). Aptitude testing. Yonkers, NY: World Book. Ilgen, D. R., & Pulakos, E. D. (Eds.) (1999). The changing nature of performance. San Francisco: Jossey-Bass. Jeanneret, P. R., & Silzer, R. (Eds.) (1998). Individual psychological assessment: Predicting behavior in organizational settings. San Francisco, CA: Jossey-Bass. Jiang, K., Lepak, D. P., Hu, J., & Baer, J. C. (2012). How does human resource management influence organizational outcomes? A meta-analytic investigation of mediating mechanisms. Academy of Management Journal, 55, 1264–1294. Johnson, J. W., & Carter, G. W. (2010). Validating synthetic validation: Comparing traditional and synthetic validity coefficients. Personnel Psychology, 63, 755–795. Kerr, N. L. (1998), HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196–217. Lance, C. L., Foster, M. R., Gentry, W. A., & Thoresen, J. D. (2004). Assessor cognitive processes in an operational assessment center. Journal of Applied Psychology, 89, 22–35. Lance, C. L., Foster, M. R., Nemeth, Y. M., Gentry, W. A., & Drollinger, S. (2007). Extending the nomological network of assessment center construct validity: Prediction of cross-situationally consistent and specific aspects of assessment center performance. Human Performance, 20, 345–362. Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. Little, R. J. A., & Rubin, D. R. (2002). Statistical analysis with missing data. Hoboken, NJ: Wiley. Livingston, S. A. (1980). Comments on criterion-referenced testing. Applied Psychological Measurement, 4, 575–581. Maier, M. H. (1988). On the need for quality control in validation research. Personnel Psychology, 41, 497–502. McDaniel, M. A. (2007). Validity generalization as a test validation approach. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence: The professional practice series (pp. 159–180). Hoboken, NJ: Wiley. McDaniel, M. A., Rothstein, H. R., & Whetzel, D. (2006). Publication bias: A case study of four test vendor manuals. Personnel Psychology, 59, 927–953. Meehl, R. J. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press.

53

Neal Schmitt et al. Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45, 35–44. Motowidlo, S. J. (2003). Job performance. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology (Vol. 12, pp. 39–53). Hoboken, NJ: Wiley. Murphy, K. R. (1986). When your top choice turns you down: The effects of rejected offers on the utility of selection tests. Psychological Bulletin, 99, 133–138. Murphy, K. R. (2009). Content validation is useful for many things, but validity isn’t one of them. Industrial and Organizational Psychology, 2, 453–464. Murphy, K. R., & DeShon, R. P. (2000). Interrater correlations do not estimate the reliability of job performance ratings. Personnel Psychology, 53, 873–900. Newman, D. A., Jacobs, R. R., & Bartram, D. (2007). Choosing the best method for local validity estimation: Relative accuracy of meta-analysis versus a local study versus Bayes-analysis. Journal of Applied Psychology, 92, 1394–1413. Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Cognitive ability in selection decisions. In O. Wilhelm (Ed.), Handbook of understanding and measuring intelligence (pp. 431–468). Thousand Oaks, CA: Sage. Peterson, N. G., Wise, L. L., Arabian, J., & Hoffman, R. G. (Eds.) (2001). Synthetic validation and validity generalization: When empirical validation is not possible. In J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in personnel selection and classification (pp. 411–452). Mahwah, NJ: Lawrence Erlbaum. Picano, J. J., Williams, T. J., & Roland, R. R. (Eds.) (2006). Assessment and selection of high-risk operational personnel. New York, NY: Guilford Press. Ployhart, R. E., & Schmitt, N. (2007). The attraction-selection-attrition model and staffing: Some multilevel implications. In D. B. Smith (Ed.), The people make the place: Exploring dynamic linkages between individuals and organizations (pp. 89–102). Mahwah, NJ: Lawrence Erlbaum. Ployhart, R. E., Schneider, B., & Schmitt, N. (2006). Staffing ng organizations: Contemporary practice and theory. Mahwah, NJ: Lawrence Erlbaum. Rothstein, H. R. (1992). Meta-analysis and construct validity. Human Performance, 5, 71–80. Rothstein, H. R., Sutton, A. J., & Borenstein, M. (Eds.) (2005). Publication bias in Meta-analysis: Prevention, Assessment, and Adjustment. Chichester, UK: Wiley. Ryan, A. M., & Sackett, P. R. (1987). A survey of individual assessment practices by I/O psychologists. Personnel Psychology, 40, 455–488. Ryan, A. M., & Sackett, P. R. (1992). Relationships between graduate training, professional affiliation, and individual psychological assessment practices for personnel decisions. Personnel Psychology, 45, 363–385. Ryan, A. M., & Sackett, P. R. (1998). Individual assessment: The research base. In R. Jeanneret & R. Silzer (Eds.), Individual psychological assessment: Predicting behavior in organizational settings (pp. 54–87). San Francisco, CA: Jossey-Bass. Sackett, P. R. (Ed.) (1990). Special issue: Project A: The U.S. army selection and classification project. Personnel Psychology, 43, 231–378. Sackett, P. R., & Ostgaard, D. J. (1994). Job-specific applicant pools and national norms for cognitive ability tests: Implications for range restriction corrections in validation research. Journal of Applied Psychology, 79, 680–684. Sackett, P. R., & Roth, L. (1991). A Monte Carlo examination of banding and rank order methods of test score use in personnel selection. Human Performance, 4, 279–295. Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment credentialing, and higher education: Prospects in a post-affirmative action world. American Psychologist, 56, 302–318. Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psychologist, 49, 929–954. Scherbaum, C. A. (2005). Synthetic validity: Past, present, and future. Personnel Psychology, 58, 481–515. Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529–540. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmidt, F. L., & Hunter, J. E. (2003). History, development, evolution, and impact of validity generalization and meta-analysis methods, 1975–2001. In K. R. Murphy (Ed.), Validity generalization: A critical review. Applied Psychology Series (pp. 31–65). Mahwah, NJ: Lawrence Erlbaum. Schmidt, F. L., Hunter, J. E., Outerbridge, A. N., & Trattner, M. H. (1986). The economic impact of job selection methods on size, productivity, and payroll costs of the federal work force: An empirically based demonstration. Personnel Psychology, 39, 1–30. Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related validity studies. Journal of Applied Psychology, 61, 473–485.

54

Validation Strategies for Primary Studies Schmidt, F. L., Mack, M. J., & Hunter, J. E. (1984). Selection utility in the occupation of U.S. park ranger for three modes of test use. Journal of Applied Psychology, 69, 490–497. Schmidt, F. L., Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability. Personnel Psychology, 53, 901–912. Schmitt, N., Gooding, R., Noe, R. A., & Kirsch, M. (1984). Meta-analyses of validity studies published between 1964 and 1982, and the investigation of study characteristics. Personnel Psychology, 37, 407–422. Schmitt, N., & Ployhart, R. E. (1999). Estimates of cross-validity for stepwise regression and with predictor selection. Journal of Applied Psychology, 84, 50–57. Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and predictive efficiency of various predictor combinations. Journal of Applied Psychology, 82, 717–730. Schneider, B. (1987). The people make the place. Personnel Psychology, 40, 437–454. Schneider, B., Smith, D., & Sipe, W. P. (2000). Personnel selection psychology: Multilevel considerations. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp. 91–120). San Francisco, CA: Jossey-Bass. Scott, W. D. (1915). The scientific selection of salesmen. Advertising and Selling, 5, 5–7. Shotland, A., Alliger, G. M., & Sales, T. (1998). Face validity in the context of personnel selection: A multimedia approach. International Journal of Selection and Assessment, 6, 124–130. Society for Industrial and Organizational Psychology. (1987). Principles for the validation and use of personnel selection procedures. (3rd ed.). College Park, MD: Author. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures. (4th ed.). Bowling Green, OH: Author. Strickland, W. J. (1979). The relationship between program evaluation research and selection system validation: Application to the assessment center method. Dissertation Abstracts International, 40(1-B), 481–482. Sussman, M., & Robertson, D. U. (1986). The validity of validity: An analysis of validation study designs. Journal of Applied Psychology, 71, 461–468. Terpstra, D. E., & Kethley, R. B. (2002). Organizations’ relative degree of exposure to selection discrimination litigation. Public Personnel Management, 31, 277–292. Terpstra, D. E., & Rozell, E. J. (1993). The relationship of staffing practices to organizational level measures of performance. Personnel Psychology, 46, 27–48. Thayer, P. W. (1992). Construct validation: Do we understand our criteria? Human Performance, 5, 97–108. Thorndike, E. L. (1911). Individuality. Boston, MA: Houghton Mifflin. Viteles, M. S. (1932). Industrial psychology. New York, NY: Norton. Zyphur, M. J., Oswald, F. L., & Rupp, D. E. (2015). Rendezvous overdue: Bayes analysis meets organizational research. Journal of Management, 41, 387–389.

55

3 VALIDITY CONSIDERATIONS IN THE DESIGN AND IMPLEMENTATION OF SELECTION SYSTEMS JERARD F. KEHOE AND PAUL R. SACKETT

Validity, along with reliability, is a concept that provides the scientific foundation upon which we construct and evaluate predictor and criterion measures of interest in personnel selection. It offers a common technical language for discussing and evaluating the accuracy of inferences we desire to make based on those scores (e.g., high scores on our predictor measure are associated with high levels of job performance; high scores on our criterion measure are associated with high levels of job performance).1 Furthermore, the literature surrounding validity provides a framework for scientifically sound measure development that, a priori, can enable us to increase the likelihood that scores resulting from our measures will be generalizable, and inferences we desire to make based upon them, supported. Like personnel selection itself, science and practice surrounding the concept of validity continue to evolve, with changes affecting not only its evaluation but also its very definition, as evidenced by comparing editions of the Standards for Educational and Psychological Testing produced over the past half century by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) (AERA, APA, & NCME, 2014). The evolution of validity has been well covered in the personnel selection literature (e.g., Binning & Barrett, 1989; McPhail, 2007; Schmitt & Landy, 1993; Society for Industrial and Organizational Psychology, 2003), and will continue to be well covered in this Handbook. This chapter and the two chapters immediately before and after all speak directly to developments with regard to validity, particularly as it relates to personnel selection. The contribution of this chapter is to develop a more comprehensive understanding of the manner in which the design and implementation of operational selection systems have implications for validity.

OVERVIEW We begin with a conceptual treatment of validity as it is represented in the personnel selection profession. This treatment attempts to outline a set of distinctions that we view as central to an understanding of validity. Namely, we discuss (a) validity as predictor-criterion relationship versus broader conceptualizations, (b) validity of an inference versus validity of a test, (c) types of validity evidence versus types of validity, (d) validity as an inference about a test score versus

56

Validity Considerations of Selection Systems validation as a strategy for establishing job relatedness, (e) the predictive inference versus the evidence for it, and (f) validity limited to inferences about individuals versus including broader consequences of test score use. Our belief is that a clear understanding of these foundational issues in validity is essential for effective research and practice in the selection arena. In addition, we believe that this conceptual foundation should guide the treatment we give below to the operational and practical considerations for establishing that a particular selection system is supported by persuasive validity evidence. Following this conceptual treatment of validity, we describe key validity considerations in the design and development of operational selection systems. For each of these considerations we describe the manner in which they can strengthen or weaken conclusions about the validity of the selection system as implemented for its intended purpose(s).

PART 1: CONCEPT OF VALIDITY Validity, according to the 2014 Standards for Educational and Psychological Testing, is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (AERA, APA, & NCME, 2014, p. 11). There is a long history and considerable literature on the subject of validity. With limited space here, it is impossible to do justice to the subject. We attempt to highlight a set of important issues in the ongoing development of thinking about validity, but we direct the interested reader to a set of key resources for a strong foundation on the topic. One key set of references is the set of chapters on the topic of validity in the four editions of Educational Measurement, which is that field’s analog to the Handbook of Industrial and Organizational Psychology. Cureton (1951), Cronbach (1971), Messick (1989), and Kane (2006) each offer detailed treatment of the evolving conceptualizations of validity. Another key set focuses specifically on validity in the context of personnel selection. Two prominent articles on validity in the employment context have been published in the American Psychologist by Guion (1974) and Landy (1986). There is also a very influential paper by Binning and Barrett (1989). A third key set is made up of classic highly cited articles in psychology—Cronbach and Meehl’s (1955) and Loevinger’s (1957) treatises on construct validity. Our focus in this section is entirely conceptual. This chapter does not address operational issues in the design of research studies aimed at obtaining various types of validity evidence except to the extent that local studies might be conducted within the process of design and development. Rather, we attempt to outline a set of issues that we view as central to an understanding of validity.

Validity as Predictor-Criterion Relationship Versus Broader Conceptualizations In the first half of the 20th century, validity was commonly viewed solely in terms of the strength of predictor-criterion relationships. Cureton’s (1951) chapter on validity stated, reasonably, that validity addresses the question of “how well a test does the job it was employed to do” (p. 621). But the “job it was employed to do” was viewed as one of prediction, leading Cureton to state that, “Validity is . . . defined in terms of the correlation between the actual test scores and the ‘true’ criterion measures” (pp. 622–623). But more questions were being asked of tests than whether they predicted a criterion of interest. These included questions about whether mastery of a domain could be inferred from a set of questions sampling that domain and about whether a test could be put forward as a measure of a specified psychological construct. A landmark event in the intellectual history of the concept of validity was the publication of the first edition of what is now known as the Standards for Educational and Psychological Testing (APA, 1954), in which a committee headed by Lee Cronbach, with Paul Meehl as a key member, put forward the now familiar notions of predictive, concurrent, content, and construct validity. Cronbach and Meehl (1955) elaborated their position on construct validity a year later in their seminal Psychological Bulletin paper. Since then, validity has

57

Jerard F. Kehoe and Paul R. Sackett been viewed more broadly than predictor-criterion correlations, with the differing validity labels viewed first as types of validity and more recently as different types of validity evidence or as evidence relevant to differing inferences to be drawn from test scores.

Validity of an Inference Versus Validity of a Test Arguably the single most essential idea regarding validity is that it refers to the degree to which evidence supports the inferences one proposes to draw about the target of assessment (in the I-O world, most commonly an individual; in other settings, a larger aggregate such as a classroom or a school) from their scores on assessment devices. The generic question “Is this a valid test?” is not a useful one; rather, the question is “Can a specified inference about the target of assessment be validly drawn from scores on this device?” Several important notions follow from this position. First, it thus follows that the inferences to be made must be clearly specified. Multiple inferences are frequently proposed. Consider a technical report stating, “This test representatively samples the established training curriculum for this job. It measures four subdomains of job knowledge, each of which is predictive of subsequent on-the-job task performance.” Note that three claims are made here, dealing with sampling, dimensionality, and prediction, respectively. Each claim is linked to one or more inferences about a test taker (i.e., degree of curriculum mastery, differentiation across subdomains, relationships with subsequent performance, and incremental prediction of performance across subdomains). Second, support for each inference is needed to support the multifaceted set of claims made about inferences that can be drawn from the test. Each inference may require a different type of evidence. The claim of representative content sampling may be supported by evidence of the form historically referred to as “content validity evidence,” namely, a systematic documentation of the relationship between test content and job knowledge requirements, typically involving the judgment of subject matter experts. The claim of multidimensionality may be supported by factor-analytic evidence, and evidence in support of this claim is one facet of what has historically been referred to as construct validity evidence (i.e., evidence regarding whether the test measures what it purports to measure). The claim of prediction of subsequent task performance may be supported by what has historically been referred to as “criterion-related validity evidence,” namely, evidence of an empirical relationship between test scores and subsequent performance. Note that the above types of evidence are provided as examples; multiple strategies may be selected alone or in combination as the basis for support for a given inference. For example, empirical evidence of a test-criterion relationship may be unfeasible in a given setting because of sample size limitations, and the investigator may turn to the systematic collection of expert judgment as to the likelihood that performance on various test components is linked to higher subsequent job performance. Third, some proposed inferences receive support as evidence is gathered and evaluated, whereas others do not. In the current example, what might emerge is strong support for the claim of representative sampling and strong support for the claim of prediction of subsequent performance, but evidence of a unidimensional rather than the posited multidimensional structure. In such cases, one should revise the claims made for the test; in this case, dropping the claim that inferences can be drawn about differential standing on subdomains of knowledge.

Types of Validity Evidence Versus Types of Validity Emerging from the 1954 edition of what is now the Standards for Educational and Psychological Testing was the notion of multiple types of validity. The triumvirate of criterion-related validity, content validity, and construct validity came to dominate writings about validity. At one level, this makes perfect sense. Each of these deals with different key inferences one may wish to draw about a test. First, in some settings, such as many educational applications, the key inference is

58

Validity Considerations of Selection Systems one of content sampling. Using tests for purposes such as determining whether a student passes a course, progresses to the next grade, or merits a diploma relies heavily on the adequacy with which a test samples the specified curriculum. Second, in some settings, such as the study of personality, the key inference is one of appropriateness of construct labeling and specification. There is a classic distinction (Loevinger, 1957) between two types of construct validity questions, namely, questions about the existence of a construct (e.g., Can one define a construct labeled “integrity” and differentiate it from other constructs?) and questions about the adequacy of a given measure of a construct (e.g., Can test X be viewed as a measure of integrity?). Third, in some settings, such as the personnel selection setting of primary interest for the current volume, the key inference is one of prediction: Can scores from measures gathered before a selection decision be used to draw inferences about future job behavior? Criterion-related validity evidence is a central mechanism for establishing this inference, though content-related evidence also supports this inference, as discussed below. Over the last several decades, there has been a move from viewing these as types of validity to types of validity evidence. All lines of evidence—content sampling, dimensionality, convergence with other measures, investigations of the processes by which test takers respond to test stimuli, or relations with external criteria—deal with understanding the meaning of test scores and the inferences that can be drawn from them. Because construct validity is the term historically applied to questions of the meaning of test scores, the position emerged that if all forms of validity evidence contributed to understanding the meaning of test scores, then all forms of validity evidence were really construct validity evidence. The 1999 and 2014 editions of the Standards pushed this one step further: If all forms of evidence are construct validity evidence, then “validity” and “construct validity” are indistinguishable. Thus, the Standards refer to “validity” rather than “construct validity” as the umbrella term. This seems useful, because construct validity carries the traditional connotations of referring to specific forms of validity evidence, namely convergence with conceptually related measures and divergence from conceptually unrelated measures. Thus, the current perspective reflected in the 2014 Standards is that validity refers to the evidentiary basis supporting the inferences that a user claims can be drawn from a test score. Many claims are multifaceted, and thus multiple lines of evidence may be needed to support the claims made for a test. A common misunderstanding of this perspective on validity is that the test user’s burden has been increased, because the user now needs to provide each of the types of validity evidence. In fact, there is no requirement that all forms of validity evidence be provided; rather, the central notion is, as noted earlier, that evidence needs to be provided for the inferences that one claims can be drawn from test scores. For example, if one’s intended inferences make no claims about content sampling, then content-related evidence is not needed. If the claim is simply that scores on a measure can be used to forecast whether an individual will voluntarily leave the organization within a year of hire, then the only inference that needs to be supported is the predictive one. One may rightly assert that scientific understanding is aided by obtaining other types of evidence than those drawn on to support the predictive inference (i.e., forms of evidence that shed light on the construct(s) underlying test scores), but we view such evidence gathering as desirable but not essential. One’s obligation is simply to provide evidence in support of the inferences one wishes to draw.

Validity as an Inference About a Test Score Versus Validation as a Strategy for Establishing Job Relatedness In employment settings, the most crucial inference to be supported about any measure is whether the measure is job-related. Labeling a measure as job-related means “scores on this measure can be used to draw inferences about an individual’s future job behavior”—we term this the “predictive inference.” In personnel selection settings, our task is to develop a body of evidence to support the predictive inference. The next section of this chapter outlines mechanisms for doing so.

59

Jerard F. Kehoe and Paul R. Sackett Some potential confusion arises from the failure to differentiate between settings where types of validity evidence are being used to draw inferences about the meaning of test scores rather than to draw a predictive inference. For example, content-related validity evidence refers to the adequacy with which the content of a given measure samples a specified content domain. Assume that one is attempting to develop a self-report measure of conscientiousness to reflect a particular theory that specifies that conscientiousness has four equally important subfacets: dependability, achievement striving, dutifulness, and orderliness. Assume that a group of expert judges is given the task of sorting the 40 test items into these four subfacets. A finding that 10 items were rated as reflecting each of the four facets would support the inference of adequate domain sampling and contribute to an inference about score meaning. Note that this inference is independent of the question about the job relatedness of this measure. One could draw on multiple lines of evidence to further develop the case for this measure as an effective way to measure conscientiousness (e.g., convergence with other measures) without ever addressing the question of whether predictive inferences can be drawn from this measure for a given job. When one’s interest is in the predictive hypothesis, various types of validity evidence can be drawn upon to support this evidence, as outlined below.

Predictive Inference Versus the Evidence for It As noted above, the key inference in personnel selection settings is a predictive one, namely the inferences that scores on the test or other selection procedure can be used to predict the test takers’ subsequent job behavior. A common error is the equating of the type of inference to be drawn with the type of evidence needed to support the inference. Put more bluntly, the error is to assert that, “If the inference is predictive, then the needed evidence is criterion-related evidence of the predictive type.” Scholars in the I-O area have clearly articulated that there are multiple routes to providing evidence in support of the predictive hypothesis. Figure 3.1 presents this position in visual form. Models of this sort are laid out in Binning and Barrett (1989) and in the 2014 Standards. This upper half of Figure 3.1 shows a measured predictor and a measured criterion. Because both are measured, the relationship between these two can be empirically established. The lower half of Figure 3.1 shows an unmeasured predictor construct domain and an unmeasured criterion construct domain. Of interest are the set of linkages among the four components of this model. The first and most central point is that the goal of validation research in the personnel selection context is to establish a linkage between the predictor measure (Figure 3.1, upper left) and the criterion construct domain (Figure 3.1, lower right). The criterion construct domain is the conceptual specification of the set of work behaviors that one is interested in predicting. This criterion construct domain may be quite formal and elaborate, as in the case of a job-analyticallyspecified set of critical job tasks, or it may be quite simple and intuitive, as in the case of an organization that asserts that it wishes to minimize voluntary turnover within the first year of employment and thus specifies this as the criterion domain of interest. The second central point is that there are three possible mechanisms for linking an observed predictor score and a criterion construct domain. The first is via a sampling strategy. If the predictor measure is a direct sample of the criterion construct domain, then the predictive inference is established based on expert judgment (e.g., obtained via a job analysis process) (Linkage 5 in Figure 3.1). Having an applicant for a symphony orchestra position sight read unfamiliar music is a direct sample of this important job behavior. Having an applicant for a lifeguard position dive to the bottom of a pool to rescue a simulated drowning victim is a simulation, rather than a direct sample of the criterion construct domain. However, it does rely on domain sampling logic and, like most work sample tests, aims at psychological fidelity in representing critical aspects of the construct domain. The second mechanism for linking an observed predictor and a criterion construct domain is via establishing a pair of linkages, namely (a) the observed predictor–observed criterion link (Linkage 1 in Figure 3.1) and (b) the observed criterion–criterion construct domain link

60

Validity Considerations of Selection Systems

Observed Predictor Measure

Linkage 2

• Amassing theory and data to assign construct labels to predictor scores

Latent Predictor Domain

Linkage 1

• Via local validation study • Via validity generalization / transportability evidence

Linkage 5

• Sampling of predictor content from criterion domain (informed via job analysis)

Linkage 3

• Via logic and judgment based on existing theory and the body of relevant empirical evidence

Observed Criterion Measure

Linkage 4

• Via SME judgment (e.g. comparison to job analysis data) • Via empirical means (e.g., CFA)

Latent Criterion Domain

FIGURE 3.1  Routes to Providing Evidence in Support of the Predictive Inference (Adapted from Binning, J. F., & Barrett, G. V., Journal of Applied Psychology, 74, 478–494, 1989.)

(Linkage 4 in Figure 3.1). The first of these links can be established empirically, as in the case of local criterion-related evidence, or generalized or transported evidence. Critically, such evidence must be paired with evidence that the criterion measure (e.g., ratings of job performance) can be linked to the criterion construct domain (e.g., actual performance behaviors). Such evidence can be judgmental (e.g., comparing criterion measure content to critical elements of the criterion construct domain revealed through job analyses) and empirical (e.g., fitting CFA models to assess whether dimensionality of the observed criterion scores is consistent with the hypothesized dimensionality of the criterion construct domain). It commonly involves showing that the chosen criterion measures do reflect important elements of the criterion construct domain. Observed measures may fail this test, as in the case of a classroom instructor who grades solely on attendance when the criterion construct domain is specified in terms of knowledge acquisition, or in the case of a criterion measure for which variance is largely determined by features of the situation rather than by features under the control of the individuals. The third mechanism also focuses on a pair of linkages, namely (a) linking the observed predictor scores and the predictor construct domain (Linkage 2 in Figure 3.1) and (b) linking the predictor construct domain and the criterion construct domain (Linkage 3 in Figure 3.1). The first linkage involves obtaining data to support interpreting variance in predictor scores as reflecting variance in a specific predictor construct domain. This reflects one form of what has historically been referred to as construct validity evidence, namely, amassing theory and data to support assigning a specified construct label to test scores. For example, if a test purports to measure achievement striving, one might offer a conceptual mapping of test content and one’s specification of the domain of achievement striving, paired with evidence of empirical convergence with other similarly specified measures of the construct. However, showing that the measure does reflect the construct domain is supportive of the predictive inference only if the predictor construct domain can be linked to the criterion construct domain. Such evidence is logical and judgmental, requiring a clear articulation of the basis for asserting that individuals who are higher in the domain of achievement striving will have higher standing on the criterion construct domain than individuals lower in achievement striving. Thus, there are multiple routes to establishing the predictive inference. These are not mutually exclusive; one may provide more than one line of evidence in support of the predictive inference. The type of measure does not dictate the type of evidentiary strategy chosen.

61

Jerard F. Kehoe and Paul R. Sackett

Validity Limited to Inferences About Individuals Versus Including Broader Consequences of Test Score Use In the last two decades, considerable attention has been paid to new views of validity that extend beyond the inferences that can be drawn about individuals to include a consideration of the consequences of test use. The key proponent of this position is Messick (1989). Messick noted that it is commonly asserted that the single most important attribute of a measure is its validity for its intended uses. He noted that at times test use has unintended negative consequences, as in the case in which a teacher abandons many key elements of a curriculum to focus all effort on preparing students to be tested in one subject. Even if inferences about student domain mastery in that subject can be drawn with high accuracy, Messick argued that the negative consequences (i.e., ignoring other subjects) may be so severe as to argue against the use of this test. If validity is the most important attribute of a test, then the only way for negative consequences to have the potential to outweigh validity evidence in a decision about the appropriateness of test use was for consequences of test use to be included as a facet of validity. Thus, he argued for a consideration of traditional aspects of validity (which he labeled “evidential”) and these new aspects of validity (which he labeled “consequential”). These ideas were generally well received in educational circles, and the term “consequential validity” came to be used; these ideas, however, were not well received in the I-O field. In this usage, a measure with unintended negative consequences lacks consequential validity. This perspective views such negative consequences as invalidating test use. The 2014 Standards rejects this view. Although evidence of negative consequences may influence decisions concerning the use of predictors, such evidence will only be related to inferences about validity if the negative consequences can be directly traced to the measurement properties of the predictor. Using an example that one of us (Sackett) contributed to the SIOP Principles for the Validation and Use of Personnel Selection Procedures (2003), consider an organization that (a) introduces an integrity test to screen applicants, (b) assumes that this selection procedure provides an adequate safeguard against employee theft, and (c) discontinues use of other theftdeterrent methods (e.g., video surveillance). In such an instance, employee theft might actually increase after the integrity test is introduced and other organizational procedures are eliminated. Thus, the intervention may have had an unanticipated negative consequence on the organization. These negative consequences do not threaten the validity of inferences that can be drawn from scores on the integrity test, because the consequences are not a function of the test itself. Given recent encouragement to evaluate selection systems with respect to their influence on organization-level outcomes (e.g., Ployhart & Weekley, Chapter 5 in this volume), we think it is helpful to distinguish between the validity of test scores used to select individuals into organizations and an evaluation of aggregate consequences of the selection system at the organization level. Consider the example of a measure of service orientation used to hire customer service employees for the purpose of improving customer satisfaction. Our view is that even where there is evidence that customer service employees selected for service orientation improve organization-level results (e.g., higher average district-level customer retention), such organization-level outcomes do not constitute evidence of validity for service orientation scores. To be sure, such evidence of organization-level impact would likely have great importance to the organization and may even be far more important to the organization than evidence showing that employees with higher service orientation produce higher customer satisfaction, but such evidence would not strengthen the validity claim that employees with higher service orientation scores produce more satisfied customers. In the design of this selection system, the I-O psychologist relies on information about the conceptual linkage at the individual level between employee attributes and employee outcomes as the basis for choosing and validating an assessment measuring service orientation. This is a nuanced but important point for this chapter. Our focus in this chapter is on the validity of inferences about selection procedure scores that operate at the level of individual applicants and employees. In effect, our point is that for validity inferences to be meaningful both theoretically and practically, the predictor scores and the intended outcome results (criteria) must be at the same level of analysis. This does not mean that validity claims cannot be made with respect to organization-level purposes. In this example, one could evaluate a claim that an

62

Validity Considerations of Selection Systems organization-level measure of customer service, such as a district-average rating of customer satisfaction, is a valid predictor of district-level customer retention. We should say that successful validation at the individual level is not the ultimate objective for the I-O professional who is designing and implementing a selection system. The most important objective is that the selection system be useful and lead to the outcomes intended by the organization and the people in the organization. While valid selection procedures may be one of the most important contributions the I-O psychologist can make to organization-level outcomes, validity is not, in general, sufficient to ensure organization-level results. Indeed, the selection professional is very likely to have a primary focus on making design and implementation decisions that maximize the intended outcomes at both individual and organizational levels, confident in the science-based conclusion that validity is an important building block for producing an effective selection system.

Summary of Part 1 In conclusion, we have attempted to develop six major points about validity. These are that (1) we have moved far beyond early conceptualizations of validity as the correlation between test scores and criterion measures; (2) validity is not a characteristic of a test, but rather refers to inferences made from test scores; (3) we have moved from conceptualizing different types of validity to a perspective that there are different types of validity evidence, all of which contribute to an understanding of the meaning of test scores; (4) the key inference to be supported in employment settings is the predictive inference, namely, that inferences about future job behavior can be drawn from test scores; (5) there are multiple routes to gathering evidence to support the predictive inferences; and (6) although evidence about unintended negative consequences (or intended positive consequences) of test use (e.g., negative applicant reactions to the test) may affect a policy decision as to whether or not to use the test, such evidence is not a threat to the predictive inference and does not affect judgments about the validity of the test. Our belief is that a clear understanding of these foundational issues in validity is essential for effective research and practice in the selection arena.

PART 2: VALIDITY CONSIDERATIONS IN THE DESIGN AND IMPLEMENTATION OF SELECTION SYSTEMS The following discussion of validity considerations in the design and implementation of selection systems begins with our perspective about the meaning of validity with regard to selection systems and follows by describing how design and development decisions in each of six major stages of a prototypic design and development process can generate evidence for three types of inferences supporting a conclusion of validity. These stages are (1) specify the intended uses and outcomes of the selection process; (2) describe the work; (3) choose/develop predictor and criterion assessment processes; (4) prescribe the manner in which assessment scores are to be used; (5) prescribe the policies and rules that govern the operation of the selection system; and (6) manage and maintain the selection system over time. These stages are roughly sequential but may overlap considerably. The three types of inferences are about (1) the intended uses and outcomes, (2) the quality of predictor and criterion scores, and (3) the prediction rationale.

The Meaning of Validity in the Context of Selection Systems In the domain of personnel selection, the Standards and Principles distinguish the definition of validity from other types of inferences relating to intended outcomes where those outcomes (consequences) “do not follow directly from test score interpretations” (Standards, p. 21; related

63

Jerard F. Kehoe and Paul R. Sackett comments in Principles on p. 7). However, just as the Standards describes a professional responsibility to provide evidence of validity, it also describes a professional responsibility to support claims about outcomes that do not follow from test score interpretations. We adhere to this distinction in this chapter about validity considerations. The subsections below address the design and implementation considerations relating to evidence of validity and only briefly acknowledge certain key efficacy considerations, for example, relating to cut scores. Note, throughout this discussion we distinguish between uses and outcomes of selection processes. “Uses” refers to the particular human resources (HR) process supported by selection including, for example, external hiring, internal lateral movement, internal progression programs, promotion/demotion decisions, selection into training/development programs, and downsizing. Different HR processes may have somewhat different implications for the design, implementation, and validity evidence of the supporting selection system. In contrast, the language of “outcomes” is used here to refer to the specific, individual-level work behaviors, products, tasks, contextual behaviors, etc. that the organization intends to influence with the selection system. The Standards refers to “proposed uses” in a very broad sense to include both uses and outcomes. We make this distinction because intended uses and intended outcomes can have different consequences for design, implementation, and validation of the selection system.

A Framework for Describing Validity Considerations in the Design and Implementation of Selection Systems We propose a two-dimensional framework to organize information about the manner in which design and implementation decisions influence inferences about the local validity of selection test scores. The two dimensions of this framework are (1) type of inference and (2) stage of the design and implementation process. In this chapter we describe the manner in which evidence for each of three types of inference is gathered (or not) in each of the six stages of design and implementation. Table 3.1 displays this framework.

TABLE 3.1 A Framework for Describing Validity Considerations in the Design and Implementation of Selection Systems Key Inferences in the Selection Validity Rationale Stages of Design and Implementation

Intended Uses and Outcomes

Quality of Predictor and Criterion Scores

Stage 1: Specify Intended Uses and Outcomes

• Identify types of selection processes (uses) • Identify most important intended outcomes o Usually it is not feasible to address all desirable outcomes • Evidence that outcomes are a function of individual differences • Determine that no artificial barriers block a predictive relationship • Confirm scope of intended outcomes • Ensure sufficient authority to specify intentions

None from this stage of work

64

Prediction Rationale No direct evidence for any prediction rationale is generated at this stage.

Key Inferences in the Selection Validity Rationale Stages of Design and Implementation

Intended Uses and Outcomes

Quality of Predictor and Criterion Scores

Stage 2: Describe the Work

• Sufficient scope and authority to describe the meaning and relevance of intended outcomes and associated work behaviors • Identify important work elements and worker behavior associated with intended outcomes • Provide expertise required for credible work information

• Provide necessary expertise to provide credible information needed to inform predictor constructs as requirements for successful outcomes/ work behaviors • Distinguish between three types of relevant expertise o Work content o Importance to work o Assessment-related expert judgment

Requires sufficient expertise to: • Establish rational/ observed link between worker attributes (predictor constructs) and work outcomes/ behaviors • Provide credible information required by synthetic validation procedures

Stage 3: Choose/ Develop Predictor and Criterion Assessment Processes

• Choice/development directed by information generated in Stages 1 and 2 • Both validity and usefulness influence the choice or development of predictors

• Online, unproctored predictor assessment • Choice of personality scale scores to suit specifics of local outcomes • Absence of bias in criterion measures

• Importance of alignment between predictor and criterion constructs • Implications of complex, multidimensional outcomes • Choosing among commercially available predictors • Incremental contributions to prediction of complex criteria • Expert judgment required to generalize validity conclusions from previous research

Stage 4: Prescribe Score Usage

• No implications for validity evidence relating to intended uses or outcomes

• Using predictor scores to inform selection decision makers’ judgments

• Creating composite scores from weighted predictor measures • Predictor scores used in sequence

Stage 5: Prescribe Governing Policies/ Rules

• No implications because equivalencies, exemptions, and waivers are based on considerations unrelated to any interpretation of test scores

• Test-taking conditions re: retesting, test preparation, and online administration are likely to affect scores and alter the generalizability of past validity research to the local setting

• Changes to test-taking conditions designed to accommodate disabilities will have a largely unknown impact on the generalizability of previous research about prediction properties to local setting

Stage 6: Manage and Maintain the Selection System

Adapting to dynamic validity factors • Monitor the importance of intended uses and outcomes • Track metrics for achieved outcomes • Link outcomes to selection results (not validation)

• Test administration and scoring processes may alter the generalizability of previous research about predictor measurement properties to local setting • Train selection administration staff re: process standards • Track predictor score characteristics • Audit threats to predictor score meaning and quality

• Maintain the current professional expertise of the personnel selection expert

Prediction Rationale

Jerard F. Kehoe and Paul R. Sackett

Three Types of Key Inferences In this framework, we describe three types of key inferences as being critical to claims of validity for selection tests. These are (1) inferences relating to the intended uses and outcomes themselves, (2) inferences related to the psychometric quality of predictor and criterion scores including properties such as reliability, group differences, construct validity, and measurement bias, and (3) inferences related to the predictive relationship between test scores and important outcomes. Intended Uses and Outcomes  The first category of inferences relates to attributes of the intended uses and outcomes associated with the purposes of the selection system. Certainly, the specification of intended uses and outcomes is necessary to construct an appropriate validation process. However, it is also necessary that the designer makes certain inferences about the intended uses and outcomes in order to design valid selection procedures. For example, intended outcomes will be amenable to a selection solution only if the designer can infer that, in the local context, they are a function of stable individual differences to some meaningful degree. Quality of Predictor and Criterion Scores  Inferences about the quality of predictor and, where feasible and needed, criterion scores are central to any evaluation of the validity of the predictor scores. This is a category of diverse inferences, all of which relate to some aspect of the psychometric quality of the predictor and criterion scores, as used, including that (a) the selected predictor and criterion assessment methods measure the intended predictor and criterion constructs; (b) the predictor and criteria scores generated in the local setting are reliable; (c) the local assessment processes themselves do not introduce unique sources of measurement error or bias; (d) the manner in which predictor scores are used does not change the meaning of the scores; and (e) the local measurement and usage conditions are supported by relevant previous research about the psychometric qualities of the predictor measures. This type of inference applies to all measured predictors to which the validity inference applies including resume-based accomplishment/qualification algorithms, interviews, structured skill/ability tests, psychological inventories, work samples, job knowledge and situational judgment tests, structured exercises, and manager/HR staff ratings and judgments. However, as a practical matter, some predictable outcomes of the selection process (possible criteria) are almost always not specified as criteria for a validation effort. This point simply acknowledges that a comprehensive identification and assessment of all desirable selection outcomes is usually not feasible. This is because (a) the list of all desirable outcomes is very long in many cases and some are more important or salient than others, and (b) it is not feasible to assess certain desirable outcomes due to constraints in time, cost, resources, or assessment methodology. For example, organizations are likely to always value the benefits in safety, employee health, and avoided human and dollar costs that are predictably a result of the use of cognitive ability predictors in selection systems, but these desirable outcomes are frequently not salient to the purposes of the selection system and/ or may not be assessable by any feasible process. Prediction Rationale  The prediction rationale refers to the evidence and reasoning supporting the claim that scores on the predictor, as used, are predictive of the intended outcomes. This type of inference is central to the claim that selection test scores are valid. The necessary but insufficient foundation for this rationale is that the chosen predictor constructs are conceptually linked to the important intended outcomes (criterion constructs) in a manner that implies predictor constructs will be predictive of criterion constructs. The discussion in the first part of this chapter provides a detailed analysis of the meaning of this category of inferences and distinguishes its meaning from the variety of types of evidence that can be used to inform this inference.

66

Validity Considerations of Selection Systems

Six Stages of Design and Implementation We organize the design and implementation work into six major stages to help describe the manner in which design and implementation can provide evidence relevant to the three types of validity inferences. Although these stages are in a roughly logical order, they are interdependent and may significantly overlap with one another. These stages can produce different evidence and support different inferences about the validity of the predictor scores in the system. These six stages are (1) specifying the intended uses and outcomes, (2) describing the work, (3) developing (or choosing) the predictor and criterion assessment procedures, (4) determining the manner in which the predictor scores will be used to make the personnel decisions, (5) establishing the policies and rules that will govern the operation of the selection system, and (6) managing and maintaining the selection system. To be sure, these six stages represent a prototypic model of design and implementation work and can vary greatly across circumstances. Nevertheless, while any of these may or may not be a stage of actual design and implementation work, each is associated with a set of considerations critical to any selection system. For example, even in the implausible case where no design or implementation work addressed the manner in which test scores should be used, test scores will be used in practice in some fashion. For this reason, each of the six stages represent considerations that have implications for the validity of every selection system regardless of the amount of attention and expertise, if any, that was invested during development. Throughout this treatment, we presume that the design and implementation work is carried out by professionals with personnel selection expertise. There’s little point in describing the prototypic work of non-experts. For each of these six stages, the subsections below describe the types of validity evidence generated or relied upon in that stage and the types of validity-related inferences supported by that evidence.

Stage 1: Specify Intended Uses and Outcomes The primary role of this stage of activities is to specify the intended uses and outcomes, which provide the direction needed to begin designing the system including the choice of predictor constructs and measures, the purpose and types of work analyses, and the consideration of most appropriate and feasible types of evidence supporting the predictive relationship between the selection procedures and outcomes.

Inferences About Intended Uses and Outcomes Generally, the specification of uses and outcomes does not, itself, generate evidence of validity. Rather, it specifies the intended uses and outcomes that will define the criteria for the predictive inference and suggest the types of predictors likely to influence the target criteria. Nevertheless, the specified uses and outcomes must satisfy at least two requirements to ensure that prediction validity is even possible or relevant. These requirements are that (1) the intended outcomes are a function of stable individual differences to some meaningful degree and that (2) there are no organizational or work-related constraints preventing the desired outcomes. During the activity of specifying the intended uses and outcomes, inferences must be drawn from various sources of evidence confirming that these requirements are satisfied. In addition to these two inferences, the meaningfulness of conclusions about selection validity are influenced by the comprehensiveness with which intended outcomes are specified. A Function of Individual Differences  The specified outcomes must be a function of individual differences among workers. The required inference is not so much that the eventual predictors and criteria will be measured at the level of the individual applicant and employee

67

Jerard F. Kehoe and Paul R. Sackett but, rather, that differences among workers’ outcomes are reliable and are a function of stable individual worker characteristics to some important extent. No Organizational or Work-Related Constraints  The specified outcomes must not be constrained in ways separate from the influence of individual differences that prevent worker attributes from affecting them. For example, suppose an organization proposes a selection system for an Account Rep job to increase the accuracy with which Account Reps decide whether contested charges may be removed from a customer’s bill. If the current problem with inaccuracy is attributable to a lack of training about the organization’s unique billing guidelines, then the intended outcome of improved accuracy is not a plausible result of any selection solution. In effect, the proposed outcome of improved accuracy does not allow a meaningful evaluation of the validity of a selection test. These two requirements are, in effect, pre-conditions for the validity and usefulness of a selection strategy. Comprehensiveness of Intended Outcomes  The Standards notes that “each intended interpretation must be validated” (p. 11), but rarely, if ever in our experience, are all real benefits of a selection system ever explicitly specified as intended. The root issue is that every individual hiring decision plays some role in many positive and/or negative work behaviors and outcomes whether or not each of these real behaviors/outcomes was salient and explicitly intended in the planning of the selection system. If the employer has a broad interest in optimal hiring that selects the best employees, then, in concept, there is a very large number of “intended interpretations.” For example, good employees who are contributors to the organization would be employees who work safely, show up on time every day, are helpful to others, perform their job tasks accurately and consistently at high levels of proficiency, do not steal from the organization or denigrate the organization to others, are supportive of others’ success, progressively develop into positive contributors in larger and more important roles, have low health-related costs, do not leave the organization, enable others to be effective, work well in teams, take leadership roles when needed, suggest positive improvements to work processes, and so on. The point is that every hired employee produces some result on virtually every valued dimension of work behavior relevant to their work context. With rare exceptions (e.g., Project A for Army jobs; Campbell & Knapp, 2001), organizations do not specify all valued work behaviors as intended outcomes in the planning of a selection system. In fact, even when seemingly comprehensive, rigorous job analyses are conducted to empirically identify important job behaviors/results, rarely is the focus broad enough to incorporate the full range of valued behaviors, many of which extend well beyond the job itself (e.g., progression, helping, loyalty, health, and safety). One specific example of an overlooked but valuable outcome of cognitive ability screening was reported by McCormick (2001) in a large (N = 7,764) study of the relationship between employee illness and accident rates, and cognitive ability test scores used in the employment selection process. This study found the correlation between cognitive test scores and illness and accident rates, corrected for the effects of age, to be −.07 and −.09, respectively. While these were low correlations in absolute value, the organization-wide dollar benefit of selecting applicants above the 40th percentile on cognitive ability was estimated to be approximately $96 million per year. This study describes important outcomes—health and safety behavior—that are improved by selection based on cognitive ability. While the dollar value may be surprising, it is not surprising theoretically that cognitive ability is predictive of worker health and safety behavior. Yet, in one author’s (Kehoe) experience, rarely are cognitive ability tests used explicitly to influence worker health behavior. Indeed McCormick’s meta-analysis was conducted across diverse selection procedures and many jobs, none of which intended cognitive ability predictors to influence health behavior. (For further information about health and safety outcomes, see Chapter 24 in this volume.) As a very practical matter, it is common for organizations to specify only those intended outcomes that are most salient in the current local circumstances. Common examples of explicitly intended, highly salient outcomes include turnover, customer satisfaction, professionalism, sales results, speed and accuracy, project execution, share value, and revenue, among many

68

Validity Considerations of Selection Systems others. But any list of the most salient intended outcomes is virtually always an incomplete list of the real outcomes for every worker that are valued by the organization. (For further discussion about the choice of criteria, see Chapter 25 in this volume.) Comprehensiveness in the specification of intended valued behaviors and outcomes is unattainable, as a practical matter. Further, the Standards does not explicitly require comprehensiveness in specifying intended interpretations. The implications are (a) validity evidence with respect to specified interpretations and outcomes is always an incomplete indicator of the relevance of a selection system to valued worker behavior and outcomes, and (b) great care should be taken at this early step in the design and implementation process to explicitly specify all the intended outcomes that the organization will expect of the selection system. Who Specifies the Intended Outcomes?  This specification of intended outcomes is critical to the design of selection systems, but there is often some ambiguity about the acceptability to various stakeholders of the specified list of explicitly intended outcomes. (This ambiguity only applies to intended outcomes; rarely is there ambiguity about the intended process(es) to be supported by a selection system, i.e., hiring, promotion, downsizing, etc.) Three sources often provide input about intended outcomes. One source is some form of standardized job/work analysis designed to identify frequent and important tasks and other work behaviors. Left to its own devices, the science-based profession of personnel selection usually begins here. However, in many cases, organization leaders (e.g., unit director, HR leader, operations manager, etc.) have strong interests in specifying the intended outcomes that are most salient and important to them. Turnover is often specified as an intended outcome in this manner. Also, more strategic outcomes may be specified by organization leaders, such as the importance of current job-specific knowledge in new hires where the organization is either increasing or decreasing its investment in new-hire job training. Indeed, in our experience it is not uncommon for organization leaders to assert that a job-specific proficiency or ability that appears important in a job analysis is, on balance, not as important as other desired outcomes and that the selection system should focus on the desired outcomes specified by the organization leader. For example, a leader of a customer service center with mostly entry-level workers may assert strongly that the most important desired outcome is reduced turnover and make the further assertion that job knowledge and learning ability are much less important. The third source of influence is the selection expert who is designing the selection system. Based on her/his expertise about personnel selection and the organization and job itself, this expert may make strong suggestions about possibly important benefits of selection outcomes that are not salient to organization leaders or identified in a job analysis. The point of this consideration is that validity evidence will be important and useful to the organization only to the extent that the specified intended outcomes are aligned with the organization’s actual interests in the selection system. Ensuring this external validity, if you will, of the specified intended outcomes is critical to the usefulness of the validation effort.

Inferences About the Quality of Predictor and Criterion Scores No inferences about the quality of predictor and/or criterion scores follow from this first stage given that it is limited to the specification of the intended uses and outcomes. This first stage only specifies the intended uses and outcomes that are necessary in Stage 2 to identify the appropriate constructs and potential measures of predictors and criteria.

Inferences About the Prediction Rationale The specification of the intended uses and outcomes cannot lead directly to inferences about a prediction rationale because the particular predictors and criterion measures have not yet been specified. However, the expert designer is able to use the specification of intended outcomes

69

Jerard F. Kehoe and Paul R. Sackett to make initial judgments about the nature of criterion constructs implied by these outcome specifications. These early judgments about likely criterion constructs, in turn, enable the expert to identify potential predictor constructs from the relevant validity research foundation, but no evidence supporting a prediction rationale is gathered at this first stage.

Stage 2: Describe the Work The process of validating selection systems depends to a great extent on the nature of the work into which applicants are selected. As a result, the information about work generated by various methods of work analysis typically provides the foundation that helps determine what many (but not all) criterion assessments should measure and, in turn, what predictor assessments should measure. For the purposes of this chapter, we treat the analysis of work very broadly to include virtually all the various processes and methods used to produce the work information required to specify criterion and predictor constructs and, in certain cases, create criterion and predictor measures. This broad treatment of work analysis includes (a) traditional job analysis methods for documenting important work behaviors, tasks, components, and required KSAOs; (b) more specialized methods for specifying work content required to develop criterion and/or predictor measures such as knowledge content, critical incidents, and situational judgments that may discriminate between good and poor judgment; and (c) methods for identifying workplace behaviors that are valued by the organization but are outside the scope of job-specific performance, such as turnover, job progression, and organization citizenship including counterproductive behavior. As a result, the design and implementation activities that produce descriptions of work often have direct and critical influence on the specification of intended outcomes, on validity-related inferences about the quality of criterion and predictor scores, and on the prediction rationale.

Inferences About Intended Uses and Outcomes In many cases, formal work analyses inform or specify intended outcomes beyond an initial, more general description. In this way these analyses provide information about constructs and/ or content of important work outcomes that become bases for evaluating the quality of criterion and predictor measures and a prediction rationale. Even though work analyses can further specify intended outcomes in ways that establish their importance and make them measurable, work analyses cannot replace the authority to establish the intended uses or outcomes of a selection system. Work analyses require job and testing expertise, not organizational authority. Ultimately, the establishment of intended uses and outcomes is a matter of authority, not expertise. The primary role of work analyses with respect to intended outcomes is often to further specify measurable constructs and work behavior content that capture the intention of the organization authority that initially established the intended uses and outcomes at some level of description. Our overall perspective about the scope of work analysis is that it should be determined by the initial description of intended uses and outcomes established prior to the effort to analyze the work, and there should be no artificial methodological limits to the scope of this analysis. For example, if the organization leader prescribes that a selection system should be designed to minimize turnover, among other desired outcomes, then some analysis of the work context and conditions should be conducted, if it hasn’t already, to understand the factors that influence individual decisions to leave the job or the organization. This may seem like a trite point, but the underlying principle here is that job analysis, whatever its form, should be designed to serve the purposes expressed in the intended uses and outcomes. Even in a prototypic scenario in which the selection professional has virtual free rein—within organizational constraints—to establish an optimal selection system, traditional work and worker-oriented job analyses should not be the sole determinant of intended outcomes. Because there is such a wide range of potentially important outcomes beyond the job tasks themselves (e.g., health behavior, progression success,

70

Validity Considerations of Selection Systems workplace theft, helping behavior, responsible behavior, professional/appropriate demeanor, creative/innovative behavior, safety/accident results, prosocial behavior in work teams, and early vs. late turnover), the design of the selection system should begin with some form of strategic discussion with organization leaders to identify the organization’s most salient needs that are amenable to a selection solution. Ultimately, organizational authority, well informed by selection expert information, should provide the direction needed to identify the intended uses and outcomes that will shape the selection system. The dilemma described above regarding the impracticality of incorporating all possible valued outcomes in a validation effort should be resolved by leaders with organizational authority, not by experts with job knowledge. The appropriate focus of job expertise is to identify work tasks, behaviors, and requirements that represent the intended outcomes and to help specify the content of measures of those tasks, behaviors, and requirements. The adequacy of the job expertise required for these tasks is critical to the overall claim of validity and, in particular, to the quality of predictor and criterion scores and the prediction rationale described below. (Chapter 6 in this volume provides considerably more detail about the various forms of work analysis.)

Inferences About the Quality of Predictor and Criterion Scores Work experts produce information that can serve as construct and/or content evidence used to develop criterion and predictor measures and, in turn, to evaluate certain qualities of those measures. This expertise allows the work analysis output to be credible, which, in turn, provides a basis for claims of validity. The output of non-experts cannot provide the credibility required for any validity rationale. An important consideration is that different types of work information may require different types of expertise. In particular, different types of expertise are required to produce credible judgments about (a) work content, (b) importance to work, and (c) assessments based on work content. Expertise in work content is required for common methods of work analysis designed to identify and describe the content of work tasks, knowledge, ability requirements, and work behaviors that constitute the full scope of work behavior relevant to the intended outcomes. This content expertise is also required to make evaluative judgments about distinctions between high and low levels of performance or work behaviors that lead to positive or negative outcomes as required, for example, in the development of job knowledge tests (JKTs), work sample tests (WSTs), and situational judgment tests (SJTs), as well as interview content that relies on critical incidents of work behavior that distinguish between successful and unsuccessful performance. Expertise regarding the importance of work behavior, job knowledge, abilities, etc. is a different expertise than work content expertise. In many instances of work analysis, it is rightfully assumed that the same experts have both content and importance expertise. For example, where the importance of work tasks for successful performance depends on a deep understanding of the relationships of all work tasks to work performance outcomes, the expertise about importance is likely to be found in the same people who are experts about work content. However, in other work analysis tasks where the importance of tasks or knowledges depends on an understanding of organizational purposes or strategies more than an understanding or work processes, experts in work content may not be experts in work importance. For example, a call center organization may choose to place high importance on average talk time due to small profit margins, whereas call center representatives may perceive from their own experience that listening skills are more important for customer satisfaction. Where judgments of importance are required, care must be taken to ensure that experts in importance are making the judgments. Finally, job experts are often directly involved in the development of assessment procedures including JKTs, SJTs, WSTs, and interviews. Similarly, job experts also participate in judgment processes used to develop critical test scores such as cut scores. In these cases, the judgments often require some level of expertise about behavioral assessments, which usually does not overlap with content or importance expertise. For example, using content experts to develop job knowledge items requires, among other things, that the content experts develop effective

71

Jerard F. Kehoe and Paul R. Sackett distractors that satisfy a number of specific requirements. The most common methods by which assessment expertise is embedded in the assessment development processes used by job content experts is that standardized instructions and procedures are used and training and process oversight is provided by assessment experts. A counterexample can be helpful. Certain cut-score-setting methods rely on job content experts to make judgments about the likely test-taking behavior of job incumbents. But job content expertise generally does provide expertise in incumbents’ test-taking behavior. Perhaps the most common example is the Angoff method (1971), which requires job content experts to judge the likelihood that job incumbents who are performing at a minimally acceptable level will answer items correctly. While Angoff methods precisely describe what judgment is to be made—the likelihood of answering correctly—they generally do not choose job experts who have expertise about test-taking behavior, nor do they provide training or oversight about such test-taking behavior. In cases like this, job experts are making judgments that require an expertise they are unlikely to have. Such judgments provide no evidence supporting any validity claim for the test, nor do they provide a credible foundation for claims about the cut scores.

Inferences About the Prediction Rationale The work description stage of design and development can provide the building blocks for a variety of prediction rationales undertaken in Stage 3 below that depend in some fashion on information about important job functions. Synthetic validity refers to a family of validation processes that rely on some judgment or index of similarity between the focal job for which a selection procedure is being used and other jobs for which empirical validity information is available from previous validity studies. Gibson and Caplinger (2007) and Hoffman et al. (2007) describe specific variations of synthetic validity evidence relating to transportability validation and job component validation, respectively, and Johnson (2007) provides an overall evaluation of synthetic validation as an acceptable technique for accumulating evidence of validity. In general, the information about the focal job used to judge its similarity to other referent jobs is generated by job experts in a structured process designed to describe the focal job in a manner that is relevant and comparable to the referent jobs. Furthermore, these same job experts or other similar job experts may also make the later judgments about the degree of similarity between jobs. Appropriate expertise is critical for these components of prediction validation methodologies in the same manner that it is critical to conclusions about the quality of predictor and criterion measures. In all of these cases, the expertise provides the credibility of the information used to describe the links between job content and the content of criterion and predictor measures and the content of other referent jobs. An overall observation about Stage 2 is that it provides the first place in the logical process of design and implementation where expert judgment produces information critical to subsequent inferences about the validity of criteria and predictors.

Stage 3: Choose/Develop Predictor and Criterion Assessment Processes The purpose of this third category of design and implementation work is to specify and choose and/or develop measures of the intended outcomes and selected predictor procedures. Beyond the typical psychometric requirements for the quality of any measure, several important considerations regarding the roles of predictor and outcome measures in the validation process are described here, organized around the three categories of key validity inferences. We acknowledge that the considerations addressed here are only some of the many validity considerations relating to the quality of the predictor and criterion measures. However, the several specific considerations we describe here are among the most important and/or most contemporary.

72

Validity Considerations of Selection Systems

Inferences Relating to Intended Uses and Outcomes There are strong direct relationships between measurement quality and the intended uses and outcomes addressed in the first step of the design and implementation process. To a great extent, this is a unidirectional relationship in which earlier decisions about intended uses and outcomes and new information generated from an analysis of the target work directly inform choices about predictor and criteria constructs and assessment processes. This direct influence is an important component of the overall validity rationale for the test scores. Nevertheless, validity considerations are not the only factors in choosing among predictor options given the target outcomes. We provide the following subsection to describe the important balance between considerations of validity, the focus of this chapter, and other considerations more closely related to the effectiveness or utility of a selection system. We believe this broader perspective helps to clarify the narrower scope of validity considerations covered in the rest of this chapter. The Roles of Validity and Usefulness in Choosing Predictors  In choosing predictors for a selection system, it is obviously necessary to consider the expected validity for each potential predictor. This requires information about the outcomes (criteria) that each predictor would be intended to predict and about the accumulated research-based evidence for the validity of the predictor’s scores with respect to similar outcomes. This expectation of validity is a minimum requirement for the choice of a predictor, but this consideration only serves to exclude potential predictors that do not satisfy this minimum requirement. In addition, it is critical to consider the expected usefulness of each “minimally qualified” predictor. Evaluating this expected usefulness requires a consideration of the many complex ways in which the organization elicits valued work behavior from its employees. The selection system is just one of several parts of the whole organizational context that shapes employee work behavior. Other parts include training/development, rewards, compensation and recognition, supervisory coaching/direction, job design and supporting resources and processes, elements of organization culture that affect work behavior, recruiting sources that target particular types of applicants, the organization’s reputation and attractiveness in the employment market, work governance systems such as union contracts and work rules, consequences for negative work behavior, work-life balance, inspiring and enabling leadership, and so on. Both small and large organizations can be remarkably adaptable in the ways in which they facilitate work behavior that leads to desired outcomes. Choosing predictors requires the selection professional to consider the most useful contributions a selection system can make in this broader context. In some cases, this might also include a consideration of whether a selection solution could be a more efficient, less costly solution than the current strategy the organization uses to achieve the desired outcome. For example, a selection system might be a less costly strategy for ensuring minimum job knowledge among new employees than an early job training approach. In contrast, a cognitive ability test might add little or no value to a selection system for a computer engineering job in a highly regarded, relatively new, and successful high-tech company that attracts resumes from the top computer engineering graduates in the country. It can be instructive, if not humbling, for a selection professional to investigate the ways young, post-startup companies can be successful without adopting maximally valid, professionally designed selection practices. The point of this comment is that maximum validity is not the selection designer’s most important objective in choosing among potential predictors. The predictors that add the most value are the ones that complement (or replace) the existing organizational systems that support effective work behavior. Of course, in many cases—perhaps most cases—selection systems solve problems created by the lack of or ineffective or harmful versions of other systems supporting work behavior. In general, though, the purpose of validity evidence supporting scores on any particular selection procedure is to ensure that the specific procedure is influencing the outcome(s) as intended. An overall point about these comments and other similar comments in this chapter is that while considerations of validity represent minimum requirements for professionally developed

73

Jerard F. Kehoe and Paul R. Sackett selection systems, validity does not define the optimality of selection systems. Rather, organizations typically have a range of important interests that are affected by selection, and the optimality of the designed solution in any particular case is the extent to which these interests are well balanced. Validity information helps inform this balancing effort but does not define the acceptability of the various tradeoffs required to find an optimal balance.

Inferences Relating to the Quality of Predictor and Criterion Scores This section addresses the following quality of measurement considerations: (a) validity considerations for online unproctored predictor assessment, (b) the meaning of personality scale scores across commercially available instruments, and (c) the absence of bias in criterion measures. We acknowledge that these are just three of many possible measurement quality considerations ranging from basic considerations such as reliability and item characteristics to more nuanced validity considerations such as test taker motivation and fidelity to work tasks/activities. We choose the first two considerations because they are contemporary and the professional research foundations are not settled; we choose to include criterion bias because of its sometimes subtle but critical implications for validity. Online, Unproctored Predictor Assessment  Perhaps the most significant and rapidly emerging new development in predictor assessment is online, unproctored administration. This emerging assessment methodology raises technical, psychometric, ethical, and professional practice issues that may have consequences for validity. Our overall perspective is that professional practice has evolved more rapidly than has the research foundation about the risk to validity associated with the unproctored feature of this methodology. It is widely acknowledged that unproctored administration has become a common practice (Pearlman, 2009), going so far as to make its way into mobile devices. The International Test Commission (ITC, 2006) has established practice guidelines for computer-based and Internet testing, while the more recent Standards (2014; Standard 10.9, p. 166) remands the practice issues for “technology-based administration” to professional judgment with no identification of issues particularly salient to unproctored online testing. SIOP’s Principles (2003) does not specifically address unproctored or online testing but does address professional responsibility for test security and test taker identity. In this Handbook, Chapters 16, 39, and 44 address various aspects of this broad issue. The first author’s informal survey in 2013 of seven test publishers’ practice of online administration of selection tests revealed large differences. Two of these publishers simply placed their paper-and-pencil assessment tools on an online administration platform with no more than one or two available forms of the tests and simply warned users that unproctored administration may corrupt the meaning of the scores. In contrast, two other publishers had developed online versions of certain tests designed specifically for unproctored administration in a manner that was largely consistent with the ITC guidelines. Key features of these tests were that (a) large banks of pre-tested items were available to enable each test taker to receive a randomized set of items with a psychometric rationale for measurement equivalence; (b) item analysis techniques and web patrols were used to proactively investigate indications of cheating; (c) users were encouraged to have a signed agreement with each test taker to adhere to the administrative instructions; and (d) proctored verification testing was recommended for short-list applicants prior to job offers. In short, while the practice of unproctored online testing is now commonplace, test publishers are widely different in the extent to which they support and encourage users to comply with ITC guidelines. Beyond important ethical considerations (Pearlman, 2009), the impact on users for the design of selection systems and the validity of scores within those systems is that some publishers may provide no evidence supporting the psychometric test properties or predictive validity evidence of scores generated by the unproctored, online mode of administration. On the other hand, the accumulating research has generally shown that unproctored online administration leads to little, if any, variation in measurement properties (Vecchione, Alessandri, & Barbaranelli, 2012) and negligible score changes (Lievens & Burke, 2011). Similar results have been reported for measurement

74

Validity Considerations of Selection Systems invariance across mobile and non-mobile online administration with Arthur, Doverspike, Munoz, Taylor, and Carr (2014) and Illingsworth, Morelli, Scott, and Boyd (2015) showing invariance across modes for both personality and cognitive tests. However, Arthur et al. (2014) reported lower cognitive scores on mobile than non-mobile devices but similar scores on personality assessments. Overall, potentially problematic effects of lack of proctoring resulting from increased cheating do not appear to change test measurement structure or score levels for cognitive and personality assessment. However, there is some indication that mobile devices yield lower cognitive scores but not lower personality scores. Overall, the evidence gathered to date about lack of proctoring does not show measurement or score effects that would threaten the validity of the unproctored scores. Consistent with that overall pattern of results, Kaminski and Hemingway (2009) and Delany and Pass (2005) reported no loss of validity in unproctored tests. In contrast, Weiner and Morrison (2009) reported mixed results. Our perspective about the current state of research on the measurement and validity consequences of unproctored online assessment is that some publishers of online versions of selection tests now may have large enough databases that they can provide dependable enough measurement results to allow a local user to generalize those measurement characteristics to their local administration. However, while some publishers may have a significant amount of relevant validity data available from client users of unproctored online testing, the volume of such research published in the selection research literature is not sufficient to support broad general conclusions about the validity of unproctored scores. Personality Scale Scores  Recent trends in personality assessment research are challenging the confidence users can have in generalizations from previous research about the validity of personality scale scores to their local context. Work over the past two decades on item types that are less susceptible to faking (Stark, Chernyshenko, Drasgow, & White, 2012), idealpoint and dominance measurement models (Stark, Chernyshenko, Drasgow, & Williams, 2006), curvilinear relationships between personality scores and work behaviors (Carter et al., 2014; Le, Oh, Robbins, Remus, & Westrick, 2011), the distinctions between observer and self-report measures (Connelly & Ones, 2010; Oh, Wang, & Mount, 2011), the potential for substantive differences between alternative instruments (Davies, Connelly, Ones, & Birkland, 2015), and the stability of personality within persons over time and contexts (Green et al., 2015) have combined to limit the extent to which broad generalizations about personality validity can be made to local settings without considering specific characteristics of the setting and the personality measurement. In addition, we offer our own informal observation from reviewing dozens of commercially available personality inventories that work-specific tailored, composite scales with similar names in different instruments (e.g., team orientation, service orientation, leadership orientation) cannot be confidently assumed to measure the same facets of personality. These developments all point to the broad theme that, with regard to the generalizability of the existing research on the validity of personality scores, specificity matters far more than it does for the generalizability for cognitive test scores. The implication of this conclusion is that the selection system designer should carefully evaluate several specific considerations in establishing the local validity rationale for personality assessment. These considerations include (a) the specific workplace behaviors/outcomes to be influenced by personality assessment and the context in which these behaviors/outcomes occur, (b) the opportunity to capitalize on the potential incremental value of other-report assessments, (c) the extent to which a curvilinear (ideal point) model of assessment would be more effective, (d) the advantages of some assessments over others with regard to susceptibility to socially desirable responding, and (e) the extent to which each of several alternative assessment tools fits well with the purposes and contexts associated with the use of personality assessment. An important implication of this increased specificity associated with the choice of and among personality assessments is the greater value (compared to general cognitive ability testing) of local criterion-oriented evidence of predictive validity.

75

Jerard F. Kehoe and Paul R. Sackett Absence of Bias in Criterion Measures  A critical concern in the process of specifying and, if needed, measuring intended outcomes is the possibility of bias in these criterion measures. One possible source of bias is the use of in-place administrative measures. Three common constraints in the validation of selection test scores are (a) the pressure to avoid costs, (b) the pressure to design and implement without delay, and (c) access only to small samples. These common constraints may collectively lead to a consideration of in-place, administrative measures of work behaviors as criteria for the purpose of validating selection test scores. Perhaps the most common of these are administrative appraisal ratings. Unfortunately, it is frequently easy to identify other factors that influence appraisal ratings beyond the target intended outcomes. These other factors may include a lack of supervisor training about the ratings, artificial distribution requirements, a lack of detailed information about actual performance, pressure to avoid low ratings that would trigger the requirement for a formal performance improvement program that supervisors might be reluctant to undertake, and other, unrelated purposes for the rating such as their use in making compensation decisions. All of these potential biasing factors are plausible threats to the meaning and fairness of appraisal ratings. For these reasons, in-place operational appraisal ratings are commonly avoided as criterion measures for validity evidence.

Inferences Relating to the Prediction Rationale This section describes five types of validity considerations relating to the relationships between predictor scores and criterion scores. Alignment Between Selection Procedures and Intended Outcomes  The Standards asserts that “intended interpretations” of scores must be validated (p. 11). The implication is that it is meaningful to gather evidence of test score validity only with respect to the outcomes that are intended for that test within the design of the selection system. Consider the example of a selection system that includes a test of cognitive ability, among other things, used to hire new service reps in a call center. It is well understood in personnel selection that cognitive ability predicts task proficiency because cognitive ability enables learning of job knowledge, which is required to perform job tasks proficiently (Hunter, 1986). Here we adopt a definition of task performance from Rotundo and Sackett (2002, p. 67), “behaviors that contribute to the production of a good or the provision of a service,” which is also used in a recent meta-analysis of relationship between general mental ability and nontask performance (Gonzalez-Mule, Mount, & Oh, 2014). At the same time, it is well established that cognitive ability is much less predictive and, for certain behaviors, not predictive of non-task behaviors and performance such as organization loyalty, helping behavior, citizenship behavior, and counterproductive behavior. For these reasons, the design of a service rep selection system might include a cognitive ability test to predict service rep task proficiency and some non-cognitive selection procedure(s) (e.g., personality inventory, biodata inventory, or interview assessment of team experience) to predict the desired contextual work behaviors. In this selection system, the rationales for predictive inference align the cognitive test scores with task proficiency and the non-cognitive scores with the non-task, contextual work behavior(s) of interest. The only relevant validity evidence for the cognitive and non-cognitive scores in this selection system is the evidence that is aligned with these intended interpretations (outcomes). Two significant implications for validation follow from this “alignment” principle. First, in the case of the service rep selection system, unambiguous validity evidence is provided by correlations (or other evidence of a predictive relationship) between cognitive predictor scores and targeted task proficiency measures and between non-cognitive predictor scores and measures of the target contextual work behaviors. Correlations involving either of these predictor scores with some measure of overall performance that includes both task proficiency and contextual behavior represent ambiguous evidence of score validity within this selection system. Correlations with such multidimensional criterion measures are measures of impact or effectiveness more than they are evidence of score validity with respect to the interpretation (outcome) intended

76

Validity Considerations of Selection Systems for those scores. Correlations with multidimensional criteria that aggregate criterion measures across different types of outcomes provide ambiguous information about the intended meaning or interpretation of these scores, even though they provide very useful information about the efficacy of the selection procedures. Complex, Multidimensional Outcomes  Few, if any, work behaviors or outcomes are a function solely of the attribute(s) measured by a single selection procedure. While this condition of heterogeneous multidimensionality likely applies in virtually all cases of in-place metrics, it is certainly more severe in some cases than others. For example, the metric of “improved ROI” may be an important outcome for senior leaders, but it is certainly a highly complex, heterogeneously multidimensional outcome for which a selection test of critical thinking skills might have only a very modest influence. On the other hand, a test-based measure of training mastery may be strongly influenced by general mental ability. This point is being made about outcomes for which the heterogeneous multidimensionality is not a source of bias in the measure of the outcome but is an accurate representation of the causal factors influencing the outcome and the measure of the outcome. But this condition influences the evidence of validity based on measures of such outcomes. Accurate, unbiased evidence based on highly heterogeneous multidimensional outcomes will almost certainly reveal relatively low levels of validity even in the case of a highly accurate conceptual/theoretical prediction rationale. Equally accurate prediction rationales for homogeneous, more singular outcomes will likely reveal relatively high levels of validity. The implication is that the evaluation of validity evidence must take into account the complexity of the outcome as well as its measurement characteristics. Where feasible, the most theoretically meaningful validation strategy would be one in which the generality and heterogeneity of the criterion measures matches that of the predictor in question. As a practical matter, however, this is probably rarely, if ever, realized. (Note, predictor constructs and measures have received considerable attention elsewhere in this volume especially in Chapters 11–15. In an effort to minimize overlap with those chapters, our focus in this section is on certain selected aspects of the choice and measurement of predictors that are especially relevant to evidence for the predictive validity of these scores.) Choice Among Available Predictors  The choice of predictors and associated assessment methods is critical for the design and implementation of a selection system and the accompanying validation effort. Fortunately, the profession of personnel selection has advanced to a degree that many high-quality predictor tools are commercially available with accompanying documentation of empirical psychometric and prediction evidence. This is especially the case for cognitive ability tests, personality inventories, and interview development tools. Furthermore, as this chapter is being written, online versions of these types of predictor tools have become commonplace. Overall, the implication is that now, more than before, evidence of validity for specific predictors may well include generalizations to the local setting from evidence accumulated by commercial suppliers, especially the larger consulting houses, as well as from published research. Incremental Contributions to Overall Criterion Prediction  The selection designer often has an interest in providing an overall evaluation of the validity of a set of predictor scores within a selection system. A frequent strategy used in the selection profession to describe the validity of a set of predictors is to regress a measure of overall performance on those multiple predictors and report the increment in the multiple R2 attributable to each predictor. Schmidt and Hunter (1998) provide a well-known, high-level example of this type of analysis. Although this analysis can have useful heuristic value, it has two significant limitations as a form of validity evidence for the specific predictors. First, this approach relies on the construction of the overall, complex criterion measure that is a weighted composite of all the outcomes that were explicitly intended for each of the predictors. Second, even if the overall measure captures all intended outcomes, this regression analysis produces coefficients (multiple Rs) that are influenced by the relative weighting and interrelatedness of the multiple outcomes in the construction of the overall criterion measure. The net consequence of these two limitations is that the meaning of multiple Rs has more in common with utility analysis than with validity analysis, where utility

77

Jerard F. Kehoe and Paul R. Sackett analysis focuses on a magnitude of relationships and validity analysis focuses on the meaning of relationships. Multiple R and the increase in R (or R2) do not provide unambiguous evidence of the extent to which scores on each of the predictors is predictive of the outcomes it was designed to predict (meaning). This is not a criticism of this type of regression analysis. Rather, it is a cautionary note that this type of evidence has a different meaning than evidence of the relationship between a predictor and its intended outcomes. For example, the usual result that personality predictors contribute less variance than cognitive predictors to overall performance measures does not necessarily imply that personality scores are less valid predictors of their intended outcomes than cognitive scores are predictive of their intended outcomes. (Of course, we know from separate validity evidence that personality scores generally do correlate less with the work outcomes they are conceptually expected to predict than do cognitive predictors.) One implication of this comment is that the question of incrementalism with respect to selection procedures within a selection system is, at root, a question of value or utility and is not an unambiguous indicator of validity. One can easily imagine validity evidence being used to determine whether one predictor is more or less valid with respect to its intended outcomes than another predictor is of its own intended outcomes. But as soon as the question is about the incremental value of one predictor with respect to another, the question fundamentally hinges on, among other things, the relative value to the organization of the two sets of intended outcomes, which is independent of the question of validity. Generalizing Validity Conclusions from Previous Research Criteria to Criteria in the Local Setting  Given common constraints on (a) the cost and time available to design and implement selection systems, (b) limited local sample sizes for local empirical studies, and (c) the challenges of accurately measuring the intended outcomes, an increasingly common and effective validation effort relies on generalizing conclusions from previous validity research to the local setting. Indeed, it seems likely that at least some part of the validation rationale for every local selection system relies on some generalization from previous research conclusions to the local context. Beyond the ordinary psychometric requirements for criterion measures, we make three points here about conclusions about local criterion measures based on previous research conclusions. First, the constructs captured by local criterion measures will be specific to the local context in virtually all cases. For example, even though “turnover” is a generic label for a common type of criterion measure, the meaning of a local measure of turnover—as a criterion to be predicted—is likely to be highly contextual given the particular factors causing local turnover. Similarly, a properly instructed supervisory rating of local, overall job performance will capture the facets of job performance important in the local job (Campbell, 2015). Second, the inference that conclusions about criteria from previous research apply to local criteria will be based on the conceptual similarity between the constructs underlying previous research criteria and local criteria and will not be based on any type of sampling rationale. Third, it is highly likely that the inference of conceptual similarity between previous and local criteria will be based on expert judgment rather than on some quantifiable comparison algorithm or on the use of identical measurement procedures. Summary conclusions about the criteria represented in meta-analytic research studies that include several local studies will often be at a different level of description than the local criteria. Conclusions from such cross-study research efforts will typically classify or categorize the studied criteria in an attempt to reach a more general conclusion. As a result, the inference that previous validity conclusions for categories of criteria can be generalized to a local criterion requires the expert judge to evaluate whether the locally specific criterion constructs and measures are similar enough to the constructs and measures captured by research-based categories of criteria. The expertise involved in this judgment should include knowledge about the general principles of inference and measurement as well as knowledge about the substantive meaning of the criterion constructs and measures in the previous research and in the local setting. We make this point here to underscore the importance of expert judgment in reaching a conclusion about test score validity in a local selection system. The role of expert judgment in generating validity evidence is well-established in professional guidance and in practice. The Standards frequently cite and endorse the role of expert judgment as a source of validity evidence. (See, for example,

78

Validity Considerations of Selection Systems Standards 11.3 and 11.5 and their accompanying comments). Gibson and Caplinger (2007) and Hoffman, Rashkovsky, and D’Egidio (2007) describe the roles of expert judgment in a variety of structured methods such as job component validation for drawing inferences about local validity from previous validity evidence. We single out the role of expert judgment here because it is likely to take on even greater importance in establishing the local validity evidence where a local criterion study is not feasible.

Stage 4: Prescribe Score Usage A critical consideration in the design of a selection system is the manner in which test scores will be used in the process of selecting among the applicants. The irony is that, as critical as this design component is for the effectiveness of the selection system, with a few exceptions it has relatively little consequence for the type of validity evidence to be gathered for the predictors. In this section we first consider three types of score usage that may have implications for the nature of the appropriate validity evidence, and then we briefly describe a systematic approach to the design of selection systems that is consistent with an overall theme in this chapter that validity is a critical building block but does not define the optimality of a selection system. Selection designers have many options available to them regarding the manner in which test scores may be used: (a) scores may be used in a compensatory or non-compensatory fashion; (b) scores may be used in a wide variety of ways to establish selection standards in the form of cut scores and/or score ranges associated with specific decisions; (c) scores may be used to screen applicants in a particular sequence; (d) scores may be used to inform individuals who make the selection decisions with certain guidance accompanying the score information; and (e) scores may be weighted to control their relative influence on selection decisions. Of all the ways scores may be used, only three of these ways have implications for needed validity evidence. The nature of the required validity evidence will be influenced by the choices about (a) weighting predictor scores in some form of compensatory scoring, (b) the manner in which scores are used to inform selection decision makers, and (c) the sequence in which scores are used to affect selection decisions. These are described below in the section Inferences Relating to Quality of Predictor and Criterion Scores and in the section on Inferences Relating to the Prediction Rationale. However, before addressing these three key issues, we first address considerations relating to intended uses and outcomes.

Inferences Relating to Intended Uses and Outcomes None of the myriad ways of using scores is likely to have implications for validity evidence relating to intended uses (e.g., hiring vs. training admissions) or intended outcomes. This is because the manner of score use has no necessary consequences for the intended uses or outcomes. Most decisions about score use are driven by considerations of operational efficiency or feasibility and do not alter the validity rationale required to support the intended uses and outcomes from the scores. A common example is the choice between use of some form of cut score–based strategy rather than some alternative such as top-down selection. In this case, the common professional practice is to comparatively evaluate these alternative uses by analyzing their implications for cost, efficiency, diversity, risk of adverse impact, and, possibly, other consequences. But this comparative evaluation ordinarily does not assume or estimate different validities for the same selection procedure used in these different ways. (Note, we acknowledge here, in anticipation of points made below, that a persuasive argument can be made that the validity of dichotomized test scores, as would be used in effect by certain cut score strategies, should be estimated separately from the validity of scores used in their original, more continuous scale form, as might be the case with top-down selection. This is an argument that scores should be validated as used to make selection decisions. We will revisit this argument below in the cases of composite scores and scores used to inform hiring manager decisions.)

79

Jerard F. Kehoe and Paul R. Sackett

Inferences Relating to Quality of Predictor and Criterion Scores Scores Used to Inform Selection Decision Makers’ Judgments  A common use of test scores is to present them to selection decision makers in some organized fashion that helps the decision maker integrate the meaning of the score information with other applicant information to form a judgment about the applicant’s overall quality. Setting aside the well-established point that human judgment tends to suboptimize the aggregation of valid score information, this use of scores would require new validity evidence supporting the manager’s judgment if the constructs captured in that judgment are different in some fashion from the original scores and, therefore, required a different prediction rationale. We recognize that this is a debatable claim, but we argue that in those settings in which the selection designer chooses to use test scores in this fashion, this design decision often rests on a belief (held by either the designer or the decision maker) that the decision maker has additional relevant information that improves on the test scores and makes better selection decisions. In this case, the decision maker’s judgment about the applicant represents a new measure of different predictor constructs and, therefore, requires separate evidence of validity beyond the evidence for each contributing test score. We also recognize that this conclusion can be problematic as a practical matter because, often, the only output capturing the manager’s judgment is the set of selection decisions. This precludes any validity evidence that depends on differentiation among the selected applicants. In those cases where the decision maker’s judgment cannot be captured in an overall rating or ratings of specific applicant attributes, it will probably not be feasible to gather empirical evidence of predictive validity.

Inferences Relating to the Prediction Rationale Here we address the implications for validity of two types of score usage—composite scores used in compensatory approaches and scores used in sequence in multiple hurdles approaches. In both cases, we describe ways in which these two methods of score use change some feature of the prediction rationale and, as a result, change the type of validity evidence required to support scores used in that manner. Score Weights to Produce Composite Scores  Compensatory scoring requires that composite scores be arithmetically derived from individual test scores. Composite scores will require additional validity information beyond that required of the individual components where these composite scores represent a new measure relating to intended outcomes differently than the component test scores used to form the composite. This will occur when the component weights used to form the composite are intended to be a measure of the relative importance of the predictor constructs for successful job performance. In this case, the composite includes additional information—the importance weights—beyond the information in the separate components, so the new information is justified based on its job relevance. As a result, validity evidence for this type of composite is supported by some rationale for the job relevance of the weights. Similarly, setting aside the issue of weighting, it is conceivable that a composite score has different meaning than the simple sum of component scores if the attributes represented by the component scores are interactive in such a way that particular profiles of component scores have predictive meaning unique to the particular profiles. For example, considering personality attributes, if applicants who are 1.5 SDs above average on narcissism are predicted to be poor performers regardless of other attributes but applicants who are 1.5 SDs above average on assertiveness are predicted to be poor performers to the extent that they lack other important attributes, then any linear combination of component scores would likely be less predictive of performance than the component scores used individually. (Of course, in this case, a composite score would be a poor choice for this very reason, so the question of whether it would warrant separate validity evidence would be moot.)

80

Validity Considerations of Selection Systems On the other hand, the predictive rationale underlying a composite of test scores requires no new theoretical or conceptual consideration or new validity evidence where the weights are not based on any job-related consideration and where the component attributes do not have an interactive relationship to the target outcome. It is worth noting that this same rationale may be applied to other ways of using scores such as cut scores. Cut scores are rarely, if ever, based solely on a prediction rationale linking particular scores to a job relevance interpretation such as a minimally acceptable level of performance. Rather, they are often based on a set of considerations relating to cost, manageability, optimized effectiveness, group differences, and the like. For this reason, particular cut scores rarely, if ever, rely on a claim of validity other than the fundamental claim that cut scores are based on valid scores. Sequential Use of Scores  The implication for validity evidence of the decision to use scores in a sequence is a technical point relating to range restriction. This point applies to quantitative evidence of validity in the form of correlations between test scores and outcomes among those who are selected into that stage. At each stage, a particular test score or composite of test scores is used to make selection decisions about which applicants move to the next stage. Consider an example of a two-stage sequence in which, at Stage 1, 50% of the applicants are screened out based on a cognitive ability test score. At Stage 2, the surviving applicants are given a personality assessment of Openness and 50% of them are screened out based on their Openness score. The surviving applicants are then given job offers. Subsequently, the new hires’ relevant work outcomes are measured and correlations are computed between the cognitive scores and the outcomes intended from cognitive test and between Openness scores and the outcomes intended from the Openness inventory. Because these two correlations were computed only among the new hires, they are artificially rangerestricted estimates because the ranges of cognitive scores and Openness scores among the new hires are both less than the ranges of cognitive scores at Stage 1 and Openness scores at Stage 2. Both estimates should be corrected for range restriction but in different ways. The correction of the cognitive validity coefficient should be with respect to the range of cognitive scores at Stage 1, whereas the correction for the Openness coefficient should be with respect to the range of Openness scores at Stage 2. (Note, in this example the Openness scores in Stage 2 may have been indirectly restricted by the selection on cognitive scores in Stage 1. This is because Openness and cognitive ability typically are positively correlated, but this indirect restriction between stages is immaterial to the method used to correct the restricted validity coefficient for Openness at Stage 2. For Openness, the restricted validity coefficient computed among new hires is corrected for the range of Openness scores observed at Stage 2 regardless of the role of indirect restriction due to screening on cognitive scores. However, this Openness validity coefficient corrected for the range restriction among new hires should be interpreted as an estimate of the predictive validity of Openness scores where preselection on cognitive ability has taken place. This estimate of the validity of Openness scores is not generalizable, without further correction, to a different local setting in which Openness scores are used for selection from an applicant pool that has not been prescreened on cognitive ability.) Other than these three specific uses of test scores, we believe that no other manner of score use affects the type of validity evidence appropriate to the test scores.

Stage 5: Prescribe Governing Policies and Rules Virtually all selection systems are shaped and governed by a set of policies and rules. These typically address many facets of the selection system ranging from applicants’ access to the selection process, management of applicant data, and the permissibility of waivers and exemptions to testing processes such as applicants’ option to retake a test, accommodations in the testing process, and permitted modes of administration. Detailed descriptions of such policies and rules are presented elsewhere (e.g., Kehoe, Brown, & Hoffman, 2012; Roe, 2005; Tippins, 2002, 2012)

81

Jerard F. Kehoe and Paul R. Sackett and in Chapter 9 of this volume. The focus in this chapter is on those policies and rules that can have implications for the meaning of and evidence of validity for test scores. We briefly consider the validity implications for policies relating to retesting, mode of administration, equivalencies, accommodations for disabilities, test preparation, and exemptions and waivers. Each of these practices, except for exemptions and waivers, can affect test scores. As a result, it is important to consider whether they trigger the need to gather different types of validity evidence. In our analyses of the validity implications of these policy-driven practices, we rely on the distinction between standard and non-standard administrations of tests in the selection process and acknowledge that feasibility and practical impact are major considerations.

Inferences Relating to Intended Uses and Outcomes The two policies addressed here affect intended outcomes but do not require any additional type of validity evidence. These two policies are about (a) equivalencies and (b) exemptions and waivers. We describe these here to document examples of selection system policies that do not warrant unique validity evidence. Equivalencies  Some selection systems establish “equivalency” rules or standards by which some other attribute of an applicant may be treated as equivalent to a test result, and the applicant is given the same status that would have been earned from the test result. For example, a personality score for Service Orientation (the referent test) is used as a requirement for several different types of customer service jobs. External applicants and internal employees may apply for these jobs. The organization has a policy that internal applicants who have a supervisor’s rating of, say, 3 or higher on a standard organizational competency of “Works Well with Others” will be assigned a “passing” score result on the Service Orientation measure. In this example, applicants have been granted a score status on a selection test they have not taken because some other result—in this case, a performance rating—is interpreted by the organization as having a comparable predictive value for an intended outcome. The question we raise here is whether the set of test scores used to validate the referent test in this example should include “awarded” test scores assigned to certain applicants via the equivalency policy. In our view, no, these awarded scores are an administrative vehicle for giving applicants a qualification status based on other considerations relating to perceived acceptability, efficiency, and fairness as well as a plausible professional judgment of comparability of meaning. The awarded statuses are not intended to be interpreted as having the same meaning as the referent test scores but are intended to be interpreted as having similar enough meaning to warrant giving the applicant the awarded qualification status. In this case, the estimate of the validity of the referent test scores would not be more accurate by including the awarded score results in the local validity study. Exemptions and Waivers  Equivalency policies describe multiple ways in which a test score result may be awarded, including completing the test. In contrast, exemption and waiver policies describe certain circumstances in which an assigned authority may decide that an applicant is not required to satisfy one or more standard job qualification requirements. For example, consider a selection process for account executives that requires satisfactory performance on a sales assessment work sample test. An organization may choose to exempt applicants from this sales assessment requirement who have been deliberately recruited from a competitor’s account executive role. In our experience, exemption and waiver policies are quite common, even if they are quiet or implicit or have a much less relevant rationale than the account executive sample in which previous experience was a justification for the exemption. While the liberal use of exemptions and waivers can harm (or help) the effectiveness of the whole selection system, they do not have any implications for the appropriate validity evidence for the exempted selection procedure. That is, no validity claim about the selection procedure is strengthened by gathering evidence that the exempted applicants are as likely as high-scoring applicants to produce the intended outcome. Indeed, once the exemption authority has been established, it isn’t necessarily the case

82

Validity Considerations of Selection Systems that exemption decisions must be based on expected performance. Other personal or organizational considerations may be the bases for exempting certain applicants from a standard selection requirement. In short, there is little to be gained by having the validation rationale for a selection test take into account those instances in which the test requirement is waived.

Inferences Relating to Quality of Predictor and Criterion Scores Three common and important policies about retesting, test preparation, and mode of administration are known to have direct consequences for test scores but do not depend on any change in the predictive rationale for the target tests. The sections below explore the implications of these policies for unique validity evidence that may be warranted. Retesting  It is a common practice in selection systems to allow applicants to retake selection tests, guided by organization policies. For example, a typical requirement is that applicants may retake a test only after waiting for a prescribed period of time, which may vary by type of test. Considerable research has investigated the effects of retesting on cognitive and non-cognitive test scores. For cognitively loaded tests, evidence shows that second occasion scores are approximately .25–.50 SDs higher than first occasion scores (e.g., Hausknecht, Halpert, Di Paulo, & Moriarty Gerrard, 2007; Lievens, Buyse, & Sackett, 2005), but findings are mixed regarding criterion validity differences and measurement equivalence between first and second scores (e.g., Lievens, Buyse, & Sackett, 2005; Lievens, Reeve, & Heggestad, 2007; Van Iddekinge, Morgeson, Schleicher, & Campion, 2011; Villado, Randle, & Zimmer, 2016). The very large and persuasive body of empirical research (e.g., Schmidt & Hunter, 1998) showing (a) substantial predictive validity for cognitive tests with respect to job proficiency criteria and (b) low variability in predictive validity across a wide range of jobs and settings is generally regarded as persuasive evidence that professionally developed cognitive tests will have substantial predictive validity with respect to proficiency criteria in local settings. As a practical matter, this inference of predictive validity also is relied upon to assure selection designers that retest effects are not important sources of invalidity, even if some studies have shown changes in validity and measurement structure with second test scores. For (non-cognitive) personality tests, the retesting issue is quite different theoretically, empirically, and in practice. The dominant theoretical consideration is about the susceptibility of self-reported personality scores to be intentionally influenced by impression management, or faking, as it is frequently called. Significant research has been conducted to understand and estimate the effect size of faking as a source of construct invalidity (e.g., Hogan, Barrett, & Hogan, 2007; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Ones, Viswesvaran, & Reese, 1996). The effects of faking are a dominant consideration in research on personality retesting for two reasons. First, a common research paradigm that investigates retest scores is one in which the study participants who have retaken a personality assessment are doing so because they were not hired following their first attempt. Second, unlike cognitive retesting, there is little theoretical rationale that could attribute score changes across short retest intervals to development or growth in the target personality attributes. Rather, the more compelling theoretical rationale to possibly explain personality score changes is that the test takers are motivated by their initial failure to adopt a different model of the personality attributes presumed to be desired by the employer. As a result, the most salient factor in attempting to explain personality test-retest score differences is faking. Unlike cognitive tests, it appears to be generally accepted that changes in personality scores from first to second scores are faking and random error, both of which introduce invalid variance. Two threads of empirical evidence are important as sources of evidence for the validity of retest personality assessments. One thread (e.g., Hogan, Barrett, & Hogan, 2007; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Ones, Viswesvaran, & Reiss, 1996) focuses on differences in predictive validity correlations between first occasion scores and second occasion scores. This thread of research has found that, overall, the predictive validities for first scores and second scores are similar. The second thread focuses on the differences between the distributions of the

83

Jerard F. Kehoe and Paul R. Sackett first and second occasion scores. In particular, this line of research has investigated the extent to which the selection decisions are different when an applicant pool includes only first occasion scores as compared to applicant pools that include both first occasion scores and retest scores (e.g., Walmsley & Sackett, 2013). In general, this research has shown that the inclusion of higher retest scores, which are not uncommon (Hausknecht, 2010), can significantly change who is hired. This result, if generalizable to local settings, can be taken to mean that the policy allowing retesting for personality assessments may reduce the effectiveness or efficiency of a selection system, even if original and second scores are approximately equally valid predictors, by increasing the percentage of new hires who benefited from a score increase that is construct invalid to some extent. The practical consequences of the theoretical and empirical state of personality assessment are complex. Design decisions about personality assessments vary considerably, although it is clear from the large number of commercially available tools that personality is a common component of current selection systems. The first author’s personal experience indicates that (a) retesting is probably commonplace, (b) corrections for faking and/or retest effects are probably not commonplace, and (c) local job conditions and important outcomes are probably important factors, more so than with cognitive tests, in the decisions about which scales and associated instruments are used. It would be difficult to gather unambiguous validity evidence in support of decisions (a) and (b), so pragmatic considerations relating to applicant perceptions and satisfaction and to process efficiency and cost are likely the most important considerations. Design decisions relating to (c), however, may be informed by the considerable published evidence (from publishers and from professional research efforts) about the specificity of personality scales’ predictive validity, other than the evidence for Conscientiousness, which generalizes across a wide range of jobs and work behaviors. In general, for both cognitive and non-cognitive tests, the degree of uncertainty about testretest score equivalence and predictive validity is regarded as an acceptable risk in the design of selection systems for two reasons. First, for specific issues like test-retest practices, current research conclusions are not clear enough, except for expected differences in cognitive scores, to generalize research-based conclusions to a particular setting. Second, it is probably not feasible in most cases to conduct local test-retest predictive validity studies. Test Preparation  The Standards, Standard 8.0, asserts that test takers have the right to, among other things, “adequate information to help them properly prepare for a test.” This standard is based on the guiding principle that test takers have a right to be informed and on the underlying belief that proper preparation enables test takers’ test performance to more accurately reflect their standing on the tested attribute. The net effect of this professional standard is that selection system designers are unlikely to seek out validity evidence relating to specific test preparation policies, with the exception that such policies should avoid inappropriate preparation practices. While we are not aware of surveys describing the current state of practice with regard to test preparation, it is likely to be common, especially for high-volume tests for which there is a “market” for test preparation courses and materials. Test preparation ranges from basic information about the test purpose and item format(s) to access to practice versions of similar tests and to more detailed instructions about processes for finding answers to items and opportunities to practice with feedback (Sackett, Burris, & Ryan, 1989). We also anticipate that the increased use of online test administration, especially mobile applications, will lead to a significant reduction in test preparation resources that are available on the same media and platforms as the test. (Note, changes in the frequency with which mobile devices are used may occur rapidly. However, recent large studies of unproctored online test usage reported that only 1%–2% of online test takers used mobile devices (Arthur, Doverspike, Munoz, Taylor, & Carr, 2014; Illingworth, Morelli, Scott, & Boyd, 2015).) Test preparation has much in common with retesting both conceptually and empirically. Indeed, studies of test preparation and practice effects often treat retesting as a form of practice, especially in the case of cognitive tests. Studies of the effects of test preparation on cognitive test scores show overall very similar effects to retesting (Hausknecht, Halpert, Di Paulo, & Moriarty Gerrard, 2007; Kulik, Bangert-Drowns, & Kulik, 1984; Lievens, Buyse, Sackett, & Connelly, 2012; Lievens, Reeve, & Heggestad, 2007). Indeed, retesting in the form of multiple practice

84

Validity Considerations of Selection Systems tests is considered a form of test preparation. Nevertheless, studies also show that type and amount of preparation/coaching can affect scores differently (Kulik, Bangert-Drowns, & Kulik, 1984; Powers, 1986). For non-cognitive assessments such as personality inventories, test preparation may consist only of examples of the item types and formats to reduce the novelty of these inventories and clarify the meaning of the instructions. Test preparation for interviews supports an entire cottage industry and is often out of scope for the employer. Rather, various support organizations such as schools, private sector companies, unions, and search firms are far more likely than employers to offer interview preparation programs for job seekers. Similarly, preparation for physical ability testing is often supported by applicant support groups such as schools and unions. In practice, test preparation policies are likely to consider a much wider range of possible practices than are considered with regard to retesting. Perhaps the two most common considerations regarding test preparation are cost and appropriateness. Generally, cost considerations are treated as having little, if any, relevance to issues of validity, even though it is quite likely that more costly test preparation programs such as extensive study materials and access to practice tests may lead to larger score increases than less costly programs such as pre-assessment instructions about test taking and exposure to sample items (Powers, 1986). Appropriateness considerations, on the other hand, are often regarded as being directly related to score validity, especially for skill, knowledge, and ability tests. In these cases, it is generally regarded as inappropriate to provide parallel tests as practice forms and to provide item-specific instructions that teach the knowledge, skill, or ability being tested. Such test preparation strategies that are so “close” to the operational test and items are inappropriate because they presumably lead to artificial, invalid score increases that reflect newly learned test/itemspecific information without enhancing the target construct. This is personnel selection’s version of education’s “teaching to the test” problem. Mode of Administration  Many of the considerations relating to the consequences of online test administration for test validity were reviewed above. The one point to be made here in this discussion of policy implications for validity is that traditionally the science-oriented practice of selection testing placed great emphasis on a consistent, standardized mode of administration. Indeed, this emphasis is sustained in the current Standards. For example, Standard 6.1 regarding Test Administration reads, in part, “Test administrators should follow carefully the standardized procedures for administration and scoring specified by the developer.” It is difficult to imagine how unproctored online test administration can even remotely comply with this Standard. First, there is no test administrator with an implied enforcement role to ensure adherence and consistency. Second, the understood meaning of standardization that all test takers complete the test under the same, beneficial administration conditions is conspicuously violated. In this now frequent context, a new burden falls to test publishers and users to demonstrate that a conspicuous lack of standardization does no harm to test score validity, but it is difficult to conduct research about unstandardized practices with sufficient controls to justify clearly prescribed, generalizable conclusions. The profession is left with the approach recently seen in which online score results are investigated in huge data sets. Arthur, Doverspike, Munoz, Taylor, and Carr (2014) reported analyses of more than 3.7 million applicants who completed online tests. Illingworth, Morelli, Scott, and Boyd (2015) reported analysis of more than 935,000 applicants who completed online tests. This approach invites the user to conclude that whatever the online administration circumstances are in the local setting, they are captured by the mega databases that report negligible score change and measurement invariance across modes of online administration. However, the actual profile of variations in administration represented by the mega “sample” cannot be specified because the information isn’t available. For example, what percentage of test takers in the mega samples attempted to cheat? What percentage were not who they said they were? What percentage were online savvy? What percentage had completed online tests before? What percentage weren’t motivated to perform well? And so on. The sheer size of these mega samples does not ensure that any particular local applicant pool will produce similar results because it is impossible to know what characteristics of the local applicant pool

85

Jerard F. Kehoe and Paul R. Sackett and their use of online administration options matters with regard to score meaning and predictive validity but may have been completely obscured by such large samples. A consequence of this current status is that a greater burden is placed on the selection system designer to evaluate plausible local threats to the meaning and predictive validity of scores from online administration.

Inferences Relating to the Prediction Rationale Policies relating to applicants with disabilities are addressed in this section because the most salient feature of these policies is the extent to which the federal regulations that implement the Americans with Disabilities Act (1990) (ADA) requires employers to ignore the significant loss of prediction rationale likely caused by such accommodations. It is likely that the large majority of medium to large organizations have established some policy relating to testing applicants with disabilities. These policies may cover a variety of aspects of the selection process, including the manner in which disabilities are disclosed, the organization’s responsibility to consider reasonable accommodations in the selection process, and the bases for deciding what actions to take in response to applicants’ requests. Further, the ADA obligates employees to make individualized decisions about accommodations. The result of all these considerations is that organizations may offer some form of individualized accommodation to one or more aspects of the selection process. This often leads to a set of circumstances in which tests are administered using accommodated processes, scores are recorded and relied on for selection decision making despite the likelihood that little, if any, information—including validity evidence—is available to support the rationale that the scores predict the desired outcomes. The considerations for validity are unique with virtually no parallel in employment selection. Accommodations for Disabilities  Under ADA and the ADA Amendments Act of 2008 (ADAAA), employers have an obligation to consider and provide reasonable accommodations to disabled applicants in the work setting as well as in the selection process. Campbell and Reilly (2000) and Guttman (2012) provide detailed descriptions of employers’ legal obligations and of common and accepted practices for accommodating disabilities in the selection process. In addition, Campbell and Reilly summarize the scant empirical evidence about the effects of disability accommodations on test scores and predictive validity. We do not reiterate those summaries here. Rather, we focus on the central validity issue posed by the legal obligations ADA imposes for disability accommodations. That validity question is whether accommodated test scores are predictive of the disabled person’s performance of “essential job functions,” while eliminating an artificial bias in unaccommodated scores that would lead to under prediction of such performance. For example, does an accommodation for visual impairment that uses large print materials both eliminate an under prediction bias and yield test scores that are predictive of performance of essential functions? What validity evidence can be available to employers to evaluate this validity question? Certainly, it is difficult to justify generalizing previous validity conclusions from standardized administration processes to the local scores from non-standardized accommodated administrations. And, local empirical validation studies are almost always infeasible simply because of the low numbers of applicants who disclose the same particular disability and receive the same accommodation. As a result, employers will rarely have the opportunity to rely on meaningful empirical evidence either from previous studies or from local studies. In virtually all cases, employers must rely on the expert judgment of the test developer (or some expert surrogate for the developer) to evaluate the theoretical and empirical bases for concluding that accommodated scores both eliminate bias and are predictive of essential function performance. It is important to note here that this reliance on expert judgment is not unique to disability accommodations but, in fact, is quite common where persuasive local empirical studies cannot be conducted. Synthetic validity strategies and generalizations from meta-analytic studies rely on the same expert judgment about the persuasiveness of the theoretical and/or empirical rationale that local scores will be unbiased and predict local performance.

86

Validity Considerations of Selection Systems

Stage 6: Manage and Maintain the Selection System This section addresses four elements of managing and maintaining the selection system: (a) training selection administration staff for operational knowledge and skills, (b) auditing for compliance with policies and processes and for indicators of threats to effectiveness and validity, (c) adapting the operation of the system to changing needs or circumstances, and (d) maintaining current professional expertise. Kehoe, Mol, and Anderson (Chapter 9 in this volume) provide a more broadly focused summary of managing for sustainability over time. We note that the four elements of these maintenance practices described here can have implications for all three categories of validity inferences. Nevertheless, it is useful to align these four elements with the categories of inference they are most likely to influence.

Inferences Relating to Intended Uses and Outcomes Adapting  Our primary point with regard to readiness to adapt is that even though selection systems are rooted in stable individual differences that reliably shape important work behaviors, a variety of organizational and professional changes may create the need to change a selection system. A compelling example is the rapid impact of the availability of online assessment tools to provide less expensive and faster selection processes. While our comments regarding adapting are much less prescriptive than for training and auditing, we suggest the following management strategies for recognizing, evaluating, and adapting to organizational and professional changes that point to improvements in an existing selection system: 1. Periodic reviews with unit-level HR leaders can be a very effective strategy for establishing access to information about organization changes that might have implications for selection. 2. To the extent possible, selection leaders should capitalize on the data described in the Auditing section below to become the owners and producers of periodic reports to organization leaders that convey the linkage between work behavior outcomes and selection processes. Treating selection scores as metrics of an organizational process, and linking them to outcome measures, positions the selection scores and the selection leader as credible and valuable sources of information about important outcomes. 3. Understanding validity as a means to an end, which is a major theme of this chapter, is a professional perspective that is likely to create more openness to view validation processes as a large toolkit of methods and procedures, some of which are more suited to current local circumstances than others.

Inferences Relating to Quality of Predictor and Criterion Scores Training  The training of selection system staff and role players is important to maintain validity because the quality of scores and the appropriateness of selection decisions depends on the successful performance of several peripheral functions, including applicant recruiting, interviewing, administration and scoring of tests, the processes of properly relying on selection scores to help make the intended selection decisions, the processes of creating and maintaining effective applicant management systems, and the management of accurate databases of score results and other applicant information. We describe three recommendations to optimize the benefits of training for the maintenance of valid and effective selection systems: 1. Provide training for all functions that are critical to a well-managed selection processes. 2. Develop training processes that require trainees to demonstrate minimally effective skills in order to be certified in the target function. Certify successful trainees. 3. Require that all critical functions are performed only by people who are training certified.

87

Jerard F. Kehoe and Paul R. Sackett We recognize that these three requirements may collectively be an onerous requirement, and the organization may need to adopt an approach that allows it to gradually achieve this objective, but it is important to acknowledge their importance for sustaining selection validity and effectiveness. Auditing  An auditing function is central to the management and maintenance of selection validity and effectiveness. Perhaps the greatest operational threat to test validity over time is the gradual loss of discipline and adherence to the process requirements for effective and valid selection procedures. Coupled with staff training, an effective auditing function can help maintain disciplined adherence to appropriate processes in two ways. First, auditing signals to the staff and stakeholders that disciplined adherence is critical. Second, auditing provides information about key indicators of process adherence and selection outcomes. Overall, effective auditing should provide at least three types of information about selection systems: (a) periodic evaluation of score properties, (b) continual confirmation of process adherence, and (c) periodic data about the achievement of the intended outcomes.

Inferences Relating to the Prediction Rationale The effective management of selection systems influences the soundness of prediction rationales primarily through the effort to sustain a high level of expertise in the selection professionals who support the organization. Expertise has two primary roles in maintaining valid and effective selection systems. First, professional expertise is a frequent source of evidence supporting claims of validity by providing expert judgements about job tasks and requirements. Second, professional expertise about the research foundations and professionally developed tools and resources may recognize new solutions to organization priorities and needs.

SUMMARY OF PART 2 Part 2 of this chapter proposes a six-stage process for the design and implementation of selection systems and describes significant considerations in each stage that can have implications for validity inferences about (a) the intended uses and outcomes, (b) the quality of predictor and criterion measures, and (c) the prediction rationale. The six stages are described in a logical order from (1) specifying the intended uses and outcomes to ensure the outcomes are amenable to a selection system based on stable individual differences in work behavior, (2) describing the work in a manner that identifies work content and its importance to inform decisions about locally relevant predictors and criteria and supports inferences (decisions) that conclusions from previous research apply locally, (3) choosing and/or developing predictor and criterion measures based on clear understanding of the likely causal linkage between test scores and work behaviors and outcomes, (4) prescribing the manner in which predictor scores will be used that capitalizes on the causal linkage while accommodating local constraints, (5) prescribing the policies and rules that govern the selection system to ensure its validity and usefulness across all conditions, and (6) managing and maintaining the selection system to control or adapt to the dynamic factors that can change validity and usefulness. At each stage of work, information is generated and inferences (decisions) are made that strengthen or weaken the claim of predictive validity. The overall claim of selection system validity can be represented as a conclusion based on the aggregation of many diverse sources of empirical and expert evidence accumulate across the design and implementation stages of work. We believe this way of describing validation contributes to our professional understanding of the meaning of validity, the distinction between validity and effectiveness, and validity’s role in the selection professional’s effort to provide useful methods for achieving individual and organizational success.

88

Validity Considerations of Selection Systems

CONCLUSIONS This chapter explored our professional understanding of selection validity and examined how the design and implementation of selection systems generates information needed to support, ultimately, the claim of predictive validity. Several key conclusions emerged: • Evidence supports the validity of test scores when it supports the claim that intended outcomes follow from the use-specific interpretation of test scores. Other evidence about the effectiveness and value of selection systems may be critically important and, possibly, more important, but only evidence relating to the meaning of the test scores supports claims of validity. • Decisions made throughout the design and implementation process are often inferences made by the selection expert that empirical evidence gathered in previous validity research efforts generalizes to the local setting. • Factors affecting score validity are dynamic and must be managed with regular auditing processes. • Expert judgment is a critical source of evidence for the local validity of test scores and, in some cases, may be the primary source. • Not all decisions about selection tests depend on or produce validity evidence. For example, decisions about the manner in which test scores are used (e.g., cut scores, advisory input, banding) and decisions about governing policies such as exemptions and waivers often do not require validity evidence but, instead, must be supported by evidence that the expected outcomes will be consistent with organization requirements such as speed, cost, efficiency, user satisfaction, and degree of improvement in intended outcomes.

In addition, we explored the distinction between evidence of selection validity and evidence of selection effectiveness. Utility analysis is perhaps the most common evidence of effectiveness at the individual level of analysis. We also applied this distinction to the relationship between selection predictor scores aggregated to an organization level and organization-level outcomes where any causal linkage is ambiguous, corrupted, or obscured by the effects of other organizational factors. (See Chapter 5 in this volume for a description of the importance and the manner in which individual-level selection influences organizationlevel outcomes.) In this situation, the relationship between organization-level measures of predictor scores can often be the most important, ultimate objective for an organization’s selection system, but this relationship does not have the same meaning as a validity relationship. One reason for addressing this distinction is to place some emphasis on this point to be clear that validity is not necessarily the only or even the most important objective for the selection professional. This chapter demonstrates that, while validity is a unitary concept, the types of evidence supporting validity and the variety of design and implementation decisions that influence or are dependent on validity are hardly unitary.

NOTE 1. Throughout this chapter we use the term “scores” to generically refer to observed manifestations of a measurement procedure; thus, scores might be ratings, behavioral observations, test scores, etc.

REFERENCES American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association (joint committee). (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51, 201–238. Americans with Disabilities Act, 42 U.S.C. § 12101 (1990) et seq. ADA Amendments Act, 42 U.S.C. § 12101 (2008) et seq.

89

Jerard F. Kehoe and Paul R. Sackett Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education. Arthur, W., Doverspike, D., Munoz, G. J., Taylor, J. E., & Carr, A. E. (2014). The use of mobile devices in high-stakes, remotely delivered assessments and testing. International Journal of Selection and Assessment, 22, 113–123. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Campbell, J. P. (2015). All general factors are not alike. Industrial and Organizational Psychology, 8(3), 428–434. Campbell, J. P., & Knapp, D. J. (Eds.) (2001). Exploring the limits in personnel selection and classification. Mahwah, NJ: Erlbaum. Campbell, W. J., & Reilly, M. E. (2000). Accommodations for persons with disabilities. In J. F. Kehoe (Ed.), Managing selection in changing organizations (pp. 319–367). San Francisco, CA: Jossey-Bass. Carter, N. T., Dalal, D. K., Boyce, A. S., O’Connell, M. S., Kung, M., & Delgado, K. M. (2014). Uncovering curvilinear relationships between conscientiousness and job performance: How theoretically appropriate measurement makes an empirical difference. Journal of Applied Psychology, 99, 564–586. doi: 10.1037/ a00334688 Connelly, B. S., & Ones, D. S. (2010). Another perspective on personality: Meta-analytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136, 1092–1122. http://doi.org/10.1037/ a0021212 Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (pp. 221–237). Washington, DC: American Council on Education. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–300. Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621–694). Washington DC: American Council on Education. Davies, S. E., Connelly, B. L., Ones, D. S., & Birkland, A. S. (2015). The general factor in personality: The “Big One”, a self-evaluative trait, or a methodological gnat that won’t go away? Personality and Individual Differences, 81, 13–22. Delany, T., & Pass, J. (2005). Design and validation of an unproctored cognitive ability tests. Paper presented at the 20th annual conference of the Society for Industrial and Organizational Psychology, Los Angeles, CA. Gibson, W. M., & Caplinger, J. A. (2007). Transportation of validation results. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 29–81). San Francisco, CA: Jossey-Bass. Green, J. P., Bradshaw, P., Kelly, E. D., Zhu, M., Dalal, R. S., & Meyer, R. D. (2015). Personality strength: Operationalization and relationship with within-person performance variation. Paper presented at the 30th annual conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Gonzalez-Mule, E., Mount, M. K., & Oh, I.-S. (2014). A meta-analysis of the relationship between general mental ability and nontask performance. Journal of Applied Psychology, 99, 1222–1243. doi: 10.1037/a0037547 Guion, R. M. (1974). Open a new window: Validities and values in psychological measurement. American Psychologist, 29, 287–296. Guttman, A. (2012). Legal constraints on personnel selection decisions. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment and selection (pp. 686–720). Oxford, UK: The Oxford University Press. Hausknecht, J. P. (2010). Candidate persistence and personality test practice effects: Implications for staffing system management. Personnel Psychology, 63, 299–324. doi: 10.1111/j.1744–6570.2010.01171.x Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92, 373–385. doi: 10.1037/0021–9010.92.2.373 Hoffman, C. C., Rashovsky, B., & D’Egidio, E. (2007). Job component validity: Background, current research, and applications. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 82–121). San Francisco, CA: Jossey-Bass. Hogan, J., Barrett, P., & Hogan, R. (2007). Personality measurement, faking and employment selection. Journal of Applied Psychology, 92, 1270–1285. doi: 10.1037/0021–9010.92.5.1270 Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Journal of Applied Psychology, 75, 581–595. Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of Vocational Behavior, 29, 340–362. Illingworth, A. J., Morelli, N. A., Scott, J. C., & Boyd, S. L. (2015). Internet-based, unproctored assessments on mobile and non-mobile devices: Usage, measurement equivalence, and outcomes. Journal of Business and Psychology, 30, 325–343.

90

Validity Considerations of Selection Systems International Test Commission. (2006). International guidelines on computer-based testing and Internetdelivered testing. International Journal of Testing, 6, 143–172. Johnson, J. W. (2007). Synthetic validity: A technique of use (finally). In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence. (pp. 122–158). San Francisco, CA: Jossey-Bass. Kaminsky, K. A., Hemingway, M. A. (2009). To proctor or not to proctor: Balancing business needs with validity in online assessment. Industrial and Organizational Psychology, 2, 24–26. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: American Council on Education and Praeger. Kehoe, J. F., Brown, S., & Hoffman, C. (2012). The life cycle of successful selection programs. In N. Schmitt (Ed.), The Oxford handbook of personnel selection and assessment (pp. 903–938). Oxford, UK: The Oxford University Press. Kulik, J. A., Bangert-Drowns, R. L., & Kulik, C. C. (1984). Effectiveness of coaching for aptitude tests. Psychological Bulletin, 95, 179–188. Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. Le, H., Oh, I., Robbins, S. B., Remus, I., Holland, E., & Westrick, P. (2011). Too much of a good thing: Curvilinear relationships between personality traits and job performance. Journal of Applied Psychology, 96, 113–133. doi: 10.1037/a0021016 Lievens, F., & Burke, E. (2011). Dealing with the threats inherent in unproctored internet testing of cognitive ability: Results from a large-scale operational test program. Journal of Occupational and Organizational Psychology, 84, 817–824. Lievens, F., Buyse, T., & Sackett, P. R. (2005). Retest effects in operational selection settings: Development and test of a framework. Personnel Psychology, 58, 981–1007. doi: 10/1111/j.1744–6570.2005.00713.x Lievens, F., Buyse, T., & Sackett, P. R., & Connelly, B. S. (2012). The effects of coaching on situational judgment tests in high-stakes selection. International Journal of Selection and Assessment, 20, 272–282. doi: 10.1111.j.1468–2389.2012.00599.x Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 1672–1682. Loevinger, J. (1957). Objective tests as instruments of psychological theory [Monograph No. 9]. Psychological Reports, 3, 635–694. McCormick, D. J. (2001). Lowering employee illness and rates of on-the-job accidents by screening for mental ability. Paper presented at the 16th annual conference of the Society for Industrial and Organizational Psychology, San Diego, CA. McPhail, S. M. (Ed.) (2007). Alternative validation strategies: Developing new and leveraging existing validity evidence. San Francisco, CA: John Wiley and Sons. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council on Education and Macmillan. Oh, I., Wang, G., & Mount, M. K. (2011). Validity of the observer ratings of the five-factor model of personality traits: A meta-analysis. Journal of Applied Psychology, 96, 762–773. doi: 10.1037/a0021832 Ones, D. S., Vishwesvaran, C., & Reiss, A. D. (1996). The role of social desirability in personality testing for personnel decisions: The red herring. Journal of Applied Psychology, 81, 660–691. Pearlman, K. (2009). Unproctored internet testing: Practical, legal and ethical concerns. Industrial and Organizational Psychology, 2, 14–19. Powers, D. E. (1986). Relations of test item characteristics to test preparation / test practice effects: A quantitative summary. Psychological Bulletin, 100, 67–77. Roe, R. A. (2005). The design of selection systems: Context, principles, issues. In A. Evers, N. Anderson, & O. Smit (Eds.), Handbook of personnel selection (pp. 73–97). Oxford, England: Blackwell. Rotundo, M., & Sacket, P. R. (2002). The relative importance of task, citizenship, and counterproductive performance to global ratings of job performance: A policy capturing approach. Journal of Applied Psychology, 87, 66–80. doi: 10.1037/0021–9010.87.1.66 Sackett, P. R., Burris, L. R., & Ryan, A. M. (1989). Coaching and practice effects in personnel selection. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational psychology (pp. 145–183). Oxford, England: John Wiley & Sons. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmitt, N., & Landy, F. J. (1993). The concept of validity. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection in organizations (pp. 275–309). San Francisco, CA: Jossey-Bass. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author.

91

Jerard F. Kehoe and Paul R. Sackett Stark, S., Chernyshenko, O. S., Drasgow, F., & White, L. A. (2012). Adaptive testing with multidimensional pairwise preference items: Improving the efficiency of personality and other noncognitive assessments. Organization Research Methods, 15, 463–487. doi: 10.1177/1094428112444611 Stark, S., Chernyshenko, O. S., Drasgow, F., & Williams, B. A. (2006). Examining assumptions about item responding in personality assessment: Should ideal point methods be considered for scale development and scoring? Journal of Applied Psychology, 91, 25–39. doi: 10.1037/0021–9010.91.1.25 Tippins, N. (2002). Issues in implementing large-scale selection programs. In J. W. Hedge & E. D. Pulakos (Eds.), Implementing organization interventions: Steps, processes, and best practices (pp. 232–269). San Francisco, CA: Jossey-Bass. Tippins, N. (2012). Implementation issues in employee selection testing. In N. Schmitt (Ed.), The Oxford handbook of personnel selection and assessment (pp. 881–902). Oxford, UK: The Oxford University Press. Van Iddekinge, C. H., Morgeson, F. P., Schleicher, D. J., & Campion, M. A. (2011). Can I retake it? Exploring subgroup differences and criterion-related validity in promotion retesting. Journal of Applied Psychology, 96, 941–955. doi: 10.1037/a0023562 Vecchione, M., Alessandri, G., & Barbaranelli, C. (2012). Paper-and-pencil and web-based testing: The measurement invariance of the Big Five tests in applied settings. Assessment, 19, 243–246. doi: 10.1177/1073.191111419091 Villado, A. J., Randle, J. G., & Zimmer, C. U. (2016). The effect of method characteristics on retest score gains and criterion-related validity. Journal of Business and Psychology, 31, 1–16. doi: 10.1007/s.10869–015–9408–7 Walmsley, P. T., & Sackett, P. R. (2013). Factors affecting potential personality retest improvement after initial failure. Human Performance, 26, 390–408. doi: 10.1080/08959285.2013.836196 Weiner, J. A., & Morrison, J. D. (2009). Unproctored online testing: Environmental conditions and validity. Industrial and Organizational Psychology, 2, 27–30.

92

4 SITUATIONAL SPECIFICITY, VALIDITY GENERALIZATION, AND THE FUTURE OF PSYCHOMETRIC META-ANALYSIS JAMES M. LEBRETON, JEREMY L. SCHOEN, AND LAWRENCE R. JAMES1

Most psychologists would agree that a well-designed employment test should yield evidence of criterion-related validity when tested against a well-measured criterion. If “validity generalization” (VG) were limited to this inference, then there would be no reason for this chapter. Indeed, the authors of this chapter subscribe to this inference, but VG is not limited to this inference. Instead, VG inferences are often extended to suggest that the magnitude of test validities are invariant across situations—that is, situations do not influence the magnitude of criterionrelated validity coefficients. This line of thinking is aptly captured in quotes such as the following: The evidence from these two studies appears to be the last nail required for the coffin of the situational specificity hypothesis. (Schmidt, Hunter, Pearlman, & Rothstein-Hirsch, 1985, p. 758) The cumulative pattern of findings . . . provides strong support for the hypothesis that there is essentially no situational variance in true validities for classic ability constructs used for selection on similar jobs. (Schmidt et al., 1993, p. 11) these studies found that, on average, all variance across settings (i.e., companies) was accounted for by artifacts. . . . All these pieces of interlocking evidence point in the same direction: toward the conclusion that, for employment tests of cognitive abilities, the situational specificity hypothesis is false. (Hunter & Schmidt, 2004, pp. 404–405) Beginning in 1977, Schmidt and Hunter began publishing empirical evidence discrediting the situational specificity hypothesis. Specifically, they demonstrated that much of the variability in validity coefficients across studies was due to random sampling error. (McDaniel, Kepes, & Banks, 2011, p. 497)

There is little question that (psychometrically well-developed) tests of knowledge, skills, abilities (i.e., KSAs), and personality traits generally predict (psychometrically well-developed) measures of organizationally relevant criteria. In this sense, the criterion-related validity evidence for these tests can be said to generalize. Whether the validity for a given type of predictor (e.g., critical intellectual skills) against a given class of criterion (e.g., job performance) is generally invariant across situations (i.e., cross-situationally consistent) is another issue. The cross-situational

93

James M. LeBreton et al. consistency hypothesis (i.e., VG) has endured a long history of theoretical and empirical debate, the roots of which can be traced, in part, to the person-situation debate (cf. Buss, 1979; Cronbach & Snow, 1977; Epstein, 1979; Hogan, 2009; Kendrick & Funder, 1988; Mischel, & Peake, 1982). The emergence of meta-analysis as a popular method for testing the consistency of predictive validities across a set of separate studies (e.g., situations) accelerated and transformed the debate into one of a more quantitative and methodological nature. Basically, meta-analysis made it possible for organizational researchers to apply increasingly sophisticated quantitative tools to assess the predictive validity of test scores and, more importantly, the consistency of these estimates over studies. Within applied psychology, the most commonly used variant of meta-analysis has been the VG analysis, which more recently has adopted the label of psychometric meta-analysis (PMA; Borenstein, Hedges, Higgins, & Rothstein, 2009; Hunter & Schmidt, 2004). In the typical VG analysis, investigators first provide estimates of the criterion-related validity coefficients obtained (for the same, or similar, predictor-criterion variable pairs) from different study samples. Investigators then examine the cross-situational variability in those criterion-related validity coefficients (cf. Pearlman, Schmidt, & Hunter, 1980; Salgado et al., 2003; Schmidt & Hunter, 1977; Schmidt, Hunter, & Raju, 1988; Schmidt et al. 1993). Unlike the meta-analytic techniques embraced in nearly all other areas of science, the VG technique seeks to estimate variability in validity coefficients after first adjusting (or, “correcting”) the observed coefficients for statistical artifacts (e.g., measurement error, range restriction). These corrections are believed to remove irrelevant noise from the system, thus enhancing the comparability of these estimates across different situations (Schmidt & Hunter, 1977; Schmidt et al., 1993). However, VG procedures are not without their critics (Algera, Jansen, Roe, & Vijn, 1984; James, Demaree, & Mulaik, 1986; James, Demaree, Mulaik, & Ladd, 1992; Kemery, Mossholder, & Roth, 1987), many of whom have questioned whether the findings based on VG procedures may have yielded an inaccurate picture of both the magnitude and consistency of predictor-criterion pairs. Like these critics, we also have concerns with the conclusions reached using VG procedures, and it is in that spirit with which this chapter was written. We have two basic goals for this chapter. First, we discuss the logic and rationale underlying the VG and Situational Specificity (SS) hypotheses and, based on the results from the extant literature, conclude that the SS hypothesis is alive and well in applied psychology. Second, we summarize five key concerns related to VG studies and the PMA procedures upon which they are based. These concerns include (1) the formulas that are used in VG analyses fail to explicitly (i.e., empirically) incorporate measured situational variables (e.g., authority structure, interpersonal interactions, social climate), despite evidence that such variables often moderate the types of predictor-criterion relationships cited in the VG literature (i.e., Ghiselli, 1959, 1966, 1973; Peters, Fisher, & O’Connor, 1982); (2) the formulas that are the basis of PMA, and thus the basis for all VG analyses, include critical (untested) assumptions, the tenability of which has been called into question (James et al., 1992; Köhler, Cortina, Kurtessis, & Gölz, 2015); (3) many VG studies have relied on dubious estimates of statistical artifacts (e.g., estimates of criterion reliability) when estimating corrected validity coefficients, and these estimates may have resulted in biased inferences about both mean validities and the variance (or lack thereof) around those means (cf. DeShon, 2003; LeBreton, Burgess, Kaiser, Atchley, & James, 2003; LeBreton, Scherer, & James, 2014; Murphy & DeShon, 2000; Putka & Hoffman, 2015; Viswesvaran, Ones, & Schmidt, 1996, 2005); (4) the appropriateness of inferences based on corrected (or partially corrected) correlation coefficients (LeBreton, Scherer, & James, 2014); and finally (5) reliance on meta-analytically derived effect sizes to guide selection decisions, especially given the negative evaluations of VG by U.S. courts (Biddle, 2010; Landy, 2003). Thus, our chapter is structured as follows. First, we provide a brief introduction to the logic and rationale underlying VG. Second, we summarize the evidence suggesting that SS is alive and well in applied psychology. Third, we offer a review and critique of the procedures of PMA that form the basis for VG analyses. Finally, we conclude with general recommendations relevant for employment selection research and practice.

94

The Future of Psychometric Meta-Analysis

VALIDITY GENERALIZATION VERSUS SITUATIONAL SPECIFICITY

Validity Generalization Validity studies conducted in the mid- to late-20th century offered modest hope for the utility of personality and KSAs as predictors of crucial outcome variables (i.e., job performance) in applied settings. Of particular interest were validity coefficients for cognitive ability tests, which tended to be modest in magnitude and often inconsistent across job types (Ghiselli, 1959, 1966, 1973). As a result of this inconsistency, many psychologists adhered to the basic hypothesis that the criterion-related validity evidence for any given selection test was situationally specific (Murphy, 2000; Schmidt & Hunter, 1998). Stated alternatively, in order to determine the extent to which inferences drawn from test scores were related to outcomes (e.g., job performance; Binning & Barrett, 1989), psychologists must understand the subtle differences or constraints that differed across situations (e.g., specific/unique job requirements identified as part of a job analysis, differential reward structures that might influence performance, culture or climate of the organization, work characteristics, etc.; James et al., 1986, 1992; Murphy, 2000). In addition, the belief that criterion-related validities were situationally specific was consistent with the more general movement toward situational specificity of behavior (including Person by Situation interaction and contingency models of behavior; cf. Endler & Magnusson, 1976; Grote & James, 1991; House & Mitchell, 1974; Kerr, Schriescheim, Murphy, & Stogdill, 1974; Mischel, 1968; Vecchio, 1987; Vroom, 1973; Wright & Mischel, 1987). VG developed out of a desire to try to increase the precision (i.e., accuracy) of validity coefficient estimates for similar or identical predictor-criterion pairs. Like other forms of meta-analysis, VG is based on a sample-size weighted average effect size. Unlike other forms of meta-analysis, VG moves beyond a simple summary/description of effect sizes to draw inferences about the consistency (or lack thereof) in the observed effect sizes. More specifically, whereas a traditional meta-analysis describes/summarizes the overall relationship between a predictor and criterion for a set of samples, VG goes one step further to infer the degree to which additional factors contribute to the consistency of this relationship across samples. Typically, VG analyses are undertaken separately for different job types or job classes (i.e., clerical, mechanical, managerial; Schmidt & Hunter, 1977). A variety of factors may contribute to the inconsistency of criterion-related validity across samples. The factors to be considered are statistical artifacts, such as unreliability of predictor and criterion scores, range restriction in predictor scores, and sampling error (i.e., Hunter & Schmidt, 2004; Schmidt & Hunter, 1977; Schmidt et al., 1988; Schmidt et al., 1993). Thus, a VG analysis may be thought of as the inferential variant of the traditional, descriptive meta-analysis (Murphy, 2000). The primary (but not the only) assumptions underlying a VG analysis include that (a) the true validity for a particular predictor-criterion pair is equal across populations but that (b) statistical artifacts that differ across studies (e.g., predictor and/or criterion reliability, range restriction, and sampling error) distort and restrict the magnitude of the observed validity. In an attempt to identify and effectively model the impact of these biasing statistical artifacts, the following structural equation—in which Greek symbols represent population parameters—is generally used in VG analysis: 1

1

rk = ρkα k 2ϕ k 2ξk  + e k

,

where ρk is the population correlation between the unrestricted true scores for the predictor and criterion in situation k (i.e., the true validity); rk represents the observed validity coefficient—i.e., the correlation between a predictor Xk and a criterion Yk for a random sample of nk individuals from population (i.e., organization, situation) k; αk is the unrestricted population reliability for the criterion in situation k; φk is the unrestricted population reliability for the predictor in situation k; ξk reflects the degree of range restriction in the predictor in situation k; and ek is the sampling error inherent in rk. Once the statistical artifact population estimates are inserted into the equation and ρk is estimated for each k, the next step is to estimate the variance among the ρk, referred to as V(ρ), and

95

James M. LeBreton et al. determine whether or not this estimated variance coefficient is small enough to justify generalization of the validity across situations. The estimate for V(ρ) is calculated based on the following estimation equation (see James et al., 1992): V ( ρ ) = V (r ) − V (r) / Π ^

where V^(ρ) is the estimate of variance in population (true) validities; V(r) is the between-situation variance in the observed validities; V(r˙ ) is the expected between-situation variance in validities associated with statistical artifacts; and Π is an additional correction for mean reliabilities and range restriction across situations. In essence, the amount of variance attributable to statistical artifacts is subtracted from the total observed variance, and the remaining variance, termed “residual variance,” represents the true variance in validities that is unaccounted for (i.e., by statistical artifacts). A primary step of VG is to determine whether or not cross-situational consistency in validities has been achieved. Basically, if the estimate of V(ρ) is approximately equal to 0, then the ρk are deemed to be truly invariant across situations (i.e., generalizable), whereas an estimate of V(ρ) greater than 0 is used as evidence consistent with a potential situational moderator. To this end, two rules have emerged that elaborate on the term “approximately equal” by imposing predetermined, theoretically justified critical values, and V(ρ) must not extend above these values in order for cross-situational consistency to be established. One rule is the “75% Rule,” in which 75% of the total variance in validity estimates (i.e., V(ρ)) must be accounted for by statistical artifacts to effectively rule out the SS hypothesis, suggesting a construct is a universal and invariant predictor of the criterion of interest. The remaining variance in validity estimates (i.e., 25% of the variance in validity estimates) is attributed to additional (unmeasured) artifacts (i.e., clerical and programming errors; Hermelin & Robertson, 2001; Schmidt & Hunter, 1977). The importance of the 75% rule for informing decisions about VG versus SS was noted by Hunter and Schmidt (2004): If 75% or more of the variance is due to artifacts, we conclude that all of it is, on the grounds that the remaining 25% is likely to be due to artifacts for which no correction has been made. (p. 401)

This rule has been criticized for being insensitive to potential situational moderators (James et al., 1986) and, while considered outdated, it is still used alone or in combination with more advanced techniques (Geyskens, Krishnan, Steenkamp, & Cunha, 2009). Given concerns over the 75% Rule, a second rule based on formal statistical tests of the heterogeneity of V(ρ) has emerged. This rule emphasizes the development of a “credibility interval,” in which the lower bound of the validity distribution is compared to a minimal validity coefficient value (e.g., .00, .01, .10). If the credibility interval does not contain the minimal value, one can say with a certain amount of confidence that the validity of the scores will generalize to other populations. Researchers frequently use 80% and 90% credibility intervals to draw inferences regarding the transportability of a validity coefficient to other situations (the concept of transportability is discussed later).

Evidence for Validity Generalization and Situational Specificity: A Continuum Perspective Historically, VG and SS were framed as two mutually exclusive outcomes. That is to say, the criterion-related validity evidence of a particular selection test was said to either generalize or be situationally specific. We believe a more fruitful path forward is to recognize that VG and SS may be better conceptualized as forming the anchors of a single generalization–specificity continuum. At one end of the continuum is found the VG hypothesis, which implies that a single (nonzero) population validity is invariant across all situations. The VG hypothesis may be formally

96

The Future of Psychometric Meta-Analysis stated as a compound hypothesis (a) after correcting for statistical artifacts the estimate of |ρ| > 0 and (b) the estimate of V(ρ) = 0. Thus, the VG hypothesis states that there is a single, invariant (or fixed) “true” population correlation between the predictor and the criterion. Any deviations that are observed within a sample from this fixed value may be attributed entirely to measured (e.g., sampling error, measurement error, range restriction) and/or unmeasured artifacts (e.g., clerical errors). Evidence to support the VG hypothesis is furnished by demonstrating that 75% of the variance in local estimates is attributed to various forms of statistical artifacts, with the presumption being that the remaining 25% is attributed to other artifacts that are not quantifiable (e.g., clerical errors). Consequently, 100% of the variance in observed validities may be attributed to noise in the system, and there is nothing unique about situations (and by extension, there are no moderators—situational or otherwise). At the other end of the continuum, we find the SS hypothesis, which implies that non-trivial variability in test validities is not attributed to measured and unmeasured artifacts. The strong SS hypothesis is agnostic with respect to the estimate of the mean validity (i.e., it could be zero, positive, or negative), but instead is focused solely on the true variability in validities. To understand the true variability in validities necessitates an understanding of the agonists of this variability (i.e., moderator variables, situational or otherwise). Finally, residing in the middle of the continuum we find what might be labeled a “weak” SS hypothesis (or “weak” VG hypothesis, depending on one’s theoretical proclivities); this hypothesis embraces the notion that the mean validity may likely be different from zero but also predicts significant variability around the mean. This middle-of-the-road hypothesis may also be considered consistent with the concept of transportability (discussed later in the chapter). Thus, to properly understand local validity estimates, one must also understand the critical differences arising across the situations where those local estimates were obtained—differences that are not entirely explained by measured and unmeasured statistical artifacts. These differences may be driven by moderator variables (situational or otherwise). In the context of test validation, one might formally state the weak SS hypothesis as a compound hypothesis (a) after correcting for statistical artifacts the estimate of |ρ| > 0 and (b) the estimate of V(ρ) > 0. Thus, this hypothesis states that there may be multiple (or variable) “true” population correlations existing between the predictor and the criterion. In general, research in applied psychology has revealed limited support for the VG hypothesis but greater support for the SS hypotheses. Evidence consistent with the SS hypotheses exists because typical VG analysis rarely frees up all of the between-sample variance in validity coefficient estimates, sometimes freeing up very little variance for certain predictor-criterion pairs and/or job types (i.e., Murphy, 2000; Ones, Viswesvaran, & Schmidt, 2003; Salgado et al., 2003). For example, several VG analyses performed on tests of cognitive ability have revealed the moderating role of job complexity on correlations between ability/KSAs and performance criteria (i.e., Hunter & Hunter, 1984; Levine, Spector, Menon, Narayanan, & Cannon-Bowers, 1996; Russell, 2001; Salgado et al., 2003; Schmidt et al., 1993). Hunter and Hunter (1984) found that cognitive ability demonstrated a higher validity for predicting job performance and training success for occupations involving greater task complexity (i.e., the validity of cognitive ability tests is not invariant but fluctuates across situations as those situations vary in levels of task complexity). Likewise, Salgado and colleagues (2003) found that the empirical validity for general mental ability varied as a function of job type, with correlations ranging from .12 for police officers to .34 for sales occupations when predicting supervisor ratings of job performance. Relatedly, Schmidt and colleagues (1993) found that validity estimates for various measures of cognitive ability (i.e., general, verbal, quantitative, reasoning, perceptual speed, memory, and spatial and mechanical) varied (at least in part) as a function of job type, where the standard deviation of the validity estimates for reasoning ability predicting job performance was .04 for jobs involving stenography, typing, and filing and .19 for production and stock clerks. Similar support for SS hypotheses has been obtained for personality traits, especially in the case of team-oriented organizations (Barrick & Mount, 1991, 1993; Mount, Barrick, & Stewart, 1998; Stewart, 1996; Stewart & Carson, 1995). For example, Barrick and Mount (1993) found that the validity of key personality traits (conscientiousness and extraversion) as predictors of job performance (as rated by supervisors) varied over managers as a function of managers’

97

James M. LeBreton et al. perceived level of autonomy on the job. Validities tended to increase in proportion with the amount of perceived autonomy. Additionally, Mount and colleagues (1998) showed that some Big Five traits were more valid than other Big Five traits, but the dominant trait varied as a function of the degree to which situations demanded social and interpersonal interactions. To illustrate, agreeableness and extraversion had stronger validities for predicting performance for employees working in situations emphasizing team-oriented jobs (e.g., highest mean validities were .24 and .20, respectively) compared to the validities that were observed for employees working in clerical and “cubicle” jobs/situations that emphasized dyadic interactions (i.e., newspaper employees in the circulation department; banking employees in loan operations; telemarketing representatives; highest mean validities were .16 and .16, respectively). In contrast, the opposite was true for conscientiousness. Specifically, dyadic jobs yielded greater validity estimates (e.g., highest mean validity was .29) than did team-oriented jobs (e.g., highest mean validity was .19). Moreover, even when validities are examined for one specific job type (e.g., sales), validities vary for the extraversion–sales effectiveness relationship across organizations, with only 54% of their variance being accounted for by statistical artifacts (Barrick & Mount, 1991; Stewart, 1996). Along these lines, trait activation theory (Tett & Burnett, 2003) posits that work situations send cues to employees about what personality traits may be relevant for a given situation. Thus, features of a situation may serve as triggers of (or inhibitors for) the expression of personalitybased work behaviors. A number of studies have supported the basic tenets of trait activation theory, including its relevance for personality (e.g., agreeableness) as a predictor of outcomes such as innovation and creativity (Hunter & Cushenberry, 2015) and for better understanding the construct validity paradox that has troubled assessment center researchers for many years (Lievens, Chasteen, Day, & Christiansen, 2006; Lievens, Schollaert, & Keen, 2015). Similarly, the strength of a situation (i.e., how much a situation restricts or inhibits behavior; Mischel, 1968) may moderate the magnitude of correlations between individual differences and work-related outcomes. For example, in a meta-analysis of the relationship between trait conscientiousness and job performance, Meyer, Dalal, and Bonaccio (2009) found that this relationship was significantly moderated by the strength of the work situation. For example, this correlation was weaker for jobs nested in very strong situations (.09 for nuclear equipment operation technicians working in a highly regulated work context) and was stronger for jobs nested in weaker situations (.23 for barbers working in a less regulated and more creative environment). More recently, Meyer and colleagues (2014) developed a measure of work-related situational strength and found that it significantly moderated the relationship between contextual work behaviors and the traits of conscientiousness and agreeableness. For example, agreeableness demonstrated a stronger relationship to organizational citizenship behaviors in weaker situations (i.e., where employees had greater discretion over their work activities; see also Meyer, Dalal, & Hermida, 2010, for a more detailed review of the situational strength concept and its relationships). The above studies (and the concepts of situational strength and trait activation) were meant to be illustrative, not exhaustive. Overall, the theoretical models driving applied psychology are not simple, bivariate models (i.e., X correlates with Y). Rather, these models are inherently complex, multivariate, and regularly invoke mediating and moderating mechanisms. With respect to moderating mechanisms, situations continue to play central roles in our models, and the hypotheses derived and tested therefrom (and, it has long been recognized that test validation is simply a specific form of hypothesis testing; Binning & Barrett, 1989; Landy, 1986). Overall, the results for both cognitive and non-cognitive selection tests indicate that (a) the mean correlation between job-relevant tests of KSAs or personality traits and workrelated outcomes is often non-zero; (b) there is often considerable variance around these non-zero mean validities; and (c) in many instances, this non-trivial variance in validities may be attributed to situational variables that moderate the strength of the validities. This pattern of findings is consistent with the logic underlying the SS hypotheses and implies that idiosyncratic characteristics of the testing situations (e.g., situational strength, trait-activating situations) may be exerting a non-trivial influence on the predictive validity of many employment tests. Further buttressing the arguments of the SS hypothesis is the general finding that most recent meta-analytic reviews test for and report evidence of moderation (Aytug, Rothstein, Zhou, & Kern, 2012; Geyskens et al., 2009). Obviously, not all of these reviews considered

98

The Future of Psychometric Meta-Analysis situational moderators, but many did. Finally, consistent with the SS hypotheses is the recommendation by the developers of VG to rely on random effects versus fixed effects models (fixed effects models are consistent with the VG hypothesis and random effects models are consistent with the SS hypotheses). In summary, when viewed as a continuum ranging from strong VG to strong SS, the extant literature indicates that the validities of most employment tests fall toward the middle or the SS end of the generalization–specificity continuum.

PSYCHOMETRIC META-ANALYSES AND THE FUTURE OF VALIDITY GENERALIZATION Although the adoption of random effects meta-analytic procedures and the continual search for moderators has largely put to rest the VG versus SS debate, a number of important concerns continue to exist regarding the procedures of PMA that underlie all VG analyses (cf. Bornstein, et al., 2009; Hunter & Schmidt, 2004). These concerns are of a sufficient magnitude to warrant a review and discussion of how they might impact the future use of PMA/VG procedures (or the interpretation of previous studies relying on PMA/VG procedures).

Concern 1: Failure to (Explicitly) Model Situational Attributes A serious concern with current PMA/VG methods is that situational variables are actually never included as part of the formal statistical model. Instead, a number of prominent situational attributes have been systematically omitted from meta-analytic summaries that have relied on PMA/VG procedures. Thus, potential moderators of predictor-criterion relationships in meta-analyses have not been formally tested, and the structural models linking predictors to criteria may be considered mis-specified (James, Mulaik, & Brett, 1982). Potential situational moderators include organizational contextual variables such as authority structure, standardization of job tasks and procedures, reward processes, leadership styles, organizational culture, and organizational climate. Thus, the pervasiveness of situational moderators has likely been underestimated in studies relying on the PMA/VG procedures; thus, our previous discussion of the selection literature should be interpreted accordingly. Returning to Equations 1 and 2, it is clear that no substantive situational variables are actually taken into account in a PMA/VG analysis. Only basic statistical properties of the measurement scores are “corrected.” Because situational variables are not included as substantive variables in meta-analytic summaries, it is impossible to ascertain which situational variables might (or might not) influence a particular predictor-criterion relationship. Although more recent applications of PMA/VG have included formal tests for moderators, these moderators have often been methodological in nature (e.g., student vs. field samples, objective vs. subjective criteria) rather than substantive in nature (e.g., interpersonal or dyadic interactions; social climate). The PMA/VG estimation approach is based on residualization (where artifact variance is removed before testing for moderation) and is disconcerting to proponents of SS hypotheses, who certainly could argue that situational variables should be measured and formally tested as moderators before these variables are simply rejected as sources of error. Instead, current applications of PMA/VG largely take moderators into consideration on a post hoc basis, after the estimate of V(ρ) is found to differ from zero (Cortina, 2003). The residualization approach to PMA analysis offers a problematic test of SS hypotheses. This problem can be broken down into concerns related to statistical power and concerns related to knowledge of quantifiable differences between and within studies that could act as moderating effects. Situational factors may impact validity estimates; however, many PMA studies include a limited number of primary studies (k) and/or the primary studies that are included may have used a small sample size (N), and both of these factors have been linked to insufficient power for detecting moderation effects (i.e., Alexander & DeShon, 1994; Cortina, 2003; James, Demaree, Mulaik, & Mumford, 1988; James et al., 1986; Murphy, 2000; Spector & Levine, 1987; Steel & Kammeyer-Mueller, 2002). When the number of studies (or number of studies used

99

James M. LeBreton et al. to explore a specific effect) in a PMA is low, the power to detect moderating effects is similarly low as PMA/VG relies on any existing variation between study effect sizes as the indication of moderation. Thus, adequate power (i.e., sufficiently large sample sizes and number of studies in the PMA/VG study) is requisite for any reasonable test of the SS hypothesis. Insufficient power to detect moderation should preclude an interpretation in favor of VG or against SS. Most moderators studied in meta-analyses are explored in a post hoc manner, involve methodological (vs. truly substantive situational) moderators, and rely on the extent to which those conducting the PMA wish to code the studies for these methodological moderators. As a consequence, many moderator variables are included because they are conveniently coded, not because they are derived from strong psychological theories. Indeed, many of the moderators explored in selection contexts are likely of little interest to most organizations. As an example, around 90% of U.S. firms employ fewer than 20 employees and 98% employ fewer than 100 individuals (U.S. Census Bureau, 2012), yet many meta-analyses are conducted on studies where the samples are from large companies, college students, or government employees. Important contextual variables, such as spans of control, reward structures, unemployment risk, benefits, flexibility, justice climate, perceived organizational support, and a host of other variables are likely to vary widely not only between large and small work organizations but also within the subgroup of smaller work organizations. Data from individuals within these contexts are rarely collected and, even if they were, this contextual information may not be reported in individual validation studies. Thus, such contextual variables cannot be explored in moderator analyses in PMA. Selection tools may exhibit high levels of consistency when most samples studied in the PMA come from similar contexts (e.g., large organizations or government samples), but that does not allow researchers to assume that the consistent effects uncovered from those substantively similar contexts can be transported to situations that were systematically excluded by the PMA (e.g., smaller organizations). Similarly, the moderator variable of interest must vary between studies to be explored with PMA/VG. A moderation effect could be reported in every study explored within a PMA. However, unless subgroup means and variances are reported in each of those studies, there is no way to extract the necessary information to explore those subgroup effects with PMA. Additionally, many variables are continuous rather than categorical (such as span of control, unemployment risk, role ambiguity, or leader trust and support). Moderating effects for continuous variables in primary studies are explored after first computing a cross-product term. There currently exists no way to summarize and test such continuous moderating effects that may be reported within the individual studies included as part of a PMA (DeSimone & Schoen, 2015). Although “moderator variables (interactions) never studied in any individual study can be revealed by meta-analysis” (Hunter & Schmidt, 2004, p. 26), it is also true that meta-analysis does not have a test that is analogous to moderation tests used in primary studies (DeSimone & Schoen, 2015; Podsakoff, MacKenzie, Ahearne, & Bommer, 1995). In summary, the advocacy for using PMA/VG as methods for uncovering moderators (see Hunter & Schmidt, 2004, p. 26) coupled with the strong statements regarding “proof ” of crosssituational consistency (i.e., no moderators; Hunter & Schmidt, 2004, pp. 404–405) paints a confusing picture. PMA/VG can be used to detect some forms of moderation, but only when sufficient variation exists between studies on the construct of interest. In addition, moderators in PMA/VG studies are often included and tested because they are conveniently coded, not because they represent critically relevant constructs derived from strong psychological theory. Finally, the current PMA/VG techniques may only be used when the moderator is categorical, no techniques have been developed that will accommodate continuous moderator variables in a PMA/VG analysis.

Concern 2: Untested Statistical Assumptions Statistical Artifacts and Validities Must Be Independent of Situational Variables Several researchers have exposed a critical, implicit assumption underlying the use of PMA/VG procedures—namely, that the effects of situational variables and statistical artifacts on validity coefficients must be independent (i.e., orthogonal) of one another (Burke, Rupinski, Dunlap, &

100

The Future of Psychometric Meta-Analysis Davison, 1996; James et al., 1986; James et al., 1992; Raju, Burke, Normand, & Langlois, 1991; Thomas & Raju, 2004). However, this assumption has rarely been discussed or formally tested in PMA/VG studies. Consider the example offered by James and colleagues (1992). They argued that variations among organizations (i.e., situations) in the restrictiveness of organizational climate would likely engender variations in criterion reliability. Restrictiveness of climate encompasses various organizational features that create strong versus weak work situations, including authority structures, standardization of job tasks and procedures, and reward structures (James et al., 1992). A highly restrictive climate (i.e., strict rules, guidelines, steep hierarchical structure, reward system not based on individual merit) would likely contribute to a decreased expression of individual differences among employees on performance because of a tendency toward compliance and conformity. This variance restriction should, in turn, attenuate criterion reliability and any relationship between these variables and job functioning (i.e., true validity) compared to what might be expected in a less restrictive climate (i.e., open communication, fewer restrictions and rules, reward system based on individual merit). If a situational variable such as restrictiveness of climate jointly affects the magnitudes of validities and criterion reliabilities, then the VG model is likely to include a covariance between validities and the reliabilities. Covariation between validities and a statistical artifact such as criterion reliability violates the fundamental assumption that these factors are statistically orthogonal to one another. Covariation between validities and criterion reliabilities implies that removing variance in validities associated with variance in criterion reliabilities likely results in removing variance due to true situational factors (e.g., restrictiveness of climate). This is because variation in the situational variable (climate) serves as a common cause for variation in validities and criterion reliabilities. To remove variance due to reliability is to remove variance due to its causes—the situational variable. It follows that one is likely to increase his/her chances of (erroneously) rejecting the SS hypothesis by incorrectly attributing variance in validities to statistical artifacts, when in fact that variance is attributed to situational features (climate) that impacted both the validities and the artifacts (e.g., reliabilities). In response to a concern of interdependencies among validities and statistical artifacts, two alternative models were introduced. One model, proposed by James and colleagues (1992), addressed this assumption of independence directly. Specifically, their model removed the assumption of independence by including covariance terms between validities and each of the statistical artifacts included in the PMA/VG correction equations. The second model, proposed by Raju and colleagues (1991), attempted to circumvent the problem engendered by lack of independence. Their approach corrected for unreliability, attenuation, and sampling error within each individual sample prior to averaging validities (i.e., predictor-criterion correlations) across studies. Therefore, violation of the assumption within studies is no longer an issue, although violation of the assumption across studies remains unresolved (Thomas & Raju, 2004). Thomas and Raju (2004) tested and compared the accuracy of these two approaches. Although no comparison was made between results obtained by application of the model presented by James and colleagues (1992) versus the traditional PMA/VG estimation equations, results from the method developed by Raju and coauthors (1991) surpassed the traditional VG model in accuracy. Furthermore, the two models (i.e., James et al., 1992 and Raju et al., 1991) demonstrated comparable properties in accurately estimating validity coefficients. The method proposed by Raju and colleagues provided slightly more stable estimates (i.e., lower variance in estimates across samples). One consequence of the latter finding is that it provides additional support for the SS hypothesis (e.g., the residual variances of these estimates, which can be interpreted as arising from situational influences, tend to increase using these procedures). Of course, neither model identifies which, nor in what way, situational variables moderate validities.

Statistical Artifacts Must Be Independent of One Another In addition to the assumption that situational factors are uncorrelated with effect sizes and statistical artifacts, PMA/VG procedures also invoke the assumption that the statistical artifacts are “independent of each other” (Hunter & Schmidt, 2004, p. 139). Thus, range restriction,

101

James M. LeBreton et al. predictor reliability, and criterion reliability are all assumed to be statistically orthogonal from one another. The tenability of this assumption was recently challenged by Köhler and colleagues (2015), who conducted two large-scale meta-analytic reviews of different types of reliability coefficients (with data from 518 and 347 studies, respectively). Contrary to the statistical assumptions that form the basis of all PMA/VG procedures, reliability coefficients obtained from primary studies were often not independent of one another. In Study 1, Köhler and colleagues (2015) summarized the degree of correlation between different types of reliability coefficients based on articles published in the Journal of Applied Psychology and the Academy of Management Journal. They found that the degree of correlation between different types of reliability coefficients ranged in magnitude from small to large. For example, the correlation between predictor reliabilities estimated using coefficient alpha and criterion reliabilities estimated using intra-rater correlations was nearly orthogonal at −0.03. In contrast, the correlation between predictor reliabilities estimated using coefficient alpha and criterion reliabilities estimated using inter-rater correlations was −0.45. It is also important to note that the reported correlations were obtained using a highly range restricted set of data (i.e., studies from only two of the top academic journals were included, and the various forms of reliability were, on average, quite high). PMA/VG studies are often based on collections of effect sizes sampled from a broader array of sources (e.g., greater number of journals and/or unpublished studies placed in the “file drawer” (possibly due to low reliabilities)). Thus, the correlations between reliability coefficients provided in Study 1 by Köhler and coauthors likely underestimate the magnitude of these associations, and thus underestimate the extent to which the fundamental assumptions underlying PMA/VG have been violated. In Study 2, Köhler and colleagues conducted a meta-analysis on the relationships between perceived organizational support (POS) and a number of its antecedents and consequences. Their results further confirmed that this fundamental assumption of PMA/VG analyses may be routinely violated (i.e., predictor and criterion reliabilities were correlated across studies). These authors noted that “correlations between reliabilities are quite substantial” (p. 376), with values ranging from −0.80 to +0.34. Consequently, they counseled researchers to take into account the correlation between reliabilities, lest they overcorrect the observed mean effect size (see Köhler et al., 2015, p. 381). Thus, although the assumption that reliabilities are independent has gone largely untested in individual PMA/VG studies, the results from Köhler and coauthors suggest that this assumption may be untenable. Violating this assumption impacts (i.e., biases) not only the estimation of between-study variation, V(ρ), but also the estimate of the true score validity, ρ (Köhler et al., 2015). Estimates of true score validity that are upwardly biased are especially problematic in selection contexts (see our Concern 5 below). In summary, the statistical foundation for all PMA/VG studies includes a core set of assumptions regarding the independence of study artifacts from one another (e.g., predictor reliability, criterion reliability, predictor range restriction) and the independence of situational variables from both statistical artifacts and observed criterion-related validities. Although attempts have been made to develop better estimating models and procedures (James et al., 1992; Raju et al., 1991), most PMA/VG studies continue to use the estimating equations presented by Schmidt and Hunter (see Aytug et al., 2012) that assume artifacts meet these two critical orthogonality assumptions. However, there is growing awareness in the extant literature that both of these assumptions may frequently be violated, resulting in misleading estimates of the mean true validity and the between-study homogeneity in validities (i.e., violating these assumptions results in biased estimates used to infer VG vs. SS).

Concern 3: Questionable Estimates Used to Correct for Artifacts Our third concern related to PMA/VG procedures relates to the specific values that are used to represent the statistical artifacts in “correction” equations. Even if data meet the necessary statistical assumptions (which appears increasingly unlikely; see Concern 2), the correction equations that form the basis of PMA/VG analyses will only yield accurate (i.e., unbiased) estimates

102

The Future of Psychometric Meta-Analysis of the population parameters (i.e., true validity correlation) when accurate (i.e., unbiased) estimates of the statistical artifacts are inserted into proper correction equations. Unfortunately, there is growing concern that a number of previously published PMA/VG studies may have relied on inaccurate estimates of statistical artifacts. The most contested application of the PMA/VG procedures has arisen when the criterion variable has been measured using supervisory performance ratings, arguably the most commonly used criterion in test validation and personnel decision making. In a highly cited meta-analytic review, Viswesvaran and colleagues (1996) examined the reliability of both supervisory and peer performance ratings. The authors found that the samplesize weighted mean estimates of inter-rater reliability were quite low. More specifically, the mean estimates of inter-rater reliability were found to be .52 and .42 for data obtained using supervisor and peer ratings of performance, respectively. These estimates of reliability stand in stark contrast to estimates based on test-retest reliabilities and internal consistency reliabilities. For example, supervisory ratings demonstrated a mean test-retest reliability of .81 and an even stronger mean internal consistency reliability of .86. Temporal stability data were not available for peers, but the peer data mirrored the supervisor data with respect to internal consistency, with an average reliability of .85. Although ratings data appeared to be both internally consistent and reasonably stable, Viswesvaran and coauthors argued that the preferred estimate of reliability was furnished by the inter-rater reliability coefficients. Consequently, with a few notable exceptions (e.g., Meriac, Hoffman, Woehr, & Fleisher, 2008), the inter-rater reliability estimates provided by Viswesvaran and colleagues have become the default values used for artifact distributions to make statistical corrections in PMA/VG studies based on performance ratings. To be clear, we do not have concerns with the breadth of the study by Viswesvaran and colleagues (1996), as we are confident they did a thorough and competent review. As such, we are confident that the mean estimates of inter-rater reliability reported by Viswesvaran and coauthors reflect the values reported in the literature. In addition, we do not have concerns with the use of correction equations (assuming of course that the necessary statistical assumptions have been met; see Concern 2). Instead, our concern is actually more fundamental and may be broken into three subcomponents: (1) the appropriateness of using inter-rater correlations to estimate the reliability of performance ratings, (2) the bias that exists in sample estimates of statistical artifacts, and (3) the questionable use of measures with such poor psychometric properties for the inferential decisions suggested by PMA/VG.

Appropriateness of Inter-rater Correlations to Index Reliability First, we examine whether the inter-rater correlation is the most appropriate statistic for estimating the reliability of performance ratings? This statistic defines reliability as the correlation between two parallel (in a psychometric sense) raters. There is a growing consensus that the raters who furnish the data (e.g., performance ratings of a target individual provided by three different supervisors) do not meet the stringent requirements necessary to be treated as psychometrically “parallel” measures of performance (Murphy & DeShon, 2000; Putka & Hoffman, 2015; Putka, Hoffman, & Carter, 2014; Putka, Le, McCloy, & Diaz, 2008). For example, Putka and colleagues (2014) noted that ratings obtained from two supervisors may not be truly parallel measures. They nested their compelling arguments in the extant literature (e.g., Trait Activation Theory; leader-member exchange theory) and described how estimates of criterion reliability based on interrater reliabilities may result in overcorrected (or undercorrected) validities. Specifically, they provided a new correction equation that would allow for two raters to provide distinct (yet valid) information about employees’ job performance. They noted that their new equation “reduces to the traditional correction if those unique perspectives completely reflect performance-irrelevant, accidental error” (p. 546). If the supervisory or peer ratings are not parallel evaluations of employees’ performance, then the application of the traditional correction equation should not be used, and instead we direct the interested reader to the work of Putka and coauthors.

103

James M. LeBreton et al.

Biased Sample Estimates of Statistical Artifacts Second, even if one assumes that the inter-rater correlations are based on essentially parallel ratings (cf. LeBreton et al., 2003; Schmidt, Viswesvaran, & Ones, 2000), one must still obtain unbiased sample estimates of these inter-rater reliability coefficients. More specifically, if the variability in job performance has been restricted (e.g., due to interventions such as valid and effective recruitment programs, the use of valid selection tools, the use of effective training interventions, attrition stemming from lack of person–organization fit, a restrictive climate/culture, or other potential causal mechanisms), then (like any correlation coefficient) observed estimates of inter-rater reliability will be downwardly biased (cf. Burke, Landis, & Burke, 2014; Huffcutt, Culbertson, & Weyhrauch, 2014; LeBreton et al., 2003; Sackett, Laczo, & Arvey, 2002). Thus, prior to “correcting” correlations for criterion unreliability, it is imperative to obtain an unbiased estimate of inter-rater reliability by correcting the attenuated reliability estimate for range restriction. Indeed, as the developers of PMA/VG procedures lamented, “In the typical validation study, the criterion reliability, as well as the test validity, is available only on the restricted group. Both coefficients should be corrected first for restriction in range. The validity coefficient should then be corrected for attenuation” (Schmidt, Hunter, & Urry, 1976, p. 475). Interestingly, rather than calculate the unrestricted estimates of inter-rater reliability, Viswesvaran and colleagues opted to meta-analytically summarize the restricted (i.e., downwardly biased) estimates of inter-rater reliability. Individuals who are familiar with the correction equations should appreciate how using downwardly biased estimates of inter-rater reliability will yield upwardly biased estimates of criterion-related validity and, obviously, restrict the possibility of finding situational moderators. Recently, Huffcutt and colleagues (2014) advocated using a multistage artifact correction process. Specifically, they noted that statistical artifacts (e.g., inter-rater reliability estimates of job performance) could be biased by other statistical artifacts (e.g., measurement error, range restriction). They suggested that prior to making corrections for statistical artifacts as part of a PMA/VG study, one should first seek to obtain the most accurate (i.e., unbiased) estimates of those artifacts. Huffcutt and coauthors reanalyzed data summarizing the criterion-related validity of the employment interview and documented how their approach yields less biased (i.e., more accurate) estimates of the true validity. On a similar note, Burke and colleagues (2014) recommended that corrections based on supervisory ratings of job performance should be made using a range of potential estimates of inter-rater reliability; this range of estimates should be selected so as to reflect situations that controlled for extraneous variables likely to attenuate estimates of inter-rater reliability. They offered a point-estimate inter-rater reliability (based on supervisor ratings) of .80 rather than the estimate of .52 that has permeated previous PMA/VG studies (and which has likely yielded inflated/overcorrected estimates of population validities). We also agree with the observation of Burke and coauthors that multilevel issues are becoming more common in meta-analyses (and thus will require different estimation formulae; see also comments by Sackett, 2014).

Suspending Psychometric Standards in PMA/VG Studies Finally, we are troubled that, with rare exceptions (e.g., Meriac et al., 2008), applied psychologists wishing to measure job performance via ratings have largely ignored traditional standards for measurement and instead embraced psychometrically questionable measures (i.e., supervisory performance ratings). We echo the sentiments presented by LeBreton and colleagues (2014) that applied psychologists should not apply different standards to evaluate the quality of predictor measures versus criterion measures. As these authors noted, many of the leading psychometricians and applied psychologists of the 20th and 21st centuries have emphasized the importance of using measurement systems yielding reliable assessments of the target constructs, and this is true for both predictor and criterion constructs: [desirable reliabilities] usually fall in the .80s or .90s. (Anastasi, 1968, p. 78)

104

The Future of Psychometric Meta-Analysis a test should have a minimum reliability coefficient of at least .94. Some have been more liberal in this regard, allowing a minimum of .90. (Guilford & Fruchter, 1973, p. 91) Relevancy [is] the first requirement for a criterion. . .. Reliability is the second requirement of a criterion. (Smith, 1976, p. 746) Most texts in industrial psychology contain lengthy lists of requirements for criteria. . .. [these lists] might be reduced to three requirements: reliability, validity, and practicality. (Landy, 1985, pp. 150–151) the minimum accep level of reliability for psychological measures in the early stages of development is .70 (Nunnally, 1978). Higher levels may be required of measures . . . used in advanced field research and practice. (Nunnally & Bernstein, 1994, p. 839) A relevant, reliable, and uncontaminated criterion measure(s) must be obtained or developed. (SIOP Principles, 2003, p. 14) if criteria are to be useful, they must be measureable in a consistent manner. (Ployhart, Schneider, & Schmitt, 2006, p. 167)

In summary, in order to justify using the inter-rater reliability values reported by Viswesvaran and his colleagues, one must avow that (a) inter-rater correlations provided in primary studies meet the statistical assumptions requisite for interpretation as a reliability coefficient and (b) the estimates of inter-rater reliability have not been attenuated by other statistical artifacts. However, even if one were willing to concede that these criteria have been met, we would argue that performance ratings with reliabilities in the .40s and .50s should not be used as the basis for making critical personnel decisions (e.g., which tests are deemed “valid”; which employees should be hired, fired, promoted, rewarded, or punished). Critical decisions that impact the lives of individuals should only be based on reliable measures (and this is true for both predictor and criterion constructs). The absence of such assessments from the tool chest of applied psychology is not sufficient justification for embracing measurement systems where 50% to 60% of the observed variance is error variance; if assessments with such questionable psychometric properties were acceptable, applied psychologists would still be using projective tests to select employees (see Lilienfeld, Wood, & Garb, 2000, for a review of projective tests).

Concern 4: Imprecise Inferences Drawn from “Corrected” Validities Our fourth concern relates to the potential for applied psychologists/practitioners to arrive at misleading inferences about selection systems from PMA/VG studies. In particular, we urge applied psychologists/practitioners to be wary of using corrected correlations for evaluating the practical benefit of a selection test. Consider the classic validity model presented in Figure 4.1 (used in LeBreton et al. (2014) and adapted from Binning and Barrett (1989)). Within the context of a local validation study, Inference 3 may be conceptualized as the observed correlation between a predictor measure (e.g., Watson-Glaser critical thinking inventory) and a criterion measure (e.g., ratings of computer programmer job performance furnished by a supervisor). Inferences 2 and 4 may be conceptualized as representing the reliability of the predictor and criterion measures. If certain statistical assumptions are tenable, then reliability estimates may be computed using scores obtained from the predictor and the criterion. Finally, Inference 1 represents the hypothetical/conceptual/theoretical relationship between the latent constructs that are assessed via the predictor test and the criterion measure. In our example, these constructs might be labeled “general intelligence” and “performance.”

105

James M. LeBreton et al.

Manifest Predictor Variable

Manifest Criterion 3

Variable

(X)

(Y)

4

2

1

Latent Predictor Construct (ξ)

Performance Domain Representing a Latent Criterion Construct (η)

FIGURE 4.1  Basic Construct Validation Model Note: Figure 4.1 originally appeared as Figure 1 in: LeBreton, J. M., Scherer, K. T., & James, L. R. (2014). Validity generalization in a land of suspended judgment. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 478–500 (copyrighted by Cambridge University Press). This figure is reprinted with permission.

Within the context of a PMA/VG study, Inference 3 may be conceptualized as the sampleweighted (or unweighted) mean observed correlation taken over samples and (often) based on different assessments within and between samples. For example, Study 1 might use the WatsonGlaser and Supervisory Performance Ratings (furnished by a single supervisor); Study 2 might use the Wonderlic Personnel Test and Supervisory Performance Ratings (perhaps averaged over two supervisors); Study 3 might use Raven’s Progressive Matrices and objective indicators of sales performance; etc. Inference 3 simply represents the average validity taken over these studies. Depending on the particular approach to PMA/VG, Inferences 2 and 4 may represent (a) the average observed reliabilities of the predictor and criterion measures or (b) an estimate of these reliabilities obtained using extant artifact distributions. Inference 1 again represents the hypothetical/conceptual/theoretical relationship between the latent constructs that are assessed via the different predictor tests and the different criterion measures.

Meaning of Corrected Coefficients Corrected coefficients are hypothetical estimates of what the relationship between predictor and criterion might look like if certain assumptions have been met. Most notably that “one had access to an infinitely long predictor and an infinitely long criterion (i.e., perfectly reliable measures representing a one-to-one correspondence between the [observed measures] and [the constructs they purport to assess])” (LeBreton et al., 2014, p. 491). We believe it is important

106

The Future of Psychometric Meta-Analysis that both researchers and practitioners recognize the information that is (and is not) conveyed using corrected versus uncorrected coefficients. To be clear, there is nothing inherently “good or bad” about corrected coefficients. Indeed, corrected coefficients are routinely estimated as part of many applications of structural equations modeling (SEM; James et al., 1982). However, additional assumptions, as noted in Concern 2, are required when making these corrections in a PMA/VG study versus a primary study that implements an SEM analysis. Further complicating the interpretation of PMA/VG studies has been the estimation of a partially corrected “operational validity.” Operational validity is a term used to denote a coefficient that has been asymmetrically subjected to corrections. Specifically, the correlation (or mean correlation) is corrected for measurement error in the criterion (e.g., performance ratings) but is not corrected for measurement error in the predictor. However, when operational validities are estimated, applied psychologists often interpret these validities as “suggestive of how we should expect a selection test to perform “operationally” or “in practice” (LeBreton et al., 2014, p. 491). Consider statements such as: [operational validities are appropriate] because in actual test use we must use observed test scores to predict future job performance and cannot use applicants’ (unknown) true scores. (Hunter & Schmidt, 2004, p. 126) We generally are not per se interested in a measured fallible indicator of performance; we want to know how well we predict the underlying construct reflected by that fallible measure. (Sackett, 2014, p. 502) But employee selection must be based on observed scores among applicants, thus the relevant relationship is the operational validity of the predictor set for the criterion construct. (Viswesvaran et al., 2014, p. 514)

Although we agree with Hunter and Schmidt (2004) that it is inappropriate to use applicants’ (unknown) true scores on a selection test, we find it troubling that these authors (and many in our field) have no qualms about validating that imperfect and flawed test against the applicants’ (unknown) true scores on the criterion (i.e., their perfect and unflawed scores on the latent criterion construct). The argument for making selection decisions using operational validities is based on the presumption that it is unfair to penalize the evaluation of a predictor measure by the measurement error tainting the criterion. However, the criterion-related validity estimate (i.e., correlation coefficient) represents a joint relationship between a predictor and a criterion. Indeed, evidence supporting the “validity” of inferences from test scores proceeds under the presumption that we have a highly reliable and relevant criterion. If our criterion is irrelevant and/or unreliable, then why bother looking for tests to predict that irrelevant and/or unreliable criterion? Computing an operational validity places a disparate and asymmetrical emphasis on the predictors that make up a selection system. This approach to test validation is inconsistent with extant validation frameworks that have emphasized the importance of accumulating validity evidence for both predictors and criteria. Indeed, if a criterion is 50% random noise, why bother trying to predict it? The consequence of relying on operational validities for test validation has enabled applied psychologists to ignore the quality of criteria (hence the tendency to make corrections using criterion reliabilities in the 0.40s and 0.50s), which simply further inflames the criterion problem bemoaned for decades in applied psychology (Austin & Villanova, 1992; LeBreton et al., 2014; Wallace, 1965). Operational validities seek to estimate not the strength of relationship one might expect to see “in operation” or “in practice” between a predictor and criterion, but instead to estimate the relationship between observed predictor scores and the latent (i.e., theoretical/hypothetical/ conceptual) criterion construct. Returning to our example above, the operational validity for Study 1 represents the correlation between observed scores on the Watson-Glaser and a “perfect” criterion that was obtained by collecting supervisory performance ratings from an infinite number (or to “approximate” perfect reliability, at least a very large number) of (psychometrically parallel) supervisors. There is a very low likelihood that an organization will have access to

107

James M. LeBreton et al. a large number of parallel supervisors for every single employee (cf. Murphy & DeShon, 2000; Putka & Hoffman, 2015; Putka et al., 2008; see also Concern 3). Let’s assume in Study 1 that the observed correlation between the Watson-Glaser and supervisors’ performance ratings is .25. If we were to correct this observed correlation for measurement error in only the criterion (e.g., using the .52 estimate recommended by Viswesvaran et al. 1996), the operational validity increases to .35. Does this number really capture the quality of prediction obtained using this selection test to predict this criterion in this particular context? Is it possible that an organization could, in actual practice, expect to see such an impressive validity? It depends. How many organizations typically estimate job performance as a unit-weighted average of performance ratings provided by 65 supervisors for each employee? Why 65 supervisors? Applying the Spearman-Brown prophesy equation using the .52 estimate reported by Viswesvaran and his colleagues, we find that it would take the ratings of 65 psychometrically parallel supervisors to obtain a criterion reliability of .99 (≈ 1.00), the value assumed to be tenable when undertaking the calculation of an operational validity. We conclude that corrected coefficients (especially those that are asymmetrically corrected for only criterion unreliability) convey limited practical value. We are not alone in this judgment: corrected rs are of little practical value. . .. The prediction of one variable from another and the accompanying error of estimate must necessarily be based on obtained, or fallible, rather than true scores. (McNemar, 1962, p. 153) when one is faced with making inferences about behavior in the real world, it is not particularly useful to know how predictive a test would be if criterion measures were perfect. (Womer, 1968, p. 65) correcting for artifacts is not the proper goal of meta-analysis. The purpose of meta-analysis is to teach us what is, not what might be some day in the best of all possible worlds when all of our variables might be perfectly measured. (Rosenthal, 1984 as quoted in DeShon, 2003, p. 386) “corrections” confuse the essential distinction between what might be and what is. The observance of that distinction is the primary factor separating science from mere supposition. The forgetting of that distinction is the hallmark of validity generalization. (Seymour, 1988; italics in original, p. 352) Practitioners should be especially wary of validity estimates adjusted for unreliability . . . these adjustments are intended to provide theoretical estimates of the magnitude of a validity coefficient under conditions of perfect measurement. . . . There is nothing operational about “operational validity.” (DeSimone, 2014, p. 530)

In summary, applied psychologists (especially those working in practice) are encouraged to focus their attention on the uncorrected correlations when interpreting the results from primary validity studies (or weighted mean uncorrected correlations when interpreting PMA/VG results). In contrast, psychologists interested in understanding hypothetical/theoretical/conceptual relationships that (might) exist between latent constructs may be better served by examining the corrected coefficients. Of course, this presumes that all necessary statistical assumptions have been met for estimating the corrected coefficients. Like the authors cited in the paragraph above, we see limited “practical” value in operational or corrected validities; instead, the value of corrected coefficients is in estimating what one (might) see in the theoretical/conceptual world where measurement error does not exist (or is very, very small; e.g., the world where each employee is rated by 65 supervisors who provide psychometrically parallel ratings).

Concern 5: Use of PMA Effect Sizes in Applied Contexts It is one thing to compute a “corrected” correlation as an estimate of the hypothetical relationship between predictor and criterion constructs, but it is quite another thing to use that

108

The Future of Psychometric Meta-Analysis hypothetical value as an excuse for not undertaking a local validity study. From a legal perspective, HR practitioners may be especially interested in knowing how cases relying on PMA/ VG studies have been received by the Supreme Court of the United States (SCOTUS). Said differently, if PMA/VG evidence suggests a test yields a non-zero validity in predicting job performance, can a company safely rely on this information without conducting a local validity study? First, we would recommend that practitioners who are planning to use effect size estimates obtained from PMA/VG studies should consider the magnitude of the reported effect. The courts have questioned the use of selection tools with low validities. Although the courts have been reticent to set a specific minimum cutoff for criterion-related validity coefficients, it does appear that tests with validities below .30 are questioned more heavily than are tests with validities higher than .30 (Biddle, 2010). In addition, as discussed under Concern 4, practitioners looking to PMA/VG studies should be mindful of distinguishing between uncorrected and corrected validity coefficients. The corrected coefficients (frequently misrepresented as the true value, ρ, rather than the estimate, ρ ) furnish a theoretical estimate of what the effect size might be if everything in the situation was perfect (e.g., no measurement error, no range restriction). The corrected validity is a hypothetical ideal that does not exist in reality and can be easily contested in the courts (see Seymour, 1988). All things being equal, a better representation of what one might find in a local validity study is the value provided by the sample weighted uncorrected validity (often represented by r ). Given that the statistical assumptions underlying artifact corrections appear to be increasingly untenable (see Concern 2), the corrected validity may actually represent an overcorrected (i.e., inflated or upwardly biased) estimate of the true validity. That the corrected validity could be inflated is immediately relevant if one assumes the courts consider tests with validities above .3 as more valid than tests with validities below .3. Corrected validities, depending on the number and type of corrections used, may yield values nearly twice as large as their uncorrected counterparts (especially when assessments of inter-rater reliability are used to correct supervisor ratings of performance; see our Concern 3). Indeed, LeBreton and colleagues (2014) demonstrated that when corrections based on dubious estimates of criterion reliability are simultaneously applied to multiple predictor variables, it is possible to explain nearly 100% of the variance in job performance using only four or five selection tests. The implication is that situational variables (e.g., perceptions of climate, culture, justice, fairness, leadership, team cohesion, training interventions) are, thus, determined to be irrelevant to job performance. We are aware of no case heard by the SCOTUS where PMA/VG studies have helped to win a case. For example, in a number of cases heard by the SCOTUS in the context of discrimination, the court has not looked favorably on PMA/VG evidence or PMA/VG expert testimony (see Biddle, 2010; Landy, 2003; Outtz, 2011). Thus, especially in the real-world selection contexts where organizations are faced with the possibility of discrimination lawsuits, relying only on PMA/VG studies appears ill-advised. Outside of concerns regarding discrimination, advocates of PMA/VG today largely argue for the concept of transportability rather than true VG. Regardless of reported evidence for the existence of moderators, practitioners trying to interpret and use results from PMA/VG must remain cautious in interpreting the weaker transportability inference. Kemery and colleagues (1987) demonstrated that a 90% credibility interval that does not include 0 (i.e., consistent with transportability) could still include a large proportion of situations (e.g., jobs, work contexts) where the true validity was in fact 0. In short, just because a credibility interval does not include 0, one cannot unconditionally conclude that the selection test/tool in question is transportable to all situations. Based on concerns of the potential impacts of discrimination and the possibility of low validities in certain jobs/context even when PMA/VG evidence is supportive of transportability, we recommend that practitioners augment any PMA/VG results with results obtained from a local validation study. Of course, local validity studies are not without potential limitations (e.g., sampling error associated with smaller sample sizes). There, however, are ways to combine the results of a local validity study with the results summarized in a meta-analysis (see Biddle, 2010; Brink & Crenshaw, 2011).

109

James M. LeBreton et al. In summary, those hoping to draw conclusions from PMA/VG for selection purposes should be cautious. Meta-analysis can be a useful tool for summarizing research. However, PMA/VG studies should not be viewed as substitutes for a well-conducted local validity study. Those hoping to use effect size estimates from PMA/VG studies should carefully scrutinize the information reported. For selection purposes, we recommend interpreting/ using the uncorrected validities rather than operational validities or fully corrected validities as the effect size estimates. We also recommend a careful analysis of the credibility interval. If the credibility interval is wide, even if the credibility interval does not include 0, then there are likely moderating effects between the studies included in the analysis. In addition, given the empirical evidence supporting situational specificity, the fact that most PMA/VG studies include tests for moderation (Aytug et al., 2012), and proponents of PMA/VG are now recommending random versus fixed effect models (Hunter & Schmidt, 2000; 2004), the tide appears to be strongly turning in a direction that further justifies the use of local validation studies. Finally, those looking to apply the results from PMA/VG studies should remember that the overall summary effect size is the average across populations; however, an effect size reported in a particular moderation analysis might be more representative of (i.e., consistent with) a practitioner’s local situation/context.

CONCLUSIONS Proponents of VG, and more recently PMA, have raised awareness of a number of important points. Studies with small sample sizes are subject to considerably less than desirable effects from sampling error. Measurement error and range restriction do attenuate the magnitude of validities. And the corrections used in PMA for measure unreliability are similar to those used in individual SEM studies without controversy. Finally, the quantitative summary of effect sizes, which was not a feature of qualitative reviews, is yet another strength of this method. However, like any statistical tool, PMA/VG is not without its controversies or limitations. The first purpose of this chapter was to discuss one of these controversies, namely the extent to which criterion-related validities are situationally specific or generalize across situations. On balance, we found considerable empirical evidence consistent with the situational specificity hypotheses. This was not entirely surprising, given that the majority of theoretical models that make up the canon of the social sciences have adopted contingency, systems, or interactional perspectives. The second purpose of this chapter was to catalog a list of concerns and limitations that have emerged with respect to the use of PMA/VG procedures, including: 1. Methodological concerns related to the systematic omission of situational variables from formal tests of VG 2. Statistical concerns related to the untenable nature of the assumptions underlying PMA/VG procedures 3. Methodological and statistical concerns related to the specific point-estimates used to estimate “corrected” validities 4. Theoretical concerns related to the drawing of improper inferences using corrected coefficients 5. Practical concerns related to use of corrected and/or operational validities (i.e., partially or asymmetrically “corrected” validities) as justification for not undertaking a local validation study

In summary, we conclude that PMA/VG procedures are useful statistical tools for elucidating the impact of statistical artifacts on validity estimates. However, we also conclude that the results of PMA/VG studies are incomplete and often misleading because they have failed to confirm that critical statistical assumptions were met or failed to include relevant situational variables from the actual tests of VG. To date, the overwhelming evidence favors hypotheses that situational moderators (i.e., variables that affect the magnitude of the validity of various tests in predicting employee performance) are not only possible but also quite likely.

110

The Future of Psychometric Meta-Analysis

NOTE 1. Portions of this chapter are based, in part, on a chapter co-authored by Larry James and Heather McIntyre, which appeared in the first edition of this Handbook. We are grateful to Heather McIntyre and Leslie James for permission to use the previous chapter as a starting point for this new and updated contribution to the second edition of this Handbook. Lawrence R. James was professor emeritus at Georgia Institute of Technology; this chapter is published posthumously.

REFERENCES Alexander, R. A., & DeShon, R. P. (1994). Effect of error variance heterogeneity on the power of tests for regression slope differences. Psychological Bulletin, 2, 308–314. Algera, J. A., Jansen, P. G. W., Roe, R. A., & Vijn, P. (1984). Validity generalization: Some critical remarks on the Schmidt-Hunter procedure. Journal of Occupational Psychology, 57, 197–210. Anastasi, A. (1968). Psychological testing (3rd ed.). New York, NY: Macmillan. Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77, 836–874. Aytug, Z. G., Rothstein, H. R., Zhou, W., & Kern, M. C. (2012). Revealed or concealed? Transparency of procedures, decisions, and judgment calls in meta-analysis. Organizational Research Methods, 15, 103–133. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Barrick, M. R., & Mount, M. K. (1993). Autonomy as a moderator of the relationships between the big five personality dimensions and job performance. Journal of Applied Psychology, 78, 111–118. Biddle, D. A. (2010). Should employers rely on local validation studies or validity generalization (VG) to support the use of employment tests in Title VII situations? Public Personnel Management, 39, 307–326. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. West Sussex, UK: John Wiley & Sons. Brink, K. E., & Crenshaw, J. L. (2011). The affronting of the Uniform Guidelines: From propaganda to discourse. Industrial and Organizational Psychology, 4, 547–553. Burke, M. J., Landis, R. S., & Burke, M. I. (2014). 80 and beyond: Recommendations for disattenuating correlations. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 531–535. Burke, M. J., Rupinski, M. T., Dunlap, W. P., & Davison, H. K. (1996). Do situational variability act as substantive causes of relationships between individual difference variables? Two large-scale tests of “common cause” models. Personnel Psychology, 49, 573–598. Buss, A. R. (1979). The trait-situation controversy and the concept of interaction. Personality and Social Psychology Bulletin, 5, 191–195. Cortina, J. M. (2003). Apples and oranges (and pears, oh my!): The search for moderators in meta-analysis. Organizational Research Methods, 6, 415–439. Cronbach, L. J., & Snow, R. E. (1977). Aptitudes and instructional methods: A handbook for research on interactions. New York, NY: Irvington. DeShon, R. P. (2003). A generalizability theory perspective on measurement error corrections in validity generalization. In Kevin R. Murphy (Ed.), Validity generalization: A critical review (pp. 365–402). Mahwah, NJ: Lawrence Erlbaum Associates. DeSimone, J. A. (2014). When it’s incorrect to correct: A brief history and cautionary note. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 527–531. DeSimone, J. A., & Schoen, J. L. (2015). Moderation effects not detectable by meta-analytic techniques. Presented at the Society of Industrial and Organizational Psychology Annual Meeting, Philadelphia. Endler, N. S., & Magnusson, D. (1976). Toward an interactional psychology of personality. Psychological Bulletin, 83, 956–974. Epstein, S. (1979). The stability of behavior: I. On predicting most of the people much of the time. Journal of Personality and Social Psychology, 37, 1097–1126. Geyskens, I., Krishnan, R., Steenkamp, J-B. E. M., & Cunha, P. V. (2009). A review and evaluation of meta-analysis practices in management research. Journal of Management, 35, 393–419. Ghiselli, E. E. (1959). The generalization of validity. Personnel Psychology, 12, 397–402. Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: Wiley. Ghiselli, E. E. (1973). The validity of aptitude tests. Personnel Psychology, 26, 461–477.

111

James M. LeBreton et al. Grote, G. F., & James, L. R. (1991). Testing behavioral consistency and coherence with the situationresponse measure of achievement motivation. Multivariate Behavioral Research, 26, 655–691. Guilford, J. P., & Fruchter, B. (1973). Fundamental statistics in education and psychology (5th ed.). New York, NY: McGraw-Hill. Hermelin, E., & Robertson, I. T. (2001). A critique and standardization of meta-analytic validity coefficients in personnel selection. Journal of Occupational and Organizational Psychology, 74, 253–277. Hogan, R. (2009). Much ado about nothing: The person-situation debate. Journal of Research in Personality, 43, 249. House, R. J., & Mitchell, T. R. (1974). Path-goal theory of leadership. Journal of Contemporary Business, 3, 1199–1237. Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014). Multistage artifact correction: An illustration with structured employment interviews. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 548–553. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Hunter, J. E., & Schmidt, F. L. (2000). Fixed effects vs. random effects meta-analysis models: Implications for cumulative research knowledge. International Journal of Selection and Assessment, 8, 275–292. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage. Hunter, S. T., & Cushenbery, L. (2015). Is being a jerk necessary for originality? Examining the role of disagreeableness in the sharing and utilization of original ideas. Journal of Business and Psychology, 30, 621–639. James, L. R., Demaree, R. G., & Mulaik, S. A. (1986). A note on validity generalization procedures. Journal of Applied Psychology, 71, 440–450. James, L. R., Demaree, R. G., Mulaik, S. A., & Ladd, R. T. (1992). Validity generalization in the context of situational models. Journal of Applied Psychology, 77, 3–14. James, L. R., Demaree, R. G., Mulaik, S. A., & Mumford, M. D. (1988). Validity generalization: A rejoinder to Schmidt, Hunter, and Raju (1988). Journal of Applied Psychology, 73, 673–678. James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models, and data. Beverly Hills, CA: Sage. Kemery, E. R., Mossholder, K. W., & Roth, L. (1987). The power of the Schmidt and Hunter additive model of validity generalization. Journal of Applied Psychology, 72, 30–37. Kendrick, D. T., & Funder, D. C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23–34. Kerr, S., Schriesheim, C. A., Murphy, C. J., & Stogdill, R. M. (1974). Toward a contingency theory of leadership based upon consideration and initiating structure. Organizational Behavior and Human Performance, 12, 62–82. Köhler, T., Cortina, J. M., Kurtessis, J. N., & Gölz, M. (2015). Are we correcting correctly? Interdependence of reliabilities in meta-analysis. Organizational Research Methods, 18, 355–428. Landy, F. J. (1985). Psychology of work behavior (3rd ed.). Homewood, IL: Dorsey Press. Landy F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. Landy, F. J. (2003). Validity generalization: Then and now. In Kevin R. Murphy (Ed.), Validity generalization: A critical review (pp. 155–195). Mahwah, NJ: Lawrence Erlbaum Associates. LeBreton, J. M., Burgess, J. R. D., Kaiser, R. B., Atchley, E. K., & James, L. R. (2003). The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar? Organizational Research Methods, 6, 80–128. LeBreton, J. M., Scherer, K. T., & James, L. R. (2014). Corrections for criterion reliability in validity generalization: A false prophet in a land of suspended judgment. Industrial and Organizational Psychology, 7, 478–500. Levine, E. L., Spector, P. E., Menon, S., Narayanan, L., & Cannon-Bowers, J. (1996). Validity generalization for cognitive, psychomotor, and perceptual tests for craft jobs in the utility industry. Human Performance, 9, 1–22. Lievens, F., Chasteen, C. S., Day, E. A., & Christiansen, N. D. (2006). Large-scale investigation of the role of trait activation theory for understanding assessment center convergent and discriminant validity. Journal of Applied Psychology, 91, 247–258. Lievens, F., Schollaert, E., Keen, G. (2015). The interplay of elicitation and evaluation of trait-expressive behavior: Evidence in assessment center exercises. Journal of Applied Psychology, 100, 1169–1188. Lilienfeld, S. O., Wood, J. M., & Garb, H. M. (2000). The scientific status of projective techniques. Psychological Science in the Public Interest, 1, 27–66.

112

The Future of Psychometric Meta-Analysis McDaniel, M. A., Kepes, S., Banks, G. C. (2011). The uniform guidelines are a detriment to the field of personnel selection. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 494–514. McNemar, Q. (1962). Psychological statistics (3rd ed.). New York, NY: John Wiley and Sons. Meriac, J. P., Hoffman, B. J., Woehr, D. J., & Fleisher, M. S. (2008). Further evidence for the validity of assessment center dimensions: A meta-analysis of the incremental criterion-related validity of dimension ratings. Journal of Applied Psychology, 93, 1042–1052. Meyer, R. D., Dalal, R. S., & Bonaccio, S. (2009). A meta-analytic investigation into situational strength as a moderator of the conscientiousness-performance relationship. Journal of Organizational Behavior, 30, 1077–1102. Meyer, R. D., Dalal, R. S., & Hermida, R. (2010). A review and synthesis of situational strength in the organizational sciences. Journal of Management, 36, 121–140. Meyer, R. D., Dalal, R. S., Jose, I. J., Hermida, R., Chen, T. R., Vega, R. P., Brooks, C. K., & Khare, V. P. (2014). Measuring job-related situational strength and assessing its interactions with personality and voluntary work behavior. Journal of Management. 40, 1010–1041. Mischel, W. (1968). Personality and assessment. New York, NY: John Wiley. Mischel, W., & Peake, P. K. (1982). Beyond déjà vu in the search for cross-situational consistency. Psychological Review, 89, 730–755. Mount, M. K., Barrick, M. R., & Stewart, G. L. (1998). Five-factor model of personality and performance in jobs involving interpersonal interactions. Human Performance, 11, 145–165. Murphy, K. R. (2000). Impact of assessments of validity generalization and situational specificity on the science and practice of personnel selection. International Journal of Selection and Assessment, 8, 194–206. Murphy, K. R., & DeShon, R. P. (2000). Interrater correlations do not estimate the reliability of job performance ratings. Personnel Psychology, 53, 873–900. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2003). Personality and absenteeism: A meta-analysis of integrity tests. European Journal of Personality, 17, S19–S38. Outtz, J. L. (2011). Abolishing the uniform guidelines: Be careful what you wish for. Industrial and Organizational Psychology, 4, 526–533. Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373–406. Peters, L. H., Fisher, C. D., & O’Connor, E. J. (1982). The moderating effect of situational control of performance variance on the relationships between individual differences and performance. Personnel Psychology, 35, 609–621. Ployhart, R. E., Schneider, B., & Schmitt, N. (2006). Staffing organizations: Contemporary practice and theory (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Podsakoff, P. M., MacKenzie, S. B., Ahearne, M., & Bommer, W. H. (1995). Searching for a needle in a haystack: Trying to identify the illusive moderators of leadership behaviors. Journal of Management, 21, 422–470. Putka, D. J., & Hoffman, B. J. (2015). The “reliability” of job performance ratings equals 0.52. In C. E. Lance & R. J. Vandenberg (Eds.), More statistical and methodological myths and urban legends (pp. 247–275). New York, NY: Routledge. Putka, D. J., Hoffman, B. J., & Carter, N. T. (2014). Correcting the correction: When individual raters offer distinct but valid perspectives. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 543–548. Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organizational research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93, 959–981. Raju, N. S., Burke, M. J., Normand, J., & Langlois, G. M. (1991). A new meta-analytic approach. Journal of Applied Psychology, 76, 432–446. Rosenthal, R. (1984). Meta-analysis procedures for social research. Beverly Hills, CA: Sage. Russell, C. J. (2001). A longitudinal study of top-level executive performance. Journal of Applied Psychology, 86, 560–573. Sackett, P. R. (2014). When and why correcting validity coefficients for interrater reliability makes sense. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 501–506. Sackett, P. R., Laczo, R. M., & Arvey, R. D. (2002). The effects of range restriction on estimates of criterion interrater reliability: Implications for validation research. Personnel Psychology, 55, 807–825. Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J. P. (2003). Criterion validity of general mental ability measures for different occupations in the European community. Journal of Applied Psychology, 88, 1068–1081. Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529–540.

113

James M. LeBreton et al. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Rothstein-Hirsh, H. (1985). Forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697–798. Schmidt, F. L., Hunter, J. E., & Raju, N. S. (1988). Validity generalization and situational specificity: A second look at the 75% rule and Fisher’s z Transformation. Journal of Applied Psychology, 73, 665–672. Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related validation studies. Journal of Applied Psychology, 61, 473–485. Schmidt, F. L., Law, K., Hunter, J. E., Rothstein, H. R., Pearlman, K., & McDaniel, M. (1993). Refinements in validity generalization methods: Implications for the situational specificity hypothesis. Journal of Applied Psychology, 78, 3–12. Schmidt, F. L, Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability. Personnel Psychology, 53, 901–912. Seymour, R. T. (1988). Why plaintiffs’ counsel challenge tests, and how they can successfully challenge the theory of “validity generalization.” Journal of Vocational Behavior, 33, 331–364. Smith, P. C. (1976). Behavior, results, and organizational effectiveness: The problem of the criteria. In. M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 745–775). Chicago, IL: Rand-McNally College Publishing. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Reprinted with permission. Spector, P. E., & Levine, E. L. (1987). Meta-analysis for integrating study outcomes: A Monte Carlo study of its susceptibility to Type I and Type II errors. Journal of Applied Psychology, 72, 3–9. Steel, P. D., & Kammeyer-Mueller, J. D. (2002). Comparing meta-analytic moderator estimation techniques under realistic conditions. Journal of Applied Psychology, 87, 96–111. Stewart, G. L. (1996). Reward structure as a moderator of the relationship between extraversion and sales performance. Journal of Applied Psychology, 81, 619–627. Stewart, G. L., & Carson, K. P. (1995). Personality dimensions and domains of service performance: A field investigation. Journal of Business and Psychology, 9, 365–378. Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal of Applied Psychology, 88, 500–517. Thomas, A., & Raju, N. S. (2004). An evaluation of James et al.’s (1992) VG estimation procedure when artifacts and true validity are correlated. International Journal of Selection and Assessment, 12, 299–311. U.S. Census Bureau. (2012). Latest SUSB annual data: U.S. & states, totals [Data file]. Retrieved from http:// www.census.gov/econ/susb/ Vecchio, R. P. (1987). Situational leadership theory: An examination of a prescriptive theory. Journal of Applied Psychology, 72, 444–451. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (2005). Is there a general factor in ratings of job performance? A meta-analytic framework for disentangling substantive and error influences. Journal of Applied Psychology, 90, 108–131. Viswesvaran, C., Ones, D. S., Schmidt, F. L., Le, H., & Oh, I-S. (2014). Measurement error obfuscates scientific knowledge: Path to cumulative knowledge requires corrections for unreliability and psychometric meta-analyses. Industrial and Organizational Psychology, 7, 507–518. Vroom, V. H. (1973). A new look at managerial decision making. Organizational Dynamics, 1, 66–80. Wallace, S. R. (1965). Criteria for what? American Psychologist, 20, 411–417. Wright, J. C., & Mischel, W. (1987). A conditional approach to dispositional constructs: The local predictability of social behavior. Journal of Personality and Social Psychology, 53, 1159–1177. Womer, F. B. (1968). Basic concepts in testing. Boston, MA: Houghton Mifflin.

114

5 STRATEGY, SELECTION, AND SUSTAINED COMPETITIVE ADVANTAGE ROBERT E. PLOYHART AND JEFF A. WEEKLEY

This chapter is motivated by a simple question: Do professionally developed personnel selection practices offer strategic value to the firm? Most industrial-organizational (I-O) psychologists would answer this question with an enthusiastic “Yes!” The belief that hiring better people will result in better job performance, which in turn will contribute to better-functioning organizations, is imbued early in the education of most I-O psychologists. Utility analyses indicate that selection systems with high validity will generate monetary returns far in excess of the costs associated with selection. How then, despite a century of research demonstrating the consequences of effective selection at the individual level, does the question posed above remain a real concern among practitioners, consultants, and academicians? Consider what it means for human resources (HR) to offer strategic value to a firm. At a high level, an HR practice will add strategic value to the extent it supports execution of the firm’s business strategy. A firm’s strategy represents how it will differentiate itself in a market relative to competitors; the strategy explains how the firm will compete and where it will compete. An HR practice such as selection must support a firm’s strategy and uniquely enable it to compete against other firms. Personnel selection is focused on identifying whether applicants have the necessary knowledge, skills, abilities, or other characteristics (KSAOs) to contribute to effective individual performance on some criterion/criteria. However, demonstration of validity of a selection procedure is—by itself—insufficient for creating sustainable competitive advantage. The requirements for that include demonstration of firm- (or unit-) level consequences for a selection procedure that cannot be easily replicated. The latter point is important because there is growing recognition that HR practices are easily copied and consequently may not form a basis for strategic value (Wright, Dunford, & Snell, 2001). Although it is likely true that using a more valid selection system will improve the quality of a firm’s workforce and ensure performance parity (all else being equal), that by itself does not make selection strategically valuable. Because of the outsourcing of selection practices, many competitors can (and often do) use the same vendor’s selection assessments. As a result, selection in such firms cannot contribute to their sustained competitive advantage, although they may use selection procedures that are predictive of individual job performance. In this chapter, we do not question whether professionally developed selection practices can add value to the firm. We believe that they usually do. However, we doubt whether this will always result in competitive advantage, and we offer some broad guidance about the conditions under which it will and will not occur. We also discuss how a broader perspective on selection can better articulate its value and thus perhaps increase the likelihood of effective selection practices being implemented and

115

Robert E. Ployhart and Jeff A. Weekley supported by top management. Demonstrating how effective personnel selection practices contribute to firm performance only increases the likelihood of such practices being implemented. In the sections that follow, we first discuss the need for selection to take a broader perspective and show consequences at the business unit level. We then review the dominant strategic HR perspectives on how HR practices and human capital resources are linked to a business unit’s strategy. This is followed by a critical examination of whether personnel selection contributes to such resources and sustained competitive advantage. We then discuss several ways through which selection may contribute to the unit’s ability to execute its strategy. We conclude by considering selection in relation to other HR practices (e.g., training, compensation) for creating competitive advantage.

WHY PERSONNEL SELECTION MUST SHOW BUSINESS-UNIT-LEVEL VALUE The most basic requirement for an HR practice to provide strategic value is to demonstrate that the practice has noticeable effects on outcomes at the business-unit level. Business units are broadly defined as those organizational entities that meaningfully describe unit functions or structures. Examples include departments, stores, divisions, lines of business, and of course, entire firms. Most key organizational decision makers are held accountable for consequences that exist at these unit levels, and key organizational decisions are driven by the performance of these units. Therefore, business-unit-level consequences are critical for demonstrating that the HR practice contributes to the firm’s success. Several staffing scholars have argued that recruitment and selection research has failed to sufficiently demonstrate business unit value because it is still focused only on individual-level outcomes (Ployhart, 2006; Saks, 2005; Taylor & Collins, 2000). It is safe to say that most personnel selection research is limited to the individual level. There is an abundance of evidence that best practices in selection will lead to the identification of KSAOs necessary for effective individual performance. The use of sound selection procedures is based on the expectation that such practices contribute to improved organizational effectiveness (Schneider, Smith, & Sipe, 2000). Although selection systems are usually developed by focusing on improving individual job performance, the implicit but rarely tested assumption is that hiring better employees will result in more effective firms. However, reviews of the selection literature indicate that most research has never examined this question directly (Ployhart, 2012). Utility analysis is an estimate of the monetary value of selection, but it does not test the effects directly (Schneider et al., 2000). Among other things, most conceptualizations of utility analysis assume financial returns on staffing are linear, additive, and stable. Such assumptions are questionable. This may be, in part, why managers have found utility estimates to be unbelievable. For example, Latham and Whyte (Latham & Whyte, 1994; Whyte & Latham, 1997) found that utility analysis reduced managerial support for implementing a valid selection procedure, although the economic benefits of doing so were substantial and the logic and merits of utility analysis as a decision-making tool were carefully described. Perhaps one reason managers have not embraced utility analysis, in addition to not appreciating the mathematical proofs behind the Brogden-Cronbach-Gleser and related models, is the extremely high valuation placed on the typical selection intervention. Consequently, I-O psychology continues to struggle with demonstrating the business-unit-level value of selection in a manner that managers find credible. More recent research within the tradition of strategic human resource management (SHRM) finds that firms using professionally developed selection practices perform better on financial, accounting, or market-based criteria than those that do not (e.g., Huselid, 1995; Jiang, Lepak, Hu, & Bair, 2012). However, one important limitation of most of these studies is that they rely on a manager (or small group of managers) to self-report the nature of the selection practice for the entire firm (see Gerhart, 2005; Wright & Haggerty, 2005). The often-cited paper by Terpstra and Rozell (1993), who showed that firms using more valid predictors outperformed those that used less valid predictors, even used a self-report measure of firm effectiveness. It is interesting that as a profession we seem willing to place confidence in self-reports of unit-level

116

Strategy, Selection, and Competitive Advantage performance, when we place no such confidence in self-assessments of one’s own performance! A second, more important limitation of this research is that it asks only generally whether “a valid or systematic selection process is used.” Demonstrating that higher-performing firms are more likely to use systematic selection procedures does not establish causality and is a far cry from showing that units with the greatest levels of talent perform the best. Organizations that perform better financially may also be the ones more likely to invest in HR programs like selection systems. As DeNisi, Hitt, and Jackson (2003, p. 12) questioned: “If hiring ‘better’ people results in higher productivity, how exactly does the selection of individuals translate into improved organizational performance?” This all leads to the rather uneasy conclusion that the existing literature says relatively little about whether selection contributes to business unit effectiveness, much less to the competitive advantage to the firm (Ployhart, 2006). We believe it is critical for selection researchers to show such value for three reasons. First, to the extent selection is perceived as nonstrategic or unrelated to unit performance, many managers and organizational decision makers will continue to use suboptimal selection practices. Second, an inability to demonstrate value may limit the rigorous I-O approach to selection primarily to entry-level hires, where the focus may be on efficiency and cost-effectiveness rather than added value. Third, the profession of I-O psychology may not achieve the degree of reputation and visibility it deserves by failing to demonstrate how one of its core practices, selection, adds value to firms (Ployhart, 2012). We do not mean to overstate our points here because clearly many firms do not suffer from these issues, but it is also apparent that many firms do not use selection practices as I-O psychologists would advocate (Anderson, 2005; Rynes, Brown, & Colbert, 2002). We believe part of the reason stems from the challenge associated with showing selection’s unit-level impact (Ployhart, 2006).

STRATEGIC HUMAN RESOURCE MANAGEMENT (SHRM) The field of SHRM has grown rapidly since the late 1980s. Most traditional HR, and especially I-O, research focuses on predicting, explaining, or influencing individual behavior. As noted above, selection in particular has treated individual job performance (and related criteria like turnover) as “the ultimate criterion.” In contrast, the criteria of interest in SHRM scholarship are at the unit level, and most typically that of the firm (Wright & Boswell, 2002). SHRM scholarship tends to focus on between-firm (or business unit) differences in HR practices that help explain between-firm (or business unit) differences in performance. The typical SHRM study involves an examination of how business units that use different HR practices perform differently (Becker & Huselid, 2006). Thus, unlike traditional HR research, which focuses on individual differences in outcomes, SHRM research focuses on unit differences in outcomes. The other part to the story is that SHRM researchers tend to focus on predicting financial, market, or accounting criteria (Gerhart, 2005; Jiang et al., 2012). Hence, SHRM scholarship has attracted a great deal of attention from the business community, precisely because it shows that HR practices can improve the organizational metrics most important to stakeholders. Broadly speaking, there are three dominant perspectives on how to best use HR practices (Delery & Doty, 1996). The universalistic perspective suggests that use of certain HR practices will always be useful and relevant. Note that such a belief suggests using the appropriate practices will always improve a firm’s effectiveness, irrespective of changes in the economy, the firm’s strategy, or its competition. Colbert (2004) argued: Research under this perspective has been useful in identifying discrete HR practices that are universally sensible, but it has not contributed much to HRM in the strategic sense, if we take strategic to mean practices that differentiate the firm in its industry and that lead to sustainable competitive advantage. (p. 344)

Yet this universalistic “best practices” approach is precisely the one taken by many selection scholars (e.g., cognitive ability is always a good predictor of job performance).

117

Robert E. Ployhart and Jeff A. Weekley The contingency perspective suggests that HR practices will be useful and relevant only when they match with each other and the firm’s strategy. The contingency perspective is more directly linked to adding value for the firm because it recognizes that HR practices must support the firm’s strategy and be internally consistent with other practices. Attempts have been made to link HR strategies to generic business strategies (e.g., Porter’s cost leadership, product differentiation, and focus; Schuler, Galante, & Jackson, 1987). Although the idea that the “appropriate” set of HR practices can be deduced from a general business strategy has obvious appeal, the approach has been widely criticized (Chadwick & Cappelli, 1999). The configural perspective builds from the contingency approach to further recognize synergies that exist in patterns of practices that fit with particular strategies. This approach takes the most holistic view of HR management and suggests specific HR practices cannot be understood (or their effects decomposed) in isolation from other practices and the firm’s unique strategy. In contrast, it appears that bundles of HR practices must be used in a specific combination to drive strategic value. These are most often called high-performance work systems and involve combinations of practices that include systematic staffing, training, compensation, and related practices. Firms that use these high-performance work systems outperform those that do not (Huselid, 1995). Although there is now fairly compelling evidence that use of HR practices and highperformance work systems is related to firm value (Combs, Yongmei, Hall, & Ketchen, 2006; Jiang et al., 2012; although see Wright, Gardner, Moynihan, & Allen, 2005), it should be recognized that most SHRM research only examines the link between unit HR practices and firm effectiveness. Intervening explanatory processes, such as how the HR practice influences the cognitions, affect, and behavior of individuals, are rarely considered (Becker & Huselid, 2006; Gerhart, 2005; Wright & Haggerty, 2005). These unexamined intervening processes have been referred to as SHRM’s “black box.” It is within this black box that I-O psychology in general, and selection specialists in particular, are uniquely situated to demonstrate value (Ployhart & Hale, 2014). First, however, it is important to understand the dominant theoretical perspectives that are invoked to explain why HR practices contribute to firm performance.

Resource-Based View of the Firm Wright, Dunford, and Snell (2001) noted the dominant theoretical perspective adopted by SHRM scholars has been the resource-based view (RBV) of the firm, as articulated by Barney (1991). What makes the RBV important among strategy theories is its emphasis on a firm’s internal resources. Internal resources may represent human capital, top management expertise, financial capital, coordination processes, and related factors. Importantly, the RBV argues that there is heterogeneity in firm-level resources that contributes to some firms having a competitive advantage over other firms. Further, the RBV makes clear predictions about the characteristics of resources that have the potential to underlie sustained competitive advantage. First, valuable resources are those linked to the firm’s strategy and allow it to perform better than competitors. For example, having highly qualified employees could be a valuable resource if they resulted in firm-level competencies that manifested themselves in firm-level outcomes (i.e., the linkage between collective KSAOs and firm performance). Second, rare resources are more likely to result in an organizational competitive advantage because there is an insufficient quantity in the market. By definition, the most talented people will be rare (think of a normal distribution), so firms that better attract and retain the best talent should benefit directly (they have the rare talent) and indirectly (by keeping it out of the competition). Together, valuable and rare resources create opportunities for a firm to achieve competitive advantage. However, what firms need to be more concerned with is sustainable competitive advantage. A competitive advantage that is not sustainable leads only to conditions of temporary superiority followed typically by parity with other firms. Two additional conditions must be met for a competitive advantage to be sustainable. Inimitable resources are those that competitors cannot readily duplicate without great cost. For example, if one firm retains high-quality talent better than its competitors, then it has created a

118

Strategy, Selection, and Competitive Advantage resource that is inimitable. Social complexity, time compression diseconomies, and causal ambiguity contribute to inimitability (Barney & Wright, 1998). Social complexity refers to resources that only exist among aggregate collectives of people. For example, in many organizations knowledge is shared informally through social networks, rather than through more formal organizational structures and processes. As such, it is quite difficult to replicate the knowledge and knowledge-sharing process in other organizations. A highly effective climate is likewise socially complex because it exists in the shared perceptions of employees. Southwest Airlines has fended off multiple imitators (remember Continental Lite or the United Shuttle?). Although these imitators could replicate the business model (e.g., point-to-point service, single type of aircraft, minimal in-flight service, etc.), they could not duplicate key elements of the organization’s culture. Time compression diseconomies represent the notion that time is often not something that can be compressed with equal effectiveness (Dierckx & Cool, 1989). For example, firms that have strong brand equity have an advantage that competitors cannot easily copy because it takes a long time to generate brand recognition and brand loyalty. Causal ambiguity describes resources that are linked to effective firm performance, but the specific reasons or paths through which they contribute are not obvious. For example, it may not be apparent which specific HR practices or combinations of practices contribute to building a more effective workforce. Because creation of these resources is not easily understood, it is hard for competitors to copy them with equal effectiveness. In the RBV, the final condition for creating sustainable competitive advantage is that the resources be nonsubstitutable. Nonsubstitutable resources are those that are both necessary and sufficient for effective firm performance. For example, automated teller machines (ATMs) are an effective technological substitution for most bank teller transactions, making bank tellers’ talent a substitutable resource. However, suppose that bank tellers also provide financial advice that adds value and increases the customer base—in this example, bank tellers’ talent would be nonsubstitutable. Thus, only resources that are valuable, rare, inimitable, and nonsubstitutable can create sustained competitive advantage. Selection practices may or may not meet these conditions.

Human Capital and Social Capital Theories Human capital theory (Becker, 1964) is a broad theory originating in economics (Becker won the 1992 Nobel Prize for his work on human capital theory). Human capital theory argues that there are two main types of human capital. Firm-specific human capital is relevant only to a specific firm or context, whereas generic human capital (e.g., cognitive ability, personality) is relevant across multiple firms or contexts. Within HR contexts, firm-specific human capital is generally considered to be more important for performance because of increased knowledge of processes, operations, products, customers, and coworkers (Strober, 1990). It could be argued that firm-specific knowledge is also more valuable, rare, inimitable, and possibly nonsubstitutable (Barney, 1991). Modern extensions to human capital theory include the knowledge-based view of the firm (Grant, 1996). Note that such perspectives on knowledge are not merely job knowledge, but include knowledge of the organization’s customers, products, services, structure, processes, culture, and related factors. Thus, human capital represents the KSAOs that are “relevant for achieving economic outcomes” (Ployhart, Nyberg, Reilly, & Maltarich, 2014, p. 376). Whereas human capital theory usually emphasizes aggregate employee education, experience, or knowledge, social capital theory emphasizes the interpersonal relationships and networks that exist among employees, units, and organizations (Nahapiet & Ghoshal, 1998). Given that modern work is increasingly knowledge- and team-based, social networks are a critical means for sharing and creating knowledge (Oldham, 2003). These social networks can generate effects on unit criteria that are unrecognized at the individual level. For example, Shaw, Duffy, Johnson, and Lockart (2005) demonstrated that using unit-level turnover rates (percentage of people who quit) as a predictor of unit performance underestimates the true costs of turnover. They showed that the negative effects of turnover were greater when those who left the firm were

119

Robert E. Ployhart and Jeff A. Weekley more centrally located in the employees’ social networks. Simply put, losing a “more connected” person is more damaging than the departure of someone on the periphery. Thus, social capital represents the business-unit-level aggregate of employee social networks, relationships, and structures (Nahapiet & Ghoshal, 1998).

ALIGNMENT OF SELECTION AND STRATEGY From the preceding discussion it is apparent that selection practices, by themselves, may not contribute to sustained competitive advantage and hence demonstrate firm-level value. Figure 5.1 is an attempt to illustrate and integrate the various SHRM concepts noted above with personnel selection. This figure is based on a multilevel, strategic model of staffing presented by Ployhart (2006). However, it makes more careful consideration of the types of firm-level resources likely to be important for sustained competitive advantage. In Figure 5.1, notice that a business unit’s HR practices in general, and selection practices in particular, will have a direct impact on the individual-level KSAOs attracted, selected, and retained (Bowen & Ostroff, 2004; Schneider, 1987). This is as one would expect; use of more valid assessments should better isolate the desired KSAOs (Guion, 1998). It is also as expected that these KSAOs have a relationship with individual job performance (denoted by the dashed lines in Figure 5.1). Here is where the usual selection story ends, with perhaps the additional effort in estimating utility. However, considering our review of the SHRM literature earlier, it is apparent that effective selection practices may or may not support the business unit’s strategy and add value. One of the main reasons for this ambivalence is that selection practices can often be copied and will not, by themselves, form the basis for sustained competitive advantage. For example, use of off-the-shelf assessments should, all else being equal, produce similar levels of validity at different firms. Such practices may lead to short-lived advantage or contribute to maintaining parity with competing units. Selection practices that are more customized to the firm, such as situational judgment tests that reflect the firm’s climate and values, may be more difficult to imitate and thus better contribute to competitive advantage (see LeBreton et al., Chapter 4 in this volume, for additional perspectives). However, unit-level competencies (i.e., collective human and social capital resources; the bold box in Figure 5.1) form the strongest basis for sustained competitive advantage (Barney & Wright, 1998; Prahalad & Hamel, 1990). Consider the organization selecting front-line employees on the basis of conscientiousness, agreeableness, and emotional stability. By themselves, these individual-level selection KSAOs can and are screened by many companies. However, when combined with other properly aligned HR practices, a strong culture, and other management practices, selecting on these individual-level KSAOs can lead to the emergence of a unit-level competency such as “aggregate customer service.” Thus, unitlevel human and social capital competencies are the source of sustained competitive advantage; they represent intangible assets and add value to the firm but are created through application of HR practices like personnel selection in combination with other social capital issues (like organizational culture and management practices). However, the following paragraphs discuss certain conditions that must be met before unit-level competencies will create sustainable unitlevel differences in performance. First, validity at the individual level generalizes across like jobs, as is consistent with most selection research (note there is no moderator of the dashed relationship in Figure 5.1). However, this does not mean the validity of unit-level competencies (human and social capital) generalizes across contexts. Rather, human capital at the unit level only adds value to the extent that it is consistent with the firm’s strategy (Ployhart et al., 2014). Also known as “external fit,” an organization creating a unit-level competency such as customer service will benefit only to the extent that the competency is critical to its business strategy. Whereas one retailer competing on service would benefit from the creation of a customer service competency, another competing on price would experience less, if any, impact from the same activities. This is an important implication because it means one can be doing a wonderful job of selecting at the individual level, yet adding nothing to the sustained competitive advantage of the firm.

120

Strategy, Selection, and Competitive Advantage

Business Unit’s Strategy

Business Unit’s Mix of Strategic Resources:

Business Unit’s Human Resource Practices

Unit Level Human & Social Capital

Human Capital Advantage

Unit Level Sustained Competitive Advantage

Individual Differences (KSAOs) Acquired and Retained

Individual Level Job Performance

Selection (Individual) Level

Human Capital Emergence

Strategic (Macro) Level

Valuable Rare Inimitable Nonsubstituable

FIGURE 5.1  Conceptual Model Linking Personnel Selection With Business Unit Strategy Notice that the extent to which unit-level resources are valuable, rare, inimitable, and nonsubstitutable will determine the strength of relationship between unit-level human and social capital and sustained competitive advantage. However, note that the contingent relationships found at the strategic level do not moderate individual-level predictor validity. The dashed line represents the primary relationship examined in personnel selection research. Only the bold box represents potentially strategic ­resources. Adapted from Ployhart (2006).

Second, it was previously asserted that selection practices alone are insufficient to create the unit-level competencies that are the basis of sustained competitive advantage. Selection practices must be aligned with other HR programs (e.g., training, performance management, and compensation) to create the desired impact. Selection practices that are inconsistently applied, or conflict with other HR practices, will not create the type of human capital emergence necessary for a unit to build a high-quality stock of aggregate human and social capital. For example,

121

Robert E. Ployhart and Jeff A. Weekley a unit may use exemplary selection practices but have an inability to recruit top talent. Or, the unit may attract and select top talent but fail to keep them in sufficient quantities. The unit’s climate, leadership, or compensation may further enhance or lower the capital’s potential. Also known as “internal fit,” it is important to recognize that selection is but one lever available to management, and that the greatest impact comes when all levers are aligned and pointing in a common direction. Third, selection can have significant strategic value if the targeted KSAOs fit externally (with the business strategy) and internally (with other HR programs) to create unit-level competencies that drive important firm outcomes. While complicated enough, this scenario is likely to have several important boundary conditions. For example, the relationship between unit competencies and unit outcomes is unlikely to be linear, cross-sectionally or over time (see Ployhart, 2004). Figure 5.1 emphasizes a contingency or configural view of selection practice because the value of unit-level human and social capital is dependent on (moderated by) the unit’s strategy. This is in contrast to the personnel selection literature, which in many ways argues for a universalistic approach. Cognitive ability and conscientiousness may predict the individual performance of most jobs (Schmidt & Hunter, 1998). Although true, it does not necessarily follow that selecting on these attributes will result in sustainable competitive advantage. Further, universalistic findings (e.g., validity generalizes across contexts) at the individual level become contextualized at the unit level (e.g., validity is affected by economic environmental factors; Kim & Ployhart, 2014), and hence require a configural perspective. For example, there is likely a need for a critical mass of human and social capital (Dierickx & Cool, 1989). Particularly in highly task-interdependent work, hiring only a few highly talented employees is unlikely to result in unit-level competencies (i.e., ensure sufficient quantities of human and social capital) that will translate into improved business unit performance. Although the “tipping point” defining critical mass likely changes by industry, occupation, organization, and job, there is always a threshold. The relationship between the level of a unit’s talent and its performance is likely to be nonlinear, and determining where the threshold is requires research unlike any that presently exists. Fourth, looking longitudinally, time is a critical element for modeling the value of human and social capital. It takes time for HR practices to create human and social capital, and more time for the increasing stock to translate into unit-level outcomes (Ployhart & Hale, 2014). Such a temporal perspective is true for all unit-level processes, with higher-level processes (e.g., relations between firm climate and firm performance) requiring a longer time perspective than lower-level processes (e.g., relations between group cohesion and group performance; Kozlowski & Klein, 2000). In short, implementation of a new selection procedure, regardless of its merits, will take time before its effects can be measured at the unit level. This, of course, assumes that the other HR practices constituting the “bundle or high-performance HR practices” are themselves stable. Where other HR practices are evolving, the effects of a selection procedure may be even more difficult to ascertain (or their effects accelerated). This requisite leap of faith may explain why some organizations frequently change their selection processes. One final caveat is that the predictors of individual job performance may or may not be the best drivers of sustained competitive advantage. Certainly one expects better individual performance to contribute to better unit performance, but this relationship will not necessarily be linear or isomorphic (e.g., group process losses, etc.). Because the criteria are not isomorphic across levels (e.g., adding one additional “high-ability” employee to a casino operator employing hundreds is unlikely to measurably impact that unit’s performance), there exists a good possibility the predictors will not be as well (Bliese, 2000). This means additional KSAOs may be important drivers of unit effectiveness and sustained competitive advantage (e.g., those contributing to shared collective performance and social capital, such as personality traits linked to interpersonal and teamwork skills; Noe, Colquitt, Simmering, & Alvarez, 2003; Oldham, 2003). In short, characteristics unrelated to individual performance may be important for unit performance (see also Chapter 37, this volume, regarding team member selection). Of course, it must be remembered that a lack of quality individual KSAOs cannot be compensated with an otherwise effective management and a supportive culture. Both quality KSAOs and a supportive environment must exist.

122

Strategy, Selection, and Competitive Advantage If the above propositions are compared to the typical model for utility analysis, similarities and differences become apparent. In terms of similarities, utility analysis and SHRM scholarship would predict that more rare KSAOs (those that have a lower base rate in the population) will result in higher utility and competitive advantage. However, there are also important differences. Utility analysis assumes that economic value resulting from more valid selection will produce firm-level value, and it uses as its point of comparison either using no selection system or an alternative selection system. SHRM scholarship does not make these assumptions. Validity does not equal value; there are multiple ways that firms may use resources to be competitive; and valuable and rare resources may only contribute to parity or temporary competitive advantage (Ployhart, 2012). Sustained competitive advantage also requires resources that are inimitable and nonsubstitutable. Utility analysis and the contemporary view of selection fall short on these latter points. These points illustrate that one cannot simply assume that using more valid selection procedures will add firm-level value, improve organizational performance, or create an opportunity for sustained competitive advantage. In fact, sole focus on individual-level selection prohibits such conclusions. Of course, poor use of selection can contribute to a competitive disadvantage by not acquiring the necessary KSAOs needed to perform necessary job and organizationally prescribed tasks. But consider one of the most important implications of Figure 5.1: an organization can be using highly valid predictors of job performance at the individual level but not necessarily developing the kinds of human capital necessary for the firm’s sustained competitive advantage! This may seem discouraging, but it represents incredible opportunity for selection scholars and practitioners to better demonstrate their value to organizational decision makers. There exists a bridge to connect micro and macro disciplines of scholarship by linking selection (micro) to strategy (macro), and each discipline has the opportunity to strengthen the other. From a practical perspective, articulating selection’s value, and the manner in which selection works in conjunction with management practices, strategy, and culture, will increase the likelihood of effective practices being implemented and supported. Consider the fact that staffing impacts the firm’s strategy and helps implement it. In terms of impact, Figure 5.1 alludes to the expectation that implementing HR practices aligned with the unit’s strategy will contribute to human and social capital emergence that is valuable, rare, inimitable, and nonsubstitutable. However, through the process of developing and administering the selection system, selection also helps articulate and reinforce the firm’s strategy through all levels of the organization. In the following section, we consider various means through which personnel selection can establish business-unit-level value.

SELECTION’S CONTRIBUTION TO BUSINESS UNIT VALUE Organizational decision makers will devote the necessary resources for staffing practices because a firm needs talent to survive, but organizational decision makers are more likely to devote discretionary resources to the extent they believe selection practices will reinforce or help implement the firm’s strategy. In the sections that follow, we discuss several opportunities for making this value proposition and review a number of recent empirical studies providing supportive evidence.

Multilevel Selection Shows Business Unit Consequences Linking selection to a unit’s strategy requires a multilevel focus (Ployhart, 2006). In this section we first review theory about multilevel systems, constructs, and process. We then consider how SHRM research has evolved to link individuals and organizations. We conclude with a brief review of empirical research linking selection to business unit outcomes. The central issue in multilevel selection is understanding how individual-level KSAOs transform into unit-level human and social capital (KSAO composition; see Figure 5.1). In multilevel

123

Robert E. Ployhart and Jeff A. Weekley language, emergence refers to ways in which lower-level KSAOs combine to form collective, unitlevel constructs (Kozlowski & Klein, 2000) that have frequently been referred to as core competencies. This is the critical link in Figure 5.1 that translates HR practices into strategically valuable intangible assets (i.e., human and social capital). Emergence takes two general forms. Composition forms of emergence represent agreement or similarity among observations. When systematic selection results in homogeneity in KSAOs, it becomes possible to describe the unit in terms of those KSAOs. This is of course the basis of the attraction-selection-attrition model (Schneider, 1987). For example, one firm that attracts, selects, and retains only those with high levels of conscientiousness will be more conscientious than a firm that is unable to develop such talent. Compilation forms of emergence represent dissensus or dissimilarity. For example, academic departments often specifically target applicants who have skills and research interests absent among current faculty, thereby filling a hole in the department’s portfolio of knowledge and teaching expertise. There has been a fair amount of theory in the last decade focused on developing multilevel staffing practices. Schneider et al. (2000) noted that personnel selection practices must become multilevel and suggested that between-firm differences are driven by the effects of unit-level competencies, which are composed of similarity on individual differences. Ployhart and Schneider (2002) discussed the practical implications of multilevel personnel selection. They noted that a focus on “the job” is insufficient, and they developed a model to show how practitioners could establish multilevel validity. Ployhart and Schneider (2005) then proposed a series of methodological and statistical approaches for establishing validity at multiple levels. Ployhart (2004, 2006) attempted to integrate the SHRM research on human capital with personnel selection scholarship to develop a more comprehensive multilevel selection model that should articulate selection’s strategic value to the firm. This research on multilevel selection fits within the broader movement of SHRM scholarship to link individual and unit levels. Wright and Boswell (2002) persuasively called for an integration of micro (HR) and macro (strategy) scholarship because each level has implications for the other. More recently, Gerhart (2005) and Wright and Haggerty (2005) noted that an important direction for SHRM scholarship is to move beyond establishing the effectiveness of HR practices to demonstrating how HR practices build aggregate compositions of human and social capital (competencies) linked to unit-level outcomes (interestingly, compilation models are mentioned less frequently). They essentially argued for scholars to focus on how HR practices create human and social capital emergence. Bowen and Ostroff (2004) developed the concept of HR strength, an idea that stronger (more consistent, greater internal fit) HR practices create more cohesive climates than do weaker (less consistent) HR practices. Ployhart and Moliterno (2011) described the cognitive, affective, and behavioral processes that transform individual KSAOs into unit-level human capital resources, thereby explaining why individual-level findings may not directly generalize to the firm level. So, how might HR practices, and selection in particular, contribute to the emergence of unitlevel competencies that form the basis for sustained competitive advantage? Lepak and Snell (1999) suggested different types of employee groups differ in their strategic value and uniqueness to the firm and recommended that firms should vary their HR practices accordingly. For example, firms should use commitment-based HR practices for employees who are both highly valuable and unique (e.g., top managers) but use efficiency-based HR practices for employees who are valuable but not unique (e.g., entry-level employees). Jobs have similarly been differentiated as “A” jobs and “C” jobs, with the former being central to the firm’s ability to execute its strategy and the latter being important but not strategic. Selection is most likely to create unitlevel impact when applied to A jobs. As an example, in most banks the loan officer job is an important one. Loan officers are the interface between the bank and customers, and they make decisions about whether the bank should lend money. Incumbents in such positions are often evaluated on sales and service criteria. The KSAO keys to selection in this context would be those related to sales (e.g., persistence, energy, persuasion) and service (agreeableness, conscientiousness). However, much more critical to the organization’s success are the far smaller number of A jobs responsible for setting the organization’s overall credit policies and managing its risk. Although small in number, their impact on the firm is enormous (witness the meltdown in the financial markets from overzealous lending practices). Criteria used to evaluate these jobs might include various ratios of

124

Strategy, Selection, and Competitive Advantage returns to bad loans. The necessary KSAOs would be substantially different, focusing on personality, quantitative analysis, forecasting, geopolitical trends, and the like. Consistent across all this theory is the expectation that HR practices create a composition and/or compilation of competencies (unit KSAOs) from individual KSAOs. It bears repeating that the strategic value manifests from the unit-level competencies, not individual KSAOs. HR practices produce unit differences and may contribute to sustained competitive advantage through the creation of valuable, rare, inimitable, and nonsubstitutable competencies (human and social capital at the unit level, and collective personality-related characteristics that contribute to the formation of high-performance cultures). Empirical research supports many of these expectations. Lepak and Snell (2002) found that firms do in fact use different HR practices for different employee groups varying in terms of value and uniqueness. Takeuchi, Lepak, Heli, and Takeuchi (2007) found that aggregate manager perceptions of employee competencies were positively related to self-reports of firm performance. Several studies have found that human capital manifests different forms of composition across units, and hence units can be distinguished in terms of aggregate individual KSAOs (Jordan, Herriot, & Chalmers, 1991; Ployhart, Weekley, & Baughman, 2006; Schaubroeck, Ganster, & Jones, 1998; Schneider, White, & Paul, 1998). More direct evidence is found in studies that link selection and/or human capital resources directly and indirectly to unit- and firm-level performance (Kim & Ployhart, 2014; Oh, Kim, & Van Iddekinge, 2015; Ployhart, Weekley, & Ramsey, 2009; Van Iddekinge, Ferris, Perrewe, Perryman, Blass, & Heetderks; 2009). More research is needed linking HR practices to specific types of human and social capital emergence. For example, do firms that use more aligned recruiting, selection, and retention practices create more valuable and rare forms of human capital more quickly? Do such firms have greater control over the flow of human and social capital? Do firms that use such practices outperform rivals? SHRM scholarship needs to focus squarely on human and social capital emergence, thereby illuminating the black box between HR practices and unit outcomes. Fortunately, I-O psychologists and selection specialists are the torch that can light the examination of this black box.

Selection and Retention Turnover nullifies the positive human and social capital enhancements accrued through effective selection. The extant literature suggests that effective selection can help reduce turnover (Barrick & Zimmerman, 2005), but this only scratches the surface of retention issues. For example, recent unit-level research suggests that the economic costs associated with turnover are far greater than individual research might indicate. Unit-level human and social capital losses represent the quantity and quality of talent lost through turnover, the costs of which may be greater than simply the rate of turnover (Glebbeek & Bax, 2004; Kacmar, Andrews, Rooy, Steilberg, and Cerone, 2006; Shaw, Gupta, & Delery, 2005). This is an important transition because unit-level turnover captures losses of collective processes (e.g., coordination and communication) that are not apparent in individual-level research, and turnover rates may not fully capture the costs of turnover (Dess & Shaw, 2001; Hausknecht & Holwerda, 2013). For example, service- and knowledge-based firms are highly dependent on social capital (the collective social relationships, interactions, and networks within a unit). Effective unit performance in these settings requires employees to share information about customers, services, products, and best practices. Unit turnover creates holes in social networks that have negative consequences beyond simple turnover rates (Shaw et al., 2005). The loss of talent through turnover means the investment made in unit human capital may not be recouped, but all loss is not considered equal. For example, turnover in A jobs should be more costly than turnover in C jobs. That is, the opportunity costs of turnover in strategically critical positions should be higher than the same costs in less mission-critical positions. Similarly, higher-quality unit human capital is expected to produce more value, so turnover among top talent will incur greater opportunity costs than will turnover among lesser talent (Hausknecht & Holwerda, 2013). Turnover should also be more damaging when those who leave have more tacit knowledge (Strober, 1990) or are more centrally located in the unit’s social network (Hausknecht & Holwerda, 2013). From the RBV perspective, units are not equally distributed with

125

Robert E. Ployhart and Jeff A. Weekley respect to human capital. Turnover may diminish the stock of unit human capital, but turnover among the highest-quality human capital may make that resource less valuable, rare, and more easily imitated unless replacements are equally qualified (Lepak & Snell, 1999). Because individual KSAOs tend to be normally distributed, by definition higher-quality talent will be more scarce and difficult to replace. Empirically, Shaw et al. (2005) found that unit social capital losses were associated with additional negative effects beyond mere performance losses, indicating the quality of social capital was an important factor. Research similarly finds a loss of shared tacit knowledge (Hitt, Bierman, Shimizu, & Kochhar, 2001) and an erosion of communication and coordination (Sacco & Schmitt, 2005). Selection experts should take ownership of turnover problems. For example, an HR executive of a Fortune 500 company told one of the authors: “I tell my (HR) team that turnover means our hiring process failed.” That is a bold statement (people quit for all kinds of reasons that have nothing to do with their ability to perform the job), but there is some truth to better utilizing staffing as a means to control turnover. That said, adequately solving this problem will require a deeper understanding of turnover than simply identifying the KSAOs necessary for job performance. For example, it will require an understanding of how individuals fit within the culture of the firm and characteristics of the work group, something that is rarely assessed as part of selection practices. Note that doing so would obviously contextualize selection, an idea that has been recently advocated (Cascio & Aguinis, 2008; Ployhart & Schneider, 2012). This research may help link selection to important internal processes that contribute to the firm’s ability to retain the high-quality stock of strategically valuable human and social capital necessary for strategy implementation.

Talent as Assets Versus Costs Current models of accounting treat HR-related activities as costs (and worse yet, sometimes as liabilities). Accounting is the language of business, so viewing HR as a cost likely contributes to managers’ perceptions of HR not adding strategic value. It is incumbent on HR managers and staffing specialists to convey how human and social capital can become a financial asset. Unfortunately, there is no commonly accepted or widely appreciated way to value human capital resources (Fulmer & Ployhart, 2014). Chapter 10 in this volume discusses this issue in some detail, but we raise a few additional points here. Earlier, when discussing Figure 5.1, we noted some similarities and differences between viewing staffing as a strategic process versus traditional utility analysis. We emphasize that unit-level human and social capital creates sustained competitive advantage. Therefore, it is critical to articulate the economic value of this intangible asset—what Ulrich and Smallwood (2005) called “return on intangibles.” In some of our own research, we have shown how individual KSAOs emerge to create meaningful between-unit differences in human capital resources, which are in turn linked to financial performance (Ployhart et al., 2009). Thus, we showed how this “intangible asset” was in fact tangibly linked to important store financial criteria. We have also demonstrated how selection practices relate to financial performance over time, at both the business unit (Ployhart et al., 2011) and firm (Kim & Ployhart, 2014) levels.

Selection as a Lever for (or Barrier to) Change A firm’s business strategy and personnel selection practices should be fully aligned. When a firm decides to change strategic direction (e.g., by shifting from competing on cost to competing on quality or expanding into completely new markets), it should require changes in the selection process. The need for change in strategy may be signaled by a new CEO, declining market share, a paradigm threatening advance in technology, or a new entrant into the competitive landscape. In such cases, new core competencies may be required. For such competencies to emerge, selection efforts will have to focus on the KSAOs comprising the basic ingredients. When combined with appropriate changes to other HR practices (e.g., performance management, compensation, leadership

126

Strategy, Selection, and Competitive Advantage and culture change), new competencies may begin to form to support execution of the firm’s new strategy. Additionally, most organizational change efforts involve a substantial change in personnel, at least at the top levels of the firm. By attracting, selecting, and retaining people who share the new vision and have the skills necessary to achieve the vision, a firm can quickly and possibly radically alter its human and social capital composition (Kim & Ployhart, 2014). These are important selling points HR managers can use to secure support and resources for selection procedures. Of course, selection can also serve as a barrier to change when it is divorced from the firm’s strategy and results in the continued acquisition of employees who lack the skills or vision necessary for the company’s survival. Organizations that rely on an internal labor market (i.e., hire at the entry level and promote exclusively from within) are often at a disadvantage when change becomes a necessity. Such firms may find it difficult to alter their skill mix in the short-term without reliance on talent secured from the outside. Conversely, organizations that rely on the external labor market for resources (i.e., buying the talent on the open market as needed) may be more nimble when faced with the need for change (Kim & Ployhart, 2014). An organization attempting to shift from a commodities market to an “upscale market” may find itself in need of people able to envision products worthy of a price premium (as opposed to people who focus primarily on cost control). Growing them internally would be slow, at best. The choice between promoting internally versus hiring externally is one that has important consequences (Bidwell, 2011; see also Groysberg, 2010).

Global Considerations in Selection and Strategy Chapter 36 in this volume discusses various issues relating to the global implementation of selection practices, but here we discuss a few specific to linking selection with strategy. Clearly, organizations operating globally face a unique set of selection issues. Organizations that operate globally often have different functions located in different parts of the world (e.g., design in the United States and manufacturing in China). As a result, different core competencies are required by different parts of the organization located in different parts of the globe. Obviously, different selection criteria may be called for in different parts of the world, and thus the KSAOs matched to the competencies needed to execute respective functions will also differ. Global selection also means that selection methods developed in one part of the world may not generalize to other countries or cultures (e.g., Ryan, McFarland, Baron, & Page, 1999). Even where the same competencies are desired, indicating the same KSAOs are required, different or at least modified methods may be required to identify people who possess them. For example, it is well established that personality measures are affected by response styles (e.g., extreme responding, acquiescence), which in turn vary across cultures. Other selection methods like biodata and situational judgment, which are usually developed and keyed at a local level, may not generalize across cultural boundaries (Lievens, 2006). The obvious point is that globalization greatly compounds the challenges facing selection researchers who are attempting to build the corporate competencies required to support execution of the business strategy and impact key outcomes. The talent necessary to compete in one local economy (e.g., China, India) may contribute little to supporting the organization’s overall strategy. For example, suppose a U.S.-based organization opens a manufacturing facility in China. The organization may choose to employ local managers because they have expertise in Chinese culture and business practices, but they may lack some key competencies necessary for them to understand how their operation adds value to the overall firm. Hence, assuming some autonomy, they may make choices that run counter to the firm’s strategy even though they are successful locally. The critical issue is alignment between the firm’s strategy and talent within and across geographic locations.

Selection Influences Diversity Strategy, selection, and diversity are an important combination for at least two reasons. First, many firms report that increasing, or at least valuing, diversity is one of their corporate goals.

127

Robert E. Ployhart and Jeff A. Weekley Whether this is done because they think it helps the bottom line, for social responsibility, or simply to enhance public relations, the fact remains that many firms place a premium on attracting diverse talent. In short, diversity may be part of the firm’s strategy. Many firms will only use a systematic selection process to the extent it increases diversity—validity will often take a secondary role, if it takes a role at all. A common scenario is an organization choosing to not implement a locally validated cognitive ability test because of the concomitant adverse impact implications. (Note that in this section, our treatment of diversity is primarily in terms of demographic diversity; psychological diversity is actually a subset of compilation models discussed earlier.) However, it should be remembered that selection is the only way to influence demographic diversity (notwithstanding mergers and acquisitions). Thus, selection is the primary mechanism for enhancing diversity, and arguing for selection in this respect can be a powerful means to articulate its value (Ployhart, 2006). For example, one of the authors worked on developing a selection system for investment bankers. Because the investment banking community is rather small, many of the hires were based on recommendations from current employees. Relying on recommendations from internal employees almost ensures the status quo with respect to demographic diversity. The line director wanted a systematic selection practice because he believed it would result in a more demographically diverse slate of applicants.

Selection Supports Talent Segmentation The strategy determines which jobs are A jobs and which are C jobs. Selection’s role is in recognizing which jobs are most critical, why they are so (e.g., because they directly impact the emergence of core competencies), and in ensuring an adequate supply of talent in those jobs (through attraction, selection, and/or retention). One problem facing selection researchers is that the C jobs are often the high-population jobs most suitable to local validation efforts. The truly critical jobs for an organization may encompass a relatively small population, making criterion-related validation efforts difficult at best. Research on the unit-level impact of alternative validation strategies is clearly needed. Although the strategy determines which are A jobs, performance largely determines who is considered an A player. Selection specialists can play a critical role in a firm’s success to the extent they can match A players with A jobs. Firms that are successful in placing their most talented individuals in the most mission-critical jobs should be more successful than those that take a more casual approach to placement. However, the role is not limited to selecting talent. It also involves deselecting marginal performers from critical jobs. If the job is truly essential to the execution of the firm’s strategy, then rigorous performance management, including the reallocation or separation of weak performers, becomes an important if unpleasant task.

Selection Helps Develop a Critical Mass The assumption in staffing is that adding higher-quality employees will improve the firm’s effectiveness in a linear manner. This is the basis of basic utility analysis. Yet SHRM research provides some theory to suggest that a unit must develop a critical mass of talent for it to be strategic. Unit-level competencies, such as customer service or conscientiousness (e.g., package delivery), are unlikely to emerge where only a few members of the unit possess the appropriate individuallevel KSAOs. In essence, Figure 5.1 suggests that hiring only one or a few highly talented applicants is unlikely to produce sustained competitive advantage. Rather, there must be sufficient quantities and quality so that “human and social capital emerges” and hence can influence unitlevel outcomes (Ployhart, 2006). If correct, this has two important implications. First, it means that human and social capital has a minimum threshold that must be passed to ensure contribution to sustained competitive advantage. This means that an adequate flow (attraction and retention) of talent becomes paramount. Second, it means that human and social capital might also have a maximum threshold,

128

Strategy, Selection, and Competitive Advantage or at least a point of diminishing returns. This would indicate a point where there is little relative value to be gained by more effective selection, at least to the extent retention is stable. There is almost no research that speaks to these issues, but such research is necessary because both implications are counter to utility analysis.

SELECTION’S SYMBIOTIC RELATIONSHIP WITH OTHER HR ACTIVITIES It has been argued that unit-level human capital resources can produce strategic benefits and that they are the result of the careful alignment of selection and other HR programs with the firm’s strategy. Selection is merely one way in which an organization can influence the talent it employs. Clearly, training and development activities can impact individual-level talent and ultimately unit competence (Ployhart & Hale, 2014). Performance management systems similarly can signal desired behaviors and shape the development and utilization of unit competencies. Compensation systems can be designed to reward the acquisition of individual knowledge, skills, and abilities that support unit competency (e.g., pay-for-knowledge systems) and can play an important role in recruitment and retention. The largest impact comes when all elements of the HR system are congruent in terms of effect; all point toward and support the development of the same unit-level competencies linked to the firm’s strategy. HR executives are under increasing pressure to demonstrate the value the function adds to the organization. Boards of directors are increasingly demanding evidence of an adequate return on investment for HR-related expenditures. To meet this challenge, practitioners have begun examining group-level outcomes as they relate to HR activities. For example, in the survey arena, linkage studies are increasingly common as HR executives seek to demonstrate that employee engagement and satisfaction are related to important outcomes. A similar approach is being applied to selection—demonstrating unit-level outcomes of individual-level selection systems (e.g., Kim & Ployhart, 2014; Ployhart et al., 2009; Ployhart et al., 2011; Van Iddekinge et al., 2009).

CONCLUSIONS In this chapter, we have argued that selection scholarship must be expanded to consider how, when, and why personnel selection practices will contribute to creating business-unit-level value and sustained competitive advantage. We noted that in contrast to expectations, effective selection practices will not always translate into firm-level value. Table 5.1 summarizes the main implications of our chapter.

TABLE 5.1 Summary of Key Implications: Integrating Strategic Human Resources with Personnel Selection 1. Only unit-level human capital and social capital resources can offer strategic value to the firm. Personnel selection and selection on targeted individual-level KSAOs can only contribute to the firm’s strategy insofar as they contribute to the emergence of strategically valuable unit-level human and social capital resources. 2. Individual-level criterion-related validity is insufficient evidence to demonstrate the strategic value of selection. It is necessary, but not sufficient. Poor use of selection can create a competitive disadvantage, but using selection procedures that predict individual job performance is no guarantee that personnel selection will contribute to the firm’s sustained competitive advantage. 3. Validity at the individual level generalizes across contexts, but the validity of unit-level human capital and social capital does not generalize. The extent to which these unit-level competencies have relationships with unit outcomes is dependent on the firm’s strategy, market, competitors, competitive environment, and related factors. Indeed, if unit-level human and social capital are to add value to the firm’s competitive advantage, then the firm would not want these relationships to generalize to other firms! 4. Selection practices that rely more on firm-specific KSAOs may contribute to competitive advantage more strongly that practices that rely on more generic KSAOs. Cognitive ability and personality scores have generalizable validity, but the benefits of such KSAOs may not necessarily differentiate a firm’s human capital resources as much as KSAOs that are more specific to a firm (e.g., KSAOs based on situational judgment or interviews). 5. Demonstrating the strategic value of personnel selection usually requires a longitudinal focus, because selection (and related HR) practices must be implemented appropriately over time to create a critical mass of unit-level human and social capital emergence.

129

Robert E. Ployhart and Jeff A. Weekley 6. A critical mass of aggregate KSAOs is necessary to influence unit-level consequences. This has three important implications. First, the development and sustainability of this critical mass is likely to be dynamic and nonlinear. Second, the relationships between unit-level human and social capital are likely to be dynamic and nonlinear with unit-level outcomes. Third, there is likely a threshold of diminishing returns; a tipping point where adding more talent is unlikely to provide the same value to the firm. 7. Because performance criteria are not isomorphic between the individual and unit levels, there is the possibility that different predictors of performance will be present at each level.

As a consequence of the issues summarized in Table 5.1, we have also tried to articulate ways through which selection practices can manifest such value. Much is currently written about how the HR profession needs to be part of the firm’s strategic decision-making team; this is even more true for I-O psychology. Being able to demonstrate selection’s contribution to the firm’s strategy is one way to accomplish this goal. Although the road toward establishing empirical connections between selection practices and business unit sustained competitive advantage will not be easy or quick, we believe it is vital for the future of our profession and I-O psychology’s own strategic direction.

ACKNOWLEDGMENTS We thank James L. Farr, Nancy T. Tippins, John Campbell, Jerard F. Kehoe, Janice Molloy, and Ben Schneider for providing suggestions on earlier drafts of this chapter.

REFERENCES Anderson, N. R. (2005). Relationships between practice and research in personnel selection: Does the left hand know what the right is doing? In A. Evers, N. R. Anderson, & O. Smit-Voskuyl (Eds.), The Blackwell handbook of personnel selection (pp. 1–24). Oxford, England: Blackwell. Barney, J. B. (1991). Firm resources and sustained competitive advantage. Journal of Management, 17, 99–120. Barney, J. B., & Wright, P. M. (1998). On becoming a strategic partner: The role of human resources in gaining competitive advantage. Human Resource Management, 37, 31–46. Barrick, M. R., & Zimmerman, R. D. (2005). Reducing voluntary, avoidable turnover through selection. Journal of Applied Psychology, 90, 159–166. Becker, B. E., & Huselid, M. A. (2006). Strategic human resource management: Where do we go from here? Journal of Management, 32, 898–925. Becker, G. S. (1964). Human capital: A theoretical and empirical analysis with special reference to education. New York, NY: National Bureau of Economic Research. Bidwell, M. (2011). Paying more to get less: Specific skills, matching, and the effects of external hiring versus internal promotion. Administrative Science Quarterly, 56, 369–407. Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 349–381). San Francisco, CA: Jossey-Bass. Bowen, D. E., & Ostroff, C. (2004). Understanding HRM-firm performance linkages: The role of the “strength” of the HRM system. Academy of Management Review, 29, 203–221. Cascio, W. F., & Aguinis, H. (2008). Staffing twenty-first-century organizations. Academy of Management Annals, 2, 133–165. Chadwick, C., & Cappelli, P. (1999). Alternatives to generic strategy typologies in strategic human resource management. In P. M. Wright, L. D. Dyer, J. W. Boudreau, & G. T. Milkovich (Eds.), Research in personnel and human resources management, (Suppl. 4, pp. 1–29). Stamford, CT: JAI Press. Colbert, B. A. (2004). The complex resource-based view: Implications for theory and practice in strategic human resource management. Academy of Management Review, 29, 341–358. Combs, J., Yongmei, L., Hall, A., & Ketchen, D. (2006). How much do high-performance work practices matter? A meta-analysis of their effects on organizational performance. Personnel Psychology, 59, 501–528.

130

Strategy, Selection, and Competitive Advantage Delery, J. E., & Doty, D. H. (1996). Modes of theorizing in strategic human resource management: Tests of universalistic, contingency, and configurational performance predictions. Academy of Management Journal, 29, 802–835. DeNisi, A. S., Hitt, M. A., & Jackson, S. E. (2003). The knowledge-based approach to sustainable competitive advantage. In S. E. Jackson, M. A. Hitt, & A. S. DeNisi (Eds.), Managing knowledge for sustained competitive advantage (pp. 3–33). San Francisco, CA: Jossey-Bass. Dess, G. G., & Shaw, J. D. (2001). Voluntary turnover, social capital, and organizational performance. Academy of Management Review, 26, 446–456. Dierickx, I., & Cool, K. (1989). Asset stock accumulation and sustainability of competitive advantage. Management Science, 35, 1504–1511. Fulmer, I. S., & Ployhart, R. E. (2014). “Our most important asset:” A multidisciplinary/multilevel review of human capital valuation for research and practice. Journal of Management, 40, 161–192. Gerhart, B. (2005). Human resources and business performance: Findings, unanswered questions, and an alternative approach. Management Revue, 16, 174–185. Glebbeek, A. C., & Bax, E. H. (2004). Is high employee turnover really harmful? An empirical test using company records. Academy of Management Journal, 47, 277–286. Grant, R. M. (1996). Toward a knowledge-based theory of the firm. Strategic Management Journal, 17, 109–122. Groysberg, B. (2010). Chasing stars: The myth of talent and the portability of performance. Princeton, NJ: Princeton University Press. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decision (2nd ed.). Mahwah, NJ: Erlbaum. Hausknecht, J. P., & Holwerda, J. A. (2013). When does employee turnover matter? Dynamic member configurations, productive capacity, and collective performance. Organization Science, 24(1), 210–225. doi:10.1287/orsc.1110.0720 Hitt, M. A., Bierman, L., Shimizu, K., & Kochhar, R. (2001). Direct and moderating effects of human capital on strategy and performance in professional service firms: A resource-based perspective. Academy of Management Journal, 44, 13–28. Huselid, M. A. (1995). The impact of human resource management practices on turnover, productivity, and corporate financial performance. Academy of Management Journal, 38, 635–672. Jiang, K., Lepak, D. P., Hu, J., & Baer, J. (2012). How does human resource management influence organizational outcomes? A meta-analytic investigation of mediating mechanisms. Academy of Management Journal, 55, 1264–1294. Jordan, M., Herriot, P., & Chalmers, C. (1991). Testing Schneider’s ASA theory. Applied Psychology, 40, 47–53. Kacmar, K. M., Andrews, M. C., Rooy, D. L. V., Steilberg, R. C., & Cerone, S. (2006). Sure everyone can be replaced . . . but at what cost? Turnover as a predictor of unit-level performance. Academy of Management Journal, 49, 133–144. Kim, Y., & Ployhart, R. E. (2014). The effects of staffing and training on firm productivity and profit growth before, during, and after the Great Recession. Monograph. Journal of Applied Psychology, 99, 361–389. Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions. (pp. 3–90). San Francisco, CA: Jossey-Bass. Latham, G. P., & Whyte, G. (1994). The futility of utility analysis. Personnel Psychology, 47, 31–46. Lepak, D. P., & Snell, S. A. (1999). The human resource architecture: Toward a theory of human capital allocation and development. The Academy of Management Review, 24, 34–48. Lepak, D. P., & Snell, S. A. (2002). Examining the human resource architecture: The relationships among human capital, employment, and human resource configurations. Journal of Management, 28, 517–543. Lievens, F. (2006). International situational judgment tests. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests (pp. 279–300). Mahwah, NJ: Lawrence Erlbaum. Nahapiet, J., & Ghoshal, S. (1998). Social capital, intellectual capital and the organizational advantage. Academy of Management Review, 23, 242–266. Noe, R. A., Colquitt, J. A., Simmering, M. J., & Alvarez, S. A. (2003). Knowledge management: Developing intellectual and social capital. In S. E. Jackson, M. A. Hitt, & A. S. DeNisi (Eds.), Managing knowledge for sustained competitive advantage (pp. 209–242). San Francisco, CA: Jossey-Bass. Oh, I-S., Kim, S., & Van Iddekinge, C. H. (2015). Take it to another level: Do personality-based human capital resources matter to firm performance? Journal of Applied Psychology, 100, 935–947. Oldham, G. R. (2003). Stimulating and supporting creativity in organizations. In S. E. Jackson, M. A. Hitt, & A. S. DeNisi (Eds.), Managing knowledge for sustained competitive advantage (pp. 274–302). San Francisco, CA: Jossey-Bass. Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. In J. Martocchio (Ed.), Research in personnel and human resource management (Vol. 23, pp. 121–176). Oxford, England: Elsevier.

131

Robert E. Ployhart and Jeff A. Weekley Ployhart, R. E. (2006). Staffing in the 21st century. Journal of Management, 32, 868–897. Ployhart, R. E. (2012). The psychology of competitive advantage: An adjacent possibility. Industrial and Organizational Psychology: Perspectives on Science and Practice, 5, 62–81. Ployhart, R. E., & Hale, D., Jr. (2014). The fascinating psychological microfoundations of strategy and competitive advantage. Annual Review of Organizational Psychology and Organizational Behavior, 1, 145–172. Ployhart, R. E., & Moliterno, T. P. (2011). Emergence of the human capital resource: A multilevel model. Academy of Management Review, 36, 127–150. Ployhart, R. E., & Schneider, B. (2002). A multilevel perspective on personnel selection: Implications for selection system design, assessment, and construct validation. In F. J. Dansereau & F. Yammarino (Eds.), Research in multi-level issues Volume 1: The many faces of multi-level issues (pp. 95–140). Oxford, England: Elsevier Science. Ployhart, R. E., & Schneider, B. (2005). Multilevel selection and prediction: Theories, methods, and models. In A. Evers, O. Smit-Voskuyl, & N. Anderson (Eds.), Handbook of personnel selection (pp. 495–516). Oxford, England: Blackwell. Ployhart, R. E., & Schneider, B. (2012). The organizational context of personnel selection. In N. Schmitt (Ed.), The Oxford handbook of assessment and selection (pp. 48–67). Oxford: Oxford University Press. Ployhart, R. E., Nyberg, A. J., Reilly, G., & Maltrich, M. (2014). Human capital is dead; long live human capital resources! Journal of Management, 40, 371–398. Ployhart, R. E., Van Iddekinge, C. H., & MacKenzie, W. I., Jr. (2011). Acquiring and developing human capital in service contexts: The interconnectedness of human capital resources. Academy of Management Journal, 54, 353–368. Ployhart, R. E., Weekley, J. A., & Baughman, K. (2006). The structure and function of human capital emergence: A multilevel examination of the ASA model. Academy of Management Journal, 49, 661–677. Ployhart, R. E., Weekley, J. A., & Ramsey, J. (2009). The consequences of human resource stocks and flows: A longitudinal examination of unit service orientation and unit effectiveness. Academy of Management Journal, 52, 996–1015. Prahalad, C. K., & Hamel, G. (1990). The core competence of the corporation. Harvard Business Review, 68, 79–91. Ryan, A. M., McFarland, L. A., Baron, H., & Page, R. (1999). An international look at selection practices: Nation and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391. Rynes, S. L., Brown, K. G., & Colbert, A. E. (2002). Seven misconceptions about human resource practices: Research findings versus practitioner beliefs. Academy of Management Executive, 16, 92–103. Sacco, J. M., & Schmitt, N. (2005). A dynamic multilevel model of demographic diversity and misfit effects. Journal of Applied Psychology, 90, 203–231. Saks, A. M. (2005). The impracticality of recruitment research. In A. Evers, O. Smit-Voskuyl, & N. Anderson (Eds.), Handbook of personnel selection (pp. 47–72). Oxford, England: Blackwell. Schaubroeck, J., Ganster, D. C., & Jones, J. R. (1998). Organization and occupation influences in the attraction-selection-attrition process. Journal of Applied Psychology, 83, 869–891. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schneider, B. (1987). The people make the place. Personnel Psychology, 40, 437–454. Schneider, B., Smith, D. B., & Sipe, W. P. (2000). Personnel selection psychology: Multilevel considerations. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions. (pp. 91–120). San Francisco, CA: Jossey-Bass. Schneider, B., White, S. S., & Paul, M. C. (1998). Linking service climate to customer perceptions of service quality: Test of a causal model. Journal of Applied Psychology, 83, 150–163. Schuler, R. S., Galante, S. P., & Jackson, S. E. (1987). Matching effective HR practices with competitive strategy. Personnel, 64, 18–27. Shaw, J. D., Duffy, M. K., Johnson, J. L., & Lockhart, D. E. (2005). Turnover, social-capital losses, and performance. Academy of Management Journal, 48, 594–606. Shaw, J. D., Gupta, N., & Delery, J. E. (2005). Alternative conceptualizations of the relationship between voluntary turnover and organizational performance. Academy of Management Journal, 48, 50–68. Strober, M. H. (1990). Human capital theory: Implications for HR managers. Industrial Relations, 23, 214–239. Takeuchi, R., Lepak, D. P., Heli, W., & Takeuchi, K. (2007). An empirical examination of the mechanisms mediating between high-performance work systems and the performance of Japanese organizations. Journal of Applied Psychology, 92, 1069–1083. Taylor, M. S., & Collins, C. J. (2000). Organizational recruitment: Enhancing the intersection of theory and practice. In C. L. Cooper & E. A. Locke (Eds.), Industrial and organizational psychology: Linking theory and practice (pp. 304–334). Oxford, England: Blackwell.

132

Strategy, Selection, and Competitive Advantage Terpstra, D. E., & Rozell, E. J. (1993). The relationship of staffing practices to organizational level measures of performance. Personnel Psychology, 46, 27–48. Ulrich, D., & Smallwood, N. (2005). HR’s new ROI: Return on intangibles. Human Resource Management, 44, 137–142. Van Iddekinge, C. H., Ferris, G. R., Perrewe, P. L., Perryman, A. Z., Blass, F. R., & Heetderks, T. D. (2009). Effects of selection and training on unit-level performance over time: A latent growth modeling approach. Journal of Applied Psychology, 94, 829–843. Whyte, G., & Latham, G. P. (1997). The futility of utility analysis revisited: When even an expert fails. Personnel Psychology, 50, 601–610. Wright, P. M., & Boswell, W. R. (2002). Desegregating HRM: A review and synthesis of micro and macro HR research. Journal of Management, 28, 247–276. Wright, P. M., Dunford, B. D., & Snell, S. A. (2001). Human resources and the resource-based view of the firm. Journal of Management, 27, 701–721. Wright, P. M., Gardner, T. M., Moynihan, L. M., & Allen, M. R. (2005). The relationship between HR practices and firm performance: Examining causal order. Personnel Psychology, 58, 409–446. Wright, P. M., & Haggerty, J. J. (2005). Missing variables in theories of strategic human resource management: Time, cause, and individuals. Management Revue, 16, 164–173.

133

6 WORK ANALYSIS MICHAEL T. BRANNICK, KENNETH PEARLMAN, AND JUAN I. SANCHEZ

The purposes of this chapter are to describe and summarize the current state of the art with respect to work analysis as it applies to employee or personnel selection and to suggest expansions of such applications in light of emerging and anticipated changes in the world of work. We use the term “work analysis” broadly to refer to any systematic process for gathering, documenting, and analyzing information about (a) the content of the work performed by people in organizations (e.g., tasks, responsibilities, or work outputs), (b) the worker attributes related to its performance (often referred to as knowledge, skills, abilities, and other personal characteristics, or KSAOs), or (c) the context in which work is performed (including physical and psychological conditions in the immediate work environment and the broader organizational and external environment). Other terms, such as “job analysis,” “occupational analysis,” and “job specification” are often used, sometimes interchangeably and with somewhat varying definitions in different contexts, to refer to one or more of these activities. Our use of “work analysis” reflects our preference for a broader term that does not connote a focus on any particular aspect or unit of analysis in the study of work. This chapter is organized into major sections, including “Traditional Selection-Related Applications of Work Analysis” examines the primary applications of selection-oriented work analysis. “A Review of Major Work Analysis Methods and Approaches” provides a review and analysis of major historical work analysis methods that have been used to support personnel selection. “Key Work Analysis Practice Issues” is devoted to several key issues that arise in the practical application of work analysis to personnel selection. “Frontiers of Work Analysis: Emerging Trends and Future Challenges” discusses several emerging trends, issues, and challenges that we see as critical to the continuing and future relevance and utility of selection-oriented work analysis; we also consider applying work analysis techniques that have yet to be used for selection. “Synopsis and Conclusions” summarizes and draws some general conclusions on the basis of the material presented in the main sections.

TRADITIONAL SELECTION-RELATED APPLICATIONS OF WORK ANALYSIS Work analysis is seldom an end in itself but is almost always a tool in the service of some application, a means to an end. We view it as axiomatic that options or alternatives regarding specific work analysis methods and practices cannot be meaningfully considered without also specifying their context or application, because this drives every facet of work analysis. Organizational goals and strategy should drive selection system goals and strategy, which in turn should drive work analysis strategy, which then serves as the basis for the many specific decisions involved in designing a particular work analysis system, program, or project.

134

Work Analysis Broadly speaking, the purpose of work analysis for personnel selection applications is to ensure that selection systems are work- or job-related, and hence valid, and thereby have value or utility for the organization, as well as being legally defensible. Within this context, four general categories of work analysis application can be distinguished, as follows.

Work Analysis for Predictor Development There are two phases to this application. First is the use of work analysis to make inferences about the person requirements of work, that is, to determine what worker attributes or KSAOs (perhaps also including at what level of proficiency) are needed to carry out the work. The second phase involves linking appropriate measures to the KSAOs generated in phase one, such as tests of particular abilities or skills.

Work Analysis for Criterion Development Work analysis provides the information needed to understand the content (work activities, behaviors, or outcomes) and context (both the broader organizational and more specific work setting) of work performance, and in so doing it provides the basis for developing work performance measures or standards. Such measures or standards can, in turn, serve as criteria for formally evaluating individual employee selection tools or an overall selection system; for example, in the context of a criterion-related validation study. These criterion measures often take the form of specific dimensions of work activity, along with associated rating scales, but can also take the form of more objectively observable indices (production or error rates) or work sample measures.

Work Analysis for Domain Sampling This refers to the application of work analysis to the development of content-related evidence in support of a selection procedure’s validity (more commonly referred to as “content validity”). For such applications, work analysis is used to define a work domain or a job’s content domain in terms of the important tasks, activities, responsibilities, or work behaviors performed and their associated worker requirements, or KSAOs. Measures of the work content, or a selected subset of associated KSAOs, are then developed and judgmentally linked back to the content domain by subject matter experts (SMEs). Essentially, an argument is made that the test content samples the job domain in a representative way. Validity is a function of the strength of these (measure-content domain) linkages, which in turn are a function of the quality of the original work analysis, the quality of the experts, and the rigor of the linkage process.

Work Analysis for Validity Evidence Extension This refers to the application of work analysis to methods for inferring a selection procedure’s validity for a given job or work setting without actually conducting a validation study in the new setting. Three distinguishable approaches for justifying such inferences are commonly recognized: (a) validity generalization or meta-analysis, (b) synthetic or job-component validity, and (c) validity transportability (McPhail, 2007). Validity generalization, a form of meta-analysis applied specifically to the personnel selection context (Pearlman, Schmidt, & Hunter, 1980), involves a series of statistical analyses performed on a set of criterion-related correlation coefficients accumulated from archival sources to determine (a) a reliable estimate of association for the predictor-criterion (or job) combination represented in the set and (b) the degree to which this estimate varies by location (see Chapter 4). Synthetic validity encompasses several methods that

135

Michael T. Brannick et al. involve the use of structured work analysis questionnaires that can be scored in terms of work dimensions or job components, each of which has pre-established relationships (generally via prior criterion-related validation work or expert estimation) with one or more predictor constructs or measures (e.g., cognitive abilities). When a new job is analyzed with this questionnaire and its component dimensions are scored, predictor measure validity for the new job can be inferred or computed from these known predictor-criterion relations (McPhail, 2007). Validity transportability refers to use of a specific selection procedure in a new situation based on results of a validation study conducted elsewhere. Work analysis is fundamental to establishing the strength of the inference of job-relatedness in all three approaches to validity evidence extension, albeit in different ways. For validity generalization and validity transportability, the key issue is establishing similarity between a job (or job group) for which validity evidence has been obtained in one setting and the target job to which one wishes to generalize or “transport” that evidence. Consequently, the key question is, “How similar is similar enough?” Research suggests that a relatively molar or high-level work analysis (e.g., sufficient to classify the target job into a broadly defined job family) may be sufficient for some applications, because even relatively large task differences between jobs do not moderate the validity of many types of ability tests (Schmidt, Hunter, & Pearlman, 1981); however, data are sparse for tests of more specific knowledge, skills, and other characteristics. By its nature, synthetic validity requires a fairly detailed work analysis; however, the precise form of the work analysis is dictated by the characteristics of the focal work analysis questionnaire on which the predictor–job dimension relationships were originally established. Validity transportability applications legally require the target job to consist of “substantially the same major work behaviors” as that (or those) on which the focal validation work was conducted (Equal Employment Opportunity Commission, 1978), implying the need for at least a somewhat detailed level of work analysis. Such issues, as well as various approaches to job similarity evaluation, are considered in depth elsewhere (Harvey, 1986; Pearlman, 1980; Sackett, 2003).

A REVIEW OF MAJOR WORK ANALYSIS METHODS AND APPROACHES

Work Analysis Information Framework Work analysis methods can be broadly differentiated in terms of the process they use to compile, analyze, and present work analytic information and the content of such information. Work analysis processes can in turn be broadly differentiated in terms of whether they are primarily qualitative or quantitative. Qualitative approaches build up work or job information, often from scratch (e.g., on the basis of observing or interviewing job incumbents to determine specific tasks performed) and generally one job at a time, yielding detailed, narrative information that is customized to individual jobs or specific work within an organization. Quantitative approaches are generally based on the use of structured work analysis questionnaires or surveys consisting of pre-established lists of different types of work or job descriptors (i.e., work characteristics or units of analysis, such as work behaviors, worker functions, or KSAOs). These usually include rating scales that permit subject matter experts (SMEs; incumbents, supervisors, or job analysts) to quantify their judgments about individual descriptors along dimensions of interest (e.g., performance frequency, importance, level of complexity, and consequences of error). Work analysis content refers to the types of work descriptors used and the level of analysis or detail represented by these descriptors. McCormick (1979) has usefully distinguished among three broad descriptor categories, including (in slightly adapted form) (1) work-oriented content descriptors, in which the descriptive frame of reference is the work to be done, including the purpose, steps, tools and materials, required resources, and conditions under which work is accomplished (e.g., tasks, activities, duties, responsibilities, working conditions, and work outputs); (2) worker-oriented content descriptors, in which the descriptive frame of reference is what workers do to carry out work (e.g., worker functions, processes, or behaviors); and (3) attribute requirement descriptors, in which the descriptive frame of reference is the attributes needed by a worker to

136

Work Analysis

TABLE 6.1 Work Analysis Information Framework Level of Description/Analysis Work Descriptor Category

Broad

Moderate

Specific

Work-oriented content

Major duties Major responsibilities

Task clusters or work activities Work functions or processes Material, equipment, tool, and machine categories

Tasks or work steps Work outputs Specific materials, equipment, tools, and machines Work-content-based performance standards Specific work environment features and working conditions

Worker-oriented content

Position/job/ occupational titles

Generalized work behaviors Worker functions or processes

Worker-behavior-based performance standards Behavioral indicators

Attribute requirements

Personality traits, values, and interests Aptitudes and abilities

Generic or crossfunctional skills

Specialized/technical skills Specialized/technical knowledge

do the specified work (e.g., skills, knowledge, abilities, and temperaments or dispositions). Distinctions among these categories can at times be blurry because of some natural overlap and the unavoidable imprecision of language (e.g., the second and third categories are sometimes considered as a single “worker-oriented” or “attribute” category); nonetheless, they have proven to be conceptually useful. Work descriptors can be further differentiated in terms of the level of analysis or description reflected by a particular category of descriptor. Work-oriented content descriptors can range from narrow and specific (such as tasks performed) to broad and general (such as major duties or responsibilities), as can worker-oriented content descriptors (ranging from specific behavioral performance indicators to broad position or job titles). Similarly, attribute requirement descriptors can be represented by very narrowly defined characteristics (specialized skills or knowledge) or very broadly defined human attributes (such as aptitudes and abilities or personality traits). Table 6.1 illustrates how type of content and level of detail of work analysis play out in terms of specific descriptor types. It provides examples of descriptors representing each work descriptor category within each of three levels of analysis. Note that the qualitative-quantitative work analysis process distinction is not represented here but is represented in our later expansion of this table and associated text discussion of specific work analysis methodologies.

Review of Specific Work Analysis Methods, Systems, and Approaches Because of space limits, our review considers “tried and true” methods only briefly. There are many sources of information on such methods (Brannick, Cadle, & Levine, 2012; Brannick, Levine, & Morgeson, 2007; Gael, 1988; Gatewood, Feild, & Barrick, 2008; Peterson, Mumford, Borman, Jeanneret, & Fleishman, 1999; Wilson, Bennett, & Gibson, 2012). Additionally, several authors provide a historical account of the development of work analysis (Mitchell & Driskell, 1996; Primoff & Fine, 1988; Wilson, 2007). Thus, rather than providing details of each technique, we provide a brief description of several of them, along with a summary table and a set of more specific references for the interested reader.

137

Michael T. Brannick et al.

TABLE 6.2 Work Analysis Information Framework Level of Description/Analysis Work Descriptor Category

Broad

Moderate

Specific

Work-oriented content

Job Diagnostic Survey PPRF Work Styles PIC Checklist

Minnesota Job Description Questionnaire

TI/CODAPa

Worker-oriented content

Dictionary of Occupational Titles classification structure

Position Analysis Questionnaire

Cognitive Task Analysisa Critical Incident Techniquea

Attribute requirements

Fleishman Ability Requirements Scales Holland Interest Taxonomy

SCANS Work Keys

Job Element Methoda CIP-2000 Knowledge Taxonomy

Hybrid (multidescriptor and/or multilevel)

Competency Modelinga Functional Job Analysisa MOSAICa O*NETa SHL Universal Competency Framework Strategic Job Modelinga

a

Method generates some or all of its information using qualitative processes (see earlier text discussion).

Table 6.2 illustrates how work analysis methods correspond to our framework; the methods contained in the table are intended to show the diversity of what is possible rather than to prescribe methods to the reader. Table 6.2 contains a row that Table 6.1 does not in order to account for hybrid methods that cut across contents. Some tabled items represent specific and well-defined instruments, methods, or programs, whereas others represent more general systems or approaches. The process aspect of these methods, discussed earlier, is represented in Table 6.2 by footnotes following methods that, partly or wholly, use qualitative data generation techniques; all other (nonfootnoted) methods use entirely quantitative methods.

Work-Oriented Content At the broad end of the spectrum, the Job Diagnostic Survey (JDS; Hackman & Oldham, 1976) considers broad characteristics of jobs, such as the degree to which the job is autonomous. At the narrow end of the spectrum, Task Inventory/Comprehensive Occupational Data Analysis Programs (TI/CODAP) refers to a collection of computer programs and applications that analyze quantitative data collected from standardized task inventories. Following initial implementation in the U.S. Air Force in 1967, it was eventually adopted as the primary work analysis tool for all branches of the military and has expanded into widespread use in academia, business, industry, and federal, state, and local governments (Christal & Wiessmuller, 1988). An example of a moderately detailed method is the Minnesota Job Description Questionnaire (Dawis, 1991; Doering, Rhodes, & Kaspin, 1988; Tinsley & Weiss, 1971), which, like the JDS, contains a scale for autonomy but also contains related but more fine-grained scales such as independence, creativity, and achievement.

Worker-Oriented Content The Dictionary of Occupational Titles (DOT) occupational classification structure that came to fruition in the DOT’s third edition (U.S. Department of Labor, 1965a) represents the synthesis of two classic work analytic methods—the Labor Department’s analyst-based methodology

138

Work Analysis (Droege, 1988) and the Data-People-Things scales of Functional Job Analysis (Fine & Cronshaw, 1999). The DOT contains descriptors that would allow us to place it in multiple spots in the table, but it is placed in its current position because of extensive worker requirements information (education, aptitudes, knowledge, interests, temperaments, and physical demands) (U.S. Department of Labor, 1965b). The Position Analysis Questionnaire (PAQ) is a structured questionnaire that describes jobs in terms of 27 standardized worker-oriented dimensions that are neither highly specific nor extremely broad and are common across nearly all jobs (McCormick, Jeanneret, & Mecham, 1972). It thus lends itself well to quantitative cross-job comparisons, job family development, and validity evidence extension applications; in particular, synthetic validity and validity transportability. The Critical Incident Technique (Flanagan, 1954) involves SMEs recalling actual incidents of notably good and poor performance (observed behavior) in a target job, and analysts subsequently sorting and grouping these incidents by theme to develop job-specific selection tools (as well as performance measures, training programs, and other applications). Cognitive task analysis (Schraagen, Chipman, & Shalin, 2000) is a collection of techniques aimed at understanding how experts represent and process information during task performance (e.g., troubleshooting). Because they are closely coupled with specific tasks, cognitive task analyses tend to be job specific.

Attribute Requirements The Fleishman Ability Requirement Scales (ARS) methodology is an outgrowth of several lines of programmatic research involving task characteristics and human ability taxonomies begun in the 1960s (Fleishman & Quaintance, 1984). This led to the development of a highly construct-valid set of 52 cognitive, physical, psychomotor, and sensory abilities, along with associated rating scales based on empirically calibrated task or behavioral anchors used to evaluate the ability requirements of tasks, broader job components, or entire jobs. Another broad set of individual differences intended to apply to all jobs is the Holland interest taxonomy, which sorts occupations into broad categories based on the kinds of interests that people share within the occupations (Holland, 1973, 1997). A taxonomy particularly well suited to managerial work is the SHL Universal Great 8 Competencies and sub-competencies (Bartram, 2005), which focuses on managerial functions such as analyzing and interpreting, creating and conceptualizing, and organizing and executing. Two approaches were designed to analyze jobs by personality requirements: the Personality-Related Position Requirements Form (PPRF; Raymark, Schmit, & Guion, 1997), based on the Big Five personality theory, and the Performance Improvement Characteristics (PIC) Checklist (J. Hogan, Davies, & Hogan, 2007), which aligns with the Hogan Personality Inventory (R. Hogan & Hogan, 1992). Moderately detailed approaches to identifying attribute requirements include the Secretary’s Commission on Achieving Necessary Skills (SCANS, 1992) and Work Keys (McLarty & Vansickle, 1997). Finer-grained approaches include the Job Element Method (Primoff, 1975) and the CIP-2000 knowledge taxonomy, which is embodied in the U.S. Department of Education’s current Classification of Instructional Programs (U.S. Department of Education, 2002).

Hybrid Systems Several work analysis approaches aim to be useful for more than one purpose. Hence, they incorporate both work-oriented and worker-oriented descriptors. Functional Job Analysis (Fine & Wiley, 1971) generates carefully structured qualitative information about what a worker does (tasks) and quantitative information about how a task is performed in terms of the cognitive, interpersonal, and physical functions of a worker, as measured by hierarchically organized rating scales for data (information or ideas), people (coworkers, customers), and things (machines,

139

Michael T. Brannick et al. equipment), as well as by additional rating scales for “worker instructions” (degree of work discretion) and general educational development, including reasoning, mathematical, and language demands (Fine & Cronshaw, 1999). The U.S. Office of Personnel Management’s Multipurpose Occupational Systems Analysis Inventory-Close-Ended (MOSAIC) system (Rodriguez, Patel, Bright, Gregory, & Gowing, 2002), which has become the federal government’s primary work analysis system, is a multipurpose, automated, survey-based work analysis approach used to simultaneously collect information (from incumbents and supervisors) on many jobs within a broad occupational area. The Occupational Information Network (O*NETTM), treated in depth elsewhere in this volume (see Chapter 40), was developed by the U.S. Department of Labor in the mid-1990s (Peterson et al., 1999). O*NET was premised on the need for a comprehensive, theoretically based, and empirically validated common language that represented a hierarchical, taxonomic approach to work description and would therefore be capable of describing the characteristics of work and workers at multiple levels of analysis. Its centerpiece is the O*NET Content Model, which serves as the blueprint and integrating framework for the various descriptive components of the system. The content model encompasses six major domains of job descriptors representing some 20 individual job descriptor categories or taxonomies that reflect more than 270 work- and worker-oriented descriptors on which data are collected from trained analysts and job incumbents by means of structured questionnaires and surveys. Strategic Job Modeling (SJM) is a term coined by Schippman (1999) to describe an approach to work analysis that can serve as a basis for integrated HR systems. Its core is a conceptual framework outlining the key descriptive elements on the work and person sides of the performance equation. The approach consists of a series of steps, suggested guidelines, procedures, and work aids for obtaining information on these descriptors within the context of a given SJM project. Competency modeling (Schippman et al., 2000) is a form of work analysis that attempts to define constructs that apply across jobs and that distinguish superior performers; it attempts to connect the values of the organization to the behaviors of its employees. We describe this method in more detail in a later section. Here we merely note that competency modeling describes a host of activities that vary in rigor and managerial intent.

Work Analysis Methods Review: Some Practical Implications and Broader Perspectives The preceding review indicates that practitioners have many choices when considering what method will best support a particular personnel selection application. There are several possible ways to narrow these choices. One is to view the specific selection-related applications discussed previously in terms of our work analysis information framework, as illustrated in Table 6.3. Broadly speaking, predictor development applications are likely to be best served by attribute requirement descriptors at any level of analysis, because selection tools are most commonly designed as measures of work-related KSAOs. Criterion development applications are best served either by work-oriented content descriptors at any level of analysis or worker-oriented content descriptors at a relatively specific level of analysis, because such information provides the most useful basis for developing relevant work performance measures. Validity evidence development (content validity) applications are best served by work-oriented content or attribute requirement descriptor information at a specific or possibly moderate level of analysis in which the necessary linkages between specific work content and attribute measures can be most readily made and documented. Existing research suggests that validity evidence extension applications are likely to be best served by worker-oriented content or attribute requirement descriptors at moderate or broad analytic levels for validity generalization and work- or worker-oriented content descriptors at a moderate level of analysis for synthetic validity, whereas legal considerations suggest that worker-oriented content descriptors at specific or moderate levels of analysis are most appropriate for validity transportability.

140

Work Analysis

TABLE 6.3 Descriptor Appropriateness for Selection-Related Work Analysis Applications Level of Analysis Work Descriptor Category

Broad

Moderate

Specific

Work-oriented content

CD

CD, VD, VE-SV

CD, VD

Worker-oriented content

VE-VG

VE-VG, VE-SV, VE-VT

CD, VE-VT

Attribute requirements

PD, VE-VG

PD, VD, VE-VG

PD, VD

Note. CD, criterion development applications; PD, predictor development applications; VD, validity of domain sampling (content validity) applications; VE, validity evidence extension applications; VE-VG, encompassing validity generalization; VE-SV, synthetic validity; VE-VT, validity transportability.

Reflecting more broadly on this review, it appears that work analysis is at a crossroads—one rooted in the fact that whereas work in many economic sectors has been changing a lot, work analysis methodology has been changing only a little. The general phenomenon of the changing nature of work, workers, and the workplace resulting from broader economic, demographic, and technological changes has been extensively described and documented for at least the last 20 years (Coates, Jarratt, & Mahafie, 1990; Johnston & Packer, 1987), as has its specific ramifications for many organizational and HR practices, including personnel selection and work analysis (Offerman & Gowing, 1993; Pearlman & Barney, 2000; Sanchez, 1994). Rather than slowing down or stabilizing, the pace of such changes, if anything, appears to be accelerating (Landy, 2007). However, for work analysis, after periods of substantial innovation in the mid-20th century, with a few exceptions the last several decades have been largely ones of methodological refinements, variations, and new combinations of tried-and-true methods. Like all organizational practices, work analysis must continue to adapt and evolve to maintain its relevance and utility. Although the traditional concept of a job may not be “dead,” as some have argued (Bridges, 1994), its changed settings and dynamics have created new and formidable challenges for traditional work analysis assumptions and practices. Among those who have speculated about this (Cunningham, 2000; Fogli & Whitney, 1998; Levine & Sanchez, 2007; Pearlman & Barney, 2000), there has been some consensus that such challenges imply the need for work analysis methods with a greater ability to capture such things as (a) strategic and future work requirements that are based on a more macro, top-down (i.e., organization-level) than a micro, bottom-up (i.e., individual- and job-level) orientation; (b) broader and multiple work roles and work processes rather than individual jobs and work content; (c) broader sets of worker attributes (e.g., personality, attitudes, and values) relevant to contextual performance (Borman & Motowidlo, 1993), team and organizational performance outcomes, and task and individual performance outcomes; and (d) important elements of the broader work and organizational environment, as well as incorporate interdisciplinary perspectives and methodological innovations that would facilitate and enhance such pursuits. (We elaborate on many of these points later in this chapter.)

KEY WORK ANALYSIS PRACTICE ISSUES

Data Collection Issues in Work Analysis Work Analysis Data Sources The choice of data sources, like all methodological decisions in work analysis, should be driven by the specific purposes and goals of the analysis. Among the more commonly used sources

141

Michael T. Brannick et al. of work analysis data (incumbents, supervisors, higher-level managers, job analysts, training specialists, or other technical experts), job incumbents are by far the most frequently used. However, there is an important distinction to be made between a job and its incumbents. Jobs are social and organizational constructions, abstractions based on specific sets of responsibilities required of one or more job incumbents, such that each incumbent of the same job is charged with performing the same set of responsibilities. Hence, jobs are actually highly dynamic (sets of responsibilities change over time, and all incumbents do not carry them out in the same way) and relativistic—a single task for one incumbent (“making sandwiches” for a short-order cook in a diner) may constitute an entire job for another (“sandwich-maker” in a specialized gourmet deli). However, most traditional methods of work analysis implicitly assume the existence of an absolute, or reasonably stable, job as at least a “convenient truth.” It is therefore not surprising that large numbers of observers of this “job reality” often have been enlisted in the work analysis process so as to mitigate the bias and idiosyncrasies of individual observers by combining and averaging observations of the same “object” (their job) from multiple vantage points. Under the assumption that those closest to the behavioral realities of the job are its most objective sources, incumbent ratings of work-analytic units such as job tasks and KSAOs are often preferred to the ratings of nonincumbents (e.g., trained analysts, supervisors, and psychologists) because of their higher “face validity” and acceptability among the end users of such data. In other words, there is a widespread assumption in work analysis that job incumbency is necessary and even sufficient to ensure valid ratings. However, there is no convincing body of evidence backing such a belief. That is, selection procedures based on incumbent ratings have not been found more valid or effective than those based on nonincumbent ratings (Sanchez, 2000). Moreover, several logical arguments can be made regarding the potential disadvantages of using incumbents (greater susceptibility to various social influence and impression management biases) and the potential advantages of using nonincumbents (greater objectivity) as sources of work analysis ratings under different circumstances (Morgeson & Campion, 1997). The argument for expanding the range of data sources beyond incumbents is further strengthened by several characteristics of many contemporary work settings, such as the need for workers to span functional boundaries and an emphasis on teamwork and customer service. This in turn suggests that internal and external customers, suppliers and vendors, and other colleagues and points of coordination along a product or service delivery chain could add value as sources of information about a given job, work function, or work process in a more broadly conceived, 360-degree approach to work analysis. However, when nonincumbents are used to provide work analysis ratings, it is important that they have sufficient opportunity to gain first-hand familiarity with the focal job (or other unit of analysis involved), such as through job observation or interviews, rather than making judgments based solely on review of written material (lists of tasks, duties, and responsibilities) that is unlikely to provide the insights necessary to make well-informed judgments (e.g., when estimating KSAO requirements).

Work Analysis Data Types and Level of Analysis Although many content approaches were described in Table 6.2, the analysis of highly cognitive and highly interpersonally oriented work content remains a challenge, because these domains involve processes that are not easily observed (but we discuss cognitive task analysis later in this chapter). The personality requirements of work also have not been well represented in traditional work analysis methods, although there are signs of progress in this area, as described earlier. Finding ways to represent and capture the dynamic nature of work has been a longstanding problem in work analysis. Methods from other disciplines are available for describing dynamic work processes, such as the flow, interaction, and strategic impact of work processes across functions and time. One such method is work process mapping (Brannick et al., 2007), in which

142

Work Analysis relationships of tasks and work roles to one another and to specific work goals are displayed in flowchart form. Several innovative approaches along these lines have been detailed by Barney (2000), such as impact mapping and “strategic modeling scales,” analytic methods for linking work tasks and worker attributes to an organization’s broad strategic goals. Such techniques could be helpful supplements to more traditional work description but have not as yet found their way into mainstream work analysis practice.

Work Analysis Data Collection Methods An overarching issue affecting any data collection processes involving human sources is the potential for distortion of job information, with or without conscious intent (Morgeson & Campion, 2000). For example, incumbents may be motivated for various reasons to present a positive image of their jobs and are thus prone to inflating their ratings of task or KSAO importance. The reverse is also possible, as in the case of information sources such as supervisors or second-level managers underrating the importance of their subordinates’ job responsibilities so as to elevate the importance of their own roles. We recommend the use of more than one data collection methodology whenever possible (e.g., interviews followed by a structured questionnaire), making it possible to check for convergence between the information gathered through different methods. This may not be as impractical as it might initially seem, because the development of a structured work analysis survey generally requires the collection of qualitative data via interviews or job observation as input into the survey development process. When sources agree on the nature of the job, all is well. When sources disagree, some detective work may be needed to understand why. When quantitative information is gathered from different sources (e.g., incumbents and supervisors), some means of combining or blending the information may necessary. For example, if either incumbents or supervisors deem a skill to be important, then it should be included in screening applicants. However, we are unaware of research that provides guidance in such matters.

Inferential Leaps and Linkages in Work Analysis Four types of inferential leaps come into play when work analysis is applied to employee selection: (1) the translation of work content information into worker attribute information, (2) the translation of work content information into work performance or criterion measures, (3) the translation of worker attributes into actual selection instruments, and (4) the inferential leap between selection instruments and performance measures (Gatewood et al., 2008). Each of these leaps is illustrated in the context of domain sampling (i.e., content validity) applications of work analysis. For example, the development of professional or occupational certification or licensure tests typically begins with a detailed analysis of a job or occupation in terms of its tasks, duties, and responsibilities. Subsequently, this information is used to infer the important knowledge components (and their relative weights) of the occupation (inference 1) and may at times be used to develop one or more performance or criterion measures (inference 2) for use in future criterion-related validation studies conducted to augment the content validity evidence. The knowledge requirements are then translated into a test plan intended to measure all of the required knowledge areas in the appropriate proportions (inference 3). The test’s content validity may be established through several means, including appropriate documentation of all the steps and judgments just described, the use of one or more statistical indices (Lindell & Brandt, 1999) available for evaluating the strength of the test-performance relationship (inference 4) on the basis of item-level job-relatedness judgments of SMEs and eventual examination of the empirical relationship between test and job performance (also inference 4). Content validity applications notwithstanding, inference (1)—the use of job or work information to determine the worker attributes needed to effectively carry out a job’s activities and

143

Michael T. Brannick et al. perform its required behaviors—has received the most attention in the general application of work analysis to predictor development. The basic idea is to analyze both the job and the person into components (tasks for the job, abilities for the person) such that lawful patterns emerge, confirming hypothesized job requirements. In some applications, such as in the O*NET system, the importance of various attribute requirements is estimated directly by incumbents with the aid of task- and behavior-anchored rating scales. Other researchers have proposed an explicit set of “linkage” ratings (Goldstein, Zedeck, & Schneider, 1993) for determining KSAO importance on the basis of the strength of the judged relationship between individual tasks and individual KSAOs. Research has shown that SMEs are indeed capable of reliably estimating linkage ratings, although analysts’ judgments may be somewhat more reliable than those of incumbents (Baranowski & Anderson, 2005). Still another stream of research has explored the covariance between task and KSAO ratings, even suggesting the possibility of empirical derivation of KSAOs from task ratings (Arvey, Salas, & Gialluca, 1992; Goiffin & Waycheshin, 2006; Sanchez & Fraser, 1994). Fleishman and colleagues showed how abilities can be mapped (through judgment) onto empirically derived dimensions of task content (Fleishman & Quaintance, 1984), whereas McCormick’s PAQ research (McCormick et al., 1972) uncovered empirical relationships between job elements and worker attributes via factor analysis. More recent research along similar lines using the initial O*NET database also found meaningful linkages among a wide range of work and worker characteristics, including abilities, generalized work activities, work styles, knowledge, occupational values, skills, and working conditions such as a requirement to work during holidays (Hanson, Borman, Kubisiak, & Sager, 1999). In summary, there is strong evidence of meaningful covariation between ratings of different work-, worker-oriented, and attribute-oriented descriptors, suggesting that SMEs’ inferences about attribute requirements are well grounded in their judgments of work activities or job tasks. Such findings provide reasonably strong underpinnings for the various methods and techniques that have been developed and used to make such linkages in practice.

Evaluating Work Analysis Quality Discussion of how best to evaluate the accuracy, quality, and meaning of work analysis data has been going on for a long time, but has it been given more recent impetus by various empirical studies (Wilson, 1997) and conceptual/theoretical work (Morgeson & Campion, 1997, 2000) illustrating how difficult it may be to collect accurate and valid job or work information—or even to agree on what this means. For example, is accuracy best indexed by inter-rater agreement, convergence among different data sources, or convergence between work analysis output and some external standard or benchmark? Despite such ambiguity, the effectiveness or utility of such data for making sound personnel decisions remains crucial. Within the context of personnel selection, the inferences at issue range from those that are immediately supported by work analysis data (such as inferring worker attributes from task importance ratings) to those that are more distally supported by such data (such as inferring the validity of work-analysis-based selection instruments by examining their correlations with performance measures). Although we recognize that such consequences provide only partial information about the impact of work analysis on decision making, we nonetheless believe there is considerable value in thinking of work analysis in terms of its more broadly conceived consequential validity, because this provides a framework for demonstrating and documenting to organizational leaders and stakeholders its pivotal role in linking personnel selection (as well as many other HR practices) to critical aspects of work performance (as revealed through work analysis). This value is further enhanced to the degree that such practices can be shown to produce significant “returns on investment” (Becker, Huselid, & Ulrich, 2001), for example, in the form of individual- or organization-level performance improvements.

144

Work Analysis

FRONTIERS OF WORK ANALYSIS: EMERGING TRENDS AND FUTURE CHALLENGES

Work Analysis in Support of Selection for High-Performance Workplaces The concept of high-performance (or high-involvement) organizations (HPOs; Lawler, Mohrman, & Benson, 2001) is a direct outgrowth of the sweeping work and workplace changes to which we alluded earlier that have been occurring over the last 20 years or so. It refers to organizations that have incorporated into their strategy, culture, and structure various elements believed to maximize the performance of people in those organizations so the performance and ability of the organization to compete effectively in the global economy is also maximized. These include such workplace practices as (a) worker empowerment, participation, and autonomy; (b) the use of self-managed and cross-functional teams; (c) commitment to superior product and service quality; (d) flat organizational structures; (e) the use of contingent workers; (f) flexible or enriched design of work that is defined by roles, processes, output requirements, and distal criteria (customer satisfaction, contribution to organization values), rather than by (or in addition to) rigidly prescribed task- or job-specific requirements; (g) rigorous selection and performance management processes; and (h) various worker- and family-friendly HR policies that reward employee development and continuous learning and support work-life balance. A growing body of evidence (Cascio & Young, 2005; Gibson, Porath, Benson, & Lawler, 2007) shows that such workplace practices can contribute to important organization-level outcomes (e.g., financial performance, productivity, and customer satisfaction; Boselie, Dietz, & Boon, 2005; Staw & Epstein, 2000). Particularly relevant for our discussion is evidence within this larger body of research that links such HPO-oriented workplace practices and outcomes with the individual worker attrib­utes and behaviors needed to effect them (Boselie et al., 2005; Guest, Conway, & Dewe, 2004; Guthrie, 2001; Huselid, 1995; MacDuffie, 1995; Spreitzer, 1995). For example, formerly individualcontributor scientists who have been reorganized into cross-functional teams with engineering and marketing staff to improve a product delivery cycle may need social and communication skills in addition to research skills (a sort of work context “main effect”). Work that has been redesigned to create greater worker autonomy may improve motivation, and hence performance, among individuals with high growth need strength but not in others, which is a work context “interaction effect” (Hackman & Oldham, 1976). It is not enough for employees at Disneyland (“the happiest place on earth”) to simply work; they must “whistle while they work”—literally for seven particular work roles and figuratively for all others—so as to contribute to one of that setting’s most critical outputs (cheerfulness). In other words, different organizational strategies—and how they are reflected in an organization’s culture and structure—imply potential needs for additional or different (or different configurations or weightings of) worker KSAOs, the measurement of which could enhance existing selection systems. Moreover, HPO-associated strategies have, in many cases, increased the organizational value of various non-job- and nontask-specific performance criteria, such as contextual performance; employee satisfaction, commitment, engagement, and retention; and avoidance of employee withdrawal and counterproductive behaviors. Most of the relevant literature considers typical employment relations (Cappelli & Keller, 2013). However, organizations have significant numbers of workers in nontraditional roles, including contract workers, temporary workers, telecommuters, and subcontractors working shoulder-to-shoulder with regular employees. Workers on contract must provide deliverables, but management has no control over the production process for such workers. For telecommuters, management has authority to direct the worker’s means of production, but the actual direction may be nominal. Management’s inability to specify actual worker behavior creates a problem for traditional job analysis methods. Such developments imply the need for work analysis methods that incorporate measures of a greater variety of work context factors—particularly those associated with or driven by an organization’s vision, mission, strategy, structure, culture, and values—than are addressed in conventional methods, which, if present at all (e.g., as they are in FJA, PAQ, and O*NET), tend

145

Michael T. Brannick et al. to be limited to those associated with specific jobs and work settings (work schedule, working conditions, environmental hazards) rather than the broader organizational and external context (business strategy, competitive environment, market conditions). Absent such measures, we may fail to detect the need for potentially important or useful worker attributes and potentially critical selection procedure validation criteria. For example, it is plausible that areas where noncognitive attributes (such as personality traits, attitudes, and values) might have their greatest predictive value remain largely unexplored because of our historical focus on more conventional (job- and performance-oriented) criteria that ignore the broader organizational context of work. We would go so far as to argue that the definition, measurement, and mapping of the work environment—in effect, creating a “common language” of work context—at multiple levels of analysis is the next major frontier in work analysis. This is not a small challenge. It is a problem of long standing in psychology as a whole (Frederiksen, 1972) and continues to be one of acute importance in I-O psychology today (Johns, 2006). However, models for explaining and understanding context are beginning to emerge (Tett & Burnett, 2003). Context variables such as organizational climate and culture may have implications for selecting people with compatible personal characteristics (see Chapter 5 in this volume).

WORK ANALYSIS FOR HIGH-PERFORMANCE INDIVIDUALS (“STARS”) A series of articles by Aguinis and colleagues (Aguinis & O’Boyle, 2014; Aguinis, O’Boyle, Gonzales-Mule, & Joo, 2014; O’Boyle & Aguinis, 2012) have argued that performance outputs (e.g., number of articles published, goals scored, Emmy nominations, dollar amounts of sales) show very skewed distributions. In such distributions, a few individuals show outstanding performance, and may be labeled “stars.” They also argue that organizational programs designed to impact the average employee are likely to be misguided because “most performance outcomes are attributable to a small group of elite performers” (O’Boyle & Aguinis, 2012). However, Beck, Beatty, and Sackett (2014) described several features of the way in which the distributions are collected (e.g., whether the time period of performance is the same for all of the individuals in the distribution) that may affect the shape of the distribution. Others have noted that outputs and performance are not synonymous, and that stars might be conceptualized instead as expert performers (Campbell & Wiernik, 2015). Aguinis et al. (2016) also investigated a number of other characteristics (cumulative advantage, or “rich get richer” factors) that may influence the tails of the performance distribution. Regardless of the degree to which their arguments about the importance of the majority of workers are ultimately confirmed, attention to recruiting, hiring, and maintaining performance stars has a clear potential to benefit many organizations. Aguinis and Boyle (2014) stated that “a focus on results rather than behaviors is most appropriate when (a) workers are skilled in the needed behaviors, (b) behaviors and results are obviously related and (c) there are many ways to do the job right” (p. 316). From the standpoint of job analysis, the possibility that there is no one best way to accomplish the job is an issue. Hierarchical task analysis (Annett, 2003) describes a set of steps designed to accomplish a task. If there is more than one way to do something, then the most efficient way will be chosen. If the problem is complex, then a series of approaches might be employed, but there will be rules about what to try under which circumstances. In some instances it may not be clear what the best method for task accomplishment is. For example, as surgery evolves, new tools and techniques become available, and as these have not been used previously, it is not obvious how best to use them in an operation. However, as experience accrues, evidence becomes available about their advantages and disadvantages. Where performance data are available, it becomes possible to investigate whether some uses are better than others, and some studies have linked job-analytic data to job behaviors (Morrison, 1994; Sanchez & Levine, 2012; Sanchez, Prager, Wilson, & Vishwesvaran, 1998). A domain in which performance data are routinely collected and scrutinized is sales, where part of performance is determined by features external to the employee, such as geographic location (Blickle, Wendel, & Ferris, 2010), part to previous, relatively distal employee behaviors

146

Work Analysis (behaviors related to repeat customers, referrals, etc.), and part to the current or proximal behaviors of the employee (Jaramillo & Grisaffe, 2009). Despite the best efforts of organizations (selection, training, compensation, and management), sales data typically show a heavy-tailed distribution such that top performers may exceed average performers by a factor of two or more. In other words, there is a large variability in performance and there are identifiable performance stars. Part of the difference in outcomes may be essentially due to luck. For example, the Jaramillo and Grisaffe (2009) study reported correlations in the .30s (.29 to .43) for sales data across four quarters. We would expect that if the same individuals were consistently in the tails of the distribution (and often outliers), we would see a high correlation over time periods. However, correlations may be higher over longer periods because performance may be more reliable when aggregated over longer periods. There are data from the study of stockbroker performance suggesting that time spent on different tasks (time allocation strategies) may distinguish superior performers (Borman, Dorsey, & Ackerman, 2006); the same study also showed evidence of differential relations between time spent and performance for more and less experienced stockbrokers. A study of computer salespeople also showed a relation between time spent and sales data (Kerber & Campbell, 1987). However, other studies (not sales jobs) have shown no relationship between time spent and performance (Conley & Sackett, 1987; Welxley & Silverman, 1978). A study of auto salespeople showed an interaction between motivation (achievement striving facet of Type A) and time management (planning at the beginning of the day) for the prediction of performance in a sample of auto sales workers (Barling, Kelloway, & Cheung, 1996). It seems likely that autonomy is necessary but not sufficient to produce differences in the relations between time spent on tasks and performance outcomes. Also, as Borman, Dorsey, and Ackerman (2006) noted: “Performance is probably in large part a function of how stockbrokers carry out activities over and above what activities they allocate time to and emphasize” (pp. 774–775). We suggest two possible avenues of research that might advance our understanding of selecting future stars. First, rather than conducting a garden-variety task analysis and then comparing the responses of those with better and poorer performance outcomes, we might start with the better and poorer performing groups (analogous to the novice vs. expert distinction in cognitive task analysis) and systematically explore the differences in what they do, how they do it, and what personal qualities distinguish them. To the best of our knowledge, a cognitive task analysis has not yet been conducted on sales jobs. Second, as we have emphasized throughout the chapter, we suggest better partnering of psychologists with professionals from other areas (e.g., operations management, economists) to create better models that predict performance outcomes. Sales is an area in which the job behaviors and the value of the outcomes are closely connected because performance is closely tied to dollars (unlike the performance of the janitors and accountants employed by the same organization). Factors that influence sales such as geographic location can be explicitly incorporated into a statistical model along with individual difference variables such as personality, motivation, or emotional intelligence. It is plausible that early in the stockbroker’s career, more time must be allocated to prospecting for clients, but in later career stages, more time must be allocated to maintaining relations with existing clients. In short, by incorporating both business variables and individual differences in the same model, it should be possible to do a better job of predicting performance outcomes than is currently the case.

COGNITIVE TASK ANALYSIS

Definition and Recent History The rise of cognitive psychology promoted unobservable phenomena (e.g., the decision-making process) as objects worthy of study, and applications of cognitive psychology to task analysis appear in the literature in the 1980s and early 1990s (Cooke, 1994; Lesgold et al., 1988), although the adoption by I-O psychologists appears to be spotty (Berryman, 1993). Cognitive task analysis

147

Michael T. Brannick et al. (CTA) can be described as a “set of methods for identifying cognitive skills, or mental demands needed to perform a task proficiently” (Militello & Hutton, 1998), and it typically emphasizes the distinction between novices and experts in solving problems and in task proficiency (Clark & Estes, 1996). Cognitive task analysis can supplement a garden-variety task analysis by focusing on mental operations that are not directly observable (Cannon-Bowers, Bowers, Stout, & Ricci, 2013; Clark & Estes, 1996). Thus, particularly for cognitive tasks, CTA can help bridge the gap between what gets done and how it gets done. During the 1960s, engineers began to focus on automation, where tasks are allocated to machines, and the time and motion study techniques developed by the engineering pioneers (Gilbreth, 1911) were perceived to be inadequate to the purpose (Annett, 2003). Engineers developed hierarchical task analysis, which decomposes a task into subtasks of whatever level of detail is needed. Each subtask is composed of four parts: (1) the subtask goal, (2) the input conditions, (3) the action or operation to achieve the goal, and (4) feedback about goal attainment (Annett, 2003). This is similar to garden-variety task analysis except that it explicitly includes the input conditions and task completion feedback. In the 1970s, multiple strands of research resulted in human factors approaches called cognitive task analysis and cognitive work analysis (Roth, 2008). Hierarchical task analysis can be supplemented by cognitive task analysis (Phipps, Meakin, & Beatty, 2011) and by cognitive work analysis (Salmon, Jenkins, Stanton, & Walker, 2010), both of which consider cognitive processes that cannot be observed directly. Cognitive task analysis and cognitive work analysis are currently used in the design of equipment and computer interfaces, as well as in training and task allocation to teams (Ashoori & Burns, 2013). Although cognitive task analysis has not been applied to selection, we cover it here because it is a lively area of research that might stimulate developments in more traditional areas of work analysis.

Common Applications Performance Assessments Because CTA has been used to develop training, it has also been used to develop performance assessments, which are necessary to evaluate learning (and of course may be relevant to selection). Examples of performance assessments based on CTA include an outline of a rubric for evaluating biology lab reports (Feldon, Timmerman, Stowe, & Showman, 2010) and a description of a think-aloud test of problem solving during two kinds of simulated surgery (Pugh & DaRosa, 2013). Two related examples illustrate the process of developing a test using CTA, one for computer networks (Williamson et al., 2004) and the other for dental hygienists (Mislevy, Stenberg, Breyer, Almond, & Johnson, 1999). In the medical literature, there is a cross-walk of judgments about the levels of expertise and expected performance at different training levels (e.g., medical school vs. certain years in residence; Khan & Ramachandran, 2012). Guidance on how to use CTA to promote good measures for a task includes eliciting the cues needed for each step, description of typical trainee errors for each step, and specification of observable behavior that allows a judge to determine whether a step is accomplished properly (Cannon-Bowers et al., 2013). Unlike the hierarchical task analysis, which includes every task plus any required subtasks, the CTA typically focuses on a subset of tasks or contexts that distinguish between novices and experts. Cannon-Bowers et al. (2013) describe results of two cognitive task analyses, one for cricothyroidotomy (emergency airway puncture) and one for hemorrhage (bleeding) control. For each task, they list the major step (e.g., insert endotracheal tube), the cues used to perform (e.g., tactile and kinesthetic cues), the typical errors (e.g., excessive force), the observable trainee behaviors used to infer competence (e.g., orientation of the instrument), and the decision-making demands of the step (whether the tube is placed properly). Klein et al. (2015) described a task analysis of dealing with civilians for police and military officers during conflict situations. They were able to link the decision-making strategies to outcomes for the incidents and to develop a list of potential antecedents that might explain

148

Work Analysis differences in the quality of interactions with civilians, including family background, rejection of negative experiences, prior work experience, and drive for excellence (G. Klein et al., 2015). CTA can provide insight about cognitive processes that would not likely be assessed by multiple-choice tests, in which items typically contain all of the required information and a single best answer. For example, in the dental test (Mislevy et al., 1999), participants need to notice that unusual deterioration has happened to a patient’s teeth during the previous six months and then make appropriate investigation to determine the cause. That is, the examinee must notice the connection between current data and something seen earlier in the examination and investigate by asking the examiner for additional information, analogous to what should happen in the real situation. Such an approach is essentially a structured means to developing a work sample or content valid test that is explicitly linked to degree of expertise and thus should be helpful in picking the best applicant.

Training The most common human resources application of CTA appears to be training (Ryder & Redding, 1993). For example, CTA can provide information that allows for the development of training content, simulators to provide the appropriate stimulus, and also to provide stimuli for assessing proficiency before and after training (Cannon-Bowers et al., 2013; Tjiam et al., 2012). Applications include a very detailed task analysis of interventional radiology (Johnson et al., 2006), a licensure test for dental hygiene (Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999), and training for tracheostomy (Sullivan et al., 2007), central venous catheterization (Velmahos et al., 2004), and nephrostomy (Tjiam et al., 2012). There are applications of CTA to disciplines other than medicine, including the teaching of biology (Feldon et al., 2010), understanding the resilience of emergency response teams (Gomes, Borges, Humber, & Carvalho, 2014), and differences in police proficiency in handling civilian encounters (G. Klein et al., 2015).

Methods of CTA Methods of cognitive task analysis have been adapted from the laboratories of cognitive scientists, where they were developed for many purposes. A cognitive task analysis involves the following steps: (a) selection of the participants (experts and possibly novices), (b) knowledge elicitation, and (c) analysis and representation of the knowledge (Craig et al., 2012). There are more than 100 distinct methods of knowledge elicitation (Cooke, 1994; Yates & Feldon, 2011). CTA methods, however, can be categorized as follows: (a) observations and interviews, (b) process tracing, (c) conceptual techniques, and (d) formal models (Cooke, 1994; Wei & Salvendy, 2004). Formal models are mathematical representations of cognitive processes and are not often used in applied settings. The methods most commonly applied to the workplace involve structured interviews (observations and interviews) and think-aloud protocols (process tracing) while performing a task or solving a problem.

Interviews The Critical Decision Method (G. Klein, Calderwood, & MacGregor, 1989; Hoffman, Crandall, & Shadbolt, 1998) provides the analyst with a series of questions for the expert. The expert reflects on an instance in which an important decision was made and then describes the context of the decision, the cues that were or could have been influential in making the decision, and strategies that could be brought to solve the problem (recall our description of the decision about minimally invasive versus open surgery; also note the similarity to Flanagan’s Critical

149

Michael T. Brannick et al. Incident Technique). A similar method is called PARI, for precursor, action, result, and interpretation (Hall, Gott, & Pokorny, 1995). The PARI method focuses on mundane tasks as they are typically performed. A related technique is the Knowledge Audit (Craig et al., 2012; Militello & Hutton, 1998), which comprises a series of questions based on the literature concerning differences between novices and experts. For example, the participants may be asked if, during the course of completing a task, they have noticed things that others did not, developed a more efficient way of doing the task, improvised something, and/or noticed something anomalous (Craig et al., 2012).

Protocols/Process Tracing A commonly used method of CTA is to have experts “think aloud” while solving a problem or performing a task. There are several techniques for doing so. Experts may be asked to imagine doing the task while verbalizing their thinking, or to actually complete the task, perhaps on a simulator (Cannon-Bowers et al., 2013). For some tasks, such as surgery, the expert may be video recorded during the operation and then provide the verbal description later while watching the video (Johnson et al., 2006). Another variant has pairs of experts work together, where one expert poses a problem to the other, who then solves it while thinking aloud (Ryder & Redding, 1993). The verbal protocols are often recorded, transcribed, and analyzed for content.

Issues CTA is laborious and time consuming, and thus it is expensive. Experts often provide different information, and thus it has been suggested that multiple experts be included in a CTA for any given task (Chao & Salvendy, 1994; Sullivan, Yates, Inaba, Lam, & Clark, 2014). At present, there is little comparative information about the reliability and validity of different methods of CTA, and there is little empirical guidance about what methods are best for specific purposes (Yates & Feldon, 2011). There have been few cost-benefit analyses of CTA, but savings in training time (Clark & Estes, 1996) and better performance results (Tofel-Grehl & Feldon, 2013) suggest that the time and expense of CTA may be wise investments so long as the resulting program can be applied to training sufficient numbers of people. Similar arguments might support the application of CTA to selection. Research is warranted on the conditions in which it is worthwhile to invest in CTA (e.g., whether the job entails time pressure and severe consequences of error). CTA appears to run counter to the business trend emphasizing broad, shallow job descriptions.

Application of CTA to Selection Several authors have mentioned the potential application of CTA to personnel testing and selection (Gordon, Coovert, & Elliott, 2012; Rothkopf, 1986; Wei & Salvendy, 2003). To the best of our knowledge, there are no published examples of CTA being used to develop a selection test.

Work Samples Based on the CTA literature for training, the most obvious test for selection would be work samples and specially constructed simulations (Cannon-Bowers et al., 2013). The CTA could emphasize the cognitive processes such as planning, noticing connections and anomalies, gathering information, and judgment and decision making. Such exams have the potential to tap cognitive processes that are not usually measured deliberately during selection. How this might be accomplished was illustrated by the dental hygienist case exercises mentioned earlier

150

Work Analysis (Mislevy et al., 1999). A drawback is that such exams are very labor intensive to develop and to administer. The examiner must have the content knowledge necessary to respond to the examinee in ways that are appropriate for the scenario and accurate for the facts of the case. At some point, computers will be capable of replacing the human examiners in such interactive exams, but this is not currently feasible. However, situational judgment tests (SJTs) are essentially low-fidelity simulations in which the situation is described narratively on paper or by video (Lievens, Buyse, & Sackett, 2005; Lievens & Motowidlo, 2016). It might be possible to obtain at least some of the same information through SJTs as through higher-fidelity work samples. As Lievens and Motowidlo (2015) argued, the SJT typically lacks much theoretical grounding. CTA might be a way to support the interpretation of SJT scores as a measure of expertise in a particular area.

Reaction Time (Automaticity) A second approach to testing for selection might involve measures of examinee reaction time to stimulus materials that are representative of the domain of interest. One way in which this might be done concerns the decomposition of reaction time into components representing fundamental processes such as perception, recognition, and solution selection. Cognitive scientists have developed formal models of decision making that might be fit to individual data to estimate examinee standing on individual difference variables (Rothkopf, 1986; Wei & Salvendy, 2004). Another way in which reaction time might be used is the speeded production of solutions to problems. Experts are able to produce workable solutions to problems very quickly (G. A. Klein, 1998). For example, a chess grandmaster can solve many chess problems within 10 seconds, whereas a weak club player cannot (Campitelli & Gobet, 2004). A test of this sort would be based on a series of problems with known solutions, each of which is presented only for a short while, and then scored for quality of response.

Limits of Expertise One reason that CTA is more closely associated with training than with selection is the cognitive psychologists’ apparent assumption that nearly everyone can learn nearly any task to a desired level of proficiency given sufficient practice (Clark & Estes, 1996). “There seems to be general agreement in cognitive psychology that most human beings are capable of acquiring declarative knowledge, production knowledge or both about any task” (p. 406). However, others have noted an apparent boundary condition, which is the consistency of the stimulus and response required for the task (Ryder & Redding, 1993). The slow, effortful problem-solving approach may be needed if the job causes the worker to encounter situations sufficiently novel that the known solutions may not apply. Jobs such as surgery may foster the development of “fractionated expertise” (Kahneman & Klein, 2009) where experts can automate only part of the required skills (Craig et al., 2012). The implication is that tests based on CTA may need to consider disentangling the learned from the general ability influences on performance assessed by work samples.

Looking Forward It seems clear that CTA can provide information that would be useful in personnel selection. However, we are unaware of any such applications. One additional step that seems required for such applications is the explicit linkage of cognitive processes to the required job knowledge, skill, or other characteristics.

151

Michael T. Brannick et al.

Strategic Work Analysis and Competency Modeling Traditional methods of work analysis appear largely rooted in the industrial-age workplace. Such methods (including much of cognitive task analysis) appear designed to support staffing Model 1—the traditional practice of matching people to individual jobs (Snow & Snell, 1993). This bottom-up orientation largely ignores three major top-down elements—organizational strategy, organizational structure, and organizational culture—that reflect an organization’s vision and mission, drive much of its daily functioning, and can (in some instances, profoundly) affect the choice, configuration, and relative importance of KSAOs constituting an organization’s selection system, independent of the nature of the work performed by individuals across the organization (Williams & Dobson, 1997). Therefore, this latter top-down perspective is highly relevant to the alternative, nontraditional staffing models being adopted by many contemporary organizations that, for example, view staffing as a tool in strategy implementation (Snow and Snell’s Model 2, applicable to organizations with clear strategies and known competitors) or strategy formation (Model 3, applicable to organizations that need to develop or change strategies quickly). Viewing staffing as strategy invites work analysis methods that incorporate various types of organization-level analyses—market and demographic trends, competitive environment, emerging technology, business and strategic plans, organizational culture and style—as is routinely done in work analysis to support training system development and has been similarly recommended for selection-oriented work analysis (Goldstein, 1997). This would provide the critical context to facilitate more specific work analytic efforts, thereby also facilitating the direct generation of worker KSAOs related to broader organizational criteria, strategies, and goals. Such an approach could in turn provide a framework from which other, more broadly conceived, selection-related applications of work analysis might be explored and capitalized on. For example, one such application could be the provision (via ads, realistic job previews, or company websites) of customized information to applicants about various aspects of work and worker requirements (e.g., context factors, career ladders and lattices based on job inter-relationships, and skill or knowledge transferability) that are potentially related to applicant attraction and subsequent organizational commitment. Another example is collecting work analysis data on contextual and other factors relevant to selection for nontraditional or nonperformance criteria, such as successful post-hire assimilation and socialization, or different levels of employee “fit” (Higgs, Papper, & Carr, 2000). Yet another example is using work analysis questionnaires more in the mode of an organizational survey (i.e., as an ongoing or regularly recurring intervention) rather than exclusively as a “one-and-done” tool for work profiling; this could provide a measure of work content and work context stability/volatility and offer insights into the nature of such content or context changes and their potential implications for selection-related worker attributes. All of this suggests a potentially useful reconceptualization of work analysis as organizational strategy—that is, as a strategic tool—and hence characterized by a strong organization development (OD) component (Higgs et al., 2000; Schippman, 1999). It also suggests the need to bridge a historical and professional disconnect between those who have tended to view jobs and work from a traditional, “micro” perspective (e.g., personnel and industrial psychologists, training and education professionals, cognitive psychologists, occupational analysts, industrial engineers, and human factors specialists) and those who have tended to look at work from a more “macro” perspective (e.g., labor economists; sociologists; business and management consultants; demographers; ethnologists; and clinical, social, and organizational psychologists). In our view, the work analysis enterprise would be better served by an integration of these perspectives, facilitated by much more interdisciplinary work among such professionals than historically has been the case, as some have called for (Barney, 2000; Cunningham, 2000). Such a reframing and associated changes in work analysis practice—and practitioners—underlie what we believe to be a broader and potentially more useful concept of “strategic work analysis” (SWA) as a systematic effort to identify or define current or anticipated work or worker requirements that are strategically aligned with an organization’s mission and goals. This would subsume some other related terms and practices in current use, such as future-oriented job analysis, strategic job analysis (which is

152

Work Analysis sometimes used as a synonym for the prior term—and sometimes not), strategic job (or work) modeling, and competency modeling, which, given its substantial impact on contemporary work analysis discussion and practice, we now consider in greater depth. Competency modeling (CM) is a form of work analysis whose use has become widespread since about the mid-1980s. There appear to be almost as many definitions of “competency” and “competency modeling” as there are users of them (Schippman et al., 2000). Most existing definitions describe a complex and multifaceted concept, such as that offered by Spencer, McLelland, and Spencer (1994). They define competency as a combination of motives, traits, self-concepts, attitudes, values, content knowledge, or cognitive behavior skills and as any individual characteristic that can be reliably measured or counted and that can be shown to differentiate superior from average performers. The difficulty with such definitions is that they lump together in a single construct attributes representing vastly different domains, characteristics, and levels of analysis, which limits its value conceptually (Clouseau-like, it means everything, therefore it means nothing) and practically (as a useful or measurable descriptor or unit of analysis in work analysis). The typical output of a CM project is a set of worker attributes (competencies) believed to contribute to an organization’s broad strategy and goals, culture, or values. As such, these attributes are considered applicable across the entire organization, or within large units or functional areas, and thereby able to serve as a common framework underlying the various components of an integrated HR system, such as training and development, performance management, compensation, and selection/promotion. Each competency is given a name or label and is usually accompanied by a set of behavioral indicators (BIs) that exemplify desirable behavioral manifestations of the competency and thereby serve as the basis for measuring individuals’ standing on the competency. Multiple sets of BIs are often developed to address a given competency’s manifestation across different job families (sales, engineering), across functional specialties within a job family (account executive, technical consultant, customer service representative), or across occupational levels within a single job. For example, a “systems thinking” competency (defined as “making calculated decisions that take into account impact on other activities, units, and individuals”) might have different BIs for sales managers (“evaluates the impact on others before changing work processes”) than for account team leaders (“helps staff understand how their function relates to the overall organization”), as appropriate for these different roles. We believe there is a huge chasm between the CM ideal envisioned by its proponents and the actual practices that, under the rubric of CM, produce the type of output described above (Lievens, Sanchez, & DeCorte, 2004; Morgeson, Delaney-Klinger, Mayfield, Ferrara, & Campion, 2004). In our experience, the majority of such practices fall into one of two categories. The first category involves entirely traditional work analysis, of one form or another, which leads to the development of sets or taxonomies of well-defined, work-related (but not strategy-, culture-, or values-related) person attributes (KSAOs) and associated metrics (behavioral or numerical rating scales, tests, or other instrumentation) that meet accepted professional standards for such work. Such activities are labeled as CM, and the KSAOs called competencies, to satisfy explicit or implicit requirements of organizations or particular leaders. The second category purports to derive attributes related to organizational strategy, culture, and values but entails the use of poorly conceived, incomplete, or otherwise inadequate procedures (e.g., convenience samples engaged in unstructured group discussions conducted without reference to individual or organizational work performance) that lead to the development of ad hoc, idiosyncratic, ill-defined (or undefined) concepts or “folk constructs”—ad hoc or “armchair” concepts or labels devised without reference to existing research or theory—that are often little more than a wish list of desired worker attributes or purported organizational values along with brainstormed (and typically unvetted and unvalidated) examples of good performance for each identified competency. The above discussion is not meant to impugn the CM ideal of its proponents, but rather to highlight the disconnect we perceive between this ideal and most contemporary CM practice, which is either fairly rigorous but not explicitly “strategic” (the first category described above) or ostensibly strategic but not very rigorous, and hence ultimately unsuccessful in its strategic intent

153

Michael T. Brannick et al. (the second category described above). We would, in fact, classify this ideal as a third (although still mostly unrealized) category of CM practice—one that combines the laudable goals of (a) linking organizational strategy (and other organization-level variables and outcomes) to desired individual employee attributes (McLagan, 1988) and (b) utilizing rigorous development methodologies of both conventional work analysis and other disciplines to ensure the validity of these linkages, much as proposed by Schippman (1999) and Barney (2000). For example, the “traditional” Critical Incident Technique can be readily adapted to the generation of genuinely strategically driven competencies and associated BIs, requiring only a change of frame of reference for incident generation from specific jobs to various organization-level variables, combined with the use of SMEs appropriate for this frame of reference. The unique aspect of this category of CM practice is its explicit strategic organizational focus, without reference to the work performed in any particular jobs. This is why we believe it is most appropriately regarded as simply one particular form of the more broadly conceived SWA concept we proposed above, and why (along with all of the conceptual and definitional ambiguities noted above) it has been argued (Pearlman, 1997) that the terms “competency” and “competency modeling” be abandoned altogether. The major need going forward, as we see it, is for creative thought and research addressing such potential adaptations (such as the beginning efforts of Lievens et al., 2004, and Lievens & Sanchez, 2007), as well as the development of new data collection methods and approaches, to support all varieties of SWA.

Quest for a “Common Language” and the Challenge of Large-Scale, Multipurpose Work Analysis Systems The concept seems simple. Develop comprehensive sets of standardized work- and workeroriented descriptors representing multiple levels of analysis and then determine their interrelationships within a single analytic system that could thereby be used to derive the work content and worker requirements of any job. Such a system, especially when fully automated (as is easily accomplished nowadays), could serve as the basis for a powerful HR data and information system (or “human asset management” system, in today’s jargon), underpinning and integrating numerous HR functions, such as selection and staffing (especially validity evidence extension applications, because it is ideally suited for cross-job comparison), training and career development, performance management, and workforce planning. At a broader level it could provide the means for tracking trends and changes in work content and occupational structure across the economy; for assessing and addressing national “skill gaps,” and skill transferabililty and occupational portability issues; and for studying selection and talent allocation issues at a national level. From a scientific standpoint, it would constitute a critical research tool for advancing theory development regarding work performance and the structure of work or occupations—in effect, moving us closer to a “unified theory of work” (Vaughn & Bennett, 2002). This notion of “a complete, universally applicable information system for human resources allocation” (Peterson & Bownas, 1982, p. 49) based on taxonomic information about work, work environments, and human attributes—a “common language” of people and jobs—has long been viewed as something of a “holy grail,” enticing work analysis researchers and practitioners for the better part of 80 years. Such a system, when fully realized, would have underpinnings of structure (meaning logical inter-relationships among both descriptor categories and specific elements within those categories) and standardization (meaning common definitions, rules, and metrics) that thereby promote common understanding and usage of system elements among all users and stakeholders. This was the driving vision behind the U.S. Labor Department’s Occupational Research Program of the 1930s and its development of the DOT, and was reflected to varying degrees in such later systems as FJA, the PAQ, ARS, MOSAIC, and O*NET. In our view, no single system has as yet been able to fully realize this vision, although O*NET probably comes the closest in terms of its scope and analytic capabilities.

154

Work Analysis Despite the simplicity and elegance of the concept, the practical realization of such a system is enormously complex. This led Higgs et al. (2000) to conclude that “most systems like this . . . have held great conceptual promise but . . . have eventually died of their own administrative weight and expense” (p. 108). Many complex choices and decisions must be made in the conception, design, and implementation of such a system, depending on the applications or objectives at issue, such as (a) descriptor coverage—how many and which work- and workeroriented attribute domains will be included in the system—and the associated question of whether the common framework will be operationalized as a single set of a relatively limited number of descriptor elements representing a single level of description, or as multiple descriptor sets or taxonomies representing multiple attribute domains and levels of descriptions; (b) descriptor level of analysis (the breadth or narrowness of descriptor definition, as well as whether to allow multiple levels of analysis via the use of hierarchical descriptor element taxonomies); (c) whether descriptor coverage will apply (or will be designed so as to allow or promote application) to work, to workers, or to both; (d) whether individual jobs will be described exclusively in terms of descriptor sets that are used across all jobs in the system or will also include some types of job-specific information (such as tasks or tools/technology); (e) the characteristics of the metrics or scales by which descriptors will be quantified; (f) the policy and deployment questions of how much and which parts (descriptors) of a common framework will be required for use by all organizational units and which parts, if any, can be user-specific, which speaks to the critical issue of gaining the support and cooperation of multiple users and stakeholders, without which the system is unlikely to succeed; (g) devising efficient and effective procedures for ongoing data collection; and (h) devising procedures for maintaining and updating the system’s data structure, which involves numerous technical and practical challenges (e.g., the dilemma of changing or incorporating new data elements to respond to changed needs or realities while maintaining comparability and continuity with the prior data structure). Despite this wide range of options, decisions, and challenges, we believe that the vision of such a system continues to be both worthy and viable, if approached in manageable steps, segments, or prototypes, on the basis of sound professional judgment, and undertaken with broad and high-level organizational support.

SYNOPSIS AND CONCLUSIONS Work analysis seems to have garnered a reputation as one of the less interesting and challenging areas of I-O psychology and HR practice. It has been noted, in a masterstroke of understatement, that “job and occupational analysis is not a glamorous or high visibility area on which to build a personal career or secure tenure” (Mitchell & Driskill, 1996, p. 129). One possible explanation may lie in the fact that, as we noted at the chapter’s outset, work analysis is rarely a destination; it is almost always a road—a way to get from here to there. In the rare instances in which it is a destination (most commonly, in the conduct of work analysis as documentation for an actual or potential lawsuit), it is not an eagerly anticipated one. We hope that this brief “walk down the road” of work analysis in the context of personnel selection serves to change such perceptions. We believe that, in order to meet the types of challenges described in the previous section, work analysis needs to be reconceptualized more broadly as a strategic, multistep, multifaceted, and interdisciplinary effort that is at least as much a top-down process (i.e., one based on analysis and understanding of macroorganizational strategy and context factors) as a bottom-up process (i.e., one based on analysis of what workers actually do). This implies the need to rethink the conventional boundaries of work analysis—what it consists of, who does it (and with what qualifications and organizational roles), and how it gets done. Such rethinking would promote a transformation of the work analysis enterprise from one of merely gathering information to one of generating insight, meaning, and knowledge about work. This would in turn contribute to theory and practice. We believe that even modest strides in these directions would yield significant returns

155

Michael T. Brannick et al. in terms of improving the efficiency and effectiveness of the (broadly conceived) employee selection life cycle. Although such a shift in orientation may not immediately change the work analysis enterprise from a road to a destination (nor necessarily should it), it will at least make the journey more interesting and productive.

REFERENCES Aguinis, H., & O’Boyle, E. (2014). Star performers in twenty-first century organizations. Personnel Psychology, 67, 313–350. doi: 10.1111/peps.12054 Aguinis, H., O’Boyle, E., Gonzales-Mule, E., & Joo, H. (2016). Cumulative advatage: Conductors and insulators of heavy-tailed productivity distributions and productivity stars. Personnel Psychology, 69, 3–66. Annett, J. (2003). Hierarchical task analysis. In E. Hollnagel (Ed.), Handbook of cognitive task design (pp. 17–36). Mahwah, NJ: Erlbaum. Arvey, R. D., Salas, E., & Gialluca, K. A. (1992). Using task inventories to forecast skills and abilities. Human Performance, 5, 171–190. Ashoori, M., & Burns, C. (2013). Team cognitive work analysis: Structure and control tasks. Journal of Cognitive Engineering and Decision Making, 7(2), 123–140. Baranowski, L. E., & Anderson, L. E. (2005). Examining rating source variation in work behavior to KSA linkages. Personnel Psychology, 58, 1041–1054. Barling, J., Kelloway, E. K., & Cheung, D. (1996). Time management and achievement striving interact to predict car sales performance. Journal of Applied Psychology, 81(6), 821–826. Barney, M. (2000). Interdisciplinary contributions to strategic work modeling. Ergometrika, 1, 24–37. Bartram, D. (2005). The great eight competencies: A criterion-centric approach to validation. Journal of Applied Psychology, 90(6), 1185–1203. doi: 10.1037/0021–9010.90.6.1185 Beck, J. W., Beatty, A. A., & Sackett, P. R. (2014). On the distribution of job performance: The role of measurement characteristics in observed departures from normality. Personnel Psychology, 67, 531–566. Becker, B. E., Huselid, M. A., & Ulrich, D. (2001). The HR scorecard: Linking people, strategy, and performance. Boston, MA: Harvard Business School Press. Berryman, S. E. (1993). Learning in the work place.. In L. Darlin-Hammond (Ed.), Review of research in education (Vol. 19, pp. 343–401). Washington, DC: American Educational Research Association. Blickle, G., Wendel, S., & Ferris, G. R. (2010). Political skill as a modertor of personality-job performance relationships in sociolanalytic theory: Test of getting ahead motive in automobile sales. Journal of Vocational Behavior, 76, 326–335. Borman, W. C., Dorsey, D., & Ackerman, L. (2006). Time-spent responses as time allocation strategies: Relations with sales performance in a stockbroker sample. Personnel Psychology, 45(4), 763–777. doi: 10.1111/j.1744–6570.1992.tb00967.x Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to included elements of contextual performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 71–98). San Francisco, CA: Jossey-Bass. Boselie, P., Dietz, G., & Boon, C. (2005). Commonalities and contradictions in HRM and performance research. Human Resource Management Journal, 3, 67–94. Brannick, M. T., Cadle, A., & Levine, E. L. (2012). Job analysis for KSAOs, predictor measures, and performance outcomes. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment and selection (pp. 119–146). New York, NY: Oxford University Press. Brannick, M. T., Levine, E. L., & Morgeson, F. P. (2007). Job and work analysis: Methods, research and applications for human resource management (2nd ed.). Los Angeles: Sage. Bridges, W. (1994). Job shift: How to prosper in a world without jobs. Reading, MA: Addison-Wesley. Campbell, J. P., & Wiernik, B. M. (2015). The modeling and assessment of work performance. Annual Review of Organizational Psychology and Organizational Behavior, 2, 47–74. Campitelli, G., & Gobet, F. (2004). Adaptive expert decision making: Skilled chess players search more and deeper. Journal of the International Computer Games Association, 27, 209–216. Cannon-Bowers, J., Bowers, C., Stout, R., & Ricci, K. (2013). Using cognitive task analysis to develop simulation-based training for medical tasks. Military Medicine, 178(10), 15–21. Cappelli, P., & Keller, J. (2013). Classifying work in the new economy. Academy of Management Review, 38(4), 575–596. Cascio, W. F., & Young, C. E. (2005). Work-family balance: Does the market reward firms that respect it? In D. F. Halpern & S. E. Murphy (Eds.), From work-family balance to work-family interaction: Changing the metaphor (pp. 49–63). Mahwah, NJ: Erlbaum.

156

Work Analysis Chao, C. J., & Salvendy, G. (1994). Percentage of procedural knowlege acquired as a function of the number of experts from whom knowledge is required for diagnosis, debugging and interpretation tasks. International Journal of Human-Computer Interaction, 6, 221–233. Christal, R. E., & Wiessmuller, J. J. (1988). Job-task inventory analysis. In S. Gael (Ed.), The job analysis handbook for business, industry and government (Vol. 2, pp. 1036–1050). New York, NY: Wiley. Clark, R. E., & Estes, F. (1996). Cognitive task analysis for training. International Journal of Educational Research, 25(5), 403–417. Coates, J. F., Jarratt, J., & Mahafie, J. B. (1990). Future work: Seven critical forces reshaping work and the work force in North America. San Francisco, CA: Jossey-Bass. Conley, P. R., & Sackett, P. R. (1987). Effects of using high- versus low-performing job incumbents as sources of job analysis information. Journal of Applied Psychology, 72, 434–437. Cooke, N. J. (1994). Varieties of knowledge elicitation techniques. International Journal of Human-Computer Studies, 41, 801–849. Craig, C., Klein, M. I., Griswold, J., Gaitonde, K., McGill, T., & Halldorsson, A. (2012). Using cognitive task analysis to identify critical decisions in the laparoscopic environment. Human Factors, 54(6), 1025–1039. Cunningham, J. W. (2000). Introduction to a new journal. Ergometrika, 1, 1–23. Dawis, R. V. (1991). Vocational interests, values, and preferences. In M. D. Dunnette & M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., pp. 833–872). Palo Alto, CA: Consulting Psychologists Press. Doering, M., Rhodes, S. R., & Kaspin, J. (1988). Factor structure comparison of occupational needs nad reinforcers. Journal of Vocational Behavior, 32, 127–138. Droege, R. C. (1988). Department of labor job analysis methodology. In S. Gael (Ed.), The job analysis handbook for business, industry, and government (Vol. II, pp. 993–1018). New York, NY: Wiley. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43(166), 38295–38309. Feldon, D. F., Timmerman, B. C., Stowe, K. A., & Showman, R. (2010). Translating expertise into effective instruction: The impacts of cognitive task analysis (CTA) on lab report quality and student retention in the biological sciences. Journal of Research in Science Teaching, 47(10), 1165–1185. Fine, S. A., & Cronshaw, S. F.. (1999). Functional job analysis: A foundation for human resources management. Mahwah, NJ: Erlbaum. Fine, S. A., & Wiley, W. W. (1971). An introduction to functional job analysis: A scaling of selected tasks. Kalamazoo, MI: Upjohn Institute. Flanagan, J. C. (1954). The critical indicent technique. Psychological Bulletin, 51, 327–358. Fleishman, E. A., & Quaintance, M. K. (1984). Taxonomies of human performance. Orlando, FL: Academic Press. Fogli, L., & Whitney, K. (1998). Assessing and changing managers for new organizational roles. In R. Jeanneret & R. Silzer (Eds.), Individual psychological assessment (pp. 285–329). San Francisco, CA: Jossey-Bass. Frederiksen, N. (1972). Toward a taxonomy of situations. American Psychologist, 27, 114–123. Gael, S. (1988). The job analysis handbook for business, industry, and government. New York, NY: Wiley. Gatewood, R. D., Feild, H. S., & Barrick, M. (2008). Human resource selection (6th ed.). Mason, OH: Thomson/ South-Western. Gibson, C. B., Porath, C. L., Benson, G. S., & Lawler, E. E., III. (2007). What results when firms implement practices: The differential relationship between specific practices, firm financial performance, customer service, and quality. Journal of Applied Psychology, 92, 1467–1480. Gilbreth, F. B. (1911). Motion study. Princeton, NJ: Van Nostrand. Goffin, R. D., & Woycheshin, D. E. (2006). An empirical method of determining employee competencies/ KSAOs from task-based job analysis. Military Psychology, 18, 121–130. Goldstein, I. L. (1997). Interrelationships between the foundations for selection and training systems. In N. Anderson & P. Herriot (Eds.), International handbook of selection and assessment (pp. 529–541). New York, NY: Wiley. Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 3–34). San Francisco, CA: Jossey-Bass. Gomes, J. O., Borges, M. R. S., Humber, G. J., & Carvalho, P. V. R. (2014). Analysis of the resilience of team performance during a nuclear emergency response exercise. Applied Ergonomics, 45, 780–788. Gordon, T. G., Coovert, M. D., & Elliott, L. R. (2012). Integrating cognitive task anaysis and verbal protocol analysis: A typology for describing jobs. In M. A. Wilson, W. Bennett, S. G. Gibson & G. Michaels (Eds.), The handbook of work analysis: Methods, systems, applications and science of work measurement in organizations (pp. 625–640). New York, NY: Routledge Academic.

157

Michael T. Brannick et al. Guest, D., Conway, N., & Dewe, P. (2004). Using sequential tree analysis to search for “bundles” of HR practices. Human Resource Management Journal, 14, 79–96. Guthrie, J. (2001). High-involvement work practices, turnover, and productivity: Evidence from New Zealand. Academy of Management Journal, 44, 180–192. Hackman, J. R., & Oldham, G. R. (1976). Motivation through the design of work: Test of a theory. Organizational Behavior and Human Performance, 16, 250–279. Hall, E. P., Gott, S. P., & Pokorny, R. A. (1995). A procedural guide to cognitive task analysis: The PARI methodology (Rep. No. AL/HR-TR-1995–0108). Brooks Air Force Base, TX: U.S. Air Force. Hanson, M. A., Borman, W. C., Kubisiak, U. C., & Sager, C. E. (1999). An occupational information system for the 21st century: The development of O*NET. Washington, DC: American Psychological Association. Harvey, R. J. (1986). Quantitative approaches to job classification: A review and critique. Personnel Psychology, 39, 267–289. Higgs, A. C., Papper, E. M., & Carr, L. S. (2000). Integrating selection with other organizational processes and systems. In J. F. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies (pp. 73–122). San Francisco, CA: Jossey-Bass. Hoffman, R. R., Crandall, B., & Shadbolt, N. (1998). Use of the critical decision method to elicit expert knowledge: A case study in the methodology of cognitive task analysis. Human Factors, 40, 254–276. Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 181–229). San Francisco, CA: Jossey-Bass. Hogan, R., & Hogan, J. (1992). Hoan personality inventory manual. Tulsa, OK: Hogan Assessment Systems. Holland, J. L. (1973). Making vocational choices: A theory of careers. Englewood Cliffs, NJ: Prentice-Hall. Holland, J. L. (1997). Making Vocational Choices: A theory of vocational personalities and work environments (3rd ed.). Odessa, FL: PAR. Huselid, M. A. (1995). The impact of human resource management practices on turnover, productivity, and corporate performance.. Academy of Management Journal, 38, 635–672. Jaramillo, F., & Grisaffe, D. B. (2009). Does customer orientation impct objective sales performance? Insignhts from a longitudinal model in direct selling. Journal of Personal Selling & Sales Management, 29(2), 167–178. doi: 10.2753/PSS0885–3134290205 Johns, G. (2006). The essential impact of context on organizational behavior. Academy of Management Review, 31, 386–408. Johnson, S., Healey, A., Evans, J., Murphy, M., Crawshaw, M., & Gould, D. (2006). Physical and cognitive task analysis in interventional radiology. Clinical Radiology, 61, 97–103. Johnston, W. B., & Packer, A. E. (1987). Workforce 2000. Indianapolis, IN: Hudson Institute. Kahneman, D., & Klein, G. (2009). Conditions for intuitive expertise: A failure to disagree. American Psychologist, 64, 515–526. Kerber, K. W., & Campbell, J. P. (1987). Correlates of objective performance among computer salespeople: Tenure, work activities, and turnover. Journal of Personal Selling & Sales Management, 7, 39–50. Khan, K., & Ramachandran, S. (2012). Conceptual framework for performance assessment: Competency, compteence and performance in the context of assessments in healthcare—Deciphering the terminology. Medical Teacher, 34, 920–928. Klein, G. A. (1998). Sources of power: How people make decisions. Cambridge, MA: MIT Press. Klein, G., Calderwood, R., & MacGregor, D. (1989). Critical decision method for eliciting knowledge. IEEE Transations on Systems, Man & Cybernetics, 19(3), 462–472. Klein, G., Klein, H. A., Lande, B., Borders, J., & Whitacre, J. C. (2015). Police and military as good stranger. Journal of Occupational and Organizational Psychology, 88, 231–250. Landy, F. J. (2007). The validation of personnel decisions in the twenty-first century: Back to the future. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 409–426). San Francisco, CA: Jossey-Bass. Lawler, E. E., III., Mohrman, S. A., & Benson, G. (2001). Organizing for high performance: Employee involvement, TQM, reengineering, and knowledge management in the Fortune 1000. San Francisco, CA: Jossey-Bass. Lesgold, A., Rubinson, H., Feltovich, P., Glaser, R., Klopfer, D., & Wang, Y. (1988). Expertise in complex skill: Diagnosing X-ray pictures. In M. T. H. Chi, R. Glaser & M. J. Farr (Eds.), The nature of expertise (pp. 311–342). Hillsdale, NJ: Erlbaum. Levine, E. L., & Sanchez, J. I. (2007). Evaluating work analysis in the 21st century. Ergometrika, 4, 1–11. Lievens, F., Buyse, T., & Sackett, P. R. (2005). The operational validity of a video-based situational judgment test for medical college admissions: Illustrating the importance of matching predictor and criterion construct domains. Journal of Applied Psychology, 90(3), 442–452.

158

Work Analysis Lievens, F., & Motowidlo, S. J. (2016). Situational judgment tests: From measures of situational judgment to measures of general domain knowledge. Industrial and Organizational Psychology: Perspectives on Science and Practice, 9, 3–22. Lievens, F., Sanchez, J. I., & DeCorte, W. (2004). Easing the inferential leap in competency modeling: The effects of task-related information and subject matter expertise. Personnel Psychology, 57, 881–904. Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the CVI, T, rWG(J), and r*WG(J) indexes. Journal of Applied Psychology, 84, 640–647. MacDuffie, J. P. (1995). Human resource bundles and manufacturing performance: Organizational logic and flexible production systems in the world auto industry. Industrial and Labor Relations Review, 48, 197–221. McCormick, E. J. (1979). Job analysis: Methods and applications. New York: AMACOM. McCormick, E. J., Jeanneret, P. R., & Mecham, R. M. (1972). A study of job characteristics and job dimensions as based on the Position Analysis Questionnaire (PAQ). Journal of Applied Psychology, 56, 347–368. McLagan, P. A. (1988). Flexible job models: A productivity strategy for the information age. In J. P. Campbell & R. J. Campbell (Eds.), Productivity in organizations (pp. 369–387). San Francisco, CA: Jossey-Bass. McLarty, J. R., & Vansickle, T. R. (1997). Assessing employability skills: The work keys super-system. In H. F. O’Neil, Jr. (Ed.), Workforce readiness: Competencies and assessment (pp. 293–325). Mahwah, NJ: Erlbaum. McPhail, S. M. (2007). Alternative validation strategies: Developing new and leveraging existing validity evidence. San Francisco, CA: Jossey-Bass. Militello, L. G., & Hutton, R. B. (1998). Applied cognitive task analysis (ACTA): A practitioner’s toolkit for understanding cognitive task demands. Ergonomics, 41, 1618–1641. Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (1999). A cognitive task analysis with implications for designing simulation-based performance assessment. Computers in Human Behavior, 15, 335–374. Mitchell, J. L., & Driskill, W. E. (1996). Military job analysis: A historical perspective. Military Psychology, 8, 119–142. Morgeson, F. P., & Campion, M. A. (1997). Social and cognitive sources of potential inaccuracy in job analysis. Journal of Applied Psychology, 82, 627–655. Morgeson, F. P., & Campion, M. A. (2000). Accuracy in job analysis: Toward an inference-based model. Journal of Organizational Behavior, 21, 819–827. Morgeson, F. P., Delaney-Klinger, K., Mayfield, M. S., Ferrara, P., & Campion, M. A. (2004). Self-presentation processes in job analysis: A field experiment investigating inflation in abilities, tasks, and competencies. Journal of Applied Psychology, 89, 674–686. Morrison, E. W. (1994). Role defintions and organizational citizenship behavior: The importance of the employee’s perspective. Academy of Management Journal, 37, 1543–1567. O’Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65, 79–119. Offerman, L. R., & Gowing, M. K. (1993). Personnel selection in the future: The impact of changing demographics and the nature of work. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 385–417). San Francisco, CA: Jossey-Bass. Pearlman, K. (1980). Job families: A review and discussion of their implications for personnel selection. Psychological Bulletin, 87, 1–28. Pearlman, K. (April 1997). Competencies: Issues in their application. In R. C. Page (Ed.), Competency models: What are they and do they work? Practitioner forum conducted at the meeting of the Society for Industrial and Organizational Psychology, St. Louis, MO. Pearlman, K., & Barney, M. F. (2000). Selection for a changing workplace. In J. F. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies (pp. 3–72). San Francisco, CA: Jossey-Bass. Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373–406. Peterson, N. G., & Bownas, D. A. (1982). Skill, task structure, and performance acquisition. In M. D. Dunnette & E. A. Fleishman (Eds.), Human performance and productivity: Vol. 1. Human capability assessment (pp. 49–105). Hillsdale, NJ: Erlbaum. Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., & Fleishman, E. A. (1999). An occupational information system for the 21st century: The development of O*NET. Washington, DC: American Psychological Association. Phipps, D. L., Meakin, G. H., & Beatty, P. C. W. (2011). Extending hierarchical task analysis to identify cognitive demands and information design requirements. Applied Ergonomics, 42, 741–748. Primoff, E. S. (1975). How to prepare and conduct job-element examinations (U.S. Civil Service Commission Technical Study 75–1). Washington, DC: U.S. Government Printing Office.

159

Michael T. Brannick et al. Primoff, E. S., & Fine, S. A. (1988). A history of job analysis. In S. Gael (Ed.), The job analysis handbook for business, industry, and government (Vol. I, pp. 14–29). New York, NY: Wiley. Pugh, C., & DaRosa, D. A. (2013). Use of cognitive task analysis to guide the development of performance-based assessments for intraoperative decision making. Military Medicine, 178(10), 22–27. doi: 10.7205/MILMED-D-13–00207 Raymark, P. H., Schmit, M. J., & Guion, R. M. (1997). Identifying potentially useful personality constructs for employee selection. Personnel Psychology, 50, 723–736. Rodriguez, D., Patel, R., Bright, A., Gregory, D., & Gowing, M. K. (2002). Developing competency models to promote integrated human-resource practices. Human Resource Management, 41, 309–324. Roth, E. (2008). Uncovering the requirments of cognitive work. Human Factors, 50(3), 475–480. Rothkopf, E. Z. (1986). Cognitive science application to human resources problems. In T. G. Sticht, F. R. Chang, S. Wood, & B. A. Hutson (Eds.), Advances in reading/language research (pp. 283–289). Greenwich, CT: JAI Press. Ryder, J. M., & Redding, R. E. (1993). Integrating cognitive task analysis into instructional systems development. Educational Technology Research and Development, 41(2), 75–96. Sackett, P. R. (2003). Exploring strategies for clustering military occupations. In A. K. Wigdor & B. F. Green (Eds.), Performance assessment for the workplace (Vol. II, pp. 305–332). Washington, DC: National Academy Press. Salmon, P., Jenkins, D., Stanton, N., & Walker, G. (2010). Hierarchical task analysis vs. cognitive work analysis: Comparison of theory, methodology and contribution to system design. Theoretical Issues in Ergonomics Science, 11(6), 504–531. Sanchez, J. I. (1994). From documentation to innovation: Reshaping job analysis to meet emerging business needs. Human Resource Management Review, 4, 51–74. Sanchez, J. I. (2000). Adapting work analysis to a fast-paced and electronic business world. International Journal of Selection and Assessment, 8, 204–212. Sanchez, J. I., & Fraser, S. L. (1994). An empirical procedure to identify job duty-skill linkages in managerial jobs: A case example. Journal of Business and Psychology, 8, 309–326. Sanchez, J. I., & Levine, E. L. (2012). The rise and fall of job analysis and the future of work analysis. Annual Review of Psychology, 63, 397–425. doi: 10.1146/annurev-psych-120710–100401 Sanchez, J. I., Prager, I., Wilson, A., & Vishwesvaran, C. (1998). Understanding within-job title variance in job-analytic ratings. Journal of Business and Psychology, 12, 407–420. Schippman, J. S. (1999). Strategic job modeling: Working at the core of integrated human resources. Mahwah, NJ: Erlbaum. Schippman, J. S., Ash, R. A., Battista, M., Carr, L., Eyde, L. D., & Hesketh, B. (2000). The practice of competency modeling. Personnel Psychology, 53, 703–740. Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1981). Task differences as moderators of aptitude test validity in selection: A red herring. Journal of Applied Psychology, 66, 166–185. Schraagen, J. M., Chipman, S. F., & Shalin, V. L. (2000). Cognitive task analysis. Mahwah, NJ: Erlbaum. Secretary’s Commision on Achieving Necessary Skills. (April 1992). Learning a living: A blueprint for high performance: A SCANS report for America 2000. Washington, DC: U.S. Department of Labor. Snow, C. C., & Snell, S. A. (1993). Staffing as strategy. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection in organizations (pp. 448–478). San Francisco, CA: Jossey-Bass. Spencer, L. M., McLelland, D. C., & Spencer, S. (1994). Competency assessment methods: History and state of the art. Boston, MA: Hay-McBer Research Press. Spreitzer, G. M. (1995). Psychological empowerment in the work place: Construct definition, measurement, and validation. Academy of Management Journal, 38, 1442–1465. Staw, B. M., & Epstein, L. (2000). What bandwagons bring: Effects of popular management techniques on corporate performance, reputation, and CEO pay. Administrative Science Quarterly, 45, 523–559. Sullivan, M. E., Brown, C. V. R., Peyre, S. E., Salim, A., Martin, M., Towfigh, S., & Grunwald, T. (2007). The use of cognitive task analysis to improve the learning of percutaneous tracheostomy placement. American Journal of Surgery, 193, 96–99. Sullivan, M. E., Yates, K. A., Inaba, K., Lam, L., & Clark, R. E. (2014). The use of cognitive task analysis to reveal the instructional limitations of experts in the teaching of procedural skills. Academic Medicine, 89, 811–816. Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal of Applied Psychology, 88, 500–517. Tinsley, H. E. A., & Weiss, D. J. (1971). A multitrait-multimethod comparison of job reinforcer ratings of supervisors and supervisees. Journal of Vocational Behavior, 1, 287–299.

160

Work Analysis Tjiam, I. M., Schout, B. M. A., Hendrikx, J. M., Scherrpbier, A J. J. M., Witjes, J. A., & Van Merrienboer, J. J. G. (2012). Designing simulator-based training: An approach integrating cognitive task analysis and four-component instructional design. Medical Teacher, 34, e698–e707. Tofel-Grehl, C., & Feldon, D. F. (2013). Cognitive task analysis-based training: A meta-analysis of studies. Journal of Cognitive Engineering and Decision Making, 7(3), 293–304. U.S. Department of Education, National Center for Education Statistics. (2002). Classification of instructional programs: 2000 edition. Washington, DC: Author. U.S. Department of Labor. (1965a). Dictionary of occupational titles (3rd ed. Vol. 1). Washington, DC: U.S. Government Printing Office. U.S. Department of Labor. (1965b). Dictionary of occupational titles (3rd ed. Vol. 2). Washington, DC: U.S. Government Printing Office. Vaughn, D. S., & Bennett, W., Jr. (November 2002). Toward a unified theory of work. Organizational simulations and policy analyses (AFRL-HE-AZ-TP-2002–0014). Mesa, AZ: Air Force Research Laboratory. Velmahos, G. C., Toutouzas, K. G., Sillin, L. F., Chan, L., Clark, R. E., Theodoru, D., & Maupin, F. (2004). Cognitive task analysis for teaching technical skills in an inanimate surgical skills laboratory. American Journal of Surgery, 187, 114–119. Wei, J., & Salvendy, G. (2003). The utilization of the Purdue Cognitive Job Analysis methodology. Human Factors and Ergonomics in Manufacturing, 13(1), 59–84. Wei, J., & Salvendy, G. (2004). The cognitive task analysis methods for job and task design: Review and reappraisal. Behavior & Information Technology, 23(4), 273–299. Welxley, K. N., & Silverman, S. B. (1978). An examination of differences between managerial effectiveness and response patterns on a structured job analysis questionnaire. Journal of Applied Psychology, 63, 646–649. Williams, A. P. O., & Dobson, P. (1997). Personnel selection and corporate strategy. In N. Anderson & P. Herriot (Eds.), International handbook of selection and assessment (pp. 219–245). New York, NY: Wiley. Williamson, D. M., Bauer, M., Steinberg, L. S., Mislevy, R. J., Behrens, J. T., & DeMark, S. F. (2004). Design rationale for a complex performance assessment. International Journal of Testing, 4(4), 303–332. Wilson, M. A. (1997). The validity of task coverage ratings by incumbents and supervisors. Journal of Business and Psychology, 12, 85–95. Wilson, M. A. (2007). A history of job analysis. In L. Koppes (Ed.), Historical perspectives in industrial and organizational psychology (pp. 219–241). Mahwah, NJ: Erlbaum. Wilson, M. A., Bennett, W., & Gibson, S. G. (2012). The handbook of work analysis: Methods, systems, applications and science of work measurement in organizations. New York, NY: Routledge/Taylor & Francis. Yates, K. A., & Feldon, D. F. (2011). Advancing the practice of cognitive task analysis: A call for taxonomic research. Theoretical Issues in Ergonomics Science, 12(6), 472–495.

161

Part II IMPLEMENTATION AND MANAGEMENT OF EMPLOYEE SELECTION SYSTEMS IN WORK ORGANIZATIONS JERARD F. KEHOE AND ROBERT E. PLOYHART, SECTION EDITORS

7 ATTRACTING JOB CANDIDATES TO ORGANIZATIONS ANN MARIE RYAN AND TANYA DELANY

Recruiting is more complex today than it has ever been. Technology promotes finding skilled, cost-effective talent in all corners of the world, enabling globally integrated workforces. However, to be successful, corporations need recruiting models that accommodate growth markets and mature markets, entry and experienced professionals, and a wider array of jobs and career paths. Corporations must also develop successful recruiting strategies to secure hot skills or market value skills. Recruiting models must leverage global best practices while complying with local legislation and managing local cultures. Recruiting must involve ways to process candidates through hiring quicker than ever while managing greater volumes of applicants than in the past. The ability to attract individuals to work at organizations is a topic of perennial research interest. Major reviews of the research on recruitment appear periodically (e.g., Barber, 1998; Breaugh & Starke, 2000; Ployhart, 2006; Rynes, 1991; Rynes & Cable, 2003; Rynes, Heneman, & Schwab, 1980; Taylor & Collins, 2000), including a recent handbook (Yu & Cable, 2014). Given our space constraints, in this chapter, we look at current questions regarding applicant attraction arising from recent workplace trends as a means of framing a practical research agenda for the future. Specifically, we address what is known and what we need to know about applicant attraction in light of globalization, advances in technology in recruitment, and organizational efforts toward more strategic talent management. We have organized our review from a more traditional recruitment process perspective into the three stages of reaching potential applicants, maintaining applicant interest, and securing offer acceptance (Barber, 1998). Our focus is more specifically on research and practice advances over the last five years, since the first edition of this volume. Because considerably more research and practice advances have focused on the first stage of reaching applicants, we devote much of our space to that stage.

REACHING POTENTIAL APPLICANTS Traditionally, human resource (HR) efforts at recruitment have placed a heavy emphasis on how to create awareness of opportunities among desired potential applicants. Today’s modern recruiting model focuses on recruiters being marketers. The goal is to attract qualified candidates to an employer brand and convert them into applicants. Most of the research on generating interest in job openings relates to (a) who provides information (i.e., recruitment sources), (b) what information is provided (e.g., how much specificity, how much realism, creating brand equity), and (c) how to best provide information to catch attention (i.e., advertising and websites).

165

Ann Marie Ryan and Tanya Delany

From Whom/Where Do Candidates Obtain Information? Although it is a long-held belief that quantity and quality of the applicant pool are affected by the source of recruitment information and the nature of the information, research on recruitment source effects on applicant pool quality often yields unclear results (Zottoli & Wanous, 2000). Because job seekers often obtain information from multiple sources (Vecchio, 1995), and the same source can be used by job seekers in different ways (Rynes & Cable, 2003), pinpointing specific source effects may be challenging. More importantly, consideration of how source often is confounded with content (specifically, content realism and valence; Barber & Roehlig, 1993) is needed. In general, credible and closer ties have greater influence, particular with regard to how negative information is considered (Keeling, McGoldrick & Sadhu, 2013). Referrals are generally believed to yield higher-quality applicants and offer acceptances; however, work by Pieper (2015) suggests that referral hires from high-performing employees perform better but have higher turnover than those from low performers. The ability to understand source effects is changing as companies are using Big Data to assist in understanding sourcing strategies (Walker, 2012). For example, Xerox cut attrition rates at call centers by 20% by using Big Data tools (Walker, 2012). IBM analyzes sourcing channels in terms of offer acceptance, candidate onboarding evaluations, as well as first-year performance and employee engagement. Gartner Research predicts that Big Data in recruiting will be a $232 billion industry by 2016. The most common sources are changing. Organizations are leveraging current employee and company alum networks to spread vacancy information and to tap potential talent (Caers & Castelyns, 2011). According to LinkedIn, social professional networks are the fastest growing source of quality hires. Reportedly, 73% of 18- to 34-year-olds found their last job through a social network (Medved, 2014), and 21% of candidates say they found their best job through a social network (Jobvite, 2014). However, social sources (e.g., employer review sites such as GlassDoor, LinkedIn, company Facebook groups, industry-specific job seeker sites, blogs, etc.) also have become an easily available resource for candidates to learn about a company (Chauhan, Buckley, & Harvey, 2013). Because candidates have greater access to information, they build their own understanding of a company, not just based on the information the company publishes. Thus, organizations are focused on developing “social recruitment” strategies with a consideration of the dynamic nature of social media content, which is not entirely in the organization’s control. The availability of information has also changed things from the recruiter’s perspective. The proliferation of available information about candidates has made it possible for recruiters to match a person’s professional and personal fit more closely to the company’s opening and corporate culture, respectively. People analytics’ tools and techniques (i.e., Big Data) allow firms to develop a much more complete profile of a candidate—far beyond a brief introduction letter and resume. While the research on social site recruitment is still emerging, there are some takeaways. For example, while third-party websites (e.g., Monster.com, Careerbuilder.com) can generate many low-quality applicants, they do also provide about as many high-quality applicants as do organizational websites (Talmage, 2012). Another example is research on the effective use of online social networks, which shows that recruiters who have secured a central network position as a connector and who brand themselves well (in addition to employer branding) are most successful in attracting quality candidates (Ollington, Gibb, & Harcourt, 2013). Although the overall conclusion of research is that source can play an important role in applicant initial attraction, there is now a much greater awareness of the variety of sources a candidate can draw from, the fluidity of information from those sources, and the relative influence an organization has over the information from these sources. However, while there has been a flood of articles in the last five years about the potential of social recruitment, critical evaluations of the effectiveness of different strategies are still sorely needed. At a practical level, VanHoye (2014) suggests that organizations must collect information on “what is being said about them, by whom, to whom, and through which media” (p. 264). He also suggests that organizations attempt indirect influences on word-of-mouth communications by rewarding current employees for positive referrals, using credible testimonials of current employees, and

166

Attracting Job Candidates to Organizations making sure that when recruiters do actively participate in social channels, they are transparent about their role and affiliation. Perhaps the greatest shift is that companies now understand that all employees are recruitment ambassadors, and they must work to ensure that all are prepared to engage with potential candidates, not just designated recruiters.

What Information Are Candidates Obtaining? Cable and Turban (2001) described an applicant’s knowledge of the company as having three dimensions: familiarity (awareness), reputation (global affective impression, brand), and image (attributes associated with organization). They argued that these, in conjunction with a job seeker’s values and needs, will determine attraction. Thus, researchers have explored what specific organizational attributes are perceived most favorably. For example, achievement, concern for others, honesty, and fairness are seen as the most salient work values, and their effects on applicant behavior have been established (Judge & Bretz, 1992; Ravlin & Meglino, 1987), as has the value of portraying an organizational culture as supportive (Catanzaro, Moore, Marshall, 2010). Individuals are attracted to an organization if they feel that it invokes prestige and impresses others and/or allows them to express their values (Highhouse, Thornbury, & Little, 2007). More recently, research has focused on how portraying specific values (e.g., social and environmental responsibility; Gully, Phillips, Castellano, Han, & Kim, 2013; Tsai, Joe, Lin, & Wang, 2014; Zhang & Gowan, 2012) or organizational policies (e.g., mandatory and binding arbitration [Bernardin, Richey, & Castro, 2011], work-family policies, employee development policies [Casper, Wayne, & Manegold, 2013], and diversity policies [Avery et al., 2013]) in websites and advertisements might affect attraction. Values emerge as important in recent surveys globally. PriceWaterhouseCooper’s 2011 Millennial Study of more than 4,000 Millennials in 75 countries found that just over half of this population reported being attracted to employers because of their corporate social responsibility position, with 56% being willing to leave an employer that did not have the values they expected. The report also found that 44% of those questioned said competitive wages made an employer more attractive, the second highest proportion for any factor given. The biggest draw for Millennials, however, was the opportunity for career progression—52% said that they felt this made an employer an attractive prospect. These results were replicated by IBM’s 2015 findings of more than 9,000 potential candidates in more than 30 countries reporting their top three factors important in determining an organization’s attractiveness were (1) meaningful and impactful jobs, (2) innovative products and services, and (3) opportunities. In sum, research has converged on showing that there are universally favored attributes as well as specific value matching underlying how recruitment content affects attraction. What about information valence? Studies suggest that the presentation of negative or realistic information will have differential effects on different categories of job seekers (e.g., those with less prior job experience [Meglino, DeNisi, & Ravlin, 1993]; higher-quality applicants [Bretz & Judge, 1998; Thorsteinson, Palmer, Wulff, & Anderson, 2004]). Further, Highhouse, Stanton, and Reeve (2004) found that negative information about prospective employers is discounted more than positive information. Also, lack of information (e.g., about pay) can lead to negative inferences and lower attractiveness perceptions (Yuce & Highhouse, 1997; however, see also Highhouse & Hause, 1995 and Maurer, Howe, & Lee, 1992). Reeve, Highhouse, and Brooks (2006) also showed that one negative piece of information can affect the accumulated effects of multiple moderately positive pieces of information (i.e., the relative balance of positive or negative information is not as important as the intensity of one’s affective response to a piece of negative information). The overall conclusion of this line of research is that although providing realistic information (and negative information) may adversely affect the attraction of some desired applicants, its overall effect on applicant pool quality and quantity may depend on the specifics of the job, industry, labor market, job seekers, and nature of the information. How can organizations affect their image? As Yang and Yu (2014) demonstrated, recruitment messages should include both need fulfillment and value expression elements to maximize attractiveness. Further, DeCooman and Pepermans (2012) showed that nonprofit ads often

167

Ann Marie Ryan and Tanya Delany presented more extrinsic value information than did profit-sector ads, highlighting that if certain information is assumed or obvious to job seekers (e.g., that nonprofits have value-driven missions and provide opportunities for value expression), it may help attraction to highlight less anticipated and more differentiating information. One of the most important conclusions from the burgeoning body of research on organizational image is that of the importance of congruity. Baum, Schafer, and Kabst (2015) showed that advertisements that were incongruent with an individual’s already established image of an organization lowered perceptions of credibility and attraction. Although the importance of “alignment” of message across recruitment platforms is generally acknowledged, it is important to recognize the role of pre-existing corporate images and how they affect perceptions of recruitment activities. For example, British Petroleum (BP) devoted considerable effort to recapture its place as a premier employer brand after the oil leak in the Gulf of Mexico through reports and videos reinforcing its commitment to its employees and to the environment (O’Meara & Petzall, 2013).

How Should Information Be Presented? In general, technology, and in particular the Internet, has facilitated the capabilities of recruiting functions to reach more potential applicants in less time and for less money; that is, technology exponentially enhances the efficiency of recruiting (Lievens & Harris, 2003). Dineen and Allen (2014) provide a nice summary of how the Internet has shifted the recruitment paradigm by (a) changing the richness of information, especially early in recruitment processes, (b) increasing customization of information, (c) changing from pushing information to job seekers to candidates pulling information, and (d) decentralizing the recruitment function in organizations. Technology can also facilitate the identification of particular talent pools (e.g., communities and other subscriber groups and sites as sources), the tailoring of materials to particular target groups (e.g., different web content depending on answers to a set of questions regarding interests/values), and the inclusion of more information than traditional advertisements (as noted above). Technological advances do not appear to alter conclusions of prior research regarding what influences attraction but do afford organizations greater and more unique opportunities to provide more information in much more efficient and effective ways. Organizations have noted the downside of using technology in the recruiting process, such as making it easier for applicants to apply to positions regardless of qualifications, creating a greater pool that recruiters must sift through. Another example is provided by Rieucau (2015) in a study of supermarkets in France and the UK, where she noted that proximity to a store was important for early opening hours, yet online applications might lead to more applicants with poorer fit advancing further in a screening process than more local forms of advertisement. What do we know specifically about information presentation? Cable and Yu (2007) proposed that media richness (multiple cues, language variety, feedback, and personal focus) and media credibility (expertise and trustworthiness) are particularly influential in the formation of applicant beliefs regarding organizations and jobs. Cober, Brown, and Levy (2004) noted that the interaction of form, content, and function is essential (i.e., good content has to be presented in an interactive, navigable, and pleasing way). Key findings in this line of research are that website content and usability play important roles in attraction (Braddy, Thompson, Wuensch, & Grossnickle, 2003; Cober, Brown, Levy, Cober, & Keeping, 2003), but website aesthetics are also fairly important (Cober, et al., 2004; Dineen, Ling, Ash, & DelVecchio, 2007; Zusman & Landis, 2002). More recently, Allen, Biggane, Pitts, Otondo, and Van Scotter (2013) found that individuals do pay more attention to text than to graphic images, and that early in the search process the focus is on information on number and type of job openings, organizational information, and geographic location. They also found that design, and in particular ease of use and ability to create a more personal presence, were important in addition to content, although content was more important than design. Similarly, Williamson, King, Lepak, and Sarma (2010) showed that for employers with less positive or weak reputations, the amount of information about company and job opportunities was important to attraction but the vividness of the website was not; however, for firms with good reputations, vividness or amount of information

168

Attracting Job Candidates to Organizations acted as substitutes, and either could lead to similar levels of attraction, but being low in both led to more negative reactions. There is growing use of technology to generate applicant interest through new mechanisms: virtual worlds (e.g., Second Life), online job fairs (Flinders, 2007), webinars (Mullich, 2004), gaming and online competitions (e.g., Loreal’s business planning contest), and quick-hitting fleeting image ads (e.g., Goldman Sachs ads on Snapchat, Moon & Mzezewa, 2015). There have been several studies on the use of virtual worlds (Badger, Kaminsky, & Behrend, 2014; Howardson & Behrend, 2014) that suggest some caution in their implementation in recruiting as individuals may not engage fully if they expect the technology is difficult to use and that individuals tend to acquire less accurate perceptions of person-organization (PO) fit due to the cognitive load in the media-rich environment. As with any form of technological innovation, ensuring that all users gain familiarity (e.g, practice and instructions) and that the technology still meets the goals (e.g., gaining accurate perspectives of fit) is important. One key question posed by technological advances is whether information should be customized and to what extent. In the past, considering individual differences in reactions to recruitment materials and selection processes was seen as less practical because developing different content and processes for different types of prospective applicants was seen as resource-prohibitive. Technology allows for a much greater capability for differences in what applicants are provided, and thus there is renewed interest in customization. For example, technology is enabling potential candidates to receive job alerts based on their profiles. When searching for jobs on Amazon.jobs, once a job is selected, the user is immediately provided a list of like or similar jobs that may also be of interest, making it easy for the user to find more jobs of potential interest. Kraichy and Chapman (2014) note that one can customize fit information (e.g., ask questions and give feedback on fit), configure information to preferences (e.g., put preferred information first), or tailor the message style and content. Several studies have shown that providing self-screening information (e.g., assessments of fit with the position in terms of abilities, values, and needs) is seen as particularly valuable by applicants and directly affects variables such as information recall and site viewing time (e.g., Dineen, Ash, & Noe, 2002; Dineen et al., 2007). Considering that research on organizational image and match to applicant values shows the importance of fit and that Uggerslev, Fasina, and Kraichy’s (2012) meta-analysis showed that fit is the largest predictor of applicant attraction, customization to tailor information to applicants and to target compatible individuals would be an effective use of recruitment resources at the early stage. The following box provides a list of company sites that have an interactive tool focused on helping candidates assess fit while provide a unique and differentiated experience.

Interactive Fit Assessments Accenture: http://careers.accenture.com/us-en/your-future/HGCYBgame/Pages/default. aspx Campbell’s: http://careers.campbellsoupcompany.com/Career-Fit-Tool ESPN: http://espncareers.com/career-areas Goldman Sachs: http://www.goldmansachs.com/careers/why-goldman-sachs/explore-gold man-sachs-careers-quiz/ Home Depot: http://careers.homedepot.com/find-your-fit/ IKEA: http://www.ikea.com/ms/en_US/rooms_ideas/fitquiz09/ L’Oreal: http://www.reveal-thegame.com/usa/ RBC: http://www.rbc.com/careers/findyourfit.html Save a Lot: http://save-a-lot.com/careers/workinghere/jobmatcher U.S. Army: http://www.goarmy.com/careers-and-jobs/help-choosing-a-career-job/by-skillsand-interests.html

In summary, considerable research advances have been made related to attracting applicants. In particular, technology has changed how individuals are sourced, where they are sourced from, and what information is made available to them. The challenge for practitioners and researchers

169

Ann Marie Ryan and Tanya Delany is to understand how to best balance providing information in meaningful and engaging ways without overwhelming applicants or having key recruitment messages lost in an increasingly noisy applicant marketplace.

MAINTAINING INTEREST There are several research topics concentrated primarily on keeping applicants in the pipeline once they have applied. In the past several years, a number of studies have examined what changes after candidates express an initial interest that is important to recruitment. In their meta-analysis of predictors of attraction, Uggerslev et al. (2012) found perceived fit is the strongest predictor of attraction across stages. With regard to maintaining interest, organizational characteristics and recruitment process characteristics are weighed more heavily and recruiter behaviors weighed less heavily later in the process. Also, perceived alternatives were not a strong predictor early in the process but did become a significant negative predictor later. In an experimental simulation of a multi-stage recruitment process, Saks and Uggerslev (2010) showed that information did have significant effects (positive and negative) at stages subsequent to when it was received, suggesting that some forms of negative information can be “made up for” with positive experiences subsequently (e.g., a personable, informative recruiter after delayed communications) but some might not. In a study of temporal decision context, vonWalter, Wentzel, and Tomczak (2012) found that fit was more influential for distant-future decisions, while pay was more relevant for near-future decisions, and concluded that the differences in time perspectives may affect how job seekers weight factors (i.e., a decision to apply still leaves a job decision in the distant future, where job choice is in the near future). Walker et al. (2013) focused specifically on the maintenance phase of recruitment and showed that treatment received continued to serve as a signal and affect attraction over time. Finally, Griepentrog, Harold, Holtz, Klimoski, and Marsh (2012) showed the importance of organizational identification as a predictor of applicant withdrawal over a three-month period. Their work suggests that organizational socialization begins from the start of recruiting, not just at the time of offer, and can affect applicant reactions and behaviors during the maintenance stage. Overall, these research studies highlight the importance of understanding that what affects attraction might not necessarily be what affects maintaining interest. Two topics of specific research focus with regard to maintaining interest have been recruiters and site visits. Not surprisingly, applicants prefer and react more positively to recruiters who treat them well and are informative (see Breaugh & Starke, 2000 or Rynes & Cable, 2003 for reviews of this research). McKay and Avery (2006) suggested that both encounter demographics (e.g., the vertical integration of minorities in the organization and in the community) and the quality of the interaction between groups, not just recruiter demographics, will affect applicant perceptions of organizational diversity climate and subsequent job acceptance intentions. These researchers also noted that there is likely significant within-group variance among minority job seekers in reaction to these factors, depending on applicant racioethnic identity, social dominance orientation, and other group orientation (McKay & Avery, 2006). In our view, advice to organizations on recruiters remains pithy: treat applicants nicely, make sure recruiters know the jobs for which they are recruiting, and train recruiters. It seems that researchers should focus more on the micro level of interactions between recruiters and applicants to better inform training as to what causes affective shifts among applicants. Site visits affect eventual choice (Rynes, Bretz, & Gerhart, 1991; Turban, Campion, & Eyring, 1995), but Breaugh and Starke (2000) noted that despite awareness of this, little research has actually focused on the details of site visits to guide HR practitioners in what truly makes a difference. One study that does tackle this was by Slaughter, Cable, and Turban (2014), who found that when recruits had little confidence in their initial views, they were much more likely to be affected by the site visit than those who already held strong image perceptions (positive or negative), whose views are less likely to change.

170

Attracting Job Candidates to Organizations In terms of maintaining interest, advances in technology can enable greater, continued contact with applicants (e.g., e-mail updates on status, job alerts, blogs, candidate communities). However, organizations must be diligent in understanding their applicant pools’ preferences and manage technology-enabled communication accordingly. Just as technology can help maintain interest, questions have arisen as to its potential negative effects on retaining applicants. For some, including executives and passive applicants who are not willing to jump through hoops, technology can be viewed as cold and inhuman. Some popular technological innovations may raise privacy concerns if not appropriately managed. For example, the practice of tagging and tracking visitors to corporate career websites and then deploying company ads in other web locations they visit to maintain a company’s presence in the job seeker’s mind has increased (Ruiz, 2008). Ensuring that personal data on the individual are not captured and maintained, that local privacy legislation is not violated, and that pervasive advertising does not turn off applicants is important. One area that seems under researched with regard to maintaining interest is that of selfregulatory processes. Recently, Stevens and Seo (2014) summarized the research on job search and emotions and noted the findings regarding motivational regulatory processes and search persistence. This growing body of research focuses solely on the job seeker perspective; we can envision useful research that applies knowledge of self-regulatory processes to understanding how emotions affect willingness to maintain interest in a specific organization, how emotions affect reactions to certain recruitment activities and timelines, and how regulatory processes affect the offer negotiation process. In summary, recruitment research taking a longitudinal perspective is increasing, and with that increase are new insights regarding how to maintain applicant interest beyond initial attraction. We would anticipate that a consideration of dynamic processes underlying applicant attitudes and behavior changes across recruitment stages will enhance our understanding even further.

ACCEPTING OFFERS The ratio of job acceptances to offers is considered an important indicator of recruiting success for many organizations (i.e., Do the ones we want want us?). The factors mentioned earlier as affecting attraction are sometimes not as critical to an acceptance: Individuals gather more information, eliminate options on those factors, and change their criteria as they proceed through the job search process. Prior reviews have noted that weak methodologies (e.g., examining intent to accept at early stages of the process) have clouded findings on what actually affects decisions to accept offers. However, several general conclusions have emerged.

What Influences? Organization characteristics are stronger predictors of acceptance intentions than recruiter characteristics, perceptions of the hiring process, or other variables (Chapman et al., 2005). However, we know that applicants make decisions in stages, first screening incompatible options and then choosing from among surviving options (Beach, 1993), but only a few researchers use a design that affords for this stage processing. Studies exploring “what is most important” to offer acceptance have been criticized for not creating choice tasks that reflect the informational and motivational context of a job applicant who is considering an actual offer. Hence, job choice often is not well predicted because of issues associated with not considering time of measurement (i.e., range restriction on key variables, applicant motivation and information levels). Two studies took a longitudinal perspective at what predicts job offers. Swider, Zimmerman, and Barrick (2015) focused on differentiation-consolidation theory and applicant fit perceptions over time. They found that job seekers did initially differentiate their PO fit with varied organizations from the start of the recruitment process and that, over time, differentiation increased even further, and that initial differentiation did predict job choice. They note organizations may

171

Ann Marie Ryan and Tanya Delany want to do as much as possible to differentiate themselves from competitors and produce positive fit perceptions early on (or produce negative fit perceptions with competitors), and to take steps throughout the process to increase PO fit perceptions (or reduce the likelihood of fit perceptions decreasing). In another recent study, Harold, Holtz, Griepentrog, Brewer, and Marsh (2015) showed that final decisions on offer acceptance were predicted by PO fit but also by perceptions of justice, providing support for suggestions on using job-relevant procedures, providing opportunities to demonstrate skills, granting ability to appeal, treating applicants with respect, allowing for two-way communication, and providing timely and honest communications. In summary, fit perceptions and candidate treatment are clearly a big component of “what predicts” offer acceptance, and those perceptions continue to be malleable throughout the recruitment process.

When Influences? Timeliness of offer is important (Rynes et al., 1991). Pressures by hiring managers to speed up the recruitment process are not without empirical backing, because one can lose desirable individuals with delays. However, Rynes and Barber (1990) noted that although offers in hand are generally favored over uncertain possibilities from other organizations, this likely varies with quality of applicant, as competitive individuals can afford to wait longer but also may be “snatched up” sooner.

Who Influences? Although the role of social influencers (e.g., family and friends) in job choice has long been suggested as important (Kilduff, 1990), it is relatively under researched. One exception would be the U.S. military’s long-time focus on influencers of enlistment decisions (Legree et al., 2000) through the Youth Attitude Tracking Study. In practice, the role of influencers is recognized in various employee referral programs as well as in recruitment activities. For example, to obtain a competitive advantage in attracting applicants, a call center in India conducts “Family Days,” which provide members of a potential applicant’s family with an opportunity to learn about the company. The U.S. military developed advertisements specifically targeted at the hopes and concerns of parents regarding military careers (Neal, 2005). While social influence is important, Kulkarni and Nithyanand (2013), in a study of graduating seniors at an elite business school in India, showed that most individuals do not see themselves as being influenced greatly by their parents and peers, but report that other job seekers are. Their study suggests parents having more influence on job choice with regard to financial issues (e.g., salaries, need to pay loans) and peers having more influence in terms of social comparisons (i.e., what does everyone else see as prestigious or glamorous). Kulkarni and Nithyanand make specific suggestions for organizations to engage with job seekers early on to influence organizational image relative to parent and peer influences, invite candidates to bring guests to open houses or onsite visits, and use individuals from the same source (e.g., same campus, region) as brand ambassadors. In recent years the emergence of social sites that feature reviews and commentary from both current and former employees is creating more transparency, giving candidates additional information, beyond the traditional family and friends’ network, to influence their view of a company’s image. One study that specifically looked at the role of recruiter in “closing the deal” examined National Collegiate Athletic Association (NCAA) football recruiting. Treadway et al. (2014) found that recruiters who were politically skilled could increase the quantity and quality of those with signed offers when the head coach had strong performance, but those low in political skill could not capitalize on good organizational performance to effectively recruit. One important implication they note is that when high-performing organizations are competing for top talent, recruiter political skills may be a critical determinant of offer acceptance.

172

Attracting Job Candidates to Organizations In summary, like research on maintaining interest, research on accepting offers has advanced to include more longitudinal studies and more theoretical work on how these decisions are made. A recognition of a broader set of influences (e.g., recruiter political skills, family and friend roles) will enhance our understanding even further.

RECRUITING GLOBALLY It is critical for an employer to recognize that in a global market, there may be large differences in familiarity, reputation, and image across regions; hence, recruitment activities may need to vary to create the type and level of employer knowledge desired (i.e., a familiar organization with a positive reputation in one locale may have to engage in more and different recruitment activities in another locale). To tap the same caliber of talent in different locations may require different sourcing strategies (e.g., considering technology access, translating materials). Growth markets often require dynamic strategies, whereas mature markets may draw upon more traditional approaches to sourcing applicants. Job market variability across nations likely will affect the number of alternatives that applicants have, how willing and able individuals are to experience delays, and hence, self-selection rates. In summary, global branding and advertising require synthesizing a desired global organizational image with awareness of local customs, needs, laws, labor markets, and language. One overarching concern with regard to globalization and recruitment is that we lack cross-cultural research on the information processing strategies and needs of job applicants. For example, does the same information serve the same role in input into employer knowledge across cultures? Is negative or missing information considered similarly across cultures? Does the influence of recruitment activity on employer knowledge vary by culture? Do certain cultures attend more to certain information sources or types? For example, referral programs are particularly important outside of the United States. Froese, Vo, and Garrett (2010), in a study of Japanese and U.S. companies recruiting in Vietnam, noted that views of the country and its people have influence beyond employer brand. Further, we lack understanding of the generalizability of recruitment efforts for different job levels in different cultures and markets. For example, sources that are effective in the recruitment of blue-collar workers but not managers in one country may play different or even opposite roles in another country. One could consider how a cultural lens might affect what is seen as impressive or what values are seen as socially approved (see Miller & Guo, 2014). Indeed, we would posit that although factors affecting what is considered prestigious (e.g., organization rankings, high pay) might be similar across cultures, there may be some differences in what is associated with respectability (e.g., what organizations are seen as good and honorable). As an example, Garcia, Posthuma, and Quinones (2010) examined how statements about how benefits exceed legal requirements affect attraction in Mexico. Although this is an often-used signal by organizations in that specific market, it is important in global recruiting where legally mandated fringe benefits vary greatly across countries. Another example would be to consider media richness and media credibility effects on attraction as moderated by culture. What constitutes a warm recruiter treating individuals fairly may not be the same in different regions, because cultures differ in beliefs regarding the appropriateness of assertive behavior in interviews (Vance & Paik, 2006). Note, however, studies have not evidenced any strong, consistent pattern of relations between type of selection tool and applicant reactions that indicates particular cultural values as key to reactions (see Anderson & Witvliet, 2008, for a review). Another concern is the need for cultural adaptation in recruitment. The literature on culture and marketing (Hermeking, 2005) has established the need to consider cultural receptivity as well as socioeconomic conditions in designing advertising campaigns (Karande, Almurshidee, & Al-Olayan, 2006); hence, it is no surprise that organizations recognize that recruitment activities and materials may need to vary by country. For example, Baack and Singh (2007) demonstrated that individuals prefer websites that are culturally adapted (i.e., changing content and presentation to fit cultural values of a target country). At first blush, this appears to fly in the face of “global branding,” in which one website and “one image” projected is the goal. However, we

173

Ann Marie Ryan and Tanya Delany contend that organizations can work to project a global brand image while also making appropriate cultural adaptations. For example, an organization that wants to project an image of caring for the customer can provide culturally appropriate examples of showing good customer service in each location or show photographs of individuals of different nationalities depending on location (Brandel, 2006). Cultural influences on the relative role of various factors on job offer acceptance need to be examined. We would anticipate, on the basis of our own practical experience and suggestions in the literature, that factors such as pay, opportunities, and signing bonuses would play stronger roles in emerging markets than mature ones where research traditionally has been based. Lucrative job offers are tickets to upward mobility, and so salary plays a bigger factor in job interest in those locations. Potential recruits often are switching jobs frequently in efforts to obtain salary increases. Further, because compensation packages and hours worked vary widely through the world, global recruiting requires adjustments to offers to make them attractive in different locations (Brandel, 2006). We have already noted that the role of social influencers will likely vary with the role accorded to family and friends in a given culture. Hence, although comparative empirical research on offer acceptance and country is not available, existing evidence strongly suggests some differential rating of factors by culture and economic conditions. Finally, Connerley (2014) notes that greater attention should be paid to the competencies recruiters need for operating in a global environment, such as having a global mindset and cultural agility. Globally integrated software can also help recruiters. For example, IBM Kenexa’s BrassRing applicant tracking system has built-in capabilities to meet reporting requirements in multiple geographies and supports different languages (IBM, 2014).

CONSIDERING STRATEGY At a basic level, organizations have long been interested in recruitment strategy, but as Ployhart and Kim (2014) note, research on strategic issues is limited, with few studies connecting the individual-level variables typically focused on in recruiting research with organizational performance and competitive advantage. Recently, Phillips and Gully (2015) introduced a model of strategic recruitment to provide a lens for considering individual, team, and organizational levels in conjunction with approaches to recruitment. We can envision ways in which the topic of generating applicant interest might be approached with a consideration of the organization’s strategy for talent management. For example, which talent pools is the organization most concerned about retaining and, therefore, for which applicant pools should concerns about applicant quality predominate? Which talent pools has the organization determined to be ones for greater investments in development internally and how might that affect recruitment targets? Although it is easy to see how such questions can serve in a practical way to ensure that organizational resources for recruitment are directed in keeping with strategy, it also may be important to consider an organization’s strategy in determining what it considers recruitment effectiveness (i.e., for a given job type, one organization’s strategy might suggest that generating strong interest among high talent is critical, whereas another’s might suggest that applicant pool quality does not have to be as strong). Adapting more of a recruiting versus screening orientation is also a strategic decision. Indeed, Dineen and Williamson (2012) showed that firms with a screening orientation reported having higher-quality pools and that when a large labor supply for a job exists and a firm is perceived more positively by applicants, a screening orientation is more likely to be adopted. Strategic ad placement to attract specific talent pools has often been discussed in the context of recruiting a more diverse applicant pool (see Avery & McKay, 2006, for a review) but can be considered more broadly as a talent management practice of customization. Similarly, whereas targeted recruitment messages to attract minority applicants, targeted campuses in recruitment, and targeted job fairs have all been discussed in terms of attraction of minority applicants (Avery & McKay, 2006), one can envision other targeted messages and targeted sourcing depending on other talent pool targets. For example, technology has enabled locationor context-based advertising on the web in innovative ways (e.g., placing job ads next to specific

174

Attracting Job Candidates to Organizations types of just-released content on websites). Organizations can send customized text messages to potential applicants at certain universities or belonging to certain professional organizations to advertise openings or events. Obviously, although customization and provision of feedback can be more costly to initiate, the long-run cost savings of creating a more targeted applicant pool are apparent. Although research on how organizational policies and benefits affect recruitment outcomes is not new (e.g., job security rights, Roehling & Winters, 2000; salary negotiations, Porter, Conlon, & Barber, 2004; work-family balance initiatives, Nord, Fox, Phoenix, & Viano, 2002; affirmative action policies, Harrison et al., 2006), strategic talent management suggests a direct tie of policy/benefit promotion to potential targeted applicant groups to success in recruiting those groups. For example, companies interested in attracting women into occupations where they are under represented (e.g., engineering) may emphasize work-life initiatives (e.g., Casper et al., 2013). One other take on targeted recruitment and fit comes from research on recruiting and small firms. Greer, Carr, and Hipp (2016) noted that small firms can gain advantages by emphasizing their uniqueness, flexible environments, lowered formality, and greater autonomy, and should be able to attract candidates who fit those preferred working environments with that strategy. They note that while being unique in message content, small firms are likely to imitate larger ones in recruiting practices, and this imitation is related to more successful firm performance. Of particular importance in strategic talent management is uncovering when and why differential reactions to recruitment activities occur for those with higher potential and/or greater alternative opportunities. The general evidence is that high-quality applicants react more critically to negative information (Bretz & Judge, 1998; Connerley & Rynes, 1997) and to recruiting delays (Rynes et al., 1991). High-quality applicants may differ from low-quality applicants in reactions to specific aspects of the process that are as yet uninvestigated, such as the amount of high-touch recruitment practices used or vertical integration of women and minorities. Cable and Turban (2001, p. 157) stated that, “There are not recruitment ‘best practices’ across firms” because firms need to assess what employer knowledge their target market holds before developing a recruitment strategy. Approaching recruitment from an integrated talent management perspective suggests a strong shift away from examining what works best across applicants to what works best for specific targets of interest in specific contexts. This is hardly a new suggestion—it has been made by Rynes (Rynes, 1991; Rynes & Cable, 2003; Rynes et al., 1980) in all of her reviews—but the current zeitgeist with regard to strategic talent management may increase the likelihood of the research shift.

CONCLUSIONS In this chapter, we have summarized conclusions in recent recruitment research. The box at the end of the chapter provides a list of the key practice implications based on this research. This focus may leave the reader wondering about the pros and cons of more contextualized approaches to researching recruitment that are industry, job-level, geography, or target applicant group-specific. Our conclusion is that contextual factors must be evaluated, but they are not necessarily going to change recruitment theory and models. For example, although we provided numerous examples in this chapter in which culture might make a difference, we also provided numerous examples in which it does not appear to greatly influence research findings and/or practice. Advancements in recruitment research and practice will come from better articulation of when one ought to be attending to these contextual factors and when they can be ignored or only minor modifications in approaches be made. Another important conclusion is the changing role of incumbent employees in recruiting. Increasingly, organizations are recognizing that all employees are marketers or brand ambassadors, and an effective recruitment strategy engages all employees. Researchers, however, have not attended to this trend adequately and thus have not provided much clarification about how best to execute this strategy. Further, research has not fully attended to the evolution of the recruiter role in terms of skill obsolescence. LinkedIn (2015) states that modern recruiters need

175

Ann Marie Ryan and Tanya Delany to be data nerds (to use numbers and data to help them make better decisions), researchers (to research candidate pools, employment and skill trends, the competition), and technologists (to leverage recruiting innovations)—roles that are very different from those of a traditional HR recruiter. Talent acquisition companies now offer training for recruiters on the soft skills required for interacting with a candidate (e.g., building relationships, negotiation, selling). These same companies are also including the data insights capabilities into their tools, allowing a recruiter to enhance their use of data to make better decisions. We have eschewed a traditional review of the recruitment literature for a more focused look at how some of the key conclusions of research should be interpreted in light of the important trends of increasing globalization, increasing use of technology in recruiting, and increasing attention to strategic talent management. Our review leads us to conclude that although organizations are certainly using new practices and adopting new strategies in response to these trends, the research base lags practice in these areas. The following box provides a list of unanswered research questions. Increasing our attention to recruitment processes with these trends in mind should yield more theory-driven practices than those adopted today, while at the same time better informing our understanding of what influences attraction to organizations.

UNANSWERED RESEARCH QUESTIONS   1. Are there differences in recruitment source effects across countries because of differences in economies and cultures?   2. Does the role of various attributes of information (e.g., valence, specificity) in attraction vary by culture and economic conditions?   3. Does the use of technology in recruitment only influence the “how” and “amount” of information delivery to potential applicants, or does it also alter “what” attracts applicants?   4. How does job seeker web search behavior influence how to best attract applicants?   5. Are the effects of media richness, media credibility, and specific recruitment activities moderated by culture? When and how should recruitment materials and activities be culturally adapted?   6. How does vertical integration by nationality affect attraction on a global basis?   7. What are the most effective, innovative uses of technology for generating interest among high-quality applicants?   8. How does technology’s role in attraction differ for different job levels and applicant pools?   9. How do organizational recruitment activities vary according to strategic talent management objectives? 10.  What is a cost-effective level of customization in recruitment?

PRACTICE IMPLICATIONS

Reaching Potential Applicants • Companies today should have some working knowledge of what sources of information are being viewed by applicants and their “social score” in terms of how they appear in online forums. • Devote time and personnel to managing your employer brand on social media. • Balance the “sell” with the job and company “realities” by creating a realistic job preview guide inclusive of competitive advantage talking points balanced with the realities of working in the company. Be transparent and prepared. Assume your candidates have researched your company through their social networks. • “More information is better.” Review job postings to add position- and organization-relevant information. If appropriate, vacancy and/or time scarcity should be emphasized.

176

Attracting Job Candidates to Organizations

• Evaluate an organization’s image and reputation, contrast what is found with what is desired, and then develop planned activities to affect images held by job seekers. • Consider content, usability, and aesthetics in conjunction with one another. • Messages should be customized where appropriate. • Applicants should be provided with self-screening opportunities. • Make efforts to dissuade low-quality applicants in effective and non-offensive ways (e.g., Ikea fit tool).

Maintaining Interest • Recruiters need to be carefully selected and trained. Remember, in the world of social media, all your employees are recruiters. Prepare them accordingly. • Site visit features (i.e., how time is spent) need to be investigated in terms of which ones relate to actual offer acceptance. • Consider how job markets affect the ability to maintain interest, and mechanisms for doing so need to be implemented in hot and/or valued skill markets.

Accepting Offers • Rather than passively accepting self-selection rates, organizations should investigate reasons for self-selection to uncover whether any causes are problematic or are leading to a loss of desirable applicants. Desirable self-selection should be facilitated through tools that enable assessment of job and organization fit. • Evaluation of offer acceptance influencers needs to be made with data gathered later in the process as it may be different from what influences attraction.

Recruiting Globally • Evaluate sourcing strategies in different markets and culturally adapt recruiting materials as needed. • Recognizing cultural and market influences on the relative importance of factors in offer acceptance is important. Anticipating how market changes will affect recruitment efforts and organizational image can facilitate the effectiveness of future efforts.

Considering Strategy • Identify pivotal talent pools and use strategic ad placement, targeted messages, and targeted sourcing for those pools. • A quality ATS to monitor, evaluate, and evolve recruiting efforts is key to success. • Narrow recruitment efforts through targeting. Focus on the early identification of talent and generating interest in those pools. • Evaluate how attraction strategies such as inducements can affect internal equity. • Prepare your recruiters to correctly use data to influence and streamline their sourcing efforts.

REFERENCES Allen, D. G., Biggane, J. E., Pitts, M., Otondo, R., & VanScotter, J. (2013). Reactions to recruitment web sites: Visual and verbal attention, attraction, and intentions to pursue employment. Journal of Business and Psychology, 28, 263–285. Anderson, N., & Witvliet, C. (2008). Fairness reactions to personnel selection methods: An international comparison between the Netherlands, the United States, France, Spain, Portugal and Singapore. International Journal of Selection and Assessment, 16, 1–13. Avery, D. R., & McKay, P. F. (2006). Target practice: An organizational impression management approach to attracting minority and female job applicants. Personnel Psychology, 59, 157–187.

177

Ann Marie Ryan and Tanya Delany Avery, D. R., Volpone, S. D., Stewart, R. W., Luksyte, A., Hernandez, M., McKay, P. F., & Hebl, M. R. (2013). Examining the draw of diversity: How diversity climate perceptions affect job-pursuit intentions. Human Resource Management, 52, 175–194. Baack, D. W., & Singh, N. (2007). Culture and web communications. Journal of Business Research, 60, 181–188. Badger, J. M., Kaminsky, S. E., & Behrend, T. S. (2014). Media richness and information acquisition in internet recruitment. Journal of Managerial Psychology, 29, 866–883. Barber, A. E. (1998). Recruiting employees: Individual and organizational perspectives. Thousands Oaks, CA: Sage. Barber, A. E., & Roehling, M. V. (1993). Job postings and the decision to interview: A verbal protocol analysis. Journal of Applied Psychology, 78, 845–856. Baum, M., Schafer, M., & Kabst, R. (2015). Modeling the impact of advertisement-image congruity on applicant attraction. Human Resource Management, 55, 7–24. Beach, L. R. (1993). Broadening the definition of decision making: The role of prechoice screening of options. Psychological Science, 4, 215–220. Bernardin, H. J., Richey, B. E., & Castro, S. L. (2011). Mandatory and binding arbitration: Effects on employee attitudes and recruiting results. Human Resource Management, 50, 175–200. Braddy, P. W., Thompson, L. F., Wuensch, K. L., & Grossnickle, W. F. (2003). Internet recruiting: The effects of web page design features. Social Science Computer Review, 21, 374–385. Brandel, M. (2006). Fishing in the global talent pool. Computerworld, 40, 33–35. Breaugh, J., & Starke, M. (2000). Research on employee recruiting: So many studies, so many remaining questions. Journal of Management, 26, 405–434. Bretz, R. D., & Judge, T. A. (1998). Realistic job previews: A test of the adverse self-selection hypothesis. Journal of Applied Psychology, 83, 330–337. Cable, D. M., & Turban, D. B. (2001). Establishing the dimensions, sources and value of job seekers’ employer knowledge during recruitment. Research in Personnel and Human Resources Management, 20, 115–163. Cable, D. M., & Yu, K. Y. T. (2007). How selection and recruitment practices develop the beliefs used to assess fit. In C. Ostroff & T. A. Judge (Eds.), Perspectives on organizational fit (pp. 155–182). New York, NY: Lawrence Erlbaum. Caers, R., & Castelyns, V. (2011). LinkedIn and Facebook in Belgium: The influences and biases of social network sites in recruitment and selection procedures. Social Science Computer Review, 29, 427–448. Casper, W. J., Wayne, J. H., & Manegold, J. G. (2013). Who will we recruit? Targeting deep- and surface-level diversity with human resource policy advertising. Human Resource Management, 52, 311–332. Catanzaro, D., Moore, H., & Marshall, T. R. (2010). The impact of organizational culture on attraction and recruitment of job applicants. Journal of Business and Psychology, 25, 649–662. Chapman, D. S., Uggerslev, K. L., Carroll, S. A., Plasentin, K. A., & Jones, D. A. (2005). Applicant attraction into organizations and job choice: A meta-analytic review of the correlates of recruiting outcomes. Journal of Applied Psychology, 90, 928–944. Chauhan, R. S., Buckley, M. R., & Harvey, M. G. (2013). Facebook and personnel selection: What’s the big deal? Organizational Dynamics, 42, 126–134. Cober, R. T., Brown, D. J., & Levy, P. E. (2004). Form, content, and function: An evaluative methodology for corporate employment Web sites. Human Resource Management, 43, 201–218. Cober, R. T., Brown, D. J., Levy, P. E., Cober, A. B., & Keeping, L. M. (2003). Organizational web sites: Web site content and style as determinants of organizational attraction. International Journal of Selection and Assessment, 11, 158–169. Connerley, M. L. (2014). Recruiter effects and recruitment outcomes. In K. Y. T. Yu & D. M. Cable (Ed.), The Oxford handbook of recruitment (pp. 21–34.). Oxford: Oxford University Press. Connerley, M. L., & Rynes, S. L. (1997). The influence of recruiter characteristics and organizational recruitment support on perceived recruiter effectiveness: Views from applicants and recruiters. Human Relations, 50, 1563–1586. DeCooman, R., & Pepermans, R. (2012). Portraying fitting values in job advertisements. Personnel Review, 41, 216–232. Dineen, B. R., & Allen, D. G. (2014). Internet recruiting 2.0: Shifting paradigms. In K. Y. T. Yu & D. M. Cable (Ed.), The Oxford handbook of recruitment (pp. 382–401). Oxford: Oxford University Press. Dineen, B. R., Ash, S. R., & Noe, R. A. (2002). A web of applicant attraction: Person-organization fit in the context of web-based recruitment. Journal of Applied Psychology, 87, 723–734. Dineen, B. R., Ling, J., Ash, S. R., & DelVecchio, D. (2007). Aesthetic properties and message customization: Navigating the dark side of web recruitment. Journal of Applied Psychology, 92, 356–372. Dineen, B. R., & Williamson, I. O. (2012). Screening-oriented recruitment messages: Antecedents and relationships with applicant pool quality. Human Resource Management, 51, 343–360. Flinders, K. (October 23 2007). Harnessing generation Y. Computer Weekly, 142, pNA.

178

Attracting Job Candidates to Organizations Froese, F. J., Vo, A., & Garrett, T. C. (2010). Organizational attractiveness of foreign-based companies: A country of origin perspective. International Journal of Selection and Assessment, 18, 271–281. Garcia, M. F., Posthuma, R. A., & Quinones, M. (2010). How benefit information and demographics influence employee recruiting in Mexico. Journal of Business Psychology, 25, 523–531. Greer, C. R., Carr, J. C., & Hipp, L. (2016). Strategic staffing and small-firm performance. Human Resource Management, 55, 741–764. Griepentrog, B. K., Harold, C. M., Holtz, B. C., Klimoski, R. J., & Marsh, S. M. (2012). Integrating social identity and the theory of planned behavior: Predicting withdrawal from an organizational recruitment process. Personnel Psychology, 65, 723–753. Gully, S. M., Phillips, J. M., Castellano, W. G., Han, K., & Kim, A. (2013). A mediated moderation model of recruiting socially and environmentally responsible job applicants. Personnel Psychology, 66, 935–973. Harold, C. M., Holtz, B. C., Griepentrog, B. K., Brewer, L. M., & Marsh, S. M. (2015). Investigating the effects of applicant justice perceptions on job offer acceptance. Personnel Psychology, 69, 1–29. Harrison, D. A., Kravitz, D. A., Mayer, D. M., Leslie, L. M., & Lev-Arey, D. (2006). Understanding attitudes toward affirmative action programs in employment: Summary and meta-analysis of 35 years of research. Journal of Applied Psychology, 91, 1013–1036. Hermeking, M. (2005). Culture and internet consumption: Contributions from cross-cultural marketing and advertising research. Journal of Computer-Mediated Communication, 11, 192–216. Highhouse, S., & Hause, E. L. (1995). Missing information in selection: An application of the EinhornHogarth ambiguity model. Journal of Applied Psychology, 80, 86–93. Highhouse, S., Stanton, J. M., & Reeve, C. L. (2004). Examining reactions to employer information using a simulated web-based job fair. Journal of Career Assessment, 12, 85–96. Highhouse, S., Thornbury, E. E., & Little, I. S. (2007). Social-identity functions of attraction to organizations. Organizational Behavior and Human Decision Processes, 103, 134–146. Howardson, G. N., & Behrend, T. S. (2014). Using the internet to recruit employees: Comparing the effects of usability expectations and objective technological characteristics on internet recruitment outcomes. Computers in Human Behavior, 31, 334–342. IBM. (2014). IBM Kenexa BrassRing on Cloud Datasheet. Public.dhe.ibm.com. Jobvite. (2014). Jobvite job seeker nation study. Web.jobvite.com. Judge, T. A., & Bretz, R. D. (1992). Effects of work values on job choice decisions. Journal of Applied Psychology, 77, 261–271. Karande, K., Almurshidee, K. A., & Al-Olayan, F. (2006). Advertising standardization in culturally similar markets: Can we standardise all components? International Journal of Advertising, 25, 489–512. Keeling, K. A., McGoldrick, P. J., & Sadhu, H. (2013). Staff Word-of-Mouth (SWOM) and retail employee recruitment. Journal of Retailing, 89, 88–104. Kilduff, M. (1990). The interpersonal structure of decision making: A social comparison approach to organizational choice. Organizational Behavior and Human Decision Processes, 47, 270–288. Kraichy, D., & Chapman, D. S. (2014). Tailoring web-based recruiting messages: Individual differences in the persuasiveness of affective and cognitive messages. Journal of Business and Psychology, 29, 253–268. Kulkarni, M., & Nithyanand, S. (2013). Social influence and job choice decisions. Employee Relations, 35, 139–156. Legree, P. J., Gade, P. A., Martin, D. E., Fischl, M. A., Wilson, M. J., Nieva, V. F., McCloy, R., & Laurence, J. (2000). Military enlistment and family dynamics: Youth and parental perspectives. Military Psychology, 12, 31–49. Lievens, F., & Harris, M. M. (2003). Research on internet recruiting and testing: Current status and future directions. In C. Cooper & I. Robertson (Eds.), International review of industrial and organizational psychology (Vol. 18, pp. 131–165). Chichester, England: Wiley. LinkedIn. (2015). LinkedIn modern recruiter’s guide. Retrieved from www.business.linkedin.com Maurer, S. D., Howe, V., & Lee, T. W. (1992). Organizational recruiting as marketing management: An interdisciplinary study of engineering graduates. Personnel Psychology, 45, 807–833. McKay, P. F., & Avery, D. R. (2006). What has race got to do with it? Unraveling the role of racioethnicity in job seekers’ reactions to site visits. Personnel Psychology, 59, 395–429. Medved, J. P. (2014). Top 15 recruiting stats for 2014. Retrieved from blog.capterra.com. Meglino, B. M., DeNisi, A. S., & Ravlin, E. C. (1993). Effects of previous job exposure and subsequent job status on the functioning of a realistic job preview. Personnel Psychology, 46(4), 803–822. Miller, J. K., & Guo, G. C. (2014). Recruitment: International cross-cultural perspectives. In K. Y. T. Yu & D. M. Cable (Ed.), The Oxford handbook of recruitment (pp. 402–422). Oxford: Oxford University Press. Moon, A., & Mzezewa, T. (2015). Goldman Sachs taps Snapchat for recruiting millennials. Retrieved from http:// ca.reuters.com

179

Ann Marie Ryan and Tanya Delany Mullich, J. (2004). Finding the schools that yield the best job-applicant ROI. Workforce Management, 83, 67–68. Neal, T. M. (August 22 2005). Military’s recruiting troubles extend to affluent war supporters. Washington Post. Nord, W. R., Fox, S., Phoenix, A., & Viano, K. (2002). Real-world reactions to work-life balance programs: Lessons for effective implementation. Organizational Dynamics, 30, 223–238. Ollington, N., Gibb, J., & Harcourt, M. (2013). Online social networks: An emergent recruiter tool for attracting and screening. Personnel Review, 42, 248–255. O’Meara, B., & Petzall, S. (2013). The handbook of strategic recruitment and selection: A systems approach. Bingley, UK: Emerald Publishing. Phillips, J. M., & Gully, S. M. (2015). Multilevel and strategic recruiting: Where have we been, where can we go from here? Journal of Management, 41, 1416–1445. Pieper, J. R. (2015). Uncovering the nuances of referral hiring: How refer characteristics affect referral hires’ performance and likelihood of voluntary turnover. Personnel Psychology, http://dx.doi.org/10.1111/ peps.12097. Ployhart, R. E. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of Management, 32, 868–897. Ployhart, R. E., & Kim, Y. (2014). Strategic recruiting. In K. Y. T. Yu & D. M. Cable (Ed.), The Oxford handbook of recruitment (pp. 5–20). Oxford: Oxford University Press. Porter, C. O. L. H., Conlon, D. E., & Barber, A. E. (2004). The dynamics of salary negotiations: Effects on applicant’s justice perceptions and recruitment decisions. International Journal of Conflict Management, 15(3), 273–303. PriceWaterhouseCoopers. (2011). Millennials at work: Reshaping the workforce. Retrieved from pwc.com Ravlin, E. C., & Meglino, B. M. (1987). Effect of values on perception and decision making: A study of alternative work values measures. Journal of Applied Psychology, 72, 666–673. Reeve, C. L., Highhouse, S., & Brooks, M. E. (2006). A closer look at reactions to realistic recruitment messages. International Journal of Selection and Assessment, 14, 1–15. Rieucau, G. (2015). Getting a low-paid job in French and UK supermarkets: From walk-in to online application? Employee Relations, 37(1), 141–156. Roehling, M. V., & Winters, D. (2000). Job security rights: The effects of specific policies and practices on the evaluation of employers. Employee Rights and Responsibilities Journal, 12, 25–38. Ruiz, G. (2008). PeopleFilter Amplifies ability to track job applicants. Workforce management online. Retrieved February 2, 2008, from http://www.workforce.com Rynes, S. L. (1991). Recruitment, job choice, and post-hire consequences: A call for new research directions. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 399–444). Palo Alto, CA: Consulting Psychologists Press. Rynes, S. L., & Barber, A. E. (1990). Applicant attraction strategies: An organizational perspective. Academy of Management Review, 15, 286–310. Rynes, S. L., Bretz, R. D., & Gerhart, B. (1991). The importance of recruitment in job choice: A different way of looking. Personnel Psychology, 44, 487–521. Rynes, S. L., & Cable, D. M. (2003). Recruitment research in the twenty-first century. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology: Volume 12 industrial-organizational psychology (pp. 55–76). Hoboken, NJ: John Wiley & Sons. Rynes, S. L., Heneman, H. G., & Schwab, D. P. (1980). Individual reactions to organizational recruiting: A review. Personnel Psychology, 33, 529–542. Saks, A. M., & Uggerslev, K. L. (2010). Sequential and combined effects of recruitment information on applicant reactions. Journal of Business Psychology, 25, 351–365. Slaughter, J. E., Cable, D. M., & Turban, D. B. (2014). Changing job seekers’ image perceptions during recruitment visits: The moderating role of belief confidence. Journal of Applied Psychology, 99(6), 1146– 1158. http://dx.doi.org/10.1037/a0037482 Stevens, C. K., & Seo, M-G. (2014). Job search and emotions. In K. Y. T. Yu & D. M. Cable (Ed.), The Oxford handbook of recruitment (pp. 126–138). Oxford: Oxford University Press. Swider, B. W., Zimmerman, R. D., & Barrick, M. R. (2015). Searching for the right fit: Development of applicant person-organization fit perceptions during the recruitment process. Journal of Applied Psychology, 100, 880–893. Talmage, C. (2012). Applicant quality: Exploring the differences between organizational and third-party websites. Social Science Computer Review, 30, 240–247. Taylor, M. S., & Collins, C. J. (2000). Organizational recruitment: Enhancing the intersection of theory and practice. In C. L. Cooper & E. A. Locke (Eds.), Industrial and organizational psychology: Linking theory and practice (pp. 304–334). Oxford, England: Basil Blackwell.

180

Attracting Job Candidates to Organizations Thorsteinson, T. J., Palmer, E. M., Wulff, C., & Anderson, A. (2004). Too good to be true? Using realism to enhance applicant attraction. Journal of Business and Psychology, 19, 125–137. Treadway, D. C., Adams, G., Hanes, T. J., Perrewé, P. L., Magnusen, M. J., & Ferris, G. R. (2014). The roles of recruiter political skill and performance resource leveraging in NCAA football recruitment effectiveness. Journal of Management, 40(6), 1607–1626. http://dx.doi.org/10.1177/0149206312441836 Tsai, Y., Joe, S, Lin, C., & Wang, R. (2014). Modeling job pursuit intention: Moderating mechanisms of socio-environmental consciousness. Journal of Business Ethics, 125, 287–298. Turban, D. B., Campion, J. E., & Eyring, A. R. (1995). Factors related to job acceptance decisions of college graduates. Journal of Vocational Behavior, 47, 193–213. Uggerslev, K. L., Fassina, N. E., & Kraichy, D. (2012). Recruiting through the stages: A meta-analytic test of predictors of applicant attraction at different stages of the recruiting process. Personnel Psychology, 65, 597–660. Van Hoye, G. (2014). Word of mouth as a recruitment source: An integrative model. In K. Y. T. Yu & D. M. Cable (Ed.), The Oxford handbook of recruitment (pp. 251–268). Oxford: Oxford University Press. Vance, C. M., & Paik, Y. (2006). Managing a global workforce: Challenges and opportunities in international human resource management. London, England: M. E. Sharpe. Vecchio, R. P. (1995). The impact of referral sources on employee attitudes: Evidence from a national sample. Journal of Management, 21, 953–965. vonWalter, B., Wentzel, D., & Tomczak, T. (2012). The effect of applicant-employee fit and temporal construal on employer attraction and pursuit intentions. Journal of Occupational and Organizational Psychology, 85, 116–135. Walker, J. (September 20, 2012). Meet the new boss: Big data. Wall Street Journal. Retrieved from: http:// ezproxy.msu.edu.proxy1.cl.msu.edu/login?url=http://search.proquest.com.proxy1.cl.msu.edu/ docview/1041132347?accountid=12598 Williamson, I. O., King, J. E., Lepak, D., & Sarma, A. (2010). Firm reputation, recruitment websites, and attracting applicants. Human Resource Management, 49, 669–687. Yang, K., & Yu, T. (2014). Person-organization fit effects on organizational attraction: A test of an expectations-based model. Organizational Behavior and Human Decision Processes, 124, 75–94. Yu, K. Y. T., & Cable, D. M. (Ed.) (2014). The Oxford handbook of recruitment. Oxford: Oxford University Press. Yuce, P., & Highhouse, S. (1997). Effects of attribute set size and pay ambiguity on reactions to ‘help wanted’ advertisements. Journal of Organizational Behavior, 19, 337–352. Zhang, L., & Gowan, M. A. (2012). Corporate social responsibility, applicants’ individual traits, and organizational attraction: A person-organization fit perspective. Journal of Business Psychology, 27, 345–362. Zottoli, M. A., & Wanous, J. P. (2000). Recruitment source research: Current status and future directions. Human Resource Management Review, 10, 353–382. Zusman, R. R., & Landis, R. S. (2002). Applicant preferences for web-based versus traditional job postings. Computers in Human Behavior, 18, 285–296.

181

8 TEST ADMINISTRATION AND THE USE OF TEST SCORES JEFF W. JOHNSON AND FREDERICK L. OSWALD

USE OF TEST SCORES Before you were born, you may have already taken a test (a prenatal test). You likely have been tested repeatedly since then, with the results of many of those tests having meaningful consequences in your life. This chapter focuses on the use of tests and test scores relevant to employment selection settings. Personnel selection is only part of a system of practices that, together, contribute toward meeting a variety of organizational goals (e.g., improved job performance, more effective teamwork, improved learning outcomes, higher motivation, reduced turnover). In other words, an organizational problem with any complexity to it is usually not purely a “selection problem,” meaning that it is most effectively evaluated and addressed through an integrated approach to practice that involves a broad set of professional, strategic, and technical skills (Huselid, Jackson, & Schuler, 1997). Just as selection systems do not exist in isolation, neither do selection test scores. Decisions about the type of test scores to collect might be influenced by a number of considerations, such as the type of training that will or will not be provided to job applicants once they are selected; the knowledge, skills, abilities, and other characteristics (KSAOs) of applicants who are the focus of the organization’s recruiting strategies; or the time and budget available for test administration. Because of the broad context of selection systems and testing, selection researchers and practitioners can meaningfully improve their skills and their work by remaining connected with the literature in training, motivation, leadership, teamwork, technology, and other relevant substantive areas outside of their usual niche.

DECISIONS TO MAKE BEFORE COLLECTING TEST SCORES Although a chapter on the use of test scores implies the scores have already been collected, most of the decisions about how test scores will be used must be made at the outset of the testing program. This section presents several issues that must be addressed prior to data collection when determining how test scores in selection will be used. First, for what purpose will the test scores be used? We focus on selection issues in this chapter, but testing is useful for several organizational purposes. General and common uses of test scores in organizations are (a) to select job applicants for employment, (b) to identify developmental or training needs, (c) to award licensure or certification of professional knowledge and

182

Test Administration and the Use of Test Scores training, and (d) to promote current employees or determine who will be put on a fast track for later promotion. Second, what are the major goals of the selection program? Examples of goals are (a) increasing the level of task-specific or contextual job performance among employees; (b) maximizing the number of employees who meet a minimum level of skill proficiency; (c) improving employee commitment and retention; (d) increasing the diversity of skill sets in the organization; and (e) minimizing counterproductive behaviors such as theft, absenteeism, unsafe work behavior, or drug abuse. Taxonomies of job performance have helped sharpen the models, definitions, and metrics that underlie the organizational objectives of selection, but they also have opened our eyes to the fact that multiple objectives are usually of interest, making the operationalization of selection goals complicated. The likely inherent tradeoffs between satisfying certain selection goals mean that some sort of decision must be made about the relative importance of each goal to the organization. If major organizational goals are overlooked in designing a personnel selection system, then decisions made during the selection process could contradict or compete with decisions made with respect to other practices and policies of the organization. Third, what characteristics do you want to measure in selection? Measures of characteristics that are important for the current job, for future job requirements, or for person-organization fit are often the foundation for a selection program. When the test or test battery measures characteristics that are closely related to job performance, such as technical knowledge and skill, the predictive accuracy of selection tests are usually most favorable, as is face validity (acceptability) of the tests in the eyes of job applicants. Job knowledge tests and work samples might be more costly, or require more company-specific tailoring, than measures of more indirect determinants of job performance, such as ability and personality. Furthermore, it may not be possible or appropriate to test for specific job knowledge and skill in some situations, such as in entry-level positions or positions in which applicants are expected to receive training to remedy any knowledge and skill deficiencies. Fourth, what is the volume of testing necessary in the selection context? Tests can be used in a wide variety of ways, such as (a) selecting a single individual from a small number of applicants into a specific position (such as CEO); (b) selecting large volumes of applicants into entry-level positions, such as in the fast-food service industry; (c) small businesses having to select from limited pools of applicants who happen to be applying to multiple jobs simultaneously (Scullen & Meyer, 2012); or (d) classifying a large pool of individuals into a wide range of jobs, as is done in the military. Testing in small versus large organizations can influence the nature of a selection process in fundamental ways. For example, large organizations tend to have the resources that would allow them to customize test content and scoring, and their own local data might be extensive enough to allow for stable psychometric and validity analyses, as well as the use of statistical methods for weighting and combining scores in relatively complex ways. By contrast, smaller organizations may be limited to buying off-the-shelf tests that are scored and interpreted by outside vendors using a relevant but broader set of norms, and support for the use of those tests may require relying more heavily on multiple sources of external information (e.g., job analysis, transporting validity from other situations, and meta-analyses; see McPhail, 2007). Finally, the mode of administration can influence how test scores are used. Will the test be administered via paper and pencil, computer, role play, work sample, or interactive voice response (IVR)? Will the test be proctored or unproctored, one-on-one or group administration? The shift to Internet or computer-based testing opens the door to a much wider range of test formats (e.g., video, interactive, adaptive) and new constructs that can be measured more reliably (e.g., interpersonal skills, ability to speak a foreign language). Tests using these innovative formats must hold up to the same high psychometric standards as those for their more traditional counterparts, with comparability being a concern when multiple formats of a measure are used in a selection setting (e.g,. web vs. paper-and-pencil tests; tests translated into multiple languages). The decisions we have briefly covered influence later decisions in the testing process. The remainder of this chapter discusses (a) collection of test scores, (b) computation of test scores, and (c) selection decisions on the basis of test scores.

183

Jeff W. Johnson and Frederick L. Oswald

COLLECTION OF TEST SCORES Several decisions pertaining to the collection of test scores can influence data quality as well as applicant satisfaction with and legal defensibility of the process. In this section, we discuss issues associated with test security, mode of administration, testing time, and retesting.

Maintaining Test Security In high-stakes testing settings, organizations and test vendors have a keen interest in test security for two primary reasons. First, test owners want to protect their proprietary rights to the content, format, and unique aspects of the administration and scoring of the test items. Second, test users want to maintain the test’s fairness and validity by preventing the spread of test-specific information that would allow for cheating (e.g., individuals posting information about a test or the testing procedure on the Internet or passing such information on to their friends; applicants taking pictures of test questions or memorizing them so they unfairly benefit upon a retest). Organizations also have an ethical responsibility to communicate and maintain the privacy and security of individuals’ test scores in high-stakes testing situations, and the data are likely to be of higher quality as a result (e.g., informing applicants about the confidentiality of item responses and test scores may reduce test taker anxiety). Different circumstances make security breaches more likely. For example, paper-and-pencil administration requires hard copies of tests that are easier to steal (computer-administered tests are more difficult to appropriate if proper safeguards are in place). A larger number of examinees means greater likelihood that unscrupulous examinees will be among those tested, greater demand for obtaining test information among applicants, and larger testing sessions that are more difficult to proctor. Using older and/or commercial tests provides more opportunity for the test to be compromised, especially given a limited number of test forms or a small item pool. Finally, unproctored Internet tests taken off-site are now commonplace, which means relying more extensively on the good faith of the examinee that test content will not be taken and shared with others.

Alternate Forms A common and effective strategy for enhancing test security is creating alternate forms of the test, thus increasing test security while maintaining comparable scores across forms. Given that a test reflects a representative sample of content from the construct domain of interest, it should be possible to develop alternate measures of the same construct that exhibit similar psychometric properties in terms of reliability and validity. High correlations between scores across test forms provides evidence that scores from a particular test are reflective of an applicant’s standing on an underlying construct rather than reflective of an applicant’s understanding of content unique to a particular test form. The following subsections review (a) creating alternate forms, (b) equating alternate forms, and (c) developing dynamically administered tests. Creating Alternate Forms  Creating alternate forms for some constructs may be a relatively simple task. For example, when creating alternate forms that test for the ability to multiply two 2-digit numbers, substituting different numbers for the original numbers will suffice. When constructs are defined with greater conceptual breadth, however, it is very important that the constructs to be tested are well defined and theoretically driven. Carroll’s (1993) hierarchical taxonomy of human abilities would serve as a good reference in the cognitive domain, and the Big Five has proven useful in the personality domain for generating test content (e.g., the International Personality Item Pool, or IPIP, at www.ipip.ori.org; Goldberg et al., 2006). Alternate test forms should be developed so that psychometric characteristics across forms have similarly

184

Test Administration and the Use of Test Scores high reliability and patterns of criterion-related validity, which can be accomplished by sampling items representatively from a well-defined construct domain. A good approach to creating alternate forms of an ability test is similar to the approach for developing measures in general (DeVellis, 2016). Given that a pool of items will be winnowed down based on the quality of the item content coupled with psychometric characteristics, a rule of thumb is to write about three times as many items as will reliably measure the construct on one form. After developing items, reviewing content, making appropriate revisions, and administering the test in a very small sample of test takers, test developers then collect data from a larger sample for all items in a pilot test (say, N = 300) and assign items with similar content and psychometric properties (e.g., proportion correct, corrected item-total r) to alternate forms. Experience tells us that about one-third of the items will drop out because of inadequate psychometric characteristics. Next, ensure that the test forms have similar internal consistency reliabilities and that the correlation between forms is high—at least r = .90, after correcting for unreliability using alphas from each form (which should be relatively high for unidimensional constructs). For speeded tests (e.g., some ability tests) and for tests with extremely heterogeneous content (e.g., SJTs), alpha reliability and item-total correlations are generally not appropriate reliability measures (Ployhart & MacKenzie, 2011). Other approaches are to be used in these cases, such as alternate forms and test-­ retest reliability (Catano, Brochu, & Lamerson, 2012). Finally, when possible, determine whether criterion-related validities are similar across forms when predicting outcomes. Items can be moved across alternate forms to improve the comparability of the forms in terms of reliability, validity, adverse impact, and other factors. Creating alternate forms for job knowledge tests can follow the same procedure, but it is usually more difficult because of the specificity of knowledge items (e.g., there may not be an alternative item when an examinee needs to know the location of the emergency switch at a nuclear power plant). Also, test developers often lack familiarity with the content area, particularly when job knowledge is highly technical. A good strategy is to have subject matter experts (SMEs; e.g., trainers, supervisors, incumbents) write test items and to instruct them to write an “item buddy” for each item they write. The item buddy would be similar to the original item in terms of content and difficulty, but different enough that knowing the answer to one does not easily give away the answer to the other. Developing alternate forms for situational judgment tests (SJTs) is a challenge because SJT content can be very wide-ranging in terms of content and constructs assessed. Lievens and Sackett (2007) explored three methods for creating alternate forms for SJTs: (1) assigning items randomly to forms, (2) creating forms with similar situations, and (3) creating forms with similar situations and similar item responses. The latter two methods did show higher test-retest correlations, indicating that random assignment of SJT items may not be a sound approach to developing alternate forms. Oswald, Friede, Schmitt, Kim, and Ramsay (2005) developed multiple parallel forms of an SJT, where each form sampled content across 12 broad dimensions of college student performance (e.g., continuous learning, leadership, ethics). The authors winnowed 10,000 computer-generated forms down to 144 tests with scores having similar means and standard deviations (SDs), high estimated alpha reliability, high estimated validity, and low item overlap. Thus, all test forms were as similar as could be accomplished feasibly in terms of the desired practical qualities of the SJT. In the personality domain, it is possible to create alternate forms using the same domainsampling procedure as for ability tests, because there are many potential items to measure personality constructs. Alternate forms may not be necessary, however, because the most desirable answer in a personality test is usually not difficult to determine (Viswesvaran & Ones, 1999). Therefore, there is little to gain in terms of preventing cheating by creating alternate forms of a personality test. A common practice in this case is simply to randomize the presentation order of personality items when creating alternate forms. In general, the importance of having alternate forms corresponds to the extent to which a single correct answer can be determined for the test questions. For example, biodata tests ask the candidate about experiences and attitudes that are often linked to performance through empirical keying. If a candidate is responding honestly to a biodata or personality test, there is no single “correct” answer because the candidate is just providing a self-description. Thus, alternate

185

Jeff W. Johnson and Frederick L. Oswald forms are less important for personality and biodata tests; more important for SJTs, interviews, and work simulations; and most important for knowledge or ability tests. A recent trend is the development of many forms (e.g., 10 instead of 2) in an attempt to minimize cheating and keep a testing program operating if one of the test forms is compromised. Oswald et al. (2005) extended a method proposed by Gibson and Weiner (1998) for creating many test forms on the basis of the statistics from a pool of items that may not have appeared on the same test form or have been given to the same sample. This method of generating parallel forms potentially minimizes the exposure of any single test item; in fact, item exposure, item testing time, item-level validity, and other item characteristics can be built into the mathematical constraints that drive the procedure for generating appropriate alternate test forms (see the linear programming models in van der Linden, 2005). A more common strategy for creating parallel forms is to create three or four unique alternate forms, then create additional forms by changing the item order and/or by taking some items from each of the original forms. Note, however, that mixing items from two or more unique forms to create a new form means that some parts of the unique forms are compromised if the new form were to be stolen. Equating Alternate Forms  When alternate forms are used, it is necessary to equate test scores across forms so that a given score on one form is as psychometrically equivalent to the same score on the other form as possible. Although equating can introduce some additional error into the selection process as opposed to using the same measure across all applicants, careful attention to the process of test development and the establishment of equivalent forms reduces such error. When samples are randomly equivalent, and common anchor items across test forms have similar parameters across samples, several different equating methods based on item response theory (IRT) yield similarly good results (Kim, Choi, Lee, & Um, 2008). In cases in which sample sizes are smaller (less than N = 500 per item) or IRT assumptions are untenable, it is necessary to rely on other equating methods. Two other kinds of equating methods are linear equating and equipercentile equating. In linear equating, individual scores across two tests are considered to be equated if they correspond to the same number of standard deviation units from the mean (i.e., the same z scores). Because linear equating is entirely analytical and does not require data at each point in the test score range, it offers the advantages of allowing a mapping of scores from one version to the other throughout the entire range of scores and requires smaller sample sizes than IRT methods. A disadvantage of linear equating is that it requires the assumption that any differences in the shapes of the raw-score distributions for each form are trivial. In equipercentile equating, scores on two tests are considered to be equated if they correspond to the same percentile rank (for details, see Livingston, 2004, pp. 17–23). A problem with equipercentile equating is that it requires very large sample sizes to precisely equate the entire range of scores on each test. In this method, large errors of estimation are likely in score ranges where data are scant or erratic, so there must be many observations for each possible score on each form (Petersen, Kolen, & Hoover, 1989). Methods are available that smooth the empirical distribution of the data, allowing for more reasonable equipercentile equating in ranges of the scale with less data, assuming the smoothed distribution is the correct one underlying the data (Dorans, Pommerich, & Holland, 2007). If the only score that needs to be equated is a cut score and only a pass-fail decision is communicated to applicants, then we recommend equating the cut score on the basis of equipercentile equating because that will lead to the same pass rate within each form. Whether simpler methods are as useful as IRT-based approaches is an empirical question, but given that very consistent empirical and practical outcomes between IRT and classical test theory have often been identified (Fan, 1998; MacDonald & Paunonen, 2002), we suspect that different methods may often yield similar results. Dynamically Administered Tests  Another way of enhancing test security without having to develop and equate multiple forms of the test is by creating a large item pool and selecting items from that pool in real time as the candidate is completing the test, such as through “linear on the fly” (LOFT) or adaptive testing. Using a LOFT procedure, items are selected dynamically from a large item pool such that different, but equivalent, tests are randomly administered, thus

186

Test Administration and the Use of Test Scores providing a customized assessment for each applicant. As a result, it is difficult to compromise test security by copying the items and disseminating them to others. Computer adaptive tests (CATs) also enhance test security because they provide different items to test takers depending on their previous responses, essentially creating a unique form for each applicant. For example, an applicant who answers an ability test item incorrectly will be given an easier item next, whereas an individual who answers that item correctly will be given a more difficult item next. In an adaptive personality test, applicants are presented with item pairs, with each item representing a different level of the target trait. An applicant who chooses the statement at a lower trait level would be given a subsequent item pair that is lower on the trait continuum in an attempt to refine trait-level estimation for that applicant. Even for unidimensional measures, developing ability CATs requires very large sample sizes (at least N = 500–1,000) and very large item pools. If this investment can be made, however, the advantage of CAT is that fewer items need to be administered to each individual, reducing test fatigue and testing time while maintaining high reliability across levels of the construct to be measured. Although adaptive testing based on large item pools improves overall test security, it is certainly possible for a savvy test taker trying to see many items to purposely answer certain items incorrectly to ensure that additional items are presented. Other potential disadvantages of CATs are that test takers are not allowed to review their previous answers to correct them, and the test scores resulting from a CAT may be more sensitive to initial responses than to subsequent ones (Chang & Ying, 2008). Investment in CAT development in the private sector is still relatively new, so the conditions under which the cost-benefit of CAT is superior to that of traditional test formats largely remains to be seen.

Mode of Administration Test administration mode generally refers to (a) paper-and-pencil versus computer administration, and (b) proctored versus unproctored administration. In general, we recommend using computer administration and computer scoring to enhance test security where possible. Although we have noted how computer-based testing does not resolve all security issues, the elimination of paper forms that are more easily stolen is a clear advantage. To maintain the security advantage of computer-administered tests, however, strict physical and information technology (IT) security measures must be established and enforced on a continuous basis to protect against unauthorized access to the testing software. Reputable vendors that specialize in online testing will have extensive security procedures for protecting their intellectual property. We do not recommend that organizations conduct large-scale testing using their own computer systems, unless extensive and up-to-date security measures are in place, and the entire testing system has been thoroughly tested by experienced testing and IT professionals for this purpose. The International Test Commission (ITC; 2006) has published a set of guidelines for best practice in delivering computer-based testing, especially through the Internet. The guidelines address four primary issues: (1) ensuring that the hardware and software technology at both the server and client side are appropriate for the use of the test; (2) ensuring the quality of test materials and the testing process; (3) controlling the way tests are delivered and who is completing the tests; and (4) ensuring test security, data protection, privacy, and confidentiality. The guidelines are designed to advise test developers, publishers, and users. One aspect of computer-based testing specifically addressed by the ITC guidelines is unproctored Internet testing (UIT). UIT is increasingly common in selection practice, and it presents unique problems for maintaining test security (see Tippins, 2009). Several things can be done to help minimize test exposure and motivation to cheat in this situation (e.g., Bartram, 2009; Burke, 2009; ITC, 2006; Tippins et al., 2006). First, the system should allow applicants to take the test only once; returning applicants are not allowed to retake the test without the knowledge and approval of the hiring organization. Second, an applicant tracking system should be used that collects relevant identification information and links it to all selection data collected on the applicant. Third, applicants should be encouraged to be honest in the information they provide to ensure a good fit to the organization; they also should be warned about the consequences of

187

Jeff W. Johnson and Frederick L. Oswald cheating under UIT and how their identity and their answers may be verified. Fourth, each test administration should be unique, through randomizing the item order or selecting items adaptively from large item banks. Finally, UIT might be used for initial testing or to screen out candidates who are highly unlikely to be suitable for the job (Nye, Do, Drasgow, & Fine, 2008). To verify those scores obtained in the unproctored environment, proctored adaptive tests (Makransky & Glas, 2011) or proctored subtests of the larger unproctored test (Segall, 2001) can still keep testing time short. To date there is little evidence of cheating or inconsistency between unproctored test scores when there is a proctored follow-up test (Kantrowitz & Dainis, 2014). That said, ethical issues should be considered when a follow-up test is used for score confirmation (Bartram, 2009; ITC, 2006). For instance, failure to confirm an unproctored test score is not definitive proof of cheating; it may merely suggest some necessary additional follow-up questioning by hiring managers. Ultimately, organizations using UIT must accept that the test items are in the public domain and will be exposed to anyone who really wants access, with potential implications for compromising the reliability of test scores and thus the integrity of a selection system. Burke (2009) presented some precautions that can be adopted to determine the extent to which test security has been breached. These precautions include searching the Internet to find sites that inappropriately offer access to test content. When these sites are found, most of them can be taken down by informing the site operator or Internet provider of activity that goes against their stated policies. Another approach to determining the extent to which test content may be compromised is to apply data forensic algorithms that search for evidence of aberrant scores or scores that are highly unlikely under honest test-taking conditions (e.g., fast response times associated with high correct answer rates; matches in correct and incorrect answer profiles, suggesting test taker collusion; long response latencies associated with few correct answers could suggest item harvesting). Ironically, the concern about cheating and test security associated with UIT has led to the development of technologies and procedures that could make UIT more secure than traditional proctored assessment (Bartram, 2009).

Testing Time The amount of time available for testing is often a practical constraint that influences decisions about the types of tests to be administered, the mode of administration, the number of tests included in a test battery, the number of items in a test, and the number of stages in the selection system. For example, if financial or other considerations place a strict time limit of one hour on test administration, that limits the number of different constructs that can be assessed, the types of constructs that can be assessed (e.g., reading comprehension takes longer than perceptual speed and accuracy), and the testing method (e.g., SJTs generally take longer than biodata inventories). Our general advice when presented with such constraints is to go for depth over breadth. Because testing time (and the patience of test takers) is finite, organizations should maintain their focus in effectively measuring a relatively small handful of key constructs relevant for selection purposes, fully recognizing that not all constructs of potential relevance can be measured. Job analysis should be the essential guide for determining which selection-relevant constructs or characteristics are most important and feasible to measure. Another factor to consider when determining what to include in a test battery is potential adverse impact. When selection tests measure constructs with large subgroup differences, it can be very difficult to neutralize this in a test battery by (a) replacing it with another measure measuring the same construct with equivalent reliability and validity but lower subgroup mean differences (because such measures are very hard to locate or develop) and/or (b) including measures of other constructs that have lower or no subgroup differences (because they do not have as strong of a mathematical effect as one may think, and because this adds breadth and therefore more testing time). Nonetheless, all attempts should be made to balance the goals of validity, reliability, and fairness of the selection battery. One way to reduce the size of a test battery is to remove tests that are highly correlated with other tests in the battery and do not provide incremental validity. Such a reduced test battery can often maintain its worth in terms of reliability and validity for its intended purposes (Donnellan,

188

Test Administration and the Use of Test Scores Oswald, Baird, & Lucas, 2006; Stanton, Sinar, Balzer, & Smith, 2002). As mentioned, many organizations cut down on in-house testing time by using unproctored web administration of predictors (Tippins et al., 2006), inviting those meeting a minimum score to the proctored testing session, where more tests are administered. If the unproctored test scores can be linked to the proctored test scores via the applicant name or identification number, then the number of constructs assessed can be increased without increasing proctored testing time. Power tests are defined as those tests for which speed is not relevant to the measurement of the construct, such as for many cognitive ability and achievement tests. Thus, test takers should be given adequate time to complete the entire test. Because unlimited time is not available for test administration, however, a rule of thumb for allocating a minimum amount of testing time to power tests is the time it takes for 90% of the examinees to complete 90% of the items. Unlike power tests, speeded tests require test takers to perform quickly on simpler tasks within the time allotted. Speeded tests must be long enough and the time limit short enough that virtually no one is able to finish. The test score is then scored as the number of correct responses minus a correction for guessing (Cronbach, 1990). Issues surrounding equity in testing could meaningfully impact decisions on the amount of testing time allocated. Consider job applicants who are nonnative speakers of the language in which the test is written. If it is safe to assume that language skills are not relevant with respect to the construct being assessed by the test, then one should consider providing additional testing time to nonnative speakers so that the influence of unfamiliarity with the language on test scores is greatly diminished or removed. Alternatively, such tests could be redesigned so that written language or other factors irrelevant to the construct are minimized, such as administering a video-based form of a test instead of a traditional written form (Chan & Schmitt, 1997; Weekley & Jones, 1997). Consideration of individuals covered by the Americans with Disabilities Act (ADA; 1990) is an organizational and legal imperative. If the test is not speeded, applicants with disabilities preventing them from completing the test in the normally allotted time should be given extra time to complete the test. Other reasonable accommodations would include providing larger-font test materials and providing assistance with the answer sheet. See Campbell and Reilly (2000) for an excellent discussion of ADA accommodations in testing.

Alternate Test Formats The need to create alternate test formats may arise when seeking to test different subgroups of individuals in a comparable manner. Not only might disabled job applicants require a reasonable accommodation, but testing applicants from different countries might also require that the test be translated accurately into several languages. Although in some cases it seems reasonable to assume that differences in test format are merely cosmetic and have no bearing on construct measurement (e.g., a static personality inventory administered on computer vs. administered on paper), research suggests that the psychometric characteristics of tests may differ across formats (Meade, Michels, & Lautenschlager, 2007). Therefore, in most cases it is necessary to test statistically whether format differences lead to score differences that are irrelevant to the constructs being measured. Numerous studies have found empirical support for the measurement invariance/equivalence of psychometric properties across formats (e.g., for web vs. paper-and-pencil format equivalence, see De Beuckelaer & Lievens, 2009; Naus, Philipp, & Samsi, 2009; Ployhart, Weekley, & Holtz, 2003). Ideally, scores from different test formats would exhibit strict measurement invariance (e.g., similar patterns and magnitudes of factor loadings, similar error variance estimates; Vandenberg & Lance, 2000), but much more often, the data support either scalar invariance (equal loadings and intercepts) or metric invariance (equal loadings only). Furthermore, when some items do not psychometrically function in the same way across subgroups, then partial invariance is said to exist. Items contributing to partial invariance may lead one either to delete or revise them, or to allow them to have unique estimates across subgroups. Recent simulation and empirical work suggests that partial invariance may only have a slight impact on selection

189

Jeff W. Johnson and Frederick L. Oswald outcomes in many practical situations (Millsap & Kwok, 2004; Stark, Chernyshenko, & Drasgow, 2004; for a thorough review of tests of measurement invariance, see Schmitt & Kuljanin, 2008). Large sample sizes are desirable in conducting appropriate tests for measurement invariance (e.g., N = 400 for supporting metric invariance; Meade & Bauer, 2007); otherwise, it may be necessary to rely on a strong rational basis for the equivalence of test formats (e.g., providing evidence that the formats of the different tests do not influence the construct of interest and should be unrelated to differences in the samples tested). Note that even when measures are found to be psychometrically nonequivalent across formats, this may not stop an organization from proceeding with both formats of a test (e.g., when moving toward an Internet-only format, but using paper forms while in transition), though it should be done with the understanding that nonequivalence prevents the direct comparability of the measures.

Retesting According to professional guidelines, allowing candidates to retest at a later date is a best practice (Society for Industrial and Organizational Psychology, 2003; U.S. Department of Labor, 1999). Because any single assessment can be influenced by various types of systematic measurement error unrelated to the construct being measured (e.g., illness, extreme anxiety, not meeting the testing prerequisites), it is reasonable to offer the opportunity to retest when it is appropriate and feasible. Retesting is typically not a relevant concern for small-scale testing programs. If a small company is testing to fill a specific position, then the opportunity to retest cannot be offered once the position has been filled. Candidates should be given the chance to retest in a timely manner if job opportunities are continuously available, especially if it involves internal candidates. On the other hand, how are the psychometric qualities of the test affected by allowing candidates to take it two, three, or four times? Is the purpose of the test defeated if a candidate can retake the test as many times as it takes to finally pass? Score increases from retesting could be partially due to regression to the mean, because low test scorers are more likely to retest, and lower retest scores would tend to increase by chance alone. Score increases could also be due to practice effects on the test-specific content (e.g., memorizing the correct answers) or in picking up on effective test strategies (e.g., learning that picking the longest multiple choice answer is beneficial when in doubt). These undesirable effects need to be considered, if not prevented by not allowing for a retest. On the other hand, retest scores can also reflect desirable effects. Some test takers could be less anxious about the test when they retake it, which could lead to score increases that better reflect their true level of knowledge even if their underlying knowledge upon retest remains the same. Test takers may also consider the types of questions they missed when they first tested and concentrate subsequent study in those areas so that the retest reflects true increases in the construct being measured. Some recent studies have examined the effect of retesting on scores for different types of tests. In a study of admission exams for medical students, Lievens, Buyse, and Sackett (2005) found standardized mean gains in test scores (after correcting for test-retest reliability) of 0.46 for a cognitive ability test, 0.30 for a knowledge test, and 0.40 for a SJT. Retesting was conducted using alternate forms, suggesting that increases were due to increases in the respective constructs being measured and not due to simply memorizing item-specific content. Raymond, Neustel, and Anderson (2007) found standardized mean gains of 0.79 and 0.48 for two different medical certification exams. In both cases, these gains were nearly the same whether the identical test or a parallel test was used. This latter finding was contrary to a recent meta-analysis of retesting effects on cognitive ability tests that found an adjusted overall effect size of 0.46 for identical forms and 0.24 for alternate forms (Hausknecht, Halpert, Di Paolo, & Moriarty Gerrard, 2007). Across all studies (combining samples using identical and alternate forms), the Hausknecht et al. (2007) meta-analysis found an effect size of 0.26 (based on adjusted SD units) between

190

Test Administration and the Use of Test Scores Time 1 and Time 2 and an effect size of 0.20 between Time 2 and Time 3. The portion of these effects that was due to regression to the mean was estimated to be less than 10%. Test coaching, defined as instruction aimed at improving test scores (through learning either test-related skills or test-taking strategies), generally had a large effect. The effect size for individuals who had some form of test coaching was 0.70, as opposed to 0.24 for individuals who did not have any coaching. Another moderator of gains upon retesting was the type of cognitive ability assessed, with tests of quantitative and analytical ability showing larger mean gains (0.30 and 0.32, respectively) than tests of verbal ability (0.19). Do scores on a retest tend to reflect the individual’s standing on a construct better than scores on the initial test? Lievens et al. (2005) examined this question by using a within-person analysis to compare the validity coefficients for original scores and retest scores for those examinees who did not pass the medical school admissions exam, elected to retest with an alternate form, and subsequently were admitted to medical school. When predicting GPA with the knowledge test, Lievens et al. hypothesized and found significantly higher validity coefficients for retest scores (r corrected for range restriction = .37) than for initial scores (corrected r = .23, N = 556). Also as hypothesized, there were no statistically significant differences in validity coefficients for the cognitive ability test or for the SJT. Note that the time intervals varied in studies examining retesting. There is little research to inform the decision about how long a candidate should have to wait before being allowed to retest, although the length of the time intervals used in practice generally do not appear to influence test score gains when alternate forms are used. Hausknecht et al. (2007) noted that score gains tended to decrease as the time interval increased, but only for identical forms. Raymond et al. (2007) found no effect of time delay on score increases regardless of whether identical or alternate forms were used. Using identical forms, Burke (1997) found different retest gains across components of a cognitive and psychomotor ability selection battery, but the retesting time interval (ranging from 1 to 5 years) did not moderate these gains by any significant amount, suggesting that the retest effect tends to be stronger than any time effect. The appropriate time interval for retesting in employment settings depends on factors such as the availability of alternate forms and the type of test, but also administrative concerns such as the cost of testing and the need to fill positions. A minimum of 6 months between administrations has been a common rule of thumb for large-scale ability or knowledge testing programs, and 30–60 days is more typical for certain types of skills tests (e.g., typing, software proficiency). Although retesting effects are usually of interest primarily for cognitive ability or knowledge tests, the effect of retesting on personality test scores has also been examined. Landers, Sackett, and Tuzinski (2011) observed that the use of extreme responses (all 1s and 5s) increased for internal applicants who chose to retake a personality test after initial failure. This type of faking increased over time, suggesting that applicants were being coached on how to respond to maximize scores. An interactive warning indicating that an extreme response pattern was not consistent with paying careful attention to each item reduced the incidence of extreme response faking. This study suggests that retaking a personality test increases the risk of faking to increase scores, but steps can be taken to mitigate this faking. Finally, in addition to these retesting findings just summarized, there is also a need for research on the effect of retesting on adverse impact. Specifically, are expected score gains upon retesting equivalent for different gender or racial/ethnic subgroups? Are candidates within different subgroups equally likely to retest when given the opportunity? Questions such as these must be answered before it is possible to speculate on the impact of retesting on adverse impact ratios found in selection practice.

COMPUTATION OF TEST SCORES After test data have been collected, scores must be computed and transformed into a single score or multiple scores that will be used as the basis for making the selection decision. In this section, we discuss creating predictor composites and reporting test scores.

191

Jeff W. Johnson and Frederick L. Oswald

Creating Predictor Composites It is common for organizations to use more than one predictor for decision making and to combine the scores on each predictor into a single composite score. For example, standardized scores on a reading comprehension test, a conscientiousness test, and an SJT may be added together to create a single score that describes a job applicant better than each test considered in isolation. This is a compensatory approach, in which high scores on one predictor can compensate for low scores on another predictor. This is contrasted with a noncompensatory or multiple-hurdles approach, in which selected applicants must meet a minimum passing score on each predictor. Issues associated with a multiple-hurdles approach are discussed later in this chapter. Two key decisions must be made when compiling a predictor battery and computing a composite score: (1) what predictors should be included in the battery? and (2) how should each predictor be weighted in computing the composite score?

Choosing Predictors A common problem faced by personnel selection researchers and practitioners is choosing a set of predictors from a larger set of potential predictors for the purpose of creating a predictor battery. In large organizations, an experimental predictor battery may have been assembled for a validation study, and the choice of which predictors to include is based on the psychometric characteristics of the predictors as measured in that study, and possibly validity information where available. In smaller organizations, the choice may be based on examining published norms or meta-analysis results for a wide range of job-relevant predictors. Usually, there are additional practical constraints, such as test availability, cost of the tests, and required testing time. When assembling a predictor battery, there is often a tradeoff between maximizing criterionrelated validity and minimizing adverse impact against protected groups (e.g., racial/ethnic minorities or females). Creating a composite of several valid predictors is a common strategy for reducing the degree to which a selection procedure produces group differences (Campbell, 1996; Sackett & Ellingson, 1997). The problem is that some of the most valid predictors of performance are cognitive in nature, but those predictors tend to have the largest potential for adverse impact. Therefore, adding a cognitive predictor that increases the validity of the composite will often have the simultaneous effect of increasing adverse impact (Sackett & Ellingson, 1997). Compounding the problem is the fact that adding a predictor with little adverse impact to a predictor with large adverse impact typically does not reduce the adverse impact of the composite to the extent that would generally be expected (Potosky, Bobko, & Roth, 2005; Sackett & Ellingson, 1997). Sackett and Ellingson (1997) gave an example of two uncorrelated predictors. One predictor had a standardized mean subgroup difference (d ) of 1.00 and the other had a d of 0.00. Most researchers would expect that the two predictors would offset each other, so the d of an equally weighted composite of the two predictors would be 0.50. In fact, the d of this composite would be 0.71 (the square root of 0.50), and one would have to add two more uncorrelated predictors (three predictors uncorrelated with each other and with a cognitive ability predictor) to achieve a d value of 0.50. Potosky et al. (2005) further demonstrated the difficulty in reducing adverse impact with a predictor composite by pointing out that the potential for adverse impact in many predictors has been underestimated because d has been computed in range-restricted samples of job incumbents rather than in the full range of job applicant samples. On the other hand, a meta-analysis of cognitive ability and race/ethnic differences found that, although overall Black-White d is considered to be around 1.0, the d values appear to range roughly between 0.60 for high-complexity jobs and .85 in low-complexity jobs, even after accounting for range restriction within jobs (Roth, Bevier, Bobko, Switzer, & Tyler, 2001; Roth, Switzer, Van Iddekinge, & Oh, 2011). The mathematical presentation of Sackett and Ellingson (1997) and the meta-analysis of Potosky et al. (2005) both demonstrated that reducing adverse impact by adding predictors

192

Test Administration and the Use of Test Scores to a composite is not as easy as it seems at first glance. The takeaway message is that reducing adverse impact is not a simple matter of adding a noncognitive predictor or two to a predictor composite that includes a measure of cognitive ability, nor is it to create a test battery that overweights measured job characteristics relative to their actual importance. Researchers and practitioners trying to balance validity against potential adverse impact when creating predictor composites should explore a wider variety of alternative predictors and weighting schemes rather than mechanically relying on offsetting the adverse impact of one predictor with another.

Weighting Predictors When multiple predictors are used, a decision must be made on how much weight to apply to each predictor when computing a composite score. Two types of weights are considered here: (a) statistical weights and (b) rational weights. Statistical weights are data driven, whereas the researcher specifies rational weights, perhaps with input from SMEs. The most common statistical weights are derived using multiple regression because, in a given sample, regression weights maximize the prediction of the criterion (in the sense of minimizing the sum of squared errors). However, regression weights have numerous limitations that often make alternative weighting schemes more desirable. First, criterion scores are not always available, such as when a content-oriented validation strategy is used. Second, regression weights focus entirely on prediction, so they cannot take into account adverse impact or other practical considerations. Third, regression weights can be difficult to interpret and explain to stakeholders, especially when predictors are correlated. Finally, the question of primary interest is how well regression weights predict in other independent samples (e.g., samples of future job applicants), not in the specific sample in which the weights were derived. Sampling error variance and/or correlated predictors make regression weights unstable and thus prone to inaccuracy in other samples compared with unit weights (i.e., a simple sum of standardized predictors), especially when sample sizes are relatively small (less than about 180; Schmidt, 1971) and predictor intercorrelations are high (i.e., multicollinearity; Green, 1977). De Corte, Lievens, and Sackett (2007) presented a procedure for weighting predictors in such a way that the tradeoff between selection quality (e.g., validity, average criterion score of those selected) and adverse impact is Pareto-optimized. Pareto optimization means that mean subgroup differences are minimized for a given level of validity (or similarly, that validity is maximized for a given level of mean subgroup differences). The procedure applies optimization methods from the field of operations research and involves nonlinear programming. The authors offer a computer program for application of the procedure. Unfortunately, when the criterion is task performance or other criteria requiring cognitive ability, then the Pareto-optimal composites usually require down-weighting a cognitive ability predictor considerably to reduce mean subgroup differences on the predictor composite by a practically significant amount. This compromises validity significantly in most cases, and thus, the low weights might not be a reasonable reflection of the actual importance of cognitive ability on the job. Given the limitations of empirical weighting schemes, rational weights are frequently applied. Examples include job analysis ratings of importance (Goldstein, Zedeck, & Schneider, 1993) and expert judgments of predictor importance (Janz, Hellervik, & Gilmore, 1986). Johnson and Carter (2010) found that weighting each predictor by the number of performance dimensions to which it is relevant yielded higher validities than did applying unit weights. Given the multidimensional nature of job performance, the approach of placing greater weight on predictors that are likely to influence a wider range of outcomes is an attractive approach from a conceptual standpoint as well as from a predictive standpoint. There are also other considerations that may influence the weighting scheme, for better or for worse, such as equally weighting cognitive and noncognitive portions of the predictor battery or weighting to please stakeholders (e.g., the CEO thinks the interview should be given more weight than the cognitive test). To the extent that adverse impact is not triggered as a function of weighting, the resulting composite scores

193

Jeff W. Johnson and Frederick L. Oswald from alternative weights become less concerning from a legal standpoint, so long as the weights are applied similarly across all members of the applicant pool. An important point to keep in mind when considering alternative weighting schemes is that whatever weights are directly applied to a set of predictors—called nominal weights—may not have the desired effect on the final composite scores. Oswald, Putka, and Ock (2015) provide several examples demonstrating how predictor variables are not actually weighted as intended unless they are completely uncorrelated (also see Brannick & Darling, 1991). This is because applying a nominal weight to one predictor also applies an implicit weight to all other correlated predictors being used. As a result, the effective weight applied to a given variable is not the nominal weight, but instead reflects a combination of the nominal weight and the implicit weights resulting from that variable’s correlation with each of the other variables in the composite (see Guion, 2011, p. 275 ff.). Because many components of a selection battery are likely to be positively correlated to some extent, the composite scores created by weighting predictors will not actually reflect the intended weighting scheme. Statistical methods exist for translating nominal weights into effective weights, and although there is no single correct method for doing so (Brannick & Darling, 1991; Oswald et al., 2015), the attempt at translation allows one to understand better whether each predictor contributes to the composite in the manner that was originally intended. Before spending time arriving at a predictor weighting scheme and worrying about the extent to which our explicit weights correspond to our effective weights, we should first ask to what extent does differential weighting of predictors influence the overall composite score. Both Koopman (1988) and Ree, Carretta, and Earles (1998) demonstrated that very different sets of weights can lead to highly correlated composite scores—often above .95. Bobko, Roth, and Buster (2007) reviewed the literature on the usefulness of unit weights and concluded that unit weights are highly appropriate under many circumstances, including when adopting a content-oriented validation strategy. Unit weights are the easiest to calculate, the easiest to explain, and the most generalizable to different situations. When applying weights, unless sample sizes are large, the number of predictors is small, and predictor intercorrelations are low, it is probably best to use unit weights to keep the weighting of predictors simple (Bobko et al., 2007; Cohen, 1990). Going beyond choosing and weighting predictors, three review articles take a broader focus on the frequent tradeoff between validity and reducing adverse impact (Kravitz, 2008; Ployhart & Holtz, 2008; Sackett, Schmitt, Ellingson, & Kabin, 2001). These authors examined a wide variety of organizational strategies, indicating which strategies appear to have promise (e.g., administering tests in a video format to reduce an unnecessary language burden), which strategies have not been as fruitful (e.g., modifying tests based on differential item functioning statistics), and which longer-term strategies lack empirical evidence but may be promising (e.g., engaging the organization in broad community-based efforts to increase visibility and attract more qualified minority applicants).

Reporting Test Scores After job applicants have been tested, it is customary to communicate to them how well they performed. There are no standards for how much information must be provided, nor the format in which to provide it, so score reporting runs the gamut from a simple pass/fail notification to a detailed report of the number correct on each test in the battery and how the information was combined to create an overall score. We are unaware of any research on applicant reactions to how test scores are reported, apart from reactions to how scoring led to making the selection decision (e.g., top-down selection versus test-score banding; Truxillo & Bauer, 1999, 2000). The type of score information provided should depend on the purpose of the assessment and the nature of the applicant. For example, if an assessment center is used to evaluate internal candidates for promotion, it is probably beneficial to the organization and the candidates to provide extensive developmental feedback on how the candidate did on each exercise. This feedback could then lead to targeted training interventions and, ultimately, performance improvement among employees (see Chapter 19, this volume, for details on providing assessment feedback to applicants). On the other hand, if a standardized test is used to screen out a large number of

194

Test Administration and the Use of Test Scores external applicants who do not meet a minimum cut score, then no more than a pass/fail decision likely needs to be communicated. If a pool of potential candidates is identified that exceeds a cut score, and the highest-scoring individuals in that pool will be given the first opportunity for job openings, then some information about where the individual stands among other candidates (e.g., a ranking or percentile score) might be appropriate to communicate some idea of how likely a job offer is in the near future. Raw test scores are often transformed to make them more interpretable to applicants. This is especially important when the selection instrument contains tests that do not have right and wrong answers (e.g., personality tests). To avoid confusion and negative impressions, we recommend not reporting test scores with negative values, which may occur when computing raw scores from items that have negatively scored response options (e.g., many biodata items) or when computing z-scores, where scores below the mean are negative. A common transformation for score reporting is to report T-scores, which standardize scores to have a mean of 50 and standard deviation of 10. This makes scores look like percentiles, because most scores range from 10 to 90, and the scores provide information about how the applicant did on each test but do not explicitly state the number correct. Another linear transformation is to convert the total score to a 100-point scale, and when a cut score is used, the cut score can be set at a value such as 70. This reporting method is easy to understand, because scores are similar to grades in school, and provides applicants with a reasonable idea of where they stand in terms of passing and failing and compared with the maximum score. A downside of both T-scores and this latter conversion is that both may incorrectly imply that the score represents the percentage of items answered correctly. A third alternative is to place scores into categories for the purposes of communication to management or other constituencies (e.g., “excellent,” “good,” “borderline,” “poor”). These coarser categories are sometimes used for selection, although coarsening scores will reduce the validity of a selection measure, as discussed in the next section.

MAKING SELECTION DECISIONS Once final test scores or composite scores are computed, they must be translated into a final selection decision. There are many ways to arrive at the decision to select, reject, or move to the next phase of the selection process. In this section, we discuss top-down selection, setting cut scores, banding, multiple hurdles, selection to fit a profile, and context-based selection.

Methods of Selection Given personnel selection data that demonstrate linear prediction of a meaningful criterion, top-down selection using a linear composite of the standardized scores is statistically the best method for maximizing the utility of criterion-related validity coefficients as they apply to data outside of the sample. This selection method assumes there is no useful curvilinear predictive relationship to be considered in the selection process, yet there is recent literature suggesting some amount of curvilinear prediction exists in the personality arena (Carter, Dalal, Boyce, O’Connell, Kung, & Delgado, 2014; Converse & Oswald, 2014; Le et al., 2011), but perhaps not in the ability arena (Coward & Sackett, 1990; Cullen, Hardison, & Sackett, 2004). The top-down approach also does not account for meaningful nonrandom attrition, such as when the top-ranked talent is more likely to turn down the offer and take a job elsewhere (Murphy, 1986). Any alternative to top-down selection usually means compromising the validity and utility of the test at least to some extent, and sometimes to a great extent (Schmidt, 1991; Schmidt, Mack, & Hunter, 1984). As we have seen, however, maximizing validity often raises the potential for adverse impact effects against protected racial/ethnic subgroups whenever selection tests have a cognitive ability component. Thus, many larger organizations seek out alternatives to strict top-down selection, trading off validity to some extent in hopes of a resulting increase in diversity.

195

Jeff W. Johnson and Frederick L. Oswald Four major selection alternatives are prominent in the selection literature and practice. The first alternative is setting a cut score, above which applicants are selected and below which applicants are rejected. Everyone who passes the test is then considered qualified, but the number of job openings is often smaller than the number of qualified applicants. In this case, job offers may be made in a top-down fashion, but then the cut score almost becomes irrelevant. Alternatively, other considerations may come into play once the cut score is passed, such as job experience or other skill sets. In these cases, selection is operating more like a multiple-hurdle selection system. In fact, setting a cut score for a test as part of a multiple-hurdle selection system is the second major alternative to top-down selection. Those scoring above the cut score move to the next stage of the selection process, which may be another test, or something as simple as an unstructured interview or reference check. Given a large enough sample size to ensure stable and generalizable results, it is possible to establish a multiple-hurdle selection system such that selection cutoffs and the order of the predictors reduce adverse impact ratios, while retaining the highest possible levels of mean predicted performance (De Corte, Lievens, & Sackett, 2006; Sackett & Roth, 1996). Multiple hurdles offer the advantage of reducing the overall cost of the selection procedure, because not all applicants complete each component. This allows the more expensive or time-intensive tests (e.g., simulations, assessment centers) to be administered to smaller numbers of applicants at the end of the process. However, the multiple-hurdle approach critically depends on the assumption that low scores at an early stage of selection should not be compensated for by high scores at a later stage, because they cannot be for those applicants who do not pass a hurdle. A disadvantage of multiple hurdles is that the reliability of the entire selection system is lower compared with the formation of predictor composites, because the reliability of the entire system is the product of the reliabilities of each hurdle in the system established by each measure (Haladyna & Hess, 1999). The third alternative is the use of test-score banding procedures. Banding is a broad term that encompasses any selection procedure that groups test scores together and considers them to be equivalent. For example, standard error of the difference (SED) banding considers scores to be equivalent unless they are significantly different from each other (Cascio, Outtz, Zedeck, & Goldstein, 1991). The problem with this type of banding is that less reliable measures lead to wider bands and the conclusion that scores are functionally equivalent, meaning that it is a process that generally goes against the principles of good measurement (Schmidt, 1991). Using bands can also make scores less valid, similar to the practice of dichotomizing predictor scores (Cohen, 1983). Several empirical papers explore the tradeoff between maximizing validity or mean predicted performance and minimizing adverse impact as a function of test-score banding (e.g., Campion et al., 2001; Sackett & Roth, 1991; Schmitt & Oswald, 2004). As one would expect, the tradeoff tends to be larger when the bands are larger, when the selection ratio is smaller, and when the standardized mean difference between groups is larger. These rules are not set in stone, however, because results also depend on the statistical banding method used and how the size of the band aligns with the cutoff point for selection in a particular data set. Despite its good intentions, there is surprisingly little evidence that banding has much of a practical effect in reducing adverse impact (Barrett & Lueke, 2004). One exception would be top-down selection of protected group members within bands (Sackett & Roth, 1991), but this is not a viable strategy because the Civil Rights Act of 1991 explicitly prohibits selection on the basis of protected class status without a consent decree, and random selection within bands may actually increase adverse impact (Barrett & Lueke, 2004). The fourth alternative to top-down selection is to place candidate scores into categories representing probability of success (e.g., green, yellow, red) and reporting those categories to the hiring manager. The purpose of the test scores is to inform the judgment of the selection decision maker, along with other relevant data such as resumes and interviews. This approach is similar to banding, without the statistical algorithms used to determine where the bands are set. The advantage of this approach is that it provides the decision maker with broader latitude for selecting the best candidate without being overly influenced by differences in test scores that may be viewed as very small from a practical standpoint. This may be a disadvantage as well, however, because a more informal approach to selection provides more opportunities for idiosyncratic biases to influence decisions, removing many of the advantages of standardized testing.

196

Test Administration and the Use of Test Scores Although there may be many situations in which the use of these alternatives to top-down selection would make sense for the organization, we do not recommend adopting them solely for the purpose of reducing adverse impact. When cognitive ability measures are incorporated into a selection battery, large tradeoffs between adverse impact and validity are often impossible to avoid (Sackett & Ellingson, 1997). Although nothing in today’s U.S. legal system bars an organization from sacrificing validity to reduce adverse impact, doing so fails to take full advantage of what selection research on validity has to offer in terms of improving the quality of talent that is hired at an aggregate level (Pyburn, Ployhart, & Kravitz, 2008). Furthermore, we would agree that the courts could perceive a procedure that deliberately decreases the validity of the predictors (cognitive and/or noncognitive) as a deliberate decrease in the job relevance of the selection system. Addressing adverse impact concerns in a proactive manner should go well beyond simple predictor selection, weighting, or banding approaches, such as via the active recruitment of minorities, fostering a climate for diversity, and engagement in the diverse communities that the organization serves.

Setting Cut Scores When it is necessary in a selection system to set one or more cut scores, there are numerous methods from which to choose (see Kehoe & Olson, 2005, and Mueller, Norris, & Oppler, 2007, for extensive reviews). In general, these methods can be distinguished by their use of either judgmental methods or empirical methods. The most common judgmental method is the Angoff (1971) method, in which expert judges (SMEs) estimate the probability that a minimally qualified candidate will answer each test item correctly. These estimates are summed across items to calculate the expected value of the mean test score for minimally qualified candidates. A legitimate criticism of the Angoff method is that judges generally find it difficult to estimate these probabilities, often tending to overestimate them, which leads to higher cut scores than those determined by other methods. The cut score is often adjusted, such as by lowering it one or two standard errors of measurement of the test. Despite this general problem in estimation, Angoff-like methods have the particular advantage of being well received by the courts (Kehoe & Olson, 2005). Empirical methods for establishing cut scores are generally based on the relationship between test performance and criterion performance, so a criterion-related validation study is required to use these methods. In the regression technique, the minimum criterion score associated with successful job performance is determined, and linear regression is used to find the test score (or test composite score) corresponding to that predicted criterion score. Forward regression regresses criterion scores on test scores (as is done in selection), and reverse regression regresses test scores on criterion scores (to predict the test or composite score cutoff). These methods usually produce different cut scores, so both methods could be used to produce a range of cut scores, and expert judgment could be used to set the appropriate cut score within that range (Mueller et al., 2007). To illustrate the practical effects of selection, expectancy charts usefully depict the relationship between test performance and criterion performance and, optionally, can be used to help set a cut score. Expectancy charts graphically display either the expected criterion performance scores within given ranges of predictor scores or the percentage of those selected who are expected to meet the standard for success on the job, given a set of alternative cut scores. The advantages of using expectancy charts to set cut scores are (a) they are easy for decision makers to understand, (b) the cut score is based on expected criterion performance, and (c) the courts have shown support for these types of methods (Kehoe & Olson, 2005). Methods for setting cut scores deserve a great deal more attention because cut scores have increasingly been subject to legal challenge. Whenever test users decide to implement cut scores, they should put as much effort into setting them as they should invest in establishing the reliability and validity of the test itself. Test users should carefully consider the need for a cut score, because a top-down selection strategy is often a more desirable alternative. Providing a legal and professional defense for a top-down strategy may be easier than defending how and where a cut

197

Jeff W. Johnson and Frederick L. Oswald score was established, because there is often a great deal of room for interpretation in the legal and professional literature on cut scores (e.g., what constitutes minimum qualifications). Test users must be careful to have a transparent, justifiable, and consistent rationale based on sound professional judgment for what is done at each step of the cut-score-setting process. Because cut scores are likely to remain in use in many selection and related contexts (e.g., licensure and certification), future research should continue to investigate the major substantive and methodological factors involved in the justification and setting of cut scores.

Selection to Fit a Profile The selection methods we have reviewed thus far are based on the straightforward linear relationship between predictors and criteria, but some have advocated selection on a more sophisticated basis, such as how well an individual fits a given profile (e.g., McCulloch & Turban, 2007). There are many types of fit (e.g., person-job, person-organization, person-group, person-supervisor; Kristof-Brown, Zimmerman, & Johnson, 2005), but person-organization (P-O) fit seems to be commonly advocated for selection. P-O fit is typically conceptualized as congruence between individual and organizational values or culture and is strongly related to organizational attitudes, such as job satisfaction, organizational trust, and commitment (Kristof-Brown et al., 2005). Conceptually, P-O fit is thought to predict important organizational criteria such as job performance and turnover in ways that traditional selection measures do not. Empirical support for this is found in a meta-analysis of P-O fit by Arthur, Bell, Villado, and Doverspike (2006), who found corrected mean validities of .15 for predicting job performance and .24 for predicting turnover. Indications were that work attitudes partially mediated the relationships between P-O fit and these criteria, so selection on P-O fit may be more on the basis of job satisfaction than on job performance (see Schmitt, Oswald, Friede, Imus, & Merritt, 2008). This meta-analysis recommended not using P-O fit to make selection decisions in the absence of a local validation study, and to use fit measures as tools post-selection for developmental purposes, such as exploring fit—and changes in fit—when working with employees who may develop performance issues or who are withdrawing and may be considering leaving the organization because of some type of misfit. A major issue with selection for any type of fit is calculating a fit score. Common techniques are difference scores and correlations between the person’s profile and the profile of the organization, where smaller differences between corresponding profile scores and larger correlations across the profile scores are both thought to imply greater fit. Difference scores and their variants (e.g., Euclidean distance) suffer from several methodological problems, including not knowing how much each component of the difference scores contributes to validity, and the compound attenuating effects of measurement error variance on the reliability of the difference scores in the profiles (Edwards, 1994). Arthur et al. (2006) found stronger criterion-related validities when fit was calculated via correlations than via difference scores. Correlations between person and organization profiles are also problematic, however, in that they reflect similarity in profile shape but not the absolute differences between person and organization scores. Also, the relationship between fit correlations and outcomes might be driven by specific variables within the profile. Scores on certain variables rather than the pattern of scores may predict performance. Edwards (1994) demonstrated that polynomial regression is the appropriate analysis method when evaluating the relationship between fit and a criterion, thus overcoming several methodological problems inherent in difference scores. Unfortunately, polynomial regression is a data analysis method and not a method for assigning scores to individuals. On the basis of current research, we recommend that P-O fit be used for selection only when the goal is to minimize turnover and only when a local validation study is possible. A procedure similar to that of McCulloch and Turban (2007) holds promise and should be legally defensible. These authors had call center managers describe the characteristics of a call center by way of a Q-sort that had them place 54 work descriptors into a normal distribution that was defined along a 9-point scale. Managers came to a consensus solution that defined the call center profile. Call center representatives then sorted the same descriptors in terms of how much

198

Test Administration and the Use of Test Scores they valued each characteristic. The P-O fit score was the correlation between the individual profile and the call center profile. This score correlated .36 with employee retention and was uncorrelated with job performance, which again is aligned with the general idea of P-O fit being correlated more with attitudinal constructs than with performance.

Context-Based Selection Related to selection to fit a profile is the idea of taking context into account when making selection decisions. Johns (2006) defined context as any aspect of the situation that could affect the occurrence of behavior in an organization or the relationship between variables (e.g., test scores and the criterion). Context could be a cause of organizational behavior (i.e., a main effect), could interact with other variables to influence behavior (i.e., a moderator), or could influence predictor or criterion scores in other ways (e.g., organizational culture may influence how a personality inventory is interpreted in an indirect or multilevel manner). By measuring relevant context variables, the prediction of performance can sometimes be improved by considering the unique aspects of the organization and/or job. Attending to context could, therefore, influence the use or weighting of test scores in making selection decisions. As an illustration of the influence of context on variable relationships, Tett, Jackson, Rothstein, and Reddon (1994, 1999) showed that personality scales may be positively correlated with a performance dimension in some situations, yet have negative correlations in other situations. For example, a person high in agreeableness may do well in an organization that has a teambased, cooperative culture, but that same person may have difficulty in an organization with a culture that is highly competitive. This suggests that the first organization should select applicants who score high on agreeableness, but the second organization would be better off selecting applicants with a measure of competitiveness or achievement orientation. Research on context-based selection is in its infancy because of the large sample sizes necessary and measurement issues associated with identifying moderator variables in personnel selection research. Johns (2006) presents a thorough review of issues associated with studying context. For example, context often has a cross-level effect (e.g., organizational strategy influences the evaluation of individual behavior), whereas selection research typically focuses on measuring variables at the individual level (e.g., test scores, performance ratings). When measuring variables at a higher level than the individual, there must be enough data at the higher level to adequately evaluate its impact on variables at the lower level. This is very difficult if one is evaluating the impact of an organizational-level variable, because data must be collected from multiple organizations, and there must be enough variability across organizations to properly evaluate the effect. The typical local validation study cannot meet this requirement, although a consortium study or meta-analysis may allow for tests of organization-level moderators. Team- or role-level context variables, however, would likely show reliable variance within an organization. One way to test the effects of organization-level moderators without conducting a multilevel analysis is to evaluate the impact of raters’ perceptions of organization-level variables (e.g., organizational strategy, business priorities) on their perceptions of individual job performance. Of course, if raters within an organization agree on the organization’s standing on these variables, the limitation of not having enough variance to test for moderation still exists. As an example of this type of analysis, Johnson (2016) used data collected across multiple organizations as part of the CEB Leadership Study (LoVerde & Schmidt, 2016) to demonstrate the moderating effect of several organizational context variables. For example, when managers had stronger perceptions that the organization’s future growth would come through innovation, the relationship between a personality measure of network leadership potential and manager ratings of network leadership performance was stronger than when innovation was not as much of a priority (i.e., there was a moderator effect). There is a risk in studying context effects in isolation, because there are always many different contexts operating simultaneously and possibly interacting. It is difficult or impractical to consider multiple contextual variables simultaneously. Similarly, it is impossible to identify, much less measure, all contextual variables that could be relevant. Nevertheless, it is still better to consider

199

Jeff W. Johnson and Frederick L. Oswald a small number of contextual variables than to ignore context altogether, as is often done. Considering context in selection research has the potential to (a) improve prediction beyond what is typically seen in criterion-related validation studies; (b) identify candidates who are in alignment with the organization’s culture, strategy, and priorities; and (c) match candidates to specific roles they are most ready and equipped to perform.

CONCLUSION Personnel selection makes a critical contribution to the system of organizational policies and practices to which it is related. Key selection questions are worth asking and addressing repeatedly as organizational researchers and practitioners, because both the questions and the answers adapt to fit the organizational setting and the current state of the art. In this chapter, we highlighted questions that are essential to address (a) at the outset of the testing program, (b) with regard to collection of test scores, (c) when computing test scores, and (d) when making selection decisions. If these questions are not addressed mindfully, they will likely be addressed by default. Test users should be aware of the organizational implications of each decision to ensure that the testing program is consistent with other organizational goals, and they should be aware of the current legal context and implications of each decision made to avoid potential litigation problems. When used properly by experienced professionals in the context of other organizational practices, selection test scores have proven to have a very positive influence on individuals and organizations alike.

REFERENCES Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education. Arthur, W., Bell, S. T., Villado, A. J., & Doverspike, D. (2006). The use of person-organization fit in employment decision making: An assessment of its criterion-related validity. Journal of Applied Psychology, 91, 786–801. Barrett, G. V., & Lueke, S. B. (2004). Legal and practical implications of banding for personnel selection. In H. Aguinis (Ed.), Test-score banding in human resource selection: Technical, legal, and societal issues (pp. 71–111). Westport, CT: Praeger. Bartram, D. (2009). The International Test Commission guidelines on computer-based and Internetdelivered training. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 11–13. Bobko, P., Roth, P. L., & Buster, M. A. (2007). The usefulness of unit weights in creating composite scores: A literature review, application to content validity, and meta-analysis. Organizational Research Methods, 10, 689–709. Brannick, M. T., & Darling, R. W. (1991). Specifying importance weights consistent with a covariance structure. Organizational Behavior and Human Decision Processes, 50, 395–410. Burke, E. F. (1997). A short note on the persistence of retest effects on aptitude scores. Journal of Occupational and Organizational Psychology, 70, 295–301. Burke, E. F. (2009). Preserving the integrity of online testing. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 35–38. Campbell, J. P. (1996). Group differences and personnel decisions: Validity, fairness, and affirmative action. Journal of Vocational Behavior, 49, 122–158. Campbell, W. J., & Reilly, M. E. (2000). Accommodations for persons with disabilities. In J. F. Kehoe (Ed.), Managing selection in changing organizations (pp. 319–370). San Francisco, CA: Jossey-Bass. Campion, M. A., Outtz, J. L., Zedeck, S., Schmidt, F. L., Kehoe, J. F., Murphy, K. R., & Guion, R. M. (2001). The controversy over score banding in personnel selection: Answers to 10 key questions. Personnel Psychology, 54, 149–185. Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York, NY: Cambridge University Press. Carter, N. T., Dalal, D. K., Boyce, A. S., O’Connell, M. S., Kung, M-C., & Delgado, K. (2014). Uncovering curvilinear relationships between conscientiousness and job performance: How theoretically appropriate measurement makes an empirical difference. Journal of Applied Psychology, 99, 564–586.

200

Test Administration and the Use of Test Scores Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implications of six methods of test score use in personnel selection. Human Performance, 4, 233–264. Catano, V. M., Brochu, A., & Lamerson, C. D. (2012). Assessing the reliability of situational judgment tests used in high-stakes situations. International Journal of Selection and Assessment, 20, 333–346. Chan, D., & Schmitt, N. (1997). Video versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143–159. Chang, H. H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items on adaptive testing. Psychometrika, 73, 441–450. Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7, 249–254. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. Converse, P. D., & Oswald, F. L. (2014). Thinking ahead: Assuming linear versus nonlinear personalitycriterion relationships in personnel selection. Human Performance, 27, 61–79. Coward, W. M., & Sackett, P. R. (1990). Linearity of ability-performance relationships: A reconfirmation. Journal of Applied Psychology, 75, 297–300. Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York, NY: HarperCollins. Cullen, M. J., Hardison, C. M., & Sackett, P. R. (2004). Using SAT-grade and ability-job performance relationships to test predictions derived from stereotype threat theory. Journal of Applied Psychology, 89, 220–230. De Beuckelaer, A., & Lievens, F. (2009). Measurement equivalence of paper-and-pencil and Internet organisational surveys: A large scale examination in 16 countries. Applied Psychology: An International Review, 58, 336–361. De Corte, W., Lievens, F., & Sackett, P. R. (2006). Predicting adverse impact and mean criterion performance in multistage selection. Journal of Applied Psychology, 91, 523–537. De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. DeVellis, R. F. (2016). Scale development: Theory and applications (4th ed.). Newbury Park, CA: Sage. Donnellan, M. B., Oswald, F. L., Baird, B. M., & Lucas, R. E. (2006). The Mini-IPIP scales: Tiny-yeteffective measures of the Big Five factors of personality. Psychological Assessment, 18, 192–203. Dorans, N. J., Pommerich, M., & Holland, P. W. (2007). Potential solutions to practical equating issues. New York, NY: Springer. Edwards, J. R. (1994). The study of congruence in organizational behavior research: Critique and a proposed alternative. Organizational Behavior and Human Decision Processes, 58, 51–100. Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/ person statistics. Educational and Psychological Measurement, 58, 357–381. Gibson, W. M., & Weiner, J. A. (1998). Generating random parallel test forms using CTT in a computerbased environment. Journal of Educational Measurement, 35, 297–310. Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. C. (2006). The International Personality Item Pool and the future of public-domain personality measures. Journal of Research in Personality, 40, 84–96. Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 3–34). San Francisco, CA: Jossey-Bass. Green, B. F. (1977). Parameter sensitivity in multivariate methods. Multivariate Behavioral Research, 12, 264–288. Guion, R. M. (2011). Assessment, measurement, and prediction for personnel decisions. New York, NY: Routledge. Haladyna, T., & Hess, R. (1999). An evaluation of conjunctive and compensatory standard-setting strategies for test decisions. Educational Assessment, 6, 129–153. Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92, 373–385. Huselid, M. A., Jackson, S. E., & Schuler, R. S. (1997). Technical and strategic human resource management effectiveness as determinants of firm performance. Academy of Management Journal, 40, 171–188. International Test Commission. (2006). International guidelines on computer-based and Internet delivered testing. International Journal of Testing, 6, 143–172. Janz, T., Hellervik, L., & Gilmore, D. (1986). Behavior description interviewing: New, accurate, cost effective. Boston, MA: Allyn and Bacon. Johns, G. (2006). The essential impact of context on organizational behavior. Academy of Management Review, 31, 386–408. Johnson, J. W. (April 2016). Enhancing our understanding of the network leadership construct. In M. A. LoVerde (Chair), Overview and selected finding from a multi-organizational, multi-level leadership study. Symposium

201

Jeff W. Johnson and Frederick L. Oswald conducted at the 31st Annual Conference of the Society for Industrial and Organizational Psychology, Anaheim, CA. Johnson, J. W., & Carter, G. W. (2010). Validating synthetic validation: Comparing traditional and synthetic validity coefficients. Personnel Psychology, 63, 755–795. Kantrowitz, T. M., & Dainis, A. M. (2014). How secure are unproctored pre-employment tests? Analysis of inconsistent test scores. Journal of Business and Psychology, 29, 605–616. Kehoe, J. F., & Olson, A. (2005). Cut scores and employment discrimination litigation. In F. J. Landy (Ed.), Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 410–449). San Francisco, CA: Jossey-Bass. Kim, D-I., Choi, S. W., Lee, G., & Um, K. R. (2008). A comparison of the common-item and random-groups equating designs using empirical data. International Journal of Selection and Assessment, 16, 83–92. Koopman, R. F. (1988). On the sensitivity of a composite to its weights. Psychometrika, 53, 547–552. Kravitz, D. A. (2008). The validity-diversity dilemma: Beyond selection—The role of affirmative action. Personnel Psychology, 61, 173–193. Kristof-Brown, A. L., Zimmerman, R. D., & Johnson, E. C. (2005). Consequences of individuals’ fit at work: A meta-analysis of person-job, person-organization, person-group, and person-supervisor fit. Personnel Psychology, 58, 281–342. Landers, R. N., Sackett, P. R., & Tuzinski, K. A. (2011). Retesting after initial failure, coaching rumors, and warnings against faking in online personality measures for selection. Journal of Applied Psychology, 96, 202–210. Le, H., Oh, I-S., Robbins, S. B., Ilies, R., Holland, E., & Westrick, P. (2011). Too much of a good thing: Curvilinear relationships between personality traits and job performance. Journal of Applied Psychology, 96, 113–133. Lievens, F., Buyse, T., & Sackett, P. R. (2005). Retest effects in operational selection settings: Development and test of a framework. Personnel Psychology, 58, 981–1007. Lievens, F., & Sackett, P. R. (2007). Situational judgment tests in high-stakes settings: Issues and strategies with generating alternate forms. Journal of Applied Psychology, 92, 1043–1055. Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: Educational Testing Service. LoVerde, M. A., & Schmidt, C. (April 2016). Overview of the CEB Leadership Study (CLS). In M. A. LoVerde (Chair), Overview and selected finding from a multi-organizational, multi-level leadership study. Symposium conducted at the 31st Annual Conference of the Society for Industrial and Organizational Psychology, Anaheim, CA. MacDonald, P., & Paunonen, S. V. (2002). A Monte Carlo comparison of item and person characteristics based on item response theory versus classical test theory. Educational and Psychological Measurement, 62, 921–943. Makransky, G., & Glas, C. A. W. (2011). Unproctored internet testing verification: Using adaptive confirmation testing. Organizational Research Methods, 14, 608–630. McCulloch, M. C., & Turban, D. B. (2007). Using person-organization fit to select employees for highturnover jobs. International Journal of Selection and Assessment, 15, 63–71. McPhail, S. M. (2007). Alternative validation strategies: Developing new and leveraging existing validity evidence. San Francisco, CA: Jossey-Bass. Meade, A. W., & Bauer, D. J. (2007). Power and precision in confirmatory analytic tests of measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 14, 611–635. Meade, A. W., Michels, L. C., & Lautenschlager, G. J. (2007). Are internet and paper-and-pencil personality tests truly comparable? An experimental design measurement invariance study. Organizational Research Methods, 10, 322–345. Millsap, R. E., & Kwok, O. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9, 93–115. Mueller, L., Norris, D., & Oppler, S. (2007). Implementation based on alternate validation procedures: Ranking, cut scores, banding, and compensatory models. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 349–405). San Francisco, CA: Jossey-Bass. Murphy, K. R. (1986). When your top choice turns you down: Effect of rejected offers on the utility of selection tests. Psychological Bulletin, 99, 133–138. Naus, M. J., Philipp, L. M., & Samsi, M. (2009). From paper to pixels: A comparison of paper and computer formats in psychological assessment. Computers in Human Behavior, 25, 1–7. Nye, C. D., Do, B.-R., Drasgow, F., & Fine, S. (2008). Two-step testing in employment selection: Is score inflation a problem? International Journal of Selection and Testing, 16, 112–120. Oswald, F. L., Friede, A. J., Schmitt, N., Kim, B. H., & Ramsay, L. J. (2005). Extending a practical method for developing alternate test forms using independent sets of items. Organizational Research Methods, 8, 149–164.

202

Test Administration and the Use of Test Scores Oswald, F. L., Putka, D. J., & Ock, J. (2015). Weight a minute, what you see in a weighted composite is probably not what you get! In C. E. Lance & R. J. Vandenberg (Eds.), More statistical myths and urban legends (pp. 187–205). New York, NY: Taylor & Francis. Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–263). New York, NY: American Council on Education and Macmillan. Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Ployhart, R. E., & MacKenzie, J. I., Jr. (2011). Situational judgment tests: A critical review and agenda for the future. In S. Zedek (Ed.), APA handbook of industrial and organizational psychology (Vol. 2, pp. 237–252). Washington, DC: American Psychological Association. Ployhart, R. E., Weekley, J. A., & Holtz. B. C. (2003). Web-based and paper-and-pencil testing of applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable? Personnel Psychology, 56, 733–752. Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and alternative measures to predict job performance and reduce adverse impact: Corrected estimates and realistic expectations. International Journal of Selection and Assessment, 13, 304–315. Pyburn, K. M., Jr., Ployhart, R. E., & Kravitz, D. (2008). The validity-diversity dilemma: Overview and legal context. Personnel Psychology, 61, 143–151. Raymond, M. R., Neustel, S., & Anderson, D. (2007). Retest effects on identical and parallel forms in certification and licensure testing. Personnel Psychology, 60, 367–396. Ree, M., Carretta, T., & Earles, J. (1998). In top-down decisions, weighting variables does not matter: A consequence of Wilks’ theorem. Organizational Research Methods, 1, 407–420. Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. Roth, P. L, Switzer, F. S., III, Van Iddekinge, C. H., & Oh, I-S. (2011). Toward better meta-analytic matrices: How input values can affect research conclusions in human resource management simulations. Personnel Psychology, 64, 899–935. Sackett, P. R., & Ellingson, J. E. (1997). The effect of forming multi-predictor composites on group differences and adverse impact. Personnel Psychology, 50, 707–721. Sackett, P. R., & Roth, L. (1991). A Monte Carlo examination of banding and rank order methods of test score use in personnel selection. Human Performance, 4, 279–295. Sackett, P. R., & Roth, L. (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on performance and minority hiring. Personnel Psychology, 49, 549–572. Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. (2001). High-stakes testing in employment, credentialing, and higher education: Prospects in a post-affirmative-action world. American Psychologist, 56, 302–318. Schmidt, F. L. (1971). The relative efficiency of relative and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 31, 699–714. Schmidt, F. L. (1991). Why all banding procedures in personnel selection are logically flawed. Human Performance, 4, 265–277. Schmidt, F. L., Mack, M. J., & Hunter, J. E. (1984). Selection utility in the occupation of U.S. park ranger for three modes of test use. Journal of Applied Psychology, 69, 490–497. Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications. Human Resource Management Review, 18, 210–222. Schmitt, N., & Oswald, F. L. (2004). Statistical weights of ability and diversity in selection decisions based on various methods of test-score use. In H. Aguinis (Ed.), Test-score banding in human resource selection: Technical, legal, and societal issues (pp. 113–131). Westport, CT: Praeger. Schmitt, N., Oswald, F. L., Friede, A., Imus, A., & Merritt, S. (2008). Perceived fit with an academic environment: Attitudinal and behavioral outcomes. Journal of Vocational Psychology, 72, 317–335. Scullen, S. E., & Meyer, B. (2012). More applicants or more applications per applicant? A big question when pools are small. Journal of Management, 14, 1675–1699. Segall, D. O. (April 2001). Detecting test compromise in high-stakes computerized adaptive testing: A verification testing approach. Paper presented at the annual meeting of the National Council on Measurement in Education, Seattle, WA. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Stanton, J. M., Sinar, E. F., Balzer, W. K., & Smith, P. C. (2002). Issues and strategies for reducing the length of self-report scales. Personnel Psychology, 55, 167–194. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item functioning and differential test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89, 497–508.

203

Jeff W. Johnson and Frederick L. Oswald Tett, R. P., Jackson, D. N., Rothstein, M., & Reddon, J. R. (1994). Meta-analysis of personality-job performance relations: A reply to Ones, Mount, Barrick, and Hunter (1994). Personnel Psychology, 47, 157–172. Tett, R. P., Jackson, D. N., Rothstein, M., & Reddon, J. R. (1999). Meta-analysis of bi-directional relations in personality-job performance research. Human Performance, 12, 1–29. Tippins, N. T. (2009). Where is the unproctored Internet testing train headed now? Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 69–76. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225. Truxillo, D. M., & Bauer, T. N. (1999). Applicant reactions to test score banding in entry-level and promotional contexts. Journal of Applied Psychology, 84, 322–339. Truxillo, D. M., & Bauer, T. N. (2000). The roles of gender and affirmative action in reactions to test score use methods. Journal of Applied Social Psychology, 30, 1812–1828. U.S. Department of Labor. (1999). Testing and assessment: An employer’s guide to good practices. Washington, DC: U.S. Department of Labor, Employment and Training Administration. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices and recommendations for organizational research. Organizational Research Methods, 3, 4–70. van der Linden, W. J. (2005). Linear models for optimal test design. New York, NY: Springer. Viswesvaran, C., & Ones, D. S., (1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59, 197–210. Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25–49.

204

9 MANAGING SUSTAINABLE SELECTION PROGRAMS JERARD F. KEHOE, STEFAN T. MOL, AND NEIL R. ANDERSON

The objective of this chapter is to describe the major features of selection programs that contribute to their sustainable success. This chapter focuses on four primary drivers of sustainability: (a) the organizational purposes for selection, (b) HR strategy, (c) governance, and (d) process management. The chapter will not include the psychometric technology of selection practices that affect the value of selection decisions as this content is treated elsewhere in this volume (e.g., Aiken & Hanges, Chapter 17, this volume; Putka, Chapter 1, this volume). Further, the section on process management only addresses the role of process metrics. Other, more detailed treatments of selection process management are available elsewhere, especially Tippins (2002, 2012). This chapter is the result of collaboration between psychologists with U.S.- and European-centric professional experience. The intent is not so much to ensure comprehensive coverage of cultural or national differences between sustainable selection programs as much as it is to better ensure that this chapter is relevant to modestly diverse cultural and national perspectives and contexts. Several recent chapters (Kehoe, Brown, & Hoffman, 2012; Roe, 2005; Tippins, 2002; Tippins, Solberg, & Singla, Chapter 16, this volume; and Tippins, 2012) and one article (Klehe, 2004) have addressed the design and implementation of selection programs. This chapter’s focus on the organizational context for selection programs complements these earlier works. Tippins (2002) and Roe (2005) focused primarily on the procedural elements of the selection process. In contrast, Tippins (2002, 2012) and Tippins et al. (Chapter 16, this volume) focused more on the necessary elements of a fully functioning selection program such as the management of test materials, test administration processes, test preparation strategies, and test use rules. Kehoe et al. (2012) focused primarily on management practices for selection programs. Finally, Klehe (2004) focused on the institutional pressures that may help or hinder the adoption of selection procedures that are recommended by academics. Finally, it should be noted that this chapter complements Chapter 5, this volume, which also addresses the organizational context for selection. In contrast to Chapter 5, this chapter treats the organizational context as an independent variable, if you will, that influences the features of selection programs necessary to be sustainable. In Chapter 5, Ployhart and Weekley focus on the organization as the dependent variable by considering the impact of selection as a human resources management (HRM) strategy on the organization.

ORGANIZATION CONTEXT FOR SELECTION The central point of this chapter is that the four layers of organization context and structure directly influence the sustainability of selection programs. At the most general level, organization

205

Jerard F. Kehoe et al.

FIGURE 9.1  The Four-Part Model of Organizational Factors Influencing the Sustainability of Selection Programs

purposes create the environment in which the most fundamental decisions about sourcing employees are made. Second, the HR strategy is likely to provide essential direction in establishing the goals and objectives of a selection program and the manner in which it is integrated with other HR programs and business processes. Third, governance establishes the authorities, accountabilities, boundary conditions, and roles that enable the selection program to function effectively and efficiently within the context of other HR processes. Finally, selection process management is the most specific form of structure within which selection programs operate. The elements of process management are highly specific to the functioning of a selection program. They can be common across all units and jobs or they can vary across units and jobs. Figure 9.1 provides a visual depiction of these sustainability considerations as well as the specific points underlying each one that are addressed in this chapter. Before describing the four layers of organizational context that affect sustainability, we offer our perspective about the meaning of selection system sustainability.

DEFINING SELECTION SYSTEM SUSTAINABILITY This chapter applies an institutional perspective (DiMaggio & Powell, 1991; Scott, 1995) to the definition of selection system sustainability by means of our focus on organization purpose, HR strategy, governance, and process management. Thus, rather than defining selection system sustainability in terms of economically rational decision making, which is epitomized in much of the academic literature pertaining to personnel selection, selection system sustainability is defined here in terms of a normative rationality that is contingent upon individual-level factors (e.g., managers’ norms, habits, and unconscious conformity to organizational traditions), the organizational level (e.g., corporate culture, shared belief systems, and political processes), and the societal level (e.g., legislation and professional standards) (Oliver, 1997). In our view, a selection system is sustainable to the extent that its purpose, strategy, governance, and management are consistent with these touchstones. A vital implication of our perspective is that although an organization may have designed a sophisticated selection procedure that displays high validity, reliability, and fairness to begin with, paying insufficient attention to sustainability issues will inevitably result in disuse or, more subtly, a gradual (or even rapid) decline in the psychometric performance of the system over time.

206

Managing Sustainable Selection Programs

ORGANIZATION PURPOSES Traditionally, the efficacy of personnel selection has been evaluated primarily in terms of the fit between the selected person and their immediate job role (i.e., person-job fit; Chatman, 1991; Ostroff & Rothausen, 1996). As organizations have become more delayered, flexible, and teambased in their structures, the imperative to evaluate personnel selection from additional, more superordinate levels of fit has gained momentum among researchers and personnel practitioners who are active in employee selection. Principally this has meant that issues of person-team fit (P-T fit) and person-organization fit (P-O fit) have been added to the selection agenda over recent years, and the need to consider any selection decision from all three levels of analysis— person-job fit (P-J fit), P-T fit, and P-O fit—has been increasingly recognized (e.g., Ployhart & Schneider, 2005). In effect, this has resulted in the criterion space under consideration in selection being substantially extended to include other levels of fit. Yet this expansion of the criterion space has only been rather recent, and the research base upon which I-O psychologists can make grounded recommendations to organizations to best manage multilevel selection systems remains underdeveloped. To illustrate this point, two quotes will suffice. The critical challenge is to expand our conceptual horizon beyond the level of person-job fit and to incorporate multiple and interactive levels of analysis into selection decision-making. (Herriot & Anderson, 1997, p. 26) Reflecting on nearly a century of personnel selection research, it is quite troubling to us that we have no solid answers . . . and approaches to answering the questions that remain outside of traditional personnel selection research. We may be able to show how hiring better people contributes to better individual job performance, but we have hardly examined whether this contributes to better unit-level performance. (Ployhart & Schneider, 2005, p. 496)

This relative paucity of research into selecting for P-T and P-O fit compared against the mass of studies into aspects of P-J fit of course leads to problems in making sound recommendations for the management of selection systems at different levels of analysis (e.g., Anderson, Lievens, van Dam, & Ryan, 2004; Ployhart, 2007; Schneider, Smith, & Sipe, 2000). Despite this discrepancy in research coverage, the area of multilevel selection has recently become far more active, and several authors internationally have contributed theoretical models (e.g., Ployhart & Schneider, 2002; Ployhart, 2004; Stevens & Campion, 1994), empirical studies have been published (e.g., LePine, Hollenbeck, Ilgen, & Hedlund, 1997; Morgeson, Reider, & Campion, 2005), and even validated measures of P-T fit have appeared in publication (e.g., Burch & Anderson, 2004; Mumford, Van Iddekinge, Morgeson, & Campion, 2008). In short, there has been a far more speculative approach than clear signs of having arrived in terms of the focus being put upon the generation of theoretical models and conceptual think-piece papers rather than the publication of robust empirical studies into multilevel selection effects. Despite these shortcomings, several important implications for the management of selection systems can be gleaned from the recent literature. In perhaps the most detailed and directly relevant contribution to the validation of multilevel selection decisions, Ployhart and Schneider (2005) proposed a 10-stage model for the conduct of any such validation study. The stages are summarized as follows:  1. Articulate theory: Operationalize hypotheses of within- and across-level relationships between predictor constructs.  2. Articulate relationships between theory and measurement issues, especially with regard to data aggregation.  3. Articulate predictors: Define predictor methods and specify their predominant level/levels of analysis.  4. Specify within-level relationships: Operationalize direction and strength of knowledge, skills, abilities, and other characteristics (KSAO)-criterion relationships within level (i.e., P-J, P-T, and P-O).  5. Specify cross-level relationships: Operationalize contextual effects, cross-level effects, and multiple-level effects.  6. Sample units: Sample a sufficient number of units to test for within- and cross-level relationships.

207

Jerard F. Kehoe et al.  7. Use appropriate measures for each level of analysis.  8. Test aggregation inferences: Test for unit-level variance and reliability of aggregation procedures.  9. Analyze data using appropriate procedures. 10. Interpret results giving consideration to within- and cross-level findings.

This procedure for validation of multilevel selection procedures is comprehensive, but it is apparent that only the most statistically versed of HR practitioners supported by an I-O psychologist able to undertake the relevant analyses could complete such a procedure. Rather, it is far more conceivable in practice that selectors will approach such decisions in a notably ad hoc manner, will give weights to different within- and cross-level variables on the basis of some notional “rules of thumb” known only to themselves, and will be prone to a gamut of errors brought on by information overload, imperfect information processing, and satisfaction in their decision-making strategies. Indeed, this is what we would expect from, say, the vast literature now accumulated in interviewer and assessor decision making under conditions of information overload. Yet, Ployhart and Schneider’s (2005) model for validation, and thus sustainability management, is highly valuable in pointing up the complexities of the task facing any practitioner or researcher. Other authors have highlighted other issues of concern, including the likelihood that maximizing fit at one level of analysis can lead to declines in fit at other levels of analysis. For instance, Anderson et al. (2004) proposed three types of effects in cross-level selection decision making: (1) complementary, (2) neutral, and (3) contradictory fit. That is, KSAOs being sought by an organization at the level of P-J fit can either be complementary to P-T and P-O fit, neutral in their overlaps, or more problematically, contradictory in their effects. For example, high extraversion needed for P-J fit can be complementary for team-level issues of fit, whereas high rule independence needed for innovation potential in a research and development (R&D) scientist may militate against P-O climate fit in an organization that possesses a strong climate in which conformity is valued (e.g., Potocnik, Anderson, & Latorre, 2015). The subdomain of multilevel fit in selection promises to generate novel but far more complex models of decision making to support organizational efforts to optimize P-J, P-T, and P-O fit. However, this field remains at an embryonic stage of development, with mostly theoretical and model-building contributions published to date. Applied research in field study settings is badly needed to begin to extend and clarify our understanding of these complexities and how best to advise organizations to deal with the many issues, challenges, and controversies thrown up by multilevel fit in employee selection.

HR STRATEGY HR strategy can vary significantly across organizations. For example, very small or highly entrepreneurial organizations may have no formalized HR strategy, whereas large organizations are likely to have an HR strategy that is professionally developed and more or less integrated into the business strategy. Our experience with medium and large organizations points to the importance of five key HR strategies in determining characteristics of successful and sustainable selection programs. The first, and perhaps most important, is the organization’s employee skill strategy. The skill strategy often defines how the organization balances the internal development of employee skills (building) with the external acquisition of employee skills (buying). The second HR strategy is more relevant for larger organizations. How are the interests of units and the organization as a whole managed? This is a critical consideration, particularly for regulated HR processes such as employee selection. These first two questions are general and have implications for many HR programs. The third strategy is specific to the HR function responsible for the development and validation of selection procedures and processes. Is this function positioned within the HR strategy as an expert role that has no ownership of any elements of the HR strategy, as might be found in a Center of Excellence (COE) or in an internal consultant role? Or, is this function positioned as an owner of the HR strategy for personnel

208

Managing Sustainable Selection Programs selection? The fourth strategy is about the relationship between the responsibility for designing and developing selection procedures and processes and the responsibility for delivering or managing employment and staffing functions that implement the selection procedures. Finally, we acknowledge the importance of the HR technology. This issue is not addressed here but is addressed in detail in Chapter 39, this volume.

Employee Skill Strategy The organization’s approach to employee training and development has a significant impact on the selection program. Generally, to the extent that the organization emphasizes training and development as the source of employee skills, either (or both) of two things may be true of the selection program. One possibility is that the focus of the selection procedure places more emphasis on less “developable” attributes such as general mental ability and general dispositional attributes such as conscientiousness, leadership, motivation, and integrity. This shift is likely to be accompanied by a reduced emphasis on the assessment of job-specific skills such as job knowledge, work simulations, and high-fidelity situational judgment tests (SJTs). A more sophisticated version of this shift occurs when the selection procedures are tailored to prerequisites of the specific training and development objectives. For example, the job analysis effort preceding the development of the selection measures may identify the knowledge, skill, and ability prerequisites of the organization’s training and development content. In turn, selection procedures may be designed to target those prerequisites. Where the HR strategy focuses on “buying” rather than “building” skills, the selection program is frequently a major mechanism by which this HR strategy is implemented. In this case, the selection program is likely to emphasize the knowledge, skills, abilities, and other characteristics necessary to perform the job at some adequate level of proficiency with minimal additional training and development. Of course, this feature of an organization’s HR strategy is usually dynamic and depends on the particular job family, as well as frequently changing budgets and business plans, and the particular personnel decision(s) (e.g., hiring, promoting, moving) for which the selection program is being used. Certainly, knowledge-based jobs requiring advanced degrees (e.g., high-tech research positions) are virtually always supported by a “buy” strategy even in the same organization that may adopt a “build” strategy for other jobs (e.g., a customer service job requiring knowledge of specific product features). Similarly, internal progression programs that define the bases for promotion within an organization may constitute a build strategy by relying on specific proficiencies demonstrated in feeder jobs. At the same time, entry into the feeder jobs may reflect a buy strategy. This complexity also extends to two more recent HR strategies—workforce management and the use of contract workers. First, an increasing emphasis on workforce management requires that information about employees’ current skills be used to make selection decisions about moving employees to other jobs. In this situation, selection programs may need to focus on two considerations: the relevance of current skills to future work and the relevance of current performance to future performance in new jobs. In this scenario, the distinction between skills and performance can be important for a workforce management selection program. This distinction hinges on the assumption that skills and performance are assessed differently. In our experience, workforce management strategies that focus on the movement of employees between jobs vary in the extent to which they rely on records of past and present performance, and assessments of future-oriented skills, despite the axiom that past behavior is the best predictor of future behavior (Guion, 1998; Nickolau, Anderson, & Salgado, 2012). Where the movement under consideration is between two similar jobs, the selection emphasis is often on recent performance in the current job. Recent performance is usually assessed by referring to administrative records of recent job performance such as appraisal ratings, salary and bonus awards, progressive work assignments, and the like. This approach is particularly evident within the realm of expatriate management, in which selection decisions are typically based on the expatriate’s technical

209

Jerard F. Kehoe et al. expertise and domestic track record as opposed to language skills, international adaptability, and other selection context predictors (Bonache, Brewster, & Suutari, 2001; Cerdin & Brewster, 2014; Harris & Brewster, 1999; Mendenhall & Oddou, 1985). This may in part be due to the fact that the expatriate position for which the candidate is being sought is highly similar to the domestic position this candidate is vacating. In contrast, where the movement is between dissimilar jobs, the selection focus is more likely to be on skills that are assessed independently of administrative assessments of performance. Such skill assessments may include ad hoc supervisor’s ratings of defined skill levels, skill tests, and ad hoc interviews. A second emerging HR strategy can be even more problematic for selection programs. Many organizations contract with external organizations to provide workers who perform work in the client organization. If the client organization does not require contract employees to complete its own selection process (e.g., to avoid co-employment liabilities), then it almost certainly faces a future dilemma. The dilemma arises when, as is often the case, the client organization eventually wants to hire a contract employee who has performed successfully. In this case, there is the very real and predictable likelihood that some significant percentage of successful contract employees will fail to satisfy the client organization’s selection criteria despite their demonstrated job success, especially if the contract employee was performing precisely the same job for which they are applying to be hired. This conflict between job success and selection failure can cause serious harm to the credibility and defensibility of the selection program, although it may be entirely consistent with the level of validity and the de facto selection rate. To avoid this conflict, owners of sustainable selection programs will pursue a strategy that either requires all contract employees to satisfy the selection criteria prior to being assigned to work in the client organization or establishes some form of selection policy (see the Selection Policy section) that allows recent success in the same job to be a surrogate for satisfying that job’s selection criteria. This latter approach relies on credible and accurate job performance measures and may create an additional legal risk for the existing selection criteria if this alternative way of entering the job leads to less adverse impact than produced by the standard selection system. This prospect of having two ways of satisfying selection standards for a job may also manifest where a vacancy may be filled either by external applicants or by incumbent employees as part of internal progression programs. For example, the movement of employees from entry-level technical positions to higher-level technical positions may be governed by a progression program that specifies requirements for progression from one level to the next. In such programs, progression requirements are selection criteria, and the employee-applicant often has some degree of control over the process of satisfying such requirements. Internal progression requirements often consist of various standards, including demonstrated skills, training certifications, and/or current job performance. In contrast, external hiring into the same job may consist of a different profile of selection criteria such as educational degrees, years of experience, interviews, and qualification test results. It is not uncommon for internal progression requirements to be different from external hiring criteria for the same position simply because more local information is known about incumbent employees than about external applicants. Where such differences occur, it is crucial to give careful consideration to the equivalence of the two paths to the target position. In many cases, it is very difficult, if not impossible, to define equivalence psychometrically. There may be few, if any, people who have scores on both sets of selection criteria. The selection criteria in the two paths may be qualitatively different. For example, internal progression may rely heavily on administrative documentation of local workplace behavior such as performance and training achievement, whereas external hiring is likely to rely on indicators such as degrees, test scores, and interview results. One possible empirical definition of equivalence is that job hires coming from the two paths tend to perform equally well; that is, they have the same expected performance level. Other definitions of equivalence may be rational, rather than empirical. One rational definition is that receiving managers agree that the two sets of standards are equivalent. However established, it is important that the organization establishes the equivalence of the two paths for the two sets of selection criteria to be simultaneously sustained.

210

Managing Sustainable Selection Programs

Managing Corporate and Unit Interests One of the most significant HR strategy factors for the success of selection programs in medium to large organizations is the manner in which the sometimes conflicting corporate and unit interests are managed. To be sure, whatever balance might be achieved between these two interests, it is likely to be dynamic and will change with business conditions. In our experience, three dimensions capture the majority of these issues: funding source, approval roles, and the myriad facets of location differences.

Funding The manner in which funding for selection programs derives from corporate budgets and/ or unit budgets has a large impact on the organizational pressures acting on the selection program. Where funding is mostly or entirely from corporate budgets and is, as such, relatively distant from the means by which units fund the corporation, it is likely that corporate interests in defensibility, fit with HR strategy, and perceived fairness and equivalence across units will be more salient in the design and management of selection programs. Where unit-based funding is substantial or, often, even contingent on unit-level satisfaction with selection programs, the pressures for unit-specific design and management are likely to be much greater. Our view is that the latter condition is more difficult to manage for selection program managers because it can create pressures that are more likely to conflict with the professional values of consistency, validity across broad job families, and job focus. In general, corporate interests tend to have a convergent influence supportive of a single, whole, integrative selection program, whereas unit interests tend to have a divergent influence that leads to differentiated and multiple selection practices across units. Divergence of interests is more likely to create fundamental conflicts with the professional and legal standards for selection programs, especially where different units have similar jobs.

Approval Roles Two types of approvals are covered here: (1) the approval to implement or change a selection program and (2) the approval to waive or exempt individuals from the requirements of a selection program. Where these two approval roles reside in a corporate organization, the interests of the corporation are likely to be more influential than if either or both approval roles reside in the affected units. In many ways, the impact of approval roles is the same as the impact of funding source. The organizational entity that funds and approves has more influence. However, we have seen combinations of funding and approval roles that have surprising and complex effects on selection programs. Indeed, selection programs may be most sustainable where funding is corporate but approval is local (the reverse combination of local funding with corporate approval is unlikely to occur in our experience except in organizations with highly centralized authorities). The impact of approval roles on sustainability is that, at its core, the authority to approve the implementation of, changes to, or exceptions to a selection program is tantamount to approval authority over the content of the selection program. It may be difficult to reach agreement to organizationally separate funding and approval roles, but, when separated, they create a form of checks and balances that may sustain a selection program across a wider range of circumstances than if both were housed in the same level of the organization. Corporate funding and local approval, even if they are often in tension with one another, give both organizational levels a significant operational stake in, and influence over, the selection program that is commensurate with their necessary interests in the program. The value we place on balancing these interests is rooted in the perspective that the effectiveness of a selection program (of great local interest) and its defensibility or compliance (of great corporate interest) are both critical considerations and both require attention to be optimized.

211

Jerard F. Kehoe et al. An alternative perspective we have observed in some organizations is that a selection program’s defensibility can be difficult to assess and is assured only by persistent and rigorous attention, whereas the effectiveness of a selection program can be satisfied more easily by the involvement of professional-level selection expertise. In effect, this perspective holds that effectiveness can be attained by the expertise of the designer but that defensibility requires continuous attention to and alignment among all processes that constitute a selection system. This latter perspective is less likely to seek a balance between effectiveness and defensibility and is more likely to place great weight on defensibility.

Location Differences: Expatriate Selection There is, perhaps, no better manifestation of the potential for location differences to impact selection strategy than expatriate selection. Conflict between corporate and unit interests is likely to be particularly salient in multinational companies (MNCs), in which personnel decision making is further complicated by determining whether expatriates—who can be either parent country nationals (PCNs), third country nationals (TCNs), or host country nationals (HCNs)— should be employed. Welch (1994), in her framework of determinants of international HR management approaches and activities, has conceived MNC personnel selection to be contingent upon (a) contextual variables relating to the particular host country environment (i.e., the legal system and cultural distance), (b) firm-specific variables (e.g., stage in internationalization, type of industry), and (c) situation variables (staff availability, location of assignment, need for control, and locus of decision). Dowling and Welch (2004) further added (d) the particular approach to staffing (ethnocentric, polycentric, regiocentric, or geocentric) that the MNC embraces to this list of antecedents of MNC selection practices. Within the ethnocentric approach, strategic decisions pertaining to selection are made at headquarters, and subsidiaries, which are managed mostly by PCNs, have little or no autonomy in decision making. The polycentric approach is characterized by more decision-making autonomy on the part of subsidiary organizations, which are usually also managed by HCNs. Within the geocentric approach, applicants are drawn from an international pool of executives, and PCNs, HCNs, and TCNs may be selected into any job in any country depending on their ability (Colakoglu, Tarique, & Caligiuri, 2009). Finally, the regiocentric approach is similar to the geocentric approach but different in that decision making is deferred to regional headquarters. Although it is beyond the scope of this chapter to consider MNC staffing in detail (see Chapter 36, this volume, for further discussion), the point being made here is that the particular organizational environment created by these antecedents may compromise selection system sustainability. For instance, MNCs with a geocentric staffing policy may be forced to revise their selection systems in light of host country legal regulations and immigration policies enforced to promote the hiring of HCNs. Similarly, MNCs that seek to exert control over their overseas subsidiary operations through ethnocentric staffing policies may find that the HCN employees within the subsidiary perceive they are being unfairly treated in comparison to expatriate PCNs. Finally, MNCs favoring a geocentric staffing policy may find this selection system unsustainable because of the huge costs involved in the training and relocation of its HCN, PCN, and TCN staff. In addition to the above considerations, Harzing (2001) has provided evidence that the likelihood of finding a PCN in a top management position in foreign subsidiaries is contingent on such diverse antecedents as host country political risk and education level, subsidiary age and performance, and industry. In addition to the above issues, the expatriate selection system sustainability may be further complicated because of the fact that expatriates are incumbents in a myriad of different occupations and countries. The term expatriate may thus be legitimately used to describe a French banker in Hong Kong and an American geologist working for an oil company in Saudi Arabia. Any standardization vis-à-vis expatriate selection decision making is therefore likely to imply the comparison of apples and oranges. This being the case, Mol (2007) has called for an abandonment of research into expatriate selection as such. A multinational bank might be better off selecting expatriate bankers on the basis of the selection system in place for the selection

212

Managing Sustainable Selection Programs of domestic bankers rather than trying to develop a selection system that is tailored specifically to expatriate bankers in Hong Kong. Alternatively, a resolution to the issue of selecting against a heterogeneous criterion space may be found in the notion of synthetic validity (Scherbaum, 2005; Steel, Huffcutt, & Kammeyer-Mueller, 2006).

Role of Professional Selection Expertise A third HR strategy consideration is the organizational role of the expert selection professional(s) who designs, develops, and validates selection programs. There can be several dimensions to the scope of this expert role (e.g., inside or outside, broad or narrow, and large or small). We view the expert-owner dimension as the one having the most significant strategic impact on the sustainability of selection programs. This dimension refers to the extent to which the selection support role, which is virtually always scaffolded by professional expertise in some fashion, is accompanied by strategy ownership responsibilities. These strategic ownership responsibilities might include any of the following: (a) ownership of the budget for selection design, development, and validation work; (b) ownership of approval authorities; (c) ownership of selection data governance and systems; (d) authority over use of assessment results; (e) ownership of compliance responsibilities beyond validation, such as monitoring, reporting, and responding to enforcement agencies; (f) ownership of employment delivery functions that manage employment and selection processes; and (g) ownership of the agenda for the development, adaptation, and maintenance of existing and new selection programs. The fundamental issue is the extent to which the organization’s strategic direction for selection programs is owned by the experts who also design and develop those programs. Of course, there can be many combinations of specific roles relating to this issue, and these can be placed along a continuum from expert-only at one end to expert-owner at the other end. Here we describe the ways the expert-only and the expert-owner ends of the spectrum can be manifest.

Expert-Only Strategy In the expert-only strategy, the selection professionals who design, develop, and validate selection procedures do not own the strategic direction of the organization’s selection programs. Although they may create selection procedures, they do not determine which organizational needs will be addressed by selection solutions; they do not determine what selection strategies will be applied across units; they do not have authority over tradeoffs between cost and value of selection procedures; and so on. This expert-only strategy can manifest in various ways. A recent organizational strategy is to house selection experts in HR organizations sometimes called Centers of Expertise/Excellence (COEs). These COEs are positioned as technical resources to the business, which may be accessed by business units as needed—often in the non-expert judgment of the business units—to develop HR solutions to business problems. Similarly, selection experts who are described as internal consultants often serve in roles very similar to COEs. COEs are almost certainly an indication of an expert-only approach. Another clear sign of an expert-only approach is the situation in which selection experts are funded only on a project-specific basis. This can be the case whether selection experts are located in corporate or unit organizations. A third sign of an expert-only approach is that the selection experts do not report to an HR department. Being housed outside of HR almost always means that selection budget funding is closely tied to specific projects rather than an overall strategic purpose for selection. A variation of the COE approach is one where the selection design role is outsourced or contracted to an external organization to provide design services back to the client organization. In this case, the role and scope of the selection designer’s work is specified by a services contract. The selection expert’s strategic influence or authority can be significantly reduced where this contract is overseen and approved by non-selection experts.

213

Jerard F. Kehoe et al. The expert-only approach is likely to have several typical consequences for selection programs. First, it will be difficult to develop a long-term plan for the gradual restructuring or introduction of a comprehensive selection strategy. Second, virtually all authority over the administration of the selection program and over initial decisions on standards and policies is likely to reside in the funding organization or, possibly, in the organization responsible for the administration of the selection procedures. Third, the development of selection programs that apply in some consistent fashion across different organizational units will be difficult. The scope of selection design work is more likely to have a local focus on particular jobs within particular units. Fourth, local business leaders may provide stronger support to locally focused selection programs than corporately focused programs if they see the programs as more directly tailored to their specific needs.

Expert-Owner Strategy Selection experts who also own selection strategy identify strategic directions by analyzing organizational needs both within and across units to identify selection solutions that have the greatest long-term benefits. A critical strategic activity for selection owners is long-term planning. Strategic planning can take many forms but almost always includes collaborative planning with HR leaders across units of the organization. Such planning would typically focus on common interests across units as well as unique interests of specific units. As mentioned earlier, particular challenges may be faced in this regard by expert-owners in MNCs in which subsidiary local idiosyncrasies (such as the host country legal context and the local labor market) may prevent the establishment of a selection strategy that cuts across the various units of the organization. Here, MNCs will clearly need to be sensitive to local needs rather than merely attempting to impose a standardized procedure upon multiple units (cf. Harzing, 2001). Indeed, we would argue that this tension between standardization versus country specificity will require active management. Strategy ownership can manifest in various ways. Owners are more likely to have responsibility for neighboring functions such as employment policy, employee research, training of employment administrative staff, and regulatory compliance that depend on or influence the selection strategy. Strategy owners may have stronger and longer relationships than expert-only managers with corporate and unit HR leaders because their roles are more comparable and interrelated. At the same time, expert-owners may not have strong relationships with unit business leaders where funding is less likely to be directly tied to business unit sources. This can be an important consideration for selection strategy owners. Certainly, there is considerable value in well-developed relationships with HR leaders and with the business leaders they support. Typically, these relationships are not managed independently. One approach is for the expert-owner’s primary unit relationship to be with the unit’s HR leader who, in turn, guides the expert-owner’s relationship with unit business leaders. Strategy ownership has other implications for selection programs. They are more likely to be both coherently integrated across units of the organization and supported by a common set of policies and practices. Strategic selection programs are more likely to be integrated with other HR strategies/programs such as training and development, workforce management, compliance support functions, and organization-wide HR information systems. The selection development function is more likely to have policy authority regarding business managers’ roles in selection decision making, even if managers’ roles vary from unit to unit. One of the most tangible possible indicators of strategy ownership would be that selection developers would have created an approved selection strategy document used to communicate and plan with units and other HR functions and to develop selection budgets. Overall, strategy ownership can be used to build in several features of selection programs that promote sustainability. A cautionary note is that strategic ownership tends to align itself with corporate interests that are longer term and cross-unit. It is critical that the strategic role not inadvertently lead to weaker relationships with local units where the actual selection decisions are made.

214

Managing Sustainable Selection Programs

Alignment of Selection Development and Selection Delivery A key dynamic for sustainable selection programs is the relationship between the science-based, professional selection development function and the operational, transaction management function that administers, scores, and uses selection procedures to make selection decisions. In many organizational arrangements, the development of selection procedures is an organizationally separate function from the delivery of those same procedures. Even if the development and delivery functions have a collaborative working relationship, their budgets may be developed and managed separately. The primary issue is that these two HR functions are likely to have somewhat different priorities and success criteria. In our experience, the priorities and success criteria for selection developers tend to center on issues of validity such as job relevance, assessment content, impact on business needs, and legal defensibility. Their science-based education, professional standards, and organizational expectations point them in these directions, especially where selection developers’ budgets do not pay for the employment operations. In contrast, employment operations that deliver selection procedures are often faced with very different expectations and measures of success. Performance typically is measured in units of cycle time, cost per hire, average time to fill a vacancy, and hiring manager satisfaction. Because selection delivery is viewed most often as transaction management, its success is often measured in terms of transaction characteristics. Delivery functions may even have service agreements with units that specify target values for speed and cost metrics. This is now typical in the case of outsourced employment delivery services. Frequently there is a natural tension between the quality of selection programs and the speed and cost of delivering them. Worsening employment market conditions may drive per-hire speed down and cost up. Changes in business conditions may alter the urgency with which vacant positions must be filled. Managers’ satisfaction with new hires may drop due to changing job requirements or conditions. Any number of variable factors such as these can create circumstances in which there is pressure to rebalance the existing combination of quality, speed, and cost. This is a dynamic tension, and how that tension is managed can have a significant impact on the sustainability of selection programs. The first step toward effectively managing this tension is to determine who “owns” the conflicting interests. In most cases, the answer is that the employing unit is the ultimate owner of the interests in both selection quality and selection process management. Of course, other stakeholders such as corporate HR leaders, employment lawyers, and compliance leaders may have an interest as well. The second step is to determine how the owner’s interests are assessed. What is the current cost per hire and how and why has it changed? What are the current turnover rates, what employee behaviors do managers value, what are the new-hire failure rates, what are the sales success rates, and so on? Frequently, by virtue of their focus on process management, delivery functions have established performance metrics that continuously track speed and cost metrics and factors that cause them to change. In sharp contrast, developers of selection programs often do not track quality indicators such as turnover, performance levels, and success factors on a continuous basis. One reason is that employee quality and behavior data are often difficult to obtain, and developers typically spend the effort to gather them only in the context of ad hoc validation studies. Another, perhaps more fundamental, reason is that the quality of selection programs is not viewed in most cases as highly variable across short (months) or even moderate (few years) time intervals. A third, more subtle reason may be that selection developers are generally conservative about the “validator’s risk” (M. Tenopyr, personal communication, August 27, 1988). The validator’s risk is the gamble selection developers take with local validation studies that any particular study may not support a conclusion of validity. In countries where the regulation of employment selection procedures hinges on validation evidence, selection developers view validation evidence in a legal context in addition to the organization context. The validator’s risk combined with the legal context often results in developers being conservative about the conduct of local validity studies. Especially for existing operational selection procedures, it is unusual for developers to have a continuous validation process in place. Once an initial validity rationale supports the implementation of a selection procedure, ongoing local validation efforts

215

Jerard F. Kehoe et al. represent, to some extent, ongoing legal risk. A major exception to this is when the developer is a large consulting house with a selection process or procedure implemented in many client organizations. In this case, the developer may have virtually continuous validation efforts underway within and across client organizations. This ongoing, large-scale validation strategy tends to minimize the validator’s risk, maximize defensibility, respond to local client desires for impact measures, and provide external marketing documentation. The net result of these factors is that delivery functions are more likely than development functions to have recent and continuous assessments of process metrics of interest to the owner. Independent of any other considerations, the ready availability of speed and cost metrics compared to quality metrics can cause speed and cost metrics to be given more weight in the process of rebalancing quality with speed and cost. Given the availability of information about quality, speed, and cost, the third step is to determine the decision process(es) by which the current pressure to rebalance interests is resolved. One efficient approach to these decisions is to distinguish between two types of decision process. Type 1 is an administrative process designed to handle routine minor or temporary fluctuations without directly involving the ultimate owner of the competing interests. Policies and practices can be established with the unit leader’s concurrence to resolve such pressures. For example, temporary business conditions that increase the urgency with which a vacant position must be filled might be routinely addressed by an administrative approval process for authorizing a temporary change in the selection standards. The key features of this first stage are that it is an established process the unit has endorsed and that the developer and/or deliverer manage the process on behalf of the unit’s interests. Type 2 is reserved for situations in which the pressure to rebalance is greater in scope, more important, less routine, and/or has longer-term implications. The key difference from Type 1 is that, for Type 2, the unit owner is directly involved in the rebalancing decision. In Type 2, the roles of the developer and deliverer are to provide information and recommendations to the business owner/decision maker about the competing factors and to describe the methods and implications of changes to those factors as well as constraints on what is possible. The underlying principle of this approach is that, above some threshold of importance, the accountability for balancing competing interests of quality, speed, and cost lies with the ultimate owner of the selection outcomes. One of the greatest risks to a selection program’s sustainability is the disengagement of the ultimate organization owner from key decisions that impact the value and speed/cost of the selection program for the owner. An important secondary benefit of an owner-based decision process for rebalancing competing interests is that it occasionally re-engages the owner with the accumulated information and decisions that give direction to selection programs and that ultimately impact the owner’s success.

SELECTION SYSTEM GOVERNANCE Some amount of governance is inevitable for any organization process that affects the outcomes of the people in the organization. At a minimum, governance of selection processes serves to promote efficiency, fairness, accountability, and compliance with legal regulations and corporate mandates. Beyond that, governance can enable more strategic objectives such as promoting employee effectiveness and contributions to the organization, facilitating the integration of multiple related processes and organizational units, and sustaining an effective organization culture. Governance of selection processes can be narrow or broad. Narrow governance often focuses on legal/regulatory compliance and may take the form of oversight by an employment attorney or HR policies defining and limiting the use of assessment results. Broader governance can address a much wider range of issues such as the fit between selection practices and an organization’s culture, rules relating to the use of test scores (Tippins, 2002), the role of local managers and HR staff in supporting or participating in selection processes, metrics for managing selection, and corporate and local authority over the selection processes. In general, two layers of governance are common: guiding principles and policy requirements. Guiding principles inform various decisions about the purpose, development, and use of

216

Managing Sustainable Selection Programs selection programs. They provide overarching direction that help align key decisions/actions. Policies dictate the behavior of people and processes. They can be more or less specific but usually provide explicit rules. Both are critical in creating and sustaining selection programs.

Guiding Principles Guiding principles often express the implications of an organization’s cultural values for selection programs. They may not be selection-specific in organizations that have defined and communicated explicit cultural values that are general in nature (e.g., integrity, respect for others, customer focus, teamwork, safety), which may be seen as sufficient to provide overall guidance to all HR practices, including selection programs. Also, selection programs often have a considerable amount of process-specific governance in the form of policies, systems requirements, and well-defined practices, given the virtually universal fairness and legal contexts. Even if guiding principles have a strong influence on the development of such policies and practices, once those policies and practices are implemented, behavior in the selection processes may be constrained to the point that guiding principles may have little operational value. The following list briefly describes examples of guiding principles we have observed in large organizations: 1. People are accountable for selection decisions: This principle fixes the accountability for selection decisions on the people who make hiring decisions, rather than on rules or algorithms that might be built into decision support tools. An implication of this principle is that selection programs should be designed to inform and guide decision makers, not replace them. 2. Choose the best: This principle establishes a deliberately high standard for who is selected. 3. Equal access/opportunity: Many organizations will espouse a guiding value relating to some shared meaning of fairness in the selection process. In cultures that place high value on performance-based results, this principle is unlikely to refer to equal outcomes and is more likely to refer to some other equity principle such as equal access or equal opportunity. 4. Compliance with government regulations: An operating principle endorsing compliance with prevailing laws may seem unnecessary given that legal obligations stand on their own as requirements for selection programs. Nevertheless, organizations that choose to endorse such a principle may do so to set an important tone for all participants in its selection programs. Communicating about the importance of compliance can have a chilling effect on risky behavior. 5. Selection processes are not surrogates for poor performance management: This principle addresses the appropriateness of possible uses of assessment results. Our experience has been that, occasionally, the ready availability of skill/ability/knowledge scores from selection processes leads managers to consider ways in which such scores could be used to facilitate other personnel decisions. This principle would discourage the use of selection-based skill/ability/knowledge scores as surrogates for corrupted performance evaluations. 6. Selection assessments benefit the individual as well as the organization: Organizations that embrace an explicit commitment to act in the interests of employees and, even, external applicants may endorse some form of principle that selection assessment results should benefit the people who are assessed. This can be a strong principle that leads to assessment feedback, assessment preparation information, and assessment-based development feedback that might not be provided otherwise.

In summary, guiding principles are intended to provide values-based guidance to the development and ongoing administration of selection programs as well as to the individuals who make selection decisions. Also, where organizations may require more flexibility in the way in which selection processes are used, a reduced emphasis on constraining policies and a greater emphasis on guiding principles may facilitate the needed flexibility.

Selection Policy In contrast to guiding principles, selection policies are prescriptive. They define authority and roles, establish rules and requirements, and set limits and boundary conditions. Because policies

217

Jerard F. Kehoe et al. have a direct and explicit impact on the behavior of virtually all participants in the selection process, they often are seen as the “face” of the selection program. They are some of the most tangible aspects of a selection program. Selection policies also are perhaps the best indication of the ownership of the selection program. Because policy establishes authority, policy ownership is the clearest indicator of selection program ownership. It is likely that selection programs are among the most policy-driven of all HR programs and practices. There are three primary causes. First, in many countries, national and/or local laws regulate employment selection. Second, employment selection is a high-stakes process. Employment decisions have important consequences for people on both sides of the decision. People care a lot about employment decisions, and their interests may sometimes conflict. Policies are often used to balance these sometimes conflicting interests. Third, employment selection is about a scarce, but valuable, resource. A qualified person selected into one job by one unit is no longer available to be selected by other units for other jobs. Many organizations have found that policy is required to govern managers’ access to candidates and candidates’ access to jobs. A starting point for this discussion of selection policy is that it is based on at least two layers of authority. One layer is the meta-authority to establish the policy; the other layer is the operational authority(ies) established by the policy. For example, a policy issued by the meta-authority, say, for example, the Senior Vice President of Human Resources, may grant business unit leaders the operational authority to implement and change selection processes within their organizations. It is important for successful programs that it be clear where the meta-authority lies, that is, who the policy maker is who may grant operational authorities to others.

A Taxonomy of Selection Policy Perhaps the best way to describe selection policies is to provide a broad taxonomy with instructive examples in the major cells. A reasonably representative taxonomy of policy content is one that organizes selection policies into four interrelated categories: (1) selection data and results, (2) uses of selection results, (3) access to selection processes, and (4) legal compliance.

Policy About Selection Data and Results This category of policy governs the creation, storage, and access to the formal information used to make selection decisions. The information ranges from resume information (e.g., degrees, previous work history, and demographic information), user-posted online information, to formal assessment scores and results (e.g., score results from tests, interviews, inventories, and past accomplishments). Policies govern who may create the data, the rules by which the data are generated, the place and method by which the data are stored, and access to the data once stored. Policies about who may create or generate the data usually take the form of role definitions and training requirements for the people who administer and score tests, interviews, and other formal assessments as well as the people and/or system processes that search resumes and codify information into more manageable formats. An increasingly important subset of policy regarding selection data and results governs access to these data. The question is, who may have access to a candidate’s selection data? In our experience, this policy consideration varies greatly across different types and sizes of organizations. In many small organizations, formal selection data might be found in HR managers’ files or in local hiring managers’ files where access is governed informally only by local policies, if any, regarding access to managers’ files. In some large organizations where privacy concerns and the legal salience of such data are important, explicit policies may specifically restrict access to selection data on a need-to-know basis. In many organizations, access to selection data is treated with the same protections as are provided for compensation data but with somewhat lesser protections than are provided for employee’s medical records.

218

Managing Sustainable Selection Programs Data access and use can be a source of great complication and liability for multinational organizations. Many countries and professional associations have differing requirements governing psychologists’/organizations’ use of employee/applicant data. For example, the UK’s Data Protection Act requires that test takers have access to their own test results; the Dutch Association of Psychologists code of ethics requires applicants’ formal consent for the testing organization to provide assessment results; a significant consideration in the U.S. is the extent to which, and the circumstances under which, privacy and confidentiality protections under the Health Insurance Portability and Accountability Act (HIPAA; 1996) apply to selection data. Of course, employers and applicants have an obligation to comply with country-specific laws/ regulations. A complexity here is where multinational companies are recruiting and selecting among applicants who originate from different countries, and where the selection procedure is based in another country. Here, organizations are well-advised to take on-board specialized legal advice in order to ensure compliance with various employment law requirements as they differ among the various countries involved. More detailed information about international legal considerations is provided in this Handbook in Shen et al. (Chapter 29, this volume) and Tison et al. (Chapter 30, this volume). Notwithstanding legal, regulatory, and professional requirements, some organizations may choose to establish a selection data ownership policy that explicitly establishes ownership of selection data. Unless compelled by law/regulation, it is unlikely that an organization would regard the applicant as the “owner” of her selection data for various reasons. However, the organization may establish access rules that protect the interests of applicants to be assured that their selection assessment results are used appropriately and consistent with the information provided to the applicants.

Use of Selection Data The broadest category of selection policy addresses policies relating to the use of selection data and results. These policies cover a broad range of topics, including initial approval to implement the selection process, decisions about specific standards or qualification requirements, and questions of alternative ways of satisfying selection standards. Within this category of selection data uses, a major subcategory consists of the authority(ies) for the decisions that establish the standards for selection decisions. The standards are the rules by which the selection results may be used to inform, influence, or dictate selection decisions. For example, cut scores that determine whether a candidate is qualified or not are standards. Strong selection programs formalize these standards so that they may be authorized and implemented. Typically, the authority to authorize standards is the same as the authority to waive a standard in a particular case or exempt a candidate from having to meet a standard. However, additional policies may be established to provide for a more administratively feasible process of evaluating and authorizing ad hoc waivers and exemptions. If high-ranking managers or executives own implementation approval authority, it may be administratively helpful not to involve these timepressured executives in all ad hoc requests for waivers or exemptions. In this case, policies may be established that authorize representatives of the executive to evaluate and decide routine waiver or exemption requests. The policies may even provide guidelines to be considered by the authorizer. In contrast to policies authorizing ad hoc waiver and exemption decisions, routine exemptions are usually handled as part of the full set of rules governing the selection program. Routine exemptions refer to known, anticipated conditions under which a candidate is not required to satisfy an ordinary selection requirement. Three types of standard exemptions are common. First, so-called grandfathering exemptions refer to conditions in which a candidate for a particular job has already performed that same job at some satisfactory level of proficiency for a period of time. Grandfathering rules would exempt such candidates if they satisfy the specific conditions laid out by the rules. The most common example of grandfathering applies to incumbents in a job when new or revised selection standards are applied to that job. A second type of standard exemption relies on an equivalency between two different sets of selection standards. For example, a work simulation assessing telephone-based customer

219

Jerard F. Kehoe et al. handling skills in a sales context may be regarded as assessing the same skills, perhaps at a higher level, as a similar work simulation designed in a service context. An equivalency policy means that candidates who satisfy a particular standard on one selection procedure are treated as having satisfied a particular standard on another selection procedure. The third common example of a standard exemption relies less on an equivalency rationale than on a substitution rationale. For example, a candidate who has demonstrated certain work experience, training, degrees, or other education/training experience may be exempt from having to meet a test standard designed to predict those very accomplishments. In effect, the candidate has accomplished the criterion result the test was designed to predict. Selection programs are less prone to incremental erosion of confidence and credibility to the extent that systematic rationales for exemptions can be anticipated and accounted for in the original application rules and taken out of the hands of local, ad hoc decision makers. A final example is provided of a policy designed to establish authority for an ad hoc decision about the use of selection standards. This example is different from the examples above, which rely on some form of equivalence or substitution rationale. In effect, those rationales are all grounded in the construct-level relevance of one set of standards to another set of standards. In contrast, this example is grounded in what might be called a pragmatic business necessity rationale. The typical situation is one in which there is a regular, “normal” set of selection standards for a particular job. For the sake of this example, assume this normal set of standards is typically satisfied by 20% of the applicants. In all likelihood, this set of standards was chosen, in part, because the selection ratio yielded by these standards enabled the hiring organization to meet its normal hiring needs at an acceptable level of quality, cost, and speed, but business circumstances are always changing. From time to time, the hiring organization may have an urgent need to substantially increase its hiring rate. For example, in The Netherlands mandatory military service was lifted in the 1990s, resulting in thousands of unfilled vacancies. In this case, there can be a compelling rationale based on business necessity to temporarily or permanently reduce the selection standards to achieve the increased hire rate necessary to meet the business need. A policy can be developed to address this special case that allows standards to be temporarily lowered and may even specify certain conditions that must be satisfied to ensure the need is substantial. At root, this authority, like the others described above, owns the responsibility to evaluate the ad hoc tradeoff between the benefits of a faster, easier, less onerous, and possibly fairer-seeming selection process with the potential loss in expected performance among those selected. Regardless of how the policy assigns authority, it is important for these exemption processes to rely on input from the affected business managers about the impact of the tradeoff on their business.

Access to the Selection Process A third category of policy considerations addresses candidates’ access to the selection process. Where selection processes are in place, they serve as one of the gateways to desired jobs. Candidates who do not have access to the selection process are effectively excluded from the sought jobs. A typical selection program will have rules or practices defining how candidates have access to the selection process. These might be as simple as scheduling requirements or as complex as having to satisfy a series of prescreening steps, each requiring time and effort. Some of the most common policy considerations for managing access include retest requirements, the ability to complete the assessment processes, physical accessibility, basic qualifications, restrictions placed on incumbents regarding frequency of internal movement, where and when the assessment processes may be completed, what organization resources (e.g., proctors and appropriate space) are required to administer assessment processes, and the number of available vacancies needing to be filled. There are often competing interests with respect to applicants’ access to selection processes. Policies that restrict access often have the direct or indirect effect of increasing the yield rate among the applicants who do have access under those policies. For example, typical retest policies limit applicants’ opportunities to retake selection tests they have previously “failed.” Given

220

Managing Sustainable Selection Programs that retest policies, by their nature, limit the access of people who do relatively poorly on tests, they are likely to increase the overall yield rate of the selection process. Also, an independent effect of retesting on cognitive tests is that the inherent practice effect of the previous attempt generally increases scores by approximately one-fourth of a standard deviation (Hausknecht, Halpert, Di Paolo, & Moriarty Gerrard; 2007), thus increasing the pass rate among the applicants who retake tests. However, evidence from Lievens, Reeve, and Heggestad (2007) indicated that this score increase introduces measurement and predictive bias that harm criterion validity. Similarly, policies that exclude candidates who do not meet basic qualifications such as education and work experience are, in many cases, more likely to block lower-qualified applicants, thus increasing the overall yield rate. These types of access policies that increase the yield rate will, all else being the same, reduce cost per hire and, possibly, reduce the cycle times for employment processes. On the other hand, policies that facilitate the access of larger numbers of applicants better ensure that enough qualified candidates are available at any point in time. Also, they accommodate the desires of candidates who seek jobs in the organization, thus potentially improving the candidates’ goodwill toward the organization. Also, increased access may reduce recruiting expenses, all else being equal.

Legal Compliance Certain selection policies are directed primarily at the organization’s legal compliance responsibilities. The policies in this category are those that establish authority for monitoring selection results for evidence of prohibited discrimination or use of results, for owning modifications to selection procedures to improve compliance, for the decisions about whether any modifications to selection procedures should be made, for protecting applicants’ private information, and for responding to enforcement agencies’ requests for compliance information. This category of policies also relates to the question of the “official” database of selection results for applicants and employees. Selection data are often formally and informally located in various files, both paper and electronic. Certain selection data, such as hiring manager interview ratings and protocols, are often kept in local HR files or even in hiring manager files. In contrast, other selection data, such as formal assessment results, demographic data, and resume information, are often maintained in corporate or unit HR information system databases. Compliance support policy should specify what the “official” selection database is, how selection data get into that database, how they are organized there, and who is responsible for putting them there. An additional consideration regarding compliance policy is that the compliance issues associated specifically with selection processes are often part of a larger employment and recruiting context. Enforcement agencies may be as interested in recruiting and sourcing methods, resume searching and screening, and an organization’s internal staffing system procedures as they are in detail about selection procedures, data, and decisions. This broader context of compliance issues often involves other roles and organizations beyond the development, validation, and maintenance of selection programs. In this situation of multiple organizations having a role in employment compliance, selection policy is best integrated with compliance policies of multiple organizations. However this integration takes place, it is advantageous to have a clearly established role with overarching authority over responses to enforcement agencies.

Authority and Accountability Alignment Principle A final perspective about selection policy is that the sustainability of a selection program relies on policies that align authority with accountability. As noted above, policies often specify who and where the authority is for making decisions about selection programs. One specific example is the policy that determines who authorizes the selection standards for a particular selection procedure. Suppose a new selection procedure is designed to make hiring decisions for a call center where account representatives resolve problems that customers have about their orders,

221

Jerard F. Kehoe et al. bills, and payments. The selection procedures consist of a work sample exercise to assess customer handling skills and a cognitive ability test to assess information learning and processing skills. In this example, a policy question is, “Who should have the authority to approve the standards by which these selection procedures are used to make selection decisions?” The standards can take many forms, including pass/fail cut scores, score bands, and methods of combining the work simulation and cognitive test results. The choice of standards will impact the cost and speed of the hiring process, the performance of the new hires, and the degree of impact on protected groups, if any. In determining who should have the authority to approve the final set of standards, the question that should be asked is, “Who has accountability for the outcomes that will be affected by the approved standards?” Commonly, the business leader over the call center operation is likely to have ultimate accountability for the performance of the account representatives. In some organizations, that same business leader might also have ultimate accountability for the supporting employment process and its compliance with prevailing regulations. In this situation, a very strong case can be made that the business leader who is accountable for all of the most important consequences of the selection decisions should have the authority to approve selection standards. This policy would then, presumably, define the role of the designer of the selection system, the manager of the employment process, and the compliance manager as expert resources to the business leader’s decision about the standards. This situation is an example of high alignment between authority and accountability. The point of this subsection is that selection policies contribute to selection system sustainability in various ways, but that a paramount requirement of selection policies is that the authority granted by a policy should be aligned with the accountability for the consequences of the decisions made under the policy. One implication of this alignment principle is that the selection program designer may not have the authority over all selection-relevant policy decisions. In particular, the authority to approve the selection standards that drive key business results is most aligned with the role that “owns” the same business results.

SELECTION PROCESS MANAGEMENT This chapter has considered several layers of sustainability factors ranging from organizationallevel considerations of fit, HR strategy, operating principles, and policies. This sequence has progressed from general to specific where organization purposes and HR strategy provide general direction for selection programs and operating principles and policies specify increasingly specific characteristics of sustainable selection programs. Process specifications and process management are at the most specific end of this continuum. Process is the layer at which the most specific and detailed characteristics of a selection program are defined and managed. It is not the purpose of this chapter to consider all of the possible variations of selection process detail. That variation is as wide as the differences between organizations. Rather, this chapter addresses one specific component of process specification and management that is becoming an increasingly significant factor in the management of selection programs. This component is the role and application of process metrics used in the management of selection programs.

Process Metrics It is our observation that the growing emphasis in HR management on HR process benchmarking, best practices, plug-in systems, and cross-HR process integration is reaching into the management of selection programs. Clearly, this impetus is coming from trends in the HR management profession and not from any such trends in the personnel selection profession. For selection practitioners, the focus of this trend is significantly different from the selection profession’s historically research-oriented focus on validation, tools, constructs, and predicted outcomes. This change emphasizes processes and metrics as the mechanisms for managing HR work. We will briefly discuss here the impact this trend is having on the management of

222

Managing Sustainable Selection Programs selection programs and will offer suggestions about strategies for sustaining selection programs in this changing context. The distinction between the transaction management work of employment process management and the “knowledge management” work of the development and validation of selection programs is important. Like other HR-oriented “knowledge” work (e.g., compensation and labor relations), the development and validation of selection programs has historically been managed as an expertise, not a process. In general, the performance standards for these types of work have been imprecise and general. Typically, the evaluation of a selection developer’s “expert” performance in the development of new selection procedures does not rely on quantified metrics describing the development process. Increasingly, the focus on process management has invited the “customers” of employment processes—hiring managers and business leaders—to require metrics of the employment process as the means by which they evaluate the quality of those services. Common employment process metrics include (a) cycle time measures such as time from requisition to hire and time between employment process events; (b) flow rates through each step in the employment process (e.g., the rate at which people who schedule an employment office interview actually show up, complete the interview, and move on to the next event); and (c) various cost measures such as cost per hire, cost per candidate, or cost per event such as cost per assessment test or per interview. Clearly, these process-oriented metrics are affected by the selection procedures and standards produced by the selection developer, which may be seen as a root cause of satisfactory or unsatisfactory process metrics. Beyond these most typical metrics, additional metrics may be included to capture information about the quality of the selected employees. The two most frequent examples of quality-of-hire metrics are early survival rates (e.g., 3-, 6-, and 12-month survival) and hiring manager (i.e., customer) ratings of early overall satisfaction with the new hires. However, a fundamental problem is that the process options available to employment process managers may have little effect on quality-of-hire metrics. Indeed, options such as enhanced job preview processes and more targeted recruiting practices, which may offer some improvement in quality-of-hire metrics, may do so at a higher cost. We suggest an approach here that may be helpful for selection program managers faced with this challenge that employment process metrics are creating new pressure on the sustainability of selection procedures. Essentially, this approach is to reframe the potential value of process metrics, not in terms of research value but in terms of business decision value, and change or supplement the information available to business leaders to help them continuously monitor the benefit of selection procedures and accompanying usage standards. The research perspective tends to view a selection program as a relatively fixed, unchanging manifestation of the basic, stable requirements of job success. The business process perspective views selection programs as organizational processes in the context of real-time business conditions that can change rapidly. These different perspectives have led to very different approaches to the evaluation of selection procedures and employment processes. Validation has been regarded as an episodic, occasional event that is needed only every several years to confirm that the causal model has not changed (MacIver, Anderson, Costa, & Evers, 2014). Process metrics represent a continual process that enables process managers to optimize processes as needed. Business managers are not trying to confirm scientific conclusions; they are trying to make business decisions with uncertain data to optimize important outcomes. Our own perspective about these divergent perspectives is that, although selection developers cannot surrender the importance they attach to validation, they would be wise to become more open to the prescientific value of continuously gathered data about worker behavior, such as the quality-of-hire data gathered by employment process managers. For many reasons, these types of data do not have the information value of worker behavior data gathered in research settings, but they do have value for building a more complete understanding of the possible situational dynamics that impact worker behavior and a deeper understanding of the relationship between worker behavior and the business outcomes that are most important to work managers.

223

Jerard F. Kehoe et al.

CONCLUSIONS This chapter describes the organizational considerations that directly influence the sustainability of selection programs. The four overarching categories of these organizational considerations are organization purpose, HR strategy, governance, and process management. Beyond the technical considerations of validity, utility, bias, and fairness, we make the case that these organizational considerations are critical in designing and implementing a selection program. To the extent that purpose, strategy, governance, and process are deliberately incorporated into the design of the selection program, the success of that program is better ensured. Inattention to these organizational considerations can undermine the sustainability of a selection program despite its validity. We also note here that much of this chapter has been written from experience more than research. The sustainability of selection programs warrants more research attention than has been given in the past. Psychometric concerns are critical, but any organization that neglects sustainability does so at its own peril and likely will find, in due course, that the psychometric integrity of its selection procedures is inevitably compromised.

REFERENCES Anderson, N., Lievens, F., van Dam, K., & Ryan, A. M. (2004). Future perspectives on employee selection: Key directions for future research and practice. Applied Psychology: An International Review, 53, 487–501. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Bonache, J., Brewster, C., & Suutari, V. (2001). Expatriation: A developing research agenda. Thunderbird International Business Review, 43, 3–20. Burch, G. S. J., & Anderson, N. (2004). Measuring person-team fit: Development and validation of the team selection inventory. Journal of Managerial Psychology, 19, 406–426. Cerdin, J. L., & Brewster, C. (2014). Talent management and expatriation: Bridging two streams of research and practice. Journal of World Business, 49(2), 245–252. Chatman, J. A. (1991). Matching people and organizations: Selection and socialization in public accounting firms. Administrative Science Quarterly, 36, 459–484. Colakoglu, S., Tarique, I., & Caligiuri, P. (2009). Towards a conceptual framework for the relationship between subsidiary staffing strategy and subsidiary performance. The International Journal of Human Resource Management, 20(6), 1291–1308. DiMaggio, P. J., & Powell, W. W. (1991). Introduction. In W. W. Powell & P. J. DiMaggio (Eds.), The new institutionalism in organizational analysis (pp. 1–38). Chicago: University of Chicago Press. Dowling, P. J., & Welch, D. E. (2004). International human resource management: Managing people in a multinational context (4th ed.). London, England: Thomson Learning. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum Associates. Harris, H., & Brewster, C. (1999). The coffee-machine system: How international selection really works. International Journal of Human Resource Management, 10, 488–500. Harzing, A. W. (2001). Who’s in charge? An empirical study of executive staffing practices in foreign subsidiaries. Human Resource Management, 40(2), 139–158. Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92, 373–385. Health Insurance Portability and Accountability Act of 1996. Public Law No. 104–91, 110 Stat. 1936. Herriot, P., & Anderson, N. (1997). Selecting for change: How will personnel and selection psychology survive? In N. Anderson & P. Herriot (Eds.), International handbook of selection and assessment (pp. 1–38). London, England: Wiley. Kehoe, J. F., Brown, S., & Hoffman, C. C. (2012). The life cycle of successful selection programs. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment and selection (pp. 903–938). Oxford, England: Oxford University Press. Klehe, U. C. (2004). Choosing how to choose: Institutional pressures affecting the adoption of personnel selection procedures. International Journal of Selection and Assessment, 12, 327–342.

224

Managing Sustainable Selection Programs LePine, J. A., Hollenbeck, J. R., Ilgen, D. R., & Hedlund, J. (1997). Effects of individual differences on the performance of hierarchical decision-making teams: Much more than g. Journal of Applied Psychology, 82, 803–811. Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 1672–1682. MacIver, R., Anderson, N. R., Costa, A. C., & Evers, A. (2014). Validity of interpretation: A user validity perspective beyond the test score. International Journal of Selection and Assessment, 22, 149–164. Mendenhall, M., & Oddou, G. (1985). The dimensions of expatriate acculturation: A review. The Academy of Management Review, 10, 39–47. Mol, S. T. (2007). Crossing borders with personnel selection: From expatriates to multicultural teams. Unpublished dissertation, Rotterdam, the Netherlands: Erasmus University. Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The importance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58, 583–611. Mumford, T. V., Van Iddekinge, C. H., Morgeson, F. P., & Campion, M. A. (2008). The team role test: Development and validation of a team role knowledge situational judgment test. Journal of Applied Psychology, 93, 250–267. Nickolau, I., Anderson, N. R., & Salgado, J. F. (2012). Advances in selection and assessment in Europe. International Journal of Selection and Assessment, 20, 381–384. Oliver, C. (1997). Sustainable competitive advantage: Combining institutional and resource-based views. Strategic Management Journal, 18, 697–713. Ostroff, C., & Rothausen, T. J. (1996). Selection and job matching. In D. Lewin, D. J. B. Mitchell, & M. A. Zaidi (Eds.), Human resource management handbook (pp. 3–52). Greenwich, CT: JAI Press. Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. In G. R. Ferris & J. Martocchio (Eds.), Research in personnel and human resource management (Vol. 23, pp. 121–176). Oxford, England: Elsevier. Ployhart, R. E., & Schneider, B. (2002). A multi-level perspective on personnel selection research and practice: Implications for selection system design, assessment, and construct validation. In F. J. Yammarino & F. Dansereau (Eds.), The many faces of multi-level issues: Research in multi-level issues (Vol. 1, pp. 95–140). Oxford, England: Elsevier. Ployhart, R. E., & Schneider, B. (2005). Multilevel selection and prediction: Theories, methods, and models. In A. Evers, N. Anderson, & O. Voskuijl (Eds.), The Blackwell handbook of personnel selection (pp. 495–516). Oxford, England: Blackwell. Potocnik, K., Anderson, N. R., & Latorre, F. (2015). Selecting for innovation: Methods of assessment and the criterion problem. In I. Nikolaou & J. K. Oostrom (Eds.), Employee recruitment, selection, and assessment: Contemporary issues for theory and practice (pp. 209–227). London: Psychology Press and Routledge Academic. Roe, R. A. (2005). The design of selection systems—Context, principles, issues. In A. Evers, N. Anderson, & O. Smit (Eds.), Handbook of personnel selection (pp. 73–97). Oxford, England: Blackwell. Scherbaum, C. A. (2005). Synthetic validity: Past, present, and future. Personnel Psychology, 58(2), 481–515. Schneider, B., Smith, D. B., & Sipe, W. P. (2000). Personnel selection psychology: Multilevel considerations. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 91–120). San Francisco, CA: Jossey-Bass. Scott, W. R. (1995). Institutions and organizations. Thousand Oaks, CA: Sage. Steel, P. D., Huffcutt, D., Allen, I., & Kammeyer-Mueller, J. (2006). From the work one knows the worker: A systematic review of the challenges, solutions, and steps to creating synthetic validity. International Journal of Selection and Assessment, 14(1), 16–36. Stevens, M. J., & Campion, M. A. (1994). The knowledge, skill and ability requirements for teamwork: Implications for human resource management. Journal of Management, 20, 503–530. Tippins, N. T. (2002). Issues in implementing large-scale selection programs. In J. W. Hedge & E. D. Pulakos (Eds.), Implementing organization interventions: Steps, processes, and best practices (pp. 232–269). San Francisco, CA: Jossey-Bass. Tippins, N. T. (2012). Implementation issues in employee selection testing. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment and selection (pp. 881–902). Oxford, England: Oxford University Press. Welch, D. (1994). Determinants of international human resource management approaches and activities: A suggested framework. Journal of Management Studies, 31, 139–164.

225

10 THE BUSINESS VALUE OF EMPLOYEE SELECTION WAYNE F. CASCIO AND JOHN C. SCOTT

Hiring good people is hard. Hiring great people is brutally hard. And yet nothing matters more in winning than in getting the right people on the field. All the clever strategies and advanced technologies in the world are nowhere near as effective without great people to put them to work. Jack Welch Winning (2005, p. 81)

Industrial and organizational (I-O) psychologists have played major roles in developing selection (or staffing) tools and implementing selection programs at every level for organizations of every size and in every industry, domestic and multinational. This chapter focuses on evaluating, monitoring, and managing the business value of employee selection. We begin by offering some general comments about the traditional model of employee selection, or staffing, its focus, and its components, with particular emphasis on selection as a dynamic organizational process, the expectations of multiple stakeholders, and the need to link employee-selection goals to business imperatives. Following that discussion, we present a decision-based framework that illustrates the logic of employee selection, with particular emphasis on assessing the outcomes of selection efforts. Such outcomes may be expressed in qualitative or quantitative terms, and we illustrate both. A key focus of the chapter is to use evaluation strategically to drive effective selection programs. We then consider what managers know about employee selection, the different perspectives of I-O psychologists and managers, and what both groups should know about the value of employee selection, including technology-enhanced assessments and multimedia, immersive simulations. We conclude with a set of recommendations for managing and monitoring the business value of employee selection, including tradeoffs among managerial concerns for “better, faster, cheaper, with less adverse impact.”

TRADITIONAL MODEL OF EMPLOYEE SELECTION I-O psychologists have developed a general approach to employee selection that has evolved over many decades (Cascio & Aguinis, 2011). Essentially, it consists of defining the work to be done, identifying individual-level characteristics that are hypothesized to predict performance with respect to the work to be done, and developing measurement instruments to assess the relative standing of job applicants on each of the individual-level characteristics (Binning & Barrett, 1989). Then applicants are rank-ordered based on their relative standing, and those with the best scores are selected for the job.

226

The Business Value of Employee Selection Over time, various instruments to predict future job performance have appeared, and such instruments are quite diverse, as noted in a recent review (Cascio & Aguinis, 2008a). In general, they assess information collected directly from job applicants or indirectly from other sources (e.g., past employers). Some types of measures are typically used at the beginning stages of the selection process as prescreening devices. A set of measures that is consistent with the current staffing model (Schmidt & Hunter, 1998) may include biographical data collected using application blanks (Roberts, 2011; Schmitt et al., 2007), integrity tests (Berry, Sackett, & Wiemann, 2007; Rotundo & Spector, 2017), and drug testing (Haar & Spell, 2007; Weber, 2015). Those job applicants who successfully complete the initial screening stage may be required to pass a background check (Zumbrun, 2015) and, if they do, they may be given paper-and-pencil or computer-administered tests that assess their general mental abilities (GMAs) and personality traits (Hough & Dilchert, in press; Morgeson et al., 2007; Ones, Dilchert, Viswesvaran, & Salgado, 2017), followed by an interview (Chapman & Zweig, 2005; Huffcut & Culbertson, 2011). Finally, for managerial and other high-level jobs, there may be an additional stage, including a work-sample test (Callinan & Robertson, 2000; Roth, Bobko, & McFarland, 2005) or an assessment center (Arthur & Day, 2011; Thornton, Johnson, & Church, 2017), in which applicants must demonstrate specific knowledge and skills by performing a limited number of job-related tasks in a controlled environment. With respect to work-sample tests, we do not mean to imply that they are appropriate only for high-level jobs. They can also be extremely valid predictors of performance in other jobs, such as craft jobs (electricians, plumbers, mechanics) and customer service jobs (Cascio & Aguinis, 2011). In the past, organizations have benefited from this traditional approach of “pick good people to get good performance,” but the changing workplace is dramatically redefining personnel selection (Cascio, 2016; Economist Intelligence Unit, 2014). The next section describes some of the challenges to the business value of that approach.

CHALLENGES TO THE BUSINESS VALUE OF THE TRADITIONAL APPROACH The following list describes seven such challenges: 1. Past behavior may not always predict future behavior (behavioral consistency), particularly if the new job differs in the types of personal characteristics necessary for successful performance. Past behavior that is relevant to future performance may predict that performance effectively. 2. Selection decisions about people and jobs are not independent events in an organization. Indeed, the broader business value of selection is often linked to other human resources (HR) processes, such as training, promotion, special assignments, staff reductions, career development, and succession planning. 3. Hiring managers do not always hire the best scorers. Validated selection techniques are rarely the only source of information for selection decision making. 4. Jobs are changing faster than we can do validation studies. 5. Assessing the business value of selection is complex, because different constituents—managers, applicants, HR professionals, and those who implement selection systems—value different outcomes. 6. The social context and social psychological processes of selection decisions are often ignored in the traditional approach. Interpersonal processes in group decision making are extremely important to the implementation of selection systems. For example, a particular decision maker’s position power, influence, and interpersonal attraction to another person may be important to understand in selecting employees. 7. Utility calculations that estimate economic returns on investments for valid selection techniques are not widely accepted or understood by business managers. Managers often do not believe the magnitude of the estimated returns because of their size and also because of the use of complex formulas with too many estimates and assumptions. To many, the dollar returns associated with improved performance are not “tangible,” and certainly less so than the dollars in one’s departmental budget. All of this suggests that few organizations, if any, view the costs related to selection as investments; rather, they consider them as expenses. Beyond that, validity coefficients of

227

Wayne F. Cascio and John C. Scott equal size, say, 0.35, are not necessarily equally valuable to decision makers if they reference different criteria. A sales manager, for example, may or may not view a validity of .35 for predicting organizational citizenship behaviors as equal in value to a validity of .35 for predicting the dollar volume of sales.1

The specific criteria used to establish the business value of a selection program will vary by organization. At a minimum, however, the effectiveness of any selection program can be judged by how well it (a) aligns with business strategy, (b) adapts to dynamic workforce requirements, (c) integrates with other talent-management systems, (d) meets the expectations of multiple constituents (e.g., leadership, hiring managers, HR, candidates), (e) conforms to operational requirements (e.g., validity, efficiency), and (f) contributes to valued organizational outcomes (e.g., productivity, sales, service, quality, revenue).

DYNAMIC, CONTEMPORARY APPROACH TO SELECTION AS AN ORGANIZATIONAL PROCESS This section presents a broader framework for understanding and valuing selection as an organizational process. Rather than considering selection as an independent event whose sole purpose is to identify the best people to perform specific jobs, we propose a broader, macro approach to the business “value added” of the selection process. This contemporary approach integrates the traditional approach as a “good start” for designing selection systems (the use of validated selection tools), but certainly not “the end.” We begin our discussion by examining four contemporary drivers that frame selection as a dynamic organizational process: 1. Dynamic change and change management 2. Expectations of multiple organizational stakeholders 3. Selection beyond hiring and HR management 4. The importance of social context and interpersonal processes in selection decisions

Dynamic Change and Change Management Selection procedures need to be flexible and adaptable to changing organizations. Significant future human capital challenges will be recruiting, staffing, and retention (Economist Intelligence Unit, 2014; Groysberg & Connolly, 2015). Previously we discussed the speed of organizational change. Numerous authors have cited the drivers of that change as shifting demographics, rapid changes in technology, higher expectations from customers, increased competition and globalization, and more intense pressure from shareholders as having the greatest impact on people and jobs (Cascio & Aguinis, 2008a; Schatsky & Schwartz, 2015). The message is clear: The traditional, static model of selection needs to be “reinvented” or “reengineered” to select people to perform changing jobs in changing organizations. No job or career today is “safe and secure.” The value of selection for an organization is predicated on how people perform in the context of changing organizations. Several authors have presented models of job-person, team-person, and organization-person assessments (Anderson, Lievens, van Dam, & Ryan, 2004; Cascio & Aguinis, 2008b; Pearlman & Barney, 2000). Organizations have merged, acquired, downsized, and reorganized to become more flexible, adaptable, efficient, and high performing. The impact of change on talent acquisition for jobs is two fold: (a) new jobs are created and old jobs are redefined, enriched, or eliminated; and (b) people are recruited, selected, developed, or eliminated. Pearlman and Barney (2000) noted some significant outcomes of these changes for selection processes: • Increased use of performance competencies (variables related to overall organizational fit, as well as personality characteristics consistent with the organization’s vision (Brannick, Pearlman, & Sanchez, 2017; Schippmann, 2010; Schippmann et al., 2000; Weber & Dwoskin, 2014)

228

The Business Value of Employee Selection • The value placed on intellectual capital and learning organizations • The value of speed, process improvement, and customer services

They offered a contemporary model of work performance with two distinguishable components: (a) task performance of a specific job and (b) contextual performance—performance related to organizational and social performance activities. Contextual performance includes three levels of analysis: external, organizational, and the immediate work or job context. Table 10.1 describes their model of worker-attribute categories needed to predict success beyond job performance per se. The key challenge in predicting performance at any level is that our current selection methods have demonstrated limited usefulness, despite 80 years of staffing research. Limitations of the current approach include the following (Cascio & Aguinis, 2008a): a near exclusive focus at

TABLE 10.1 Definitions and Examples of Work-Performance Model Worker-Attribute Categories Attribute Category

Definition

Examples

Aptitude and abilities

Capacity to perform particular classes or categories of mental and physical functions

Cognitive, spatial/perceptual, psychomotor, sensory, and physical abilities

Workplace basic skillsa

Fundamental developed abilities that are required to at least some degree in virtually all jobs

Reading, writing, and arithmetic or computational skills

Cross-functional skills

Various types of developed generic skills that are related to the performance of broad categories of work activity and that tend to occur across relatively wide ranges of jobs

Oral communication, problem analysis, interpersonal skills, negotiating, information gathering, organizing, planning, and teamwork skills

Occupation-specific skills

Developed ability to perform work activities that occur across relatively narrow ranges of jobs or are defined in relatively jobor activity-specific terms

Ability to read blueprints, to repair electrical appliances, to operate a milling machine, to operate a forklift, to do word processing

Occupation-specific knowledge

Understanding or familiarity with the facts, principles, processes, methods, or techniques related to a particular subject area, discipline, trade, science, or art; includes language proficiency

Knowledge of financial planning and analysis, fire-protection systems, computer graphics, data communication networks, patent law, Spanish, COBOL, spreadsheet software

Personal qualities (also known as personality traits, temperaments, or dispositions)

An individual’s characteristic, habitual, or typical manner of thinking, feeling, behaving, or responding with respect to self and others, situations, or events Goals, beliefs, or ideals an individual holds as important and that function as the standards or criteria by which he or she evaluates things

Adaptability, empathy, conscientiousness, self-esteem, autonomy, sociability, service orientation, emotional stability, integrity, honesty Empowerment, cooperation, achievement, initiative, work ethic

An individual’s characteristic workrelated preferences or likes and dislikes regarding specific (or classes of) work activities

Realistic, investigative, artistic, social, enterprising, and conventional

Values

Interests

Workplace basic skills are differentiated from aptitudes and abilities because of their significant knowledge and learning components. Source: From Pearlman, K., & Barney, M., Selection for a changing workplace, in J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies, Jossey-Bass, San Francisco, CA, 2000. Used with permission.

a

229

Wayne F. Cascio and John C. Scott the level of the individual, the assumption of behavioral consistency, a focus on thin slices of behavior and behavior that may not be representative of actual performance on a job, selection systems that produce high levels of adverse impact, overestimation of expected economic payoffs from the use of valid selection procedures, and limited applicability of the traditional model when applied to executives and expatriates. Although many existing selection methods perform well, we believe that if selection strategies are to be more useful in response to rapidly changing organizations, then we need to broaden our perspectives of the relevant criterion space and criterion constructs, from a focus on predicting task performance per se, to in situ performance (Cascio & Aguinis, 2008a). In situ performance reflects the broad range of effects—situational, contextual, strategic, and environmental—that may affect individual, team, or organizational performance. Such specification provides a richer, fuller, context-embedded description of the criterion space that we wish to predict. As just one example, the construct of adaptability (Pulakos, Arad, Donovan, & Plamondon, 2000) may well be a key predictor of success in a rapidly changing organization. A related consideration for utility researchers is the meaning of the parameter t in the general utility equation (see Cascio & Boudreau, 2011a, 2011b). Traditionally, that parameter represents the average tenure of individuals in a given job. Perhaps a more nuanced view is to define t as the length of time that the constructs measured by the current selection system remain relevant. The faster that jobs and organizations change, the lower the value of t.

Expectations of Multiple Organizational Stakeholders Since a selection program affects the business in so many ways, some customers and stakeholders will have unique expectations of its value and will generally want to have input into its ultimate direction. Therefore, these stakeholder groups should be carefully identified and solicited for input as part of the planning and implementation effort (Jayne & Rauschenberger, 2000; Scott, Rogelberg, & Mattson, 2010). These various constituents of employee selection have similar, but sometimes competing, values and expectations. Balancing these competing needs is critical to implementing successful selection systems. The following sections discuss some typical categories of stakeholders whose input and perspective should be considered.

Executive Team This group’s primary focus is strategic and financial, and as such, they will want assurances that the selection program will deliver a high-performing workforce that can drive revenues, shareholder value, growth, competitive advantage, and long-term sustainability. The selection program will need to align with long-term business strategy and also address more immediate issues such as retention, diversity, employee engagement, and the creation of a robust talent pipeline.

Line Managers This group is responsible for implementing business strategy, and therefore will heavily rely on practices that can support their talent management accountabilities, particularly the acquisition of top talent. Line managers value benchmarking evidence about the “best” selection systems, administrative efficiency (cycle time to fill a position), process metrics (costs and results), and process reliability to meet the needs of different organizational units (Jayne & Rauschenberger, 2000). These individuals should therefore play a critical stakeholder role in the evaluation and implementation of a selection program.

230

The Business Value of Employee Selection

Selection-Program Managers HR typically manages an organization’s selection program. HR managers are particularly concerned that selection systems are aligned with diversity and affirmative action goals, meet stakeholder needs, integrate with other talent management systems, and run as efficiently as possible.

Job Candidates Candidates value administrative efficiency, company reputation, relationship of pre-hire assessments to the job, fairness of the process, and quality of the information received about the job. Solicitation of input from the stakeholder groups listed above should occur both during the planning phase and on an ongoing basis following the implementation of the selection program. As such, it is critical to clarify the business challenges and key strategic priorities that the selection program is attempting to address. It will be particularly critical to gather input early on from the executive stakeholders to ensure that the employee-selection goals link to the organization’s business imperatives. Once an organization’s priorities have been established, a plan can be developed for how the selection program can best be leveraged to advance the business strategy. We will present specific strategies for establishing the goals, and action steps to link the selection program to these goals, later in the chapter, in the section entitled Strategic Use of Evaluation to Drive Selection-Program Effectiveness.

Selection: Beyond Hiring and HR Management Employee selection is more complex than hiring a qualified employee to perform a particular job. Higgs, Papper, and Carr (2000) emphasized the important point that selection processes and techniques are often keys to the effective execution of other HR processes. Table 10.2, adapted from Higgs et al. (2000), describes how other HR processes depend on selection. As stated earlier, selection is a key component in the overall life cycle of an individual’s employment with an organization. That life cycle includes changing jobs and changing people.

TABLE 10.2 HR Processes That Depend Upon Selection Hiring

Multiple-stage process using various techniques and types of information for mutual selection decision by organization and candidate

Training

Selection for participation in particular training programs

Performance management

Selection for effective performance in a given assignment or role

Promotion

Selection for limited promotional opportunities or for job families or job levels with limited population sizes

Special assignments

Selection for assignments to task forces, committees, special projects

Career development

Selection for development processes, programs, or mentors

Succession planning

Selection for inclusion in replacement-planning or succession-planning databases or management-planning sessions

Source: Adapted from Higgs, A. C., Papper, E. M., & Carr, L. S., Integrating selection with other organizational processes and systems, in J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies, Jossey-Bass, San Francisco, CA, 2000. Used with permission.

231

Wayne F. Cascio and John C. Scott

Importance of Social Context and Interpersonal Processes in Selection Decisions Several authors (Cascio & Aguinis, 2008a; 2008b; Ramsay & Scholarios, 1999) have challenged the traditional I-O psychology selection process for being too micro in its orientation and for failing to integrate the social context and interpersonal processes into selection decisions. Beyond the individual differences of job applicants, these authors argue (and we agree) that the cognitive processes of key decision makers, organizational characteristics, strategic goals, group processes, and contextual factors constrain and shape a manager’s actual staffing decisions. In contrast, the traditional psychometric paradigm of selection necessarily assumes that (a) effective (and ineffective) performance in most jobs can be reduced to relatively stable, observable behaviors and static job demands; (b) intra- and inter-individual differences in human capacities (knowledge, skills, abilities, and other characteristics, or KSAOs) account for most differences in job performance; and, consequently, that (c) effective staffing decisions depend largely on the efficient processing of information about job-related human capacities. In practice, selection decisions are made based on the social context as well as individual differences. Boudreau, Sturman, and Judge (1994) and others, including Skarlicki, Latham, and Whyte (1996), Latham and Whyte (1994), and Whyte and Latham (1997), have raised serious concerns about the ways that hiring managers actually use selection information in decision making, specifically, about how they use “rational” selection data (e.g., test scores). There may be only a weak link between rational selection information and actual selection decisions. Therefore, managers may actually ignore valid information in their decisions to adopt particular information-gathering procedures, being more receptive to other, “unscientific” sources of persuasion (Ramsay & Scholarios, 1999). In fact, important social-psychological phenomena operate in selection decisions, including interpersonal attraction, interviewer biases in processing information, the power and influence of managers/executives to shape the perceptions of others, and the inclusion of non-jobspecific behaviors (e.g., organizational citizenship and pro-social behaviors) as important selection criteria for hiring managers (Anderson et al., 2004; Dorsey, Cortina, & Luchman, 2017; Motowidlo, 2003; Organ, Podsakoff, & Podsakoff, 2011). In light of these changes, and the contemporary view of selection as a dynamic organizational process, it is important that we articulate the rationale for evaluating the business value of employee selection. The next section considers that topic in greater detail.

RATIONALE FOR EVALUATING THE BUSINESS VALUE OF EMPLOYEE SELECTION As Rynes, Giluk, and Brown (2007) have noted, management is not truly a profession like medicine, education, or law. There is no requirement that managers be exposed to scientific knowledge about management, that they pass examinations to become licensed to practice, or that they pursue continuing education to be allowed to maintain their practice. Although they might not be familiar with statistical terminology and methodology, the language of science, managers tend to be very smart people who grasp ideas quickly and process information critically and analytically. To many of them, employee selection is a cost, not an investment, and, as with any other area of business, they want to minimize their costs. This is the origin of the mindset and desire of managers for selection methods that are “better, faster, cheaper, with less adverse impact.” As we shall demonstrate, many, if not most, assessments of the outcomes of employeeselection efforts are expressed in statistical terms, at least in the scientific literature. Because extremely few managers read such literature, including academic publications (Rynes, Colbert, & Brown, 2002), they are simply unaware of much of this potentially valuable information. Managers and academics exist in different “thought worlds” (Cascio, 2007); therefore, an ongoing challenge is to educate managers about the business value of selection efforts and to enable them to see those efforts as investments that will generate a stream of benefits over time. We hasten to add that the term “business value” does not imply that all outcomes must be expressed exclusively in monetary or quantitative terms. Indeed, as we shall demonstrate,

232

The Business Value of Employee Selection much of the business value of selection may be expressed in qualitative terms (e.g., improvements in customer service, team dynamics, or innovations).

Assessing the Outcomes of Employee Selection In theory, there are multiple strategies for assessing the outcomes of employee selection. In general, they comprise two broad categories: quantitative (or statistical) and qualitative (or behavioral). Four common statistical approaches to evaluation are validity coefficients, effect sizes, utility analyses, and expectancy charts. Of these, validity coefficients and effect sizes are currently most popular. Validity coefficients are typically expressed in terms of Pearson product-moment correlation coefficients that summarize the overall degree of linear relationship between two sets of scores: those on the predictor in question (e.g., a test of GMA) and a criterion (some measure of job performance). Chapters 2, 3, and 4 in this volume address the concept of validity, the validation process, and validation strategies in considerable detail, so we need not repeat that information here. Using the methods of meta-analysis (Le, Oh, Shaffer, & Schmidt, 2007; Schmidt & Hunter, 2003; 2014) to express cumulative results across validity studies that have used the same predictor over time and situations, researchers typically have expressed their results in statistical (i.e., correlational) terms. For example, summarizing the results of 85 years of research findings in employee selection, Schmidt and Hunter (1998) reported that the top ten predictors of subsequent job performance are GMA tests (meta-correlation of .51), work-sample tests (.54), integrity tests (.41), conscientiousness tests (.31), structured employment interviews (.51), unstructured employment interviews (.38), job-knowledge tests (.48), job-tryout procedures (.44), peer ratings (.49), and ratings of training and experience (.45). Some validity studies express outcomes in terms of effect sizes. An effect size expresses the degree to which a phenomenon is present in a population of interest, or, alternatively, the degree to which a null hypothesis is false (Cohen, 1988). The null hypothesis (“no difference”) always means that the effect size is zero, as when two tests are compared to determine which one is the better predictor of some criterion of job performance. Regardless of which statistic is used to compare the results of the tests (e.g., Pearson product-moment correlation, t, z, or F), each has its own effect-size index. The only requirement for an effect-size index is that it be a pure (dimensionless) number, one not dependent on the units of the measurement scales (Cohen, 1988). Examples include the population correlation coefficient or the difference between two means expressed in units of standard deviation. Many studies in the behavioral sciences express outcomes in terms of effect sizes (e.g., see Murphy, Myors, & Wolach, 2014). Many operating executives may be unfamiliar with validity coefficients and effect sizes. Even when they are, they may view these indexes as too abstract from which to draw implications about the effects of employee-selection efforts on their businesses. In such situations, utility analyses and expectancy charts may be valuable, for they express the outcomes of selection in monetary terms or in terms of the likelihood of success on a job, given a particular level of performance on a selection procedure. We consider each of these approaches in the following sections.

Utility Analyses The utility of a selection device is the degree to which its use improves the quality of the individuals selected beyond what would have occurred had that device not been used (Taylor & Russell, 1939). Because the technical details of utility analysis have been addressed elsewhere (Boudreau & Ramstad, 2003; Boudreau, 1991; Cabrera & Raju, 2001; Cascio & Boudreau, 2011a; 2011b), we focus here only on the logic of utility analysis as illustrated in Figure 10.1. At its core, utility analysis considers three important parameters: quantity, quality, and cost. The top row of Figure 10.1 refers to the characteristics of candidates for employment as they flow through the various stages of the staffing process. At each stage, the candidate pool can be thought of in terms of the quantity of candidates, the average and dispersion of the quality

233

Wayne F. Cascio and John C. Scott

Talent flows

Potential labor pool

Labor pool

Building and planning

Applicant pool

Recruiting

Candidates for further evaluation

Screening

Offer candidates

Selecting

New hires

Offering and closing

Productive employees

On-boarding

Staffing processes

FIGURE 10.1  The Logic of Utility Analysis (From Cascio, W., and Bordreau, J., Investing in People: Financial Impact of Human Resource Initiatives, 2nd ed., Pearson, New York. ©2011. Reprinted by permission of Pearson Education, Inc., New York, New York.)

of the candidates, and the cost of employing the candidates. For example, the “applicant pool” might have a quantity of 100 candidates, with an average quality value of $75,000 per year and a variability in quality value that ranges from a low of $50,000 to a high of $125,000. This group of candidates might have an anticipated cost (salary, benefits, training, and so on) of 70% of their value. After screening and selection, the “offer candidates” might have a quantity of 50 who receive offers, with an average quality value of $90,000 per year, ranging from a low of $50,000 to a high of $115,000. Candidates who receive offers might require employment costs of 80% of their value, because they are highly qualified and sought-after individuals. Eventually, the organization ends up with a group of “new hires” (or promoted candidates in the case of internal staffing), who can also be characterized by quantity, quality, and cost. Similarly, the bottom row of Figure 10.1 reflects the staffing processes that create the sequential filtering of candidates. Each of these processes can be thought of in terms of the quantity of programs and practices used, the quality of the programs and practices, as reflected in their ability to improve the value of the pool of individuals that survives, and the cost of the programs and practices in each process. For example, as we have seen, the quality of selection procedures is often expressed in terms of their validity, or accuracy in forecasting future job performance. Validity may be increased by including a greater quantity of assessments (e.g., a battery of selection procedures), each of which focuses on an aspect of KSAOs that has been demonstrated to be important to successful performance on a job. Higher levels of validity imply higher levels of future job performance when the same number of candidates is selected or promoted, thereby improving the overall payoff to the organization. As a result, those candidates who are predicted to perform poorly never get hired or promoted in the first place. Decision makers naturally focus on the cost of selection procedures because costs are so vividly depicted by standard accounting systems, but the cost of errors in selecting, hiring, or promoting the wrong person is often much more important. Utility analysis has achieved limited success in translating the value of valid selection procedures into terms that managers and organizational decision makers understand (Cascio & Boudreau, 2011b). Unfortunately, in many cases such analyses lack credibility because of complex formulas and dollar-based return-on-investment analyses that seem “too good to be true” (Ashe, 1990; Cascio, 1993; Schmitt & Borman, 1993). Indeed, one may logically ask, if the return on investment associated with such programs is so high, then why don’t all companies invest substantial amounts of resources in them? The answer is that the actual returns are likely to be considerably lower than the estimated returns, because researchers have tended to make simplifying assumptions with regard to variables like economic factors that affect payoffs and to omit others that add to an already complex mix of factors.

234

The Business Value of Employee Selection Economic factors include the effects of taxes, discounting, and variable costs. Other relevant factors are employee flows into and out of the workforce, probationary periods (the difference in performance between the pool of employees hired initially and those who survive a probationary period), the use of multiple selection devices, and rejected job offers. One study used computer simulation of 10,000 scenarios, each of which comprised various values of these five factors (Sturman, 2000). Utility estimates were then computed using the five adjustments applied independently. The median effect of the total set of adjustments was −91% (i.e., the adjusted values were, on average, 91% lower than the unadjusted values), with a minimum effect of −71% and negative estimates 16% of the time. Although most utility estimates for the simulated scenarios remained positive, the five modifications had sizable and noteworthy practical effects. These results suggest that although valid selection procedures may often lead to positive payoffs for the organization, actual payoffs depend significantly on organizational and situational factors that affect the quantity, quality, and cost of the selection effort.

Expectancy Charts and Performance Differences Between High and Low Scorers Expectancy charts allow managers to see graphically the likelihood that, for example, each quintile of scorers on an assessment procedure will perform successfully on a job. More formally, organizational or institutional expectancy charts depict the likelihood of successful criterion performance to be expected from any given level of predictor scores. Individual expectancy charts depict the likelihood of successful criterion performance to be expected by an individual score at any given level on an assessment procedure. Figure 10.2 shows these two types of expectancy charts.

(a) 20%

82%

Select top

40%

71% 65%

60%

60%

80% All

53% 0%

25%

50%

75%

100%

% Successful (b) 82%

Score range

Equal to or above 1852 1827–1851

61%

1804–1826

53%

1779–1803

43%

Equal to or below 1778

26% 0%

25%

50%

75%

100%

Likelihood of success

FIGURE 10.2  (a) Organizational Expectancy Chart, (b) Individual Expectancy Chart

235

Wayne F. Cascio and John C. Scott The organizational expectancy chart provides an answer to the question, “Given a selection ratio of .20, .40, .60, etc., what proportion of successful employees can be expected if the future is like the past?” Such an approach is useful in attempting to set cutoff scores for future hiring programs. In similar fashion, the individual expectancy chart illustrates the likelihood of successful criterion performance for an individual whose score falls within a specified range on the predictor distribution. Computational procedures for developing empirical expectancies are straightforward, and theoretical expectancy charts are also available (Lawshe & Balma, 1966). In fact, when the correlation coefficient is used to summarize the overall degree of predictor-criterion relationship, expectancy charts are a useful way of illustrating the effect of the validity coefficient on future hiring decisions. In situations in which tests have only modest validities for predicting job performance, test-score differences that appear large will correspond to modest scores on the expectancy distribution, reflecting the modest predictability of job performance from test scores (Hartigan & Wigdor, 1989). Another way to demonstrate the business value of selection—in this case, the value of a testing program—is to compare performance differences among individuals who score at the top and bottom of the test-score distribution on job-related criteria. For example, managers can learn that a bank teller who scored in the top 80% on a test will serve 1,791 customers and refer 22 new customers in one month, compared to the bottom 20% of test scorers, who will serve only 945 customers and refer only 10 new customers (People Focus, 1998). Table 10.3 shows performance differences in job-related criteria across three companies in a Food Marketing Institute study of supermarket cashiers (Food Marketing Institute, 1985). TABLE 10.3 Job-Performance Differences Among Supermarket Cashiers Who Score at the Top and Bottom 50% of a Test Battery Company

Average

Score

Value

A

Amount over or under

Top 50%

1.53

Bottom 50%

2.18

Items per minute

Rings per minute

Number of voids

B

Amount over or under

Items per minute

Rings per minute

Number of voids

C

Amount over or under

Items per minute

Rings per minute

Number of voids

236

Top 50%

19.15

Bottom 50%

17.43

Top 50%

18.18

Bottom 50%

17.33

Top 50%

7.17

Bottom 50%

9.08

Top 50%

1.55

Bottom 50%

2.37

Top 50%

21.47

Bottom 50%

17.67

Top 50%

18.29

Bottom 50%

16.01

Top 50%

6.84

Bottom 50%

10.99

Top 50%

1.47

Bottom 50%

1.94

Top 50%

21.60

Bottom 50%

18.63

Top 50%

15.27

Bottom 50%

15.92

Top 50%

5.83

Bottom 50%

5.73

The Business Value of Employee Selection In our opinion, expectancy charts, together with illustrations of performance differences between high- and low-scoring individuals on an assessment procedure, provide credible, tangible evidence of the business value of selection.

Qualitative (Behavioral) Approaches to Assessing the Outcomes of Employee-Selection Programs Qualitative outcomes can help enrich our understanding of the actual operation of selection programs, including their efficiency and effectiveness. Qualitative outcomes can also contribute to the nomological network of evidence that supports the construct validity of selection instruments. That network relates observable characteristics to other observables, observables to theoretical constructs, or one theoretical construct to another theoretical construct (Cronbach & Meehl, 1955). Information relevant either to the construct itself or to the theory surrounding the construct can be gathered from a wide variety of sources. Here is a practical example (Fisher, 2005). In 2002, J.D. Power’s customer-satisfaction surveys ranked T-Mobile dead last in its industry, trailing Verizon, Cingular, Nextel, and Sprint. The first step toward improvement was to bring together T-Mobile’s HR people and its marketing managers to sit down and talk. The idea was to change the company’s hiring practices in an effort to improve the quality of customer service representatives who would be willing and able to follow through on the promises that marketing representatives made to customers. Although this might sound like common sense, in practice the customer-contact people did not report to anyone in marketing or have any contact with them. Nor did anyone in HR, so HR was not able to understand the needs of managers in customer service, who, in turn, need people in place who can deliver on the marketers’ message. As a result of the in-depth discussions among representatives from customer service, HR, and marketing, T-Mobile instituted a new set of hiring criteria that emphasized traits like empathy and quick thinking. After all, customers want their problems resolved fast, in one phone call, and in a courteous manner. In addition, T-Mobile made sure that all employees knew exactly how they would be evaluated. By ensuring that HR and marketing were in sync, the company found that its employee-incentive plans also worked well, because hiring, performance management, and rewards all were linked to a common message and a common theme. The broad-based effort paid off. By 2005, attrition and absenteeism each dropped 50% relative to 2002, while productivity tripled. As for T-Mobile’s formerly exasperated customers, J.D. Power ranked T-Mobile number one in customer service for two years running. This example illustrates nicely how qualitative outcomes can help enrich our understanding of the actual operation of selection programs, including their efficiency and effectiveness. That approach certainly helped T-Mobile.

Strategic Use of Evaluation to Drive Selection-Program Effectiveness To realize and communicate the business value of a selection program fully and effectively, it is critical to tie the program’s solutions to valued organizational outcomes and to build metrics that “speak the language” of the stakeholders and decision makers. We need to establish a stream of evidence that implies a causal link between our program and desired organizational outcomes. This requires that we understand fully how an employee selection program within our organization can be leveraged to accomplish key business objectives and strategies. Our goal should be to position employee selection as a strategic tool for driving business success and achieving competitive advantage. While the specific criteria used to establish the business value of a selection program will vary by organization (based on multiple-stakeholder perspectives), a framework does exist for capturing these criteria and justifying stakeholder investment (Davidson & Martineau, 2007; Davidson, 2010).

237

Wayne F. Cascio and John C. Scott Drawing on the fields of program evaluation (Edwards, Scott, & Raju, 2003; Phillips, 1997) and balanced-scorecard methodology (Becker, Huselid, & Ulrich, 2001), it is possible to demonstrate a selection program’s usefulness to all stakeholder groups and ensure that it is appropriately tied to valuable organizational outcomes and business imperatives (Scott et al., 2010). This strategic use of program evaluation ensures that a selection program is useful not only to those implementing and using the program but also to those responsible for driving overall business strategy (Davidson, 2010). The evaluation covers four key perspectives: 1. Strategic—How well does the selection program align with the organization’s business strategy and key priorities? 2. Operational—How accurate, reliable, and efficient is the selection program in acquiring top talent? How well integrated is it with other talent-management systems? 3. Customer—To what extent are customer expectations met regarding the design, deliverables, and success criteria of the selection program? 4. Financial—To what extent does the selection program contribute to valued organizational outcomes (e.g., industry competitive advantage, robust talent pipeline) and bottom-line profitability?

When designing a selection program, it is critical to work with key stakeholders so that each of these perspectives is taken into account. We have found that one of the most effective ways to accomplish this and proceed through the design of a selection program is to develop a logic map. A logic map provides a structured way to establish the business case, articulate stakeholder goals, detail the steps in the process, and make course corrections along the way. An example of a logic map is shown in Table 10.4 for a leadership-selection program. In this case, we can see that one of the organization’s long-term goals is to establish a pipeline of leaders who can broaden the business’s product portfolio across diverse geographies. The short- and medium-term outcomes serve as milestones that lead to the achievement of the long-term outcomes. For instance, in order to establish this pipeline of leaders, it is first necessary to assess a targeted group of individuals from across the enterprise (medium-term outcome). In order for these medium-term outcomes to occur, it is necessary to ensure that the leadership-competency model is aligned with business strategy and that the requirements associated with the key roles are documented (short-term outcome). The program elements identify how these outcomes will be achieved and by whom. For example, it is necessary to involve subject matter experts from senior leadership to define the role requirements and from Talent Acquisition to validate the selection tools. The assumed inputs in the first column identify the conditions that are necessary for the success of the program (Davidson & Martineau, 2007). A logic map like the one presented here (Table 10.4) will clarify for each stakeholder group how its objectives will be met, along with the relevant metrics against which the success of a selection program can be judged. This approach helps overcome barriers caused by statistical language by providing evidence of business value in the vernacular of the key stakeholders. By addressing the priorities of each stakeholder group—using their own metrics—it becomes a straightforward matter to justify a selection program’s investment.

WHAT MANAGERS KNOW (AND DO NOT KNOW) ABOUT EMPLOYEE SELECTION Here are six well-established findings in the field of I-O psychology regarding employee selection. A study of nearly 1,000 HR vice presidents, directors, and managers found that more than 50% of them actively disagreed with or did not know about the findings (Rynes, Colbert, & Brown, 2002). 1. Intelligence predicts job performance better than conscientiousness (Schmidt & Hunter, 1998). 2. Screening for intelligence results in higher job performance than screening for values or values fit (Meglino & Ravlin, 1998; Schmidt & Hunter, 1998). 3. Being very intelligent is not a disadvantage for performing well on a low-skilled job (Hunter, 1986; Schmidt & Hunter, 1998).

238

Assumed Inputs

• Develop and validate assessment tools

• Map existing talent into Acceleration Pools based on validated assessment program • Establish strategic, operational, customer, and financial metrics

•  Talent Pools: Robust and diverse global candidate pools

•  Resources: Approved budget and resources to create evidence-based leadership assessment program; cadre of senior leaders ready to act

•  Tools & Technology: Leadershipselection philosophy, framework, tools, and programs well established • Key stakeholder groups, Talent Analytics, and Finance functions

• Business sector leaders and HR generalists to execute mapping

• Senior leadership to serve as subject matter experts

• Talent Acquisition leadership and staff to serve as project liaison

• Industry thought leaders to frame long-term vision for leadership skills

Participation

▬►

• Philosophy and strategy around identification and retention of talent broadly communicated

• Metrics and tracking system implemented to accurately assess bench strength and other “leading indicators”

• All leadership roles documented against competency, experience, and contextual requirements

• Assessment and development processes implemented; linked to development and succession plans

• Competency model and assessment tools aligned with business imperatives

Short

• Truly differentiated multiyear development for C-Suite talent

• Development plans created and shared with leaders of all high-potential managers

• Differentiated Acceleration pools filled

• Assessment of targeted group of talent across each business sector

Medium

Outcomes

• Company established as World’s #1 Talent Management Company

• Strong leadership reputation and outstanding financial performance

• Outperform competitors in selection, development, and retention of highpotential leaders

• Pipeline of leaders in place to grow broader product portfolio across diverse geographies

Long

Source: Adapted from Scott, J. C., Rogelberg, S. G., & Mattson, B. W. (2010). Measuring and managing the talent management function. In R. Silzer & B. Dowell (Eds.), Strategy-Driven talent management. Alexandria, VA: Jossey-Bass/Pfeiffer. Used with permission

• Validate strategic competency models

•  Support: Executive sponsorship for leadership assessment and development program

Activities

Program Elements

• Conduct job analysis to define competencies and leadership role requirements

▬► ⇔

•  Vision: Strategically select & manage talent at enterprise level



TABLE 10.4 Logic Model for Leadership Assessment and Development Program

Wayne F. Cascio and John C. Scott 4. Personality inventories vary considerably in terms of how well they predict applicants’ job performance (Barrick & Mount, 1991; Gardner & Martinko, 1996). 5. Integrity tests successfully predict whether someone will steal, be absent, or otherwise take advantage of employers, although individuals can “fake good” on them (Ones, Viswesvaran, & Reiss, 1996; Ones, Viswesvaran, & Schmidt, 1993). 6. Integrity tests do not have adverse impact on racial minorities (Ones & Viswesvaran, 1998).

Needless to say, the Rynes et al. (2002) findings are disturbing, for they indicate that HR vice presidents, directors, and managers live in a very different world from that of I-O psychologists. To bridge that gap, the Society for Human Resource Management (SHRM) Foundation commissions reviews of the professional literature in key HR areas (e.g., performance management, employee selection, retention, reward strategies, employee engagement, and commitment) by knowledgeable professionals, and has them “translate” the results of published research into practical guidelines. An academic and a practitioner review each draft of the report to ensure that it is well organized and jargon-free, the findings are presented clearly, and the implications of the findings for professional practice are highlighted. The name of this initiative is “Effective Practice Guidelines,” and each report may be downloaded in PDF form from http://www. shrm.org/about/foundation/Pages/default.aspx. We know that senior-level managers are extremely aware of the importance of hiring the right people for their organizations. Groysberg and Connolly (2015), for example, reported that the top three concerns of CEOs of firms both large and small, in order, were (1) talent management, (2) operating in a global marketplace, and (3) regulation/legislation. With respect to talent management, CEOs identified three major issues: (1) finding the right talent (especially during periods of change or growth), (2) developing high-potential employees (particularly with respect to using mobility to enable those employees acquire the breadth of expertise and experience required of senior executives), and (3) developing talent pipelines to meet changing business demands. These results suggest that, whether an organization is purely domestic or international in its scope of operations, CEOs recognize the critical importance of employee selection (“finding the right people”) to the achievement of their strategic objectives. There is therefore a pressing need and a ripe opportunity for I-O psychologists to have a major impact on organizations by demonstrating the business value of employee selection. Executives need this information to make informed decisions about selection tools and processes, and they have never been more receptive to it than now, in light of the key human capital challenges they are facing.

Benchmarking Current Practices Scott and Lezotte (2012) recently highlighted how assessment practices have evolved over the past decade due to rapid advances in technology and the ability to leverage the internet’s explosive growth (see also Chapters 39–44 in this Handbook). The authors cited several key features of technology-enhanced assessment tools that create significant advantages over more traditional measurement practices, including (a) increased efficiency in administration, data warehousing, and analytics; (b) enhanced access to a more global and diverse candidate pool; (c) expanded construct coverage, with an ability to measure an almost limitless array of attributes using more true-to-life item types; (d) optimized ability to deploy more advanced measurement theories and applications, leading to increased accuracy, precision, and shorter testing time; and (e) bottom-line impact and demonstrated value in achieving key organizational goals. The authors also noted that there is an unprecedented level of assessment activity across all organizational levels, as companies seek to leverage assessment-technology solutions to upgrade their workforces and drive key talent initiatives. While the search for the latest technological application has, at times, presented some challenges to good testing practice, advancements in measurement theory, revised professional standards, and the application of core measurement principles have served as the beacon for this evolution.

240

The Business Value of Employee Selection A recent online survey of talent-assessment trends included more than 1,400 global human resource professionals. It focused on, among other things, the nature of assessment use in organizations and how technology has been incorporated into recruitment and selection (Kantrowitz, 2014). The survey found that assessments are fairly common across all job levels, ranging from 55% use for first-line supervisors to 72% use for middle managers. Assessments are used most for external hiring (76%), internal hiring (65%), and leadership development (56%). The survey also revealed an increasing focus on assessments as part of succession planning and talent analytics. The types of tests that organizations use for pre-hire applications are shown in Table 10.5. The table shows that the most frequently used tests are skills/knowledge tests, followed by personality and cognitive-ability tests. On the question of assessment-delivery modes, the Kantrowitz (2014) survey found that online assessment is the most prevalent (81%), followed by paper-and-pencil assessment (37%), and computer-based testing with offline scoring (35%). Mobile assessment was reported at only 4% usage. The author indicates that the use of paper-and-pencil or computer-based testing with offline scoring is more common in emerging economies. This survey also addressed the extent to which social media are used to establish job fit for candidates. Although 54% of the respondents value its use as a recruiting tool, fewer (40%, up from 29% in 2013) view it as useful for establishing candidate fit. Only 20% of the respondents have confidence in the quality of these data, and roughly 25% have policies in place governing its use. Another recent benchmarking study was conducted on assessment practices for high potentials and leaders, drawing upon a group of 100 large, multinational organizations—most of whom were ranked among Fortune Magazine’s Top Companies for Leaders (Church & Rotolo, 2013). This study revealed that 70% of the respondents use assessments, and of that group, 90% assess their senior executives and 75% focus on high potentials. In reviewing these data, the authors conclude that organizations appear to be structuring their talent initiatives by identifying individual leadership potential at lower levels, while applying more selectivity and precision at the highest layers in their company, where leadership mistakes can have serious consequences. Church, Rotolo, Ginther, and Levine (2015) conducted a follow-up study on 80 top leadership companies and found that talent management leaders from two-thirds of these companies perceived assessments as having a moderate (5–9% improvement) to significant (10–20% improvement) impact on the business performance of high-potentials and senior-executive participants.

TABLE 10.5 Pre-Hire Assessment Use Assessment Types

2014

2014 Rank

Skills/knowledge tests

73%

1

Personality tests

62%

2

Cognitive ability/general problem-solving tests

59%

3

Job-fit tests

47%

4

Specific ability tests

47%

4

Situational judgment

43%

6

Assessment centers

41%

7

Job-specific solutions

39%

8

Biodata (life history information)

37%

9

Culture-fit tests

33%

10

Job simulations

32%

11

Interest assessments

23%

12

Source: Tracy M. Kantrowitz Ph.D., 2014 Global Assesment Trends Report (2014), p. 28 © 2014 CEB. All rights reserved.

241

Wayne F. Cascio and John C. Scott

LOOKING TO THE FUTURE: THE NEED FOR IMPROVED SELECTION PROCESSES

Flaws in the Traditional Selection Model Earlier in the chapter we discussed the fact that the traditional I-O psychology selection process generally fails to take into account the social context and interpersonal processes as part of staffing decisions, and that all too frequently hiring managers override valid selection tools in favor of potentially non-job-related factors. While this sort of behavior can result in poor hiring decisions and successful legal challenges, it is important to reflect on why this behavior occurs. As previously discussed, one key reason for this could be that our selection processes are simply too narrow in scope and don’t adequately measure the multidimensional facets of work that exist in most organizations. Hiring-manager stakeholders may simply be reacting to the fact that they are not being provided with the full set of data necessary to make informed staffing decisions. Outtz (2010) contends that these sorts of limitations in our traditional selection model lead to flawed selection decisions that negatively impact our ability to advance organizational goals and to treat candidates fairly. Even though a selection tool may be validated, if it does not measure all or even most of the important facets of job performance, it can result in imperfect decisions, illusory benefits, adverse impact, and legal challenges. As previously emphasized in this chapter, there is a pressing need to expand the relevant criterion space addressed by our selection programs and to target our predictions on in situ performance (Cascio & Aguinis, 2008a). Practical limits are always placed on the number of assessment tools that can be implemented as part of any selection program. Organizations want to control costs and minimize administration time to manage the candidate experience. It is not uncommon for organizations to demand that the full assessment-test battery take no more than 20–30 minutes to administer, particularly in high-volume hiring situations. This drive for expediency can, and often does, lead to the implementation of a limited number of measures that may not capture the full range of attributes needed for success in the targeted roles. Selection of the highest scorers on a predictor battery that is designed for expediency and isn’t necessarily measuring the most important facets of performance has little probability of producing the desired outcomes. Outtz (2010) emphasized that research over the years showing cognitive tests to be the best predictor of performance across all jobs can be misleading, since the best predictors for a particular job can only be gleaned through an understanding of the full range of attributes required for job success, including knowledge of the relative importance of those attributes. To ensure that our selection programs are measuring the most important and relevant facets of job performance, it is essential that they be based on a comprehensive job analysis that identifies performance domains reflective of the 21st-century workplace. When conducting a job analysis in the context of rapidly changing organizations, we must look beyond the responsibilities and competencies of a particular role, and also account for situational and contextual factors that impact individual, team, and organizational performance. This means taking into account the overall business environment (e.g., global, economic, competitive, and market challenges anticipated to impact this type of company in the future); the organizational structure (how work gets done); culture (social and demographic environment); and the organization’s strategic objectives (Cascio & Aguinis, 2008a). The incorporation of these critical factors into the job analysis allows us to expand our assessment tools and more accurately predict the full range of job performance against the backdrop of a dynamic work environment. Once a job analysis is conducted, a selection blueprint can be created that links assessment tools to each of the targeted attributes. This allows us to prioritize the relevant attributes that should be assessed, an especially critical concern when practical constraints limit the number of selection tools at our disposal (see Chapter 6 in this volume). Moving forward, technology-enhanced assessments can be leveraged to address the practical constraints around testing time and better meet stakeholder needs for greater construct coverage. For example, computer adaptive testing (CAT) is becoming more widely available within

242

The Business Value of Employee Selection large testing programs (Gibby, Ispas, McCloy, & Biga, 2009). Using this approach, candidates are presented with only a limited number of items that are needed to determine proficiency or standing on the targeted attribute. The items presented to applicants are tailored to each applicant’s “ability” based on responses to previous items. This allows for a greater number of attributes to be assessed within a limited time period. That being said, regardless of the approach taken in deploying the assessment tools, a job analysis remains a fundamental requirement to ensure that the right attributes are used to select the right candidates for the right roles.

Leveraging the Value of Multimedia Immersive Simulations Advances in technology, measurement theory, and cognitive science have provided many new opportunities for innovative test design and deployment (Reynolds & Rupp, 2010). As a result, organizations are able to leverage multimedia technology to assess a more comprehensive range of candidate attributes with greater speed and precision, using more true-to-life item formats (see Chapters 39–44 in this Handbook). While many large-scale, high-volume selection programs still rely on multiple-choice assessments for the sake of expediency, positions that require the measurement of more complex attributes (e.g., leadership roles)—and have lower-volume hiring requirements—can take advantage of innovative formats that more closely simulate the work environment and evoke a demonstration of higher-order skills (Scott & Lezotte, 2012). Organizations engaged in leadership selection and development programs have realized tremendous time, cost, and resource savings as a result of new technologies and innovative approaches designed to select, develop, and retain high-potential leaders. Multimedia assessments can be developed as theatrical, first-person stories that immerse candidates in the fictional world of an organization—complete with organizational and situational backstory, robust and compelling narratives, and strong story resolutions. Most webbased assessment systems support virtual environments that place the candidate in realistic job scenarios. Multimedia technology blends film or animation with other stimuli that are presented through e-mails, voicemails, annual reports, analyst research reports, marketing/sales presentations, and any number of other business and role-related materials. Candidates must absorb and act on this information in order to make decisions and take actions. The immersive quality of the simulation helps the participants engage, and it creates a sense of urgency and psychological involvement in the assessment. The story (or dramatic narrative) is designed to drive the simulation, help ensure an engaging simulation experience, and leave candidates with a sense of accomplishment and resolution. This approach predicts future in situ performance and also elicits the candidate’s best performance as he or she is drawn into the storyline’s sense of reality and challenge. A number of organizations have found that immersive simulations have been quite effective at breaking through what Church and Rotolo (2013) term the “assessment glass ceiling,” where top organizational leaders, for a variety of reasons (e.g., skepticism, perceived loss of control), resist taking formal assessments. Immersive simulations not only capture the attention of senior leaders but also hold it long enough to elicit meaningful information about their capabilities. The assessment becomes a seamless component of a dynamic, engaging, and job-relevant narrative. The benefits of multimedia assessments are also particularly impactful for the deployment of large-scale and multinational selection programs. With today’s advanced server technologies, and the advent of cloud computing, multimedia assessments can be administered simultaneously around the globe to hundreds of thousands of candidates to measure an almost limitless array of attributes, in any language, for any position, with almost instantaneous results. Multimedia technology also allows organizations to expand the range, depth, and fidelity of assessments, which results in greater measurement precision and an ability to assess a fuller range of job-performance criteria. As organizations increasingly understand the value of assessment and development for driving sustainable business success, new technologies and immersive simulations will serve as the foundation for their effectiveness.

243

Wayne F. Cascio and John C. Scott

CONCLUSIONS: HOW SHOULD WE VALUE THE SUCCESS OF SELECTION IN ORGANIZATIONS? Earlier in this chapter we argued that the traditional approach of developing valid selection methods to predict individual job performance is only “the start” of selection as an organizational process. To move the field forward, we believe that it is important to adopt a broader perspective of successful selection processes. Those processes should be assessed in terms of criteria that include empirical validity, face validity, selection ratios, marketability of the selection effort, demonstration of the business value of selection using quantitative as well as qualitative metrics, the effectiveness of the management of selection processes, candidate reactions to those processes, overall expense, and the timeliness with which selection decisions can be made.

Recommendations and Future Directions I-O psychologists need to do more as professionals than simply develop selection systems characterized by sound psychometric qualities. Our role cannot be limited to that of technicians, because our responsibility does not end after developing a valid selection system. It does little good to say, “I developed a valid selection system, but the organization misused it.” We need to be better scientists/practitioners in integrating selection systems into organizations. Beyond traditional technical psychometric competencies, we need to provide facilitation and process-consulting skills within the business context. As selection-system developers, we need to implement change-management techniques (e.g., overcoming resistance to change) and practices (e.g., involvement and participation) when implementing these systems. We need to extend our role as scientists/practitioners to achieve successful implementation of selection systems within the context of specific organizational characteristics, social contexts, and interpersonal processes. The Society for Industrial and Organizational Psychology (SIOP) could provide education and skill development (e.g., workshops for which the objectives are to develop business acumen as well as learning skills and techniques to facilitate the implementation of selection systems in organizational contexts). Graduate training and internships could require students to demonstrate competencies in facilitation and process-consulting skills related to implementing selection systems. Beyond that, graduate students in I-O psychology need a deeper understanding of how businesses work. They need not earn MBA degrees, but they should, at the very least, understand fundamental concepts in disciplines such as strategic management, marketing, macro- and micro-economics, accounting, and corporate finance. Consider a real-world example that is based on the direct involvement of one of the authors. The organization in question is a major fashion retailer that has used a recruitment/selection system for more than 10 years to hire college graduates from the most prestigious universities for a management-development program. The continued success of this program can be attributed to some fundamental development and implementation practices, including the following: 1. Involvement of managers, executives, decision makers, and HR in selection-technique development and implementation. This includes job analysis, simulations, and interviews. 2. Updating job and competency requirements and selection techniques to meet the requirements of changing management jobs.

College-graduate candidates are screened on college campuses, with interviews targeted to management competencies and organizational success factors. Candidates who pass the on-campus screen are invited to a one-day assessment at the corporate headquarters. This oneday assessment includes (a) learning financial analysis skills, (b) participating in a group-based leadership exercise to improve retail-store effectiveness, and (c) two panel-group interviews. All assessors are trained to evaluate candidate performance using behavioral benchmarks and standards. Independent and consensus ratings standards and guidelines are required. An assessor conference is held after the one-day assessment. Selection-technique data, including ratings and behavioral observations, are reported to the assessors, who include managers, executives,

244

The Business Value of Employee Selection incumbents, and HR staff. Guidelines are provided to establish bands of scores and to make successful decisions. Observations of why this selection system has been successful in predicting job success and in becoming integrated into the culture of the business include the following: 1. The original development and updates to the selection system have involved multiple organizational participants, including executives, managers, job incumbents, and representatives from recruitment, staffing, and training. 2. Hiring decisions are made in a one-day session with all key decision makers involved. 3. Selection techniques are developed in the business context and are updated at least every three years. Interviews contain behavior-description questions and situational questions. The leaderlessgroup-competition exercise requires candidates to visit company and competitors’ stores and to read consumer information regarding trends and the latest company strategies for business development. 4. Assessors self-monitor and also monitor each other to evaluate candidates using behaviors/benchmarks related to competencies and success factors.

In our opinion, the key to successful implementation of a selection system is to involve decision makers and stakeholders. Development of selection techniques is therefore a necessary, but not a sufficient, condition for their successful acceptance and use by decision makers. Implementation is an ongoing challenge.

NOTE 1.  We would like to thank Jerard F. Kehoe for suggesting these last two points.

REFERENCES Anderson, N., Lievens, F., van Dam, K., & Ryan, A. M. (2004). Future perspectives on employee selection: Key directions for future research and practice. Applied Psychology: An International Review, 53, 487–501. Arthur, W., Jr., & Day, E. A. (2011). Assessment centers. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (Vol. 2, pp. 205–236). Washington, DC: American Psychological Association. Ashe, R. L, Jr. (April 1990). The legality and defensibility of assessment centers and in-basket exercises. Paper presented at the meeting of the Society for Industrial and Organizational Psychology, Miami Beach, FL. Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Becker, B. E., Huselid, M. A., & Ulrich, D. (2001). The HR scorecard: Linking people, strategy and performance. Boston, MA: Harvard Business School Press. Berry, C. M., Sackett, P. R., & Wiemann, S. (2007). A review of recent developments in integrity test research. Personnel Psychology, 60, 271–301. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Boudreau, J. W. (1991). Utility analysis for decisions in human resource management. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 2, pp. 621–745). London, England: SAGE Publications. Boudreau, J. W., & Ramstad, P. M. (2003). Strategic industrial and organizational Klimoski (Vol. Eds.), Handbook of psychology, Vol. 12, Industrial and organizational psychology (pp. 193–221). Hoboken, NJ: Wiley. Boudreau, J. W., Sturman, M. C., & Judge, T. A. (1994). Utility analysis: What are the black boxes, and do they affect decisions? In N. Anderson & P. Herriot (Eds.), Assessment and selection in organizations. Methods and practice for recruitment and appraisal (pp. 77–96). New York, NY: John Wiley. Brannick, M., Pearlman, K., & Sanchez, J. (2017). Work analysis. In J. Farr & N. T. Tippins (Eds.), Handbook of employee selection (Revised ed.). New York, NY: Routledge. Cabrera, E. F., & Raju, N. S. (2001). Utility analysis: Current trends and future directions. International Journal of Selection and Assessment, 9, 92–102. Callinan, M., & Robertson, I. T. (2000). Work sample testing. International Journal of Selection and Assessment, 8(4), 248–260.

245

Wayne F. Cascio and John C. Scott Cascio, W. F. (1993). Assessing the utility of selection decisions: Theoretical and practical considerations. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 310–340). San Francisco, CA: Jossey-Bass. Cascio, W. F. (2007). Evidence-based management and the marketplace for ideas. Academy of Management Journal, 50, 1009–1012. Cascio, W. F. (2016). Managing human resources: Productivity, quality of work life, profits (10th ed.). New York, NY: McGraw-Hill. Cascio, W. F., & Aguinis, H. (2008a). Staffing 21st-century organizations. Academy of Management Annals, 2, 133–165. Cascio, W. F., & Aguinis, H. (2008b). Research in I/O psychology from 1963–2007: Changes, choices, and trends. Journal of Applied Psychology, 93, 1062–1081. Cascio, W. F., & Aguinis, H. (2011). Applied psychology in human resource management (7th ed.). Upper Saddle River, NJ: Pearson Prentice Hall. Cascio, W. F., & Boudreau, J. W. (2011a). Investing in people: Financial impact of human resource initiatives (2nd ed.). Upper Saddle River, NJ: Pearson. Cascio, W. F., & Boudreau, J. W. (2011b). Utility of selection systems: Supply-chain analysis applied to staffing decisions. In S. Zedeck (Ed.), Handbook of I/O psychology (Vol. 2, pp. 421–444). Washington, DC: American Psychological Association. Chapman, D. S., & Zweig, D. I. (2005). Developing a nomological network for interview structure: Antecedents and consequences of the structured selection interview. Personnel Psychology, 58, 673–702. Church, A. H., & Rotolo, C. T. (2013). How are top companies assessing their high-potential and senior executives? A talent management benchmark study. Consulting Psychology Journal: Practice and Research, 65(3), 199–223. Church, A. H., Rotolo, C. T., Ginther, N. M., & Levine, R. (March 2015). How are top companies designing and managing their high-potential programs? A follow-up talent management benchmark study. Consulting Psychology Journal: Practice and Research, 67(1), 17–47. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Davidson, E. J. (2010). Strategic evaluation of the workplace assessment program. In J. C. Scott, & D. H. Reynolds (Eds.), The handbook of workplace assessment: Evidence-based practices for selecting and developing organizational talent (pp. 729–756. San Francisco, CA: Jossey-Bass/Pfeiffer. Davidson, E. J., & Martineau, J. W. (2007). Strategic uses of evaluation. In K. M. Hannum, J. W. Martineau & C. Reinelt (Eds.), The handbook of leadership development evaluation (pp. 433–463). San Francisco, CA: Jossey-Bass. Dorsey, D., Cortina, J. M., & Luchman, J. (2017). Adaptive and citizenship-related behaviors at work. In J. Farr & N. T. Tippins (Eds.), Handbook of employee selection (Revised ed.). New York, NY: Routledge. Economist Intelligence Unit. (February 2014). What’s next: Future global trends affecting your organization. Evolution of work and the worker. Alexandria, VA: Society for Human Resource Management Foundation. Edwards, J. E., Scott, J. C., & Raju, N. S. (Eds.) (2003). The human resources program-evaluation handbook. Newbury Park, CA: Sage. Fisher, A. (November 28 2005). For happier customers, call HR. Fortune, p. 272. Food Marketing Institute. (1985). Cashier test battery administrator’s manual. Washington, DC: Author. Gibby, R. E., Ispas, D., McCCloy, R. A., & Biga, A. (2009). Moving beyond the challenges to make unproctored Internet testing a reality. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 64–68. Gardner, W. L., & Martinko, M. J. (1996). Using the Myers-Briggs type indicator to study managers: A literature review and research agenda. Journal of Management, 22, 45–83. Groysberg, B., & Connolly, K. (March 2015). The three things CEOs worry about the most. Harvard Business Review. Downloaded from https://hbr.org/2015/03/the-3-things-ceos-worry-about-the-most. Haar, J. M., & Spell, C. S. (2007). Factors affecting employer adoption of drug testing in New Zealand. Asia Pacific Journal of Human Resources, 45, 200–217. Hartigan, J. A., & Wigdor, A. K. (Eds.) (1989). Fairness in employment testing: Validity generalization, minority issues, and the general aptitude test battery. Washington, DC: National Academy Press. Higgs, A. C., Papper, E. M., & Carr, L. S. (2000). Integrating selection with other organizational processes and systems. In J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies (p. 94). San Francisco, CA: Jossey-Bass. Hough, L. M., & Dilchert, S. (2017). Personality: Its measurement and validity for employee selection. In J. Farr & N. T. Tippins (Eds.), Handbook of employee selection (Revised ed.). New York, NY: Routledge.

246

The Business Value of Employee Selection Huffcut, A. I., & Culbertson, S. S. (2011). Interviews. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (Vol. 2, pp. 185–204). Washington, DC: American Psychological Association. Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of Vocational Behavior, 29, 340–362. Jayne, M. E. A., & Rauschenberger, J. M. (2000). Demonstrating the value of selection in organizations. In J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies (pp. 123–157). San Francisco, CA: Jossey-Bass. Kantrowitz, T. M. (2014). 2014 global assessment trends report. Retrieved September 20, 2015 from http:// www.cebglobal.com/shl/images/uploads/GATR-042014-UKeng.pdf Latham, G. P., & Whyte, G. (1994). The futility of utility analysis. Personnel Psychology, 47, 31–46. Lawshe, C. H., & Balma, M. J. (1966). Principles of personnel testing (2nd ed.). New York, NY: McGraw-Hill. Le, H., Oh, I., Shaffer, J., & Schmidt, F. L. (2007). Implications of methodological advances for the practice of personnel selection: How practitioners benefit from meta-analysis. Academy of Management Perspectives, 21, 6–15. Meglino, B. G., & Ravlin, E. C. (1998). Individual values in organizations: Concepts, controversies, and research. Journal of Management, 24, 351–389. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007). Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60, 683–729. Motowidlo, S. J. (2003). Job performance. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Comprehensive handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 39–53). New York, NY: Wiley. Murphy, K. R., Myors, B., & Wolach, A. (2014). Statistical power analysis (4th ed.). New York, NY: Routledge. Ones, D. S., Dilchert, S., Viswesvaran, C., & Salgado, J. F. (2017). Cognitive abilities. In J. Farr & N. T. Tippins (Eds.), Handbook of employee selection (Revised ed.). New York, NY: Routledge. Ones, D. S., & Viswesvaran, C. (1998). Gender, age, and race differences on overt integrity tests: Results across four large-scale job applicant data sets. Journal of Applied Psychology, 83, 35–42. Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in personality testing for personnel selection: The red herring. Journal of Applied Psychology, 81, 660–679. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78, 679–703. Organ, D. W., Podsakoff, P. M., & Podsakoff, N. P. (2011). Expanding the criterion domain to include organizational citizenship behavior: Implications for employee selection. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (Vol. 2, pp. 281–324). Washington, DC: American Psychological Association. Outtz, J. L. (2010). Addressing the flaws in our assessment decisions. In J. C. Scott & D. H. Reynolds (Eds.), The handbook of workplace assessment: Evidence-based practices for selecting and developing organizational talent (pp. 711–727. San Francisco, CA: Jossey-Bass/Pfeiffer. Pearlman, K., & Barney, M. (2000). Selection for a changing workplace. In J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies (pp. 3–72). San Francisco, CA: Jossey-Bass. People Focus. (1998). Bank of America new hire assessment predictive validation report. Pleasant Hill, CA: Author. Phillips, J. J. (1997). Handbook of training evaluation (3rd ed.). Houston, TX: Gulf. Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624. Ramsay, H., & Scholarios, D. (1999). Selective decisions: Challenging orthodox analyses of the hiring process. International Journal of Management Reviews, 1, 63–89. Reynolds, D. H., & Rupp, D. E. (2010). Advances in technology-facilitated assessment. In J. C. Scott & D. H. Reynolds (Eds.), Handbook of workplace assessment: Selecting and developing organizational talent (pp. 609–641). San Francisco, CA: Jossey Bass. Roberts, B. (February 2011). Close-up on screening. HR Magazine, 56(2), 23–29. Roth, P. L., Bobko, P., & McFarland, L. A. (2005). A meta-analysis of work sample test validity: Updating and integrating some classic literature. Personnel Psychology, 58, 1009–1037. Rotundo, M., & Spector, P. E. (2017). Counterproductive work behavior and withdrawal. In J. Farr & N. T. Tippins (Eds.), Handbook of employee selection (Revised ed.). New York, NY: Routledge. Rynes, S. L., Colbert, A. E., & Brown, K. G. (2002). HR professionals’ beliefs about effective human resource practices: Correspondence between research and practice. Human Resource Management, 41, 149–174. Rynes, S. L., Giluk, T. L., & Brown, K. G. (2007). The very separate worlds of academic and practitioner periodicals in human resource management: Implications for evidence-based management. Academy of Management Journal, 50, 987–1008.

247

Wayne F. Cascio and John C. Scott Schatsky, D., & Schwartz, J. (2015). Global human capital trends 2015: Leading in the new world of work. Deloitte University Press. Available at http://dupress.php.islsandbox.com/periodical/trends/ human-capital-trends-2015/page/2/ Schippmann, J. S. (2010). Competencies, job analysis, and the next generation of modeling. In J. C. Scott & D. H. Reynolds (Eds.), The handbook of workplace assessment: Evidence-based practices for selecting and developing organizational talent (pp. 197–231). San Francisco, CA: Jossey-Bass/Pfeiffer. Schippmann, J. S., Ash, R. A., Battista, M., Carr, L., Eyde, L. D., Hesketh, B., et al. (2000). The practice of competency modeling. Personnel Psychology, 53, 703–740. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmidt, F. L., & Hunter, J. E. (2003). History, development, evolution, and impact of validity generalization and meta-analysis methods, 1975–2002. In K. R. Murphy (Ed.), Validity generalization: A critical review (pp. 31–66). Hillsdale, NJ: Lawrence Erlbaum. Schmidt, F. L., & Hunter, J. E. (2014). Methods of meta-analysis: Correcting error and bias in research findings (3rd ed.). Thousand Oaks, CA: Sage. Schmitt, N., & Borman, W. C. (Eds.) (1993). Personnel selection in organizations. San Francisco, CA: Jossey-Bass. Schmitt, N., Oswald, F. L., Kim, B. H., Imus, A., Merritt, S., Friede, A., & Shivpuri, S. (2007). The use of background and ability profiles to predict college student outcomes. Journal of Applied Psychology, 92, 165–179. Scott, J. C., & Lezotte, D. V. (2012). Web-based assessments. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment and selection (pp. 485–516). New York, NY: Oxford University Press. Scott, J. C., Rogelberg, S. G., & Mattson, B. W. (2010). Measuring and managing the talent management function. In R. Silzer & B. Dowell (Eds.), Strategy-driven talent management (pp. 503–547). Alexandria, VA: Jossey-Bass/Pfeiffer. Skarlicki, D. P., Latham, G. P., & Whyte, G. (1996). Utility analysis: Its evolution and tenuous role in human resource management decision making. Canadian Journal of Administrative Sciences, 13, 13–21. Sturman, M. C. (2000). Implications of utility analysis adjustments for estimates of human resource intervention value. Journal of Management, 26, 281–299. Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of tests in selection. Journal of Applied Psychology, 23, 565–578. Thornton, G. C., III., Johnson, S. K., & Church, A. (2017). Executive selection: Assessing leadership and high potential. In J. Farr & N. T. Tippins (Eds.), Handbook of employee selection (Revised ed.). New York, NY: Routledge. Weber, L. (June 3 2015). Drug use is on the rise among workers in U.S. The Wall Street Journal, pp. B1, B7. Weber, L., & Dwoskin, E. (September 30 2014). As personality tests multiply, employers are split. The Wall Street Journal, pp. A1, A2. Welch, J. (2005). Winning. New York, NY: HarperBusiness. Whyte, G., & Latham, G. (1997). The futility of utility analysis revisited: When even an expert fails. Personnel Psychology, 50, 601–610. Zumbrun, J. (June 19 2015). Behind lingering job listings. The Wall Street Journal, A3.

248

Part III CATEGORIES OF INDIVIDUAL DIFFERENCE CONSTRUCTS FOR EMPLOYEE SELECTION DAVID CHAN AND FILIP LIEVENS, SECTION EDITORS

11 COGNITIVE ABILITY Measurement and Validity for Employee Selection DENIZ S. ONES, STEPHAN DILCHERT, CHOCKALINGAM VISWESVARAN, AND JESÚS F. SALGADO

Cognitive ability (or “intelligence”) affects individuals’ lives in countless ways, and it influences work lives of employees perhaps to a greater extent than any other individual differences trait. Cognitive ability determines whether an employee will be able to acquire the required job knowledge and perform assigned tasks. It is the strongest predictor of learning and acquisition of job knowledge as well as overall job performance. It is remarkably relevant regardless of the occupation one holds (Ones, Dilchert, & Viswesvaran, 2012). It even predicts extrinsic career success (i.e., earnings and promotions). As such, it is an exceedingly important trait to include in employee selection systems. In this chapter, we provide an overview of cognitive ability’s key role in staffing organizations and provide evidence-based practice recommendations. We first present a brief synopsis of the history, current usage, and acceptance of cognitive ability tests in employee selection. Second, we highlight the theoretical underpinnings and structure of cognitive ability as a construct. Third, we discuss developments in its measurement. Fourth, we present an overview of the criterion-related validity of cognitive ability tests in predicting valued work behaviors and outcomes, including non-task-performance criteria that have been increasingly investigated in recent years. Fifth, we discuss the issue of group differences in cognitive ability test scores both within the United States and internationally. We conclude by discussing future research and challenges facing organizations that intend to use cognitive ability tests in making employee selection decisions.

HISTORY, CURRENT USAGE, AND ACCEPTABILITY OF COGNITIVE ABILITY MEASURES IN EMPLOYEE SELECTION It has been more than 110 years since the publication of Spearman’s influential (1904) article “ ‘General Intelligence,’ Objectively Determined and Measured.”1 Early in the 20th century, researchers began to study the usefulness of cognitive ability measures for predicting learning and performance in educational settings. For personnel decision making, standardized, objective cognitive ability tests first saw large-scale use in military settings. Group tests of intelligence were developed prior to World War I and used extensively during both World Wars. European

251

Deniz S. Ones et al. and U.S. armed forces continued to utilize cognitive ability tests for selection and placement, and many business organizations followed suit. However, during the 1940s, ’50s, and ’60s, research revealed much variability, particularly with regard to the supposed usefulness of such measures to predict job performance. It seemed that the specific jobs under investigation, specific organizational settings, the particular ability measures used, and many unidentified (and unidentifiable) factors all contributed to the variability of observed results (e.g., Hull, 1928). Moreover, validation results differed even when jobs, organizations, and measures were held constant. Industrial psychologists came to believe that subtle, undetectable differences in situations were responsible for differences observed in the predictive value of the test studied. By the 1960s, this belief in situational specificity dominated the scientific literature and was well entrenched among practitioners (see Chapter 4, this volume). The breakthrough came in the 1970s. Frank Schmidt and Jack Hunter demonstrated (Schmidt & Hunter, 1977) that most differences observed across studies of cognitive ability were due to sampling error (sample sizes for validation studies in the 1960s displayed a median of 68; see Lent, Aurbach, & Levin, 1971), differences in level of restriction of range in samples (typically employees in concurrent studies who had already been selected into an organization), and differences across studies in the unreliability of criterion measurement (typically supervisory ratings of job performance). These statistical and measurement artifacts were responsible for the differences in results observed across most previous validation studies. The invention of meta-analysis (known as “validity generalization” in the employee selection literature), and the consistent findings from meta-analytic studies discredited the theory of situational specificity and paved the way to systematic investigations of predictor validity, also reaching beyond the domain of cognitive abilities.2 Today, cognitive ability measures are used in educational admissions and civilian personnel staffing, but how widespread is the use of cognitive ability measures in organizational settings in general, as well as vis-à-vis other tools available for personnel decision making? At the turn of the 21st century, the most comprehensive survey was conducted by Ryan, McFarland, Baron, and Page (1999). Ryan and colleagues surveyed 959 organizations from 20 countries by randomly sampling 300 large organizations (with more than 1,000 employees) in each country. The focus of their study was the examination of national and cultural influences on many selection system features. The pervasiveness of cognitive ability tests was also surveyed. Across the 18 countries for which data were reported by Ryan and colleagues, on average, cognitive ability tests were used between 21% and 50% of the time in employee selection. Organizations in the Netherlands, Belgium, Portugal, Spain, South Africa, New Zealand, and the United Kingdom reported above average use, whereas organizations in Germany, Hong Kong, and Italy reported especially low levels of cognitive test use. Within each country, of 14 selection methods presented to respondents (cognitive ability tests, physical ability tests, foreign language tests, work samples, personality tests, integrity tests, interest inventories, simulation exercises, situational judgment tests [SJTs], video-based tests, projective techniques, drug tests, medical screens, and graphology), cognitive ability tests were ranked in the top three most frequently utilized methods in 15 of 18 countries. It is of value to note that some of the methods listed, such as SJTs or simulations, can be used to measure a variety of constructs, and thus data on their use are not necessarily directly comparable to that of construct-specific ones such as standardized tests of cognitive ability and personality. However, it appears that if objective tests are utilized at all in personnel staffing decisions, cognitive ability measures are included with frequency. Several other, often region-specific, surveys have documented similar prevalence rates as well as other interesting trends. Salgado and Anderson’s (2002) summary of such studies revealed that cognitive ability test use seems more common for selection into graduate and manageriallevel positions compared to low-complexity jobs. More recent surveys (e.g., Chartered Institute of Personnel Development, 2007; Taylor, Keelty, & McDonnell, 2002) also seem to suggest an increase in cognitive ability test use, at least in some countries. In addition, cognitive ability tests appear to be used more frequently by larger organizations compared with smaller ones (Salgado, in press). Unfortunately, even when considering these more recent surveys, the available data on the extensiveness of cognitive ability test use come from countries that are not entirely representative of the world’s cultural regions. Data from Eastern Europe (e.g., Ukraine, Russia) and southern Asia (e.g., India, Pakistan) are meager; systematic, large-scale surveys from Latin

252

Cognitive Ability America, the Middle East, and Africa are also lacking. Research studies on other issues relating to cognitive ability tests (e.g., validity, group differences) are increasingly being published by authors in these regions, which might be interpreted as an indicator that their use in practice is also increasing (see, for example, Barros, Kausel, Cuadra, & Diaz, 2014; Kriek & Dowdeswell, 2009; Thadeu & Ferreira, 2013). Countries from these world regions offer a unique opportunity for industrial-organizational (I-O) psychologists to assess cultural variability in the extensiveness of use of as well as reactions to cognitive ability tests. Prevalence data provide an index of organizational acceptance of selection tools. Another perspective on this issue can be gained by examining applicants’ acceptance. Applicant reactions to selection tests vary by test type (Kluger & Rothstein, 1993). There are now many international (including some comparative) studies of applicant reactions to selection tests (see Bertolino & Steiner, 2007; Moscoso, 2006; Nikolaou & Judge, 2007, Ryan et al., 2009). An early meta-analysis by Hausknecht, Day, and Thomas (2004) of 10 studies on selection tool perceptions showed that the mean favorability ratings for cognitive ability tests were lower than those for interviews, work samples, resumes, and references, but higher than those for (in descending order) personality tests, biodata, personal contacts, honesty tests, and graphology. However, this meta-analysis included both respondents in laboratory research and field settings. Caution is warranted in drawing conclusions about reactions of actual job applicants to cognitive ability tests on the basis of these results: participants in the studies contributing to the Hausknecht et al. metaanalysis were not necessarily applying for jobs, were not in selection settings, and did not experience each of the tools they were rating. A more recent meta-analysis by Anderson, Salgado, and Hülsheger (2010), which included studies from all countries listed above (and summarized data for job applicants as well as some “student surrogate” samples), found that cognitive ability tests are among the selection procedures rated most favorably by job applicants. Those authors conclude that the high favorability of such tests, despite the drawback of being perceived as relatively impersonal, was due to perceptions of standardized tests being scientifically valid, respectful of applicants’ privacy, and providing them with an opportunity to perform. Recent research indicates there is cross-national similarity in organizational acceptance and use of cognitive ability test use for employee selection (Ryan et al., under review). Furthermore, applicants view cognitive ability tests relatively favorably, and again similarly so across several countries where data are available (e.g., see Ryan et al., 2009 for a 21-country investigation). Figure 11.1 presents meta-analytic data on applicant reactions to cognitive ability tests in comparison to one of the most favorably rated selection methods (interviews), the least favorably ranked method for each justice dimension, as well as mean applicant reactions across all methods investigated. In sum, cognitive ability tests are perceived more favorable than average on all relevant dimensions except “interpersonal warmth,” and even exceed favorability ratings of interviews in terms of scientific evidence and respect for applicants’ privacy. Scholars have rightfully pointed out that applicant reactions to personnel selection procedures are largely determined by their perceived fairness and their perceived predictive validity (Chan & Schmitt, 2004). However, it has also been shown that cognitive ability is a common antecedent not only of performance on standardized tests but also of perceived test fairness and test-taking motivation (Chan, Schmitt, Jennings, Clause, & Delbridge, 1998; Reeve & Lam, 2007). This is also true for self-assessed performance. Applicants’ “guesses” of how well they performed on cognitive tests have been demonstrated to relate to perceptions of predictive validity and job relatedness in several cultures (Ryan et al., 2009). Applicant perceptions of fairness are likely to present a challenge for cognitive ability tests as long as any traditionally disadvantaged group (broadly defined) scores systematically lower on a given predictor battery or applicants perceive a systematic bias hindering their test performance on such tests. Of course, this issue has been much discussed in relation to race and ethnic group mean score differences, particularly in the U.S. context. However, with changes in test technology and the increase of online testing, as well as new and innovative item formats to measure various cognitive abilities (especially inductive reasoning), similar concerns might occur with regard to other groups, such as older job applicants (see below for a discussion of oft-neglected age differences and potential for adverse impact). Tackling this issue will be a major task for our profession in the years to come if organizations and society as a whole is to benefit from the use of the most reliable

253

Deniz S. Ones et al.

FIGURE 11.1  Applicant Reactions to Cognitive Ability Tests and Other Selection Methods by Dimension Based on meta-analytic results from Anderson et al. (2010). Values represent sample-size weighted means for each method after all ratings were transformed to a 7-point scale; higher values indicate more favorable reactions. Graphology was excluded from consideration as the “least favorable” method due to lack of validity evidence and relative infrequency of use compared to all other methods compared. For the five dimensions, “personal contacts” was consistently the least favorably ranked method; for the dimensions “widely used” and “respectful of privacy,” the least favorable method was honesty tests.

and valid assessments available for hiring and placements. Vocal opponents of high-stakes testing have promulgated myths about the ability of intelligence tests to predict valued outcomes, as well as their fairness (see Schmidt et al., 2007, for examples). These myths, although often entirely unsupported by empirical evidence or even common logic, are difficult to dispel and, if allowed to inform organizational decisions on selection tool use, pose a threat to organizations’ and ultimately societies’ economic welfare (Sackett, Borneman, & Connelly, 2008; Schmidt & Hunter, 1981).

DEFINITIONS AND THEORETICAL UNDERPINNINGS The core of intelligence as a psychological construct has long been conceptualized as reasoning ability and a form of mental adaptability (Stern, 1911). Despite the central place this construct takes in determining individual behavior, it took almost a century for a broad scientific consensus to emerge on its definition. A group of 52 experts that included luminaries of psychological science defined intelligence as “a very general mental capacity that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience” (Gottfredson, 1997, p. 13). This group of scholars, drawn from various psychological disciplines (including I-O psychology), goes on to state that intelligence “is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings—‘catching on,’

254

Cognitive Ability ‘making sense’ of things, or ‘figuring out’ what to do” (p. 13). In the words of William Stern, one of the forefathers of modern-day research on cognitive ability, intelligence is the “general mental adaptability to new problems and conditions of life” (2011, p. 3). The importance of such a broad definition (in contrast to folk concepts such as “book smarts”) cannot be overstated. Conceptually, intelligence, in humans and other species, indicates the complexity and efficiency of cognitive functioning. Here, complexity refers to the “sophistication of the intellectual repertoire” (Lubinski, 2004, p. 98), whereas the efficiency aspect refers to the effectiveness of information-processing skills. Both aspects are critical to performance in all domains of life (in interpersonal interactions, at home, school, or work), and their impact on individual differences in problem-solving ability can be observed in individuals of all ages. The realization that such information-processing skills “can be applied to virtually any kind of content in any context” (Gottfredson, 2004b, p. 23) is of relevance to scientists and practitioners alike. The application of this principle to organizational contexts forms the conceptual basis of Campbell’s (1990) fundamental statement that “general mental ability is a substantively significant determinant of individual differences in job performance for any job that includes information-processing tasks” (p. 56). It is difficult to imagine any job that does not include information processing of some form. Cognitive ability is an integral part in models of job performance because of its relation to knowledge and skill acquisition. General mental ability predicts job performance because it is a causal determinant of acquisition of job knowledge (McCloy, Campbell, & Cudeck, 1994; Schmidt, Hunter, & Outerbridge, 1986). The more cognitively demanding the knowledge to be acquired and the more complex the task to be performed, the greater is the relationship between cognitive ability and performance (Hunter & Hunter, 1984).

STRUCTURE OF COGNITIVE ABILITY Although there are numerous specific cognitive abilities, they all share a common construct core: general mental ability, popularly called intelligence, or g (for general intelligence factor) in many scientific writings. As Gottfredson (2002) so aptly noted, the multitude of ways to measure g attest to its generality. Although measures of intelligence may look different, employ different item types (e.g., verbal, figural, numerical, cognitive/neuropsychological tasks), and use different formats (e.g., individually administered tasks, paper-and-pencil tests, computerized batteries, and even game-like mobile applications), this does not mean they assess entirely distinct constructs. The structure of cognitive abilities has been examined extensively since Spearman’s (1904) distinction of g and s (specific abilities). A century of research has yielded hundreds of data sets in which individuals took multiple cognitive ability measures. Carroll (1993) compiled, analyzed, and summarized the correlation matrices resulting from more than 460 such data sets. The result was his popular three-stratum model (see McGrew, 2009; Schneider & McGrew, 2012; for fuller descriptions of the Cattell-Horn-Carroll [CHC] model of intelligence). Cognitive abilities are hierarchically organized. At the apex is g, or an ability that is general. At the second stratum are group factors or broad abilities, including fluid reasoning, memory (short-term as well as longterm storage and retrieval), visual processing, processing speed (including perceptual speed), but also previously acquired knowledge (including quantitative ability, comprehension, reading and writing, and domain-specific knowledge). At the lowest level of the hierarchy are specific factors or narrow abilities such as induction for fluid abilities, ideational fluency for long-term memory, or lexical knowledge for comprehension. Individuals of similar intelligence (i.e., at the same trait level of general mental ability) differ in their standing on specific abilities because of differential “investment” of their cognitive capacity (guided by other personal characteristics as well as idiosyncratic developmental and educational experiences) in these narrow cognitive domains. The distinction between fluid and crystallized intelligence (termed gf and gc, respectively; see Cattell, 1971) provides a still-popular conceptual model but has also been shown to distinguish between g and lower-level abilities, rather than ability factors at the same level of the taxonomical hierarchy. Fluid and crystallized intelligence tend to correlate around .70, and some scholars argue that gf is indistinguishable from g (Gustafsson, 2002; other scholars go as far as classifying

255

Deniz S. Ones et al. certain fluid ability tests, such as the Raven’s matrices, as direct measures of g). The most significant content domains that surface in most ability models are verbal/linguistic, quantitative/ numerical, and spatial/mechanical. There are several other popular models of cognitive ability structure, both competing as well as converging (e.g., Vernon’s, Cattell and Horn’s, Holzinger’s, Johnson & Bouchard’s, as well as the Berlin model). Most of them are hierarchical in nature but differ in terms of both number of strata as well as nature of primary and secondary factors they postulate. An overview and illustration of these models is provided in Salgado (in press). Even though researchers continue to clarify and refine the structure of intelligence in individual studies (e.g., see Johnson & Bouchard, 2005; Johnson, te Nijenhuis, & Bouchard, 2007, 2008), the Cattell-Horn-Carroll (CHC) model (McGrew, 2009) still dominates the thinking about the structure of intelligence today. What is undebated is that when various cognitive ability tests reflecting the entire range of intelligence from the general population are administered to test takers, a large proportion of variance can be attributed to a general factor. Lubinski (2004) found that in such cases, about 50% of the common variance is due to g, whereas 8–10% of the remaining common variance is attributable to verbal, quantitative, and spatial abilities (Lubinski, 2004). It has been found that relationships among cognitive ability scales are weaker at higher levels of the ability spectrum (e.g., see Detterman & Daniel, 1989; Kane, Oakland, & Brand, 2006), implying a smaller amount of common variance due to g. Theoretically, one implication could be that there may be more room among high-ability individuals for specific abilities to yield incremental validities over tests of general mental ability. However, investigations of incremental validity in highly complex jobs have so far yielded mixed results. For example, Olea and Ree (1994) reported that specific abilities contributed little beyond g to the prediction of job performance among pilots and navigators, whereas Ree and Carretta (1996) concluded that some specific abilities had the potential to add incremental value at least for prediction of military pilot performance. If it were consistently found that the common variance among individual ability tests accounted for by g was smaller than in samples of broad talent, then it is plausible that specific abilities could add incremental value over general mental ability for such groups. For the prediction of training performance, there is some initial evidence to support this hypothesis in primary samples of apprentices in low- compared with medium-complexity jobs (Ziegler, Dietl, Danay, Vogel, & Bühner, 2011). However, both the Germanic context of these data (e.g., relatively high educational standards) and the reliance on suboptimal regression analyses in this research points to the need for replication in other countries, using more appropriate statistical approaches (Wiernik, Wilmost, & Kostal, 2015). Moreover, direct tests among job applicants, especially high-ability samples, and for the prediction of job performance are still called for. However, such investigations would have to sort out potentially complex range restriction effects in these samples. When broad job categories and applicants of a wide range of talent are studied, analyses directed at incremental validities of specific ability measures over g have yielded disappointing results: Specific abilities do not provide substantial incremental validity over g. Nonetheless, in some meta-analyses, specific abilities have been shown to be similarly valid for the prediction of some criteria (see below). In addition, there may also be nonvalidity-related considerations for practitioners to include specific ability measures in their selection systems, such as the consequences of anticipated group differences or applicant reactions. Some survey results on the use of specific versus general mental ability test use seem to reflect such considerations, as specific ability tests see significant use in pre-hire assessments, albeit not at the same level as general mental ability tests (Krantowitz, 2014).

MEASUREMENT The list of cognitive ability measures available to scientists, individual practitioners, and organizations runs in the hundreds and includes everything from simple, homegrown measures to tests of wide circulation supported by many decades of empirical research evidence. A discussion of the merits of individual measures cannot be provided in this chapter. However, a brief

256

Cognitive Ability discussion of commonly used methods, as well as current trends in cognitive ability assessment, is warranted. Traditional, standardized tests are the most widespread method for measuring all types of cognitive abilities. Their popularity is not because of a lack of alternative methods, but primarily because of their excellent reliability, ease of administration, and scoring. Although validity (including predictive validity) is the property of the inferences made about a psychological construct (e.g., the abilities measured by a test, not the test itself), the reliability of the assessment methods provides a ceiling for validities that can be obtained in applied settings. From this point of view, standardized tests provide the best solution for organizations looking to assess cognitive ability in a reliable, standardized, and objective manner. The use of standardized tests in employee selection and academic settings is not without controversy. Unfortunately, criticism levied against cognitive ability tests, like that directed at other standardized testing, often falls victim to “content-format confusion” (Chan & Schmitt, 1997, 2004; Ryan & Greguras, 1998) and failure to distinguish the nature of the test response (Lievens, De Corte, & Westerveld, 2015). In addition to standardized tests, many other methods can be used to assess cognitive ability constructs, and a careful investigation of these methods and how they are typically used can inform decisions on whether they are suitable for a given purpose and setting. Interviews, assessment centers (ACs), and SJTs are all methods that assess cognitive ability to varying degrees—sometimes by design, sometimes by accident. Early meta-analyses estimated the overlap between interviews and cognitive ability at approximately ρ = .40 (Huffcut, Roth, & McDaniel, 1996). A more recent meta-analysis reported a mean, range-restriction corrected correlation of .27 (N = 11,317, k = 40; Berry, Sackett, & Landers, 2007). A re-analysis by Roth & Huffcutt (2013) showed that interviews conducted specifically in employment settings (versus for academic admissions) are more saturated with cognitive ability variance, in line with earlier findings (ρ = .41, N = 840, k = 5). However, the analysis by Berry and colleagues provides some intriguing moderator results, including higher interview-ability test correlations when interview validity is high and job complexity is low. Interviews with greater cognitive content can be expected to yield higher criterion-related validities. Also, for low-complexity jobs, interviews may function as more of a cognitive screen than for higher-complexity jobs. Relationships between cognitive ability and overall AC ratings have also been examined. A meta-analysis by Collins et al. (2003) reported that cognitive ability test scores correlated .43 with overall AC ratings (N = 5,419, k = 34). AC dimensions may have a differential cognitive load. In a large-scale study, Dilchert and Ones (2009) reported that the highest correlations were found for the AC dimension problem solving (r = .32, N = 4,856), providing further evidence for the fact that cognitive ability measures capture real-world problem-solving abilities, including those displayed in business simulations (cf. Gottfredson, 1997). In an integrative meta-analysis of the AC literature, Meriac, Hoffman, and Woehr (2014) reported similar findings, with the AC dimensions problem solving, communication, and organizing/planning all displaying mean unreliability-corrected correlations of .29 with GMA. Meriac and colleagues estimated that the general factor that spans AC dimensions (see Kuncel & Sackett, 2013) is correlated .26 with GMA (the maximum correlation with any of the Big Five personality dimensions was .14 with Extraversion). Assessment center exercises are similarly related to GMA. Hoffman, Monahan, Lance, and Sutton’s (2015) meta-analysis reported unreliability corrected correlations of .30 for in-baskets, but relations in the range of .13 to .22 for leaderless group discussions, role plays, case analyses, and oral presentations. Increasingly popular SJTs are also correlated with cognitive ability; however, the magnitude of the correlation depends on the instructions given to participants. Knowledge instructions (e.g., “what should one do,” “rate the best/worst option”) in completing SJTs produce an observed correlation of .32 (N = 24,656, k = 69), whereas SJTs with behavioral tendency instructions (e.g., “what would you do”) correlate .17 (N = 6,203, k = .26) with cognitive ability (McDaniel, Hartman, Whetzel, & Grubb, 2007). Thus, if job applicants complete SJTs, especially under knowledge instructions, assessments produce a ranking of job applicants on cognitive ability to a certain degree. However, Christian, Edwards, and Bradley’s (2010) meta-analysis of SJT validity by construct domain indicates that job knowledge and skills-focused SJTs predict job performance at lower levels than those established for traditional cognitive ability tests. More

257

Deniz S. Ones et al. construct-focused research on SJTs is warranted, however, as total sample sizes for these analyses were very small, and SJTs specifically designed to assess general mental ability were not included (likely because few such measures exist). Many assessment methods increasingly rely on formats other than the traditional paperand-pencil form, a trend that is also reflected in ability measurement. Earlier research used meta-analysis to establish the equivalence of computerized and paper-and-pencil versions of cognitive ability tests (Mead & Drasgow, 1993). Recent trends in web-based assessment and test content delivery build on the fact that tests of many individual difference predictors (not only cognitive ability) have been shown to be equivalent between paper-and-pencil and computerized versions. However, the real challenge arises not from a change in test format but from a change in administration mode. Web-based, unproctored cognitive ability assessment is aimed at streamlining the application process for applicants and organizations. Critics argue that this approach requires strong confidence in the honesty of test takers (who, at least in selection contexts, presumably have a strong incentive to cheat). Some organizations, confronted by the real-world challenges of having to assess hundreds of thousands of applicants every year, are already using computerized adaptive testing and constantly updated test materials to conduct unproctored web-based testing (Gibby, 2008). Although there is considerable range in estimates of the magnitude of applicant cheating on such assessments (Arthur, Glaze, Villado, & Taylor, 2010; Hense, Golden, & Burnett, 2009; Lievens & Burke, 2011), they are lower than initially assumed, and not high enough to justify foregoing the significant efficiencies realized by unproctored testing (Tippins, 2015). Commercial test publishers and assessment providers have also developed several strategies to address issues of cheating and test security, ranging from regular monitoring for item piracy and systematic, proctored retesting of test takers (Burke, 2008) to remote, video-based proctoring and biometric test taker identification (Foster, 2008), as well as algorithmic identity monitoring via means such as keystroke analysis, facial-, voice-, and even palm/knuckle recognition. We are certain that for large-scale assessments, such trends will soon become the everyday reality, dictated by demands for more streamlined assessment procedures from applicants and organizations alike (see also Chapter 39, this volume). The challenges posed by remote, unproctored cognitive ability assessment will need to be addressed by a more intense collaboration of scientists and practitioners, as well as by drawing on expertise from outside of the psychological domain. The challenges are worth tackling, because the utility gains of expanding testing programs to larger numbers of test takers earlier in the hiring process, as well as improvements in fairness gained from reaching additional and unique applicant populations, are likely to outweigh the costs (see, for example, the simulations provided by Landers & Sackett, 2012). Organizations and providers that invest in the appropriate know-how and technology are increasingly at the forefront of big development in cognitive ability measurement.

CRITERION-RELATED VALIDITY EVIDENCE The job relatedness and usefulness of standardized cognitive ability tests in employee selection have been documented in dozens of quantitative reviews in the form of publications and technical reports, incorporating more than 1,300 meta-analyses summarizing results from more than 22,000 primary studies. The total sample size of job applicants and employees providing data for these validation studies is well in excess of 5 million individuals (Ones, 2004; Ones & Dilchert, 2004). The question of whether cognitive ability tests are useful predictors of performance in occupational settings has been definitely answered: yes, they are excellent predictors of training performance and job performance. In fact, no other predictor construct in employee selection produces as high validities, as consistently, as does cognitive ability. In addition, no assessment method has so far achieved as reliable assessment of cognitive ability as standardized tests, making such tests the ideal choice for predicting performance in organizational settings. Meta-analyses of cognitive ability test validities have been tabulated and summarized by Ones, Viswesvaran, and Dilchert (2005), Dilchert (in press), and Salgado (in press). In this section we

258

Cognitive Ability provide an overview of conclusions from these quantitative reviews. Readers interested in the specific meta-analyses supporting each conclusion are encouraged to review these chapters.

Cognitive Ability Tests Predict Learning, Acquisition of Job Knowledge, and Job Training Performance with Outstanding Validity (Operational Validities in the .50 to .70 Range) Validities for training criteria generalize across jobs, organizations, and settings. Meta-analyses provide voluminous evidence of high validity for training success in military and civilian organizations. Operational validities (correlations corrected for attenuation due to unreliability in criterion measures and range restriction, where applicable) are highest for general mental ability and specific quantitative and verbal abilities, and somewhat lower for memory (although still highly useful with a sample-size-weighted operational validity of .46). Validities are moderated by job complexity. The greater the complexity of jobs being studied, the higher the validity of cognitive ability tests in predicting training performance (Hunter & Hunter, 1984; Salgado, Anderson, Moscoso, Bertua, de Fruyt, & Rolland, 2003; Ziegler et al., 2011). Superior validities of cognitive ability tests for learning are in line with findings that cognitive ability is the strongest determinant of knowledge acquisition, in this case acquisition of job knowledge (Schmidt et al., 1986). The more complex jobs are, the more complex and vast the knowledge to be acquired. Brighter individuals learn more quickly, learn more, and can acquire more complex knowledge with ease.

Cognitive Ability Tests Predict Overall Job Performance with High Validity (Operational Validities in the .35 to .55 Range) Table 11.1 summarizes the potential moderators of cognitive ability test validity in employment settings, indicating those supported and those rejected on the basis of meta-analyses, as well

TABLE 11.1 Hypothesized Moderators of Cognitive Ability Test Validities Yes: Confirmed Moderators

No: Rejected Moderators

?: Moderating Effect Unknown

Job complexity

Situational variables

Time of study (historical age)

Criterion predicted

Organizational setting

Age

Training performance

Race

Race

Job performance

African Americansa

Asian Americans

Leadership

Hispanicsa

Native Americans

Turnover, etc.

Sex

National setting and culture (except for some countries)

Cognitive ability construct assessed

Military/civilian setting

GMA

Validation design (concurrent/ predictive)

Verbal ability

Length of time on the job (up to 5 years)

Memory, etc.

Method of criterion-measurement (e.g., ratings, production quantity, work samples)

b

Meta-analytic evidence for race and comparisons between Caucasians and African Americans as well as Hispanic/Latino Americans in civilian employment suggests operational validities for Whites may be .02 to .04 correlational points higher; differences in military settings were found to be somewhat higher. b Meta-analytic evidence suggests operational validities for men may be negligibly higher. a

259

Deniz S. Ones et al. as those awaiting investigation. Validities for overall job performance criteria generalize across jobs, organizations, and settings. Support for these key conclusions comes from meta-analyses of studies using narrow job groupings (e.g., mechanical repair workers, first-line supervisors, health technicians, computer programmers, lawyers, retail sales personnel, firefighters), broad job groupings (e.g., clerical jobs, law enforcement, maintenance trades), and heterogeneous job groupings (e.g., by job complexity). Individual large sample studies (e.g., Project A) also point to the same conclusions. Operational validities are highest for general mental ability and quantitative abilities and somewhat lower for memory (although still useful with a sample-size-weighted operational validity of .39 across 12 different meta-analyses; Ones & Dilchert, 2004). The method of performance measurement employed (objective vs. subjective) does not lead to different conclusions about the usefulness of cognitive ability tests, and different indices of performance (rankings, ratings, etc.) produce similar operational validities. Job complexity also moderates the validities of cognitive ability tests for predicting job performance. Higher validities are found for jobs of higher complexity. Although content validation has gained popularity in recent years, validity generalization studies have clearly demonstrated that matching specific cognitive abilities to aspects of task performance deemed important in a given job is not necessary. Basing selection systems on one or two specific abilities based on content validity evidence can be expected to result in lower levels of learning, less new job knowledge acquisition, and poorer adaptation to changing work environments (Ones, 2016). In sum, it is remarkable that even when moderators have been reported for cognitive ability test validity, they do not result in validities reversing direction or shrinking to negligible levels in magnitude. Useful levels of validity are found even for the more specific cognitive abilities and for lowest levels of job complexity. A multitude of additional variables have been tested as potential moderators of validity in the meta-analyses reviewed in Ones et al. (2005), and most can be dismissed based on empirical evidence. These include organizational setting, method of criterion measurement (ratings, rankings, etc.; Nathan & Alexander, 1988), sex (see below), validation study design (concurrent vs. predictive; Barrett, Phillips, & Alexander, 1981), and length of time on the job (experience up to five years; Schmidt, Hunter, Outerbridge, & Goff, 1988). In sum, cognitive ability test validity does not vary substantially and systematically across organizational settings or for most subgroups that have been examined. Concurrent validities approximate predictive validities, and cognitive ability tests show no declines in validity as workers gain experience. There are, nonetheless, still some potential moderators waiting to be tested in large-scale, representative studies, or more systematically or thoroughly investigated using meta-analytic approaches. Studies of Asians and Native Americans as well as older adults are notably absent from the I-O psychology literature (see below for more details). Although we know a great deal about the validity of cognitive ability tests for predicting training, task, and overall job performance criteria, knowledge of how cognitive ability relates to other aspects of work behavior (e.g., organizational citizenship) has been limited. Initial intriguing findings were reported by Alonso, Viswesvaran, and Sanchez (2008), who found that cognitive ability correlated more highly with contextual performance than personality factors. In a recent meta-analysis of 43 studies, Gonzalez-Mule, Mount, and Oh (2014) reported mean truescore correlation of .24 between cognitive ability and supervisor-rated organizational citizenship behaviors (the validity was corrected for indirect range restriction on the predictor measure and unreliability both the predictor and criterion measures). Investigations of cognitive ability validities for counterproductive work behaviors (CWB) have also been reported. A large predictive validity study relating a cognitive ability measure to counterproductive behaviors indicated that intelligent individuals avoid engaging in organizational and interpersonal deviance on the job (Dilchert, Ones, Davis, & Rostow, 2007). A meta-analysis of 16 studies with non-self-report CWB criteria pointed out the need to better understand the CWB criterion domain (Gonzalez-Mule et al., 2014). More research into correlates of cognitive ability outside of the traditional job performance domain (e.g., adaptive performance, employee green behaviors) would be welcome. So far, there have been no large-scale investigations of cognitive ability test validity across time (i.e., has validity for cognitive ability tests in general changed over time?). Labor force changes paired with progressive changes in the way work is done in many fields (more complex

260

Cognitive Ability processes, greater use of technology) call for renewed inquiries. Of course, given the current reward system in academia and the overemphasis on theory that favors small-scale studies of new effects over large-scale replications (cf. Campbell & Wilmot, in press), new academically based research on cognitive ability test validities is unlikely (Salgado, in press). One hypothesis that we would like to offer is that cognitive ability tests today have greater validity than half a century ago. As job roles and tasks for jobs in most sectors change over time to include more complex tools (e.g., computers), processes (e.g., virtual teamwork), and requirements (e.g., multiple languages), the power of general mental ability as the basic learning skill in predicting performance may increase substantially, especially in mid- to high-complexity jobs. Another change that has put increasing demand on individuals and organizations over the last two decades or so is internationalization. We already know that organizations, small and large, compete for customers in a global economy. However, in a time when mobility—real and virtual—is greater than ever in humanity’s history, organizations now also compete internationally for their labor force or are faced with mobile labor forces. Validity of cognitive ability tests has been studied in international contexts, most extensively in Europe (Hülsheger, Maier, & Stumpp, 2007; Salgado, Anderson, Moscoso, Bertua, & de Fruyt, 2003; Salgado, Anderson, Moscoso, Bertua, de Fruyt, & Roland, 2003), but also in Asia (Takahasi & Nimura, 1994; Nimura, Imashiro, & Naito, 2000; Oh, 2010; Lee, 2005). Table 11.2 summarizes the results of these international meta-analyses for job performance. Findings are mostly parallel to those from the United States: Cognitive ability tests show substantial validity, and higher validities are found for higher-complexity jobs. Moreover, highest validities are found for general mental ability rather than specific abilities (Salgado, Anderson, Moscoso, Bertua, & de Fruyt, 2003). The only international exception to the pattern of very strong validities seems to exist in Japan. One explanation that has been put forth for this finding is that job performance evaluations in this context often emphasize non-task-performance aspects, such as citizenship behaviors, more strongly than in other cultures (Nimura et al., 2000; Salgado, in press). However, another

TABLE 11.2 Validity of Cognitive Ability Tests for Predicting Job Performance in International Contexts Across European countries

k

N

r

ρ

High-complexity jobs

14

1,604

.23

.64

Medium-complexity jobs

43

4,744

.27

.53

Low-complexity jobs

12

864

.25

.51

Belgium and the Netherlands

15

1,075

.24

.63

France

26

1,445

.48

.64

.33

Analyses by country

Germany

8

746

126

26,095

.20

8

1,098

.57

Spain

11

1,182

.35

.64

United Kingdom

68

7,725

.26

.56

Japan South Korea

.53

N = total number of subjects; k = number of studies summarized in meta-analysis; r = sample size weighted mean observed correlation; ρ = operational validity, corrected only for sampling error and attenuation due to unreliability in the criterion. Sources: Data across European countries summarized from Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J. P., Journal of Applied Psychology, 88, 1068–1081, 2003. Data for individual European countries except Germany summarized from Salgado, J. F., & Anderson, N., European Journal of Work and Organizational Psychology, 12, 1–17, 2003. Data for Germany summarized from Hülsheger, U. R., Maier, G. W., & Stumpp, T., International Journal of Selection and Assessment, 15, 3–18, 2008. Data for Japan based on three different meta-analyses, synthesized by Salgado (in press); data for South Korea based on two meta-analyses also synthesized by Salgado.

261

Deniz S. Ones et al. explanation might be that nearly half of the data summarized in the two available Japanese meta-analyses were collected with one specific ability test. Although the criterion-related validity of cognitive ability tests for predicting job performance is among the most established findings in applied psychology, there is room for additional research in specific cultural regions. Several practical issues are noteworthy. First, it is often argued that educational requirements serve as proxies for cognitive ability. This argument suggests that using a cognitive ability test would not be necessary if a screening based on educational credentials were in place. There are two flaws in this line of reasoning. Educational qualifications of applicants to the same jobs are very similar, or at least more homogenous than those of the population at large. Conversely, even among those who hold advanced degrees (e.g., doctoral, medical, and law degrees), there is still substantial variability in cognitive ability (Sackett & Ostgaard, 1994; Wonderlic Inc., 2002), indicating room for utility to be derived from cognitive testing. Findings of highest predictive validities for the most complex jobs (e.g., lawyers, physicians) underscore the usefulness of cognitive tests even when there is homogeneity in the educational level and credentials of job applicants. If, for some reason, individuals of mixed educational level were to apply for the same job, a cognitive ability test is a more precise, valid, and efficient selection tool to use (Berry, Gruys, & Sackett, 2006). It is also often suggested that beyond a certain required level of ability, cognitive capacity does not contribute to performance. Such arguments essentially suggest a nonlinear relationship between cognitive ability and performance: Cognitive ability predicts up to a certain point on the trait continuum, but validity drops off beyond that. The data that have been brought to bear on this question seem to tell the opposite story (Arneson, 2007; Coward & Sackett, 1990; Ghiselli & Kahneman, 1962; Hawk, 1970; Tiffin & Vincent, 1960). An issue commonly encountered in investigations of nonlinearity, however, is the lack of sensitivity of such investigations at the parts of the trait continuum that actually matter (in this case, high ability levels). This is due to both the sensitivity of the ability measures employed as well as the number of (extremely) high-ability individuals in the respective data sets. However, large-scale investigations that address these issues now exist. Arneson, Sackett, and Beatty (2011) provided a particularly strong illustration for the case of ability tests used in academic admissions decisions. Not only did these authors find no evidence for the “good-enough” hypothesis on ability– performance relationships, but they also showed that at high ends of the ability spectrum, the relationship with performance was typically stronger. For noneducational achievement criteria, Wai, Lubinski, and Benbow (2005) had previously shown that validity remains high even among the most extremely talented individuals. Altogether, this evidence suggests that if any nonlinear ability–performance relationships exist, they are more likely to be characterized by an exponential curve rather than an asymptotic relationship.

GROUP DIFFERENCES ON COGNITIVE ABILITY MEASURES One of the greatest points of concern in using cognitive ability measures in the United States is the potential for adverse impact. In this section, we review mean group differences on cognitive ability measures and discuss their implications for adverse impact. We also review findings regarding predictive fairness and discuss group differences in international contexts.

Group Difference in Central Tendency and Dispersion In the United States, Title VII of the Civil Rights Act prohibits employment discrimination on the basis of race, color, religion, sex, and national origin, and the Age Discrimination in Employment Act addresses age discrimination. Historically, disadvantaged racial and ethnic groups in the United States are African Americans (Blacks), Hispanic/Latino Americans, Asian Americans, and Native Americans/Pacific Islanders (see Chapter 29, this volume, for an international perspective). Women and older adults have also been historically disadvantaged in many

262

Cognitive Ability contexts. If selection decisions result in selection ratios for subgroups of protected classes that are less than 80% of those for the better-performing group, presence of adverse impact is concluded, and the burden of proof shifts to the employer to establish the job relatedness of the selection tools utilized, typically using criterion-related validity evidence (see also Chapter 28, this volume). However, it is important to remember that adverse impact (or lack thereof) in the employee selection process is the result of a selection system and not only a single test. That is, adverse impact is the end result of the magnitude of group differences, selection ratios, use of different selection tools in combination, and the manner in which scores are combined and utilized. Sackett and Roth (1996) used a series of Monte Carlo simulations to investigate the effects of multistage selection strategies on minority hiring. The important features of selection systems that contributed to the level of minority hiring included subgroup differences on the predictors, intercorrelations among the predictors in the selection system, the overall selection ratio, and the selection strategy used (i.e., top-down, hurdle, etc.). For cognitive ability tests, group differences have been examined in dozens of primary studies and have been meta-analytically summarized. In these studies, the measure of group differences is typically Cohen’s d, which expresses the differences between the means of two groups in terms of standard deviation units. In meta-analyzing these effect sizes, d values from individual studies are pooled and averaged to obtain an overall effect size that reflects the magnitude of group differences in the population at large. Corrections for unreliability in cognitive ability measures are typically not applied, because selection decisions are based on observed scores.3 In general, d values of .80 or greater are considered large effects, those around .50 are moderate, and those below .20 are small (Cohen, 1977). (From a theoretical perspective, d values under .20 are often trivial; however, under extreme conditions, such as when the majority group selection ratio is under 1%, even small differences in the .10 to .20 range can lead to violation of the four-fifths rule and thus constitute adverse impact.) In the I-O psychology literature, it is widely believed and reported that sex differences in cognitive ability are nonexistent (e.g., see Table 1 in Ployhart & Holtz, 2008). Table 11.3 offers a more precise and detailed view in summarizing sex differences on cognitive variables. Differences in verbal and mathematical abilities are negligible. Women score moderately higher than men on one particular verbal ability marker—speech production. Largest sex differences are found on visual-spatial measures such as mental rotation and spatial perception (meta-analytic d values in favor of men are in the .40 to .70 range) as well as figural reasoning and technical aptitude (Irwing & Lynn, 2005; Lynn & Irwing, 2004; Schmidt, 2011). Thus, given selection ratios of 50% or lower for men, cognitive tests with visual-spatial items or technical aptitude questions (e.g., mechanical comprehension, electronics information) can result in adverse impact against women. However, when general mental ability scores (extracted from a battery of different tests) are considered, sex differences have been shown to be either negligible (Colom, Juan-Espinosa, Abad, & García, 2000; Deary, Irwing, Der, & Bates, 2007) or favoring females to a small degree (Keith, Reynolds, Patel, & Ridley, 2008). Thus, organizations concerned with gender diversity would be better off including general mental ability tests over tests of those specific abilities in their assessment systems. Underrepresentation of women in science, technology, engineering, and math (STEM) fields has spurred inquiries into whether men’s and women’s variances are comparable on cognitive tests. If men are more variable than women, groups of individuals selected may reflect greater proportions of men at the high end of the ability distribution, even if there are no mean subgroup differences. Hyde’s (2014) summary of the literature indicates male-to-female variance ratios in the 1.03–1.16 range for verbal abilities, 1.05–1.20 range for mathematics abilities, and 1.27 for spatial ability. The 27% greater variability among men on spatial abilities may partly explain lower female high achievement and accomplishments in STEM fields (Lubinski, 2010). More so than sex differences in cognitive ability, race and ethnic group differences have consumed attention in employee selection research and practice, especially in the North American context (Hough, Oswald, & Ployhart, 2001). Table 11.4 summarizes race and ethnic group differences on cognitive ability based on the largest meta-analysis of the employment literature (Roth, Bevier, Bobko, Switzer, & Tyler, 2001). On average, Blacks score 1.00 and Hispanics .83

263

Deniz S. Ones et al.

TABLE 11.3 Meta-Analyses of Sex Differences on Cognitive Ability Measures Cognitive Variable

Study

k

d

Vocabulary

Hyde & Linn (1988)

40

−.02

Reading comprehension

Hyde & Linn (1988)

18

−.03

Speech production

Hyde & Linn (1988)

12

−.33

Mathematics computation

Hyde, Fennema, & Lamon (1990)

45

−.14

Mathematics concepts

Hyde, Fennema, & Lamon (1990)

41

−.03

Mathematics problem solving

Hyde, Fennema, & Lamon (1990)

48

.08

Spatial perception

Linn & Petersen (1985)

62

.44

Spatial perception

Voyer, Voyer, & Bryden (1995)

92

.44

Mental rotation

Linn & Petersen (1985)

29

.73

Mental rotation

Voyer, Voyer, & Bryden (1995)

78

.56 .57

Mental rotation

Maeda & Yoon (2012)

70

Mental rotation—untimed tests

Voyer (2011)

23

.51

Mental rotation—short time limits

Voyer (2011)

7

1.03

Mental rotation—long time limits

Voyer (2011)

Spatial visualization

Linn & Petersen (1985)

Spatial visualization

Voyer, Voyer, & Bryden (1995)

6

.85

81

.13

116

.19

Figural reasoning (matrices)

Lynn & Irwing (2004)

10

.30

Figural reasoning (matrices—standard)

Irwing & Lynn (2005)

10

.10

Figural reasoning (matrices—advanced)

Irwing & Lynn (2005)

11

.20

k = number of studies summarized in meta-analysis; d = standardized group mean score difference. Positive effect sizes indicate males scoring higher on average.

TABLE 11.4 Race and Ethnic Group Mean Score Differences in General Mental Ability Among Job Applicants Group Comparison White–Black

Setting

Job Complexity

N

k

d

Industrial

Across complexity levels

375,307

11

1.00

Industrial (within-job studies)

Low

125,654

64

.86

31,990

18

.72

4,884

2

.63

245,036

1

1.46

Moderate High Military White–Hispanic

Across complexity levels

Industrial

Across complexity levels

313,635

14

.83

Military

Across complexity levels

221,233

1

.85

k = number of studies summarized in meta-analysis; d = standardized group mean score difference; N = total sample size. Positive effect sizes indicate Whites scoring higher on average. Source: Data from Tables 2, 4, and 7 of Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P., Personnel Psychology, 54, 297–330, 2001.

standard deviation units lower than Whites on general mental ability (GMA) measures used in employee selection. Group differences on measures used in military settings are somewhat larger, especially for the White–Black comparison. One explanation for this finding could be the greater heterogeneity among military job applicants. Cognitive ability differences between Black and White applicants to high-complexity jobs are smaller than among applicants to

264

Cognitive Ability lower-complexity jobs, most likely because of severe self-selection as well as higher minimum requirements with regard to educational credentials. Among applicants to medium- and low-complexity jobs, White–Black and White–Hispanic differences in cognitive ability tests are large and almost certain to result in adverse impact if cognitive ability were the only predictor used in employee selection. This finding is at the root of the validity-diversity dilemma that most U.S. organizations face today (Kehoe, 2008; Kravitz, 2008; Ployhart & Holz, 2008; Potosky, Bobko, & Roth, 2008; Sackett, De Corte, & Lievens, 2008). The situation is slightly better among applicants to high-complexity jobs, in which group mean-score differences in cognitive ability are only moderate (d = .63), and thus carry slightly less severe implications for adverse impact. Data on Asian American–White and Native American–White cognitive ability differences among job applicants are scant. Ability profiles and subgroup differences for Asian Americans and Native Americans remain mostly uninvestigated, especially when the job applied to is held constant (i.e., within-job examinations). The broader psychological literature indicates slightly higher scores among Asian Americans compared with Whites (Gottfredson, 1997), but again, systematic data on job applicants are scarce. The response categories used for demographic data collection in psychological research often subsume individuals from very heterogeneous race and ethnic backgrounds in a single category, which complicates comparisons, especially with regard to the White–Asian comparisons. (As an illustration: the most recent U.S. census lists 14 national and ethnic categories of interest as well as one residual “other” category; 7 of the 15 categories represent Asian groups, and in total, 57 multi-race combinations could possibly be endorsed; “The Asian Population: 2010,” U.S. Census Bureau, 2012.) The educational literature reports sizable lower scores among Native Americans when compared with Whites (Humphreys, 1988). We were able to locate only one study that compared Native North Americans and Whites in a job context (Vanderpool & Catano, 2008). In this study, individuals from Canadian Aboriginal Peoples scored much lower on verbal ability tests than on nonverbal tests. It is important to stress that although race and ethnicity are protected categories in the United States, the characteristics that define disadvantaged groups elsewhere are diverse (cf. Myors et al., 2008). Furthermore, constructs such as race and ethnicity are also often confounded with national origin or immigrant/refugee status. Cognitive ability test scores of disadvantaged groups around the globe remain largely unstudied in employee selection settings, although exceptions can be found in a handful of countries where race, ethnicity, or immigrant status have been examined (e.g., Australia, Israel, New Zealand, South Africa, Taiwan, Turkey, and the Netherlands). This research appears to point to consistently lower scores of disadvantaged groups (e.g., Aborigines in Australia, Canada, New Zealand, and Taiwan; Blacks in South Africa; immigrants in the Netherlands and Sweden; Sackett & Shen, 2008, Salgado, in press). Ongoing mass refugee movements, resulting in more than 60 million refugees worldwide (United Nations High Commissioner for Refugees, December 2015), will offer I-O psychologists and societies around the world both opportunities and challenges. If cognitively oriented assessments are utilized in assessing and placing refugees into jobs, potential subgroup differences must be attended to. Large-scale international comparisons using the same cognitive ability test in employment settings are rare, if not nonexistent. The first two authors of this chapter were involved in content development for a computer adaptive figural reasoning test for use in employee selection around the globe (Dilchert & Ones, 2007). The data from hundreds of thousands of applicants allow us an extraordinary global look at group differences on the same, nonverbal reasoning measure. Among the nearly 60,000 U.S.-based applicants who completed the tests in the first few months after implementation, group differences were in the expected direction, with all minority groups except Asian Americans scoring lower than Whites. However, typically observed group differences were reduced. When analyzed on the country level, the 10 countries in which job applicants scored highest on average were Southeast Asian (4) and European (6). An analysis of the data by cultural clusters revealed Confucian Asia and southern and northern Europe scoring higher than other cultural regions. Regardless of the underlying mechanisms, observed differences between applicants from different cultural regions applying to the same organization present a challenge and an opportunity to actively shape their workforce on the basis of diversity and talent goals.

265

Deniz S. Ones et al. One area of group differences that has received little attention in the adverse impact literature is that of cognitive ability differences between younger and older adults. One large-scale examination of cognitive abilities across the working life span (Avolio & Waldman, 1994) offers some insights into the magnitudes of age-related declines in cognitive ability. Avolio and Waldman (1994) reported mean scores of 25,140 White, Black, and Hispanic job applicants who had taken the General Aptitude Test Battery (GATB) of the U.S. Employment Service and broke these scores down by age groups. When computing d values based on these data, one notices that age group differences in cognitive ability (both general and specific) start as early as age 35, but they become particularly notable for the 45–55 and 55–65 age groups (differences to the 20–34 year comparison groups range from approximately .80 to 1.5 standard deviation units). Verbal ability, however, shows the smallest declines across older age groups. Investigations of age differences among job applicant samples are still rare in the scholarly literature. One recent exception is the work by Klein, Dilchert, Ones, and Dages (2015), who reported age-differences among job applicants to managerial and executive positions, and did so separately for different ability tests. Their results, which also generalized in two representative, longitudinal U.S. general population samples, showed that certain crystallized verbal abilities actually increased over individuals’ working lives, but that declines in general mental ability, as well as (most drastically) inductive reasoning, are notable as early as the early forties. Table 11.5 summarizes these findings. When cognitive ability measures are used in employee selection, younger applicants generally stand to get selected at greater rates than older applicants. The disparity in selection ratios can be particularly severe if applicant pools include individuals from the entire age spectrum of adults. However, we now know that the choice of ability test matters. Group differences (and thus the threat of adverse impact) are less severe on general mental ability tests/scores compared to fluid ability or inductive reasoning. Moreover, some crystallized verbal abilities might even present an advantage for older adults. But it is important to recall that crystallized verbal ability differences are largest in race/ethnic group comparisons (see above). Organizations that are concerned with diversity should closely examine both the demographic makeup of their applicant pools and the new knowledge acquisition, adaptability, and verbal ability requirements of specific jobs to strategically address age and race/ethnic diversity through proper test choice. Evidence from the individual differences literature suggests that rates of cognitive decline are slower for those who have higher initial baseline ability (Deary, MacLennan, & Starr, 1998), higher levels of education (Deary et al., 1998; Rhodes, 2004), and those employed in complex or enriched jobs that presumably use their cognitive abilities to a greater extent (Schooler, Mulatu, & Oates, 1999). In the future, the aging workforces of most industrialized countries will certainly necessitate greater attention to the consequences of cognitive ability test use for workforce diversity with regard to age.

Validity Differences and Predictive Bias Thus far, we have discussed only group mean score differences on cognitive ability tests. Another salient issue is that of differential validity and differential prediction. Differential validity refers to differences in criterion-related validity coefficients of various subgroups. It indicates the degree to which the pre-employment test similarly/differentially relates to a given criterion. While differential validity compares the magnitudes of criterion-related validities between groups of interest, differential prediction simultaneously compares slopes and intercepts of regression lines for such groups. A healthy body of literature in employee selection has led to the conclusion that there is no predictive bias against Blacks in the United States (Rotundo & Sackett, 1999; Schmidt, 1988). Hunter, Schmidt, and Hunter (1979) and Schmidt, Pearlman, and Hunter (1980) have quantitatively summarized dozens of validation studies using the GATB with Blacks and Hispanics, respectively. Hunter et al.’s (1979) analysis demonstrated that, on average, validities for Whites were .01 correlational points higher than those for Blacks in predicting objective performance criteria and .04 correlational points higher for predicting subjective ratings of job performance (k = 866 non-independent validity pairs).

266

TABLE 11.5 Age Differences in Cognitive Ability Cognitive Variable

Age Group Comparison

N

d

U.S. job applicants who completed the General Aptitude Test Battery

a

Reference Group: 20–34 Years GMA

Verbal ability

Numerical ability

Spatial ability

Form perception

Clerical ability

35–44

13,746 4,305

– 0.33

45–54

2,825

0.55

55–65

1,161

0.80

35–44

4,305

0.26

45–54

2,825

0.35

55–65

1,161

0.49

35–44

4,305

0.36

45–54

2,825

0.59

55–65

1,161

0.71

35–44

4,305

0.30

45–54

2,825

0.55

55–65

1,161

0.88

35–44

4,305

0.56

45–54

2,825

1.04

55–65

1,161

1.53

35–44

4,305

0.40

45–54

2,825

0.69

55–65

1,161

0.90

Job applicants to executive positions who completed multiple cognitive ability testsb Reference Group: 20–34 Years GMA

Verbal ability

Figural reasoning

Inductive reasoning

662



35–44

1,167

0.12

45–54

1,098

0.23

55–64

371

35–44

1,167

-0.36

0.32

45–54

1,098

-0.49

55–64

371

-0.76

35–44

1,167

0.09

45–54

1,098

0.20

55–64

371

0.36

35–44

1,167

0.51

45–54

1,098

0.80

55–64

371

1.03

N = sample size; d = standardized group mean-score difference. Positive effect sizes indicate younger individuals scoring higher on average; the reference group for computation of d values was 20–34 years of age for all effect sizes. a Based on data presented in Table 3 of Avolio, B. J., & Waldman, D. A., Psychology and Aging, 9, 430–442, 1994. Means and standard deviations for the reference group were obtained by sample-size weighting means and pooling standard deviations for 20- to 24-year-old and 25- to 34-year-old age groups across race and ethnic groups. b Based on data presented in Table 2 of Klein, R. M., Dilchert, S., Ones, D. S., & Dages, K. D., Journal of Applied Psychology, 5, 1497–1510, 2015. Group sample sizes were combined across subgroups to enable age-group comparisons consistent with those computed based on Avolio & Waldman data. d values were computed by sample-size weighting effects across the subgroups that were combined for each row.

267

Deniz S. Ones et al. In the 2010 edition of this Handbook, we stressed the need for an update of the existing literature on differential validity. Berry, Clark, and McClure (2011) presented an updated metaanalysis, incorporating previous data, reporting that in employment settings observed validities for Whites and Blacks differed by .03 (k = 143 independent studies incorporating data from 20,399 Whites and 10,350 Blacks). Observed validity differences reported in 93 military studies were drastic: White validities were double Black validities. Roth et al. (2014) suggested differential range restriction as a factor that clouds observed validity comparisons. In response, Berry, Cullen, and Meyer (2014) produced a new set of meta-analytic validity estimates corrected for differential range restriction. In civilian settings, range restriction values (u values) were .89 for Whites and .85 for Blacks. Differences in range-restricted validity remained at .03 correlational points lower for Blacks. In military studies, much of the differential validity was concluded to be due to differential range restriction, as the range restriction corrected validity difference between Whites and Blacks shrank to .07 correlational points (White validity higher). Using up-to-date meta-analytic operational validity estimates, Berry and Zhao (2015) concluded that there is “strong evidence that cognitive ability tests generally overpredict job performance of African Americans” (p. 162). Research on Hispanic Americans is meager. Across 1,128 pairs of validity coefficients from 19 studies, Schmidt et al. (1980) showed White validities on average to be .02 correlational points higher than Hispanic validities. Berry et al. (2014) reported that across 35 studies in civilian employment settings, appropriately range-restriction-corrected validity differences were .02 (Whites higher). Differential validity analyses were available only in educational settings, indicating .02 operational validity points higher for Whites. Differential validity of cognitive ability tests in organizational settings has not been reported for Asian and Native Americans in the peer-reviewed literature. Rothstein and McDaniel (1992) reported an examination of differential validity by sex for cognitive ability tests. Using 59 pairs of male and female correlations (N = 5,517 and 9,428, respectively), they found observed validities to be on average .03 correlational points higher for women (validities corrected for range restriction and unreliability in the criteria were .05 correlational points higher). The higher validity for women was more marked in lower-complexity jobs and female-dominated occupations. In male-dominated occupations, the validity was higher for predicting performance among men. For the prediction of academic success, the reverse (but weak) pattern was established in a recent meta-analysis (Fischer, Schult, & Hell, 2013). We were unable to locate differential validity investigations for older versus younger adults. Future research should examine differential validity for hitherto unexamined groups in employment settings (Asians, Native Americans, older adults). Some exceptions not withstanding (e.g., the Netherlands and South Africa), studies from other parts of the world (as well as those for other minority groups) are sparse and need to be conducted as well. We would like to stress that continued research on differential validity would be valuable. Labor force participation and occupational distributions of women, Blacks, Hispanics, and a multitude of racial, ethnic, and religious groups are much different today than even 10–20 years ago. Changes in the nature of many jobs (e.g., greater complexity, greater technological demands) as well as changes in the social milieu in many organizations (e.g., emergence of workforce diversity as a core value, mass immigrations across the globe) may manifest themselves in cognitive ability–criteria relations. Research must also examine whether differential validity is found for criteria other than overall job performance. In our opinion, studies on organizational citizenship behaviors, task performance, and leadership criteria may constitute priorities. The only study that examined Black–White differential validity of a cognitive ability test for predicting nontraditional performance criteria investigated incidents of detected counterproductive behaviors (interpersonal and those targeted at the organization) and found no evidence of differential validity (Dilchert et al., 2007). However, replications of these results, as well as investigations among other minority groups, are certainly warranted.

268

Cognitive Ability

FUTURE CHALLENGES FOR RESEARCH AND PRACTICE In this chapter, we have identified specific areas in need of additional research attention as well as some challenges for the use of cognitive ability tests in applied settings. The high validity of cognitive measures makes them attractive for use in employee selection. Their ability to enhance productivity and offer substantial economic utility to organizations is indisputable. However, many applied psychologists are concerned, and understandably so, that various groups (e.g., Blacks, Hispanics/Latinos, other disadvantaged ethnic groups, and older applicants) on average score lower than the majority applicants, often resulting in differential selection ratios for different groups. Our literature is filled with suggestions on ways to reduce the likelihood for adverse impact. Thoughtful description and evaluation of various proposed alternatives is not possible in this short chapter but is available in various papers (e.g., Campbell, 1996; De Corte, Lievens, & Sackett, 2007, 2010; Hough et al., 2001; Hunter & Schmidt, 1982; Ployhart & Holtz, 2008; Potosky, Bobko, & Roth, 2005; Sackett, Schmitt, Ellingson, & Kabin, 2001; Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997). Frankly, we believe that structural and procedural proposals to reduce adverse impact are stopgap measures that are not sufficient for dealing with profound group differences observed in occupational settings. Although the exact definition of what constitutes protected classes may differ, societies around the world are now facing similar issues—this fact has been corroborated by the adoption of antidiscrimination Directives 2000/43/EC and 2000/78/EC in the European Community (country-specific laws passed as a result still differ considerably with regard to prohibited grounds and protected classes; while some countries expanded the Directives by including color, national origin, or language, others are less broad in their protection). It will require the collective wisdom of scientists across disciplines to evaluate whether some group differences on individual differences traits can be reduced, and if so, how. In the meantime, I-O psychologists need to face the challenges that these group differences pose in applied settings. To this end, the responsibility is equally distributed among (a) scientists, who need to address the areas of concern summarized above; (b) test publishers, who need to continuously collect and make available data regarding group differences and predictive fairness of their tests; and (c) individual practitioners, who need to educate themselves on the past and current research as well as its implications for their specific purpose. Another, somewhat easier challenge is that of enhancing the acceptability of cognitive ability measures among applicants to high-complexity jobs. As Lubinski (2004) pointed out, cognitive ability can be assessed in all shapes and forms: “Variegated conglomerations of information and problem-solving content, not necessarily tied to an educational program, which may involve fresh as well as old learning (acquired in or out of school), may be used to assess general intelligence” (p. 98). However, it is our opinion that when new formats and approaches are used to address issues of applicant reactions and face validity, intellectual honesty still mandates an acknowledgment of the construct being measured. The proliferation of “new” abilities and claims that such abilities are independent of traditional intelligence are insincere and harmful to the professional reputation of our field. “Gamification” of selection processes and tools might result in higher applicant engagement, but thorough assessment development and professional principles of measurement should not be abandoned in this pursuit. We have also observed that sometimes preconceived notions of cognitive test acceptability can cloud our judgment. Our work with nonverbal figural reasoning tests, arguably an item type that on the surface does not appear extraordinarily related to most real-world tasks, yielded some surprising findings. Data show that such items, especially when compared to those with verbal content, are received very positively by applicants. Although contextualization is certainly a viable method of achieving face validity, items need not always be contextualized to invoke positive applicant reactions.

EPILOGUE This chapter aimed to offer a broad and forthright overview of cognitive ability tests and their use in employee selection. Other excellent overviews of the topic may be found in Drasgow

269

Deniz S. Ones et al. (2003, especially with regard to structural issues); Ree, Carretta, and Steindl (2001, especially with regard to broader life correlates); Ones, Viswesvaran, and Dilchert (2004, 2005; especially with regard to validity for learning criteria); Ones et al. (2012), Dilchert (in press), and Salgado (in press, especially with regard to a criterion-related validity in organizational settings). Debates and exchanges over the use of cognitive ability tests in selection settings can also be found in special issues of Human Performance (Viswesvaran & Ones, 2002), Human Resource Management (Vol. 25[1]), and Industrial and Organizational Psychology (Vol. 12[5]). Cognitive ability is the capacity to learn, solve problems, and adapt to environments. Abstract thinking and logic reasoning determine success in various life domains by allowing us to not only rely on skills acquired through past experience, but also to react to novel situations through knowledge and insights acquired in mental simulations. Cognitive ability continues to be the single best determinant of work performance. We believe that the benefits associated with cognitive ability test use in employee selection far outweigh potential concerns. More importantly, the changing nature of work in most developed economies (increasing job complexity, fewer traditional employment relationships, increasing job switching) means that cognitive ability should be an increasingly important human capital variable. Advances in technology, such as the ubiquitous availability of mobile devices as well as increasing Internet access even in remote regions, provide immense opportunities to improve both the science and practice of employee assessment using cognitive ability tests. Although few fundamental things might have changed in the last 30 years of cognitive ability assessment, the near future promises exciting developments.

NOTES 1. For the 100th anniversary of that article’s publication, the Journal of Personality and Social Psychology published a special section attesting to the impact of cognitive ability on a multitude of life domains (Deary, Whiteman, Starr, Gottfredson, 2004a; Kuncel, Hezlett, & Ones, 2004; Lubinski, 2004; Plomin & Spinath, 2004; Schmidt & Hunter, 2004; Whalley, & Fox, 2004). 2. In fact, meta-analysis has changed the nature of epistemological inquiry in all sciences. Thirty years after its inception, 40,000 peer-reviewed publications have used or discussed meta-analytic methods, garnering hundreds of thousands citations (Christensen, Selzer, Beatty, & Ones, 2009). 3. However, corrections for attenuation due to range restriction and unreliability in the criterion are advisable when comparing results across studies differing in their levels of range restriction and unreliability.

REFERENCES Alonso, A., Viswesvaran, C., & Sanchez, J. I. (2008). The mediating effects of task and contextual performance. In J. Deller (Ed.), Research contributions to personality at work (pp. 3–17). Munich, Germany: Rainer Hampp. Arneson, J. J. (2007). An examination of the linearity of ability—Performance relationships among high scoring applicants. Unpublished doctoral dissertation, Minneapolis, MN: University of Minnesota. Arthur, W., Jr., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). The magnitude and extent of cheating and response distortion effects on unproctored Internet-based tests of cognitive ability and personality. International Journal of Selection and Assessment, 18(1), 1–16. http://dx.doi.org/10.1111/j.1468–2389.2010.00476.x Avolio, B. J., & Waldman, D. A. (1994). Variations in cognitive, perceptual, and psychomotor abilities across the working life span: Examining the effects of race, sex, experience, education, and occupational type. Psychology and Aging, 9, 430–442. Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical reanalysis. Journal of Applied Psychology, 66, 1–6. Barros, E., Kausel, E. E., Cuadra, F., & Díaz, D. A. (2014). Using general mental ability and personality traits to predict job performance in three Chilean organizations. International Journal of Selection and Assessment, 22, 432–438. Berry, C. M., Gruys, M. L., & Sackett, P. R. (2006). Educational attainment as a proxy for cognitive ability in selection: Effects on levels of cognitive ability and adverse impact. Journal of Applied Psychology, 91, 696–705. Berry, C. M., Sackett, P. R., & Landers, R. N. (2007). Revisiting interview-cognitive ability relationships: Attending to specific range restriction mechanisms in meta-analysis. Personnel Psychology, 60, 837–874.

270

Cognitive Ability Berry, C. M., & Zhao, P. (2015). Addressing criticisms of existing predictive bias research: Cognitive ability test scores still overpredict African Americans’ job performance. Journal of Applied Psychology, 100(1), 162–179. http://dx.doi.org/10.1037/a0037615 Bertolino, M., & Steiner, D. D. (2007). Fairness reactions to selection methods: An Italian study. International Journal of Selection and Assessment, 15, 197–205. Burke, E. (April 2008). Remarks. In N. Tippins (Chair), Internet testing: Current issues, research, solutions, guidelines, and concerns. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology, San Francisco, CA. Campbell, J. P. (1990). The role of theory in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 1, pp. 39–73). Palo Alto, CA: Consulting Psychologists Press. Campbell, J. P. (1996). Group differences and personnel decisions: Validity, fairness, and affirmative action. Journal of Vocational Behavior, 49, 122–158. Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York, NY: Cambridge University Press. Cattell, R. B. (1971). Abilities: Their structure, growth, and action. Oxford, England: Houghton Mifflin. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143–159. Chan, D., & Schmitt, N. (2004). An agenda for future research on applicant reactions to selection procedures: A construct-oriented approach. International Journal of Selection and Assessment, 12, 9–23. Chan, D., Schmitt, N., Jennings, D., Clause, C., & Delbridge, K. (1998). Applicant perceptions of test fairness: Integrating justice and self-serving bias perspectives. International Journal of Selection and Assessment, 6, 232–239. Chartered Institute of Personnel and Development. (2007). 2007 recruitment, retention, and turnover survey. London, UK: Chartered Institute of Personnel and Development. Christiansen, F., Seltzer, B. K., Beatty, A., & Ones, D. S. (April 2009). Thirty years of meta-analysis: Assessing its impact on the sciences. Poster presented at the annual conference of the Society for Industrial and Organizational Psychology, New Orleans, Louisiana. Cohen, J. (1977). Statistical power analysis for the behavioral sciences. San Diego, CA: Academic Press. Collins, J. M., Schmidt, F. L., Sanchez-Ku, M., Thomas, L., McDaniel, M. A., & Le, H. (2003). Can basic individual differences shed light on the construct meaning of assessment center evaluations? International Journal of Selection and Assessment, 11, 17–29. Coward, W., & Sackett, P. R. (1990). Linearity of ability-performance relationships: A reconfirmation. Journal of Applied Psychology, 75, 297–300. Deary, I. J., MacLennan, W. J., & Starr, J. M. (1998). Is age kinder to the initially more able?: Differential ageing of a verbal ability in the healthy old people in Edinburgh study. Intelligence, 26, 357–375. Deary, I. J., Whiteman, M. C., Starr, J. M., Whalley, L. J., & Fox, H. C. (2004). The impact of childhood intelligence on later life: Following up the Scottish mental surveys of 1932 and 1947. Journal of Personality and Social Psychology, 86, 130–147. De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. De Corte, W., Sackett, P. R., & Lievens, F. (2010). Selecting predictor subsets: Considering validity and adverse impact. International Journal of Selection and Assessment, 18(3), 260–270. http://dx.doi. org/10.1111/j.1468–2389.2010.00509.x Detterman, D. K., & Daniel, M. H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low IQ groups. Intelligence, 13, 349–359. Dilchert, S. (In press). Cognitive ability in occupational settings. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran, International handbook of industrial, work, and organizational psychology. Thousand Oaks, CA: Sage. Dilchert, S., & Ones, D. S. (April 2007). Influence of figural reasoning item characteristics on group meanscore differences. In A. S. Boyce & R. E. Gibby (Chairs.), Global cognitive ability testing: Psychometric issues and applicant reactions. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology, New York, NY. Dilchert, S., & Ones, D. S. (2009). Assessment center dimensions: Individual differences correlates and meta-analytic incremental validity. International Journal of Selection and Assessment, 17, 254–270. Dilchert, S., Ones, D. S., Davis, R. D., & Rostow, C. D. (2007). Cognitive ability predicts objectively measured counterproductive work behaviors. Journal of Applied Psychology, 92, 616–627. Drasgow, F. (2003). Intelligence and the workplace. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 107–130). Hoboken, NJ: John Wiley & Sons.

271

Deniz S. Ones et al. Foster, D. (April 2008). Remarks. In N. Tippins (Chair.), Internet testing: Current issues, research, solutions, guidelines, and concerns. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology, San Francisco, CA. Ghiselli, E. E., & Kahneman, D. (1962). Validity and non-linear heteroscedastic models. Personnel Psychology, 15, 1–11. Gibby, R. E. (April 2008). Online and unsupervised adaptive cognitive ability testing: Lessons learned. In R. E. Gibby & R. A. McCloy (Chairs.), Benefits and challenges of online and unsupervised adaptive testing. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology, San Francisco, CA. Gibby, R. E., & Boyce, A. S. (April 2007). Global cognitive ability testing: Psychometric issues and applicant reactions. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology, New York, NY. Gottfredson, L. S. (1997). Mainstream science on intelligence: An editorial with 52 signatories, history and bibliography. Intelligence, 24, 13–23. Gottfredson, L. S. (2002). Where and why g matters: Not a mystery. Human Performance, 15, 25–46. Gottfredson, L. S. (2004a). Intelligence: Is it the epidemiologists’ elusive “fundamental cause” of social class inequalities in health? Journal of Personality and Social Psychology, 86, 174–199. Gottfredson, L. S. (2004b). Life, death, and intelligence. Journal of Cognitive Education and Psychology, 4, 23–46. Gonzalez-Mulé, E., Mount, M. K., & Oh, I-S. (2014). A meta-analysis of the relationship between general mental ability and nontask performance. Journal of Applied Psychology, 99(6), 1222–1243. http://dx.doi. org/10.1037/a0037547 Gustafsson, J-E. (2002). Measurement from a hierarchical point of view. In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 73–95). Mahwah, NJ: Lawrence Erlbaum. Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated model and meta-analysis. Personnel Psychology, 57, 639–683. Hawk, J. (1970). Linearity of criterion-GATB aptitude relationships. Measurement and Evaluation in Guidance, 2, 249–251. Hense, R., Golden, J. H., & Burnett, J. (2009). Making the case for unproctored internet testing: Do the rewards outweigh the risks? Industrial and Organizational Psychology, 2, 20–23. Hoffman, B. J., Monahan, E., Lance, C. E., & Sutton, A. (2015). A meta-analysis of the content, construct, and criterion-related validity of assessment center exercises. Journal of Applied Psychology, 100, 1143–1168. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Hull, C. L. (1928). Aptitude testing. Yonkers-on-Hudson, NY: World Book. Hülsheger, U. R., Maier, G. W., & Stumpp, T. (2007). Validity of general mental ability for the prediction of job performance and training success in Germany: A meta-analysis. International Journal of Selection and Assessment, 15, 3–18. Humphreys, L. G. (1988). Trends in levels of academic achievement of Blacks and other minorities. Intelligence, 12, 231–260. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Hunter, J. E., & Schmidt, F. L. (1982). Ability tests: Economic benefits versus the issue of test fairness. Industrial Relations, 21, 122–158. Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of employment tests by race: A comprehensive review and analysis. Psychological Bulletin, 86, 721–735. Hyde, J. S. (2005). The gender similarities hypothesis. American Psychologist, 60, 581–592. Hyde, J. S. (2014). Gender similarities and differences. Annual Review of Psychology, 65, 373–398. Hyde, J. S., Fennema, E., & Lamon, S. J. (1990). Gender differences in mathematics performance: A meta-analysis. Psychological Bulletin, 107, 139–155. Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104, 53–69. Johnson, W., & Bouchard, T. J. (2005). The structure of human intelligence: It is verbal, perceptual, and image rotation (VPR), not fluid and crystallized. Intelligence, 33, 393–416. Johnson, W., te Nijenhuis, J., & Bouchard, T. J. (2007). Replication of the hierarchical visual-perceptual-image rotation model in de Wolff and Buiten’s (1963) battery of 46 tests of mental ability. Intelligence, 35, 69–81. Johnson, W., te Nijenhuis, J., & Bouchard, T. J. (2008). Still just 1 g: Consistent results from five test batteries. Intelligence, 36, 81–95.

272

Cognitive Ability Kane, H. D., Oakland, T. D., & Brand, C. R. (2006). Differentiation at higher levels of cognitive ability: Evidence from the United States. Journal of Genetic Psychology, 167, 327–341. Kehoe, J. F. (2008). Commentary on Pareto-optimality as a rationale for adverse impact reduction: What would organizations do? International Journal of Selection and Assessment, 16, 195–200. Kluger, A. N., & Rothstein, H. R. (1993). The influence of selection test type on applicant reactions to employment testing. Journal of Business and Psychology, 8, 3–25. Krantowitz, T. M. (2014). 2014 Global assessment trends report. Surrey, UK: CEB-SHL. Kravitz, D. A. (2008). The diversity-validity dilemma: Beyond selection—The role of affirmative action. Personnel Psychology, 61, 173–193. Kriek, H., & Dowdeswell, K. (2010). Adverse impact in South Africa. In J. L. Outz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 375–399). New York. Routledge. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2004). Academic performance, career potential, creativity, and job performance: Can one construct predict them all? Journal of Personality and Social Psychology, 86, 148–161. Kuncel, N. R., & Sackett. P. R. (2013). Resolving the assessment center construct validity problem (as we know it). Journal of Applied Psychology, 99, 38–47. Lee, S. (2005). Cross-cultural validity of personality for predicting job performance of Korean engineers. Unpublished doctoral dissertation, Columbus: Ohio State University. Lent, R. H., Aurbach, H. A., & Levin, L. S. (1971). Predictors, criteria, and significant results. Personnel Psychology, 24, 519–533. Lievens, F., & Burke, E. (2011). Dealing with the threats inherent in unproctored Internet testing of cognitive ability: Results from a large-scale operational test program. Journal of Occupational and Organizational Psychology, 84(4), 817–824. http://dx.doi.org/10.1348/096317910X522672 Lievens, F., De Corte, W., & Westerveld, L. (2015). Understanding the building blocks of selection procedures: Effects of response fidelity on performance and validity. Journal of Management, 41, 1604–1627. Linn, M. C., & Petersen, A. C. (1985). Emergence and characterization of sex-differences in spatial ability: A meta-analysis. Child Development, 56, 1479–1498. Loehlin, J. C. (2000). Group differences in intelligence. In R. J. Sternberg (Ed.), Handbook of intelligence (pp. 176–193). New York, NY: Cambridge University Press. Lubinski, D. (2004). Introduction to the special section on cognitive abilities: 100 years after Spearman’s (1904) “ ‘General Intelligence’, Objectively Determined and Measured”. Journal of Personality and Social Psychology, 86, 96–111. Lubinski, D. (2010). Spatial ability and STEM: A sleeping giant for talent identification and development. Personality and Individual Differences, 49(4), 344–351. http://dx.doi.org/10.1016/j.paid.2010.03.022 Lynn, R., & Irwing, P. (2004). Sex differences on the progressive matrices: A meta-analysis. Intelligence, 32, 481–498. McCloy, R. A., Campbell, J. P., & Cudeck, R. (1994). A confirmatory test of a model of performance determinants. Journal of Applied Psychology, 79, 493–505. McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91. McGrew, K. (2009). Editorial: CHC theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research. Intelligence, 37, 1–10. Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114, 449–458. Meriac, J. P., Hoffman, B. J., & Woehr. D. J. (2014). A conceptual and empirical review of the structure of assessment center dimensions. Journal of Management, 40, 1269–1296. Muller, J., & Schepers, J. (2003). The predictive validity of the selection battery used for junior leader training within the South African National Defense Force. South African Journal of Industrial Psychology, 29, 87–98. Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., et al. (2008). International perspectives on the legal environment for selection. Industrial and Organizational Psychology, 1, 206–246. Nathan, B. R., & Alexander, R. A. (1988). A comparison of criteria for test validation: A meta-analytic investigation. Personnel Psychology, 41, 517–535. Nikolaou, I., & Judge, T. A. (2007). Fairness reactions to personnel selection techniques in Greece: The role of core self-evaluations. International Journal of Selection and Assessment, 15, 206–219. Nimura, H., Ishashiro, S., & Naito, J. (2000). A meta-analysis and validity generalization study of a personnel test and a general cognitive test for measuring managerial aptitude. Japanese Journal of Administrative Science, 13, 159–167. Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of Applied Psychology, 79, 845–851.

273

Deniz S. Ones et al. Ones, D. S. (October 2004). Validity of cognitive ability tests in selection: Quantitative summaries of meta-analyses. Cattell Award address given at the annual conference of the Society for Multivariate Experimental Psychology, Naples, FL. Ones, D. S. (April 2016). “g” versus specific abilities. In L. Lin (Chair.), IGNITE session: I/O hot topics debate—dual or duel? Symposium session at the annual conference of the Society for Industrial and Organizational Psychology, Anaheim, CA. Ones, D. S., & Dilchert, S. (2004). Practical versus general intelligence in predicting success in work and educational settings: A first-order and a second-order meta-analysis. Paper presented at the University of Amsterdam, Amsterdam, The Netherlands. Ones, D. S., Dilchert, S., & Viswesvaran, C. (2012). Cognitive abilities. In N. Schmitt (Ed.), Oxford handbook of personnel assessment and selection (pp. 179–224). New York, NY: Oxford University Press. Ones, D. S., Viswesvaran, C., & Dilchert, S. (2004). Cognitive ability in selection decisions. In O. Wilhelm & R. W. Engle (Eds.), Handbook of understanding and measuring intelligence (pp. 431–468). Thousand Oaks, CA: Sage. Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Cognitive ability in personnel selection decisions. In A. Evers, O. Voskuijl, & N. Anderson (Eds.), Handbook of selection (pp. 143–173). Oxford, England: Blackwell. Owens, W. (1966). Age and mental abilities: A second adult follow-up. Journal of Educational Psychology, 57, 311–325. Plomin, R., & Spinath, F. M. (2004). Intelligence: Genetics, genes, and genomics. Journal of Personality and Social Psychology, 86, 112–129. Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61(1), 153–172.http:// dx.doi.org/10.1111/j.1744–6570.2008.00109.x Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and alternative measures to predict job performance and reduce adverse impact: Corrected estimates and realistic expectations. International Journal of Selection and Assessment, 13, 304–315. Potosky, D., Bobko, P., & Roth, P. L. (2008). Some comments on Pareto thinking, test validity, and adverse impact: When “and” is optimal and “or” is a trade-off. International Journal of Selection and Assessment, 16, 201–205. Ree, M. J., & Carretta, T. R. (1996). Central role of g in military pilot selection. International Journal of Aviation Psychology, 6, 111–123. Ree, M. J., Carretta, T. R., & Steindl, J. R. (2001). Cognitive ability. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work and organizational psychology: Vol. 1: Personnel psychology (pp. 219–232). Thousand Oaks, CA: Sage. Reeve, C. L., & Lam, H. (2007). Consideration of g as a common antecedent for cognitive ability test performance, test motivation, and perceived fairness. Intelligence, 35, 347–358. Reeves, T. J., & Bennett, C. E. (2004). We the people: Asians in the United States—Census 2000 special report (CENSR-17). Washington, DC: U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau. Rhodes, M. G. (2004). Age-related differences in performance on the Wisconsin card sorting test: A meta-analytic review. Psychology and Aging, 19, 482–494. Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. Roth, P. L., Le, H., Oh, I-S., Van Iddekinge, C. H., Buster, M. A., Robbins, S. B., & Campion, M. A. (2014). Differential validity for cognitive ability tests in employment and educational settings: Not much more than range restriction? Journal of Applied Psychology, 99(1), 1–20. http://dx.doi.org/10.1037/a0034377 Rothstein, H. R., & McDaniel, M. A. (1992). Differential validity by sex in employment settings. Journal of Business and Psychology, 7, 45–62. Rotundo, M., & Sackett, P. R. (1999). Effect of rater race on conclusions regarding differential prediction in cognitive ability tests. Journal of Applied Psychology, 84, 815–822. Ryan, A. M., Boyce, A. S., Ghumman, S., Jundt, D., Schmidt, G., & Gibby, R. (2009). Going global: Cultural values and perceptions of selection procedures. Applied Psychology: International Review, 58, 520–556. Ryan, A. M., & Greguras, G. J. (1998). Life is not multiple choice: Reactions to the alternatives. In M. D. Hakel (Ed.), Beyond multiple choice: Evaluating alternatives to traditional testing for selection (pp. 183–202). Mahwah, NJ: Lawrence Erlbaum. Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391. Ryan, A. M., Reeder, M., Golubovich, J., Grand J., Inceoglu, I., Bartram, D., Derous, E., Nikolaou, I., & Yao, X. (under review). Culture and testing practices: Is the world flat?

274

Cognitive Ability Sackett, P. R., Borneman, M. J., & Connelly, B. S. (2008). High stakes testing in higher education and employment: Appraising the evidence for validity and fairness. American Psychologist, 63, 215–227. Sackett, P. R., De Corte, W., & Lievens, F. (2008). Pareto-optimal predictor composite formation: A complementary approach to alleviating the selection quality/adverse impact dilemma. International Journal of Selection and Assessment, 16, 206–209. Sackett, P. R., & Ostgaard, D. J. (1994). Job-specific applicant pools and national norms for cognitive ability tests: Implications for range restriction corrections in validation research. Journal of Applied Psychology, 79, 680–684. Sackett, P. R., & Roth, L. (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on performance and minority hiring. Personnel Psychology, 49, 549–572. Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment, credentialing, and higher education: Prospects in a post-affirmative-action world. American Psychologist, 56, 302–318. Sackett, P. R., & Shen, W. (April 2008). International perspectives on the legal environment for selection. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology, San Francisco, CA. Salgado, J. F. (In press). Using ability tests in selection. In H. Goldstein, E. Pulakos, J. Passmore, & C. Semedo (Eds.), The Wiley Blackwell handbook of the psychology of recruitment, selection, and retention. Hoboken, NJ: John Wiley & Sons. Salgado, J. F., & Anderson, N. (2002). Cognitive and GMA testing in the European Community: Issues and Evidence. Human Performance, 15, 75–96. Salgado, J. F., & Anderson, N. (2003). Validity generalization of GMA tests across countries in the European Community. European Journal of Work and Organizational Psychology, 12, 1–17. Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., & de Fruyt, F. (2003). International validity generalization of GMA and cognitive abilities: A European community meta-analysis. Personnel Psychology, 56, 605. Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J. P. (2003). A meta-analytic study of general mental ability validity for different occupations in the European Community. Journal of Applied Psychology, 88, 1068–1081. Salgado, J. F., Viswesvaran, C., & Ones, D. S. (2001). Predictors used for personnel selection: An overview of constructs, methods and techniques. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work and organizational psychology: Vol. 1: Personnel psychology (pp. 165–199). London, England: Sage. Schaie, K. W. (1994). The course of adult intellectual development. American Psychologist, 49, 304–313. Schmidt, F. L. (1988). The problem of group differences in ability test scores in employment selection. Journal of Vocational Behavior, 33, 272–292. Schmidt, F. L. (2011). A theory of sex differences in technical aptitude and some supporting evidence. Perspectives on Psychological Science, 6, 560–573. Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529–540. Schmidt, F. L., & Hunter, J. E. (1981). New research findings in personnel selection: Myths meet realities in the 1980s. Management, 23, 23–27. Schmidt, F. L., & Hunter, J. (2004). General mental ability in the world of work: Occupational attainment and job performance. Journal of Personality and Social Psychology, 86, 162–173. Schmidt, F. L., Hunter, J. E., & Outerbridge, A. N. (1986). Impact of job experience and ability on job knowledge, work sample performance, and supervisory ratings of job performance. Journal of Applied Psychology, 71, 432–439. Schmidt, F. L., Hunter, J. E., Outerbridge, A. N., & Goff, S. (1988). Joint relation of experience and ability with job performance: Test of three hypotheses. Journal of Applied Psychology, 73, 46–57. Schmidt, F. L., Le, H., Oh, I-S., & Shaffer, J. (2007). General mental ability, job performance, and red herrings: Responses to Osterman, Hauser, and Schmitt. Academy of Management Perspectives, 21, 64–76. Schmidt, F. L., Pearlman, K., & Hunter, J. E. (1980). The validity and fairness of employment and educational tests for Hispanic Americans: A review and analysis. Personnel Psychology, 33, 705–724. Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and predictive efficiency of various predictor combinations. Journal of Applied Psychology, 82, 719–730. Schneider, W. J., & McGrew, K. (2012). The Cattell-Horn-Carroll model of intelligence. In D. Flanagan & P. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (3rd ed., pp. 99–144). New York, NY: Guilford. Schooler, C., Mulatu, M. S., & Oates, G. (1999). The continuing effects of substantively complex work on the intellectual functioning of older workers. Psychology and Aging, 14, 483–506. Spearman, C. (1904). “General intelligence,” objectively determined and measured. American Journal of Psychology, 15, 201–293.

275

Deniz S. Ones et al. Stern, W. (1911). Die differentielle Psychologie in ihren methodischen Grundlagen [Differential psychology and its methodological foundations]. Leipzig, Germany: J. A. Barth. Taylor, P., Keelty, Y., & McDonnell, B. (2002). Evolving personnel selection practices in New Zealand organisations and recruitment firms. New Zealand Journal of Psychology, 31, 8–18. Thadeu, S. H., & Ferreira, M. C. (2013). The validity of psychological assessment in a selection process in the area of public safety. Revista Iberoamericana de Diagnóstico y Evaluación, 117–145. Tiffin, J., & Vincent, N. L. (1960). Comparison of empirical and theoretical expectancies. Personnel Psychology, 13, 59–64. Tippins, N. (2015). Technology and assessment in selection. Annual Review of Organizational Psychology and Organizational Behavior, 2, 551–582. United Nations High Commissioner for Refugees. (December 2015). UNHCR mid-year trends 2015. Retrieved from http://www.unhcr.org/statistics/unhcrstats/56701b969/mid-year-trends-june-2015. html U.S. Census Bureau. (2010). The Asian population: 2010. Issued March 2012. Vanderpool, M., & Catano, V. M. (2008). Comparing the performance of Native North Americans and predominantly White military recruits on verbal and nonverbal measures of cognitive ability. International Journal of Selection and Assessment, 16, 239–248. Verhaeghen, P. (2003). Aging and vocabulary score: A meta-analysis. Psychology & Aging, 18, 332–339. Viswesvaran, C., & Ones, D. S. (2002). Special issue: Role of general mental ability in industrial, work, and organizational psychology. Human Performance, 15. Voyer, D., Voyer, S., & Bryden, M. (1995). Magnitude of sex differences in spatial abilities: A meta-analysis and consideration of critical variables. Psychological Bulletin, 117, 250–270. Wai, J., Lubinski, D., & Benbow, C. P. (2005). Creativity and occupational accomplishments among intellectually precocious youths: An age 13 to age 33 longitudinal study. Journal of Educational Psychology, 97, 484–492. Wiernik, B. M., Wilmot, M. P., & Kostal, J. W. (2015). How data analysis can dominate interpretations of dominant general factors. Industrial and Organizational Psychology, 8, 438–445. Wonderlic Inc. (2002). Wonderlic personnel test and scholastic level exam user’s manual. Libertyville, IL: Author.

276

12 PHYSICAL PERFORMANCE TESTS DEBORAH L. GEBHARDT AND TODD A. BAKER

The Bureau of Labor Statistics (2011) reported that 28% of the workforce in the United States performs physically demanding jobs that involve construction, machinery installation and repair, public safety, and other professions. In many instances, physically demanding jobs are the highest paying jobs in a geographic location. The importance of ensuring access to these jobs for all applicants was underscored in the Department of Defense (DoD) decision in 2015 to open all military occupational specialties (MOS) to men and women (Carter, 2015). Assessment of physical performance has a historical base in the fields of exercise science, medical, psychology, and military and encompasses the academic areas of physiology, biomechanics, industrial engineering, applied psychology, and medicine. These multidisciplinary aspects led to the use of physical testing in occupational settings. Although our society has become more computer-driven, there are many arduous jobs in the public, private, and military sectors. The warehouse, manufacturing, long-shore, telecommunications, railroad, airline, electric, and natural gas industries contain many arduous jobs (Gebhardt & Baker, in press). Organizations use physical performance tests for applicant selection, retention of incumbents, and evaluation of physical fitness levels. Although physical testing is common in the selection setting, some organizations evaluate incumbent personnel at specified intervals (e.g., annually) to determine if they can perform the physical aspects of the job. A few public safety agencies require annual physical qualification (e.g., Nuclear Regulatory Commission, state police agencies) with employment consequences ranging from remedial training, denial of promotion and bonus payments, and job suspension, to job loss (Gebhardt & Baker, 2006).

JOB ANALYSIS FOR ARDUOUS JOBS Similar to all assessments used when making employment decisions, physical tests must be supported by a detailed job analysis. In the physical area, it is important to consider all underlying parameters (e.g., environment, protective equipment) that affect job performance. It would be unrealistic to consider police officer job tasks without including the weight of the equipment worn (e.g., bulletproof vest, weapon, ammunition, radio, handcuffs). As with all job analyses, identification of essential tasks and abilities is critical to defining job requirements. The job analysis—whether physical, cognitive, or psychomotor—involves three steps: job observations and interviews, identification of essential tasks, and identification of ergonomic and environmental conditions.

277

Deborah L. Gebhardt and Todd A. Baker

Site Visits and Essential Task Identification Job site visits involve interviews and observation of incumbents performing job tasks. During the observations, researchers identify job tasks and gather data related to postures, motions, and ergonomic parameters associated with the tasks. Interviews with incumbents and supervisors provide lists of tasks and details related to task performance (e.g., equipment weights). When the intent of the job analysis is to develop and validate physical performance tests, the task statements should be specific in nature to allow for identification of the frequency of specific types of physical activities (e.g., lifting different types and weights of objects). Following these initial steps, incumbents and supervisors use rating scales such as frequency of performance and importance to the job to identify essential job tasks. Use of other scales such as physical effort assists in initially obtaining an overview of the physical demand of job tasks (Fleishman, Gebhardt, & Hogan, 1986). Tasks with mean physical effort ratings equal to or above a specified value (e.g., 4 on a 7-point scale) contain moderate or higher physical demand. The expected to perform scale, used for public safety jobs, identifies rarely performed tasks that are critical to successful job performance (e.g., discharging firearms, carrying victims from burning structures). Past research found that job incumbents and supervisors provide reliable task ratings (Hogan, 1991a). However, incumbents provide better frequency and time spent ratings, in most instances, because they perform the tasks. If supervisors are used, first-line supervisors with previous job experience are most appropriate. Researchers have used numerous decision rules or algorithms (frequency, importance, and time-spent combinations) to identify essential tasks from task ratings. There is not one specific algorithm associated with physical-oriented job analysis. Selection of the most appropriate algorithm depends upon the nature and mission of the job. For instance, jobs in which most tasks are performed frequently (e.g., assembly line) may require a larger weighting for frequency than importance. Jobs that include highly important tasks that are infrequently performed (e.g., discharge firearm) may use an algorithm in which there is a separate importance or frequency mean cutoff to identify essential tasks. Thus, there may be need for a combination of algorithms to define the essential physical tasks.

Ergonomic/Biomechanical/Physiological Analysis Ergonomic, physiological, and biomechanical data, used separately or in combination, provide direct measures to quantify physical job demands. The methodologies range from simple measures such as the distance a worker walks, to sophisticated measures involving oxygen consumption, mathematical modeling, and use of archival engineering data. Simple ergonomic measures such as weights of objects, distances objects carried, and heights lifted to and from are appropriate for most jobs. To measure the force required to move objects, researchers use a load-cell device that records force production. To quantify actual job task demand, researchers use basic physiological and biomechanical data-gathering methodologies. The type of data gathered is dependent upon the essential tasks and physical demands of the job. The data can be gathered using a variety of equipment (e.g., heart rate monitor, accelerometer, oxygen/gas sensor, mathematical modeling). Heart rate monitors can attain a basic estimate of the physiological workload for jobs requiring task performance at medium to high intensities for extended timeframes (e.g., order filler, firefighter). The monitor captures the individual’s heart rate during task performance, while the researchers calculate the heart rate response and the percentage of maximum heart rate at which the individual was working. For example, if a 30-year-old male with a maximum heart rate of 190 beats per minute (bpm) (220) – age (30) = 190 bpm) is working at an average heart rate of 142.5 bpm, then he is working at 75% of his maximum (142.5/190 = 0.75). The American College of Sports Medicine (ACSM) classified the intensity of physical activity in terms of the percentage of maximum heart rate (Pescatello, Arena, Riebe, & Thompson, 2014). Table 12.1 lists the ACSM intensities, which range from very light to maximum (Pescatello et al., 2014). Gebhardt,

278

Physical Performance Tests

TABLE 12.1 ACSM’s Categories of Physical Activity Intensity Intensity

Percentage of Maximum Heart Rate

Very light

75%) AC exercises are specifically developed for an organization. Customization is costly. If customization elicits behavior indicative of traits that are particularly valued in a given context though, the cost is likely worthwhile.

VALIDITY OF PERSONALITY CONSTRUCTS AND FACTORS THAT AFFECT THEIR USEFULNESS Significant evidence documents the utility of personality variables for predicting important organizational criteria. Yet there are those who sharply criticize the utility of personality variables for employee selection on the grounds of purportedly low validities. For an exchange on this issue, see Morgeson et al. (2007a, 2007b) and Murphy and Dzieweczynski (2005) for one side of the argument; and Barrick and Mount (2005), R. Hogan (2005a, 2005b), Hough and Oswald (2005), Ones et al. (2005, 2007), and Tett and Christiansen (2007) for the other side. In addition, we refer the reader to meta-analyses and reviews of the literature such as Barrick, Mount, and Judge (2001); Dudley et al. (2006); J. Hogan and Holland (2003); J. Hogan and Ones (1997); Hough and Furnham (2003); Hough and Johnson, (2013); Hough and Ones (2001); Hough and Oswald (2008); Ones et al. (2007); Ones, Viswesvaran, and Schmidt (1993); Roberts, Kuncel, Shiner, Caspi, and Goldberg (2007); and Rothstein and Goffin (2006). These summaries indicate that personality constructs predict many important criteria, including major life outcomes. The list of criteria that are well predicted by personality variables includes, among others, the following: • Overall job performance: Conscientiousness, rtrue = .23 (Barrick & Mount, 1991) and roperational = .20 (Hurtz & Donovan, 2000); integrity tests, roperational = .41 (Ones et al., 1993); and core self-evaluations rtrue = .36 (Chang, Ferris, Johnson, Rosen, & Tan, 2012)

311

Leaetta Hough and Stephan Dilchert • Organizational citizenship behaviors: Conscientiousness rtrue = .22; agreeableness, rtrue = .17; openness, rtrue = .17; emotional stability, rtrue = .15; extraversion, rtrue = .11 [somewhat different patterns for organizational, interpersonal, and change-oriented OCB are also reported] (Chiaburu, Oh, Berry, Li, & Gardner, 2011); core self-evaluations rtrue = .22 (Chang et al., 2012); positive affect, rtrue = .23; negative affect, rtrue = –.10 (Kaplan, Bradley, Luchman, & Haynes, 2009) • Counterproductive work behavior: Conscientiousness, roperational = −.26 (Salgado, 2002), roperational = –.31 (Berry, Ones, & Sackett, 2007); dependability, rtrue = −.34 (Dudley et al., 2006); emotional stability, roperational = –.23, agreeableness, roperational = –.38 (Berry et al. 2007), personality-based integrity tests, roperational = −.32, overt integrity tests, roperational = .55 (Ones et al., 1993); core self-evaluations, rtrue = −.19 (Chang et al., 2012); negative affectivity, rtrue = .30 (Kaplan et al., 2009). For counterproductive work behaviors rated by others: conscientiousness, roperational = –.19, agreeableness, roperational = –.22 (Berry, Carpenter, & Barratt, 2012) • Task performance: core self-evaluations, rtrue = .19 (Chang et al., 2012); positive affectivity, rtrue = .20; negative affectivity, rtrue = –.09 (Kaplan et al., 2009) • Adaptive performance at work: emotional stability, roperational = .16, but .20 for managers; ambition, ropera = .14, but .26 for managers (Huang, Ryan, Zabel, & Palmer, 2014) tional • Managerial effectiveness: Dominance, roperational = .27; energy level, roperational = .20; achievement orientation, roperational = .17 (Hough, Ones, & Viswesvaran, 1998); conscientiousness, rtrue = .22 (Barrick & Mount, 1991) • Entrepreneurial performance: Conscientiousness, rtrue = .19; openness, rtrue = .21; emotional stability, rtrue = .18 (Zhao, Seibert, & Lumpkin, 2010) • Customer service: Customer service scales, roperational = .34 (Ones & Viswesvaran, 2008) • Unsafe behavior: conscientiousness, rtrue = –.25; agreeableness, rtrue = –.26; emotional stability, rtrue = –.13; extraversion, rtrue = .10 (Beus, Dhanani, & McCord, 2015) • Job satisfaction: Emotional stability, rtrue = .29; conscientiousness, rtrue = .26; extraversion, rtrue = .25; agreeableness, rtrue = .17 (Judge et al., 2002); core self-evaluations, rtrue = .44 (Chang et al., 2012); positive affectivity rtrue = .33; negative affectivity, rtrue = –.37 (Thoreson, Kaplan, Barsky, Warren, & de Chermont, 2003) • Job commitment: core self-evaluations, rtrue = .32, –.18, and –.27 for affective commitment, continuance commitment, and turnover intentions, respectively (Chang et al., 2012) • Intrinsic motivation: core self-evaluations, rtrue = .37 (Chang et al., 2012) • Goal setting—goal setting motivation: emotional stability and agreeableness, rtrue = .29; conscientiousness, rtrue = .29; openness, rtrue = .18; extraversion, rtrue = .15 (Judge & Ilies, 2002). Goal commitment: core self-evaluations, rtrue =.44 (Chang et al., 2012) • Life satisfaction: core self-evaluations, rtrue = .57 (Chang et al., 2012); emotional stability, rtrue = –.45; extraversion, rtrue = –.35; conscientiousness, rtrue = .27; agreeableness, rtrue = .19 (Steel, Schmidt, & Shultz, 2008) • Divorce: Conscientiousness, robserved = −.13; emotional stability, robserved = −.17; agreeableness, robserved = −.18 (Roberts et al., 2007) • Mortality: Conscientiousness, robserved = −.09; extraversion/positive emotion, robserved = −.07; emotional stability, robserved = −.05; agreeableness/lack of hostility, robserved = −.04 (each greater than the effects of socioeconomic status and IQ; Roberts et al., 2007)

Ones et al. (2005; 2007), as well as Hough and Ones (2001) and Hough and Johnson (2013), have summarized the meta-analytic evidence for compound personality scales in predicting work-related criteria and shown that these scales have high validity in predicting the specific criteria they were developed for, as well as for overall job performance. We readily acknowledge that it is not necessarily Big Five factors that predict valued outcomes. Indeed, we argue that (a) more specific criteria are predicted by more narrow personality traits; (b) complex criteria are predicted by theoretically appropriately matched predictors; and (c) for some of the criteria listed above, the highest predictive validities are not necessarily obtained at the factor level. We do not want to underestimate the importance of the FFM. It has provided a structure for us to think about personality variables. Prior to its acceptance, personality and I-O psychology had little from which to generalize, the myriad of personality measures and variables numbered in the hundreds, and there were different names for the same or similar constructs or the same name for different constructs. We are not advocating a return to the “good old daze” (Hough, 1997). We applaud the interest and evidence coming from studies that examine facet-level

312

Personality variables of the FFM. We urge more such research, especially research based on empirically derived, generalizable, facet-level personality taxonomies. An area of increased research attention in examining personality in work contexts are maladaptive traits and measures, as well as their overlap with measures and taxonomies of adaptive personality. Often, as Dilchert, Ones, and Krueger (2014) point out, “personality constructs range between maladaptive positive and negative extremes, with the middle normal range representing typical (i.e., ‘normal’) traits” (p. 98). Maladaptive personality measures have item content that is tilted toward higher negative valence. Examples include so-called dark side measures (e.g., Hogan Development Survey [HDS], R. Hogan & Hogan, 2009), measures of the “Dark Triad” of narcissism, Machiavellianism, and psychopathy (Paulhus & Williams, 2002), as well as—at the extreme end—measures of psychopathology (e.g., the MMPI, Ben-Porath & Tellegen, 2008; Butcher, Graham, Ben-Porath, Tellegen, & Dahlstrom, 2001; the Personality Inventory for DSM-5 [PID-5], Krueger, Derringer, Markon, Watson, & Skodol, 2012). Typically, such measures assess more extremes of personality variables, often describable in terms of Big Five factors, their facets, but most often compounds. For example, narcissism measures capture variance from low agreeableness and high extraversion (Moore & Ones, 2016). I-O psychology literature on maladaptive traits is meager. A few recent meta-analyses have summarized the criterion-related validities of the Dark Triad and the dark side traits assessed by the HDS. Machiavellianism and narcissism predict counterproductive work behavior well (rtrue = .25 and .43, respectively; O’Boyle, Forsyth, Banks, & McDaniel, 2012). It appears that the entitlement/exploitative facet of Narcissism is responsible for its predictive utility for CWB (Grijalva & Newman, 2015). For managers, a small meta-analysis (ks from 4 to 12) based on the Hogan Development Survey examined the validity of dark side traits for managerial performance (Gaddis & Foster, 2015). Managers who were leisurely (“indifferent to other people’s requests”), skeptical (cynical and distrustful), excitable (volatile and inconsistent), and cautious (resistant to change) performed worse (operational validities ranged from –.11 to –.20). Fine-grained analyses indicated that colorful (dramatic), bold (overconfident), imaginative, mischievous (taking risks, testing limits), and skeptical managers are rated as untrustworthy by their supervisors (unreliability-corrected validities ranged from –.10 to –.29). We urge more such research on workplace consequences of maladaptive traits and workforce-relevant nomological nets of their measures. Especially needed are studies examining how these traits and measures relate to a broader spectrum of criteria that are consequential in organizations such as negotiation tactics, conflict resolution, benefitting from HR interventions, political behavior, coaching, mentoring and derailment, among others. Also important, maladaptive behavior may not have a personality construct label that is obviously maladaptive. For example, Chan (2006) has demonstrated that the construct of proactive personality, which has almost always been taken as adaptive, can be maladaptive when high proactive personality scores are accompanied by low situational judgment effectiveness. Assessments of maladaptive traits, as well as understanding how positive traits can be negative in the presence (or lack) of other skills and abilities, can be useful in predicting interpersonal behavior as well as counterproductivity in organizations.

Incremental Validity Personality variables can increment criterion-related validity in at least one of two ways. One way is in combination with other relevant personality variables. A second way is in combination with other individual variables such as measures of cognitive ability. Personality variables generally have low correlations with cognitive ability measures and do increment validity when jointly used (Bartram, 2005; McHenry, Hough, Toquam, Hanson, & Ashworth, 1990; White et al., 2008). When used in combination with other measures, such as the interview, biodata, and situational judgment, personality variables also increment validity (DeGroot & Kluemper, 2007; McManus & Kelly, 1999).

313

Leaetta Hough and Stephan Dilchert

Variables That Moderate Validity of Personality Constructs Many variables affect the magnitude of the criterion-related validity that is obtained in primary and meta-analytic studies. Important factors are the type of criterion, the criterion measurement method, the relevance of the predictor for the criterion, personality measurement method (see above), research setting (experimental/laboratory vs. real-life selection), research design (concurrent vs. predictive/longitudinal), item transparency (subtle vs. obvious), and rater perspective (e.g., self vs. other). The more theoretically relevant the predictor is to the criterion, the higher the validity. The Hough and Furnham (2003) and Hough and Johnson (2013) summaries of meta-analyses according to predictor and criterion construct provide excellent examples of how predictor-criterion relevance affects the relationship between the two. In addition, validities are typically higher in concurrent validation studies compared to longitudinal validity studies (see Lievens, Ones, & Dilchert, 2009, for exceptions). Validities are also higher in “weak” situations in which people have more autonomy and control compared with “strong” situations in which people have few options. Trait activation theory (Tett & Burnett, 2003; Tett & Guterman, 2000) provides an integrated framework for understanding how the situation can explain variability in the magnitude of the relationships between personality and behavior and performance.

Nature of Predictor-Criterion Relationships As with all employee selection measures (whether standardized tests, interviews, simulations, or ACs), their utility depends on the nature of the predictor-criterion relationship. In making top-down hiring decisions, linearity of the relationship between predictor and criterion scores is typically assumed. Pearson correlations, which are most commonly used to estimate operational validities, also assume linearity, the same assumption that is critical to traditional utility approaches. Two plausible scenarios of nonlinearity between personality traits and criterion variables that would impact employee selection, especially with rigorous top-down selection, are: A relationship between the predictor and criterion in which an asymptote is reached after a certain level of predictor scores (e.g., beyond a certain point, all conscientious individuals keep their workplace similarly tidy, and differences in orderliness do not translate into performance differences). Additionally, a U- (or inverted U) shaped function, in which the direction of the relationship actually reverses beyond a certain level of predictor scores, is possible. In the case of an asymptotic relationship between personality and criterion scores, there is still potential utility in using personality as part of a selection process. Many organizations using minimum standards or defined cutoff scores do so because of the implicit assumption that predictor scores do not matter after a certain cutoff. If, however, personality-performance relationships are described by an inverted U-shaped function, the detrimental effect on overall utility of a selection system could be significant. In cases where predictor-criterion relationships reverse direction, top-down hiring could result in the acceptance of applicants who display high predictor scores but actually perform worse than some lower-scoring candidates. There have been a handful of studies investigating curvilinearity of personality–job performance relationships, most focusing on conscientiousness. Classically scaled conscientiousness scale scores are linearly related to overall job performance (LaHuis, Martin, & Avis, 2005; Robie & Ryan, 1999; Walmsley, 2013; Whetzel, McDaniel, Yost, & Kim, 2010), task performance (Carter et al., 2014; Le et al., 2011), organizational citizenship behaviors, and counterproductive work behaviors (Carter et al., 2014; Le et al., 2011), as well as GPA and training performance (Cucina & Vasilopulos, 2005; Vasilopoulos, Cucina, & Hunter, 2007), although slight decrements to performance or its facets were reported for classically scored, ad hoc measures of conscientiousness (Carter et al., 2014; La Huis et al., 2005; Le et al., 2011). Overall, for most commercially available personality inventories, curvilinearity appears not to present an impediment to predicting performance constructs (for examples, see Walmsley, 2013, for the Hogan Personality Inventory; Whetzel et al., 2010, for the Occupational Personality

314

Personality Questionnaire; Robie & Ryan, 1999, for the Personal Characteristics Inventory and NEO). The conclusions from the largest study on the topic (N > 11,000) are noteworthy: Any expected declines in performance at high ends of the predictor range were very small on average, and would be highly unlikely to produce scenarios in which those passing a realistic cut score would be expected to underperform those screened out due a curvilinear effect. . . . Even with slight curvilinear trends for several of the scales examined, the results suggest that curvilinearity is highly unlikely to present problems for typical uses of personality test scores in employment settings. (Walmsley, 2013, pp. ii–iii)

In general, nonlinear relationships occur when the independent or dependent variable’s measures are non-normally distributed. Most personality scales used in employee selection (rather than screening) are normally distributed and thus present little concern. Nonlinearity may be more of an issue when measures are used to assess extreme ranges of the trait continuum. Benson and Campbell (2007) reported nonlinear relationships between composites of darkside personality traits and leadership as assessed in AC dimensions and supervisory ratings. Grijalva and Newman (2015) found a mean incremental validity of .06 for nonlinearity using the Bold scale of the HDS dark-side personality measure in predicting leadership effectiveness across six samples (the authors interpreted scores on this scale to indicate narcissism). Thus, personality scales constructed to assess maladaptive ranges of personality constructs can have inverted U-shaped, nonlinear relations with performance criteria. An interesting illustration of such nonlinearity was provided by Carter et al. (2014). In their study 1, they scored a select set of conscientiousness items using the generalized graded unfolding item response theory model (see GGUM, Roberts, Donoghue, & Laughlin, 2000). For task performance and citizenship behaviors, inverted U-shaped relationships were found, whereas for CWB, a U-shaped relationship was found. In a second sample, another set of selected conscientiousness items, when scored using GGUM in especially notable nonlinear effects at scores lower than one standard deviation below the mean on conscientiousness (i.e., larger decrements to task performance and citizenship behavior, and larger increases in CWB) compared with those scoring within a standard deviation (above or below) the mean. Classically constructed measures of maladaptive or abnormal personality designed to detect infrequently occurring psychopathological characteristics should be expected to have greater predictive value at extreme score ranges. However, most of these measures are not suitable for pre-offer employee selection and are typically employed for screening out extreme cases after a conditional job offer has been made (Dilchert et al., 2014). Finally, although most research examining personality-criterion relationships has highlighted the predictor construct and related measurement issues, nonnormality in criterion measures can also result in nonlinearity. Future research in this area should carefully distinguish personality constructs versus their measures. Examinations in diverse samples of occupations and a broader set of criterion measures can help determine whether nonnormality and nonlinearity appreciably impact usefulness of personality measures used in employee selection. These examinations should proceed in a theory-driven manner, taking into account the distinction between test method and test content (Chan & Schmitt, 1997), the nature of the test response (Lievens, De Corte, & Westerveld, 2015), and the conceptual nature of the predictor and criterion constructs (Chan, 2005; Sackett & Lievens, 2008).

Adverse Impact Group mean score differences on measures used in employee selection are one of the major factors determining adverse impact against protected groups, in addition to the selection ratio and score variability. Hough et al. (2001) summarized studies that examined mean score differences between Whites and various ethnic minorities, between men and women, and between older and younger people on personality traits, cognitive ability, and physical abilities. They found essentially no differences between Whites and ethnic minorities for most personality

315

Leaetta Hough and Stephan Dilchert variables. They also examined mean score differences between groups at the facet level of the Big Five with some unexpected findings: For some facets, mean-score differences differed from that of their respective Big Five factor (e.g., a Black–White difference of d = −.10 on global extraversion but a reversal, i.e., d = .12, on surgency/dominance, a facet of extraversion). Another meta-analysis of race and ethnic group differences on personality measures also showed modest differences between Whites and ethnic minority groups on facets of the Big Five (Foldes et al., 2008) and again established that differential patterns may exist for Big Five factors and facets (e.g., a Black–White difference of −.12 on global emotional stability measures but a reversal, i.e., .17, on self-esteem, a facet of emotional stability). Table 8 of Foldes et al. also provides a summary of scenarios based on majority/minority group selection ratios under which these observed group differences are unlikely to result in adverse impact. These two summaries highlight the usefulness of personality variables in reducing adverse impact in personnel selection systems as well as the importance of focusing on facet-level measurement.

CONCLUSIONS We now have a better understanding of personality and its role in determining work behavior and performance. Although the FFM has provided an important framework to organize our research and systematically cumulate evidence, understanding personality and personalitycriterion relationships requires more than five trait variables, including broader and narrower variables. Current research examining the taxonomic structure at the facet level of the FFM will benefit science and practice as generally accepted models emerge. Such models allow us to move beyond inventory-specific investigations of limited generalizability to cumulating results across studies and settings, thus enabling systematic investigations of moderator variables. Such models also enhance our theory building and theory testing. As our knowledge of personalitycriterion relationships grows for different hierarchical levels of predictor and criterion variables, we learn how to combine predictor variables into criterion-appropriate variables that will enhance the prediction of valued outcomes in applied settings. The prospects of better understanding the determinants of work behavior and performance are exciting. Already primary studies, meta-analyses, and second-order meta-analyses provide ample evidence that traditional self-report questionnaires of personality are among the most powerful predictors of behavior in work settings. New developments in assessment and scoring methods show promise for further improvements in measurement and prediction. Although initial optimism regarding alternate response formats (e.g., fully ipsative forced-choice scales) proved unjustified, other innovations (e.g., ideal point response methods and adaptive testing based on IRT) are promising ways to address concerns about traditional self-reports of personality on Likert-type scales. Moreover, I-O psychologists have several other assessment tools at their disposal to measure personality (e.g., biodata, interviews, other-reports, SJTs, and ACs). In addition to improving measurement using self-report personality measures, we encourage researchers to thoroughly investigate the value of standardized other-reports in relation to occupational criteria. The few studies that have investigated their criterion-related validity suggest that other-reports may be even more valid for certain criteria than are self-report measures of personality. Other-reports can reliably capture personality variance that improves construct coverage and thus have the potential to increment criterion-related validity. More evidence for the validity of other-reports must be established and moderator variables (such as rating source) more systematically investigated before organizations will be persuaded to implement such measures more fully in employee selection. Personality variables add significant explanatory and predictive power beyond other variables (e.g., educational credentials, cognitive ability, work experience) often assessed during employment decision making. With better understanding of the structure of personality and criterion variables and better measurement of both, personality will be more fully recognized for its very important role in affecting work behavior and performance.

316

Personality

NOTES 1. The results of differential observer agreement for personality traits reviewed above can provide helpful information about which traits are best assessed with traditional employment interviews, at least with regard to issues of reliability. 2. Although some have criticized the use of these intercorrelations on the basis that they are purportedly “unrealistically low” (Morgeson et al., 2007a, p. 1035), the meta-analyses are based on data from thousands of people. Other researchers have also used these estimates to compute construct overlap between personality measures and other individual difference variables to estimate incremental validity (e.g., Judge, Heller, & Mount, 2002; Judge & Ilies, 2002; McDaniel et al., 2007).

REFERENCES Alliger, G. M., Lilienfeld, S. O., & Mitchell, K. E. (1996). The susceptibility of overt and covert integrity tests to coaching and faking. Psychological Science, 7, 32–39. Arthur, W., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related validity of assessment center dimensions. Personnel Psychology, 56, 125–154. Ashton, M. C., Lee, K., & de Vries, R. E. (2014). The HEXACO honesty-humility, agreeableness, and emotionality factors: A review of research and theory. Personality and Social Psychology Review, 18, 139–152. Back, M. D., Stopfer, J. M., Vazire, S., Gaddis, S., Schmukle, S. C., Egloff, B., & Gosling, S. D. (2010). Facebook profiles reflect actual personality, not self-idealization. Psychological Science, 21, 372–374. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Barrick, M. R., & Mount, M. K. (2005). Yes, personality matters: Moving on to more important matters. Human Performance, 18, 359–372. Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the beginning of the new millennium: What do we know and where do we go next? International Journal of Selection and Assessment, 9, 9–30. Bartram, D. (2005). The Great Eight competencies: A criterion-centric approach to validation. Journal of Applied Psychology, 90, 1185–1203. Ben-Porath, Y. S., & Tellegen, A. (2008). MMPI-2-RF: Manual for administration, scoring and interpretation. Minneapolis, MN: University of Minnesota Press. Benson, M. J., & Campbell, J. P. (2007). To be, or not to be, linear: An expanded representation of personality and its relationship to leadership performance. International Journal of Selection and Assessment, 15, 232–249. Berry, C. M., Carpenter, N. C., & Barratt, C. L. (2012). Do other-reports of counterproductive work behavior provide an incremental contribution over self-reports? A meta-analytic comparison. Journal of Applied Psychology, 97, 613–636. Berry, C. M., Ones, D. S., & Sackett, P. R. (2007). Interpersonal deviance, organizational deviance, and their common correlates: A review and meta-analysis. Journal of Applied Psychology, 92, 410–424. Beus, J. M., Dhanani, L. Y., & McCord, M. A. (2015). A meta-analysis of personality and workplace safety: Addressing unanswered questions. Journal of Applied Psychology, 100, 481–498. Blackman, M. C., & Funder, D. C. (2002). Effective interview practices for accurately assessing counterproductive traits. International Journal of Selection and Assessment, 10, 109–116. Block, J. (1995). A contrarian view of the five-factor approach to personality description. Psychological Bulletin, 117, 187–215. Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71, 460–502. Brown, A., & Maydeu-Olivares, A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18, 36–52. Butcher, J. N., Graham, J. R., Ben-Porath, Y. S., Tellegen, A., & Dahlstrom, W. G. (2001). MMPI-2 (Minnesota Multiphasic Personality Inventory-2). Manual for administration, scoring, and interpretation (Revised ed.). Minneapolis, MN: University of Minnesota Press. Call, M. L., Nyberg, A. J., Ployhart, R. E., & Weekley, J. (2015). The dynamic nature of collective turnover and unit performance: The impact of time, quality, and replacements. Academy of Management Journal, 58, 1208–1232. Carter, N. T., Dalal, D. K., Boyce, A. S., O’Connell, M. S., Kung, M-C., & Delgado, K. M. (2014). Uncovering curvilinear relationships between conscientiousness and job performance: How theoretically appropriate measurement makes an empirical difference. Journal of Applied Psychology, 99, 564–586.

317

Leaetta Hough and Stephan Dilchert Chan, D. (2000). Understanding adaptation to changes in the work environment: Integrating individual difference and learning perspectives. Research in Personnel and Human Resources Management, 18, 1–42. Chan, D. (2005). Current directions in personnel selection. Current Directions in Psychological Science, 14, 220–223. Chan, D. (2006). Interactive effects of situational judgment effectiveness and proactive personality on work perceptions and work outcomes. Journal of Applied Psychology, 91, 475–481. Chan, D. (2009). So why ask me? Are self-report data really that bad? In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Received doctrine, verity, and fable in the organizational and social sciences (pp. 308–336). New York: Routledge. Chan, D. (2014). Individual adaptability to changes at work: New directions in research. New York, NY: Routledge. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143–159. Chan, D., & Schmitt, N. (2002). Situational judgment and job performance. Human Performance, 15, 233–254. Chan, D., & Schmitt, N. (2005). Situational judgment tests. In A. Evers, O. Voskuijl, & N. Anderson (Eds.), Handbook of selection (pp. 219–242). Oxford, England: Blackwell. Chang, C. H., Ferris, D. L., Johnson, R. E., Rosen, C. C., & Tan, J. A. (2012). Core self-evaluations: A review and evaluation of the literature. Journal of Management, 38, 81–128. Chernyshenko, O. S., Stark, S., Drasgow, F., & Roberts, B. W. (2007). Constructing personality scales under the assumptions of an ideal point response process: Toward increasing the flexibility of personality measures. Psychological Assessment, 19, 88–106. Chiaburu, D. S., Oh, I-S., Berry, C. M., Li, N., & Gardner, R. G. (2011). The five-factor model of personality traits and organizational citizenship behaviors: A meta-analysis. Journal of Applied Psychology, 96, 1140–1166. Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010). Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities. Personnel Psychology, 63, 83–117. Christiansen, N. D., Burns, G. N., & Montgomery, G. E. (2005). Reconsidering forced-choice item formats for applicant personality assessment. Human Performance, 18, 267–307. Connelly, B. S., Davies, S. E., Ones, D. S., & Birkland, A. (2008). Conscientiousness: Investigation of its facet structure through meta-analytic factor analysis. International Journal of Psychology, 43, 553–553. Connelly, B. S., & Hülsheger, U. R. (2012). A narrower scope or a clearer lens for personality? Examining sources of observers’ advantages over self-reports for predicting performance. Journal of Personality, 80, 603–631. Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Meta-analytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136, 1092–1122. Connelly, B. S., Ones, D. S., Davies, S. E., & Birkland, A. (2014). Opening up openness: A theoretical sort following critical incidents methodology and meta-analytic investigation of the trait family measures. Journal of Personality Assessment, 96, 17–28. Connelly, B. S., Davies, S. E., Ones, D. S., & Birkland, A. (2008). Conscientiousness: Investigation of its facet structure through meta-analytic factor analysis. International Journal of Psychology, 43, 553–553. Connolly, J. J., Kavanagh, E. J., & Viswesvaran, C. (2007). The convergent validity between self and observer ratings of personality: A meta-analytic review. International Journal of Selection and Assessment, 15, 110–117. Cucina, J. M., & Vasilopoulos, N. L. (2005). Nonlinear personality-performance relationships and the spurious moderating effect of traitedness. Journal of Personality, 73, 227–260. DeGroot, T., & Kluemper, D. (2007). Evidence of predictive and incremental validity of personality factors, vocal attractiveness and the situational interview. International Journal of Selection and Assessment, 15, 30–39. DeYoung, C. G., Quilty, L. C., & Peterson, J. B. (2007). Between facets and domains: 10 aspects of the Big Five. Journal of Personality and Social Psychology, 93, 880–896. Digman, J. M. (1997). Higher-order factors of the Big Five. Journal of Personality and Social Psychology, 73, 1246–1256. Dilchert, S., & Ones, D. S. (2011). Application of preventive strategies. In M. Ziegler, C. MacCann, & R. D. Roberts (Eds.), New perspectives on faking in personality assessments (pp. 177–200). New York, NY: Oxford University Press. Dilchert, S., Ones, D. S., & Krueger, R. F. (2014). Maladaptive personality constructs, measures, and work behaviors: Scientific background and employment practice recommendations. Industrial and Organizational Psychology, 7, 98–110. Dilchert, S., Ones, D. S., Van Rooy, D. L., & Viswesvaran, C. (2006). Big Five factors of personality. In J. H. Greenhaus & G. A. Callanan (Eds.), Encyclopedia of career development (pp. 36–42). Thousand Oaks, CA: Sage.

318

Personality Dilchert, S., Ones, D. S., Viswesvaran, C., & Deller, J. (2006). Response distortion in personality measurement: Born to deceive, yet capable of providing valid self-assessments? Psychology Science, 48, 209–225. Drasgow, F., Chernyshenko, O. S., & Stark, S. (2010a). 75 years after Likert: Thurstone was right! Industrial and Organizational Psychology, 3, 465–476. Drasgow, F., Chernyshenko, O. S., & Stark, S. (2010b). Tailored Adapative Personality Assessment System (TAPAS). Urbana, IL: Authors. Drasgow, F., Stark, S., Chernyshenko, O. S., Nye, C. D., Hulin, C. L., & White, L. A. (2012). Development of the Tailored Adaptive Personality Assessment System (TAPAS) to support Army selection and classification decisions (Tech. Rep. No. 1311). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Dudley, N. M., Orvis, K. A., Lebiecki, J. E., & Cortina, J. M. (2006). A meta-analytic investigation of conscientiousness in the prediction of job performance: Examining the intercorrelations and the incremental validity of narrow traits. Journal of Applied Psychology, 91, 40–57. Engel, D., Woolley, A. W., Jing, L. X., Chabris, C. F., & Malone, T. W. (2014). Reading the mind in the eyes or reading between the lines? Theory of mind predicts collective intelligence equally well online and face-to-face. PLOS One, 9, 1–16. Foldes, H. J., Duehr, E. E., & Ones, D. S. (2008). Group differences in personality: Meta-analyses comparing five U.S. racial groups. Personnel Psychology, 61, 579–616. Funder, D. C. (1995). On the accuracy of personality judgment: A realistic approach. Psychological Review, 102, 652–670. Gaddis, B. H., & Foster, J. L. (2015). Meta-analysis of dark side personality characteristics and critical work behaviors among leaders across the globe: Findings and implications for leadership development and executive coaching. Applied Psychology, 64, 25–54. Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34. Griffeth, R. W., Hom, P. W., & Gaertner, S. (2000). A meta-analysis of antecedents and correlates of employee turnover: Update, moderator tests, and research implications for the next millennium. Journal of Management, 26, 463–488. Grijalva, E., & Newman, D. A. (2015). Narcissism and counterproductive work behavior (cwb): Metaanalysis and consideration of collectivist culture, Big Five personality, and narcissism’s facet structure. Applied Psychology, 64, 93–126. Harold, C. M., McFarland, L. A., & Weekley, J. A. (2006). The validity of verifiable and non-verifiable biodata items: An examination across applicants and incumbents. International Journal of Selection and Assessment, 14, 336–346. Heggestad, E. D., Morrison, M., Reeve, C. L., & McCloy, R. A. (2006). Forced-choice assessments of personality for selection: Evaluating issues of normative assessment and faking resistance. Journal of Applied Psychology, 91, 9–24. Hicks, L. E. (1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74, 167–184. Hogan, J., & Holland, B. (2003). Using theory to evaluate personality and job-performance relations: A socioanalytic perspective. Journal of Applied Psychology, 88, 100–112. Hogan, J., & Ones, D. S. (1997). Conscientiousness and integrity at work. In R. Hogan & J. A. Johnson (Eds.), Handbook of personality psychology (pp. 849–870). San Diego, CA: Academic Press. Hogan, J., & Roberts, B. W. (1996). Issues and non-issues in the fidelity-bandwidth trade-off. Journal of Organizational Behavior, 17, 627–637. Hogan, R. T. (2005a). Comments. Human Performance, 18, 405–407. Hogan, R. T. (2005b). In defense of personality measurement: New wine for old whiners. Human Performance, 18, 331–341. Hogan, R. T., & Hogan, J. (2009). Hogan development survey manual. Tulsa, OK: Hogan Assessment Systems. Holland, J. L. (1976). Vocational preferences. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 521–570). New York, NY: Rand-McNally. Hough, L. M. (1992). The “Big Five” personality variables—construct confusion: Description versus prediction. Human Performance, 5, 139–155. Hough, L. M. (1997). The millennium for personality psychology: New horizons or good old daze. Applied Psychology: An International Review, 47, 233–261. Hough, L. M. (1998). Personality at work: Issues and evidence. In M. D. Hakel (Ed.), Beyond multiple choice: Evaluating alternatives to traditional testing for selection (pp. 131–166). Mahwah, NJ: Lawrence Erlbaum. Hough, L. M. (2010). Assessment of background and life experience: Past as prologue. In J. C. Scott & D. H. Reynolds (Eds.), Handbook of workplace assessment: Selecting and developing organizational talent (pp. 109–139). Hoboken, NJ: Wiley & Sons. Hough, L. M., & Connelly, B. S. (2012). Personality measurement and use in industrial-organizational psychology. In K. F. Geisinger (Editor-in-Chief), APA handbook on testing and assessment and N. Kuncel (Vol.

319

Leaetta Hough and Stephan Dilchert 1 Ed.), Test theory and testing and assessment in industrial and organizational psychology (Vol. 1, pp. 501–531). Washington, DC: American Psychological Association. Hough, L. M., & Dilchert, S. (2010). Personality: Its measurement and validity for personnel selection. In J. Farr & N. Tippins (Eds.), Handbook of employee selection (pp. 299–319). New York, NY: Routledge—Taylor & Francis Group. Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. A. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities [Monograph]. Journal of Applied Psychology, 75, 581–595. Hough, L. M., & Furnham, A. (2003). Use of personality variables in work settings. In W. C. Borman, D. R. Ilgen & R. J. Klimoski (Eds.), Handbook of psychology. Vol. 12: Industrial and organizational psychology (pp. 131–169). Hoboken, NJ: John Wiley & Sons. Hough, L. M., & Johnson, J. W. (2013). Use and importance of personality variables in work settings. In I. B. Weiner (Ed.-in-Chief) & N. Schmitt & S. Highhouse (Vol. Eds.), Handbook of psychology: Vol. 12: Industrial and organizational psychology (pp. 211–243). New York, NY: Wiley. Hough, L. M., & Ones, D. S. (2001). The structure, measurement, validity, and use of personality variables in industrial, work, and organizational psychology. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work, and organizational psychology. Vol. 1: Personnel psychology (pp. 233–277). London, England: Sage. Hough, L. M., Ones, D. S., & Viswesvaran, C. (April 1998). Personality correlates of managerial performance constructs. Poster session presented at the annual conference of the Society for Industrial and Organizational Psychology, Dallas, TX. Hough, L. M., & Oswald, F. L. (2000). Personnel selection: Looking toward the future—Remembering the past. Annual Review of Psychology, 51, 631–664. Hough, L. M., & Oswald, F. L. (2005). They’re right, well . . . mostly right: Research evidence and an agenda to rescue personality testing from 1960s insights. Human Performance, 18, 373–387. Hough, L. M., & Oswald, F. L. (2008). Personality testing and I-O psychology: Reflections, progress, and prospects. Industrial and Organizational Psychology, 1, 272–290. Hough, L. M., Oswald, F. L., & Ock, J. (2015). Beyond the Big Five—New directions for personality research and practice. In F. P. Morgeson (Ed.), Annual review of organizational psychology and organizational behavior (pp. 183–209). Palo Alto, CA: Annual Reviews. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Hough, L. M., & Paullin, C. (1994). Construct-oriented scale construction: The rational approach. In G. S. Stokes & M. D. Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical information in selection and performance prediction (pp. 109–145). Palo Alto, CA: CPP Books. Hough, L. M., & Schneider, R. J. (1996). Personality traits, taxonomies, and applications in organizations. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 31–88). San Francisco, CA: Jossey-Bass. Houston, J. S., Borman, W. C., Farmer, W., & Bearden, R. M. (2005). Development of the Enlisted Computer Adaptive Personality Scales (ENCAPS), Renamed Navy Computer Adaptive Personality Scales (NCAPS) (Institute Report #503). Minneapolis, MN: Personnel Decisions Research Institutes. Huang, J. L., Ryan, A. M., Zabel, K. L., & Palmer, A. (2014). Personality and adaptive performance at work: A meta-analytic investigation. Journal of Applied Psychology, 99, 162–179. Huffcutt, A. I., & Arthur, W. (1994). Hunter and Hunter (1984) revisited: Interview validity for entry-level jobs. Journal of Applied Psychology, 79, 184–190. Huffcutt, A. I., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86, 897–913. Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of Applied Psychology, 85, 869–879. Jackson, D. N., Wroblewski, V. R., & Ashton, M. C. (2000). The impact of faking on employment tests: Does forced choice offer a solution? Human Performance, 13, 371–388. Judge, T. A., Heller, D., & Mount, M. K. (2002). Five-factor model of personality and job satisfaction: A meta-analysis. Journal of Applied Psychology, 87, 530–541. Judge, T. A., & Ilies, R. (2002). Relationship of personality to performance motivation: A meta-analytic review. Journal of Applied Psychology, 87, 797–807. Judge, T. A., Rodell, J. B., Klinger, R. L., Simon, L. S., & Crawford, E. R. (2013). Hierarchical representations of the Five-Factor Model of personality in predicting job performance: Integrating three organizing frameworks with two theoretical perspectives. Journal of Applied Psychology, 98, 875–925.

320

Personality Kaplan, S., Bradley, J. C., Luchman, J. N., & Haynes, D. (2009). On the role of positive and negative affectivity in job performance: A meta-analytic investigation. Journal of Applied Psychology, 94, 162–176. Kilcullen, R. N., White, L. A., Mumford, M. D., & Mack, H. (1995). Assessing the construct validity of rational biodata scales. Military Psychology, 7, 17–28. Knapp, D. J., Owens, K. S., & Allen, M. T. (Eds.) (2011). Validating future force performance measures (Army Class): In-unit performance longitudinal validation (Tech. Rep. No. Fr-10–38). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Krause, D. E., & Thornton, G. C. (2009). A cross-cultural look at assessment center practices: Survey results from Western Europe and North American. Applied Psychology: An International Review, 58, 557–585. Krueger, R. F., Derringer, J., Markon, K. E., Watson, D., & Skodol, A. E. (2012). Initial construction of a maladaptive personality trait model and inventory for DSM-5. Psychological Medicine, 42, 1879–1890. LaHuis, D. M., Martin, N. R., & Avis, J. M. (2005). Investigating nonlinear conscientiousness—Job performance relations for clerical employees. Human Performance, 18, 199–212. Le, H., Oh, I-S., Robbins, S. B., Ilies, R., Holland, E., & Westrick, P. (2011). Too much of a good thing: Curvilinear relationships between personality traits and job performance. Journal of Applied Psychology, 96, 113–133. Lievens, F., De Corte, W., & Westerveld, L. (2015). Understanding the building blocks of selection procedures: Effects of response fidelity on performance and validity. Journal of Management, 41, 1604–1627. Lievens, F., Ones, D. S., & Dilchert, S. (2009). Personality scale validities increase throughout medical school. Journal of Applied Psychology, 94, 1514–1535. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22, 55. Macey, W. H., LoVerde, M. A., & Bartram, D. (2016). Evidence for a replicable leadership typology. In W. H. Macey (Chair), Current perspectives on person-centered leadership research. Symposium conducted at the 31st Annual Convention of the Society for Industrial and Organizational Psychology, Anaheim. Manley, G. G., Benavidez, J., & Dunn, K. (2007). Development of a personality biodata measure to predict ethical decision making. Journal of Managerial Psychology, 22, 664–682. Martin, B. A., Bowen, C. C., & Hunt, S. T. (2002). How effective are people at faking on personality questionnaires? Personality and Individual Differences, 32, 247–256. McCloy, R. A., Heggestad, E. D., & Reeve, C. L. (2005). A silk purse from the sow’s ear: Retrieving normative information from multidimensional forced-choice items. Organizational Research Methods, 8, 222–248. McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91. McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599–616. McFarland, L. A., & Ryan, A. M. (2000). Variance in faking across noncognitive measures. Journal of Applied Psychology, 85, 812–821. McGrath, R. E., Mitchell, M., Kim, B., & Hough, L. M. (2010). The validity of response bias indicators. Psychological Bulletin, 136, 450–470. McHenry, J. J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project A validity results: The relationship between predictor and criterion domains. Personnel Psychology, 43, 335–354. McManus, M. A., & Kelly, M. L. (1999). Personality measures and biodata: Evidence regarding their incremental predictive value in the life insurance industry. Personnel Psychology, 52, 137–148. Moore, M., & Ones, D. S. (April 2016). Convergent and discriminant validity of dark tetrad measures. Poster presented at the annual conference of the Society for Industrial and Organizational Psychology, Anaheim, CA. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007a). Are we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel selection. Personnel Psychology, 60, 1029–1049. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007b). Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60, 683–729. Moscoso, S. (2000). Selection interview: A review of validity evidence, adverse impact and applicant reactions. International Journal of Selection and Assessment, 8, 237–247. Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure: The low-fidelity simulation. Journal of Applied Psychology, 75, 640–647. Murphy, K. R., & Dzieweczynski, J. L. (2005). Why don’t measures of broad dimensions of personality perform better as predictors of job performance? Human Performance, 18, 343–357. National Research Council. (2015). Measuring human capabilities: An agenda for basic research on the assessment of individual and group performance potential for military accession. Committee on Measuring Human Capabilities: Performance Potential of Individuals and Collectives, Board of Behavioral, Cognitive, and Sensory

321

Leaetta Hough and Stephan Dilchert Sciences. Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press. O’Boyle, E. H., Forsyth, D. R., Banks, G. R., & McDaniel, M. A. (2012). A meta-analysis of the Dark Triad and work behavior: A social exchange perspective. Journal of Applied Psychology, 97, 557–579. Oh, I-S., Wang, G., Mount, M. K. (2011). Validity of observer ratings of the five-factor model of personality traits: A meta-analysis. Journal of Applied Psychology, 96, 762–773. Ones, D. S. (1993). The construct validity of integrity tests. Unpublished doctoral dissertation, University of Iowa, Iowa City, IA. Ones, D. S., & Dilchert, S. (February 2008). Recent assessment center research: Dimensions, exercises, group differences, and incremental validity. Paper presented at the annual Assessment Centre Study Group conference, Stellenbosch, South Africa. Ones, D. S., Dilchert, S., Viswesvaran, C., & Judge, T. A. (2007). In support of personality assessment in organizational settings. Personnel Psychology, 60, 995–1027. Ones, D. S., & Viswesvaran, C. (1996). Bandwidth-fidelity dilemma in personality measurement for personnel selection. Journal of Organizational Behavior, 17, 609–626. Ones, D. S., & Viswesvaran, C. (2001). Personality at work: Criterion-focused occupational personality scales used in personnel selection. In B. W. Roberts & R. Hogan (Eds.), Personality psychology in the workplace (pp. 63–92). Washington, DC: American Psychological Association. Ones, D. S., & Viswesvaran, C. (2008). Customer service scales: Criterion-related, construct, and incremental validity evidence. In J. Deller (Ed.), Research contributions to personality at work (pp. 19–46). Mering, Germany: Hampp. Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Personality at work: Raising awareness and correcting misconceptions. Human Performance, 18, 389–404. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78, 679–703. Oswald, F. L. (2010). Practical recommendations for trait-level estimation in the Navy Computer Adaptive Personality Scales (NCAPS). Millington, TN: Navy Personnel Research Studies, and Technology. Oswald, F. L., & Hough, L. M. (2008). Personality testing and I-O psychology: A productive exchange and some future directions. Industrial and Organizational Psychology, 1, 323–332. Oswald, F. L., & Hough, L. M. (2011). Personality and its assessment in organizations: Theoretical and empirical developments. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology: Vol. 2. Selecting and developing members for the organization (pp. 153–184). Washington, DC: American Psychological Association. Oswald, F. L., Hough, L. M., & Ock, J. (2013). Theoretical and empirical structures of personality: Implications for measurement, modeling and prediction. In N. D. Christiansen & R. P. Tett (Eds.), Handbook of personality at work (pp. 11–29). New York, NY: Routledge/Taylor & Francis Group. Oswald, F. L., & Schell, K. L. (2010). Developing and scaling personality measures: Thurstone was right— but so far, Likert was not wrong. Industrial and Organizational Psychology, 3, 481–484. Oswald, F. L., Shaw, A., & Farmer, W. L. (2015). Comparing simple scoring with IRT scoring of personality measures: The Navy computer adaptive personality scales. Applied Psychological Measurement, 39, 144–154. Oviedo-Garcia, M. (2007). Internal validation of a biodata extraversion scale for salespeople. Social Behavior and Personality, 35, 675–692. Paulhus, D. L., & Williams, K. M. (2002). The Dark Triad of personality: Narcissism, Machiavellianism, and psychopathy. Journal of Research in Personality, 36, 556–563. Pew Research Center. (2009). America’s changing workforce: Recession turns a graying office grayer. Washington, DC: Pew Research Center. Retrieved from http://www.pewsocialtrends.org/files/2010/10/americas-changingworkforce.pdf Ployhart, R. E., & Ehrhart, M. G. (2003). Be careful what you ask for: Effects of response instructions on the construct validity and reliability of situational judgment tests. International Journal of Selection and Assessment, 11, 1–16. Ployhart, R. E., Weekley, J. A., & Baughman, K. (2006). The structure and function of human capital emergence: A multilevel examination of the attraction-selection-attrition model. Academy of Management Journal, 49, 661–677. Ployhart, R. E., Weekley, J. A., & Ramsey, J. (2009). The consequences of human resource stocks and flows: A longitudinal examination of unit service orientation and unit effectiveness. Academy of Management Journal, 52, 996–1015. Poropat, A. E. (2009). A meta-analysis of the Five-Factor Model of personality and academic performance. Psychological Bulletin, 135, 322–338.

322

Personality Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 83, 612–624. Pulakos, E. D., Schmitt, N., Dorsey, D. W., Arad, S., Hedge, J. W., & Borman, W. C. (2002). Predicting adaptive performance: Further tests of a model of adaptability. Human Performance, 15, 299–323. Reilly, R. R., & Chao, G. R. (1982). Validity and fairness of some alternative employee selection procedures. Personnel Psychology, 35, 1–62. Roberts, B. W., Chernyshenko, O. S., Stark, S. E., & Goldberg, L. R. (2005). The structure of conscientiousness: An empirical investigation based on seven major personality questionnaires. Personnel Psychology, 58, 103–139. Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2, 313–345. Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32. Robie, C., & Ryan, A. M. (1999). Effects of nonlinearity and heteroscedasticity on the validity of conscientiousness in predicting overall job performance. International Journal of Selection and Assessment, 7, 157–169. Rose, A., Timm, H., Pogson, C., Gonzales, J., Appel, E., & Kolb, N. (2010). Developing a cybervetting strategy for law enforcement (Special Report). Defense Personnel Security Research Center and International Association of Chiefs of Police. Roth, P. L., Van Iddekinge, C. H., Huffcutt, A. I., Eidson, C. E., Jr., & Schmit, M. J. (2005). Personality saturation in structured interviews. International Journal of Selection and Assessment, 4, 261–273. Rothstein, M. G., & Goffin, R. D. (2000). The assessment of personality constructs in industrial-organizational psychology. In R. D. Goffin & E. Helmes (Eds.), Problems and solutions in human assessment: Honoring Douglas N. Jackson at seventy (pp. 215–248). New York, NY: Kluwer Academic/Plenum. Rothstein, M. G., & Goffin, R. D. (2006). The use of personality measures in personnel selection: What does current research support? Human Resource Management Review, 16, 155–180. Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391. Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450. Salgado, J. F. (2002). The Big Five personality dimensions and counterproductive behaviors. International Journal of Selection and Assessment, 10, 117–125. Salgado, J. F., Anderson, N., & Tauriz, G. (2014). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 87, 1–38. Salgado, J. F., & Táuriz, G. (2014). The Five-Factor Model, forced-choice personality inventories and performance: A comprehensive meta-analysis of academic and occupational validity studies. European Journal of Work and Organizational Psychology, 23, 3–30. Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199–223. Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Meta-analyses of validity studies published between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37, 407–422. Schmitt, N., & Kunce, C. (2002). The effects of required elaboration of answers to biodata questions. Personnel Psychology, 55, 569–587. Schmitt, N., Oswald, F. L., Kim, B. H., Gillespie, M. A., Ramsay, L. J., & Yoo, T-Y. (2003). Impact of elaboration on socially desirable responding and the validity of biodata measures. Journal of Applied Psychology, 88, 979–988. Schneider, B., & Bartram, D. (February 2015). Aggregate personality and organizational performance. Presentation at annual meeting of Summit Group, Livingston, TX. Schneider, R. J., & Hough, L. M. (1995). Personality and industrial/organizational psychology. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational psychology (pp. 75–129). Chichester, England: Wiley. Schneider, R. J., Hough, L. M., & Dunnette, M. D. (1996). Broadsided by broad traits: How to sink science in five dimensions or less. Journal of Organizational Behavior, 17, 639–655. Scholz, G., & Schuler, H. (1993). Das nomologische Netzwerk des Assessment Centers: Eine Metaanalyse. [The nomological network of the assessment center: A meta-analysis]. Zeitschrift für Arbeits- und Organisationspsychologie, 37, 73–85. Sisco, H., & Reilly, R. R. (2007a). Development and validation of a biodata inventory as an alternative method to measurement of the Five Factor Model of personality. Social Science Journal, 44, 383–389. Sisco, H., & Reilly, R. R. (2007b). Five Factor Biodata Inventory: Resistance to faking. Psychological Reports, 101, 3–17.

323

Leaetta Hough and Stephan Dilchert Stanush, P. L. (1997). Factors that influence the susceptibility of self-report inventories to distortion: A meta-analytic investigation. Doctoral dissertation, College Station, TX: Texas A&M University. Stark, S., & Chernyshenko, O. S. (2007). Adaptive testing with the multi-unidimensional pairwise preference model. In D. J. Weiss (Ed.), Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2005). An IRT approach to constructing and scoring pair-wise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29, 184–203. Stark, S., Chernyshenko, O. S., Drasgow, F., Nye, C. D., White, L. A., Heffner, T., & Farmer, W. L. (2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26, 153–164. Stark, S., Chernyshenko, O. S., Drasgow, F., & Williams, B. A. (2006). Examining assumptions about item responding in personality assessment: Should ideal point methods be considered for scale development and scoring? Journal of Applied Psychology, 91, 25–39. Steel, P., Schmidt, J., & Shultz, J. (2008). Refining the relationship between personality and subjective well-being. Psychological Bulletin, 134, 138–161. Stokes, G. S., & Cooper, L. A. (2001). Content/construct approaches in life history form development for selection. International Journal of Selection and Assessment, 9, 138–151. Stokes, G. S., & Cooper, L. A. (2004). Biodata. In J. C. Thomas (Ed.), Comprehensive handbook of psychological assessment (Vol. 4: Industrial and organizational assessment, pp. 243–268). Hoboken, NJ: John Wiley & Sons. Stokes, G. S., Hogan, J. B., & Snell, A. F. (1993). Comparability of incumbent and applicant samples for the development of biodata keys: The influence of social desirability. Personnel Psychology, 46, 739–762. Tenopyr, M. L. (1994). Big Five, structural modeling, and item response theory. In G. S. Stokes, M. D. Mumford & W. A. Owens (Eds.), Biodata handbook: Theory, research, and use of biographical information in selection and performance prediction (pp. 519–533). Palo Alto, CA: Consulting Psychologists Press. Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal of Applied Psychology, 88, 500–517. Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A response to Morgeson, Campion, Dipboye, Hollenbeck, Murphy, and Schmitt (2007). Personnel Psychology, 60, 967–993. Tett, R. P., & Guterman, H. A. (2000). Situation trait relevance, trait expression, and cross-situational consistency: Testing a principle of trait activation. Journal of Research in Personality, 34, 397–423. Tett, R. P., Jackson, D. N., Rothstein, M., & Reddon, J. R. (1999). Meta-analysis of bidirectional relations in personality-job performance research. Human Performance, 12, 1–29. Thoresen, C. J., Kaplan, S. A., Barsky, A. P., Warren, C. R., & de Chermont, K. (2003). The affective underpinnings of job perceptions and attitudes: A meta-analytic review and integration. Psychological Bulletin, 129, 914–945. Thornton, G. C., & Rupp, D. E. (2003). Simulations and assessment centers. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 319–344). New York, NY: Wiley & Sons. Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554. Tupes, E. C., & Christal, R. E. (1961/1992). Recurrent personality factors based on trait ratings. Journal of Personality, 60, 225–251. Twenge, J. M., Campbell, S. M., Hoffman, B. J., & Lance, C. E. (2010). Generational differences in work values: Leisure and extrinsic values increasing, social and intrinsic values decreasing. Journal of Management, 36, 1117–1142. Underhill, C. M. (2006). Investigation of item-pair presentation and construct validity of the Navy Computer Adaptive Personality Scales (NCAPS). Millington, TN: Navy Personnel Research, Studies, Technology Division, Bureau of Naval Personnel. U.S. Census Bureau (June 13 2016). Website; www.census.gov/en.html. Van Iddekinge, C. H., Raymark, P. H., Eidson, C. E., Jr., & Attenweiler, W. J. (2004). What do structured selection interviews really measure? The construct validity of behavior description interviews. Human Performance, 17, 71–93. Van Iddekinge, C. H., Raymark, P. H., & Roth, P. L. (2005). Assessing personality with a structured employment interview: Construct-related validity and susceptibility to response inflation. Journal of Applied Psychology, 90, 536–552. Vasilopoulos, N. L., Cucina, J. M., & Hunter, A. E. (2007). Personality and training proficiency: Issues of bandwidth-fidelity and curvilinearity. Journal of Occupational and Organizational Psychology, 80, 109–131. Viswesvaran, C., & Ones, D. S. (1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59, 197–210.

324

Personality Viswesvaran, C., & Ones, D. S. (2000). Measurement error in “Big Five factors” personality assessment: Reliability generalization across studies and measures. Educational and Psychological Measurement, 60, 224–235. Viswesvaran, C., & Ones, D. S. (2016). Integrity tests: A review of alternate conceptualizations and some measurement and practical issues. In U. Kumar (Ed.), The Wiley handbook of personality assessment (pp. 58–75). West Sussex, UK: Wiley & Sons. Walmsley, P. T. (2013). Investigating the presence of nonlinear personality-job performance relationships. Doctoral dissertation, University of Minnesota. Minneapolis, MN. Whetzel, D. L., McDaniel, M. A., Yost, A. P., & Kim, N. (2010). Linearity of personality-performance relationships: A large-scale examination. International Journal of Selection and Assessment, 18, 310–320. White, L. A., Young, M. C., Hunter, A. E., & Rumsey, M. G. (2008). Lessons learned in transitioning personality measures from research to operational settings. Industrial and Organizational Psychology, 1, 291–295. White, L. A., Young, M. C., & Rumsey, M. G. (2001). ABLE implementation issues and related research. In J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in personnel selection and classification (pp. 525–558). Mahwah, NJ: Lawrence Erlbaum. Wiernik, B. M., Dilchert, S., & Ones, D. S. (2016). Creative interests and personality: Scientific versus artistic creativity. Zeitschrift für Arbeits- und Organisationspsychologie, 60, 65–78. Wiesner, W. H., & Cronshaw, S. F. (1988). A meta-analytic investigation of the impact of interview format and degree of structure on the validity of the employment interview. Journal of Occupational Psychology, 61, 275–290. Wiggins, J. S., & Trapnell, P. D. (1997). Personality structure: The return of the Big Five. In R. Hogan, J. A. Johnson, & S. R. Briggs (Eds.), Handbook of personality psychology. (pp. 737–765). San Diego, CA: Academic Press. Wolters, H., Heffner, T., & Sams, M. (October 2015). Overview and introduction of ARI’s non-cognitive selection and assignment research: Enlisted personnel. Paper presented at Briefing for Expert Classification Panel. Belvoir, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Woodley, M. A., & Bell, E. (2011). Is collective intelligence (mostly) the General Factor of Personality? A comment on Woolley, Chabris, Pentland, Hashmi and Malone (2010). Intelligence, 39, 79–81. Woolley, A. W., Aggarwal, I., & Malone, T. W. (2015). Collective intelligence and group performance. Current Directions in Psychological Science, 24, 420–424. Zhao, H., Seibert, S. E., & Lumpkin, G. T. (2010). The relationship of personality to entrepreneurial intentions and performance: A meta-analytic review. Journal of Management, 36, 381–404.

325

14 VALUES, STYLES, AND MOTIVATIONAL CONSTRUCTS DAVID CHAN

For several decades now, cognitive ability and personality traits are the two major types of predictors examined in employee selection research. Construct-oriented studies have focused on the structure and taxonomy of cognitive ability (see Chapter 11, this volume) and personality traits (see Chapter 13, this volume), as well as the validity evidence for these two types of constructs. In contrast, selection researchers have paid little attention to other types of individual difference predictors such as those in the domains of values, cognitive styles, and motivational constructs. To the extent that these individual differences are distinct from cognitive ability and personality constructs, and to the extent that they predict work-relevant attitudes, perceptions, and behaviors, there is a need in selection research to direct more attention to these “nontraditional” predictor constructs. The purpose of this chapter is to provide an overview of the major values, cognitive styles, and motivational constructs that are likely relevant in employee selection research. In the following sections, I discuss each of these three construct domains with the objectives to (a) understand the basic conceptualizations of the focal constructs and their potential value in employee selection research and practice, (b) illustrate the variety of constructs and present the theory and research associated with their structure and validity, and (c) discuss the current concerns and emerging issues in the conceptualization and measurement of these constructs. I end the chapter with a discussion on practical considerations of the use of these constructs in employee selection and a proposed strategic agenda for future research directions.

VALUES The interest in the psychological research on the concept of values may be traced back to the publication of Rokeach’s (1973) influential book The Nature of Human Values and the Rokeach Value Survey, which he developed to measure the various value constructs described in his book. Subsequent researchers who examined the structure of values or criterion-related validities of values have tended to rely on Rokeach’s conceptual definition of values, which refers to the individual’s “enduring belief that a specific mode of conduct or end-state of existence is personally or socially preferable to an opposite or converse mode of conduct or end-state of existence” (Rokeach, 1973, p. 5). Although researchers have defined values in different ways, there appears to be a consensus from their conceptual definitions that values are the individual’s stable beliefs that serve as general standards by which he or she evaluates specific things, including people, behaviors, activities, and issues. These standards of evaluation are also considered abstract goals, which are important guiding principles in life for the individual. There is also agreement that

326

Values, Styles, and Motivational Constructs values are more general than attitudes in that the latter are more referent-specific. Values are also differentiated from interests in that the former is scaled on relative importance, whereas the latter is scaled on relative liking.

Why Study Values? The rationale for the study of values is primarily due to its criterion-related validity. Because values are assumed to occupy a central position in the individual’s network of cognitive beliefs and attitudes, we expect values to be associated with and hence predictive of criteria such as specific beliefs, attitudes, perceptions, and behaviors. Indeed, much of the early interest in empirical studies of values was generated by Rokeach’s (1973) seminal research showing that rankings of the importance of values were predictive of a wide variety of attitudes and behaviors. Subsequent to Rokeach’s work, the criterion-related validities of values were quite consistently demonstrated over the years for diverse criteria including attitudinal and behavioral outcomes (e.g., Kaikati & Torelli, 2010; Ravlin & Meglino, 1987). For example, Ravlin and Meglino (1987) showed that achievement, concern for others, fairness, and honesty were major values that predicted various perceptions and decisions at the workplace. A second reason for studying values is that value congruence, or similarity versus dissimilarity of values, is expected to lead to important outcomes. For example, studies have found that value congruence between managers and their organizations predicted the managers’ success and intention to remain in the organization (Posner, Kouzes, & Schmidt, 1985), and value congruence between subordinates and supervisors predicted subordinates’ ratings of supervisors’ competence and success (Weiss, 1978). However, the inferences from the results of many value congruence studies tend to be less conclusive given the difficulty of interpretation associated with methodological problems in these studies (Cable & Edwards, 2009).

Structure of Values Following his conceptual definition of values, Rokeach (1973) made two useful distinctions in the structure of values. The first distinction is between instrumental values and terminal values. Instrumental values are about modes of conduct, and they refer to the subjective desirability about the actions or conduct, such as being honest, obedient, or courageous, which are presumed as means that lead to certain desirable outcomes. Terminal values are about end-states of existence, and they refer to the subjective desirability of life outcomes such as equality or a peaceful world. The second distinction is between values about well-being of the self and values about well-being of others. On the basis of these two distinctions, Rokeach produced a useful taxonomy of four major types of values by factorially crossing the two independent distinctions. Instrumental values that are self-oriented are called competence values (e.g., being ambitious, independent), whereas instrumental values that are other-focused are called moral values (e.g., being altruistic, forgiving). Terminal values that are self-oriented are called personal values (e.g., a materially comfortable life, a well-respected person), whereas terminal values that are other-oriented are called social values (e.g., a peaceful world, a society with little or no inequality). Schwartz (1992) argued that the conceptual distinction between instrumental and terminal values, although intuitively attractive, may not be necessary and may in fact create confusion because in many cases the same value may be construed as a means and an end. For example, pleasure may be construed as a terminal value, but it may also serve as an instrumental value in promoting other terminal values such as happiness. Also, instrumental values, such as being honest, could also be seen as a terminal value to be promoted by other instrumental values, such as being courageous. Dissatisfied with the typology of values provided by Rokeach (1973), Schwartz (1992) proposed a new framework or structure of values that he believed to have universal content that can be applied across cultures. Schwartz presented respondents with items representing specific

327

David Chan values and asked them to rate the importance of each value to their lives. On the basis of these importance ratings, from large and diverse samples of respondents, Schwartz organized the large variety of individuals’ specific values into 10 value types (e.g., power, achievement, hedonism, self-direction). Schwartz further proposed that the 10 values may be organized at a higher level into two bipolar value dimensions—namely, openness to change versus conservation and self-enhancement versus self-transcendence. However, research using Schwartz’s framework has focused almost exclusively on the 10 value types, probably because of the generic (and hence less useful) nature of the two bipolar value dimensions. In addition to the 10 value types at the individual level, Schwartz proposed seven value dimensions at the cultural level to allow for cross-cultural comparisons in value research. Examples of these cultural-level value dimensions are prosocial (active protection or enhancement of the welfare of others), restrictive conformity (restraint of actions likely to harm others or violate norms), and security (safety, harmony, and stability of the society of groups with whom one identifies). A major contribution of Schwartz’s framework is that in addition to providing a categorization of values at the individual level, it offers a conceptual guide for us to understand and compare cultures in terms of value dimensions. There is considerable empirical evidence that the framework, including the 10 value types and seven culture dimensions, can be used on a global basis to identify and understand the content and structure of values across diverse cultures (e.g., Schwartz & Sagiv, 1995). To date, Schwartz’s framework represents the most comprehensive typology of values at the individual and culture levels of analysis, and there is also a relatively large research literature on the results of the Schwartz Value Survey administered in diverse cultures. Another large-scale value survey project is the well-known World Values Survey, which was developed from the original European Value Survey. The first World Values Survey, conducted in 1981, contained only 22 countries, with 14 of them outside of Europe. The second wave, which contained 42 countries, was conducted 10 years later. Subsequent waves, containing increasingly more countries, were conducted at approximately five-year intervals. Results on the World Values Survey are available at www.worldvaluessurvey.com. One of the most wellknown interpretations of the results of the World Values Survey is that the many values across countries may be factor-analytically summarized into two global dimensions of cultural variation labeled as “traditional versus secular-rational” and “survival versus self-expression.” Given the large-scale results on diverse cultures available on the World Values Survey and the Schwartz Value Survey, the utility of these two value frameworks is likely to continue for many years.

Current Concerns and Emerging Issues The scientific defensibility and practical usefulness of values for employee selection are dependent on the extent to which values are adequately conceptualized and measured. The following list highlights some of the current concerns and emerging issues associated with conceptualization and measurement in the study of values: 1. An adequate structure of values clarifying taxonomy and typology issues (i.e., number, level, and type of values) is fundamental for the study of values to contribute to the science and practice of employee selection. In employee selection, the criterion constructs of interest are primarily work-relevant attitudes and behaviors. The structure of values is important because it provides the conceptual organizing principles to relate these work-relevant attitudes and behaviors to value constructs. Although we now have several conceptual frameworks that provide researchers with a working structure of values, it remains unclear what degree of comprehensiveness and level of specificity we would require of a values structure for the purpose of employee selection research and practice. A structure is nonparsimonious and impractical if it specifies a large variety of specific values organized into many different types, domains, and levels of conceptualization. On the other hand, a structure with a few generic values is likely to lead to studies with misspecified models because of omitted value variables. There has been a proliferation of value measures that are rarely reconciled with earlier measures in the literature, and this makes comparison of studies problematic and accumulation of knowledge difficult. An adequate structure of values is needed to guide researchers and provide more precise and theory-driven operationalizations of value constructs.

328

Values, Styles, and Motivational Constructs 2. Even when multiple values are examined in a single study, researchers tend to study each value in isolation as opposed to the effects of an individual’s actual profile of values. Note that this goes beyond studying joint effects of multiple values at the aggregate level of analysis (e.g., incremental validity of conformity over tradition, interaction effect of power and achievement), which could examine only a small number of values. The study of interindividual differences in intraindividual profiles of values is important because it is unlikely that an individual’s attitude or behavior is determined by a single value in isolation. Intraindividual analyses also directly address the issue of intraindividual value conflicts, which should be most relevant in work situations involving moral dilemmas. The study of interindividual differences in intraindividual profiles of values and intraindividual changes in values over time involves difficult measurement and data analysis issues, but recent methodological advances provide useful tools for conceptualizing and assessing these differences (see Chan, 1998a, 2002). 3. The study of individual values and cultural values raises important levels of analysis issues that need to be addressed. For example, does a value construct change in meaning when it is composed from the individual level to the cultural level of analysis? The functional relationships between the same value construct across different (individual vs. cultural) levels of analysis need to be carefully specified in a composition model. Failing to adequately address these multilevel issues could lead to critical conceptual, measurement, analysis, and inferential errors (see Chan, 1998b, 2005a). 4. Two increasingly important areas in employee selection are recruiting teams (see Chapter 37, this volume) and selection of employees for expatriate assignments (see Chapter 36, this volume). In these two areas, as well as the ongoing area of interest relating to person-organization fit, the critical issue in the study of values concerns value congruence. Advancement in these areas of employee selection research and practice is dependent on advancements in person-environment fit research, particularly in issues relating to the different conceptualizations and measurements of fit (e.g., objective fit vs. subjective fit; supplementary fit vs. complementary fit). For example, value fit is almost always conceptualized as supplementary fit defined in terms of similarity of values between the person and the environment. Are there situations in which value fit is better conceptualized as complementary fit defined in terms of the environment meeting certain value needs/demands of the person? In other words, value congruence may not always mean or imply value similarity. A different conceptualization of the type of value congruence or fit could open up new and useful areas for research and practice in employee selection. These issues on value congruence apply not only to person-environment fit but also to person-person fit. 5. Given the research dependence on large-scale international surveys, which are largely western in origin, applications of research findings on individual- and cultural-level values to nonwestern cultures will need to pay careful attention to validity concerns associated with methodological issues in cross-cultural measurement.

COGNITIVE STYLES Cognitive styles refer to the characteristic mode, typical method, habitual patterns, or preferred ways of processing information that are consistent over time and across many areas of activity. So, we can speak of cognitive styles in terms of thinking styles, problem-solving styles, learning styles, and so forth. As noted by Sternberg and Grigorenko (1997), cognitive styles should be distinguished from strategies. The latter refers to the operations that individuals use or follow to minimize errors in problem solving and decision making. The use of strategies involves the conscious choice of alternative operations, whereas cognitive styles typically function without the individual’s awareness. In addition, strategies are used in task- or context-specific situations, whereas cognitive styles refer to more stable characteristic modes of information processing that the individual uses consistently across a large variety of task situations or contexts. Cognitive styles refer to a set of preferences or habits and hence should be distinguished from cognitive abilities. We can construe cognitive abilities as the “can do” aspect of cognition and cognitive styles as the “tend to do” aspect of cognition. Because styles are not abilities, they should not be inherently better or worse in an absolute or context-free sense. Instead, cognitive styles may differ in their goodness of fit to different environments or situations, and the degree of fit could lead to different extent of positive or negative consequences. Cognitive styles are also distinct from personality traits. Although personality traits such as conscientiousness and extraversion also refer to individual differences in stable characteristic modes of behaviors, they

329

David Chan tend to be construed as generic behavioral tendencies or predispositions, whereas cognitive styles refer to typical modes of information processing.

Why Study Styles? Conceptually, the rationale for studying cognitive styles is fairly obvious because an individual’s habitual or preferred ways of processing information would affect the individual’s perception, learning, and performance. Hence, employee selection researchers and practitioners should be interested in cognitive styles as potential predictors for various work-relevant criterion outcomes. Given the centrality of information processing in learning and skill acquisition, cognitive styles should also be of great interest in training research. Because cognitive styles affect information processing and are distinguished from cognitive abilities and personality traits, they provide another potential source of predictor constructs for employee selection. In addition, it may be useful to relate cognitive styles to the maximum-typical performance distinction in employee selection. Cognitive abilities are most relevant to maximum performance, and personality traits are most relevant to typical performance. Cognitive styles refer to the “tend to do” aspect of cognition and therefore provide a potential bridge between cognitive ability and personality traits for investigating how these two traditional types of predictors may interface.

Varieties of Styles The idea of cognitive styles as an interface between cognitive ability and personality was very popular in the 1950s and 1960s, and numerous types and measures of cognitive styles were developed during this period. However, not all of the purported cognitive style constructs are in fact assessing cognitive styles. For example, Witkin and colleagues introduced the style construct called field independence to refer to the degree to which individuals are dependent or independent on the structure of the surrounding visual field when perceiving objects. The Rod and Frames Test (Witkin, Dyke, Faterson, Goodenough, & Karp, 1962) and the Embedded Figures Test (Witkin, Oltman, Raskin, & Karp, 1971) are the two most widely used measures of field independence. In these measures, the individual’s task is to locate a true vertical (in the Rod and Frame Test) or an object/figure (Embedded Figures Test), which can be accomplished only by ignoring the surrounding visual field. The problem with the purported style construct of field independence is that it most likely represents a cognitive ability as opposed to a cognitive style. The way the construct is conceptualized and measured clearly involves objectively right and wrong answers, and it assesses the ability to objectively obtain the right answer. Contrary to the conceptualization of a cognitive style construct as not inherently adaptive or maladaptive, high field independence appears to be inherently more adaptive than low field independence. It is difficult to think of situations in which field dependence is better than field independence. Rather than a preferred way of processing information (i.e., a style), high field independence refers to a specific type of information-processing ability. Whereas some measures of cognitive styles are in fact assessing cognitive abilities, others are probably assessing personality traits or multidimensional constructs that are composites of styles and personality traits. For example, Myers built on Jung’s (1923) theory of psychological types and developed the Myers-Briggs Type Indicator (MBTI; Myers & McCaulley, 1985) as a cognitive style measure consisting of four factors, each containing two categories (i.e., thinking vs. feeling, extraversion vs. introversion, intuition vs. sensing, judgment vs. perception) that are combined to form 16 possible types of individuals. Although widely used in business and education settings, there are numerous validity problems with the MBTI (e.g., Druckman & Bjork, 1991). Moreover, conceptually and empirically, each of the 16 types in the MBTI is clearly a composite of personality traits (extraversion-introversion) and other individual difference constructs that may be cognitive styles (e.g., intuition-sensing) or the degree to which personal values versus impersonal logic are used as the basis for making judgment and decisions (thinking-feeling).

330

Values, Styles, and Motivational Constructs There are legitimate cognitive style constructs. For example, several researchers introduced (differently labeled) constructs that all refer to the degree to which individuals see things as similar or different. These include constructs such as categorizing behavior (Gardner, 1953), conceptual differentiation (Gardner & Schoen, 1962), and compartmentalization (Messick & Kogan, 1963). These constructs refer to the tendency to separate ideas or objects into discrete categories. Clearly, any two ideas or objects are similar in some ways and different in other ways. Depending on the problem or situation, the similarities (or differences) may be task-relevant or task-irrelevant. Hence, consistent with the conceptualization of a cognitive style construct, the tendency to see things as similar or different is not inherently adaptive or maladaptive—the adaptive value of any given level on the construct is dependent on its fit with the problem situation. Other examples of legitimate cognitive style constructs are the preference for abstract versus concrete information (Harvey, Hunt, & Schroder, 1961), the adaption versus innovation cognitive style (Kirton, 1976), and the tolerance for contradiction cognitive style (Chan, 2004). In two different studies, Chan demonstrated that a cognitive style is not inherently adaptive or maladaptive and that it may interact disordinally with the style demands of the work context (Chan, 1996) or practical intelligence (Chan, 2004) to produce positive or negative consequences. Using Kirton’s (1976) conceptualization of adaption versus innovation approach to problem solving, Chan (1996) showed that the degree of cognitive style mismatch between the individual’s problem-solving style and the style demands of the work context predicted actual turnover over the predictability provided by job performance. In Chan (2004), construct validity evidence for the cognitive style construct of tolerance for contradiction were provided in terms of convergent and discriminant validity with an established set of external constructs. Using a sample different from the validation sample, Chan (2004) then showed that tolerance for contradiction positively predicted job performance among individuals with high practical intelligence but negatively predicted job performance among those with low practical intelligence.

Current Concerns and Emerging Issues Similar to the study of values, basic conceptualization and measurement issues need to be adequately addressed for the study of cognitive styles to contribute to the science and practice of employee selection. The following are some major concerns and emerging issues: 1. Unlike the structure of values, there are no widely used or commonly accepted frameworks/taxonomies of cognitive styles. Sternberg and Grigorenko (1997) classified some of the styles available in the literature into three broad categories: cognition-centered, personality-centered, and activity-centered. However, this classification is not very useful for various reasons. First, only a few examples are given in each category. Second, several of the constructs are very closely related conceptually and may even be identical. For example, it is unclear if cognitive complexity, compartmentalization, conceptual differentiation, and conceptual integration are four distinct styles or if some of these are simply different labels for the same construct. Third, the cognition-centered category includes some cognitive styles that are clearly cognitive abilities and others that more closely fit the conceptualization of cognitive styles. Fourth, the only two examples [the MBTI and Gregorc’s (1985) Energic Model] given in the personality-centered category are models or typologies in which individuals are classified into composite types simply obtained from a combination of several factors that appear to include cognitive styles, personality traits, and other types of individual difference constructs. Fifth, the activity-centered category, which consisted of learning and teaching styles, is simply a description of the learning or teaching contexts in which various types of cognitive styles, personality traits, and motivational constructs may be applicable. An adequate taxonomy of typology of cognitive styles is needed to organize the extant style constructs and measures; reduce the proliferation of different construct labels, which in fact represent the same construct; provide meaningful comparisons of results across studies; and aid the meta-analysis of cognitive styles. 2. Although cognitive styles are conceptually distinct from cognitive abilities and personality traits, the literature on cognitive styles contains numerous conceptualizations and measures of styles that are highly related to or even indistinguishable from cognitive abilities or personality traits. On the other hand, there are examples of cognitive styles with empirical evidence suggesting that they are distinct from cognitive ability and personality traits

331

David Chan (e.g., Chan’s [2004] tolerance for contradiction style; Harvey et al.’s [1961] abstract-concrete preference; Kirton’s [1976] adaption-innovation style). When studying a cognitive style in the context of employee selection, it is important to provide clear theoretical arguments and empirical evidence for the cognitive style vis-à-vis the traditional predictor space containing cognitive ability and personality traits (an adequate taxonomy of cognitive styles will provide a useful conceptual basis). When carefully studied, cognitive styles could provide important contributions in terms of incremental validity or interaction effects involving other individual difference constructs or situational variables (e.g., Chan, 1996, 2004). 3. Given the basic definition that cognitive styles are not inherently adaptive or maladaptive, it is important to validate new cognitive style constructs by identifying and showing, in theory-driven ways, the boundary conditions under which the cognitive style is adaptive and those under which it is maladaptive. 4. Cognitive style constructs are often conceptualized, and probably correctly so, as continuous variables. However, many studies measure and analyze cognitive styles as categorical variables in which individuals are classified into discrete types. This is not merely an issue of loss of statistical power to detect an effect due to artificial categorization of a continuous variable. It concerns mismatch in theory, measurement, and analysis, which are likely to lead to erroneous substantive inferences. For example, dichotomizing the abstract-concrete style continuum into the abstract type or concrete type (hence ignoring the degree of abstraction) makes it impossible to conceptualize and empirically test the hypothesis that degree of abstraction is curvilinearly related to a criterion variable of interest, such as task performance.

MOTIVATIONAL CONSTRUCTS Motivation is often defined in terms of three features: it directs (i.e., goal-oriented), it energizes (i.e., activation and activity), and it perseveres (i.e., effort). Clearly, motivation is necessary for accomplishing many tasks. Many researchers would agree with the conceptualization of job performance as a function of ability and motivation (e.g., Campbell & Pritchard, 1976). Yet, in terms of the non-ability predictor construct space, the past three decades of employee selection research have largely focused on personality traits rather than motivational constructs, such as trait goal orientations and need for achievement. Some personality traits (e.g., conscientiousness) are more easily construed as motivational constructs than others (e.g., extraversion and neuroticism). Given that personality may overlap with motivation, and even if we assume that personality is a subset of motivation (and I suspect not many of us would make this assumption), a large part of the motivational construct space still is not captured by personality traits. Although motivational constructs may be captured in selection methods such as interviews, accomplishment records, biodata measures, and situational judgment tests, we must not confound these methods with constructs (see Chan & Schmitt, 2005). These selection methods may be used to assess a wide range of constructs including cognitive ability, personality traits, and motivational constructs. Employee selection has focused much on cognitive ability and personality constructs but paid relatively little explicit attention to motivational constructs, although some motivational constructs may in fact be assessed together with ability and personality in the variety of selection methods used. The purpose of this section is to highlight the fact that many established motivational constructs are available in the literature, and they deserve more attention from employee selection researchers than is currently received.

Why Study Motivational Constructs? Research on motivational constructs is easily justified by the assumption that motivation is necessary for job performance and the fact that the motivational construct space may overlap but is certainly not exhausted by personality constructs. In addition, values and cognitive styles, as defined and illustrated in this chapter, do not appear to possess all three features of motivation. Specifically, most value and cognitive style constructs do not seem to have to be goaldirected, activation- or activity-oriented, and effortful. Motivation should be critical in learning and skill acquisition and therefore should predict work-relevant outcomes associated with newcomer adaptation and training. Motivation is also central in the conceptual definition of typical

332

Values, Styles, and Motivational Constructs performance, in which the basis is the “will do” aspect of performance. Finally, motivation is clearly the central conceptual feature in work-relevant criterion outcomes such as organizational commitment, withdrawal behaviors, and turnover. In short, the study of motivational constructs is important because some of these constructs are likely to provide incremental prediction for important work-relevant criteria over the predictability provided by cognitive ability, personality traits, values, and cognitive style constructs.

Examples of Motivational Constructs Instead of attempting a review of the numerous motivational constructs in the literature, which is beyond the scope of this chapter, this section will briefly describe three types of motivational constructs: trait goal orientations, achievement motivations, and interests. The three types are clearly nonexhaustive—the purpose is to illustrate how the study of motivational constructs may contribute to employee selection in various ways.

Trait Goal Orientations The motivational construct of trait goal orientation originated from Dweck (1986), who proposed a theory of motivation that posited that individuals exhibit different response patterns according to stable differences in their goal orientations. Two types of goals are distinguished— learning goals and performance goals. Individuals who are high in learning goal orientation are motivated to learn something new or increase their competence in a domain. They exhibit a “mastery-oriented” response pattern characterized by seeking challenging tasks, treating their performance errors as useful feedback, and persisting to arrive at solutions in the face of repeated failures and difficult task conditions. Individuals who are high in performance goal orientation are motivated to seek favorable or avoid unfavorable evaluations of their performance or competence. They tend to attribute performance errors and failures to low competence and hence avoid challenges or difficult situations that are “error-prone.” The bulk of the research on goal orientation is found in the educational literature. In the 1990s, several researchers noted that goal orientation is potentially useful in organizational research, including studies on design and implementation of training programs, performance appraisal systems, cultural diversity efforts, and task performance in general (e.g., Farr, Hofmann, & Ringenbach, 1993; Pieterse, Van Knippenberg, & Van Dierendonck, 2013). Consequently, there has been strong interest in applying goal orientation in several areas within the employee selection and organizational behavior domains (e.g., Van de Walle, Brown, Cron, & Slocum, 1999). Due to the potential value of goal orientation in organizational contexts, it is likely that the interest in trait goal orientations will continue. Fundamental issues of construct validation need to be better addressed to guide substantive studies of goal orientation in organizational settings. The works of Dweck and colleagues appear to treat goal orientation as a single bipolar continuum with learning goal orientation at one end and performance goal orientation at the other. However, subsequent researchers have argued that learning goal and performance goal orientation are distinct factors. Button, Mathieu, and Zajac (1996) reviewed the conceptualizations of goal orientation and argued for an uncorrelated two-factor model in which learning goal and performance goal orientations are distinct and independent. Although there is agreement with the conceptualization of learning goal orientation (LGO), previous research has not distinguished or paid sufficient attention to two important, distinct, and relatively independent dimensions of performance goal orientation. As noted by Van de Walle (1997), goal orientations can be conceptualized as a three-factor model because performance goal orientation can be construed (and assessed) in terms of either an avoid performance goal orientation (APGO) or a prove performance goal orientation (PPGO). Individuals high on APGO strive to avoid unfavorable judgments about their ability. Given this conceptualization, APGO individuals are less likely to be high on LGO because they tend to perceive error-prone

333

David Chan and difficult situations as threatening and are vulnerable to negative evaluation rather than learning opportunities for increasing job performance. Individuals high on PPGO strive to gain favorable judgments by demonstrating their ability and competence to others through their performance. Unlike APGO, which is conceptualized as negatively associated with LGO, PPGO is conceptually independent of LGO. Previous research has produced mixed findings on the association between LGO and performance goal orientation, with studies reporting zero, positive, and negative correlations (see Button et al., 1996). The failure to distinguish the notion of performance goal orientation into its two relatively independent dimensions (APGO vs. PPGO) may be one reason for the apparently mixed findings in previous research. Conceptually, we would expect LGO to be negatively and substantially related with APGO but unrelated with PPGO. Given this differential pattern of associations across the two performance goal orientations, the “mixed findings” in research may not be surprising because the magnitude and direction of the correlation between LGO and performance goal orientation would be dependent on the relative extent to which the performance goal orientation measure was loaded with APGO and PPGO. Because previous performance goal orientation items were not designed to assess two independent dimensions, some of the items are likely to be bi- or multidimensional rather than pure markers of APGO or PPGO.

Achievement Motivations The most well-known construct of achievement motivation is McClelland’s (1961) Need for Achievement. Individuals with high need for achievement have a strong desire for significant accomplishments. They tend to be approach-oriented, and they work harder and spend substantive efforts in striving to achieve success. In addition, they tend to be medium risk takers and select tasks with intermediate level of difficulty so that they have more than a 50% chance of achieving success (McClelland, 1985). According to McClelland, individuals high in need for achievement have a greater need to achieve success and, conversely, avoid failure. That is, high need achievement individuals tend to also have a high fear of failure and therefore tend to be avoidance-oriented when it comes to tasks with high risks of failure. McClelland’s conceptualization of need for achievement has dominated motivational constructs from the 1960s to the 1980s. Since the 1980s, the concept of need for achievement has evolved in important ways with regard to the way the concept of achievement is construed. A major advancement came from researchers in cross-cultural social psychology. These researchers distinguish between the individualistic notion of achievement, which is based on an independent view of the self as originally conceived by McClelland, and a different notion of achievement that is based on an interdependent view of the self and more characteristic of individuals from collectivistic cultures (e.g., East Asian) in which group harmony, interconnectedness, and social relationships are emphasized (e.g., Markus & Kitayama, 1991). Cultural models of self and need for achievement provide important conceptual bases for addressing challenging cross-cultural issues of construct equivalence, measurement invariance of responses to measures, and comparisons of criterion-related validity involving achievement motivational constructs and achievement-related criterion contexts. Advances in these areas will directly contribute to the employee selection research on issues related to staffing cross-cultural teams and expatriate assignment. Another major advancement in the construal of need for achievement is the distinction of different achievement domains in terms of the type of goal striving. Trait goal orientation, as described above, is essentially a multidimensional view of need for achievement according to the type of goals that one is striving to achieve.

Interests Interest measures have been used more frequently in vocational guidance situations than in employee selection, but the goal of selection, most broadly, is to find a person who has the

334

Values, Styles, and Motivational Constructs characteristics that best fit the requirements or offerings of the job, organization, or occupation. Interests in certain type of work or careers certainly could be one type of these characteristics, and they are therefore relevant to employee selection. The primary reason for considering and measuring interests in employee selection lies in the assumption that a person will be happiest and most productive when he or she is working in a job or occupation in which he or she is interested (Schmitt & Chan, 1998). Dawis (1991) summarized research indicating that interest, and personality measures are correlated relatively lowly. Although there are no reviews examining the correlations between interests and values or cognitive styles, the notion of interests is conceptually distinct from the values and cognitive styles. Interests are scaled in terms of liking, whereas values are scaled in terms of importance and cognitive styles are scaled in terms of preference in information processing. Interests may be construed as primarily motivational constructs insofar as interests tend to have the three motivational features—namely, goal orientation, activation and activity, and effort. Holland (1985) focused on the similarity between an individual’s interests and the degree to which an environment provides for engagement in activities of interest to the individual. According to Holland’s (1985) framework, which is the most well-known taxonomy of interests, individuals and environments could be characterized along six major dimensions: social, enterprising, conventional, realistic, investigative, and artistic. For example, high scorers on the realistic dimension are usually interested in dealing with concrete things and relatively structured tasks, and realistic occupations include such occupations as engineers, farmers, and carpenters. Individuals who score high on the social dimension are interested in working with and helping others, and these individuals are attracted to such occupations as teachers, social workers, flight attendants, and mental health workers. According to Holland, the interest patterns are organized in a fashion explained by a hexagon. Interest areas next to an area of primary interest are also likely to be of interest to an individual, whereas those interests opposite to a primary area on the hexagon are unlikely to be of much interest. Holland’s structure of interests, measured by The Strong Vocational Interest Blank, has received considerable corroborative support (see Tracey & Rounds, 1993). Holland’s framework for the structure and understanding of interest dominates the field of counseling and vocational guidance. The framework has potential for employee selection (for a review, see Van Iddekinge, Putka, & Campbell, 2011), although its direct use is surprisingly limited. However, some of Holland’s interest dimensions are probably captured in biodata measures.

Current Concerns and Emerging Issues The following are some areas of concerns, and addressing these issues would contribute to the study of motivational constructs in employee selection research: 1. With the emergence of new motivational constructs, basic construct validation efforts are necessary. Specifically, clarifying the dimensionality of a motivational construct is critical because it affects our theorizing and directs our hypothesis formulation and our interpretation of findings regarding the motivational construct. Consider the research on trait goal orientations. If a three-factor model is correct, then future meta-analytic studies have to take into account the type of performance goal orientation being assessed when coding each primary study. The research on dimensionality of trait goal orientations also highlights the importance of explicating the role of goals in a motivational construct, including the content and structure of goals and the goal striving process. 2. An important issue in the conceptualization and hence measurement of motivational constructs concerns the level of specificity. Although the appropriateness of the level of specificity of a motivational construct is likely to be dependent on the particular research question or practical use, we need to ensure conceptual clarity as we move up or down the ladder of specificity. For example, when a motivational construct is conceptualized at a very general level, it is likely to be multidimensional and made up of multiple constructs that may be motivational or nonmotivational constructs. This is best illustrated in the study of interests. Although the concept of interest has the elements of motivational constructs, the interest dimensions in Holland’s structure are descriptive categories of individuals or environments

335

David Chan rather than unitary individual difference motivational constructs. In fact, each interest dimension probably reflects multiple personality traits and cognitive styles, in addition to motivational constructs. For example, the artistic dimension describes a category of individuals who are likely to also score high on personality traits such as openness to experience and cognitive style constructs such as preference for abstraction. In addition, individuals’ knowledge and skills (e.g., artistic “talent”), as well as their education, opportunities, and experiences, are likely to shape their interests. In short, interests are probably better understood in terms of descriptions of individuals or environments in composite terms reflecting motivational constructs but also a variety of knowledge, skills, abilities, and other characteristics (KSAOs), such as personality traits and cognitive styles. 3. Motivation is a process. Hence, to understand how motivational constructs affect behaviors, we may require conceptualizations of motivational constructs that are more dynamic than the static conceptualizations that are typical of personality traits, values, and cognitive styles. To begin, studies on motivational constructs need to relate the individual differences in motivation to the larger literature on motivation, particularly the literature on theoretical models of work motivation (for review, see Mitchell & Daniels, 2003). A theoretical model of work motivation specifies the motivational processes or mechanisms by which motivated individuals select specific goals and pursue them through allocating effort, monitoring progress, and responding to obstacles and feedback. In each of the established models in the work motivation literature, what is the role of individual differences in motivational constructs? Specifically, where in the work motivational model do we locate the motivational construct(s)? A theory-driven framework for including motivational constructs in employee selection would require us to specify the appropriate direct effects and interaction effects linking motivational constructs and the focal variables in the particular work motivation model. 4. As illustrated in the above discussion on cultural models of need for achievement, studies on motivational constructs need to be sensitive to cultural differences in the conceptual definition of the motivational construct. Even if construct equivalence exists across cultures, culture effects may operate in other ways. For example, it is possible that culture may moderate the relationship between a motivational construct and a criterion variable. Consider the motivational construct of APGO, which has almost always been construed and empirically demonstrated to be negatively associated with job performance in western samples. It may be possible that in cultures (or task settings) in which there is low tolerance for performance errors and high emphasis on speed and accuracy, individuals high on APGO may not necessarily be rated as poorer performers than those low on APGO, and they may even be rated as better performers.

PRACTICAL CONSIDERATIONS AND FUTURE RESEARCH CHALLENGES In this chapter, I have discussed the basic conceptualizations of values, cognitive styles, and motivational constructs. Using various specific examples in each of these types of constructs as illustrations, I have raised several concerns and issues with regard to fundamental conceptualization and measurement issues that need to be addressed as we incorporate these constructs in employee selection. There are some commonalities in the critical issues associated with the study of each of the three types of constructs that will impact employee selection. In this final section of the chapter, I will discuss several practical considerations in the use of these constructs in employee selection and propose a strategic agenda for future research directions.

Practical Considerations in Employee Selection The following four types of practical considerations in the use of values, cognitive styles, and motivational constructs in employee selection will be considered: legal and social issues, subgroup differences, cultural differences, and problems with self-report data. 1. Legal and social issues. We need to consider the legal and social constraints when recommending the use of individual difference measures of values, cognitive styles, or motivations for the purpose of making employee selection decisions. Virtually all of the legal and social issues involving the use of

336

Values, Styles, and Motivational Constructs cognitive ability and personality tests (see Part VI of this volume) are applicable to the use of values, cognitive styles, and motivational measures, although the importance of each issue is dependent on the specific measure and situation of use. Examples of these issues include the legal determination of job relevance, which may or may not overlap with psychometric validity; allegations of discriminatory hiring practices; affirmative action and equal employment opportunities; applicant reactions; and the distinction between psychometric test bias and nonpsychometric fairness perceptions (for review, see Schmitt & Chan, 1998). In practice, legal and social issues are often closely related, as evident in the issue of adverse impact. In addition, the extent to which it is appropriate or acceptable (whether legally or socially) to assess a construct for employee selection decisions may be tied to the selection procedures used and the extent to which the construct is explicitly assessed. For example, values may be assessed in some biodata items and interviewers’ assessment of applicants’ values is probably captured, although mostly not in an explicit manner, in the interview scores. The measurement of values as a component of biodata or interview scores may not attract as much legal or social attention as the use of an inventory designed specifically to measure values. The last two decades of employee selection research have focused much attention on applicant reactions, including its importance and the various ways to engender favorable reactions. When adequately developed, measures of values, cognitive styles, and motivational constructs can lead to positive applicant reactions (see Chan & Schmitt, 2004; Schmitt & Chan, 1997). 2. Subgroup differences. A practical problem faced by many organizations is the use of selection tests (particularly cognitive ability tests) that are valid predictors of job performance for majority and minority applicants but that show large subgroup differences in mean test scores favoring the majority subgroup. This situation leads to a conflict between the organization’s need to use a valid test and the goal to hire a diverse workforce for legal and social reasons. In general, measures of values, cognitive styles, and motivational constructs are probably more similar to personality tests than cognitive ability tests in that there is no evidence of substantial subgroup differences between majority and minority applicants. However, this may not be true of some specific measures even if the measures do not assess cognitive ability constructs. For example, it has been argued and there is some empirical evidence showing that Black Americans, as compared to White Americans, tend to perform better on a test that is loaded with socially interactive and visual information than on a test loaded with written and verbal information (Chan & Schmitt, 1997). Hence, using a cognitive style measure to assess the preference for processing visual versus verbal information is likely to result in Black–White subgroup difference in test scores, leading to adverse impact problems in employee selection. But in general, adding measures of values, cognitive styles, and motivational constructs to cognitive ability tests is likely to reduce subgroup difference in the composite test scores and hence adverse impact. Including these nonability measures also increases criterionrelated validity to the extent that the criterion space is expanded from the narrow focus on ability-based maximum and technical job performance to the nonability-based typical and contextual job performance. 3. Cultural differences. Issues of possible cultural differences in test validity (in terms of content, criterionrelated, and construct validity evidence) need to be considered whenever we use a selection measure in a culture different from the culture in which the measure is developed and validated. Discussions on methodological issues in cross-cultural measurement, such as response sets and measurement invariance, are readily available in the literature and will not be repeated here (for a recent review, see Chan, 2008a). Note, however, that cultural differences may affect the conceptualization and measurement of constructs in substantive ways that go beyond the technical issues of cross-cultural measurement. This is particularly relevant to values and motivational constructs given that cultures may differ qualitatively in their conceptualizations of certain values (e.g., freedom, happiness) and motivations (e.g., need for achievement). 4. Problems with self-report data. In the assessment of values, cognitive styles, and motivational constructs, the large majority of the measures used are in self-report format. Similar to the use of personality inventories, issues related to the validity problems of self-report data are relevant when self-report measures of these three types of constructs are used in employee selection, especially given the high stakes involved in actual employee selection contexts. Some values (e.g., honesty) and motivational constructs (e.g., LGO), given the evaluative nature of their content, may be particularly susceptible to social desirability responding problems. In general, cognitive styles are probably less likely than values and motivational constructs to suffer from social desirability responding given the nonevaluative nature of cognitive style items. Finally, although self-report data problems do occur in the measurement of values, cognitive styles, and motivational constructs, many of the purported problems are often overstated (see Chan, 2008b).

337

David Chan

Strategic Agenda for Future Research Directions On the basis of the previous discussions on values, cognitive styles, and motivational constructs, I propose the following strategic agenda for future research directions: 1. Dimensionality. Construct validation efforts that specify and test the dimensionality of a construct are fundamental when examining a value, cognitive style, or motivational construct. Specifically, it is important to determine if the construct of interest under study is a single, “pure” factor or a composite construct consisting of multiple factors. Composite constructs are particularly difficult to deal with. First, we will need to identify the number and nature of the various factors. Second, we will need to establish the different contributions of the various factors to the composite construct. Third, failing to accurately identify the number, nature, and weights of the factors making up the composite construct will result in substantive inferential errors, or at least confusion, about values, cognitive styles, or motivational constructs. A purportedly motivational construct may in fact be a composite label reflecting not only multiple motivational constructs but also various nonmotivational constructs such as knowledge, skills, abilities, personality traits, values, and cognitive styles. Conceptual clarity of the nature of composite constructs is critical to advance the theory and application of the constructs. One example is the composite construct of core self-evaluation (CSE) proposed by Judge, Locke, and Durham (1997) to refer to individual differences in the fundamental appraisals that people make about their own self-worth, competence, and capabilities. CSE is construed as a higher-order construct composed of four constructs: emotional stability, self-esteem, generalized self-efficacy, and locus of control. These four constructs are established individual difference traits in the personality, motivation, and thinking style domains. With increasing research and applied interest in CSE, researchers have called for more efforts to examine the theoretical foundations and construct validity of the construct (e.g., Chang, Ferris, Johnson, Rosen, & Tan, 2012). My view is that the fundamental conceptual issue for CSE is about dimensionality and the nature of the inter-relations among the four traits. Specifically, is CSE best conceptualized as an underlying common factor variance construct that saturates each of the four traits (as suggested by Judge and his colleagues) or as a composite construct indicated by a summation of the four traits? In psychometric terms, the former implies a reflective factor model of CSE, whereas the latter implies a formative model of CSE. Some of the unresolved debates and apparent inconsistencies in the literature about CSE may be due to the failure to distinguish these two representations of the relations linking CSE to the four traits. 2. Level of specificity. Closely related to the issue of dimensionality and composite constructs is the issue of level of specificity of a construct. Depending on the particular research question, researchers need to ensure that the level of specificity of the value, cognitive style, or motivational construct is appropriate, and this requires clear conceptual definitions of the constructs and appropriate matching between predictor and criterion constructs. Broader constructs (e.g., individualistic vs. collectivistic values, need for achievement) may be more useful for obtaining better prediction of a general criterion (e.g., organizational commitment, overall job performance) in a parsimonious and generalizable manner. More narrowly defined constructs may be more useful for increasing understanding of the criterion space and the predictor-criterion relationships, including possible mediating mechanisms (e.g., linking specific trait goal orientations to specific dimensions of job performance). The issue here is not about any inherently optimal level of specificity of the construct. Any general statement on the relative value of broad versus narrowly defined constructs is unlikely to be useful, because it is the clarity of conceptual definition of constructs and appropriate matching between the predictor and criterion spaces that will lead to higher validities and better explanations. 3. Adaptive value. Studies on the three types of constructs, particularly with respect to their use in employee selection, need to explicate and test the adaptive value of the construct. As noted in this chapter, cognitive styles are not inherently adaptive or maladaptive in an absolute and context-free sense. Consider a measure that was designed to assess a cognitive style. Let us suppose high scorers on this measure perform better in tasks across many domains, and it is very difficult or impossible to conceive of two different situations in which the high scorers are adaptive in one and maladaptive in the other. In this scenario, the measure is likely to be assessing cognitive abilities or some other construct that is inherently adaptive rather than a cognitive style. On the other hand, motivational constructs are inherently adaptive in nature in that they should correlate positively rather than negatively with a criterion in which higher scores represent higher adaptive value. It is difficult to think of generalizable situations in which higher-motivated individuals, as compared to lower-motivated individuals, will experience less positive or more negative consequences. Whether or not a value

338

Values, Styles, and Motivational Constructs construct is adaptive or maladaptive is dependent on the nature of the construct. Values with higher evaluative content such as honesty and fairness are likely to be adaptive in many situations, whereas those with lower evaluative content such as individualism–collectivism may be adaptive or maladaptive depending on the nature of the situational demands. In addressing the adaptive value of a predictor construct, it is important to examine possible nonlinear relationships linking the predictor construct and the criterion construct. For example, in certain work situations, collectivism (a value construct) or need for achievement (a motivational construct) may be related to a job performance construct by an inverted-U function rather than a liner association. The specific functional form of the predictor-criterion relationship has clear implications for employee selection. If the function is an inverted U, then individuals with moderate scores on the value or motivational construct are more likely than those with low or high scores to be better performers on the job. 4. Person-environment fit. Another challenging and important future research direction is to study individual differences in values, cognitive styles, and motivational constructs in the context of personenvironment fit. For example, Chan (1996) showed that a misfit between the individual’s cognitive style and the commensurate style demands of the work environment predicted actual turnover beyond the predictability provided by job performance. Such findings have potential practical implications for employee selection for various work environments. Similar fit studies could be conducted for various values, cognitive style, and motivational constructs by carefully mapping the individual difference construct to the environmental construct. One promising area is to study the effects of fit between trait goal orientations and the goal orientation demands of the work environment. Clearly, studies of person-environment fit will require construct-oriented approaches that explicate dimensionality and predictor-criterion relationships (Chan, 2005b). Fit is generally construed as adaptive, whereas misfit is construed as maladaptive. However, advancements in fit research are likely to occur if we can show when and how fit may have negative effects, as well as when and how misfit may have positive effects. Examples of possible negative effects of fit include homogeneity of individuals in an environment leading to groupthink and cognitive style fit between individuals and the culture leading to failure to consider alternatives. Examples of possible positive effects of misfit include diversity of individuals leading to new ideas and value misfit leading to whistle blowing. 5. Interconstruct relationships. Any individual difference construct cannot be considered in isolation. Future research should examine interconstruct relationships within and across the three types of constructs. There are at least two ways to examine interconstruct relationships. The first way is to examine the incremental validity of one construct over another in predicting a criterion. For example, Payne, Youngcourt, and Beaubien (2007) examined trait goal orientations and found that these motivational constructs predicted job performance above and beyond the prediction provided by cognitive ability and personality. It is practically important to examine if a value, cognitive style, or motivational construct offers any incremental validity in the prediction of job performance or other work-relevant criteria over the predictability provided by the traditional predictor constructs, such as cognitive ability and personality traits. The second way is to examine trait-trait interaction effects on work-relevant criteria. For example, Chan (2004) found a disordinal interaction effect between a cognitive style construct (tolerance for contradiction) and practical intelligence such that the cognitive style positively predicts job performance among individuals high on practical intelligence but negatively predicted job performance among those low on practical intelligence. Studies on trait-trait interactions are important because they clarify and validate the nature of the individual difference constructs and identify the boundary conditions for their criterion-related validities and adaptive effects.

CONCLUSION Given the modest amount of criterion variance typically accounted for in employee selection, it is understandable that researchers and practitioners seek to expand the predictor construct space by going beyond cognitive abilities and personality traits to include values, cognitive styles, and motivational constructs. I have provided an overview of the nature of these three types of constructs, their potential usefulness, issues relating to conceptualization and measurement, practical considerations to take into account in the use of these constructs for employee selection, and a strategic agenda for future research directions. It is hoped that this chapter will provide an effective springboard for fruitful construct-oriented research on values, cognitive styles, and motivational constructs.

339

David Chan

REFERENCES Button, S. B., Mathieu, J. E., & Zajac, D. M. (1996). Goal orientation in organizational research: A conceptual and empirical foundation. Organizational Behavior and Human Decision Processes, 67, 26–48. Cable, D. M., & Edwards, J. R. (2009). The value of value congruence. Journal of Applied Psychology, 94, 654–677. Campbell, J. P., & Pritchard, R. D. (1976). Motivation theory in industrial and organizational psychology. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology ( pp. 63–130). Chicago, IL: Rand McNally. Chan, D. (1996). Cognitive misfit of problem-solving style at work: A facet of person-organization fit. Organizational Behavior and Human Decision Processes, 68, 194–207. Chan, D. (1998a). The conceptualization and analysis of change over time: An integrative approach incorporating Longitudinal Means and Covariance Structures Analysis (LMACS) and Multiple Indicator Latent Growth Modeling (MLGM). Organizational Research Methods, 1, 421–483. Chan, D. (1998b). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246. Chan, D. (2002). Longitudinal modeling. In S. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 412–430). Malden, MA: Blackwell. Chan, D. (2004). Individual differences in tolerance for contradiction. Human Performance, 17, 297–325. Chan, D. (2005a). Multilevel research. In F. T. L. Leong & J. T. Austin (Eds.), The psychology research handbook (2nd ed., pp. 401–418). Thousand Oaks, CA: Sage. Chan, D. (2005b). Current directions in employee selection. Current Directions in Psychological Science, 14, 220–223. Chan, D. (2008a). Methodological issues in international Human Resource Management. In M. M. Harris (Ed.), Handbook of research in international human resources management, (pp. 53–76). Mahwah, NJ: Lawrence Erlbaum. Chan, D. (2008b). So why ask me?—Are self-report data really that bad? In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Received doctrine, verity, and fable in the organizational and social sciences (pp. 309–336). Hillsdale, NJ: Lawrence Erlbaum. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143–159. Chan, D., & Schmitt, N. (2004). An agenda for future research on applicant reactions to selection procedures: A construct-oriented approach. International Journal of Selection and Assessment, 12, 9–23. Chan, D., & Schmitt, N. (2005). Situational judgment tests. In A. Evers, O. Smit-Voskuijl, & N. Anderson (Eds.), Handbook of personnel selection (pp. 219–242). Oxford, England: Blackwell. Chang, C. H., Ferris, L. D., Johnson, R. E., Rosen, C. C., & Tan, J. A. (2012). Core self-evaluations: A review and evaluation of the literature. Journal of Management, 38, 81–128. Dawis, R. V. (1991). Vocational interests, values, and preferences. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 2, pp. 833–872). Palo Alto, CA: Consulting Psychologists Press. Druckman, D., & Bjork, R. A. (1991). In the mind’s eye. Washington, DC: National Academy Press. Dweck, C. S. (1986). Motivational processes affecting learning. American Psychologist, 41, 1040–1048. Farr, J. L., Hofmann, D. A., & Ringenbach, K. L. (1993). Goal orientation and action control theory: Implications for industrial and organizational psychology. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational psychology (Vol. 8, pp. 191–232). New York, NY: Wiley. Gardner, R. W. (1953). Cognitive style in categorizing behavior. Perceptual and Motor Skills, 22, 214–233. Gardner, R. W., & Schoen, R. A. (1962). Differentiation and abstraction in concept formation. Psychological Monographs, 76, 1–21. Harvey, O. J., Hunt, D. E., & Schroder, H. M. (1961). Conceptual systems and personality organization. New York, NY: Wiley. Holland, J. L. (1985). Making vocational choices: A theory of careers (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Judge, T. A., Locke, E. A., & Durham, C. C. (1997). The dispositional causes of job satisfaction: A core evaluations approach. Research in Organizational Behavior, 19, 151–188. Jung, C. (1923). Psychological types. New York, NY: Harcourt Brace. Kaikati, A., & Torelli, C. (2010). When do personal values predict helping behaviors? It’s all in the mindset. Advances in Consumer Research, 37, 157–160. Kirton, M. J. (1976). Adaptors and innovators: A description and measure. Journal of Applied Psychology, 61, 622–629.

340

Values, Styles, and Motivational Constructs Markus, H., & Kitayama, S. (1991). Culture and self: Implications for cognition, emotion, and motivation. Psychological Review, 98, 224–253. McClelland, D. C. (1961). The achieving society. Princeton, NJ: Van Nostrand. McClelland, D. C. (1985). Human motivation. New York, NY: Scott-Freeman. Messick, S., & Kogan, N. (1963). Differentiation and compartmentalization in object-sorting measures of categorizing style. Perceptual and Motor Skills, 16, 47–51. Mitchell, T. R., & Daniels, D. (2003). Motivation. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology (Vol. 12, pp. 225–254). New York, NY: Wiley. Myers, I. B., & McCaulley, M. H. (1985). Manual: A guide to the development and use of the Myers-Briggs type indicator. Palo Alto: CA: Consulting Psychologists Press. Payne, S. C., Youngcourt, S. S., & Beaubien, J. M. (2007). A meta-analytic examination of the goal orientation nomological net. Journal of Applied Psychology, 92, 128–150. Pieterse, A. N., Van Knippenberg, D., & Van Dierendonck, D. (2013). Cultural diversity and team performance: The role of team member goal orientation. Academy of Management Journal, 56, 782–804. Posner, B. Z., Kouzes, J. M., & Schmidt, W. H. (1985). Shared values make a difference: An empirical test of corporate culture. Human Resource Management, 24, 293–309. Ravlin, E. C., & Meglino, B. M. (1987). Effects of values on perception and decision making: A study of alternative work values measures. Journal of Applied Psychology, 72, 666–673. Rokeach, M. (1973). The nature of human values. New York, NY: Free Press. Schwartz, S. H. (1992). Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In M. P. Zanna (Ed.), Advances in experimental social psychology (Vol. 25, pp. 1–65). San Diego, CA: Academic Press. Schwartz, S. H., & Sagiv, L. (1995). Identifying culture-specifics in the content and structure of values. Journal of Cross Cultural Psychology, 26, 92–116. Schmitt, N., & Chan, D. (1998). Employee selection: A theoretical approach. Thousand Oaks, CA: Sage. Sternberg, R. J., & Grigorenko, E. L. (1997). Are cognitive styles still in style? American Psychologist, 52, 700–712. Tracey, T. J., & Rounds, J. B. (1993). Evaluating Holland’s and Gati’s vocational interest models: A structural meta-analysis. Psychological Bulletin, 113, 229–246. Van de Walle, D. (1997). Development and validation of a work domain goal orientation instrument. Educational and Psychological Measurement, 8, 995–1015. Van de Walle, D., Brown, S. P., Cron, W. L., & Slocum, J. W., Jr. (1999). The influence of goal orientation and self-regulation tactics on sales performance: A longitudinal field test. Journal of Applied Psychology, 84, 249–259. Van Iddekinge, C. H., Putka, D. J., & Campbell, J. C. (2011). Reconsidering vocational interests for personnel selection: The validity of an interest-based selection test in relation to job knowledge, job performance, and continuance intentions. Journal of Applied Psychology, 96, 13–33. Weiss, H. M. (1978). Social learning of work values in organizations. Journal of Applied Psychology, 63, 711–718. Witkin, H. A., Dyke, R. B., Faterson, H. F., Goodenough, D. R., & Karp, S. A. (1962). Psychological differentiation. New York, NY: Wiley. Witkin, H. A., Oltman, P. K., Raskin, E., & Karp, S. A. (1971). Embedded figures test, children’s embedded figures test, group embedded figures test: Manual. Palo Alto, CA: Consulting Psychologists Press.

341

15 PRACTICAL INTELLIGENCE, EMOTIONAL INTELLIGENCE, AND SOCIAL INTELLIGENCE FILIP LIEVENS AND DAVID CHAN

Over the years, practical intelligence, social intelligence, and especially emotional intelligence have received substantial attention in both the academic and practitioner literatures. However, at the same time, these individual difference “constructs” have also fueled controversies and criticisms, including their applications to employee selection. It is without doubt that their definition, dimensionality, and operationalization (measurement) have been much more questioned as compared to the more traditional or established constructs (i.e., cognitive ability, personality) in this section of the Handbook. This chapter has two main objectives. The first objective is to review and clarify the conceptualization and measurement of these three constructs (or categories of constructs). In doing so, we aim to identify commonalities and differences among the three constructs. The second objective is to advance research on practical, social, and emotional intelligence. We aim to achieve both objectives by placing the three intelligence constructs in an integrative conceptual framework that relates them to traditional individual difference constructs and critical criterion constructs. We end by proposing directions for future research.

DEFINITIONS AND CONCEPTUALIZATIONS In this section, we review how practical, emotional, and social intelligence have been conceptualized and the research that attempted to empirically test these conceptualizations.

Practical Intelligence Sternberg and colleagues introduced the construct of practical intelligence in the mid- to late 1980s (Sternberg, 1988; Wagner & Sternberg, 1985). As a common thread running through the various definitions of practical intelligence, it is generally considered to refer to the ability of an individual to deal with the problems and situations of everyday life (Bowman, Markham, & Roberts, 2001). In lay terms, it can be characterized as “intuition” or “common sense,” and it is often referred to as “street smart” to contrast with “book smart,” which is used to characterize traditional analytical or academic intelligence. A central element in practical intelligence is tacit knowledge. Sternberg, Wagner, Williams, and Horvath (1995) defined tacit knowledge as “action-orientated knowledge, acquired without

342

Practical, Emotional, and Social Intelligence direct help from others, that allows individuals to achieve goals they personally value” (p. 916). This definition encompasses the key characteristics of tacit knowledge (see Hedlund et al., 2003). First, tacit knowledge is difficult to articulate because it is not formalized in explicit procedures and rules. Second, tacit knowledge is typically situationally specific procedural knowledge, telling people how to act in various situations. Third, individuals acquire tacit knowledge on the basis of their own everyday experience related to a specific domain. Thus, tacit knowledge is not formally taught; it is experience-based. Fourth, tacit knowledge is practical as it enables individuals to obtain the goals that they value in life. These characteristics exemplify the claim of practical intelligence and tacit knowledge as being constructs that are conceptually distinct from academic intelligence, technical job knowledge, or personality. Research by Sternberg and colleagues as well as by others has found some support for or at least produced findings consistent with some of these claims. First, tacit knowledge seems to increase with experience. For example, business managers received higher tacit knowledge scores than business graduate students, who in turn outperformed undergraduate students, although sample sizes in these groups were often small (Wagner, 1987). Second, scores on tacit knowledge inventories showed low correlations (below .20) with measures of fluid and crystallized intelligence (Legree, Heffner, Psotka, Martin, & Medsker, 2003; Tan & Libby, 1997). Finally, Bowman et al. (2001) reviewed research on tacit knowledge in organizational, educational, and military settings and concluded that the assessment of tacit knowledge has certain promise for predicting performance in these real-world environments, although the level of prediction does not reach the values obtained with g (see also Van Rooy, Dilchert, Viswesvaran, & Ones, 2006). Although most of these results were obtained in educational, military, sales, or business contexts, Baum, Bird, and Singh (2011) also found evidence for the role of practical intelligence in the context of entrepreneurship: Practical intelligence interacted with growth goals to predict venture growth across four years. Bowman et al. (2001) leveled various criticisms with respect to the construct of practical intelligence. From a conceptual point of view, questions have been raised whether practical intelligence (tacit knowledge) at all exists as a single construct that is different from other types of intelligence, job knowledge, and personality (see also Gottfredson, 2003; McDaniel & Whetzel, 2005). In particular, McDaniel and Whetzel (2005) put various claims related to practical intelligence (tacit knowledge) to the test. To this end, they used research related to situational judgment tests (SJTs), a measurement method that is closely related to tacit knowledge inventories. Consistent with research by Sternberg and colleagues, McDaniel and Whetzel concluded that such tests predict job performance and have incremental validity over more common selection procedures. However, they argued that there was no support for the other claims. Specifically, they cited studies showing that SJTs of practical intelligence were factorially complex and could not be represented by a general factor in factor analytic studies. They also reviewed research showing that these test scores were significantly related to scores on established constructs such as g, conscientiousness, emotional stability, and agreeableness. Later in this chapter, we argue that such criticisms are both right and wrong—they are right that practical intelligence is not a unitary construct, but they are wrong to conclude that the factorially complex results and significant correlations with established constructs imply that practical intelligence is not a distinct and valid construct.

Emotional Intelligence Since the mid-1990s, emotional intelligence (EI) is probably the psychological construct that has received the greatest attention in both popular and academic literatures. Historically, a distinction is made between two conceptualizations of emotional intelligence, namely an ability EI model and a trait EI model (e.g., Matthews, Zeidner, & Roberts, 2007). The first model conceptualizes EI as an ability akin to cognitive ability and measures it via performance-based tests. In this paradigm, EI is viewed as another legitimate type of intelligence. Hence, this model is also referred to as emotional cognitive ability or information processing emotional intelligence. Emotional intelligence is then defined as “the

343

Filip Lievens and David Chan ability to monitor one’s own and others’ emotions, to discriminate among them, and to use the information to guide one’s thinking and actions” (Salovey & Mayer, 1990, p. 189). This definition shows that the higher-order construct of emotional intelligence is broken down into four branches. The first branch—emotional identification, perception, and expression—deals with the ability to accurately perceive emotions in others’ verbal and nonverbal behavior. Emotional facilitation of thought is the second branch, referring to the ability to use emotions to assist thinking and problem solving. Third, emotional understanding denotes the ability to analyze feelings, discriminate among emotions, and think about their outcomes. Finally, emotional management deals with abilities related to maintaining or changing emotions. We refer to Côte (2014) for an excellent and detailed overview of the different abilities under each branch. The second model, the trait EI model, views EI as akin to personality and assesses it via self-report. In this model, emotional intelligence is defined as “an array of non-cognitive capabilities, competencies, and skills that influence one’s ability to succeed in coping with environmental demands and pressures” (Bar-On, 1997, p. 16). As the name suggests, this model uses a broad definition of emotional intelligence. Abilities such as emotion perception are typically combined with non-cognitive competencies, skills, and personality traits. For example, one of the most popular mixed models (Bar-On, 1997) measures five broad factors and fifteen facets: (1) Intrapersonal (Self-Regard, Emotional Self Awareness, Assertiveness, Independence, and Self-Actualization), (2) Interpersonal (Empathy, Social Responsibility, Interpersonal Relationship), (3) Stress Management (Stress Tolerance and Impulse Control), (4) Adaptability (Reality Testing, Flexibility, and Problem Solving), and (5) General Mood (Optimism and Happiness). In the Goleman (1995) model, a similar expanded definition of emotional intelligence is used, referring to emotional intelligence as a set of learned competencies. Emotional intelligence competence is then defined as “an ability to recognize, understand, and use emotional information about oneself or others that leads to or causes effective or superior performance” (Boyatzis & Sala, 2004, p. 149). A distinction is further made among five main competency clusters (with various subcompetencies): self-awareness, self-regulation, motivation, empathy, and social skills. Given the trait-like nature of the mixed model, some researchers have suggested using terms such as “trait emotional intelligence,” “emotional self-efficacy” (Petrides & Furnham, 2003), or “emotional self-confidence” (Roberts, Schulze, Zeidner, & Matthews, 2005). Meta-analytic research (Van Rooy, Viswesvaran, & Pluta, 2005) demonstrated that these two models are not measuring the same constructs. Measures based on the two models correlated only .14 with one another. In addition, these two models had different correlates. Emotional intelligence measures based on the mixed model overlapped considerably with personality trait scores but not with cognitive ability. Conversely, EI measures developed according to an EI ability model correlated more with cognitive ability and less with personality. Other research has clarified that ability model measures correlate especially with verbal (crystallized) ability, with correlations typically between .30 and .40 (Mayer, Roberts, & Barsade, 2008). Hence, some have posited that the term “emotional intelligence” should be replaced by the term “emotional knowledge” (Zeidner, Matthews, & Roberts, 2004). Besides the construct-related validity of emotional intelligence, its criterion-related validity has also been scrutinized. To this end, Côte (2014) reviewed three meta-analyses: Joseph and Newman (2010) found an uncorrected correlation of .16 between emotional intelligence (ability model) and job performance, whereas in O’Boyle, Humphrey, Pollack, Hawver, and Story (2011) this uncorrected correlation was .21. Finally, an earlier meta-analysis of Van Rooy and Viswesvaran (2004) that used both EI models revealed a correlation of .17 for predicting performance in a variety of settings (e.g., employment, academic). There are especially conceptual and methodological problems associated with the mixed model of emotional intelligence (Mayer et al., 2008). First, the ambiguous (all-encompassing) definition and the very broad content of the mixed model have been criticized (e.g., Landy, 2005; Locke, 2005; Matthews, Roberts, & Zeidner, 2004). For example, Landy (2005) succinctly noted: “the construct [of emotional intelligence] and the operational definitions of the construct (i.e.,

344

Practical, Emotional, and Social Intelligence the actual measurement instruments) are moving targets” (p. 419). Similarly, Locke (2005) posited that “the concept of EI has now become so broad and the components so variegated that no one concept could possible encompass or integrate all of them, no matter what the concept was called; it is no longer even an intelligible concept” (p. 426). Another criticism relates to redundancy of the mixed model with Big Five personality traits. For instance, De Raad (2005) explored to what extent emotional intelligence (mixed model) can be expressed in terms of personality traits. To this end, he gathered a total of 437 items from EI inventories. Sixty-six percent of the EI descriptors could be classified in a well-known Big Five framework (The Abridged Big Five Dimensional Circumplex). The lion’s share of the terms was categorized under Agreeableness and Emotional Stability. The main reason for items not being classifiable was that they were ambiguous, as they were often related to several Big Five factors. In other studies, the multiple correlation between Big Five scores and scores on mixed model EI measures ranged between .75 and .79 (Brackett & Mayer, 2003; Grubb & McDaniel, 2007). Other studies, however, found incremental validity of the mixed model over and above personality (Law, Wong, & Song, 2004; Tett, Fox, & Wang, 2005). Nonetheless, in the scientific community, there have been calls to give up the mixed model (despite its popularity in practice), to focus solely on the ability model (Daus & Ashkanasy, 2005), or at least not to refer to the mixed model as emotional intelligence (Cherniss, 2010). In recent years, two meta-analyses have further clarified various aspects in this debate. First, Joseph and Newman (2010) examined the validity of emotional intelligence as conceptualized only in the ability model. They found support for a sequential relationship among emotional intelligence facets (emotion perception, understanding, and regulation) and job performance, with personality and cognitive ability as antecedents of these emotional intelligence processes. Second, Joseph, Jin, Newman, and O’Boyle (2015) examined the validity of emotional intelligence as conceptualized in the mixed model. Although Joseph et al. found a mean corrected correlation of .29 between mixed emotional intelligence and supervisor-rated job performance, this relationship became .00 after controlling for already-established constructs such as ability EI, self-efficacy, personality, and cognitive ability. Taken together, these two meta-analyses demonstrate that further progress on emotional intelligence is to be made via more refined conceptualizations and measurement of the ability EI model. That said, the ability model is not without limitations either. For example, a large-scale examination of many emotional intelligence, cognitive intelligence, and personality measures showed that emotion perception (as represented by measures of perception of emotions in faces and pictures) was the only branch of the four branches of the ability model that could not be classified under established measures (Davies, Stankov, & Roberts, 1998). But even the emotion perception construct has drawbacks, as the construct does not seem to have generalizability across different measures (Gohm, 2004). That is, existing emotion perception measures correlate lowly among themselves. In comparing the findings from the ability and the trait models, a major methodological problem exists due to a method-construct confound resulting from the fact that the ability model is often measured using performance-based tests, whereas the trait model is often measured using self-reports. In order to advance research on the comparison of ability and trait models of emotional intelligence (and also on the comparison of these models when applied to practical intelligence or social intelligence), rigorous designs that allow us to clearly isolate construct and method variances are needed (Chan & Schmitt, 2005).

Social Intelligence Of the three intelligence constructs, social intelligence has the longest history. The idea goes back to Thorndike (1920), who defined social intelligence as “the ability to understand and manage men and women, boys and girls—to act wisely in human relations” (p. 228). As noted

345

Filip Lievens and David Chan by Landy (2005), Thorndike did not build a theory of social intelligence, but he used the notion of social intelligence only to clarify that intelligence could manifest itself in different facets (e.g., abstract, mechanical, social). Social intelligence has a checkered history. Early studies tried to distinguish social intelligence from academic intelligence (e.g., Hoepener & O’Sullivan, 1968; Keating, 1978), but these research efforts were unsuccessful. The problem was that measures of social intelligence did not correlate highly among themselves and that academic intelligence and social intelligence formed one factor. Methodologically, it was troublesome that both intelligences were measured with the same method (paper-and-pencil measures). The early research led to the conclusion that the “putative domain of social intelligence lacks empirical coherency, at least as it is represented by the measures used here” (Keating, 1978, p. 221). Two advancements led to more optimism. The first was the distinction between cognitive social intelligence (e.g., social perception or the ability to understand or decode verbal and nonverbal behaviors of other persons) and behavioral social intelligence (effectiveness in social situations). Using this multidimensional definition of social intelligence and multiple measures (self, teacher, and peer ratings), Ford and Tisak (1983) were able to distinguish social intelligence from academic intelligence. In addition, social intelligence predicted social behavior better than academic intelligence (see also Marlowe, 1986). The second advancement was the use of multitrait-multimethod designs (and confirmatory factor analysis) to obtain separate and unconfounded estimates of trait and method variance (Jones & Day, 1997; Wong, Day, Maxwell, & Meara, 1995). These more sophisticated multitrait-multimethod designs have brought further evidence for the multidimensionality of social intelligence and for its discriminability vis-à-vis academic intelligence. For example, the aforementioned distinction made between cognitive social intelligence and behavioral social intelligence has been confirmed (e.g., Wong et al., 1995). Similarly, a distinction is often made between fluid and crystallized social intelligence. The fluid form of social intelligence refers to social-cognitive flexibility (the ability to flexibly apply social knowledge in novel situations) or social inference. Conversely, a term such as social knowledge (knowledge of social etiquette, procedural and declarative social knowledge about social events) denotes the more crystallized component of social intelligence (Jones & Day, 1997). Despite these common findings, the dimensions, the definitions, and measures of social intelligence still vary a lot across studies. Along these lines, Weis and Süss (2005) provided an excellent overview of the different facets of social intelligence that have been examined. This might form the basis for adopting a more uniform terminology in the description of social intelligence subdimensions. Interest in social intelligence has also known a renaissance under the general term of social effectiveness constructs. According to Ferris, Perrewé, and Douglas (2002), social effectiveness is a “broad, higher-order, umbrella term, which groups a number of moderately related, yet conceptually distinctive, manifestations of social understanding and competence” (p. 50). Examples are social competence, self-monitoring, emotional intelligence, social skill, social deftness, practical intelligence, etc. The value of social skills has been especially scrutinized. Similar to social intelligence, social skills are posited to have a cognitive component (interpersonal perceptiveness) and a behavioral component (behavioral flexibility; Riggio, 1986; Schneider, Ackerman, & Kanfer, 1996). Another interesting framework of social skills was proposed by Klein, DeRouin, and Salas (2006). They distinguished among 10 social skills, which they more parsimoniously grouped under two meta social skills (communication and relationship building). A key difference between social skills and personality traits is that the former are learned (i.e., an ability), whereas the latter are relatively stable. Research has found that they are only moderately (.20) correlated (Ferris, Witt, & Hochwarter, 2001), but both constructs are also related in that social skills enable personality traits to show their effects (Ferris et al., 2001; Hogan & Shelton, 1998). Research has confirmed that social skills moderate the effects of personality traits (conscientiousness) on job performance (Witt & Ferris, 2003). Social skills were also found to have direct effects on managerial job performance, although personality and cognitive ability were not controlled for in most studies (Semadar, Robins, & Ferris, 2006).

346

Practical, Emotional, and Social Intelligence

Conclusions Our review of practical, social, and emotional intelligence highlights that these three constructs share remarkable similarities. Specifically, we see at least three parallels. First, the origins and rationale behind each of the constructs can be summarized as “going beyond g”. Cognitively oriented measures of ability and achievement have been traditionally used in employment and educational contexts. However, at the same time there has always been substantial interest in exploring possible supplemental (“alternative”) predictors for broadening the constructs measured and reducing possible adverse impact. Supplementing cognitive with alternative predictors is seen as a mechanism for accomplishing this goal (Sackett, Schmitt, Ellingson, & Kabin, 2001). Whereas social intelligence is the oldest construct, practical intelligence came into fashion at the end of the 1980s. Since Goleman’s (1995) book, emotional intelligence is the newest fad. Every time, the construct was introduced as the panacea for the problem of an exclusive reliance on g. We agree that there is a need to go beyond g and identify new and non-g constructs, but a new construct has little scientific explanatory and utility value if it is defined solely by negation (i.e., as non-g). Hence, good construct-related validity evidence for the three constructs is needed. The current state of research indicates to us that such efforts have been undertaken for social and emotional intelligence (ability model). Still, more rigorous construct validation studies are needed. Second, the conceptualizations of these three constructs have salient parallels. Each of these three constructs has various definitions, is multidimensional, and there exists debate about their different dimensions. Third, for each of these constructs, investigations of incremental validity over and above more established constructs, such as cognitive ability and personality, have been the focus of debate and research. So, are there conceptual differences among the three constructs? According to Landy (2005), emotional intelligence as a so-called new construct has simply replaced the older notion of social intelligence. Similarly, Bowman et al. (2001) posited that “it is not certain to what extent tacit knowledge, social, and EQ measures are structurally independent” (p. 148). Although our review shows that these three constructs overlap, it is possible to make at least some subtle distinctions. On the one hand, emotional intelligence might be somewhat narrower than social intelligence because it focuses on emotional problems embedded in social problems (Mayer & Salovey, 1993). That is probably why Salovey and Mayer (1990) defined emotional intelligence as a subset of social intelligence (p. 189). Conversely, one might also posit that emotional intelligence is broader than social intelligence because internal regulatory processes/emotions are also taken into account, which is not the case in social intelligence. Despite these differences, some authors have grouped social and emotional intelligence under the umbrella term of social and emotional effectiveness constructs (Heggestad & Morrison, 2008; Schlegel, Grandjean, & Scherer, 2013). Clearly, practical intelligence with its emphasis on real-world problems is more distinct than the other two constructs as it makes no reference to interpersonal skills (Austin & Saklofske, 2005). Domain specificity is another aspect of tacit knowledge, which contrasts to the more generic nature of social and emotional intelligence. In any case, these conceptual distinctions are open to investigation because few studies have explicitly examined the three constructs together (Weis & Süss, 2005).

MEASUREMENT APPROACHES In the previous section, we showed that the conceptual debate around practical, social, and emotional intelligence shared many parallels. The same can be said about their measurement because the similarities in how practical intelligence, social intelligence, and emotional intelligence are measured are striking. Generally, at least1 six measurement approaches might be distinguished: (1) self-reports, (2) other-reports, (3) interviews, (4) tests, (5) situational judgment tests, and (6) assessment center exercises. The following sections discuss each of these approaches, including their advantages and disadvantages. Some examples of instruments are also given, and these are summarized in Table 15.1 (see Côte, 2014, for a more comprehensive list of measures).

347

Filip Lievens and David Chan

TABLE 15.1 Overview of Methods (Including Some Examples) for Measuring Practical, Emotional, and Social Intelligence Ability Emotional Intelligence Model

Trait Emotional Intelligence Model

Practical Intelligence

Social Intelligence

Selfreports

• WLEIS • SREIT • MEIA • SUEIT

• EQ-I • ECI • TMMS • TEIQue • AES

• Self-reports of people’s behavior in everyday situations

• Social skills inventories • TSIQue

Other-reports

• Same as self-reports • Workgroup Emotional Intelligence Profile

• Same as self-reports

• Other-reports of people’s behavior in everyday situations

• Same as self-reports

Performancebased Tests

• MSCEIT • DANVA2 • PONS • JACBART • EARS • VOCAL-I • MSFDE • MERT

• No known examples

• Basic Skills Tests

• LEAS • IPT-15 • Four/Six-Factor Tests of Social Intelligence • MTSI

Interviews

• Interview rating on components of the four-branch model of Mayer, Salovey, and Caruso

• Interview rating on mixed model emotional intelligence competencies (interpersonal sensitivity, stress tolerance, etc.)

• Interview rating on people’s reported behavior in everyday situations

• Interview rating on applied social skills

Situational Judgment Tests (SJTs)

• STEU • STEM • TEMINT • MEMA

• SJTs that aim to measure mixed model emotional intelligence competencies

• Tacit Knowledge Inventories

• George Washington Social Intelligence Test (Judgment in Social Situations)

Assessment Centers (ACs)

• AC rating on components of the four branch model of Mayer, Salovey, and Caruso

• AC rating on mixed model emotional intelligence competencies

• Case Situational Problems

• AC rating on applied social skills

Note. Abbreviations are explained in the text.

Self-Reports The self-report approach presents respondents with descriptive statements and asks them to use a sort of rating scale to indicate the extent to which they agree or disagree with the respective statements. An important advantage of self-report measures is that they can be administered inexpensively and quickly to large groups of respondents. Examples of the self-report approach are many. In fact, most examples of self-report EI measures are based on the mixed model approach to emotional intelligence. Examples are the Emotional Competence Inventory (ECI; Sala, 2002), the Trait Meta-Mood Scale (TMMS; Salovey, Mayer, Goldman, Turvey, & Palfai, 1995), EQ-I (Bar-On, 1997), and the Trait Emotional Intelligence Questionnaire (TEIQue; Petrides & Furnham, 2003). Other EI measures are based on the four-branch model (or its predecessors) (Salovey & Mayer, 1990) but use a self-report methodology (instead of performance-based tests) for measuring it. Some researchers have categorized these measures as a third stream within the emotional intelligence domain (apart from

348

Practical, Emotional, and Social Intelligence the ability and mixed models, e.g., Daus & Ashkanasy, 2005; O’Boyle et al., 2011). Examples are the Wong Law Emotional Intelligence Scale (WLEIS; Law et al., 2004; Wong & Law, 2002), the Multidimensional Emotional Intelligence Assessment (MEIA; Tett et al., 2005), the Swinburne University Emotional Intelligence Test (SUEIT; Palmer & Stough, 2001), or the Schutte SelfReport Emotional Intelligence Test (SREIT; Schutte et al., 1998). We refer to Pérez, Petrides, and Furnham (2005) for a comprehensive list of trait EQ measures. There exist also self-report inventories of social intelligence/social skills (e.g., Ferris et al., 2001; Riggio, 1986; Schneider et al., 1996). We are not aware of self-report instruments (excluding SJTs as self-report measures) that assess tacit knowledge. In the personality domain, there is a longstanding history of using self-report measures and an equally long debate over their use. The debate and issues concerning the use of self-report measures in personality research (see Connelly & Ones, 2010) is generalizable to the use of self-report measures in assessing social and emotional intelligence. A detailed review of the pros and cons of self-report measures is beyond the scope of this chapter. Suffice it to say that self-report data are by no means perfect, and they are in principle susceptible to various validity problems, such as lack of self-insight and/or faking (e.g., Christiansen, Janovics, & Siers, 2010; Lievens, Klehe, & Libbrecht, 2011; Tett et al., 2012) and inflation of correlations due to common method variance. However, it is noteworthy that the severity of many of the purported problems of self-report data may be overstated.

Other-Reports Other-reports (or informant reports) have also been used for measuring emotional and social intelligence. One reason is that knowledgeable others might provide less lenient and more reliable measurement. Another reason is that multidimensional constructs such as emotional and social intelligence inherently have an important interpersonal component. Hence, it makes sense that in other-reports the same emotional and social intelligence scales as listed above are used, with others (e.g., peers, colleagues, teachers, parents, friends) now rating the focal person on descriptive statements. For example, the ECI of Goleman can also be completed by peers or supervisors. There also exist EI measures that were specifically developed for use in team settings. For instance, Jordan, Ashkanasy, Hartel, and Hooper (2002) developed a specific work group EI measure, namely the Workgroup Emotional Intelligence Profile. Although there exists a large amount of research supporting the use of peer ratings in the personality domain (e.g., Borkenau & Liebler, 1993; Funder, 1987; Kenny, 1991), research with other-based EI measures is slowly catching up. Van der Zee, Thijs, and Schakel (2002) confirmed that peer ratings of emotional intelligence were more reliable. However, they also found that these peer ratings suffered from leniency. Law et al. (2004) reported that peer-reports of a trait-based EI measure had substantial incremental validity over self-reports of emotional intelligence and personality. So, it seems beneficial to use peers for mixed model EI measures. So far, Elfenbein, Barsade, and Eisenkraft (2015) have conducted the largest examination of peerreports in the context of emotional intelligence. Interestingly, their data came from self- and other-reports in workplace settings. They found evidence of inter-rater agreement among others’ ratings of the focal person and of self–other agreement. Three other key findings were that (1) others could distinguish relatively well among the different emotional intelligence branches; (2) the other ratings predicted interdependent task performance, even after controlling for likability; and (3) these predictions were more accurate than those based on self-rated or ability emotional intelligence measures.

Performance-Based Tests Whereas both self-reports and peer-reports are assumed to be measures of typical performance, performance-based tests are posited to measure maximal performance. The rationale behind these

349

Filip Lievens and David Chan tests parallels the one behind cognitive ability tests, as these tests present people with social or emotion-based problem-solving items. For example, in popular tests of emotion perception, individuals are presented with faces, voices, or pictures and are then asked to describe the associated emotions. Historically, performance-based tests have been used for measuring social intelligence. An often-cited example is O’Sullivan and Guilford’s (1965) (O’Sullivan, Guilford, & deMille, 1965) tests of Social Intelligence (see Landy, 2006, for other older examples). A more modern example is the Levels of Emotional Awareness scale (LEAS; Lane, Quinlan, Schwartz, Walker, & Zeitlin, 1990), although this test has also been used as a measure of emotional intelligence (e.g., Barchard, 2003). Similarly, the Interpersonal Perception Task-15 (IPT-15; Costanzo & Archer, 1993) is a performance-based measure that presents videotapes to participants. These tests have known a renaissance in the context of the ability model of emotional intelligence, with the Mayer-Salovey-Caruso Emotional Intelligence Test (MSCEIT) as the best-known example. Other well-known examples are the Japanese and Caucasian Brief Affect Recognition Test (JACBART; Matsumoto et al., 2000), the Diagnostic Analysis of Nonverbal Accuracy (DANVA2; Nowicki, 2004), the Profile of Nonverbal Sensitivity (PONS; Rosenthal, Hall, DiMatteo, Rogers, & Archer, 1979), the Emotional Accuracy Scale (EARS; Mayer & Geher, 1996), The Montreal Set of Facial Displays of Emotion (MSFDE; Beaupré, Cheung, & Hess, 2000), and the Index of Vocal Emotion Recognition (Vocal-I; Scherer, Banse, & Wallbott, 2001). As noted by Spector and Johnson (2006), there is a difference between knowledge about emotions and the actual skill. It is not because one knows how to regulate one’s emotion in the face of problems that one will also do this in an actual context. With regard to practical intelligence, this problem has been circumvented by using basic skills tests (Diehl, Willis, & Schaie, 1995). These tests measure among others the ability to perform daily tasks such as cooking or using a bus schedule. Scoring constitutes another problem of performance-based tests. In contrast to cognitive ability tests, EI tests using the ability model, for instance, do not have objectively correct answers (with the exception of emotion perception tests constructed through digitally morphing faces).

Interviews Interviews constitute another possible method for measuring practical, social, and emotional intelligence. In the past, especially social skills (social intelligence) have been frequently measured in interviews. This is demonstrated by the meta-analysis of Huffcutt, Conway, Roth, and Stone (2001), who reviewed the type of constructs most frequently targeted by interviews in 47 studies. Specifically, social skills were measured in 27.8% of the interviews. Moreover, applied skills were twice as frequently rated in high-structure interviews (behavior description interviews and situational interviews) as compared to low-structure interviews (34.1% vs. 17.7%). Essentially, interviews are measurement methods that can be used to assess a wide variety of constructs. On the basis of multiple job-related questions, interviewees are asked to describe behavior that is relevant for constructs deemed important. Therefore, interviews could also be used for measuring practical intelligence (Fox & Spector, 2000) and emotional intelligence (mixed model; Schmit, 2006). Schmit notes how interview questions can try to elicit situations from interviewees wherein they had to recognize emotions of others and how they dealt with this situation. Yet, in interviews samples of behavior can be observed only for specific dimensions (e.g., interpersonal skills or oral communication skills, Van Iddekinge, Raymark, & Roth, 2005). For other dimensions, candidates report past behavior (in behavior description interviews) or intended behavior (in situational interviews).

Situational Judgment Tests SJTs might be another approach for measuring practical, social, and emotional intelligence (Chan, 2000, 2006; O’Sullivan, 2007; Schulze, Wilhelm, & Kyllonen, 2007). SJTs are measurement

350

Practical, Emotional, and Social Intelligence methods that present respondents with job-related situations and sets of alternate courses of action to these situations. Per situation, respondents either select the best and worst options or rank/rate each of the alternative actions in terms of their effectiveness. Meta-analytic research in employment settings documented the predictive and incremental validity of SJTs in predicting job performance over and above cognitive ability scores and personality ratings (Chan & Schmitt, 2002; McDaniel et al., 2001; McDaniel, Hartman, Whetzel, & Grubb, 2007). As respondents have to respond to realistic (written and especially video-based) scenarios, SJTs might constitute a more contextualized (ecologically valid) way of measuring practical, social, and emotional intelligence. This judgment in a realistic context contrasts to the decontextualized nature of standardized tests. Technological advancements make it possible to develop interactive SJTs (aka branched SJTs) that present different video fragments contingent upon responses to earlier video fragments. This allows the SJT to simulate the dynamics of interaction. Similar to EI tests (ability model), multiple-choice SJTs are scored using algorithms based on experts (excellent employees) or scored empirically based on the responses of large pilot samples. Over the years, SJTs have been developed for measuring each of the three constructs. First, as noted by McDaniel, Morgeson, Finnegan, Campion, and Braverman (2001), the first SJTs were social intelligence tests, namely the Judgment in Social Situations subtest of the George Washington Social Intelligence Test. Second, instruments very similar to SJTs are used under the label “tacit knowledge tests” for measuring practical intelligence (Sternberg et al., 1995). Examples are the Tacit-Knowledge Inventory for Managers or the Tacit-Knowledge Inventory for Military Leaders. Third, research has explored the use of SJTs for measuring two branches of Mayer and Salovey’s EI model. Specifically, MacCann and Roberts (2008) developed the Situational Test of Emotional Understanding (STEU) and the Situational Test of Emotion Management (STEM). Whereas these prior SJTs relied on a paper-and-pencil format, some EI branches (e.g., emotion management) might be better measured via multimedia items (see Lievens & Sackett, 2006, for similar arguments about assessing interpersonal skills). Recently, MacCann, Lievens, Libbrecht, and Roberts (2016) developed a multimedia SJT for reliably and validly measuring emotional management (aka MEMA). As compared to the MSCEIT’s written emotional management test, they showed that scores on the MEMA tapped into not only cognitive ability but also emotion perception. In the future, virtual and avatar-based environments might also be designed for measuring emotional intelligence facets.

Assessment Center Exercises Whereas SJTs are low-fidelity simulations that require candidates to pick the “correct” answer from a limited set of predetermined response options instead of asking them to actually show how they would handle a specific situation, a final possible approach for measuring practical, social, and interpersonal intelligence consists of putting people in a simulated situation, observing their actual behavior, and then making inferences about their standing on the construct of interest. Performance (or authentic) assessment is often used as a general term for describing this strategy. In industrial and organizational psychology, this contextualized approach focusing on actual behavior is exemplified by assessment centers (ACs). In ACs, several job-related simulations (e.g., role-play, interview simulation, in-basket, group discussion) aim to elicit behavior relevant to the constructs under investigation. The assumption is that individuals’ responses to these simulations reflect the responses that they would exhibit in the real world. Multiple trained assessors observe and rate the candidates on these constructs. According to Gowing (2001), the roots of the measurement of social, practical, and emotional intelligence can be traced to this AC approach. Although these constructs are not explicitly measured in AC exercises, they correspond well to the typically competencies targeted by AC exercises. In particular, some AC competencies, such as flexibility, awareness for others, interpersonal skills, flexibility, stress tolerance, and communication, have clear resemblances with practical, emotional, and social intelligence. The context sensitivity of what constitutes good performance in AC exercises and the ease with which situations may temporally unfold or

351

Filip Lievens and David Chan change through injecting novel demands as the exercise progresses are features of the AC that makes it a useful method for measuring the adaptability competencies associated with practical, emotional, and social intelligence (Chan, 2000). Several researchers have explicitly related the measurement of these AC dimensions to the measurement of one or more of the three intelligence constructs. Specifically, Spector and Johnson (2006) presented various examples of how AC exercises might be adapted for measuring emotional intelligence. For example, in a role-play a participant might be asked to deal with an irate customer or to comfort an upset colleague. Assessors might then rate the assessees on broad-based competencies or on more detailed verbal/nonverbal behaviors. Another example is Stricker and Rock’s (1990) Interpersonal Competency Inventory (ICI), wherein participants have to respond orally to videotaped scenes (for a more recent example with webcam-captured performances, see Lievens, De Corte, & Westerveld, 2015). Similarly, Sternberg and colleagues have argued that the typical AC exercises are very useful for assessing practical intelligence. For example, Hedlund, Wilt, Nebel, Ashford, and Sternberg (2006) developed so-called case scenario problems as a skill-based measure of practical intelligence. These case scenario problems consist of a fictitious business case, wherein participants are given information such as the history of the organization, their role, memos, e-mails, and financial tables. Individuals have to use their practical intelligence (practical problem-solving skills) to solve these contextual and poorly defined problems. Clearly, this methodology is somewhat similar to the case analysis and in-basket formats that have been used for decades in ACs. Although the emphasis on simulations and actual behavior results in good AC validities (Arthur, Day, McNelly, & Edens, 2003) and little adverse impact (Terpstra, Mohamed, & Kethley, 1999), scores are often situation specific (Lance, Lambert, Gewin, Lievens, & Conway, 2004). That is, ratings of the same competency do not converge well across exercises. In addition, there is little distinction among dimensions within a specific exercise as within-exercise dimension ratings are highly correlated. Although these findings were traditionally interpreted as indicative of poor convergent and discriminant validity evidence for AC ratings, this has now changed. As reviewed by Lance (2008), the situation specificity of AC results is regarded as reflecting true cross-situational variability of candidates across exercises.

Combinations Although we discussed the measurement approaches in separate sections, it is also possible to adopt combinations of them. For instance, MacCann, Wang, Matthews, and Roberts (2010) used an SJT for assessing emotion management with not only self-reports but also via an other-report format. So, they also asked a significant other what the focal person would do in a given situation. The correlation between self and other SJT scores was low (.19). Although the other-report SJT scores predicted the criteria as well as the typical self-report SJT scores, the construct validity of the two measures was different. That is, SJT scores on the basis of otherreports had lower means, higher Extraversion correlations, lower Agreeableness correlations, and lower correlations with g.

Conclusions Our review of measurement approaches suggests parallels in how the three constructs are measured. Although it is often thought that the three constructs are primarily measured with self-reports and performance tests, this section highlighted that a wide array of other options are possible. Specifically, interviews, peer-reports, and instruments with somewhat more fidelity, such as SJTs and AC exercises, are viable measurement approaches. Future research should further explore differences and communalities between these alternative measurement methods.

352

Practical, Emotional, and Social Intelligence

CONCEPTUAL FRAMEWORK FOR EXAMINING PRACTICAL, EMOTIONAL, AND SOCIAL INTELLIGENCE In Figure 15.1, we present a conceptual framework that we adapted from Chan and Schmitt (2005) to organize the discussion and guide future research on the validity of practical, emotional, and social intelligence. Following Chan and Schmitt, the framework construes all three types of intelligence as competencies that are multidimensional constructs, each of which is a partial mediator of the predictive or causal effect of unidimensional KSAOs on job performance or other job-relevant criteria. In addition, our framework construes the three types of intelligences as distinct but related competencies with both common and unique construct space, as depicted by the three overlapping circles representing practical, emotional, and social intelligence. The framework in Figure 15.1 shows that both proponents and opponents of each of these three constructs are right and wrong in different ways. Specifically, the opponents typically focus on the KSAOs and correctly argue that practical, emotional, and social intelligences are not factorially pure (unitary) KSAOs, but they incorrectly dismiss the validities and value of these intelligence constructs. Conversely, the proponents typically focus on the multidimensional competencies and correctly argue that practical, emotional, and social intelligences are proximal (and hence sometimes better) predictors of performance and other criteria, but they incorrectly ignore the important role of KSAOs in determining the nature of these intelligence constructs. Our framework is consistent with and may reconcile several findings and the debate over the value of the three types of intelligence. For example, each of the three intelligence constructs is inherently multidimensional in the sense that it is conceptualized as a multidimensional competency resulting from a combination of several different individual difference constructs. The relationships linking each type of intelligence and the various individual difference constructs explain the consistent findings from factor analytic studies that the intelligence measure is factorially complex and the data from the measure do not produce good fit with a single-factor model. These relationships also explain the significant and sometimes substantial correlations between the intelligence measure and the established measures of traditional KSAOs, such as cognitive ability and personality traits. In addition, these relationships provide the conceptual

Individual Difference Constructs (KSAOs)

Job Performance and other Criterion Constructs

Cognive Abilies

Task Performance

Personality Traits Cognive Styles Values

Praccal Intelligence Emoonal Intelligence

Social Intelligence

Movaonal Constructs

Contextual Performance Adapve Performance Withdrawal Behaviors Counterproducve Behaviors

Experience Job Knowledge

FIGURE 15.1  Conceptual Framework for Examining Practical, Emotional, and Social ­Intelligence (adapted from Chan & Schmitt, 2005).

353

Filip Lievens and David Chan bases for examining ability models, trait models, and mixed models of emotional (as well as practical or social) intelligence. The findings on the substantial zero-order validities and incremental validities of practical intelligence in predicting job performance over the prediction provided by cognitive ability and personality traits (e.g., Chan & Schmitt, 2002) are consistent with the proximal status of practical intelligence competencies (relative to the distal status of KSAOs) in the prediction of job performance. Similarly, the proximal status of emotional and social intelligence also explains the findings from studies that showed zero-order and incremental validities of these intelligence measures in the prediction of job performance and other criteria (for meta-analytic review of studies). Interestingly, Figure 15.1 may also explain why SJTs and ACs, which are multidimensional measures, do better than factorially pure measures of single unitary constructs (e.g., cognitive ability, personality) in predicting job-relevant performance criteria, which are often multidimensional in nature. That is, much of what SJTs and ACs are assessing may well be multidimensional competencies that are similar, if not identical, to practical, emotional, and social intelligence. We believe the conceptual framework in Figure 15.1 is consistent with existing findings and reconciles much of the debate on the validity of practical, emotional, and social intelligence, but more direct empirical support of the framework is certainly needed. We reiterate the call in Chan and Schmitt (2005) that to obtain more direct evidence for a framework that construes the intelligence competencies as multidimensional mediators in the relationship between KSAOs and job performance (and other criteria), we would need to specify and test hypothesized and alternative structural equation models (based on primary data from a single study or an accumulation of results from past studies using meta-analyses) linking KSAOs, intelligence competencies, and job performance or other criterion outcomes. Future research could derive theory-driven specific models from the general framework depicted in Figure 15.1 to empirically examine the validity of one or more of the three intelligence constructs that would facilitate the interpretation of the correlations between the intelligence construct and more established individual difference KSAOs as well as the zero-order and incremental validities of the intelligence construct in predicting different criterion outcomes. The recent meta-analysis of Joseph et al. (2015) constitutes a good example of applying a similar framework for illuminating the construct saturation and validity of mixed model EI measures. In the following section, we suggest various strategies for formulating theory-driven testable models that are likely to advance research in ways that make conceptual and practical contributions to the study of these constructs.

STRATEGIES FOR FUTURE RESEARCH We suggest the following strategies for future research on practical, social, and emotional intelligence: (1) developing better measures, (2) matching predictor and criterion, (3) disentangling methods and constructs, (4) going beyond bivariate relationships, (5) using longitudinal validation designs, and (6) adopting a multilevel perspective.

Developing Better Measures When reviewing the domain of emotional intelligence, Miners, Côte, and Lievens (2017) counted that in one year alone more than 50 different measures were used for ostensibly assessing emotional intelligence. In addition, research typically shows that convergent validity among the scores on these different measures is hard to establish because the scores do not substantially correlate with each other. In line with Miners et al. (2017), we therefore call researchers to pay much more attention to the underlying theoretical processes that intervene between the construct of emotional intelligence and responses to the EI items that together constitute an EI measure. This admonition is derived from a seminal paper by Borsboom, Mellenbergh, and Van Heerden (2004), which posited that in order to assess the validity of measures, it is pivotal to relate variation in a construct with variation on the responses of the items as a precursor to the traditional content-related, construct-related, and criterion-related validation process.

354

Practical, Emotional, and Social Intelligence To stimulate further research, Miners et al. (2016) outlined three specific strategies that researchers can adopt. They also exemplified how researchers can put these strategies into action in the context of the emotion perception branch. However, these strategies should also be applicable to other branches and for new EI abilities (see Côté & Hildeg, 2011). Although this call for better EI measurement is longstanding (e.g., Riggio, 2010; Ybarra, Kross, & Sanchez-Burks, 2014), we highlight it again here as a key area for future research.

Matching Between Predictor and Criterion An important development in personnel selection research is the movement away from general discussions of predictors as “valid” to consideration of “valid for what?” This development of more nuanced questions about predictor-criterion relationships was spurred by the taxonomic work on job performance led by Campbell, McCloy, Oppler, and Sager (1993) that differentiated performance into multiple distinct dimensions (Campbell, McCloy, Oppler, & Sager, 1993). Since then, selection researchers have significantly expanded the notion of job performance to include distinct performance dimensions, such as those listed in the criterion space of the framework in Figure 15.1. The expansion of the definition of performance and recognition of the multidimensional nature of performance led to streams of research demonstrating that different predictor constructs and selection tests will offer optimal predictive validity depending on the performance dimension(s) of interest (Chan, 2005a). For example, research has shown that task performance is better predicted by cognitive ability tests, whereas contextual performance is better predicted by personality tests (McHenry, Hough, Toquam, Hanson, & Ashworth, 1990). The key message here is that one needs to carefully attend to the constructs underlying both predictors and criterion dimensions in developing hypotheses about predictorcriterion relationships. Research on practical, social, and emotional intelligence has only begun linking these constructs to relevant criterion variables (Cherniss, 2010; Landy, 2005). These three constructs are often proposed to predict almost everything. Probably, this is best exemplified by studies investigating the validity of emotional intelligence for predicting academic performance (e.g., Amelang & Steinmayr, 2006; Barchard, 2003; Jaeger, 2003; Newsome, Day, & Catano, 2000; Parker, Hogan, Eastabrook, Oke, & Wood, 2006). There is little theoretical basis or conceptual match between emotional intelligence and grade point average (GPA). Clearly, emotional intelligence will have at best moderate predictive value for predicting an omnibus cognitively loaded criterion such as GPA. Hence, we need studies that carefully match the three intelligence constructs and their subdimensions to relevant criteria. For example, Libbrecht, Lievens, Carette, and Côte (2014) discovered that emotional intelligence was a good predictor of grades in courses that require interpersonal skills but not of overall GPA. Importantly, on a wider meta-analytical level, there is now also support for the predictor-criterion matching logic, because Joseph and Newman (2010) found that the validity of EI measures for predicting job performance was higher in jobs high on emotional labor than for jobs low on emotional labor. Referring to Figure 15.1, we could apply the conceptual matching between predictor and criterion to foster our understanding of the link between the three intelligence constructs and the difference dimensions of job performance. For instance, task performance might be predicted by ability-based emotional intelligence, whereas contextual performance might be predicted by trait-based emotional intelligence. As another example, practical intelligence might predict adaptive performance better than it predicts routine task performance.

Disentangling Methods and Constructs In the field of I-O psychology, there is increased recognition that methods should be distinguished from constructs in the comparative evaluation of predictors (Arthur & Villado, 2008; Arthur et al., 2003; Bobko, Roth, & Potosky, 1999; Chan & Schmitt, 1997, 2005; Lievens, Harris,

355

Filip Lievens and David Chan Van Keer, & Bisqueret, 2003). Constructs refer to the substantive conceptual variables (e.g., conscientiousness, cognitive ability, finger dexterity, field dependence-independence, reaction time, visual attention, emotional intelligence) that the measures were designed to assess. Conversely, methods refer to the tests, techniques, or procedures (e.g., paper-and-pencil tests, computer-administered tests, video-based tests, interviews, and ACs, self-reports, peer-reports) used to assess the intended constructs. This distinction between constructs and methods is especially crucial for multidimensional predictors (Bobko et al., 1999). Conceptual and methodological issues of variance partition associated with the construct-method distinction and their applications to constructs such as practical intelligence are available in Chan and Schmitt (2005). Given the multidimensional nature of practical, social, and emotional intelligence, clarity of the method-construct distinction is critical. As shown in Table 15.1, practical, social, and emotional intelligence might be measured in multiple ways. As noted previously, social intelligence research has adopted such multitrait-multimethod design and cleared some of the confusion around this construct. For example, social intelligence constructs (e.g., social understanding, memory, and knowledge) were operationalized in a multitrait-multimethod design applying verbal, pictorial, and video-based performance measures. A similar strategy could be followed for clarifying some of the confusion related to emotional intelligence. So far, research mainly compared self-reports of ability-based emotional intelligence or mixed model emotional intelligence to personality inventories. However, many more strategies are possible. One possibility is to operationalize a specific branch of the EI ability model via different measurement approaches (Wilhelm, 2005). For example, the emotion understanding branch of the ability model might be measured via the MSCEIT and an SJT. Similarly, the emotion perception branch might be measured via faces, pictures, movies, voices, etc. As another example, people might complete an ability EI test, they might provide self-reports of their emotional intelligence, and they might be rated by trained assessors on emotional intelligence (or conceptually similar competencies such as interpersonal sensitivity) in AC exercises. Such research designs (see also Landy, 2006) focus on convergent validity and enable us to answer key questions such as: How well do these different methods converge in assessing emotional intelligence? How much variance is accounted for by method factors and how much variance is accounted for by substantive construct factors? It is important to distinguish among methods and constructs because comparative evaluations of predictors might be meaningful only when one either (a) holds the method constant and varies the content, or (b) holds the constructs constant and varies the method. This is another reason why it is crucial to operationalize EI constructs via multiple methods. Moreover, it shifts the attention from measures to constructs (Matthews et al., 2004). Similarly, the need to include diversity in measurement also applies to the criterion side (see also Figure 15.1), because most studies on trait emotional intelligence are prone to common method variance (both predictors and criteria are measured with the same method, namely self-reports). We need studies that link the three intelligence constructs to objective measures of the various performance constructs.

Going Beyond Bivariate Relationships In the broader field of personnel selection, researchers have gone beyond documenting simple bivariate relationships between individual difference predictor and job performance criterion to examine mediator and moderator relationships. Identifying mediators in the predictor-criterion relationship increases our understanding of the prediction and helps in the search for alternative predictors or design of interventions that influence individuals’ scores on the criteria (by understanding what might affect the mediator). Similarly, research could attempt to explicate the precise affective, cognitive, motivational, and behavioral mechanisms that mediate the effects of practical, emotional, or social intelligence on the criterion, and directly measure and test these hypothesized mediation mechanisms. For example, cognitions and motivations (expectancy and instrumentality beliefs) or more subtle mediators (likeability) may mediate the intelligence effects on criteria such as job satisfaction and performance. For instance, to clarify the relationship between emotional intelligence and GPA, MacCann, Fogarty, Zeidner, and Roberts (2011)

356

Practical, Emotional, and Social Intelligence found that emotional intelligence predicted achievement in school. That is, students who were higher on emotional intelligence used more effective strategies for coping with school-based stressors so that their achievement was less impeded by stress. When an intelligence construct interacts with another predictor (e.g., personality trait) to affect the criterion, the interaction effect is mathematically equivalent whether we select intelligence or the other predictor as the moderator. However, conceptually, which predictor is selected as the moderator reflects different research questions. Identifying moderators that affect the magnitude and even nature of the relationship between the intelligence and criterion constructs is important, as the moderator effect clarifies the range and boundary conditions of the predictive validity of the intelligence construct. There has been increasing research examining moderator effects in the predictive validity of personality traits (e.g., Barrick, Parks, & Mount, 2005). In the domain of practical, emotional, and social intelligence, similar research on moderator effects has been conducted. For instance, Côté and Miners (2006) found that emotional intelligence was linked to task performance and organizational citizenship behavior (OCB) toward the organization only for people low on cognitive ability. Another example is Ferris et al. (2001), who reported that the relationship between social intelligence and job performance was stronger among workers who were high than low in cognitive ability. On the other hand, when the intelligence construct is the moderator affecting the relationship between another predictor and the criterion, the importance of the intelligence construct is demonstrated not in terms of its bivariate predictive validity of the criterion but in terms of its role in determining the range and boundary conditions of the bivariate predictive validity of another predictor. Several studies have demonstrated important moderator roles of practical, emotional, and social intelligence constructs. For example, Witt and Ferris (2003) found that the conscientiousness–performance relationship is moderated by social intelligence in that high levels of Conscientiousness together with poor social intelligence lead to lower performance. Chan (2006) found that proactive personality predicts work perceptions (procedural justice perception, perceived supervisor support, social integration) and work outcomes (job satisfaction, affective organizational commitment, job performance) positively among individuals with high practical intelligence (construed in terms of situational judgment effectiveness) but negatively among those with low practical intelligence. The findings on the disordinal interaction effects show that high levels of proactive personality may be either adaptive or maladaptive depending on the individual’s level of practical intelligence, and they caution against direct interpretations of bivariate associations between proactive personality and work-relevant criteria. To encourage researchers to go beyond bivariate relationships, Côte (2014) presents various strategies that could be followed. In short, fruitful future research could be conducted by adopting a strategy that goes beyond bivariate relationships to examine the mediators that link the intelligence construct to the criterion construct, the moderators that affect the nature of the intelligence-criterion relationship, and the role of the intelligence construct as a moderator affecting the nature of a predictorcriterion relationship.

Using Longitudinal Validation Designs The time spans over which criteria are gathered for validation studies often reflect practical considerations. In predictive studies, the time period selected for the criterion rarely exceeds a year or two. Validation studies of practical intelligence, social intelligence, or emotional intelligence are no exception. As such, criterion-related validities reported for these three constructs may or may not accurately estimate the long-term validities associated with these constructs. That is, early performance may not reflect typical performance over an individual’s tenure in an organizational or educational context, and if so, early validation efforts would provide misleading results. In the personnel selection domain, research has shown that predictors of job performance might differ across job stages. Along these lines, the transitional job stage where there is a need to learn new things is typically contrasted to the more routine maintenance job stage (Murphy, 1989). For instance, Thoresen, Bradley, Bliese, and Thoresen (2004) found that Openness was

357

Filip Lievens and David Chan related to performance and performance trends in the transition stage but not to performance at the maintenance stage. We believe that future studies on practical, social, and emotional intelligence should also adopt a longitudinal design where possible. Similar to personality, it might well be that the validity of these intelligence constructs differs in the long run for predicting job performance. For example, the transitional job stage typically involves more adaptive demands than the routine maintenance job stage. So, practical intelligence might predict job performance stronger in the transitional job stage than in the routine maintenance job stage. A construct-oriented approach to the study of practical, emotional, and social intelligence that locates the constructs in the framework presented in Figure 15.1 would provide the conceptual basis to hypothesize, test, and interpret performance changes over time. Using appropriate longitudinal designs and change assessment techniques allows us to draw practical implications for key issues such as changes in test validities, changes in mean performance, changes in rank order of individuals’ performance, and changes in dimensionality (i.e., number/nature of dimensions) of performance (Chan, 1998a, 2005a).

Adopting a Multilevel Perspective In many contexts, personnel selection researchers have started to move beyond the individual level to consider variables at the higher levels (e.g., group, organization) of analysis. In the conceptual framework presented in Figure 15.1, the three intelligence constructs, as well as all of the other constructs in the individual difference and criterion spaces, could be conceptualized, measured, and analyzed in multiple levels of analysis (e.g., individual, group, organization). So far, the research on practical, emotional, and social intelligence has not adopted a multilevel approach. With the increasing reliance on the use of teams to accomplish work in various organizations, the relevant job performance criteria are often at the higher level (e.g., team, organization) than the individual level of analysis (for an example in the field of personality, see Oh, Kim, & Van Iddekinge, 2015). When each of the three intelligence constructs is examined as predictors in the multilevel context of staffing teams or organizations and relating them to job performance at the individual, team, and organizational levels, we would need appropriate composition models (Chan, 1998b) that explicate the functional relationships linking the same intelligence constructs at the different levels of analysis so that we have clear conceptual understanding of what is meant by, say, team social intelligence and how to measure and analyze social intelligence at the team level. The multidimensional nature of the practical, emotional, and social intelligence constructs poses challenges to multilevel research because of the increased difficulty in formulating and testing appropriate composition models for these intelligence constructs. Multilevel constructs and data bring with them complex conceptual, measurement, and data analysis issues, and discussion of these issues is beyond the scope of this chapter (for review, see Chan, 1998b, 2005b). Our basic point is that a multilevel approach is a strategy for future research on practical, emotional, and social intelligence that is not just desirable but probably necessary, given the inherently multilevel nature of the criteria of interest (e.g., team performance) that are emerging in personnel selection research.

EPILOGUE We have, under the constraint of a relatively short chapter length, critically reviewed the vast literature on practical, emotional, and social intelligence constructs. We have proposed a conceptual framework, adapted from Chan and Schmitt (2005), that provides a way to organize the conceptualizations of the intelligence constructs and their relationships with other individual difference and criterion constructs. We believe that this framework also reconciles some, if not most, of the findings and debates in the literature on the intelligence constructs. Finally, by

358

Practical, Emotional, and Social Intelligence explicating several strategies for future research, we hope that more scientifically rigorous studies could be conducted on practical, emotional, and social intelligence to provide practitioners in personnel selection and other HR functions with a more evidence-based basis for the use of these intelligence constructs and measures.

NOTE 1. Given space constraints we do not discuss physiological and neural measures (e.g., Raz, Dan, Arad, & Zysberg, 2013).

REFERENCES Amelang, M., & Steinmayr, R. (2006). Is there a validity increment for tests of emotional intelligence in explaining the variance of performance criteria? Intelligence, 34, 459–468. Arthur, W., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related validity of assessment center dimensions. Personnel Psychology, 56, 125–154. Arthur, W., & Villado, A. (2008). The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93, 435–442. Austin, E. J., & Saklofske, D. H. (2005). Far too many intelligences? On the communalities and differences between social, practical, and emotional intelligences. In R. Schulze & R. D. Roberts (Eds.), Emotional intelligence—an international handbook (pp. 107–128). Cambridge: Hogrefe & Huber Publishing. Barchard, K. A. (2003). Does emotional intelligence assist in the prediction of academic success? Educational and Psychological Measurement, 63, 840–858. Bar-On, R. (1997). Bar-On emotional quotient inventory: A measure of emotional intelligence: Toronto, ON: Multi-Health Systems Inc. Barrick, M. R., Parks, L., & Mount, M. K. (2005). Self-monitoring as a moderator of the relationships between personality traits and performance. Personnel Psychology, 58, 745–767. Baum, J., Bird, B., & Singh, S. (2011). The practical intelligence of entrepreneurs: Antecedents and a link with new venture growth. Personnel Psychology, 64, 397–425. Beaupré, M. G., Cheung, N., & Hess, U. (2000). The Montreal set of facial displays of emotion [Slides]. (Available from Ursula Hess, Department of Psychology, University of Quebec at Montreal, P. O. Box 8888, Station “Centre-ville,” Montreal, Quebec H3C 3P8.). Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporating cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561–589. Borkenau, P., & Liebler, A. (1993). Convergence of stranger ratings of personality and intelligence with self-ratings, partner ratings, and measures intelligence. Journal of Personality and Social Psychology, 65, 546–553. Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. Bowman, D. B., Markham, P. M., & Roberts, R. D. (2001). Expanding the frontier of human cognitive abilities: So much more than (plain) g! Learning and Individual Differences, 13, 127–158. Boyatzis, R. E., & Sala, F. (2004). Assessing emotional intelligence competencies. In G. Geher (Ed.), The measurement of emotional intelligence: Common ground and controversy (pp. 147–180). Hauppauge, NY: Nova Science. Brackett, M. A., & Mayer, J. D. (2003). Convergent, discriminant, and incremental validity of competing measures of emotional intelligence. Personality and Social Psychology Bulletin, 29, 1147–1158. Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco: Jossey-Bass. Chan, D. (1998a). The conceptualization of change over time: An integrative approach incorporating Longitudinal Means and Covariance Structures analysis (LMACS) and Multiple Indicator Latent Growth Modeling (MLGM). Organizational Research Methods, 1, 421–483. Chan, D. (1998b). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246. Chan, D. (2000). Understanding adaptation to changes in the work environment: Integrating individual difference and learning perspectives. Research in Personnel and Human Resources Management, 18, 1–42.

359

Filip Lievens and David Chan Chan, D. (2005a). Current directions in personnel selection. Current Directions in Psychological Science, 14, 220–223. Chan, D. (2005b). Multilevel research. In F. T. L. Leong & J. T. Austin (Eds.), The psychology research handbook (2nd ed., pp. 401–418). Thousand Oaks, CA: Sage. Chan, D. (2006). Interactive effects of situational judgment effectiveness and proactive personality on work perceptions and work outcomes. Journal of Applied Psychology, 91, 475–481. Chan, D. (2009). So why ask me? Are self report data really that bad? In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp. 309–335). New York, NY: Routledge. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143–159. Chan, D., & Schmitt, N. (2002). Situational judgment and job performance. Human Performance, 15, 233–254. Chan, D., & Schmitt, N. (2005). Situational judgment tests. In A. Evers, O. Smit-Voskuijl, & N. Anderson (Eds.), Handbook of personnel selection (pp. 219–242). Oxford, UK: Blackwell Publishers, Inc. Cherniss, C. (2010). Emotional intelligence: Toward clarification of a concept. Industrial and Organizational Psychology: Perspectives on Science and Practice, 3, 110–126. Christiansen, N. D., Janovics, J. E., & Siers, B. P. (2010). Emotional intelligence in selection contexts: Measurement method, criterion-related validity, and vulnerability to response distortion. International Journal of Selection and Assessment, 18, 87–101. Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Meta-analytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136, 1092–1122. Costanzo, M., & Archer, D. (1993). The interpersonal perception task-15 [Videotape]. Berkeley: University of California Extension Media Center. Côté, S. (2014). Emotional intelligence in organizations. Annual Review of Organizational Psychology and Organizational Behavior, 1, 459–488. Côté, S., & Hideg, I. (2011). The ability to influence others via emotion displays: A new dimension of emotional intelligence. Organizational Psychology Review, 1, 53–71. Côté, S., & Miners, C. (2006). Emotional intelligence, cognitive intelligence, and job performance. Administrative Science Quarterly, 51, 1–28. Daus, C. S., & Ashkanasy, N. M. (2005). The case for the ability-based model of emotional intelligence in organizational behavior. Journal of Organizational Behavior, 26, 453–466. Davies, M., Stankov, L., & Roberts, R. D. (1998). Emotional intelligence: In search of an elusive construct. Journal of Personality and Social Psychology, 75, 989–1015. De Raad, B. (2005). The trait-coverage of emotional intelligence. Personality and Individual Differences, 38, 673–687. Diehl, M., Willis, S. L., & Schaie, K. W. (1995). Everyday problem-solving in older adults—observational assessment and cognitive correlates. Psychology and Aging, 10, 478–491. Elfenbein, H. A., Barsade, S. G., & Eisenkraft, N. (2015). The social perception of emotional abilities: Expanding what we know about observer ratings of emotional intelligence. Emotion, 15, 17–34. Ferris, G. R., Perrewé, P. M., & Douglas, C. (2002). Social effectiveness in organizations: Construct validity and research directions. Journal of Leadership & Organizational Studies, 9, 49–63. Ferris, G. R., Witt, L. A., & Hochwarter, W. A. (2001). Interaction of social skill and general mental ability on job performance and salary. Journal of Applied Psychology, 86, 1075–1082. Ford, M. E., & Tisak, M. S. (1983). A further search for social intelligence. Journal of Educational Psychology, 75, 196–206. Fox, S., & Spector, P. E. (2000). Relations of emotional intelligence, practical intelligence, general intelligence, and trait affectivity with interview outcomes: It’s not all just “G”. Journal of Organizational Behavior, 21, 203–220. Funder, D. C. (1987). Errors and mistakes: Evaluating the accuracy of social judgment. Psychological Bulletin, 101, 75–90. Gohm, C. L. (2004). Moving forward with emotional intelligence. Psychological Inquiry, 15, 222–227. Goleman, D. (1995). Emotional intelligence: Why it can matter more than IQ. New York, NY: Bantam. Gottfredson, L. S. (2003). Dissecting practical intelligence theory: Its claims and evidence. Intelligence, 31, 343–397. Gowing, M. K. (2001). Measures of individual emotional competencies. In C. Cherniss & D. Goleman (Eds.), The emotionally intelligent workplace (pp. 83–131). San Francisco: Jossey-Bass. Grubb, W. L., & McDaniel, M. A. (2007). The fakability of Bar-On’s Emotional Quotient Inventory Short Form: Catch me if you can. Human Performance, 20, 43–59.

360

Practical, Emotional, and Social Intelligence Hedlund, J., Forsythe, G. B., Horvath, J. A., Williams, W. M., Snook, S., & Sternberg, R. J. (2003). Identifying and assessing tacit knowledge: Understanding the practical intelligence of military leaders. Leadership Quarterly, 14, 117–140. Hedlund, J., Wilt, J. M., Nebel, K. L., Ashford, S. J., & Sternberg, R. J. (2006). Assessing practical intelligence in business school admissions: A supplement to the graduate management admissions test. Learning and Individual Differences, 16, 101–127. Heggestad, E. D., & Morrison, M. J. (2009). An inductive exploration of the social effectiveness construct space. Journal of Personality, 76, 839–874. Hoepener, R., & O’Sullivan, M. (1968). Social intelligence and IQ. Educational and Psychological Measurement, 28, 339–344. Hogan, R., & Shelton, D. (1998). A socioanalytic perspective on job performance. Human Performance, 11, 129–144. Huffcutt, A. I., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86, 897–913. Jaeger, A. J. (2003). Job competencies and the curriculum: An inquiry into emotional intelligence in graduate professional education. Research in Higher Education, 44, 615–639. Jones, K., & Day, J. D. (1997). Discrimination of two aspects of cognitive-social intelligence from academic intelligence. Journal of Educational Psychology, 89, 486–497. Jordan, P. J., Ashkanasy, N. M., Hartel, C. E., & Hooper, G. S. (2002). Workgroup emotional intelligence: Scale development and relationship to team process effectiveness and goal focus. Human Resource Management Review, 12, 195–214. Joseph, D. L., Jin, J., Newman, D. A., & O’Boyle, E. H. (2015).Why does self-reported emotional intelligence predict job performance? A meta-analytic investigation of mixed EI. Journal of Applied Psychology, 100, 298–342. Joseph, D. L., & Newman, D. A. (2010). Emotional intelligence: An integrative meta-analysis and cascading model. Journal of Applied Psychology, 95, 54–78. Keating, D. P. (1978). Search for social intelligence. Journal of Educational Psychology, 70, 218–223. Kenny, D. A. (1991). A general-model of consensus and accuracy in interpersonal perception. Psychological Review, 98, 155–163. Klein, C., DeRouin, R. E., & Salas, E. (2006). Uncovering workplace interpersonal skills: A review, framework, and research agenda. In G. P. Hodgkinson & J. K. Ford (Eds.), International review of industrial and organizational psychology (Vol. 21, pp. 80–126). New York, NY: Wiley & Sons, Ltd. Lance, C. E. (2008). Why assessment centers do not work the way they are supposed to. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 84–97. Lance, C. E., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of dimension and exercise variance components in assessment center postexercise dimension ratings. Journal of Applied Psychology, 89, 377–385. Landy, F. J. (2005). Some historical and scientific issues related to research on emotional intelligence. Journal of Organizational Behavior, 26, 411–424. Landy, F. J. (2006). The long, frustrating, and fruitless search for social intelligence: A cautionary tale. In K. R. Murphy (Ed.), A critique of emotional intelligence: What are the problems and how can they be fixed? (pp. 81–123). Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. Lane, R. D., Quinlan, D. M., Schwartz, G. E., Walker, P. A., & Zeitlin, S. B. (1990). The levels of emotional awareness scale—a cognitive-developmental measure of emotion. Journal of Personality Assessment, 55, 124–134. Law, K. S., Wong, C. S., & Song, L. J. (2004). The construct and criterion validity of emotional intelligence and its potential utility for management studies. Journal of Applied Psychology, 89, 483–496. Legree, P. J., Heffner, T. S., Psotka, J., Martin, D. E., & Medsker, G. J. (2003). Traffic crash involvement: Experiential driving knowledge and stressful contextual antecedents. Journal of Applied Psychology, 88, 15–26. Libbrecht, N., Lievens, F., Carette, B., & Côté, S. (2014). Emotional intelligence predicts success in medical school. Emotion, 14, 64–73. Lievens, F., De Corte, W., & Westerveld, L. (2015). Understanding the building blocks of selection procedures: Effects of response fidelity on performance and validity. Journal of Management, 41, 1604–1627. Lievens, F., Harris, M. M., Van Keer, E., & Bisqueret, C. (2003). Predicting cross-cultural training performance: The validity of personality, cognitive ability, and dimensions measured by an assessment center and a behavior description interview. Journal of Applied Psychology, 88, 476–489. Lievens, F., Klehe, U. C., & Libbrecht, N. (2011). Applicant versus employee scores on self-report emotional intelligence measures. Journal of Personnel Psychology, 10, 89–95.

361

Filip Lievens and David Chan Lievens, F., & Sackett, P. R. (2006). Video-based versus written situational judgment tests: A comparison in terms of predictive validity. Journal of Applied Psychology, 91, 1181–1188. Locke, E. A. (2005). Why emotional intelligence is an invalid concept. Journal of Organizational Behavior, 26, 425–431. MacCann, C., Fogarty, G. J., Zeidner, M., & Roberts, R. D. (2011). Coping mediates the relationship between emotional intelligence (EI) and academic achievement. Contemporary Educational Psychology, 36, 60–70. MacCann, C., Lievens, F., Libbrecht, N., & Roberts, R. D. (2016). Differences between multimedia and text-based assessments of emotion management: An exploration with the Multimedia Emotion Management Assessment (MEMA). Cognition and Emotion, 30, 1317–1331. MacCann, C., & Roberts, R. D. (2008). New paradigms for assessing emotional intelligence: Theory and data. Emotion, 8, 540–551. MacCann, C., Wang, L., Matthews, G., & Roberts, R. (2010). Emotional intelligence and the eye of the beholder: Comparing self- and parent-rated situational judgments in adolescents. Journal of Research in Personality, 44, 673–676. Marlowe, H. A. (1986). Social intelligence—evidence for multidimensionality and construct independence. Journal of Educational Psychology, 78, 52–58. Matsumoto, D., LeRoux, J., Wilson-Cohn, C., Raroque, J., Kooken, K., Ekman, P. et al. (2000). A new test to measure emotion recognition ability: Matsumoto and Ekman’s Japanese and Caucasian Brief Affect Recognition Test (JACBART). Journal of Nonverbal Behavior, 24, 179–209. Matthews, G., Roberts, R. D., & Zeidner, M. (2004). Seven myths about emotional intelligence. Psychological Inquiry, 15, 179–196. Matthews, G., Zeidner, M., & Roberts, R. R. (2007). Emotional intelligence: Consensus, controversies, and questions. In G. Matthews, M. Zeidner, & R. R. Roberts (Eds.), The science of emotional intelligence—knowns and unknowns (pp. 3–46). New York, NY: Oxford University Press. Mayer, J. D., & Geher, G. (1996). Emotional intelligence and the identification of emotion. Intelligence, 22, 89–113. Mayer, J. D., Roberts, R. D., & Barsade, S. G. (2008). Human abilities: Emotional intelligence. Annual Review of Psychology, 59, 507–536. Mayer, J. D., & Salovey, P. (1993). The intelligence of emotional intelligence. Intelligence, 17, 433–442. McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. L. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91. McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730–740. McDaniel, M. A., & Whetzel, D. L. (2005). Situational judgment test research: Informing the debate on practical intelligence theory. Intelligence, 33, 515–525. McHenry, J. J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project a validity results—the relationship between predictor and criterion domains. Personnel Psychology, 43, 335–354. Miners, C., Côte, S., & Lievens, F. (2017). Assessing the validity of emotional intelligence measures. Emotion Review. Murphy, K. R. (1989). Is the relationship between cognitive ability and job performance stable over time? Human Performance, 2, 183–200. Newsome, S., Day, A. L., & Catano, V. M. (2000). Assessing the predictive validity of emotional intelligence. Personality and Individual Differences, 29, 1005–1016. Nowicki, S. (2004). A manual for the Diagnostic Analysis of Nonverbal Accuracy tests (DANVA). Atlanta, GA: Department of Psychology, Emory University. O’Boyle, E. H., Humphrey, R. H., Pollack, J. M., Hawver, T. H., & Story, P. A. (2011). The relation between emotional intelligence and job performance: A meta-analysis. Journal of Organizational Behavior, 32, 788–818. Oh, I. S., Kim, S., & Van Iddekinge, C. H. (2015). Take it to another level: Do personality-based human capital resources matter to firm performance? Journal of Applied Psychology, 100, 935–947. O’Sullivan, M. (2007). Trolling for trout, trawling for tuna. In G. Matthews, M. Zeidner, & R. R. Roberts (Eds.), The science of emotional intelligence—knowns and unknowns (pp. 258–287). Oxford, NY: Oxford University Press. O’Sullivan, M., Guilford, J. P., & deMille, R. (1965). The measurement of social intelligence (Psychological Laboratory Report No. 34). Los Angeles: University of Southern California. Palmer, B., & Stough, C. (2001). The measurement of emotional intelligence. Australian Journal of Psychology, 53, 85–85. Parker, J. D., Hogan, M. J., Eastabrook, J. M., Oke, A., & Wood, L. M. (2006). Emotional intelligence and student retention: Predicting the successful transition from high school to university. Personality and Individual Differences, 41, 1329–1336.

362

Practical, Emotional, and Social Intelligence Pérez, J. C., Petrides, K. V., & Furnham, A. (2005). Measuring trait emotional intelligence. In R. Schulze & R. D. Roberts (Eds.), Emotional intelligence—an international handbook (pp. 181–201). Cambridge: Hogrefe & Huber Publishing. Petrides, K. V., & Furnham, A. (2003). Trait emotional intelligence: Behavioural validation in two studies of emotion recognition and reactivity to mood induction. European Journal of Personality, 17, 39–57. Raz, S., Dan, O., Arad, H., & Zysberg, L. (2013). Behavioral and neural correlates of emotional intelligence: An Event-Related Potentials (ERP) study. Brain Research, 1526, 44–53. Riggio, R. E. (1986). Assessment of basis social skills. Journal of Personality and Social Psychology, 51, 649–660. Riggio, R. E. (2010). Before emotional intelligence: Research on nonverbal, emotional, and social competences. Industrial and Organizational Psychology: Perspectives on Science and Practice, 3, 178–182. Roberts, R. D., Schulze, R., O’Brien, K., MacCann, C., Reid, J., & Maul, A. (2006). Exploring the validity of the Mayer-Salovey-Caruso Emotional Intelligence Test (MSCEIT) with established emotions measures. Emotion, 6, 663–669. Roberts, R. D., Schulze, R., Zeidner, M., & Matthews, G. (2005). Understanding, measuring, and applying emotional intelligence. In R. Schulze & R. D. Roberts (Eds.), Emotional intelligence—an international handbook (pp. 311–336). Cambridge: Hogrefe & Huber Publishing. Rosenthal, R., Hall, J. A., DiMatteo, M. R., Rogers, P. L., & Archer, D. (1979). Sensitivity to nonverbal communication: The PONS test. Baltimore, MD: Johns Hopkins University Press. Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment, credentialing, and higher education—Prospects in a post-affirmative-action world. American Psychologist, 56, 302–318. Sala, F. (2002). Emotional competence inventory: Technical manual. Philadelphia, PA: McClelland Center For Research, HayGroup. Salovey, P., & Mayer, J. D. (1990). Emotional intelligence. Imagination, Cognition, and Personality, 9, 185–211. Salovey, P., Mayer, J., Goldman, S., Turvey, C., & Palfai, T. (1995). Emotional attention, clarity and repair: Exploring emotional intelligence using the Trait Meta-Mood Scale. In J. W. Pennebaker (Ed.), Emotion, disclosure, and health (pp. 125–154). Washington, DC: American Psychological Association. Scherer, K. R., Banse, R., & Wallbott, H. G. (2001). Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology, 32, 76–92. Schlegel, K., Grandjean, D., & Scherer, K. R. (2013). Constructs of social and emotional effectiveness: Different labels, same content? Journal of Research Personality, 47, 249–253. Schmit, M. J. (2006). EI in the business world. In K. R. Murphy (Ed.), A critique of emotional intelligence: What are the problems and how can they be fixed? Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. Schneider, R. J., Ackerman, P. L., & Kanfer, R. (1996). To “act wisely in human relations”: Exploring the dimensions of social competence. Personality and Individual Differences, 21, 469–481. Schulze, R., Wilhem, O., & Kyllonen, P. C. (2007). Approaches to the assessment of emotional intelligence. In G. Matthews, M. Zeidner, & R. R. Roberts (Eds.), The science of emotional intelligence—knowns and unknowns (pp. 199–229). New York, NY: Oxford University Press. Schutte, N. S., Malouff, J. M., Hall, L. E., Haggerty, D. J., Cooper, J. T., Golden, C. J., & Dornheim, L. (1998). Development and validation of a measure of emotional intelligence. Personality and Individual Differences, 25, 167–177. Semadar, A., Robins, G., & Ferris, G. R. (2006). Comparing the validity of multiple social effectiveness constructs in the prediction of managerial job performance. Journal of Organizational Behavior, 27, 443–461. Spector, P. E., & Johnson, H. M. (2006). Improving the definition, measurement, and application of emotional intelligence. In K. R. Murphy (Ed.), A critique of emotional intelligence: what are the problems and how can they be fixed? (pp. 325–344). Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. Sternberg, R. J. (1988). The triarchic mind: A new theory of human intelligence. New York, NY: Penguin Books. Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common-sense. American Psychologist, 50, 912–927. Stricker, L. J., & Rock, D. A. (1990). Interpersonal competence, social intelligence, and general ability. Personality and Individual Differences, 11, 833–839. Tan, H. T., & Libby, R. (1997). Tacit managerial versus technical knowledge as determinants of audit expertise in the field. Journal of Accounting Research, 35, 97–113. Terpstra, D. E., Mohamed, A. A., & Kethley, R. B. (1999). An analysis of federal court cases involving nine selection devices. International Journal of Selection and Assessment, 7, 26–34. Tett, R. P., Fox, K. E., & Wang, A. (2005). Development and validation of a self-report measure of emotional intelligence as a multidimensional trait domain. Personality and Social Psychology Bulletin, 31, 859–888. Tett, R. P., Freund, K. A., Christiansen, N. D., Fox, K. E., & Coaster, J. (2012). Faking on self-report emotional intelligence and personality tests: Effects of faking opportunity, cognitive ability, and job type. Personality and Individual Differences, 52, 195–201.

363

Filip Lievens and David Chan Thoresen, C. J., Bradley, J. C., Bliese, P. D., & Thoresen, J. D. (2004). The Big Five personality traits and individual job performance growth trajectories in maintenance and transitional job stages. Journal of Applied Psychology, 89, 835–853. Thorndike, E. L. (1920). Intelligence and its uses. Harper’s Magazine, 140, 227–235. Van der Zee, K., Thijs, M., & Schakel, L. (2002). The relationship of emotional intelligence with academic intelligence and the Big Five. European Journal of Personality, 16, 103–125. Van Iddekinge, C. H., Raymark, P. H., & Roth, P. L. (2005). Assessing personality with a structured employment interview: Construct-related validity and susceptibility to response inflation. Journal of Applied Psychology, 90, 536–552. Van Rooy, D. L., Dilchert, S., Viswesvaran, C., & Ones, D. S. (2006). Multiplying intelligences: are general, emotional, and practical intelligences equal? In K. R. Murphy (Ed.), A critique of emotional intelligence: What are the problems and how can they be fixed? (pp. 235–262). Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. Van Rooy, D. L., & Viswesvaran, C. (2004). Emotional intelligence: A meta-analytic investigation of predictive validity and nomological net. Journal of Vocational Behavior, 65, 71–95. Van Rooy, D. L., Viswesvaran, C., & Pluta, P. (2005). An evaluation of construct validity: What is this thing called emotional intelligence? Human Performance, 18, 445–462. Wagner, R. K. (1987). Tacit knowledge in everyday intelligent behavior. Journal of Personality and Social Psychology, 52, 1236–1247. Wagner, R. K., & Sternberg, R. J. (1985). Practical intelligence in real-world pursuits—the role of tacit knowledge. Journal of Personality and Social Psychology, 49, 436–458. Weis, S., & Süss, H M. (2005). Social intelligence—a review and critical discussion of measurement concepts. In R. Schulze & R. D. Roberts (Eds.), Emotional intelligence—an international handbook (pp. 203–230). Cambridge: Hogrefe & Huber Publishing. Wilhelm, O. (2005). Measures of emotional intelligence: Practice and standards. In R. Schulze & R. D. Roberts (Eds.), Emotional intelligence—an international handbook (pp. 131–154). Cambridge: Hogrefe & Huber Publishing. Witt, L. A., & Ferris, G. R. (2003). Social skill as moderator of the conscientiousness-performance relationship: Convergent results across four studies. Journal of Applied Psychology, 88, 809–820. Wong, C. M., Day, J. D., Maxwell, S. E., & Meara, N. M. (1995). A multitrait-multimethod study of academic and social intelligence in college-students. Journal of Educational Psychology, 87, 117–133. Wong, C. S., & Law, K. S. (2002). The effects of leader and follower emotional intelligence on performance and attitude: An exploratory study. Leadership Quarterly, 13, 243–274. Ybarra, O., Kross, E., & Sanchez-Burks, J. (2014). The “big idea” that is yet to be: Toward a more motivated, contextual, and dynamic model of emotional intelligence. Academy of Management Perspectives, 28, 93–107. Zeidner, M., Matthews, G., & Roberts, R. D. (2004). Emotional intelligence in the workplace: A critical review. Applied Psychology-an International Review, 53, 371–399.

364

Part IV DECISIONS IN DEVELOPING, SELECTING, USING, AND EVALUATING PREDICTORS ANN MARIE RYAN AND NEAL SCHMITT, SECTION EDITORS

16 DECISIONS IN THE OPERATIONAL USE OF EMPLOYEE SELECTION PROCEDURES Choosing, Evaluating, and Administering Assessment Tools NANCY T. TIPPINS, EMILY C. SOLBERG, AND NEHA SINGLA

INTRODUCTION Most organizations have strategic goals that determine the kind of selection program they need. An organization that achieves its competitive advantage by delivering goods to consumers with a high level of service may focus on how its selection program can identify the best service workers. In contrast, another organization that pursues a goal of low-margin, high-volume sales of goods may be less concerned about a high level of service skills among employees and instead value a low-cost program for identifying employees with minimal skills to do the job efficiently. These overall objectives for the selection program, in turn, determine the goals for a specific assessment tool or test. An organization whose selection program goals relate to high levels of job performance is likely to set goals for the tests it uses related to validity and reliability and may attend less to their costs. There can be many selection program goals. Some examples include enhancing employee productivity, minimizing error, reducing accidents, complying with Equal Employment Opportunity (EEO) regulations, minimizing staffing costs, and supporting the employment brand. Once the organization’s strategic goals are understood and the selection program goals defined, the organization can begin the process of determining the characteristics of the assessment tools constituting a selection program that meets those goals. Just as there are many goals for a selection program, there can also be a number of test goals that typically fall into two broad classes: test characteristics and administrative goals. Test characteristics that often influence the choice of instruments include the validity and reliability that are typically found for a type of test used in the applicant population of interest and the estimated utility. Some organizations consider the appropriateness and feasibility of different validation strategies for a test in the context of their organization. Many organizations pay a great deal of attention to the potential group differences in test scores and adverse impact, and they search for alternative selection procedures that might have equal or greater validity and less adverse impact than another instrument. Some carefully review the past history of the test or type of test in litigation and attempt to avoid assessment tools that will be difficult to defend. Often, applicant reactions are also an important concern. Administrative goals relate to issues around test administration and scoring. Organizations must consider how to measure the important job relevant constructs given their staffing

367

Nancy T. Tippins et al. environments, financial and personnel resources, and time constraints. For many organizations, significant concerns about costs, including the costs of personnel with the necessary skills to administer and score the test in every location necessary, the costs of equipment needed for test administration, scoring, and data storage, the costs of test purchase, the costs of test development and maintenance, and the costs of validation, arise. In some environments, the availability of personnel and equipment cannot be assumed. In other environments, time concerns are very important. Organizations must consider the time requirements for development and validation as well as the amount of time necessary for test administration and the length of time between test completion and the availability of results. A few test goals blend the two categories. The feasibility of the use of a test with particular characteristics within a specific staffing context is particularly important for some employers. For example, some organizations want to use tests that can be administered in an unproctored environment. Other employers, particularly those in public services such as police officers and firefighters, use tests for only one round of hiring to lessen the amount of information sharing that occurs over multiple administrations. In addition, organizations must decide how many of the critical knowledge, skills, abilities, and other personal characteristics (KSAOs) should be measured to adequately cover the domain of job performance and assure an acceptable level of accuracy in prediction, as well as how many constructs it can afford to measure. Other test goals that blend the two categories include the question of how test scores will be used. Typically, the employer must decide what form of test score to use (e.g., raw scores, scaled scores, percentiles), the combination and weighting of test scores, and the type of guidance to provide to hiring managers (e.g., cutoff scores, bands and expectancy tables, if used). Sometimes, these test goals conflict. Choices made with the objective of identifying the best applicants may not be the same as those made to minimize costs. Furthermore, different constituencies in the same organization may have different goals. While the department receiving new employees may want tests that result in the best prediction of future performance, the staffing organization may want to minimize administration costs, and the legal team may want to avoid challenges to the selection process. Occasionally, the source of budget may determine which group’s point of view prevails. For example, if the department needing the workers is funding test development and validation and the Human Resource department bears the cost of administration, costs for development may have different limitations than ongoing administrative costs. Optimization of all goals is often not possible, and organizations must usually make some trade-offs. For example, an employer that wants the most accurate predictor using the lowest cost instrument that takes the least amount of time is unlikely to achieve all three goals and will need to find the right compromises for the organization. Many decisions about selection programs have ramifications for other decisions, and it is important to note that none of the decisions to be made should be considered independently from the others. For example, a requirement to have one, very short test to minimize the amount of time test takers will spend on a test will have an effect on the validity and reliability of the selection program. Or, a desire for positive applicant reactions could preclude a lengthy testing process composed of abstract measures of problem solving and suggest a choice of instruments that are more face valid. In such cases, the hiring organization must decide which goal is more important because both cannot be maximized. Often, decisions that are already made will need to be revisited as new decisions are made and additional criteria are considered. In addition, there is no defined sequence to the decision-making process. Each employer tackles the problem of determining what test to use in its own way. While most testing experts would recommend determining the organization’s strategic goals first, then the selection program’s goals, and then the goals for specific tests as the most efficient process, many experienced professionals have reversed the order and deduced the strategic goals and selection program goals through discussion of the test goals. Research is needed on how goal choices are made and what goal hierarchies exist in relation to selection programs. Figure 16.1 displays some examples of how organizational goals, selection program goals, and test goals can be related to each other. The remainder of the chapter reviews basic decisions about employment tests and discusses the considerations that influence those decisions. This chapter reviews five sets of decisions that must be made when developing and implementing a selection system: (1) What constructs

368

Operational Use of Employee Selection Procedures

Organizaonal Goals

Selecon Program Goals

Test Goals

Improving public image

Supporng employer brand

Selecng/developing tesng program that portrays organizaon’s unique brand

Differenang from competors through employees

Hiring top-performing employees

Selecng/developing tests with the highest validity

Minimizing legal risks

Minimizing subgroup differences

Selecng/developing tests with small to no subgroup differences

Minimizing costs and increasing margins

Minimizing staffing costs

Selecng tests that are least expensive to implement and administer

FIGURE 16.1  Organizational Goals, Selection Program Goals, and Test Goals

should be measured in the selection system? (2) How should the chosen constructs be measured? (3) How should the validity of assessments be evaluated? (4) How should the test be administered? (5) How should the resulting scores be used?

WHAT CONSTRUCTS SHOULD BE MEASURED? One of the initial decisions to be made when developing and selecting assessment tools concerns the constructs that should be measured in the selection program and by individual tests constituting the selection program (see Chapters 11–15 in this volume for more information on the measurement of specific constructs). The test user must determine both which constructs to measure and how many to measure. Additionally, the test user must consider the advantages and disadvantages of measuring a single KSAO or broader subset of the entire content domain.

Importance and Needed at Entry According to legal and professional guidelines, such as the Uniform Guidelines on Employee Selection Procedures (Uniform Guidelines; Equal Employment Opportunity Commission, 1978), the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014), and the Principles for the Validation and Use of Personnel Selection Procedures (SIOP Principles; SIOP, 2003), tests should measure KSAOs that are job relevant and necessary at entry to the job. Often, job relevancy is operationalized as those KSAOs that are identified as important and required to perform important tasks by subject matter experts who are typically also asked to rate the extent to which a KSAO is needed at entry into the job. In choosing or developing appropriate tests, the test user should generally avoid a test that measures an appropriate construct but requires one or more nonrelevant KSAOs to complete it. If a measure of manual dexterity requires the applicant to read detailed instructions and reading at the level of the test instructions is not a job requirement, then the manual dexterity test is not likely to be an accurate predictor of performance in the job, because such a test will confound the candidate’s manual dexterity with his/her reading proficiency. Steps must be taken to communicate the test directions in another form if reading is not also required for the job. When the applicant has a disability and meets the requirements for protection under disability laws in the U.S. (e.g., the Americans with Disabilities Act [ADA] and the ADA Amendments Act

369

Nancy T. Tippins et al. of 2008), the need for skills that are not directly relevant to the job in test administration in the testing situation can be particularly important. For example, if an applicant for a job requiring manual dexterity skills and minimal reading skills beyond the initial training period has a visual disability that makes reading printed instructions impossible but is not so severe as to limit the manipulations performed on the job, the employer should find an alternative method for evaluating manual dexterity that does not require reading. Such issues highlight the importance of making a clear distinction between the KSAOs to be measured and the KSAOs required to take the test when deciding the type of test to use as well as considering appropriate accommodations for candidates with qualified disabilities. It is important to note that not all employment tests are designed to predict job performance. Frequently, employers want to know how likely an applicant is to turnover or be absent or to exhibit organizational citizenship behaviors or counterproductive work behaviors. (See Chapters 20–24, in this volume for a discussion of criterion constructs in employee selection.) In such cases, a test may not be directly related to a KSAO required to perform an important job task but is nevertheless a predictor of an outcome that is important to the organization. Regardless of the criterion, it is incumbent upon the test user to demonstrate the relationship between the test and the criterion of interest.

Feasibility of Measuring the Construct Some important constructs can be notoriously difficult to measure for a variety of reasons, such as lack of a clear definition, psychological and/or statistical multidimensionality of the construct, and subjectivity of scoring (Shute & Wang, 2016). For example, highly predictive measures of an individual’s creativity are difficult to find or develop. As a result, a key factor in deciding which constructs to measure will be the extent to which the constructs can be assessed validly and reliably. In addition, there are usually organizational constraints (e.g., budget, staffing context) that limit the feasibility of assessing some constructs. For example, if an employer has no employment offices and only administers computer-based tests in an unproctored environment, then a direct measure of oral communication skills or physical abilities would not be possible. Similarly, if an employer plans to test a large volume of candidates, then a test designed to assess each candidate’s physical strength may not be a feasible option unless some screening to narrow the applicant pool is done first. Additional details regarding organizational constraints are also covered in the Administrative Concerns section later in this chapter.

Number of KSAOs to Measure A job analysis often results in many more KSAOs that are important and required at entry for a job than are feasible to measure (see Chapter 6 in this volume for more information regarding work analysis), but there is no clear guidance regarding the degree to which the job content domain should be covered in the selection program. While all would argue that the KSAOs that are measured in an employment test must be important, few would suggest that all important and needed at entry KSAOs should be measured. The SIOP Principles (2003) indicate that measurement of all the important KSAOs is not necessary: “Not every element of the work domain needs to be assessed. Rather, a sample of the work behaviors, activities, and worker KSAOs can provide a good estimate of the predicted work performance” (p. 24). In contrast, Goldstein, Zedeck, and Schneider (1993) suggest using a guide of measuring KSAOs that linked to at least 50% of the tasks (in a content-oriented validity study) and view measuring only 10–20% of a job as problematic. However, they also acknowledged that measuring KSAOs linked to 50% of the tasks might not be possible in some cases. In the U.S., when a selection practice is challenged under Title VII of the Civil Rights Act of 1964 as amended in 1991, the user of a test(s) that is supported on the basis of evidence from

370

Operational Use of Employee Selection Procedures a content-oriented validation study may need to defend how much of the job content area is measured by the test(s). Despite the litigation, court opinions have varied on what is sufficient job content representation. Thus, in practice, it is not clear how many KSAOs should be measured or how much of the job content domain should be measured. Perhaps, the most important considerations in determining the number of KSAOs to measure are the criticality of each KSAO relative to job performance and the extent to which one KSAO may compensate for another. When two or more KSAOs are critical for job performance, all may need to be measured. For example, in jobs that are physically demanding and require cognitive skills, incumbents need both sets of skills. A cable splicer may need the cognitive skills associated with splicing cables as well as the physical abilities associated with ascending and descending utility poles. One cannot do the job if he/she cannot climb the pole or if he/ she cannot splice a cable correctly. In some jobs, one skill compensates for another. A customer service job may require both interpersonal skills and problem-solving skills; however, in some cases, a lower level of problem-solving skill may be compensated by higher interpersonal skills and vice versa, although a minimum level of each may be required. In many situations, the number of KSAOs is expanded to measure the broader job because multiple criteria are valued. For example, employers that are concerned about job performance and prosocial behaviors may measure KSAOs related to both criteria. In the U.S., where litigation concerns prevail, another rationale for increasing the number of KSAOs measured is rooted in the hope of minimizing subgroup differences, which open the door to legal challenges. For example, an employer might add a reading test to a math test (assuming reading is a jobrelevant KSAO) if the historical mean group differences on the tests indicate that women do better on reading and men do better on math even when the reading test does little to improve the level of prediction. Another way to determine the number of tests that should be used is to evaluate the incremental validity of each test when criterion-oriented validity data are available. However, it merits noting that test users often find little quantitative support for multiple predictors in a criterion-related validation study beyond the first few. When a content-oriented validation strategy has been used, incremental validity data are not available, and the test user can only rely on data regarding the KSAOs that are important, needed at entry, and linked to one or more critical tasks. As noted in the SIOP Principles, “The sufficiency of the match between (the) selection procedure and work domain is a matter of professional judgment based on evidence collected in the validation effort” (SIOP Principles, 2003, p. 25). As a final cautionary note, often practitioners argue that a test measuring a single, important KSAO can be demonstrated to be job-related and a business necessity by virtue of the results of the job analysis and a criterion-oriented validity study. Employers that strive to minimize costs may minimize the number of KSAOs measured and focus only on those tests that have the greatest payoff in terms of prediction. However, when large mean subgroup differences exist for the selected tests, this approach can be considered risky as regulatory agencies and courts may question the decision to use a test that measures a single, albeit important KSAO, or only a few important KSAOs, based on the rationale that even strong criterion-oriented validity coefficients do not explain a great deal of the variance in performance.

Relationship Between the Goals of the Organization and the Number of KSAOs Measured As noted above, all constructs measured in an employment test must be important and required at entry; however, organizations have differing views on the number of KSAOs to measure that are related to their goals for their selection program (see Chapter 10 in this volume for more information regarding employee selection and organizational strategy). Many organizations attempt to balance their needs for accurate evaluation of candidates’ skills with costeffective staffing procedures and legal compliance. Organizations that are focused primarily on the cost-effectiveness of their selection programs will pay a great deal of attention to the number

371

Nancy T. Tippins et al. of constructs measured and their utility, choosing to use only those that contribute substantially to the prediction of performance (or other criteria), and they manage costs related to test development, validation, and administration partially by using fewer measures. On the other hand, organizations that are more concerned about the legal defensibility of a selection process may be more likely to include tests that provide broader coverage of the domain of critical KSAOs.

Breadth of KSAOs In addition to determining the number of KSAOs to measure, the organization must also define the breadth of the critical KSAOs to be measured. Some test users will choose to measure a narrow, homogeneous construct (e.g., addition), whereas others will measure a broader combination of constructs (e.g., math, including addition, subtraction, multiplication, division, fractions, decimals). A test that measures a unidimensional variable has questions that require similar thought processes and result in similar types of answers. Tests that evaluate a multidimensional construct may involve several different processes and contain questions that elicit different kinds of answers. Different types of items (e.g., math word problems and word analogies) can be found in the same, multidimensional test that measures “mental ability.” Some multidimensional tests such as a problem-solving test may measure a single construct that has multiple components. For example, a test user might employ a business case to determine how well job candidates solve problems that require the collection of data from multiple sources and the analysis of quantitative and qualitative data.

HOW SHOULD THE CONSTRUCTS BE MEASURED? Once the constructs to be measured are identified, the test user must determine the best way to measure them. There are multiple ways to measure most content areas. For example, job knowledge might be assessed through a multiple-choice test of cognitive abilities, work samples and simulations, or interviews. Each measurement approach has its advantages and disadvantages that are relative to the population for which the test is being used. A test format that is acceptable for selecting applicants into an entry-level position in a fast-food restaurant may not be acceptable to executives seeking promotion in their own company. Some of the criteria that should be considered when determining their measurement options are discussed in the following sections.

Timing One of the primary factors that organizations have to consider when choosing a selection process is the time that will be needed to implement it. Once a need for a new selection process is uncovered, many organizations are impatient to implement the new process. Some organizations lack an existing selection process and need to rapidly develop and deploy one to meet the staffing requirements associated with their strategic direction. Others have detected some problem with their existing program and are anxious to replace the current selection process, which is flawed. Only a few organizations seem to take a continuous improvement approach to employee selection and develop and validate a new selection procedure when the existing one is working well. The immediate need for a selection process may guide an organization toward off-the-shelf tests and/or tests that can be validated quickly. Rather than creating its own test, an employer may eschew the development process and choose an off-the-shelf test that is ready for a validation study. Some employers will gravitate to a test and rely on a publisher’s generalizability study or choose a test that can be validated using a content-oriented validity strategy relatively quickly.

372

Operational Use of Employee Selection Procedures For example, an organization may choose an off-the-shelf test for which a large-scale metaanalysis has been conducted and implement the test based on evidence from other validity studies while planning a local criterion-related validation study based on applicant data to be collected in the coming months. Another may choose a work sample test that can be validated using a content-oriented strategy to avoid the time and costs of a local criterion-oriented validation study.

Group Differences in Test Score Means and Adverse Impact in Selection Decisions Many organizations embrace a diverse workforce because it contributes to the achievement of their strategic objectives, and in the U.S., they want to avoid unnecessarily eliminating members of protected groups. Thus, these organizations consider the available evidence of differences in score means among subgroups of interest when choosing a test. Although mean score differences are not the same as adverse impact calculations, they are often related. Adverse impact may be assessed in several ways, ranging from four-fifths ratios to statistical tests of significant differences between pass rates. Regardless of how adverse impact is assessed, organizations must decide whether to avoid, reduce, or eliminate it through their choice of tests, decisions on cutoff scores, or some other approach such as alternate recruitment strategies. For tests of some constructs, group mean differences cannot be easily eliminated, and the organization must prioritize its goals and decide if the construct should be measured at all.

Consideration of Alternatives In the U.S., employers are required to search for alternative selection procedures that have equal or greater validity and less adverse impact: Where two or more selection procedures are available which serve the user’s legitimate interest in efficient and trustworthy workmanship, and which are substantially equally valid for a given purpose, the user should use the procedure which has been demonstrated to have the lesser adverse impact. (Uniform Guidelines, Section 3B)

Although the Uniform Guidelines do not clearly outline the steps that should be taken to comply with the alternatives provision, practitioners often consider a variety of options, such as different measures of the same construct, measures of different constructs, different ways of combining measures, different methods of making selection decisions (e.g., setting pass/fail cutoff scores, banding scores, using top-down selection), and different cutoff scores. Practical issues make the comparison of alternatives difficult. A comparison of the validity and adverse impact of two alternatives often assumes the availability of criterion-oriented validity data and adverse impact data for both instruments. However, such data are not always available. For example, validity data may have been collected for different jobs or with different criteria (e.g., job performance, organizational citizenship behaviors). Or, adverse impact statistics may have been collected on an applicant population in one situation and on the incumbent sample that participated in a criterion-related validity study in another. Moreover, contentoriented validity strategies do not yield validity coefficients, and many content-oriented validity studies do not produce estimates of adverse impact unless the tests have been administered in a pilot study. Question 51 of the Uniform Guidelines indicates that the strength of validity evidence is determined by the proportion of critical job behaviors and the extent to which the test resembles work behaviors. An additional complication is the lack of operational specificity in the Guidelines about what qualifies as “substantially equally valid” or “lesser impact.” Thus, even if two tests have comparable validity coefficients and adverse impact data available, there is no commonly accepted method or standard for determining how much of a difference in validity coefficients or adverse impact findings is sufficient to warrant the use of one test over the other; instead, the decision must be based on the test user’s professional judgment. These

373

Nancy T. Tippins et al. difficulties with the process of identifying alternatives with equal or greater validity and lesser impact may explain why, in practice, some test users do not make thorough reviews of alternatives, as pointed out by Guion (1998). Reviews of a wide array of alternative selection procedures may be limited in some circumstances. In practice, some organizations choose a consulting firm to develop and validate a selection program, knowing that the tests and alternatives considered will be limited to the firm’s own proprietary tests. Theoretically, at least, the organization may have considered a broader set of tests in the process of selecting a consulting firm, but this process of evaluating the pros and cons of different test publishers is rarely documented. If a human resources professional or operating manager is choosing the tests without the assistance of a qualified testing professional, he or she may not be aware of the need to consider alternatives and may lack a sufficient understanding of the concepts of validity and adverse impact to make nuanced judgments about different tests (Murphy & Davidshofer, 1988). Furthermore, decision makers lacking a testing background may be easily swayed by test advertisements that broadly claim the “validity” of a test, promise no adverse impact, indicate approval by a governmental regulator or professional association, or extol the ease of administration. Test users who fail to understand that validity relates to the inferences made from test scores for a particular job and is not a characteristic of a test, or who are not familiar with different ways of calculating adverse impact or the effect of different applicant populations on adverse impact often have difficulty evaluating alternative procedures.

Consideration of Past Legal and Administrative Challenges Another factor that is often considered in the U.S. is a test’s history of legal and administrative challenges. Although theoretical information and research findings supporting a test that evaluates a required KSAO, as well as the administrative requirements of the test, are primary drivers of test choice, the outcome of previous litigation or grievance and arbitration procedures involving the test user’s own organization or others’ can inform the test user about the potential liabilities associated with a particular test (see Chapters 26–30, in this volume, for a more detailed account of legal issues related to employee selection). Several characteristics of tests seem to increase the likelihood of some form of legal scrutiny. Tests that generally produce large group differences (e.g., multiple-choice measures of cognitive ability) are more often challenged than are those with smaller group differences (all other factors being equal). Certain ways in which a test is used also tend to attract more attention. For example, high-volume selection programs seem to be challenged more frequently than those tests used for smaller numbers of applicants. A test used for selection into an entry-level position in a large retail firm appears more likely to be challenged than one used for the selection of executives in a small professional services organization. Tests used for promotions or upgrades and transfers in a unionized setting are often the catalyst for a grievance. Some test practices also appear to be lightning rods for challenges. Test users often avoid selecting a personality test that uses a lie scale or a social desirability scale for promotions because of the implications of dishonesty on the part of the candidate and the test takers’ negative reaction to these scales. Indeed, many organizations with a represented labor force will avoid any selection tool that does not have universally right and wrong answers. Additionally, some test formats are challenged less frequently than others. For example, structured or unstructured interviews are less frequently reviewed than multiple-choice tests in general, possibly because of the ubiquity of the use of interviews. Similarly, high-fidelity work samples that are obviously similar to the job for which they are used seem to be challenged less frequently than more abstract tests of basic skills. The type of test used in combination with the feasibility of an appropriate validity study may also have legal implications for the test user. When a criterion-oriented validity study is not feasible because of the size of the available sample, test users may gravitate to a test that can be appropriately validated using a content-oriented strategy under the Uniform Guidelines. Fewer questions may be raised about a work sample test that is validated using a content-oriented strategy than a personality inventory or a more abstract measure of cognitive ability that is validated

374

Operational Use of Employee Selection Procedures using a content-oriented strategy. Thus, when a criterion-oriented validation strategy is not possible, some organizations will consider only those assessment tools for which a content-oriented strategy will provide compelling evidence of their relevance and validity. Nevertheless, some organizations believe a criterion-related study produces better evidence of validity in terms of legal defensibility and choose instruments for which there is a history of successful criterionrelated studies. Although past challenges may influence test choice, test users often do not have access to detailed information about these challenges in a timely manner. The decisions in many court cases can be a long time coming, and grievance information is not often shared outside of a particular organization or labor organization. By the time the test user has had an opportunity to access publicly available information about testing litigation, the test in question may be out of date or there may be new and better options.

Administrative Concerns For many organizations, a major concern is whether or not the test can be administered appropriately in their staffing environment (see Chapter 8, in this volume, for additional information on administrative concerns associated with employee testing). Frequently, an organization’s staffing process and resources will dictate the choice of a particular test. For example, a multiple-choice test of cognitive ability with a fixed number of items that must be proctored is not practical in an organization that processes applications via the Internet and lacks the facilities and staff available to administer tests in proctored settings. Similarly, a lengthy assessment process that requires one-on-one assessors for scoring is not practical for high-volume jobs that have substantial amounts of turnover. Some common administrative concerns are discussed below. Administrative Personnel  The test user must consider whether the personnel required to administer and score the test are available and affordable. Some tests require an administrator who facilitates the test administration by reading instructions, distributing and collecting the testing materials, timing the test, scoring the test by comparing answers to a scoring template, etc. In other situations, the administrator’s role may be more complicated and require more complex skills and training, such as setting up equipment, serving as a role player, or making judgments about a candidate’s performance. For example, a work sample test measuring knowledge and skill in welding metal parts may require that a certified welding expert score the sample. When such an expert is not available, a work sample may not be feasible. Cost of administration personnel is another related factor that is critical. Even if personnel with the prerequisite skills are available to administer tests, an organization may find the cost of using many of these employees prohibitive or at least greater than the return on the investment warrants. For example, an organization using a structured interview as a first screen for a high-volume, high-turnover position may well find the cost of the interviewer exceeds the value of the interview as a screening tool. Some organizations make the mistake of failing to consider all of the associated costs of test administration. In addition to time spent administering and scoring tests, many tests require administrative personnel to be trained, retrained, calibrated, and monitored. For example, many structured interview programs include extensive training to ensure that all interviewers, regardless of location, are administering the interview properly and using the behavioral anchors in the same manner for scoring. Some of these companies also offer “refresher” training to reinforce the standards for the interview. Some organizations bear the additional cost of monitoring test administrators and scorers to promote strict adherence to administration rules, testing policies and procedures, data recording requirements, etc. In the past 20 years, many organizations have abandoned proctored testing in favor of unproctored Internet testing (UIT). Although most testing professionals who have adopted UIT agree that one of the primary advantages is the reduction in administration costs (e.g., Tippins, Beatty, Drasgow, & Pearlman, 2006), most also agree that administration costs are not eliminated entirely, and significant costs related to IT personnel remain. Thus, the costs of

375

Nancy T. Tippins et al. developing or buying a computer-based test administration system, maintaining it, and monitoring it must be factored into the testing decision. Although many organizations purchase these services from an outside vendor, these costs are embedded into the cost of the test. Testing Facilities and Equipment  In addition to adequate personnel to administer selection tests, equipment and facilities appropriate for the tests are also necessary. Some tests rely on paper copies of test materials and pencils; others are computer-based or require telephones or video equipment. Today, many computer-based tests require an Internet connection. Although most in the U.S. assume the availability of reliable electric power and an Internet connection, that assumption may not be true in all parts of the world, and the test user may need to take extraordinary efforts to find appropriate facilities. Mobile testing can also present challenges related to Internet connections. Work samples can involve almost any equipment that might be used on the job, ranging from simple machinery to complex electronics. One of the more expensive forms of testing in terms of the facilities required is an assessment center that requires space for participants to work independently, assessors to privately score exercises and/or conduct interviews, as well as additional space for group exercises with other participants at the center. When administration costs are a factor in test choice, the test user must consider not only the initial cost of the facilities and equipment but also the cost for ongoing maintenance and upgrades. For example, an assessment center facility has to be acquired and maintained through the life of the assessment process. Video-based assessments containing pictures of individuals can become quickly outdated because of clothing or hairstyles. Technology-enabled assessments must be implemented with the necessary equipment and facilities, but many programs will be upgraded, requiring additional, newer, or more powerful equipment and the places to put it. In addition, user interfaces may need to be overhauled to achieve a “modern” look even when the existing interface is functional. Physical abilities testing can also require expensive facilities or equipment. High-fidelity physical abilities tests may include equipment such as that found on the job (e.g., ladders, utility poles, stretchers, stair chairs), which must be maintained. Measures of physical capability may include abstract kinds of physical ability tests that use equipment such as tensiometers, which can be expensive to buy and maintain because they must be recalibrated regularly to ensure accurate ratings. Although cost may be the overriding concern for many organizations when considering testing facilities, the mobility of facilities and equipment can also be important in certain situations. When large numbers of applicants must be tested in multiple places, movement of bulky equipment (e.g., computers or equipment for work samples) may not be feasible, or if feasible, it may not be cost effective to purchase redundant equipment or move it around. Similarly, facilities for complex physical abilities tests or assessment centers may be expensive and time-consuming to replicate in multiple locations. Proctored and Unproctored Internet Testing  An issue that has continued to gain more attention from test users is the question of whether to use proctored or unproctored Internet testing. The advantages and disadvantages of UIT are well-documented (Tippins et al., 2006). In a nutshell, advantages usually include lower administration costs, standardized instructions, faster time to testing, and broader applicant pools. Disadvantages frequently cited include nonstandardized test environments that have the potential to affect test scores, cheating, including the inability to identify the test taker, and threats to the security of test materials. In choosing to use a proctored or unproctored test, the test user must weigh the pros and cons to determine what works best for the organization. Time Requirements for Test Administration  The time to administer a test has two important implications for staffing programs: (1) the cost of administration and (2) the impact on applicant flow. The time spent testing is related to the costs of the personnel required for administration and the use of facilities and equipment and can be driven by several factors. In proctored testing or testing that requires an assessor or interviewer to participate in the test administration, the longer the test, the greater the personnel cost. A biodata form that takes 30 minutes to complete will cost less in terms of administration time than will a one-hour structured interview

376

Operational Use of Employee Selection Procedures or a three-hour business case. Tests that have high reading demands can increase the administration time and the costs. For example, situational judgment inventories (SJIs) are often more time-intensive than other forms of tests because of the reading load. There are often trade-offs with respect to test time. Both high- and low-fidelity simulations often convey a great deal of information about the job and organization. Although the testing time may be somewhat longer than for more abstract tests, there are additional benefits in the form of information about the job, work environment, and company associated with the additional time. Another concern about lengthy tests is their impact on applicant flow. A common perception is that applicants have a low tolerance for lengthy evaluations, especially when they are administered in unproctored Internet settings, although recent literature indicates this may not be a correct assumption (Speer, King, & Grossenbacher, 2016). Recruiters often point to the abandonment rate on UITs; however, there may be multiple reasons why a candidate chooses to stop testing. Some applicants may be exploring jobs and not have a sincere interest in the one for which the test is required; some may begin the test and realize the job has requirements that do not match their skills; and some may be distracted and leave the test. In addition, lengthy applications confound the applicant’s perception of the amount of time spent testing. Abandonment may be the result of the test in addition to the application process and not the test alone. Even when the applicant is asked to take a test at an employer’s site, the amount of time spent testing can be a deterrent to maintaining the applicant’s interest in employment. Employed applicants may be particularly reluctant to invest significant amounts of time in face-to-face testing in another firm. The number of tests administered can also pose a challenge to keeping an applicant engaged. When the selection program is based on multiple hurdles and the applicant is asked to return multiple times to take tests in the sequence, the problem of keeping the applicant engaged is exacerbated.

Consequences of Poor Hiring Decisions Another source of costs related to selection programs comes from hiring an applicant without the necessary skills. When the consequences of hiring someone who does not have the necessary skills are severe (e.g., an error caused by an employee without the requisite skills leads to injury, death, or widespread property damage, or the cost of training a replacement for inadequate employees is high), accurate prediction of future job performance is critical. Thus, test users often look for a selection procedure that covers more of the KSAOs required for the job or measures them in a more reliable and valid manner. For example, if extensive on-the-job training is necessary, the cost of training a replacement for an unsuccessful employee could justify the cost of a more elaborate hiring system. Similarly, a more comprehensive selection system might be chosen when hiring individuals for highly critical positions with no or minimal margin for error, such as flying commercial airplanes or operating heavy machinery. In contrast, an organization might choose other, less extensive selection instruments when the repercussions of an error are relatively minor or when the cost of training a replacement employee is minimal.

Organization Reactions Many experienced testing professionals who have been in a position to review and select the appropriate assessment tools to implement within their organizations have learned that the organizations for which they work often have strong likes and dislikes for various types of tests. As noted above, many of these preferences are related to the goals of the organization. Some of the preferences that are expressed by members of an organization collectively are related to the image they want to project to applicants. For example, some organizations promote the idea that anyone can perform any job with some hard work and a little coaching. Thus, tests that measure relatively immutable traits (e.g., personality tests) or are based on past experiences (e.g., biodata) and that evaluate skills that generally cannot be developed are not acceptable. Instead,

377

Nancy T. Tippins et al. tests measuring skills and abilities (e.g., skills tests, achievement tests) or knowledge that can be acquired are preferred. Some organizations espouse the idea that selection should be based only on skills related to performance and that measures of how that work is accomplished are not appropriate. In such situations, the organization might use tests that only relate to performance outcomes and not to the way the outcomes are achieved. For example, a sole job knowledge test might be used instead of a job knowledge test in combination with a measure of interpersonal skills. Some organizations promote the idea of selecting the best by hiring individuals who have proven their skills by graduating from top schools. Consequently, asking individuals to demonstrate their mental prowess through measures of cognitive ability is anathema. Many organizations have a decided preference for face valid tests in hopes of avoiding the challenge of explaining the relevance of a less face valid test. So, instead of using a personality inventory to gauge interpersonal skills, an organization might use a customer service simulation. While these organizations are likely to select work samples and simulations, other organizations want to assure test takers of the company’s objectivity in selection and choose only instruments that involve no human judgment. These organizations might avoid the simulations that must be evaluated by an individual and rely instead on objectively scored multiple-choice tests.

Applicant Reactions Sackett and Lievens (2008, p. 439) characterized the lack of evidence for a relationship between applicant reactions and individual or organizational outcomes as “the Achilles heel of this field.” Nonetheless, many believe that applicants’ reactions to the testing experience can have important implications for organizations, such as influencing the applicants’ intention to remain in the selection and hiring process and accept job offers, affecting their attitude if hired, increasing the possibility of legal action if the selection process is deemed inappropriate, and increasing the likelihood of sharing their negative experience with the organization to others (Bauer, McCarthy, Anderson, Truxillo, & Salgado, 2012). Applicant reactions have become increasingly important to organizations in recent years, with many organizations focusing their efforts on creating and promoting an employer brand. As part of this effort, employers are looking for tests that are shorter, more modern looking, or more entertaining (e.g., simulations, games). Test users must be aware, however, that shorter tests often have lower validity than their longer counterparts, and more entertaining tests often cost a great deal more to develop or implement and may not be any more valid than less entertaining tests. Several research studies have investigated the role of various factors, such as test type, administration format, procedural characteristics, and personal variables, on applicant reactions to the testing event (see Gilliland, 1993, for a theoretical model of applicant reactions). In general, research indicates that tests are perceived more positively when the relationship between the content of the test and the duties of the job is clear to the applicants. However, as is the case with other test selection criteria, an applicant’s reaction is not the only factor in deciding which test to use. If a job requires cognitive ability, the finding that cognitive ability tests are not perceived as favorably by applicants as interviews and work samples may be irrelevant. More research is needed before concluding that applicants’ reactions to selection procedures actually predict applicant behaviors (e.g., withdrawal from the selection process, job acceptance, job performance). Nonetheless, applicants should be treated fairly and consistently because of legal, moral, and ethical constraints on the organization. Moreover, applicant reactions to the testing procedures are likely to result at least in more positive perceptions of the organization.

HOW SHOULD ASSESSMENT TOOLS BE EVALUATED? There are several ways to evaluate the validity of assessment tools and a number of factors that influence the choice of validation strategy, which are described in the following sections. In addition, professional guidelines such as the Standards and the Principles suggest that an accumulation of evidence of validity strengthens the support for the inferences made from a test score.

378

Operational Use of Employee Selection Procedures

Information on the Validity of the Test Validity is a critical factor when selecting assessment tools (Chapters 2 through 4 of this volume contain more detailed discussions of validity.) When choosing test instruments, testing professionals often review past validity research to help identify which selection instruments will be useful to measure certain constructs. Data from past research can provide information regarding the validity of a particular test type in predicting various outcomes and the incremental validity of using various assessment types in conjunction with other forms of tests (see Schmidt & Hunter, 1998). Although innovation in testing processes can be helpful, it is often unwise to use a test for which the extant evidence provides little or no support for the kind of inference to be made (e.g., using a typing test to measure conscientiousness). A review of the validity evidence for a particular test use is sometimes overlooked in practice, particularly when individuals who do not have training in industrial-organizational (I-O) psychology choose the tests for the experimental battery (Rynes, Colbert, & Brown, 2002). There are times when untrained test users attempt to review validity evidence from publishers or the I-O literature, but they fail to understand technical issues well enough to make sound judgments. For example, a test user may believe a test is valid for a particular use because a study of the test indicates a seemingly large correlation between predictor and criterion, not grasping the importance of significance testing or effect sizes. Or, the test user may not understand the importance of cross-validation when items are selected based on their correlations with the criterion measure. In addition to the test user lacking the knowledge and skill necessary to understand the concept of validity and review related technical materials, the test user may fail to conduct a review of the literature because he/she does not know where to get information about the validity of a test or there is no information. A few test users may even discount the value of such information, instead relying on idiosyncratic beliefs about the constructs and tools that predict job performance and other outcomes of interest (e.g., turnover, absenteeism, integrity).

Appropriateness and Feasibility of Validation Strategies Several validation strategies can be used to demonstrate the validity of inferences made from tests (e.g., content-oriented strategies, criterion-related strategies, validity generalization techniques). However, the feasibility and appropriateness of different validation strategies vary based on a number of factors, including those outlined as follows. Type of Test—Although content-oriented studies and criterion-related validity studies can be conducted for any test that produces a score, different validation strategies are often used for different types of tests. For example, evidence of validity for a structured interview often comes from a content-oriented study, and evidence for a numerical reasoning test frequently comes from a criterion-related study. There are several likely reasons for this choice. First, the relationship between the constructs measured by the test and the critical KSAOs is probably easier for subject matter experts (SMEs) to evaluate when the test constructs are more similar to the KSAOs. For example, SMEs may be more likely to see the relationship between a work sample test that measures electronic repair and a critical KSAO such as knowledge of electronics than the relationship between a number series test and knowledge of electronics. Second, instruments like structured interviews and work samples are often developed for use with a smaller number of applicants than more abstract measures that are often used for screening purposes. Small sample sizes make criterion-related validity studies technically infeasible. One of the primary determinants of appropriate validation in the U.S. is the perception of which validation strategy is legally defensible. The Uniform Guidelines state that “A selection procedure can be supported by a content validity strategy to the extent that it is a representative sample of the content of the job.” The Guidelines also reject content-oriented validation strategies for measures of “traits or constructs, such as intelligence, aptitude, personality, commonsense, judgment, leadership, and spatial ability” (Section 14.C.1). Thus, some tests may be technically validated using a content-oriented strategy only with some concern for legal defensibility if the

379

Nancy T. Tippins et al. test is challenged. Although not consistent with professional guidelines (e.g., Standards for Educational and Psychological Testing, Principles for the Validation and Use of Employee Selection Procedures), some believe that criterion-related validity is the “gold standard” for successful legal defense. Some organizations consider factors such as those outlined above when choosing a type of validation effort, but other organizations ignore the process of validation altogether. These organizations put themselves at a disadvantage not only in terms of missing the benefit of identifying the most predictive hiring tools and collecting evidence of the effectiveness of the selection program but also with respect to opening themselves up to potentially costly litigation in the event that their hiring practices are challenged and found wanting. Size of Incumbent Population—Organizations developing tests for jobs that have few incumbents and low hiring volume are unlikely to be able to execute a criterion-oriented validation study because these studies require relatively large sample sizes to obtain the sufficient power for statistical analyses. In such a circumstance, the test user sometimes resorts to contentoriented validation. In other situations, alternatives such as a validity generalization strategy, including a transportability study or a meta-analysis of validity studies involving relevant measures and criteria, are employed to establish evidence of validity. Additionally, organizations with small incumbent populations may use a synthetic or job component validity approach in which validity inferences are based on the synthesis of the relationships between scores on a test and measures of performance on a component of the job. New Jobs—New jobs can pose special problems for test validation. A concurrent criterionoriented validation study is clearly not feasible due to the lack of incumbents and supervisors available to complete test and performance ratings. A traditional content-oriented validation approach may not be possible either due to the lack of SMEs available to provide input about a job that does not exist. Occasionally, another source of expertise about the job is used to provide task and KSAO ratings for the new job and to establish linkages between the tasks and the KSAOs, and the KSAOs and the proposed tests. For example, information can be gathered from those who designed the job about the work tasks, processes, and equipment as well as the impetus for the newly created job, proposed minimum qualifications, jobs from which current employees will be promoted, proposed training, and similar jobs from external sources (e.g., O*NETTM, the I-O literature). Test Security—Another factor that might limit the type of validation process selected is the level of test security required. Some validation strategies (e.g., concurrent criterion-oriented) require that internal employees complete the tests experimentally. When the need for test security is high, the involvement of organization personnel in the test design process or validation effort may raise questions about the security and confidentiality of the test content. Existence of Robust Database of Validation Studies—When a sufficient database of validation studies is available to the user, validation based on generalization strategies instead of criterion-related or content-oriented validity approaches may be an option for the test user. In some cases, this database will come from a test publisher that maintains records of validation studies conducted using the firm’s tests as predictors. This type of database may provide sufficient evidence for the test user to reasonably believe a test is likely to be valid in the local setting so that the test can be used on an interim basis until the test user’s company can gather additional local validity evidence to support the test use. In other cases, the validation data may come from a database of validation studies internal to the organization that will facilitate validity transportation.

Cost of Test Development and Validation Studies and the Utility of the Selection Program In virtually every organization, costs are a consideration when developing and validating selection tools. These processes can be expensive when an outside consulting firm is used to develop and validate a test; however, even when the test development and validation work is conducted in-house, the validation effort can be expensive as qualified professionals still cost the organization money. Regardless of who performs the test development and validation

380

Operational Use of Employee Selection Procedures work, internal personnel must perform many tasks, such as coordinating study participants, ensuring appropriate communications, and collecting background data on populations and other archival data relevant to the study. Job incumbents, supervisors, and other SMEs may take time away from the job to participate in various components of the project (e.g., answer job analysis surveys, make linkage ratings, complete experimental tests, or provide criterion data). There can also be costs associated with the equipment and supplies required to construct the test (e.g., work samples) and conduct validation efforts (e.g., laptops for employee testing). As a result of these many costs, organizations often seek ways to minimize their expenditures and take steps to reduce the cost of test development (e.g., use off-the-shelf tests) and validation effort (e.g., rely on validity generalization strategies such as the transportation of validity). The source of funding for these efforts can become an important factor. For example, in some organizations, test development and validation expenses are paid from a limited, centralized human resources budget, whereas test administration costs may come from richer, decentralized operational budgets. In other organizations, the opposite is true. In the first case, an organization might be motivated to select tools that are less costly to develop (e.g., commercially available tests, interviews) or less costly to validate (e.g., those that can be justified through a transportability study or a content-oriented validity strategy). Because budgets are usually managed on a yearly basis, organizations may use off-the-shelf tests even when the ongoing licensing fees are more costly overall than the development of proprietary tests that have fewer recurring costs. Alternatively, the organization that has a higher budget for test development and validation may develop and validate a custom test tailored to its industry, core values, or culture rather than buy a commercially available test with ongoing licensing fees. The volume of test use may also be related to cost considerations. Under circumstances in which the volume of test use will be extremely high, the cost of ongoing test licensing may so greatly exceed the initial upfront costs of developing a proprietary test that test development becomes more economically viable. Another factor that can influence the organization’s approach to test development and validation is its perceptions of the test’s value. When an organization uses company-specific equipment or process, or has a unique culture that is not reflected in off-the-shelf tests, a proprietary test tailored to the needs of a specific business may be needed. Similarly, if the organization has confidence in the value of its selection program and believes the selection process offers a competitive advantage, then the business may seek to develop a test that is specific to it. Occasionally, organizations simply want to avoid the repercussions of other companies’ poor testing practices. For example, if a competitor in the same geographic market has poor testing practices, an organization may seek a different off-the-shelf test or develop its own unique test. When the value of a business’s services is derived from something other than its employees (e.g., natural resources), a test shared with other similar companies may be sufficient for its needs. In a few situations (e.g., utility companies), where one organization dominates a geographic area and applicants come primarily from regional pools, tests shared with other organizations from different geographic areas tend to have little effect on the organization’s competitive advantage. As a final note, it can be argued that the ultimate measure of a test’s value to the organization is its utility, which takes into account not only its costs but also its benefits. Although testing professionals often struggle to identify and estimate all costs and the value of all benefits, both tangible and intangible, they should consider the costs to develop, validate, and administer a test relative to its benefits.

HOW SHOULD TESTS BE ADMINISTERED? There are multiple ways to administer a test. Currently, one of the most discussed questions about test administration is whether or not the test should be proctored. However, other dimensions of test administration affect the selection of tests. Several of the more common questions about test administration that affect the choice of test are discussed as follows.

381

Nancy T. Tippins et al.

Proctored or Unproctored Testing The essential question about unproctored testing appears to be whether the risk of cheating in any form, which can decrease the validity of the test, justifies the advantages resulting from the tests that are administered on the candidates’ equipment at times and places of their convenience (see Chapters 39 and 41, in this volume). Ideally, the user considers the types of items used in the test as well as the consequences of bad hires when deciding whether to test in unproctored environments. While there are few ways to cheat on unproctored self-description inventories that cannot also be used in proctored settings, a number of maleficent behaviors can be used when there are clear right and wrong answers that can increase test scores in ways that do not reflect the test taker’s ability. When the consequences of failure to perform are significant, many employers will avoid unproctored testing and opt for monitoring test takers during administration. When the staffing context requires unproctored testing, test users should consider carefully both the type of test to administer and the implications of the test taker’s opportunity to cheat and choose tests accordingly.

Speeded Test or Power Test Another frequent consideration in test administration is the use of time limits. Although some constructs (e.g., measures of perceptual speed and accuracy) require speeded tests, others do not. In such cases, the test user must decide what, if any, time limit to place on the testing time. Test users often impose a time limit for administrative reasons. In proctored settings, the time limit allows for efficient scheduling. In unproctored settings, a time limit may inhibit some forms of cheating. There are few rules about how to set a time limit on a test, but several factors should be considered. In the U.S., where accommodations for individuals with disabilities can be an important element of test administration, time limits are often generous to reduce the need for adjustments in administration times. For example, a user might set a time limit that allows 90% of test takers to complete 90% of items. Firms concerned with test taker reactions may set generous time limits to avoid negative test taker reactions when the test is difficult for most candidates to finish. For some tests like business case assessments, time limits are set to standardize the exercise and allow the organization to learn what candidates can do in a set amount of time.

Group or Individual Administration Many tests can be administered either individually or in a group setting, and the choice of which to use may depend entirely on the staffing model the employer uses. However, some tests (e.g., many physical abilities tests, structured interviews, and work samples) require individual administration and scoring. When resources are insufficient to allow for this, alternative forms of testing must be found.

Test Preparation Many employers provide test preparation materials that explain what is being measured, how the test is scored, what can be done to prepare for the test, what are the rules regarding testing, etc. Other employers offer practice tests that familiarize test takers with the test and provide them with some idea of how their practice scores compare to the test standard for the job to which they are applying. At times, the practice test feedback is accompanied by developmental suggestions intended to improve the skill being measured. For example, employers who use physical abilities tests may offer a practice test, feedback, and developmental suggestions on improving upper and lower body strength, flexibility, etc. The intent of many of the preparation efforts is

382

Operational Use of Employee Selection Procedures to inform test takers of what to expect so that their scores more closely reflect their ability and not their comfort with or savviness for taking tests. In choosing the type of test to use, the test user must consider whether or not to offer test preparation materials and how to provide access to all candidates. Because test preparation materials and practice tests represent another source of costs and may have implications for applicant reactions and adverse impact, the amount of test preparation required and the guidance to test takers needed can be factors in the choice of test. The test user must also decide what kind of guidance to provide for tests that measure skills that are difficult to develop (e.g., personality tests).

HOW SHOULD SCORES BE USED? After tests have been identified or developed and validated, the test user must consider how to calculate, report, and use the resulting test scores. (Chapters 8 and 18 in this volume contain additional information regarding the use of test scores.) Considerations related to these decisions include the form of the test score used (e.g., raw score, percentile score, score bands), the method for combining test scores (e.g., compensatory, multiple hurdle), and the operational use of test scores (e.g., top-down selection, banding). Additionally, decisions need to made regarding what type of feedback (if any) to provide to test takers.

Calculation and Form of Reported Test Score A variety of methods can be used to calculate test scores (e.g., points are given for a single correct answer, points are given differentially for each possible response to a question, a different number of points is given for answers to different questions depending on the difficulty of the question, points are subtracted for guessing). Additionally, the final test score can be presented in a variety of forms (e.g., raw score, percent score, percentile score compared to a norm group or to the current group of test takers, standardized score). Various factors should be considered when determining what kind of score to report, including the type of test (e.g., power versus speeded), the construct measured (e.g., cognitive ability versus personality), the number of competencies measured, the availability of appropriate normative groups, the ability of the test score recipient to understand the score, the purpose of the test score, and the reliability and validity of the test score. Different test types require different score formats. A score indicating the number or the percentage of items the test taker answered correctly may be effective when communicating the extent to which an individual possesses a body of knowledge. In contrast, a number or percent correct score on a personality inventory would be difficult to interpret as there are not technically right or wrong answers; instead, there are responses that describe the test taker’s standing on a construct to varying degrees. Similarly, a percent correct would be appropriate on a power test but would be less useful on a speeded test. A standardized score or percentile score might be helpful when information about an applicant’s standing relative to other test takers is needed, but less useful when the question posed is how much of some ability or skill a person possesses. In such a case, the number correct or the percent correct might be more useful. If there is no relevant normative group, then the use of a percentile score or standard score that is based on a sample of individuals in an irrelevant group is not informative. The ability of the test score recipient to understand various types of scores can also influence the decision of how to calculate and present scores. For example, test takers and hiring managers may have difficulty interpreting some forms of test scores (e.g., norm-referenced percentile scores with multiple norm groups), whereas testing professionals may prefer more complex forms of the score that convey more information about the individual. The purpose for which the test is given can influence the type of score to be provided. If the test is used for selection, all the test taker and hiring manager may need to know is whether or

383

Nancy T. Tippins et al. not the test taker has met the qualifying standard. If, however, there is a developmental component to the test, more detailed information may be warranted. For example, an employee seeking promotion may need to know how far from the test standard his/her score is or what his/her score on each scale is so that he/she can focus his/her developmental activities. Finally, the reliability and validity of a test or scale should also be considered when determining how to present test results. It may be more appropriate to present more general feedback regarding overall test performance (e.g., pass/fail) than to provide individual scale scores that lack sufficient reliability.

Combining Scores Across Tests When multiple tests are used in a selection process, a decision needs to be made regarding how to use multiple test scores to make a selection decision. One option is to weight and combine the separate test scores in a compensatory fashion. Another option is to use a multiple-hurdle model in which a cutoff score is applied to each test and applicants must score above each cutoff score to be qualified on the overall assessment. Another alternative is to use a mixed model in which a minimum level of performance is required on certain tests and then the scores are also combined into a single score and a cutoff score is applied to the overall score as well. The method used to combine scores should take into account the requirements of the job as well as available data that may support the decision. For example, a selection procedure for a technical sales job that requires technical skills and sales skills may involve a multiple-hurdle approach when job analysis data indicate that high levels of technical skills do not compensate for low levels of sales skills or vice versa. In another job that requires lower levels of technical skills along with sales skills, combining these test scores in a way that high levels of persuasiveness compensate for lower technical skills may be more appropriate. In still another job in which technical skills are 80% of the job and sales is 20% of the job, a compensatory model that weights scores on tests measuring technical skills more highly than tests measuring sales skill (e.g., 80/20) may be appropriate.

Use of Test Scores Test scores can be used in many ways for hiring decisions (or progression to the next step in the hiring process). Test scores are often distributed to individuals making hiring decisions as one source of job-relevant information that they use according to their own understanding of the meaning of the test score and the job requirements. Test users can also be provided with an expectancy table and accompanying guidance regarding how the test score should be used. For example, a candidate falling into the top score range may be hired without any other education or experience credentials, whereas another candidate with a score in the lower range may be hired only if he or she has certain levels and kinds of relevant education or experience. Alternatively, strict cutoff scores (for individual tests or a battery) can be established, and decision makers are only given pass/fail information without the opportunity to deviate from the company-wide rule. A common variation to a single cutoff score is score bands that theoretically take into account the unreliability of individual test scores and treat all scores within a band as though they predict the same level of performance. Finally, some organizations use top-down selection by hiring the individuals with the highest scores first. The best method for using test scores depends on a variety of factors, such as the goals of the organization, the frequency of hiring, and the qualification level of the applicant pool. Topdown selection, for example, can work well when testing occurs infrequently and the employees are drawn from a single pool of qualified applicants; however, it may be less appropriate when testing occurs frequently because the candidate pool changes daily and the top candidate may be different in terms of qualification level from one day to the next. When an organization strives to hire the best of the applicant pool, top-down hiring can help ensure the organization achieves its goals. When an organization uses a test with large group mean differences and desires a

384

Operational Use of Employee Selection Procedures diverse workforce, top-down hiring can present a barrier to achieving the diversity goal. Additionally, top-down hiring for a test with large group mean differences can exacerbate the level of adverse impact and lead to legal challenges. In addition, where the cutoff score is set may have legal implications related to the extent of adverse impact. Despite the desire of some organizations to upgrade their workforces, cutoff scores on tests with large group mean differences that reflect skills that exceed the minimum required to perform the job can be difficult to defend (see Lanning v. SEPTA). Organizations may need to collect evidence on the minimally acceptable level of performance to justify a cutoff score. Another important consideration involves the requirements of the job. In situations where a high level of skill is required on the job, an organization may need to set a floor on test scores to ensure minimum skill levels in all new hires. Top-down hiring could be appropriate if there is a wide range of skill in the applicant population but may result in the employment of unqualified individuals if there are few highly skilled individuals in the applicant pool. Occasionally, organizations will set a minimum score while using top-down hiring to identify qualified candidates for a job. As noted earlier, decisions regarding many of the factors described in this chapter influence decisions on other factors. For example, consider an organization that chooses to use a multiple-hurdle approach for selection that includes a numerical reasoning test, a reading test, and a situational judgment inventory. On the basis of data from a concurrent criterion-oriented validity study, the organization decides to set a cutoff score on the numerical reasoning test that results in 95% of candidates who pass the numerical reasoning test also passing the reading test and 80% also passing the situational judgment inventory. In this scenario, there is virtually no value in retaining the reading test and little value for using the situational judgment inventory. Thus, the decision regarding the cutoff score for the numerical reasoning test essentially alters the decision of which constructs to measure and what tests to use. Therefore, these types of interactions should be considered when making decisions regarding all of the factors described above, and the test user should be prepared to revisit these decisions repeatedly. While it can be challenging to identify all of the goals for the testing program and the individual tests and prioritize them, it is important to use tests in ways that meet the organization’s goals. Few job aids exist to facilitate the user in determining how to use a test other than an understanding of the organization’s goals and knowledge of the impact various decisions have. Recently, some organizations have turned to Pareto optimization methods when considering multiple goals to maximize the levels of goals achieved.

Feedback When deciding what kind of feedback to offer, if any, most organizations consider a wide array of factors, including the size of the applicant pool, the type of candidate (e.g., internal or external), the expectations of the candidate, the resources of the organization, the employment brand the organization wishes to project, the level of the position (e.g., entry level, executive), and the type of test(s) administered. When organizations test a large number of individuals from outside the organization for an entry-level role that traditionally has high turnover, they frequently provide candidates with basic pass/fail information regarding whether or not they successfully progressed to the next stage in the hiring process. At the other extreme, an internal candidate applying for a higher-level position who completes tests may expect more detailed feedback (e.g., percentile score for each test/scale) to guide his/her development. The specificity of feedback is particularly important when the internal candidate is not promoted into the new role and is expected to develop in the deficient areas to prepare for the role in the future. Another consideration related to feedback is the type of test completed. It is more appropriate to provide feedback on constructs that can be improved with effort (e.g., knowledge areas) than on more stable attributes (e.g., personality). Additionally, when feedback is provided on tests measuring constructs that can be developed by the individual (e.g., knowledge tests), developmental suggestions are often given in addition to detailed test performance information.

385

Nancy T. Tippins et al. Some employers recognize the role of feedback in shaping test takers’ feelings about the company. They strive to provide accurate and constructive feedback in a sensitive manner in hopes of reducing the likelihood of a challenge to the selection program or decreasing negative comments about the organization’s staffing process.

CONCLUSIONS This chapter has reviewed many of the issues test users consider in selecting or developing a test for validation or for operational use on an interim basis. The issues, framed around five questions, are many, and hard-and-fast answers are few. As noted earlier in the chapter, none of these factors can be evaluated without consideration of the others. For example, the feasibility of test development and validation and their costs are significant factors in the choice of tests. An organization with few incumbents in a job for which tests are being considered may not be able to supply enough job incumbents to complete tests for a concurrent study, or perhaps even SMEs for a content-oriented study. Even enterprises with many incumbents may not be able to relieve employees from their job duties for the time needed to assist with test development and validation and maintain smooth operations. In addition, the answer to a question may need to be revisited depending on the answers to the other questions. An organization that decides to measure only problem-solving ability because it was the most important KSAO for a particular job and then decides to use a work sample test may find that the work sample measures a broader array of KSAOs than just problemsolving abilities. Conversely, an organization that decides to measure all of its important KSAOs may find that the number of tests required is so large that testing requires three days and consequently is unaffordable to the organization and intolerable to applicants. Or, the organization may find that after the first few tests the latter tests add little incremental validity. A particularly difficult, overarching concern is how to arrive at one decision in the face of many competing demands on the organization. Optimization of all factors is challenging, if not impossible, in most cases. For example, increasing validity while minimizing adverse impact and meeting organizational constraints of time and cost associated with validation and administration remains a balancing act rather than a series of discrete decisions. Minimally, it is imperative that those who are tasked with identifying or developing successful selection systems are familiar with the many decision points in the process. The test user responsible for designing selection systems must consider these issues and their ramifications, weigh the tradeoffs, and make fully informed final decisions. In many organizations, these questions and their answers must be revisited regularly. Many things about an organization can change quickly. The business needs and strategies change; the staffing context changes; the applicant pool changes; etc. What exists today may not exist tomorrow. Thus, the skilled test user will continually evaluate each selection program to ensure it meets as many needs of the organization as possible.

REFERENCES American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Bauer, T. N., McCarthy, J., Anderson, N., Truxillo, D. M., & Salgado, J. F. (2012). What we know about applicant reactions on attitudes and behavior: Research summary and best practices. International Affairs Committee of the Society for Industrial and Organizational Psychology, Inc. Society for Industrial and Organizational Psychology, Inc. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43, 38290–38315. Gilliland, S. W. (1993). The perceived fairness of selection systems: An organizational justice perspective. Academy of Management Review, 18, 694–734.

386

Operational Use of Employee Selection Procedures Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 3–32). San Francisco, CA: Jossey-Bass. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahway, NJ: Lawrence Erlbaum. Murphy, K. R., & Davidshofer, C. O. (1988). Psychological testing: Principles and applications. Englewood Cliffs, NJ: Prentice-Hall. Rynes, S. L., Colbert, A. E., & Brown, K. G. (2002). Human resource professionals’ beliefs about effective human resource practices: Correspondence between research and practice. Human Resource Management, 41, 149–174. Sackett, P. R., & Lievens, F. (2008). Personnel selection. In S. T. Fiske, A. E. Kazdin, & D. L. Schacter (Eds.), Annual review of psychology (pp. 419–450). Palo Alto, CA: Annual Reviews. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Reprinted with permission. Shute, V. J., & Wang, L. (2016). Assessing and supporting hard-to-measure constructs. In A. A. Rupp & J. P. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and application (pp. 535–562). Hoboken, NJ: Wiley. Speer, A. B., King, B. S., & Grossenbacher, M. (2016) Applicant reactions as a function of test length: Is there reason to fret over using longer tests? Journal of Personnel Psychology, 15, 15–24. Tippins, N. T., Beatty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225.

387

17 THE SUM OF THE PARTS Methods of Combining Assessments for Employment Decisions JULIET R. AIKEN AND PAUL J. HANGES

Organizations commonly use multiple assessments (e.g., a combination of cognitive tests, personality tests, situational judgment tests, interviews, etc.) to make hiring decisions. While using more than one assessment to make employment decisions can provide organizations with a more holistic view of each candidate, deciding on how to combine these assessments can have profound consequences on who gets hired. Most employers use one of three approaches to combining assessments: (1) combining multiple assessments subjectively (clinical assessment), (2) developing assessment composites empirically, or (3) using assessments in sequence—i.e., a multiple-hurdle approach. In this chapter, we will define each of these techniques, review when it tends to be used, discuss its strengths and weaknesses, and provide recommendations for its use. In addition to discussing these three methods, we will also discuss how these methods can be used in combination. But, let us begin at the beginning—we will next review each of the three methods of score combination, beginning with clinical assessment.

CLINICAL ASSESSMENT In clinical assessment, an employer subjectively combines multiple assessments to develop an overall impression of the candidate. Clinical assessments are often based on intuition. Their use may be prompted by the assumption that quantitative approaches to combining assessments fail to capture the complexity of each applicant’s qualifications and potential. Despite repeated empirical evidence that clinical assessment tends to have fairly low validity in a number of contexts (Morris, Daisley, Wheeler, & Boyer, 2014), especially compared to statistical decisionmaking approaches (Ægisdóttir et al., 2006, Grove, Zald, Lebow, Snitz, & Nelson, 2000; Kuncel, Klieger, Connelly, & Ones, 2013), it remains a popular technique for selection and assessment in many workplaces (Highhouse, 2008). Employers use clinical assessment in a variety of business activities, including leadership or management coaching, individual development assessment, and forming subjective summary scores from applicants’ interviews or assessment centers. Clinical assessment is especially popular for selecting candidates into executive positions (Thornton, Hollenbeck, & Johnson, 2010), as job requirements for these positions are seen as specific to that particular opening or organization (Hollenbeck, 2009).

388

Methods of Combining Assessments The goal of clinical assessment is to summarize and combine multiple pieces of information about each candidate before making an employment decision; employers using clinical assessment do so to form a holistic judgment of each candidate. There are three phases in clinical assessment: (1) collecting information, (2) evaluating information, and (3) developing a report and recommendation for an individual job candidate (Weiner, 2003). Employers vary on the processes they follow to accomplish each of these steps. For example, information-collecting techniques may lack structure or may be highly standardized. Likewise, information integration varies substantially, as clinical assessment is very subjective. Consequently, who the assessors are and what training they have profoundly influences the assessment process (Morris et al., 2014). Unsurprisingly, the subjectivity of the integration process can make it especially difficult to tell how clinical assessment is conducted in practice. Clinical assessment is also, due to its subjectivity, both flexible and fairly simple to implement provided that the assessor is not overwhelmed with information on each candidate. However, the drawbacks of clinical assessment far outweigh its strengths. Because clinical assessment is unstandardized, it is typically applied inconsistently both over time and over candidates. Likewise, without any attempt at standardization or objectivity, assessor biases can strongly influence who is hired. Furthermore, clinical assessment particularly lacks validity when assessing candidates for nonmanagerial positions (Morris et al., 2014). Finally, it can be challenging to form a clinical assessment when the assessor has too many different pieces of information to consider at once. In general, we—like many other scholars—do not recommend that organizations use clinical assessment for selection purposes, even in low-frequency hiring contexts, such as when dealing with small candidate pools, circumstances where few offers are extended, or one-time hiring decisions. While clinical assessment provides useful individualized feedback and can help identify areas for individual growth, it is too subjective for organizations using it to consistently hire or promote the most qualified applicants. Despite this word of caution, if an organization chooses to use clinical assessment, we have a number of recommendations to improve its usage. Specifically, we suggest that organizations (a) use a cognitive ability test in their clinical assessment, (b) standardize the evaluation process, including documenting strengths and weaknesses for each applicant, (c) take structured notes during the evaluation, (d) provide training for evaluators to reduce personal biases, (e) use the same rater or panel of raters to assess all applicants, and (f) only use clinical assessment when determining whom to hire for a high-level managerial position (e.g., executive management). First, if a clinical assessment approach is taken, organizations should use a cognitive ability test as part of their battery. In their meta-analysis on clinical assessment, Morris et al. (2014) found that clinical assessments were more valid when they included a cognitive ability test. Second, it is critical that organizations standardize their evaluation process as much as possible. This may at first seem counterintuitive; after all, isn’t the point of a clinical assessment to make gut decisions, rather than decisions guided by a set rule? However, some structure and standardization can enable decision makers relying on clinical assessment to fight their unseen biases. Our primary recommendation in standardizing the process is to make written reports of each candidate’s strengths and weaknesses. These reports should require the evaluator to justify the conclusions they reach, thus helping evaluators think more deeply and thoroughly about their decisions. In addition to creating reports on each applicant, we recommend that employers using clinical assessment also take structured notes during the evaluation or interview to ensure that the evaluator is considering all aspects of the person’s strengths and weaknesses in real time. These notes can be used as the framework of the reports. Furthermore, clinical assessments have higher validity when used to hire candidates into managerial positions (Morris et al., 2014), and particularly into executive positions (Thornton et al., 2010). Thus, if an employer must use clinical assessment, we recommend that it is only used for selecting into high-level managerial roles. Finally, the use of multiple evaluators should be carefully considered when conducting clinical assessment. Specifically, using multiple assessors does not improve validity, except when the same assessors are used across all candidates (Morris et al., 2014). Therefore, if multiple evaluators are going to assess candidates, we recommend

389

Juliet R. Aiken and Paul J. Hanges that organizations use the same evaluators to assess all candidates. We further recommend that organizations employing a clinical assessment approach train their evaluators for consistency. In particular, we recommend frame-of-reference training (Bernardin & Buckey, 1981). Frame-of-reference training involves educating assessors on what attributes or behaviors are desired, providing them with opportunities to practice evaluating, and giving them feedback on their accuracy (Pulakos, 1986). Frame-of-reference training has been shown to reduce the influence of personal biases (Woehr & Huffcutt, 1994) and increase rater consistency (Schleicher, Day, Mayes, & Riggio, 2002). In summary, although the subjectivity of clinical assessment makes it a less valid method for combining multiple assessments when making employment decisions, employers continue to use it because it is intuitive, because they are concerned it would not be practical to develop an elaborate process for positions with few or rare hires, or because they are skeptical of the ability of hard data to truly identify who will be a good fit for their organization’s needs. Two other methods for candidate selection do not suffer from these setbacks, or at least not to the same degree. We turn next to one of these methods. Specifically, the next procedure for weighing multiple pieces of information we will discuss is a compensatory approach.

COMPENSATORY METHODS Compensatory methods of weighing multiple assessment criteria involve mathematically weighting each piece of information about a candidate (e.g., each assessment score) to determine an overall qualification score for that candidate. These methods are considered “compensatory” because low scores on one assessment can be counterbalanced by high scores on another assessment. Methods of weighting the criteria include unit weighting, regression weighting, factor analysis (e.g., Kanfer, Wolf, Kantrowitz, & Ackerman, 2010), and relative importance analysis, among others. Typically, compensatory methods are used to select employees into jobs for which weaknesses in one area can be compensated by strengths in another area. For example, let’s assume two candidates take six assessments that are scored from 1 to 5 each. The first candidate scores a 1, 3, 4, 5, 5. The second candidate scores a 4, 3, 3, 4, 4. Assuming no minimum score was required on any given assessment, which is a typical assumption in compensatory selection procedures, these two candidates would have an equivalent sum score. The first candidate’s strengths in the latter two assessments offset his or her weakness in the first assessment. Both would be equally qualified, assuming equal weights were put on each assessment. However, if we know that anyone who scores below a “3” on the first assessment would not be qualified for the job, then only the second candidate would qualify. Thus, when minimum scores are required, compensatory methods may not be ideal. However, compensatory methods would be well suited for selection in a context where there is no minimum required score on any given assessment. While compensatory methods are widely thought of as more valid methods for combining predictors than clinical assessment, there are several challenges associated with implementing these methods effectively. First is the obvious issue of how to weight different predictors. Multiple options, including regression weighting, rational weighting, unit weighting, and even ­Pareto-optimal weighting exist. Each of these methods differs not only in how exact weights are calculated but also in the rationale for its use. We will provide a brief overview of some of the most common methods of weighting predictors next. Regression weighting each assessment to create a composite involves regressing assessment scores onto the criterion (or criterion composite), then using the resultant regression weights to determine how much to weight applicant scores on each assessment. In contrast, unit weighting involves assigning each predictive assessment a “1”; organizations using unit weighting simply average together standardized scores on each assessment to create a composite. Organizations using rational weighting derive weights for each assessment from a job analysis. Assessments are therefore weighted according to the importance job analysis establishes for each. Finally, factor analyses of measures can be used to determine weights for each assessment within different domain composites (e.g., “ability”; Kanfer et al., 2010).

390

Methods of Combining Assessments Each of these techniques has different strengths and weaknesses. Regression weighting, for example, is limited in that it assumes there is a linear relationship between KSAOs/competencies and job performance. Furthermore, the rationale behind each weighting practice varies significantly. For example, unit weighting and regression weighting are very different approaches. Specifically, unit weighting focuses on content validity—does the composite reflect all of the desired job components? In contrast, regression weighting focuses on criterion-based validity— does the composite accurately capture the most predictive regression equation? The rationale behind rational weighting also contrasts with the rationale underlying regression weighting. Specifically, weights derived from rational weighting are imbued with the values of the organization and decision makers, and weights derived from this process would at best implicitly account for predictor and criteria intercorrelation (Hough, Oswald, & Ployhart, 2001). In contrast, regression-derived weights account for intercorrelation explicitly, and the composite is optimal in the mathematical sense rather than directly reflecting organizational values (Hough et al., 2001). Further complicating the question of how to form compensatory composites is the question of whether and how much the predictors are interrelated. Specifically, the weights recommended by different methods of forming composites diverge depending on the extent to which predictors are orthogonal (Hough et al., 2001). That is, since rational weights do not explicitly take intercorrelation into account, they are likely to differ most dramatically from statistically derived weights when predictors are highly correlated. Indeed, when predictors are highly correlated, regular ordinary least squares (OLS) regression may not be ideal for determining weights. Instead, one statistical technique for minimizing the challenges presented by intercorrelated predictors is relative importance analysis or dominance analysis. Either relative importance or dominance analysis would allow organizations to determine statistical weights (rather than rational weights) for each predictor in the context of the selection model while simultaneously minimizing the effects of suppression and multicollinearity on statistically derived weights (Hough et al., 2001). Another issue that arises when using compensatory methods of combining predictors is what the organization should use as the criterion (Hattrup & Rock, 2002). Specifically, should the criterion also be a weighted composite? Research on this question reveals that using weighted criterion composites rather than a single criterion assessment can also boost validity and help reduce adverse impact. Unit weighting for criteria appears to result in higher synthetic validity coefficients (Johnson & Carter, 2010). Additionally, weighting job components using multiple regression to create a criterion composite and then weighting predictors based on the criterion composite boosts validity while also reducing adverse impact (Hattrup, Rock, & Scalia, 1997). Notably, the greatest gains in validity occur when the weights used to form a predictor composite correspond to the values placed on each component in the criterion composite (Hattrup & Rock, 2002). Furthermore, adverse impact is reduced if criteria that correlate with cognitive ability (e.g., task performance) are given lower weights than criteria that correlate less with cognitive ability (e.g., contextual performance) in a criterion composite (Hattrup & Rock, 2002). Of course, as in many areas of scientific inquiry, matching predictor and criterion complexity improves the validity of selection outcomes. Specifically, more complex criteria are predicted better by complex predictor composites that match the criteria on relevance and bandwidth (Hough & Ones, 2001). There is one final downside to the use of regression to determine predictor or criterion weights that warrants attention. Specifically, when regression weights are used to determine the composite, these weights may be sample and time dependent. Counteracting this concern requires organizations to collect more data, which may be easier said than done; the influence of sample fluctuations lessens with very large sample sizes. As should be obvious from our discussion thus far, organizations need to make several decisions when developing composite predictors. With so many weighting strategies available, how does an organization know which to use? Unfortunately, most research on weighting strategies does not give an unambiguous and consistent answer to this question. While some research suggests that unit weights are appropriate for combining predictors (Bobko, Roth, & Buster, 2007), other research reports gains in validity coefficient when weighting using other criteria, such as the number of job components or relative weights analysis (Johnson & Carter, 2010).

391

Juliet R. Aiken and Paul J. Hanges Recently, however, De Corte and colleagues have developed a promising method for compensatory selection that enables organizations to make maximally informed decisions based on Pareto-optimal predictor weights (De Corte, Lievens, & Sackett, 2007; De Corte, Lievens, & Sackett, 2008; De Corte, Sackett, & Lievens, 2010). These weights provide Pareto-optimal tradeoffs between validity and adverse impact (De Corte et al., 2007; De Corte et al., 2008; De Corte et al., 2010). Specifically, weights are considered Pareto-optimal when the level of one outcome (e.g., adverse impact) cannot be improved without losing ground on the other outcome (e.g., decision quality). More than one outcome is desired in many selection decisions. Ideally, organizations want to develop assessments that maximize validity with regard to relevant outcomes (e.g., performance, organizational citizenship behaviors, etc.) while minimizing adverse impact. However, as we have noted, these outcomes are typically in conflict. Consequently, the procedure for determining Pareto-optimal weights does not result in a single recommended set of weights for each predictor. Instead, the procedure provides a range of possible weights that organizations choose among to reach the desired levels and tradeoff between adverse impact and validity. The method De Corte et al. (2007, 2008, 2010) propose allows organizations to answer concretely how much of an improvement they can make in one of these areas within a given constraint (e.g., 1%, 5%) or penalty imposed in the other area. Additionally, this method provides organizations with information on the worst possible tradeoffs and the relative importance of adverse impact as an outcome for each Pareto-optimal solution maximizing validity (De Corte et al., 2007). Finally, each Pareto-optimal solution maximizes a combined adverse impact-decision quality goal. Finding Pareto-optimal solutions is a multistep process. The solutions produced by this program first seek to maximize one outcome and then the other. For example, you might first specify that you would like a mean standardized performance level of 0.75 among the pool of candidates who are selected. Multiple combinations of predictors and predictor weights would yield this desired level. That is, multiple combinations maximize your first outcome, validity. Then, you would need to consider which of these combinations would maximize your second outcome (i.e., minimize adverse impact). The combination that maximizes your second outcome (e.g., adverse impact) at each specified level of the first maximized outcome (e.g., performance) is Pareto-optimal (De Corte et al., 2007). To conduct their proposed analyses, organizations need to specify (a) the selection rate, (b) the representation of minority and majority candidates in the applicant pool, (c) the effect size of the available predictors, (d) the validity of the available predictors, and (e) the intercorrelations of the available predictors (De Corte et al., 2007). Ideally, these estimates are readily available from past or current validation studies or meta-analyses (De Corte et al., 2007). Fortunately, however, the proposed procedure is fairly robust to uncertainty when precise estimates are not available (De Corte et al., 2007). In addition to requiring the specified information, this procedure requires the assumption that predictor and criterion scores have a joint multivariable normal distribution with the same variance-covariance and different means in the minority and majority populations (De Corte et al., 2007). A detailed explanation of how this procedure works is available in De Corte et al. (2007). Researchers and organizations who want to use this procedure can access a Windows-compatible computer program designed to run it, as well as instructions on how to use this program, at http://users.ugent.be/_wdecorte/software.html. Program users will have several control options, including (a) operationalizing the selection quality objective either by the validity of the composite, the average criterion score of selected applicants, or the utility of the selection; (b) determining the number of tradeoff points computed; (c) specifying if the selection decision is probationary; (d) constraining predictor weights; and (e) specifying upper/lower boundaries or fixing the proportion of hired employees (De Corte et al., 2007). Although some challenges and caveats are associated with creating valid predictor composites, there is one overarching strength to using these methods that has already surfaced in our previous discussion: the reduction of adverse impact. The preponderance of research shows that most predictors have an adverse impact-validity tradeoff (Pyburn, Ployhart, & Kravitz, 2008). That is, predictors that tend to have high predictive validity also often have high adverse

392

Methods of Combining Assessments impact, whereas predictors that tend to have lower adverse impact also often have lower validity. In other words, no one predictor is perfect. Compensatory methods of selection allow organizations to address the imperfections of any given selection instrument by combining predictors with different strengths. A particularly popular compensatory technique used to combat the adverse impact-validity tradeoff is combining noncognitive predictors with cognitive predictors (De Soete, Lievens, & Druart, 2012). In doing so, organizations seek to offset the higher amount of adverse impact typically associated with cognitive predictors with the lower amount of adverse impact associated with noncognitive predictors. Thus, compensatory methods are commonly used to combine personality tests with cognitive tests. However, while it is possible for these combinations to reduce subgroup differences when the assessments are uncorrelated, they are unlikely to reduce those differences as significantly as expected and may even exacerbate differences among groups when assessments are moderately correlated (Sackett & Ellingson, 1997; Sackett, Schmitt, Ellingson, & Kabin, 2001; Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997). In summary, organizations often form statistical composites of assessments to obtain an overall estimate of a candidate’s qualifications. These approaches are considered compensatory, since high scores on one assessment can balance out lower scores on another assessment. Compensatory selection procedures are often used in an attempt to balance the goals of low adverse impact and high decision quality. While compensatory approaches have a number of strengths, they can also raise challenging questions. For example, organizations need to consider not only how to form the predictor composites (e.g., rational, unit, regression, Pareto-efficient weighting) but also whether and how to form a criterion composite. We suggest that organizations calculate Pareto-efficient weights when determining predictor composites in order to obtain the preferred balance between adverse impact and predicted performance. Next, we discuss a noncompensatory technique: the multiple-hurdle approach to selection.

MULTIPLE HURDLE Finally, employers who want to use more than one assessment to determine which applicants are most qualified may turn to a multiple-hurdle approach. In multiple-hurdle selection approaches, employers sequence assessments rather than combining assessment scores to determine whom to hire. For example, rather than weighting a cognitive ability test and a personality test together to determine one overall qualification score, multiple-hurdle procedures might require applicants to first pass an IQ test and then pass a personality assessment. Thus, multiple-hurdle approaches are considered noncompensatory. In order to succeed in a multiple-hurdle testing environment, applicants need to have high scores on all assessments. In contrast, as we have discussed previously, in order to succeed in a compensatory testing environment, high scores on one assessment can balance out lower scores on another assessment. In addition to their noncompensatory nature, multiple-hurdle approaches to selection rely on other underlying assumptions. One particularly salient assumption is that there is a nonlinear model between knowledge, skills, and abilities, and job performance. The compensatory approach is linear—simply weight each assessment and map a line of predicted performance using that weighted assessment. In contrast, the multiple-hurdle approach assumes that assessments cannot be so easily combined to linearly predict performance. A second, related, key assumption—assuming a classic approach to multiple-hurdle selection—is that there are only two groups of applicants: acceptable and not acceptable. In other words, everyone who passes the set cutoff score for each assessment instrument is considered equally desirable. Multiple-hurdle assessment approaches are used frequently in practice. They are often used when a large number of people apply for a given job. By using a multiple-hurdle approach in these situations, employers are able to save money by applying assessments to an ever-decreasing number of applicants. Multiple-hurdle assessments are also used when highly technical tests are used. Specifically, if an employer is using an assessment center or other highly involved assessment, using multiple-hurdle assessment techniques in an appropriate sequence may save both time and money.

393

Juliet R. Aiken and Paul J. Hanges There are a number of strengths to the multiple-hurdle approach to selection. One strength is that it is easy to weed out large numbers of applicants early in the process. As mentioned earlier, doing so allows organizations to reserve more expensive predictors for the most promising applicants. Additionally, using a multiple-hurdle approach strategically may help organizations reduce their adverse impact (De Corte, Lievens, & Sackett, 2006; Sackett & Roth, 1996). While multiple-hurdle approaches to selection are appealing for reducing expenditures and adverse impact, their use also has several drawbacks. One challenge is where and how to set the cutoff. As discussed, applicants above the cutoff will be considered interchangeable. Therefore, setting a cutoff takes the multiple-hurdle approach away from an emphasis on criterion-related validity (in which a linear model is assumed, and higher scores always more qualified). Instead, the multiple-hurdle approach to selection adopts a content validity approach, where having an adequate amount of each knowledge, skill, or ability is more important than having more of each KSAO. Also, since applicants are treated interchangeably past the cutoff, it is possible that employers do not get the top performers if the underlying relationship between the construct measured by the assessment and performance is linear. See Chapter 8 (this volume) for additional discussion of setting cutoff scores within the multiple-hurdles framework. We have several recommendations for employers who wish to use a multiple-hurdle approach to selection. First, to reduce adverse impact, prevailing wisdom has held that tests with more adverse impact should be administered later in the hurdle process, and tighter selection should occur in the first hurdle (Sackett & Roth, 1996). In other words, popular belief indicates that the first stage of a selection process should employ low-adverse-impact tests, and comparatively few applicants should make it beyond this stage. The second stage on this smaller group should then employ a higher-adverse-impact assessment, and proportionally fewer applicants should be weeded out by this assessment. By being selective in the first hurdle—which had less adverse impact—and then applying the higher-adverse-impact assessments, organizations can reduce adverse impact with limited loss in predictive validity. However, a simulation suggests that this wisdom may not hold in all scenarios. Specifically, if predictors have roughly the same validity but differ in adverse impact, then high-impact assessments should precede lower-impact assessments and selectivity should be equal or less severe in the first stage (De Corte et al., 2006). That is, when predictors differ in adverse impact but not in validity, adverse impact is reduced by taking the exact opposite approach to that suggested by Sackett and Roth (1996). According to this recommendation, there are instances when organizations would want to use their high-adverse-impact tests in the first hurdle and allow proportionally slightly more applicants through. Then, organizations would use lower-adverse-impact assessments in the later hurdle and allow proportionately slightly fewer applicants through. In summary, a multiple-hurdle approach to selection allows organizations to sequence assessments in order to minimize adverse impact and cost. This approach takes a noncompensatory view to selection, wherein high scores on all assessments are required to pass the selection process. Moreover, the theory behind multiple-hurdle approaches to selection is nonlinear; after the cutoff score, all participants are considered equal.

USING MULTIPLE COMBINATION METHODS We discussed each of these techniques separately, almost as if organizations never combine these methods when making decisions about applicants. In reality, these techniques can be combined to help identify the best overall candidate. For example, let us assume that an organization uses three different assessments (e.g., cognitive ability test, physical skill assessment, and interview) to evaluate its job applicants. Pass scores could be identified separately for each assessment tool, and these assessment tools are then used in a multiple-hurdle fashion. However, what happens when multiple people score above all three pass scores? The quantitative score on each assessment can be combined using the compensatory approach. While the remaining applicants all have the requisite level of each latent skill or ability, how do we determine who to hire when multiple applicants survive all of the hurdles? The compensatory approach can be used

394

Methods of Combining Assessments to combine scores across all three predictors. This way strengths in one area can compensate for moderate weaknesses in another. However, true deficiencies in one area can never be compensated for by strengths in another, because individuals whose scores fall below a pass score are eliminated from the potential future employee pool by the multiple-hurdle technique. Thus, in applied settings, it is likely that the various techniques are used simultaneously to provide the optimal and most appropriate decisions regarding the applicants.

CONCLUSION In this chapter, we discussed various techniques that have been developed to enable organizations to combine information from multiple assessments to form a holistic impression of their job applicants. Specifically, we discussed qualitative methods (aka clinical assessment) and several quantitative methods (e.g., compensatory, multiple-hurdle, Pareto-optimal). The strengths and weaknesses of these techniques, along with recommendations from the literature regarding best practices when using these methods, were noted.

REFERENCES Ægisdóttir, S., White, M. J., Spengler, P. M., Maugherman, A. S., Anderson, L. A., Cook, R. S., Nichols, C. N., Lampropoulos, G. K., Walker, B. S., Cohen, G., & Rush, J. D. (2006). The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. The Counseling Psychologist, 34, 341–382. Bernardin, H. J., & Buckey, M. R. (1981). Strategies in rating training. Academy of Management Review, 6, 205–212. Bobko, P., Roth, P. L., & Buster, M. A. (2007). The usefulness of unit weights in creating composite scores. Organizational Research Methods, 10, 689–709. De Corte, W., Lievens, F., & Sackett, P. R. (2006). Predicting adverse impact and mean criterion performance in multistage selection. Journal of Applied Psychology, 91, 523–537. De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. De Corte, W., Lievens, F., & Sackett, P. R. (2008). Validity and adverse impact potential of predictor composite formation. International Journal of Selection and Assessment, 16, 183–194. De Corte, W., Sackett, P., & Lievens, F. (2010). Selecting predictor subsets: Considering validity and adverse impact. International Journal of Selection and Assessment, 18, 260–270. De Soete, B., Lievens, F., & Druart, C. (2012). An update on the diversity—validity dilemma in personnel selection: A review. Psychological Topics, 21, 399–424. Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical predictors: A meta-analysis. Psychological Assessment, 12, 19–30. Hattrup, K., & Rock, J. (2002). A comparison of predictor-based and criterion-based methods for weighting predictors to reduce adverse impact. Applied H.R.M. Research, 7, 22–38. Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying conceptualizations of job performance on adverse impact, minority hiring, and predicted performance. Journal of Applied Psychology, 82, 656–664. Highhouse, S. (2008). Stubborn reliance on intuition and subjectivity in employee selection. Industrial and Organizational Psychology, 1, 333–342. Hollenbeck, G. P. (2009). Executive selectin—What’s right . . . and What’s wrong. Industrial and Organizational Psychology, 2, 130–143. Hough, L. M., & Ones, D. S. (2001). The structure, measurement, validity, and use of personality variables in industrial, work, and organizational psychology. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial work and organizational psychology (Vol. 1, pp. 233–377). London: Sage. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence, and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Johnson, J. W., & Carter, G. W. (2010). Validating synthetic validation: Comparing traditional and synthetic validity coefficients. Personnel Psychology, 6, 755–795. Kanfer, R., Wolf, M. B., Kantrowitz, T. M., & Ackerman, P. L. (2010). Ability and trait complex predictors of academic and job performance: A person-situation approach. Applied Psychology, 59, 40–69.

395

Juliet R. Aiken and Paul J. Hanges Kuncel, N. R., Klieger, D. M., Connelly, B. S., & Ones, D. S. (2013). Mechanical versus clinical data combination in selection and admissions decisions: A meta-analysis. Journal of Applied Psychology, 98, 1060–1072. Morris, S. B., Daisley, R. L., Wheeler, M., & Boyer, P. (2014). A meta-analysis of the relationship between individual assessments and job performance. Journal of Applied Psychology, 100, 5–20. Pulakos, E. D. (1986). The development of training programs to increase accuracy with different rating tasks. Organizational Behavior and Human Decision Processes, 38, 76–91. Pyburn, K. M., Jr., Ployhart, R. E., & Kravitz, D. A. (2008). The diversity-validity dilemma: Overview and legal context. Personnel Psychology, 61, 143–151. Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differences and adverse impact. Personnel Psychology, 50, 707–722. Sackett, P. R., & Roth, L. (1996). Multi-state selection strategies: A Monte Carlo investigation of effects of performance and minority hiring. Personnel Psychology, 49, 549–572. Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment, credentialing, and higher education. American Psychologist, 56, 302–318. Schleicher, D. J., Day, D. V., Mayes, B. T., & Riggio, R. E. (2002). A new frame for frame-of-reference training: Enhancing the construct validity of assessment centers. Journal of Applied Psychology, 87, 735–746. Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and predictive efficiency of various predictor combinations. Journal of Applied Psychology, 82, 719–730. Thornton, G. C., III., Hollenbeck, G. P., & Johnson, S. K. (2010). Selecting leaders: Executives and high potentials. In J. L. Farr & N. T. Tippins (Eds.), Handbook of employee selection (pp. 823–840). New York, NY: Routledge/Taylor & Francis Group. Weiner, I. B. (2003). The assessment process. In J. R. Graham & J. A. Naglieri (Eds.), Handbook of psychology: Vol. 10: Assessment psychology (pp. 3–25). Hoboken, NJ: Wiley. Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67, 189–205.

396

18 CHOOSING A PSYCHOLOGICAL ASSESSMENT Reliability, Validity, and More MICHAEL J. ZICKAR, JOSE M. CORTINA, AND NATHAN T. CARTER

Consumers of psychological tests have a large number of tests to choose from these days and often have little factual information that can be used to pick a particular test. Googling tests for hiring employees results in roughly 48,200,000 hits, personality tests has 10,300,000 hits, integrity tests 7,220,000 hits, and tests to hire salespeople 693,000 hits. Clicking on some of these sites, we found claims such as “Never hire a bad salesperson again”; “Our sales assessment validity is backed by brain research. No other sales assessment is”; and “You can start testing your job candidates today—it’s that quick and easy!” These quick-and-easy fool-proof solutions might seem attractive to employers who need a hiring solution but have little expertise to choose among tests and vendors. Fortunately, industrial-organizational psychologists have conducted more than 100 years of research and practice that can help people choose tests appropriate for a particular job. In this chapter, we review some of the key concepts underlying the science of testing, particularly reliability and validity. Then we discuss how employers can use these concepts, as well as relevant information that should be provided by any reputable test developer (but which often is not!) to choose a particular test best suited for particular needs.

RELIABILITY Even though reliability theory is one of the first topics covered in graduate measurement courses, it is one of the most misunderstood topics. Most students learn about reliability in the context of classical test theory and are deceived by the simple formula X = T + E, where an observed score is mysteriously parsed into a true score, T, and error, E. Students who delve a little deeper into reliability theory realize that there is little “true” about the true score, and often what they think is error is not. What is often lost with novice researchers is that the source of error that is identified in a particular measure is dictated by the type of reliability coefficient calculated. In this section, we focus on three common types of error that are often present in psychological measures: error associated with different items, error associated with different raters, and error due to issues related to momentary, time-limited phenomena. As a test consumer, you will want to pay keen attention to the level of reliability reported as well as the type of coefficients presented. Also, as we will discuss, the level of reliability needed will be dictated partially by how you plan to use the test.

397

Michael J. Zickar et al.

Error Due to Items When one of the authors [Zickar] took the GRE Psychology Subject exam, there was an item that asked something like “What was the name of the first computerized learning system?” He got that item correct, not because he knew a lot about psychology, but because he had been an undergraduate student at the University of Illinois where nearly every freshman had to use the computerized system PLATO to learn chemistry, mathematics, or economics. In a sense, Zickar got one extra item correct because of the unique content of one item that was biased in his favor. Students from other universities across the country and world were not so lucky. Internal consistency measures of reliability, such as the popular coefficient alpha, are largely a function of inter-item covariances. As items relate more strongly with each other, holding all else equal, internal consistency reliability increases. Tests that have a large percentage of items that are dominated by unique variance will be more susceptible to error due to individual items and, therefore, have a lower internal consistency reliability. In addition, all else being equal, scales with few items are more susceptible to the unique influence of individual items. For example, if the GRE Psychology test had only three items and one of them was related to the PLATO learning system, Zickar’s score would have been greatly inflated. As it was, the small increase that he got by having “inside information” on that one item probably made little difference on his overall test score, given the large number of items on the subject test. Although it might be tempting to eliminate error due to the uniqueness of individual items by administering a scale consisting of items that ask the same item in slightly different ways, this approach runs the risk of compromising measure sufficiency. Research has also shown that, although asking the same item in slightly different ways may result in a high internal consistency index, the resulting narrowness of the scale may result in reduced validity (see Roznowski & Hanisch, 1990). A better way to minimize the error associated with unique item content is to increase the number of items, while making sure that individual items do not share constructirrelevant components (i.e., are contaminated). As a test consumer, if measurement precision is of key importance, make sure to avoid tests that report high reliabilities but are extremely short.

Error Due to Raters The classic Japanese movie Rashomon is a good way to understand the nature of rater error. In that movie, several observers witness the same crime, though when they retell what they observe, their retellings are vastly different. When observing behavior or coding written behavior, observers interpret information differently. Some raters are more lenient, but others are more stringent. Some interviewers might give preference to blondes, whereas others may unconsciously give high ratings to people who wear blue ties. Differences in rater behavior can sometimes be reduced by providing training, though given the different ways in which individuals view the world, these differences are unlikely to be completely eliminated. Most tests that will be considered for selection will not have this source of error given that most pre-employment tests rely on objectively scored items that require no individual rater to make a judgment. Tests that involve projective items as well as work samples and standardized interviews, however, both require individual raters to interpret test behaviors, thus potentially introducing this type of error. When raters are involved in judging job-related variables, research has shown that this type of error can be significant. For example, Woehr, Sheehan, and Bennett (2005) found that unique, idiosyncratic source-specific factors were responsible for two-thirds of the variance in performance ratings. Employment interview researchers have also demonstrated the inter-rater reliability of interviewees is typically fairly low (see Conway, Jako, and Goodman, 1995). There are many ways to reduce the amount of error related to raters. If at all possible, it is important to standardize the nature of information that different raters observe. In addition, providing frame-of-reference training (e.g., Conway et al., 1995) that attempts to provide common standards of comparison might help improve inter-rater reliability. Computerized scoring

398

Choosing a Psychological Assessment algorithms are used by large-scale testing companies to interpret and score written essays in the GRE and other certification tests, thereby eliminating the possibility of rater unreliability. If you cannot reduce error by standardizing information, the best way to reduce it is to increase the number of raters, thereby reducing the amount of error through aggregation in the same way that increasing the number of items reduces internal inconsistency. Taking an average of a large number of raters will cancel out the positive and negative errors associated with individual raters. In terms of choosing tests that require raters (e.g., projective tests, standardized oral interviews), make sure that you find out how many raters are needed to ensure reasonable reliability. See Greguras and Robie (1998) for procedures on how to determine the appropriate number of raters. For some tests, the demands needed to achieve acceptable reliability may be prohibitive or too costly.

Error Due to Momentary Time-Limited Factors There are lots of reasons that scores on tests may vary from one testing administration to another. Weird things can happen in testing administrations. For example, in an entrance testing session, one of our students witnessed another student vomiting (perhaps because of nervousness) in the vicinity of other students. It is possible that the students who were near the projectile vomiter would score lower on that particular administration compared to administrations at other times. Although that is a bizarre, rare event, many time-limited errors can be due to test administrators, the testing environment, or temporary issues related to the test taker. Test administrators can give too much time or not enough time. They can be unnecessarily harsh and intimidating, thus increasing test anxiety, or they can be so welcoming and pleasant that test takers do much better than normal. Administrators can give erroneous instructions or mishandle timing devices or they can inadvertently give away correct answers for difficult items. Related to the testing environment, the heating or air conditioning system can fail. A picture in the testing room of the previous school principal might remind a single test taker of a mean uncle who used to taunt him about how he would be a failure for his whole life, thus prompting that student to do poorly. Or that student may be given a comfortable chair that fits him just right. In an unproctored Internet testing environment, the test takers can choose where they take their test, further adding to standardization problems (see Tippins et al., 2006). Test takers can have unique things happen to them on one testing occasion that might not happen to them on another testing occasion. Test takers can be hungover or sick with the flu. They could have just been dumped by a fiancée. They may have had an especially good night’s sleep or an especially poor one. Regardless of the source of time-limited momentary effects, these events are unlikely to happen if the test taker were to take the test at a different time. Events that are predictable and are expected to occur every time a respondent takes a test would not be considered error even if they were distinct from the construct that the test is measuring. For example, test anxiety would not be considered error in the context of test-retest reliability if the test taker experienced the same level of anxiety each time s/he took a math test, even though test anxiety is clearly a different construct than mathematics ability. Although it would be impossible to eliminate all sources of time-limited error, it is possible to minimize the effects of error due to administration and environment by having standardized instructions and environments for test takers. Measures of reliability sensitive to time-limited factors, such as test-retest reliability, rest on the assumption that all score differences across two separate testing administrations are due to momentary time-limited errors. Of course, differences in scores across two administrations can be due not only to time-limited errors such as the ones mentioned but also to true change in the underlying construct. For example, dramatic changes in vocabulary test scores given across six months may be due to true growth in vocabulary rather than momentary, time-limited errors (see Hausknecht, Halpert, Di Paolo, & Moriarty Gerrard, 2007). As a test user, you have a lot of control over minimizing this source of error. Standardizing test administration for all examinees is an important step to ensure that error under your control is minimized. Some test

399

Michael J. Zickar et al. administrators have found that providing video-based test instructions and introductions are helpful so that everyone has precisely the same instructions. Also, making sure applicants are isolated from outside distractions is especially important. Make sure to follow administration instructions as dictated by test manuals. One can never eliminate this type of error, but standardized testing experiences for all test takers helps minimize this error. In unproctored Internet testing, instructions should include making sure the testing environment is free from distractions and that no outside help is used to solve items.

Conclusions on Sources of Error Error can contaminate our measures and compromise measurement. This can wreak havoc in work contexts such as top-down selection, where small differences in test scores might have significant consequences. Test developers are often unaware of the amount of error that their tests are likely to generate because they have used a single operationalization of reliability, generally internal consistency, which is sensitive to one source of error but ignores other sources of error. One way to calculate the effects of multiple sources of error is through application of Generalizability Theory (GT), which is an ANOVA-based approach that can be used to determine the magnitude of various sources of error simultaneously. GT approaches to error estimation are used less frequently than traditional approaches to reliability because they require more extensive data collections (especially compared to internal consistency analyses). For readers more interested in GT, we refer them to Chapter 1 in this Handbook (Putka), as well as to Shavelson and Webb (1991). Test developers rarely report GT coefficients, however, so as a test user, you are likely forced to rely on individual reliability coefficients such as test-retest and internal consistency coefficients.

How to Use Reliability Information in Choosing a Test First, all reputable testing firms should be able to provide reliability information about the tests they are selling. Most tests report only a single reliability coefficient, typically coefficient alpha, that is sensitive to error due to items but ignores error due to time or raters. That type of reliability coefficient may be useful for certain purposes, though less useful for others. Any test that requires subjective scoring (e.g., structured interviews and projective tests) by a rater should report consistency across raters. In general, you will want to see multiple types of reliability presented. In addition to the types of reliability reported, desired levels of reliability may differ depending on the way the test is used. Remember that reliability is related to the uncertainty of a test score, which is often best quantified by the standard error of measurement (SEM). Tests that play an important part in determining whether somebody is hired or promoted need to have higher levels of reliability than tests that might be given little weight or used as a rough screening device perhaps early in the process. Therefore, if you are using a single test to hire your next CEO, that test should have extremely high reliability (and validity!), but if you were using a battery of tests to screen out the bottom 20% of candidates for an entry-level position, lower levels of reliability might be tolerated. In addition, tests that have less significant consequences, such as tests used for staff development, can have lower levels of reliability. Finally, tests with somewhat lower reliabilities that are averaged across a group of individuals might be tolerated. For example, if you are using a cognitive ability test to determine whether applicants from a particular region score higher than another region, lower levels of reliability can be tolerated given that errors within individuals may cancel out. As noted here, the target level of reliability depends on the particular usage of a test; therefore, it is difficult to give a single value of reliability needed to use a test. One generalization that is safe to make, though, is that any test publisher who does not make appropriate reliability information available should be avoided!

400

Choosing a Psychological Assessment

VALIDITY Our review of validity focuses on sufficiency and contamination, two concepts that are deemed critical for demonstrating evidence of content validity. Several types of evidence can be collected to support test validation, including evidence from criterion-related validity, content-oriented validity, and construct validity. We do believe that all forms of validity are related to each other and the concepts of sufficiency and contamination, although most often used in discussion of content validity, are relevant to all forms of validity (see Landy, 1986). For example, the SIOP Principles (SIOP, 2003) discuss contamination in the context of content validity, criterion-related validity, and item bias. We believe that understanding these fundamental issues related to test validity is important for test consumers in order to make better choices about which tests to use.

Sufficiency In discussions of validity, it is often asked whether the test in question covers all of the ground that it should. For example, measures of job performance have been expanded to accommodate dimensions that have been added to models of job performance (e.g., adaptive performance; Pulakos, Arad, Donovan, & Plamondon, 2000), and measures of intelligence have been expanded to accommodate dimensions that have been added to models of intelligence (e.g., practical intelligence; Sternberg, Wagner, Williams, & Horvath, 1995). The criticism to which these expansions responded was that prior selection measures (both predictors and criteria) often failed to capture the entire domain of the construct being measured, (i.e., they were insufficient). Consider the example of adaptive performance. Pulakos et al. (2000) argued that employees often engaged in various work behaviors that contributed to organizational effectiveness but were not recognized by existing models and measures of job performance. Specifically, they suggested that categories of work behavior such as Handling Crises and Cultural Adaptability were crucial to effectiveness in some organizations but were conspicuously absent from existing measures of job performance. One might conclude from this that existing measures were insufficient. It would be more appropriate, however, to say that existing models of performance were insufficient, and that the measures merely reflected the inferior models on which they were based. If we assume that a measure is unidimensional, then insufficiency can only indicate factorial complexity at the model level. It seems more parsimonious, then, to stipulate that sufficiency is a property of conceptual models rather than one of measures. Once a model has been deemed to cover the full breadth of its domain (e.g., a performance model that consists of technical performance, contextual/ citizenship performance, adaptive performance, interpersonal performance, etc.), then unidimensional scales measuring each factor can be developed. Reliability then reflects proportion of true score variance, and validity represents lack of contamination (i.e., the introduction of construct-irrelevant variance into a measure). This position may seem at odds with the Standards for Educational and Psychological Testing. In the section on content-related evidence, it is stated that construct underrepresentation . . . may give an unfair advantage or disadvantage to one or more subgroups. Careful review of the construct and test content domain by a diverse panel of experts may point to potential sources of irrelevant difficulty (or easiness) that require further investigation. (AERA et al., 1999, p.12)

There are several observations to be made about this passage. The first is that sufficiency is inextricably intertwined with content-related validity evidence. Evidence of insufficiency comes from a comparison of test content to the “content domain.” Omissions suggest insufficiency. Second, the solution that is offered in the passage has to do with contamination rather than sufficiency. This may have been incidental, but it may also have been due to an inability to refer to insufficiency without also referring to deficiencies in the definitions of the construct of interest and of the domain of items that apply to it. Third, this passage is representative of the

401

Michael J. Zickar et al. Standards as a whole in that nowhere in the Standards are issues of sufficiency raised without reference to content-oriented approaches to validity. Although the term “sufficiency” does not appear in the index of the Standards or in any of the relevant standards (e.g., 1.6, 1.7, 3.2, 3.6, 14.8, 14.11), issues related to sufficiency appear in every section that deals with content-related evidence. Issues relating to contamination, on the other hand, appear in every section that deals with evidentiary bases of inferences. Contentrelated validity provides an appropriate framework for determining the extent of insufficiency. Our position is that content-related evidence has the potential to expose insufficiency only if the construct is poorly specified. If the construct is well specified, then insufficiency is not possible in the absence of egregious oversight. Therefore, we recommend that to ensure sufficiency, researchers spend additional effort in better explaining the conceptual foundations of their measure. From our experience, many scale development efforts jump straight into writing items, with little attention paid to a careful explication of the construct that those items are supposedly measuring. Engaging in more “up-front” thinking about the target construct will help ensure sufficiency. In addition, it is useful to think of sufficiency in terms of a battery of tests. If one particular test is insufficient in capturing the range of constructs needed to perform a particular job well, then other tests could be used to supplement that single measure. For test users, it is very important to compare the critical KSAOs derived from a professionally conducted job analysis to the content of the test items that you are considering using. In terms of understanding whether the test you are considering is reasonably sufficient, the quality of the job analysis is crucial. Many test publishers will help you conduct a job analysis and then use those results to link to tests that represent constructs identified in the job analysis.

Contamination As noted in the introduction, measurement contamination implies that a particular measure is influenced by unwanted sources of variance, different from the construct of interest. Confirmatory factor analytic (CFA) frameworks are helpful in understanding the complex multidimensional nature of contamination by isolating different sources of variance. As will be noted throughout this section, concern for contamination is motivated not only by the psychometric goal of creating a “pure” measure but also by a desire to minimize sources of irrelevant variance that covary with membership in demographic subgroups that are accorded special protection under U.S. employment law. Therefore, all I-O psychologists should be concerned with the contamination of their instruments. Given the complexity of the analyses that can be used to quantify contamination, we devote more space on this topic than reliability and sufficiency. Contamination implies that a particular measure is influenced by sources of variance other than the construct of interest. Although these sources of irrelevant variance could arise from methods effects, response styles, or irrelevant constructs, within a selection context the largest concern centers around contamination of sources of irrelevant variance that are due to membership in legally protected classes. U.S. employment law prohibits making employment decisions on the basis of group membership in terms of race, color, religion, gender, nationality (Civil Rights Act of 1964), age (Age Discrimination in Employment Act of 1967), and disability (American with Disabilities Act of 1990), whether or not this is the employer’s intent. In this sense, the use of test scores that vary on the basis of race or another protected characteristic can create adverse impact in the legal sense, increasing an employer’s chances of involvement in litigation (Williamson, Campion, Malos, Roehling, & Campion, 1997). Aside from legal concerns, ignoring measurement differences among subpopulations can negatively impact decisions based on organizational research (Drasgow, 1984, 1987) and diversification efforts (Offerman & Gowing, 1993), and can cause negative applicant reactions to the assessment (Gilliland & Steiner, 1999). Thus, it is imperative for researchers in organizations to examine whether the adequacy of an assessment method is similar across groups that may be legally, practically, or theoretically important. In addition to legal and practical concerns, the consideration of potential differences across subpopulations has scientific value. For example,

402

Choosing a Psychological Assessment hypotheses about cultural differences can be tested by examining the ways in which people from different cultures respond differently to certain items.

How to Use Validity Evidence to Help Choose a Test The first point to remember is that the validity of the test you are using depends on the particular purpose for which it is being used. Test publishers who claim that their test is valid without specifying the context should not be treated seriously. A knowledge test that has been shown to predict success for actuarial scientists will likely not be valid for predicting whether a comedian would generate consistent applause and attendance. A reputable test publisher should be able to provide validation evidence from previous studies to help another test user decide whether a particular test is likely to be valid for a particular usage. Although in earlier times, I-O psychologists were concerned about situational specificity, which stated that validities might vary significantly across situations (with an ambiguous understanding of what situational factors mattered), with the popularization of meta-analyses and validity generalization, these concerns have been lessened. Strong meta-analytic research has shown that cognitive ability tests have validity for nearly all occupations, though the validity is higher for more complex jobs (e.g., Schmidt & Hunter, 1998). In addition, evidence shows that personality traits such as conscientiousness have validity across a wide variety of occupations (e.g., Barrick & Mount, 1991). It is not enough, however, just to rely on a general statement of validity generalization such as “Our test of conscientiousness is valid because meta-analyses have shown such tests to be valid across a wide range of occupations.” First, just because a test is asserted to measure a particular construct does not mean that it actually does. Validity by assertion is not a technique recognized by I-O psychologists and respected in courts of law! A reputable test publisher will have correlated its particular test with other tests that measure similar constructs, demonstrating convergent validity. Second, it is important to assess whether the range of occupations for which the test has been used is similar to the ones for which you will be using the test. Finally, it is important to assess the similarity of the situations for which the test is being used. Has it only been validated for personal development and self-insight or has it been validated for high-stakes decision making? In terms of adverse impact and measurement invariance, reputable test developers should make mean differences for sex and race available for review as well as differential validity statistics. These statistics allow an organization to determine whether a particular test might impact the diversity of hiring decisions. In some cases, organizations may still choose a test with adverse impact because it may have the highest validity compared to other alternatives.

OTHER CONSIDERATIONS FOR CHOOSING A TEST Although reliability and validity should be the foundation for decisions about whether to choose a particular test or not, we realize that consumers of tests care about many other factors. In this section, we briefly review some more practical issues, such as test security, efficiency, access to norms, and delivery.

Test Security Test consumers want to make sure that the scores that assessments yield are representative of the KSAOs of the person who is taking the test possesses. Without test security, it may be difficult to know if the score for a person truly represents that individual’s construct score or not. Candidates might have other candidates take the test, or may have access to the test beforehand, or may be able to access content from the exam via ancillary sources. Test security may be a

403

Michael J. Zickar et al. primary consideration for some test users and may be of minor importance to others. For tests that are used as the primary basis for making an important decision, test security may be a prime concern. For tests that are relatively low stakes, such as initial screening tests or tests used for developmental purposes, test security may not be a concern at all. For those who need high test security, there are several options. One of the best solutions may be to use computerized adaptive testing (CAT), which relies on item response theory (IRT) to match individual items that are most appropriate to an individual. Well-designed CATs are high in security because the test that each test taker receives differs from that of other test takers. Creation of a well-designed CAT, however, is a serious endeavor that requires a large number of items that are pre-calibrated. To develop a CAT may be beyond the capabilities of most organizations, though it may be possible for smaller organizations to use the same CAT as others or use a test from a test publisher that has the resources and client base to develop an effective CAT. If a company is unwilling to invest in a CAT, one simple solution that increases security, though not as much as a CAT, would be to randomize the order of items throughout a test. This can make it more difficult for respondents to remember a string of answers, though the challenge is still not insurmountable. Besides modifying the order of items, general advice to keep materials as secure as possible seems warranted.

Efficiency Another major consideration for choosing a test is efficiency. How much time does the test take to complete? Unfortunately, there tends to be a tradeoff in terms of efficiency and reliability. Increasing the number of items in an assessment (assuming they are good items) increases the measurement precision of the particular test, although it increases testing time. Clearly, in some situations testing time is a premium. For example, a company may wish to bring an applicant in and have him/her complete some psychological tests, while providing the individual with a recruiting tour of the corporate facilities. In addition, for some jobs, applicants may not be willing to complete a test if it takes too long. Target stores have a computerized kiosk where people can complete the assessment before or after shopping. Clearly, if the assessment took two hours, many good applicants would give up and carry on with their other activities. CAT is a great solution to the tradeoff between efficiency and measurement precision because good CATs eliminate ineffective items. If you are a mathematical genius, it is a waste of time to ask you basic algebra items. And consequently, if you are less adept at mathematics, asking you to solve two simultaneous unknown equations is futile. Without CAT, test users need to determine how long an assessment takes for most applicants and also determine if there are shorter and longer forms of a test so there can be some flexibility. In our field, there seems to be a trend to make sure all scales are as short as possible. The danger of administering three-item personality tests is that the reliability tends to be so low as to preclude making decisions about individuals based on those test scores.

Norms Another consideration for choosing a particular test may be the availability of relevant norms. It might be extremely useful to compare individual scores to norms within a particular industry or country or across the general population. In fact, some tests are more useful because the organization responsible for the tests has collected norms across a variety of organizations, demographic groups, cultures, and industries. These norms can be extremely useful in interpreting individual scores, especially if you have a small number of people who will be completing your test. Of course, with norms it is important to determine whether they are relevant to your population. Knowing that your eighth grader is at the second percentile of all individuals who have completed the GRE Psychology Subject test is not a good indication of her/his potential for success in a psychology doctoral program.

404

Choosing a Psychological Assessment

CONCLUSIONS The outlandish claims made by test promoters often make it difficult to determine whether a particular test will work as intended. Fortunately, the science of test development and evaluation has a long history and can be used to see through some of the claims used to advertise tests. We hope that this chapter’s review of some of the fundamentals of reliability and validity provides a useful background for test users to make informed decisions.

REFERENCES Age Discrimination in Employment Act of 1967, 29 U.S.C. § 621 (1967). American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing (US). (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Americans with Disabilities Act of 1990, 42 U.S.C. § 12101 (1990). Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Civil Rights Act of 1964, 42 U.S.C. § 253 (1964) Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis of interrater and internal consistency reliability of selection interviews. Journal of Applied Psychology, 80, 565–579. Drasgow, F. (1984). Scrutinizing psychological tests: Measurement equivalence and equivalent relations with external variables are central issues. Psychological Bulletin, 95, 134–135. Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal of Applied Psychology, 72, 19–29. Gilliland, S. W., & Steiner, D. D. (1999). Applicant reactions. In R. W. Eder & M. M. Harris (Eds.), The employment interview handbook (pp. 69–82). London: Sage Publications. Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of 360-degree feedback ratings. Journal of Applied Psychology,83, 960–968. Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92(2), 373. Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. Offerman, L. R., & Gowing, M. K. (1993). Personnel selection in the future: The impact of changing demographics and the nature of work. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 385–417). San Francisco: Jossey-Bass. Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624. Roznowski, M., & Hanisch, K. A. (1990). Building systematic heterogeneity into work attitudes and behavior measures. Journal of Vocational Behavior, 36, 361–375. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks, CA: Sage. Society for Industrial-Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th Ed.). Bowling Green, OH: Society for Industrial-Organizational Psychology. Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common sense. American Psychologist, 50, 912–927. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59(1), 189–225. Williamson, L. G., Campion, J. E., Malos, S. B., Roehling, M. V., & Campion, M. A. (1997). Employment interview on trial: Linking interview structure with litigation outcomes. Journal of Applied Psychology, 82, 900–912. Woehr, D. J., Sheehan, M. K., & Bennett Jr, W. (2005). Assessing measurement equivalence across rating sources: A multitrait-multirater approach. Journal of Applied Psychology, 90, 592–600.

405

19 ASSESSMENT FEEDBACK MANUEL LONDON AND LYNN A. MCFARLAND

This chapter explores the role of feedback in the assessment process. The content of assessment feedback can range anywhere from a pass/fail statement to a detailed, competency-based report delivered in person and followed up with suggestions for development. Unlike feedback in other contexts, such as performance appraisal, selection feedback is not generally about how applicants can improve their performance in the future. Indeed, many tools are used in selection to measure areas where we do not believe people can easily improve, such as personality characteristics and general cognitive ability. Although the primary purpose of feedback may be to explain the selection decision, feedback may influence applicants’ perceptions of fairness, their self-image, and their reactions to the organization. This, in turn, may affect applicants’ behavior, such as whether or not to accept a job or recommend the organization to another prospective employee. Also, although not necessarily the intended purpose, selection feedback may be useful to guide applicants’ development—whether to help them next time they apply for a position, repeat a similar test for the same job, accept a job offer, or need or want further training to enhance their job performance. Furthermore, organizations and human resource (HR) professionals who are responsible for selection may view giving feedback as a professional and ethical obligation, or it may be required by law in some cases. Assessment feedback also can affect the organization’s reputation, for instance, as a respectful, development-oriented employer. Here we consider the benefits and drawbacks of providing feedback from the standpoint of the organization and the candidate. We review the literature on test givers’ obligations to provide feedback, candidates’ reactions to feedback, and the potential costs and benefits of feedback to the recipients and the organization. Finally, we consider implications for practice and areas for research to better understand the role of feedback from individual and organizational perspectives.

SOME CASE EXAMPLES Consider some examples of deciding whether or not to provide post-selection feedback to candidates. The first example involves an online assessment center (AC); the second, an objective preemployment test; and the third, an individual assessment for executive selection.

Assessment Center Feedback A large, national management consulting firm is hiring 15 recent MBA graduates for entry-level support positions. The jobs require analytic skills, client relations, high work standards, ability to work well in teams, and motivation. Organizational psychologists on the HR staff develop a

406

Assessment Feedback selection method that includes a battery of psychological tests, a measure of cognitive ability, a biodata form, two business exercises to produce work samples (simulations that ask candidates to set priorities, write e-mails, and respond to phone calls), and a test that asks how the candidate would handle certain situations. The assessment is conducted online. Participants can log in from remote locations. The assessment process provides a wealth of information that helps the firm decide whom to hire and, for those hired, what development and job experiences they need to improve their chances for success. The information would also be valuable to candidates who are not selected to help them guide their own early career decisions. The HR staff considered the following: the online assessment easily allowed giving feedback immediately after the completion of testing. This is in stark comparison to past situations where such an assessment would be conducted in person with small groups of participants making specific feedback more time-consuming and costly to deliver. Now the decision to not deliver feedback is not as justifiable on economic or practical grounds. So, the HR staff wondered, should they give all candidates feedback immediately after the assessment? Should they wait until a selection decision is made, inform the candidates, and then invite them to return for feedback? Does this risk divulging proprietary information about the testing process and its validity? Does this put the firm at risk if candidates question the selection process’s fairness or accuracy? Should only those who are hired be offered feedback? Should the firm require that those who are hired receive feedback and use the information to establish a development plan? Should a report be prepared for each candidate and sent to a new hire’s immediate supervisor to review with the candidate?

Test Feedback A restaurant is hiring food service personnel. It uses a biodata form, an integrity test to assess self-reports of honesty, and an interview. Hundreds of people take the test each year, administered by the managers of the company’s restaurants across the country. What type of feedback should the restaurant provide to applicants who are selected and those who are not, other than to inform them of the decision?

Individual Assessment for Executive Selection A multinational consumer products company is hiring a marketing vice president. The company HR department works with the CEO to hire an executive search firm to help identify candidates. A personnel psychologist working for the search firm meets with the CEO and others in the organization to determine the job demands and expectations and formulate a set of desired characteristics for the successful candidate, including knowledge, experience, motivation, and interpersonal skills. Also, the psychologist develops a screening technique to identify candidates for further analysis and ultimately formulates an individual assessment consisting of a battery of personality tests along with background and situational interview questions for the top candidates. Three candidates make it to the final stage and agree to complete the selection tests. The psychologist writes a detailed report about each candidate for the hiring CEO to review before interviewing the candidates, talking to references who know the candidates well, and making a final decision. This process raises several questions about feedback: Should the reports be available to the candidates? Who should deliver the reports? Might the results be used to support the candidates’ development? Should the organization simply hand the report to the candidates? Should the candidates not receive any feedback other than not being offered the job? The questions about feedback in these examples deal with how feedback fits within the selection process. Should applicants be told more than whether or not they were selected? If they were not chosen, should they receive more specific information that might help them in the future? More generally, how does feedback fit within the assessment process? We can begin to answer these questions by turning to professional testing standards for guidance.

407

Manuel London and Lynn A. McFarland

FEEDBACK AND PROFESSIONAL STANDARDS Although feedback is a neglected aspect of testing, professional standards for test development and administration provide some guidance about whether to provide test feedback to applicants and how specific that feedback should be. The American Psychological Association (APA)’s Ethical Principles of Psychologists specifies that applicants have the right to a full explanation of the nature and purpose of an assessment technique in language they can understand, unless the candidate has explicitly waived that right, and establish a procedure for ensuring the adequacy of the explanation (APA, 2002). The APA’s The Rights and Responsibilities of Test Takers: Guidelines for Testing Professionals specifies that applicants receive a written or oral explanation of their test results within a reasonable amount of time after testing and in commonly understood terms (APA, 1998). This is also embedded in the Standards for Educational and Psychological Testing (APA, 2014). The document emphasizes that the “rights and responsibilities” are neither legally based nor inalienable rights, but they represent good professional practice. Pope (1992) summarized psychologists’ responsibilities in providing psychological test feedback to the organization and to job candidates. He viewed feedback as a process that includes clarifying tasks and roles of the test giver and taker, ensuring that the test taker has given informed consent (or refusal) before taking the test, framing the feedback, acknowledging its fallibility, guarding against misuse and misinterpretation of the information, guarding the test results, and assessing and understanding important reactions. He emphasized that applicants and the organization have a right to understand the purpose and use for the assessment, the procedures involved, and the feedback they can expect and from whom. The rights of test takers were controversial when they were first created. They seemed to ignore some of the realities of large-scale employment testing. For instance, the right expressed in the APA Guidelines for Testing Professionals for applicants to “have their test administered and their results interpreted by appropriately trained individuals who follow professional codes of ethics” (pt. 6.0) may not be possible when test administration personnel have no code of ethics or when feedback is delivered electronically to many people. These guidelines need to be contrasted with practice. In our experience, with the exception of some civil service agencies that may be required to provide feedback, most organizations within the United States do not follow the guidelines when it comes to selection tests. If feedback is given, it is usually at the “dimension” level, for instance, explaining to candidates that they did well in one area, say math, but had trouble with another area, such as mechanical comprehension. However, this is not the case everywhere. Organizations outside of the United States may be required to give more detailed feedback. For instance, the European Union (EU) Civil Service is required by law to provide more specific information on standing or scores. The European Personnel Selection Office specifies that applicants have a “right of access”: admission test applicants systematically obtain a list of the reference numbers/letters of the answers they gave together with a list of the reference numbers/letters of the correct answers; for the assessment center stage each applicant receives a competency passport which gives feedback on the marking of the general and job-specific competencies that were assessed. (Bear, 2011, p. 7)

The EU personnel office offers initial self-screening devices that provide immediate feedback on each test question and final feedback, such as the EU’s self-assessment, which may inform the successful test taker: “Your responses suggest that you have the right perception of the reality of the EU working environment” (Camilleri, 2014). While feedback may vary across contexts and countries, the type and amount of feedback given may also vary by selection device. For example, feedback tends to be integral to the operation of ACs used for selection. The Guidelines and Ethical Considerations for Assessment Center Operations were developed and endorsed by practitioners to delineate the key components and activities of an AC, including feedback (International Task Force on Assessment Center Guidelines, 2000). ACs combine a host of qualitative and quantitative data about candidates and are used for selection, promotion, and development for general management positions and for

408

Assessment Feedback specific functions and various industries, such as manufacturing, banking, and sales (Spychalski, Quiñones, Gaugler, & Pohley, 1997). Line managers are often trained as assessors and may also be charged with giving feedback to candidates. The Guidelines defines AC feedback as “information comparing actual performance to a standard or desired level of performance” (International Task Force on Assessment Center Guidelines, 2000, p. 10), indicating that applicants must receive information to help them understand their results. The organization using the AC should establish and publish a policy statement about the use of the data (e.g., who will receive reports, restrictions on access to information, planned uses for research and program evaluation, and feedback procedures). Assessor training should include “thorough knowledge and understanding of feedback procedures” as well as a “demonstrated ability to give accurate oral and written feedback, when the assessor’s role is to give feedback” (p. 10). Furthermore, guidelines for use of the data state the following: 1. Assessees should receive feedback on their AC performance and should be informed of any recommendations made. Assessees who are members of the organization have a right to read any written reports concerning their own performance and recommendations that are prepared and made available to management. 2. Applicants to an organization should be provided with, at a minimum, what the final recommendation is and, if possible and if requested by the applicant, the reason for the recommendation. 3. For reasons of test security, AC exercises and assessor reports on performance in particular exercises are exempted from disclosure, but the rationale and validity data concerning ratings of dimensions and the resulting recommendations should be made available upon request of the individual. 4. The organization should inform the assessee what records and data are being collected, maintained, used, and disseminated. 5. If the organization decides to use assessment results for purposes other than those originally announced and that can impact the assessee, the assessee must be informed and consent obtained. (p. 9)

The various guidelines reviewed above recognize selection feedback as valuable to the applicant, and indeed couch it in ethical terms—that applicants deserve to know the meaning of the information collected about them and how the information was used to make decisions about them. There is a growing body of research on applicants’ reactions to feedback that suggests the potential value of feedback to the applicant and the organization. We examine this literature next.

APPLICANTS’ REACTIONS TO FEEDBACK Understanding how applicants react to feedback can help HR managers design feedback that will be beneficial to applicants and the organization. If administered appropriately, feedback may help applicants make decisions about whether to accept an offer, prepare them to re-apply, or make it more likely that they will recommend the organization to other qualified applicants. Also, feedback may reinforce or enhance applicants’ self-image, increase their self-awareness (accurately recognizing strengths and weaknesses), direct their development goals, and, at least, not do harm by damaging an individual’s self-image. Ensuring that applicants respond favorably to test feedback begins with test development. Much research has examined which features of selection processes are most likely to result in positive applicant reactions toward the process and toward the test feedback. Work has also been devoted to examining the best ways to deliver feedback to ensure positive reactions. However, before we describe this research in more detail, we must provide information about this research area, more generally, and how this type of research has typically been conducted. Much of the research on applicant reactions to feedback uses the justice framework to understand reactions to testing and feedback. Distributive justice refers to perceptions of the fairness (equity) of the outcomes of a selection process, whereas procedural justice refers to perceptions of the fairness of the process itself (Folger & Greenberg, 1985; Gilliland, 1993; Thibaut & Walker, 1975). Applicants who receive negative feedback tend to rate the test less fair than those

409

Manuel London and Lynn A. McFarland who receive positive feedback (Lounsbury, Bobrow, & Jensen, 1989; Schleicher, Venkataramani, Morgeson, & Campion, 2006; Smither, Reilly, Millsap, Pearlman, & Stoffey, 1993). Procedural justice may mediate the effects of favorability on outcomes such that the more fair the process is perceived, the more positive responses will be after the outcome is known, regardless of the outcome. Furthermore, several important organizational outcomes may be related to procedural justice perceptions of selection processes (Bauer et al., 2001). These include the attractiveness of the organization to applicants, applicant intentions toward the organization (e.g., recommending the company to others), and deciding to accept the organization’s job offer. However, there is still some debate regarding whether or not the effects of reactions extend beyond feelings and beliefs of applicants and to actual behavior. In their Annual Review article, Sackett and Lievens (2008) concluded, based on the meta-analysis of Hausknecht, Day, and Thomas (2004), that there was little evidence for a relationship between applicant reactions to the selection process and actual behavioral outcomes. However, other research indicates some statistically positive, although small, results. Truxillo, Steiner, and Gilliland (2004) reviewed the literature on the effects of selection fairness on organizational outcomes. They found that feelings of unfair treatment affect both “soft” outcomes such as satisfaction with the selection process and “hard” outcomes such as applicant withdrawal. Furthermore, McCarthy, Van Iddekinge, Lievens, Kung, Sinar, and Campion (2013) found that candidate reactions to tests were related to test scores and indirectly affected later job performance, providing evidence of the potential long-term effects of applicant reactions. Research generally finds that selection processes that are designed to be fair and considerate to applicants lead to better reactions toward the results of the process. Ryan and Ployhart (2000) examined the literature on applicant perceptions of selection procedures, including the consequences of selection results and feedback. They found various factors that may influence reactions to the selection procedure, such as test type, HR policy, and the behavior of HR personnel, as well as selection results. They suggested that selection results and feedback operate within the larger context of conditions that affect applicant reactions. Ryan and Ployhart (2000) proposed that these conditions include personal characteristics (e.g., experience), job characteristics (e.g., attractiveness), procedural characteristics (reasonableness of explanation; job relatedness of the selection methods), and organizational context (e.g., selection ratio/competition). These affect applicants’ perceptions about the selection process and outcome, depending on applicants’ expectations, the perceived desirability of the job and organization, available alternative jobs, and social support. Based on Ryan and Ployhart’s (2000) review, organizations will want to be sure that applicants are clear about the purpose for the test before they even take it, and they should treat all applicants in a friendly and respectful manner, including the process of giving feedback. To support this, Maertz, Bauer, Mosley, Posthuma, and Campion (2004) found that procedural justice perceptions predicted organizational attractiveness and intention related to the organization prior to receiving pass-fail feedback, and this effect existed even after feedback was given, although the effect was not as strong. Two things should be noted about the applicant reactions research we have described. First, the effect of reactions on outcomes tends to be small. However, one must remember that even small effects can have meaningful (and potentially disastrous) consequences when we consider thousands of applicants within one organization or across organizations. Just one lawsuit can cost an organization millions of dollars (and tarnish a reputation) and therefore applicant reactions have important and potentially large consequences when considered on this scale. Second, the research described has not examined reactions to feedback but reactions to selection processes in general. In most of the studies described above, the applicant was unaware of how he/ she actually performed in the process. Beyond designing selection processes to be perceived positively, one must also consider precisely how feedback should be delivered. Applicants’ reactions to feedback need to be understood in relation to their perceptions and feelings about the context and process before, during, and after testing and feedback of results. The effects of assessment feedback may depend on conditions, such as applicants perceiving that there was strong competition or applicants already having excellent jobs, thereby providing an attribution beyond their own ability (Ryan & Ployhart, 2000).

410

Assessment Feedback Recognizing that the selection process involves evaluation, Ryan and Ployhart (2000) commented that reactions to tests are not merely reactions to the process but reactions to being evaluated. Research supports this line of thinking. Job candidates’ reactions to a test, what Schuler (1993) called “social validity,” depend, in part, on the feedback they receive about their test performance. Thus, how test results are presented will influence how candidates interpret the meaning of the results and what they decide to do as a result, if anything (Marcus, 2003; Pope, 1992). Candidates’ reactions to a test and the testing process may change after receiving feedback about whether they passed or failed the test. Bauer, Maertz, Dolen, and Campion (1998) assessed applicants’ reactions before testing, after testing, and again after feedback about whether they passed or failed. They found that applicants who passed the test evaluated the organization and testing process more favorably than did those who failed. Wiechmann and Ryan (2003) found that feedback about the selection decision had a significant effect on reactions to the test, consistent with a self-serving bias. Candidates who were selected preferred to think that the test was fair and valid because this helped them maintain their self-image. Those who were not selected expressed a negative perception about the test because this diverted blame for the outcome from their own ability to behave effectively in the situation. In a laboratory study, applicants who failed to meet hiring standards had a more negative view of the organization than those who had passed (Kluger & Rothstein, 1993). In a study of actual applicants, those who performed more poorly rated the entire selection process more negatively than did those who performed well (Macan, Avedon, Paese, & Smith, 1994). Thus, those who receive negative feedback on test performance may automatically blame the test to protect their self-perceptions. However, if a test is perceived to be procedurally fair before receiving feedback, but applicants receive feedback that they did not perform well on the test, then this may cause applicants to feel incompetent and thereby lower their expectations of being able to perform on such tests in the future (Ployhart, Ryan, & Bennett, 1999). Bauer et al. (1998) found that feedback about test performance was more important than procedural justice perceptions as a determinant of organizational outcomes. In a laboratory study, Gilliland (1994) found that student applicants who were selected on the basis of their test results reported greater fairness in the selection process than did the rejected students, especially when they had high expectations of being hired. Applicants who were rejected were more likely to recommend application to others when they were offered an explanation for the use of the selection procedure. Feedback influenced the factors that were important in perceiving that the process was fair (Van Vienen et al., 2004). Explaining the selection decision to applicants has consistently been found to increase their perceptions of a fair selection process (Gilliland et al., 2001; Ployhart, Ryan, & Bennett, 1999; Truxillo, Bodner, Bertolino, Bauer, & Younce, 2009). Other research found that selection processes and results are likely to be perceived more fairly when applicants are given information about the process before being evaluated (Truxillo, Bauer, Campion, & Paronto, 2002). Individual differences may explain some variance in how feedback is received. Schinkel, van Dierendonck, van Vianen, and Ryan (2011) found that applicants with an optimistic attributional style (people perceive the cause for a positive event to be internal and stable and the cause of a negative event to be external and unstable) reported higher well-being after being rejected than those with a less optimistic style, particularly when feedback about test performance was unspecific. Specific performance feedback negatively affected rejected applicants’ well-being. Receiving specific feedback seemed to lower the possibility of externally attributing a negative outcome, and this negatively affected feelings of well-being after being rejected, particularly if the individual was used to making external attributions for unfavorable outcomes. The implication is that feedback should be sufficiently specific to avoid inaccurate external attributions. However, for individuals who are rejected, being less specific can reduce negative feelings. This raises the dilemma of protecting the applicant’s self-image or the organization’s reputation as fair in hiring. Presumably the best policy is to explain the fairness of the process and provide specific feedback that may cause the applicant to feel bad, at least temporarily, but could lead to learning as well as maintaining the organization’s reputation. In summary, passing or failing a test contributes strongly to applicants’ subsequent reactions (Bauer et al., 1998; Marcus, 2003; Ryan & Ployhart, 2014; Thorsteinson & Ryan, 1997): the more positive the results, the more favorable the reaction. Research suggests that applicants may prefer

411

Manuel London and Lynn A. McFarland detailed comments, as opposed to just knowing a test score (Anastasiya, Limnevich, & Smith, 2009). Conveying personal and procedural information sensitively contributes to more positive perceptions of the test and organization, whether the applicant was hired or not. Understanding the procedures may limit the negative effects of failing a test on self-perceptions. Feedback reactions are influenced by the context of the process (e.g., information about competition for the position; Ryan & Ployhart, 2000). Praise may help applicants feel better about their performance but may not necessarily affect learning (Anastasiya et al., 2009).

Feedback and Self-Image Generally, people do not perceive themselves accurately, or at least as others see them (London, 2003). Feedback helps them have an accurate perception of their performance and benefit from the selection process regardless of whether they are selected or not. Feedback supports candidates’ self-learning by clarifying what good performance is, encouraging dialogue between testers and test takers, and suggesting ways that testers can improve their performance in the future (Nicola & Macfarlane-Dick, 2006). Job candidates are likely to perceive feedback as being accurate when the feedback is clear and objective, not based on ambiguous or subjective data (e.g., one assessor’s or interviewer’s opinions). People tend to evaluate themselves positively to maintain or increase their self-image (Harris & Schaubroeck, 1988). They interpret feedback through their own lens; for instance, potentially attributing poor test results to factors outside of their control, such as an unfair testing procedure (Ployhart & Harold, 2004). In contrast, failing leads to lower self-image and selfperceptions of incompetence (Anderson & Goltsi, 2006; Fletcher, 1991; Maertz et al., 2005; McFarland & Ross, 1982; Ployhart & Harold, 2004; Ployhart, Ryan, & Bennett, 1999). Nicola and Macfarlane-Dick (2006) suggested that candidates are likely to evaluate their own performance during the testing process and create their own feedback in line with their selfimage when feedback is not given. For instance, if they do not receive specific information about the results of a test and they were not chosen, they will rationalize that their rejection was not because of their test performance but because of other factors beyond their control. This allows them to maintain their self-image, but it may also create erroneous impressions that could be damaging to the organization—for instance, that the organization has unfair hiring or promotion practices. When the candidate is offered the job, the candidate is likely to attribute the cause to his or her good test performance, but not necessarily. The candidate with low self-esteem may erroneously conclude that the positive outcome was due to luck or other factors beyond his or her ability. In addition, the successful candidate could benefit from test results that suggest ways to improve his or her skills and knowledge to be even more valuable to the organization and increase his or her chances of job and career success. Hence, feedback should be provided to influence applicants’ perceptions about the testing process and organization. Social identity theory suggests how applicants’ social identities interact with their perceptions of selection experiences to predict their withdrawal from the process. Herriot (2004) argued that applicants are likely to withdraw from the selection process when there is incongruence between their current perceptions of the organization’s identity and their own self-identities that are salient during specific elements of the selection process. Social identities are individuals’ beliefs about their membership in social categories (e.g., their gender, ethnicity, occupation, family, and religion), in contrast to their personal identities, which are beliefs about their own characteristics (e.g., their strengths and weaknesses). Social identities are associated with a range of beliefs, values, and norms of behavior, and may incorporate prototypes or beliefs about the typical or ideal member of a category. Organizational identities are subsets of social identities. Applicants develop perceptions about their organizational identity as they participate in the selection process and receive feedback. The effects of degree of congruence on leaving the process may be moderated by the applicants’ perceptions of the probability of obtaining another job. People who believe there are plenty of other opportunities will have a lower threshold for incongruence. Those who believe that job opportunities are scarce are more likely to tolerate a higher level of incongruence.

412

Assessment Feedback Schinkel, van Dierendonck, and Anderson (2004) studied the role of feedback in minimizing the psychological effect of a negative selection decision on job applicants. Student subjects completed two tests and then received a rejection message. Half received just the rejection message and half received the rejection message and bogus performance feedback (the percentile, how they did relative to others). Core self-evaluations and affective well-being of the rejected students receiving performance feedback significantly decreased from before to after the testing and feedback compared with that of students in the rejection message–alone condition. Core self-evaluations actually increased for those who were rejected but were not given performance feedback, particularly if they saw the procedure as unfair. Procedural fairness (candidates’ perceptions that they had a chance to demonstrate their performance and that the test was related to the job function) interacted with feedback to affect core self-evaluation; distributive fairness (the perception that the selection decision was correct) interacted with feedback to affect well-being. The authors suggested an attribution theoretic explanation for these results. The students may have showed a self-serving bias after receiving negative outcomes, attributing the negative outcome to external causes (e.g., unfair procedures that were not under their control), thereby maintaining their self-perceptions. This comparison made reducing the negative state following rejection even more important, following DeNisi and Kluger’s (2000) concept that feedback entails a comparison. Maertz, Bauer, Mosley, Posthuma, and Campion (2005) measured applicants’ self-efficacy for cognitive ability testing before and immediately after the test and again after pass/fail feedback. The applicants were applying for a position at a utility company. Self-efficacy for the test prior to actually taking it was higher for men, applicants who had prior successful experiences with cognitive ability tests, and those who perceived the test to be valid and fair. Not surprisingly, self-efficacy for the test increased for those who passed it and decreased for those who failed. However, failing had a greater negative impact on subsequent self-efficacy for the test for women and Whites and a lower negative effect for those who had previously been hired based on ability tests. Gilliland (1994) found that perceptions of job-relatedness were negatively related to test-taking self-efficacy for applicants who failed but positively related to test-taking self-efficacy for those who passed. An implication of these findings is that those who fail the test and have significant decreases in self-efficacy because of it might tell others that the test was particularly difficult and discourage other potential applicants. Maertz et al. (2005) suggested that organizations consider attribution-related or other interventions to bolster self-efficacy or to increase perceptions of fairness and test validity. This does not mean giving applicants an excuse to attribute negative results to factors beyond their control so they will feel better about the outcome. Rather, feedback and explanations could help applicants make accurate and useful attributions, for instance, to improve their skills or knowledge and/or to understand why the selection method suggests that they would not do well in the position. Also, test administrators should follow procedural justice rules and emphasize in pretest preparation sessions the proven validity of the test for predicting job performance. These interventions may also enhance applicants’ attraction to the organization. Derous, Born, and de Witte (2004) found that applicants valued and expected feedback. They argued that although applicants should not be deprived of their right to performance scores, perhaps performance measures after selection should not be provided or should be provided in ways that protect applicants’ self-image; for instance, reminding them of the low selection ratio or that the position is not right for everyone and that knowing now is better than being unhappy later. Anderson and Goltsi (2006) formulated the construct of negative psychological effects (NPEs) of selection and assessment methods upon applicants. They defined NPEs as follows: Declines in applicant psychological well-being, general mental health, or core self-esteem that are inveterate, measurable, and statistically demonstrable, and that occur as a result of exposure to rejection decisions, violations of applicant rights, or unprofessional feedback given to applicants by recruiters, at any stage during organizational selection or promotion assessment procedures. (p. 237)

They also defined positive psychological effects (PPEs) as increases in applicant psychological well being, general mental health, or core self-esteem that result from acceptance decisions,

413

Manuel London and Lynn A. McFarland perceived respect for applicant rights, or complementary feedback. Previous research had found that applicants participating in an assessment center experienced negative psychological effects (Fletcher, 1991). Anderson and Goltsi (2006) suggested that NPEs may be present for several weeks and months after receiving a negative selection decision. They noted that much of the research on applicant reactions to selection methods has focused on applicants’ immediate reactions and preference perceptions to different predictors. This study investigated the long-term effects and outcomes of applicants’ exposure to specific selection methods on candidate decision making, attitudes toward the organization, and psychological health and well-being. For instance, measures included self-esteem, mental health, positive and negative affect, and career exploration behavior. One hundred seven applicants participating in an assessment center completed measures just before participating in the center, immediately afterward but before they were told the outcome, and six months after the assessment. All applicants received detailed feedback regardless of whether they were accepted or not. Rejected applicants did not differ significantly from accepted applicants on the indices of NPEs. Accepted applicants rated feedback dimensions more favorably than did rejected applicants. Rejected applicants seemed to attribute the negative decision to a lack of accuracy in the assessment process. The authors thought that one reason why NPEs did not emerge for unsuccessful candidates may have been that the selection ratio was so competitive in this organization that this may have moderated applicants’ negative feelings from rejection. An implication is that providing detailed feedback to unsuccessful candidates may be dysfunctional. For internal applicants, rejected applicants remain in the organization, and NPEs could affect their job performance. This could also apply to successful applicants who received inappropriately negative feedback or felt that they were not treated fairly in some way. In summary, unfavorable feedback may have a negative effect on applicants’ self-image. Organizations should investigate the costs of potential NPEs on reduced performance and at least consider the possible long-term negative consequences from rejection. Organizations should make all the external reasons for a negative hiring decision salient (e.g., low selection ratio, high quality of other applicants, fit issues, etc.), so that negative feedback does not negatively affect a person’s self-image. The research on self-image is consistent with findings within the applicant reactions literature. These two literatures lead to clear advice regarding how to design selection processes to maximize positive reactions and outcomes for both the organization and test takers: • Provide information about the fairness of the testing and decision-making process. • Ensure the selection process is designed in such a way that the applicants feel like they have an opportunity to show their strengths and perform at their best. • Focus on implications of the results for behaviors, knowledge, or skills that can be changed or learned, not personal characteristics, such as personality or cognitive ability, that threaten self-image. • Tie feedback to job requirements and developmental opportunities. • Precede feedback with information about the selection test to explain its job relevance and fairness. • Accompany feedback with behavior-based standards in relation to job knowledge, abilities, and experiences required for success on the job. • Explain why the test was fair to avoid applicants creating their own judgments about fairness. • Convey personal and procedural information sensitively to generate more positive perceptions of the test and organization, whether the applicant was hired or not. • Explain reasons for test methods and results to promote accurate attributions and judgments of test fairness and validity. • Protect applicants’ self-image (e.g., remind them of the difficulty of the test, the tough competition, and the value of knowing now that the job is not right for them). • Recognize that machine-delivered feedback may require more explanation than face-to-face feedback to convey the meaning and intention.

BENEFITS AND COSTS OF FEEDBACK People generally do not react positively to feedback. They are naturally apprehensive about being evaluated and are concerned about what others think of them (London, 2003). However,

414

Assessment Feedback feedback can direct behavior by helping people set and recalibrate goals and determine what they need to do to achieve their goals. Feedback can be motivating, by giving people a sense of what they have accomplished and a feeling of reward for their achievements. Feedback from selection tests can inform candidates about whether their ambitions were realistic and what they need to do to increase their preparedness in the future. As such, feedback can contribute to development and career planning. The dilemma for the organization is deciding the level of specificity for feedback to maximize the benefits and minimize the costs to the organization and the individual. Here we consider the costs and benefits of feedback from the perspectives of the organization and the individual candidate.

Organization’s Perspective Potential benefits of feedback for the organization include the following: • Feedback informs the candidates’ decision making. Candidates who are selected will have a better understanding of why and the value the organization believes they will bring to their positions. • Knowing that candidates will receive feedback requires the organization to maintain focus on the skills and knowledge needed by the successful candidate(s). These characteristics may reflect skills needed to do the job today and/or needed for future career advancement in the organization. Maintaining this focus is especially important when decisions are based on qualitative information, such as interviews and supervisory opinions. • Feedback is a way to maximize value from assessment dollars. Considerable time and money may be spent evaluating candidates. This information can be useful not only for making the decision but also for guiding candidates’ future development. • Feedback may guard against illegal discrimination and defend against claims of such in that feedback recipients will know and understand the reason for the selection process and decision, see its job relevance, and recognize that it was not arbitrary or based on something other than bona fide job requirements.

Regarding potential costs of feedback, the organization needs to consider the following: • The organization incurs the cost of delivering feedback. This may include the cost of personnel, such as a coach, who meets with the candidates after the selection process is concluded to explain the decision and provide the feedback. • Feedback may create difficulties in maintaining the reliability and validity of the assessment methods. For instance, if candidates know the test content, they may communicate this to others, giving future candidates an unfair advantage, or causing them to respond in certain ways to create impressions they feel the organization wants, thereby limiting the accuracy of the information. • Test security is costly, and feedback that goes beyond test outcomes may make security difficult. Having alternate forms of the selection process may eliminate this worry but adds a further cost to create and validate these forms. • The organization may be obliged to go beyond feedback to include advice for development. Such advice needs to be given in a way that does not lead the candidates to have expectations about future opportunities; for instance, implying that if they follow the advice, they will be promoted. • Guarding candidates’ confidentiality is also a cost. Candidates should be told about who has access to the assessment results, how the information will be used by the organization, how long it will be retained, and how identifying information will be secured. • Giving feedback imposes obligations to follow up, especially with internal employees. The employees may want career advice in the future. Such feedback and career counseling can be linked to a career development function in the organization. • The issue of longevity of assessment results needs to be examined by the organization. Feedback indicates the value of the information for development, implying that the organization recognizes that people grow and develop over time. However, reassessment for future career opportunities is costly. The organization will want to study changes in assessment performance over time when varying degrees of development have occurred; for instance, differences in performance between candidates who have received feedback and those who have not and differences in performance between candidates who subsequently participated in various developmental experiences compared to those who did not. Such research is an additional but worthwhile cost in increasing the value of the assessment.

415

Manuel London and Lynn A. McFarland

Candidate’s Perspective Potential benefits of feedback for the candidate include the following: • Increase in self-awareness. In general, the benefit of feedback is to increase self-awareness of strengths and weaknesses, identify competencies and experiences needed to increase competitiveness for future positions, and learn areas needing development to be more effective once on the job. • Feedback may help candidates who are offered positions understand the nature of the work and expectations. Details about passing assessment results will show them the characteristics that are valued by the organization for the job and/or for their careers. Such detailed feedback, if provided, would contain information about strengths and weaknesses and suggest areas for improvement although they have been offered a job. • The way they are treated, the information they are given about themselves and the selection process, and their conclusions about the fairness and validity of the process will help them evaluate the organization. The feedback may suggest that the organization cares about and is aware of their abilities and will support their continuous learning. For individuals who were rejected, the information explains why, affects their beliefs that the decision was made fairly and on the basis of relevant information, and they can benefit by using the information to recognize their strengths and weaknesses to focus their future job search and development. • Assessment feedback should include not only information about results but also ideas for development of weaknesses and ways to build on one’s strengths. The feedback recipients should understand differences between performance areas that can be developed and those that are difficult to develop. For performance areas that are difficult to develop, candidates might value suggestions about how to avoid behaviors and responsibilities that require ability in these areas.

Potential costs of feedback for the individual candidate include the following: • Possible decline in self-confidence. Perhaps the major cost of feedback from the individual’s perspective, particularly for those who are not offered positions, is losing face and declining self-confidence. Generally, less specific information will be less threatening but of less value. • Another potential cost is processing the information. This takes time, energy, and motivation—sometimes more than the individual cares to give. Perhaps the applicant was not highly motivated to apply for the position to begin with. Or perhaps the individual knew from the start that this was a long shot. This person does not need to hear more than he or she was not offered the job. However, perhaps more information may redirect the candidate’s time and attention toward more fruitful opportunities in the future. • The results may lead the candidate to make a disappointing career decision. Candidates who get positive results may be flattered by the job offer and not pay attention to details of the feedback—or hear just about their strengths and ignore or give less attention to any weaknesses. Also, they may not pay sufficient attention to the nature of the job, instead focusing on their own competence. As a result, they may accept a position that may have requirements (e.g., travel) that they really did not want. Conversely, candidates who are rejected may focus on the rejection decision and not be able to process feedback about their personal characteristics mindfully.

Differences Between Internal and External Candidates The costs and benefits may differ depending on whether the candidates are internal or external to the organization. If internal, the organization wants to maintain the loyalty and motivation of the employee and enhance the individual’s career development with the organization. Moreover, communicating reasons for selecting certain candidates and bypassing others lets the candidates and other stakeholders (internal and external) know what is important to the organization. Internal candidates who were not selected will expect a rationale and may appreciate and take advantage of advice to direct their development and improve their future career development opportunities within the organization. External candidates may not need as much information. Still, as described above, their reactions to the fairness and thoroughness of the selection process and the validity of the decision may affect their impressions of the organization and their subsequent relationships with the organization.

416

Assessment Feedback

Feedback Opportunities for Different Assessment Methods Assessment methods vary in the nature of the data they collect and their difficulty in interpreting feedback results and applying the information for development. They differ in quantitative and qualitative results, the face validity of the methods, and their usefulness for improving performance and future selection prospects. Consider the following methods: • Interviews. Interview results are difficult to quantify. Situational judgment interviews (SJIs) may provide sample “normative” responses for comparison with individual candidates’ answers. • Cognitive and personality tests. These produce objective results that can be explained by subject area, percentiles, and norms within the company and with national samples. • Integrity tests. Honest feedback on integrity tests could generate some really negative reactions. Test administrators may need to justify the use of integrity tests for candidates or, more simply, say in a tactful way that the results did not conform to the pattern that the organization wanted without divulging the nature of the test. • Test batteries. Test batteries vary in their measures, some including personality and ability tests. They produce complex results and may require professional input to integrate and summarize. Electronic, prewritten feedback could be developed, but if not accompanied by face-to-face explanation and coaching, it may not be sufficient for applicants to gain useful and accurate understanding of the results. • Biodata. Past behavior is an excellent predictor of future behavior. Biodata results can be justified in terms of experience needed. This method may be valuable in guiding failed candidates to better preparatory experiences. • AC measures of management and leadership skills. ACs produce complex results and allow in-depth feedback, focusing candidates’ attention on weaknesses that can be corrected and strengths that suggest alternative career directions. These can be delivered to the candidate online immediately or soon after an online AC. • Multisource (360-degree) feedback survey results. This method of collecting performance ratings from subordinates, supervisors, peers, and/or customers as well as self-ratings is often used alone for development. When used for decision making, the results can be related to other performance data and/or to changes in performance ratings from different sources over time. An executive coach may assist in providing the feedback and encouraging the recipient to interpret it accurately. • Supervisor nominations and evaluations. When used as sole input for a placement or promotion decision, candidates may perceive nominations and performance evaluations as unfair. This may increase their political behavior and impression management in hopes of winning favor for future positive decisions. • Performance in a management/leadership development program. Participation in a management development program may indicate performance capabilities in areas in which the candidates may not have had a chance to demonstrate their abilities on the job. The participants would likely perceive the use of such results for making decisions about them as unfair unless this purpose was evident before they participated in the program. Also, the experiential situations in such a program may be viewed as artificial and not a good representation of actual job performance.

Computerized Assessments and Feedback Online assessments can be cost effective and flexible. There is a considerable literature on the use of technological applications for selection, including online assessments with simulations that vary in fidelity (e.g., high-fidelity assessments with simulations and work samples that require decisions or actions compared to low-fidelity assessments that are hypothetical situational judgment tests) (Lievens & De Soete, 2012; Tippins, 2015). These assessments can be customized to assess various abilities and behaviors. For instance, they can present realistic scenarios to candidates via full-motion video and ask candidates how they would respond to the different situations (Drasgow, Olsen, Keenan, Moberg, & Mean, 1993; Wiechmann & Ryan, 2003). Technology-based selection methods that are new and unfamiliar (e.g., avatar-based situational judgment tests) are especially likely to produce negative reactions on the part of candidates who are rejected (Anderson, 2003; Bruk-Lee, Drew, & Hawkes, 2013; Oostrom, Born, & van de Molen, 2013). However, test takers’ post-test, post-feedback reactions to the tests are not affected by mode of administration (computer vs. paper-and-pencil test; Wiechmann & Ryan, 2003).

417

Manuel London and Lynn A. McFarland Just as assessments can be computerized, so can the feedback. This may be less threatening than receiving feedback from an individual, but it also presents an opportunity to avoid paying attention to the results, which is more difficult with in-person feedback. Computerized feedback can be given at different points of time during the assessment or at the end, along with information about alternative response possibilities. Although these are simulations, they are realistic, standardized, and easy to administer. Of course, the computerized feedback can be combined with in-person feedback to help the candidate use the information. There is a growing body of research comparing receiving feedback via computer to receiving feedback face-to-face. For instance, Watts (2007) noted that computer-mediated communication makes delivering evaluative feedback immediate and detailed. She compared the effects of feedback via e-mail with voicemail from the perspective of the sender and receiver in a study of evening MBA students delivering feedback that they generated themselves in relation to fellow students’ participation in a group project. E-mail produced fewer social context cues than voicemail, so e-mail increased the negative content of feedback, filtering out the affect of the sender and receiver. E-mail senders viewed the negative feedback they gave as more negative than the receivers viewed the feedback. This was not the case for voicemail senders. However, voicemail senders, but not e-mail senders, were less comfortable than receivers with the negative feedback. Media conditions did not influence feedback effectiveness (e.g., the perception that the feedback motivated the receiver to work harder next time). Mishra (2006) addressed whether people respond similarly to computer feedback about their abilities as they do to human feedback. Students participated in a laboratory study in which they were assigned randomly to one of four experimental conditions: scored test versus unscored test crossed with two levels of computer-generated feedback: (a) praise for success on easy task and no blame for failure on a difficult task or (b) no praise for success on an easy task and blame for failure on a difficult task. Participants who received computer-generated feedback seemed unwilling to commit to the same level of “deep psychological processing” about the intention of the feedback as other studies found with face-to-face feedback (e.g., Meyer, Mittag, & Engler, 1986). This was contrary to the position that people learn to respond to computers as social actors with whom they interact in real time using natural language fulfilling traditional social roles. The students in Mishra’s study seemed to disregard the context within which the feedback was offered. They saw praise from the computer as being positive, regardless of whether or not they thought that their ability level was known or whether the task was easy or difficult. If people respond more mindlessly to computer feedback about their test scores, this could thwart the goals of the feedback; for instance, to motivate higher achievement next time or to take the difficulty of the task into account when receiving praise. However, research is needed to determine if providing feedback recipients with a social script that suggests that the computer is capable of sophisticated inferences; for instance, that the computer “respects” the subject’s intelligence because the computer has a record of information about the subject and takes that into account in providing feedback. Ferdig and Mishra (2004) explored the technology as a social actor, finding that people exhibited emotional responses (e.g., anger and spite) when they felt that the computer had treated them unfairly in an ultimatum bargaining game.

Feedback, Test Preparation, and Coaching Test feedback provides opportunities for career development. More specifically, it focuses the candidates’ attention on ways to increase opportunities and also avoid errors or failures in areas that were critical to the selection process. Feedback’s main value is correcting errors. Providing test feedback suggests ways that the feedback recipients can increase their knowledge and avoid similar errors in the future. Feedback that is directed at the retrieval and application of specific knowledge stimulates recipients to correct errors when they recognize (are mindfully focused) on correcting or avoiding these errors in similar testing situations (Bangert-Drowns, Kulik, Kulik, & Morgan, 1991). However, another possibility to consider in future research is that feedback may have a deleterious effect if it focuses candidates’ attention on areas that will

418

Assessment Feedback not be as important in future situations. For instance, what was important for a given position in one organization may be different than what is needed for another position in a different organization, even if the positions are similar. Also, feedback may focus attention on areas that detract from current performance. For instance, the candidates may concentrate on improving their weaknesses or maximizing strengths that were important for a promotion but may not be as important in their job. They may behave as if they were promoted or were working on a different job and ignore current job requirements (anticipatory socialization).

FEEDBACK TO UNEMPLOYED INDIVIDUALS Unemployed individuals may be especially sensitive to job search outcomes, including feedback on selection tests or the absence of feedback. The job search literature on unemployed individuals indicates the negative effects of disappointing job search results on mental health (Kanfer, Wanberg, & Kantrowitz, 2001; Wanberg, Basbug, VanHooft, & Samtani, 2012). Personality characteristics, such as a positive self-concept, perceived control, and emotional stability affect continued job search despite rejections along the way (Wanberg, Glomb, Song, & Sorenson, 2005). Few studies have examined the effects of feedback, or lack thereof, on unemployed individualsm although they do recognize that outcomes beyond time to re-employment need to be studied (Wanberg, Kanfer, Hamman, & Zhang; 2015). A qualitative study quoted unemployed individuals’ desire for feedback (Wanberg, Basbug, VanHooft, & Samtani, 2012, p. 900): “I just wish there was more feedback available so that you could grow constructively and, you know, optimize your next time.” “I don’t think you ever get the true reason you were rejected. So like I said, it’s the lack of information, the lack of feedback that frustrates me, and that happens daily.” “The most help that I need is to know why things didn’t go the way I wanted them to. And even if somebody says no, I can handle that; that’s not a problem; I don’t mind that a bit; as long as you tell me why.”

IMPLICATIONS FOR PRACTICE Practitioners need to make fundamental decisions about giving feedback, recognizing that feedback may affect applicants’ reactions to the testing situation, the organization, and their self-image. Also, in divulging test results, practitioners worry about guarding the security of the test and minimizing costs relative to the gains from feedback. Specifically, practitioners need to determine if and when feedback should be given, by whom (or by what means), or by what medium (e.g., face-to-face, letter, e-mail). They also need to consider how much detail should be given and the form of score reporting (e.g., raw score, percentiles, standard scores, etc.). Other questions involve the relevant comparison groups (e.g., other candidates, other people with similar ability and background) and what resources should be provided to assist applicants in interpreting the feedback and using the information for their development. Organizations may be more likely to invest in hiring an assessor or coach to convey and discuss results with applicants when the assessment is for current employees vying for another position in the organization. Overall, HR practitioners need to foster applicants’ perceptions that the selection process and outcome are fair, guard against erroneous attributions about the organization, and protect applicants’ self-image. In addition, practitioners need to guard the security of the test and minimize cost as they make feedback a positive part of the selection process. Also, candidates are not accountable for using the feedback. Candidates need to be made aware that it is available. The organization then needs to provide a setting that is conducive to delivering feedback, including the format for the feedback. The decision about format, setting, and feedback specificity depends on the test developer’s and administrator’s conclusions about their ethical obligation to provide feedback and to do so in a manner that does not do harm to the candidates, at the very least, and hopefully benefits them. The dilemma is how to maximize the benefits and

419

Manuel London and Lynn A. McFarland minimize the costs to the organization and the recipients. This is likely to be a balance between candor and confidentiality. It may also require customizing the feedback to suit the candidates. Some may welcome feedback; others may avoid it. Those who want more detail can be given the information. Precautions should be taken to guard the assessment information to protect its usefulness to the organization (e.g., do not hand the test and scores to applicants) as well as deliver the results in a sensitive way that takes into account the recipient’s ability to comprehend the information. This may require hiring and training an assessor or coach to convey and discuss the results with the applicant. The organization must also determine its obligation to follow up after the feedback is delivered. Follow-up questions can benefit the individual by ensuring that harm was not done and possibly providing further coaching or career guidance. Follow-up can benefit the organization by asking candidates for their perceptions of the fairness of the selection process and their feelings about the organization. Organizations that routinely provide assessment feedback to internal candidates are likely to foster a continuous learning culture that includes accepting and understanding performance feedback and seeking areas for performance improvement and career development. Feedback recipients learn to evaluate the feedback results for themselves and share it with others, perhaps their coworkers, as a way of validating the results and seeking ways to apply the information for their continued professional growth. Clear communication about the assessment method and how the results are used is important. Professionals responsible for deciding what and how assessment results are fed back to applicants need to consider not only the cost of divulging information about the test and results from the standpoint of the organization but also the individual’s ability to understand and benefit from the information. Organizations should track changes in performance over time at the individual and organizational level. Also, organizations can collect data to show the added value of selection feedback and its joint effects with other interventions, such as coaching, training, career paths, online developmental resources, etc. Returning to the three cases that we introduced at the outset of this chapter, here is how the organizations answered the questions about whether to provide feedback, how much, and in what form.

AC Feedback The consulting firm that used an AC to help select recent MBA graduates for entry-level positions decided to inform the candidates that feedback would be given one week after the online assessment. Although some test results would be available to the firm immediately, on the basis of computer scoring, the exercises would provide qualitative results that the firm wanted to examine and integrate with all of the information about the candidates. Observers who reviewed the transactions would record some feedback on each exercise. The feedback would be available to candidates who passed the assessment and those who did not, although the nature of the feedback and tenor of the conversation would differ. Those who passed and were offered jobs were told about developmental experiences that would help them use the feedback for development in relation to the nature of the work they would be doing and the organization’s performance expectations—essentially, a realistic job preview. Consistent with the results reviewed above, applicants who were not offered positions were given the information in a way that did not damage their self-image (e.g., suggested that this job may not have been right for them) and that pointed to behaviors they could develop. The firm asked the candidates not to describe the details of the selection process with others, although they may want to reveal the results to others who could be helpful in their job search (e.g., their academic advisor or a friend or career mentor). The feedback included information about how the data would be retained and protected. For selected candidates who accepted a position, this included making the assessment results available to the individual’s new manager and the HR director, who would track the individual’s career and developmental assignments. For candidates who failed the assessment, the data would be retained for a year under lock and key in case the individual wanted to apply for another position with the firm.

420

Assessment Feedback

Test Feedback The restaurant hiring food service personnel using several evaluation sources (a biodata form, an integrity test, and an interview) provided a written report for individuals who were offered jobs. Those who were not hired were given information about the nature of the assessment to help them realize that the selection methods were fair and related to the position. They were also given a summary statement of results written in relation to job requirements (e.g., “This job requires a person to keep records accurately.”) without providing the individual’s actual results. Pains were taken not to divulge the nature or purpose of specific aspects of the selection process.

Individual Assessment for Executive Selection The consumer products company hiring a marketing vice president asked the personnel psychologist who created and administered the assessment process to prepare separate feedback reports for the company and the candidates. All of the candidates had the option of requesting a feedback report, which would be delivered in person by the psychologist. The feedback reports would be available only after the decision was made and a candidate accepted the job offer. The successful candidate would receive a more detailed, career-oriented report that would be the start of an ongoing developmental coaching experience with the psychologist or another external coach at the discretion of the new vice president. Note that these cases suggest that rejected applicants receive less detailed feedback than those who are accepted. This may be normative and have in mind protecting applicants’ self-image, their perception of the selection method’s fairness and validity, and their positive feelings about the organization, as well as limiting the resources the organization devotes to those who will not be employed. However, standards of best practice, as we described at the outset of this chapter, specify the importance of clear feedback and that giving feedback to applicants should be incorporated into the design of the selection method. Human resource professionals need to do this in a way that takes into account applicants’ affective reactions and the potential value of the results for their career development.

IMPLICATIONS FOR RESEARCH More research is needed on applicant reactions measured after they receive feedback and comparing those who received negative feedback in addition to those who received positive feedback. Applied research can evaluate the effects of feedback along with the effects of other aspects of the selection process, such as explaining the process to demonstrate its fairness and relevance to the position and providing realistic job previews before and/or after the assessment to guide candidates’ decisions. Such applied research can be part of the design of a selection process. The effects of feedback on continued reliability and validity of an assessment process should also be determined. These data can be used for ongoing program improvement. Potential group differences in reactions to feedback should also be explored. Studying job performance feedback, Roberts and Noeln-Hoeksema (1994) reported that women were more likely than men to report lower self-esteem and intention to change behavior after they received negative feedback. Similarly, Johnson and Helgeson (2002) found that women’s self-esteem declined significantly after receiving negative feedback and increased slightly after receiving positive feedback while men’s self-esteem was not affected by the favorability of feedback. Generally, women are more self-aware and are more sensitive to feedback than are men (Fletcher, 1999; London & Wohlers, 1991). Men have an inflated self-assessment that may explain why they discount evaluative feedback more than women (Cleveland, Lim, & Murphy, 2007; Vecchio & Anderson, 2009); however, because of this, women may be more likely to process feedback mindfully, react to it, and change behavior because of it. Furthermore, research suggests that women may react differently to feedback depending on whether the job in question is

421

Manuel London and Lynn A. McFarland traditionally male-dominated or female-dominated (London, Downey, Romero-Canvas, Rattan, & Tyson, 2012). Research should examine how specific versus more generic feedback affects reactions of males and females depending on the type of job for which one applies. Basic research should explore the effects of anticipated and actual feedback on candidates’ perceptions, test performance, and career development and decisions. This should include studying candidate’s reactions to feedback as an impression-making opportunity. The effects of feedback on later job applications (for both the employed and unemployed), participation in development, and assessment performance should be studied. Other research should examine the extent to which the expectation of receiving feedback influences assessment performance. Generally, we need to study how people process positive and negative assessment feedback and the effects of the feedback on their making of career decisions. Other areas for investigation include understanding how assessment feedback interacts with candidates’ demographic characteristics (age, gender, minority status, cultural background, career stage), organizational characteristics (size, growth history, reputation for treating employees), and the nature of the assessment data (qualitative or quantitative, detailed or general, and accompanied by coaching and availability of developmental resources such as training programs for internal candidates).

CONCLUSIONS Feedback of assessment results is a complex process involving issues of information security and professional accountability as well as development. Professional standards suggest that developers of selection tests and other assessment methods are obligated to provide some feedback, if only to explain the selection method and rationale for the outcome. Feedback also affects candidates’ reactions to the selection process, which in turn may affect their decisions about the organization and their development goals. Feedback can benefit the candidates and the organization, but precautions must be taken to guard the confidentiality of the information and protect the self-image of the feedback recipient. The organization must determine the level of feedback specificity that is in the best interests of the organization and the candidates. Internal and external candidates may be treated differently, providing more details and developmental support to internal candidates and offering external candidates optional feedback and help interpreting the results. To be constructive, feedback can focus on implications of the results for behaviors, not personal characteristics that threaten self-image. Moreover, feedback can be tied to job requirements and developmental opportunities. Feedback can be preceded by information about the selection test when possible to explain its job relevance and fairness. Furthermore, feedback can be accompanied by behavior-based standards, not merely comparisons to others but standards in relation to job knowledge, abilities, and experiences that are required for the job and that predict success on the job. Feedback should not be given without adequate support for using the information and ensuring that the candidate was not harmed by the information. Although assessment feedback has potential costs to the candidate and the organization, the benefit of assessment results can be maximized by recognizing its value for selection and development.

REFERENCES American Psychological Association. (1998). The rights and responsibilities of test takers: Guidelines and expectations. Test Taker Rights and Responsibilities Working Group of the Joint Committee on Testing Practices. Washington, DC: Author. http://www.apa.org/science/ttrr.html American Psychological Association. (2002). Ethical principles of psychologists and code of conduct. Washington, DC: Author. American Psychological Association. (2014). Standards for educational and psychological testing. Washington, DC: Author. Anderson, N. (2003). Applicant and recruiter reactions to new technology in selection: A critical review and agenda for future research. International Journal of Selection and Assessment, 11, 121–136.

422

Assessment Feedback Anderson, N., & Goltsi, V. (2006). Negative psychological effects of selection methods: Construct formulation and an empirical investigation into an assessment center. International Journal of Selection and Assessment, 14, 236–255. Bangert-Drowns, R. L., Kulik, C. L. C., Kulik, J. A., & Morgan, M. T. (1991). The instructional effect of feedback in test-like events. Review of Educational Research, 61, 213–238. Bauer, T. N., Maertz, C. P., Jr., Dolen, M. R., & Campion, M. A. (1998). Longitudinal assessment of applicant reactions to employment testing and test outcome feedback. Journal of Applied Psychology, 83, 892–903. Bauer, T. N., Truxillo, D. M., Sanchez, R. J., Craig, J. M., Ferrara, P., & Campion, M. A. (2001). Applicant reactions to selection: Development of the selection procedural justice scale. Personnel Psychology, 54, 387–418. Bearfield, N. D. (2011). Explanatory Note, EU Careers, Doc. 309901. Brussels: European Personnel Selection Office. ec.europa.eu/dpo-register/download?metaId=1447044 Accessed January 31, 2016. Bruk-Lee, V., Drew, E. N., & Hawkes, B. (2013). Candidate reactions to simulations and media-rich assessments in personnel selection. In M. Fetzer & K. Tuzinski (Eds.), Simulations for personnel selection (pp. 43–60). New York, NY: Springer. Camilleri, C. (2014). EPSO—Overview of the new self-selection and self-assessment selection module in open competitions. Brussels: European Commission. http://www.testpublishers.org/assets/2014_sst_ sat1.pdf Cleveland, J. N., Lim, A. S., & Murphy, K. R. (2007). Feedback phobia: Why employees do not want to give or receive performance feedback. In J. Langan-Fox, C. L. Cooper, & R. J. Klimoski (Eds.), Research companion to the dysfunctional workplace: Management challenges and symptoms (pp. 168–186). Northampton, MA: Edward Elgar. doi: 10.4337/9781847207081.00018 DeNisi, A. S., & Kluger, A. N. (2000). Feedback effectiveness: Can 360-degree appraisals be improved. Academy of Management Executive, 14, 129–139. Derous, E., Born, M. P., & DeWitte, K. (2004). How applicants want and expect to be treated: Applicant’s election treatment beliefs and the development of the social process questionnaire on selection. International Journal of Selection and Assessment, 12, 99–119. Drasgow, F., Olson, J. B., Keenan, P. A., Moberg, P., & Mead, A. D. (1993). Computerized assessment. Research in Personnel and Human Resources Management, 11, 163–206. Ferdig, R. E., & Mishra, P. (2004). Emotional responses to computers: Experiences in unfairness, anger, and spite. Journal of Educational Multimedia and Hypermedia, 13, 143–161. Fletcher, C. (1999). The implications of research on gender differences on self-assessments and 360 degree appraisal. Human Resource Management Journal, 9, 39–46. Fletcher, C. (1999). The implications of research on gender differences on self-assessments and 360 degree appraisal. Human Resource Management Journal, 9, 39–46. Fletcher, C. (1991). Candidates’ reactions to assessment centres and their outcomes: A longitudinal study. Journal of Occupational Psychology, 64, 117–127. Folger, R., & Greenberg, J. (1985). Procedural justice: An interpretive analysis of personnel systems. Research in Personnel and Human Resources Management, 3, 141–183. Gilliland, S. W. (1993). The perceived fairness of selection systems: An organizational justice perspective. Academy of Management Review, 18, 694–734. Gilliland, S. W. (1994). Effects of procedural and distributive justice on reactions to a selection system. Journal of Applied Psychology, 79, 691–701. Gilliland, S. W., Groth, M., Baker, R. C., Dew, A. F., Polly, L. M., & Langdon, J. C. (2001). Improving applicants’ reactions to rejection letters: An application of fairness theory. Personnel Psychology, 54, 669–703. Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-manager, self-peer, and peer-manager ratings. Personnel Psychology, 41, 43–62. Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated model and meta-analysis. Personnel Psychology, 57, 639–683. Herriot, P. (2004). Social identities and applicant reactions. International Journal of Selection and Assessment, 12, 75–83. International Task Force on Assessment Center Guidelines. (2000). Guidelines and ethical considerations for assessment center operations. Bridgeville, PA: Development Dimensions International. Johnson, M., & Helgeson, V. S. (2002). Sex differences in response to evaluative feedback: A field study. Psychology of Women Quarterly, 26, 242–251. Kanfer, R., Wanberg, C. R., & Kantrowitz, R. M. (2001). Job search and employment: A personalitymotivational analysis and meta-analytic review. Journal of Applied Psychology, 86, 837–855. doi: 10.1037//0021–9010.86.5.837

423

Manuel London and Lynn A. McFarland Kluger, A. N., & Rothstein, H. R. (1993). The influence of selection test type on applicant reactions to employment testing. Journal of Business and Psychology, 8, 3–25. Lipnevich, A. A., & Smith, J. K. (2009). “I really need feedback to learn”: Students’ perspectives on the effectiveness of the differential feedback messages. Educational Assessment Evaluation and Accountability, 21, 347–367. doi: 10.1007/s11092–009–9082–2 London, B., Downey, G., Romero-Canyas, R., Rattan, A., & Tyson, D. (2012). Gender-based rejection sensitivity and academic self-silencing in women. Journal of Personality & Social Psychology, 102, 961–979. http://dx.doi.org/10.1037/a0026615 London, M. (2003). Job feedback: Giving, seeking, and using feedback for performance improvement (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. London, M., & Wohlers, A. J. (1991). Agreement between subordinate and self-ratings in upward feedback. Personnel Psychology, 44, 375–390. Lounsbury, J. W., Bobrow, W., & Jensen, J. B. (1989). Attitudes toward employment testing: Scale development, correlates, and “known-group” validation. Professional Psychology: Research and Practice, 20(5), 340. Macan, T. H., Avedon, M. J., Paese, M., & Smith, D. E. (1994). The effects of applicants’ reactions to cognitive ability tests and an assessment center. Personnel Psychology, 47, 715–738. Maertz, C. P., Jr., Bauer, T. N., Mosley, D. C., Jr., Posthuma, R. A., & Campion, M. A. (2004). Do procedural justice perceptions in a selection testing context predict applicant attraction and intention toward the organization? Journal of Applied Social Psychology, 34, 125–145. Maertz, C. P., Jr., Bauer, T. N., Mosley, D. C., Jr., Posthuma, R. A., & Campion, M. A. (2005). Predictors of self-efficacy for cognitive ability employment testing. Journal of Business Research, 58, 160–167. Marcus, B. (2003). Attitudes towards personnel selection methods: A partial replication and extension in a German sample. Applied Psychology: An International Review, 52, 515–532. McCarthy, J. M., Van Iddekinge, C. H., Lievens, F., Kung, M. C., Sinar, E. F., & Campion, M. A. (2013). Do candidate reactions relate to job performance or affect criterion-related validity? A multistudy investigation of relations among reactions, selection test scores, and job performance. Journal of Applied Psychology, 98, 701–719. McFarland, C., & Ross, M. (1982). Impact of causal attributions on affective reactions to success and failure. Journal of Personality and Social Psychology, 43, 937–946. Meyer, W. U., Mittag, W., & Engler, U. (1986). Some effects of praise and blame on perceived ability and affect. Social Cognition, 4, 293–308. Mishra, P. (2006). Affective feedback from computers and its effect on perceived ability and affect: A test of the computers as social actor hypothesis. Journal of Educational Multimedia and Hypermedia, 15, 107–131. Nicola, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A model and seven principles of good feedback practice. Studies in Higher Education, 31, 199–218. Oostrom, J. K., Born, M. P., & van der Molen, H. T. (2013). 9 Webcam tests in personnel selection. In D. J. Derks, & A. B. Bakker (Eds.), The psychology of digital media at work (pp. 166–180). New York: Psychology Press. Ployhart, R. E., & Harold, C. M. (2004). The Applicant Attribution-Reaction Theory (AART): An integrative theory of applicant attributional processing. International Journal of Selection and Assessment, 12, 84–98. Ployhart, R. E., Ryan, A. M., & Bennett, M. (1999). Explanations for selection decisions: Applicants’ reactions to informational and sensitivity features of explanations. Journal of Applied Psychology, 84, 87–106. Pope, K. S. (1992). Responsibilities in providing psychological test feedback to clients. Psychological Assessment: Clinical & Forensic, 4, 268–271. Roberts, R. A., & Noeln-Hoeksema, S. (1994). Gender comparisons in responsiveness to others’ evaluations in achievement settings. Psychology of Women Quarterly, 18, 221–240. Ryan, A. M., & Ployhart, R. E. (2014). A century of selection. Annual Review of Psychology, 65, 693–717. Ryan, A. M., & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and decisions: A critical review and agenda for the future. Journal of Management, 26, 565–606. Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450. Schinkel, S., van Dierendonck, D., Van Vianen, A., & Ryan, A. M. (2011). Applicant reactions to rejection: Feedback, fairness, and attributional style effects. Journal of Personnel Psychology, 10, 146–156. doi: 10.1027/1866–5888/a000047 Schinkel, S., Van Dierendonck, D., & Anderson, N. (2004). The impact of selection encounters on applicants: An experimental study into feedback effects after a negative selection decision. International Journal of Selection and Assessment, 12, 197–205.

424

Assessment Feedback Schleicher, D. J., Venkataramani, V., Morgeson, F. P., & Campion, M. A. (2006). So you didn’t get the job . . . now what do you think? Examining opportunity-to-perform fairness perceptions. Personnel Psychology, 59, 559–590. Schuler, H. (1993). Social validity of selection situations: A concept and some empirical results. In H. Schuler, J. L. Farr, & M. Smith (Eds.), Personnel selection and assessment: Individual and organizational perspectives (pp. 11–26). Hillside, NJ: Lawrence Erlbaum. Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant reactions to selection procedures. Personnel Psychology, 46, 49–76. Spychalski, A. C., Quiñones, M. A., Gaugler, B. B., & Pohley, K. (1997). A survey of assessment center practices in organizations in the United States. Personnel Psychology, 50, 71–90. Thibaut, J., & Walker, L. (1975). Procedural justice: A psychological analysis. Hillsdale, NJ: Lawrence Erlbaum. Thorsteinson, T. J., & Ryan, A. M. (1997). The effect of selection ratio on perceptions of the fairness of a selection test battery. Internal Journal of Assessment, 5, 159–168. Tippins, N. T. (2015). Technology and assessment in selection. Annual Review of Organizational Psychology and Organizational Behavior, 2(1), 551–582. doi: 10.1146/annurev-orgpsych-031413–091317 Truxillo, D. M., Bauer, T. N., Campion, M. A., & Paronto, M. E. (2002). Selection fairness information and applicant reactions: A longitudinal field study. Journal of Applied Psychology, 87, 1020–1031. Truxillo, D. M., Bodner, T., Bertolino, M., Bauer, T. N., & Younce, C. (2009). Effects of explanations on applicant reactions: A meta-analytic review. International Journal of Selection and Assessment, 17, 346–361. Truxillo, D. M., Steiner, D. D., & Gilliland, S. (2004). The importance of organizational justice in personnel selection: Defining when selection fairness really matters. International Journal of Selection and Assessment, 12, 39–53. Van Vianen, A. E. M., Taris, R., Scholten, E., & Schinkel, S. (2004). Perceived fairness in personnel selection: Determinants and outcomes in different stages of the assessment procedure. International Journal of Selection and Assessment, 12, 149–159. Vecchio, R. P., & Anderson, R. J. (2009). Agreement in self-other ratings of leader effectiveness: The role of demographics and personality. International Journal of Selection and Assessment, 17, 165–179. doi: 10.1111/j.1468–2389.2009.00460.x Wanberg, C., Basbug, G., Van Hooft, E. A. J., & Samtani, A. (2012). Navigating the black hole: Explicating layers of job search context and adaptational responses. Personnel Psychology, 65, 887–926. doi: 10.1111/ peps.12005 Wanberg, C. R., Glomb, T. M., Song, Z.,  & Sorenson, S. (2005). Job-search persistence during unemployment: A 10-wave longitudinal study. Journal of Applied Psychology, 90, 411–430. doi: 10.1037/0021–9010.90.3.411 Wanberg, C. R., Kanfer, R., Hamann, D. J., & Zhang, Z. (2015). Age and reemployment success after job loss: An integrative model and meta-analysis. Psychological Bulletin. On-line publication. http://dx.doi. org/10.1037/bul0000019 Watts, S. A. (2007). Evaluative feedback perspectives on media effects. Journal of Computer-Mediated Communication, 12, 384–411. Wiechmann, D., & Ryan, A. M. (2003). Reactions to computerized testing in selection contexts. International Journal of Selection and Assessment, 11, 215–229.

425

Part V CRITERION CONSTRUCTS IN EMPLOYEE SELECTION KEVIN R. MURPHY AND ELAINE D. PULAKOS, SECTION EDITORS

20 THE MEASUREMENT OF TASK PERFORMANCE AS CRITERIA IN SELECTION RESEARCH WALTER C. BORMAN, MATTHEW R. GROSSMAN, REBECCA H. BRYANT, AND JAY DORIO

This chapter is about measuring task performance (i.e., the technical proficiency part of job performance) in personnel selection research. When evaluating the validity of selection tests and procedures, the accuracy of these validity estimates depends in turn on the accuracy of criterion performance measurement. Accordingly, there is considerable motivation in selection research to obtain reliable and accurate criterion scores for job incumbents participating in the research. Our chapter covers the task performance criterion “space.” Chapter 21 in this volume describes citizenship performance criteria. Specific topics covered in this chapter are (a) relatively objective criterion measures, such as work samples, job knowledge tests, and production rates; (b) subjective measures (i.e., ratings of performance), including different rating formats and rater training strategies; (c) dimensionality of job performance; and (d) validity estimates against task performance for several predictor constructs (e.g., ability, personality, etc.).

OBJECTIVE CRITERIA At first glance, we might assume that objective criterion measures should be preferred over subjective ratings of performance. However, “objective” may be at least in part a misnomer in that judgment often enters into the use of objective criteria. Also, objective measures are notoriously deficient as criteria because they usually tap into only a small part of the total criterion space. Contamination can also be a serious problem with objective measures. For example, factors beyond the control of the job incumbent can influence objective criterion measures. Nonetheless, when they are relevant to important performance requirements and are reasonably reliable and uncontaminated (or when corrections can be made to reduce contamination), objective measures can be useful for measuring some criterion dimensions. In other words, the deficiency issue may not be a problem with objective criteria if we measure well the task performance part of job performance and have other criterion measures evaluating performance in other aspects of the job.

429

Walter C. Borman et al.

Production Rates For jobs that have observable, countable products that result from individual performance (e.g., military recruiters or patrol officers who are assigned traffic enforcement duties), a production rate criterion is a compelling bottom-line index of performance. However, as often noted (e.g., Borman, 1991; Guion, 1965), considerable care must be taken in gathering and interpreting production data. For example, work-related dependencies on other employees or on equipment for determining production rates may create bias in these rates. Also, production standards and quota systems (e.g., in call center jobs) can create problems for criterion measurement. Instability of production rates is another potential problem. Rothe’s (1978) extensive research on production workers showed that week-to-week production rates are only moderately reliable. Correlations between successive weeks’ production average .75 with incentives, and .53 with no incentives (Rothe, 1978). Longer periods for data collection may be necessary to ensure stable criterion production rates. Most importantly, researchers attempting to derive production criteria should pay special attention to possible contaminating influences, whereby employees have unequal opportunities to produce at the same rate.

Sales Initially, sales jobs may seem ideally suited for the use of objective criteria as performance measures. Number of sales per unit time, or some similar index of bottom-line sales volume, appear compelling as global, overall performance measures. However, upon closer inspection significant criterion contamination issues are evident for objective sales criteria. First, summary sales volume measures are a function of individual skill and effort as well as environmental factors beyond the control of the salesperson. In the Campbell, Dunnette, Lawler, and Weick (1970) behavior-performance-effectiveness model, behavior can be characterized as a task statement, a simple description of what an employee might do on the target job, behavior with no evaluative component. Performance is behavior but with an evaluative component, as in a single critical incident (Flanagan, 1954). Finally, effectiveness includes an outcome that is the results of the performance. Thus, in the context of the Campbell et al. model, objective sales volume is an effectiveness measure. Where environmental influences are substantial and unequal in their effect on sales, criterion measurement will be contaminated. One way to remove contamination is to adjust sales data for factors such as market potential (e.g., Kanfer & Borman, 1987). A practical strategy for making these adjustments is to create norms for stores, sales territories, or for whatever organizational unit provides the appropriate comparison. Then criterion scores for each salesperson can be compared to scores for other salespersons with roughly the same selling-related environment and thus similar opportunities to produce sales. Unfortunately, an inherent problem with this approach has to do with the norming process. For example, if large sales territories with many salespersons are used to accomplish the norming, there may be meaningful differences within territories with respect to opportunity to perform. If smaller territories are used, then the norms tend to be unstable because the mean sales performance comparison indices are based on too few salespersons. Thus, how one does the adjusting may be as important as whether or not to adjust. Another “objective” sales criterion that is often used is percent of quota, which presumably controls for environmental factors. Of course, the accuracy of this measure depends on how accurately one defines the environmental factors when setting the quota.

Work Samples Work sample or performance tests are sometimes developed to provide criteria for selection research. Some argue that work sample tests have the highest fidelity for measuring criterion

430

Measurement of Task Performance performance. In a sense, the argument is compelling: What could be fairer than to assess employees’ performance on a job by having them actually perform some of the most important tasks associated with it? Yet evaluation of work samples as criteria is not quite so simple, and their use involves several issues. One issue in work sample scoring is whether to evaluate products or process relative to work sample performance. In general, tasks associated with products (e.g., troubleshooting a problem with a radio) can be oriented toward either product or process; tasks with no resulting products (e.g., interviewing a job candidate) must be scored according to process considerations. An advantage to scoring products over process is that assessment is typically more objective. However, if the procedures taken to arrive at the product are also important, process assessment is clearly necessary. Other issues relevant to scoring work samples are germane here. Difficult-to-score process steps are to be avoided. For example, checking and inspecting steps are difficult, if not impossible, to observe. Ill-defined steps and complex steps where an employee can do well on one part of the step but poorly on another should also be avoided. Still another issue with scoring work samples is the relative merits of pass/fail marks versus performance-level ratings on task steps. Guion (1978) argued for task step performance ratings (e.g., on a 1 = low to 7 = high scale) because they provide more information. Indeed, many steps seem amenable to a continuous performance scale, where such judgments as “more skillful,” “faster,” and “more efficient” may have meaning for evaluating performance. For certain very simple task steps, pass/fail may suffice, but it will usually be desirable to develop continuous performance scales for use in work sample testing. A major issue with work sample tests is that researchers may treat them as ultimate criteria; that is, these tests are sometimes considered the criterion of choice for accurately assessing performance in certain jobs, especially those that require complex motor skills. Work samples should not be thought of in this light. First, they are clearly maximum performance rather than typical performance measures. As such, they tap the “can-do” more than the “will-do” performance-over-time aspects of effectiveness. Yet will-do longer-term performance is certainly important for assessing effectiveness in jobs. Accordingly, these measures are deficient when used exclusively in measuring performance. In sum, inherent shortcomings of work samples for measuring some aspects of performance, as well as practical limitations such as time and equipment constraints, argue against relying on such tests to provide a comprehensive index of overall performance.

Job Knowledge Tests Another category of criterion measures is the job knowledge test. Once the target tasks are identified, items can be prepared, typically in a multiple-choice format, although other kinds of items such as the essay type are of course possible. Just as in writing any other multiple-choice items, care should be taken to ensure that the item stems and response alternatives are clearly stated and that distractor responses are definitely wrong but plausible. An issue with job knowledge test development is when is the paper-and-pencil knowledge test medium appropriate for evaluating job performance. When a task is procedural, requiring primarily knowledge about steps to complete it, and not complex motor skills for performing each step, a job knowledge format seems clearly to be as appropriate as a work sample format. Tasks requiring certain skills and operations are probably not amenable to job knowledge testing. Such tasks include (a) those that require finely tuned acts of physical coordination (e.g., a police marksmanship task), (b) those that require quick reaction (e.g., typing a letter under time pressure), and (c) those that require complex time-sharing psychomotor performance (e.g., aircraft cockpit simulator tasks).

SUBJECTIVE CRITERIA Subjective criteria will typically be synonymous with performance ratings. The notion of supervisors or peers providing numerical scores for employees on job-relevant performance areas

431

Walter C. Borman et al. is an interesting idea. Ideally, it provides well-informed observers with a means of quantifying their perceptions of individuals’ job performance. This is preferable to verbal descriptions of performances because individuals can now be compared in a reasonably straightforward way. The notion can be viewed as analogous to developing structured job analysis questionnaires to take the place of verbal job descriptions for purposes of comparing jobs (McCormick, 1976). In each case, quantification of perceptions clears the way for scientific study of an area that could not be previously studied in this manner. The emphasis in this section will be on ratings gathered for-research-only as criteria for selection research applications. Although ratings can be generated for purposes of salary administration, promotion decisions, or employee feedback and development, they are not very relevant to personnel selection research. We should add here that recent research and commentaries clearly suggest that performance appraisal systems have very little impact on individual or organizational effectiveness (e.g., Pulakos, 2004; Pulakos & Mueller-Henson, 2015; Pulakos & O’Leary, 2011), further supporting the point that operational performance appraisal ratings should not be used as criteria for test validation research. For-research-only performance ratings are the most often used criterion measure in I-O psychology. Landy and Farr (1980) refer to several surveys intended to assess how frequently ratings are used as criterion measures in research reports. The percentages reach 75% and higher, suggesting that considerable attention should be paid to this criterion measurement method in the interest of making such measurement as accurate as possible. Issues in using ratings as performance criteria include (a) design of the rating form to be used and (b) type of training to provide to raters.

Rating Formats Behaviorally Anchored Rating Scales Smith and Kendall (1963) extended the notion of critical incidents (Flanagan, 1954) by designing a rating format they referred to as behavioral expectation scales, now generally labeled behaviorally anchored rating scales (BARS). Smith and Kendall reasoned that different effectiveness levels on job performance rating scales might be anchored using behavioral examples of incumbent performance. Accordingly, they developed performance rating dimensions, with scaled behavioral examples anchoring the appropriate effectiveness levels on the dimensions. Essentially, the rater’s task is to compare observed job behaviors of the ratee with the behavioral anchors on the scale to assign a rating on that dimension. This was seen as preferable to evaluating a ratee without guidance regarding the effectiveness levels of different scale points. The BARS idea is more than a format; it is a system, or even philosophy (Bernardin & Smith, 1981). For example, ideally raters should record examples of employee work behavior in preparation for assigning performance ratings. Another positive feature of BARS is that users of the system typically participate in scale development, enhancing the credibility of the format. Further, from a domain sampling perspective, BARS development steps provide an excellent methodology to aid in identifying all important dimensions for a job.

Behavior Summary Scales In response to a difficulty some raters have had with BARS, that of matching observed ratee performance and the often very specific, low-base-rate behaviors serving as anchors on the scale, Borman (1979) developed the behavior summary scales (BSS) format. With this format, behavioral incidents are first generated targeting a wide range of levels of effectiveness on each dimension, as with BARS. Second, the incidents are retranslated according to dimension membership and level of effectiveness, also as is done with BARS. Finally, the content of all incidents

432

Measurement of Task Performance reliably retranslated into the high-, mid-, and low-effectiveness levels, respectively, is summarized, resulting in the summary scale anchors. These summary anchors represent sometimes four or five effectiveness levels, but the main point is rather than the BARS practice of having individual incidents as scale anchors, BSS has summary anchors capturing the behavioral content of several individual anchors at each level of effectiveness for each dimension. Regarding raters’ use of the BSS, the most important potential advantage is that the behavioral construct underlying each aspect of job performance is made more evident to the rater. Raters do not need to infer the dimensionality from a series of highly specific incidents. The inferential step is accomplished in scale development, in which the behavioral essence from several specific incidents is distilled in each behavior summary statement. Accordingly, this approach should increase the probability that raters can match observed ratee behavior directly with scaled behavior. That is, by increasing the scope of behavior representing various performance levels on a scale, chances are greater that one of the anchors will accurately describe a ratee’s performance on that dimension. This argument makes good conceptual sense, but in the one format comparison study pitting BARS against a BSS format, there were no consistent differences between these format types with respect to psychometric error or accuracy (Borman, 1979). Thus, the seeming conceptual advantage of BSS may not make much difference in the actual use of the scale.

Behavior Observation Scales Latham and Wexley (1981) developed the behavior observation scales (BOS) format with favorably worded behavioral statements that the rater responds to by indicating how frequently the ratee exhibits each of these behaviors. Latham and Wexley (1981) provided a list of advantages of BOS, including (a) BOS are developed from a systematic job analysis; (b) the content of the explicit behavioral items provides an excellent listing of the job’s performance requirements in concrete behavioral terms; and (c) item analysis and factor analytic procedures can be more readily applied to BOS ratings than to BARS or BSS data. To these should be added that BOS items appear to cut down on the complexity of inferences necessary to make a rating, although a study by Murphy, Martin, and Garcia (1982) casts some doubt on this point.

Computerized Adaptive Rating Scales Each of these behavior-based rating format ideas had appealing features. However, the following question arose: Does format make a difference relative to rating errors or the reliability and validity of the ratings generated by raters using the different formats? Not all of the relevant format comparison studies have been conducted, but the studies that have been completed generally show small differences between formats in terms of level of rater errors, reliability, validity, or accuracy. For example, early reviews of format comparison studies (Landy & Farr, 1983; Schwab, Heneman, & DeCotiis, 1975) concluded that the psychometric properties of the BARS format are probably not much better than the psychometric properties of graphic rating scales (GRS, or scales with numerical rather than behavioral anchors). Borman (1979) found only small differences in halo, inter-rater reliability, and validity for BARS, the BSS, and a graphic rating format. Landy and Farr (1980) went so far as to estimate that the variance accounted for in psychometric quality by rating format was as little as 4%. In fact, they called for a “moratorium” on rating format research, citing the largely negative results. For the next 20 years, Landy and Farr’s suggestion was followed for the most part (Farr & Levy, 2007), but it still seems compelling to explore rating format ideas that might result in more reliable and valid judgments about work performance. Small adjustments made to present formats are unlikely to result in higher reliabilities and validities; however, it still seems important to experiment with formats that are fundamentally different from those currently used in

433

Walter C. Borman et al. hopes of developing a format more in alignment with raters’ cognitive processes or that somehow calibrates raters’ perceptions to help them make more precise judgments about observed performance. One possible idea in the direction of a different rating measurement method started with consideration of Thurstone’s (1927) law of comparative judgment in the context of the performance rating process. Thurstone developed a method for scaling stimuli on the basis of paired-comparison judgments. Arguably, his approach places stimuli on an interval scale. In the context of rating job performance, Borman, Buck, Hanson, Motowidlo, Stark, and Drasgow (2001) reasoned that pairs of behavioral statements might be presented to the rater with instructions to pick the statement that is more descriptive of the ratee. If interval-scale judgments of ratee performance levels can be achieved with this method, the paired-comparison judgments may provide ratings that are more precise than those generated by other rating formats that use a linear numerical scale, which arguably provide only ordinal-level measurement. Another idea that might make the paired-comparison format even more effective is to apply an item response theory (IRT) adaptive testing orientation to the method. For example, the rater could be presented with a series of behavioral statement pairs such that responses to each successive pair provide a more precise estimate of ratee performance. Accordingly, our notion for computerized adaptive rating scales (CARS) was to develop a paired-comparison rating task that used adaptive testing principles to help raters estimate a ratee’s performance level through an iterative paired-comparison rating process. The idea was to initially present two behavioral statements associated with a dimension—one reflecting somewhat below average performance and the other reflecting somewhat above average performance. Depending on which statement the rater indicated was more descriptive of the ratee, the rating algorithm, developed subsequently by Stark and Drasgow (2002), selected two additional behavioral statements—one with a scaled effectiveness level somewhat above the effectiveness value of the statement picked first as the more descriptive, and the other with a scaled effectiveness level somewhat below the effectiveness value of that initially chosen statement. The rater’s selection of the more descriptive statement for the second paired comparison then revised the initial estimated ratee effectiveness level, and, as before, the algorithm selected two more statements with effectiveness values bracketing the revised estimated performance level. Thus, analogous to adaptive testing, a ratee’s “true” effectiveness level was to be estimated in an IRT sense by this iterative paired-comparison rating task that presents in sequence item pairs that maximize the amount of information about performance derived from each choice of an item. In a laboratory study to evaluate selected psychometric properties of CARS compared to two other formats, videotapes of six office workers were prepared, depicting prescripted levels of performance on three dimensions, and subjects rated these vignettes using the CARS format and one or the other competing formats (graphic or behaviorally anchored rating scales). Results showed 23–37% lower standard errors of measurement for the CARS format. In addition, validity was significantly higher for the CARS format (d = .18). Accordingly, in a laboratory study, CARS showed promising results (Borman et al., 2001). More recently, we developed a CARS system for the Canadian Forces and were able to conduct a field test comparing the standard error of measurement of supervisor and peer ratings on the CARS and behaviorally anchored rating scales (Borman, Kubisiak, & Grossman, 2013). Results showed that the CARS ratings had on average 19.5% lower standard error, compared to BARS ratings, again greater precision for the CARS ratings (i.e., more reliable differentiation between ratees). One last point about formats: although different format ideas may not make very large differences related to psychometric properties (with the possible exceptions of CARS), well-articulated performance standards for communicating expectations and providing feedback in operational performance management systems can be quite useful. Thus, especially the behavior-based formats can serve this important purpose in organizations. This is important because we should also point out that the more complex formats (i.e., the behaviorbased scales) are generally more expensive to develop and thus need more benefits to justify their use.

434

Measurement of Task Performance

Rater Training Rater training provides a promising approach to improving the quality of performance ratings. Two general kinds of training programs have emerged to help raters generate more error-free and accurate ratings (Smith, 1986; Woehr & Huffcutt, 1994). Rater error training seeks simply to alert raters to certain psychometric or perceptual errors such as leniency/severity, halo, restriction-in-range, and similar-to-me effects. Training often takes the form of a brief lecture on or demonstration of each error and training to avoid such errors when making performance ratings. Frame-of-reference training (Bernardin & Pence, 1980) attempts to convey to raters that performance is multidimensional and to thoroughly familiarize them with the actual content of each performance dimension. Regarding familiarization, examples of different levels of performance on individual dimensions are typically reviewed with raters, along with the “correct” or actual performance levels the examples reflect (e.g., Pulakos, 1984). Practice and feedback for trainees typically rating videotaped performances are important components of this type of training. Researchers have conducted studies comparing the psychometric properties and accuracy of ratings made by raters trained using one of the approaches just discussed and ratings generated by untrained raters. Results suggest the following conclusions: (a) error training is usually successful in reducing the target psychometric error (Pulakos, 1984); (b) error training does not improve the quality of ratings when inter-rater reliability or accuracy is used as a criterion (e.g., Borman, 1979); and (c) frame-of-reference training increases rating accuracy (Noonan & Sulsky, 2001; Woehr & Huffcutt, 1994). A useful observation was offered by Bernardin and Pence (1980): Rater error training is successful in reducing the target psychometric response set or error (e.g., halo), but essentially new response sets are imposed on raters (e.g., to eliminate halo, spread out your ratings across dimensions), resulting in no change in accuracy or a reduction in it. Similarly, Borman (1979) suggested that to direct persons to adjust their rating distributions in some manner is relatively easy for training to accomplish; it is much more difficult to train raters to be more accurate. Frame-of-reference training appears to be the best bet to attain this worthwhile goal.

DIMENSIONALITY OF JOB PERFORMANCE Almost no one doubts that job performance is a multidimensional construct (Campbell, 1990b; Ghiselli, 1956). To identify these multiple categories of performance for a job, I-O psychologists will typically use task analysis, from which clusters of tasks may be derived to define the performance dimensions (McCormick, Jeanneret, & Mecham, 1972), or critical incidents analysis (Flanagan, 1954), which can also result in a set of performance dimensions for a job. With these approaches to identifying performance categories, the dimension sets are likely to be different across target jobs. At one level, this is how it should be. Jobs are often different. However, there is considerable motivation to identify a set of dimensions that represents the performance requirements in common across jobs. Over the past 25 years or so, at least six attempts have been made to develop such dimension sets. In this section of the chapter, we review the methodologies used in these efforts and present the dimension sets. Then, we review similarities and differences among dimension sets and summarize all of the dimension content into a six-category taxonomy.

Campbell, McCloy, Oppler, and Sager (1993) Campbell, McCloy, Oppler, and Sager (1993) posited eight latent performance categories that summarize the performance requirements of all jobs. The notion was that not every job has as performance requirements all eight of the dimensions, but that for any single job, a subset of these factors (or all eight) are sufficient for describing its performance requirements. Several of these constructs emerged in factor analyses of the performance measures administered in the

435

Walter C. Borman et al. Project A research (a large-scale selection and classification study conducted in the U.S. Army; Campbell, 1990a) across the many jobs studied in that program. As examples, for first-tour soldiers in 19 different jobs, technical proficiency, personal discipline, and effort were consistently represented in factor analyses of performance criterion measures. For second-tour noncommissioned officer jobs, a leadership factor was added to the mix of factors emerging. Accordingly, the Campbell et al. (1993) taxonomy has been largely confirmed for several categories, using data that are highly appropriate for testing its generality across a wide variety of supervisory and nonsupervisory jobs. Thus, the Campbell et al. (1993) dimension system includes these eight dimensions: (1) job-specific technical proficiency, (2) non-job-specific technical proficiency, (3) written and oral communication, (4) demonstrating effort, (5) maintaining personal discipline, (6) facilitating peer and team performance, (7) supervision/leadership, and (8) management/administration. Parts or all of Dimensions 1, 2, and 4–8 were confirmed in Project A, using multiple methods to measure job performance, including hands-on performance tests; job knowledge tests; supervisor and peer ratings on multiple dimensions of performance; administrative measures such as disciplinary cases, awards and commendations, etc.; and a supervisory situational judgment test (SJT) criterion measure. Importantly, all of these criterion measures were developed and then administered to supervisory and nonsupervisory people. The wide variety of criterion measures used to evaluate job performance constructs ensured the criterion space was comprehensively reflected in the coverage of the performance domain. Accordingly, the system seems quite generalizable across different types of jobs and supervisory and nonsupervisory jobs.

Borman and Brush (1993) In a second approach, focusing on managerial performance, Borman and Brush (1993) inductively derived an 18-dimension taxonomy of performance categories from existing dimension sets taken from empirical studies of managerial performance. In this project, several existing sets of managerial job performance dimensions were gathered from published and unpublished empirical studies, resulting in a total of 187 independent dimensions. These dimensions were then sorted into categories by 25 subject matter experts (SMEs) on the basis of the similarity of the content of each dimension. These 25 sorting solutions were summarized into a single 187by-187 correlation matrix using a method described by Rosenberg and Sedlak (1972). Finally, the matrix was factor analyzed, resulting in a set of 18 managerial job performance dimensions. These dimensions were further grouped into four “mega-dimensions.” The first mega-dimension, interpersonal skills and communication, consists of those dimensions involving communication skills, maintaining good interpersonal relationships at work, representing the organization to others, and selling/influencing behaviors. Second, the leadership and supervision mega-dimensions includes those dimensions related to guiding, directing, motivating, training, coaching, developing, and coordinating subordinates, as well as providing feedback as needed. Third, technical activities and the “mechanics of management” involve dimensions pertaining to the technical proficiency required for a job, but also those related to managerial tasks such as planning, organizing, decision making, staffing, monitoring, and delegating. Finally, the last mega-dimension, conscientiousness and dependability, consisted of the somewhat more heterogeneous set of dimensions: useful personal behavior and persistence, handling stress, and organizational commitment. This system, at the 18 dimension level, is at a relatively high level of specificity, especially because it covers only management jobs.

Viswesvaran (1993) Viswesvaran (1993) built upon the lexical hypothesis of Galton (Goldberg, 1993) to develop a taxonomy of general job performance. In this investigation, Viswesvaran compiled and sorted measures of job performance from the literature into summary categories, resulting in

436

Measurement of Task Performance 25 conceptually distinct dimensions. Next, correlations between each pair of these 25 dimensions were obtained from studies utilizing these dimensions. These correlations were used in a meta-analysis to determine the true score correlations between the dimensions. Finally, factor analysis was used to analyze these true score correlations and to derive a set of 10 job performance categories. The dimensions identified in this investigation were intended to summarize overall job performance. Dimensions identified by Viswesvaran (1993) included interpersonal competence, administrative competence, quality, productivity, effort, job knowledge, leadership, compliance/ acceptance of authority, communications competence, and an overall job performance dimension.

Borman, Ackerman, and Kubisiak (1994) Borman, Ackerman, and Kubisiak (1994) incorporated elements of personal construct theory in developing a 12-dimension taxonomy of performance dimensions arguably relevant to all nonmanagement jobs in the U.S. economy. Briefly, personal construct theory posits that, on the basis of their experiences over time, individuals develop categories or dimensions that they use to interpret and make judgments about events or objects, especially other people. Personal construct theorists believe that these categories represent the natural way that people think about their world, again, especially regarding other people (e.g., Adams-Webber, 1979). The Repertory Grid protocol has provided a method for individuals to generate their personal constructs by contrasting different role persons (e.g., mother, best friend). In this application, we were asking supervisor participants to generate their personal constructs related to job performance, what have been referred to as personal work constructs, or “folk theories” of performance (Borman, 1987). In particular, 81 supervisors representing many different types of jobs and industries (e.g., sales, manufacturing, service sector) generated the names of several effective workers they had worked with and several relatively ineffective workers. The supervisor sample was instructed to select certain pairs of effective and ineffective employees and generate a performance dimension that differentiated the two employees. Sample members prepared a dimension label and a definition of the dimension. The supervisors generated a total of 176 reasonably nonredundant dimensions and definitions, and similar to the Borman and Brush (1993) research, 12 I-O psychologists sorted these dimensions into categories according to similarity in the performance areas represented, and a 176-by-176 correlation matrix was generated reflecting the relationship between each pair of dimensions. A factor analysis of this matrix revealed a highly interpretable 12-factor solution. The resulting dimension set might be organized hierarchically, similar to Borman and Brush. First, a grouping of interpersonal and communication dimensions was evident, consisting of the dimensions of communication and cooperation. Next, technical activities related to the job were represented in the dimensions of job knowledge, task proficiency, productivity, and judgment and problem solving. Finally, useful personal behavior and skills included the dimensions of dependability, integrity and professionalism, initiative, adaptability, organization, and safety.

Hunt (1996) Hunt’s intention was to develop a dimension set that reflected important behavioral dimensions of the performance requirements for entry-level jobs. Using a critical incidents approach, Hunt (1996) derived an eight-dimension taxonomy of generic work behaviors focusing on non-jobspecific aspects of performance. In this investigation, Hunt used factor analysis to empirically derive dimensions of generic work behaviors from supervisor ratings of employee behaviors. However, contrary to the typical approach in which a single job family or single organization is used, Hunt obtained supervisory ratings for nearly 19,000 employees in 52 different job settings across 36 different companies. Because of the nature of these data (i.e., each data set included a slightly different combination of the behaviors assessed), multiple factor analyses of

437

Walter C. Borman et al. the dimension structure could be conducted. First, a sample of data sets was subjected to factor analysis, resulting in several initial dimension structures. Similarities across these initial taxonomies were then cross-validated through the use of additional factor analyses (using hold-out data sets) and SME ratings of agreement. These analyses resulted in an eight-dimension taxonomy of generic work behavior, including task and citizenship behaviors. Specifically, the dimension structure consisted of two higherorder dimensions and eight second-order dimensions. The higher-order dimension of required performance behaviors included those behaviors required of an employee for continued employment, including the second-order dimensions of attendance, off-task behavior (i.e., effort expended toward non-job-related activities while at work; e.g., goofing off), and employee deviance (a combination of unruliness, theft, and drug misuse). The second higher-order dimension, organizational citizenship behaviors, comprises the second-order dimensions of schedule flexibility and work ethic (a combination of industriousness and thoroughness). Although Hunt identified a ninth specific dimension, adherence to confrontational rules, this factor was posited to be primarily relevant to cash register work, so Hunt omitted it from his model of generic work behavior. In addition to dimensions generated from the supervisory ratings, Hunt also identified four dimensions of generic work behavior through a review of the literature, including teamwork, problem solving, safety, and personal appearance.

Peterson, Mumford, Borman, Jeanneret, and Fleishman (1999) O*NET’s generalized work activities (GWAs; Borman, Jeanneret, Kubisiak, & Hanson, 1996) provide a broad-level overview of job behaviors that are applicable to a wide range of jobs. The GWA framework contains 42 lower-order dimensions that have been summarized into four “highest” order dimensions. Information Input describes those GWAs that focus on how and where information is acquired as part of the job, including looking for, receiving, identifying, and evaluating job-related information. Mental Processes summarizes those GWAs that involve information and data processing, as well as reasoning and decision making. Work Output describes physical activities that get performed on the job, including manual activities and complex and technical activities requiring coordinated movements. Finally, Interacting with Others summarizes GWAs that involve interactions with others or supervisory functions, including communicating, interacting, coordinating, developing, managing, advising, and administering. Between the 4 and 42 levels of GWA dimensions, there is a nine-dimension system that we focus on here. This system is supported by factor analytic studies of the Position Analysis Questionnaire (PAQ) and other job analysis instruments for nonsupervisory jobs (McCormick, 1976), and the Borman and Brush (1993) behavioral dimensions (in turn derived in part from other factor analyses involving management or supervisory jobs). The nine-dimension system includes (1) looking for and receiving job-related information, (2) identifying and evaluating job-related information, (3) information/data processing, (4) reasoning/decision making, (5) performing physical and manual work activities, (6) performing complex/technical activities, (7) communicating/interacting, (8) coordinating/developing/managing/advising others, and (9) administering (see Chapter 40, this volume, for more on O*NET).

Integrating the Job Performance Dimension Taxonomies Clearly, important similarities exist across the dimensional taxonomies discussed that allow an integration of the dimension systems, but they also point to “outlier” dimensions in some of the taxonomies that are worth noting. Table 20.1 presents a crosswalk of the six dimensional systems, indicating the commonalities and differences across the systems. Then, we provide a summary column, reflecting the common content where it is evident. Also, the rows of Table 20.1 are ordered such that the first row represents the most commonality across systems, the second row has the next most commonality, and so on.

438

• Initiative • Adaptability • Safety • Dependability • Integrity and professionalism

• Organization

• Administrative competence

• Compliance/ acceptance of authority

• Effort and productivity • Job knowledge • Task proficiency

• Administration and paperwork • Planning and organizing • Monitoring and controlling resources • Staffing

• Persisting to reach goals • Handling crises and stress • Organizational commitment

• Demonstrating effort • Maintaining personal discipline

• Productivity • Job knowledge • Quality • Effort

• Communication • Cooperation

• Judgment and problem solving

• Technical proficiency

• Job-specific technical proficiency • Non-job-specific technical proficiency

• Communication competence • Interpersonal competence

Viswesvaran (1993)

Borman, Ackerman, & Kubisiak (1994)

• Decision making/ problem solving

• Communicating effectively and keeping others informed • Maintaining good working relationships • Selling/influencing • Representing the organization to customers and the public

Borman & Brush (1993)

• Communication • Facilitating peer and team performance

Campbell, McCloy, Oppler, & Sager (1993)

TABLE 20.1

• Problem solving

• Industriousness • Adherence to confrontational rules • Safety • Schedule flexibility

• Thoroughness

• Teamwork • 

Hunt (1996)

Summary of Six Performance Taxonomies

• Administering

• Reasoning and decision making

• Performing complex and technical activities • Performing physical and manual work activities

• Communicating and interacting

Peterson, Mumford, Borman, Jeanneret, & Fleishman (1999)

Organizing and planning

Problem solving

Conscientiousness and dependability

Productivity and proficiency

Communicating and interacting

Summary Categories

• Coordinating subordinates and other resources to get the job done • Guiding, directing, and motivating subordinates and providing feedback • Training, coaching, and developing subordinates • Delegating

• Supervision/leadership • Management/ administration

• Collecting and interpreting data

Borman & Brush

Campbell, McCloy, Oppler, & Sager • Leadership

Viswesvaran

Borman, Ackerman, & Kubisiak

• Off-task behavior • Unruliness • Attendance • Drug misuse • Theft

Hunt

Summary of Six Performance Taxonomies

TABLE 20.1 (CONTINUED)

• Information and data processing • Identifying and evaluating job-relevant information • Looking for and receiving job-related information

• Coordinating, developing, managing, and advising

Peterson, Mumford, Borman, Jeanneret, & Fleishman (O*NET)

Counterproductive work behaviors

Information processing

Leadership and supervision

Summary Categories

Measurement of Task Performance All six dimension sets have content involving communicating and interacting with others, although Hunt (1996) has only a teamwork dimension, so communicating is not explicitly represented in his framework. In Viswesvaran (1993), Borman and Brush (1993), and Borman et al. (1994), communicating and the interpersonal component are represented separately; in Peterson et al. (1999), at the nine-dimension level, the two constructs are combined in a single dimension. Productivity and proficiency are likewise reflected in all six dimension sets, although the configuration of performance dimension content for this construct is somewhat different across the dimension sets. For example, Viswesvaran (1993) has four of his nine dimensions related to this construct (productivity, job knowledge, quality, and effort), Borman et al. (1994) have 3 of their 12 dimensions (effort and productivity, job knowledge, and task proficiency) in this category, and Peterson et al. (1999) divide the construct into performing complex/technical activities and physical/manual activities. The third summary construct, conscientiousness and dependability, is more heterogeneous, but five of the six dimension sets are represented in some fashion. The content varies from Hunt’s (1996) industriousness and adherence to rules to Borman and Brush’s (1993) persisting to reach goals and handling crises, Borman et al.’s (1994) initiative, adaptability, and safety, Campbell et al.’s (1993) personal discipline and effort (this effort dimension is defined less like productivity and more like a personal quality compared with the other two effort dimensions), and Viswesvaran’s (1993) compliance dimension. Problem solving draws on content from four of the six dimension sets. Three of these four (Borman & Brush, 1993; Borman et al., 1994; Peterson et al., 1999) include elements of decision making in addition to problem solving; Hunt’s (1996) system defines problem solving more narrowly. The fifth construct, organizing and planning, also has representation by four of the dimension sets. Because this construct can be seen as in part management-oriented, it is not surprising that Borman and Brush’s (1993) managerial taxonomy has several dimensions in this category (i.e., administration and paperwork, planning and organizing, monitoring and controlling resources, and staffing). Viswesvaran (1993), Borman et al. (1994), and Peterson et al. (1999) have a single administering or organizing dimension. Finally, Campbell et al.’s (1993) management/administration dimension is broader than organizing and planning but does contain elements relevant to this construct. The sixth summary construct is leadership and supervision and is also represented in four of the dimension sets. Again, as might be expected, the Borman and Brush (1993) managerial taxonomy has multiple dimensions in this category (coordinating subordinates; guiding, directing, and motivating subordinates; training, coaching, and developing subordinates; and a delegating dimension). Campbell et al. (1993) have two leadership-related dimensions (supervision/leadership and at least part of management/administration). We should note that the Hunt (1996) and Borman et al. (1994) taxonomies were intended for entry-level and nonmanagement jobs, respectively, and thus would not be expected to contain supervisory or managerial dimensions. A seventh construct, information processing, had representation from only two systems: information and data processing, identifying and evaluating job-relevant information, and looking for and receiving job-related information from Peterson et al. (1999) and collecting and interpreting data from Borman and Brush (1993). And, Hunt’s (1996) dimension set had several dimensions that could be classified as counterproductive work behaviors. These included offtask behavior, unruliness, drug misuse, and theft. Because these last two categories were relatively idiosyncratic, represented in only one or two of the dimension sets reviewed, we propose that the six summary construct system might be used as a target criterion taxonomy in personnel selection research. Thus, we argue here that in the future, it might be preferable to consider a more multidimensional criterion space rather than just overall job performance in this quest toward systematically studying links between individual predictor and criterion variables in a selection context. All of the six dimension sets reviewed have important strengths. What we are advocating, following Campbell et al. (1993) and Campbell, Gasser, and Oswald (1996), is that the field move toward some performance taxonomy that can be used in personnel selection research to more systematically study empirical links between individual differences and individual performance constructs, as represented by a

441

Walter C. Borman et al. job performance taxonomy. A science of personnel selection could benefit greatly from research using a common set of performance constructs to map individual difference (e.g., abilities, personality, vocational interests) and job performance relations (Borman, Hanson, & Hedge, 1997; Campbell, et al., 1996). This approach gets us beyond studying individual differences–overall job performance correlations.

PREDICTORS OF TASK PERFORMANCE DIMENSIONS A major problem with studying predictor-task performance relationships is that almost all predictor-performance correlation data use overall performance as the criterion data. Fortunately, there is an important exception. Project A, also known as the U.S. Army’s Selection and Classification Project, was a large-scale test validation effort conducted by the Army Research Institute (ARI) and three research firms (Campbell, 1990a; Campbell & Zook, 1990; see also Campbell & Knapp; Chapter 40 in this volume). The seven-year effort included data from thousands of participants across a wide range of military occupational specialties (MOS). In addition to its large sample size, Project A measured multiple dimensions of performance, including task performance. Specifically, criteria were carefully developed based on a literature review, the critical incidents technique, and a clear explication of the task domain. Performance was measured using multiple indices, including hands-on job sample tests, multiple-choice knowledge tests, and supervisor/peer ratings of performance on Behavior Summary Scales. Army-wide and MOS-specific scales were developed, and administrative/archival records were also examined. On the basis of exploratory and confirmatory factor analytic results, five dimensions of performance were specified: (1) core technical proficiency, (2) general soldiering proficiency, (3) effort and leadership, (4) personal discipline, and (5) physical fitness and military bearing. The first two factors—core technical proficiency and general soldiering proficiency— clearly represent task performance constructs; substantial loadings were evident for hands-on performance tests, the job knowledge tests, and supervisor/peer ratings on some of the technical performance dimensions. Thus, we believe these Project A criterion data are ideal for representing a relatively pure measure of task performance. Additionally, data were collected for each participant on the following five major predictor constructs: general cognitive ability, spatial ability, perceptual/ psychomotor ability, personality, and vocational interests. Accordingly, the wide array of predictor data in Project A provide relatively comprehensive coverage of predictor-task performance links. The remainder of the chapter will summarize the correlations obtained in the concurrent validation study part of Project A (Campbell & Zook, 1990). Mean validities are based on data from 4,039 incumbents in nine diverse MOS, including infantryman, cannon crewmember, armor crewman, single channel radio operator, light wheel vehicle mechanic, motor transport operator, administrative specialist, medical specialist, and military police. Validity estimates were corrected for range restriction and criterion unreliability. Despite the exemplary aspects of Project A, one might question the generalizability of the results, given that an Army sample was used. Thus, for each type of predictor, relevant research conducted with other samples is also discussed.

TABLE 20.2 Validities Against Task Performance by Type of Predictor General Cognitive Ability

.63–.65

Spatial Ability

.56–.63

Perceptual/ Psychomotor Ability

.53–.70

Personality

.25

Vocational Interests

.34–.35

Note: Corrected for unreliability of criterion and restriction of range. N = 4,039.

442

Measurement of Task Performance

General Cognitive Ability A large body of literature has examined the link between general cognitive ability and job performance, and findings indicate that it is one of the most robust predictors of performance (Ree & Earles, 1992; Schmidt & Hunter, 1998; see also Chapter 11, this volume). In Project A, general cognitive ability was measured with the Armed Services Vocational Aptitude Battery (ASVAB), in which nine subtests combined to form four composite scores: technical, quantitative, verbal, and speed. Similar to other research in the selection field, the Project A relationships between general cognitive ability and task performance dimensions were strong, with a mean validity of .63 between cognitive ability and core technical proficiency, and .65 between cognitive ability and general soldiering proficiency. A substantial body of research has also examined the relationship between cognitive ability and overall job performance using nonmilitary samples. Literally thousands of studies have investigated this research question, finding strong correlations between general cognitive ability and job performance across various jobs, companies, and criteria (e.g., Hunter, 1986; Hunter & Schmidt, 1996; Schmidt & Hunter, 1981). Although research conducted on civilian populations report high validity coefficients between job performance and cognitive ability, they are not as high as those reported in Project A. For example, Hunter and Hunter (1984) summarized the results of 515 validation studies conducted by the U.S. Department of Labor, with more than 32,000 employees in 512 diverse civilian jobs (Hunter, 1980). On the basis of this large-scale meta-analysis, Hunter and Hunter reported validities of .40, .51, and .58 between general cognitive ability and job proficiency for low-, medium-, and high-complexity jobs, respectively. The .51 estimate for medium complexity jobs was recited in Schmidt and Hunter’s (1998) seminal article and is frequently referenced in the literature as a point estimate for the relationship between cognitive ability and job performance. More recently, researchers have conducted meta-analyses on the basis of studies conducted in different countries (e.g., the United Kingdom and Germany), reporting relationships of similar magnitude (Berta, Anderson, & Salgado, 2005; Hülsheger, Maier, & Stumpp, 2007). A likely reason for the higher mean validities presented here is that the criterion was task performance rather than overall performance.

Spatial and Perceptual/Psychomotor Ability In addition to general cognitive ability, Project A examined the relationship between spatial and perceptual/psychomotor and task performance. Spatial ability was measured with the Spatial Test Battery, comprising six paper-and-pencil tests. The six tests—assembling objects, object rotation, mazes, orientation, map, and figural reasoning—were combined to form an overall composite score. Perceptual/psychomotor ability was assessed in a computerized battery of 20 tests, which formed six composite scores: (1) psychomotor, (2) complex perceptual speed, (3) complex perceptual accuracy, (4) number speed and accuracy, (5) simple reaction speed, and (6) simple reaction accuracy. Sample tests include target identification, cannon shoot, and target tracking. Although lower than with general cognitive ability, the relationships between spatial and perceptual/psychomotor ability and task performance were high. The correlations with core technical proficiency and general soldiering proficiency were .56 and .63 for spatial ability and .53 and .57 for perceptual/psychomotor ability. These mean validities are substantially higher than those reported in other studies, in which overall performance was the criterion. In addition, several meta-analytic studies have examined these relationships, focusing on such specific industries as aviation (Hunter & Burke; 1994; Martinussen, 1996) and craft jobs in the utility field (Levine, Spector, Menon, Narayanan, & Cannon-Bowers, 1996). Across 68 studies on pilot selection, Hunter and Burke reported mean validities of .19 for spatial ability, .32 for gross dexterity; .10 for fine dexterity, and .20 for perceptual speed, correcting for sampling error only. Similarly, Martinussen reported a mean relationship of .20 between psychomotor/information processing and pilot performance. Martinussen’s meta-analysis was based on 50 studies conducted in 11

443

Walter C. Borman et al. countries. Finally, Levine et al. conducted a meta-analysis of 80 studies that sampled craft jobs in the utility industry across six job families. The weighted average of correlation coefficients was .20 between spatial/psychomotor ability and overall performance. Thus, similar to with cognitive ability, spatial and perceptual/ psychomotor ability correlate considerably higher with task performance than with overall performance.

Personality Over the past two decades, research on the utility of personality in the selection field has received a great deal of attention (see Chapter 13, this volume). Although early estimates of the relationship between personality and performance were quite low, more recent results have been somewhat more optimistic. Personality researchers generally credit the advent of a wellaccepted taxonomy (i.e., the Big Five) and the increased use of validity generalization techniques (e.g., Barrick & Mount, 1991) for the recent positive findings. Although Project A did not utilize the Five-Factor Model in the measurement of personality, it did find moderate correlations between personality and task performance. Using the Assessment of Background Life Experiences (ABLE), soldiers completed 11 scales (emotional stability, self-esteem, cooperativeness, conscientiousness, nondelinquency, traditional values, work orientation, internal control, energy level, dominance, physical condition). Seven of the scales combined to form four composite scores: adjustment, dependability, achievement orientation, and physical condition. Overall, the mean validities for personality were .25 for both dimensions of task performance (job-specific technical proficiency and non-job-specific technical proficiency). This is similar to the relationship of .27 reported by Barrick, Mount, and Judge (2001) between conscientiousness and overall job performance in their meta-analysis of 15 earlier meta-analyses. Hurtz and Donovan (2000) also conducted a meta-analysis. They partitioned the criterion domain into three dimensions: task performance, job dedication, and interpersonal facilitation. Their findings indicate the following relationships between the Big Five and task performance: .15 for conscientiousness, .13 for emotional stability, -.01 for openness to experience, .07 for agreeableness, and .06 for extraversion. The multiple R was .19. Finally, a couple of metaanalyses have examined personality-task performance relations at the Big Five facet level. Woo, Chernyshendo, Stark, and Conz (2014) investigated facets of Openness to Experience and found that, of the eight facets studied, six had higher correlations with task performance than did the overall Openness construct. However, the difference was not great (means = .10 versus .07), and neither of these relationships was very high. Similarly, Dudley, Orvis, Lebiecki, and Cortina (2006) examined four facets of Conscientiousness and found somewhat higher relationships between the facets and task performance compared to Global Conscientiousness and performance (means = .22 and .16), but here the difference was largely due to a single facet, Dependability (r = .46). In sum, personality-task performance relationships are generally low to modest in magnitude.

Vocational Interests Another set of predictors investigated in Project A was vocational interests. On the basis of Holland’s (1966) Basic Interests Constructs, as well as six different organizational climate scales, Project A researchers developed the Army Vocational Interest Career Examination (AVOICE). Twenty-two scales make up AVOICE, forming six composite scores: skilled technical, structural/machines, combat-related, audiovisual arts, food service, and protective services. Across the six composites, vocational interests related to core technical proficiency (r = .35) and general soldiering proficiency (r = .34). Conversely, Schmidt and Hunter (1998), citing Holland (1986), commented that there is generally no relationship between interests and job performance. Although they considered this a somewhat surprising finding, they hypothesized that interests may affect one’s choice of jobs, but once the job is selected, interests do not affect performance.

444

Measurement of Task Performance More recently, Morris (2003) conducted a meta-analysis of 93 studies, reporting a mean corrected correlation of .29 between vocational interests and job performance. Interestingly, larger effect sizes were observed when studies used task performance as the criterion. The reason for this finding is unclear, but it does mirror the Project A results. Specifically, the correlations between vocational interests and performance were higher for task performance dimensions (.34 to .35) compared with the other three performance dimensions (.12 to .24). Finally, Nye, Su, Rounds, and Drasgow (2012), in a meta-analysis of 60 studies and 568 correlations, found a mean correlation (uncorrected) of .20 between vocational interests and overall job performance. In sum, Project A research supports quite strong relationships between general cognitive ability, spatial ability, and perceptual/psychomotor ability and task performance. Importantly, these correlations are higher with task performance than when the criterion is overall performance, the criterion almost always used in meta-analyses of these predictors’ validities against job performance. The explanation we offered for this finding is the consistent trend in the literature of combining task and other dimensions of performance into one overall factor, thus reducing the validities of these predictors. Relations for personality and vocational interests with task performance are more modest but still far from trivial (mid-.20s for personality and .20 to mid-.30s for vocational interests).

REFERENCES Adams-Webber, J. (1979). Intersubject agreement concerning relationships between the positive and negative poles of constructs in repertory grid tests. British Journal of Medical Psychology, 52, 197–199. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the beginning of the new millennium: What do we know and where do we go next? Personality and Performance, 9, 9–30. Bernardin, H. J., & Pence, E. C. (1980). Effects of rater training: Creating new response sets and decreasing accuracy. Journal of Applied Psychology, 65, 60–66. Bernardin, H. J., & Smith, P. C. (1981). A clarification of some issues regarding the development and use of Behaviorally Anchored Rating Scales (BARS). Journal of Applied Psychology, 66, 458–463. Bertua, C., Anderson, N., & Salgado, J. F. (2005). The predictive validity of cognitive ability tests: A UK meta-analysis. Journal of Occupational and Organizational Psychology, 78, 387–409. Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied Psychology, 64, 410–421. Borman, W. C. (1987). Personal constructs, performance schemata, and “folk theories” of subordinate effectiveness: Explorations in an Army officer sample. Organizational Behavior and Human Decision Processes, 40, 307–322. Borman, W. C. (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 2, pp. 271–326). Palo Alto, CA: Consulting Psychologists Press. Borman, W. C., Ackerman, L. D., & Kubisiak, U. C. (1994). Development of a performance rating program in support of Department of Labor test validation research. Unpublished manuscript. Borman, W. C., & Brush, D. H. (1993). More progress toward a taxonomy of managerial performance requirements. Human Performance, 6, 1–21. Borman, W. C., Buck, D. E., Hanson, M. A., Motowidlo, S. J., Stark, S., & Drasgow, F. (2001). An examination of the comparative reliability, validity, and accuracy of performance ratings made using computerized adaptive rating scales. Journal of Applied Psychology, 86, 965–973. Borman, W. C., Hanson, M. A., & Hedge, J. W. (1997). Personnel selection. In J. T. Spence, J. M. Darley, & D. J. Foss (Eds.), Annual review of psychology (Vol. 48, pp. 299–337). Palo Alto, CA: Consulting Psychologists Press. Borman, W. C., Kubisiak, U. C., & Grossman, M. R. (2013). Development and field test of the Canadian Forces Computer Adaptive Rating Scales (Institute Report No. 796). Tampa, FL: Personnel Decisions Research Institutes, Inc. Borman, W. C., Jeanneret, P. R., Kubisiak, U. C., & Hanson, M. A. (1996). Generalized work activities: Evidence for the reliability and validity of the measures. In N. G. Peterson, M. D. Mumford, W. C. Borman,

445

Walter C. Borman et al. P. R. Jeanneret, & E. A. Fleishman (Eds.), O*NET final technical report. Salt Lake City: Utah Department of Employment Security. Campbell, J. P. (1990a). An overview of the Army Selection and Classification Project (Project A). Personnel Psychology, 43, 231–239. Campbell, J. P. (1990b). The role of theory in industrial and organizational psychology (Vol. 1, 2nd ed.). Palo Alto, CA: Consulting Psychologist Press. Campbell, J. P., Dunnette, M. D., Lawler, E. E., & Weick, K. E. (1970). Managerial behavior, performance, and effectiveness. New York, NY: McGraw-Hill. Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). The substantive nature of performance variability. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 255–299). San Francisco: Jossey-Bass. Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco: Jossey-Bass. Campbell, J. P., & Zook, L. M. (Eds.) (1990). Improving the selection, classification, and utilization of Army enlisted personnel: Final report on Project A (ARI Research Report 1597). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Dudley, N. M., Orvis, K. A., Lebiecki, J. E., & Cortina, J. M. (2006). A meta-analytic investigation of conscientiousness in the prediction of job performance: Examining intercorrelations and the incremental validity of narrow traits. Journal of Applied Psychology, 91, 40–57. Farr, J. L., & Levy, P. E. (2007). Performance appraisal. In L. L. Koppes (Ed.), Historical perspectives in industrial and organizational psychology (pp. 311–327). Mahwah, NJ: Lawrence Erlbaum. Flanagan, J. C. (1954). The critical incident technique. Psychological Bulletin, 51, 327–358. Ghiselli, E. E. (1956). Dimensional problems of criteria. Journal of Applied Psychology, 40, 1–4. Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34. Guion, R. M. (1965). Personnel testing. New York, NY: McGraw-Hill. Guion, R. M. (1978). Principles of work sample testing: III: Construction and evaluation of work sample tests. Alexandria, VA: U.S. Army Research Institute. Holland, J. (1986). New directions for interest testing. In B. S. Plake & J. C. Witt (Eds.), The future of testing (pp. 245–267). Hillsdale, NJ: Lawrence Erlbaum. Holland, J. L. (1966). The psychology of vocational choice: A theory of personality types and model environments. Oxford, England: Blaisdell. Hülsheger, U. R., Maier, G. W., & Stumpp, T. (2007). Validity of general mental ability for the prediction of job performance and training success in Germany: A meta-analysis. International Journal of Selection and Assessment, 15, 3–18. Hunt, S. T. (1996). Generic work behavior: An investigation into the dimensions of entry-level, hourly job performance. Personnel Psychology, 49, 51–83. Hunter, D. R., & Burke, E. F. (1994). Predicting aircraft pilot training success: A meta-analysis of published research. The International Journal of Aviation Psychology, 4, 297–313. Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Department of Labor, Employment Service. Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of Vocational Behavior, 29, 340–362. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Hunter, J. E., & Schmidt, F. L. (1996). Intelligence and job performance: Economic and social implications. Psychology, Public Policy, and Law, 2, 447–472. Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of Applied Psychology, 85, 869–879. Kanfer, R., & Borman, W. C. (1987). Predicting salesperson performance: A review of the literature (Research Note 87–13). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107. Landy, F. J., & Farr, J. L. (1983). The measurement of work performance: Methods, theory and applications. New York, NY: Academic Press. Latham, G. P., & Wexley, K. N. (1981). Increasing productivity through performance appraisal. Reading, MA: Addison-Wesley. Levine, E. L., Spector, P. E., Menon, S., Narayanan, L., & Cannon-Bowers, J. (1996). Validity generalization for cognitive, psychomotor, and perceptual tests for craft jobs in the utility industry. Human Performance, 9, 1–22. Martinussen, M. (1996). Psychological measures as predictors of pilot performance: A meta-analysis. The International Journal of Aviation Psychology, 6, 1–20.

446

Measurement of Task Performance McCormick, E. J. (1976). Job and task analysis. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 651–696). Chicago: Rand McNally. McCormick, E. J., Jeanneret, P. R., & Mecham, R. C. (1972). A study of job characteristics and job dimensions as based on the Position Analysis Questionnaire (PAQ). Journal of Applied Psychology, 56, 347–368. Morris, M. A. (2003). A meta-analytic investigation of vocational interest-based job fit, and its relationship to job satisfaction, performance, and turnover. Unpublished doctoral dissertation. University of Houston. Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales measure observation? Journal of Applied Psychology, 67, 562–567. Noonan, L., & Sulsky, L. M. (2001). Examination of frame-of-reference and behavioral observation training on alternative training effectiveness criteria in a Canadian military sample. Human Performance, 14, 3–26. Nye, C. D., Su, R., Rounds, J., & Drasgow, F. (2012). Vocational interests and performance: A quantitative summary of over 60 years of research. Perspectives on Psychological Science, 7(4), 384–403. Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of Applied Psychology, 79, 845–851. Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., & Fleishman, E. A. (Eds.) (1999). The occupation information network (O*NET). Washington, DC: American Psychological Association. Pulakos, E. D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal of Applied Psychology, 69, 581–588. Pulakos, E. D. (2004). Performance management: Society for Human Resource Management. Washington, DC: SHRM Foundation. Pulakos, E. D., & Mueller-Henson, R. (2015). A path to performance management reform. Alexandria, VA: Association for Talent Development Press. Ree, M. J., & Earles, J. A. (1992). Intelligence is the best predictor of job performance. Current Directions in Psychological Science, 1, 86–89. Rosenberg, S., & Sedlak, A. (1972). Structural representations of perceived personality trait relationships. In A. K. Romney, R. N. Shepard, & S. B. Nerlave (Eds.), Multidimensional scaling (pp. 134–162). New York, NY: Seminar. Rothe, H. F. (1978). Output rates among industrial employees. Journal of Applied Psychology, 63, 40–46. Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new research findings. American Psychologist, 36, 1128–1137. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schwab, D. P., Heneman, H. G., & DeCotiis, T. (1975). Behaviorally anchored rating scales: A review of the literature. Personnel Psychology, 28, 549–562. Smith, D. E. (1986). Training programs for performance appraisal: A review. Academy of Management Review, 11, 22–40. Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149–155. Stark, S., & Drasgow, F. (2002). An EM approach to parameter estimation for the Zinnes and Griggs paired comparison IRT model. Applied Psychological Measurement, 26, 208–227. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–286. Viswesvaran, C. (1993). Modeling job performance: Is there a general factor? Unpublished doctoral dissertation, University of Iowa. Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67, 189–205. Woo, S. E., Chernyshenko, O. S., Stark, S. E., & Conz, G. (2014). Validity of six openness facets in predicting work behaviors: A meta-analysis. Journal of Personality Assessment, 96(1), 76–78.

447

21 ADAPTIVE AND CITIZENSHIP-RELATED BEHAVIORS AT WORK DAVID W. DORSEY, JOSE M. CORTINA, MATTHEW T. ALLEN, SHONNA D. WATERS, JENNIFER P. GREEN, AND JOSEPH LUCHMAN

CONCEPTUALIZATION Macro-level trends such as globalization, technology, demographic shifts, and alternative work structures have led researchers and practitioners to challenge traditional definitions of individual work performance (Ilgen & Pulakos, 1999). Two major ways in which these definitions have shifted include performing in interdependent and uncertain work contexts (Griffin, Neal, & Parker, 2007). In this chapter, we explore such expanded definitions of work performance by considering what we know (and what we do not know) about adaptive and organizational citizenship-related behaviors and how this knowledge might be used to inform selection. Implicit in our effort to highlight adaptive and citizenship behavior is the assumption that such behaviors are in some ways unique from traditional task performance. Although we argue in various ways throughout this chapter that this is true, we acknowledge that the boundaries among such performance components are fuzzy. It has been argued that neither adaptive nor citizenship performance is mutually exclusive from task performance, and some conceptual and empirical overlap should be expected (Griffin et al., 2007; Johnson, 2003; Schmitt, Cortina, Ingerick, & Wiechmann, 2003). Moreover, it has been demonstrated that differences in specific job requirements can drive the relative importance (and profile) of various performance components (Pulakos, Arad, Donovan, & Plamondon, 2000). For the purposes of a selection volume, it is sufficient to observe that one of the reasons for distinguishing adaptive performance and citizenship performance from task performance is that they have different determinants.

ADAPTIVE BEHAVIOR DEFINED Recent work in the area of performance adaptation (e.g., Baard, Rench, & Kozlowski, 2013) has sought to summarize and clarify the extant literature on the basis of various theoretical approaches, namely: a performance construct, an individual difference construct, a change in performance, and a process. We agree that additional clarity and specificity regarding the relevant approach can lead to greater clarity in interpreting the literature. Because this volume focuses on selection, multiple approaches are relevant for the current review.

448

Adaptive and Citizenship-Related Behaviors Adaptability refers to an individual difference or a predictor (either as a separate construct, compound trait) that has different relationships with facets of adaptive performance. This view typically includes an individual’s ability, skill, disposition, willingness, and/or motivation to respond to change (e.g., Ployhart & Bliese, 2006). Adaptation refers to a process that includes recognizing demands of a situation, identifying implications, and taking needed actions, or more broadly, the process of achieving fit between new demands and individual behaviors (e.g., Chan, 2000). The theoretical perspective that views adaptation as a process is relevant to identifying mediators that may be important in predicting adaptive performance. Adaptive performance is the criterion of interest—it is a change in response to an altered situation (cf., Dorsey, Cortina, & Luchman, 2010) or the behavioral outcome of the adaptation process (Schmitt & Chan, 2014). Like other definitions of performance, adaptive performance must be considered in relation to the goals of the organization to be able to determine the relevance and effectiveness of the response (Campbell, 2012). We subsume the final theoretical perspective, adaptation as a change in performance, identified by Baard et al. (2014), within our treatment of the measurement of adaptive performance. We also at times use the term adaptive transfer to refer to behavior in a training setting, which we view as a context-specific instantiation of adaptive performance.

As noted by Baard et al. (2014), each of these perspectives is relevant at multiple organizational levels (e.g., individual, team, unit, organization). Again, due to our focus on selection, we will generally display a bias toward the individual level of analysis. However, we believe multilevel issues are critical to investigate, and we recognize that team or higher-level effects are often not simply aggregations of individual-level results (e.g., Stajkovic et al., 2009). Minimal advances have been made in the definition of adaptive performance in recent years. Taxonomic work done by Pulakos, Arad, Donovan, and Plamondon (2000) remains the most rigorous and comprehensive study of adaptive performance. The eight dimensions of adaptive performance identified by Pulakos et al. include: 1. Handling emergencies or crisis situations 2. Learning work tasks, technologies, and procedures 3. Handling work stress 4. Demonstrating personal adaptability 5. Displaying cultural adaptability 6. Solving problems creatively 7. Dealing effectively with unpredictable or changing work situations 8. Demonstrating physically oriented adaptability

This definition includes both reactive and proactive responses to change (e.g., Huang et al., 2014; Shoss, Witt, & Vera, 2012) and has mental, interpersonal, and physically oriented dimensions (White et al., 2005). Further treatment of the relationship between proactive and adaptive performance is outside of the scope of this chapter but is reviewed in detail elsewhere (cf. Zhu, Frese, & Li, 2014). As is true for general models of job performance (e.g., Campbell, 2012), differences of specific job requirements drive the relative importance of various performance components (Pulakos et al., 2000). Controversy remains regarding the factor structure and viability of alternate frameworks; however, little empirical research has been done within recent years to further evaluate the Pulakos et al. (2000) model. Some authors continue to argue that adaptive performance is not truly distinct from other types of performance (e.g., Campbell, 2012; Johnson, 2001; Ployhart & Bliese, 2006); however, there is agreement that performance requirements change and that the ability and proficiency of individuals to anticipate and meet those changes varies. Pulakos and colleagues (Pulakos, Dorsey, & White, 2006) noted that adaptive performance is not independent of task and contextual performance and that it may or may not be needed to perform those duties. Regardless, given the increasingly dynamic nature of work and the need to better understand and predict effective responses to change, separate and specific attention to adaptive performance seems warranted to ensure good criterion measurement.

449

David W. Dorsey et al.

CITIZENSHIP DEFINED Citizenship performance traces its conceptual lineage back to Barnard (1938), Katz (1964), and more recently, Organ and colleagues, who first coined the term “organizational citizenship behavior” or OCB (Bateman & Organ, 1983; Smith, Organ, & Near, 1983). Since its introduction, more than 30 potential variants of OCB have arisen (Podsakoff, MacKenzie, Paine, & Bachrach, 2000), including a host of umbrella terms (e.g., “contextual performance”). Although most studies of citizenship refer to citizenship behavior, we prefer the term citizenship performance because it emphasizes the notion that there is an aspect of quality and that some citizenship behaviors are more successful than others. The notion of quality in citizenship is necessary for the recognition of the importance of knowledge and skill in the prediction of citizenship. Consistent with Borman and Motowidlo (1993), we define citizenship performance as activities that support the broader environment in which an organization’s technical core must function. Citizenship performance has many subdimensions, and there have been varied attempts to identify them (e.g., Borman & Motowidlo, 1993; Smith et al., 1983; Williams & Anderson, 1991). In this chapter, we use the most detailed of these models—that of Borman et al. (2001a). We made this choice, recognizing that there is considerable overlap among some of the

TABLE 21.1 Model Facets of Citizenship Behavior Personal Support Helping

Helping others by offering suggestions about their work, showing them how to accomplish difficult tasks, teaching them useful knowledge or skills, directly performing some of their tasks, and providing emotional support for personal problems

Cooperating

Cooperating with others by accepting their suggestions, following their lead, putting team objectives over personal interests, and informing others of events or requirements that are likely to affect them

Courtesy

Showing consideration, courtesy, and tact in relations with others

Motivating

Motivating others by applauding their achievements and successes, cheering them on in times of adversity, showing confidence in their ability to succeed, and helping them overcome setbacks

Organizational Support Representing

Representing one’s organization favorably to outsiders by defending it when others criticize, promoting its achievements and positive attributes, and expressing own satisfaction with organization 

Loyalty

Showing loyalty by staying with one’s organization despite temporary hardships, tolerating occasional difficulties, handling adversity cheerfully and without complaining, and publicly endorsing and supporting the organization’s mission and objectives

Compliance

Complying with organizational rules and procedures, encouraging others to comply with organizational rules and procedures, and suggesting procedural, administrative, or organizational improvements

Conscientious Initiative Self-development

Developing own knowledge and skills by taking courses on own time, volunteering for training and development opportunities offered within the organization, and trying to learn new knowledge and skills on the job from others or through new job assignments

Initiative

Taking the initiative to do all that is necessary to accomplish team or organizational objectives even if not typically a part of own duties, correcting nonstandard conditions whenever encountered, and finding additional work to perform when own duties are completed

Persistence

Persisting with extra effort despite difficult conditions and setbacks, accomplishing goals that are more difficult and challenging than normal, completing work on time despite unusually short deadlines, and performing at a level of excellence that is significantly beyond normal expectations

Source: Adapted from Borman, W. C. Buck et al., An examination of the comparative reliability, validity, and accuracy of performance ratings made using computerized adaptive rating scales, Journal of Applied Psychology, 86, 965–973, 2001.

450

Adaptive and Citizenship-Related Behaviors subdimensions in this model. The Borman et al. (2001b) model contained three subdimensions of citizenship performance, each of which can be further broken down into facets. Table 21.1 contains detailed definitions of the facets of the model. Clearly, there is overlap among these facets. Indeed, a meta-analysis by Hoffman, Blair, Meriac, and Woehr (2007) suggested that the citizenship domain is best characterized as a single higher-order factor. On the other hand, these dimensions seem to be conceptually distinct (see Table 21.1). For example, Machiavellianism often involves being courteous without being helpful or cooperative. These facets can be further distinguished by the fact that they have different consequences. Most important for our purposes is the fact that they have different individual and situational determinants. It remains to be seen if the covariance between subdimensions suggested by Hoffman et al. (2007) results from halo or common method effects or from the true nature of citizenship dimensions as reflections of a higher order (see also LePine, Erez, & Johnson, 2002). Before discussing determinants and consequences, a discussion of the nature of citizenship is in order. Most empirical studies of citizenship, and of performance generally, assign a single score to each participant in the same way that studies of cognitive ability assign a single ability score. The implication is that a single value can represent the standing of a person on the stable construct, citizenship. Ilies, Scott, and Judge (2006) used event sampling to generate daily reports of job attitudes and citizenship and found that much of the variance in citizenship was within-person. As will be explained later, these authors found that this within-person variance was explained by other within-person factors. For the moment, we merely wish to point out that the Ilies et al. (2006) findings may cast doubt on the practice of using a single value to represent citizenship and on the conceptualization that this practice implies. Predictors of performance often do not have simple cause-effect relationships with performance. Rather, some predictors can place “boundary conditions” on their relationships with citizenship or can exercise influence through other variables. These two conditions are known as “moderation” (boundary condition) and “mediation” (acting through). We discuss potential mediators and moderators in later sections.

INDIVIDUAL DIFFERENCE PREDICTORS There are many potential predictors of adaptability and citizenship. Rather than attempt a comprehensive review, we offer those predictors that have been most prominent in the recent literature. Table 21.2 summarizes the predictors on which we focus.

Distal Individual Differences—Adaptability For the purposes of discussing antecedents of adaptive performance, we distinguish between proximal and distal predictors. We conceptualize distal predictors as abilities and other characteristics (e.g., personality traits) inherent to the individual, which tend to be less closely related to outcomes of interest and are generally appropriate for selection contexts. Importantly, distal predictors are not direct causes of performance but may predispose individuals to behaviors that increase performance. Proximal predictors by contrast are characteristics that tend to be more trainable or influenced by the environment, such as knowledge and skills, and thus may or may not be appropriate for selection purposes depending on the specific position. For a more detailed treatment of the relationship between various antecedents and adaptive performance, see Jundt, Shoss, and Huang (2015).

Cognitive Ability As with meta-analytic estimates of job and training performance (Schmidt & Hunter, 1998), most studies examining the link between general cognitive ability and adaptive performance

451

High-openness people are better prepared to revise initial approaches on the basis of experience. On the other hand, they may be less inclined to pursue formal objectives in the first place. The role of age is unclear. Although physical adaptability may be more difficult for older workers, their experience may help them to appreciate the need for other kinds of adaptability.

Openness to experience

Age

Judge et al. (2006) Kamdar et al. (2006); Korsgaard et al. (2010); Rioux & Penner (2001)

Prosocial people are more empathic and more interested in engaging in activities that benefit others or the organization. People high in collectivism feel shared fate with their groups and are more concerned with group outcomes. Causes one to engage in less citizenship and to inflate self-ratings of citizenship. Prosocial value motives and empathy predict OCB-I, whereas organizational concern motives and reciprocation wariness predict OCB-O.  People high in CFC are less likely to engage in OCB if term of employment is unclear. Relational ties, direct and indirect, lead to OCB. The structure and degree of ties is also relevant.

Prosocial personality

Collectivism

Narcissism

Motives

Concern over future consequences (CFC)

Social networks

Bowler & Brass (2006); Venkatramani & Dalal (2007)

Joireman et al. (2006)

Jackson et al. (2006)

Borman et al. (2001b)

High-conscientiousness people are more likely to recognize the need for and follow through on citizenship behaviors. “Duty” is positively related to “taking charge,” whereas achievement striving is negatively related.

Borman et al. (2001b); Moon et al. (2008); Organ & Ryan (1995)

DeArmond et al. (2006)

Conscientiousness

Citizenship: Distal predictors

LePine et al. (2000); Pulakos et al. (2002); Stewart & Nandekeolyar (2006)

High-conscientiousness people are prepared to struggle through uncertainty to achieve. They are more likely to engage in task pursuit. On the other hand, they may be less willing to revise their approach.

Conscientiousness/ Resiliency

LePine et al. (2000); Pulakos et al. (2002); Stewart & Nandekeolyar (2006)

Pulakos et al. (2002)

Primary Source(s)

High-g people have more cognitive resources to devote to adaptive situations, allowing them to determine appropriate courses of action.

Adaptability: Distal predictors

Explanation

Cognitive ability

Predictor

Predictors of Adaptability and Citizenship

TABLE 21.2

Motivational components (e.g., goal choice and goal striving) and other processes (e.g., strategy selection) transmit the effects of abilities and traits.

Regulatory processes

Satisfaction, commitment, gratitude, engagement, and other attitudes explain betweenperson and within-person variance in OCB. PA, NA, mood, and other affective variables are related to OCB. Various person and organization-related forms of knowledge and skill influence the success of OCB attempts. Leader guidance, mentoring, abuse, and LMX have been linked to OCB-I, although job characteristics may mediate these relationships. Climate for feedback and complementarity lead to positive social exchange, which engenders OCB. Social exchange theory also links leadership to OCB.

Attitudes Affect

Knowledge and skill

Leadership

Social exchange

Citizenship: Proximal predictors

Because adaptability is situational, practical intelligence in the form of situational knowledge should increase it.

Adaptability: Proximal predictors

Privacy of personal information leads to empowerment, which increases OCB. Although individual differences may moderate, perceptions of CSR are positively related to OCB.

Situational knowledge

Informational privacy Corporate Social Responsibility (CSR)

Glomb & Welsh (2005); Rosen et al. (2006); Kirkman et al. (2009)

Dineen et al. (2006); Eby et al. (2015); Ilies et al. (2007); Martin et al. (2015); Piccolo & Colquitt (2006)

Dudley & Cortina (2008); Motowidlo et al. (1997); Munyon et al. (2015)

Ilies et al. (2006); Organ & Ryan (1995); Rich et al. (2010); Spence et al. (2014) Dalal et al. (2009); Glomb et al. (2011)

Chen et al. (2005); Mitchel & Daniels (2003); Ployhart & Bliese (2006)

Schmitt & Chan (2006)

Alge et al. (2006) Rupp et al. (2013)

David W. Dorsey et al. have found a positive, statistically significant relationship. Conceptually, cognitive ability should be related to adaptive performance, particularly cognitively oriented dimensions such as learning new tasks or technologies. With some exceptions (notably Allworth & Hesketh, 1999), field studies relying on performance ratings have generally found small-magnitude effects (less than .20 uncorrected; e.g., Bartram, 2005; Pulakos et al., 2002), whereas lab studies relying on objective performance measures have generally found larger effects (e.g., Bell & Kozlowski, 2008; LePine, 2005; LePine, Colquitt, & Erez, 2000). Furthermore, lab studies examining adaptive transfer of training suggest that cognitive ability is predictive of mediating variables, such as declarative knowledge and practice behaviors, ultimately related to adaptive performance (e.g., Hughes et al., 2013). Research into narrow dimensions of cognitive ability is rare (Jundt et al., 2015), despite evidence that these can be more predictive than measures of general cognitive ability (Lang, Kersting, Hulsheger, & Lang, 2010). One exception is Allworth and Hesketh (1999), who found strong correlations between three measures of cognitive ability (Abstract Reasoning, Clerical Speed and Accuracy, and Numerical Reasoning) and ratings of adaptive performance, with Abstract Reasoning being the most predictive. Other narrow dimensions of cognitive ability, such as divergent thinking, may be closely linked with certain facets of adaptive performance, such as solving problems creatively (Hunter, Cushenbery, & Friedrich, 2012). Recent research suggests that the relationship between general cognitive ability and adaptive performance may be more complex than originally thought. Lang and Bleise (2009) found that individuals with higher cognitive ability showed larger declines in performance after a change (transition adaptation) and no advantage in their rate of regaining previous performance levels (reacquisition adaptation). Beier and Oswald (2012) suggest that this may be evidence that cognitive ability can be a hindrance in certain situations, though stressing the need for more research.

Personality Research into the relationship between global dimensions of personality (e.g., the Big Five) and adaptive performance has yielded inconsistent results. For example, LePine and colleagues (2000) asked undergraduates to complete a complex simulation task and measured their performance before and after a task change. The authors found that conscientiousness was negatively related to post-change (i.e., adaptive) performance, whereas openness to experience was positively related to post-change performance. Stewart and Nandkeolyar (2006) examined how well sales associates adapt to environmental changes in the form of referrals. They found that conscientiousness was positively related to taking advantage of these environmental factors, whereas openness to experience was negatively related. How can two studies both nominally studying the personality-adaptive performance link come to completely different conclusions? Researchers seeking to disentangle the antecedents of adaptive performance appear to be coalescing around two conclusions. First, examining personality at the facet level is more meaningful than examining personality at the dimension level when predicting adaptive performance. Second, the context of adaptive performance matters—certain jobs and tasks require different types of adaptive performance, driving the predictive relationships (Jundt et al., 2015; Pulakos et al., 2000; Pulakos et al., 2006). Exploring personality at the facet level allows better alignment with relevant dimensions of adaptive performance, adding nuance that helps to resolve discrepancies in earlier research. With respect to conscientiousness and openness to experience— the two personality dimensions most frequently associated with adaptive performance—the research becomes clearer at the facet level. For example, in the LePine et al. (2000) study referenced above, the negative relationship between conscientiousness and adaptive performance appears to be driven by facets associated with dependability and not by facets associated with achievement striving. This is in line with the broader literature. While the extant literature has found mixed results for global conscientiousness as a predictor of adaptive performance, researchers have consistently found facets such as “achievement striving” to be positively related to adaptive performance, while

454

Adaptive and Citizenship-Related Behaviors “cautious/dutifulness” facets are negatively or unrelated to adaptive performance (Griffin & Hesketh, 2005; Pulakos et al., 2002). With respect to openness to experience, despite its appeal as a construct related to creativity and open-mindedness, most studies have found little evidence of a significant positive relationship with adaptive performance (Allworth & Hesketh, 1999; Griffin & Hesketh, 2005; Huang, Ryan, Zabel, & Palmer, 2014; Pulakos et al., 2002; Stokes et al., 2010; Woo, Chernyshenko, Stark, & Conz, 2014). Although there are exceptions (e.g., LePine et al., 2000; Shoss, Witt, & Vera, 2012), where there are positive relationships, they tend to be small in magnitude and found for particular facets related to creativity (e.g., “ingenuity,” Woo et al., 2014; see also Huang et al., 2014) or contextualized to adaptive situations (e.g., “openness to change”; Griffin, Neal, & Parker, 2007). More evidence of a positive correlation between adaptive performance and openness to experience might be found at the facet levels on both the predictor (e.g., creativity facets) and criterion (e.g., solving problems creatively) (Hunter et al., 2012). Other dimensions of the Big Five—emotional stability/neuroticism, extraversion, and agreeableness—have been examined less frequently in relation to adaptive performance. As with conscientiousness, the results examining the relationship between emotional stability and adaptive performance have been mixed, with some studies showing a small, positive, statistically significant relationship (e.g., Bartram, 2005; Huang et al., 2014; Pulakos et al., 2002) and others showing no relationship (e.g., Allworth & Hesketh, 1999; Shoss et al., 2012). Researchers have generally found no relationship between adaptive performance and global extraversion (e.g., Shoss et al., 2012, but see Blickle et al., 2011 for an exception). However, a study by Huang and colleagues (2014) sheds additional light on the relationship between emotional stability and extraversion with adaptive performance by examining the personality constructs at the facet level and distinguishing between proactive and reactive forms of adaptive performance. The authors found that “ambition” and “adjustment” (facets of extraversion and emotional stability, respectively) were positively correlated with adaptive performance, whereas sociability (another facet of extraversion) was unrelated. Furthermore, the magnitude of the relationships changed depending on the form of adaptive performance. Finally, to our knowledge, agreeableness has only been included in a few studies examining the personality-adaptive performance relationship, with results suggesting a nonsignificant relationship (e.g., Allworth & Hesketh, 1999; Shoss et al., 2012). However, as Ployhart and Turner (2014) suggest, dimensions such as agreeableness may be critical for team or organizational adaptability, and thus should not be discounted in future research.

Motivations, Interests, and Previous Experience Individual trait motivation is often assessed using the concept of goal orientation first proposed by Dweck (1986; Elliott & Dweck, 1988). Although there are different frameworks (e.g., Button, Mathieu, & Zajac, 1996; VandeWalle, 1997), most studies that have examined the linkage between goal orientation and adaptive performance distinguish between learning/mastery and performance orientations. The role of goal orientation in learning outcomes has been wellestablished (Colquitt, LePine, & Noe, 2000) but has rarely been studied in relation to adaptive performance in a field setting. One study from Goad and Jaramillo (2014) found that mastery orientation was strongly related to “adaptive selling” techniques (performance orientation was not related). Marques-Quinteiro and Curral (2012) also found mastery orientation to be related to both self-reported proactive and adaptive performance in a corporate setting. Goal orientation has been much more frequently examined in the skill acquisition literature, where the positive effects of mastery performance on learning are well-established, often as a distal predictor mediated by self-efficacy or other proximal variables (e.g., Kozlowski et al., 2001; see Jundt et al., 2015). These results suggest that goal orientation holds promise for predicting adaptive performance in a selection context. Researchers have also examined the role of interests and experience in predicting adaptive performance, although less frequently. Pulakos and colleagues (2002) developed a measure of interests in adaptive activities and experience with adaptive situations. They found both measures to be positively related to supervisor ratings of adaptive performance, though only the

455

David W. Dorsey et al. experience measure predicted incrementally beyond cognitive ability and personality. This is consistent with Allworth and Hesketh (1999), who found a nonlinear relationship between an “experience with change” measure and adaptive performance that was incrementally predictive beyond cognitive ability. As summarized by Baard, Rench, and Kozlowski (2014), these studies, as well as the model of individual adaptability proposed by Ployhart and Bleise (2006), represent adaptability as an individual difference construct that requires additional theoretical explication and empirical study.

Other Individual Differences Pulakos and colleagues (2006) propose that individual differences can impact adaptive performance in two ways: (a) through stress mitigation and (b) through effective coping strategies. They go on to propose a number of individual difference constructs that may be linked to various aspects of adaptive performance. A few of the individual differences proposed that have received less attention in the research literature include practical intelligence, social intelligence, and originality. As outlined in Lievens and Chan (2010), practical, social, and emotional intelligence hold promise for predicting key criteria when theoretically aligned to the predictor construct. Pulakos et al.’s (2006) analysis suggests that practical intelligence is conceptually linked to the “handling emergencies/crises” and “solving problems creatively” dimensions of adaptive performance, whereas social intelligence is linked to “demonstrating interpersonal adaptability” and “displaying cultural adaptability.” Similarly, Oliver and Lievens (2014) propose that ability-based emotional intelligence will predict interpersonal adaptability, which will in turn predict adaptive interpersonal performance. While one study (Mumford et al., 1993) provides some support for originality by demonstrating that those concerned with creative achievement are more likely to perform well on novel tasks, more research is needed to link other dimensions that are predictive of creative potential (e.g., divergent thinking, creative processing skills) to adaptive performance (Hunter et al., 2012). Finally, although not appropriate for selection purposes, a couple of studies have examined adaptive performance in relation to demographic variables. In one small-sample study, Pattie and Parks (2011) found that U.S. ethnic minorities are better able than their Caucasian counterparts to adapt to expatriate experiences. Finally, Niessen, Swarowsky, and Leiz (2010) found age to be negatively related to performance and perceived organizational fit after an organizational change, suggesting a negative correlation between age and adaptive performance.

Distal Individual Differences—Citizenship Previous research on distal citizenship performance predictors has focused almost entirely on motivational and dispositional variables. Organ and Ryan (1995) reviewed the research on dispositional predictors of OCB and found that conscientiousness was the strongest predictor. Borman et al. (2001b) found that conscientiousness and two dimensions of prosocial personality, other-oriented empathy and helpfulness, were the strongest predictors of citizenship performance. Jackson, Colquitt, Wesson, and Zapata-Phlean (2006) decomposed the collectivism domain into five subdimensions, finding that the elements of “reliance” (feeling of shared fate with group) and “concern” (feelings of concern about outcomes for group members) related most strongly to within-group citizenship behavior in teams. In a similar vein, Moon, Kamdar, Meyer, and Takeuchi (2008) sought to resolve inconsistent findings related to conscientiousness and “taking charge” citizenship or voluntary behaviors that are intended to affect functional organizational change. They decomposed the conscientiousness trait into “duty” and “achievement striving” and found that “duty” had a positive relationship with taking charge and “achievement striving” had a negative relationship. Moon et al. attributed this to the fact that duty is “other-oriented,” or centered on the benefit of others, whereas achievement striving is not.

456

Adaptive and Citizenship-Related Behaviors Judge, LePine, and Rich (2006) found that narcissism or delusional beliefs about influence and personal specialness are related to inflated self-ratings of citizenship performance and to inhibited ratings of these same behaviors by others. Taken collectively, this research suggests that having a strong individual orientation can inhibit citizenship performance, particularly those dimensions of citizenship that are other-oriented. Research has also investigated whether personal motives relate to citizenship performance (Rioux & Penner, 2001). Rioux and Penner (2001) found that prosocial values motives were most strongly associated with citizenship directed toward individuals, whereas organizational concern motives were most strongly related to citizenship directed toward the organization. Korsgaard, Meglino, Lester, and Jeong (2010) examined the interaction of motives and individual differences (other orientation) that predict OCB in situations in which OCB is unrewarded. Kamdar, McAllister, and Turban (2006) found that empathy (defined as empathetic concern and perspective taking) was a better predictor of individual-level citizenship behaviors, whereas reciprocation wariness was a better predictor of citizenship toward the organization. These latter findings reflect an effort to understand citizenship through motives related to social exchange theory (SET; Blau, 1964; Cropanzano & Mitchell, 2005). In SET, an individual’s motives are thought to be contingent upon relational obligations and duties formed through exchanges of resources with other individuals or groups. SET posits that two exchanging parties are likely to build high-quality relationships characterized by trust and commitment to one another. From this perspective, citizenship is a form of resource exchange with an organization or other individual. Accordingly, Kamdar et al. suggested that empathetic individuals tend to define their job role in terms of citizenship and are willing to exchange their citizenship with coworkers and the organization alike. In contrast, workers who are more concerned about reciprocation from their exchange partner will only exchange with the more powerful and resource-rich organization where there is a higher likelihood of reward. Similar reasoning can be used to explain the finding by Anand, Vidyarthi, Hoffman, and Sauer (2010) that those who negotiate nonstandard work arrangements (e.g., flexible work schedules) are more likely to engage in citizenship, and the finding that supervisor mentoring improves OCB-I but not OCB-O (Eby, Butts, Hoffman, and Sauer, 2015). Also from a social exchange perspective, Joireman, Kamdar, Daniels, and Duell (2006) found that “concern over future consequences” (CFC) predicts citizenship. Because the benefits of engaging in citizenship behavior are longer term, individuals who are not concerned with their own benefits in the long term are more likely to engage in citizenship behavior irrespective of their expected length of tenure with an employer. Conversely, individuals who have higher CFC will withhold citizenship in cases where their expected tenure is short because of lack of perceived benefits. The importance of this finding for selection is that screening for citizenship will be ineffective for high-CFC applicants if term of employment is unclear or known to be short. Bowler and Brass (2006) found that individuals were more likely to both give and receive OCB from those with whom they had a relational tie or were connected through a friend in a social network. Workers in powerful structural positions were also more likely to receive OCB from others. These findings were bolstered by a later study by Venkatramani and Dalal (2007), who found that having a direct or indirect (third party) positive affective tie led to increased instances of interpersonal helping. Taken together, these studies suggest that friendship, direct and indirect, can increase the likelihood of giving and receiving helping behavior from coworkers. In fostering a genial workplace, an organization may then reap the benefits of citizenship performance. More recently, self-regulation has been used to explain citizenship relationships. In an experience sampling study of workday breaks, Hunter and Wu (2015) found that preferred breaks that occurred earlier in the work shift improved citizenship through resource recovery. Trougakos, Beal, Cheng, Hideg, and Zweig (2015) also tested a resource depletion model of daily OCB. Bolino, Hsuing, Harvey, and LePine (2015) drew upon some of the same notions to explain citizenship fatigue. These studies point toward the notion that situational characteristics that cause employees to expend regulatory resources reduce citizenship, while situational characteristics or interventions that conserve or replenish resources increase citizenship.

457

David W. Dorsey et al.

Immediate/Proximal Determinants—Adaptability Consistent with Campbell, McCloy, Oppler, and Sager (1993) and process models of performance (e.g., Johnson, 2008; Van Iddekinge, Ferris, & Heffner, 2009), research into proximal predictors of adaptive performance tend to focus on motivational constructs, such as self-efficacy and job knowledge. Much of the adaptive performance model-building work comes out of lab studies conducted in the skill acquisition literature, rather than in field studies. However, one motivational construct, self-efficacy, has been studied in both settings with some regularity, although the construct is defined differently across studies. Griffin et al. (2007) examined role breadth self-efficacy, which they found to be related to individual adaptive and proactive performance dimensions, with a particularly strong correlation with proactive performance. Pulakos et al. (2002) examined self-efficacy for adaptive performance and found it to be related to adaptive performance, but not incrementally over cognitive ability and personality predictors. Various laboratory studies have found self-efficacy to be related to adaptive transfer (Bell & Kozlowski, 2008), though its effect is often mediated by self-regulatory activities (Chen et al., 2005; see Jundt et al., 2015 and Chen & Firth, 2014 for more complete reviews). The other proximal predictor frequently covered in the adaptive performance literature is job-specific knowledge and skills. In skill acquisition studies, researchers have consistently found declarative knowledge of the training topic to be a significant predictor of adaptive transfer (Bell & Kozlowski, 2008; Hughes et al., 2013; Kozlowski et al., 2001). Other knowledge and skill domains, such as general knowledge (Hunter et al., 2012), role knowledge and skill (Chen et al., 2005), and political skills (Blickle et al., 2011), have also been identified as predictive of adaptive performance dimensions. More research is needed to fully uncover the range of knowledge and skill structures that may be predictive of adaptive performance in organizations.

Immediate/Proximal Determinants—Citizenship Although stable individual differences are clearly important for the prediction of citizenship performance, the better part of its variance remains unexplained. One likely reason for the modest correlations between stable characteristics and citizenship is that their influence is transmitted by more proximal variables. Three categories of predictors that hold great promise are attitudes, knowledge, and skills. In their seminal meta-analysis, Organ and Ryan (1995) found that, with the exception of conscientiousness, attitudinal variables were stronger predictors than dispositional variables of OCB, with mean uncorrected correlations that were 5–15 points higher. More recently, Rich, LePine, and Crawford (2010) found that job engagement transmitted the effects of value congruence, perceived organizational support, and core self-evaluation on OCB. Several studies have examined within-person OCB effects. Ilies et al. (2006) found that within-person variance in citizenship could be explained by within-person variance in positive and negative affect and job satisfaction. Spence, Brown, Keeping, and Lian (2014) found that state gratitude explained withi- person variance in citizenship beyond state positive affect. Dalal, Lam, Weiss, Welch, and Hulin (2009) found the positive affect at the previous time point did not explain variance in OCB at the next time point when concurrent positive affect was included as a predictor. Until recently, researchers had only hinted at knowledge and skills that might predict citizenship performance. For example, Motowidlo et al. (1997) provided general examples of citizenship performance knowledge and skills. In addition, much of the empirical work has asserted a particular knowledge and/or skill as relevant for effective performance in a specific domain (e.g., Bergman, Donovan, & Drasgow, 2001, interpersonal skill predicting leadership; Morgeson, Reider, & Campion, 2005, interpersonal and teamwork skills predicting team performance; Motowidlo, Brownless, & Schmit, 1998, and Schmit, Motowidlo, Degroot, Cross, & Kiker, 1996, customer service knowledge and skill predicting customer service performance). Other research has worked backward to determine knowledge and skills (e.g., Bess, 2001). For

458

Adaptive and Citizenship-Related Behaviors example, a situational judgment test presents a participant with an interpersonal situation that may be encountered on the job for which they are applying, and then he or she must select the most effective response. Subject matter experts (SMEs) then decide which items measure overall job knowledge relevant to citizenship performance dimensions. Thus, no particular citizenship performance knowledge and skill is identified. Dudley and Cortina (2008) developed a conceptual model linking specific knowledge and skill variables to the Personal Support facets of citizenship performance. Among the most prominent knowledge-based predictors were strategy richness, emotional knowledge, knowledge influence tactics, and organizational norm knowledge. Among the most prominent skill-based predictors were emotion support skills, emotion management skills, facework skills, behavioral flexibility, social perceptiveness, and perspective-taking skills. We refer the reader to Dudley and Cortina (2008) for the bases of the linkages. To our knowledge, only one study has examined specific knowledge in the prediction of citizenship performance. Bettencourt, Gwinner, and Meuter (2001) found that two customer-knowledge antecedents explained unique variance in service-oriented OCBs after controlling for attitudinal and personality variables. Regarding skill, there has been a good deal of research linking political skill to OCB. For example, Jawahar, Meurs, Ferris, and Hochwarter (2008) and Liu, Ferris, Zinko, Perrewe’, Weitz, and Xu (2007) found that political skill was positively related to OCB because those with political skill are better able to understand what is important to others and are therefore able to select more effective helping behaviors. Politically skilled individuals are also able to make their OCBs more salient to others. Munyon et al. (2015) found an uncorrected correlation between political skill and OCB of .33. Indirect evidence has been found for the effect of interpersonal skills on the fostering of helping in creative tasks. Porath and Erez (2007) found that incidents of rudeness were related to decreased levels of helping toward the rude person and to unrelated others. This suggests that rudeness has wide-ranging effects in interdependent contexts, such as team performance, where helping can contribute substantially to performance. Leader characteristics and practices have been found to be predictive of citizenship behavior consistent with social exchange. For instance, leaders who provide guidance to their followers on appropriate types of behaviors (i.e., citizenship), when bolstered by behavioral integrity (word-behavior consistency in leader), leads to follower OCB (Dineen, Lewicki, & Tomlinson, 2006). Abusive supervision has also been linked to lower perceptions of organizational justice, which acts as a mediator to predict levels of citizenship (Ayree, Chen, Sun, & Debrah, 2007). A good deal of work has linked positive leader-member exchange (LMX) to citizenship. Ilies, Nahrgang, and Morgeson (2007) found that LMX was related to OCB-I. Several papers have identified mediators of this relationship. Hu and Liden (2013) found self-efficacy to act as a mediator, and in a meta-analysis, Martin, Guillaume, Thomas, Lee, and Epitropaki (2015) found that trust, motivation, empowerment, satisfaction, and leadership trust transmitted the effects of LMX onto OCB. Transformational leadership has also been linked to citizenship behaviors. However, a study found that Hackman and Oldham’s (1980) job characteristics theory constructs mediated the transformational leadership-citizenship relation (Piccolo & Colquitt, 2006). Specifically, transformational leadership was found to influence the “meaning” ascribed by employees to elements of their jobs (such as their perceptions of skill variety and identity). This in turn led to intrinsic motivation, which mediated the relation between job characteristics constructs and citizenship behavior. Others suggest that the transformational leadership-citizenship relation is mediated by LMX (Wang, Law, Hackett, Wang, & Chen, 2005). In this case, transformational leadership is a cue to a follower that the leader is a trustworthy exchange partner. This leads to greater investment by the follower in the relationship and then stronger positive LMX perceptions. Kirkman, Chen, Farh, Chen, and Lowe (2009) used SET to argue that followers’ fairness perceptions mediate the relationship between transformational leadership and OCB. Although not examined from a SET perspective, Rosen, Levy, and Hall’s (2006) study can be also understood from a social exchange point of view. Rosen et al. found that fostering a climate where feedback is provided reduces perceptions of organizational politics and increases employee morale. High feedback environments suggest that the outcomes enjoyed by individuals are determined in a

459

David W. Dorsey et al. procedurally fair manner and are not based on political skill but on contributions to organizational success, which lead to citizenship. This suggests that resources devoted to selecting good citizens may not be well spent unless the environment is conducive to citizenship. Additionally, complementarity in control-related personality characteristics (leader having more and subordinates having less controlling personalities) was posited to lead to positive social exchanges (Glomb & Welsh, 2005). The complementarity hypothesis was not supported. Rather, there was a main effect for subordinate control, suggesting that individuals need a sense of control over their work (consistent with some evidence outlined above) to exhibit OCB. In closing, we should mention that most research conducted on predictors of citizenship and adaptive performance has used job incumbents as subjects of their research. To increase the confidence that the above predictors are in fact the causes of the observed effects on performance, predictive validity studies using applicant populations need to be conducted.

MEASUREMENT

Measuring Adaptive Performance In considering methods for measuring individual adaptive performance, we must first return to our definition of adaptive performance as an effective behavioral response to an altered state. This definition suggests that as work situations change, individuals can be evaluated on their effectiveness in (a) recognizing and preparing for the change and (b) maintaining situational awareness and performing during the altered state (Penney, David, & Witt, 2011). Another important aspect of this definition is that adaptive performance is not independent of task and contextual performance (Pulakos et al., 2006). In the performance of technical duties, situations, such as emergencies or new requirements, may emerge that require adaptive performance. The central implication of this definition is that the best measures of adaptive performance will be contextualized to the job or jobs of interest. With this definition as the backdrop, there are two central considerations in developing an adaptive performance measure: (a) the types of change likely to be encountered on the job and (b) the method of measurement. To the first consideration, there are numerous frameworks to draw from, although the eight dimensions first introduced by Pulakos and colleagues (2000) provide an empirically supported starting point (Baard et al., 2014). Bartram’s (2005) “Adapting and Coping” competency provides a similar, though less fully operationalized framework (see Appendix, pp. 1202–1203). Baard and colleagues (2014) offer task complexity as a conceptual underpinning for change type (see Figure 2, p. 90). Other authors have suggested a higher-order dimensional structure to Pulakos and colleagues’ (2000) eight factors, such as proactive versus reactive and cognitive versus noncognitive forms of adaptive performance (Allworth & Hesketh, 1999; Huang et al., 2014; White et al., 2005). With respect to the second consideration, method of measurement, we distinguish among three approaches: maximal, typical, and transfer (see Baard et al., 2014 for a more complete treatment of this topic). Many dimensions of adaptive performance, such as handling emergencies or crisis situations, are for most jobs rare events, suggesting that maximal performance methods may be more appropriate than typical performance methods. However, given the dynamic environment of the modern workplace, typical performance measurement methods may be appropriate in many settings. Maximal performance methods include high-fidelity simulations, such as assessment centers and training simulations, and low-fidelity simulations, such as situational judgment tests (SJTs). Certain jobs, such as pilots (Crognale & Krebs, 2011), firefighters (Joung, Hesketh, & Neal, 2006), and special forces officers (Raybourn, Deagle, Mendini, & Heneghan, 2005), routinely employ high-fidelity simulations that involve situations requiring adaptive behaviors to evaluate performance. However, adaptive performance dimensions could also be assessed with low-fidelity methods. For example, “solving problems creatively” could be assessed by asking participants to generate responses to novel problems and having experts rate the quality of those responses (e.g., Mumford, Baughman, Threlfall, Uhlman, & Costanza, 1993). Other studies have also used

460

Adaptive and Citizenship-Related Behaviors SJT-like measurement methods to assess adaptability, which could also be applied to measuring adaptive performance (e.g., Oswald, Schmitt, Kim, Ramsay, & Gillespie, 2004). Typical performance measurement methods usually involve behavioral ratings, most often by supervisors or peers. Multiple studies have developed and employed supervisory rating methods, often relying on the Pulakos et al. (2000) framework as a starting point. Other rating methods that involve more sources, such as 360-degree methods, have not seen significant use in the literature, but researchers should consider these and other alternative methods of measuring typical adaptive performance. Experience sampling may represent another viable measurement alternative (Baard et al., 2014). Stokes, Schneider, and Lyons (2010) found a correlation between a subjective and an objective measure of adaptive performance to be .43, suggesting that maximal and typical performance methods of measurement are likely to be positively correlated but nonoverlapping. The final approach is the adaptive transfer method. Although similar to maximal methods, this approach is distinct because it grew out of the skill acquisition literature rather than the performance measurement literature (Baard et al., 2014). In this paradigm, adaptive performance is measured by examining how individuals perform immediately after changes are made while performing complex tasks. “Change” could be a content change, such as increased speed or accuracy, or a context change, such as change from one physical context to another (Barnett & Ceci, 2002). Lang and Bliese (2009) distinguish between two types of altered state performance: (a) transition adaptation, or the extent to which performance does not decrease after a change, and (b) reacquistion adaptation, or the speed with which previous levels of performance are regained after a change. Their research suggests that psychological variables, such as cognitive ability, may differentially predict these two types of adaptation.

Measuring Citizenship-Related Variables In measuring citizenship performance, there are four major issues to consider. First, although citizenship is often measured with global indices, few jobs require equal amounts of all dimensions. Thus, global measures are contaminated for almost all settings. Organizations would do well to identify the citizenship dimensions in which they are particularly interested. Second, measures of citizenship vary in the degree to which they reflect activity rather than performance. Self-report measures of citizenship invariably emphasize activity (e.g., “I often help my coworkers when they have personal problems”) or attempt (e.g., “I often try to help my coworkers when they have personal problems”) rather than emphasizing the degree to which activities and attempts are successful. Measures from external sources such as supervisors are much more likely to reflect success. For this reason, external evaluation of citizenship is even more important than it is for other factors. Another way of characterizing this difference is to say that it is likely that a different model is tested if a self-report measure of citizenship is used rather than an external measure. This is not to say that supervisors cannot be asked about citizenship activity as opposed to citizenship performance. Rather, whereas an external source might be expected to produce an unbiased evaluation of citizenship quality, it is not reasonable to expect an unbiased evaluation of quality from the subject him/herself. Even if one focuses only on quantity, it is not always obvious what is being quantified. For example, Podsakoff, Maynes, Whiting, and Podsakoff (2015) found that item referent (group vs. individual) influenced the measurement of voice such that individual referents led raters to focus on frequency of voice behaviors, whereas group referents led them to focus on proportion of group members who exhibit voice. To address the third issue, we return to the Ilies et al. (2006) study. These authors showed that there was tremendous within-person variability in citizenship such that the level of citizenship varied considerably from one day to the next. This finding might suggest either that citizenship is not a stable construct or that the extent to which a person has opportunities to display citizenship behaviors varies within and across workdays. More research is needed to clarify these issues. Episodic measurement of citizenship affords two types of citizenship measurement. In addition to treating episodic measurements separately, they can also be averaged. The values based on

461

David W. Dorsey et al. the individual episodes can be used to test within-person models such as that tested in Ilies et al. (2006), Judge et al. (2006), and Dalal et al. (2009). The averages of these episodic measurements are conceptually similar to the single-shot measures that are typical of citizenship research and can be used to test individual difference models. Fourth, in describing the model of citizenship on which we would rely in this chapter, we mentioned that a good deal of conceptual overlap exist among the various facets of citizenship. This overlap creates serious measurement problems. For example, in writing items to measure the Personal Support facets helping and cooperation, Dudley and Cortina (2008) found that despite careful item distinctions, a group of SMEs had difficulty in categorizing items accurately. This is consistent with the Hoffman et al. (2007) finding that raters typically have difficulty in discriminating among subdimensions without the aid of “frame-ofreference” training. It should also be mentioned that, just as there are unique difficulties in the measurement of citizenship, there are unique difficulties in the measurement of the knowledge and skills that predict citizenship. We mentioned in the previous paragraph that dimensions of citizenship are difficult to measure, in part because they are difficult to distinguish from the knowledge and skills that predict them. This problem cuts both ways in that knowledge and skills are difficult to measure because they are difficult to distinguish from the dimension that they are intended to predict.

MODERATORS AND MEDIATORS—ADAPTABILITY As noted elsewhere, adaptive performance is not independent of task and contextual performance. One implication is that general models of performance are likely to apply to adaptive performance and may be useful in providing a conceptual framework for identifying relevant mediators and moderators of adaptive performance. For example, Campbell’s model (1993) of job performance specifies that differences in performance are a result of the interactive effects of declarative knowledge, procedural knowledge, and motivation. Differences in declarative and procedural knowledge are related to much of what is captured in the individual difference approach to adaptability (e.g., ability, personality, skill, interest, experience). Differences in motivation can result from goal orientation (e.g., Elliott & Dweck, 1988), self-efficacy, situational appraisal, self-regulation, and so on (see Chan & Firth, 2014 for a thorough description of the motivational underpinnings of adaptive performance). These distinctions map onto the “can do” and “will do” components emphasized elsewhere (e.g., Schmitt & Chan, 2014). Baard et al. (2014) stressed the need to specify the performance requirements that changed and described how those changes required adaptation and of what kind. That call is very much in line with the Pulakos et al. (2000) model of adaptive performance, which essentially defines categories of situations and responses. Dorsey, Cortina, and Luchman (2010) noted the importance of the characteristics of the task domain, such as complexity, in moderating the relationship between determinants and adaptive performance. Kozlowski and colleagues (e.g., Bell & Kozlowski, 2002, 2008; Kozlowski et al., 2001) have used Wood’s (1986) taxonomy to describe the nature of the complexity within the task domain. Griffin, Neal, and Parker (2007) highlight the role of the degree of uncertainty and interdependence, whereas others have emphasized interdependence (cf., Reis, 2008). These task or situational features are likely to be particularly important in examining cross-level effects. For example, the degree of task interdependence may affect the importance of team structure (e.g., coordination, communication and assisting processes) when team adaptive performance is of interest (e.g., Hollenbeck et al., 2011). Frameworks such as McGrath’s (1984) Group Task Circumplex may serve as a useful organizing structure in thinking through task characteristics that may affect observed relationships. Tett and Burnett (2003) divide situational demands into task demands, social demands, and organizational demands. All three are likely to moderate the relationship between determinants and adaptive performance. Situational or environmental characteristics can also have an impact. For example, compensation systems can either reward or suppress proactive and adaptive responses. Organizational norms, cultures, and climates may be implicated in the amount of adaptive performance

462

Adaptive and Citizenship-Related Behaviors observed at the individual or team level (e.g., Ployhart & Turner, 2014). In a training context, interventions such as manipulations of state goal orientation and error management training can enhance adaptive transfer (Bell & Kozlowski, 2008; Jundt et al., 2015; Kozlowski et al., 2001). These results suggest that the mechanisms contributing to team and organizational culture and climate (e.g., processes, leadership) should be aligned to facilitate adaptive performance.

MODERATORS AND MEDIATORS—CITIZENSHIP Various environmental variables might act as moderators in the prediction of citizenship. Much can be learned from research on other criteria. Barrick and Mount (1993) found that autonomy moderated the degree to which job performance was predicted by conscientiousness, extraversion, and agreeableness. Because citizenship is more susceptible than job performance to individual choices, it seems likely that autonomy would moderate the relationship between most predictors and citizenship. Colbert, Mount, Harter, Witt, and Barrick (2004) found that perceptions of the developmental environment and perceived organizational support moderated the relationship between personality and workplace deviance. This is consistent with the Ilies et al. (2006) finding that attitudes and personality interact to predict citizenship. Although citizenship can be predicted by factors such as personality traits (e.g., conscientiousness) and job attitudes (e.g., job satisfaction), research is beginning to show that focusing only on factors predictive of task performance may result in a decreased ability to explain citizenship. Citizenship is highly interpersonal in nature and, as was outlined above, is beginning to be understood from an SET perspective. Accounting for personality and SET on citizenship performance, Kamdar and Van Dyne (2007) used conscientiousness, agreeableness, LMX, and TMX (team-member exchange) to predict citizenship. Consistent with prior research, both personality traits predicted citizenship toward supervisors and coworkers. However, LMX and TMX were also able to predict citizenship above and beyond personality (for supervisors and coworkers, respectively). Furthermore, agreeableness was found to moderate the relationship between quality of LMX and citizenship such that individuals with high levels of agreeableness do not need high-quality exchanges to engage in citizenship behavior. In sum, Kamdar and Van Dyne’s (2007) findings suggested that when we fail to account for nontraditional predictors such as exchange relationship quality, our ability to predict citizenship is diminished, and agreeableness appears to have a more consistent relationship with citizenship than it really does (in reality, it appears to change depending on exchange relationship quality). The organizational justice literature has begun to explore the role of moderators in the relationship between justice and citizenship. For instance, Kamdar et al. (2006) found that job role mediates the relationship between procedural justice and citizenship. Thus, individuals who define their job role as involving citizenship will engage in these behaviors irrespective of whether they experience procedural justice at work or not. Those who do not will essentially “withhold” citizenship when not treated fairly. Procedural justice has also been found to be more strongly related to “taking charge” citizenship when perceived discretion over the demonstration of these behaviors is low (i.e., they are role prescribed; McAllister, Kamdar, Morrison, & Turban, 2007). Interestingly, this same study found that altruistic/interpersonal citizenship had a stronger relationship with high perceived discretion of citizenship when procedural justice was low (McAllister et al., 2007). This finding is consistent with research by Halbesleben and Bowler (2007), in which interpersonal citizenship was found to be used as a social support coping mechanism when conditions at work are stressful. Findings related to procedural justice climate (PJC) are consistent with some of the research outlined above. For instance, Yang, Mossholder, and Peng (2007) found that average group levels of “power distance,” or the extent to which individuals defer to decisions of powerful individuals and accept power imbalances, moderates the relationship between PJC and citizenship. Groups with high average levels of power distance will not “withhold” citizenship toward the organization when faced with unfair treatment at the group level because they do not feel that arbitrary decisions made by leaders justify such a reaction. Other instances of multilevel research suggest that attitude targets moderate relationships across levels. Thus, group-level

463

David W. Dorsey et al. justice perceptions are more strongly related to “higher-level” recipients of citizenship (e.g., Liao & Rupp, 2005, Redman & Snape, 2005). Other leader characteristics and practices have been found to moderate relations between personality and contextual variables and citizenship behavior. For instance, charismatic leadership has been found to interact with feelings of follower belongingness such that charisma is less important in cases in which follower belongingness is high (Den Hartog, De Hoogh, & Keegan, 2007). Leader influence tactics on subordinate’s citizenship performance has also been found to be contingent upon the quality of the relationship between leader and follower (i.e., LMX). For instance, inspirational techniques are negatively related to citizenship for followers with poor-quality LMX because these appeals reinforce value incongruence between the leader and the follower (Sparrowe, Soetjipto, & Kraimer, 2006). However, those higher in LMX were encouraged to engage in more citizenship by using exchange appeals in which resources are exchanged between leader and subordinate. This was likely construed as “just another exchange” of many already positive exchanges between the leader and follower (Sparrowe et al., 2006). As a whole, the justice and SET-related research above suggests that to the extent that some other individual difference predictor is not making an employee engage in citizenship, being treated well by the organization can compensate. Thus, high levels of certain individual differences bound justice and SET’s prediction of citizenship. In addition to SET as an explanation of citizenship behavior, researchers are beginning to recognize the role of self-enhancement as a motive for citizenship (e.g., Bowler & Brass, 2006; Yun, Takeuchi, & Liu, 2007). Research has shown that in cases where an employee’s role is ambiguous, employees will engage in more citizenship performance toward the organization to make up for their inability to determine which behaviors are valued (Yun et al., 2007). This relationship only holds for employees who are perceived as having high levels of affective organizational commitment, otherwise their citizenship motives are transparent and recognized as being self-interested (Yun et al., 2007). Of particular interest here is a study by Grant and Mayer (2009) showing that prosocial and impression management motives interact to predict citizenship such that impression management is less predictive for those high in prosocial motives. Other studies have found a similar role played by time management (Rapp, Bachrach, & Rapp, 2013) and job ambivalence (Ziegler, Schlett, Casel, & Diehl, 2012). One study has suggested that commitment may be less predictive than the configurations of differing types of commitment (Sinclair, Tucker, Cullen, & Wright, 2005). The purpose of this study was to tease apart how different profiles or levels of affective and continuance commitment within persons predicted citizenship performance between persons. “Devoted” (high affective, continuance commitment) employees were found to have consistently higher citizenship than other profiles, and “free agents” (moderate continuance, low affective) were found to have consistently low citizenship. Recent work has also uncovered curvilinear citizenship relationships. Rubin, Deirdorff, and Bachrach (2013) found that the relationship between citizenship and task performance weakens as citizenship increases. MacKenzie, Podsakoff, and Podsakoff (2011) found an inverted U relationship between group-level OCBs and task performance. Both studies identified moderators of these curvilinear relationships as well, such as autonomy and accountability. Finally, we return once again to Ilies et al. (2006) and Dalal et al. (2009). These authors first demonstrated that there is meaningful within-person variance in citizenship. Using an event sampling approach, these authors showed that a sizable percentage of the total variance in citizenship was within-person variance. These studies then showed that within-person citizenship variance had a good deal of overlap with within-person variance in job attitudes such as job satisfaction and positive affect. Finally, Ilies et al. found that stable individual difference variables such as agreeableness moderated this within-person relationship. By treating citizenship as a within-person variable, Ilies et al. and Dalal et al. offer a different perspective on the topic and point to a need to better understand the fundamental stability and drivers of stability/instability of this construct and its measures. In short, some task and organizational variables suppress adaptability or citizenship, change their components, muffle the influence of determinants, or transmit the effects of those determinants. If one is to maximize the utility of an adaptability-based or citizenship-based selection system, then these variables and this impact must be recognized.

464

Adaptive and Citizenship-Related Behaviors

IMPACT ON ORGANIZATIONAL OUTCOMES

Impact of Adaptability How can selecting individuals who are more likely to engage in adaptive behavior and adaptive performance impact organizational bottom-line results? One answer to this question is to posit that having more adaptive individuals makes for more adaptive organizations. This line of thinking views organizational adaptability as an emergent phenomenon driven by the adaptive capabilities of organizational members (Kozlowski, Watola, Nowakowski, Kim, & Botero, 2008). Still, one can ask what such adaptability looks like at the level of organizational outcomes. Reviewing the existing literature on all of the ways in which organizations adapt to become more effective is well beyond the scope of this chapter [see Ployhart and Turner (2014) for a more complete treatment of internal and external firm adaptation]. Here, we propose (based on educated speculation) three ways in which individual-level adaptive performance might aggregate to affect or link to organizational outcomes; namely, (a) managing change, (b) increasing organizational learning, and (c) maintaining customer focus. The first of these items suggests that organizations with higher levels of individual adaptive capacity might manage change better. As suggested earlier, modern organizations merge, grow, shrink, or expand (often globally), thus requiring adaptation on the part of their members. If members are better able to tolerate, manage, and leverage such changes, organizations are likely to be more effective. Research literature supports the contention that variables such as openness to change serve as moderators of important organizational outcomes (e.g., satisfaction, turnover; Wanberg & Banas, 2000). In addition, constant change from technologies, globalization, restructuring, etc. require organizational members at various levels of aggregation (individual, teams/units, entire originations) to learn new skills, tasks, and technologies. Thus, the popular notion of a “learning organization” may depend largely on the adaptive capacity of its constituent members (Redding, 1997). Lastly, as markets, environments, and missions change, organizations and their members must refocus on what customers want, value, and need. Thus, we highlight maintaining a focus on customers as a final potential organizational outcome related to adaptive performance. As individual performers seek to adaptively sense and respond to customer demands, organizational effectiveness is likely enhanced.

Impact of Citizenship Citizenship performance does indeed contribute to organizational effectiveness (e.g., George & Bettenhausen, 1990; Karambayya, 1990, as cited in Podsakoff et al., 2000; Koys, 2001; Podsakoff, Ahearne, & MacKenzie, 1997; Podsakoff & MacKenzie, 1994; Podsakoff et al., 2000). This is especially true in cases in which work tasks are interdependent (Bachrach, Powell, Collins, & Richey, 2006b). Other research has confirmed that fostering citizenship leads to positive organizational outcomes. Specifically, service-related citizenship partially mediated the relationship between social exchange–informed HR practices and organizational productivity and turnover (Sun, Ayree, & Law, 2007). Li, Zhao, Walter, and Zhang (2015) found that an “extra miler” in a team (i.e., a team member exhibiting citizenship behaviors) improves team functioning to the degree that the person occupies a central position in the network. Payne and Webber (2006) found that service-oriented citizenship (i.e., toward customers) is related to customer attitudes such as satisfaction, loyalty intentions, relationship tenure, positive word-of-mouth, and reduced complaining. Similar to Sun et al. (2007), these results suggested that positive social exchanges predict citizenship. Research has demonstrated that citizenship performance is important in supervisory ratings of performance (Borman et al., 1995; Conway, 1996; MacKenzie, Podsakoff, & Fetter, 1993; Rotundo and Sackett, 2002; Werner, 1994). Group task interdependence has been found to moderate the effects of OCB on performance appraisals (Bachrach, Powell, Bendoly, & Richey, 2006a).

465

David W. Dorsey et al. Although supervisors may typically ignore or deemphasize the centrality of OCB to overall job performance, when tasks are highly interdependent, the need for cooperation and helping is more difficult to disregard. Thus, with higher levels of interdependence, the influence of citizenship on performance appraisal is more pronounced. However, other research suggests that the influence of citizenship on performance appraisal is affected by social comparison. If an employee’s workgroup exhibits high average levels of citizenship, then any given employee’s levels are comparatively lower and tend to have weaker associations with appraisal outcomes than employees who are in workgroups with lower average levels of citizenship (Bommer, Dierdorff, & Rubin, 2007). Whiting, Podsakoff, and Pierce (2008) decomposed the citizenship domain into “helping” or altruistic citizenship, “voice” (similar to “taking charge”), and “loyalty” to gauge their independent effects on performance appraisal outcomes. This study found that independent of task performance, all three citizenship dimensions predicted appraisals, with loyalty having the strongest association. Interestingly, a three-way interaction was found such that when helping is low and task performance high, voice loses its association with positive appraisals. Because voice is more confrontational than the other forms of citizenship, when employees are not contributing in other areas of their job, their challenges to organizational routine are undervalued. Ferrin, Dirks, and Shah (2006) found that OCB-Is or interpersonal citizenship behaviors were predictive of ratings of interpersonal trust, especially in cases where the social networks of the rater and ratee were similar. This is because interpersonal citizenship behaviors are explicitly altruistic in nature and are therefore taken to indicate the trustworthiness of the person engaging in it.

TOO MUCH OF A GOOD THING?

Too Much Adaptability? In some performance environments, the study of excessive, unpredictable, and/or ineffective changes in response to perceptions of altered situations may prove useful. For example, in military settings, it may be important to research the nature of shifts between following standard operating procedures (e.g., doctrine) and engaging in nonroutine acts of adaptive performance. It is possible that an overemphasis on adaptability can lead to individuals, teams, or organizations that are out of control or fail to institutionalize effective practices. Although we are not aware of research that directly addresses the boundary conditions under which adaptive performance becomes too adaptive, it should be one consideration in studying adaptive performance. In addition, there is little extant evidence regarding subgroup differences in adaptive performance. Future research should address this gap.

Too Much Citizenship? Research has shown that there are smaller subgroup differences on citizenship than on technical performance (Hattrup, Rock, & Scalia, 1997; Murphy & Shiarella, 1997; Ployhart & Holtz, 2008). This would suggest that a criterion space that comprises a larger percentage of citizenship should yield smaller differences among demographic subgroups. The story is not so simple. Heilman and Chen (2005) found that there are gender differences in expectations regarding the ratio of technical to citizenship performance, such that women are expected to engage in more and better citizenship than are men. Specifically, they found that the positive effects of engaging in altruistic OCB were observed in the performance appraisals of men but not of women. Conversely, women were penalized by raters for not engaging in citizenship. The authors attributed this to gender differences in role expectations. When women do not engage in altruistic citizenship, they are seen as failing to fulfill their roles. When men do engage in citizenship,

466

Adaptive and Citizenship-Related Behaviors they are rewarded because their behavior is seen as “extra-role.” This creates various problems for selection. First, it may skew validation results by introducing criterion inflation or deflation depending on the incumbent’s gender. Second, it creates a need for different weighting schemes for selection tests depending on gender—a need that cannot legally be met. Citizenship has also been linked to employee race. Jones and Schaubroek (2004) found that relative to White employees, non-White employees tended to have lower self- and supervisorreported OCB. However, this relationship was partially mediated by job satisfaction, negative affectivity, and coworker social support. Furthermore, citizenship has been found to be related to employee age. In a meta-analysis by Ng and Feldman (2009), age was found to have several nonzero correlations with self- and supervisor-rated dimensions of citizenship. Citizenship behavior has also been linked to increased amounts of work-family conflict and stress/strain, especially when an employee’s individual initiative is high. If individuals strive not only to do their job well but also to be good organizational citizens, then they are likely to experience role conflict with their family life, an effect that is especially strong for working women (Bolino & Turnley, 2005). Finally, as was mentioned earlier, the findings of Rubin et al. (2013) and MacKenzie et al. (2011) suggest that opportunity costs are associated with citizenship. A lack of citizenship is bad for the group and the organization. On the other hand, citizenship can come at the cost of task performance, and there appears to be a happy medium such that either more or less citizenship can be detrimental.

CONCLUSIONS The purpose of this chapter was to review recent research on two performance dimensions that represent departures from traditional job performance models: adaptive performance and citizenship performance. For each of these dimensions, we began by offering definitions that clarify their nature and their distinctiveness. We then reviewed research on distal and proximal predictors of these dimensions and discussed variables that might moderate the relationships between these dimensions and other variables. Finally, we discussed the consequences of these dimensions. A wide array of research has been conducted on predictors of adaptive and citizenship performance. For both criteria, further research at the facet level may be fruitful to help disentangle weak or mixed findings and better understand potential tradeoffs across criteria. For example, the findings regarding the relationship between conscientiousness and adaptive performance have been mixed; however, facet-level relationships may be driving the mixed results, such that achievement striving is positively related to adaptive performance, whereas duty is negatively correlated. On the other hand, researchers have found duty to be positively related to “taking charge” citizenship and achievement striving to be negatively related. With regard to mediators, situational knowledge, self-efficacy, and regulatory processes mediate relationships with adaptive performance, whereas attitudes and affect facilitate relationships with citizenship performance. Moreover, specific performance requirements and situational characteristics moderate individual difference–performance relationships. Important performance requirements include complexity, uncertainty, interdependence, and autonomy. Situational characteristics include compensation, norms, culture, climate, exchange relationship quality, and leader characteristics. This chapter should make clear that adaptive and citizenship performance are important and complex. One reason that they are important is that they relate to variables that are of great interest to organizations, such as performance appraisal, group effectiveness, change management, and stressors. Another reason is that they are distinct from alternative dimensions, conceptually and nomologically. If, as seems to be the case, these dimensions are underrepresented in performance appraisals, then the weighting of dimensions in those systems is suboptimal. In addition to the obvious consequences for organizational productivity, this would also result in

467

David W. Dorsey et al. discrimination against protected groups to the degree that these dimensions create smaller subgroup differences than do the dimensions that are commonly included in appraisal instruments. If these dimensions are important, then more work must be done to determine how important they are (i.e., optimal weighting) and how relative importance varies with situational or organizational characteristics. More research must also be done to identify the determinants of these dimensions. Because many of these determinants are malleable (e.g., skills, leader behaviors), research must also evaluate interventions designed to increase adaptive performance and citizenship through the improvement of these determinants. One important issue to note is that uncertainty in work roles is sometimes the subject of labor–management conflicts. Organizations may need to balance incorporating adaptive performance into performance conceptualizations with union concerns regarding uncertainty in job roles. In closing, we would point out that there is no reason to stop here. If performance is to be understood, then adaptive performance and citizenship must receive specific attention. However, there are bound to be other dimensions of performance that should also be added to the mix. If it can be demonstrated that a new dimension is conceptually distinct from existing dimensions, has important consequences, and has a set of determinants that differ from those of other dimensions, then that new dimension should also be added to our models of performance.

REFERENCES Alge, B. J., Ballinger, G. A., Tangirala, S., & Oakley, J. L. (2006). Information privacy in organizations: Empowering creative and extrarole performance. Journal of Applied Psychology, 91, 221–232. Allworth, E., & Hesketh, B. (1999). Construct-oriented biodata: Capturing change-related and contextually relevant future performance. International Journal of Selection and Assessment, 7, 97–111. doi: 10.1111/1468–2389.00110 Anand, S., Vidyarthi, P. R., Liden, R. C., & Rousseau, D. M. (2010). Good citizens in poor-quality relationships: Idiosyncratic deals as a substitute for relationship quality. Academy of Management Journal, 53(5), 970–988. Aryee, S., Chen, Z. X., Sun, L-Y., & Debrah, Y. A. (2007). Antecedents and outcomes of abusive supervision: Test of a trickle-down model. Journal of Applied Psychology, 92, 191–201. Baard, S. K., Rench, T., Kozlowski, S. (2014). Performance adaptation: A theoretical integration and review. Journal of Management, 40, 48–99. doi: 10.1177/0149206313488210 Bachrach, D. G., Powell, B. C., Bendoly, E., & Richey, R. G. (2006a). Organizational citizenship behavior and performance evaluations: Exploring the impact of task interdependence. Journal of Applied Psychology, 91, 193–201. Bachrach, D. G., Powell, B. C., Collins, B. J., & Richey, R. G. (2006b). Effects of task interdependence on the relationship between helping behavior and group performance. Journal of Applied Psychology, 91, 1396–1405. Barnard, C. I. (1938). The functions of the executive. Cambridge, MA: Harvard University Press. Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128, 612–637. doi: 10.1037//0033–2909.128.4.612 Barrick, M. R., & Mount, M. K. (1993). Autonomy as a Moderator of the Relationships Between the Big Five Personality Dimensions and Job Performance. Journal of Applied Psychology, 78(1), 111–118. Bartram, D. (2005). The great eight competencies: A criterion-centric approach to validation. Journal of Applied Psychology, 90(6), 1185–1203. doi: 10.1037/0021–9010.90.6.1185 Bateman, T. S., & Organ, D. W. (1983). Job satisfaction and the good soldier: The relationship between affect and employee “citizenship”. Academy of Management Journal, 26, 587–595. Beier, M. E., & Oswald, F. L. (2012). Is cognitive ability a liability? A critique and future research agenda on skilled performance. Journal of Experimental Psychology: Applied, 18, 331–345. doi: 10.1037/a0030869 Bell, B. S., & Kozlowski, S. W. (2002). Adaptive guidance: Enhancing self-regulation, knowledge, and performance in technology-based training. Personnel Psychology, 55(2), 267–306. doi: 10.1111/j.1744– 6570.2002.tb00111 Bell, B. S., & Kozlowski, S. W. (2008). Active learning: Effects of core training design elements on self-regulatory processes, learning, and adaptability. Journal of Applied Psychology, 93(2), 296–316. doi: 10.1037/0021–9010.93.2.296

468

Adaptive and Citizenship-Related Behaviors Bergman, M. E., Donovan, M. A., & Drasgow, F. (April 2001). A framework for assessing contextual performance. Paper presented at the sixteenth annual conference of the Society of Industrial and Organizational Psychology, San Diego, CA. Bess, T. L. (2001). Exploring the dimensionality of situational judgment: Task and contextual knowledge. Unpublished thesis, Virginia Polytechnic Institute and State University. Bettencourt, L. A., Gwinner, K. P., & Meuter, M. L. (2001). A comparison of attitude, personality, and knowledge predictors of service-oriented organizational citizenship behaviors. Journal of Applied Psychology, 86, 29–41. Blau, P. M. (1964). Exchange and power in social life. New York, NY: Wiley. Blickle, G., Kramer, J., Schneider, P. B., Meurs, J. A., Ferris, G. R., Mierke, J., Witzki, A. H., & Momm, T. D. (2011). Role of political skill in job performance prediction beyond general mental ability and personality in cross-sectional and predictive studies. Journal of Applied Social Psychology, 41, 488–514. doi: 10.1111/j.1559–1816.2010.00723.x Bolino, M. C., Hsiung, H. H., Harvey, J., & LePine, J. A. (2015). “Well, I’m tired of tryin’!” Organizational citizenship behavior and citizenship fatigue. Journal of Applied Psychology, 100, 56–74. Bolino, M. C., & Turnley, W. H. (2005). The personal costs of citizenship behavior: The relationship between individual initiative and role overload, job stress, and work-family conflict. Journal of Applied Psychology, 90, 740–748. Bommer, W. H., Dierdorff, E. C., & Rubin, R. S. (2007). Does prevalence mitigate relevance? The moderating effect of group-level OCB on employee performance. Academy of Management Journal, 50, 1481–1494. Borman, W. C., Buck, D. E., Hanson, M. A., Motowidlo, S. J., Stark, S., & Drasgow, F. (2001a). An examination of the comparative reliability, validity, and accuracy of performance ratings made using computerized adaptive rating scales. Journal of Applied Psychology, 86, 965–973. Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements of contextual performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 71–98). San Francisco, CA: Jossey-Bass. Borman, W. C., Penner, L. A., Allen, T. D., & Motowidlo, S. J. (2001b). Personality predictors of citizenship performance. International Journal of Selection and Assessment, 9, 52–69. Borman, W. C., White, L. A., & Dorsey, D. W. (1995). Effects of rate task performance and interpersonal factors on supervisor and peer performance ratings. Journal of Applied Psychology, 80, 168–177. Bowler, W. M., & Brass, D. J. (2006). Relational correlates of interpersonal citizenship behavior: A social network perspective. Journal of Applied Psychology, 91, 70–82. Button, S. B., Mathieu, J. E., & Zajac, D. M. (1996). Goal orientation in organizational behavior research: A conceptual and empirical foundation. Organizational Behavior and Human Decision Processes, 67, 26–48. doi: 10.1006/obhd.1996.0063 Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco: Jossey-Bass. Campbell, J. P. (2012). Behavior, performance, and effectiveness in the 21st century. In S. Kozlowski (Ed.), The Oxford handbook of organizational psychology (pp. 159–195). New York, NY: Oxford University Press. Chan, D. (2000). Conceptual and empirical gaps in research on individual adaptation at work. International Review of Industrial and Organizational Psychology, 15, 143–164. Chen, G., & Firth, B. M. (2014). The motivational underpinnings of adaptability. In D. Chan (Ed.), Individual adaptability to changes at work: New directions in research (pp. 18–35). New York, NY: Routledge. Chen, G., Thomas, B. A., & Wallace, J. C. (2005). A multilevel examination of the relationships among training outcomes, mediating regulatory processes, and adaptive performance. Journal of Applied Psychology, 90, 827–841. doi:10.1037/00219010.90.5.827 Colbert, A. E., Mount, M. K., Harter, J. K., Witt, L. A., & Barrick, M. R. (2004). Interactive effects of personality and perceptions of the work situation on workplace deviance. Journal of Applied Psychology, 89, 599–609. Colquitt, J. A., LePine, J. A., & Noe, R. A. (2000). Toward an integrative theory of training motivation: A meta-analytic path analysis of 20 years of research. Journal of Applied Psychology, 85(5), 678–707. doi: 10.1037/0021–9010.85.5.678 Conway, J. M. (1996). Additional construct validity evidence for the task-contextual performance distinction. Human Performance, 9, 309–329. Crognale, M. A., & Krebs, W. K. (2011). Performance of helicopter pilots during inadvertent flight into instrument meteorological conditions. The International Journal of Aviation Psychology, 21(3), 235–253. doi: 10.1080/10508414.2011.582443 Cropanzano, R., & Mitchell, M. S. (2005). Social exchange theory: An interdisciplinary review. Journal of Management, 31, 874–900.

469

David W. Dorsey et al. Dalal, R., Lam, H., Weiss, H. M., Welch, E. R., & Hulin, C. L. (2009). A within person approach to work behavior and performance: Concurrent and lagged citizenship-counterproductivity associations, and dynamic relationships with affect and overall job performance. Academy of Management Journal, 52, 1051–1066. DeArmond, S., Tye, M., Chen, P. Y., Krauss, A., Rogers, D. A., & Sintek, E. (2006). Age and gender stereotypes: New challenges in a changing workplace and workforce. Journal of Applied Social Psychology, 36, 2184–2214. Den Hartog, D. N., De Hoogh, A. H. B., & Keegan, A. E. (2007). The interactive effects of belongingness and charisma on helping and compliance. Journal of Applied Psychology, 92, 1131–1139. Dineen, B. R., Lewicki, R. J., & Tomlinson, E. C. (2006). Supervisory guidance and behavioral integrity: Relationships with employee citizenship and deviant behavior. Journal of Applied Psychology, 91, 622–635. Dorsey, D., Cortina, J., & Luchman, J. (2010). Adaptive and citizenship-related behaviors at work. In J. Farr & N. Tippins (Eds.), Handbook of employee selection (pp. 463–487). New York, NY: Routledge / Taylor & Francis. Dudley, N. M., & Cortina, J. M. (2008). Knowledge and skills that facilitate the personal support dimension of citizenship. Journal of Applied Psychology, 93(6), 1249–1270. Dweck, C. S. (1986). Motivational processes affecting learning. American Psychologist, 41(10), 1040. Eby, L. T., Butts, M. M., Hoffman, B. J., & Sauer, J. B. (2015). Cross-lagged relations between mentoring received from supervisors and employee OCBs: Disentangling causal direction and identifying boundary conditions. Journal of Applied Psychology, 100, 1275–1285. Elliott, E. S., & Dweck, C. S. (1988). Goals: An approach to motivation and achievement. Journal of Personality and Social Psychology, 54(1), 5–12. doi: 10.1037/0022–3514.54.1.5 Entin, E. E., Diedrich, F. J., & Rubineau, B. (2003, October). Adaptive communication patterns in different organizational structures. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 47, No. 3, pp. 405–409). SAGE Publications. Ferrin, D. L., Dirks, K. T., & Shah, P. P. (2006). Direct and indirect effects of third-party relationships on interpersonal trust. Journal of Applied Psychology, 91, 870–883. George, J. M., & Bettenhausen, K. (1990). Understanding prosocial behaviour, sales performance, and turnover: A group-level analysis in a service context. Journal of Applied Psychology, 75(6), 698–709. Glomb, T. M., Bhave, D. P., Miner, A. G., & Wall, M. (2011). Doing good, feeling good: Examining the role of organizational citizenship behaviors in changing mood. Personnel Psychology, 64, 191–223. Glomb, T. M., & Welsh, E. T. (2005). Can opposites attract? Personality heterogeneity in supervisorsubordinate dyads as a predictor of subordinate outcomes. Journal of Applied Psychology, 90, 749–757. Grant, A. M., & Mayer, D. M. (2009). Good soldiers and good actors: Prosocial and impression management motives as interactive predictors of affiliative citizenship behaviors. Journal of Applied Psychology, 94, 900–912. Griffin, B., & Hesketh, B. (2004). Why openness to experience is not a good predictor of job performance. International Journal of Selection and Assessment, 12(3), 243–251. Griffin, B., & Hesketh, B. (2005). Are conscientious workers adaptable? Australian Journal of Management, 30, 245–259. doi: 10.1177/031289620503000204 Griffin, M. A., Neal, A., & Parker, S. K. (2007). A new model of work role performance: Positive behavior in uncertain and interdependent contexts. Academy of Management Journal, 50, 327–347. Goad, E. A., & Jaramillo, F. (2014). The good, the bad and the effective: A meta-analytic examination. Journal of Personal Selling & Sales Management, 34(4), 285–301. doi: 10.1080/08853134.2014.899471 Hackman, J. R., & Oldham, G. R. (1980). Work redesign. Reading, MA: Addison-Wesley. Halbesleben, J. R. B., & Bowler, W. M. (2007). Emotional exhaustion and job performance: The mediating role of motivation. Journal of Applied Psychology, 92, 93–106. Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying conceptualizations of job performance on adverse impact, minority hiring, and predicted performance. Journal of Applied Psychology, 82(5), 656. Heilman, M. E., & Chen, J. J. (2005). Same behavior, different consequences: Reactions to men’s and women’s altruistic citizenship behavior. Journal of Applied Psychology, 90, 431–441. Hoffman, B. J., Blair, C. A., Meriac, J. P., & Woehr, D. J. (2007). Expanding the criterion domain? A quantitative review of the OCB literature. Journal of Applied Psychology, 92, 555–566. Hollenbeck, J. R., Ellis, A. P., Humphrey, S. E., Garza, A. S., & Ilgen, D. R. (2011). Asymmetry in structural adaptation: The differential impact of centralizing versus decentralizing team decision-making structures. Organizational Behavior and Human Decision Processes, 114(1), 64–74. Huang, J. L., Ryan, A. M., Zabel, K. L., & Palmer, A. (2014). Personality and adaptive performance at work: A meta-analytic investigation. Journal of Applied Psychology, 99, 162–179. doi: 10.1037/a0034285

470

Adaptive and Citizenship-Related Behaviors Hughes, M. G., Day, E. A., Wang, X., Schuelke, M. J., Arsenault, M. L., Harkrider, L. N., & Cooper, O. D. (2013). Learner-controlled practice difficulty in the training of a complex task: Cognitive and motivational mechanisms. Journal of Applied Psychology, 98(1), 80–98. doi: 10.1037/a0029821 Hunter, E. M., & Wu, C. (2015). Give me a better break: Choosing workday break activities to maximize resource recovery. Journal of Applied Psychology. Advance online publication. http://dx.doi.org/10.1037/ apl0000045 Hunter, S. T., Cushenbery, L., & Friedrich, T. (2012). Hiring an innovative workforce: A necessary yet uniquely challenging endeavor. Human Resource Management Review, 22(4), 303–322. doi: 10.1016/j. hrmr.2012.01.001 Ilgen, D. R., & Pulakos, E. D. (1999). Employee performance in today’s organizations. In D. R. Ilgen & E. D. Pulakos (Eds.), The changing nature of performance: Implications for staffing, motivation, and development (pp. 1–20). San Francisco, CA: Jossey-Bass. Ilies, R., Nahrgang, J. D., & Morgeson, F. P. (2007). Leader-member exchange and citizenship behaviors: A meta-analysis. Journal of Applied Psychology, 92, 269–277. Ilies, R., Scott, B. A., & Judge, T. A. (2006). The interactive effects of personal traits and experienced states on intraindividual patterns of citizenship behavior. Academy of Management Journal, 49, 561–575. Jackson, C. L., Colquitt, J. A., Wesson, M. J., & Zapata-Phelan, C. P. (2006). Psychological collectivism: A measurement validation and linkage to group member performance. Journal of Applied Psychology, 91, 884–899. Jawahar, I. M., Meurs, J. A., Ferris, G. R., & Hochwarter, W. A. (2008). Self-efficacy and political skill as comparative predictors of task and contextual performance: A two-study constructive replication. Human Performance, 21, 138–157. Johnson, J. W. (2001). The relative importance of task and contextual performance dimensions to supervisor judgments of overall performance. Journal of Applied Psychology, 86, 984–996. Johnson, J. W. (2003). Toward a better understanding of the relationship between personality and individual job performance. In M. R. Barrick & A. M. Ryan (Eds.), Personality and work: Reconsidering the role of personality in organizations (pp. 83–120). San Francisco, CA: Jossey-Bass. Johnson, J. W. (2008). Process models of personality and work behavior. Industrial and Organizational Psychology, 1(3), 303–307. doi: 10.1111/j.1754–9434.2008.00052.x Joireman, J., Kamdar, D., Daniels, D., & Duell, B. (2006). Good citizens to the end? It depends: Empathy and concern with future consequences moderate the impact of a short-term time horizon on organizational citizenship behaviors. Journal of Applied Psychology, 91, 1307–1320. Jones, J. R., & Schaubroeck, J. (2004). Mediators of the relationship between race and organizational citizenship behavior. Journal of Managerial Issues, 505–527. Joung, W., Hesketh, B., & Neal, A. (2006). Using “war stories” to train for adaptive performance: Is it better to learn from error or success? Applied Psychology, 55, 282–302. doi: 10.1111/j.1464–0597.2006.00244.x Judge, T. A., LePine, J. A., & Rich, B. L. (2006). Loving yourself abundantly: Relationship of the narcissistic personality to self- and other perceptions of workplace deviance, leadership, and task and contextual performance. Journal of Applied Psychology, 91, 762–776. Jundt, D. K., Shoss, M. K., & Huang, J. L. (2015). Individual adaptive performance in organizations: A review. Journal of Organizational Behavior, 36, S53–S71. doi: 10.1002/job.1955 Kamdar, D., McAllister, D. J., & Turban, D. B. (2006). “All in a day’s work”: How follower individual differences and justice perceptions predict OCB role definitions and behavior. Journal of Applied Psychology, 91, 841–855. Kamdar, D., & Van Dyne, L. (2007). The joint effects of personality and workplace social exchange relationships in predicting task performance and citizenship performance. Journal of Applied Psychology, 92, 1286–1298. Karambayya, R. (1990). Contexts for organizational citizenship behavior: Do high performing and satisfying units have better ‘citizens’. York University working paper. See Podsakoff, MacKenzie, Paine, & Bachrach (2000). Katz, D. (1964). Motivational basis of organizational behavior. Behavioral Science, 9, 131–146. Kirkman, B. L., Chen, G., Farh, J. L., Chen, Z. X., & Lowe, K. B. (2009). Individual power distance orientation and follower reactions to transformational leaders: A cross-level, cross-cultural examination. Academy of Management Journal, 52, 744–764. Korsgaard, M. A., Meglino, B. M., Lester, S. W., & Jeong, S. S. (2010). Paying you back or paying me forward: Understanding rewarded and unrewarded organizational citizenship behavior. Journal of Applied Psychology, 95, 277–290. Koys, D. J. (2001). The effects of employee satisfaction, organizational citizenship behavior, and turnover on organizational effectiveness: A unit-level, longitudinal study. Personnel Psychology, 54, 101–114.

471

David W. Dorsey et al. Kozlowski, S. W. J., Gully, S. M., Brown, K. G., Salas, E., Smith, E. M., & Nason, E. R. (2001). Effects of training goals and goal orientation traits on multi-dimensional training outcomes and performance adaptability. Organizational Behavior and Human Decision Processes, 85, 1–31. Kozlowski, S. W. J., Watola, D., Nowakowski, J. M., Kim, B., & Botero, I. (2008). Developing adaptive teams: A theory of dynamic team leadership. In E. Salas, G. F. Goodwin, & C. S. Burke (Eds.), Team effectiveness in complex organizations: Cross-disciplinary perspectives and approaches (pp. 113–155). Mahwah, NJ: Lawrence Erlbaum. Lang, J. W. B., & Bliese, P. D. (2009). General mental ability and two types of adaptation to unforeseen change: Applying discontinuous growth models to the task-change paradigm. Journal of Applied Psychology, 94, 411–428. doi: 10.1037/a0013803 Lang, J. W. B., Kersting, M., Hulsheger, U. R., & Lang, J. (2010). General mental ability, narrower cognitive abilities, and job performance: The perspective of the nested-factors model of cognitive abilities. Personnel Psychology, 63(3), 595–640. doi: 10.1111/j.1744–6570.2010.01182.x LePine, J. A. (2005). Adaptation of teams in response to unforeseen change: Effects of goal difficulty and team composition in terms of cognitive ability and goal orientation. Journal of Applied Psychology, 90(6), 1153–1167. doi: 10.1037/0021–9010.90.6.1153 LePine, J. A., Colquitt, J. A., & Erez, A. (2000). Adaptability to changing task contexts: Effects of general cognitive ability, conscientiousness, and openness to experience. Personnel Psychology, 53, 563–593. LePine, J. A., Erez, A., & Johnson, D. E. (2002). The nature and dimensionality of organizational citizenship behavior: A critical review and meta-analysis. Journal of Applied Psychology, 87, 52–65. Li, N., Zhao, H. H., Walter, S. L., Zhang, X. A., & Yu, J. (2015). Achieving more with less: Extra milers’ behavioral influences in teams. The Journal of applied psychology, 100(4), 1025–1039. Liao, H., & Rupp, D. E. (2005). The impact of justice climate and justice orientation on work outcomes: A cross-level multifoci framework. Journal of Applied Psychology, 90, 242–256. Lievens, F., & Chan, D. (2010). Practical intelligence, emotional intelligence, and social intelligence. In J. Farr & N. Tippins (Eds.), Handbook of employee selection (pp. 339–360). New York, NY: Routledge / Taylor & Francis. Liu, Y., Ferris, G. R., Zinko, R., Perrewé, P. L., Weitz, B., & Xu, J. (2007). Dispositional antecedents and outcomes of political skill in organizations: A four-study investigation with convergence. Journal of Vocational Behavior, 71, 146–165. MacKenzie, S. B., Podsakoff, P. M., & Fetter, R. (1993). The impact of organizational citizenship behavior on evaluations of salesperson performance. Journal of Marketing, 57, 70–80. MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Challenge-oriented organizational citizenship behaviors and organizational effectiveness: Do challenge-oriented behaviors really have an impact on the organization’s bottom line? Personnel Psychology, 64, 559–592. Marques-Quinteiro, P., & Curral, L. A. (2012). Goal orientation and work role performance: Predicting adaptive and proactive work role performance through self-leadership strategies. The Journal of Psychology, 146(6), 559–577. doi: 10.1080/00223980.2012.656157 Martin, R., Guillaume, Y., Thomas, G., Lee, A., & Epitropaki, O. (2015). Leader–member exchange (LMX) and performance: A meta-analytic review. Personnel Psychology. Advance online publication. doi: 10.1111/ peps.12100 McAllister, D. J., Kamdar, D., Morrison, E. W., & Turban, D. B. (2007). Disentangling role perceptions: How perceived role breadth, discretion, instrumentality, and efficacy relate to helping and taking charge. Journal of Applied Psychology, 92, 1200–1211. McGrath, J. E. (1984). Groups: Interaction and performance (Vol. 14). Englewood Cliffs, NJ: Prentice-Hall. Mitchell, T. R., & Daniels, E. (2003). Motivation. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Comprehensive handbook of psychology: Vol. 12: Industrial and organizational psychology (pp. 225–254). New York, NY: Wiley. Moon, H., Kamdar, D., Mayer, D. M., & Takeuchi, R. (2008). Me or we? The role of personality and justice as other-centered antecedents to innovative citizenship behaviors within organizations. Journal of Applied Psychology, 93, 84–94. Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The importance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58, 583–611. Motowidlo, S. J., Borman, W. C., & Schmitt, M. J. (1997). A theory of individual differences in task and contextual performance. Human Performance, 10, 71–83. Motowidlo, S. J., Brownlee, A. L., & Schmit, M. J. (2008). Effects of personality characteristics on knowledge, skill, and performance in servicing retail customers. International Journal of Selection and Assessment, 16(3), 272–280.

472

Adaptive and Citizenship-Related Behaviors Mumford, M. D., Baughman, W. A., Threlfall, K. V., Uhlman, C. E., & Costanza, D. P. (1993). Personality, adaptability, and performance: Performance on well-defined problem solving tasks. Human Performance, 6(3), 241–285. doi: 10.1207/s15327043hup0603_4 Munyon, T. P., Summers, J. K., Thompson, K. M., & Ferris, G. R. (2015). Political skill and work outcomes: A theoretical extension, meta-analytic investigation, and agenda for the future. Personnel Psychology, 68, 143–184. Murphy, K. R., & Shiarella, A. H. (1997). Implications of the multidimensional nature of job performance for the validity of selection tests: Multivariate frameworks for studying test validity. Personnel Psychology, 50(4), 823–854. Ng, T. W., & Feldman, D. C. (2008). The relationship of age to ten dimensions of job performance. Journal of Applied Psychology, 93(2), 392. Ng, T. W., & Feldman, D. C. (2009). How broadly does education contribute to job performance? Personnel Psychology, 62(1), 89–134. Niessen, C., Swarowsky, C., & Leiz, M. (2010). Age and adaptation to changes in the workplace. Journal of Managerial Psychology, 25(4), 356–383. doi: 10.1108/02683941011035287 Oliver, T., & Lievens, P. (2014). Conceptualizing and assessing interpersonal adaptability: Towards a functional framework. In D. Chan (Ed.), Individual adaptability to changes at work: New directions in research (pp. 52–72). New York, NY: Routledge. Organ, D. W., & Ryan, K. (1995). A meta-analytic review of attitudinal and dispositional predictors of organizational citizenship behavior. Personnel Psychology, 48, 775–802. Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., & Gillespie, M. A. (2004). Developing a biodata measure and situational judgment inventory as predictors of college student performance. Journal of Applied Psychology, 89(2), 187–207. doi: 10.1037/0021–9010.89.2.187 Pattie, M., & Parks, L. (2011). Adjustment, turnover, and performance: The deployment of minority expatriates. The International Journal of Human Resource Management, 22(10), 2262–2280. doi: 10.1080/09585192.2011.580195 Payne, S. C., & Webber, S. S. (2006). Effects of service provider attitudes and employment status on citizenship behaviors and customers’ attitudes and loyalty behavior. Journal of Applied Psychology, 91, 365–378. Penney, L. M., David, E., & Witt, L. A. (2011). A review of personality and performance: Identifying boundaries, contingencies, and future research directions. Human Resource Management Review, 21, 297– 310. doi: 10.1016/j.hrmr.2010.10.005 Piccolo, R. F., & Colquitt, J. A. (2006). Transformational leadership and job behaviors: The mediating role of core job characteristics. Academy of Management Journal, 49, 327–340. Ployhart, R. E., & Turner, S. F. (2014). Organizational adaptability. In D. Chan (Ed.), Individual adaptability to changes at work: New directions in research (pp. 73–91). New York, NY: Routledge/Taylor & Francis. Ployhart, R. E., & Bliese, P. D. (2006). Individual adaptability (I-ADAPT) theory: Conceptualizing the antecedents, consequences, and measurement of individual differences in adaptability. In S. Burke, L. Pierce, & E. Salas (Eds.), Understanding adaptability: A prerequisite for effective performance within complex environments (pp. 3–39). New York, NY: Elsevier. Ployhart, R. E., & Holtz, B. C. (2008). The diversity–validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Podsakoff, P. M., Ahearne, M., & MacKenzie, S. B. (1997). Organizational citizenship behavior and the quantity and quality of work group performance. Journal of Applied Psychology, 82, 262–270. Podsakoff, P. M., & MacKenzie, S. B. (1994). Organizational citizenship behaviors and sales unit effectiveness. Journal of marketing research, 31(3), 351-363. Podsakoff, P. M., MacKenzie, S. B., Paine, J. B., & Bachrach, D. G. (2000). Organizational citizenship behaviors: A critical review of the theoretical and empirical literature and suggestions for future research. Journal of Management, 26, 513–563. Podsakoff, N. P., Maynes, T. D., Whiting, S. W., & Podsakoff, P. M. (2015). One (rating) from many (observations): Factors affecting the individual assessment of voice behavior in groups. Journal of Applied Psychology, 100, 1189–1202. Porath, C. L., & Erez, A. (2007). Does rudeness really matter? The effects of rudeness on task performance and helpfulness. Academy of Management Journal, 50, 1181–1197. Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the work place: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624. Pulakos, E. D., Dorsey, D. W., & White, S. S. (2006). Adaptability in the workplace: Selecting an adaptive workforce. In C. S. Burke, L. G. Pierce, & E. Salas (Eds.), Understanding adaptability: A prerequisite for effective performance within complex environments (Advances in Human Performance and Cognitive Engineering Research, Vol. 6, pp. 41–71). Oxford, UK: Elsevier.

473

David W. Dorsey et al. Pulakos, E. D., Schmitt, N., Dorsey, D. W., Hedge, J. W., & Borman, W. C. (2002). Predicting adaptive performance: Further tests of a model of adaptability. Human Performance, 15, 299–323. Rapp, A. A., Bachrach, D. G., & Rapp, T. L. (2013). The influence of time management skill on the curvilinear relationship between organizational citizenship behavior and task performance. Journal of Applied Psychology, 98, 668–677. Raybourn, E. M., Deagle, E., Mendini, K., & Heneghan, J. (2005). Adaptive thinking and leadership simulation game training for Special Forces officers (I/ITSEC 2005). Proceedings, Interservice/Industry Training, Simulation and Education Conference, November 28–December 1, Orlando, FL. Redding, J. (1997). Hardwiring the learning organization. Training & Development, 15, 61–67. Redman, T., & Snape, E. (2005). Exchange ideology and member-union relationships: An evaluation of moderation effects. Journal of Applied Psychology, 90, 765–773. Reis, H. T. (2008). Reinvigorating the concept of situation in social psychology. Personality and Social Psychology Review, 12(4), 311–329. Rich, B. L., Lepine, J. A., & Crawford, E. R. (2010). Job engagement: Antecedents and effects on job performance. Academy of Management Journal, 53, 617–635. Rioux, S. M., & Penner, L. A. (2001). The causes of organizational citizenship behavior: A motivational analysis. Journal of Applied Psychology, 86, 1306–1314. Rosen, C. C., Levy, P. E., & Hall, R. J. (2006). Placing perceptions of politics in the context of the feedback environment, employee attitudes, and job performance. Journal of Applied Psychology, 91, 211–220. Rotundo, M., & Sackett, P. R. (2002). The relative importance of task, citizenship, and counterproductive performance to global ratings of job performance: A policy-capturing approach. Journal of Applied Psychology, 87, 66–80. Rubin, R. S., Dierdorff, E. C., & Bachrach, D. G. (2013). Boundaries of citizenship behavior: Curvilinearity and context in the citizenship and task performance relationship. Personnel Psychology, 66, 377–406. Rupp, D. E., Shao, R., Thornton, M. A., & Skarlicki, D. P. (2013). Applicants’ and employees’ reactions to corporate social responsibility: The moderating effects of first-party justice perceptions and moral identity. Personnel Psychology, 66, 895–933. Schmidt, F. L., & Hunter, J. E., (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274. doi: 10.1037/0033–2909.124.2.262 Schmit, M. J., Motowidlo, S. J., Degroot, T., Cross, T., & Kiker, D. S. (April 1996). Explaining the relationship between personality and job performance. Paper presented at the annual meeting of the Society for Industrial and Organizational Psychology, San Diego, CA. Schmitt, N., & Chan, D. (2006). Situational judgment tests: Method or construct? In J. Weekley & R. E. Ployhart (Eds.), Situational judgment tests (pp. 135–156). Mahwah, NJ: Lawrence Erlbaum. Schmitt, N., & Chan, D. (2014). Adapting to rapid changes at work: Measures and research. In D. Chan (Ed.), Individual adaptability to changes at work. New directions in research, 3–17. New York: Routledge Schmitt, N., Cortina, J. M., Ingerick, M. J., & Wiechmann, D. (2003). Personnel selection and employee performance. Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 77–105). New York, NY: John Wiley & Sons, Inc. Shoss, M. K., Witt, L. A., & Vera, D. (2012). When does adaptive performance lead to higher task performance? Journal of Organizational Behavior, 33(7), 910–924. doi: 10.1002/job.780 Sinclair, R. R., Tucker, J. S., Cullen, J. C., & Wright, C. (2005). Performance differences among four organizational commitment profiles. Journal of Applied Psychology, 90, 1280–1287. Smith, C. A., Organ, D. W., & Near, J. P. (1983). Organizational citizenship behavior: Its nature and antecedents. Journal of Applied Psychology, 68, 653–663. Sparrowe, R. T., Soetjipto, B. W., & Kraimer, M. L. (2006). Do leaders’ influence tactics that relate to members’ helping behavior? It depends on the quality of the relationship. Academy of Management Journal, 49, 1194–1208. Spence, J. R., Brown, D. J., Keeping, L. M., & Lian, H. (2014). Helpful today, but not tomorrow? Feeling grateful as a predictor of daily organizational citizenship behaviors. Personnel Psychology, 67, 705–738. Stajkovic, A. D., Lee, D., & Nyberg, A. J. (2009). Collective efficacy, group potency, and group performance: Meta-analyses of their relationships, and test of a mediation model. Journal of Applied Psychology, 94(3), 814–828. doi: 10.1037/a0015659 Stewart, G. L., & Nandkeolyar, A. (2006). Adaptability and intraindividual variation in sales outcomes: Exploring the interactive effects of personality and environmental opportunity. Personnel Psychology, 59, 307–332. Stokes, C. K., Schneider, T. R., & Lyons, J. B. (2010). Adaptive performance: A criterion problem. Team Performance Management, 16, 212–230. doi: 10.1108/13527591011053278

474

Adaptive and Citizenship-Related Behaviors Sun, L-Y., Aryee, S., & Law, K. S. (2007). High-performance human resource practices, citizenship behavior, and organizational performance: A relational perspective. Academy of Management Journal, 50, 558–577. Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal of Applied Psychology, 88(3), 500–517. doi: 10.1037/0021–9010.88.3.500 Trougakos, J. P., Beal, D. J., Cheng, B. H., Hideg, I., & Zweig, D. (2015). Too drained to help: A resource depletion perspective on daily interpersonal citizenship behaviors. Journal of Applied Psychology, 100, 227–236. Vande Walle, D. (1997). Development and validation of a work domain goal orientation instrument. Educational and Psychological Measurement, 57(6), 995–1015. doi: 10.1177/0013164497057006009 Van Iddekinge, C. H., Ferris, G. R., & Heffner, T. S. (2009). Test of a multistage model of distal and proximal a­ntecedents of leader performance. Personnel Psychology, 62(3), 463–495. doi: 10.1111/ j.1744–6570.2009.01145.x Venkataramani, V., & Dalal, R. S. (2007). Who helps and harms whom? Relational antecedents of interpersonal helping and harming in organizations. Journal of Applied Psychology, 92, 952–966. Wanberg, C. R., & Banas, J. (2000). Predictors and outcomes of openness to changes in a reorganizing workplace. Journal of Applied Psychology, 85, 132–142. Wang, H., Law, K. S., Hackett, R. D., Wang, D., & Chen, Z. X. (2005). Leader-member exchange as a mediator of the relationship between transformational leadership and followers’ performance and organizational citizenship behavior. Academy of Management Journal, 48, 420–432. Werner, J. M. (1994). Dimensions that make a difference: Examining the impact of in-role and extra-role behaviors on supervisory ratings. Journal of Applied Psychology, 79, 98–107. White, S. S., Mueller-Hanson, R. A., Dorsey, D. W., Pulakos, E. D., Wisecarver, M. M., Deagle, E. A., & Mendini, K. G. (2005). Developing adaptive proficiency in Special Forces officers. (ARI Research Report 1831). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Whiting, S. W., Podsakoff, P. M., & Pierce, J. R. (2008). Effects of task performance, helping, voice, and organizational loyalty on performance appraisal ratings. Journal of Applied Psychology, 93, 125–139. Williams, L. J., & Anderson, S. E. (1991). Job satisfaction and organizational commitment as predictors of organizational citizenship and in-role behaviors. Journal of Management, 17, 601–617. Woo, S. E., Chernyshenko, L. S., Stark, S. E., & Conz, G. (2014). Validity of six openness facets in predicting work behaviors: A meta-analysis. Journal of Personality Assessment, 96, 76–86. doi: 10.1080/00223891.2013.806329 Wood, R. E. (1986). Task complexity: Definition of the construct. Organizational Behavior and Human Decision Processes, 37(1), 60–82. doi: 10.1016/0749–5978(86)90044–0 Yang, J., Mossholder, K. W., & Peng, T. K. (2007). Procedural justice climate and group power distance: An examination of cross-level interaction effects. Journal of Applied Psychology, 92, 681–692. Yun, S., Takeuchi, R., & Liu, W. (2007). Employee self-enhancement motives and job performance behaviors: Investigating the moderating effects of employee role ambiguity and managerial perceptions of employee commitment. Journal of Applied Psychology, 92, 745–756. Zhu, J., Frese, M., & Li, W. D. (2014). Proactivity and adaptability. In D. Chan (Ed.), Individual adaptability to changes at work: New directions in research (pp. 36–51). New York, NY: Routledge. Ziegler, R., Schlett, C., Casel, K., & Diehl, M. (2012). The role of job satisfaction, job ambivalence, and emotions at work in predicting organizational citizenship behavior. Journal of Personnel Psychology, 11, 176–190.

475

22 NEW PERSPECTIVES ON COUNTERPRODUCTIVE WORK BEHAVIOR INCLUDING WITHDRAWAL MARIA ROTUNDO AND PAUL E. SPECTOR

Counterproductive work behavior (CWB) subsumes a broad range of behaviors by employees that harm organizations and/or people in organizations. For this edition of the Handbook, we take a comprehensive look at the literature on CWB since 2007 when we completed the literature review for the chapter for the first edition. There has been an explosion of interest in the topic and a maturing of a previously disjointed field that has coalesced. In reviewing the literature on CWB and related topics, we began with a PsycInfo search for the term “counterproductive work behavior” and related terms including “workplace deviance” and “workplace aggression.” We found that more than two-thirds of the sources have been published since 2007, showing an accelerating interest in the topic. Clearly, CWB has become one of the most popular topics among organizational researchers. Our goal is to provide an overview of the post-2007 CWB research literature and withdrawal, most notably absence, lateness, and turnover (and turnover intentions). We will begin with a discussion of the nature and assessment of CWB. We will discuss potential environmental and individual antecedents of CWB and potential consequences of engaging in CWB. Included will be emerging issues in the use of social media as it relates to CWB and withdrawal. Finally, we will conclude with a discussion of implications for employee selection.

NATURE OF CWB The term CWB arose from two contemporaneous perspectives. Sackett and DeVore (2001) considered CWB from the perspective of organizations as behavior that runs counter to the legitimate interests of organizations. It is an aspect of broadly construed job performance that can be divided into task performance, organizational citizenship behavior (OCB), and CWB (Rotundo & Sackett, 2002). Fox, Spector, and Miles (2001) took research on human aggression as their basis, and defined CWB as behavior intended to harm organizations or organization members. The term deviance has its basis in sociology and criminology (Hollinger & Clark, 1982) and is defined as harmful behavior that violates organization norms and rules. It was introduced into the organizational literature by Robinson and Bennett (1995). The list of CWB behaviors ranges from minor acts to serious and even criminal activities that can be directed at individuals or organizations. Included are unauthorized withdrawal such

476

Perspectives on Counterproductive Work Behavior as calling in sick when not ill, or purposely and unnecessarily coming in late. In this chapter, we discuss withdrawal separately as much of it is not CWB, as it is not intended to harm the organization, nor does it cause harm. Although CWB, deviance, and related terms have distinct conceptualizations, research on these topics has focused on an overlapping set of behaviors that are compiled into indices that are sometimes global and sometimes specific subcategories (Langton, Piquero, & Hollinger, 2006). Our PsycInfo literature review identified 151 papers on the phenomenon. Approximately 91% of them used the terms deviance (49 papers) or CWB (88 papers), often interchangeably. In fact, some authors referred to the phenomenon as CWB, but then chose a measure of deviance to operationalize it. Some papers focused on specific forms of CWB, using terms like cyber incivility (Lim & Teo, 2009), cyber loafing (Kim, del Carmen Triana, Chung, & Oh, 2016), theft (Christian & Ellis, 2011), and time banditry (Martin, 2010). Robinson and Bennett (1995), using the term deviance, classified a list of disparate behaviors along the dimensions of target (interpersonal versus organizational) and severity (minor versus serious). This resulted in four forms of deviance, one for each combination of target and severity. Two forms targeted the organization and two targeted people. The organizational forms reflected Hollinger and Clark’s (1982) categories of Production Deviance (behaviors that affect organizational productivity such as leaving early or intentionally working slowly) and Property Deviance (behaviors directed at property such as sabotaging equipment and theft). The interpersonal forms were Political Deviance (behaviors that reflect office politics such as showing favoritism and gossiping) and Personal Aggression (verbal abuse and stealing from coworkers). Subsequent research has ignored the severity dimension and has collapsed the four categories into the interpersonal (CWB-I) and the organizational (CWB-O) forms, although not all of the original behaviors are represented. A finer-grained five-category scheme by Spector et al. (2006) included Abuse (CWB-I), Production Deviance (behaviors that harm productivity other than withdrawal), Sabotage (destruction of property), Theft (stealing material objects), and Withdrawal (working fewer hours than required). Note that the content of Robinson and Bennett’s production deviance is broader than Spector et al.’s as it is combined with withdrawal. An even finer-grained 11-dimension classification was provided by Sackett and DeVore (2001) and Gruys and Sackett (2003). This scheme is more inclusive than earlier classifications, expanding some categories and including additional behaviors. The categories are Theft, Destruction of Property, Misuse of Information, Misuse of Time and Resources, Unsafe Behavior, Poor Attendance, Poor Quality Work, Alcohol Use, Drug Use, Inappropriate Verbal Actions, and Inappropriate Physical Actions. As part of our review, we compiled how CWB was operationalized in empirical studies. Assessing CWB-I and CWB-O in the same study was the most frequently used approach (49 studies), followed by assessing CWB with a single global score (28 studies). A smaller number of studies assessed either CWB-I (12 studies) or CWB-O (12 studies) but not both. Five studies used the Spector et al. (2006) five-category breakdown. Only a couple of studies assessed the 11 Sackett and DeVore (2001) or Gruys and Sackett (2003) categories in the same study. This is unfortunate as there are unique and understudied behaviors in this broader CWB scheme.

ASSESSMENT OF CWB Most studies of CWB use behavior checklists in which people indicate how often individuals engage in each behavior. Individuals report on their own behavior, or in some cases, other’s behavior. Studies using nonsurvey sources of data on CWB, such as archival data or objective measures, are rare. Studies of individual withdrawal behaviors, particularly absence and turnover, generally use records. For example, in their meta-analysis linking absence, lateness, and turnover, Berry, Lelchook, and Clark (2012) found that out of 38 studies, only 5 for absence and 6 for lateness

477

Maria Rotundo and Paul E. Spector used self-reports of the behavior. Furthermore, counter to concerns about correlation inflation due to common method variance (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003), the selfreport measures yielded smaller, not larger, correlations among the three forms of withdrawal. At least for absence, although self-reports yielded lower mean levels, their convergence with records was quite high, suggesting they can be a reasonable measure when the purpose is determining relationships with other variables (Johns & Miraglia, 2015). For more details on how CWB and withdrawal behaviors are assessed, see Rotundo and Spector (2010). Our review found that the most popular (used in 55 studies) CWB measure is the Bennett and Robinson (2000) workplace deviance scale. This 19-item measure has subscales to assess both interpersonal and organizational deviance. The next most frequently used (22 studies) was the Counterproductive Work Behavior Checklist (CWB-C; Spector et al., 2006). This scale provides scores for overall CWB (45 items), CWB-I and CWB-O (44 items), or the five dimensions noted earlier (33 items). The remaining papers noted a wide variety of measures used, some of which were ad hoc and some of which targeted specific forms of CWB, such as cyber incivility or theft. There is a great deal of overlap in the content of measures, so it is not surprising that to date there has been consistency of results found across specific CWB instruments. Studies vary in the extent to which their focus is on global CWB or on understanding more specific subcategories. Such distinctions are important as correlations with criteria can vary across subscales (Spector et al., 2006). Thus, one must be cautious in assuming that if a global measure of CWB is related to another variable, that all behaviors that constitute that global measure are also related. That said, various dimensions of CWB tend to be intercorrelated. For example, in their meta-analysis, Marcus, Taylor, Hastings, Sturm, and Weigelt (2016) showed that the mean correlations among the Spector et al. (2006) five subscales ranged from .50 to .60. Furthermore, they provided evidence that global CWB can best be considered a higher-order factor that can be broken into more specific lower-order factors. In what follows, we will focus our attention primarily on higher-order global CWB findings. Considering the widespread use of self-reports to assess CWB, a reasonable concern is the extent to which such reports can be trusted. Berry, Carpenter, and Barratt (2012) addressed this question using meta-analysis to compare self-reports with other-reports. They found that there was convergence (significant correlations) of self with other-source CWB, with larger correlations for CWB-I than CWB-O. This makes sense because, by its nature, CWB-I is more public, as it is difficult to mistreat another person without another person’s knowledge. Likely the raters would have witnessed or even been the target of the interpersonal CWB being rated. CWB-O, on the other hand, can be more hidden. If an employee steals from the company, it is unlikely that coworkers or the supervisor have knowledge of the behavior. Furthermore, Berry et al. found a similar pattern of relationships of CWB with other variables. Although many would assume that self-reports would have higher correlations with other variables due to common method variance, there was no consistent pattern of self-reports having higher correlations. For example, the correlation of interactional justice with CWB was higher for other-reports (r = −.45) than for self-reports (r = −.30). Berry et al. concluded based on several types of evidence that self-reports are a reasonable approach to the assessment of CWB.

POTENTIAL ANTECEDENTS OF CWB The study of CWB has been guided by a basic environment–person–outcome framework in which environmental conditions are perceived, leading to feelings (emotional/attitudinal reactions), leading to behavior. Figure 22.1, based on Spector and Fox (2005), provides a framework to consider potential antecedents of CWB. It suggests that environmental conditions (e.g., high workloads) lead to negative feelings (e.g., burnout or job dissatisfaction) that lead to CWB. Individual differences play two likely roles—as a direct antecedent to feelings and as a moderator of environmental effects. The literature has focused more on the direct impact on feelings and CWB, with less attention being given to moderating effects.

478

Perspectives on Counterproductive Work Behavior

Atudes & Emoon

Environment

CWB/ Withdrawal

Individual Differences

FIGURE 22.1  General Framework Depicting Various Antecedents of CWB and Withdrawal

INDIVIDUAL DIFFERENCES Individual difference characteristics can play an important role in the selection process, and their utility extends to the prediction of CWB and withdrawal. Thus, scholars have had an ongoing interest in identifying those characteristics that help screen out CWB-prone individuals, resulting in a sizeable body of research. Traditionally, this research was dominated by integrity tests and the Five-Factor Model (FFM), which showed support for integrity tests and some of the FFM factors. However, current research has focused on a deeper understanding of the interrelationship among what appear to be the more dominant of these factors in the prediction of CWB/ withdrawal. Prior research also considered the role of cognitive ability and demographic or background characteristics such as gender, age, education, and organizational tenure as examples. Given the mixed findings reported in that research, attention shifted to summarizing the effect sizes via quantitative reviews. This research and some of the related meta-analytic work are summarized as follows and in Tables 22.1 and 22.2.

Integrity Tests At the time of the first edition, we reported mean observed coefficients for CWB that ranged from 0.22 to 0.39 for personality-based and overt integrity tests, respectively (Ones, et al., 1993) (see Table 22.1). The former tests assess personality traits that explain variance in CWB, whereas the latter assess attitudes towards CWB. Mean observed correlations of 0.06 (overt integrity tests) and 0.25 (personality-based tests) were reported for a lack of absence (Ones, et al., 2003) (see Table 22.2). More recently, a meta-analysis was conducted on a smaller subset of studies and reported sample-size-weighted correlations with CWB of 0.23 and 0.30 for personality-based and overt integrity tests, respectively (Van Iddekinge, Roth, Raymark, & Odle-Dusseau, 2012a). They also reported sample-size-weighted correlations with turnover of 0.07 and 0.06 for personality-based and overt integrity tests, respectively (Van Iddekinge, et al., 2012a). This meta-analysis generated important dialogue reminding us to exert caution when comparing meta-analytic effect sizes across different studies (Harris et al., 2012; Ones, Viswesvaran, & Schmidt, 2012; Sackett & Schmitt, 2012; Van Iddekinge, Roth, Raymark, & Odle-Dusseau, 2012b). Several reviews over the years support the validity of integrity tests for CWB and, to a lesser extent, withdrawal.

479

.30 .22 .23

Overt

Personality-based

Personality-based

−.29

Conscientiousness

−.13

Agreeableness

.01

Extraversion

Extraversion

−.10

.10

Openness

Openness

Openness

Emotional Stability −.04

−.04

Emotional Stability

Emotional Stability

−.18

Agreeableness

Agreeableness

−.15

Conscientiousness

Conscientiousness

−.16

Conscientiousness

PERSONALITY

.39

CWB

Overt

INTEGRITY TESTS

Variable

TABLE 22.1

.02

−.07

−.20

−.36

−.19

CWB-I

−.07

−.03

−.19

−.25

−.34

CWB-O

Berry, Ones, & Sackett (2007)

Salgado (2002)

Berry, Carpenter, & Barratt (2012)

Berry, Ones, & Sackett (2007)

Salgado (2002)

Berry, Carpenter, & Barratt (2012)

Berry, Ones, & Sackett (2007)

Salgado (2002)

Berry, Carpenter, & Barratt (2012)

Berry, Ones, & Sackett (2007)

Salgado (2002)

Berry, Carpenter, & Barratt (2012)

Berry, Ones, & Sackett (2007)

Dalal (2005)

Salgado (2002)

Van Iddekinge, Roth, Raymark, & Odle-Dusseau (2012a)

Ones, Viswesvaran, & Schmidt (1993)

Van Iddekinge, Roth, Raymark, & Odle-Dusseau (2012a)

Ones, Viswesvaran, & Schmidt (1993)

Source

Summary of Effect Sizes Reported in Meta-Analyses: CWB Correlates

Mean observed correlation

CWB rated by other

Mean observed correlation

CWB rated by other

Mean observed correlation

CWB rated by other

Mean observed correlation

CWB rated by other

Mean observed correlation

Mean observed correlation

Mean observed correlation

Comment

.35 .06 .16 .44

Narcissism

Psychopathy

Conditional reasoning test-aggression

Conditional reasoning test-aggression

−.05 −.02/ .01

Age

Education

−.19/−.05/−.14

−.06

Organizational Tenure

Organizational Tenure

Organizational Tenure

−.09/−.12

Age

Age

−.07

−.01

−.05

−.19

Gender

Gender

.14

−.03

CWB-I

Gender

DEMOGRAPHIC AND BACKGROUND

General Mental Ability −.02

.20

Machiavellianism

COGNITIVE ABILITY

.03

CWB

Extraversion

Variable

−.07

−.09

−.11

.11

−.11

CWB-O

Source

Berry, Carpenter, & Barratt (2012)

Ng & Feldman (2010)

Berry, Ones, & Sackett (2007)

Ng & Feldman (2009a)

Berry, Carpenter, & Barratt (2012)

Ng & Feldman (2008)

Berry, Ones, & Sackett (2007)

Berry, Carpenter, & Barratt (2012)

Hershcovis, et al., (2007)

Berry, Ones, & Sackett (2007)

Gonzalez-Mulé, Mount, & Oh (2014)

James, McIntyre, Glisson, Green, Patton, LeBreton, . . . Williams (2005)

Berry, Sackett, & Tobares (2010)

O’Boyle, Forsyth, Banks, & McDaniel (2012)

O’Boyle, Forsyth, Banks, & McDaniel (2012)

O’Boyle, Forsyth, Banks, & McDaniel (2012)

Berry, Carpenter, & Barratt (2012)

(Continued)

CWB rated by other

CWB rated by supervisor/ self-rated/organizational records

CWB rated by other/self-rated

CWB rated by other

CWB rated by other/self-rated

0=female; 1= male; CWB rated by other

Mean observed correlation; 0=male; 1=female

0=female; 1=male

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean observed correlation

CWB rated by other

Comment

.16 .37 .27 .25 .23 .22 .19 .17

Negative Affect

Negative Affect – State

Envy - State

Frustration – State

Sadness – State

Anger – State

Anxiety – State

Guilt/Shame – State

.34 .33 .27 .22

Anger – Trait

Hostility – Trait

Envy – Trait

Anxiety – Trait

Trait anger

.25

Negative Affect

Negative Affect

Negative Affect

.34

−.28

Organizational commitment

EMOTIONS

−.19

−.29

CWB

Job satisfaction

Job satisfaction

Job satisfaction

ATTITUDES

Variable

TABLE 22.1  (Continued)

.37

.22

−.14

CWB-I

.28

.24

−.31

CWB-O

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Hershcovis, et al. (2007)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Berry, Carpenter, & Barratt (2012)

Kaplan, Bradley, Luchman, & Haynes (2009)

Hershcovis, Turner, Barling, Arnold, Dupre, Inness,. . . Sivanathan (2007)

Dalal (2005)

Dalal (2005)

Berry, Carpenter, & Barratt (2012)

Hershcovis, Turner, Barling, Arnold, Dupre, Inness,. . . Sivanathan (2007)

Dalal (2005)

Source

Mean uncorrected correlation

CWB rated by other

Mean uncorrected correlation

CWB rated by other

Mean uncorrected correlation

Comment

−.28 −.21 −.14 −.10 −.05 −.05 −.04 −.17 −.12 −.07 −.05 −.01

Positive Affect

Positive Affect –- State

Attentive – State

Joy – State

Pride – State

Content – State

Affection – State

Attentive – Trait

Pride – Trait

Joy – Trait

Affection – Trait

Content – Trait

−.18 −.22

Distributive

.39

Organizational

JUSTICE

Interpersonal conflict

Interpersonal conflict

Organization constraints

Organization constraints .27

.15

Sadness – Trait

STRESSORS

.19

CWB

Guilt/Shame – Trait

Variable

.40

.26

CWB-I

.33

.31

CWB-O

Source

Cohen-Charash & Spector (2001) (Continued)

CWB rated by other

Berry, Carpenter, & Barratt (2012)

Dalal (2005)

Mean uncorrected correlation

CWB rated by other

Berry, Carpenter, & Barratt (2012) Hershcovis, Turner, Barling, Arnold, Dupre, Inness,. . . Sivanathan (2007)

Mean uncorrected correlation

Comment

Hershcovis, Turner, Barling, Arnold, Dupre, Inness,. . . Sivanathan (2007)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

Dalal (2005)

Shockley, Ispas, Rossi, & Levine (2012)

Shockley, Ispas, Rossi, & Levine (2012)

−.23

Informational

.24

Outsider aggression

.18

.25

.34

−.18

−.16

−.06

−.18

−.23

−.18

−.18

−.20

−.10

−.12

CWB-O

Hershcovis & Barling (2010)

Hershcovis & Barling (2010)

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean uncorrected population correlation

Colquitt, Scott, Rodell, Long, Zapata, Conlon, & Wesson (2013)

Hershcovis & Barling (2010)

Mean uncorrected population correlation

Colquitt, Scott, Rodell, Long, Zapata, Conlon, & Wesson (2013)

Berry, Ones, Sackett (2007)

Berry, Carpenter, & Barratt (2012)

CWB rated by other

Mean uncorrected population correlation

Colquitt, Scott, Rodell, Long, Zapata, Conlon, & Wesson (2013) Berry, Ones, Sackett (2007)

CWB rated by other

Berry, Carpenter, & Barratt (2012)

Berry, Ones, Sackett (2007)

Hershcovis, Turner, Barling, Arnold, Dupre, Inness,. . . Sivanathan (2007)

Mean uncorrected correlation

Mean uncorrected population correlation

Colquitt, Scott, Rodell, Long, Zapata, Conlon, & Wesson (2013) Cohen-Charash & Spector (2001)

CWB rated by other

Mean uncorrected correlation

Comment

Berry, Carpenter, & Barratt (2012)

Berry, Ones, Sackett (2007)

Hershcovis, et al. (2007)

Source

Note. All coefficients reported in the table are mean sample-size weighted correlations unless otherwise indicated in the Comment column.

.38

.29

Co-worker aggression

CONSEQUENCES

Supervisor aggression

RELATIONSHIP WTH SUPERVISOR

−.12

−.20

Interpersonal −.18

−.17

−.22

−.16

Interpersonal

Interactional −.40

−.23

Procedural

Interactional

−.20

Procedural

−.19

Procedural

−.28

Procedural

−.11

−.18

−.22

Distributive

Procedural

−.07

Distributive

−.12

Distributive

CWB-I −.12

CWB

Distributive

Variable

TABLE 22.1  (Continued)

−.16 −.22 −.25 −.16 −.11 .09 −.14 −.03

Agreeableness

Agreeableness

Emotional stability

Emotional stability

Openness

Openness

Extraversion

Extraversion

Age

Performance

Prior Performance

a

a

a

a

a

a

a

a

a

a

a

b

−.26 /−.12 b

−.18

Conscientiousness

DEMOGRAPHIC AND BACKGROUND

−.23

.07

.06 a

.07a

Turnovera/ Tardinessb

Conscientiousness

PERSONALITY

Personality-based

Personality-based

Overt

Overt

All tests

All tests

INTEGRITY TESTS

Variables

TABLE 22.2

c

c

c

c

−.10

c

−.16 c/−.19d

−.07

.01

−.19

−.10

−.12

c

Intent to Turnoverc/ Withdrawald

−.17

.05

.00

.03

.03

−.04

.25

.06

.14

Absence

Ng & Feldman (2008)

Zimmerman & Darnold (2009)

Swider & Zimmerman (2014)

Zimmerman (2008)

Salgado (2002)

Zimmerman (2008)

Salgado (2002)

Zimmerman (2008)

Salgado (2002)

Zimmerman (2008)

Salgado (2002)

Zimmerman (2008)

Salgado (2002)

Van Iddekinge, Roth, Raymark, & Odle-Dusseau (2012a)

Ones, Viswesvaran, & Schmidt (2003)

Van Iddekinge, Roth, Raymark, & Odle-Dusseau (2012a)

Ones, Viswesvaran, & Schmidt (2003)

Van Iddekinge, Roth, Raymark, & Odle-Dusseau (2012a)

Ones, Viswesvaran, & Schmidt (2003)

Source

Summary of Effect Sizes Reported in Meta-Analyses: Withdrawal Correlates

(Continued)

Outcome rated by other/self-rated

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean observed correlation

Mean observed correlation

Comment

−.10 −.02b

Organizational Tenure

Organizational Tenure

−.16 −.10

Normative

Continuance

a

Transactional breach

Psychological contract breach

.05

.16

.34

−.36

Procedural

.14

.17

−.41

PSYCHOLOGICAL CONTRACT BREACH

d

d

d

d

c

c

c

d

d

c

d

.06 d (ns)

−.18

Distributive

JUSTICE

Workload

.18

Hindrance a

.04 a (ns)

a

−.33

Challenge

STRESSORS

−.17

Affective a

.05

Positive Affect

−.56

−.14

Positive Affect

c

d

a

.14

Negative Affect

ORGANIZATIONAL COMMITMENT

.26

Intent to Turnoverc/ Withdrawald

Negative Affect

EMOTIONS

b

.02 /.02

Education b

Turnovera/ Tardinessb

b

Age

Variables

TABLE 22.1  (Continued)

.07

.06

.05(ns)

−.15

−.17

−.04

−.11/−.06

−.26/−.01

Absence

Zhao, Wayne, Glibkowski, & Bravo (2007)

Zhao, Wayne, Glibkowski, & Bravo (2007)

Colquitt, Conlon, Wesson, Porter & Ng (2001)

Colquitt, Conlon, Wesson, Porter & Ng (2001)

Bowling, Alarcon, Bragg, & Hartman (2015)

Podsakoff, LePine, & LePine (2007)

Podsakoff, LePine, & LePine (2007)

Meyer, Stanley, Herscck, et al. (2002)

Meyer, Stanley, Herscck, et al. (2002)

Meyer, Stanley, Herscck, et al. (2002)

Kaplan, Bradley, Luchman, & Haynes (2009)

Zimmerman (2008)

Kaplan, Bradley, Luchman, & Haynes (2009)

Zimmerman (2008)

Mean uncorrected population correlation

Mean uncorrected population correlation

Outcome obtained from organizational records

Outcome rated by self

Ng & Feldman (2010) Ng & Feldman (2010)

Outcome-objective measure/self-rated

Outcome-objective measure/self-rated

Comment

Ng & Feldman (2009a)

Ng & Feldman (2008)

Source

-.07

Person-Job Fit

−.44

−.15 .10

On-the-job embeddedness

Off-the-job embeddedness

.15 .28

Outsider aggression

Bullying

Oh, Guay, Kim, Harold, Lee, Heo & Shin (2014)

Nielsen & Einarsen (2012)

Is for turnover;

Is for tardiness, c Is for intent to turnover,

Mean observed correlation

Hershcovis & Barling (2010)

c

Mean observed correlation

Hershcovis & Barling (2010)

c

a

Jiang, Liu, McKay, Lee, & Mitchell (2012)

c

c

Jiang, Liu, McKay, Lee, & Mitchell (2012)

Oh, Guay, Kim, Harold, Lee, Heo & Shin (2014)

c

Oh, Guay, Kim, Harold, Lee, Heo & Shin (2014)

Kristof-Brown, Zimmerman & Johnson (2005)

Kristof-Brown, Zimmerman & Johnson (2005)

Kristof-Brown, Zimmerman & Johnson (2005)

c

b

Mean observed correlation

Gajendran & Harrison (2007)

Oh, Guay, Kim, Harold, Lee, Heo & Shin (2014)

Difference score (d)

Parks & Steelman (2008)

Mean observed correlation

Hershcovis & Barling (2010)

c

.11

−.05

−.30

Mean observed correlation

Comment

Banks, Batchelor, Seers, O’Boyle, Pollack & Gower (2014)

Zhao, Wayne, Glibkowski, & Bravo (2007)

Source

c

c

c

c

c

c

c

c

c

Absence

Note. All coefficients reported in the table are mean sample-size weighted correlations unless otherwise indicated in the Comment column. for withdrawal

.20

Co-worker aggression

CONSEQUENCES

−.35

Person-Supervisor Fit

−.21

−.25

Person-Group Fit

a

−.17

Person-Group Fit

a

−.31

Person-Job Fit

−.37

−.29

−.13

Person-Organization Fit a

−.29

−.08

.26

−.28

.30

Intent to Turnoverc/ Withdrawald

Person-Organization Fit

PERSON AND ENVIRONMENT

Telecommuting

Organization Wellness Program

HRM PRACTICES

Supervisor aggression

Leader-Member Exchange

a

Turnovera/ Tardinessb

RELATIONSHIP WITH SUPERVISOR

Relational breach

Variables

d

Is

Maria Rotundo and Paul E. Spector

Personality: Five-Factor Model Earlier research on the role of personality in predicting CWB and withdrawal focused for the most part on the Five-Factor Model (FFM) and the independent effect of each of the five traits (i.e., conscientiousness, agreeableness, emotional stability, openness, and extraversion), resulting in five separate meta-analyses (Berry, Carpenter, & Barratt, 2012; Berry, Ones, & Sackett, 2007; Dalal, 2005; Salgado, 2002; Zimmerman, 2008). Most of these reviews were based on selfreports (Berry, et al., 2007) or combined self- and other-reports of CWB or withdrawal (Dalal, 2005; Salgado, 2002; Zimmerman, 2008). As noted earlier in this chapter, the meta-analysis conducted by Berry et al. (2012) focused on reports of CWB that were provided by supervisors or coworkers, noting some of the disadvantages associated with self-reports (e.g., Barclay & Aquino, 2011; Fox, Spector, Goh, & Bruursema, 2007; Stewart, Bing, Davison, Woehr, & McIntyre, 2009). Tables 22.1 and 22.2 summarize the coefficients reported in these meta-analyses. Based on these reviews, conscientiousness, agreeableness, and emotional stability are more consistently related to CWB and turnover. Recently, studies have moved away from studying whether the FFM and CWB are related and more toward improvements in the measurement of these traits or toward a more nuanced understanding of the three dominant traits. For example, one study reported that the negative relationship between conscientiousness and CWB holds when personality is measured in adolescence and CWB in adulthood (Le, Donnellan, Spilman, Garcia, & Conger, 2014). Other research showed that personality assessed by acquaintances (i.e., conscientiousness, agreeableness, and emotional stability) had incremental validity in predicting CWB over self-ratings (Kluem­per, McLarty, & Bing, 2015). Similar findings have been reported for the predictive validity of personality ratings provided by others when the criterion was job performance or academic achievement (Connelly & Ones, 2010). Thus, future studies of the personality–CWB relationship may consider employing other-ratings of the target’s personality instead of self-ratings. However, who the other-rater is matters. Not only is the frequency of the interaction between the target and the other-rater important for improving personality rating accuracy but interpersonal intimacy between the two also matters (Connelly & Ones, 2010). Scholars have studied the interaction among the FFM traits in predicting CWB since these interrelationships can provide further insight into the behavioral manifestation of personality. As an example, two individuals who are both low in emotional stability (i.e., high in neuroticism) may demonstrate different levels of CWB due to their standing on one or more of the other traits. Indeed, a study reported that conscientiousness moderated the positive relationship between CWB and neuroticism (Bowling, Burns, Stewart, & Gruys, 2011). Conscientiousness played a greater role in tempering CWB at high levels of neuroticism (low emotional stability) compared to low levels of neuroticism. In contrast, another study examined emotional stability as a moderator of the conscientiousness–CWB relationship (Penney, Hunter, & Perry, 2011). The level of emotional stability mattered more when conscientiousness was high than when it was low. Individuals were less inclined to engage in CWB when they were high in both conscientiousness and emotional stability compared to when they were high in conscientiousness and low in emotional stability. Emotional stability demonstrated a similar effect in the moderation of the agreeableness–CWB relationship (Penney, et al., 2011). Future researchers may seek to clarify the pattern of the interactions among these three personality traits, as there seems to be potential to further explicate the FFM-CWB relationship. Research has also sought to better understand the FFM-CWB relationship by studying the explanatory power of the narrow facets underlying each of the five traits. Hastings and O’Neill (2009) reported that the facets of Excitement Seeking (Extraversion), Cooperation (Agree­ ableness), Dutifulness (Conscientiousness), Anger (Neuroticism), and Emotionality (Openness) had the largest relationships with CWB. In fact, when CWB was regressed on these five narrow facets, the r-square was 0.33 compared to 0.35 when CWB was regressed on the broad five traits, suggesting the potential importance of facet-level relationships. Together, these individual studies lend support to the findings from the quantitative reviews showing that conscientiousness, agreeableness, and emotional stability are important traits in

488

Perspectives on Counterproductive Work Behavior the prediction of CWB. However, the findings also suggest that our understanding of the FFMCWB relationship can be improved by using other-ratings to assess personality and by looking deeper into the relationship among the three dominant traits and their respective facets.

Individual Personality Traits Research that studies personality traits other than the FFM or integrity has considered the role of the dark triad in predicting CWB (O’Boyle, Forsyth, Banks, & McDaniel, 2012; Wu & LeBre­ ton, 2011). Typically, these traits include Machiavellianism, narcissism, and psychopathy. The rationale for focusing on these traits is that they are considered maladaptive and thus should be relevant for explaining unique variance in the CWB criterion above the FFM. Although these traits have been shown to be related to some of the FFM traits, the findings are not always consistent, and there appears to be variance in these aberrant traits that is unrelated to the FFM (Wu & LeBreton, 2011). As for their role in explaining CWB, the research findings have also been mixed (e.g., Dahling, Whitaker, & Levy, 2009; Judge, LePine, & Rich, 2006; Penney & Spector, 2002). However, a recent meta-analysis by O’Boyle et al. (2012) reported positive and significant relationships with CWB for narcissism and Machiavellianism, whereas the relationship for psychopathy was not significant (see Table 22.1). Future research may want to tease out the unique value of these aberrant personality traits in explaining CWB above the FFM, which may necessitate analyses at the facet level of all traits in the FFM and dark triad included. Research has also considered honesty–humility, risk taking, and implicit measures of personality in relation to CWB. For example, O’Neill and Hastings (2011) found that risk taking and integrity explained incremental variance in CWB over conscientiousness, agreeableness, and extraversion; Oh, Lee, Ashton, and de Vries (2011) reported that honesty–humility and extraversion interacted to explain deviance (i.e., high extraversion was related to more deviance than low extraversion for individuals who were low in honesty–humility). Lastly, conditional reasoning tests of aggression (an implicit test that assesses personality indirectly) have been studied as alternatives for predicting CWB among job applicants. Two meta-analyses reported positive and significant relationships between scores on conditional reasoning tests and CWB (see Table 22.1; Berry, Sackett, & Tobares, 2010; James et al., 2005).

Cognitive Ability and Education Cognitive ability is said to be the best predictor of job performance, not to mention important for success in many work and life outcomes (e.g., Gottfredson, 1997). Although the evidence is vast for many criterion outcomes of interest, less research exists for the predictive role of cognitive ability in explaining CWB, and the research that does exist is mixed (e.g., Dilchert, Ones, Davis, & Rostow, 2007; Marcus & Schuler, 2004). It has been argued that individuals who are high in cognitive ability have a better capacity to reason and anticipate the consequences of their actions, which should thus inhibit the negative work behaviors like CWB (Gonzalez-Mulé, Mount, & Oh, 2014). However, a recent meta-analysis reported a nonsignificant negative relationship between cognitive ability and overall CWB and CWB-I and a significant negative relationship for CWB-O (see Table 22.1; Gonzalez-Mulé et al., 2014). Educational attainment has been associated with more favorable work outcomes (e.g., Elman & O’Rand, 2004; Torche, 2011). Education level can also serve as a proxy for knowledge, skills, cognitive ability, and some personality traits such as achievement orientation (e.g., Berry, Gruys, & Sackett, 2006; Poropat, 2009). These factors explain in part why higher education attainment is expected to yield better job performance and why it is used in selection systems to screen potential employees. It is less clear whether the beneficial outcomes of educational attainment extend to domains of job performance beyond task performance. Some scholars have reasoned that the education system imparts upon students the values of discipline, respect, honesty, and concern for others, among other values, which would suggest less engagement in

489

Maria Rotundo and Paul E. Spector CWB and fewer withdrawal behaviors such as absence, lateness, and involuntary turnover (Ng & Feldman, 2009a). However, a meta-analysis reported that education level was unrelated to CWB and tardiness but negatively related to absence (see Tables 22.1 and 22.2; Ng & Feldman, 2009a). Thus, as with cognitive ability, educational attainment does not seem to play a strong or consistent role in explaining CWB.

Demographic and Background Variables Gender Research has reported that young males have a greater propensity toward serious forms of crime (e.g., Steffensmeier & Allan, 1996). However, the gender gap appears to be narrowing, and the degree of the difference depends on which types of data are examined (e.g., arrest records, self-report, victimization data; Kruttschnitt, 2013). Meta-analyses of gender differences in aggression would seem to support an aggressive male stereotype (Card, Stucky, Sawalani, & Little, 2008), although much of the gender literature is with children, and the differences between genders is often quite small. Results of meta-analyses with CWB are somewhat mixed, with men having somewhat higher levels, especially for CWB-I, but comparisons are not always statistically significant (see Table 22.1; Berry, Ones, & Sackett, 2007; Hershcovis, et al., 2007; Hersh­ covis & Barling, 2010). Berry, Carpenter, and Barratt (2012) compared self-reports of CWB with other-reports, finding larger gender differences with self-reports. Although it is possible that males engage in more CWB than do females, an alternative explanation for observed gender relationships is reporting bias differences between men and women. As discussed by Spector and Zhou (2014), aggressive behavior is more socially acceptable for men, as it is more consistent with male than female gender norms. Given that aggressive behavior is acceptable, and in some circles desirable for men, while frowned upon for women, it is likely that males have less reluctance than females to report their CWB. For this reason it seems plausible that women under report their CWB so their behavior appears more socially acceptable. The fact that self-reported CWB yields larger gender differences than other-reported CWB supports this possibility. In addition to mean differences in CWB, two studies have shown that gender acts as a moderator of relationships of CWB with other variables. Both studies showed gender-moderating effects, whereby men are more reactive than women to external (stressors) and internal (personality and attitude) conditions associated with CWB (Bowling & Burns, 2015; Spector & Zhou, 2014). Both papers found that gender moderated the relationship of stressors to CWB, with gender differences only at high levels of stressors. Thus, men are no more likely to engage in, or at least report engaging in, CWB under relaxed conditions. They only reported higher levels of CWB when stressors were reported as high. Bowling and Burns found a similar pattern with job attitudes, for example, only finding gender differences in CWB among dissatisfied employees. Spector and Zhou found a similar pattern with personality traits that relate to CWB, for example, agreeableness and trait anger. Agreeable men and women and those low in trait anger showed little difference in their CWB reports. However, men who were disagreeable or high in trait anger reported higher levels of CWB than did their female counterparts.

Age An aging workforce and the corresponding stereotypes associated with older workers has prompted research on the individual difference characteristics, attitudes, and work outcomes associated with older employees (e.g., Hedge, Borman, Lammlein, 2006). Some outcomes that have long been studied are the productivity and job performance of older workers or age differences in various forms of withdrawal behavior such as absenteeism or turnover, with research findings being mixed as to whether or not age differences exist (e.g., Ng & Feldman,

490

Perspectives on Counterproductive Work Behavior 2009c). Other than studies comparing rates of crime or other forms of societal deviance among different age groups (e.g., Sampson & Laub, 1992; Steffensmeier, Allan, Harer, & Strei­ fel, 1989), there has been little research in which the primary focus was on age differences in CWB. However, age is a demographic sample characteristic, so many studies report an age,CWB (or age-withdrawal) correlation coefficient, which can then be analyzed in quantitative reviews or meta-analyses. These meta-analyses reported a small but significant negative relationship between age and overall CWB, and a significant negative relationship with tardiness and absence suggesting that older employees may engage in less CWB and withdrawal (see Tables 22.1 and 22.2; Berry, Carpenter, & Barratt, 2012; Berry, Ones, & Sackett, 2007; Ng & Feldman, 2008). The review by Ng and Feldman re-analyzed the age-CWB relationship within three separate age categories (i.e., less than 25 years old, 25–39 years old, and 40 years and older). They reported larger negative coefficients in the latter two age categories (−0.12 and −0.17, respectively) compared to the first category, which included younger workers, in which the relationship was close to zero.

Organizational Tenure Researchers have considered whether organizational tenure is related to CWB and withdrawal. The rationale is that the longer the tenure of an employee, the more likely he or she is to become embedded in the job and the organization, and that both parties (i.e., organization and employee) have deemed the relationship to be a good fit and worth maintaining, resulting in higher job performance and lower negative behavior (Jiang, Liu, McKay, Lee, & Mitchell, 2012; Ng & Feldman, 2010). Poor performers or those with whom there is a mismatch of organizational values or a poor fit have been managed out or have voluntarily exited (Jiang, et al., 2012; Ng & Feldman, 2010). Indeed, meta-analyses have shown that employee job performance has a negative and significant relationship with turnover intention, absenteeism, and withdrawal (see Table 22.2; Swider & Zimmerman, 2014; Zimmerman & Darnold, 2009). However, organizational tenure typically correlates positively with age, which raises a concern that any relationship that may arise is confounded by age. Indeed, meta-analyses reported that organizational tenure was significantly and negatively correlated with CWB when the latter was assessed by the supervisor or obtained from organizational records (see Table 22.1; Berry, Carpenter, & Barratt, 2012; Berry, Ones, & Sackett, 2007; Ng & Feldman, 2010). Subsequent moderator analyses showed that after controlling for age, the size of the negative organizational tenure–CWB coefficient decreased to nonsignificance when CWB was obtained from organizational records but remained negative and significant for self-ratings of CWB (Ng & Feldman, 2010). Although organizational tenure was also significantly and negatively related to objective measures of absence (see Table 22.2; Berry, Carpenter, & Barratt, 2012; Berry, Ones, & Sackett, 2007; Ng & Feldman, 2010), the negative relationship became positive and was significant in the analyses that controlled for age (Ng & Feldman, 2010). Thus, although there is some support for a negative relationship with CWB and a positive relationship with absence, the relationships are small.

Other Characteristics There has been some research on the role of national culture as it relates to withdrawal and CWB. Addae, Johns, and Boies (2013) reported differences among countries as to how acceptable absence is perceived to be and on the extent to which employees should be held accountable for their absenteeism. Rotundo and Xie (2008) reported some differences between Canadian and Chinese managers on the extent to which they valued task performance and CWB but less difference in which behaviors constituted CWB. These findings suggest that attention should be placed on aligning expectations and norms surrounding appropriate behaviors, especially in diverse workplaces.

491

Maria Rotundo and Paul E. Spector

ATTITUDES AND EMOTIONS Typically, the job attitudes that are most studied in organizational behavior are job satisfaction and organizational commitment. Research, including four meta-analyses, has reported that job satisfaction and organizational commitment have a negative and significant relationship with CWB and that organizational commitment has a negative relationship and significant relationship with some forms of withdrawal (see Tables 22.1 and 22.2; Berry, Carpenter, & Barratt, 2012; Dalal, 2005; Hershcovis, et al., 2007; Meyer, Stanley, Herscovitch, & Topolnytsky, 2002). Thus, organizational attitudes continue to be important for explaining CWB and withdrawal. Research has also sought to understand the relative contribution of individual differences compared to organizational attitudes (O’Brien & Allen, 2008). Emotions have been widely studied since it is believed that they mediate the relationship between several antecedent variables and acts of CWB and withdrawal (e.g., Spector & Fox, 2005). The findings from six separate meta-analyses are summarized in Tables 22.1 and 22.2, which show significant positive relationships between various negative emotions and CWB or withdrawal and significant negative relationships between positive emotions and CWB or withdrawal (Berry, Carpenter, & Barratt, 2012; Dalal, 2005; Hershcovis et al., 2007; Kaplan, Bradley, Luchman, & Haynes, 2009; Shockley, Ispas, Rossi, & Levine, 2012; Zimmerman, 2008). Research has typically focused on trait or dispositional affect. However, Shockley et al. (2012) sought to differentiate between dispositional affect and state affect in a meta-analysis that also considered discrete emotions (see Table 22.1). They reported stronger relationships between negative affect or negative discretion emotions and CWB compared to positive affect or positive discrete emotions and CWB. Of the negative discrete emotions, anger (trait), hostility (trait), envy (trait and state), and frustration have the largest relationships, whereas attentiveness (trait and state) had the largest relationship among the positive emotions. Other research has considered emotion regulation strategies (Kleumper, DeGroot, & Choi, 2013; Scott & Barnes, 2011) or the interplay between personality and emotion (Côté, DeCelles, McCarthy, Van Kleef, & Hideg, 2011). For example, research has shown that the ability to manage emotions has a negative relationship with CWB even after controlling for general mental ability and personality (Kleumper et al., 2013). Other researchers reported a positive relationship between surface acting and withdrawal and a negative relationship between deep acting and withdrawal (Scott & Barnes, 2011).

ENVIRONMENT Selection doesn’t occur in a vacuum, so it is important to understand the context in which CWB occurs. Environmental conditions that might serve as antecedents to CWB are those where selection is likely to have the largest impact. Jobs or work conditions that are particularly stressful, for example, might benefit most by selecting individuals who are least likely to respond to precipitating conditions with CWB. We review those environmental conditions that have been most linked to CWB.

Stressors The negative health and work outcomes associated with stress have been established (e.g., Crawford, LePine, & Rich, 2010; Ganster & Rosen, 2013; Gilboa, Shirom, Fried, & Cooper, 2008). Tables 22.1 and 22.2 summarize the results of meta-analyses that reported on the stressor-­ CWB/withdrawal relationship. The literature often distinguishes between challenge and hindrance stressors, as the outcomes associated with these stressors are not always the same. Challenge stressors include those work demands that test individuals’ capabilities and that provide them with the opportunity to advance their skills and knowledge (Cavanaugh, Boswell, Roehling, & Boudreau, 2000; Podsakoff, LePine, & LePine, 2007). Examples of these demands

492

Perspectives on Counterproductive Work Behavior are high workload, time pressure, a broad job scope, or more responsibility at work. Metaanalyses have shown them to be positively linked to employee engagement, burnout, and turnover intention (Crawford, et al., 2010), yet unrelated to turnover and withdrawal behavior (see Table 22.2 Podsakoff, LePine, & LePine, 2007). However, a meta-analysis solely on workload, which is one of the most studied job demands, reported that it is associated with poorer psychological and physical well-being, lower job attitudes, and higher turnover intentions and absence (see Table 22.2; Bowling, Alarcon, Bragg, & Hartman, 2015). Thus, challenge stressors such as workload might increase motivation, but it can be at the cost of increased physical and psychological strain, (e.g., burnout), which can lead to withdrawal. In fact, meta-analyses on the consequences of burnout, work strain, and psychological and physical illness support their association with increased turnover and absence (Darr & Johns, 2008; Swider & Zimmerman, 2010). In contrast, hindrance stressors include work demands that present a threat or an obstacle and that impede the achievement of goals (e.g., Podsakoff, et al., 2007). Examples are role conflict (including work-family conflict), role ambiguity, organizational politics, organizational constraints, or job insecurity. These stressors have been linked to negative outcomes such as decreased job performance (Gilboa, et al., 2008), decreased engagement (Crawford, et al., 2010), increased burnout (Crawford, et al., 2010), increased CWB (Ferguson, Carlson, Hunter, & Whitten, 2012; Semmer, Tschan, Meier, Facchin, & Jacobshagen, 2010), increased absence (ten Brummelhuis, ter Hoeven, de Jong, & Peper, 2013), and increased turnover and withdrawal behavior (see Table 22.2; Fugate, Prussia, & Kinicki, 2012; Podsakoff, et al., 2007). Metaanalyses on specific hindrance stressors as they relate to CWB showed that organizational constraints and interpersonal conflict are positively and significantly related to CWB (see Table 22.1; Berry, Carpenter, & Barratt, 2012; Hershcovis et al., 2007). The constraints-CWB relationship appears to hold when CWB is assessed at a later point in time (Meier & Spector, 2013) and when studied in other countries (e.g., Bayram, Gursakal, & Bilgel, 2009). Thus, there is consistent support for the negative outcomes associated with these types of stressors. Job resources, on the other hand, are functional job characteristics that can facilitate the achievement of work goals and job demands. Examples include job control, autonomy, participative decision making, task variety, task feedback, work-role fit, and organizational support. Job resources have been linked to increased engagement, decreased burnout (Crawford, et al., 2010), less indiscipline (Tucker et al., 2009), and less absence (Giardini & Kabst, 2008; Hystad, Eid, & Brevik, 2011; Soane et al., 2013). Thus, these research findings would suggest that job resources can play an important role in reducing these negative work outcomes. Some preliminary evidence suggests that the extent to which stressors result in CWB or withdrawal depends on individual differences. For example, the negative effect of stressors as it pertains to CWB have been found to be more pronounced for individuals who are low in conscientiousness, agreeableness, or emotional stability (Bowling & Eschleman, 2010; Zhou, Meier, & Spector, 2014). Less resilient individuals were more absent when they experienced high job demands combined with high job control compared to more resilient individuals (Hystad, et al., 2011). Research shows that stress relates to CWB through negative affect (Yang & Diefendorff, 2009). These findings lend further support to the important role that individual differences can play under potentially negative environmental conditions in managing workplace behaviors.

Relationship with Supervisor An important factor in the environment at work is the relationship an individual has with his or her supervisor or leader. This relationship can be a positive and constructive one characterized by high-quality leader-member exchange or a social stressor as with abusive supervision. Research continues to show the positive outcomes associated with good-quality leader-member exchange including lower turnover intentions (see Table 22.2; Banks et al., 2014). It also supports the negative consequences of abusive supervision (Avey, Wu, & Holley, 2015; Lian, Ferris, & Brown, 2012; Liu, Kwan, Wu, & Wu, 2010; Thau, Bennett, Mitchell, & Marrs, 2009) or supervisor aggression. The consequences can include higher CWB toward the supervisor (Tepper

493

Maria Rotundo and Paul E. Spector et al., 2009), other individuals, or the organization (see Tables 22.1 and 22.2; Hershcovis & Barling, 2010; Mawritz, Dust, & Resick, 2014), higher turnover intentions, more coworker sick leave, and more customer service sabotage (Kao, Cheng, Kuo, & Huang, 2014). Abusive supervision has been linked to greater frustration (Avey et al., 2015), increased hostility (Mayer, Thau, Workman, Van Dijke, & De Cremer, 2012), and lower self-esteem (Farh & Chen, 2014), which in turn are related to higher CWB. However, individuals who are embedded in their jobs do not react with such high frustration and reported engaging in less CWB (Avey et al., 2015). The degree to which abusive supervision increased CWB also differed depending on employee characteristics and the environment. That is, supervisor abuse was more strongly related to reports of CWB for employees who score low on moral identity (Greenbaum, Mawritz, Mayer, & Priesemuth, 2013), score low on self-control (Lian, Ferris, Morrison, & Brown, 2014), report high perceptions of distributive injustice (Thau & Mitchell, 2010), or a high intent to quit (Lian et al., 2014). Thus, the negative work outcomes associated with stress extend to social stressors like abusive supervision, even though the extent of this influence can depend further on individual differences and other characteristics of the work environment.

Organizational Justice Perceptions of organizational justice are related to less CWB and withdrawal. This relationship holds for all forms of justice, whether distributive, procedural, interpersonal, interactional, or informational justice. The results of seven meta-analyses in which organizational justice–CWB/ withdrawal coefficients were analyzed are summarized in Tables 22.1 and 22.2 (Berry, Carpenter, & Barratt, 2012; Berry, Ones, & Sackett, 2007; Cohen-Charash & Spector, 2001; Colquitt, Conlon, Wesson, Porter, & Ng, 2001; Colquitt et al., 2013; Dalal, 2005; Hershcovis et al., 2007). Recent research has focused on identifying the mechanisms that explain the justice-CWB/withdrawal link, and findings suggest that this link can operate through negative affect (e.g., Colquitt et al., 2013), the desire for revenge (Hoffmann, 2008; Jones, 2009), perceived organizational support (El Akremi, Vandenberghe, & Camerman, 2010), leader-member exchange (El Akremi et al., 2010), or self-esteem (Ferris, Spence, Brown, & Heller, 2012). Research has also shown that the strength of the justice-CWB/withdrawal link can depend on employee values of justice (Holtz & Harold, 2013), occupational rank (Cronin & Smith, 2011), or social identity (Enns & Rotundo, 2012). Perceptions of injustice can arise from a breach in the psychological contract with an employer. Several studies have reported a positive relationship between psychological contract breach and various forms of CWB and withdrawal (Chao, Cheun, & Wu, 2011; Chiu & Peng, 2008; Jensen, Opland, & Ryan, 2010; Zagenczyk, Restubog, Kiewitz, Kiazad, & Tang, 2014; Zhao, Wayne, Glibkowski, & Bravo, 2007). This relationship has been found to be stronger for relational breach compared to transactional breach (i.e., for withdrawal; see Table 22.2; Zhao, et al., 2007), for individuals who score higher on hostile attribution style (i.e., individuals who attribute negative outcomes to something external, stable, and controllable; Chiu & Peng, 2008), and when employees attribute the breach to something that the organization initiated (e.g., reneging; Chao et al., 2011). Others have reported that employees’ desire to seek revenge after a breach mediated the positive breach–CWB relationship (Bordia, Restubog, & Tang, 2008). Overall, these findings lend further support to the importance of managing perceptions of justice at work and to the interaction of the person and the environment.

Human Resource Management Practices Organizations implement employment practices in part based on the expectation that they improve work outcomes for various stakeholders. Examples of some of these practices or work arrangements are high-performing human resource practices, diversity management programs,

494

Perspectives on Counterproductive Work Behavior telecommuting arrangements, organizational wellness programs, electronic performance monitoring, or initiatives to control absence. Research has shown that some of these practices are related to lower levels of withdrawal (see Table 22.2; Armstrong et al., 2010; Gajendran & Harrison, 2007; Kehoe & Wright, 2013; Parks & Steelman, 2008; Peretz & Fried, 2012). In the few instances when CWB is considered as an outcome, there appears to be no relationship (see Table 22.1; Bhave, 2014). Furthermore, research has shown that the programs that organizations implement to manage absenteeism can be related to the types of absence that emerge (e.g., absence due to illness or medical reasons, low motivation; Hopkins, 2014; Johnson, Holley, Morgeson, LaBonar, & Stetzer, 2014) and may even relate to employee pressure to work when ill, recently coined presenteeism (e.g., Baker-McClearn, Greasley, Dale, & Griffith, 2010; Johns, 2010). Thus, although there is some support for the role that employment practices can play in managing absenteeism, it is important to consider the reasons for absence when devising a strategy to combat it.

Climate The climate of an organization or workgroup has to do with the context in which employee behavior is enacted. Organizational climate is defined as shared perceptions regarding what is rewarded and supported in the organization (Zohar & Luria, 2004) and therefore provides cues as to what is appropriate and inappropriate behavior. It concerns employee perceptions of management practices regarding a set of behaviors or what is emphasized by supervisors (Schneider & Bowen, 1985). Although it is important to distinguish individual perceptions of climate (psychological climate) from shared perceptions (organizational climate), relationships of both forms of climate with other variables are often quite similar (e.g., Beus, Payne, Bergman, & Arthur, 2010; Law, Dollard, Tuckey, & Dormann, 2011). Several climate variables have been studied in relation to CWB. Most of the research has focused on CWB-I from the perspective of targets rather than actors, looking at various forms of mistreatment, both physical and psychological. Such studies do not always separate mistreatment from organizational insiders, which would be CWB-I, from similar mistreatment from outsiders, such as clients or patients, which would not be defined as CWB since it is not performed by employees. Nevertheless, this literature has been consistent in suggesting that climate plays a role in CWB and that certain climates might encourage, whereas others discourage, such behaviors. Climates have been studied that relate specifically to mistreatment and violence in the workplace by insiders and outsiders. Violence prevention climate concerns practices by management that focus on minimizing both the physical violence and psychological abuse experienced by employees (Kessler, Spector, Chang, & Parr, 2008). Incivility climate concerns the control of uncivil behaviors in the workplace (Ottinot, 2011). Bullying climate, as the name implies, is concerned with the control of bullying behavior at work (Hutchinson, Jackson, Wilkes, & Vickers, 2008). These various forms of what they termed mistreatment climate were shown in a meta-analysis to relate to workplace incivility, psychological abuse, and physical violence (Yang, Caughlin, Gazica, Truxillo, & Spector, 2014). An even broader climate is psychosocial safety climate, which is concerned not just with mistreatment but also with the protection of workers from all sorts of conditions that would adversely affect their psychological health and safety (Dollard & Bakker, 2010). Included would be CWB-I ranging from mild incivility to serious bullying. In a multilevel study of both psychological and organizational psychosocial safety climate, Law et al. (2011) found that climate at each level related to employees being bullied and harassed at work. Taken together, these lines of research on climate suggest a clear link between context and CWB-I. A climate that has been shown to be more directly related to CWB performed specifically by organizational insiders is ethical climate. This form of climate is concerned with what is considered ethically acceptable behavior in an organization, in other words, what actions are considered right or wrong from a moral perspective (Victor & Cullen, 1988), for example, is it okay to lie to or cheat a customer? In another multilevel study with ethical climate measured

495

Maria Rotundo and Paul E. Spector at the organizational level, Chen, Chen, and Liu (2013) investigated the interaction of climate with personality. They found that the relationship of negative affectivity to overall CWB was moderated by ethical climate, such that a climate encouraging ethical behavior inhibited CWB in individuals who were high in negative affectivity. Research has also emphasized the role of climate, especially as it relates to withdrawal behavior. More specifically, a lenient organizational lateness climate has been shown to relate to higher lateness frequency compared to a stricter climate (Elicker, Foust, O’Malley, & Levy, 2008). Research also reported a positive relationship between manager absence and employee absence (Duff, Podolsky, Biron, & Chan, 2015; Nielsen, 2008) and that absence increased more in the presence of permissive absence norms (Biron & Bamberger, 2012).

PERSON AND ENVIRONMENT Scholars continue to study the outcomes of the fit between an individual based on various personal characteristics and the environment where they work (e.g., Biron & DeReuver, 2013; KristofBrown, Zimmerman, & Johnson, 2005; Liao, Chuang, & Joshi, 2008; Maynard & Parfyonova, 2013; Oh et al., 2014). This research is based on the premise that individuals whose knowledge, skills, abilities, interests, and values match those required of the job, organization, supervisor, or group are more likely to succeed on the job and to stay. Two meta-analyses have shown that these various forms of fit are related to lower turnover intentions and lower withdrawal (see Table 22.2; Kristof-Brown et al., 2005; Oh et al., 2014). In 2001, scholars introduced a new construct they labelled job embeddedness, which reaches beyond an individual’s fit with the job and work environment to also include ties among colleagues, family, and friends at work and in the community and the costs or sacrifices associated with leaving (Mitchell, Holtom, Lee, Sablynski, & Erez, 2001). These scholars and others since have proposed that individuals who are more embedded in their jobs, organizations, and the community are less likely to seek employment elsewhere and are more likely to perform effectively on the job (Kiazad, Holtom, Hom, & Newman, 2015; Mitchell et al., 2001). A meta-analysis found support for some of these relationships. On-the-job embeddedness was negatively and significantly related to turnover intentions and to actual turnover, whereas off-the-job embeddedness was negatively related to turnover intentions (see Table 22.2). These relationships held even after controlling for job attitudes and job alternatives (Jiang et al., 2012). Furthermore, research has shown that employees who are employed in jobs that match their interests and for which the job environment is compatible or are more embedded engage in less CWB (Iliescu, Ispas, Sulea, & Ilie, 2015; Ng & Feldman, 2009b). Thus, achieving greater fit for employees at work and beyond the workplace through relationships in the community can be useful for managing negative discretionary behaviors and withdrawal.

CONSEQUENCES Our chapter in the first edition of this book reported that limited research had considered the costs of CWB or withdrawal, possibly because it is assumed that the costs are high. More recently, research has shed light on some of the consequences experienced by the various stakeholders. For example, Simpson (2013) summarized some of the costs of white-collar crime to be punishment of leaders, a decrease in firm value, and status loss of the firm. As for the consequences of withdrawal, a meta-analysis reported that turnover is more negatively related to customer service and quality/safety and less related to firm performance (Hancock, Allen, Bosco, McDaniel, & Pierce, 2013). At the employee level, researchers have reported that employee CWB was related to less mentoring received (Lapierre, Bonaccio, & Allen, 2009), and coworker aggression was related to higher turnover intentions, absence, and higher CWB (see Tables 22.1 and 22.2; Hershcovis & Barling, 2010; Nielsen & Einarsen, 2012).

496

Perspectives on Counterproductive Work Behavior At the same time, recent research has noted some of the positive consequences of CWB. For example, Reynolds, Shoss, and Jundt (2015) present a model that delineates favorable and unfavorable outcomes of organizational citizenship behavior and CWB for individuals, peers, and the organization. They noted as favorable outcomes instigating change by drawing attention to problematic situations, increasing efficiency/effectiveness, restoring relationship balance among employees, and improving employee performance. Other research found that CWB and withdrawal moderated the relationship between organizational justice and emotional exhaustion (Krischer, Penney, & Hunter, 2010). That is, when employees reported lower perceptions of distributive or procedural justice, they also reported higher levels of emotional exhaustion. However, the strength of this relationship decreased for those employees who also reported withdrawal behavior or production deviance (Krischer et al., 2010). Research also showed that employees reported higher self-evaluation when they worked with a coworker who was perceived to be a rule breaker (Markova & Folger, 2012), and although employees reported experiencing guilt when they were informed that they had engaged in CWB, they compensated by performing organizational citizenship behavior (Ilies, Peng, Savani, & Dimotakis, 2013). Although no one is endorsing engaging in CWB in response to a negative work environment, this type of research is useful for sorting out how employees react to it and deal with it. Research has also considered the consequences of CWB and withdrawal by studying how others react to this behavior. For example, Patton (2011) reported that judgments of responsibility for the act matter. That is, when an employee is judged to be responsible for his or her absence, others will experience more anger and intent to punish and less sympathy and intent to help. Furthermore, research found that the status of the perpetrator and the perceiver, and the perceived reasons for the CWB or withdrawal, influenced people’s reactions to it. Bowles and Gelfand (2010) found that punishment for negative behavior was more likely by high-status evaluators than by low-status evaluators when the target was low status. Race and sex were the variables used to represent status. In a scenario study, Luksyte, Waite, Avery, and Roy (2013) reported that lateness resulted in fewer advancement opportunities for Black employees compared to White employees. Together these findings suggest that potential biases may be at play when CWB or withdrawal are evaluated.

EMERGING ISSUES WITH SOCIAL NETWORKING WEBSITES The focus of this chapter has been on types of CWB and withdrawal that typically occur in the workplace. Given the strong presence of social media and that its use extends beyond the workplace, future research may seek to understand how to motivate the effective and appropriate use of social networking websites by employers, employees, or other stakeholders. Several social networking websites arose for the purpose of helping individuals share information and connect with their friends. However, they have made their way into the working lives of individuals, raising concerns that certain behaviors and activities on social networking websites can be inappropriate (e.g., Black, Stone, & Johnson, 2015; Chauhan, Buckley, & Harvey, 2013; Davis, 2012; Dreher, 2014; Jain et al., 2014; Lucero, Allen, & Elzweig, 2013; Miller, 2013; Pate, 2012; Roberts & Sambrook, 2014; Roth, Bobko, Van Iddekinge, & Thatcher, 2016). It would appear that some of these behaviors even satisfy the broad parameters associated with counterproductive work behavior that were summarized earlier in this chapter (e.g., counter to the interests of the organization or intend to harm the organization or its members). Behavior may come into question even though it is engaged in outside of work hours or away from the workplace. For example, public online posts in which coworkers are attacked or abused, the organization is disparaged, or confidential organizational information is made public may be considered counterproductive and may result in disciplinary action or termination (e.g., Chauhan et al., 2013; Lucero et al., 2013; Mainiero & Jones, 2013; Miller, 2013; Roberts & Sambrook, 2014). Even behaviors that do not involve directly the organization or other members and that are engaged in outside of work hours (e.g., posting provocative photos online) have attracted the attention of recruiters (e.g., Chauhan et al., 2013; Davis, 2012; Lucero et al., 2013; Pate, 2012).

497

Maria Rotundo and Paul E. Spector Employers and employees are urged to exert caution when engaging in online activities or when reacting to them (e.g., Brice, Fifer, & Naron, 2012; Davis, 2012; Lucero et al., 2013; Miller, 2013). The circumstances surrounding each case often appear to be relevant when disciplinary consequences are considered and evaluated (e.g., Brice et al., 2012; Davis, 2012; Lucero et al., 2013; Mainiero & Jones, 2013). Consequently, some guidelines have been developed for use when evaluating cases that arise (e.g., Brice et al., 2012; Davis, 2012). Organizations are urged to provide employees with training on social media use and to create social media policies that specify behavior that is deemed to be inappropriate (e.g., Black et al., 2015; Davis, 2012; Dreher, 2014; Lucero et al., 2013; Mainiero & Jones, 2013; Miller, 2013; Pate, 2012). Some employers have resorted to scanning online profiles of applicants during the selection process or to requesting passwords to social networking websites. However, these practices have been met with some caution (e.g., Black et al., 2015; Chauhan et al., 2013; Pate, 2012; Roberts & Sambrook, 2014; Roth et al., 2013). Part of employers’ motivation to scan online profiles is to protect against negligent hiring (e.g., Black et al., 2015; Chauhan et al., 2013; Levashina & Campion, 2009; Lucero et al., 2013; Pate, 2012). This concern is heightened for high-risk jobs deemed sensitive in which individuals are in contact with customers, the public, the elderly, or children, such as jobs in education, medical professions, or security, among other jobs (e.g., Connerley, Arvey, & Bernardy, 2001; Levashina & Campion, 2009). Although there is research and legal precedent on the need for employers to conduct background checks for these high-risk occupations that is even mandated for some (e.g., Connerley et al., 2001; Levashina & Campion, 2009; Pate, 2012), the evidence is less clear as to whether it is necessary for employers to scan profiles on social networking websites (e.g., Levashina & Campion, 2009; Pate, 2012). Furthermore, concerns have been raised that scanning social media profiles and relying on them for selection may have the unintended consequence of adverse impact or invisible discrimination (e.g., Black et al., 2015; Chauhan et al., 2013; Pate, 2012; Roth et al., 2013). That is, profiles contain information about applicants’ demographic characteristics and non-job-related information (e.g., age, religion, sexual orientation, disability status, marital status, political affiliation) that is otherwise not typically available at the time of hire. Scanning profiles also assumes that individuals use social media, which may vary disproportionately by ethnicity, age, or other characteristics (e.g., Pate, 2012; Roth et al., 2013). Research on the antecedents of social media use is limited, and some has focused on the role of personality (e.g., Karl, Peluchette, & Schlaegel, 2010; Kluemper, Rosen, & Mossholder, 2012; Newness, Steinert, & Viswesvaran, 2012). For example, in a survey of college students, Newness et al. (2012) found that individuals who scored higher on conscientiousness, agreeableness, openness, emotional stability, honesty–integrity, or emotional intelligence posted less inappropriate content on Facebook. Chou, Hammond, and Johnson (2013) found that the frequency with which individuals updated their Facebook profiles was positively related to a form of withdrawal, whereas the time spent with friends offline was negatively related to the same form of withdrawal. McFarland and Ployhart (2015) draw attention to the unique context that social media represents and propose a contextual framework to motivate future research and practice. Given the rise in social media use and its spillover into work, it may be prudent for researchers, practitioners, and the courts to study this behavior further and to continue to revise and update guidelines surrounding its use (e.g., Black et al., 2015; Dreher, 2014; Mainiero & Jones, 2013; Newness et al., 2012; Roth et al., 2013).

IMPLICATIONS FOR EMPLOYEE SELECTION Employee selection tests are validated against criterion constructs of interest, many of which are components of employee job performance. Thus, the quality and relevance of performance criteria have important implications for the validity of selection tests. Counterproductive work behavior and withdrawal are domains of employee job performance that have gained widespread attention over the years, and there is great interest in selecting employees who are less likely to engage in these behaviors. Consequently, scholars have developed reliable measures that have seen widespread use, primarily in research studies. Withdrawal measures used as criteria

498

Perspectives on Counterproductive Work Behavior for selection usually rely on records rather than individual reports. CWB measures are generally checklists of behaviors completed by the employee or others. Although these measures have shown evidence for validity, it is not clear to what extent they could be subject to reporting bias, both by the self and others. The first edition of this chapter reviewed some research on employee theft and the efforts that organizations expend to reduce employee theft both pre- or post-hire (Langton & Hollinger, 2005). This research suggested that strategies aimed at selecting out high-risk employees were more effective than most post-hire efforts at reducing shrinkage rates. This strategy remains a useful one today. Research findings support the role that integrity tests, conscientiousness, agreeableness, and emotional stability can play in predicting employee CWB, which would support their use in employee selection, especially since they also predict other dimensions of job performance. Preliminary research also suggests that these personality traits can interact with environmental triggers, such as stressors or injustice, to reduce the negative outcomes associated with these triggers, providing further support for their use in selection. Having said this, we could benefit from additional research on the interplay among these individual difference factors, including facet-level comparisons, to see if there is any efficiency to be gained in selection by focusing on specific facets over broad traits. This suggestion includes the aberrant personality traits of Machiavellianism or narcissism to see if they have any potential incremental role in explaining CWB and withdrawal above integrity tests and the FFM traits mentioned above. Going beyond selection, research also shows that once an employee is hired, characteristics of the work environment can motivate CWB or withdrawal behaviors. As noted, such factors can interact with characteristics of people, having a bigger impact on some individuals than on others. As reported earlier, gender has been shown to moderate the relationship between job stressors and CWB (Bowling & Burns, 2015; Spector & Zhou, 2014). Obviously, no one would use gender as a selection factor, but such research underscores the notion that individuals vary in their response to the work environment, so that one should not assume that if some employees seem unaffected by a workplace practice, that practice will have no impact on CWB or withdrawal across the board. The environmental triggers that can motivate CWB or withdrawal should not be ignored. Hindrance and social stressors, a negative or unethical work climate, perceptions of organizational injustice including a breach in the psychological contract are consistently related to higher levels of CWB and certain forms of employee withdrawal. In contrast, job resources, job embeddedness and fit, job satisfaction, organizational commitment, and strong relationships with leaders show the opposite pattern. Less support has been found for the role of demographic characteristics, with the exception of some support between age and withdrawal. Such workplace conditions should be carefully considered, as one should not assume that selection alone will be sufficient for dealing with issues of CWB and withdrawal. An integrated approach that considers selection in the context of a work environment that supports employee effectiveness and well-being will likely have a positive impact on reducing counterproductive and increasing productive work behavior.

ACKNOWLEDGMENT Preparation of this chapter was supported in part by a grant to the first author from the Social Sciences and Humanities Research Council of Canada.

REFERENCES Addae, H. M., Johns, G., & Boies, K. (2013). The legitimacy of absenteeism from work: A nine nation exploratory study. Cross Cultural Management, 20(3), 402–428. doi: 10.1108/ccm.05.2012.0040 Armstrong, C., Flood, P. C., Guthrie, J. P., Liu, W., MacCurtain, S., & Mkamwa, T. (2010). The impact of diversity and equality management on firm performance: Beyond high performance work systems. Human Resource Management, 49(6), 977–998. doi: 10.1002/hrm.20391

499

Maria Rotundo and Paul E. Spector Avey, J. B., Wu, K., & Holley, E. (2015). The influence of abusive supervision and job embeddedness on citizenship and deviance. Journal of Business Ethics, 129(3), 721–731. doi: 10.1007/s10551–014–2192-x Baker-McClearn, D., Greasley, K., Dale, J., & Griffith, F. (2010). Absence management and presenteeism: The pressures on employees to attend work and the impact of attendance on performance. Human Resource Management Journal, 20(3), 311–328. doi: 10.1111/j.1748–8583.2009.00118.x Banks, G. C., Batchelor, J. H., Seers, A., O’Boyle, E. H., Pollack, J. M., & Gower, K. (2014). What does team-member exchange bring to the party? A meta-analytic review of team and leader social exchange. Journal of Organizational Behavior, 35(2), 273–295. doi: 10.1002/job.1885 Barclay, L. J., & Aquino, K. (2011). Workplace aggression and violence. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology, Vol 3: Maintaining, expanding, and contracting the organization (pp. 615–640). Washington, DC: American Psychological Association. Bayram, N., Gursakal, N., & Bilgel, N. (2009). Counterproductive work behavior among white-collar employees: A study from Turkey. International Journal of Selection and Assessment, 17(2), 180–188. doi: 10.1111/j.1468–2389.2009.00461.x Bennett, R. J., & Robinson, S. L. (2000). Development of a measure of workplace deviance. Journal of Applied Psychology, 85(3), 349–360. doi: 10.1037/0021–9010.85.3.349 10900810 Berry, C. M., Carpenter, N. C., & Barratt, C. L. (2012). Do other-reports of counterproductive work behavior provide an incremental contribution over self-reports? A meta-analytic comparison. Journal of Applied Psychology, 97(3), 613–636. doi: 10.1037/a0026739 22201245 Berry, C. M., Gruys, M. L., & Sackett, P. R. (2006). Educational attainment as a proxy for cognitive ability in selection: Effects on levels of cognitive ability and adverse impact. Journal of Applied Psychology, 91(3), 696–705. doi: 10.1037/0021–9010.91.3.696 16737365 Berry, C. M., Lelchook, A. M., & Clark, M. A. (2012). A meta-analysis of the interrelationships between employee lateness, absenteeism, and turnover: Implications for models of withdrawal behavior. Journal of Organizational Behavior, 33(5), 678–699. doi: 10.1002/job.778 Berry, C. M., Ones, D. S., & Sackett, P. R. (2007). Interpersonal deviance, organizational deviance, and their common correlates: A review and meta-analysis. Journal of Applied Psychology, 92(2), 410–424. doi: 10.1037/0021–9010.92.2.410 17371088 Berry, C. M., Sackett, P. R., & Tobares, V. (2010). A meta-analysis of conditional reasoning tests of aggression. Personnel Psychology, 63(2), 361–384. Beus, J. M., Payne, S. C., Bergman, M. E., & Arthur, W. (2010). Safety climate and injuries: An examination of theoretical and empirical relationships. Journal of Applied Psychology, 95(4), 713–727. doi: 10.1037/ a0019164 Bhave, D. P. (2014). The invisible eye? Electronic performance monitoring and employee job performance. Personnel Psychology, 67(3), 605–635. doi: 10.1111/peps.12046 Biron, M., & Bamberger, P. (2012). Aversive workplace conditions and absenteeism: Taking referent group norms and supervisor support into account. Journal of Applied Psychology, 97(4), 901–912. doi: 10.1037/ a0027437 Biron, M.,  & De Reuver, R. (2013). Restoring balance? Status inconsistency, absenteeism, and HRM practices. European Journal of Work and Organizational Psychology, 22(6), 683–696. doi: 10.1080/1359432x.2012.694165 Black, S. L., Stone, D. L., & Johnson, A. F. (2015). Use of social networking websites on applicants’ privacy. Employee Responsibilities and Rights Journal, 27(2), 115–159. doi: 10.1007/s10672–014–9245–2 Bordia, P., Restubog, S. L. D., & Tang, R. L. (2008). When employees strike back: Investigating mediating mechanisms between psychological contract breach and workplace deviance. Journal of Applied Psychology, 93(5), 1104–1117. doi: 10.1037/0021–9010.93.5.1104 Bowles, H. R., & Gelfand, M. (2010). Status and the evaluation of workplace deviance. Psychological Science, 21(1), 49–54. doi: 10.1177/0956797609356509 Bowling, N. A., Alarcon, G. M., Bragg, C. B., & Hartman, M. J. (2015). A meta-analytic examination of the potential correlates and consequences of workload. Work and Stress, 29(2), 95–113. doi: 10.1080/02678373.2015.1033037 Bowling, N. A., & Burns, G. N. (2015). Sex as a moderator of the relationships between predictor variables and counterproductive work behavior. Journal of Business and Psychology, 30(1), 193–205. doi: 10.1007/ s10869–013–9342–5 Bowling, N. A., Burns, G. N., Stewart, S. M., & Gruys, M. L. (2011). Conscientiousness and agreeableness as moderators of the relationship between neuroticism and counterproductive work behaviors: A constructive replication. International Journal of Selection and Assessment, 19(3), 320–330. doi: 10.1111/j.1468–2389.2011.00561.x Bowling, N. A., & Eschleman, K. J. (2010). Employee personality as a moderator of the relationships between work stressors and counterproductive work behavior. Journal of Occupational Health Psychology, 15(1), 91–103. doi: 10.1037/a0017326

500

Perspectives on Counterproductive Work Behavior Brice, R., Fifer, S., & Naron, G. (2012). Social media in the workplace: The NLRB speaks. Intellectual Property & Technology Law Journal, 24(10), 13–17. Card, N. A., Stucky, B. D., Sawalani, G. M., & Little, T. D. (2008). Direct and indirect aggression during childhood and adolescence: A meta-analytic review of gender differences, intercorrelations, and relations to maladjustment. Child Development, 79(5), 1185–1229. doi: 10.1111/j.1467–8624.2008.01184.x Cavanaugh, M. A., Boswell, W. R., Roehling, M. V., & Boudreau, J. W. (2000). An empirical examination of self-reported work stress among U.S. managers. Journal of Applied Psychology, 85(1), 65–74. doi: 10.1037/0021–9010.85.1.65 Chao, J. M. C., Cheung, F. Y. L., & Wu, A. M. S. (2011). Psychological contract breach and counterproductive workplace behaviors: Testing moderating effect of attribution style and power distance. International Journal of Human Resource Management, 22(4), 763–777. doi: 10.1080/09585192.2011.555122 Chauhan, R. S., Buckley, M. R., & Harvey, M. G. (2013). Facebook and personnel selection: What’s the big deal? Organizational Dynamics, 42(2), 126–134. doi: 10.1016/j.orgdyn.2013.03.006 Chen, C-C., Chen, M. Y-C., & Liu, Y-C. (2013). Negative affectivity and workplace deviance: The moderating role of ethical climate. International Journal of Human Resource Management, 24(15), 2894–2910. doi: 10.1080/09585192.2012.753550 Chiu, S-F., & Peng, J-C. (2008). The relationship between psychological contract breach and employee deviance: The moderating role of hostile attributional style. Journal of Vocational Behavior, 73(3), 426–433. doi: 10.1016/j.jvb.2008.08.006 Chou, H-T. G., Hammond, R. J., & Johnson, R. (2013). How Facebook might reveal users’ attitudes toward work and relationships with coworkers. Cyberpsychology, Behavior, and Social Networking, 16(2), 136–139. doi: 10.1089/cyber.2012.0321 23276260 Christian, M. S., & Ellis, A. P. J. (2011). Examining the effects of sleep deprivation on workplace deviance: A self-regulatory perspective. Academy of Management Journal, 54(5), 913–934. doi: 10.5465/amj.2010.0179 Cohen-Charash, Y., & Spector, P. E. (2001). The role of justice in organizations: A meta-analysis. Organizational Behavior and Human Decision Processes, 86(2), 278–321. doi: 10.1006/obhd.2001.2958 Colquitt, J. A., Conlon, D. E., Wesson, M. J., Porter, C. O. L. H., & Ng, K. Y. (2001). Justice at the millennium: A meta-analytic review of 25 years of organizational justice research. Journal of Applied Psychology, 86(3), 425–445. doi: 10.1037//0021–9010.86.3.425 Colquitt, J. A., Scott, B. A., Rodell, J. B., Long, D. M., Zapata, C. P., Conlon, D. E., & Wesson, M. J. (2013). Justice at the millennium, a decade later: A meta-analytic test of social exchange and affect-based perspectives. Journal of Applied Psychology, 98(2), 199–236. doi: 10.1037/a0031757 Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Meta-analytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136(6), 1092–1122. doi: 10.1037/ a0021212 21038940 Connerley, M. L., Arvey, R. D., & Bernardy, C. J. (2001). Criminal background checks for prospective and current employees: Current practices among municipal agencies. Public Personnel Management, 30(2), 173–183. Côté, S., DeCelles, K. A., McCarthy, J. M., Van Kleef, G. A., & Hideg, I. (2011). The Jekyll and Hyde of emotional intelligence: Emotion-regulation knowledge facilitates both prosocial and interpersonally deviant behavior. Psychological Science, 22(8), 1073–1080. doi: 10.1177/0956797611416251 Crawford, E. R., LePine, J. A., & Rich, B. L. (2010). Linking job demands and resources to employee engagement and burnout: A theoretical extension and meta-analytic test. Journal of Applied Psychology, 95(5), 834–848. doi: 10.1037/a0019364 Cronin, T., & Smith, H. (2011). Protest, exit, or deviance: Adjunct university faculty reactions to occupational rank-based mistreatment. Journal of Applied Social Psychology, 41(10), 2352–2373. doi: 10.1111/j.1559–1816.2011.00826.x Dahling, J. J., Whitaker, B. G., & Levy, P. E. (2009). The development and validation of a new Machiavellianism scale. Journal of Management, 35(2), 219–257. doi: 10.1177/0149206308318618 Dalal, R. S. (2005). A meta-analysis of the relationship between organizational citizenship behavior and counterproductive work behavior. Journal of Applied Psychology, 90(6), 1241–1255. doi: 10.1037/0021–9010.90.6.1241 Darr, W., & Johns, G. (2008). Work strain, health, and absenteeism: A meta-analysis. Journal of Occupational Health Psychology, 13(4), 293–318. doi: 10.1037/a0012639 Davis, S. D. (2012). Social media activity & the workplace: Updating the status of social media. Ohio Northern University Law Review, 39(1), 359–386. Retrieved from http://law.onu.edu/sites/default/files/ Davis_0.pdf Dilchert, S., Ones, D. S., Davis, R. D., & Rostow, C. D. (2007). Cognitive ability predicts objectively measured counterproductive work behaviors. Journal of Applied Psychology, 92(3), 616–627. doi: 10.1037/0021–9010.92.3.616

501

Maria Rotundo and Paul E. Spector Dollard, M. F., & Bakker, A. B. (2010). Psychosocial safety climate as a precursor to conducive work environments, psychological health problems, and employee engagement. Journal of Occupational and Organizational Psychology, 83(3), 579–599. doi: 10.1348/096317909x470690 Dreher, S. (2014). Social media and the world of work: A strategic approach to employees’ participation in social media. Corporate Communications: An International Journal, 19(4), 344–356. doi: 10.1108/ CCIJ-10–2013–0087 Duff, A. J., Podolsky, M., Biron, M., & Chan, C. C. A. (2015). The interactive effect of team and manager absence on employee absence: A multilevel field study. Journal of Occupational and Organizational Psychology, 88(1), 61–79. doi: 10.1111/joop.12078 El Akremi, A., Vandenberghe, C., & Camerman, J. (2010). The role of justice and social exchange relationships in workplace deviance: Test of a mediated model. Human Relations, 63(11), 1687–1717. doi: 10.1177/0018726710364163 Elicker, J. D., Foust, M. S., O’Malley, A. L., & Levy, P. E. (2008). Employee lateness behavior: The role of lateness climate and individual lateness attitude. Human Performance, 21(4), 427–441. doi: 10.1080/08959280802347254 Elman, C., & O’Rand, A. M. (2004). The race is to the swift: Socioeconomic origins, adult education, and wage attainment. American Journal of Sociology, 110(1), 123–160. doi: 10.1086/386273 Enns, J. R., & Rotundo, M. (2012). When competition turns ugly: Collective injustice, workgroup identification, and counterproductive work behavior. Human Performance, 25(1), 26–51. doi: 10.1080/08959285.2011.631646 Farh, C. I. C., & Chen, Z-J. (2014). Beyond the individual victim: Multilevel consequences of abusive supervision in teams. Journal of Applied Psychology, 99(6), 1074–1095. doi: 10.1037/a0037636 Ferguson, M., Carlson, D., Hunter, E. M., & Whitten, D. (2012). A two-study examination of work-family conflict, production deviance and gender. Journal of Vocational Behavior, 81(2), 245–258. doi: 10.1016/j. jvb.2012.07.004 Ferris, D. L., Spence, J. R., Brown, D. J., & Heller, D. (2012). Interpersonal injustice and workplace deviance: The role of esteem threat. Journal of Management, 38(6), 1788–1811. doi: 10.1177/0149206310372259 Fox, S., Spector, P. E., Goh, A., & Bruursema, K. (2007). Does your coworker know what you’re doing? Convergence of self- and peer-reports of counterproductive work behavior. International Journal of Stress Management, 14(1), 41–60. doi: 10.1037/1072–5245.14.1.41 Fox, S., Spector, P. E., & Miles, D. (2001). Counterproductive Work Behavior (CWB) in response to job stressors and organizational justice: Some mediator and moderator tests for autonomy and emotions. Journal of Vocational Behavior, 59(3), 291–309. doi: 10.1006/jvbe.2001.1803 Fugate, M., Prussia, G. E., & Kinicki, A. J. (2012). Managing employee withdrawal during organizational change: The role of threat appraisal. Journal of Management, 38(3), 890–914. doi: 10.1177/0149206309352881 Gajendran, R. S., & Harrison, D. A. (2007). The good, the bad, and the unknown about telecommuting: Meta-analysis of psychological mediators and individual consequences. Journal of Applied Psychology, 92(6), 1524–1541. doi: 10.1037/0021–9010.92.6.1524 Ganster, D. C., & Rosen, C. C. (2013). Work stress and employee health: A multidisciplinary review. Journal of Management, 39(5), 1085–1122. doi: 10.1177/0149206313475815 Giardini, A., & Kabst, R. (2008). Effects of work-family human resource practices: A longitudinal perspective. The International Journal of Human Resource Management, 19(11), 2079–2094. doi: 10.1080/09585190802404312 Gilboa, S., Shirom, A., Fried, Y., & Cooper, C. (2008). A meta-analysis of work demand stressors and job performance: Examining main and moderating effects. Personnel Psychology, 61(2), 227–271. doi: 10.1111/j.1744–6570.2008.00113.x Gonzalez-Mulé, E., Mount, M. K., & Oh, I-S. (2014). A meta-analysis of the relationship between general mental ability and nontask performance. Journal of Applied Psychology, 99(6), 1222–1243. doi: 10.1037/ a0037547 Gottfredson, L. S. (1997). Mainstream science on intelligence: An editorial with 52 signatories, history and bibliography. Intelligence, 24(1), 13–23. doi: 10.1016/S0160–2896%2897%2990011–8 Greenbaum, R. L., Mawritz, M. B., Mayer, D. M., & Priesemuth, M. (2013). To act out, to withdraw, or to constructively resist? Employee reactions to supervisor abuse of customers and the moderating role of employee moral identity. Human Relations, 66(7), 925–950. doi: 10.1177/0018726713482992 Gruys, M. L., & Sackett, P. R. (2003). Investigating the dimensionality of counterproductive work behavior. International Journal of Selection and Assessment, 11, 30–42. Hancock, J. I., Allen, D. G., Bosco, F. A., McDaniel, K. R., & Pierce, C. A. (2013). Meta-analytic review of employee turnover as a predictor of firm performance. Journal of Management, 39(3), 573–603. doi: 10.1177/0149206311424943

502

Perspectives on Counterproductive Work Behavior Harris, W. G., Jones, J. W., Klion, R., Arnold, D. W., Camara, W., & Cunningham, M. R. (2012). Test publishers’ perspective on “An updated meta-analysis”: Comment on Van Iddekinge, Roth, Raymark, and Odle-Dusseau (2012). Journal of Applied Psychology, 97(3), 531–536. doi: 10.1037/a0024767 22582727 Hastings, S. E., & O’Neill, T. A. (2009). Predicting workplace deviance using broad versus narrow personality variables. Personality and Individual Differences, 47(4), 289–293. doi: 10.1016/j.paid.2009.03.015 Hedge, J. W., Borman, W. C., & Lammlein, S. E. (2006). The aging workforce: Realities, myths, and implications for organizations. Washington, DC: American Psychological Association; US. doi. 10.1037/11325–000 Hershcovis, M. S., & Barling, J. (2010). Towards a multi-foci approach to workplace aggression: A meta-analytic review of outcomes from different perpetrators. Journal of Organizational Behavior, 31(1), 24–44. doi: 10.1002/job.621 Hershcovis, M. S., Turner, N., Barling, J., Arnold, K. A., Dupre, K. E., Inness, M., . . . Sivanathan, N. (2007). Predicting workplace aggression: A meta-analysis. Journal of Applied Psychology, 92(1), 228–238. doi: 10.1037/0021–9010.92.1.228 Hoffmann, E. A. (2008). “Revenge” and “Rescue”: Workplace deviance in the taxicab industry. Sociological Inquiry, 78(3), 270–289. doi: 10.1111/j.1475–682X.2008.00240.x Hollinger, R. C., & Clark, J. P. (1982). Formal and informal social controls of employee deviance. The Sociological Quarterly, 23(3), 333–343. doi: 10.1111/j.1533–8525.1982.tb01016.x Holtz, B. C., & Harold, C. M. (2013). Interpersonal justice and deviance: The moderating effects of interpersonal justice values and justice orientation. Journal of Management, 39(2), 339–365. doi: 10.1177/0149206310390049 Hopkins, B. (2014). Explaining variations in absence rates: Temporary and agency workers in the food manufacturing sector. Human Resource Management Journal, 24(2), 227–240. doi: 10.1111/j.1748–8583.2012.00206.x Hutchinson, M., Jackson, D., Wilkes, L., & Vickers, M. H. (2008). A new model of bullying in the nursing workplace: Organizational characteristics as critical antecedents. Advances in Nursing Science Violence, Injury, and Human Safety April/June, 31(2), E60–E71. doi: 10.1097/01.ANS.0000319572.37373.0c Hystad, S. W., Eid, J., & Brevik, J. I. (2011). Effects of psychological hardiness, job demands, and job control on sickness absence: A prospective study. Journal of Occupational Health Psychology, 16(3), 265–278. doi: 10.1037/a0022904 Ilies, R., Peng, A. C., Savani, K., & Dimotakis, N. (2013). Guilty and helpful: An emotion-based reparatory model of voluntary work behavior. Journal of Applied Psychology, 98(6), 1051–1059. doi: 10.1037/ a0034162 Iliescu, D., Ispas, D., Sulea, C., & Ilie, A. (2015). Vocational fit and counterproductive work behaviors: A self-regulation perspective. Journal of Applied Psychology, 100(1), 21–39. doi: 10.1037/a0036652 Jain, A., Petty, E. M., Jaber, R. M., Tackett, S., Purkiss, J., Fitzgerald, J., & White, C. (2014). What is appropriate to post on social media? Ratings from students, faculty members and the public. Medical Education, 48(2), 157–169. doi: 10.1111/medu.12282 James, L. R., McIntyre, M. D., Glisson, C. A., Green, P. D., Patton, T. W., LeBreton, J. M., . . . Williams, L. J. (2005). A conditional reasoning measure for aggression. Organizational Research Methods, 8(1), 69–99. doi: 10.1177/1094428104272182 Jensen, J. M., Opland, R. A., & Ryan, A. M. (2010). Psychological contracts and counterproductive work behaviors: Employee responses to transactional and relational breach. Journal of Business and Psychology, 25(4), 555–568. doi: 10.1007/s10869–009–9148–7 Jiang, K., Liu, D., McKay, P. F., Lee, T. W., & Mitchell, T. R. (2012). When and how is job embeddedness predictive of turnover? A meta-analytic investigation. Journal of Applied Psychology, 97(5), 1077–1096. doi: 10.1037/a0028610 22663557 Johns, G. (2010). Presenteeism in the workplace: A review and research agenda. Journal of Organizational Behavior, 31(4), 519–542. doi: 10.1002/job630 Johns, G., & Miraglia, M. (2015). The reliability, validity, and accuracy of self-reported absenteeism from work: A meta-analysis. Journal of Occupational Health Psychology, 20(1), 1–14. doi: 10.1037/a0037754 25181281 Johnson, M. D., Holley, E. C., Morgeson, F. P., LaBonar, D., & Stetzer, A. (2014). Outcomes of absence control initiatives: A quasi-experimental investigation into the effects of policy and perceptions. Journal of Management, 40(4), 1075–1097. doi: 10.1177/0149206311423822 Jones, D. A. (2009). Getting even with one’s supervisor and one’s organization: Relationships among types of injustice, desires for revenge, and counterproductive work behaviors. Journal of Organizational Behavior, 30(4), 525–542. doi: 10.1002/job.563 Judge, T. A., LePine, J. A., & Rich, B. L. (2006). Loving yourself abundantly: Relationship of the narcissistic personality to self- and other perceptions of workplace deviance, leadership, and task and contextual performance. Journal of Applied Psychology, 91(4), 762–776. doi: 10.1037/0021–9010.91.4.762

503

Maria Rotundo and Paul E. Spector Kao, F-H., Cheng, B-S., Kuo, C-C., & Huang, M-P. (2014). Stressors, withdrawal, and sabotage in frontline employees: The moderating effects of caring and service climates. Journal of Occupational and Organizational Psychology, 87(4), 755–780. doi: 10.1111/joop.12073 Kaplan, S., Bradley, J. C., Luchman, J. N., & Haynes, D. (2009). On the role of positive and negative affectivity in job performance: A meta-analytic investigation. Journal of Applied Psychology, 94(1), 162–176. doi: 10.1037/a0013115 Karl, K., Peluchette, J., & Schlaegel, C. (2010). Who’s posting Facebook faux pas? A cross-cultural examination of personality differences. International Journal of Selection and Assessment, 18(2), 174–186. doi: 10.1111/j.1468–2389.2010.00499 Kehoe, R. R., & Wright, P. M. (2013). The impact of high-performance human resource practices on employees’ attitudes and behaviors. Journal of Management, 39(2), 366–391. doi: 10.1177/0149206310365901 Kessler, S. R., Spector, P. E., Chang, C-H., & Parr, A. D. (2008). Organizational violence and aggression: Development of the three-factor violence climate survey. Work and Stress, 22(2), 108–124. doi: 10.1080/02678370802187926 Kiazad, K., Holtom, B. C., Hom, P. W., & Newman, A. (2015). Job embeddedness: A multifoci theoretical extension. Journal of Applied Psychology, 100(3), 641–659. doi: 10.1037/a0038919 25774569 Kim, K., del Carmen Triana, M., Chung, K., & Oh, N. (2016). When do employees cyberloaf ? An interactionist perspective examining personality, justice, and empowerment. Human Resource Management, 55, 1041–1058. Kluemper, D. H., DeGroot, T., & Choi, S. (2013). Emotion management ability: Predicting task performance, citizenship, and deviance. Journal of Management, 39(4), 878–905. doi: 10.1177/0149206311407326 Kluemper, D. H., McLarty, B. D., & Bing, M. N. (2015). Acquaintance ratings of the big five personality traits: Incremental validity beyond and interactive effects with self-reports in the prediction of workplace deviance. Journal of Applied Psychology, 100(1), 237–248. doi: 10.1037/a0037810 Kluemper, D. H., Rosen, P. A., & Mossholder, K. W. (2012). Social networking websites, personality ratings, and the organizational context: More than meets the eye? Journal of Applied Social Psychology, 42(5), 1143–1172. doi: 10.1111/j.1559–1816.2011.00881.x Krischer, M. M., Penney, L. M., & Hunter, E. M. (2010). Can counterproductive work behaviors be productive? CWB as emotion-focused coping. Journal of Occupational Health Psychology, 15(2), 154–166. doi: 10.1037/a0018349 Kristof-Brown, A. L., Zimmerman, R. D., & Johnson, E. C. (2005). Consequences of individuals’ fit at work: A meta-analysis of person-job, person-organization, person-group, and person-supervisor fit. Personnel Psychology, 58(2), 281–342. doi: 10.1111/j.1744–6570.2005.00672.x Kruttschnitt, C. (2013). Gender and crime. Annual Review of Sociology, 39, 291–308. doi: 10.1146/ annurev-soc-071312–145605 Langton, L., & Hollinger, R. C. (2005). Correlates of crime losses in the retail industry. Security Journal, 18(3), 27–44. doi: 10.1057/palgrave.sj.8340202 Langton, L., Piquero, N. L., & Hollinger, R. C. (2006). An empirical test of the relationship between employee theft and low self-control. Deviant Behavior, 27(5), 537–565. doi: 10.1080/01639620600781548 Lapierre, L. M., Bonaccio, S., & Allen, T. D. (2009). The separate, relative, and joint effects of employee job performance domains on supervisors’ willingness to mentor. Journal of Vocational Behavior, 74(2), 135–144. doi: 10.1016/j.jvb.2009.01.005 Law, R., Dollard, M. F., Tuckey, M. R., & Dormann, C. (2011). Psychosocial safety climate as a lead indicator of workplace bullying and harassment, job resources, psychological health and employee engagement. Accident Analysis and Prevention, 43(5), 1782–1793. doi: 10.1016/j.aap.2011.04.010 Le, K., Donnellan, M. B., Spilman, S. K., Garcia, O. P., & Conger, R. (2014). Workers behaving badly: Associations between adolescent reports of the Big Five and counterproductive work behaviors in adulthood. Personality and Individual Differences, 61–62, 7–12. doi: 10.1016/j.paid.2013.12.016 Levashina, J., & Campion, M. A. (2009). Expected practices in background checking: Review of the human resource management literature. Employee Responsibilities and Rights Journal, 21(3), 231–249. doi: 10.1007/ s10672–009–9111–9 Lian, H., Ferris, D. L., & Brown, D. J. (2012). Does taking the good with the bad make things worse? How abusive supervision and leader-member exchange interact to impact need satisfaction and organizational deviance. Organizational Behavior and Human Decision Processes, 117(1), 41–52. doi: 10.1016/j.obhdp.2011.10.003 Lian, H., Ferris, D. L., Morrison, R., & Brown, D. J. (2014). Blame it on the supervisor or the subordinate? Reciprocal relations between abusive supervision and organizational deviance. Journal of Applied Psychology, 99(4), 651–664. doi: 10.1037/a0035498 Liao, H., Chuang, A., & Joshi, A. (2008). Perceived deep-level dissimilarity: Personality antecedents and impact on overall job attitude, helping, work withdrawal, and turnover. Organizational Behavior and Human Decision Processes, 106(2), 106–124. doi: 10.1016/j.obhdp.2008.01.002

504

Perspectives on Counterproductive Work Behavior Lim, V. K. G., & Teo, T. S. H. (2009). Mind your E-manners: Impact of cyber incivility on employees’ work attitude and behavior. Information & Management, 46(8), 419–425. doi: 10.1016/j.im.2009.06.006 Liu, J., Kwan, H. K., Wu, L. Z., & Wu, W-K. (2010). Abusive supervision and subordinate supervisordirected deviance: The moderating role of traditional values and the mediating role of revenge cognitions. Journal of Occupational and Organizational Psychology, 83(4), 835–856. doi: 10.1348/096317909x485216 Lucero, M. A., Allen, R. E., & Elzweig, B. (2013). Managing employee social networking: Evolving views from the National Labor Relations Board. Employee Responsibilities and Rights Journal, 25(3), 143–158. doi: 10.1007/s10672–012–9211–9 Luksyte, A., Waite, E., Avery, D. R., & Roy, R. (2013). Held to a different standard: Racial differences in the impact of lateness on advancement opportunity. Journal of Occupational and Organizational Psychology, 86(2), 142–165. doi: 10.1111/joop.12010 Mainiero, L. A., & Jones, K. J. (2013). Sexual harassment versus workplace romance: Social media spillover and textual harassment in the workplace. Academy of Management Perspectives, 27(3), 187–203. doi: 10.5465/amp.2012.0031 Marcus, B., & Schuler, H. (2004). Antecedents of counterproductive behavior at work: A general perspective. Journal of Applied Psychology, 89(4), 647–660. doi: 10.1037/0021–9010.89.4.647 Marcus, B., Taylor, O. A., Hastings, S. E., Sturm, A., & Weigelt, O. (2016). The structure of counterproductive work behavior: A review, a structural meta-analysis, and a primary study. Journal of Management, 42(1), 203–233. doi: 10.1177/0149206313503019 Markova, G., & Folger, R. (2012). Every cloud has a silver lining: Positive effects of deviant coworkers. Journal of Social Psychology, 152(5), 586–612. doi: 10.1080/00224545.2012.671201 Martin, L. E. (2010). Time banditry: Validation of a measure of counterproductive work behavior. Dissertation Abstracts International: Section B: The Sciences and Engineering, 71(3-B), 2084. Retrieved from http://ovidsp.ovid. com/ovidweb.cgi?T=JS&CSC=Y&NEWS=N&PAGE=fulltext&D=psyc7&AN=2010–99180–088 Mawritz, M. B., Dust, S. B., & Resick, C. J. (2014). Hostile climate, abusive supervision, and employee coping: Does conscientiousness matter? Journal of Applied Psychology, 99(4), 737–747. doi: 10.1037/ a0035863 Mayer, D. M., Thau, S., Workman, K. M., Van Dijke, M., & De Cremer, D. (2012). Leader mistreatment, employee hostility, and deviant behaviors: Integrating self-uncertainty and thwarted needs perspectives on deviance. Organizational Behavior and Human Decision Processes, 117(1), 24–40. doi: 10.1016/j. obhdp.2011.07.003 Maynard, D. C., & Parfyonova, N. M. (2013). Perceived overqualification and withdrawal behaviours: Examining the roles of job attitudes and work values. Journal of Occupational and Organizational Psychology, 86(3), 435–455. doi: 10.1111/joop.12006 McFarland, L. A., & Ployhart, R. E. (2015). Social media: A contextual framework to guide research and practice. Journal of Applied Psychology, 100(6), 1653–1677. doi: 10.1037/a0039244 Meier, L. L., & Spector, P. E. (2013). Reciprocal effects of work stressors and counterproductive work behavior: A five-wave longitudinal study. Journal of Applied Psychology, 98(3), 529–539. doi: 10.1037/ a0031732 Meyer, J. P., Stanley, D. J., Herscovitch, L., & Topolnytsky, L. (2002). Affective, continuance, and normative commitment to the organization: A meta-analysis of antecedents, correlates, and consequences. Journal of Vocational Behavior, 61(1), 20–52. doi: 10.1006/jvbe.2001.1842 Miller, M. B. (2013). Avatars and social media: Employment law risks and challenges in the virtual world. FDCC Quarterly, 63(4), 279–294. Retrieved from http://search.ebscohost.com/login. aspx?direct=true&db=lpb&AN=95414233&site=ehost-live Mitchell, T. R., Holtom, B. C., Lee, T. W., Sablynski, C. J., & Erez, M. (2001). Why people stay: Using job embeddedness to predict voluntary turnover. Academy of Management Journal, 44(6), 1102–1121. doi: 10.2307/3069391 Newness, K., Steinert, J., & Viswesvaran, C. (2012). Effects of personality on social network disclosure: Do emotionally intelligent individuals post inappropriate content? Psychological Topics, 21(3), 473–486. Ng, T. W. H., & Feldman, D. C. (2008). The relationship of age to ten dimensions of job performance. Journal of Applied Psychology, 93(2), 392–423. doi: 10.1037/0021–9010.93.2.392 Ng, T. W. H., & Feldman, D. C. (2009a). How broadly does education contribute to job performance? Personnel Psychology, 62(1), 89–134. doi: 10.1111/j.1744–6570.2008.01130.x Ng, T. W. H., & Feldman, D. C. (2009b). Occupational embeddedness and job performance. Journal of Organizational Behavior, 30(7), 863–891. doi: 10.1002/job.580 Ng, T. W. H., & Feldman, D. C. (2009c). Re-examining the relationship between age and voluntary turnover. Journal of Vocational Behavior, 74(3), 283–294. doi: 10.1016/j.jvb.2009.01.004 Ng, T. W. H., & Feldman, D. C. (2010). Organizational tenure and job performance. Journal of Management, 36(5), 1220–1250. doi: 10.1177/0149206309359809

505

Maria Rotundo and Paul E. Spector Nielsen, A-K. L. (2008). Determinants of absenteeism in public organizations: A unit-level analysis of work absence in a large Danish municipality. International Journal of Human Resource Management, 19(7), 1330–1348. doi: 10.1080/09585190802110158 Nielsen, M. B., & Einarsen, S. (2012). Outcomes of exposure to workplace bullying: A meta-analytic review. Work and Stress, 26(4), 309–332. doi: 10.1080/02678373.2012.734709 O’Boyle, E. H., Jr., Forsyth, D. R., Banks, G. C., & McDaniel, M. A. (2012). A meta-analysis of the Dark Triad and work behavior: A social exchange perspective. Journal of Applied Psychology, 97(3), 557–579. doi: 10.1037/a0025679 22023075 O’Brien, K. E., & Allen, T. D. (2008). The relative importance of correlates of organizational citizenship behavior and counterproductive work behavior using multiple sources of data. Human Performance, 21(1), 62–88. doi: 10.1080/08959280701522189 Oh, I-S., Guay, R. P., Kim, K., Harold, C-M., Lee, J-H., Heo, C-G., & Shin, K-H. (2014). Fit happens globally: A meta-analytic comparison of the relationships of person-environment fit dimensions with work attitudes and performance across East Asia, Europe, and North America. Personnel Psychology, 67(1), 99–152. doi: 10.1111/peps.12026 Oh, I-S., Lee, K., Ashton, M. C., & de Vries, R. E. (2011). Are dishonest extraverts more harmful than dishonest introverts? The interaction effects of honesty-humility and extraversion in predicting workplace deviance. Applied Psychology, 60(3), 496–516. doi: 10.1111/j.1464–0597.2011.00445.x O’Neill, T. A., & Hastings, S. E. (2011). Explaining workplace deviance behavior with more than just the “Big Five”. Personality and Individual Differences, 50(2), 268–273. doi: 10.1016/j.paid.2010.10.001 Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78(4), 679–703. doi: 10.1037/0021–9010.78.4.679 Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2003). Personality and absenteeism: A meta-analysis of integrity tests. European Journal of Personality, 17, S19–S38. doi: 10.1002/per.487 Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2012). Integrity tests predict counterproductive work behaviors and job performance well: Comment on Van Iddekinge, Roth, Raymark, and Odle-Dusseau (2012). Journal of Applied Psychology, 97(3), 537–542. doi: 10.1037/a0024825 22582728 Ottinot, R. C. (2011). A multi-level study investigating the impact of workplace civility climate on incivility and employee well-being. Dissertation Abstracts International: Section B: The Sciences and Engineering, 72(2-B), IV–VII, 1–88. Parks, K. M., & Steelman, L. A. (2008). Organizational wellness programs: A meta-analysis. Journal of Occupational Health Psychology, 13(1), 58–68. doi: 10.1037/1076–8998.13.1.58 Pate, R. L. (2012). Invisible discrimination: Employers, social media sites and passwords in the U.S. International Journal of Discrimination and the Law, 12(3), 133–146. doi: 10.1177/1358229112470300 Patton, E. (2011). The devil is in the details: Judgments of responsibility and absenteeism from work. Journal of Occupational and Organizational Psychology, 84(4), 759–779. doi: 10.1348/096317910x521510 Penney, L. M., Hunter, E. M., & Perry, S. J. (2011). Personality and counterproductive work behaviour: Using conservation of resources theory to narrow the profile of deviant employees. Journal of Occupational and Organizational Psychology, 84(1), 58–77. doi: 10.1111/j.2044–8325.2010.02007.x Penney, L. M., & Spector, P. E. (2002). Narcissism and counterproductive work behavior: Do bigger egos mean bigger problems? International Journal of Selection and Assessment, 10(1–2), 126–134. doi: 10.1111/1468–2389.00199 Peretz, H., & Fried, Y. (2012). National cultures, performance appraisal practices, and organizational absenteeism and turnover: A study across 21 countries. Journal of Applied Psychology, 97(2), 448–459. doi: 10.1037/a0026011 Podsakoff, N. P., LePine, J. A., & LePine, M. A. (2007). Differential challenge stressor-hindrance stressor relationships with job attitudes, turnover intentions, turnover, and withdrawal behavior: A meta-analysis. Journal of Applied Psychology, 92(2), 438–454. doi: 10.1037/0021–9010.92.2.438 Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879–903. doi: 10.1037/0021–9010.88.5.879 14516251 Poropat, A. E. (2009). A meta-analysis of the five-factor model of personality and academic performance. Psychological Bulletin, 135(2), 322–338. doi: 10.1037/a0014996 19254083 Reynolds, C. A., Shoss, M. K., & Jundt, D. K. (2015). In the eye of the beholder: A multi-stakeholder perspective of organizational citizenship and counterproductive work behaviors. Human Resource Management Review, 25(1), 80–93. doi: 10.1016/j.hrmr.2014.06.002 Roberts, G., & Sambrook, S. (2014). Social networking and HRD. Human Resource Development International, 17(5), 577–587. doi: 10.1080/13678868.2014.969504

506

Perspectives on Counterproductive Work Behavior Robinson, S. L., & Bennett, R. J. (1995). A typology of deviant workplace behaviors: A multidimensional scaling study. Academy of Management Journal, 38(2), 555–572. doi: 10.2307/256693 Roth, P. L., Bobko, P., Van Iddekinge, C. H., & Thatcher, J. B. (2016). Social media in employee-selection-related decisions: A research agenda for uncharted territory. Journal of Management, 42, 269–298. Rotundo, M., & Sackett, P. R. (2002). The relative importance of task, citizenship, and counterproductive performance to global ratings of job performance: A policy-capturing approach. Journal of Applied Psychology, 87(1), 66–80. doi: 10.1037/0021–9010.87.1.66 11916217 Rotundo, M., & Spector, P. E. (2010). Counterproductive work behavior and withdrawal. In J. L. Farr & N. T. Tippins (Eds.), Handbook of employee selection (pp. 489–511). New York, NY: Routledge/Taylor & Francis Group. Rotundo, M., & Xie, J. L. (2008). Understanding the domain of counterproductive work behaviour in China. International Journal of Human Resource Management, 19(5), 856–877. doi: 10.1080/09585190801991400 Sackett, P. R., & DeVore, C. J. (2001). Counterproductive behaviors at work. In N. Anderson, D. Ones, H. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work and organizational psychology, Vol. 1: Personnel psychology (pp. 145–164). Thousand Oaks, CA: Sage Publications Ltd. Sackett, P. R., & Schmitt, N. (2012). On reconciling conflicting meta-analytic findings regarding integrity test validity. Journal of Applied Psychology, 97(3), 550–556. doi: 10.1037/a0028167 22582730 Salgado, J. F. (2002). The Big Five personality dimensions and counterproductive behaviors. International Journal of Selection and Assessment, 10(1–2), 117–125. doi: 10.1111/1468–2389.00198 Sampson, R. J., & Laub, J. H. (1992). Crime and deviance in the life course. Annual Review of Sociology, 63–84. doi: 10.1146/annurev.so.18.080192.000431 Schneider, B., & Bowen, D. E. (1985). Employee and customer perceptions of service in banks: Replication and extension. Journal of Applied Psychology, 70(3), 423–433. doi: 10.1037//0021–9010.70.3.423 Scott, B. A., & Barnes, C. M. (2011). A multilevel field investigation of emotional labor, affect, work withdrawal, and gender. Academy of Management Journal, 54(1), 116–136. Semmer, N. K., Tschan, F., Meier, L. L., Facchin, S., & Jacobshagen, N. (2010). Illegitimate tasks and counterproductive work behavior. Applied Psychology, 59(1), 70–96. doi: 10.1111/j.1464–0597.2009.00416.x Shockley, K. M., Ispas, D., Rossi, M. E., & Levine, E. L. (2012). A meta-analytic investigation of the relationship between state affect, discrete emotions, and job performance. Human Performance, 25(5), 377– 411. doi: 10.1080/08959285.2012.721832 Simpson, S. S. (2013). White-collar crime: A review of recent developments and promising directions for future research. Annual Review of Sociology, 39, 309–331. doi: 10.1146/annurev-soc-071811–145546 Soane, E., Shantz, A., Alfes, K., Truss, C., Rees, C., & Gatenby, M. (2013). The association of meaningfulness, well-being, and engagement with absenteeism: A moderated mediation model. Human Resource Management, 52(3), 441–456. doi: 10.1002/hrm.21534 Spector, P. E., & Fox, S. (2005). The stressor-emotion model of counterproductive work behavior. In S. Fox & P. E. Spector (Eds.), Counterproductive work behavior: Investigations of actors and targets (pp. 151–174). Washington, DC: American Psychological Association; US. Spector, P. E., Fox, S., Penney, L. M., Bruursema, K., Goh, A., & Kessler, S. (2006). The dimensionality of counterproductivity: Are all counterproductive behaviors created equal? Journal of Vocational Behavior, 68(3), 446–460. doi: 10.1016/j.jvb.2005.10.005 Spector, P. E., & Zhou, Z. E. (2014). The moderating role of gender in relationships of stressors and personality with counterproductive work behavior. Journal of Business and Psychology, 29(4), 669–681. doi: 10.1007/s10869–013–9307–8 Steffensmeier, D., & Allan, E. (1996). Gender and crime: Toward a gendered theory of female offending. Annual Review of Sociology, 22, 459–487. Steffensmeier, D. J., Allan, E. A., Harer, M. D., & Streifel, C. (1989). Age and the distribution of crime. American Journal of Sociology, 94(4), 803–831. doi: 10.1086/229069 Stewart, S. M., Bing, M. N., Davison, H. K., Woehr, D. J., & McIntyre, M. D. (2009). In the eyes of the beholder: A non-self-report measure of workplace deviance. Journal of Applied Psychology, 94(1), 207–215. doi: 10.1037/a0012605 Swider, B. W., & Zimmerman, R. D. (2010). Born to burnout: A meta-analytic path model of personality, job burnout, and work outcomes. Journal of Vocational Behavior, 76(3), 487–506. doi: 10.1016/j.jvb.2010.01.003 Swider, B. W., & Zimmerman, R. D. (2014). Prior and future withdrawal and performance: A meta-analysis of their relations in panel studies. Journal of Vocational Behavior, 84(3), 225–236. doi: 10.1016/j. jvb.2014.01.004 ten Brummelhuis, L. L., ter Hoeven, C. L., de Jong, M. D. T., & Peper, B. (2013). Exploring the linkage between the home domain and absence from work: Health, motivation, or both? Journal of Organizational Behavior, 34(3), 273–290. doi: 10.1002/job.1789

507

Maria Rotundo and Paul E. Spector Tepper, B. J., Carr, J. C., Breaux, D. M., Geider, S., Hu, C., & Hua, W. (2009). Abusive supervision, intentions to quit, and employees’ workplace deviance: A power/dependence analysis. Organizational Behavior and Human Decision Processes, 109(2), 156–167. doi: 10.1016/j.obhdp.2009.03.004 Thau, S., Bennett, R. J., Mitchell, M. S., & Marrs, M. B. (2009). How management style moderates the relationship between abusive supervision and workplace deviance: An uncertainty management theory perspective. Organizational Behavior and Human Decision Processes, 108(1), 79–92. doi: 10.1016/j. obhdp.2008.06.003 Thau, S., & Mitchell, M. S. (2010). Self-gain or self-regulation impairment? Tests of competing explanations of the supervisor abuse and employee deviance relationship through perceptions of distributive justice. Journal of Applied Psychology, 95(6), 1009–1031. doi: 10.1037/a0020540 Torche, F. (2011). Is a college degree still the great equalizer? Intergenerational mobility across levels of schooling in the United States. American Journal of Sociology, 117(3), 763–807. doi: 10.1086/661904 Tucker, J. S., Sinclair, R. R., Mohr, C. D., Adler, A. B., Thomas, J. L., & Salvi, A. D. (2009). Stress and counterproductive work behavior: Multiple relationships between demands, control, and soldier indiscipline over time. Journal of Occupational Health Psychology, 14(3), 257–271. doi: 10.1037/a0014951 Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012a). The criterion-related validity of integrity tests: An updated meta-analysis. Journal of Applied Psychology, 97(3), 499–530. doi: 10.1037/a0021196 21319880 Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012b). The critical role of the research question, inclusion criteria, and transparency in meta-analyses of integrity test research: A reply to Harris et al. (2012) and Ones, Viswesvaran, and Schmidt (2012). Journal of Applied Psychology, 97(3), 543–549. doi: 10.1037/a0026551 22582729 Victor, B., & Cullen, J. B. (1988). The organizational bases of ethical work climates. Administrative Science Quarterly, 33(1), 101–125. doi: 10.2307/2392857 Wu, J.,  & LeBreton, J. M. (2011). Reconsidering the dispositional basis of counterproductive work behavior: The role of aberrant personality. Personnel Psychology, 64(3), 593–626. doi: 10.1111/j.1744–6570.2011.01220.x Yang, J., & Diefendorff, J. M. (2009). The relations of daily counterproductive workplace behavior with emotions, situational antecedents, and personality moderators: A diary study in Hong Kong. Personnel Psychology, 62(2), 259–295. doi: 10.1111/j.1744–6570.2009.01138.x Yang, L-Q., Caughlin, D. E., Gazica, M. W., Truxillo, D. M., & Spector, P. E. (2014). Workplace mistreatment climate and potential employee and organizational outcomes: A meta-analytic review from the target’s perspective. Journal of Occupational Health Psychology, 19(3), 315–335. doi: 10.1037/a0036905 Zagenczyk, T. J., Restubog, S. L. D., Kiewitz, C., Kiazad, K., & Tang, R. L. (2014). Psychological contracts as a mediator between Machiavellianism and employee citizenship and deviant behaviors. Journal of Management, 40(4), 1098–1122. doi: 10.1177/0149206311415420 Zhao, H., Wayne, S. J., Glibkowski, B. C., & Bravo, J. (2007). The impact of psychological contract breach on work-related outcomes: A meta-analysis. Personnel Psychology, 60(3), 647–680. doi: 10.1111/j.1744–6570.2007.00087.x Zhou, Z. E., Meier, L. L., & Spector, P. E. (2014). The role of personality and job stressors in predicting counterproductive work behavior: A three-way interaction. International Journal of Selection and Assessment, 22(3), 286–296. doi: 10.1111/ijsa.12077 Zimmerman, R. D. (2008). Understanding the impact of personality traits on individuals’ turnover decisions: A meta-analytic path model. Personnel Psychology, 61(2), 309–348. doi: 10.1111/j.1744–6570.2008.00115.x Zimmerman, R. D., & Darnold, T. C. (2009). The impact of job performance on employee turnover intentions and the voluntary turnover process: A meta-analysis and path model. Personnel Review, 38(1–2), 142–158. doi: 10.1108/00483480910931316 Zohar, D., & Luria, G. (2004). Climate as a social-cognitive construction of supervisory safety practices: Scripts as proxy of behavior patterns. Journal of Applied Psychology, 89(2), 322–333. doi: 10.1037/0021–9010.89.2.322

508

23 DEFINING AND MEASURING RESULTS OF WORKPLACE BEHAVIOR RYAN S. O’LEARY AND ELAINE D. PULAKOS

The previous chapters in this section focused on the measurement of task performance, constructive personal behavior (citizenship and adaptability), and counterproductive behavior and how these fit in the context of conducting selection research. Each of these represents a conceptually distinct content area within the performance domain, and all consist of reasonably well-defined constructs that have been reliably and validly measured in the past and successfully used as criteria in validation research, albeit some more so than others. Alternatively, this chapter focuses on measuring results—the actual end products, outcomes, or deliverables individuals or teams produce on a job. Unlike other criterion constructs, discussions of a “results” construct are relatively rare in the industrial and organizational (I-O) psychology literature. Likewise, results measures have not been as well defined and researched as other types of performance measures (e.g., task, citizenship, adaptive, etc.). Thus, we know less about their reliability, validity, accuracy, and fairness compared with other, more commonly used performance measures. We also know less about how to develop effective results measures that will possess adequate psychometric properties and validity. Given that we already have several conceptually distinct, well-defined, and psychometrically sound performance measures that appear to comprehensively cover the criterion domain, one might reasonably question why we should bother adding results measures to the mix. The answer is that many organizations today are focusing on defining work in terms of the results employees and teams are expected to achieve, and likewise, they are evaluating and rewarding staff on the extent to which they have delivered tangible outcomes that are important to the organization’s success. Additionally, as employees are becoming more empowered and jobs more autonomous, it is increasingly important to hold employees accountable for achieving measurable results. Thus, if a situation arises in which we must conduct validation research using criterion measures that are available, chances are that we will increasingly encounter measures of results. In addition, operational performance measures are sometimes used as predictors in making promotion decisions. Here, again, such predictors are increasingly likely to include measures of results. Many of the performance measures used in validation research have focused on measuring work behavior, which is important to ensure job relevance. Behavioral measures have also been used extensively in the past as a basis for performance management. These measures focus on how employees get the job done; for example, how they contribute to a team, communicate, plan and organize work, and so forth. Irrespective of how productive employees may be, we are all familiar with the problems and disruptions they can cause if they are difficult to work with, unhelpful, or exhibit maladaptive behavior. Thus, evaluating workplace behavior is important. We are all also familiar with employees who are extremely helpful, communicate well, and are

509

Ryan S. O’Leary and Elaine D. Pulakos nice to everyone, yet never seem to get anything done. This is why considering the results that an employee achieves is also an important part of overall performance measurement, and as mentioned, it is one that organizations are increasingly emphasizing. Although a choice can be made to assess results or behavior, it may be important to include both types of measures when comprehensive performance measurement is the goal (Landy & Trumbo, 1980; Pulakos, 2008), as would be the case when the measures are used as criteria for validation research or as predictors in selection or promotion processes (e.g., the use of accomplishment records for leadership selection). Because earlier chapters have discussed behavioral performance measurement in detail, our focus here is on how to obtain useful and meaningful measures of individual and team results. However, because relatively little research has been directed to measuring results, there is not an extensive literature to draw on that speaks directly to the quality and utility of results measures or how they relate to other, more commonly used predictors and criteria. Accordingly, we draw on related research to propose methods for developing results measures that should maximize their reliability, validity, and fairness. We begin by reviewing the debate that has surrounded measuring workplace behavior versus results and discuss why the measurement of results has become increasingly popular today. We then propose methods for developing results measures for individuals and teams and the associated challenges. We review the concept of cascading goals and provide guidelines for developing individual and team objectives, which are thought to be an important precursor to achieving organizationally relevant results. We then discuss evaluation methods that should facilitate accurate and fair measurement of the results employees achieve, using a combination of objective and subjective measures. Finally, we discuss individual difference constructs that are likely to predict performance results.

MEASURING INDIVIDUAL WORKPLACE BEHAVIOR VERSUS RESULTS There have been longstanding differences of opinion about what aspects of employee performance should be measured—behavior, results, or both (see Bernardin, Hagan, Kane, & Villanova, 1998; Feldman, 1992; Latham, 1986; Murphy & Cleveland, 1991; Olian & Rynes, 1991). The measurement of each offers unique advantages and corresponding disadvantages. In this section, we briefly discuss these as well as the reasons for the increasingly popular trend of measuring employee performance in terms of results. Many I-O psychologists have argued against measuring results, advocating instead for a focus on behavior. They argue that there are too many measurement problems associated with resultsbased criteria that undermine their usefulness (Dunnette, 1966; Guion, 1965). First, there are some jobs for which results measures are nonexistent (e.g., artistic and creative jobs, many research and development jobs), making it impossible for job performance to be evaluated in these terms. Second, the assessment of results is problematic because it can be impacted by factors outside of an employee’s direct control or be the result of team efforts. Indeed, it is likely that many of the nontrivial results that an individual achieves are at least somewhat a function of factors outside of his or her complete control. Consequently, the measurement of important results may inherently suffer from some amount of criterion contamination (Borman, 1991). Finally, an exclusive focus on results can yield deficient performance measurement because consideration is not given to how employees achieve their results. Although workers can achieve impressive results, overall performance is not effective if employees have a “results-atany-cost” mentality and achieve outcomes in ways that are detrimental to others or the organization (Cardy, 1998). To address these issues, job performance has typically been evaluated by measuring work behaviors via the use of subjective rating scales. One important advantage of using subjective ratings is that all of a job’s performance requirements can be described on a set of rating scales, thereby mitigating the deficiency problems that often plague results-based measurement (Borman, 1987). Also, by focusing on behaviors that lead to effective performance, criterion contamination resulting from situational factors outside of the employee’s control can begin to be reduced.

510

Defining and Measuring Result Although there are clearly challenges inherent in measuring results, the evaluation of behavior is not without issues of its own. First and foremost, the common practice of using subjective ratings to assess behavioral performance (see Chapter 20, this volume) yields measures with notoriously attenuated variance. This is particularly true when these ratings are collected for operational purposes (e.g., pay, promotion), circumstances in which a large proportion of employees are rated at the highest levels of the rating scale (Pulakos, 2004). This lack of discrimination among employees renders the measures virtually useless for validation research or for use as selection measures (with the notable exception being the identification of a handful of nonperformers). Although forresearch-only ratings collected in validation studies tend to be more variable, lack of discrimination is a chronic problem with subjective ratings, undermining their reliability, validity, and utility. Second, advocates of results-based measurement assessment argue that a focus exclusively on behavior misses what is most important, namely whether or not an employee actually delivered important bottom-line results. Although an employee can engage in highly effective behaviors, they are of little value if they do not result in organization-relevant outcomes (Bernardin et al., 1998). To that end, it has been suggested that behaviors should be measured only if they can be linked to outcomes that drive organizational success. In addition, research has shown that employees perform more effectively when they have specific goals and expectations so that they know what they are accountable for delivering (e.g., Locke, Shaw, Saari, & Latham, 1981). Defining and measuring the results each employee is expected to achieve and aligning those to organizational performance helps everyone work toward a common set of important goals. Despite the difficulties associated with measuring results (e.g., criterion contamination and deficiency), there has been an increasingly popular trend over the last decade for organizations to adopt a results focus in the measurement of job performance. This is largely because business leaders and organizational consultants have become convinced that an exclusive focus on behaviors is remiss in not sufficiently emphasizing the importance of delivering meaningful results that are critical to organizational success. This orientation has likely been driven by intensified pressure from stockholders and increasingly formidable national and international competition. It is noteworthy that two other chapters in this volume share the perspective that results criteria are important indices of selection-related value. Chapter 10, this volume, makes this point in discussing the business value of selection as it relates to system- and organizational-level outcomes, whereas Chapter 5, in this volume, discusses this in relation to multilevel issues. Even public sector and not-for-profit organizations that have not traditionally driven toward results have adopted this focus to demonstrate their value. In the late 1990s, the Internal Revenue Service (IRS), the Federal Aviation Administration (FAA), and the Government Accountability Office (GAO) all initiated pay-for-performance systems, which focused on measuring and rewarding results. More recently, the U.S. Departments of Defense (DoD) and Homeland Security (DHS) have developed similar programs. This results focus has become so pervasive that the U.S. Office of Personnel Management (OPM) codified procedures that require federal government agencies to develop performance management systems for executives that link their performance to results-oriented goals and to explicitly evaluate results.

MEASURING TEAM BEHAVIOR VERSUS RESULTS Over the past two decades, organizations have steadily moved from individualized work in functional structures to team-based work systems (Kozlowski & Ilgen, 2006). In fact, team-based work is becoming a dominant organizational strategy for achieving important outcomes (Salas, Burke, & Fowlkes, 2006; Wildman, Bedwell, Salas, & Smith-Jentsch, 2011). The increase in the use of teams is often based on the assumption that they will lead to increases in productivity and efficiency because of characteristics that can be built into teams (e.g., skill diversity, ability for rapid response) that enable them to respond to emerging organizational challenges. As a result, team performance is becoming increasingly important to ensuring organizational success—even more so than individual performance. However, while a considerable amount of research has focused on building effective teams, far less work has been devoted to the measurement of team-based performance.

511

Ryan S. O’Leary and Elaine D. Pulakos As with the measurement of individual performance, team-based performance can be measured through behaviors, results, or both (see Cannon-Bowers & Salas, 1997; McIntyre & Tedrow, 2004; Salas, Stagl, Burke, & Godwin, 2007). In team-based performance measurement, behavior is often defined in terms of process. Process measures assess the manner in which the work is completed or the mechanisms a team uses to accomplish its tasks, capturing behaviors such as communication, coordination, monitoring, conflict resolution, and back-up behavior (Marks, Mathieu, & Zaccaro, 2001; Salas, Sims, & Burke, 2005). Process measures are distinct from, although related to, results measures, which assess the quantity or quality of the outcomes of the work produced as a result of the team processes. Many have advocated for a focus on process. Similar to the arguments made in relation to individual performance measures, experts have argued that team result measures are deficient. A team can produce a quality product but exhibit such poor teamwork and process that in the long run team performance may suffer, the team may burn itself out, or there will be a lack of willingness among team members to work together in the future (Hackman & Oldham, 1980; McIntyre & Tedrow, 2004). In fact, some have argued that one of the most important outcomes or results associated with team performance is team viability, the team’s desire to remain together during and after a performance event (Hackman, 1987). Additionally, results measures are not diagnostic in that they do not identify the underlying causes of outcomes while process measures capture the behavioral mechanisms of performance (Cannon-Bowers & Salas, 1997). As is the case when measuring individual performance, advocates of results-based measurement argue that a focus exclusively on process misses what is most important, namely whether or not the team actually delivered important results. A team can exhibit excellent teamwork skills and process but still deliver an inferior product or outcome. It is important to assess both process and results for comprehensive measurement of team performance (Wildman et al., 2011). Ultimately, the team’s success must be based in part on the results of their performance because outcomes are what organizations must predict and manage. Most organizations define team effectiveness with results measures such as quality of output and quantity of work (CannonBowers & Bowers, 2011). With results-oriented performance measurement increasingly emerging as a significant trend in the measurement of individual and team-based performance, the remainder of this chapter is devoted to methods for defining and evaluating results in a manner that will yield the highest quality measures possible. One important caveat to point out is that results have been fairly narrowly defined in the past to include only those outcomes that could be evaluated using highly objective criteria, such as dollar volume of sales. More recent operationalization definitions of results continue to emphasize objective measurement, but there has also been recognition that it may not be possible to translate every important aspect of a result into a bottom-line, objective metric. This has opened the door for the use of some subjective (i.e., judgmental) measures along with objective measures in assessing the quality of results.

DEFINING INDIVIDUAL PERFORMANCE OBJECTIVES Measuring results relies on identifying performance objectives that state the outcomes an employee is expected to achieve in sufficient, measurable detail such that it is clear whether or not the objectives have been met. An important goal in many organizations today is ensuring that employees focus on achieving results that contribute to important organizational goals. For example, if improved teaming with strategic partners is a key organizational goal, the objectives set for employees should hold them accountable for seeking out and formalizing these relationships. The value of developing and linking goals at different levels has been written about extensively in the management by objectives (MBO) literature (Rodgers & Hunter, 1991). Linking organizational goals to individual goals not only helps focus employees’ attention on the most important things to achieve but also shows how their achievements support the organization’s mission. Additionally, by showing how work performed across the organization is related, it is more likely that everyone will be working in alignment to support the organization’s strategic direction and critical priorities (Hillgren & Cheatham, 2000; Schneier, Shaw, & Beatty, 1991).

512

Defining and Measuring Result To ensure alignment of goals across levels, organizations frequently implement the concept of cascading goals, in which the organization’s strategic goals are cascaded down from level to level until they ultimately reach individual employees. In such a system, each employee is accountable for accomplishing specific objectives that are related to higher-level goals, thus providing obvious and transparent connections between what an employee does on his or her job and the organization’s key goals (Banks & May, 1999; Hillgren & Cheatham, 2000). Figure 23.1 presents an example of linking four levels of organizational goals. Looking at the connecting symbols, not every goal applies to all levels. For example, only two of the five organizational goals apply to the Administrative Division. Likewise, only two of the Administrative Division’s goals apply to the Accounting and Finance Department. Finally, in this example, the person’s individual performance objectives support only one of the department’s goals. It is extremely unlikely that an individual’s performance objectives will relate to every goal at every level in the organization. What is shown in the example is much more typical, in which an individual’s objectives will support only a subset of higher-level goals. Although the value of developing and linking individual and organizational objectives makes a great deal of sense in theory, practical implementations of this strategy have revealed some significant challenges that make the process much easier said than done. First, it is absolutely critical for organizations to set goals and objectives in a thoughtful and realistic way to ensure that mistakes made in setting the highest-level goals do not cascade down throughout the entire organization. For this reason, goals set by the top leadership of the organization need the most critical scrutiny and opportunities for correction or revision, things that do not always occur to the degree they should. Assuming the highest-level goals are well thought through and realistic, one of the challenges in cascading goals is that it is sometimes difficult to see direct relationships between highlevel goals and what an individual does on the job. This is why organizational goals need to be translated or cascaded into increasingly refined goals at the division, department, and individual levels. The process of developing cascading goals usually requires several meetings in which organizational leaders first develop division goals that align with the organizational goals. Then, mid-level managers develop unit goals that align with division goals. Then, managers develop group goals that align with unit goals, and so on until the organizational goals are cascaded down to individual employees. The process of cascading goals thoughtfully and meaningfully is quite time-consuming and challenging.

Organization’s Strategic Goals Objectives

Administrative Division Goals Accounting and Finance Department Goals

Finance Professional’s Individual Objectives FIGURE 23.1  Example of Cascaded Goals

513

Ryan S. O’Leary and Elaine D. Pulakos On average, organizations spend fewer than 10 hours per year on performance management activities for each employee (Brentz, Milkovich, & Read, 1992). However, the process of cascading goals requires considerably more time. In fact, it is not uncommon for organizations that are initiating cascading goals to take until the end of the second quarter of the operating year to complete the process. This poses difficulties, because half of the rating period may have passed before individual goals and expectations are set for employees, leaving little time for goal attainment. However, as organizations gain more experience with cascading goals, efficiencies are realized. The bottom line is that the implementation of cascading goals requires time, effort, and considerable hand-holding, at least initially, to ensure that the process is done well. Once goals have been successfully cascaded down to the level just above the individual, there are two ways individual goals can be linked to these higher-level goals: 1. Start with performance objectives and work upward to link them to higher-level goals. 2. Start with higher-level goals that are relevant to an employee’s job and work downward to develop individual performance objectives.

The decision to link upwards or downward is a personal preference. Some find it easier to start with something concrete from their job and work upward to a less-tangible concept. Others find it easier to start with a higher-level goal and develop something they can do on their job that relates to that goal. Figure 23.2 shows an example of how department goals could be cascaded down to individual objectives for a human resources (HR) professional. Note that the individual objectives are related to only one of the department goals. Objectives can be related to more than one goal at the next higher level, but as mentioned previously, it is unlikely that goals at one level will relate to all of the goals at the next level. Several guidelines should be followed when developing individual performance objectives. Many of these are a direct outgrowth of the well-established principles found in the goal-setting literature (Locke & Latham, 1990). Following these guidelines will help to ensure that the objectives are clear, employees know what is expected, and they are motivated to achieve success. • Objectives must be specific. Objectives must be clearly defined, identifying the end results employees are expected to achieve. Ambiguity is reduced by specifying the outcomes, products, or services in terms of quality, quantity, and timeliness expectations. Although research has continually found that well-defined objectives are associated with higher levels of performance, reviews of results-based performance measurement systems have shown that objectives are frequently not sufficiently defined or well written to clearly communicate the employee’s expectations. • Objectives must be measurable. To the extent possible, objectives should be defined in terms of measurable outcomes relating to quality, quantity, and timeliness standards so that both managers and employees know when and whether they have been achieved. However, to comprehensively measure what is most important, it may be necessary to go beyond objective measures and allow for some subjective

Human Resources Department Goals • Develop a process to tie employee performance to rewards by the end of the year. • Train all managers on the new performance management system by December 31. • Reduce the average cost of processing retirement claims by 5% by the end of year. • Conduct a competency gap analysis for critical occupations by the end of third quarter. ↓ Individual Objectives for Human Resources Professional • Design and obtain leadership approval for a new employee incentive system by March. • Implement an employee incentive system for all full-time employees with payouts to employees beginning in January.

FIGURE 23.2  Example of Individual Goals Cascaded From Departmental Goals

514

Defining and Measuring Result judgment (e.g., quality is sometimes difficult to operationalize in terms of concrete metrics). Later in the chapter, we discuss evaluation of objectives in detail and provide examples of objective and subjective criteria that can be used to measure results. • Objectives must be difficult but achievable. The goal-setting literature has consistently shown that difficult but attainable objectives lead to more effective performance than moderately difficult goals (Locke & Latham, 1990). Goals that are perceived as challenging, but realistic, have been found to have the strongest impact on motivating employees to perform. Related to this idea is that the objective must be sufficiently within an employee’s control to achieve and not overly depend on outside factors. • Objectives must be job relevant. Objectives should have a direct and obvious link to the employee’s job and important organizational success factors. As discussed, the use of cascading goals and objectives helps ensure that individual and organizational goals are aligned. In the section “Challenges associated with developing individual objectives and mitigation strategies,” we discuss how to use job analytic information as a basis for developing objectives, thus helping to ensure their content validity. • Ideally, no more than three to five objectives should be set. Performance objectives should reflect significant products or outcomes that employees are expected to deliver. The recommendation to limit objectives to three to five is based on the fact that most employees will be unlikely to achieve more than this number of significant and important results in a year’s time. Consequently, establishing more than this number of objectives could be overwhelming and only serve to demotivate employees. Although it is usually possible to set subgoals for major objectives, and employees should do this to guide their own performance, it is not recommended that the objectives recorded in an employee’s performance plan contain this level of detail. Recording many narrowly defined subgoals in one’s formal performance plan can make it impractically time-consuming to maintain. This is because changes to formal performance plans often require review and approval from others (e.g., supervisors, second-line managers, and HR). In most circumstances, it will not make sense to include very detailed subgoals that may change regularly as the work evolves and require ongoing formal revision of the plan. • Employees must be committed to the objectives. A key aspect of commitment that we have already discussed is that employees must feel that they can reach their objectives, or they will be demotivated to try. The best way to facilitate employees accepting their objectives is to make them an active part of the objective-setting process and work with them to arrive at objectives that are challenging yet achievable. Once managers and employees have come to agreement on the employee’s objectives in principle, asking employees to prepare the written description of their objectives helps enhance their ownership of them. • Managers must show their commitment to the objectives. It is important for managers to show their support by providing guidance and resources as well as removing obstacles to goal achievement. The literature clearly shows that management commitment is critical to successful achievement of objectives (Rodgers, Hunter, & Rogers, 1993).

Realistically, meeting all of these requirements is difficult, if not impossible, to achieve. Moreover, for most of these requirements there will be some variability in the level to which each can be met in a given context. The more these requirements can be met, the more effective the objectives will be. There is often an overemphasis on ensuring performance objectives adhere to “SMART” criteria (specific, measureable, attainable, realistic, time-bound), often at the expense of being meaningful and driving performance increases. As a result, organizations often spend a significant amount of time and money on training employees and managers to develop SMART goals without realizing any improvement in performance. What is of fundamental importance is ensuring that employees and managers work together to set ongoing expectations as work evolves and to monitor progress towards those expectations.

DEFINING TEAM-BASED PERFORMANCE OBJECTIVES Extensive research has supported the tenets of goal-setting theory at the individual level. Only recently has this research addressed goal setting in teams. While the emerging research has found some differences between goal setting with individuals and teams, the cumulative literature suggests that there is a significant relationship between team-based goal setting and performance

515

Ryan S. O’Leary and Elaine D. Pulakos (Kramer, Thayer, & Salas, 2013). Team-based goals provide direction, create motivation to enact strategies for goal attainment, and energize team members to work hard and persist, which in turn impacts performance. Many of the underlying principles of goal-setting theory have been found to generalize to teams (see Kleingeld, Van Mierlo, & Arends, 2011; Kramer et al., 2013; Latham & Locke, 2007). For example, Wegge and Haslam (2005) found that specific and challenging team goals led to better performance than “do your best” goals. A meta-analysis by O’Leary-Kelly, Martocchio, and Frank (1994) found that specific and difficult goals led to a one standard deviation improvement in team performance when compared to “do your best” goals. Finally, DeShorn, Kozlowski, Schmidt, Milner, and Wiechmann (2004) found that team goals, goal commitment, and efficiency interact to determine the level of performance improvement expected from goal setting. The collective research findings suggest that many of the recommendations for developing performance objectives made above can be extrapolated from individuals to teams.

CHALLENGES ASSOCIATED WITH DEVELOPING INDIVIDUAL OBJECTIVES AND MITIGATION STRATEGIES Although it may be intuitively appealing to develop individual employee objectives that link to organizational goals, there are several challenges associated with developing fair and effective objectives that result in reliable and valid performance measurement. In this section, we discuss seven major challenges inherent in identifying and setting objectives, along with recommendations for mitigating these.

Challenge 1: Training Managers and Staff to Write Effective Objectives Managers and employees are not typically accustomed to developing objectives and therefore find it challenging to identify and clearly define them. One reason is that organizational members seem to naturally think in terms of the work behaviors that employees perform on the job rather than in tangible, well-defined outcomes. This may be because the materials they tend to review (e.g., job descriptions or vacancy announcements) typically contain work behaviors or job tasks. Identifying performance objectives requires going beyond tasks and defining the specific products, services, or outcomes that result from work activities. Training is necessary to help managers and employees understand what performance objectives are and how to write them in a clear and unambiguous manner. However, even after attending training, the quality of the objectives produced by different managers and employees varies greatly. It is especially helpful in the initial implementation process for individuals who know how to write effective objectives (e.g., trained HR staff or higher-level managers) to review the objectives for each employee and provide feedback on their appropriateness, clarity, and fairness. One advantage of a higher-level review is that it enables the objectives developed for similarly situated employees to be assessed for comparability and revised, if necessary. The process of receiving feedback from higher-level reviews further trains managers and employees how to write more effective objectives.

Challenge 2: Ensuring Objectives Are Job Relevant In more routine, standard, and predictable jobs, it is often possible to predefine a set of objectives that apply uniformly to all employees at a given level using standard job analytic procedures. This not only saves time that would otherwise be spent by each manager and employee on developing individual objectives, but it also ensures that all employees in the same job are held accountable for delivering the same results. Standardized objectives are not only the most fair

516

Defining and Measuring Result

TABLE 23.1 Transforming Work Tasks into Performance Objectives Work Task

Transformed into Performance Objective

Evaluate and monitor the quality of information provided to potential customers

Monitor calls to company call center and provide feedback to staff as necessary to ensure 95% accuracy of product information provided Monitor responses to e-mail inquiries to ensure that responses are made within 24 hours and that accuracy of information provided is at least 95% Develop items, select vendor, and administer survey by January; analyze data, conduct focus groups to further understand survey results, and write report with clear, actionable, and feasible recommendations that requires no grammatical editing and minimal substantive editing by July

Design, administer, analyze, and evaluate surveys

for employees, but they also allow straightforward comparisons to be made among employees in terms of the results they delivered. In more unique jobs and situations, it may be impossible to predefine objectives that apply across positions, jobs, or organizations. Although a group of employees may occupy a given job, the specific results each individual is expected to achieve may vary depending on the nature of his or her assignments. For example, some organizational consultants may have production or sales results, others in essentially the same job may be responsible for developing and implementing systems, others may have specific levels of customer satisfaction outcomes they are expected to meet, and still others may have employee development or team-leadership goals. To the extent that people holding similar jobs have different goals and objectives, evaluating and comparing their performance in a fair and standardized manner becomes increasingly challenging. Under these circumstances, we recommend developing individual objectives that further define critical tasks from a comprehensive job analysis. This helps ensure that a common base of job-relevant information is used to develop objectives. Objectives derived in this manner will contain more specific information than the tasks or work behavior statements, such as what specific project, customer, product, etc. the employee is responsible for and what specific quality, quantity, and timeliness criteria will be measured. Two examples of how objectives can be developed by further specifying validated work behaviors appear in Table 23.1. The first task is to evaluate and monitor the quality of product information supplied to potential customers. A reasonable objective for this task would be to monitor the specific channels that are used to provide information to customers and evaluate the accuracy and timeliness of the information supplied according to measurable criteria. The second task is to design, administer, analyze, and interpret surveys. Specifying what type of survey a given employee is responsible for and the timeline required for its completion allows this work task to be transformed into an individual objective.

Challenge 3: Helping Managers Develop Comparable and Fair Objectives for Employees A problem that occurs when different managers set objectives for employees who occupy the same job is that natural inconsistencies among them can result in objectives that are too easy, unattainable, or unsystematic across employees (Jamieson, 1973; Strauss, 1972). This often results in employees in the same job being evaluated against objectives that vary significantly in their difficulty and complexity. For example, assume one employee’s objective is to perform a simple information-cataloguing project, whereas another employee in the same job and level is given the objective of managing the design and implementation of a complex information

517

Ryan S. O’Leary and Elaine D. Pulakos management system. If the value of these different objectives is not established and there is no mechanism in place to review objectives for fairness and consistency across employees, both of these employees could be considered performing equally well if they both achieved their stated objectives. Yet, the employee who managed the design and implementation of the information management system would have undertaken a much more difficult and complex assignment and contributed substantially more. Thus, evaluating employee results cannot merely take into account whether each individual simply met or did not meet the established objectives. This would not only undermine the accuracy of performance measurement but could also rightly be viewed as unfair, with a consequential negative impact on employee acceptance of the measurement process (e.g., Dipboye & de Pontbraind, 1981; Greenberg, 1986). We recommend several strategies to mitigate this problem. First, the training provided to managers and employees needs to focus on teaching them how to develop objectives that are of similar difficulty and complexity for individuals in the same or similar jobs. This process is similar to frame-of-reference training, in which review and discussion of example objectives helps calibrate trainees to apply similar standards. As a supplement to training, especially in the early stages of implementation, having managers meet to review the objectives for staff in the same job helps ensure that similarly difficult and complex objectives are set for similarly situated employees. Such meetings also reinforce development of a common frames-of-reference among managers for setting objectives. A third recommendation to facilitate the quality and consistency of individual objectives is to retain them in a searchable database organized by job and level. These can be used again verbatim or refined and edited over time to develop future objectives. Finally, even if an objective appears appropriate for the job and level and is comparable to those for similarly situated others, a project, program, or goal will sometimes turn out to be much more or less difficult than anticipated. For this reason, we feel it is important to evaluate employees not only on the extent to which they achieved or exceeded their stated results but also on the difficulty and complexity of what they delivered relative to what would be expected for their job. Although this involves a subjective judgment, it provides a more fair and more accurate assessment of the employee’s performance overall and a systematic basis for making meaningful comparisons among employees who may have achieved different types of results.

Challenge 4: Ensuring Objectives Are Within an Employee’s Control When one is developing individual objectives, care must be taken to ensure that they are largely within the employee’s control and not overly dependent on things he or she cannot control. Differences in the results achieved may not be a function of differences in individual motivation, effort, or ability, but instead, differences in the opportunities available to different employees. For example, one employee may have more modern equipment than another and thus be able to produce a higher volume of product, irrespective of how hard either individual works. In a similar classic example, one employee may have a sales territory in Wyoming and another in New York City. On the basis of volume and proximity of potential customers, the individual in New York City should have more opportunity to make sales than the one in Wyoming. Clearly, circumstances beyond an employee’s control can have a significant impact on the results achieved (Kane, 1986) and an employee’s motivation.

Challenge 5: Handling Objectives That Are Partially Attributable to Others A related challenge in setting objectives occurs when outcomes cannot easily be associated with a specific person’s effort, because the work involves significant interdependencies or is team focused. For example, in the design and production of a new automobile, the quality of the product is dependent on the design engineering group and the production group (Cascio, 1998). When the work requires significant interdependencies, objectives should be set at the

518

Defining and Measuring Result level where the key work products are produced. If jobs are so intertwined, it may not be practical or even appropriate to set individual objectives. In such circumstances, individual objectives should be abandoned and replaced with objectives set at the higher group or team level (Lawler, 1994). Ployhart and Weekley (Chapter 5, this volume) similarly make the point that task and result interdependencies may make individual-level performance and results impossible to measure well, if at all, and only aggregated performance/results may be measurable in any reasonable way.

Challenge 6: Setting Objectives in Fluid Situations Setting specific objectives in advance may be extremely difficult for some jobs (Cascio, 1998; Levinson, 2005). Jobs that best lend themselves to setting objectives have relatively static performance requirements and definable productivity metrics, both of which are uncommon in many of today’s jobs. As the economy continues to transform from a manufacturing focus to a knowledge and service focus, jobs are increasingly becoming more fluid and unpredictable, which makes setting objectives more difficult (Pulakos, Hanson, & O’Leary, 2007). For jobs that are fluid and unpredictable, or in situations where unforeseen circumstances regularly interfere with attaining objectives, it may be necessary to alter or completely revise an employee’s objectives during the rating period. Managers and employees need to be prepared to make changes to the objectives as the situation or priorities change. Obviously, to the extent that a situation is chronically volatile, requirements for constant changes to the formal performance plan may prove to be impractically time-consuming. An alternative strategy for jobs that are in flux is to set shorter-term objectives that are more predictable. Feedback can be given during the rating period as employees meet key milestones. In fact, given the fluid nature of many work environments and jobs, some experts have argued against setting longer-term objectives and instead recommend setting shorter-term goals as the work evolves.

Challenge 7: Ensuring Objectives Focus on Important Aspects of Performance Measuring important aspects of performance is necessary to obtain valid and useful measures. Consider the job of an electrician. Although the number of projects completed within budget may be a useful indicator of performance effectiveness, the ability to complete projects within budget is only one aspect of the job. There are other, more important contributors to overall performance that should be assessed, such as whether the work is competently performed according to code. Ensuring that nontrivial aspects of performance are measured relies on two things. The first is that careful consideration be given to what types of performance measures are most critical to assessing effectiveness (e.g., quality, quantity, timeliness) and appropriately incorporating these factors into performance measurement. The second is understanding that although some people advocate using only quantifiable measures (e.g., average call time, sales volume) to evaluate objectives, results-based performance measurement does not require consideration of only bottom-line, objective measures. Instead, the results of some objectives may need to be judged subjectively (e.g., to evaluate the quality of work produced). Including evaluation of objective metrics and subjective factors, where appropriate, will help mitigate the problem of only focusing on those results that can be easily measured rather than on those that represent the most important aspects of performance.

CHALLENGES ASSOCIATED WITH DEVELOPING TEAM-BASED OBJECTIVES It may be tempting to think of team-based objective setting as simply an extension of individual objective setting. However, there are unique challenges associated with establishing team-based

519

Ryan S. O’Leary and Elaine D. Pulakos objectives that are worth addressing. In this section, we discuss four unique challenges inherent in identifying and setting objectives for teams.

Challenge 1: Accounting for Interdependence One challenge is task interdependence, the extent to which team members must rely on one another in order to complete a task or produce an outcome. The results of team-based jobs are a function of the coordination and seamless performance of the group, not simply the sum of the individual team member contributions. Interdependence adds a layer of complexity that is not often found in individual-based work. Processes such as communication and coordination do not take place when individual tasks are required. Performance objectives need to accommodate the level of interdependence within a team. The more tasks are interdependent, the more important team goal commitment becomes for ensuring goal accomplishment (Aubé & Rousseau, 2005).

Challenge 2: Establishing Objectives at Multiple Levels Objectives in teams must be set at multiple levels (Salas et al., 2004; Wildman et al., 2011). Teams are composed of individuals working toward a common goal, and objectives need to be set and performance measured at the individual and team levels. Objectives set at the individual level focus on the specific products or outcomes the individual achieves in relation to the team’s results. Objectives set at the team level focus on the results achieved by the collective team. Assigning objectives at the individual level encourages individual productivity and is consistent with traditional approaches to performance management, which hold individuals accountable for goals they are directly responsible for achieving. They are important because the use of team-based objectives alone may not accurately represent the contributions of all of the team members (Cannon-Bowers & Bowers, 2011). For example, a team may have one or two members with subpar performance who get acceptable ratings if individual objectives are not considered (McIntyre & Tedrow, 2004). However, it may be difficult to assign individuals to specific contributions related to team outputs, and individualized goals may remove the focus from the team or lead to counterproductive competition among team members. Individual-level objectives are not sufficient. Team-level objectives are required to measure outcomes of team-based tasks and processes. In addition, to combine work efforts effectively, team members must have a shared understanding of what they are trying to achieve. What’s more, individuals may achieve their personal objectives in a manner that prevents team goal attainment. Finally, team objectives are more likely than individual objectives to align with organizational goals. In team-based objective setting, a balanced approach is needed (Cannon-Bowers & Salas, 1997). Individuals should have objectives for their own performance as well as an objective(s) for the entire team (Kramer et al., 2013). These goals are often cascaded with individual goals contributing to team goals and team goals contributing to higher-level organizational goals. However, very limited research has specifically examined the setting of objectives at multiple levels.

Challenge 3: Avoiding Goal Conflict One major difficulty encountered when setting objectives at the individual and team levels is the potential for goal conflict (Latham & Locke, 2007). Individual objectives may be set to motivate members to achieve their own goals, but these may interfere with cooperation and team performance. Only when an individual’s objectives are compatible with the team’s objectives will performance be enhanced (Seijts & Latham, 2000). Additionally, at the individual level, when

520

Defining and Measuring Result individuals view goal attainment as competitive and perceive that others’ attainment of their own goals may prevent them from personal goal achievement, they may obstruct others (e.g., withhold information) (Stanne, Johnson, & Johnson, 1999). Cooperation is only likely to occur if individuals see their goals and the goals of others and the team as correlated, such that attainment of one leads to the attainment of the other. When developing goals at multiple levels, it is important to understand and strive for goal interdependence. This is often done through the development of group-centric individual objectives set by the individual to maximize team performance (Kramer et al., 2013).

Challenge 4: Accounting for the Uniqueness of the Team In response to operational and organizational challenges, a wide variety of team types have emerged. Teams range from small to large, temporary to permanent, co-located to distributed to virtual, and self-managed to hierarchically led. Not surprisingly, research suggests that not all teams are equal. The processes used to develop objectives must understand and take into account differences in team purpose, composition, structure, and management structure before measurement approaches can be developed (Salas et al., 2004). By way of example, establishing common goals within virtual teams (characterized by members working in different locations and communicating through a variety of methods) can be difficult. When compared to in-person teams, working in virtual teams can increase anonymity and social loafing, lead to feelings that individual work is not important or will be overlooked, and decreases in trust among team members (Kramer et al., 2013). This can lead to a lack of goal commitment and team cooperation (Hertel, Konradt, & Orlikowski, 2004). In these contexts, participative goal setting is beneficial as it allows for ownership of the objective and a shared understanding of each team member’s responsibilities. Alternatively, for high-performing teams working in difficult, stressful, and complex environments (e.g., medical teams, flight crews) where hierarchical management structures dominate, difficult and specific goals may lead to more risk taking, which may be detrimental. In these contexts, a less concrete goal with “room for interpretation” may be beneficial (Kramer et al., 2013). Additionally, in such environments, team members are more likely to turn to their leader for guidance, so leader-set goals have more value.

MEASURING RESULTS OF PERFORMANCE OBJECTIVES Once objectives have been established, employee performance related to those objectives must be evaluated. Four types of measures are commonly used for this purpose: timeliness, quality, quantity, and financial metrics. Timeliness refers to the timeframe in which the work was performed. Examples of timeliness metrics include responding to customer complaints within 24 hours and providing statistical reports on a quarterly basis that summarize progress toward affirmative action goals. Quality refers to the effectiveness of the result. Examples of quality metrics include improving the layout for navigating a website to make it more user-friendly as indicated by a 10% improvement in user survey satisfaction results, independently creating a report containing relevant and concise information on program operations that required no revisions, and developing an online training program in which trainees successfully learned 85% of the materials. Although it is useful to develop quantifiable metrics of quality where it is possible to do so, quality assessments will sometimes require subjective judgments (e.g., how relevant and concise the information contained in a report actually was). Providing predefined rating criteria to guide subjective judgments helps ensure that employees are fairly evaluated against uniform standards. Quantity refers to how much work is performed. Examples of quantity metrics include responding to 95% of requests, providing computer training to 90% of employees, and conducting two onsite reviews each month to assess compliance with regulations.

521

Ryan S. O’Leary and Elaine D. Pulakos

TABLE 23.2 Example Performance Objectives Well-Defined Objectives • By June 30, develop a plan that allows for 90% of general inquires to company website to be responded to within 72 hours. • By the end of the operating year, implement a self-service benefits system that reduces processing costs by 10%. • By June 30, draft and submit to the Human Resources Vice President a plan and timeline that is accepted without revision for expanding telework options to at least 70% of full-time employees. • Reduce average cost of processing travel reimbursements by 5% by end of year. Poorly Defined Performance Objectives • Provide effective customer service. • Coordinate with the Legal Department to revise the company’s HR policy. • Promote volunteering in the local community. • Reduce operating costs of company fleet program.

Finally, financial metrics relate to the efficient use of funds, revenues, profits, or savings. Examples of financial metrics include budgeting operations to achieve a 10% cost savings compared to last year and convincing customers to increase expenditures for service by 15% more than last year. Although there are four primary ways to measure results, the different types of measures can be used together, which usually improves the clarity of expectations. For example: • Processed 99% of candidate job applications within one week of receiving them (quantity and timeliness) • Developed an online training course that taught 90% of employees how to use automated transactional systems and reduced training costs by $500 per employee (quantity, quality, and financial metrics)

Table 23.2 presents examples of well-defined objectives that specify timeliness, quality, quantity, and/or financial metrics and examples of poorly defined objectives that fail to specify measurable criteria. As it can be seen by reviewing the first set of objectives (i.e., well-defined) in the table, articulating expected results in terms of the four types of measures is likely to increase understanding and agreement about whether or not the objectives were achieved. Alternatively, the second set of objectives (poorly defined) is vague and nonspecific, which could easily lead to differing opinions about the extent to which they were met. These four measures are rarely independent. For example, in order to meet timeliness metrics, it may be necessary to sacrifice quality. In cases where there may be tradeoffs, the organization and management must determine how to most appropriately balance the competing objectives. The optimal balance will depend on a number of factors, including organizational goals and changing operating environment.

CHALLENGES ASSOCIATED WITH MEASURING RESULTS AND MITIGATION STRATEGIES Because our focus in this chapter is on measures that will be used as criteria in validation studies or as predictors for making selection decisions, reliability, validity, accuracy, and fairness of measurement are essential, as we have discussed. In the previous section, we described four types of measures that are most commonly used to evaluate results. Although we feel that these are useful and should be incorporated into measuring results, they have some inherent limitations that are important to address. To appreciate these limitations fully, it is important to

522

Defining and Measuring Result understand the two primary factors that have driven a results focus in organizations. That is, organizational leaders want to do the following: • Drive achievement of important results from all employees or teams that contribute to the organization’s success. • Reward employees or teams on the basis of their performance, which requires accurate performance measurement. Architects of pay-for-performance systems felt this could be best achieved by defining results in terms of concrete, objective measures, thus mitigating the chronic inflation that characterizes subjective ratings.

With this as background, we now discuss three challenges inherent in measuring results and recommendations for addressing these goals.

Challenge 1: Ensuring the Measures Selected Are the Important Ones Managers must decide which measures are most important for assessing employee or team performance on each objective. They are encouraged to quantify these measures so there is no disagreement about the extent to which an objective has been met. On the surface, selecting the most appropriate measures may seem easy and straightforward, but consider the following questions: • Did the employee who produced the most pieces also produce the highest quality pieces? • Did the website redesign that was completed on time and within budget actually improve usability? • Was the driver who made the most deliveries speeding and endangering others?

The reality is that even when measuring performance on objectives seems straightforward, it is important to consider the consequences of the measures selected because employees (and teams) will drive to those measures. For example, quantity measures are usually easier to define than quality measures. However, if only quantity metrics are used, employees will focus on production, possibly to the detriment of quality. It is also important not to fall prey to measuring peripheral aspects of an objective that may be easy to measure but are unimportant. For example, meeting a deadline is easy to measure, but improving customer service may be what is important. Researchers and practitioners have long argued against using convenience criteria because they are often unrelated to the most critical aspects of job performance (e.g., Smith, 1976). Despite the limitations associated with use of subjective criteria, inclusion of some subjective judgment in the measurement of results increases the likelihood that the most important aspects of performance will be measured. However, we also recommend that uniform standards be provided to guide raters in making these judgments fairly and systematically across employees. Also, incorporating standardized criteria on which ratings are made provides a mechanism for making direct comparisons among employees who may have delivered different types of results. Shown in Table 23.3 are example criteria with a 5-point rating scale that could be used to evaluate the quality of different individually delivered results.

TABLE 23.3 Performance Standards for Evaluating Quality of Results Low

High

Exceptional

1         2              3          4               5 The product, service, or other deliverable had significant problems, did not meet minimum quality standards, and fell well short of expectations. There were many or very significant errors or mistakes and substantial revision or reworking was needed.

The product, service, or other deliverable possessed high quality and fully met expectations. There were only minor errors or mistakes that were easily corrected and inconsequential.

The product, service, or other deliverable possessed flawless and impeccable quality that met the highest possible standards and surpassed expectations. There were no errors or mistakes and no revision or reworking was needed.

523

Ryan S. O’Leary and Elaine D. Pulakos

Challenge 2: Measuring a Reasonable and Sustainable Number of Criteria Although many different types of measures can be used to evaluate results, there is the very practical issue of which and how many of these can be reliably and accurately measured without creating systems and processes that are so burdensome that they die under their own weight. Developing and collecting meaningful performance measures in organizations can have significant resource implications and, thus, careful consideration must be given to the number and types of metrics that will be collected. To implement and maintain an effective and sustainable results-based process over time, any measures that require implementation of special or additional processes or systems for collection should be judiciously selected.

Challenge 3: Ensuring Useful and High-Quality Evaluation Information One of the most challenging problems in measuring results occurs when employees individually or together have delivered a myriad of different results, and it is difficult to differentiate among them in terms of their overall contribution to the organization (Graves, 1986). For example, how should a cost-savings result be evaluated and rewarded as compared to a leadership result? Given that some results have more impact than others, it would not be fair or accurate to assume that all employees who achieved their objectives were performing with equal effectiveness. Related to this, some employees consistently deliver results above the expectations for their job level, whereas others consistently deliver below their level. Thus, although it is useful to know whether or not a set of objectives was met, this does not always provide useful information for discriminating between the most and least effective performers for validation research or operational selection/promotion decisions. An effective strategy that has been used in public and private sector organizations to address these issues is, again, to introduce scaled criteria or standards that enable evaluation of the relative contribution and level of difficulty associated with different results. By using such standards as a part of the results evaluation process, managers are able to more accurately and reliably measure the contribution and value of the different results delivered. The use of individual performance objectives without scaled evaluation criteria to assess their relative contribution can result in a system that fails to differentiate among employees who are contributing more or less and for differentially rewarding them (Muczyk, 1979). Examples of standards to evaluate three different aspects of a result (e.g., extent to which objective was met, level of result achieved, and contribution of result) appear in Tables 23.4, 23.5, and 23.6, respectively. It is important to note that ratings on these criteria can easily be combined into a composite results measure, the psychometric properties of which can be readily assessed.

TABLE 23.4 Extent to Which Objective Was Met Not Met

Met

Exceeded

1         2              3          4               5 Several of the quality, quantity, timeliness, or financial measures established for this objective were not met.

524

All of the quality, quantity, timeliness, and financial measures established for this objective were met.

The quality, quantity, timeliness, or financial measures established for this objective were significantly exceeded.

Defining and Measuring Result

TABLE 23.5 Level of Results Achieved Did Not Meet

Met

Exceeded

1         2              3          4              5 The result achieved fell far below the difficulty and complexity of work expected for this job level.

At this level, work is moderately complex and difficult such that critical analysis, integration of multiple sources of information, and analyzing pros and cons of multiple solutions are required. Work is performed with minimal supervision and guidance. The result achieved was consistent with the difficulty and complexity of work expected for this job level.

The result achieved far exceeded the difficulty and complexity of work expected for this job level.

TABLE 23.6 Contribution of Results Low

Moderate

High

1         2              3          4              5 The efficiency or effectiveness of operations remained the same or were only minimally improved. The quality of products or services remained the same or was only minimally improved.

The efficiency or effectiveness of operations was improved, consistent with what was expected. Product or service quality showed expected improvements.

The efficiency and effectiveness of operations was improved tremendously, far surpassing expectations. The quality of products or services was improved tremendously.

INDIVIDUAL DIFFERENCE PREDICTORS OF RESULTS As Chapter 20, this volume, discusses, job performance criteria in selection research are often conceptually ambiguous, which makes specifying relationships between predictors and criterion measures difficult. In the case of results measures, the problem is compounded by the fact that the results achieved across different jobs may not reflect conceptually homogeneous content or constructs to the extent that other performance measures do. For example, considerable research evidence supports the existence of two major and conceptually distinct performance constructs, task and citizenship performance, each of which has been found to account for significance variance in overall job performance and to be associated with different antecedents (Borman, White, & Dorsey, 1995; Motowidlo & Van Scotter, 1994; Podsakoff, MacKenzie, Paine, & Bachrach, 2000). Because results measures reflect major outcomes or deliverables that relate to higher-level goals, they likely capture variance that is predominantly overlapping with task performance. However, depending on their nature, some results may be more reflective of citizenship performance, whereas others may be a combination of both. Because research has not been conducted to understand the underlying dimensionality of results measures, coupled with conceptual ambiguity about what underlies achieving results, we can only speculate about what constructs may be most useful for predicting this aspect of job performance. Two constructs that have been shown to consistently predict performance across jobs also seem highly relevant for predicting results. First, cognitive ability has been found to be one of the strongest predictors of job performance in general (Hunter, 1980; Schmidt & Hunter, 1998). Additionally, the limited number of studies that used results as criteria suggest

525

Ryan S. O’Leary and Elaine D. Pulakos that cognitive ability is likely be a strong predictor of results, especially to the extent that the results measures share variance with task performance measures. It also seems reasonable that conscientiousness, one of the Big Five personality constructs (Barrick & Mount, 1991; Chapter 13, this volume), would be associated with an overall predisposition to achieve results. The two major components of conscientiousness are achievement motivation and dependability. Achievement motivation, in particular, which refers to one’s desire to achieve results and master tasks beyond others’ expectations, may be particularly relevant to predicting results. Although the Big Five are rarely broken down into their component parts, Hough (1992) and Hough and Dilchert (Chapter 13, this volume) have argued for and shown potential advantages of examining lower-level personality constructs in the prediction of job performance. Because of the direct conceptual similarity between achievement motivation and achieving results, this may be a circumstance in which examining the validity of the component personality constructs may prove fruitful.

CONCLUSIONS The development of individual and team performance objectives, linked to key organizational goals and priorities, has been hypothesized to drive important results. Given the pervasive use of results measures in today’s organizations, future research should investigate the relationships between these performance measures and more commonly used predictors and criteria. Many practitioners and organizational leaders certainly believe that unique variance is accounted for in measuring results versus other types of performance measures. Because this belief has led to implementation of complex and time-consuming results-based systems, it is important to know if the added effort associated with these systems is, in fact, producing different or better information than other, less demanding performance measurement approaches. Research should also be conducted to evaluate the psychometric properties of results measures to assess whether or not they possess sufficient reliability, validity, and fairness to be used in validation research and for making selection decisions. Research is also needed to investigate the underlying dimensionality of results measures as well as predictors of them. Throughout this chapter, we drew from the literature to propose methods for identifying objectives and evaluating results that should maximize the likelihood of obtaining measures with adequate measurement properties, validity, and utility. However, data need to be collected using these methods to evaluate their efficacy. Competent development of fair, job-relevant, and useful objectives is difficult, resource-intensive, and time-consuming, requiring considerable training and effort on the part of managers, employees, and HR staff. If organizational members are not committed to developing effective objectives, doing this consistently for all employees and devoting the time that is needed to yield high-quality measures, we recommend that results measures not be collected or included in performance measurement processes. This is because poorly developed objectives will neither motivate employees nor provide useful criterion measures for validation research or operational selection decisions. However, if organizational members are willing to devote the time, energy, and resources necessary to overcome the inherent challenges involved in developing objectives and monitoring their effectiveness and completion, results-based measures may hold considerable promise. Research and practice have certainly suggested that defining and measuring results can have a profoundly positive effect on individual and organizational performance (Locke & Latham, 1990; Rodgers & Hunter, 1991).

REFERENCES Aubé, C., & Rousseau, V. (2005). Team goal commitment and team effectiveness: The role of task interdependence and support behaviors. Group Dynamics: Theory, Research, and Practice, 9, 189–204. Banks, C. G., & May, K. E. (1999). Performance management: The real glue in organizations. In A. I. Kraut & A K. Korman (Eds.), Evolving practices in human resource management (pp. 118–145). San Francisco, CA: Jossey-Bass.

526

Defining and Measuring Result Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6, 205–213. Bernardin, H. J., Hagan, C. M., Kane, J. S., & Villanova, P. (1998). Effective performance management: A focus on precision, customers, and situational constraints. In J. W. Smither (Ed.), Performance appraisal: State of the art in practice (pp. 3–48). San Francisco, CA: Jossey-Bass. Borman, W. C. (1987). Behavior-based rating scales. In R. A. Berk (Ed.), Performance assessment: Methods and application (pp. 100–120). Baltimore, MD: Johns Hopkins University Press. Borman, W. C. (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 2, pp. 271–326). Palo Alto, CA: Consulting Psychologists Press. Borman, W. C., White, L. A., & Dorsey, D. W. (1995). Effects of ratee task performance and interpersonal factors on supervisor and peer performance ratings. Journal of Applied Psychology, 80, 168–177. Brentz, R. D., Milkovich, G. T., & Read, W. (1992). The current state of performance appraisal research and practice: Concerns, directions, and implications. Journal of Management, 18, 321–352. Cannon-Bowers, J. A., & Bowers, C. (2011). Team development and functioning. In S. Zedeck (Ed.), Handbook of industrial/organizational psychology (Vol. 1, pp. 597–650). Washington, DC: American Psychological Association. Cannon-Bowers, J. A., & Salas, E. (1997). A framework for developing team performance measures in training. In M. T. Brannick, E. Salas, & C. Prince (Eds.), Team performance assessment and measurement: Theory, methods, and application (pp. 45–62). Mahwah, NJ: Erlbaum. Cardy, R. L. (1998). Performance appraisal in a quality context: A new look at old problems. In J. W. Smither (Ed.), Performance appraisal: State of the art in practice (pp. 132–162). San Francisco, CA: Jossey-Bass. Cascio, W. F. (1998). Applied psychology in human resource management. Upper Saddle River, NJ: Prentice Hall. DeShorn, R. P., Kozlowski, S. W. J., Schmidt, A. M., Milner, K. R., & Wiechmann, D. (2004). A multiple-goal, multilevel model of feedback effects on the regulation of individual and team performance. Journal of Applied Psychology, 89, 1035–1056. Dipboye, R. L., & de Pontbraind, R. (1981). Correlates of employee reactions to performance appraisals and appraisal systems. Journal of Applied Psychology, 66, 248–251. Dunnette, M. D. (1966). Personnel selection and placement. Belmont, CA: Wadsworth. Feldman, D. (1992). The case for non-analytic performance appraisal. Human Resources Management Review, 2, 9–35. Graves, J. P. (1986). Let’s put appraisal back in performance appraisal: Part 1. Personnel Journal, 61, 844–849. Greenberg, J. (1986). Determinates of perceived fairness of performance evaluations. Journal of Applied Psychology, 71, 340–342. Guion, R. M. (1965). Personnel testing. New York, NY: McGraw-Hill. Hackman, J. R. (1987). The design of work teams. In J. Lorsch (Ed.), Handbook of organizational behavior (pp. 315–342). Englewood Cliffs, NJ: Prentice-Hall. Hackman, J. R., & Oldham, G. R. (1980). Work redesign. Reading, MA: Addison-Wesley. Hertel, G., Konradt, U., & Orlikowski, B. (2004). Managing distance by interdependence: Goal setting, task interdependence, and team-based rewards in virtual tams. European Journal of Work and Organizational Psychology, 13, 1–28. Hillgren, J. S., & Cheatham, D. W. (2000). Understanding performance measures: An approach to linking rewards to the achievement of organizational objectives. Scottsdale, AZ: WorldatWork. Hough, L. M. (1992). The “Big Five” personality variables—Construct confusion: Description versus prediction. Human Performance, 5, 139–155. Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Department of Labor, Employment Service. Jamieson, B. D. (1973). Behavioral problems with management by objective. Academy of Management Review, 16, 496–505. Kane, J. S. (1986). Performance distribution assessment. In R. A. Berk (Ed.), Performance assessment: Methods and applications (pp. 237–274). Baltimore, MD: Johns Hopkins University Press. Kleingeld, A., Van Mierlo, H., & Arends, L. (2011). The effects of goal setting on group performance: A meta-analysis. Journal of Applied Psychology, 96(6), 1289–1304. Kozlowski, W. J., & Ilgen, D. R. (2006). Enhancing the effectiveness of work groups and teams. Psychological Science in the Public Interest, 7(3), 77–124. Kramer, W. S., Thayer, A. L., & Salas, E. (2013). Goal setting in teams. In W. A. Locke & G. P. Latham (Eds.), New developments in goal setting and task performance (pp. 287–310). New York, NY: Routledge.

527

Ryan S. O’Leary and Elaine D. Pulakos Landy, F. J., & Trumbo, D. A. (1980). The psychology of work behavior. Homewood, IL: Dorsey Press. Latham, G. P. (1986). Job performance and appraisal. In C. L. Cooper & I. Robertson (Eds.), International review of industrial and organizational psychology (pp. 117–155). New York, NY: Wiley. Latham, G. P., & Locke, E. A. (2007). New developments in and directions for goal-setting research. European Psychologist, 12(4), 290–300. Lawler, E. E. (1994). Performance management: The next generation. Compensation and Benefits Review, 26, 16–20. Levinson, H. (2005). Management by whose objectives? In Harvard Business Review on Appraising Employee Performance (pp. 1–28). Boston, MA: Harvard Business School Publishing. Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ: Prentice-Hall. Locke, E. A., Shaw, K. N., Saari, L. M., & Latham, G. P. (1981). Goal setting and task performance, 1969– 1980. Psychological Bulletin, 90, 125–152. Marks, M. A., Mathieu, J. E., & Zaccaro, S. J. (2001). A temporally based framework and taxonomy of team process. Academy of Management Review, 26, 356–376. McIntyre, R. W., & Tedrow, L. (2004). A theory-based approach to team performance assessment. In J. C. Thomas (Ed.), Comprehensive Handbook of Psychological Assessment (Vol. 4, pp. 443–452). Hoboken, NJ: Wiley. Motowidlo, S. J., & Van Scotter, J. R. (1994). Evidence that task performance should be distinguished from contextual performance. Journal of Applied Psychology, 79, 475–480. Muczyk, J. P. (1979). Dynamics and hazards of MBO application. Personnel Administrator, 24, 51–61. Murphy, K. R., & Cleveland, J. N. (1991). Performance appraisal: An organizational perspective. Needham Heights, MA: Allyn & Bacon. O’Leary-Kelly, A. M., Martocchio, J. T., & Frank, D. D. (1994). A review of the influence of group goals in performance. Academy of Management Journal, 37, 1285–1301. Olian, R. L., & Rynes, S. L. (1991). Making total quality work: Aligning organizational processes, performance measures, and stakeholders. Human Resources Management, 30, 303–330. Podsakoff, P. M., MacKenzie, S. B., Paine, J. B., & Bachrach, D. G. (2000). Organizational citizenship behaviors: A critical review of the theoretical and empirical literature and suggestions for future research. Journal of Management, 26, 513–563. Pulakos, E. D. (2004). Performance management. A roadmap for developing, implementing, and evaluating performance management systems. Alexandria, VA: Society for Human Resources Management. Pulakos. E. D. (2008). Performance management: How you can achieve important business results. Oxford, England: Blackwell. Pulakos, E. D., Hanson, R. A., & O’Leary, R. S. (2007). Performance management in the United States. In A. Varma, P. Budhwar, & A. DeNisi (Eds.), Global performance management (pp. 97–114). London, England: Routledge. Rodgers, R., & Hunter, J. E. (1991). Impact of management by objectives on organizational productivity. Journal of Applied Psychology, 76, 322–336. Rodgers, R., Hunter, J. E., & Rogers, D. L. (1993). Influence of top management commitment on management process success. Journal of Applied Psychology, 78, 151–155. Salas, E., Burke, C. S., & Fowlkes, J. E. (2006). Measuring team performance “in the wild”: Challenges and tips. In W. Bennett, C. E. Lance, & D. J. Woehr (Eds.), Performance measurement: Current perspectives and future challenges (pp. 245–272). New York, NY: Psychology Press. Salas, E., Burke, C. S., Fowlkes, J. E., & Priest, H. A. (2004). On measuring teamwork skills. In J. C. Thomas (Ed.), Comprehensive handbook of psychological assessment (Vol. 4, pp. 427–442). Hoboken, NJ: Wiley. Salas, E., Sims, D. E., & Burke, C. S. (2005). Is there “Big Five” in teamwork? Small Group Research, 36, 555–599. Salas, E., Stagl, K. C., Burke, C. S., & Goodwin, G. F. (2007). Fostering team effectiveness in organizations: Towards an integrative theoretical framework of team performance. In J. W. Shuart, W. Spalding, & J. Poland (Eds.), Nebraska symposium on motivation (pp. 185–243). Lincoln: University of Nebraska Press. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schneier, C. E., Shaw, D. G., & Beatty, R. W. (1991). Performance measurement and management: A tool for strategy execution. Human Resource Management, 30, 279–301. Seihts, G. H., & Latham, G. P. (2000). The effects of goal setting and group size on performance in a social dilemma. Canadian Journal of Behavioral Science, 32, 104–116. Smith, P. C. (1976). Behaviors, results, and organizational effectiveness: The problem of criteria. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 239–287). Chicago, IL: Rand McNally.

528

Defining and Measuring Result Stanne, M. B., Johnson, D. W., & Johnson, R. T. (1999). Does competition enhance or inhibit motor performance: A meta-analysis. Psychological Bulletin, 125, 133–154. Strauss, G. (1972). Management by objectives: A critical review. Training and Development Journal, 26, 10–15. Wegge, J., & Haslam, S. A. (2005). Improving work motivation and performance in brainstorming groups: Effects of three group goal setting strategies. European Journal of Work and Organizational Psychology, 14, 400–430. Wildman, J. L., Bedwell, W. L., Salas, E., & Smith-Jentsch, K. A. (2011). Performance measurement at work: A multilevel perspective. In S. Zedeck (Ed.), Handbook of industrial/organizational psychology (Vol. 1, pp. 303–341). Washington, DC: American Psychological Association.

529

24 EMPLOYEE WORK-RELATED HEALTH, STRESS, AND SAFETY LOIS E. TETRICK, PAMELA L. PERREWÉ, AND MARK GRIFFIN

Organizations are increasingly concerned with the health and safety of their employees. Several factors are contributing to this concern. First, the legal environment in many countries stipulates that employers are responsible for the safety and health of their employees, at least while at work. For example, the U.S. Occupational Safety and Health Act of 1970 mandates that employers provide a safe and healthy work environment for employees, and the European Agency for Safety and Health at Work has issued several Council Directives that protect the health and safety of workers throughout the European Union. It also has been argued that there are growing concerns and expectations by the general public over health protection from communicable diseases and noncommunicable environmental hazards in the work environment (Nicoll & Murray, 2002). Second, the cost of healthcare continues to climb. Oziransky, Yach, Tsao, Luterek and Stevens (2015) found that 70% of human resources professionals and 60% of chief financial officers reported that healthcare costs were a major financial concern for their organizations. In the United States, health insurance coverage of employees is a direct expense to employers and continues to increase. Although many companies have shifted more of the healthcare costs to employees (Kaiser Network, 2006), there is still a potential savings to organizations to have healthy employees because health insurance premiums in the United States are often based on claims from job-related illnesses and injuries. In the European Union and other countries around the world, the cost of healthcare is more of a social, public health responsibility rather than an organizational responsibility, based in part on differences in funding of healthcare systems. This social value of the health and safety of the workforce appears to be emerging in the United States (National Academies of Science, 2015). Third, employee health is related to productivity and organizational effectiveness. Recent research on the association of health risks and on-the-job productivity estimated the annual cost of lost productivity in one organization was between $1,392 and $2,592 per employee, based on self-reported health risk factors (Burton et al., 2005). In another study of chronic health conditions, Collins et al. (2005) found that the cost associated with lost productivity from chronic health conditions exceeded the combined costs of absenteeism and medical treatment. Therefore, the concern over employees’ health and safety is not limited to healthcare costs but also includes loss of productivity, and this concern is increasingly a global issue (World Health Organization, 2008). The purpose of this chapter is to examine potential mechanisms that organizations may use to maintain and promote healthy employees. These mechanisms might be the selection of “healthy” workers, modifications to the work environment to reduce work stressors and increase

530

Work-Related Health, Stress, and Safety

Selecon

Work Environment

Training

Employee Safety, Health, and Well-Being

Organizaonal Effecveness

• • • •

Performance Lower health care costs Fewer accidents Less turnover and absenteeism

FIGURE 24.1  Conceptual Overview of Chapter

safety, and employee training. Each mechanism has its pros and cons with respect to maintaining healthy employees as well as a safe and healthy work environment. We examine the importance of having healthy workers and the role of stress and safety in the workplace on organizational effectiveness. Figure 24.1 is an illustration of the chapter overview.

HEALTHY WORKERS In this section, we first examine the rising costs of healthcare. Next, we discuss how organizations can develop and possibly obtain healthy workers through wellness programs and selection.

Healthcare Costs Healthcare spending is rising faster than incomes in most developed countries, with the United States spending more per capita on healthcare than other countries. The United States spent $8,745 per capita in 2012, which was 42% higher than Norway, the Organisation for Economic Co-operation and Development (OECD) country with the next highest per capita spending on health (Peterson-Kaiser Health Tracker System, 2016). International comparisons of individual and family health spending are difficult given differences in healthcare systems. A Kaiser Daily Health Report (2008) indicated that medical care for the typical insured family of four in the United States was $13,382 in 2006—an increase of 9.6% from 2005, with employers paying 62% or $8,362 per family in 2006. According to another report by the Kaiser Family Foundation (2008), health insurance premiums had a cumulative growth of 78% between 2001 and 2007, with much of this increase in health insurance cost being borne by employers. Goetzel et al. (2014) reported that U.S. employers spent $16,351 per employee on average for health insurance premiums in 2013. Although the numbers differ based

531

Lois E. Tetrick et al. on industry, size of organization, and exact definitions, it appears that healthcare costs and health insurance premiums for individuals and organizations will continue to increase. Analyses of healthcare expenditures find that healthcare costs are often related to modifiable lifestyle behaviors. Anderson et al. (2000) found that modifiable health risks were associated with 25% of the total healthcare expenditures among a sample of 46,026 employees. Uncontrolled stress, smoking, and obesity were the three costliest risk factors based on healthcare expenditures in this study. In a larger study of more than 300,000 employees from six companies, Goetzel, Hawkins, Ozminkowski, and Wang (2003) found, based on 1999 data, that physical health problems cost a total of $3,524 per eligible employee for medical care, absenteeism, and short-term disability program use, and mental health problems cost only $179 on average, although with increased attention to mental health parity, it is doubtful that the difference between physical and mental health expenditures and productivity would remain as large. It also was noted that expenditures associated with multiple conditions were attributed to the most acute condition, which might also account for the considerably larger costs associated with physical health problems. On the basis of an earlier study, Goetzel et al. (1998) reported that employees with depression had 70% higher healthcare expenditures than those individuals who were not depressed. In addition, they found that individuals with uncontrolled stress had 46% greater medical costs than those who were not stressed, and the third costliest risk factor was high blood glucose. Employees with high blood glucose had 35% greater medical expenses than those with normal blood glucose. Other costly risk factors were obesity, tobacco use, high blood pressure, and poor exercise habits. Somewhat surprising, excessive alcohol consumption was not associated with increased medical costs, although Goetzel et al. suggested this might be reflective of individuals with drinking problems tending to avoid the healthcare system. These studies, as well as others, highlight the effects of modifiable health risk factors on overall healthcare costs. It is not surprising then that organizations have become increasingly interested in employees’ health and lifestyle factors that are associated with health. Organizations have incorporated certain lifestyle factors into their selection processes (e.g., hiring only nonsmokers) and have implemented wellness programs that focus on developing and maintaining healthy lifestyles through exercise, nutrition, smoking cessation, and stress management programs as attempts to enhance the health of their workforce.

Organizational Wellness Programs Rothstein (1983) suggested that many organizations initiated organizational wellness programs beginning in the 1970s. Organizational wellness programs typically have focused on the modifiable health risk factors associated with lifestyle, such as being overweight, lack of physical activity, poor diet, smoking, and alcohol use. These programs often include educational and training components; financial incentives or disincentives; disease management programs; health risk assessments; health screenings; and special programs for medical management such as flu shots, health fairs, on-site fitness facilities, and fitness center discounts (Shurtz, 2005). Organizational wellness programs seek to increase employee health, productivity, and morale while decreasing absenteeism and reducing healthcare expenditures (Goetzel et al., 2014). A recent RAND Health report (Mattke et al., 2013) reported that over half of U.S. employers offer organizational wellness programs, with 72% of those programs offering a combination of screening and interventions. That said, less than half of employees actually complete the screenings offered, and only 20% or fewer of employees identified for interventions participate in the intervention (Mattke et al., 2013). Therefore, consideration of factors to engage employees in available organizational wellness programs is of increasing importance. As might be expected, these programs vary considerably, with some focusing on only a single risk factor such as lack of physical fitness to others with multiple components and prevention programs, which Parks and Steelman (2008) referred to as “comprehensive programs.” Given

532

Work-Related Health, Stress, and Safety the differences across programs, evaluating the effectiveness of these programs or of specific components within the programs remains a challenge, although an important one if organizational wellness programs are to accomplish their goals. The effectiveness and return on investment among the various organizational wellness programs is still open for debate. Mattke et al. (2013) reported that employers were confident that their wellness programs reduced medical costs, absenteeism, and health-related productivity losses, although only about half of the organizations participating in the study had actually evaluated their programs, and only 2% indicated that there were actual savings from their programs. A meta-analysis of organizational wellness programs (Parks & Steelman, 2008) found that participation in a wellness program was related to decreased absenteeism and improved job satisfaction, supporting the effectiveness of wellness programs, although direct measures of health or return on investment were not included. Baicker, Cutler, and Song’s (2010) metaanalysis found that the return on investment was 3:1 for direct medical costs and also 3:1 for absenteeism; however, other studies have not always found evidence for either improved health or significant saving. Baxter, Sanderson, Venn, Blizzard, and Palmer (2014) included several characteristics of the studies included in their systematic review, including how return on investment was computed and the quality of the study. They found that overall there was a 1.38 return on investment, although interestingly the higher-quality studies tended to show lower returns than the lower-quality studies. This prompted O’Donnell (2015) to conclude that whether organizational wellness programs work relative to the return on investment, at any rate, “depends” on a number of factors. The evidence relative to health indicators appears to be more consistent. In a quasiexperimental design study, Mills, Kessler, Cooper, and Sullivan (2007) found that participation in a multicomponent organizational wellness program resulted in reduction of health risks on the basis of a 12-month follow-up assessment on several self-reported risk factors, contrasting participants with a comparison group—a decrease of 4.3 days annualized absenteeism compared with the comparison group and an increase in productivity of 7.9% over the comparison group. The convergence of evidence suggests that wellness programs can increase productivity and morale as well as reduce absences and healthcare costs in a cost-efficient manner, although cost savings may depend on a number of factors, including characteristics of the cost-benefit/ cost-effectiveness study. That being said, wellness programs are not without some associated downsides. One challenge has traditionally been getting those with the most risk to participate in the programs. For example, many organizational fitness programs have not engaged those individuals who most need to increase their activity levels. One approach to increase participation has been the use of incentives, but these programs may actually create other concerns because incentives/rewards may be found to be discriminatory under the Health Insurance Portability and Accountability Act (HIPAA), the Americans with Disabilities Act (ADA), or state insurance laws (Simon, Bruno, Grossman, & Stamm, 2006; Simon, Traw, McGeoch, & Bruno, 2007). Wellness programs need to be designed and implemented such that they are compliant with employment law (Kendall & Ventura, 2005; Shurtz, 2005; Simon et al., 2006). HIPAA bars healthcare plans from discriminating against individuals because of medical conditions and health status. Whether a particular wellness plan is considered a healthcare plan and subject to HIPAA depends on several factors, including what components are included in the program (Simon et al., 2006, 2007). In addition to whether a specific wellness program is considered a health plan, the use of incentives needs to be considered such that they are not construed as discriminatory toward individuals under HIPAA; in other words, that the incentive does not depend on the health status of the individual, and rewards must be available to all similarly situated individuals or at least provide a reasonable alternative standard for attaining the reward. Simon et al. (2006) and Kendall and Ventura (2005) suggested that although a particular wellness program may not be discriminatory on the basis of HIPAA, it may still be counter to the ADA and/or state insurance plans. For example: If an employer’s wellness program offers lower health plan premiums to employees who complete a healthy lifestyles course, the ADA may require the employer to ensure that a blind or deaf employee can access

533

Lois E. Tetrick et al. the classes and course materials. In addition, if a disability prevents an employee from earning a wellness incentive, the ADA may require the employer to work with the employee to develop alternative incentives that are within the employee’s ability. (Simon et al., 2006, p. 57)

Therefore, organizational wellness programs need to be carefully designed to avoid discrimination against individuals on the basis of health status and disability.

Selecting Healthy Workers Organizational wellness programs are one mechanism for enhancing the health of workers. An alternative mechanism for enhancing the health of an organization’s workforce might be the selection of “healthy workers.” As mentioned above regarding wellness programs, selection based on health status and disability may run counter to the ADA (Rothstein, 1983); also see Gebhardt and Baker’s Chapter 12 in this volume on physical performance tests. Under the ADA, it would be unlawful to base one’s selection decision on a disability, which is defined as a condition that substantially limits one or more major life activities and the individual can perform the essential functions of the job. For example, obesity is not considered a disability under the ADA unless the cause of the obesity is a physiological disorder/impairment or the obesity substantially limits a major life activity, which might be the case with morbidly obese individuals. Therefore, if an individual is morbidly obese and can perform the essential functions of the job, denial of employment may be deemed discriminatory under the ADA. In addition, other employment laws may apply. For example, in Michigan it is illegal to discriminate based on weight. Interestingly, smoking does appear to be one health risk factor that does not have any protections under employment law. Increasingly, employers are not only restricting smoking while at work, but they are also not hiring individuals who are smokers (Smerd, 2007). It may be possible to build a case that selection based on at least certain risk factors is a business necessity. As indicated above, the cost of healthcare insurance, absenteeism, and lower levels of productivity associated with many health risk factors and ill-health conditions might be regarded as business necessity, which is one justification that has been used in implementing smoking bans at work as well as outside of work. Regardless of legal issues, the decision to select employees on the basis of health or risk factors has several complications. First, the fundamental principle for selection is that selection criteria should be job-related and consistent with business necessity. Second, selection criteria are generally considered to be relatively stable characteristics of an individual that predict how well an individual will be able to perform a job, such as job-relevant knowledge, skills, and abilities. Use of health and health risk factors may move away from relatively stable selection factors, especially if modifiable health risk factors such as smoking, weight, and lack of physical fitness are being considered as selection criteria. The selection system would then be dealing with a dynamic predictor and a dynamic criterion. Therefore, one would expect the predictive validities to be lower than when the predictors and/or criteria are relatively stable. Further, as Rothstein (1983) indicated, the use of health data as predictors requires that the measurement of these predictors have sufficient sensitivity (i.e., the measure is accurate in identifying people correctly with the condition being assessed) and specificity (i.e., the measure is accurate in identifying people who do not have the condition being assessed). Sensitivity and specificity is a concern for traditional selection factors such as cognitive ability tests as well, but they may be more of a concern for health and health risk factors that are modifiable. In addition to the legal issues, there are several measurement and validity issues of using health indicators for selection purposes. Hackl, Halla, Hummer, and Pruckner (2015), for example, raised the issue of the validity of general health screenings in predicting actual health status. In a comprehensive study covering a 10-year period conducted in the general adult population of Austria, general health screenings did not predict subsequent health status, and Hackl et al. concluded that general health screenings were not a viable approach for developing and maintaining

534

Work-Related Health, Stress, and Safety the health of the workforce. Lesser and Puhl (2014) suggest another drawback of using health indicators for selection purposes. In their study of the effects of incentives in organizational wellness programs, many health indicators have multiple causes and may not reliably indicate an individual’s health, especially over the long run. For example, they point out that excessive weight does not necessarily indicate that an individual is unhealthy, nor does being a normal weight indicate that an individual is healthy. Also, there may be underlying factors of a given health indicator such as weight that an individual has no control over such as genetic conditions, which could create legal issues if used for selection purposes. For additional considerations in using health indicators for selection purposes, readers are referred to Gebhardt and Baker’s Chapter 12 in this volume, as many health indicators are assessments of physical abilities. Another consideration in focusing on selection as a mechanism to improve the health of a specific organization is that as a strategy it does not recognize the relation of employees’ health to the health of the community in which the organization is located. There is a growing recognition that the health of the population of a community is reflected in the health of the applicant pool, arguing for organizations to be engaged in the development of the human capital in their communities (National Academies of Sciences, Engineering and Medicine, 2015). Selection as a strategy does not directly recognize this link between an organization’s health and the health of the community population. Relatively few organizational wellness programs include family members in the activities and thus miss an opportunity to improve the health of the community (Oziransky et al., 2015). Certainly, selection would not typically include family members. Inclusion of family members in organizational wellness programs might be a good recruiting program, making explicit the organization’s values relative to the well-being of employees and families and a shared value of health and safety and their engagement in the community. Given the legal implications of using health risk factors for selection and the potentially changing levels of many health risk factors before and after hiring, the advisability of using selection for creating and maintaining a healthy workforce seems weak. There is some evidence that organizational wellness programs can be useful for creating and maintaining a healthy workforce, and they may serve as a recruiting strategy. Future research may determine which health risk factors in interaction with which elements of the work environment and components of organizational wellness programs are most effective and which may be appropriate for use in selection.

WORK STRESS Considering the results of Anderson et al. (2000) and Goetzel et al. (2014) that psychosocial risk factors, especially depression and stress, are prevalent in organizations and account for significant proportions of disabilities, absences, and healthcare expenditures, this section will focus on stress in the workplace. Job stress arises from a disruption to employees’ cognitiveemotional-environmental system by some external environmental demand in the work environment (Lazarus & Folkman, 1984). Various reviews of the extensive stress literature have generally concluded that prolonged exposure to certain job demands can have debilitating consequences for employees (Tetrick, 2002). Specifically, experienced stress can have adverse effects on individuals’ mental health, physical health, and organization-related outcomes (Ganster & Rosen, 2013), which can be very costly for organizations (Perrewé et al., 2005). It has been estimated that stress costs organizations billions of dollars annually in disability claims, absenteeism, and lost productivity (e.g., Ryan & Watson, 2004). More specifically, the World Health Organization estimates that stress costs American businesses $300 billion per year (Martin, 2012). In this section, we examine organizational-, job-, interpersonal-, and personal-level predictors of experienced work stress. At the organizational level, we focus on organizational resources and climate, work hours, and various work schedule arrangements that includes a discussion of the pros and cons of using realistic job previews as a recruiting tool. At the job level, we examine role ambiguity and conflict, job demands, personal control at work, and adaptive performance. At the interpersonal level, we discuss a lack of social support, abusive supervision,

535

Lois E. Tetrick et al. organizational politics, and political skill. Further, at the personal level, we look at several personality types and individual-level demographic predictors that include age and gender. Finally, we review some recent research on the interface between the work domain and the non-work domain.

Organizational-Level Stressors Organizations differ in the amount of resources that can be distributed among employees as well as the general culture or climate. Working in a resource-poor environment with poor working conditions and few opportunities for pay raises and advancement can be stressful to employees. There are no easy answers to combatting a dysfunctional work climate, but employees will need to adopt coping strategies. Coping strategies have been defined and operationalized in a variety of ways, but perhaps the most well-known theory of stress and coping comes from Lazarus and Folkman (1984), who distinguished between problem-solving coping and emotion-focused coping. Problem-solving coping is an attempt to get rid of the actual stressor. In the situation described, leaving the organization is one viable way to get rid of the stressful situation (e.g., taking another job to escape a dysfunctional organizational climate). Emotion-focused coping refers to more cognitive ways of coping if ridding of the stressor is not possible. For example, cognitive escapist coping refers to coping patterns that suggest an avoidance mode (e.g., trying not to think about work and blocking out others in the organization) while cognitive reappraisals may refer to employees reevaluating their situation and focusing on the positive aspects of the job. Unfortunately, some organizations are simply ‘bad’, and employees will need to either leave the organization, cope in different ways to succeed in such an environment, or attempt to change the work environment. The next sections examine several organizational contexts. The widely held assumption that long work hours inevitably lead to negative health and quality-of-life outcomes is highly questionable. Barnett (2006) argued that long work hours appear to be a weak predictor of outcomes because the absolute number of work hours fails to take into account the distribution of those hours. Arguably, the distribution of work hours has greater implications for outcomes than does the number of work hours per se. Over the past two decades, the stereotypical workweek and work schedules have begun to vanish. Typical or standard work is often assumed to be working during the day on the basis of a Monday through Friday schedule. Interestingly, most of us do not fit the assumed typical workweek. In fact, Fenwick and Tausig (2001) found that less than one-third of the workforce in the United States and Canada is employed in jobs that fit the Monday through Friday, full-time day cycle. In recent years, the presence of contingent workers and flexible work schedules has grown because of an increasingly competitive market and the availability of new information technology. For many organizations, employing workers on a more temporary basis provides a way to maximize flexibility and minimize costs, especially when faced with seasonal work demands. Furthermore, flexibility in work schedules has been seen as a way to not only help organizations to remain competitive but also to offer employees more control over their own work schedules. Even full-time employment can be flexible, such as shift work. Full-time shift work might involve working 35–40 hours during the week, but the work may be performed at night or early mornings, such as the “graveyard shifts.” This can benefit the organization by allowing services or production to be on a continual basis, but this can also help the employees by allowing flexibility in their work schedules so that they best meet their own needs. For example, dual-career couples with small children may like the idea of working different shifts because this might aid in their ability to care for their children. Flexible time schedules, job sharing (e.g., two employees working part-time but they share one job), temporary employment, home-based work, and teleworking (e.g., working from home, hotels, or other remote work sites) have all become more popular in recent years (Barling et al., 2002). However, the concerns are not if these types of arrangements aid in flexibility for the organization and the employee, but rather if these arrangements are reducing experienced stress for employees and are consistent with a healthy workforce. The following section examines the consequences of alternative work arrangements on the well-being of employees.

536

Work-Related Health, Stress, and Safety Research on the psychological well-being of part-time workers versus full-time workers has not demonstrated significant differences in terms of employee job attitudes or well-being (Barling & Gallagher, 1996). What appears to be the most important factor differentiating fulltime from part-time workers regarding their well-being is whether working part-time is voluntary. Voluntary part-time employment actually has been shown to be beneficial in terms of job satisfaction and general well-being if the part-time employee has a sense of control over work scheduling (Krausz, Sagie, & Biderman, 2000). Unfortunately, not all “voluntary” part-time employment can be assumed to be positive. For example, many workers are part-time workers who are only working part-time because of a preexisting health concern or disability (Mykletun & Mykletun, 1999). Whether this constitutes true voluntary part-time work is debatable. Additional research examining the extent to which part-time employment is perceived by the employee to be truly voluntary is still needed before definite claims can be made regarding the role of part- versus full-time employment on health and well-being. Health concerns may become even more pronounced when coupled with rotating shifts (Jamal & Baba, 1997). Shift work, especially night work, has been found to be a risk factor for cardiovascular disease (Boggild & Knutsson, 1999). Parkes (2003) found that, in general, dayworkers reported more favorable perceptions of their work environment than did shift workers. However, she also found that differences in the work environment (i.e., onshore versus offshore) between dayworkers and shift workers were a moderator in these relationships. She argued that the organizational setting in which work routines are similar for dayworkers and shift workers, and in which all resources are available to both groups, might reduce the negative perceptions associated with shift work. Several factors may explain the relationship between shift work and health concerns, including the employee’s ability to adjust to differing schedules and the supportive nature of the employee’s family. Additional research that can separate out these effects is needed before we can make a clear statement about the relationship between full-time versus part-time workers and working shifts on employee stress and health. One factor that does appear to be important in promoting health in employees is whether the work arrangements are voluntary. In a review of the research on work arrangements, Barling and colleagues reviewed several important work arrangements, including temporary workers, job sharing, shift work, full-time versus part-time work, and seasonal and migrant employment (Barling et al., 2002). They concluded that psychological well-being depends less on the nature of the work arrangement and more on whether the arrangement was voluntary or not. Being able to choose or have some control over work arrangements is a very important factor in the ability to handle job stressors and the health and well-being of employees. Given that organizational work hours and schedules have the potential to be stressful to many workers, perhaps recruiting individuals who are comfortable with less traditional schedules might help ensure a long and effective employment relationship. One way to recruit workers who have an understanding of the employment environment is through realistic job previews (RJPs). The basic argument is that job applicants will be better able to make informed decisions about whether or not to pursue a job opportunity if they have a clear idea about the job and job environment. RJPs give applicants a sense of realism for positive and negative aspects of the job and job environment that (a) might reduce the number of applicants who remain interested in the job but (b) increase the retention of those workers who are hired (Wanous, 1980). However, some empirical research (i.e., Bretz & Judge, 1998) suggests that RJPs may have too many opportunity costs for the organization because the highest-quality applicants may be less willing to pursue jobs for which negative information has been presented. Clearly, organizations need to be honest about the actual job; however, emphasizing the negative aspects of the job may hurt recruiting, especially with high-quality applicants.

Job-Level Stressors Job and role stressors such as role conflict, role ambiguity, and role overload (Kahn, Wolfe, Quinn, Snoek, & Rosenthal, 1964) have long been known to contribute to the stress experience. Role conflict occurs when employees’ expectations are incongruent with those expressed by

537

Lois E. Tetrick et al. their role senders. Communication by those in authority of work expectations that are largely incompatible with those understood and internalized by employees may increase the likelihood of burnout. Role ambiguity is related to the amount of predictability in the work environment; thus, experienced stress may be more prevalent in situations in which employees are uncertain about their work goals and the means available for accomplishing them. Role overload, qualitative and quantitative, can contribute to the experience of burnout. Workers may experience qualitative overload if they feel deficient in the basic skills necessary for effective task completion. Quantitative overload is characterized by the belief that one’s work cannot be completed in the time allotted. Employees may experience stress if they perceive that they cannot successfully complete their work because of lack of skill, lack of time, or both. Vandenberg and colleagues have argued how occupational stress research has consistently demonstrated the deleterious effects of role stressors on occupational strain outcomes such as burnout, dissatisfaction, decreased performance, and psychophysiological responses such as increased heart rate and blood pressure (Vandenberg, Park, DeJoy, Wilson, & Griffin-Blake, 2002). Perhaps one of the most well-known and historical conceptualizations of job stress is that of Karasek’s (1979) Demands-Control model. Karasek suggested that heavy job demands, coupled with a lack of control, are associated with strain and job dissatisfaction. This is because control provides individuals with the confidence and efficacy to perceive and interpret their task environment in nonthreatening ways, thereby neutralizing the potentially dysfunctional effects of job demands (Theorell, 2004). Job demands refer to the physical, psychological, organizational, or social aspects of the job that require sustained physical and/or psychological costs. Stress research examining job demands and resources (e.g., control) have found that job demands can result in resource loss because they initiate a job strain process (Bakker & Demerouti, 2007). Accordingly, research would suggest that organizations should balance the job demands placed on the individual employee with the discretion permitted to the worker in order for the employee to cope with the heightened expectations of these demands. Implicitly, demands can increase with little or no threat to the individual’s psychological strain as long as appropriately adequate levels of job control are maintained (Mauno, Kinnunen, & Ruokolainen, 2007). Of course, the employee must have the personal (e.g., resilience) and/or organizational (e.g., autonomy) resources to cope with the job demands. Furthermore, if the job demands are perceived to be unreasonably high, this is another employee health concern because sustained demands may actually reduce employees’ perceptions of control at work (Bakker & Demerouti, 2007). Finally, the need for adaptive workers has become increasingly important, as today’s organizations are characterized by changing, dynamic, and sometimes turbulent environments (Ilgen & Pulakos, 1999). Employees need to be adaptable, flexible, and tolerant of uncertainty to perform effectively (Pulakos, Arad, Donovan, & Plamondon, 2000). Employee adaptability encompasses a wide variety of behaviors including handling emergencies or crisis situations, handling work stress, dealing with uncertainty, and learning new technologies and procedures (Pulakos et al., 2000). The question is how can managers select adaptable workers or train workers to be adaptable? Given the various types of adaptable behaviors, it might not be possible to select or train workers to be adaptable on all aspects of their performance; however, we may be able to offer some general guidelines for selection and training. First, research has shown some evidence that certain personalities might be more (or less) adaptive. For example, LePine, Colquitt, and Erez (2000) examined the effects of conscientiousness on decision making before and after unforeseen changes in a task context and found that individuals who are higher in conscientiousness do not adapt quickly to change. Of course, a plethora of research demonstrates that conscientious employees perform at very high levels (e.g., Barrick & Mount, 1991; Wallace & Vodanovich, 2003). Thus, managers may want to consider how the work environment (e.g., dynamic and constantly changing) might affect employees’ health differentially. Employees may be performing well in the short-term, but this performance may come at a cost to employees’ health and well-being. Much more research is needed on personality profiles (i.e., examining several personality dimensions in conjunction with one another) before selecting employees based on personality is warranted. This will be discussed in more detail in a later section.

538

Work-Related Health, Stress, and Safety Second, managers may want to consider prior experience in adaptability as a selection criterion. Research has long demonstrated that one of the best predictors of future performance is past performance (e.g., Wernimont & Campbell, 1968). Biodata instruments that emphasize prior experiences with crises and emergencies may prove to be an effective means of selection (cf. Pulakos et al., 2000). Furthermore, training employees to be more adaptive by exposing them to various unpredictable situations in a training setting that they might be expected to encounter in the work setting may prepare workers to be more adaptive and creative. Finally, organizations that customize resource-based interventions to their specific employees and resources relevant to their employees’ work contexts likely will prove helpful in bolstering employees’ resources (Baumeister & Alghamdi, 2015).

Interpersonal Relationships Employees require specific job resources (e.g., social support) to successfully navigate stressful work environments while maintaining psychological health. To date, social support has attracted the most extensive amount of investigation in the interpersonal domain, and findings consistently support the idea that a lack of support from coworkers and supervisors is highly correlated with increases in occupational stress and burnout (Maslach, Schaufeli, & Leiter, 2001). Work environments that fail to support emotional exchange and instrumental assistance may exacerbate strain by isolating employees from each other and discouraging socially supportive interactions. Workplaces characterized by conflict, frustration, and hostility may have the same effect. Besides a general lack of social support, we focus on two additional types of interpersonal stressors commonly found in the workplace—abusive supervision and perceptions of politics. We also examine how having political skill can help alleviate some of the negative effects from interpersonal stressors. Abusive supervision is one of the most detrimental interpersonal stressors found in the workplace. Abusive supervision reflects subordinates’ perceptions of negative and hostile verbal and nonverbal leader behaviors (Tepper, 2007). Behaviors include public criticism, yelling, rudeness, bullying, coercion, and blaming subordinates for mistakes they did not make (Burton & Hoobler, 2006). Research indicates that abused subordinates are less satisfied with their jobs, less committed to their organizations, and more likely to display turnover intentions than are nonabused subordinates (Schat, Desmarais, & Kelloway, 2006). Employees consider abusive supervision to be a source of stress and injustice in the workplace that has the potential to influence their attitudes, psychological distress, and physical well-being (Tepper, 2007). If employees believe their behaviors have no bearing on the accrual of desired outcomes, then their sense of volition is weakened (Greenberger & Strasser, 1986), and many researchers believe that perceptions are more powerful predictors of functioning than actual control (Burger, 1989). This distinction is critical because individuals’ perceived control influences their behaviors and emotions, regardless of the actual control conditions contributing to these perceptions. Work environment factors such as regulated administration, available help, and feedback influence perceived control. Not surprisingly, research suggests that supervisors may engage in abuse to protect their own sense of power and control over work situations (Tepper, Duffy, Henle, & Lambert, 2006), thus limiting that of their employees. Supervisors who behave in an abusive way toward subordinates have been found to lead to more experienced stress (Tepper, 2007) and reduced psychological and physical well-being for employees (Grandey et al., 2007). Another well-researched interpersonal-level stressor is organizational politics. Organizations have long been considered political arenas, and the study of organizational politics has been a popular topic for many years. Mintzberg (1983) defined politics as an individual or group behavior that is typically disruptive, illegitimate, and not approved of by formal authority, accepted ideology, or certified expertise. Organizations, indeed, can be viewed as political arenas, where informal negotiation and bargaining, deal-making, favor-doing, quid-pro-quo interactions, and coalition and alliance building characterize the way things really get done. Environmental

539

Lois E. Tetrick et al. circumstances, such as perceptions of organizational politics, can be thought of as work demands, which are potential sources of stress because they threaten or cause a depletion of the resources individuals possess. Over the past three decades, research examining the relationship between organizational politics and job stress has flourished, with empirical research demonstrating that politics and the ability to manage politics (e.g., political skill) have direct as well as moderating effects on stress-related outcomes, including job anxiety and tension, helplessness, victimization, burnout, depression, and diminished control over personal outcomes (e.g., Chang, Rosen, & Levy, 2009; Perrewé et al., 2004). Thus, research indicates that workplace politics are a significant concern and source of stress for many workers. What we know less about are the characteristics that enable one to exercise influence in ways that lead to success in political work environments. Some have referred to such qualities as interpersonal style, “savvy,” “street smarts,” and “political skill” (e.g., Reardon, 2000). Research has demonstrated how different forms of personal control (e.g., interpersonal social skill or political skill) can mitigate the negative effects of job stressors. Political skill is the ability to effectively understand others at work and to use such knowledge to influence others to act in ways that enhance one’s personal and/or organizational goals (Ferris et al., 2007). Politically skilled individuals are socially astute and keenly aware of the need to deal differently with different situations and people. Therefore, they reflect the capacity to adjust their behavior to different and changing situational demands (i.e., self-monitoring) in a sincere and trustworthy manner. It has been suggested that political skill generates an increased sense of self-confidence and personal security because politically skilled individuals experience a greater degree of interpersonal control, or control over activities that take place in social interactions at work (Perrewé et al., 2005). Furthermore, greater self-confidence and control lead individuals to interpret workplace stressors in different ways, resulting in such individuals experiencing significantly less strain/anxiety at work (Kanter, 2004). Consistent with this argument, Perrewé et al. (2004) found that political skill neutralized the negative effects of role conflict on psychological anxiety, somatic complaints, and physiological strain (i.e., heart rate, systolic and diastolic blood pressure) and that political skill moderated the role overload–strain relationship in a similar manner (Perrewé et al., 2005). Recently, Rosen and Ganster concluded that political skill can help mitigate the negative effects of workplace stressors (e.g., organizational politics), such that stressors have less of a negative impact on employee psychological as well as physiological health outcomes (Rosen & Ganster, 2014). The important message is that personal control, such as believing one has the political skill to successfully navigate his or her work environment, appears to play a fairly significant role in buffering the negative effects of work stressors. On the other hand, a lack of personal control may exacerbate the stressor-strain relationship, or it may even be perceived as a stressor itself. Given the importance of political skill, we recommend organizations consider this an important attribute in employee selection and training. Political skill is an individual characteristic that can be learned and developed (Ferris, Davidson, & Perrewé, 2005). Today’s training, more than ever before, needs to be compelling, realistic, practical, relevant, and lasting. In addition, training should encourage risk taking and facilitate improved awareness and behavioral flexibility. Assessment centers that include simulations and drama-based training may be viable options for political skill development (see Ferris et al., 2005, for a more in-depth discussion of developing political skill). Drama-based training is a training model that includes lifelike simulations for participants to practice managing complex human interactions in a safe and controlled learning environment (St. George, Schwager, & Canavan, 2000), and, as such, it provides a useful vehicle to shape and develop various social skills. Furthermore, assigning individuals to work with skilled mentors is another important way to develop influence skills. Individuals can observe professionals in real work situations as they exercise influence in meetings with subordinates and peers. Language, facial expressions, body posture, and gestures will convey messages to observers as to how influence is best exercised. The key is to be sure that individuals are assigned to talented and understanding mentors who have plenty of social influence interactions and are given plenty of opportunities to discuss various social influence interactions encountered.

540

Work-Related Health, Stress, and Safety

Personal Characteristics Although various aspects of the external environment play a critical role in the experience of stress and burnout, specific personal characteristics may lead some individuals to be more likely than others to experience strain in the same environment. The evidence on individual differences, such as personality differences, suggests that certain individuals are more prone to strain than others. The Five-Factor Model, or “Big Five” model of personality, has been extensively examined in the organizational context over the past decade. Although some disagreement exists over the appropriate names for the five factors, most would agree that the Five-Factor Model consists of five broad dimensions of personality: extraversion, neuroticism, conscientiousness, agreeableness, and openness to experience. Research using this typology indicates that individuals who are high in neuroticism are more likely to experience stress and burnout (Zellars & Perrewé, 2001). Furthermore, those with extraversion, agreeableness, and openness to experience are less likely to experience stress (Zellars, Perrewé, & Hochwarter, 2000). In a review of the role of personality in organizations, Perrewé and Spector (2002) discussed how Type A behavior pattern and negative affectivity have been shown to have positive associations with experienced stress and negative associations with health and well-being. In addition, individuals with a high internal locus of control experience strain less than individuals with high external locus of control. Individuals high in conscientiousness are described as efficient, diligent, thorough, hardworking, persevering, and ambitious (e.g., McCrae & John, 1992). Conscientiousness has been related to a number of positive work outcomes, such as organizational citizenship behaviors (Borman & Penner, 2001), job performance (Barrick & Mount, 1991), workplace safety performance (Wallace & Vodanovich, 2003), and intrinsic and extrinsic career success (Judge, Higgins, Thoresen, & Barrick, 1999). However, more research is needed on personality before selecting employees based on personality for their ability to perform well and/or cope with stressful work environments is warranted. For example, conscientiousness had a negative relationship with decision quality after an unanticipated change, which suggests that conscientious people do not adapt quickly to change (LePine et al., 2000). Thus, we do not recommend selecting (or not selecting) employees on the basis of one personality dimension (e.g., conscientiousness) alone. Perhaps future research should examine a more holistic approach to personality by looking at several combinations of personality dimensions (i.e., personality profiles) and the relationship with important outcomes, such as coping with stressful situations and job performance. For example, research has found that conscientiousness, when coupled with positive affectivity (i.e., the dispositional tendency to experience positive emotions across situations and time), resulted in the lowest levels of reported job tension (Zellars, Perrewé, Hochwarter, & Anderson, 2006). Furthermore, the individual difference variables of perceived control, optimistic orientation, and self-esteem are highly correlated variables and, together, form a hardy or “resilient personality” (Major, Richards, Cooper, Cozzarelli, & Zubek, 1998) that can help workers adapt to change and cope with work stressors. Although a comprehensive examination of personality is beyond the scope of this chapter, personality clearly has the potential to be a powerful selection tool. However, additional research is critical before confident predictions about workers’ ability to handles stressors and adaptable performance can be made. In addition to personality characteristics, simple demographic differences have been shown to have an association with experienced stress. We focus on two demographic characteristics that have been found to have some relation with occupational stress—specifically, age and gender. Research has demonstrated that younger employees consistently report a higher level of burnout (Maslach et al., 2001). Some researchers suggest that older employees experience lower levels of burnout because they have shifted their own expectations to fit reality on the basis of their personal experiences (Cordes & Dougherty, 1993). These findings suggest that older, more experienced employees are better able to handle the demands of stressful work environments. Or alternatively, the findings regarding older workers may reflect that they have passed some critical threshold of burnout that would trigger turnover; that is, they may handle stressful environments by altering their perceptions and reducing their expectations of what is possible

541

Lois E. Tetrick et al. in terms of career accomplishment or satisfaction. High expectations and unmet expectations can encourage increased levels of burnout (Cordes & Dougherty, 1993). Younger employees tend to be more idealistic and thus may react more intensely when their overly optimistic career expectations are shattered. On the other hand, younger workers have been shown to respond more positively to some workplace stressors—specifically, perceptions of organizational politics. Results across three studies demonstrated that increases in politics perceptions were associated with decreased job performance for older employees and that younger employees achieved higher performance scores when perceptions of politics were high (Treadway et al., 2005). In regard to gender, stress can result from feelings of discrimination in a male-dominated work environment (Sullivan & Mainiero, 2007). First, much literature suggests that working women are more likely to experience stress on the basis of being female (see Powell & Graves, 2003). Being employed in a male-dominated work environment is a cause for stress, because the norms, values, and expectations of the male-dominated culture are uniquely different (Maier, 1999). Furthermore, women in male-dominated environments are more likely to face certain stressors such as sexual harassment and discrimination (Nelson & Burke, 2000). Gender does not appear to be a strong predictor of preferred work hours (Jacobs & Gerson, 2004). Family circumstances are more important than gender in predicting preferred work hours (Barnett, 2006); specifically, women and men with young children want more time away from work than do other groups. Although women with young children cut back on their time at paid work more so than do men, they do so to a smaller extent than in previous generations. Jacobs and Gerson (2004) found little support for the popular belief that married women with young children are the primary group wishing to work less, and they state that “About half of married men and women across a range of family situations express such a desire” (p. 73). Selecting employees on the basis of personality, gender, or age is not recommended. What is encouraged is setting realistic expectations for career advancement, allowing flexibility and control in work schedules, and training opportunities for all employees. It is important for managers to understand the entire person (not just the employee) and to help employees achieve a balance in their work and non-work lives.

Work and Non-work Interface The examination of the work and non-work interface is one of the most critical challenges organizations and individuals face today. Research on the work-family or work and non-work interface typically focuses on the antecedents and consequences of work-family conflict. In a meta-analysis of more than 60 studies, Byron (2005) identified the most common work and family domain antecedents of work-family conflict. Work domain antecedents included job involvement, hours spent at work, work support, schedule flexibility, and job stress. Family domain antecedents included family/non-work involvement, hours spent in non-work, family support, family stress, family conflict, number of children, the age of youngest child, marital status, and spousal employment. Research has demonstrated that job stress, family stress, and work-family conflict are all related to each other bi-directionally, which underlines the reciprocal nature of family interference with work and work interference with family (Frone, 2003). This emphasizes the importance of better understanding how stressful events at home (work), such as psychological bullying, impact workplace (home) attitudes and behavior. Based on decades of research, Frone (2003) argued that both work interfering with family and family interfering with work were positively related to individuals’ anxiety, negative moods, and substance abuse disorders. The consequences of work-family conflict have been linked to both physical and psychological outcomes such as depression, physical health complaints, and hypertension (Frone, 2003). Work-family conflict has been linked to lower job satisfaction, greater turnover intentions, lower perceived career success, lower career satisfaction, and lower family satisfaction (Eby, Casper, Lockwood, Bordeaux, & Brinley, 2005). When the experienced stress from work affects the non-work domain (and vice versa), this has been termed ‘spillover’. When

542

Work-Related Health, Stress, and Safety the experienced stress from work for one person affects the experienced stress of another person, this has been termed ‘crossover’. Spillover theory describes a process by which feelings, attitudes, and behaviors spill over from one role to another for the same individual (Piotrkowski, 1979), and it has been used to describe the transference of moods, skills, values, and behaviors from one role to another (Carlson, Kacmar, Wayne, & Grzywacz, 2006). Stress spillover, a form of stress contagion, occurs when stress experienced in one domain of life results in stress in another domain for the same individual (Edwards & Rothbard, 2000). Previous research on the work-family interface has shown that work stressors (e.g., abusive supervision) are linked to work interference with family (Carlson, Ferguson, Perrewé, & Whitten, 2011). Carlson et al. (2011) found that abusive supervision had detrimental effects not only on the subordinates at home (i.e., spillover) but also on their partners (i.e., crossover). Thus, when an employee experiences stress from work, this may spill over into his or her family life, affecting both the employee as well as the employee’s partner. Furthermore, recent research has demonstrated that partner aggression at home affected employee job withdrawal as well as performance (LeBlanc, Barling, & Turner, 2014). As stress researchers have long acknowledged (and continue to find), the work domain is not independent of the other domains in employees’ lives. The work-life interface continues to be an important area of inquiry.

Summary of Work Stress In this section, we examined organizational-, job-, interpersonal-, and personal-level predictors of experienced work stress. Furthermore, we examined the interface between work and nonwork such as spillover and crossover. Although we examined several personal characteristics that have been associated with higher levels of experienced stress, the selection of “strain-resistant” individuals into organizations is not necessarily recommended. Just as environmental conditions can affect employees, employees can adapt to and change their environments to make the work situation less stressful. Given the complexity and reciprocal effects of individuals and their environments, we do not have enough empirical findings to be confident that certain individuals are strain-resistant in all situations. Furthermore, efforts to recruit and select strain-resistant individuals do little to help existing employees. For most organizations, strain prevention programs, such as the wellness programs discussed earlier, may be useful. Such programs can be used to teach individuals how to identify stressors and modify their coping strategies. Specific training strategies include specific goals to provide more realistic expectations of work, better time management strategies, facilitation of social support, simulation and drama-based training, and developing social networking skills through mentoring.

OCCUPATIONAL SAFETY Accidents and injuries at work are costly for individuals and organizations, and avoiding severe accidents is an essential priority for all organizations. Therefore, it is not surprising that considerable attention is paid to factors that might influence whether an individual is involved in an accident. Sanders and McCormick (1987) identified selection as one of the key strategies used to reduce human error in addition to training and job design. Despite the popularity of selection systems to manage safety, there is limited evidence for the effectiveness of selection for improving organizational safety, particularly when implemented in isolation from other interventions to improve safety. Guastello (1993) conducted a metaanalytic review of accident prevention programs and found that personnel selection programs had a relatively weak relationship with accident rates. He found that although individual selection practices were the most common type of accident reduction programs used, they had the least effective outcome compared with 10 types of intervention. Sanders and McCormick (1987) considered work design to be a more effective approach to improving safety compared with

543

Lois E. Tetrick et al.

System (e.g., training, risk) Selection processes

Individual motivational and cognitive predictors (e.g., personality, ability)

Proximal behavior and cognitions (e.g., rule following, lapses)

Consequences (e.g., accidents)

FIGURE 24.2  Proximal Safety Behaviors and Consequences

selection because it requires less ongoing maintenance and support. Moreover, they argued it is easier to limit the occurrence of human errors by making them impossible, difficult, or inconsequential through work design rather than relying on changing and managing individuals. There is substantial agreement among researchers that safety needs to be considered from a systemic perspective that includes factors at the individual, micro-organizational, and macroorganizational level (Hofmann, Jacobs, & Landy, 1995). Vredenburgh (2002) found that selection practices were effective as part of a broader proactive strategy of recruiting and training safety-conscious individuals. However, focusing solely on individual causes is not sufficient for understanding safety at work. Some researchers suggest that selection practices designed to improve safety will be less important than training and design interventions, which have a more systemic and wide-ranging impact on safety (Lawton & Parker, 1998). Figure 24.2 depicts how selection processes can be situated within a larger systemic framework that includes other systems (e.g., training) while focusing on individual differences and behavior. The goal of selection is to identify individual characteristics that might influence individual behavior and cognition, which, in turn, might influence consequences such as accidents in an organization. The figure also shows that systemic factors might shape any aspect of the causal chain from individual differences to consequences. In a safety context, these systemic factors have been conceptualized in terms such as “latent failures.” They represent the impact of managerial and organizational processes on safety outcomes. With the above considerations in mind, our review will focus on the role of selection within a broader context. We begin by looking more closely at the meaning of safety as a criterion construct.

Safety Criteria and Safety Systems Like other topics in this chapter, the criterion of interest is complex and conceptually problematic. Safe working can be viewed as the presence of safe behaviors (e.g., following correct procedures) or the absence of unsafe ones (e.g., avoiding errors). In addition, the criterion of safe behavior is often not clearly distinguished from its consequences, such as personal injury. For example, much of the research investigating individual differences and safety focuses on the prediction of accidents reported by the organization. However, accidents might be determined by a range of situational factors beyond the proximal behavior of the individual. We emphasize the distinction between accidents and more proximal behaviors in Figure 24.2. These proximal

544

Work-Related Health, Stress, and Safety behaviors include individual actions such as slips that might lead to accident and injury, as well as positive behaviors such as using safety equipment that might reduce accidents and injury. Next, we review the literature predicting accidents at work and then consider the prediction of individual behavior more proximally associated with accidents.

Predicting Workplace Accidents It is estimated that at least 80% of accidents are the result of some kind of human error (Hale & Glendon, 1987). Accidents have been the main focus of much safety research, including that related to selection. However, the notion of accidents is a broad and limiting criterion for selection. Accidents are a broad criterion because they range from minor falls to events that result in death. Accidents are a limiting criterion because they do not include important events such as narrowly missing a serious injury. Accidents are also constrained as a criterion because of problems in recording and reporting these events, as discussed later in this section. Despite these concerns with accidents and injury as criteria for evaluating selection practices, they remain the most commonly used measure of safety outcomes. Therefore, we first review evidence for selection methods and measures that reduce accident outcomes. Many reviews of safety also include road accidents and injuries as part of their measurement. However, we exclude studies of road safety unless they specifically incorporate the work context. There is a long history of research seeking to identify factors that predict whether individuals will experience a work accident. The most often-studied attribute—and perhaps least successful—has been the search for an “accident-prone” individual. Despite the popularity of this idea, there is little consistency in the findings from research. Glendon and McKenna (1995) concluded that it is impossible to define a stable profile that identifies an accident-prone individual. Overall, there is little evidence that an accident-prone personality can be identified that distinguishes employees who have accidents from those who do not (Lawton & Parker, 1998). Beyond the search for a general personality type who is prone to accidents, there is growing evidence to link more specific dispositions and behavioral orientations to safety outcomes. Research into specific traits is producing a more complex picture of the way individual differences relate to safety, and a wide range of personality dimensions has been investigated as potential antecedents of accidents and injury. Dimensions of the Big Five categorization of personality traits have received the most attention. Clarke and Robertson’s (2005) meta-analysis of studies involving workplace and motor vehicle accidents showed that low conscientiousness (nine studies) and low agreeableness (seven studies) were associated with more individual accidents in the workplace studies. In an updated meta-analysis, Clarke and Robertson (2008) found that low agreeableness was the personality trait most consistently linked to higher workplace accidents. The other four personality dimensions were also linked to accidents, although the effects were more variable and showed stronger evidence of moderation by unmeasured variables. More recently, a meta-analysis by Beus, Dhanani, and McCord (2015) also showed significant links between the Big Five dimensions and accidents, with only extraversion having 95% confidence interval that included zero. The results also showed that the effect sizes for personality dimensions were smaller than those for situational measures of safety climate. This study extended previous meta-analyses by including safety behaviors and facets of the Big Five dimensions. Similarly, Christian, Bradley, Wallace, Burke, and Spears (2009) investigated the meta-analytic link between personality and accidents as part of a broader mode linking proximal and distal antecedents to safety outcomes via safety behavior. We discuss implications of these findings in the next section on predicting safety behavior. Trait affectivity has also been considered as a broad personality dimension that might predict accidents and injury. Iverson and Erwin (1997) found trait positive affectivity and trait negative affectivity were related to accidents one year later after controlling for a range of job conditions. Although they did not control for stability in these characteristics, the design was stronger than many in this area. They suggested that extraversion factors such as overconfidence and intolerance were associated with risk taking, and neuroticism factors such as anxiety and indecision were associated with task distractibility. Frone (1998) found negative affectivity but not

545

Lois E. Tetrick et al. rebelliousness and impulsivity to be related to accidents. However, this relationship disappeared after taking account of physical hazards and workload. Outside of the Big Five and trait affectivity, Guastello (1993) found that predictors associated with maladjustment did show a positive relationship with lower accidents. Two types of maladjustment were considered. Personal maladjustment was based on measures such as distractibility and tension. Social maladjustment was based on measures such as safety locus of control. Studies of impulsivity, alcohol use, and drug use showed no significant relationships with accidents. Liao, Arvey, Butler, and Nutting (2001) found psychopathic deviant and conversion hysteria scales of the Minnesota Multiphasic Personality Inventory (MMPI) were associated with a higher frequency of injuries in a prospective study of firefighters. They also found social introversion to be associated with injury rates (Liao et al., 2001, p. 231, for a review of MMPI types). Conversion hysteria was based on patients who exhibited some sensory or motor disorder. Psychopathic deviants were more likely to act on impulse or ignore rules. Finally, locus of control has been identified in some studies as being related to accidents; however, research in this area is inconsistent and inconclusive (see Lawton & Parker, 1998). A range of demographic factors and job types has been linked to accidents. Adolescents represent the age group with the highest risk for nonfatal accident and injury (Frone, 1998). A concurrent study of adolescents found work injuries were associated with gender, negative affectivity, job tenure, and exposure to physical hazards, excessive workloads, job boredom, poor physical health, and on-the-job substance abuse (Frone, 1998). Liao et al. (2001) found female firefighters experienced more injuries than males, although the reason for this difference was uncertain (Liao et al., 2001). Studies of general mental ability have been contradictory (see Hansen, 1989, for a review). Physical abilities such as strength and flexibility can be valid predictors of performance in hazardous work environments and so might be used to predict safety outcomes (Hogan & Lesser, 1996). Readers are referred to Gebhardt and Baker’s Chapter 12 in this volume for more discussion on use of physical abilities tests in selection systems for arduous jobs.

Predicting Safety Behaviors Beyond accidents as a criterion, it is important to consider how selection procedures might predict the specific safety behaviors that precede accidents and near misses, or that increase the potential for accidents to occur. Recent meta-analyses have combined studies of behavior and accidents to investigate mediational models of accident causation consistent with Figure 24.2. The meta-analysis by Beus et al. (2015) found that safety behaviors partially mediated the link between personality and work accidents. These authors had hypothesized full mediation, and they speculated partial mediation was found because personality measures also include behavioral descriptors that might not be captured in the measures of safety behavior. Christian et al. (2009) found that safety behavior fully mediated the link between conscientiousness and accidents. To review the role of safety behavior in more detail, we build on a distinction between safety compliance and safety participation that has been developed in the general area of occupational health and safety (Griffin & Neal, 2000). Most meta-analytic studies have combined different aspects of safety performance into a single performance measure. However, individual studies suggest important distinctions between these aspects of behavior and have elaborated further distinctions that are important in a variety of safety contexts (Curcuruto, Conchie, Mariani, & Violante, 2015). Safety compliance refers to behaviors such as using correct safety procedures and equipment and complying with safety regulation. These behaviors contribute to safety outcomes associated with an employee’s core task activities. Safety participation refers to behaviors such as participating in safety meetings, communicating safety issues to others, and suggesting ideas for improving organization. These behaviors support the broader organizational context of safety rather than the safety of the individual or the specific task. Cognitive and motivational antecedents influence safety compliance and safety participation. Cognitive processes include the knowledge to carry out the tasks, understanding the consequences of actions, and attending to important events (Christian et al., 2009). Motivational

546

Work-Related Health, Stress, and Safety processes describe the willingness to engage in a specific behavior. It is possible that cognitive processes are more important for safety compliance, whereas motivational processes are more important for safety participation (Motowidlo, Borman, & Schmit, 1997). However, there is little empirical evidence for this proposition at this stage, and both cognitive and motivational predictors should be considered for safety compliance and safety participation. We review some of the cognitive and motivational predictors that might be useful for selection of safety compliance and safety participation next.

Safety Compliance Behaviors associated with safety compliance include vigilance, perseverance, and accurately following procedures. Tests for vigilance can provide information about the extent to which individuals are able to maintain attention. For example, in critical medical contexts, the ability to maintain vigilant scanning might be an important element of safety behavior (Subramanian, Kumar, & Yauger, 1994). Mindfulness has received a great deal of attention in the workplace and has been proposed to support safe working behaviors. Zhang, Ding, Li, and Wu (2013) found employees with higher levels of trait mindfulness were more likely to show higher levels of awareness and attention and to perform work more safely, particularly when tasks were more complex. Persevering with safety compliance requires maintenance of effort over time. Conscientiousness is a predictor of effort that has been validated for general job performance and linked to accident outcomes (Barrick & Mount, 1991; Clarke & Robertson, 2005). Conscientiousness should play an important role in sustaining safety compliance. From a different perspective, distractibility or neuroticism can reduce an individual’s ability to maintain consistent effort over time (Hansen, 1989). Avoiding errors and mistakes is important for safety compliance. Errors of execution and action (e.g., slips, lapses, trips) and procedural mistakes are more likely to arise from attention failures. Several cognitive processes have been linked to attention failures and the situational awareness required for scanning and responding to the work environment (Carretta, Perry, & Ree, 1996). Cognitive failure (Simpson, Wadsworth, Moss, & Smith, 2005) and safety consciousness (Westaby & Lee, 2003) describe the way individuals pay attention to the safety requirements of the work environment, and selection activities can assess the degree to which individuals are able to demonstrate these capacities. On the other hand, knowledge-based mistakes occur when an individual lacks the appropriate information to perform correctly. For these types of mistakes, safety knowledge is likely to be a more important predictor (Hofmann et al., 1995). Finally, it is important to consider deliberate noncompliance with safety requirements. Integrity tests have shown validity for predicting organizationally counterproductive behaviors such as rule-breaking and dishonesty (Casillas, Robbins, McKinniss, Postlethwaite, & Oh, 2009).

Safety Participation The behaviors that make up safety participation have received less attention than safety compliance behaviors in terms of personnel selection. Participation supports the overall safety context and includes behaviors such as encouraging the safety of others, contributing to safety initiatives, and proactively supporting change in safety practices. By definition, these behaviors often go beyond individual core task requirements and may be discretionary in some jobs. Standard job analysis practices that focus on individual task performance are therefore less likely to articulate the behaviors that are important for safety participation. Selection for these behaviors requires consideration of the broader context and its constraints. For example, Yuan, Li, and Lin (2014) found that dispositional core self-evaluation, a personality trait involving a sense of efficacy and control (Judge, Locke, & Durham, 1997), moderated the negative impact of work stress on safety behavior. The ability to communicate

547

Lois E. Tetrick et al. safety concerns with others and encourage safety compliance of team members might be critical where teams work in high-risk environments. Validity evidence from personality testing suggests that extraversion can predict performance in jobs requiring social interaction (Barrick & Mount, 1991). To date, research has focused on contextual factors that motivate these behaviors, such as leadership (Barling, Loughlin, & Kelloway, 2002) and job conditions (Probst & Brubaker, 2001). Organizations that can articulate the nature of safety participation in their specific context will be better able to identify potential individual predictors of these activities.

Summary of Safety In summary, our review suggests that selection can play a part in a safer work environment, but its role is complex. Many of the attributes required are trainable or are strongly influenced by the organizational environment. Methodological limitations, such as the use of concurrent designs, reduce the ability of many safety studies to inform selection systems. However, an equally important concern is the degree to which theory is used to explain the behaviors that constitute the criterion domain of work safety. There is now a growing body of theory and evidence linking individual differences in factors such as agreeableness to safety outcomes such as accidents, via specific safety-related behaviors. There is also good evidence about the way organizational factors, such as training and safety climate, might modify these links. Further theoretical development about the way individual differences contribute to safety outcomes within different organizational systems will enhance the role that can be played by selection procedures.

CONCLUSIONS In the three sections of this chapter, we have examined correlates of employee health, work stress, and safety. On the basis of the literature, there appear to be consistent findings that workplace factors can enhance the health and safety of employees. Also, some relatively stable individual characteristics have been found to be related to stress, resilience, safety compliance, and safety participation. Unfortunately, the empirical literature has not generally considered workplace factors and individual characteristics jointly to evaluate potential interactions between person characteristics and environmental factors or the relative contribution of each in predicting health, stress, and safety. Many of the theoretical perspectives relative to occupational safety and health including stress have not specifically taken an interactional perspective and tend to focus on situational factors or personal characteristics. Although there is support for the effects of some relatively stable individual characteristics that might be useful for selection purposes in creating and maintaining a healthy workforce and a healthy work environment, the current empirical evidence is not strong, and there are potential legal ramifications in using some of these characteristics in selection systems. It is possible that given certain contexts, selection based on individual characteristics may have utility. However, the literature as a whole appears to currently favor workplace interventions as more effective compared with selection.

REFERENCES Anderson, D. R., Whitmer, R. W., Goetzel, R. Z., Ozminkowski, R. J., Dunn, R. L., Wasserman, J., & Serxner, S. (2000). The relationship between modifiable health risks and group-level health care expenditures. American Journal of Health Promotion, 15, 45–52. Baicker, K., Cutler, D., & Song, Z. (2010). Workplace wellness programs can generate savings. Health Affairs, 29, 304–311. Bakker, A. B., & Demerouti, E. (2007). The job demands-resources model: State of the art. Journal of Managerial Psychology, 22(3), 309–328.

548

Work-Related Health, Stress, and Safety Barling, J., & Gallagher, D. G. (1996). Part-time employment. International Review of Industrial and Organizational Psychology, 11, 243–278. Barling, J., Loughlin, C., & Kelloway, E. K. (2002). Development and test of a model linking safety-specific transformational leadership and occupational safety. Journal of Applied Psychology, 87, 488–496. Barnett, R. C. (2006). Relationship of the number and distribution of work hours to health and qualityof-life outcomes. In P. L. Perrewé & D. C. Ganster (Eds.), Employee health, coping and methodologies: Research in occupational stress and well being (Vol. 5, pp. 99–138). Oxford, England: JAI Press/Elsevier Science. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Baumeister, R. F., & Alghamdi, N. (2015). Resource-based interventions in the workplace: Integration, commentary, and recommendations. Journal of Occupational and Organizational Psychology, 88, 623–629. Baxter, S., Sanderson, K., Venn, A. J., Blizzard, L., & Palmer, A. J. (2014). The relationship between return on investment and quality of study methodology in workplace health promotion programs. American Journal of Health Promotion, 28, 347–363. Beus, J. M., Dhanani, L. Y., & McCord, M. A. (2015). A meta-analysis of personality and workplace safety: Addressing unanswered questions. Journal of Applied Psychology, 100(2), 481–498. Boggild, H., & Knutsson, A. (1999). Shift work, risk factors, and cardiovascular disease. Scandinavian Journal of Work, Environment & Health, 25, 85–99. Borman, W. C., & Penner, L. A. (2001). Citizenship performance: Its nature, antecedents, and motives. In B. W. Roberts & R. Hogan (Eds.), Personality in the workplace (pp. 45–61). Washington, DC: American Psychological Association. Bretz, R. D., Jr., & Judge, T. A. (1998). Realistic job previews: A test of the adverse self-selection hypothesis. Journal of Applied Psychology, 83(2), 330–337. http://dx.doi.org/10.1037/0021-9010.83.2.330 Burger, J. M. (1989). Negative reactions to increases in perceived personal control. Journal of Personality and Social Psychology, 56(2), 246–256.  http://dx.doi.org/10.1037/0022-3514.56.2.246 Burton, J., & Hoobler, J. (2006). Subordinate self-esteem and abusive supervision. Journal of Managerial Issues, 18, 340–355. Burton, W. N., Chen, C., Conti, D., Schultz, A., Pransky, G., & Edington, D. (2005). The association of health risks with on-the-job productivity. Journal of Occupational and Environmental Medicine, 47, 769–777. Bureau of Labor Statistics. (2009). Health insurance costs for civilians. Retrieved from http://data.bls. gov/cgi-bin/surveymost Byron, K. (2005). A meta-analytic review of work-family conflict and its antecedents. Journal of Vocational Behavior, 67, 169–198. Carlson, D., Ferguson, M., Perrewé, P. L., & Whitten, D. (2011). The fallout of abusive supervision through work-family conflict: An examination of job incumbents and their partners. Personnel Psychology, 64, 937–961. Carlson, D. S., Kacmar, K. M., Wayne, J. H., & Grzywacz, J. G. 2006. Measuring the positive side of the work-family interface: Development and validation of a work-family enrichment scale. Journal of Vocational Behavior, 68, 131–164. Carretta, T. R., Perry, D. C., & Ree, M. J. (1996). Prediction of situational awareness in F-15 pilots. The International Journal of Aviation Psychology, 6, 21–41. Casillas, A., Robbins, S., McKinniss, T., Postlethwaite, B., & Oh, I. S. (2009). Using narrow facets of an integrity test to predict safety: A test validation study. International Journal of Selection and Assessment, 17(1), 119–125. Christian, M. S., Bradley, J. C., Wallace, J. C., Burke, M. J., & Spears, J. (2009). Workplace safety: A meta-analysis of the roles of person and situation factors. Journal of Applied Psychology, 94(5), 1103–1127. Chang, C-H., Rosen, C. C., & Levy, P. E. (2009). The relationship between perceptions of organizational politics and employee attitudes, strain, and behavior: A meta-analytic examination. Academy of Management Journal, 52, 779–801. Clarke, S., & Robertson, I. T. (2005). A meta-analytic review of the Big Five personality factors and accident involvement in occupational and non-occupational settings. Journal of Occupational and Organizational Psychology, 78, 355–376. Clarke, S., & Robertson, I. T. (2008). An examination of the role of personality in work accidents using meta-analysis. Applied Psychology: An International Review, 57(1), 94–108. Collins, J. J., Baase, C. M., Sharda, C. E., Ozminkowski, R. J., Nicholson, S., Billotti, G. M., Turpin, R. S., Olson, M., & Berger, J. L. (2005). The assessment of chronic health conditions on work performance, absence, and total economic impact for employers. Journal of Occupational and Environmental Medicine, 47, 547–557. Cordes, C. L., & Dougherty, T. W. (1993). A review and integration of research on job burnout. Academy of Management Review, 18, 621–656.

549

Lois E. Tetrick et al. Curcuruto, M., Conchie, S. M., Mariani, M., & Violante, F. (2015). The role of prosocial and proactive safety behaviors in predicting safety performance. Safety Science, 80, 317–323. Eby, L. T., Casper, W. J., Lockwood, A., Bordeaux, C., & Brinley, A. (2005). Work and family research in IO/OB: Content analysis and review of the literature (1980–2002). Journal of Vocational Behavior, 66, 124–197. Edwards, J. R., & Rothbard, N. P. (2000). Mechanisms linking work and family: Clarifying the relationship between work and family constructs. Academy of Management Review, 25, 178–199. Fenwick, R., & Tausig, M. (2001). Scheduling stress: Family and health outcomes of shift work and schedule control. The American Behavioral Scientist, 44, 1179–1198. Ferris, G. R., Davidson, S. L., & Perrewé, P. L. (2005). Political skill at work. Mountain View, CA, Davies-Black/CPP. Ferris, G. R., Treadway, D. C., Perrewé, P. L., Brouer, R. L., Douglas, C., & Lux, S. (2007). Political skill in organizations. Journal of Management, 33, 290–320. Frone, M. R. (1998). Predictors of work injuries among employed adolescents. Journal of Applied Psychology, 83, 565–576. Frone, M. R. (2003). Work-family balance. In J. C. Quick & L. E. Tetrick (Eds.), Handbook of occupational health psychology (pp. 143–162). Washington, DC: American Psychological Association. Ganster, D. C., & Rosen, C. C. (2013). Work stress and employee health: A multidisciplinary review. Journal of Management, 39(5), 1085–1122. Glendon, I., & McKenna, E. (1995). Human safety and risk management. London: Chapman and Hall. Goetzel, R. Z., Anderson, D. R., Whitmer, R. W., Ozminkowski, R. J., Dunn, R. L., & Wasserman, J. (1998). The relationship between modifiable health risks and health care expenditures. An analysis of the multi-employer HERO health risk and cost database. Journal of Occupational and Environmental Medicine, 40, 843–854. Goetzel, R. Z., Hawkins, K., Ozminkowski, R. J., & Wang, S. (2003). The health and productivity cost burden of the “top 10” physical and mental health conditions affecting six large U.S. employers in 1999. Journal of Occupational And Environmental Medicine, 45, 5–14. Goetzel, R. Z., Henke, R. M., Tabrizi, M., Pelletier, K. R., Loeppke, R., Ballard, D. W., . . . Metz, R. D. (2014). Do workplace health promotion (wellness) programs work? Journal of Occupational and Environmental Medicine, 56(9), 927–934. Grandey, A. A., Kern, J., & Frone, M. (2007). Verbal abuse from outsiders versus insiders: Comparing frequency, impact on emotional exhaustion, and the role of emotional labor. Journal of Occupational Health Psychology, 12, 63–79. Greenberger, D. B., & Strasser, S. (1986). Development and application of a model of personal control in organizations. The Academy of Management Review, 11, 164–177. Griffin, M. A., & Neal, A. (2000). Perceptions of safety at work: A framework for linking safety climate to safety performance, knowledge, and motivation. Journal of Occupational Health Psychology, 5(3), 34–58. Guastello, S. J. (1993). Do we really know how well our occupational accident prevention programs work? Safety Science, 16, 445–463. Hackl, F., Halla, M., Hummer, M., & Pruckner, G. J. (2015). The effectiveness of health screening. Health Economics, 24, 913–935. Hale, A. R., & Glendon, A. I. (1987). Individual behaviour in the control of danger. New York, NY: Elsevier. Hansen, C. P. (1989). A causal model of the relationship among accidents, biodata, personality, and cognitive factors. Journal of Applied Psychology, 74, 81–90. Hofmann, D. A., Jacobs, R., & Landy, F. (1995). High reliability process industries: Individual, micro, and macro organizational influences on safety performance. Journal of Safety Research, 26(3), 131–149. Hogan, J., & Lesser, M. (1996). Selection of personnel for hazardous performance. In J. E. Driskell & E. Salas (Eds.), Series in applied psychology. Stress and human performance (pp. 195–222). Hillsdale, NJ: Lawrence Erlbaum Associates. Ilgen, D. R., & Pulakos, E. D. (1999). Employee performance in today’s organizations. In D. R. Ilgen & E. D. Pulakos (Eds.), The changing nature of work performance: Implications for staffing, motivation, and development (pp. 1–20). San Francisco, CA: Jossey-Bass. Iverson, R., & Erwin, P. (1997). Predicting occupational injury: The role of affectivity. Journal of Occupational and Organizational Psychology, 70, 113–128. Jacobs, J. A., & Gerson, K. (2004). The time divide: Work, family and gender inequality. Cambridge, MA: Harvard University Press. Jamal, M., & Baba, V. V. (1997). Shift work, burnout and well-being: A study of Canadian nurses. International Journal of Stress Management, 4, 197–204. Judge, T. A., Higgins, C. A., Thoresen, C. J., & Barrick, M. R. (1999). The Big Five personality traits, general mental ability, and career success across the life span. Personnel Psychology, 52, 621–652.

550

Work-Related Health, Stress, and Safety Judge, T. A., Locke, E. A., & Durham, C. C. (1997). The dispositional causes of job satisfaction: A coreevaluations approach. Research in Organizational Behavior, 19, 151–188. Kahn, R. L., Wolfe, D. M., Quinn, R. P., Snoek, J. D., & Rosenthal, R. A. (1964). Organizational stress: Studies in role conflict and ambiguity. New York, NY: Wiley. Kaiser Daily Health Report. (January 2008). U.S. health care spending reaches $2.1T in 2006, increasing 6.7%. Retrieved from http://www.kaiserhealthnews.org/Daily-Reports/2008/January/08/dr00049709.aspx? referrer=search Kaiser Family Foundation. (2008). Employer health insurance costs and worker compensation. Retrieved from http://www.kff.org/insurance/snapshot/chcm030808oth.cfm Kaiser Network. (2006). Companies shift more medical costs to workers. Retrieved from http://www.kaisernetwork. org./daily_reports/rep_index.cfm Kanter, R. M. (2004). Confidence. New York, NY: Crown Business. Karasek, R. A. (1979). Job demands, job decision latitude, and mental strain: Implications for job redesign. Administrative Science Quarterly, 24, 285–308. Kendall, J., & Ventura, P. L. (2005). A stumbling block on the road to wellness: The ADA disability-related inquiry and medical examination rules and employer wellness incentive programs. Benefits Law Journal, 18, 57–76. Krausz, M., Sagie, A., & Biderman, Y. (2000). Actual and preferred work schedules and scheduling control as determinants of job-related attitudes. Journal of Vocational Behavior, 56, 1–11. Lawton, R., & Parker, D. (1998). Individual differences in accident liability: A review and integrative approach. Human Factors: The Journal of the Human Factors and Ergonomics Society, 40, 655–671. Lazarus, R. S., & Folkman, S. (1984). Stress, appraisal, and coping. New York, NY: Springer. LeBlanc, M. M., Barling, J., & Turner, N. (2014). Intimate partner aggression and women’s work outcomes. Journal of Occupational Health Psychology, 19(4), 399–412. LePine, J. A., Colquitt, J. A., & Erez, A. (2000). Adaptability to changing task contexts: Effects of general cognitive ability, conscientiousness, and openness to experience. Personnel Psychology, 53, 563–593. Lesser, L. I., & Puhl, R. M. (2014). Alternatives to monetary incentives for employee weight loss. American Journal of Preventive Medicine,46, 429–431. Liao, H., Arvey, R. D., Butler, R. J., & Nutting, S. M. (2001). Correlates of work injury frequency and duration among firefighters. Journal of Occupational Health Psychology, 6(3), 229–242. Maier, M. (1999). On the gendered substructure of organization: Dimensions and dilemmas of corporate masculinity. In G. N. Powell (Ed.), Handbook of gender and work (pp. 69–93). Thousand Oaks, CA: Sage. Major, B., Richards, C., Cooper, M. L., Cozzarelli, C., & Zubek, J. (1998). Personal resilience, cognitive appraisals, and coping: An integrative model of adjustment to abortion. Journal of Personality and Social Psychology, 74, 735–752. Martin, J. (2012). Stress at work is bunk for business. Forbeswoman. Retrieved from http://www.forbes. com/sites/work-in-progress/2012/08/02/stress-at-work-is-bunk-for-business/2/ Maslach, C., Schaufeli, W. B., & Leiter, M. P. (2001). Job burnout. Annual Review of Psychology, 52, 397–422. Mattke, S., Liu, H., Caloyeras, C., Huang, Y., Van Busum, K. R., Khodyakov, D., & Shier, V. (2013). Workplace wellness programs study. RAND Health. Retrieved from http://www.dol.gov/ebsa/pdf/workplacewellnessstudyfinal.pdf Mauno, S., Kinnunen, U., & Ruokolainen, M. (2007). Job demands and resources as antecedents of work engagement: A longitudinal study. Journal of Vocational Behavior, 70(1), 149–171. http://dx.doi. org/10.1016/j.jvb.2006.09.002 McCrae, R. R., & John, O. P. (1992). An introduction to the Five-Factor model and its applications. Journal of Personality, 60, 175–215. Mills, P. R., Kessler, R. C., Cooper, J., & Sullivan, S. (2007). Impact of a health promotion program on employee health risks and work productivity. American Journal of Health Promotion, 22, 45–53. Mintzberg, H. (1983). Power in and around organizations. Englewood Cliffs, NJ: Prentice-Hall. Motowidlo, S. J., Borman, W. C., & Schmit, M. J. (1997). A theory of individual differences in task and contextual performance. Human Performance, 10(2), 71–83. Mykletun, R. J., & Mykletun, A. (1999). Comprehensive schoolteachers at risk of early exit from work. Experimental Aging Research, 25, 359–365. National Academies of Sciences, Engineering, and Medicine. (2015). Applying a health lens to business practices, policies, and investments: Workshop summary. Washington, DC: The National Academies Press. Nelson, D. L., & Burke, R. J. (2000). Women executives: Health, stress and success. Academy of Management Executive, 14, 107–121. Nicoll, A., & Murray, V. (2002). Health protection—A strategy and a national agency. Public Health, 116, 129–137. O’Donnell, M. P. (2015). What is the ROI for workplace health promotion? It really does depend, and that’s the point. American Journal of Health Promotion, 29, v–viii. doi: 10.4278/ajhp.29.3.v

551

Lois E. Tetrick et al. Oziransky, V., Yach, D., Tsao, T., Luterek, A., & Stevens, D. (July, 2015). Beyond the four walls: Why community is critical to workforce health was released on July 28, 2015. Retrieved from http://thevitalityinstitute.org/ projects/community-health (accessed January, 7, 2016). Parkes, K. R. (2003). Shiftwork and environment as interactive predictors of work perceptions. Journal of Occupational Health Psychology, 8, 266–281. Parks, K. M., & Steelman, L. A. (2008). Organizational wellness programs: A meta-analysis. Journal of Occupational Health Psychology, 13, 58–68. Perrewé, P. L., & Spector, P. E. (2002). Personality research in the organizational sciences. In G. R. Ferris & J. J. Martocchio (Eds.), Research in personnel and human resources management (Vol. 21, pp. 1–85). Oxford, England: JAI Press/Elsevier Science. Perrewé, P. L., Zellars, K. L., Ferris, G. R., Rossi, A. M., Kacmar, C. J., & Ralston, D. A. (2004). Neutralizing job stressors: Political skill as an antidote to the dysfunctional consequences of role conflict stressors. Academy of Management Journal, 47, 141–152. Perrewé, P. L., Zellars, K. L., Rossi, A. M., Ferris, G. R., Kacmar, C. J., Liu, Y., Zinko, R., & Hochwarter, W. A. (2005). Political skill: An antidote in the role overload—Strain relationship. Journal of Occupational Health Psychology, 10, 239–250. Peterson-Kaiser Health System Tracker: Measuring the performance of the U.S. Health System. Retrieved from http://www.healthsystemtracker.org/chart-collection/how-does-health-spending-in-the-u-scompare-to-other-countries/ (accessed on January 9, 2016). Piotrkowski, C. S. (1979). Work and the family system: A naturalistic study of the working-class and lower-middle-class families. New York, NY: Free Press. Powell, G. N., & Graves, L. M. (2003). Women and men in management. Thousand Oaks, CA: Sage. Probst, T. M., & Brubaker, T. L. (2001). The effects of job insecurity on employee safety outcomes: Cross-sectional and longitudinal explorations. Journal of Occupational Health Psychology, 6(2), 139–159. Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624. Reardon, K. K. (2000). The secret handshake: Mastering the politics of the business inner circle. New York, NY: Doubleday. Rosen, C. C., & Ganster, D. C. (2014). Workplace politics and well-being: An allostatic load perspective. In A. M Rossi, J. A. Meurs, & P. L. Perrewé (Eds.), Improving employee health and well-being (pp. 3–24). Charlotte, NC; Information Age Publishing. Rothstein, M. A. (1983). Occupational safety and health law. St. Paul, MN: West Group. Ryan, D., & Watson, R. (2004). A healthier future. Occupational Health, 56(7), 20–21. Sanders, M. S., & McCormick, E. J. (1987). Human factors in engineering and design (6th ed.). New York, NY: McGraw-Hill. Schat, A. C. H., Desmarais, S., & Kelloway, E. K. (2006). Exposure to workplace aggression from multiple sources: Validation of a measure and test of a model. Unpublished manuscript, McMaster University, Hamilton, Canada. Shurtz, R. D. (2005). Reining health care costs with wellness programs: Frequently overlooked legal issues. Benefits Law Journal, 18, 31–60. Simon, T. M., Bruno, F., Grossman, N., & Stamm, C. (2006). Designing compliant wellness programs: HIPAA, ADA, and state insurance laws. Benefits Law Journal, 19, 46–59. Simon, T. M., Traw, K., McGeoch, B., & Bruno, F. (2007). How the final HIPAA nondiscrimination regulations affect wellness programs. Benefits Law Journal, 20, 40–44. Simpson, S. A., Wadsworth, E. J. K., Moss, S. C., & Smith, A. P. (2005). Minor injuries, cognitive failures and accidents at work: Incidence and associated features. Occupational Medicine, 55, 99–108. Smerd, J. (October 22 2007). Smoker? Can’t work here more firms say. Workforce Management. Retrieved on June 12, 2008, from http://www.lexisnexis.com/us/lnacademic/frame.do?tokenKey= rsh-20.312685 .587091969 St. George, J., Schwager, S., & Canavan, F. (Autumn 2000). A guide to drama-based training. National Productivity Review, 15–19. Subramanian, R., Kumar, K., & Yauger, C. (1994). The scanning of task environments in hospitals: An empirical study. Journal of Applied Business Research, 10, 104–115. Sullivan, S. E., & Mainiero, L. (2007). Women’s kaleidoscope careers: A new framework for examining women’s stress across the lifespan. In P. L. Perrewé & D. C. Ganster (Eds.), Exploring the work and nonwork interface: Research in occupational stress and well being (Vol. 6, pp. 205–238). Oxford, England: JAI Press/ Elsevier Science. Tepper, B. (2007). Abusive supervision in formal organizations: Review, synthesis and research agenda. Journal of Management, 33, 261–289. Tepper, B. J., Duffy, M. K., Henle, C. A., & Lambert, L. S. (2006). Procedural injustice, victim precipitation, and abusive supervision. Personnel Psychology, 59(1), 101–123. http://dx.doi.org/10.1111/j.1744-6570.2006.00725.x

552

Work-Related Health, Stress, and Safety Tetrick, L. E. (2002). Understanding individual health, organizational health, and the linkage between the two from both a positive health and an ill health perspective. In P. L. Perrewé & D. C. Ganster (Eds.), Historical and current perspectives on stress and health: Research in occupational stress and well being (Vol. 2, pp. 117–141). Oxford, England: JAI Press/Elsevier Science. Theorell, T. (2004). Democracy at work and its relationship to health. In P. L. Perrewé & D. C. Ganster (Eds.), Research in occupational stress and well being (Vol. 3, pp. 323–357). Oxford, England: JAI Press/Elsevier Science Treadway, D. C., Ferris, G. R., Hochwarter, W. A., Perrewé, P. L., Witt, L. A., & Goodman, J. M. (2005). The role of age in the perceptions of politics—job performance relationship: A three-study constructive replication. Journal of Applied Psychology, 90, 872–881. Vandenberg, R. J., Park, K., DeJoy, D. M., Wilson, M. G., & Griffin-Blake, C. S. (2002). The healthy work organization model: Expanding the view of individual health and well being in the workplace. In P. L. Perrewé & D. C. Ganster (Eds.), Historical and current perspectives on stress and health: Research in occupational stress and well being (Vol. 2, pp. 57–115). Oxford, England: JAI Press/Elsevier Science. Vredenburgh, A. G. (2002). Organizational safety: Which management practices are most effective in reducing employee injury rates. Journal of Safety Research, 33(2), 259–276. Wallace, J. C., & Vodanovich, S. J. (2003). Workplace safety performance: Conscientiousness, cognitive failure, and their interaction. Journal of Occupational Health Psychology, 8, 316–327. Wanous, J.P. (1980). Organizational entry: Recruitment, selection, and socialization of newcomers. Reading, MA: Addison-Wesley. Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52(5), 372–376.  http://dx.doi.org/10.1037/h0026244 Westaby, J. D., & Lee, B. C. (2003). Antecedents of injury among youth in agricultural settings: A longitudinal examination of safety consciousness, dangerous risk taking, and safety knowledge. Journal of Safety Research, 34(3), 227–240. http://dx.doi.org/10.1016/S0022-4375(03)00030-6 World Health Organization. (2008). Workplace health promotion. Retrieved from http://www.who.int/ occupational_health/topics/workplace/en Yuan, Z., Li, Y., & Lin, J. (2014). Linking challenge and hindrance stress to safety performance: The moderating effect of core self-evaluation. Personality and Individual Differences, 68, 154–159. Zellars, K. L., & Perrewé, P. L. (2001). Affective personality and the content of emotional social support: Coping in organizations. Journal of Applied Psychology, 86, 459–467. Zellars, K. L., Perrewé, P. L., & Hochwarter, W. A. (2000). Burnout in healthcare: The role of the five factors of personality. Journal of Applied Social Psychology, 30, 1570–1598. Zellars, K. L., Perrewé, P. L., Hochwarter, W. A., & Anderson, K. S. (2006). The interactive effects of positive affect and conscientiousness on strain. Journal of Occupational Health Psychology, 11, 281–289. Zhang, J., Ding, W., Li, Y., & Wu, C. (2013). Task complexity matters: The influence of trait mindfulness on task and safety performance of nuclear power plant operators. Personality and Individual Differences, 55(4), 433–439.

553

25 THE DEFICIENCY OF OUR CRITERIA Who Defines Performance, Contribution, and Value? JEANETTE N. CLEVELAND, KEVIN R. MURPHY, AND ADRIENNE COLELLA

Work psychologists have had a longstanding interest in the criterion problem and have been particularly concerned with determining how to measure job performance and success at work. Many notable industrial and organizational (I-O) psychologists have urged researchers to develop theories of employee performance (e.g., Campbell, 1990). Within the last two decades, we have made progress in the articulation and measurement of required tasks and behaviors at work (Campbell, 1990; Chapter 20, this volume) and the identification of contingent, discretionary behaviors that are important for person-team success (e.g., Borman & Motowidlo, 1993; Chapter 21, this volume). However, the approach to the criterion problem followed by most researchers continues to be generally narrow and reinforces the status quo in terms of what is defined as success or successful work behaviors in organizations. This chapter is different from many of the others in this volume in the sense that we will make an argument that the criteria we typically use to validate everything from selection tests to organizational interventions (e.g., training programs, executive succession programs) run the risk of being deficient because they represent the concerns of only a narrow set of stakeholders and because they ignore a wide range of behaviors and outcomes in organizations. For example, one way to think about the validation of selection instruments is that a test or assessment is a good one if it helps us to identify applicants who are likely to become successful employees and contribute to the effectiveness of the organization. There are many ways of defining what one means by a successful employee and even an effective organization. This chapter is devoted to exploring the possibility of expanding the boundaries of our current definitions of success and effectiveness. Unlike many of the other chapters in this volume, we are not in a position to determine precisely how some of these changes might influence conclusions reached about the validity of tests, assessments, or other predictors, because research on validation against the broader set of criteria envisioned here is still in its infancy. Nevertheless, we see considerable value in stepping back to ask what it means for an employee to be successful and how our thinking about interventions in organizations might change with a different set of criteria. In particular, we will discuss the implications of incorporating two concepts related to employee health and welfare (physical/mental health and promotion of organizational health and sustainability) into the definition of a successful employee.

554

Deficiency of Our Criteri

DEFINING SUCCESS The most widely used definitions of what represents success in organizations at the individual level (e.g., job performance) and the organizational level (e.g., organizational effectiveness) have not changed fundamentally over the years. There have been advances in understanding particular aspects of performance and success (e.g., contextual performance), but there have not yet been substantial changes in the way we think about the criteria that are used to evaluate personnel selection, training, or other interventions in organizations. Our thinking has not changed, but the context in which work occurs certainly has (Cascio & Aguinis, 2008). The boundaries between the spheres of work, as well as between nonwork, local, national, and global or international boundaries, have steadily eroded, and these domains increasingly overlap. The world of work is becoming increasingly complex and intrusive (e.g., it is common for employees to take their work with them when they go home or on vacations), and the definition of success in the workplace is constantly evolving. This implies the need for an increasingly broad view of the criterion domain. Several previous chapters in this volume (e.g., Chapters 20–24) provide excellent reviews and discussions about specific aspects or facets of workplace behavior and performance domains. Each of these performance aspects is important to the effectiveness and health of employees and organizations within the 21st-century workplace. In the current chapter, we argue that both traditional and emerging facets of the performance domain must be incorporated as part of the foundation for an integrated and long-term-focused human resources (HR) system, and, importantly, that issues concerning the larger context, especially the interface between work-nonwork issues, must be incorporated into our criterion models to more fully capture the increasing complexities of the workplace and the diversity of the workforce. Finally, as Ployhart and Weekley so eloquently articulate in Chapter 5, this volume, using a multilevel lens, we may be better able to link individual-level HR systems to more macro indices of organizational productivity and sustainability. In this chapter, we focus on defining what it means to be a successful employee. We will argue that the definition of what constitutes success or effective performance is a critically important one, and the question of who makes this decision and why is potentially even more important. That is, a number of choices need to be made in deciding what represents effective performance (e.g., is someone who routinely works a 60-hour week showing more dedication than someone who routinely works a 40-hour week, or is the first worker simply less efficient? Is someone who completes all of his or her tasks but who makes it difficult for others to perform their tasks because of a lack of willingness to help out or to behave civilly a better performer than a less efficient but more considerate coworker?), and these choices reflect the preferences and values of some decision makers, while the preferences and values of other potential decision makers may be shut out (Murphy, 2009). A range of perspectives might be considered in defining our criteria, but these have rarely been examined in a systematic way. Changes in the workplace and the workforce are making these choices increasingly complex and increasingly important.

Work and Workforce in the 21st Century: Outmoded Assumptions and Bases for Change The design of work and the definition of success in the workplace continue to be built around the assumption that most or all employees will treat the workplace as their primary focus, an assumption that often works only if they have at least one adult working at home in the role of “caregiver” (Smolensky & Gootman, 2003). That is, our models for defining successful job performance and a successful career (Murphy, 1998) begin with the assumptions that (a) each worker can and should devote a great deal of time, attention, and loyalty to the organization; (b) there will be someone at home to take care of the other needs; (c) the demands of the work and nonwork sides of life are distinct and nonoverlapping; and (d) the costs associated with work interfering with nonwork can be ignored (or at least are not the concern of the organization)

555

Jeanette N. Cleveland et al. whenever the organization places demands on its members. The way psychologists and managers have defined and measured success, in general, and work performance in particular (i.e., an emphasis on task performance, devotion to the organization, progression toward higher levels of the organization) makes a good deal of sense if you start with a homogenous (e.g., male, White, traditional, nuclear, family structure) and local (e.g., U.S. workers) workforce, but it is not necessarily sensible in the current environment. Given the changing nature of the workforce both within the United States and globally, it is now time to think more broadly about the conceptualization of our criteria within I-O psychology. Job performance is not the same as success. We need to clearly distinguish between job performance and success, a broader construct that might be assessed and defined across multiple levels of analysis and might be defined differently depending on whether the focus is on the short or the long term. Furthermore, both constructs need to be considered in relation to their costs. That is, the headlong pursuit of performance in the workplace might have several costs to the organization (e.g., short-term focus) and to the community (e.g., work-family conflict); different definitions of success in organizations might push employees to engage in a range of behaviors that have personal and societal costs (e.g., workaholism). Why should we examine how success is measured in organizations? We argue that (a) success is a much broader and more encompassing construct with content that spills over from work to nonwork domains; and (b) success and performance must be understood within a multilevel context, recognizing that for some organizational problems and decisions, we can focus on understanding performance at a given level but that what occurs at one level may not reverberate at other levels in a similar way. I-O psychology has made significant progress in specific facets of criterion theory and measurement, as shown by in-depth review chapters in this volume (see Chapters 20–24). In the following section, the concepts of ultimate or conceptual criterion and actual criteria (and the subsequent criterion relevance, contamination, and deficiency) are used to describe how I-O psychologists have contributed to the understanding of one of the most important psychological outcomes—performance success. Briefly, we review the development of task performance theory, context performance, adaptive performance, and counterproductive work behaviors. Using the notion of criterion deficiency, we identify where our current conceptualizations of success are likely to be narrow, outmoded, and deficient.

Criterion Problem in I-O Psychology The legacy of 60 years of scientific research on criteria between 1917 and 1976 was the identification of the “criterion problem” (e.g., Austin & Villanova, 1992). The term denotes the difficulty involved in the conceptualization and measurement of performance constructs, particularly when performance measures are multidimensional and used for different purposes.

Definition and Assumptions of Criterion Problem Bingham (1926) was perhaps the first to use the word criterion in one of the two ways that it is frequently used today, as “something which may be used as a measuring stick for gauging a worker’s relative success or failure” (p. 1). In the organizational sciences, the most widely used criteria are often measures of job performance, and this is certainly one way of assessing success or failure. However, the construct “job performance” is both multidimensional and complex (see Chapters 20–24 in this volume), and the choice of dimensions to represent or define performance depends on how broadly or narrowly one interprets the meaning of success (i.e., conceptual criterion; Nagle, 1953). More generally, an employee’s success or failure is probably defined in wider terms than his or her job performance. Consider, for example, the employee who performs his or her tasks well, receives raises and promotions, but who also makes life

556

Deficiency of Our Criteri miserable for coworkers and who abandons responsibilities in his or her family or community, and who eventually burns out and quits work. You could argue that this person is not a success. Success is not a construct that exists a priori; different time frames might be chosen to define and measure success, and different components of this multifaceted construct might get more or less emphasis. Traditionally, discussions of the criterion problem have started with the assumption that the conceptual or ultimate criterion of success is reasonably well defined and that the major problem involves the shift from conceptualizing or defining success to its actual measurement. When this shift is made, a gap is likely to develop between the “ideal” conceptualization of success and its practical or actual measurement. The relationship between conceptual and practical measurement of success is depicted using two general notions: conceptual criterion and actual criteria. The term “conceptual,” “theoretical,” or “ultimate criterion” (Thorndike, 1949) describes the full domain of everything that ultimately defines success (Cascio, 2000). Because the ultimate criterion is strictly conceptual, it cannot be measured or directly observed. It embodies the notion of “true,” “total,” “long-term,” or “ultimate worth” to the employing organization (Cascio, 2000). Implicit in this model is the often unexamined assumption that we all know and agree about the conceptual definition of success (i.e., the idea that the ultimate criterion is obvious and noncontroversial). Yet, key performance stakeholders (e.g., employees, organizations, families, society, and the environment) do not necessarily know or agree on the conceptual definition and content of the ultimate criterion. In short, an ultimate criterion is important because the relevance or linkage of any operational or measurable criterion is better understood if the conceptual stage is clearly and thoroughly documented (Astin, 1964). I-O psychology can and does measure some facets of success very well, but these may reflect a small, narrow proportion of the ultimate criterion. As Chapters 20–24 suggest, the criterion domain has a number of distinct facets, such as task performance, organizational citizenship, counterproductive behavior, and even physical and mental health (e.g., a working style that allows you to accomplish many tasks but that causes you to burn out in a short time and suffer long-term health consequences that prevent you from working might not be an effective one). There are multiple stakeholders whose interests, preferences, and values might influence decisions about which dimensions of the criterion domain should get the most emphasis (Murphy, 2010). These stakeholders include management, the employees being evaluated, coworkers, members of the community, families, and possibly political and social groups. For example, suppose decisions need to be made about the relative importance of task performance and organizational citizenship. Managers and supervisors might be most concerned with task performance, because a number of tasks need to be performed well to advance the goals and objectives of their particular unit. Co-workers may be somewhat more concerned with organizational citizenship, since they are the ones who benefit most from their coworkers’ citizenship behaviors. There is a large amount of research dealing with the roles of various stakeholders in determining the actions of organizations (Agle, Mitchell, & Sonnenfeld, 1999; Rowley & Moldoveanu, 2003; Starkey & Madan, 2001) and with the question of how managers and organizations balance diverse criteria, such as tradeoffs or potential conflicts between efficiency, profitability, and social responsibility in making and evaluating decisions (Clarkson, 1995; Harris & Freeman, 2008; Griffin & Mahon, 1997; Margolis & Walsh, 2001; Walsh, Weber, & Margolis, 2003). In this chapter, we apply concepts from this literature to describe how considering multiple perspectives on the definition of success or effectiveness might change our definition of “the criterion.”

What Is Success as Defined by I-O Psychologists? Within the last 20–30 years, the question of what performance and success on the job actually means has received considerable attention (see, for example, Chapters 20–24, this volume), but

557

Jeanette N. Cleveland et al. we believe that there is considerable room for further expansion of the criterion domain. However, this trend toward differentiating the different aspects of the criterion domain runs counter to the Classic Model of performance, which has dominated thinking in applied research. The model states that performance is one general factor and will account for most of the variations among different measures. Therefore, the objective with performance measures is to develop the best possible measure of the general factor. Throughout most of the history of I-O psychology, the adequacy and relevance of this “ultimate criterion” has rarely been discussed and debated. In recent decades, Campbell and others (Borman & Motowidlo, 1993; Cleveland, 2005; Johnson, 2003) have suggested that the notion of an ultimate criterion or single general performance factor is not the best representation of the performance construct. However, the ultimate-actual criterion distinction, as shown in Figure 25.1, is still a useful heuristic for understanding the nature of the criterion problem.

ULTIMATE OR CONCEPTUAL CRITERION “SUCCESS”

ACTUAL CRITERION

Criterion Contamination Criterion Deficiency

Criterion Relevance

Traditional content of ultimate criterion-success

Expanded content of ultimate criterion-success

Task Efficiency Written communication Oral communication

Task efficiency Written communication Oral communication Work-nonwork conflict Spousal stress Family health & well-being

FIGURE 25.1  Criterion Components: Relationships Among Criterion Relevance, Contamination, and Deficiency

558

Deficiency of Our Criteri Task Performance  Campbell et al.’s (1993) model of performance focused on required worker behaviors in a given job and attempts to delineate the dimensions of job performance. This model guides researchers and managers in assessing and preventing criterion contamination and deficiency by articulating the eight most important aspects of job performance. The eight factors (e.g., job-task proficiency, non-job-specific task proficiency, written and oral communication proficiency, demonstrating effort, maintaining personal discipline, facilitating peer and team performance, supervision/leadership, and management/administration) are assumed to be the highest-order factors that are sufficient to describe the latent hierarchy among all jobs. That is, the construct of performance cannot be meaningfully understood by combining these factors into a smaller subset or one general factor. Although the content of the factors may vary slightly across jobs, the focus of each is in terms of the observable and behavioral things that people do that are under their control. There is a good deal of value to articulating what task performance actually means; Chapter 20 considers these issues in detail. However, the specific performance components articulated by Campbell et al. (1993) and others address work success from what is arguably a very narrow perspective. In particular, this model defines performance in the workplace as a set of behaviors that is independent from behavior associated with our nonwork lives, or at least that nonwork factors are not relevant for defining success at work. From this perspective, the flow back and forth between the work and nonwork spheres of life is at best a form of criterion contamination. Contextual Performance (Organizational Citizenship Behavior)  Within the last two decades, several researchers have noted that job performance involves more than task performance (Borman & Motowidlo, 1993; Organ, 1988). For example, Borman and Motowidlo (1993) proposed a model of performance with two components at the highest level: task performance, as we have already discussed, and contextual performance. Smith, Organ, and Near (1983) labeled a similar construct organizational citizenship behavior (OCB). Contextual performance consists of behaviors that support the broader environment in which the required tasks or technical core must operate (Borman & Motowidlo, 1993). Rather than being behaviors that are relevant for only a particular job (e.g., making electrical repairs is relevant to the job of electrician but not to the job of physician), these behaviors are important in and relevant for all jobs.1 Contextual performance includes citizenship behaviors such as volunteering for tasks that are not formally part of the job, demonstrating effort, helping and cooperating with others, following organizational rules, and supporting organizational objectives (Borman & Motowidlo, 1993). A number of these behaviors would fall under a subset of components identified by Campbell et al. (1993). Borman et al. (2001) found that the structure of citizenship behaviors could be described using three categories: (1) personal support (behaviors benefiting individuals in the organization including helping, motivating, cooperating with, and showing consideration), (2) organizational support (behaviors benefiting the organization including representing the organization favorably, showing loyalty, and complying with organizational rules and procedures), and (3) conscientious initiative (behaviors benefiting the job or task including persisting with extra effort to complete tasks, taking initiative, and engaging in self development activities; Borman, et al., 2001; Johnson, 2003). The articulation of contextual performance challenges traditional definitions of individual work performance that focused almost exclusively on task performance (Ilgen & Pulakos, 1999). Furthermore, as discussed in greater detail in Chapter 21 of this volume, the identification of contextual performance/citizenship behavior reflects an initial shift toward broadening work performance criteria to include performing in interdependent and uncertain work contexts (Neal & Hesketh, 2002). For example, contextual performance includes both behaviors that directly support the work environment and the work of others (e.g., helping, cooperating, demonstrating courtesy) and support for the organization (e.g., demonstrating loyalty to the organization), and behaviors that are not part of a standard job description but that are crucial to the smooth functioning of workgroups and organizations (Chapter 21 describes in detail the components of contextual performance). Also note, that certain contextual performance

559

Jeanette N. Cleveland et al. behaviors can impact nonwork outcomes. For example, supporting a coworker who has elder care issues at home may lead to better and less stressed performance at both work and home. There is evidence that the interpretation and evaluation of organizational citizenship behaviors depend on the gender of the people who exhibit these behaviors. In many jobs, organizational citizenship behaviors, particularly those that involve caring for others, providing support, and even basic civility are expected of women (Heilman & Chen, 2005). As a result, men who exhibit these behaviors get “credit” for doing so, whereas women who fail to exhibit them are sanctioned. Adaptive Performance  A third component of job performance, adaptive performance, is distinct from task and contextual performance (Hesketh & Neal, 1999). Adaptive performance is the proficiency with which a person alters his or her behavior to the demands of the environment, an event or a new situation (Pulakos, Arad, Donovan, & Plamondon, 2000), or an effective change in response to an altered situation (White et al., 2005). Although some dimensions of adaptive performance overlap with task or contextual performance, the dimension of addressing uncertain and unpredictable work situations may be distinct from task and citizenship performance (Johnson, 2003). Related to the construct of adaptive performance, the recent conceptualization of successful aging refers to the construct as successfully adjusting to change that is developmental (Baltes & Baltes, 1990) or as competently adapting or adjusting (Abraham & Hansson, 1995; Featherman, 1992; Hansson, DeKoekkoek, Neece, & Patterson, 1997). As Chapter 21 of this volume notes, adaptive behaviors are likely to reflect a combination of individual skills and abilities (some people are better at recognizing the need to change and making changes in their behavior than others) and environmental factors (some jobs and work environments require more adaptation than others). Organizational Deviance Behaviors  Finally, organizationally deviant behaviors that have negative value for organizational effectiveness have been proposed as a fourth distinct component of job performance (Sackett & Wanek, 1996). This component is also known as counterproductive work behavior, and an excellent discussion of it is presented in Chapter 22, this volume. Organizationally deviant behavior is defined as voluntary behavior that violates organizational norms and also threatens the viability and well-being of the organization and/ or its members (Robinson & Bennett, 1995). Currently, there is little consensus regarding the dimensionality of counterproductivity. For example, some researchers have identified property damage, substance abuse, and violence on the jobs as facets of counterproductivity (Sackett & Wanek, 1996); withdrawal behaviors such as tardiness, absenteeism, and turnover; or even social loafing or withholding effort are included in some definitions of this aspect of job performance (Kidwell & Bennett, 1993). The definition of “deviance” implies that there is clear agreement about what is normative, but this might not always be the case. For example, withdrawal (e.g., disengagement at work, voluntary lateness and absenteeism) can be considered a type of deviance, in the sense that employees may choose to fail to live up to their end of the implicit contract (Rotundo & Spector, 2010). However, some organizations expect high levels of involvement from employees, with 80-hour workweeks, constant availability, putting work and the good of the organization above other priorities. In this environment, someone who desires to give family some priority, who works hard but limits him or herself to a 40-hour week, or who sometimes puts higher priority on family or other nonwork dimensions rather than work might be considered “deviant.” The alternate interpretation is that the expectations in the workplace are excessive and unhealthy, and that the organization and its culture is deviant.

Criterion Deficiency: What Have We Ignored? Modern organizations are becoming more and more concerned with the notion of sustainability rather than focusing solely on immediate profit (Senge, Smith, Kruschwitz, Laur, & Schley, 2008). The term is usually used in conjunction with the sustaining of natural resources and

560

Deficiency of Our Criteri processes, but sustainability can also generalize to the management of HR. Traditional criterion measures focus on aspects of short-term performance, ignoring the influence that behavior has on other stakeholders and the long-term consequences over time. This is analogous to focusing solely on short-term profit. We need to be aware of how current measures of success impact the future ability of employees to remain with the organization and to continue to perform in a manner that is beneficial to the organization, themselves, and society. Considering the sustainability of HR requires taking a longer-term perspective than is usually the case. Furthermore, given current trends in criterion measurement and typical failure to consider multiple stakeholders, our criteria of “success” continue to be deficient in at least two ways. First, we need to expand the notion of criteria to include aspects of individual functioning outside of the work context. Employee health and well-being, stress, marital quality, and parental performance are all potential aspects of an emerging performance domain within the larger context of our lives and are inextricably linked with work organizations (Cleveland, 2005). Behavior at work affects behavior away from work and vice versa, and a truly comprehensive definition of effectiveness and success in an organization is likely to include facets (e.g., health and well-being) that have traditionally not been thought of as part of the performance domain. Second, the content of our criteria should continue to be broadened to explicitly recognize the multilevel implications of the construct (Cleveland, 2005; DeNisi, 2000; Murphy & Cleveland, 1995). We need to more explicitly link conceptions of individual performance and success to definitions of effectiveness and sustainability at the group, organizational, and societal level. The same behaviors and outcomes that contribute to success as traditionally defined at the individual level (e.g., high level of competitiveness, high level of involvement in work) might sow the seeds of failure at other levels (e.g., by building destructive conflict within organizations, by contributing to the failure of community institutions that compete with the workplace for employees’ time and effort). These themes are echoed in Cascio and Aguinis (2008) and in Chapter 9, this volume. Recognition of the multilevel nature of performance is important for several reasons (DeNisi, 2000). Notably, it provides one way that we can examine how our definitions and measures of success at one level are linked with potential costs associated with or incurred at another level. Broadening the definition of the criterion space to include extra-work functioning and multilevel effects leads us to consider domains that have been given little emphasis by traditional definitions of individual performance and organizational success. Two such domains are individual well-being and organizational health.

Health and Well-Being At the individual level, health is not simply the absence of ill health (e.g., Jahoda, 1958). Within Western societies, the concept of mental health also includes aspiring to learn, being reasonably independent, and possessing confidence (Karasek & Theorell, 1990). Chapter 24 of this volume examines work-related health and stress as aspects of the criterion domain, focusing mainly on how health and health-related costs might influence the effectiveness of organizations. This is a valid and important concern, but it is useful to take a broader perspective and think about how work influences health. Even if the costs of work-related stress, accidents, or health declines are not a direct drag on the financial performance of an organization, they are an important part of determining whether particular patterns of work behavior are “successful” if they imperil the well-being of the worker or the well-being of his or her family or community. Individual Health  Drawing on Warr’s framework (1987, 1994a), variations in mental health reflect different relative emphases on ill health and good health. Mental or psychological health can be described using six dimensions: subjective or affective well-being, positive self-regard, competence, aspiration, autonomy, and integrated functioning (Warr, 2005). Well-being is the most commonly investigated facet of mental health and, according to Warr (1987), it includes two orthogonal dimensions: pleasure (feeling bad to feeling good) and level of arousal (low to high). He identified three assessable aspects of well-being that can be viewed in terms of their

561

Jeanette N. Cleveland et al. location on these dimensions: specifically, the (1) horizontal axis of pleasure or displeasure, which is measured in terms of satisfaction or happiness; (2) an axis from anxiety (high arousal, low pleasure) to comfort (low arousal, high pleasure); and (3) an axis from depression (low arousal, low pleasure) to enthusiasm (high arousal, high pleasure). Indicators of well-being that emphasize the detection of ill health rather than good health assess anxiety, depression, burnout, psychological distress, and physiological or psychosomatic symptoms. On the other hand, indicators of well-being that emphasize positive mental health assess high arousal–high pleasure states, such as enthusiasm. Job satisfaction is considered to be either an indicator of ill health (e.g., job dissatisfaction) or positive health (e.g., job satisfaction). Either way, it is thought to be a relatively passive form of mental health because, although it assesses the degree of pleasure/ displeasure about a job, it does not assess arousal (Parker, Turner, & Griffin, 2003; Warr, 1997). In addition to affective well-being, Warr (1987) identified the five other components of mental health: competence (e.g., effective coping), aspiration (e.g., goal directedness), autonomy/ independence (e.g., proactivity), positive self-regard (e.g., high self-esteem), and integrated functioning (i.e., states involving balance harmony and inner relatedness). These are important components of mental health in their own right because (a) they are potentially more enduring than affective well-being; and (b) competence, aspiration, and autonomy/independence represent more active states and behaviors than most measures of well-being that reflect passive contentment (e.g., job satisfaction). How does individual well-being fit as part of a definition of effectiveness or success? First, there is considerable evidence that workplace stresses are an important source of physical and mental health problems. Warr (1987, 1999) has developed a framework that identifies key features of an environment that have been shown to be related to mental health. The 10 features are described in Table 25.1 in positive terms, but low values are viewed as stressful (Warr, 2005). This table suggests that the design of jobs (e.g., variety, opportunities for skill use), workplaces (e.g., physical security), reward systems (e.g., availability of money), leadership training and development systems (e.g., supportive supervision), and personnel recruitment selection systems (e.g., valued social position) could all have mental health implications for the workforce. For example, given the shrinking, aging, and increasingly diverse global workforce, organizations need to rethink the primary objectives of recruitment and selection systems. Organizations may increasingly face the situation of having more job vacancies than qualified individuals to fill them. Selection systems may need to be retooled to reflect more “recruitment selection.” Selection tests or measures not only may need to assess how well applicants can perform across various work contexts over a period of time but also convey to the applicants what range of situations they are likely to encounter and what resources the organization can provide to sustain

TABLE 25.1 Job Characteristics Related to Mental Health Opportunity for personal control Opportunity for skill use Externally generated goals Variety Environmental clarity Availability of money Physical security Supportive supervision Opportunity for interpersonal contact Valued social position Adapted from Warr, P., Work, well-being and mental health, in J. Barling, E. K. Kelloway, & M. R. Frone, Eds., Handbook of work stress, 547–574, Sage, Thousand Oaks, CA, 2005.

562

Deficiency of Our Criteri their performance and worklife health. It can certainly be argued that individuals who perform well in an environment that has adverse effects on their physical or mental health should not necessarily be described as successful. Second, health effects today are likely to have performance effects tomorrow. That is, an employee whose health is impaired by the workplace will probably make a smaller contribution to the organization, the family, and the community over the long run than one whose employment is a source of well-being. Thus, employee well-being and health are important components to sustaining an organization’s human capital. Indeed, successful organizations already are aware of the link between health and well-being, performance, and sustainability, as documented in Chapter 24, this volume. For example, IBM’s corporate policy on employee well-being has led the organization to develop a myriad of programs to ensure employee health, well-being, and family balance. Furthermore, the company ties these programs to criteria such as work-related injury and lost workdays. Organizational Health  A healthy organization is one that is competitive within the marketplace and also has low rates of injury, illness, and disability (Hofmann & Tetrick, 2003). Individual health outcomes are distinguished from organizational-level outcomes, but both are likely to be related to individual behavior at work. Together, individual and organizational effectiveness constitute the health of an organization (Parker, Turner, & Griffith, 2003). That is, a healthy organization is one that accomplishes the business-related goals that define traditional financial success and the human goals of advancing the health and welfare of the organization’s members. It is possible to move this discussion one step further and define a healthy organization as involving three dimensions: (a) competitive within the marketplace; (b) low rates of injury, illness, and disability (lack of negative outcomes); and (c) promoting long-term sustainability and well-being of its constituents (e.g., work that increases the success of constituents in terms of competence, aspiration, autonomy, and balance). Integrating Health and Well-Being into the Criterion Space  One reason why it is useful to distinguish between performance and success is that a narrow focus on performance forces one to search for similarly narrow reasons for including factors such as health in the criterion domain. It is certainly possible to do so; unhealthy individuals and unhealthy organizations are not likely to maintain any notable level of performance over the long run. On the other hand, a focus on success does not require one to locate some performance-related pretext for including health as part of the ultimate criterion. Rather, the promotion of individual and organizational health is likely to be a valued outcome in and of itself (i.e., valued by at least some stakeholders) and does not require justification in terms of some other set of criteria (e.g., profitability). We argue that employees, their families, and their communities all have a vested interest in workplaces that promote physical and mental health, and all have a vested interest in minimizing a range of negative outcomes (e.g., spillover of work-related conflicts) that might be associated with unhealthy organizations.

Multilevel Issues in Defining Performance and Success Performance and success all occur at the individual, group, and organizational levels (Campbell, Dunnette, Lawler, & Weick, 1970; DeNisi, 2000); they can also be defined within the larger context (level) of society and environment. Performance and success are not only defined at many levels of analysis, but they can also be defined in terms of multiple units of time. Perhaps the most serious deficiency in many definitions of individual performance and success is the lack of awareness or concern with the relationship between choices in defining the domain at one level (e.g., Is “face time” an important part of performance and success?) and effects felt at other levels of the system (e.g., If “face time” at work is viewed as important, time spent in family or community activities is likely to decline). According to DeNisi (2000), when we acknowledge that performance is a multilevel phenomenon, then several important implications follow:

563

Jeanette N. Cleveland et al. 1. We assess and develop individual employee performance with the intent of ultimately affecting the performance of the team or the whole organization. 2. Individuals and teams perform in ways to allow the organization to achieve outcomes referred to as “organizational performance.” 3. Performance at higher levels of analysis is more than just the simple sum of performance at lower levels; that is, it is not always sufficient to change individual performance to change team or organization performance (see DeNisi & Smith, 2014 for a detailed review). 4. Variables at higher levels of analysis (e.g., organizational structure or climate) can serve as constraints on (or facilitators of) the performance of individuals and teams. Therefore, we must understand the organizational context in order to fully understand the performance of individuals or teams.

In particular, thinking about performance and success from a multilevel perspective might help us to understand how and why the ultimate criterion should be expanded. For example, we traditionally construct and validate personnel selection systems as if the only objective of those systems was to predict future performance at the individual level (e.g., virtually all validation studies use measures of individual job performance as the criterion of choice). Yet it is clear that the goals of a personnel selection system are not solely to predict future performance; the goals are to help the organization make better strategic decisions, be profitable, and sustain productivity.2 Consistent with the message conveyed in Chapter 5 of this volume, it is critical that the criteria are linked with unit or organizational strategy. Therefore, our criteria may include unit-, organizational-, and societal-level assessments, as well as individual-level performance assessments, to be most consistent with a firm’s strategy. One plausible reason that a validated selection system does not translate into better unit performance may be the narrowly defined criteria used (Murphy & Shiarella, 1997). There are usually real and legitimate differences in different stakeholders’ definitions of “better decisions.” For example, an organization may use a class of tests in personnel selection that results in predicted increases in individual performance but also results in adverse impact, in conflict between supervisors and subordinates, and in negative images of the organization. This might not be thought of as a success, even if the validity coefficients are all large and positive (Murphy, 2010). Therefore, the logic that Ployhart and Weekley develop in Chapter 5, this volume, to link individual-level selection tests to organizational business strategy should also be applied to the re-examination and development of the criterion domain. That is, relevant macro work context and nonwork factors should be included within the articulation and domain of success. Cascio and Aguinis (2008) make similar recommendations using the emerging construct they label, “in situ performance,” which refers to the situational, contextual, strategic, and environmental effects that may influence individual, team, or organizational performance. By integrating or specifying these effects, we develop a “richer, fuller, context-embedded description of the criterion space that we wish to predict” (Cascio & Aguinis, 2008, p. 146). With the changing nature of work and the workforce, such criterion evolution can more fully capture how work is done in the 21st century (Cascio & Aguinis, 2008). I-O psychologists devote a great deal of time and effort in helping organizations make high-stakes decisions about people (e.g., whom to hire, where to place them, and what behaviors to reward and sanction). A multi-level perspective suggests that these decisions can and probably should be evaluated in terms of their effects on individuals, work groups, organizations, and families and communities, and that short- and long-term perspectives should be considered. To be sure, many difficult issues have to be addressed to put such a program of criterion development in place. Whose perspectives should be considered and how much weight should be given to each stakeholder in defining individual or organizational success? How should conflicts between stakeholders be addressed (e.g., it might benefit organizations but harm the communities that support them if many employees put in 80-hour weeks)? There are no simple answers to these questions, but I-O psychologists do have experience dealing with the multilevel issues in several other domains, and we may be able to draw from this research and this experience to gain insights into developing more inclusive definitions of what it means to be a success in the workplace. In particular, there is much to be learned from research on work-family conflict.

564

Deficiency of Our Criteri

Work-Family Conflict in Relation to Organizational Health and Sustainability Research on work-family conflict provides an example of the implications of thinking about performance and success from the perspectives of multiple stakeholders. Given I-O psychologists’ interest in the work context, the work side of the work-family interface has been more focal in I-O research (Major & Cleveland, 2005). Research in work-family conflict has typically emphasized the experiences of managers and professionals, as opposed to other types of workers (e.g., laborers), and has typically focused on the individual employee and his or her performance at work. Although some I-O studies have examined outcomes for employed couples (e.g., Hammer, Allen, & Grigsby, 1997), these are few and far between, and research that includes or acknowledges children is sparse indeed. Nevertheless, the field of work-family conflict can be viewed as one of the most successful examples of multilevel, multiperspective thinking, particularly if we recast some of the traditional areas of work-family conflict research in a slightly different light. I-O psychologists have been particularly interested in the effects of work-family conflict on employee job-related attitudes. They have usually not thought of work-family conflict as a measure of success (or lack thereof), but rather as a criterion contaminant. However, it is reasonable to argue that work-family conflict should be part of the definition of success, particularly when we define success at the organizational level. That is, an organization that frequently places demands on employees that interfere with their ability to function well as spouses, parents, caregivers, etc., should be considered as less successful than similar organizations that find a way to minimize their encroachment on the family roles of their employees. The decision not to include work-family balance in the scorecard used to evaluate organizations may make sense from the perspective of some stakeholders (e.g., investors, or executives with stay-at-home spouses), but it is not likely to be in the interest of families, children, and perhaps even the larger society that provides the customers, infrastructure, employees, and support that is necessary for the organization’s survival. Although I-O psychologists often ignore work-family balance as a criterion of success, some organizations, such as IBM, do not. IBM provides a robust program to support Dependent Care Spending (http://www-01.ibm.com/employment/us/ benefits/s25.shtml). Why should organizations care about work-family conflict? First, work-family conflict has been linked to organizational commitment, turnover intentions (e.g., Lyness & Thompson, 1997; Netemeyer, Boles, & McMurrian, 1996), turnover (Greenhaus, Collins, Singh, & Parasuraman, 1997), and stress and health (Frone, 2000; Frone, Russell, & Cooper, 1997). Second, some studies have found a negative relationship between work-family conflict and job performance (Aryee, 1992; Frone et al., 1997), particularly when performance is defined as task performance. By revealing links to outcomes that traditionally matter to business (e.g., turnover), this research illustrates that attending to work-family concerns is not simply a “moral imperative” or the “right thing” to do, but it also makes good business sense. That is, a reasonable case can be made that work-family conflict is harmful to an organization’s bottom line, especially over the long term. A multilevel perspective suggests that it is not necessary (although it is likely to be desirable) to focus on the links between work-family conflict and the bottom line to justify including this conflict as a facet of success. Rather, there are important stakeholders (e.g., employees, their families, their communities) who have a legitimate stake in wanting to minimize work-family conflict, regardless of whether or not it affects the bottom line of the organization. This multilevel perspective is particularly important because it has been consistently found that work-tofamily conflict is more likely to occur than family-to-work conflict (Eagle, Miles, & Icenogle, 1997; Gutek, Searle, & Klepa, 1991; Netemeyer et al., 1996). Organizational demands on the time and energy of employees appear to be more compelling than those of the family because of the economic contribution of work to the well-being of the family (Gutek et al., 1991). Employees are often afraid to be away from the workplace, and “presenteeism” takes its toll (Lewis & Cooper, 1999; Simpson, 1998). Workers are spending more time in the workplace in response to job insecurity, workplace demands, perceived career needs, and financial pressure. That is, the most compelling finding in the domain of work-family conflict is not that family

565

Jeanette N. Cleveland et al. interferes with work but that work interferes with family. If we, as I-O psychologists, focus only on outcomes that directly affect the employer’s interests (particularly employers’ short-term interests), we are likely to dismiss the most important aspect of work-family conflict (i.e., the way work can adversely affect families) as outside of the boundaries of the criterion domain. If we consider the interests of employees, their families, and their communities as a legitimate part of the definition of the ultimate criterion space, then we are less likely to dismiss this important set of findings as being largely irrelevant, or at least as being someone else’s problem. Women and men in the United States increased their annual working hours by an average of 233 and 100 hours, respectively, between 1976 and 1993 (Bureau of Labor Statistics, 1997). In 1999 to 2014, the average workweek for employed persons aged 25–64 with children was 44.5 hours (http://www.bls.gov/tus/charts/). Many employees work longer hours, and dualearner couples may work unusual hours or shifts. In the United States and the United Kingdom, workers feel they need to put in substantial “face time” to demonstrate their commitment (Bailyn, 1993), and many in low-wage occupations work more than one job. Despite the increasing time and effort devoted to work, employees are feeling increasing levels of job insecurity (Burchell, Felstead, & Green, 1997; Reynolds, 1997). From the perspective of multilevel systems, this increasing focus on face time, long hours, and increased insecurity is arguably evidence that organizations are increasingly unhealthy and, therefore, increasingly unsuccessful. Similarly, we can think of research on workers’ experiences with family-friendly work policies (e.g., parental leave, flextime) differently if we broaden our definitions of performance, effectiveness, and success. For example, family-friendly policies are of limited value without a secure job, and there is evidence that many qualified employees decline opportunities to participate in these programs (Lewis et al., 1998). One way of evaluating the success of an organization would be to pay attention to the uptake rates for policies such as these. If employees report stress and dissatisfaction as a result of work-family conflict but are unwilling or unable to take advantage of workplace policies designed to reduce these stresses, this can be considered evidence that the organization is failing its stakeholders, regardless of what the balance sheet says. Although studied far less frequently than work-related outcomes, psychological research has not completely neglected outcomes in the family domain (Major & Cleveland, 2007). Numerous empirical studies demonstrate a negative relationship between work-family conflict and life satisfaction (e.g., Adams, King, & King, 1996; Netemeyer et al., 1996); the results of two metaanalyses (Allen, Herst, Bruck, & Sutton, 2000; Kossek & Ozeki, 1998) reinforce this conclusion. The results are similar for work-family conflict and marital functioning and/or satisfaction (e.g., Duxbury, Higgins, & Thomas, 1996; Netemeyer et al., 1996) and family satisfaction (e.g., Parasuraman, Purohit, Godshalk, & Beutell, 1996). Yet again, this research often taps only the perceptions of the employed worker and does not collect information from spouses or children. Children are virtually absent from I-O research on the work-family interface (Major & Cleveland, 2007), and when they are included, it is typically as demographic control variables (i.e., number of children, age of youngest child) in studies of an employed parent’s family demands (see Rothausen, 1999 for a review). With few exceptions (e.g., Barling, Dupre, & Hepburn, 1998), children’s outcomes are seldom considered in I-O work-family research. Moreover, I-O research lacks a rich treatment of how children and other family variables influence employee behavior (cf. Eby, Casper, Lockwood, Bordeaux, & Brinley, 2005) or, importantly, how workplace characteristics and the employment/parental behaviors of both working parents influence the well-being and work attitudes of their children. Furthermore, current measures of success are deficient and lack consideration of children’s well-being. If we think about the family as one of the important set of stakeholders in defining what we mean by success, we will be more likely to consider the reciprocal effects of work and family in deciding whether our HR systems (e.g., personnel selection) are indeed leading to better decisions. In the traditional model of success, in which the ultimate criterion is entirely focused on what is good (often in the short term) for the organization, including measures of workfamily conflict in evaluations of careers, organizations, etc., would probably be dismissed as criterion contamination. If we recognize that the worlds of work and nonwork are inextricably intertwined, we are likely to reach a very different conclusion; that is, that the failure to

566

Deficiency of Our Criteri include variables such as work-family conflict in our definitions of success has led to conceptions of the ultimate criterion that are themselves deficient.

“Closing In” on Criterion Deficiency: One Approach to Bridging HR Systems with Business Unit Strategy Scholars in management and applied psychology have often worked from the assumption that work could and should be analyzed and understood as a separate domain from our nonwork lives. This probably made a good deal of sense for workplaces in the late 19th and early 20th centuries (a formative period for work organizations and for I-O psychology) when White males were the predominant members of the workforce, with unpaid wives at home tending to children and nonwork needs. This characterization increasingly is not accurate of workers in the 21st century, nor is it accurate for their families. Families are more diverse in structure, and it is more likely that all adult family members are paid employees working outside of the home. With the changing demographic composition of the workforce and working families, and the changing demands and technology within organizations, the way success is defined and measured must undergo transformation as well. We argue that this transformation in evaluation at work needs to reflect the following. First, the domain of success must encompass a more inclusive set of content, including individual employee well-being, marital and family well-being, and traditional indicators of task and citizenship behaviors. Second, the domain of success must reflect multiple levels of analysis, including individual employee, couples, families, teams, work units, organization productivity, and community quality. Furthermore, the multiple levels of analysis may include varying units of time—short term including up to about one year to longer term including up to decades of time. For example, children often leave home at 18 years of age, and the balance between work and nonwork that is best for the employee, the child, the spouse, the organization, and the community might constantly shift during those 18 years. Some employees might attempt to maximize their career advancement before starting a family, whereas others might reenter a career after child-rearing is completed. The definition of the employees’ behaviors that are most desirable will probably vary over employees, over time, and over stakeholders. Third, the set of stakeholders who have a legitimate interest in defining what behaviors should occur in the workplace are not found only at work (e.g., employees, coworkers, customers). Our definition of stakeholders must include nonworking and working spouses/partners and children. Finally, our nonwork lives should not be viewed as contaminants of job performance or success but rather as part of the ultimate criterion of success, and therefore very relevant and appropriate to assess.

Implications for Selection We do not suggest that organizations measure employee marital satisfaction or fire employees when they divorce or have problematic children, nor that they use health status as selection criterion (see Chapter 24, this volume, for a discussion of work-related health, stress, and safety). Rather, just as many organizations collect and monitor various safety criteria at the organizational level (e.g., accident rates), an organization can monitor at an aggregate level the work and nonwork health of the organization. To ensure privacy for employees, information on nonwork issues can be collected at group or organizational levels of analysis about marital health and family relationships, not from individual employees. However, information on work performance using task and citizenship behaviors can be collected at individual and aggregated levels. Furthermore, it is important that organizations tap not only perceptions of individual employees, coworkers, supervisors, and so forth, but also the perceptions of employees’ partners/spouses and children. Just as 360-degree performance feedback programs have gained some popularity in management circles (Bracken, Timmreck, & Church, 2001), organizations should also receive

567

Jeanette N. Cleveland et al. feedback from nonwork sources (Shellenbarger, 2002). Using a type of family 360 may provide useful feedback to employees. Adopting a broader, more heterogeneous conceptualization of worker success would have important implications for the way we evaluate the validity and adequacy of our criteria and for the conclusions we reach about the validity and value of many of the systems psychologists develop for organizations (Murphy & Shiarella, 1997). A broader concept of success may have considerable appeal for employees and their families and could even be thought of as a competitive advantage for organizations (i.e., organizations that think more broadly about defining success may be better positioned to recruit and retain particular employees) and enhance the sustainability of the organization. Perhaps one basis for worker dissatisfaction with performance appraisal is that what employees value as success is not reflected in the organization evaluation process. Taking a broader perspective may also provide the organization with a strategic advantage within the public’s eye. In addition, organizations would gain essential insight to potential HR challenges facing working families that can provide the basis for innovative and effective interventions. Not only would I-O psychologists and managers have more actual measures to tap success, but they would also have more sources of performance information. Finally, using a multilevel orientation to tap multisource information, we plausibly can begin to (a) link our HR systems with business strategy (as discussed in Chapter 5, this volume) and (b) develop selection tools that predict in situ performance and more fully reflect individual success and well-being as well as organizational sustainability.

CONCLUSIONS The way we go about predicting and understanding success in organizations (and designing personnel selection systems that will maximize success) depends largely on how we define success. Researchers and practitioners increasingly question the adequacy of traditional definitions of job performance, promotions, salary, job title, organizational level, and so forth as indicators of success. These are all important and relevant, but success almost certainly should be defined more broadly and comprehensively. As the permeability of the boundaries between work and nonwork domains increases in the 21st century, our definition of what it means to the organization, the individual, and the broader society to be a success or a failure in the workplace is likely to change. We have argued in this chapter that criteria such as marital and family well-being are of legitimate concern to responsible organizations and are part of the ultimate criterion. The wealth of evidence shows that employees place family as their number-one priority (Lewis & Cooper, 1999) and that employees’ work demands regularly interfere with their ability to meet family demands, and (to a lesser degree) there is also some evidence that employees’ family demands interfere with their ability to carry out work demands (cf. Greenhaus & Parasuraman, 1999). Business strategies that emphasize promoting long-term sustainability and concern with the construct of in situ performance (Cascio & Aguinis, 2008) will necessarily be concerned with determining how work and nonwork domains affect one another and with how nonwork criteria such as family well-being are likely to influence the viability of organizations. The literature includes constant calls for aligning HR practices with business strategy (Chapter 5, this volume) to promote the long-term benefit of organizations, and it is likely that understanding the effects of work and organizational demands on the quality of nonwork life will be an important factor in building and sustaining healthy organizations. Our current criteria for success (and theories of performance) are arguably deficient because we ignore the facets and structures of work that affect nonwork areas of our lives. For example, suppose that a new performance management system led employees to book more hours at work but also led to increased stress at home. It might be reasonable to ask whether the organization should consider their new system a success or a failure. It may not be easy to determine the best balance between the positive and negative effects of this system, but it seems reasonable to at least ask the question of how interventions that have what seem like beneficial effects at one level of analysis might have negative effects at other levels. Our current

568

Deficiency of Our Criteri narrow focus on what is good for the organization may lead us to miss the effects of what happens in the workplace on any number of domains other than work. What happens at work does not always stay at work; the workplace affects our nonwork lives, and our nonwork lives affect the workplace. It is important to more fully appreciate the reciprocal relationships between work and nonwork and to recognize the larger developmental and cultural context in which work behaviors unfold. Including nonwork factors in our evaluations of careers, jobs, and organizations is not a source of criterion contamination. Rather, failure to consider these factors in defining success should be thought of as a source of criterion deficiency. There are many challenges in determining what to measure, how to measure it, and how to use that information, but the case seems clear—we need to take a broader (and richer) approach to defining performance and success for individuals and organizations.

NOTES 1. Although these behaviors are likely to be relevant for all jobs, establishing the job-relatedness of citizenship behaviors might not be easy, given current models for validating criteria, and in contexts where litigation seems particularly likely, there are good arguments for relying more heavily on task performance as a criterion. 2. Note, however, that our current legal environment requires valid individual-level measures when those measures are used to make high-stakes decisions about individuals that may have differential impact across demographic groups.

REFERENCES Abraham, J. D., & Hansson, R. O. (1995). Successful aging at work: An applied study of selection, optimization, and compensation through impression management. Journal of Gerontology, 50, 94–103. Adams, G. A., King, L. A., & King, D. W. (1996). Relationships of job and family involvement, family social support, and work-family conflict with job and life satisfaction. Journal of Applied Psychology, 81, 411–420. Agle, B. R., Mitchell, R. K., & Sonnenfeld, J. A. (1999). Who matters to CEOs? An investigation of stakeholder attributes and salience, corporate performance and CEO Values. Academy of Management Journal, 42, 507–525. Allen, T. D., Herst, D. E., Bruck, C. S., & Sutton, M. (2000). Consequences associated with work-to-family conflict: A review and agenda for future research. Journal of Occupational Health Psychology, 5, 278–308. Aryee, S. (1992). Antecedents and outcomes of work-family conflict among married professional women: Evidence from Singapore. Human Relations, 45, 813–837. Astin, A. W. (1964). Criterion-centered research. Educational and Psychological Measurement, 24, 807–822. Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77, 836–874. Bailyn, L. (1993). Breaking the mold: Women, men and time in the new corporate world. New York, NY: Free Press. Baltes, P. B., & Baltes, M. M. (1990). Psychological perspectives on successful aging: The model of selective optimization with compensation. In P. B. Baltes & M. M. Baltes (Eds.), Successful aging: Perspectives from the behavioral sciences (pp. 1–33). Cambridge, England: Cambridge University Press. Barling, J., Dupre, K. E., & Hepburn, C. G. (1998). Effects of parents’ job insecurity on children’s work beliefs and attitudes. Journal of Applied Psychology, 83, 112–118. Bingham, W. V. (1926). Measures of occupational success. Harvard Business Review, 5, 1–10. Borman, W., & Motowidlo, S. (1993). Expanding the criterion domain to include elements of contextual performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 71–98). San Francisco, CA: Jossey-Bass. Borman, W. C, Penner, L. A., Allen, T. D., & Motowidlo, S. J. U. (2001). Personality predictors of citizenship performance. International Journal of Selection and Assessment, 9, 52–69. Bracken, D., Timmreck, C., & Church, A. (2001). The handbook of multisource feedback: The comprehensive resource for designing and implementing MSF processes. San Francisco, CA: Jossey-Bass. Burchell, B., Felstead, A., & Green, F. (September 1997). The age of the worried workers: Extent, pattern and determinants of insecurity in Britain over the last decade. Paper presented at the 12th Annual Employment Research Unit Conference, Cardiff, Wales. Bureau of Labor Statistics. (1997). Workers are on the job more hours over the course of a year (issues in Labor Statistics, Summary 97–3). Washington, DC: U.S. Department of Labor.

569

Jeanette N. Cleveland et al. Campbell, J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 1, pp. 687–732). Palo Alto, CA: Consulting Psychologists Press. Campbell, J. P., Dunnette, M. D., Lawler, E. E., & Weick, K. (1970). Managerial behavior, performance and effectiveness. New York, NY: McGraw-Hill. Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-Bass. Cascio, W. F. (2000). Managing human resources: Productivity, quality of work life and profits. New York, NY: McGraw-Hill. Cascio, W., & Aguinis, H. (2008). Staffing twenty-first-century organizations. In J. P. Walsh & A. P. Brief (Eds.), Academy of management annals (pp. 133–165). Mahwah, NJ: Lawrence Erlbaum. Clarkson, M. B. E. (1995). A stakeholder framework for analyzing and evaluating corporate social performance. Academy of Management Review, 20, 92–117. Cleveland, J. N. (2005). What is success? Who defines it? Perspectives on the criterion problems as it relates to work and family. In E. E. Kossek & S. J. Lambert (Eds.), Work and life integration: Organizational, cultural and individual perspectives (pp. 319–346). Mahwah, NJ: Lawrence Erlbaum. DeNisi, A. S. (2000). Performance appraisal and performance management: A multilevel analysis. In K. J. Klein & S. J. W. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 121–156). San Francisco, CA: Jossey-Bass. DeNisi, A. S., Smith, C. E. (2014). Performance appraisal, performance management, and firm-level performance: A review, a proposed model, and new directions for future research. Academy of Management Annals, 8, 127–179. Duxbury, L. E., Higgins, C. A., & Thomas, D. R. (1996). Work and family environments and the adoption of computer-supported supplemental work-at-home. Journal of Vocational Behavior, 49, 1–23. Eagle, B. W., Miles, E. W., & Icenogle, M. L. (1997). Interrole conflicts and the permeability of work and family domains: Are there gender differences? Journal of Vocational Behavior, 50, 168–184. Eby, L. T., Casper, W. J., Lockwood, A., Bordeaux, C., & Brinley, A. (2005). A twenty-year retrospective on work and family research in IO/OB: A content analysis and review of the literature. [Monograph] Journal of Vocational Behavior, 66, 124–197. Featherman, D. L. (1992). Development of reserves for adaptation to old age: Personal and societal agendas. In E. Cutler, D. W. Gregg, & M. P. Lawton (Eds.), Aging, money, and life satisfaction: Aspects of financial gerontology (pp. 135–168). New York, NY: Springer. Frone, M. R. (2000). Work-family conflict and employee psychiatric disorders: The national comorbidity survey. Journal of Applied Psychology, 85, 888–895. Frone, M. R., Russell, M., & Cooper, M. L. (1997). Relation of work-family conflict to health outcomes: A four-year longitudinal study of employed parents. Journal of Occupational & Organizational Psychology, 70, 325–335. Greenhaus, J. H., Collins, K. M., Singh, R., & Parasuraman, S. (1997). Work and family influences on departure from public accounting. Journal of Vocational Behavior, 50, 249–270. Greenhaus, J. H., & Parasuraman, S. (1999). Research on work, family and gender: Current status and future directions. In G. N. Powell (Ed.), Handbook of gender and work (pp. 391–412). Thousand Oaks, CA: Sage. Griffin, J. J., & Mahon, J. F. (1997). The corporate social performance and corporate financial performance debate: Twenty-five years of incomparable research. Business and Society, 36, 5–31. Gutek, B. A., Searle, S., & Klepa, L. (1991). Rational versus gender role explanations for work-family conflict. Journal of Applied Psychology, 76, 560–568. Hammer, L. B., Allen, E., & Grigsby, T. D. (1997). Work-family conflict in dual-earner couples: Within-individual and crossover effects of work and family. Journal of Vocational Behavior, 50, 185–203. Hansson, R. O., DeKoekkoek, P. D., Neece, W. M., & Patterson, D. W. (1997). Successful aging at work: Annual Review, 1992–1996: The older worker and transitions to retirement. Journal of Vocational Behavior, 51, 202–233. Harris, J. D., & Freeman, R. E. (2008). The impossibility of the separation thesis. Business Ethics Quarterly, 18, 541–548. Heilman, M. E., & Chen, J. J. (2005). Same behavior, different consequences: Reactions to men’s and women’s altruistic citizenship behavior. Journal of Applied Psychology, 90, 431–441. Hesketh, B., & Neal, A. (1999). Technology and performance. In D. R. Ilgen & E. D. Pulakos (Eds.), The changing nature of performance: Implication for staffing, motivation, and development (pp. 21–55). San Francisco, CA: Jossey-Bass. Hofmann, D. A., & Tetrick, L. E. (2003). The etiology of the concept of health: Implications for “organizing” individual and organizational health. In D. A. Hofmann & L. E. Tetrick (Eds.), Health and safety in organizations: A multilevel perspective (pp. 1–26). San Francisco, CA: Jossey-Bass.

570

Deficiency of Our Criteri Ilgen, D. R., & Pulakos, E. D. (1999). Employee performance in today’s organizations. In D. R. Ilgen & E. D. Pulakos (Eds.), The changing nature of performance: Implications for staffing, motivation, and development (pp. 21–55). San Francisco: Jossey-Bass. Jahoda, M. (1958). Current concepts of positive mental health. New York. Basic Books. Johnson, J. W. (2003). Toward a better understanding of the relationship between personality and individual job performance. In M. R. Barrick & A. M. Ryan (Eds.), Personality and work: Reconsidering the role of personality in organizations (pp. 83–120). San Francisco, CA: Jossey-Bass. Karasek, R. A., & Theorell, T. (1990). Healthy work: Stress, productivity, and the reconstruction of working life. New York, NY: Basic Books. Kidwell, R. E., Jr., & Bennett, N. (1993). Employee propensity to withhold effort: A conceptual model to intersect three avenues of research. Academy of Management Review, 18, 429–456. Kossek, E. E., & Ozeki, C. (1998). Work-family conflict, policies, and the job-life satisfaction relationship: A review and directions for organizational behavior-human resources research. Journal of Applied Psychology, 83, 139–149. Lewis, S., & Cooper, C. L. (1999). The work-family research agenda in changing contexts. Journal of Occupational Health Psychology, 4, 382–393. Lewis, S., Smithson, J., Brannen, J., Das Dores Guerreiro, M., Kugelberg, C., Nilsen, A., & O’Connor, P. (1998). Futures on hold: Young Europeans talk about combining work and family. London, England: Work-Life Research Centre. Lyness, K. S., & Thompson, D. E. (1997). Above the glass ceiling? A comparison of matched samples of female and male executives. Journal of Applied Psychology, 82, 359–375. Major, D. A., & Cleveland, J. N. (2005). Psychological perspectives on the work-family interface. In S. Bianchi, L. Casper, & R. King (Eds.), Work, family, health and well-being (pp. 169–186). Mahwah, NJ: Lawrence Erlbaum. Major, D. A., & Cleveland, J. N. (2007). Reducing work-family conflict through the application of industrial/ organizational psychology. International Review of Industrial and Organizational Psychology, 22, 111–140. Margolis, J. D., & Walsh, J. P. (2001). People and profits? The search for a link between a company’s social and financial performance. Mahwah, NJ: Lawrence Erlbaum. Murphy, K. R. (1998). In search of success: Everyone’s criterion problem. SIOP Presidential Address at the Annual Conference for Industrial and Organizational Psychology, Dallas, TX. Murphy, K. R. (2009). Validity, validation and values. The Academy of Management Annals, 3, 421–461. Murphy, K. R. (2010). How a broader definition of the criterion domain changes our thinking about adverse impact. In J. Outtz (Ed.), Adverse impact (pp. 137–160). San Francisco: Jossey-Bass. Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational and goaloriented perspectives. Newbury Park, CA: Sage. Murphy, K. R., & Shiarella, A. (1997). Implications of the multidimensional nature of job performance for the validity of selection tests: Multivariate frameworks for studying test validity. Personnel Psychology, 50, 823–854. Nagle, B. F. (1953). Criterion development. Personnel Psychology, 6, 271–289. Neal, A., & Hesketh, B. (2002). Productivity in organizations. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work and organizational psychology, Vol. 2: Organizational psychology (pp. 7–24). Thousand Oaks, CA: Sage. Netemeyer, R. G., Boles, J. S., & McMurrian, R. (1996). Development and validation of work-family conflict and family-work conflict scales. Journal of Applied Psychology, 81, 400–410. Organ, D. W. (1988). Organizational citizenship behavior. Lexington, MA: D. C. Heath. Parasuraman, S., Purohit, Y. S., Godshalk, V. M., & Beutell, N. J. (1996). Work and family variables, entrepreneurial career success and psychological well-being. Journal of Vocational Behavior, 48, 275–300. Parker, S. K., Turner, N., & Griffin, M. A. (2003). Designing healthy work. In D. A. Hofmann & L. E. Tetrick (Eds.), Health and safety in organizations: A multilevel perspective (pp. 91–130). San Francisco, CA: Jossey-Bass. Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624. Reynolds, J. R. (1997). The effects of industrial employment conditions on job related distress. Journal of Health and Social Behaviour, 38, 105–116. Robinson, S. L., & Bennett, R. J. (1995). A typology of deviant workplace behaviors: A multidimensional scaling study. Academy of Management Journal, 38, 555–572. Rothausen, T. J. (1999). “Family” in organizational research: A review and comparison of definitions and measures. Journal of Organizational Behavior, 20, 817–836. Rotundo, M., & Spector, P. E. (2010). Counterproductive work behavior and withdrawal. In J. Farr & N. Tippins (Eds.), Handbook of Employee Selection (pp. 489–512). New York, NY: Routledge.

571

Jeanette N. Cleveland et al. Rowley, T. J., & Moldoveanu, M. (2003). When will stakeholder groups act? An interest-and identity-based model of stakeholder group mobilization. Academy of Management Review, 28, 204–219. Sackett, P. R., & Wanek, J. E, (1996). New development in the use of measures of honesty, integrity, conscientiousness, dependability, trustworthiness and reliability for personnel selection. Personnel Psychology, 49, 787–830. Senge, P. M., Smith, B., Kruschwitz, N., Laur, J., & Schley, S. (2008). The necessary revolution: How individuals and organizations are working to create a sustainable world. New York, NY: Random House. Shellenbarger, S. (2002). Executive dad asks family for a 360 review. Wall Street Journal. Retrieved June 12, 2002, from http://www.careerjournal.com Simpson, R. (1998). Organisational restructuring and presenteeism: The impact of long hours on the working lives of managers in the UK. Management Research News, 21, 19. Smith, C. A., Organ, D. W., & Near, J. P. (1983). Organizational citizenship behavior: Its nature and antecedents. Journal of Applied Psychology, 68, 655–663. Smolensky, E., & Gootman, J. A. (2003). Working families and growing kids: Caring for children and adolescents. Washington, DC: The National Academies Press. Starkey, K., & Madan, P. (2001). Bridging the relevance gap: Aligning stakeholders in the future of management research. British Journal of Management, 12, S3–S26. Thorndike, R. L. (1949). Personnel selection: Test and measurement techniques. New York, NY: Wiley. Walsh, J. P., Weber, K., & Margolis, J. D. (2003). Social issues and management: Our lost cause found. Journal of Management, 29, 859–881. Warr, P. (2005). Work, well-being and mental health. In J. Barling, E. K. Kelloway, & M. R. Frone (Eds.), Handbook of work stress (pp. 547–574). Thousand Oaks, CA: Sage. Warr, P. B. (1987). Work, unemployment and mental health. Oxford, England: Clarendon Press. Warr, P. B. (1994a). A conceptual framework for the study of work and mental health. Work and Stress, 8, 84–97. Warr, P. B. (1997). Age, work and mental health. In K. W. Schaie & C. Schoder (Eds.), The impact of work on older adults (pp. 252–296). New York, NY: Springer. Warr, P. B. (1999). Well-being in the workplace. In D. Kahneman, E. Diener, & N. Schwarz (Eds.), Wellbeing: The foundation of hedonic psychology (pp. 392–412). New York, NY: Russell Sage Foundation. White, S. S., Mueller-Hanson, R. A., Dorsey, D. W., Pulakos, E. D., Wisecarver, M. M., Deagle, E. A., III, & Mendini, K. G. (2005). Developing adaptive proficiency in Special Forces Officers. Research Report 1831, U.S. Army Research Institute for the Behavioral and Social Sciences. Arlington, VA.

572

Part VI LEGAL AND ETHICAL ISSUES IN EMPLOYEE SELECTION P. RICHARD JEANNERET AND S. MORTON MCPHAIL, SECTION EDITORS

26 ETHICS OF EMPLOYEE SELECTION JOEL LEFKOWITZ AND RODNEY L. LOWMAN

Each of the chapters in this Handbook focuses on determinants of how the organizational human resource (HR) practice of employee selection can be done well. That is, the contents are aimed at providing the guidance needed to develop selection and promotion procedures that are accurate, valid, and useful for organizations. In this chapter we suggest another standard. In addition to doing selection well, we add a concern for doing it right. Hence, added to the technical and procedural knowledge and empirical criteria that guide employee selection, this chapter emphasizes the normative or moral standards associated with notions of the good, right, fair, or just. We suggest that doing selection well (i.e., technical competence) is inextricably bound up with doing it right. This approach also opens to reflection the implicit values and moral justification underlying the practice itself, in addition to considering the manner in which its constituent activities are implemented. In other words, the ethics of employee selection are as relevant as the ethics in employee selection.

SOME META-ISSUES

The Inextricable Mix of Competence, Ethics, Judgment, and Values In the selection enterprise, industrial-organizational (I-O) psychologists work at the intersection of no fewer than four domains that are conceptually distinct but that have ambiguous, uncertain, and probably overlapping boundaries. We make decisions that reflect simultaneously varying aspects and degrees of (a) technical competence, (b) ethical considerations, and (c) differences in professional judgment. Moreover, interpretations and conclusions regarding the substantive matters at hand also reflect (d) the individual I-O psychologist’s views regarding such questions as “Whose interests matter?”, “Who is to benefit?”, or “What is the right thing to do?”, as well as other personal beliefs, attitudes, assumptions, and social values. For example, the choices and decisions made to estimate population parameters from single-sample or mean validity coefficients involve the generally unrecognized melding of technical, normative, and values issues in which it may be difficult to disentangle one’s professional judgment from one’s personal preferences. One sometimes encounters parameter estimates that are based on national rather than local labor pool measures of predictor variability in the applicant population or that use low archival estimates of the reliability of supervisor criterion ratings when actual reliability data may be accessible. And perhaps most dramatically, the economic utility of selection tests based on the prediction of individual-level subjective criteria (supervisor ratings) may be extrapolated to estimates of organizational-level financial performance in the absence of data justifying such causal inferences, particularly at that level of analysis.1

575

Joel Lefkowitz and Rodney L. Lowman The point of the illustration is not to denigrate the attempt to better understand the validity and utility of selection systems but to point out the underlying nature of the estimation procedure and our critique. They both inextricably entail decisions reflecting not only technical knowledge and experience but also ethical considerations of appropriateness and professional judgment. Moreover, all of those actions are shaped in the context of motives that reflect personal, societal, and/or professional interests and values. Is it coincidental that the effect of each of the choices and common practices mentioned in the previous paragraph is to maximally increase the numeric value of estimated validity and utility? As noted below, one of the customary “gut checks” for those who consciously wrestle with ethical dilemmas is to look with suspicion on one’s tendency to opt for a solution that just happens to be self-serving. Those who have given some thought to the matter have identified the values structure of I-O psychology as representing primarily managerial or corporate interests; historically, even, at times, to the extent of having an anti-labor bias (Baritz, 1960; Katzell & Austin, 1992; Lefkowitz, 1990, 2003, 2004, 2005, 2008; Lowman, 2006; Zickar, 2001). The viewpoint that informs this chapter differs in that we are greatly influenced by three values positions that are at variance with such a singular perspective. We will have more to say about them later, but we highlight them briefly here so that the reader may be clear about our values position and how it may agree with or differ from the reader’s own. First is the universalist principle in moral philosophy that suggests that no one’s interests warrant a priori preference over anyone else’s, although there may be factual reasons in specific instances that justify granting such preference (Rachels & Rachels, 2015; Singer, 2011). Second, and commensurate with universalism, is the normative version of the prominent business ethics model of multiple-stakeholder management (Freeman, 1984; Freeman & Phillips, 2002), which asserts that it is right and just that powerful organizations that have enormous impact on society should recognize the legitimacy of the interests of those affected by it. (The instrumental version of the model holds that an organization’s actual success is dependent on how well it manages its relationships with all of its key stakeholder groups.) Third, and complementing the first two, is the so-called professional ideal (Kimball, 1992) or professional model (Hall, 1975), which asserts that the power and authority granted by society to a profession, such as I-O psychology, entail reciprocal responsibilities of the profession extending beyond its direct clients to the society at large. With respect to the practice of employee selection, there are at least 10 discernable groups of people who have a stake, directly or indirectly, in the process and/or its outcomes. They include (1) qualified job candidates who are recognized as such by the selection system and thus hired; (2) qualified candidates who are misidentified by the system and rejected; (3) unqualified candidates who are correctly identified and so not hired; (4) unqualified candidates who are misidentified and hired; (5) coworkers of the successful candidates, and other employees, whose own work is in some way impacted by them; (6) their direct supervisors, whose own success may be dependent on the work performance of the new hires; (7) higher-level supervisors and managers of superordinate work units whose success also may be contingent on the performance of the newcomers; (8) the owners or shareholders of the company, whose investments depend on the overall performance of the organization; (9) the company’s clients or customers, who purchase the goods or services produced by it; and (10) the local community from which the job applicants are drawn, which may be affected in various ways by the company’s actions and success. The nature of their interests or “stake” in the selection system differs for many of them, as does the extent of its impact on them, but they all potentially have some legitimate claim to have their interests considered.

An Underappreciated Constituency: The Participants in Validation Research and Selection Programs One ethically relevant matter underlies research with human participants in the biological, social, and behavioral sciences and is often overlooked, including in selection contexts. It is that such research, with a few exceptions, is generally not aimed at directly benefiting the people who

576

Ethics of Employee Selection participate in it as subjects (Lefkowitz, 2007a, 2007b). This statement is not to deny that research participants may ultimately benefit from the application of research findings (contingent on positive study outcomes) through the development of new drug treatments, more effective teaching strategies, more rewarding and satisfying jobs, or by not being placed in an ill-fitting job. But most of the applied research conducted by I-O psychologists is driven by the intentions of senior organizational policy makers in the service of organizational objectives or by the theoretical interests, curiosity, or ambitions of the researcher. For example, the development and validation of employee selection methods is generally not aimed explicitly at benefiting members of the validation sample(s). Whether the participants are current employees or job applicants, the results are applied to subsequent groups of job candidates to improve organizational effectiveness. Admittedly, however, those participants who are hired may benefit indirectly by having more competent coworkers in the future. Because research participants are often, in this sense, used by us for testing and validation research in which they may have no personal interest in the outcome, we are ethically dutybound to consider seriously issues such as the voluntary nature of their participation in the study, the extent of potential subjects’ obligation to participate, obtaining their participation through use of various inducements or implicit coercion, providing informed consent, examinees’ rights to access their own test data, and providing feedback. And when a testing program becomes operational, additional matters arise, including whether to provide an opportunity for retesting rejected applicants and the confidentiality and use of assessment data from incumbents. In the United States, researchers’ obligations in some of these areas are incorporated in regulations promulgated by the Federal Office for Human Research Protection (OHRP) of the U.S. Department of Health and Human Services (OHRP, 1991).2 However, the circumstances under which I-O psychologists conduct employee selection testing are generally recognized to provide us with somewhat greater ethical latitude. For example, (a) informed consent for testing is ordinarily not required of educational or employment applicants because they are deemed to have implicitly given consent by virtue of having applied (American Educational Research Association, American Psychological Association, National Council on Measurement in Education Standards for Educational and Psychological Testing, 2014 [hereafter, Test Standards, 2014], Standard 8.4; American Psychological Association, Ethical Principles of Psychologists and Code of Conduct (with 2010 Amendments), 2010 [hereafter, APA Code], Ethical Standard 9.03[a]); (b) job applicants may acceptably be asked to waive (or be assumed to have waived) access to their test results and so might not receive any feedback (Test Standards, 2014, Standard 8.9 and 11.6; APA Code 2002, Ethical Standard 9.10); and (c) providing an opportunity for job candidates to be retested is not obligatory according to the Standards (2014, Standard 12.10), although the Uniform Guidelines on Employee Selection indicate that a reasonable opportunity for retesting and reconsideration should be provided. On the other hand, some ethical requirements are viewed as virtually universal, even in the operational employment setting, such as safeguarding to the extent feasible the confidentiality of test data and protecting against their misuse and informing employees or applicants beforehand of any limitations on confidentiality (APA Code, Ethical Standards 1.01, 3.11, 4.01, 4.02, 4.05, 9.04). Moreover, we argue there are some very good reasons why we ought not always avail ourselves of all the legitimate ethical exceptions that have been ceded to the practice of I-O psychology and should instead behave as if the more stringent ethical research environment pertained. As one of us has noted previously: A corollary of that advantage, or right, we enjoy as a consequence of employees’ obligations [to cooperate with legitimate, non-threatening research] is the duty to see that their obligation is not abused or experienced as coercive. There is, obviously, an inherent conflict between the principle that all research participation should be explicitly voluntary and the existence of a relatively open-ended implicit obligation of workers to participate in legitimate organizational research. Notwithstanding the implied obligation, adherence to the moral principle of respect for persons requires that we treat research participation as genuinely voluntary and volitional to avoid even the semblance of coercion. (Lefkowitz, 2003, p. 336)

577

Joel Lefkowitz and Rodney L. Lowman To the moral arguments regarding respect for the worth and dignity of all people, we add some pragmatic considerations. To function effectively in an organization, I-O psychologists depend on the goodwill and cooperation of organization members, which in turn depend on the psychologist’s reputation and the general reputation of the profession. Treating people cavalierly by taking their cooperation for granted is likely to produce adverse effects on future interactions, including those initiated by other I-O researchers and practitioners in that organization. In other words, it is in our own self-interest and in the interests of the profession to always treat those with whom we work honestly and with deference. A good principle to follow stems from the suggestion of the social psychologist Robert Rosenthal (1994), which is that we ought to think of our potential participants as a “granting agency” to which we must apply for necessary resources to implement our proposed investigations. Obviously, most of the selection work we do could not be accomplished without the input and cooperation of applicants, employees, and/or other subject matter experts (SMEs).

The Universalist, Multiple-Stakeholder, Professional Perspective An implicit attribute on which all normative moral theories can be arrayed is the extent to which they are egoistic or universalist in nature. This meta-issue pertains to whose interests should be considered in understanding what is the good or right thing to do—only one’s own, or also those of others (typically, all those affected by the actions contemplated)? For Aristotle, the ultimate aim of human behavior (i.e., the ultimate good) is one’s own happiness. (In the Greek it is eudaimonia—generally thought to connote personal fulfillment, actualization, or “flourishing,” as well as simply feeling happy.) For Aristotle, happiness results from acting in accord with all of the human virtues, even the altruistic ones such as beneficence and sympathy. So for him there was no contradiction between self-interest and a broader-based, more altruistic conception of morality. Needless to say, contemporary ethical debacles in the world of business and elsewhere have displayed the ugly side of an unqualified pursuit of self-interest. Modern philosophers such as Rachels and Rachels (2015) have outlined two arguments that seem to repudiate unrestricted ethical egoism as a basis for moral theory. First, if one accepts that a major objective of the ethical enterprise is to provide moral guidance that reduces conflict and enhances cooperation among members of a society, it is clear that the unqualified pursuit of self-interest is counterproductive of these aims (Samuelson, 1993). Second, unrestricted egoism can be classified as one of a family of moral perspectives that makes a priori distinctions between people and justifies treating them differently on the basis of those putative differences (as with racism, sexism, anti-Semitism). In this instance, the distinction is simply between oneself and everyone else. But “[w]e should treat people in the same way unless there is a good reason not to” (Rachels & Rachels, 2015, p. 79, emphasis in original). (In this context, the process of negatively stereotyping a minority group can be understood as an attempt to manufacture such “differences” as justifications warranting prejudicial treatment.) Singer (1995), in complementary fashion, observes that “Self-interested acts must be shown to be compatible with more broadly based ethical principles if they are to be ethically defensible, for the notion of ethics carries with it the idea of something bigger than the individual” (p. 10). (And, indeed, in the next section we turn to a discussion of those bigger ideas.) In other words, in the universalist tradition the interests and rights of all those affected by an action are to be considered equal with respect to moral judgments regarding the action, unless good and reasonable arguments to the contrary can be made. The best-known reflection of moral universalism in the field of business ethics and the social responsibility of business institutions is the normative version of the multiple-stakeholder perspective (Freeman, 1984; Freeman & Phillips, 2002). Instrumental stakeholder theory is merely descriptive. Normative stakeholder models are prescriptive and stem from recognition of the enormous power, size, and widespread societal impact of corporations. From those observations it is concluded that they have an obligation to take into account the interests of the many constituencies that are impacted by their actions and with whom they may be thought of as having implicit social contracts. I-O psychologists are probably ahead of our colleagues in other subfields of psychology, who apparently are only now anticipating the likelihood of finding

578

Ethics of Employee Selection themselves “increasingly drawn into situations where a multitude of social and political interests apply across hierarchies of individuals to whom we owe various degrees of professional duties” (Koocher, 2007, p. 381). All of this suggests that in evaluating a selection system we ought to consider not only its effectiveness but also, from an ethical perspective, its impact on all of those affected. We always have an instrumental concern for the extent of productivity improvement our client company or employer can expect from cohorts of job applicants hired on the basis of validated predictor measures. That is, in the language of selection classification, we anticipate significantly increasing the proportion of those selected who are successful on the job (“true positives”) in relation to those hired who are unsuccessful (“false positives”), but we should also be concerned about the proportion of incorrectly rejected applicants who have been denied employment (“false negatives”) because of the imperfect validity of those predictors. In other words, enhancing the interests of the organization adversely and arguably unfairly impacts a substantial proportion of the applicant population. In addition, customary attempts to further increase productivity improvement by means of more restrictive hiring (decreasing the selection ratio) can generally be expected to exacerbate the harm by increasing the proportion of false negatives among those rejected—to a greater extent than the decrease in the proportion of false positives. The structure of this situation is that of a classic ethical dilemma—actions that benefit some directly hurt innocent others—yet to our knowledge it has never been addressed seriously in the literature of I-O psychology. We surmise that the reason is that I-O psychology, at least in the context of employee selection, tends to view the organization as the primary (or only) relevant stakeholder. The interests of applicants, especially rejected applicants who are not and will not be members of the organization, are generally not considered. But that is an inappropriately narrow view for professionals to hold. Professions are characterized by attributes that distinguish them from other occupations (Haber, 1991), and among the more salient of those attributes is a sense of responsibility and obligation that extends beyond the paying client to segments of the broader society. This is generally thought to constitute a quid pro quo for the considerable amount of power, influence, and respect afforded by society to professions and their members. This broader perspective has been referred to as a “true professional ideal” (Kimball, 1992, p. 303) or “the professional model” (Hall, 1975, p. 72). In sum, the universalist tradition in moral philosophy; the multiple-stakeholder approach from the study of management, business, and society; and the professional model from the sociological study of occupations all coalesce around the notion that ethical evaluations of our selection programs require that their impact on all of those affected by them be considered. This could mean assuring that rejected candidates are afforded an opportunity for retesting and perhaps even, when circumstances and budgets allow, utilizing relatively low cutoff scores and relying on probationary employment as a selection device.

ETHICAL PRINCIPLES AND DILEMMAS How does one know when he or she is faced with an ethical dilemma, as opposed to a mere technical, procedural, administrative, or professional problem? (They are not mutually exclusive. Ethical dilemmas may occur in any of those realms.) The study of moral thought has yielded three succinct criteria by which to answer the question (Wittmer, 2001). The problem will involve (a) the expression of fundamental moral or ethical principles like those discussed as follows and articulated in formal ethical codes such as that of the APA (2010), and the individual will be faced with (b) having to make a decision that (c) has significant impact on others.

Ethical Principles Although space constraints preclude delving into the origins of the ethical principles presented here, it should be noted that they emerge from a long history of moral philosophy and more recent work in moral psychology (see Lefkowitz, 2003, for a review). They are reflected in various normative ethical theories that are generally either deontological or consequentialist in nature.

579

Joel Lefkowitz and Rodney L. Lowman Deontological theories are concerned with right and wrong per se; they hold that the rightness or wrongness of an action is intrinsic to the nature of the act on the basis of whether it violates a moral principle. Deontologists are concerned with principled expressions of rights, duties, responsibilities, virtue, fairness, and justice. Some deontological principles are expressed positively in terms of affirmative duties (e.g., treat job applicants with respect; protect the confidentiality of test data), but many are expressed negatively in terms of actions that are disallowed versus what it is permissible to do (e.g., do not exaggerate or make “hyperclaims” to organization decision makers about the likely benefits of your proposed selection system). If one of us fails to protect the confidentiality of employee test data, the deontologist will view that person as having wronged those individuals even if there is no evidence of their actually having been harmed. Note, however, that harm may have been caused to the reputation of the profession. Consequentialists, on the other hand, define right and wrong in terms of the good and bad (or benefits and harms) that will result from an action. The relative morality of alternative actions is directly reflected in the net amounts of goodness that can be expected to result from each. The option that leads to the greatest amount of net good or the least amount of net harm (considering all those to be impacted) is the morally imperative choice, not merely a permissible one, as with deontological approaches. Neither system of thought is free from legitimate criticism by proponents of the other perspective, and some situations seem more amenable to analysis by one rather than the other of the two strategies, so that the prudent professional should be familiar with both approaches to ethical analysis and decision making.

Respect People have the right to be treated with dignity and respect and allowed to exercise their rights to privacy, confidentiality, freedom, autonomy, and self-expression. These rights are universalizable (i.e., applicable as much to anyone else as to oneself) and bounded by reciprocal obligations. For example, our right to pursue our research objectives should not supersede an employee’s right to not participate in the study. With regard to the principle of respect, psychologists are obligated in particular to be aware of and to eliminate the effects of prejudice and bias related to “age, gender, gender identity, race, ethnicity, culture, national origin, religion, sexual orientation, disability, language, and socioeconomic status” (APA Code, 2010, Principle E).

Fairness and Justice The notion of justice can be among the more nebulous ethical principles to conceptualize and implement, but it is important to distinguish between notions of justice in moral philosophy, political theory, or economics (Wolff, 2005), which usually refer to normative societal distributions of goods, versus “organizational justice” as it is usually studied in I-O psychology, which more typically focuses on perceptions of procedural fairness (Cuguer-Escofet & Fortin, 2014; Lefkowitz, 2009). For example, justice can be defined deontologically as each person having a fair balance of rights and duties, or in consequentialist fashion as each receiving a fair proportion of the available benefits and burdens associated with membership in a social system (e.g., organization or nation). However, alternative criteria of fairness represent social and moral values positions and so are influenced greatly by macro-level political and economic systems (e.g., the marked preference in the American free-enterprise system for the distributive justice criterion of equity, or “merit,” in comparison to our expectation of equality of treatment in the legal system).

Caring: Beneficence The origins of this principle are in consequentialist theory and the “ethics of care” in moral psychology. It is reflected in the traditional service ideal of the professions: “Psychologists are

580

Ethics of Employee Selection committed to increasing scientific and professional knowledge of behavior . . . and to the use of such knowledge to improve the condition of individuals, organizations, and society” (APA Code, 2010, p. 3) and “providers of I-O psychological services are guided primarily by the principle of promoting human welfare” (APA, 1981, p. 668). Although this principle is generally interpreted within the context of the universalist meta-principle that the interests of all those concerned are to be considered equal, it is also generally recognized that most of us justifiably care more for some people than others and/or there may be some to whom we owe a special obligation or duty (e.g., family, friends, neighbors, colleagues, clients, employer). Therefore, it is usually not viewed as unethical per se to act on those special concerns and obligations. However, there may be occasions when such actions slide impermissibly far down the slippery slope of favoritism, prejudice, bias, or nepotism.

Caring: Nonmaleficence The obligation not to cause unjustifiable harm is generally thought to apply equally to all others, even strangers. It is especially pertinent with regard to those who are in potentially vulnerable positions (e.g., employees, students, job candidates, research participants). The primacy of nonmaleficence is indicated in the APA Code (2010): When conflicts occur among psychologists’ obligations or concerns, they attempt to resolve these conflicts in a responsible fashion that avoids or minimizes harm . . . and [they] guard against personal, financial, social, organizational, or political factors that might lead to misuse of their influence. (p. 3)

Moral Character Many ethical treatises and professional codes of conduct include discussions of personal attributes having to do with the character of the person potentially faced with a moral dilemma rather than on the process of his or her ethical decision making. For example, the APA’s (2010) ethical code is explicit about the importance of fidelity and responsibility (to those with whom we work, to our communities, and to society) and of integrity (accuracy, honesty, truthfulness, and promise-keeping) (Principles B and C, respectively, p. 3). In recent years, technological advances (e.g., web-based job applications and employment testing), changes in the nature and conditions of work (e.g., home-based work, increased use of teams), and other dramatic changes such as the globalization of organizations have impacted the way in which ethical problems in employee selection (and other professional domains) are manifested. Notwithstanding those changes in the manifest circumstances of contemporary work life, the importance of the five sets of fundamental moral principles noted above is reflected in the following observation: The paradigmatic forms taken by those [ethical] problems, the character traits and motives needed to recognize them as such, the ethical reasoning used to address them, as well as the substance of the ethical principles on which such reasoning is based are all essentially unaffected and still pertain. (Lefkowitz, 2006, p. 245)

We turn now to a consideration of those paradigmatic forms.

Forms of Ethical Dilemmas Hoffman’s (1988) theory of moral development included three ideal types of moral dilemma from which the internalized sense of morality develops. Lefkowitz (2003, 2006) elaborated and extended those to four and later to five (Lefkowitz, 2007b, 2012) forms or paradigms of ethical

581

Joel Lefkowitz and Rodney L. Lowman challenges that seem to represent a comprehensive taxonomy (with the understanding that there may be combinations of two or more of them).

Paradigm I. Preventing Harm: Possessing Foreknowledge of Someone to Be Harmed or Wronged HR managers and organizational consultants frequently are privy to impending company policy decisions or personnel actions that may entail some harms or wrongdoing. For example, a manager may intend to promote someone generally known to be less qualified than other candidates. A senior HR administrator may be intent on implementing the latest selection test fad that you know is likely to have adverse impact on minority applicants and has virtually no credible validity evidence. Failing to act to prevent an impending harm or wrong may sometimes be motivated primarily by a sense of organizational loyalty rather than by self-serving objectives, but revealing, challenging, or resisting a contemplated action by a superior might also entail some personal risk, hence exacerbating the dilemma. Suppose you are an internal consultant in the development section of the HR department of a large corporation and you are beginning to train a group of high-level managers to serve as assessors for a planned assessment center. When leading an evaluation discussion following a mock exercise engaged in by actors playing hypothetical promotion candidates, a senior executive—the apparent informal leader of the group—makes a demeaning sexist remark about the one female candidate being discussed, and all of the other managers laugh appreciatively. Responding appropriately and ethically may require an abundance of courage and social skill.

Paradigm II. Temptation: Contemplating a Self-Serving Action That Would Be Unjust, Deceitful, or Potentially Cause Harm to Another Recent notorious examples of this sort of unethical action in the corporate world are well known. Other examples may be less extreme instances of acquiescing to inappropriate peer (or superior) expectations to “get along.” Of particular relevance to organizational life are instances in which one’s potential unethical actions serve the explicit or implicit policies, directives, or aims of the organization, rather than one’s own personal interests. Even so, given the prevalence of employees’ psychological identification with the organization, formal performance-based reward systems, and the informal recognition to be gained by accomplishing company goals and objectives, such (mis)behavior nevertheless might also be readily construed as self-serving.

Paradigm III. Role Conflict: Having Competing Obligations or Responsibilities to Two or More Persons or Other Entities Such That Fulfilling One Might Mean Risking the Other(s) This type of dilemma is almost inevitable, given the universalist, multiple-stakeholder, professional perspective that acknowledges responsibility to several (perhaps conflicting) constituencies. Role conflict is especially salient for employees who are in internal boundary-spanning positions with responsibilities to multiple departments. It is also pertinent for those who operate at the external boundaries of the organization, such as salespersons and purchasing agents who may have considerable loyalty to longstanding customers, clients, or suppliers, as well as to their employer; or professionals who acknowledge their responsibilities to society and the common good as well as to the organization. Consultants who are afforded the opportunity to work with multiple (i.e., competing) firms in the same industry should also be familiar with this form of potential dilemma. Relevant matters to be considered include (a) the consultant’s general representations regarding knowledge gained

582

Ethics of Employee Selection from working with previous clients/competitors; (b) each party’s understanding of the expectations of client #1 with respect to the consultant’s prospective work with competitors; (c) what useful information, whether proprietary or not, was garnered from working with client #1 that might be useful in working with client #2 and by extension improve their competitive position; (d) client #2’s expectations regarding accessibility of the consultant’s cumulative knowledge of the policies of other firms in the industry; etc. For example, suppose a portion of the criterionrelated selection test validation project paid for by client #1 consisted of the time-consuming development of a complex criterion measure based on an empirically derived composite of job performance indicators. Such sophisticated job knowledge, if shared, could be a persuasive part of the consultant’s “sales pitch” for conducting a selection study for client #2. Is that appropriate? These matters are all best discussed with client #1 before beginning that project.

Paradigm IV. Values Conflict: Facing Conflicting and Approximately Equally Important Personal Values So That Expressing One Entails Denying the Other(s) At a somewhat macro-level, this is the battlefield on which conflicts play out between the objectives of shareholder value and corporate financial performance (CFP) on one side versus the putative societal obligations of business reflected in the corporation’s “social performance” (CSP; Lefkowitz, 2007c). At the level of specific HR systems such as selection, it gets reflected in the attempt to balance ostensibly competing objectives such as increasing economic utility and decreasing adverse impact on minorities (De Corte, Lievens, & Sackett, 2007). It is on a note of optimism that we point out that the most recent accumulations of evidence suggest that CFP and CSP may be entirely compatible or even complementary (Guenster, Derwall, Bauer, & Koedijk, 2005; Orlitzky, Schmidt, & Rynes, 2003).

Paradigm V. Pressure to Violate Ethical Principles3 Organizations value productivity, efficiency, speed, and profitability, which often get expressed in corresponding pressures for goal attainment on managers who may be the superiors or clients of an I-O psychologist. And it is possible that those pressures are directed to the I-O psychologist and conflict with professional ethical standards. For example, perhaps a senior executive decides that assessment data, which had been collected in the context that—and with assurances—it was to be used confidentially only for developmental-coaching purposes, would be very useful for performance management and appraisal. Standard 1.03 of the APA Ethics Code (2010 revision) indicates that if organizational demands conflict with the code, psychologists clarify the nature of the conflict, make known their commitment to the Ethics Code, and take reasonable steps to resolve the conflict consistent with the general Principles and Ethical Standards of the Ethics Code. Under no circumstances may this standard be used to justify or defend violating human rights. (APA Ethics Code, 2010, p. 4)

None of the five paradigms explicitly mentions matters of technical competence. Is competence an ethical issue? In fact, the APA Code (2010) contains six enforceable standards in the section headed “Competence”; for example, “Psychologists’ work is based upon established scientific and professional knowledge of the discipline” (Standard 2.04) and “Psychologists undertake ongoing efforts to develop and maintain their competence” (Standard 2.03). Suppose an I-O psychologist conducts a well-done validation study for a client and reports a statistically significant validity coefficient for a set of predictors but fails to report several other nonsignificant coefficients with respect to other relevant criteria investigated. If the psychologist is ignorant of the statistical prohibition against exaggerating validation findings by capitalizing on

583

Joel Lefkowitz and Rodney L. Lowman chance relationships, he/she is not competent. Now, what if the psychologist was not ignorant of these psychometric matters but had struggled with temptation and, ultimately, and with some misgivings, omitted the negative findings out of concern for disappointing the client? Or worse still, in a third variant, suppose he/she freely and deceitfully chose to distort the nature of the findings to justify having inappropriately guaranteed favorable results in advance? Ethical analyses invariably include a focus on the “bottom line” of an action taken or actions contemplated (i.e., the consequences of the act(s) on all those affected). Each of these three scenarios represents an ethical transgression because of the potential harm to be caused to the client and job applicants by using an ineffective selection system and ultimate harm to the reputation of the profession when the ineffectiveness becomes apparent. However, the motives of the psychologist are rather different in each scenario, and in ethical analyses it is also true that motives matter. So, in what way do they matter? In each of the scenarios the psychologist is portrayed as increasingly venal: from merely inexcusably ignorant and failing to live up to one’s professional obligations, to defensively self-protective and disrespectful of a client’s rights, to premeditatedly self-serving and deceitful. Two observations are warranted. First, these different motives—implying different moral characters—make little or no difference in terms of the consequences of the incomplete data reporting. That is why “mere” incompetence is an ethical matter. Second, it is likely that the reader feels somewhat differently about each of our three hypothetical transgressors—perhaps feels that their degree of venality is related directly to their degree of culpability. That may, depending on circumstances, appropriately lead to differences in what we view as the suitable degree of penalty or opprobrium for each of our three transgressors for the “same” offense.

ROLE OF ETHICAL CODES IN PROFESSIONAL PRACTICE: HISTORICAL AND CURRENT PERSPECTIVES Whether licensed or not, professional psychologists are expected to follow the ethics code of the APA (APA Code, 2010), although enforcement mechanisms pertain only to APA members. A brief history of professional ethics in psychology reveals the absence of a code for the first 50 years of the APA (Pope & Vetter, 1992), its initial empirical start based on critical incidents, and its evolution over more than 60 years. During the last 15 years, greater attention has been paid to I-O issues so that the current code even applies to selection work. Professional codes of conduct typically derive from the practice of a profession; behaviors that arouse concerns about appropriate and inappropriate behavior work their way into a code of conduct over time. They also often inductively work themselves backward to a philosophical basis rather than starting that way. For example, consider the famous Hippocratic Oath for medicine, one translation (Edelstein, 1967) of which, thought to date back to the fifth century bc, is as follows: I swear . . . that I will fulfill according to my ability and judgment this oath and this covenant: I will apply dietetic measures for the benefit of the sick according to my ability and judgment; I will keep them from harm and injustice. I will neither give a deadly drug to anybody who asked for it, nor will I make a suggestion to this effect. I will not use the knife . . . but will withdraw in favor of such men as are engaged in this work. Whatever houses I may visit, I will come for the benefit of the sick, remaining free of all intentional injustice, of all mischief. . .. What I may see or hear in the course of the treatment or even outside of the treatment in regard to the life of men, which on no account one must spread abroad, I will keep to myself, holding such things shameful to be spoken about. (p. 6)

Note that the Hippocratic Oath does not emphasize moral principles underlying the ethical admonitions, despite having been created in a golden era of moral philosophy. It imposed on those taking the oath specific obligations to behave in certain ways and not to behave in

584

Ethics of Employee Selection other ways. Some of its tenets are readily interpretable in terms of modern professional ethical standards, but others would be irrelevant or considered inappropriate in today’s world. Another aspect of the oath that is relevant to contemporary I-O psychologists is, despite its pragmatic orientation, its explicit recognition of the broader professional and moral context within which the specific obligations are taken on. The doctor is not only to be technically competent (providing good dietary recommendations, not performing surgery), but is to be pure, prevent harm and injustice, and protect confidentiality. Some professions favor a narrow and explicit approach to ethical practice—i.e., if it is not explicitly prohibited (e.g., by an ethics code), then it is not unethical. (An extension of this approach is the view that any action that is not clearly illegal is morally permissible.) Others see the need for a broader, more proactive approach in which moral principles and the values to which they give rise deserve expression, even when no specific ethical “violation” is identified. In its enforceable Standards, the APA Code (2010) is an example of the former pragmatic approach; in its General Principles, it exemplifies the latter. It bears reminding that for more than the first half century of its existence, the APA had no ethics code at all. This case of apparent arrested development reflects the historical growth of a field that for the early part of its existence was not as concerned with the practice of psychology as with the establishment of the field as a science. Only with the burgeoning growth of clinical psychological practice around the time of World War II did the need for a formal code of ethics for psychology become more intensely apparent. The initial code of ethics for psychologists emerged from an empirical rather than an a priori theoretical or philosophical base (cf. Pope & Vetter, 1992). Members of the association were polled about incidents they had encountered that raised ethical issues, and from those data, an initial code of ethics was written. The field of I-O psychology is relatively new as an applied area of training and practice (cf. Lowman, Kantor, & Perloff, 2006). As a result, until the 2002 revision of the code there had not been included much in it to suggest that it was written with applied practice in I-O psychology in mind. A partial exception was the area of testing, a domain that is included in many types of applied practice, and so has had a long-time focus in the various editions of the code. However, more attention is paid in the code to issues associated with test construction and with applications in individual assessment contexts than explicitly to the mass testing often associated with employee selection. However, the 2002 code did take modest steps to address how the ethics principles and standards applied to I-O and organizational consulting psychology. The code added several references to consulting and psychological work in organizations and includes “organizational clients” in most categories of service. For example, ethics Standard 3.11 explicitly concerns psychological services delivered to or through organizations. It clarifies the issues involved in working with individuals versus organizations and the responsibilities of psychologists to the individual organization members with whom they work when they are not themselves the defined client. APA Standard 3.11 Psychological Services Delivered To or Through Organizations (a) Psychologists delivering services to or through organizations provide information beforehand to clients and when appropriate those directly affected by the services about (1) the nature and objectives of the services, (2) the intended recipients, (3) which of the individuals are clients, (4) the relationship the psychologist will have with each person and the organization, (5) the probable uses of services provided and information obtained, (6) who will have access to the information, and (7) limits of confidentiality. As soon as feasible, they provide information about the results and conclusions of such services to appropriate persons. (b) If psychologists will be precluded by law or by organizational roles from providing such information to particular individuals or groups, they so inform those individuals or groups at the outset of the service. (APA Code, 2010)

The entire ninth standard, which is on assessment, has direct applicability to most employee selection psychology. It encompasses individual assessments and those done in the context of groups such as with applicant selection, and it indicates that consent may ethically be implied.

585

Joel Lefkowitz and Rodney L. Lowman APA Standard 9.03 Informed Consent in Assessments (a) Psychologists obtain informed consent for assessments, evaluations, or diagnostic services, as described in Standard 3.10, Informed Consent, except when . . . (2) informed consent is implied because testing is conducted as a routine educational, institutional, or organizational activity (e.g., when participants voluntarily agree to assessment when applying for a job); or (3) one purpose of the testing is to evaluate decisional capacity. (APA Code, 2010)

This section of the code also identifies the requirements for test construction, issues related to outdated assessments, and issues related to feedback on the results of tests. It also deals with release of information about tests, obligations of psychologists concerning test security, and situations involving obsolete tests. However, it can be argued that the code says very little per se about the common situation in which psychologists who administer large testing programs in industry or government work for nonpsychologists, and the decisions about testing programs are made by persons with little psychological training. But two sections of the code do explicitly cover such situations as envisioned by the fifth paradigm or form of ethical dilemma noted above (external pressure to violate ethical norms), Standards 1.02 and 1.03. These were amended by the APA in 2010 following the exposure of the role of psychologists in U.S. government “enhanced interrogation techniques” with captives (see Hoffman et al., 2015). APA Standard 1.02 Conflicts Between Ethics and Law, Regulations, or Other Governing Legal Authority If psychologists’ ethical responsibilities conflict with law, regulations, or other governing legal authority, psychologists clarify the nature of the conflict, make known their commitment to the Ethics Code and take reasonable steps to resolve the conflict consistent with the General Principles and Ethical Standards of the Ethics Code. Under no circumstances may this standard be used to justify or defend violating human rights. (APA Code, 2010) APA Standard 1.03 Conflicts Between Ethics and Organizational Demands If the demands of an organization with which psychologists are affiliated or for whom they are working are in conflict with this Ethics Code, psychologists clarify the nature of the conflict, make known their commitment to the Ethics Code, and take reasonable steps to resolve the conflict consistent with the General Principles and Ethical Standards of the Ethics Code. Under no circumstances may this standard be used to justify or defend violating human rights. (APA Code, 2010)

There is also the following ethics standard, which imposes an ethical obligation to take appropriate action in response to misuse of one’s work: APA Standard 1.01 Misuse of Psychologists’ Work If psychologists learn of misuse or misrepresentation of their work, they take reasonable steps to correct or minimize the misuse or misrepresentation. (APA Code, 2010)

SOME SPECIFIC ISSUES AND SOURCES OF ETHICAL PROBLEMS4 In this section we present several specific illustrative ethical issues in the practice of employee selection and indicate the sections of the APA Code that provide some guidance.

586

Ethics of Employee Selection

Generic Ethical Issues in Selection The Basics: Issues of Validity As the reader is likely to be well aware, the overwhelmingly most important matter in selection— from technical, professional, and ethical perspectives—is the appropriate justification of the personnel actions taken; that is, the validity of the measures on which those decisions to hire or promote people (or to decline to do so) are based. Validity is inherently an ethical issue because it reflects the relative accuracy of selection decisions by which some are selected/hired and some are rejected; the absence of validity can result in serious harm to both applicants and employers. (Many of the ethical issues that seem associated with particular selection procedures represent manifestations of this generic issue.) That validity is a fundamental ethical requirement is suggested, among other APA Standards, by the following: Standard 2.04 Bases for Scientific and Professional Judgments Psychologists’ work is based upon established scientific and professional knowledge of the discipline. (APA Code, 2010)

Professional Competence As noted earlier, competence in the conduct of one’s profession is an important ethical issue for many reasons (cf. Ethical Standards 2.01(b), 2.06, APA Code, 2010). The issue of competence of course overlaps that of validity, but it also requires that psychologists base their practice on mastery of the relevant technical knowledge base associated with their area of psychology, as indicated by the following standard: Standard 2.01 Boundaries of Competence (a) Psychologists provide services, teach and conduct research with populations and in areas only within the boundaries of their competence, based on their education, training, supervised experience, consultation, study, or professional experience. (APA Code, 2010)

Test Security Psychologists are mandated by Ethical Standard 9.11 to maintain test security, namely: Standard 9.11 Maintaining Test Security The term test materials refers to manuals, instruments, protocols, and test questions or stimuli and does not include test data as defined in Standard 9.04, Release of Test Data. Psychologists make reasonable efforts to maintain the integrity and security of test materials and other assessment techniques consistent with law and contractual obligations, and in a manner that permits adherence to this Ethics Code. (APA Code, 2010)

In the case of a psychologist administering intelligence tests in the context of a private practice or school system, the issues of maintaining test security may be straightforward. However, in today’s employee selection context, tests may be administered to hundreds of thousands of applicants, tests may be administered electronically with no oversight of the test-taking circumstances, and a team of psychologists and nonpsychologists may help create and validate a test with little direct control by a psychologist of the security of the process. Although the psychologist’s ethical mandate to protect test security is conceptually clear, the practical realities

587

Joel Lefkowitz and Rodney L. Lowman of corporate and government testing contexts are often far more complicated and difficult than the code may have contemplated. In such complex situations, our best advice is to proceed cautiously and seek out advice from knowledgeable colleagues.

Multiple Responsibilities: Who Is the Client? As noted earlier, an important hallmark of a true profession is the recognition by its practitioners of responsibilities that extend beyond the paying client. In the case of employee selection in organizations, those responsibilities extend in two directions: within the organization to individual job applicants and promotional candidates and beyond the organization to the community that depends on the continued success of the organization and that is impacted by its actions.

Some Issues Relating to Particular Selection Methods Individual-Level Assessments There are many ethical issues associated particularly with selection or evaluation at the individual level (see Jeanneret, 1998, for a review). Although standards of practice are well defined at the level of applying individual tests to clinical practice (e.g., assessing parents and children in the context of fitness-for-parenting in divorce proceedings), the literature on individual assessments in selection contexts is far less developed. Issues of validity for individual instruments (such as the selection interview, Fletcher, 1992) and, particularly, the proper metric for combining across domains of testing (such as in the domains of occupational interests, abilities, and personality characteristics, cf. Lowman, 1991) suggest that there is much work still to be done for valid conclusions to be drawn reliably. The use of multiple types of psychological assessment data and the translation of such data into predictions that have psychological validity entail at least three very significant issues: (1) whether all of the data can be quantified and, if so, the relative weights to be given each of the sources of information in arriving at a composite evaluation or prediction; (2) if not, how to meaningfully integrate qualitative and quantitative information about the candidates; and (3) what role the specific organizational context should play in any recommendations based on the assessments (e.g., factoring in what is known about the supervisor of the targeted position and the culture and expectations of the organization). Additional ethical issues particularly relevant to the process of individual assessment include maintaining confidentiality, recognizing that the assessee and the client organization are both clients of the assessor, the qualifications and proper training of those administering and interpreting examinations, assuring that the client organization understands the limitations of the assessment process, and providing adequate feedback to the candidates (Jeanneret, 1998; Prien, Schippmann, & Prien, 2003).

Assessment Centers Assessment centers (ACs) seem to have attributes that attract ethical challenges. Their notable early and well-publicized success has led to a faddish proliferation beyond the resources of those who are actually trained and skilled in their development, implementation, and administration. (Refer to Chapter 38, this volume.) For example, Caldwell, Thornton, and Gruys (2003) have itemized 10 “classic errors” in this area of practice, most of which have ethical implications (poor planning, shoddy exercise development, no pretesting of exercises, using unqualified assessors, etc.) Colloquially, experienced AC practitioners also have observed problems such as promotion by consultants of the utility of their AC entirely on the basis of the general research literature, which has little to do with the consultants’ specific proposed procedures; use of unprofessional assessors who receive virtually no training in behavioral observation, rating,

588

Ethics of Employee Selection and evaluation and who may be unfamiliar with the particular exercises used; subsequent pressure on AC staff or consultants to use the data from an earlier developmental AC for personnel actions (e.g., retention decisions during a reduction-in-force, sometimes exacerbated by the age of the data); widely disseminating individual assessment data in the organization for various unintended purposes; and using generic exercises that have little or no demonstrable relationship to the target jobs. In addition, various administrative gaffes have been noted, such as failing to maintain the security of measures for days 2 and 3 of a multi-day AC; allowing nonassessors (e.g., senior managers) to influence assessment evaluations; and failing to provide appropriate feedback to candidates or providing inappropriate feedback, such as implying organizational actions to be taken (“you clearly have the talent to be promoted soon”), etc. All of these matters (and others, such as “rights of the participant”) are discussed carefully in Guidelines and Ethical Considerations for Assessment Center Operations (International Task Force on Assessment Center Guidelines, 2015).

Computer- and Web-Based Testing5 The administrative and financial efficiencies of computerized and web-based job application and selection testing procedures are often considerable. It is not surprising that the practice is growing. However, as is often the case with new technologies or procedures, the incidence of usage has probably outstripped careful consideration of potential problems in implementation (see Joint Task Force, 2013; Lowman, 2013b). We see three broad sets of problems to be considered. The first is largely pragmatic and has to do with administrative and technical problems associated with a computerized delivery system (e.g., provision of an adequate number of computer consoles). The second set of problems has more professional and ethical overtones, having to do with the equivalence of test results obtained by traditional means of testing with the results of the same tests administered via computer (Potosky & Bobko, 2004). That is, to what extent is empirical validity evidence from traditional test administrations to be taken as wholly applicable to webbased administration? It is at least possible, if not likely, that degrees of equivalence will vary as a function of the domain tested (e.g., cognitive ability vs. personality attributes), type of test (e.g., timed vs. untimed), response format (e.g., short-answer vs. open-ended), examinee attributes (e.g., facility with computers, degree of self-efficacy, etc.), and other factors. Psychometricians may be sanguine that the degree of correlation between paper-and-pencil and computer-delivered test administrations is high enough to conclude that the same measurement objectives are being met (r ~ .6–.8). From a fairness perspective, however, the same pass/fail cut score on the two forms of administration will include and exclude some different examinees, as may rank-ordered selection. The third set of problems is associated with web-based assessment, independent of the equivalence issue. These include concern for test security, the possibility of cheating when the testing is unproctored, differential access to the Internet for different groups of potential applicants, etc. (cf. Tippins et al., 2006). Other ethical issues raised by online testing methods include the unintended consequences of delivery of tests, especially in global/international contexts (see Lowman, 2013a, 2013b). Suppose a test administered via the web contains a speeded component. In the case of slow Internet speeds, particularly those available in developing countries, the test stimuli are not equal, and those in poorer settings may be unfairly tested compared to those in more developed countries. Concerning the use of unproctored Internet testing, psychologists must contend with many ethical issues. Even the recently issued revised Test Standards (2014) seem not to contend with the proliferation of online unproctored testing. For example, the Test Standards state: Professionals who oversee testing and assessment should be thoroughly versed in proper test administration procedures. They are responsible for ensuring that all persons who administer and score tests have received the appropriate education and training needed to perform their assigned tasks. Test administrators should administer tests in the manner that the test manuals indicate and should adhere to ethical and professional standards. . . . If tests are administered by computer or other technological devices or online, the professional is responsible for determining if the purpose of the assessment and the capabilities of the

589

Joel Lefkowitz and Rodney L. Lowman test taker require the presence of a proctor or support staff (e.g., to assist with the use of the computer equipment or software). (Test Standards, 2014, p. 153)

It is difficult to imagine that the unsupervised mass use of unproctored testing for pre-employment screening would meet any of these standards. Both the APA Ethics Code (2010) and the Test Standards (2014) identify the need for professional oversight of all such testing. Psychologists setting up a testing program in which it is likely that cheating will occur, such that the results of the testing are compromised (see, e.g., standard 7.9 of the Test Standards, 2014), have an ethical and professional obligation to make known their objections to those responsible for making the selection decision (APA Ethics Code Standard 1.03). Protecting the content of the test as intellectual property (see APA Test Standard 1.09 Test Security; APA Ethics Code. 2010) is also a factor to be considered in such testing. In today’s era of cell phones being able to instantly capture the content of a test, this potentially further lessens the validity of the test, not just for the individual test takers but also for the further use of the test. Ethically, if a high-stakes test, despite the recommendations of the psychologist, is to be used in unproctored test-taking situations, consideration should be given to administering it in a way that maximizes the likelihood that people will not cheat and that minimizes the lessening of the integrity of the test (e.g., by showing test stimuli for relatively short time periods and not allowing test takers to return to earlier items).

The Use of “Big Data” The recent growing use of so-called Big Data (BD; see Chapter 43, this volume, for additional discussion of BD) in organizations has been characterized as a “management revolution” (McAfee & Brynjolfsson, 2012), allowing us to manage more precisely than ever before . . . make better predictions and smarter decisions . . . target more-effective interventions, and . . . in areas that so far have been dominated by gut and intuition rather than by data and rigor. (p. 62)

It is not possible, in these pages, to evaluate the veracity of those claims; suffice it to say that in our opinion they represent perhaps as much wish fulfillment as fact, and they gloss over some limitations. To begin with, it should be noted that there is often disagreement and/or uncertainty concerning the definition of BD. Based on recent deliberations (Jin et al., 2015), there seems to be general agreement on the following attributes: (a) volume—i.e., huge amounts of data; (b) variety and complexity—e.g., intentionally collected, voluntarily offered information from known participants as well as passively collected, involuntarily obtained data from anonymous contributors, perhaps requiring linkage of multiple data sets, and multiple files across different systems; (c) velocity—exceedingly fast, even real-time, recording of the data; (d) data analytics—use of very sophisticated statistical techniques and graphical presentation methods to accommodate the enormous data sets; (e) accordingly, the preeminence in the area of data scientists more likely to be trained in computer science, information technology (IT), artificial intelligence (AI), economics, or marketing than in psychology; and (f) prediction—a focus largely, if not entirely, on extracting accurate algorithms or patterns from the vast data set(s). At this point in time, I-O psychologists ought to be wary of the following issues regarding BD (as pointed out by our colleagues): 1. Many BD studies involve searching for patterns or relationships in existing data that were collected prior to any definition of the problem or specification of hypotheses (or, in the case of employee selection, theoretically relevant predictors) (Such, Tippins, & Corbet, 2015). The data are not necessarily, therefore, the “best” or even particularly good for the purpose intended (Guzzo, et al., 2015). 2. Similarly, the overriding aim of BD studies is prediction accuracy, not theory, explanation, or causal understanding (Dekas, Wette, Rivera, & Dubey, 2015). Hence, the results may be conducive to

590

Ethics of Employee Selection developing applications—e.g., predicting employee turnover, but not have the information to reduce it (Such, Tippins, & Corbet, 2015). 3. A number of issues relate to the training, experience, and values of the data scientists who may be in charge of a BD project. For example, they may not be sensitive to the relevant ethical issues in research with human participants (respect, privacy and confidentiality, informed consent, risk reduction, debriefing) or to the importance of considering employees’ reactions to HR projects (Guzzo et al., 2015; McCune et al., 2015; Meade, Sinar, Bokhari, & Villanes, 2015). They may not be equipped to understand and interpret properly the patterns uncovered in the data (King et al., 2015; Such, Tippins, & Corbet, 2015). As noted by McAffee and Brynjolfsson (2012), speaking generally—not in the context of I-O psychology or employee selection: “when it comes to knowing which problems to tackle . . . domain expertise remains critical” (p. 66). Finally, data scientists may not be (sufficiently) familiar with the importance of Title VII adverse impact issues and the ensuing requirement for demonstrating the manifest “job relatedness” of predictors used for employee selection.

BD might be thought of conveniently as just another among many substantial changes in the nature of work and employment that have occurred in recent years (Burke & Cooper, 2006). As noted earlier, conclusions drawn a decade ago concerning the ethical implications of those changes probably pertain to BD as well. The technological advances certainly have impacted the ways in which moral problems are manifested, but the moral nature of those problems, the personality attributes needed to recognize them as such, the ethical reasoning used to address them, as well as the ethical principles on which such reasoning is based all still pertain (Lefkowitz, 2006).

Some Issues Relating to Situational or Contextual Organizational Issues Selection in the Context of a Unionized Organization There seem to us to be at least four important matters to be considered: 1. The potential difficulties one might encounter in this regard in implementing a selection study are likely to be as much or more influenced by the history of union/management relations in the organization as by the attributes of the proposed program, so that one should become familiar with that history. 2. The parameters of the project may be set or limited by terms of the collective bargaining agreement, so one also needs to be knowledgeable about that. 3. Even if it were legal (by virtue of recognized management prerogative in the union contract) to implement a selection program for union members without obtaining prior union agreement, the prudent I-O psychologist (i.e., one who would like the project to succeed) would be well advised to proceed as if such approval were necessary—and not to proceed until some acceptable arrangement was achieved, preferably even actively involving the union in the project from its outset. 4. Because the topic of unions tends to be a volatile one on which people hold strong opinions, and because most I-O psychologists tend to view themselves as representatives of management, it is advisable to consider the extent to which one’s personal values in that regard might conflict with a more universalistic set of professional values, including respect for all relevant stakeholders, including union members.

Ethical Issues Regarding Setting Cut Scores6 The purpose of a cut (or passing) score is to segment examinees into two groups: one deemed “unacceptable,” hence rejected for employment or promotion, and another thought to be “acceptable,” hence hired/promoted or deemed eligible for further screening. Consequently, the primary ethical issue is inextricably bound up with the technical psychometric issues having to do with the accuracy of those classification decisions. The generally preferred method of setting a cut score on a predictor is to do so empirically by predicting a specified minimum criterion score, which first needs its own justification (Green, 1996). Such predictor cutoff scores are affected by issues of criterion relevance, extent of predictor validity, measurement error around the predictor score, and the error of estimate associated with that prediction. Those sources of variability are often

591

Joel Lefkowitz and Rodney L. Lowman not considered. Alternatively, when criterion-related validation is not technically feasible, cut scores sometimes are determined nonempirically by one of several subjective rating or judgmental methods using SMEs’ knowledge about the content domain of the examination (Mills & Melican, 1987). In the absence of any criterion information, the issue of classification errors (particularly “false rejects”) is exacerbated by virtue of there being no way to assess their extent, and generally no attempt is made to assess the accuracy of the classifications. The I-O psychologist may experience a dilemma when, perhaps for reasons of cost, criterion-related validation is not done although it is feasible. The resulting ignorance of classification (in)accuracy was potentially avoidable.

Ethical Issues Regarding Retesting To acknowledge that even the most valid selection system entails classification errors, particularly applicants who are rejected incorrectly, suggests (by virtue of the ethical principles of fairness and nonmaleficence) that unsuccessful candidates should be allowed the opportunity for reexamination, if it is feasible. However, it is known that people do tend to improve their performance on ability tests when retested—on average, by about one-quarter standard deviation for cognitive abilities (Hausknecht, Halpert, Di Paolo, & Moriarty Gerard, 2007). Consequently, to allow those who request it to be retested raises an additional ethical issue with respect to unfairness to those unsuccessful candidates who have not requested retesting and, under some circumstances, even to those who passed initially (e.g., when all of those who pass are to be rank-ordered). However, a reasonably satisfactory solution seems attainable. First, the practice effect is reduced if an alternate form of the examination is used for retesting. Second, although the effect can be enhanced if accompanied by coaching, that is not likely to be the case in employment testing; and third, it declines with the length of time before retesting. Therefore, if it is financially feasible to develop alternative forms and to provide the administrative resources for retesting, and if the opportunity to request retesting is known and available to all candidates, then retesting seems feasible and fair. To reduce the possible practice effects, most organizations that adopt the policy generally specify a minimum waiting period.

Organizational Pressures for the Misuse of Test Data Pressures on psychologists to use data in ways that are inconsistent with their ethical standards can be considerable, particularly for psychologists who are employed in industry and government. Such pressures arise from various sources, including the reality that most of those involved in leadership roles in such settings are not psychologists and may not understand the ethical constraints on psychologists or the complexities of standards associated with employee selection research. For example, the adequacy of sample sizes for establishing reliable validity coefficients may seem like an academic concern to an impatient manager who is eager to get on with implementation. Psychologists have an ethical obligation in such circumstances to “take reasonable steps to resolve the conflict consistent with the General Principles and Ethical Standards of the Ethics Code” (Standard 1.03, revised).

Serving as an Expert in Litigation This area is highly problematic and fraught with complex and difficult ethical issues. While I-O psychologists may address a number of topics as litigation experts, perhaps the single most common one involves selection issues. The breadth of considerations raised for I-O professionals who provide such services is too great and far-reaching to be addressed in this chapter. At a minimum, I-O psychologists should note the inherent role conflict of the enterprise, wherein one is putatively an objective professional expert but paid by and often emotionally involved with one side in an often highly contentious environment with a high-stakes outcome. One may also be put in the

592

Ethics of Employee Selection position of “defending” one’s own work from what one perceives to be unfair criticism. At the very least, psychologists must be aware of the rules that govern the legal setting regarding expert testimony; they must clarify their independence from the party that retains them to testify; and they must be vigilant that their own behavior and that of others does not foist an ethical dilemma upon them. Given the complexities of the concerns and the level of the stakes, specific thought must be given to the ethical issues that may arise in advance of undertaking such assignments.

CONCLUSION It seems most appropriate, if not necessary, to conclude by focusing on application and solution, that is, what to do. On the one hand, general ethical principles and written sources such as APA’s Code of Conduct and the Society for Industrial and Organizational Psychology (SIOP)’s casebook (Lowman, 2006) are readily available but may not explicitly include one’s particular problem(s). On the other hand, specific potential ethical issues—even within a limited domain such as employee selection—are innumerable, not entirely predictable, and so cannot all be itemized a priori. The best we can hope to do, aside from noting some particularly common examples, as we have done, is to present a general scheme emphasizing prevention (cf. Pryzwansky & Wendt, 1999), that is, highlighting the importance of trying to anticipate and prevent problems before they arise. Largely on the basis of the work of Canter, Bennett, Jones, and Nagy (1994), as well as Pryor (1989), we offer the following six-step plan.

1.  Be Familiar with Applicable Ethical Codes and Professional Standards Ethical guidelines are available from the APA (2010), the Canadian Psychological Association (2000), the Academy of Management (2005), the Society for Human Resource Management (2014), the International Task Force on Assessment Center Guidelines (2015), and other relevant organizations. Gaining familiarity with them can help one to avoid blundering into ethical indiscretions because of sheer ignorance, which is important because “lack of awareness or misunderstanding of an ethical standard is not itself a defense to a charge of unethical conduct” (APA Code, 2010, p. 2). Indispensable sources of professional information include the Test Standards (2014), SIOP Principles (2003), and knowledge concerning how tests are often misused (Moreland, Eyde, Robertson, Primoff, & Most, 1995).

2.  Be Familiar with Relevant Federal, State, and Local Laws and Regulations These pertain to rules regarding conducting research with human participants (OHRP, 1991), one’s particular (U.S.) state laws regulating the licensing of psychologists, and federal and state laws governing employment practices such as the Civil Rights Acts of 1964 and 1991, the Americans with Disabilities Act of 1990, the Age Discrimination in Employment Act of 1967, the Equal Pay Act of 1963, and the Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, Civil Service Commission, U.S. Department of Labor, and U.S. Department of Justice, 1978). (Refer to Chapters 27–30, this volume, for in-depth treatments of these professional and legal standards.)

3.  Know the Rules and Regulations of the Organization in Which You Work and/or Those of Your Client Having that knowledge serves at least two purposes. First, it helps assure competent and appropriate professional practice by incorporating and meeting organizational expectations regarding

593

Joel Lefkowitz and Rodney L. Lowman procedures and outcomes. The second pertains to the possible conflict between organizational practices or objectives versus our professional ethical and/or legal standards (e.g., some I-O psychologists have been directed to use confidential research or test data for purposes not originally intended or consented to; some have been told that the organization will not provide test feedback to employees who were candidates for promotion). As quoted earlier, but worth the reminder, the more “I-O friendly” revision of the APA ethical principles includes enforceable Standard 1.03, which requires I-O psychologists to clarify the nature of the conflict, make known their commitment to the Ethics Code, and take reasonable steps to resolve the conflict consistent with the General Principles and Ethical Standards of the Ethics Code. Under no circumstances may this standard be used to justify or defend violating human rights. (APA Code, 2010; revision, 2010)

4.  Participate Regularly in Continuing Education in Ethics and in Professional/Technical Issues Affecting Competence This admonition is obviously not entirely necessary for you, the reader. Attending courses, workshops, and professional conference presentations and seminars; subscribing to journals; and reading books that focus on ethical, professional, and technical matters are some of the means of keeping abreast of new technical developments and honing one’s ethical sensitivities and decision-making skills. Conferring with colleagues is often indispensable—most especially when one is in the throes of an uncomfortable ethical dilemma or sees the potential for one developing. In addition to our national association, SIOP, the regularly published newsletters of several local organizations of applied psychologists have proven to be consistently reliable sources of information.7

5.  Maintain a Mindset of Ethical Watchfulness and Identify Potential Ethical Problems To a considerable degree, the purpose of this entire chapter is to promote one’s ability to do just this. If we have been at all successful, it will have been by increasing the salience and the reader’s knowledge of ethical principles; the way in which those moral issues are enmeshed with matters of personal values, professional judgment, and technical competence; the typical forms or structures of ethical dilemmas; the role to be played by formal ethical guidelines; and some particular and somewhat predictable ethical problems associated with particular selection practices. Hopefully, this will help to avoid ethically ambiguous situations or to clarify them early on. We believe that such moral sensitivities (Rest, 1994) are learned attributes and can be enhanced with practice. All in all, we hope to have contributed to I-O psychologists’ “staying ethically fit” (Jeanneret, 1998), which leads to the last item.

6.  Learn Some Method(s) for Analyzing Ethical Situations and Making Ethical Decisions in Complex Social Situations Space does not permit delving into this process in this chapter. Fortunately, however, others have done so. Several ethical decision-making models and procedures have been reviewed by Wittmer (2001) and by Pryzwanski and Wendt (1999). We have (unsurprisingly) found one decision-making model to be helpful that was synthesized with I-O psychology in mind (Lefkowitz, 2003, in press), even though such models have been criticized with some justification as being simplistic (Ladenson, in Gellerman, Frankel, & Ladenson, 1990, p. 90); that is, as not matching the complexities of many ethical dilemmas. However, their value may lie in the psychologist becoming accustomed to the general process of ethical reasoning they promote, rather than in adhering to specific decision-making steps.

594

Ethics of Employee Selection

NOTES 1. A recent meta-analysis of the between-organizations effects of “high-performance work practices” (HPWPs) on organizational-level performance uncovered just 15 studies that investigated the impact of selection (Combs, Liu, Hall, & Ketchen, 2006). They yielded a mean validity coefficient of only .11 (.14 corrected for measurement error). Moreover, most such studies have used “postdictive” designs in which a claim that the HPWP has had a causal influence on the organizational outcomes is not warranted (Wright, Gardner, Moynihan, & Allen, 2005). Perhaps more importantly, we are not aware of any within-organization studies documenting the effect of selection systems on overall financial performance of the firm. 2. The regulations pertain to all research with human participants whether supported by government funds or not. And research is defined as “a systematic investigation, including testing and evaluation, designed to develop or contribute to generalizable knowledge” (§46.102[d]). Special note should be taken that “contribut[ing] to generalizable knowledge” is often operationalized as seeking to publish or otherwise make public the findings of the study (such as at a professional conference). The regulations are available at http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.html. 3. Although this might be construed as a contextual factor or antecedent of any of the other four types, we emphasize its importance by listing it separately. 4. The authors are grateful for input on this section from Robert Hogan, Joel Moses, George C. Thornton III, John G. Veres III, and Michael J. Zickar. 5. See Chapter 39, this volume, concerning technology and employee selection for a more comprehensive presentation. 6. See Chapter 17, this volume, on the use of test scores for a more thorough treatment of this issue. 7. Information may be obtained from the following websites: http://www.siop.org, http://www.metroapp psych.com, http://www.ptcmw.org, and http://www.ptc-sc.org.

REFERENCES Academy of Management. (2005). Ethics Code. Briarcliff Manor, NY: Author. Retrieved from: http://aom. org/About-AOM/Code-of-Ethics.aspx American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association. (1981). Specialty guidelines for the delivery of services by I/O psychologists. American Psychologist, 36, 664–669. American Psychological Association. (2010). Ethical principles of psychologists and code of conduct (with the 2010 amendments). Washington, DC: Author. Retrieved from http://www.apa.org/ethics/code/principles.pdf Baritz, L. (1960). The servants of power: A history of social science in American industry. Westport, CT: Greenwood. Burke, R. J., & Cooper, C. L. (Eds.) (2006). The new world of work and organizations. Human Resource Management Review, 16(2), 83–280. Caldwell, C., Thornton, G. C., III., & Gruys, M. L. (2003). Ten classic assessment center errors: Challenges to selection validity. Public Personnel Management, 32, 73–88. Canadian Psychological Association. (2000). Canadian code of ethics for psychologists (3rd ed.). Ottawa, ON: Author. Canter, M. B., Bennett, B. E., Jones, S. E., & Nagy, T. F. (1994). Ethics for psychologists: A commentary on the APA ethics code. Washington, DC: American Psychological Association. Combs, J., Liu, Y., Hall, A., & Ketchen, D. (2006). How much do high performance work practices matter? A meta-analysis of their effects on organizational performance. Personnel Psychology, 59(3), 501–528. Cooper, T. L. (Ed.) (2001). Handbook of administrative ethics. New York, NY: Marcel Dekker. Cuguer-Escofet, N., & Fortin, M. (2014). One justice or two? A model of reconciliation of normative justice theories and empirical research on organizational justice. Journal of Business Ethics, 124, 435–451. De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. Dekas, K., Welle, B., Rivera, M. T., & Dubey, A. (April 23–25, 2015). 101 Things about big data you’re afraid to ask. Panel presented at the 30th Annual Conference of the Society for Industrial-Organizational Psychology, Philadelphia, PA. Edelstein, L. (1967). The Hippocratic oath: Text, translation and interpretation. In O. Temkin & C. L. Temkin (Eds.), Ludwig Edelstein. Ancient medicine: Selected papers of Ludwig Edelstein (pp. 3–64). Baltimore, MD: Johns Hopkins University Press.

595

Joel Lefkowitz and Rodney L. Lowman Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43(166), 38290–38315. Fletcher, C. (1992). Ethical issues in the selection interview. Journal of Business Ethics, 11, 361–367. Freeman, R. E. (1984). Strategic management: A stakeholder approach. Boston, MA: Pitman. Freeman, R. E., & Phillips, R. A. (2002). Stakeholder theory: A libertarian defense. Business Ethics Quarterly, 12, 331–349. Gellermann, W., Frankel, M. S., & Ladenson, R. F. (1990). Values and ethics in organization and human systems development: Responding to dilemmas in professional life. San Francisco, CA: Jossey-Bass. Green, B. F. (1996). Setting performance standards: Content, goals, and individual differences. The second annual William H. Angoff Memorial Lecture. Princeton, NJ: Educational Testing Service. Guenster, N., Derwall, J., Bauer, R., & Koedijk, K. (July 2005). The economic value of corporate eco-efficiency. Paper presented at the Academy of Management Conference. Honolulu, HI. Guzzo, R. A., Fink, A., King, E., Tonidandel, S., & Landis, R. (2015). Big data recommendations for industrial-organizational psychology. Industrial & Organizational Psychology: Perspectives on Science and Practice, 8(4), 491–508. Haber, S. (1991). The quest for authority and honor in the American professions, 1750–1900. Chicago, IL: University of Chicago Press. Hall, R. T. (1975). Occupations and the social structure (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall. Hausknecht, J. R., Halpert, J. A., DiPaolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92, 373–385. Hoffman, D. H., Carter, D. J., Viglucci Lopez, C. R., Benzmiller, H. L., Guo, A. X., S. Yasir, L., Craig, D. C. (2015). Report to the special committee of the board of directors of the American psychological association independent review relating to APA ethics guidelines, national security interrogations, and torture. Chicago: Sidley Austin, LLP. Retrieved from: http://www.apa.org/independent-review/revised-report.pdf Hoffman, M. L. (1988). Moral development. In M. H. Bornstein & M. E. Lamb (Eds.), Developmental psychology: An advanced textbook (pp. 497–548). Hillsdale, NJ: Lawrence Erlbaum Associates. International Task Force on Assessment Center Guidelines. (2015). Guidelines and ethical considerations for assessment center operations. Journal of Management, 41(4), 1244–1273. doi: 10.1177/0149206314567780 Jeanneret, P. R. (1998). Ethical, legal, and professional issues for individual assessment. In R. Jeanneret & R. Silzer (Eds.), Individual psychological assessment: Predicting behavior in organizational settings (pp. 88–131). San Francisco, CA: Jossey-Bass. Jin, J., Kalinoski, Z. T., Cullen, J. C., Harrell, M. M., Lee, W. C. Lisk, T. C., & Napper, C. (April 23–25, 2015). Thrive in big data: Change in I-O’s mindset and toolset. Panel presented at the 30th Annual Conference of the Society for Industrial-Organizational Psychology, Philadelphia, PA. Joint Task Force for the Development of Telepsychology Guidelines for Psychologists. (2013). Guidelines for the practice of telepsychology. American Psychologist, 68, 791–800. Katzell, R. A., & Austin, J. T. (1992). From then to now: The development of industrial-organizational psychology in the United States. Journal of Applied Psychology, 77, 803–835. Kimball, B. A. (1992). The “true professional ideal” in America. Cambridge, MA: Blackwell. King, E. B., Tonidandel, S., Sinar, E. F., Stanton, J. M., & Oswald, F. (April 23–25, 2015). Understanding big data: Emerging approaches to data interpretation. Panel presented at the 30th Annual Conference of the Society for Industrial-Organizational Psychology, Philadelphia, PA. Koocher, G. P. (2007). APA presidential address. Twenty-first century ethical challenges for psychology. American Psychologist, 62, 375–384. Lefkowitz, J. (1990). The scientist-practitioner model is not enough. The Industrial-Organizational Psychologist, 28(1), 47–52. Lefkowitz, J. (2003). Ethics and values in industrial-organizational psychology. Mahwah, NJ: Lawrence Erlbaum. Lefkowitz, J. (2004). Contemporary cases of corporate corruption: Any relevance for I-O psychology? The Industrial-Organizational Psychologist, 42(2), 21–29. Lefkowitz, J. (2005). The values of industrial-organizational psychology: Who are we? The Industrial-Organizational Psychologist, 43(2), 13–20. Lefkowitz, J. (2006). The constancy of ethics amidst the changing world of work. Human Resource Management Review, 16(2), 245–268. doi: 10.1016/j.hrmr.2006.03.007 Lefkowitz, J. (2007a). Ethics in industrial-organizational psychology research. In S. Rogelberg (Ed.), The encyclopedia of industrial and organizational psychology (Vol. 1, pp. 218–222). Thousand Oaks, CA: Sage. Lefkowitz, J. (2007b). Ethics in industrial-organizational psychology practice. In S. Rogelberg (Ed.), The encyclopedia of industrial and organizational psychology (Vol. 1, pp. 215–218). Thousand Oaks, CA: Sage.

596

Ethics of Employee Selection Lefkowitz, J. (2007c). Corporate social responsibility. In S. Rogelberg (Ed.), The encyclopedia of industrial and organizational psychology (Vol. 1, pp. 114–118). Thousand Oaks, CA: Sage. Lefkowitz, J. (2008). In order to prosper the field of organizational psychology should . . . expand its values to match the quality of its ethics (Special issue). Journal of Organizational Behavior, 29, 439–453. Lefkowitz, J. (2009). Promoting employee justice: It’s even worse than that. Industrial and Organizational Psychology, 2, 221–225. Lefkowitz, J. (2012). Ethics in industrial-organizational psychology. In S. Knapp, M. L. VandeCreek, M. Gottlieb, & M. Handelsman (Eds.), APA handbook of ethics in psychology (Vol. 2, pp. 149–167). Washington, DC: American Psychological Association. Lefkowitz, J. (in press). Ethics and values in industrial-organizational psychology, 2nd Ed. New York: Routledge. Lowman, R. L. (1991). The clinical practice of career assessment: Interests, abilities, and personality. Washington, DC: American Psychological Association. Lowman, R. L. (Ed.) (2006). The ethical practice of psychology in organizations (2nd ed.). Washington, DC: American Psychological Association/Society for Industrial-Organizational Psychology. Lowman, R. L. (Ed.) (2013a). Internationalizing multiculturalism: Expanding professional competencies in a globalized world. Washington, DC: American Psychological Association. Lowman, R. L. (2013b, April 3). Ethical Implications of future testing techniques and personnel selection paradigms. Presentation at Board on Behavioral, Cognitive and Sensory Sciences, Division of Behavioral and Social Sciences and Education, National Research Conference Workshop, New Directions in Assessing individuals and Groups. Measuring human capabilities: Performance potential of individuals and collectives. Washington, DC, National Reseach Council. Lowman, R. L., Kantor, J., & Perloff, R. (2006). History of I-O psychology educational programs in the United States. In L. L. Koppes (Ed.), Historical perspectives in industrial and organizational psychology (pp. 111–137). Mahwah, NJ: Lawrence Erlbaum. McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review, 90(10), 60–68. McCune, E. A., Dekas, K., Anderson, A., Kasimatis Singleton, M., MacNiven, S., Sinnett, S. A. (April 23–25, 2015). Guidelines for ethical research in the age of big data. Panel presented at the 30th Annual Conference of the Society for Industrial-Organizational Psychology, Philadelphia, PA. Meade, A. W. Sinar, E. F., Bokhari, E., & Villanes, A. (April 23–25, 2015). Theme track: Big data advances from computer science and statistics. Panel presented at the 30th Annual Conference of the Society for Industrial-Organizational Psychology, Philadelphia, PA. Mills, C. N., & Melican, G. J. (1987). A preliminary investigation of three compromise methods for establishing cutoff scores (Report No. RR-87–14). Princeton, NJ: Educational Testing Service. Moreland, K. L., Eyde, L. D., Robertson, G. J., Primoff, E. S., & Most, R. B. (1995). Assessment of test user qualifications: A research-based measurement procedure. American Psychologist, 50, 14–23. Office for Human Research Protections, Department of Health and Human Services. (June 18, 1991). Protection of human subjects. Code of Federal Regulations, Title 45, Public Welfare. Orlitzky, M., Schmidt, F. L., & Rynes, S. L. (2003). Corporate social and financial performance: A meta-analysis. Organization Studies, 24, 403–441. Pope, K. S., & Vetter, V. A. (1992). Ethical dilemmas encountered by members of the American Psychological Association: A national survey. American Psychologist, 47, 397–411. Potosky, D., & Bobko, P. (2004). Selection testing via the Internet: Practical considerations and exploratory empirical findings. Personnel Psychology, 57, 1003–1034. Prien, E. P., Schippmann, J. S., & Prien, K. O. (2003). Individual assessment as practiced in industry and consulting. Mahwah, NJ: Lawrence Erlbaum. Pryor, R. G. L. (1989). Conflicting responsibilities: A case study of an ethical dilemma for psychologists working in organisations. Australian Psychologist, 24, 293–305. Pryzwansky, W. B., & Wendt, R. N. (1999). Professional and ethical issues in psychology: Foundations of practice. New York, NY: W. W. Norton & Co. Rachels, J., & Rachels, S. (2015). The elements of moral philosophy (8th ed.). New York, NY: McGraw-Hill. Rest, J. R. (1994). Background: Theory and research. In J. R. Rest & D. Narvaez (Eds.), Moral development in the professions (pp. 1–26). Hillsdale, NJ: Lawrence Erlbaum. Rosenthal, R. (1994). Science and ethics in conducting, analyzing and reporting psychological research. Psychological Science, 5, 127–134. Samuelson, P. A. (1993). Altruism as a problem involving group versus individual selection in economics and biology. American Economic Review, 83, 143–148. Singer, P. (2005). Practical ethics (2nd ed.). New York, NY: Cambridge University Press. Singer, P. (2011). Practical ethics (3rd ed.). New York, NY: Cambridge University Press. Society of Human Resource Management. (2014). SHRM code of ethical and professional standards in human resource management. Alexandria, VA: Author. Retrieved from: http://www.shrm.org/about/pages/code-of-ethics.aspx

597

Joel Lefkowitz and Rodney L. Lowman Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Such, M. J., Tippins, N. T., & Corbet, C. E. (April 23–25, 2015). Using big data for employment decisions. Panel presented at the 30th Annual Conference of the Society for Industrial-Organizational Psychology, Philadelphia, PA. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225. Wittmer, D. P. (2001). Ethical decision-making. In T. L. Cooper (Ed.), Handbook of administrative ethics (2nd ed., pp. 481–507). New York, NY: Marcel Dekker. Wolff, J. (2005). “Economic justice.” Chapter 17. In H. LaFollette (Ed.), The Oxford handbook of practical Ethics (pp. 433–458). New York, NY: Oxford University Press. Wright, P. M., Gardner, T. M., Moynihan, L. M., & Allen, M. R. (2005). The relationship between HR practices and firm performance: Examining causal order. Personnel Psychology, 58, 409–446. Zickar, M. J. (2001). Using personality inventories to identify thugs and agitators: Applied psychology’s contribution to the war against labor. Journal of Vocational Behavior, 59, 149–164.

598

27 PROFESSIONAL GUIDELINES/STANDARDS P. RICHARD JEANNERET AND SHELDON ZEDECK

INTRODUCTION1 Three primary sources of authoritative information and guidance that can be relied upon in the development, validation, and implementation of an employment selection procedure are the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014; Standards), the Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology, 2003; Principles), and the Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, Civil Service Commission, Department of Justice, & Department of Labor, 1978; Uniform Guidelines). The term selection procedure in this instance should be interpreted broadly to include any process or information used in personnel decision making. Selection procedures would include (but not be limited to) all forms and types of tests (e.g., cognitive, personality, work samples, and assessment centers), interviews, job performance appraisals, and measures of potential. These procedures may be administered, scored, and interpreted as paper-and-pencil or computer-based instruments and/ or by individuals internal or external to the organization. This broad view is consistent with the interpretations expressed by the authoritative sources. The term “test” is often used in one of the sources. For the purposes of this chapter, a test is synonymous with a selection procedure. A number of other guidelines, standards, and legal requirements exist both in the United States and in other countries around the world. Relevant standards and guidelines include (but are not limited to) the following: • U.S. Department of Labor guide regarding testing and assessment (International Standards Organisation, 2011) • International Standards Organisation standards for assessment delivery (ISO-10667–2, 2011) • New guidelines for assessment center operations (International Taskforce on Assessment Center Operations, 2015) • European Federation of Psychologists’ Associations model for description and evaluation of tests (EFPA, 2013) • Guidelines for test use and adaptation from the International Test Commission (2001, 2005)

Additionally, many countries have statutes, rules, and regulations governing employment practices that may explicitly include testing or incorporate assessment procedures under broader requirements governing all employment practices. Because of the length of their histories and breadth of applicability, this chapter will focus on the primary sources noted in the introduction. However, the interested reader, especially those practicing in international settings would be well advised to review additional resources that may apply in their specific situations.

599

P. Richard Jeanneret and Sheldon Zedeck

Purpose and Chapter Flow The central focus of this chapter is to describe the history and substance of each of the three primary sources, compare and contrast their technical content, and provide some guidance as to how they might be particularly useful to those directly associated with employment selection procedures. Each of these three sources will be discussed separately in chronological order defined by the date of initial publication. The discussion will begin with the purpose and brief history of each document. Then information will be presented that describes the content relevant to employment decision making. After describing each document, the three sources will undergo comparisons with indications of inconsistencies and how they might be resolved. Finally, suggestions are made as to what additions or changes would be appropriate given the current state of relevant research.

Application to Employment Selection Only The Standards in particular and the Principles to a lesser extent have potential relevance to settings outside of employment selection. Such venues include forensic, academic, counseling, program evaluation, and publishing that involves psychological instruments and measurements. This chapter does not address these applications. The focus is strictly on organizational settings and employment-related selection decisions.

Importance of the Authorities For the most part, the authorities are retrospective rather than prospective. By necessity they must rely on the state of knowledge in the fields of measurement and applied psychology. Reality, of course, is that knowledge changes as research in the field develops more information about the strategies and psychometrics of employment selection procedures. Therefore, the authoritative sources become outdated and either include guidance that is no longer relevant or do not offer guidance that is very important in current times. Nevertheless, there are several reasons why the three authoritative sources are valuable resources that can be relied upon by individuals associated with employment selection: 1. The study of employment-related psychometrics has been taking place for about 100 years. Accordingly, there is a body of knowledge that is stable, well researched, and directly relevant to understanding the measurement properties of employment-based selection procedures. Much of this knowledge, with varying degrees of specificity, is embedded in all three authorities with little, if any, contradiction. Consequently, the authoritative sources are able to provide accurate information about the state of the science, at least at the time they were written, which can support the proper development and use of an employment selection procedure. 2. The three documents describe and discuss several specific concepts and terms associated with the psychometric qualities of a selection procedure. Although not intended as teaching documents per se, they do frequently summarize bodies of research that are otherwise buried in textbooks and research journal articles. 3. The current editions of the Standards and the Principles have undergone extensive professional peer review. Although the initial preparations of the documents were accomplished by committees of experts in the field (the Standards jointly by three psychological, educational, and measurement organizations and the Principles by a committee of Society for Industrial and Organizational Psychology (SIOP) members), both documents were open for comment by the membership of the American Psychological Association (APA) and, especially in the case of the Principles, the document was subject to review by the entire membership of SIOP, a division of APA. The Standards and the Principles were adopted as policy by APA and hence have formal professional status. Accordingly, there were much greater levels of scrutiny and approval of the scientific content of the Standards and Principles than typically occurs for a textbook or journal article. 4. The Uniform Guidelines was authored by the Equal Employment Opportunity Commission (EEOC), the Civil Service Commission (CSC), the Department of Labor (DoL), and the Department of Justice (DoJ). The

600

Professional Guidelines/Standards preparation of the Uniform Guidelines also relied upon input from individuals with expertise in psychological measurement, but others (e.g., attorneys) were influential in creating the document as well. Given this complement of authors, it is understandable that there was less psychometric content and greater emphasis on the documentation of validity evidence that would be satisfactory in a judicial proceeding. Interestingly, when the Uniform Guidelines was under development and when the U.S. House of Representatives was holding hearings on revisions to the Uniform Guidelines, the APA (Division 14) submitted information that was, for the most part, not incorporated into the final document. Subsequently, in 1985, an APA representative gave congressional testimony that psychologists disagreed with four technical issues as these topics were addressed in the Uniform Guidelines: (a) validity generalization, (b) utility analysis, (c) differential prediction, and (d) validity requirements and their documentation. Similarly, SIOP believed the Uniform Guidelines was incorrect with respect to requiring fairness studies, the definition of construct validity, and how validity generalization and utility analyses were considered (Camera, 1996). Nevertheless, the EEOC and the Office of Federal Contact Compliance Programs (OFCCP) currently rely on the Uniform Guidelines to determine whether or not a selection procedure is discriminatory. 5. For those who are involved in the judicial process (particularly judges and lawyers), the authoritative sources are additional reference sources to case law and other judicial writings. The three sources have been relied upon by experts in the fields of personnel, industrial, organizational, and measurement psychology when formulating opinions about selection procedures. In such instances, the authoritative sources have become benchmarks that help define sound professional practice in the employment setting. Unfortunately, the apparent use of the three sources is rather limited, as indicated by the judicial interviews in Chapter 15 of Landy (2005).

STANDARDS FOR EDUCATIONAL AND PSYCHOLOGICAL TESTING

Brief History The Standards has a history dating back more than 60 years. The first edition was titled Technical Recommendations for Psychological Tests and Diagnostic Techniques and was authored by a committee of APA members and published in 1954. A similar publication was prepared by a committee comprising members from the American Educational Research Association (AERA) and the National Council on Measurement Used in Education (NCMUE). The document was titled Technical Recommendations for Achievement Tests and was published in 1955 by the National Education Association. In 1966 the two separate documents were revised and combined into a single document, the Standards for Educational and Psychological Tests and Manuals, authored by a committee representing the APA, AERA, and the National Council on Measurement in Education (NCME). These three organizations have continued to jointly publish revisions. In a revision completed by a subsequent joint committee in 1974, the document title was changed to Standards for Educational and Psychological Tests. The 1966 document delineated about 160 standards, and this number was increased to more than 225 standards in 1974. However, the number of standards declined to about 180 in 1985 after a revision and publication of the Standards for Educational and Psychological Testing (Standards). This title has remained with the subsequent 1999 revision. In 1991, the APA began an initiative to revise the 1985 Standards. In 1993, a joint AERA, APA, and NCME committee was formed, and after six years of effort the final document was published. It incorporates 264 standards and was adopted as APA policy. The Standards is intended to be prescriptive but does not have any associated enforcement mechanisms. More so than with past versions, the 1999 Standards devoted considerable attention to fairness; testing individuals with disabilities; scales, norms, and score comparability; reliability; and the responsibilities of test users. After six years of revision and review, the latest version of the Standards (2014) presents an up-to-date wealth of psychometric information and places expanded emphasis on three topics: fairness, new and emerging technology, and holding users (and especially those associated with high-stakes testing) accountable for proper test use. The 2014 edition contains 45 pages of new material that was not included in the 1999 edition of the Standards.

601

P. Richard Jeanneret and Sheldon Zedeck

Application The 2014 Standards is applicable to the entire domain of educational and psychological measurement. Because the Standards provides a comprehensive wealth of information on psychological measurement, it is not possible to adequately discuss all of the content in this chapter. So, this review will focus on those components of the Standards that are most applicable to psychometric issues in employment selection.

Purpose of the Standards “The purpose of the Standards is to provide criteria for the development and evaluation of tests and testing practices and to provide guidelines for assessing the validity of interpretations of test scores for the intended test uses” (p. 1). It is further emphasized that the evaluation of a test or its application should rely heavily on professional judgment and that the Standards provides a set of references or benchmarks to support the evaluation process. Finally, the Standards is not intended to respond to public policy questions that are raised about testing; however, the psychometric information embedded in the Standards may be very useful to informing those involved in debates and decisions regarding testing from a public policy perspective. This relevance exists because the initial version of the Standards (1954) preceded and was, in part, foundational to the Uniform Guidelines and the Principles.

Validity Defined A key term that will appear throughout this chapter is “validity” or one of its derivatives (e.g., validation process). The Standards has established the most current thinking regarding validity and provides a definition that should receive broad acceptance by all professionals concerned with the psychometrics of selection procedures. According to the Standards: (p. 11) Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing and evaluating tests. The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretation. It is the interpretation of test scores required by the proposed uses that are evaluated, not the test itself. When test scores are used or interpreted in more than one way (e.g., both to describe a test taker’s current level of the attribute being measured and to make a prediction about a future outcome), each intended interpretation must be validated.

Validity is a unitary concept and can be considered an argument based on scientific evidence that supports the intended interpretation of a selection procedure score (Binning & Barrett, 1989; Cronbach & Meehl, 1955; McDonald, 1999; Messick, 1980, 1989; Wainer & Braun, 1988). There are 25 specific standards regarding validity incorporated into the 2014 document. Generally, if a test does not have evidence for its validity for a particular purpose, it also will not have utility. Utility is an estimate of the gain in productivity or other practical value that might be achieved by use of a selection procedure. Several measures are used to estimate utility, including increases in job proficiency, reduced accidents, reduction in turnover, training success, etc. (Cascio & Boudreau, 2011; Cronbach & Gleser, 1965; Hunter & Hunter, 1984; Naylor & Shine, 1965; Schmidt, Hunter, McKenzie, & Muldrow, 1979). Consequently, it would be an unusual situation for an organization to want to use a test that lacked validity and (therefore) utility. Furthermore, if a test lacks validity, it is possible that unintended consequences may result from its use. Thus, reliance on test scores that are not valid will not yield results intended by the selection process and may yield outcomes that are detrimental to the organization.

602

Professional Guidelines/Standards

Application to Selection Decision Making Of the three authoritative sources, the Standards offers the greatest level of detail regarding psychometric properties and use of selection procedures. However, all standards are not necessarily equally important in a given situation, and no attempt is made to categorize some standards as “primary” and others as “secondary,” as occurred in earlier versions. An entire chapter that incorporates 20 standards is focused on fairness; another chapter devotes 12 standards to the rights and responsibilities of test takers; and still another chapter is focused on individual psychological assessment (18 standards), whereby tests have been categorized into six groups: cognitive and neuropsychological tests; problem behavior measures; family and couples tests; social and adaptive behavior tests; personality measures; and vocational tests. This level of description is at times less precise in the Principles and Uniform Guidelines, particularly with respect to testing and assessment in the employment domain.

Cautions Offered by the Standards The Standards (p. 7) sets forth five cautions that are intended to prevent misinterpretations: 1. Evaluation of a selection procedure is not just a matter of checking-off (or not) one standard after another to determine compliance. Rather the evaluation process must consider (a) professional judgment, (b) satisfaction of the intent of a relevant standard, (c) alternate selection procedures that are readily available, (d) feasibility of complying with the standard given past experience and research knowledge.; and (e) applicable laws and regulations (Note: this edition is the first time such a basis for acceptability has been expressed in the Standards.) 2. The Standards offers guidance to the expert in a legal proceeding, but professional judgment determines the relevance of a standard to the situation. 3. Blanket statements about conformance with the Standards should not be made without supporting evidence. Otherwise, care should be exercised in any assertions about compliance with the Standards. 4. Research is ongoing and knowledge in the field will continue to change. Accordingly, the Standards will be revised over time and the use of older Standards may be a disservice to test users and takers. 5. The Standards is not intended to mandate use of specific methodologies. The use of a “generally accepted equivalent” is always understood with regard to any method provided in the Standards.

Sources of Validity Evidence There are multiple ways in which validity evidence might be assembled for a selection procedure, and no one method is necessarily superior to another. Rather, the validation strategy should be consistent with the nature and intended use of the selection procedure. The Standards (pp. 13–21) describes five validation strategies or sources of validity evidence: Content Response processes Internal structure Relations to other variables Consequences of testing

Comprehensive information on validation evidence and strategies may be found in Chapters 2 and 3 of this Handbook.

Convergent and Discriminant Validity When selection procedure scores and other measures of the same or similar constructs are correlated, convergent validity evidence is demonstrated. When selection procedure scores are not correlated with other measures of purportedly different constructs, there is evidence of

603

P. Richard Jeanneret and Sheldon Zedeck discriminant validity (Campbell & Fiske, 1959; McDonald, 1999). Although both types of evidence are valuable in evaluating tests, convergent validity has been the more frequently studied. For example, in the typical criterion-related validity study, the relationship between a cognitive selection procedure and a measure of job performance is purportedly concerned with the same or very similar constructs (i.e., convergent validity). However, if a selection procedure comprised a cognitive measure and a test of interpersonal skills, and there were two job performance indices (decision making and teamwork), a lack of relationship (or low relationship) between the cognitive measure and teamwork (or a low correlation between the interpersonal skills test and decision making) would provide discriminant evidence.

Validity Generalization An issue that arose early in research on selection measures was whether or not validity evidence obtained in one situation can be generalized to a new situation without further study of the validity of that procedure in the new setting. When criterion-related validity evidence has been accumulated for a selection procedure, meta-analysis has provided a useful statistical method for studying this generalization question. There are numerous methodological and statistical issues associated with meta-analytic studies, and these matters are too lengthy to be addressed here. The interested reader is referred to Cooper (2010) or Hunter and Schmidt (2004).

Integrating Validity Evidence A comprehensive and sound validity argument is made by assembling the available evidence indicating that interpretations of scores from a well-developed selection procedure can accurately predict the criterion of interest. Although the various sources of validity evidence discussed above are directly relevant, there are many other valuable information sources, including information obtained from prior research; reliability indices; information on scoring, scaling, norming, and equating data; standard settings (e.g., cut scores); and fairness information. All of these information sources, when available, contribute to the final validity argument and decision regarding the use of a selection procedure (Barrett, Phillips, & Alexander, 1981; Bemis, 1968).

Validity Standards There are 25 specific standards presented in the validity chapter of the Standards. Although all 25 standards are important, certain themes are particularly relevant in the context of employment selection. A brief summary of these themes follows: • The rationale and intended interpretation of selection procedure scores should be stated at the outset of a validity study. When new interpretations or intended uses are contemplated, they should be supported by new validity evidence. • Descriptions of individuals participating in validation studies should be as detailed as is practical. If subject matter experts (SMEs) are used, their qualifications and the procedures they followed in developing validation evidence should be documented. • When criterion-related validity studies are completed, information about the quality and relevance of the criterion should be reported. • When several variables are predicting a criterion, multiple regressions should be used to evaluate increments in the predictive accuracy achieved by each variable. Results from the analyses of multiple variables should be verified by cross-validation whenever feasible. • If statistical adjustments (e.g., the correction of correlations for restriction in range) are made, the unadjusted and adjusted correlations and the procedures followed in making the adjustments should be documented.

604

Professional Guidelines/Standards • If meta-analyses are relied upon as criterion-related validity evidence, the comparability between the meta-analytic variables (predictors and criteria) and the specific situation of interest should be determined to support the applicability of the meta-analytic findings to the local setting. All assumptions and clearly described procedures for conducting the meta-analytic study should be reported. • If effect size indices are used to make inferences beyond the validation sample, indicators of the amount of uncertainty regarding those indices (e.g., confidence intervals, standard errors, or significance tests) should be reported.

Reliability and Measurement Errors Part I, Chapter 2 of the Standards describes reliability and errors of measurement and sets forth 20 standards related to the topic. The chapter is concerned with understanding the degree to which a selection procedure score is free from error. To the extent that a score is unreliable, it is due to errors of measurement that are usually assumed to be unpredictable and random in occurrence. There are two sources of error: (1) within individuals subject to the selection procedure and (2) conditions external to the individuals, such as the testing environment or mistakes in scoring the selection procedure. Reliability is an index indicating the degree to which selection procedure scores are measured consistently across one or more sources of error such as time, test forms, or administrative settings. Reliability has an impact on validity in that to the extent the selection procedure is not reliable it will be more difficult to make accurate predictions from the selection procedure scores. Excellent treatments of reliability may be found in McDonald (1999), Nunnally and Bernstein (1994), Pedhazur and Schmelkin (1991), Putka and Sackett (2010), Traub (1994), and Chapter 1 of this Handbook. The reliability chapter of the Standards develops many of the basic concepts embedded in psychometric theory. It is important to note that no single index of reliability measures all of the variables that influence the accuracy of measurement. The two major theoretical positions regarding the meaning of reliability are classical reliability theory and generalizability theory. What is important is that the method used to determine reliability be appropriate to the data and setting at hand and that all procedures be clearly reported. Furthermore, various reliability indices (e.g., test-retest, internal consistency) are not equivalent and should not be interpreted as being interchangeable; accordingly, one should not state that the “reliability of test X is . . .”, but rather should state “the test-retest reliability of test X is . . .”. Finally, the reliability of selection procedure scoring by examiners does not imply high candidate consistency in responding to one item versus the next item that is embedded in a selection procedure. In other words, just because the scoring of a test is reliable does not mean that the test itself is reliable.

Standards for Employment and Credentialing Tests Chapter 11 of the Standards describes testing used for employment, licensure, and certification. In the employment setting, tests are most frequently used for selection, placement, and promotion. Sixteen standards are set forth in Chapter 11. They address the collection and interpretation of validity evidence, the use of selection procedure scores, and the importance of reliability information regarding selection procedure scores. The chapter’s introduction emphasizes that the contents of many other chapters in the Standards also are relevant to employment testing. One point of emphasis in Chapter 11 is the influence of context on the use of a selection procedure. Ten contextual features are identified, which by their labels are self-explanatory (see Standards, pp. 170–171): • Internal versus external candidate pool • Trained versus untrained candidates • Short-term versus long-term focus

605

P. Richard Jeanneret and Sheldon Zedeck • • • • • • •

Screening in versus screening out Mechanical versus judgmental decision making (when interpreting test scores) Ongoing versus one-time use of a test Fixed applicant pool versus continuous flow Small versus large sample size Application to a new job Size of applicant pool relative to the number of job openings (selection ratio)

The Standards indicates that the validation process in employment settings is usually grounded in two sources of validity evidence: relations to other variables and content. One or both types of evidence can be used to evaluate how well a selection procedure (predictor) predicts or is directly linked to a relevant outcome (criterion). Furthermore, the Standards describe limited situations (e.g., small sample sizes, a new job without incumbents) in which validity evidence might be established on the basis of generalizability to include transporting validity using job analysis or statistical analyses across validation studies that encompassed similar jobs (e.g., meta analysis). Importantly, the Standards assert that there is no methodological preference or more correct method of establishing validity; rather, the selection situation and professional judgment should be the determiners of what source(s) of evidence are appropriate.

Evaluating Validity Evidence Perfect prediction does not occur, and the evaluation of validity evidence is often completed on a comparative basis (e.g., how an observed validity coefficient compares to coefficients reported in the literature for the same or similar constructs). Consideration may be given to available and valid alternative selection procedures, utility, concerns about applicant reactions, statutory or regulatory requirements, fairness, strategies to achieve workforce diversity, and organizational values. Any or all of these types of considerations could influence the final conclusions drawn about the validity evidence as well as the implementation of the selection procedure.

Professional and Occupational Credentialing In Chapter 11, the Standards also address the specific instance of credentialing or licensing procedures that are intended to confirm that individuals (e.g., medical doctors or nuclear power plant operators) possess relevant knowledge or skills to the degree that they can safely and/or effectively perform certain important occupational activities. Credentialing or licensing procedures are intended to be strict to provide the public as well as governmental and regulatory agencies with sound information regarding the capabilities of practitioners. The procedures are designed to have a gate-keeping role and often include written examinations as well as other specific qualifications (e.g., education or supervised experience). Content validity evidence is usually obtained to support the use of the credentialing procedures, because criterion information is generally not available. Establishing a passing score is a critical component of the validation process and is usually determined by SMEs, although empirical methods exist if the relevant data are available. Arbitrary passing scores, such as 70% correct, typically are not useful. They are unlikely to have any relevance to the underlying test psychometrics, and they may not define a level of credentialing procedure success equivalent to acceptable job performance. Thus, they provide no assurance of protection from harm to the public or of fairness to test takers. Finally, issues regarding fairness and accessibility are important and must be evaluated as to test scoring and accommodation, while also considering critical job functions and public interest.

606

Professional Guidelines/Standards

Review of the Standards in Chapter 11 The first four standards are general in nature and apply to both workplace testing and credentialing; the next eight standards apply to workplace testing; the last four standards apply to credentialing. A brief discussion of these standards follows: • The objective of the employment selection procedure should be set forth, and an indication of how well that objective has been met should be determined. • Decisions regarding the conduct of validation studies should take into consideration prior relevant research, technical feasibility, and the conditions that could influence prior and contemplated validation efforts. • When used, the fidelity of the criterion (which could be important work behaviors, work output, or job-relevant training) should be documented. • Inference about the content validity of a selection procedure for use in a new situation requires that critical job content factors be substantially the same (e.g., as determined by a job analysis), and that the reading level of the test material not exceed that appropriate for the new job. In addition, the original meaning of the test materials should not be substantially changed in the new situation. (Standards, p. 181)

• When multiple sources of information are available to decision makers regarding an employment process, the use of each informational component should be supported by validity evidence. Furthermore, the role played by each component as it is integrated into a final decision preferably should be explained. In credentialing situations, the rules and procedures followed when combining scores from multiple information sources should be made available to candidates. • Cut scores for credentialing tests should be determined on the basis of the skill or knowledge level necessary for acceptable job performance and not on the basis of the number or proportion of candidates passing.

Fairness Fairness is addressed in Chapter 3 of the Standards, where it is described as a fundamental validity issue for all types of measurement, including that of workplace testing. While there is no single technical meaning for fairness, a fair test may be described as one that minimizes the construct-­ irrelevant variance associated with individual characteristics and testing contexts that otherwise would compromise the validity of scores for some test takers. Fairness must be addressed during both test development and use for individuals from specific subgroups. These subgroups are identified by various characteristics, including disabilities, race, ethnicity, gender, age, culture, language, and socioeconomic status. The Standards asserts that measurement bias is the central threat to fairness, but consideration is also placed on accessibility and universal test design. With these concerns in mind, there are four general aspects of fairness: 1. Equitable treatment of all test takers during the entire testing process 2. Lack of measurement bias 3. Full access to the construct being assessed (e.g., an individual with impaired vision might not be able to read a standard version of a personality test). 4. Validity of individual test score interpretations for their intended use

General threats to fair and valid interpretations of test scores include test content that produces construct-irrelevant variance, test context, test item responses, and opportunity to learn the content and skills measured by the test. Proper test design and adaptations help minimize these threats.

607

P. Richard Jeanneret and Sheldon Zedeck In employment testing, the issue of fairness is typically addressed by statistically examining test results for evidence of bias. It is not simply a matter of whether or not test score averages differ by subgroups but whether or not there are differences in test score predictions by subgroup. Under the most widely used model for analyzing test fairness (Bartlett, Bobko, & Mosier, 1978; Cleary, 1968), if the predictions are equivalent (i.e., no difference in the slopes or intercepts), then there is no bias. It should be noted that a number of concerns have been raised about fairness analyses using moderated regression models, especially with respect to the availability of adequate power in the analyses to detect bias should it actually exist (Aguinis & StoneRomero, 1997). Another statistical perspective is that of differential item functioning (DIF). In this instance if there is bias, candidates of equal ability differ in their responses to a specific item according to their group membership. Unfortunately, the underlying reason for DIF, when it has been observed, has not been apparent; one group often performs better than another on some items for no explainable reason associated with item content. Use of sensitivity review panels that comprise individuals representative of the subgroups of interest has been one mechanism intended to prevent item content being relevant for one group but not another. Members of such review panels are expected to flag items that will be potentially unfair to a subgroup. However, there is not much research evidence indicating that sensitivity review panels find a great deal to alter in test item content for well-constructed tests.

Selection Procedure Development and Administration Chapters 4–7 in the Standards are concerned with the development, implementation, and documentation of selection procedures. The discussions are quite technical in nature and will not be reviewed in this chapter. However, a couple of topics of particular relevance to employment selection will be mentioned: • A cut score is used to partition candidates into two groups: one passing or successful and the other not passing or not successful. There is no single or best method for setting a cut score. Furthermore, because selection procedures are not perfect, there will always be errors—some candidates will pass who do not truly have adequate skills (false positives) and some will fail when in fact they do have adequate skills (false negatives). Changing a cut score to correct for one concern will usually increase the occurrence of the other. Thus, professional judgment always must play a significant role when setting a cut score. • Normative data should be described in terms of demographics, sampling procedures, descriptive statistics, and the precision of the norms. • The psychometric characteristics of different forms of the same test should be documented, and the rationale for any claim of equivalency in using test scores from different test forms must be reported. • If the test developer permits different conditions of administration from one test taker or group to another, then a rationale for permitting the different conditions and any requirements for permitting the different conditions should be documented. • Standardization in the administration procedures is extremely important, and all instructions and procedures must be carefully followed. • The use of computers and the Internet for test administration and scoring result in special cautions. Training may be required to reduce construct-irrelevant variance; explanations and practice may be needed to manage test-specific details such as the test’s interface; and managing the testing environment to avoid light reflections on the computer screen that interfere with display legibility may be necessary. • Technology and the Internet have made it possible to administer tests in which the administration conditions may not be strictly controlled or monitored. Those who allow lack of standardization are responsible for providing evidence that lack of standardization will not affect test-taker performance or the quality and comparability of scores produced. • Selection procedures and results (including individual scores) should be treated as confidential and kept in a secure manner. • Documentation for a selection procedure typically includes information about intended purpose; prior research evidence; the development process; technical information regarding validity, reliability, fairness, score interpretation, scaling, or norming, if relevant; administration; and appropriate uses of the results (e.g., pass/fail).

608

Professional Guidelines/Standards

Rights and Responsibilities Two chapters in the Standards (Chapters 8–9) discuss test user and test taker rights and responsibilities. The standards set forth in these chapters are concerned with policy and administrative issues. Generally, these matters become more relevant in specialized circumstances (e.g., an applicant with a verified disability who needs an accommodation for a selection procedure). Professional judgment is typically required because of the individualized nature of the conditions.

Summary The 2014 Standards reflects the state of the science and much of the most current professional knowledge available regarding psychological testing. As in the past, the Standards no doubt will be revised in the future. Nevertheless, the Standards is extremely informative about current professional thinking and scientific research regarding requirements associated with the development and application of a selection procedure in employment settings. The document has been published to promote the professionally sound and ethical use of selection procedures and to provide a set of standards that can be the basis for developing and implementing a new selection procedure, or for evaluating the quality of an existing selection procedure and practice.

PRINCIPLES FOR THE VALIDATION AND USE OF PERSONNEL SELECTION PROCEDURES

Brief History The first edition of the Principles was published in 1975 in response to the growing concern about the need for professional standards for validation research. Furthermore, because early versions of what became the Uniform Guidelines were being prepared by various governmental organizations, Division 14 representatives wanted to set forth the perspective of industrial and organizational (I-O) psychology, particularly with regard to validation studies. The second edition was published five years later and, for the first and only time, cited specific references regarding equal employment opportunity and associated litigation. Because of continuing changes in employment case law, subsequent editions have not attempted to stay current with them. Furthermore, it has not been the purpose of the Principles to interpret these cases in terms of the science of I-O psychology In 1987 the third edition of the Principles was published by SIOP. This edition consisted of 36 pages of text and 64 citations to published research to support the various principles contained in the document. An appended glossary defined 76 terms used in the Principles. The fourth edition of the Principles was published by SIOP and adopted as policy by the APA in 2003. This edition consists of 45 pages of text and an appended glossary of 126 terms. There are 65 research literature citations that support the scientific findings and professional practices that underlie the principles for conducting validation research and using selection procedures in the employment setting. The increase in glossary terms reflects some of the more recent scientific findings and thinking related to such topics as generalized evidence of validity, work analysis, internal structure validity evidence, models of reliability and fairness, and test development and implementation.

Purpose of the Principles The Principles establishes ideals and sets forth expectations for the validation process and the professional administration of selection procedures. The document also can inform those responsible for authorizing the implementation of a validation study and/or selection procedure. The Principles does not attempt to interpret federal, state, or local statues, regulations, or

609

P. Richard Jeanneret and Sheldon Zedeck case law related to matters of employment discrimination. However, the Principles expects to inform decision making in employment administration and litigation and offers technical and professional guidance that can help others (e.g., human resource professionals, judges, and lawyers) understand and reach conclusions about the validation and use of employment selection processes.

Principles Versus the Standards The Principles was revised in 2003 with the full understanding that the document would be consistent with the then-extant Standards, especially with regard to the psychometric topics of validity, reliability, and bias. Both documents are grounded in research and express a consensus of professional opinion regarding knowledge and practice in personnel selection. However, there are also some important differences between the two documents. First, unlike the Standards, the Principles does not enumerate a list of specific principles in the same manner as the Standards sets forth 240 standards. Consequently, the Principles is more aspirational and facilitative in content, whereas the Standards is more directive in nature. That said, the Standards states that it is not a set of legal requirements nor a substitute for legal advice (p. 1). Second, the Standards is much broader than the Principles with respect to psychological measurement. For example, although many of the concepts expressed in the Principles could be relevant to the field of educational testing, the Standards directly addresses the topic. The same is true for such topics as testing in program evaluation and public policy. Third, the Standards is more concerned with the rights and responsibilities of test takers, whereas the Principles focuses more on the responsibilities of selection procedure developers and users. This focus reflects the fact that the Principles places most of the responsibility for proper selection processes on the employer rather than the candidate, whereas the Standards considers a much wider group of test takers to include students, patients, counselees, and applicants. Finally, the Principles provides more guidance on how to plan a validation effort and collect validity evidence within the context of an employment setting. Consequently, there is more discussion of such topics as (a) feasibility of a validation study; (b) strategies for collecting information about the work and work requirements, as well as about job applicants or incumbents and their capabilities; (c) analyzing data, including such topics as multiple-hurdles versus compensatory models, cutoff scores, rank orders, and banding; and (d) information to be included in an administrative guide for selection procedure users.

Application to Litigation The Principles offers relevant information and guidance regarding personnel selection procedures that might be the subject of litigation. Although the document is not written in absolute terms, it provides a wealth of information that defines best practices in the validation and implementation processes required to use selection procedures properly. When examining the qualities of a validation study or the implementation of a selection procedure, a decision maker in litigation proceedings might find that one or more expectations set forth in the Principles were not met and ask why. Absent sound and logical explanations, the unexplained issues could be strong indicators that the procedures being scrutinized were not established in accord with accepted professional expectations.

Analysis of Work Given that the Principles is focused on selection procedures in the employment setting, there is a particular emphasis on the analysis of work. Such an analysis establishes the foundation for

610

Professional Guidelines/Standards collecting validity evidence. More specifically, information from the analysis of work defines relevant worker requirements and determines the KSAOs needed by a worker to perform successfully in a work setting. Second, the work analysis defines the criterion measures that, when appropriate for the validation strategy being used, indicate when employees have successfully accomplished relevant work objectives and organizational goals. Historically, the analysis of work was labeled “job analysis,” and that term is still frequently used. The Principles expanded the term to “analysis of work” to give clear recognition to the realization that the concept of a traditional job is changing. Furthermore, the “analysis” should incorporate the collection of data about the workers, the organization, and the work environment, as well as the specific job or some future job if that is relevant to the study. As implied by the various permutations that might be considered, no one preferred method or universal approach is appropriate for completing an analysis of work. The Principles encourages the development of a strategy and a sampling plan to guide an analysis of work. Furthermore, the analysis should be conducted at a level of detail consistent with the intended use and availability of the work information. Any method used and outcomes obtained should be well documented in a written report.

Validation The Principles adopts the same definition of validity as given in the Standards. Validity is a unitary concept, and different sources of evidence can contribute to the degree to which there is scientific support for the interpretation of selection procedure scores for their proposed purpose. If a selection procedure is found to yield valid interpretations, then it can be said to be job-related. The Principles recognizes the five sources of evidence discussed in the Standards. However, the Principles places more emphasis on the two sources of evidence most frequently relied upon when studying validity in the employment context—criterion-related and content validity.

Criterion-Related Validity Evidence The Principles emphasizes several issues related to obtaining criterion-related validity evidence: • Feasibility: Is it technically feasible to conduct the study in terms of measures, sample sizes, and other factors that might unduly influence the outcomes? • Design: Is a concurrent or predictive design most appropriate? • Criterion: Is the criterion relevant, sufficient, uncontaminated, and reliable? • Construct equivalence: Is the predictor measuring the same construct underlying the criterion? • Predictor: Is the selection procedure theoretically sound, uncontaminated, and reliable? • Participants: Is the sample of individuals in the study representative of the applicants and/or incumbents, and will it support the generalization of results? • Analyses: Are the analytical methods to be used appropriate for the data collected? • Strength of relationships: What effect size and statistical significance or confidence intervals were hypothesized and observed? • Adjustments: What adjustments are necessary to correct observed validity relationships to avoid underestimating the predictor-criterion relationship? It may be appropriate to adjust for restriction in range and unreliability in the criterion. • Combining predictors/criteria: How are predictor and/or criteria scores weighted if combined? • Cross-validation: Should the estimates of validity be cross-validated to avoid capitalization on chance? Typically, when regression analyses are used and the sample is small, adjustments should be made using a shrinkage formula or a cross-validation design. • Interpretation: Are the results observed consistent with theory and past research findings? • Administrative procedures: Are adequate guidelines established for administering and scoring the selection procedure that will maintain the integrity of the validity evidence?

611

P. Richard Jeanneret and Sheldon Zedeck

Content Validity Evidence The Principles also emphasizes several issues related to obtaining content validity evidence: • Feasibility: Are there job determinant conditions (e.g., is the work stable or constantly changing?), workerrelated variables (e.g., are past experiences relevant for the current work?), or contextual matters (e.g., are the work conditions extremely different from the testing environment?) that might influence the outcome of the validity study? If so, are they sufficiently controlled so as to not contaminate the study? • Design: Has an adequate sample of important work behaviors and/or worker KSAOs been obtained and analyzed? • Content domain: Has the work content domain been accurately and thoroughly defined and linked to the selection procedure? • Selection procedure: Does the selection procedure adequately represent the work content domain? The fidelity of this relationship is the basis for the validity inference. • Sampling: Is there a sound rationale for the sampling of the work content domain? • Specificity: Has the level of specificity necessary in the work analysis and selection procedure been described in advance? • Administrative procedures: Are adequate guidelines established for administering and scoring the selection procedure that will maintain the integrity of the validity evidence?

The Principles also recognizes internal structure validity evidence. The Principles points out that evidence based on the structure of a selection procedure is not sufficient alone to establish the validity of the procedure for predicting future work performance or other work-related behaviors (e.g., attendance, turnover). However, consideration of the internal structure can be very helpful during the design of a selection procedure.

Generalizing Validity Evidence The Principles provides considerably more detail regarding the generalization of validity evidence in comparison to the Standards. There are at least three strategies for generalizing evidence, known as transportability, job component validity, and meta-analysis. The Standards indicates these strategies are especially relevant when a job is new, sample sizes are small, or if research data are available to conduct meta-analyses.

Transportability This strategy refers to relying on existing validity evidence to support the use of a selection procedure in a very similar but new situation. The important consideration underlying the transport argument is work/job comparability in terms of content and requirements. Also, similarity in work context and candidate groups may be relevant to documenting the transport argument (Gibson & Caplinger, 2007).

Synthetic/Job Component Validity This type of generalization relies on the demonstrated validity of selection procedure scores for one or more domains or components of work. The work domains or components may occur within a job or across different jobs. If a sound relationship between a selection procedure and a work component has been established for one or more jobs, then the validity of the procedure can be generalized to another job that has a comparable component. As in the transportability argument, the comparability of work content on the basis of comprehensive information is essential to the synthetic/job component validity process (Hoffman, Rashkovsky, & D’Egidio, 2007; Johnson, 2007).

612

Professional Guidelines/Standards

Meta-analysis The information on meta-analysis in the Standards and Principles is very similar. In the Principles, meta-analysis is acknowledged as a statistical technique that serves as the foundation for validity generalization. Both documents point out that meta-analytic findings may be useful, but not sufficient, to reach a conclusion about the use of a selection procedure in a specific situation. Rather, a local validation study may be more appropriate. Both sources also emphasize that professional judgment is necessary to evaluate the quality of the meta-analytic findings and their relevance to the specific situation of interest. The general conclusion in the Principles is that meta-analytic findings for cognitive tests indicate that much of the difference in validity coefficients found from one study to the next can be attributed to statistical artifacts and sampling error (Callendar & Osburn, 1981; Hartigan & Wigdor, 1989; Hunter & Hunter, 1984). Similar but not conclusive evidence is occurring for noncognitive measures (Barrick & Mount, 1991; Barrick, Mount, & Judge, 2001; Hurtz & Donovan, 2000), but the strength of validity may be less for noncognitive tests (Hogan, Davies, & Hogan, 2007; Morgeson et al., 2007). The Principles discuss the appropriateness of the technique and its interpretation in specific situations. In general, reliance on meta-analytic results is most appropriate when the studies contributing to the meta-analysis focus on well-defined constructs. In such instances, the meta-analytic findings reflect the degree to which the measures of the constructs are measuring the same construct. In contrast, when the studies in the meta-analysis focus on methods (e.g., the interview) instead of constructs, several interpretational difficulties arise. Because interviews may measure different constructs, it is difficult to generalize about the general method of the interview unless the features of the interview method “are clearly understood, if the content of the procedures and meaning of the scores are relevant for the intended purpose, and if generalization is limited to other applications of the method that include those features” (Principles, p. 30). Generalizing from a meta-analysis of “the” interview method to a new interview method measuring different constructs or to a new interview that addresses a new situation is problematic when constructs do not serve as the foundation of the analysis.

Fairness and Bias As presented in the Standards, the topics of fairness and bias are also prominent in the Principles. The Principles endorses the definitions and positions taken by the Standards.

Predictive Bias An alternative term to “predictive bias” is differential prediction. Regardless of the terminology, the key is that bias occurs if consistent, nonzero errors of prediction are made for individuals in a particular subgroup that are greater than those for another subgroup. Multiple regression techniques are typically used to assess predictive bias, which is indicated if slope and/or intercept differences are observed in the model. Research on cognitive ability measures has typically supported the conclusion that there is no predictive bias for African American or Hispanic groups relative to Whites, and when predictive differences are observed, they usually indicate overprediction of the performance of the minority group. It is also important to understand that there can be mean score differences on a selection procedure for minority versus majority subgroups that do not result from predictive bias.

Measurement Bias This form of bias is associated with one or more irrelevant sources of variance contaminating a predictor or criterion measure. There are not well-established approaches to assessing

613

P. Richard Jeanneret and Sheldon Zedeck measurement bias, as is the case for predictive bias, though differential item functioning (DIF) and item sensitivity analyses are suggested as options in the Principles, but considerable caution in the value of such analyses is also mentioned. As noted by Sackett, Schmitt, Ellingson, and Kabin (2001), the research results indicate that item effect is often very small, and there is no consistent pattern of items that favor one group of individuals relative to another group. Additionally, the rubric of item sensitivity is very broad and includes concerns about item acceptability and perception, even if no measurement bias has resulted.

Operational Considerations Almost half of the Principles is devoted to operational considerations. The issues discussed are related to initiating and designing validation efforts; analysis of work; selecting predictors, a validation strategy, and criterion measures; data collection and analyses; implementation; recommendations and reports (technical and administrative); and other circumstances that may influence the validation effort (e.g., organizational changes; candidates with disabilities; and responsibilities of selection procedure developers, researchers, and users). There are a few topics discussed in the operational considerations section of the Principles deserving particular attention in the development and implementation of an employment selection procedure that are discussed in the following subsections.

Combining Selection Procedures If selection procedure scores are combined in some manner, the validity of the inferences derived from the composite is of great importance. In other words, it is not sufficient to simply report the validity index for each procedure as a stand-alone predictor; rather, a validity index should be reported for the combined selection procedure score that is used for decision making.

Multiple-Hurdle Versus Compensatory Models A multiple-hurdle model involves making decisions in a sequence (e.g., applicants who pass one selection procedure move on for further consideration in a following procedure, and those who pass the second procedure move on to a third selection procedure, etc.). In contrast, a compensatory model involves all applicants completing all selection procedures, and their final hiring result is based on a weighted combination of their scores on the components of the procedure. The Principles provides no definitive guidance as to which model is more appropriate; rather, each situation must be evaluated on its own merits. Combining scores into a compensatory sum may affect the overall reliability and validity of the process. When multiple predictors (with different reliabilities and validities) are combined into a single weighted composite score, the result produces a single-stage selection decision. How each predictor is weighted will influence the psychometric characteristics of the compensatory selection procedure score, and the final reliability/validity indices may be lower than if used in their individual capacities in a multi-staged selection process (Sackett & Roth, 1996).

Cutoff Scores Versus Rank Order The Principles concludes that a cutoff score may be set as high or low as needed relative to the requirements of the using organization given that a selection procedure demonstrates linearity or monotonicity across the range of predictions (i.e., it is valid). For cognitive predictors, the linear relationship is typically found using a criterion-related validity model and is assumed with

614

Professional Guidelines/Standards a content validity process. Under these circumstances, using a rank-order (top-down) process will maximize expected performance on the criterion. Whether this same premise holds true for noncognitive measures has not been determined. In a rank-order model, the score of the last person selected becomes the lower bound cutoff score. A cutoff score set otherwise usually defines the score on the selection procedure below which applicants are rejected. Professional judgments that consider KSAOs required, expectancy of success versus failure, the cost-benefit ratio, consequences of failure, the number of openings, the selection ratio, and organizational diversity objectives are important to setting a cutoff score. In the case of organizational diversity objectives, using lower cutoff scores could result in higher proportions of minority candidates passing some valid initial hurdle, with the expectation that subsequent hurdles might have less adverse impact. In such instances, cutoff scores may be set even lower with the realization that there will be a corresponding reduction in job performance and selection procedure utility, but that the tradeoffs regarding hiring a diverse workforce may be sufficient to overcome such reductions.

Utility Gains in productivity, reductions in outcomes (e.g., accidents, absenteeism), or comparisons among alternate selection procedures can be estimated by utility computations. Typically, several assumptions must be made with considerable uncertainty to satisfy the computational requirements of the utility models. Thus, caution should be observed in relying upon such utility estimates.

Bands A band exists when a range of selection procedure scores is established that considers all candidates within the range to be effectively equivalent. Banding may necessarily lower expected criterion outcomes and selection utility when compared to top-down selection, but these consequences may be balanced by increased administrative ease and the possibility of increased workforce diversity.

Technical Validation Report Requirements Every validation study should be documented with a technical report that contains sufficient information to allow an independent researcher to replicate the study. Such a report should present all findings, conclusions, and recommendations. In particular, the technical report should give information regarding the research sample and the statistical analyses conducted, as well as recommendations on implementation and the interpretation of the selection procedure scores.

Summary The Principles offers a comprehensive resource for use by decision makers when developing and implementing employment selection procedures. Because the Principles is focused specifically on employment settings, there is frequently more guidance offered on matters that arise in the development and use of selection procedures than will be found in the Standards. Nevertheless, the two documents are very compatible and not at all contradictory. The Principles has undergone substantial professional peer review and represents the official policy of the SIOP and APA. Currently, the Principles are being revised, with an expected delivery in 2017.

615

P. Richard Jeanneret and Sheldon Zedeck

UNIFORM GUIDELINES ON EMPLOYEE SELECTION PROCEDURES

Brief History When the U.S. Congress passed the Equal Employment Opportunity (EEO) Act of 1972, it created the Equal Opportunity Coordinating Council, which comprised the Directors/Secretaries of the Equal Employment Opportunity Commission (EEOC), the Civil Service Commission (CSC), the Civil Rights Commission (CRC), the Department of Justice (DoJ), and the Department of Labor (DoL). The Council was given the mandate to develop and implement policies, practices, and agreements that would be consistent across the agencies responsible for enforcing EEO legislation. Building on earlier guidelines promulgated by the EEOC and the Office of Federal Contract Compliance Programs (OFCCP), in 1977 the Council began developing the Uniform Guidelines document, which was adopted on August 25, 1978, by the EEOC, the CSC, the DoJ, and the DoL’s OFCCP, with an effective date of September 25, 1978. On March 2, 1979, the EEOC, Office of Personnel Management (OPM), DoJ, DoL, and Department of Treasury published the Questions and Answers (the Q&As) to clarify and provide a common interpretation of the Uniform Guidelines on Employee Selection Procedures. The change in agencies adopting the Q&As was because OPM and, to some degree, the Office of Revenue Sharing of the Treasury Department had succeeded the CSC. Although some psychologists participated in the development of the Uniform Guidelines, there was not consensus from the professional associations (e.g., SIOP, APA) that the document reflected the state of the scientific knowledge regarding the validation and use of employee selection procedures. Ad hoc committees of psychologists from SIOP and APA reviewed draft versions of the Uniform Guidelines and offered considerable input, but most of the suggestions were not incorporated (Camara, 1996). When Congress considered revising the Uniform Guidelines in 1985, the APA offered testimony that the document was deficient with respect to differential prediction, validity generalization, utility analysis, and validity requirements and documentation. SIOP concurred with the APA’s concerns and further argued that the Uniform Guidelines was in error in defining construct validity and in determining the acceptable types of validity evidence. Congress declined to revise the Uniform Guidelines at that time, though subsequently additional Q&As were adopted regarding Internet testing.

Purpose The Uniform Guidelines is intended to do the following: Incorporate a single set of principles which are designed to assist employers, labor organizations, employment agencies, and licensing and certification boards to comply with requirements of Federal Law prohibiting employment practices which discriminate on grounds of race, color, religion, sex, and national origin. They are designed to provide a framework for determining the proper use of tests and other selection procedures. These guidelines do not require a user to conduct validity studies of selection procedures where no adverse impact results. However, all users are encouraged to use selection procedures which are valid, especially users operating under merit principles. (Section 1.B 29C.F.R.1607) The Q&As was prepared “to interpret and clarify, but not to modify, the provisions of the Uniform Guidelines” (Introduction, Federal Register 43, 166, 11996–12009, March, 1979). All subsequent references in this chapter to the Uniform Guidelines should be considered to include the Q&As.

Application and Limitations The Uniform Guidelines applies to Title VII of the Civil Rights Act of 1964, Executive Order 11246 (establishing the OFCCP) regarding race, color, religion, sex, and national origin. They do

616

Professional Guidelines/Standards not apply to the Age Discrimination in Employment Act (ADEA) of 1967, nor to sections 501, 503, and 504 of the Rehabilitation Act of 1973, which prohibit discrimination on the basis of disability. Because the Americans with Disabilities Act (ADA) was not enacted until 1991, the Uniform Guidelines was not able to address this legislation and the protection it affords people with disabilities (though courts have applied the Uniform Guidelines to subsequent new laws such as ADEA and the Civil Rights Act of 1991). Generally, the Uniform Guidelines applies to most public and private-sector employers.

Selection Procedures/Employment Decisions In general, the Uniform Guidelines defines selection procedures (Equal Employment Opportunity Commission, 1979) and employment decisions in a manner similar to the Standards and the Principles. Thus, processes related to hiring, promotion, retention, and certification are covered. These processes would include tests, assessment centers, interview protocols, scored applications, physical ability measures, work samples, and performance evaluations. Furthermore, the Uniform Guidelines applies to any intermediate process (e.g., having to complete a certification program to be eligible for a promotion) that leads to a covered employment decision. Two practices are exempt or are not considered selection procedures: recruitment (excluded to protect the affirmative recruitment of minorities and women) and bona fide seniority systems.

Discrimination/Adverse Impact The Uniform Guidelines explicitly defines discrimination and introduces the term “adverse impact.” In essence, discrimination occurs when a selection procedure results in unjustifiable adverse impact. Adverse impact occurs when the selection rate for a protected group is less than four-fifths (80%) of the rate for the group with the highest rate (typically the nonprotected group). To illustrate, if the passing rate for the majority group is 60%, and the passing rate for a protected group is 40%, then the ratio 40/60 yields 67%, which is less than 80%, and the Uniform Guidelines says that the enforcement agencies will view that as evidence of adverse impact. If, on the other hand, the passing rate of the protected group was 50%, the ratio becomes 50/60 yielding 83%, resulting in no adverse impact. This “rule of thumb” is not intended as a legal definition and for good reason, because it is problematic from a couple of perspectives. First, it is highly influenced by sample size. For example, if there are 50 male and 50 female applicants and 20 open positions, the only way a selection process will not violate the 80% rule is to hire at least 9 females (9/50 = 18%) and no more than 11 males (a difference of 2), which does not violate the 80% rule in this case because the passing rate for the males is 22% (18/22 = 82%). Note that if the samples of males and females were each 500, then the same percentages of 22% and 18% hired would yield 110 males and 90 females hired; this difference of 20 would not be considered adverse impact. Second, and perhaps most important, the 80% rule of thumb is not a statistical test; it is simply a ratio. The null hypothesis is not stated, and there is no estimate of the likelihood of any difference observed being because of chance. Accordingly, an alternative to the 80% rule is a statistical test of significant differences. Such hypothesis testing is accomplished using binomial or hypergeometric probability models. Typically, the .05 level of statistical significance under a two-tailed test (e.g., 1.96 standard deviation units) is considered the threshold of significance (both in the scientific literature and the courts, Hazelwood School District v. United States, 1977). Although the 80% value has no standing in the scientific literature, the .05 level of significance is well accepted in social sciences research as indicating statistical significance, but this test also has its practical limitation because statistical significance is also a function of sample size. A difference of 5 points between two groups would be statistically significant if the total sample were in the thousands but would not be statistically significant if the total sample was two digits (e.g., 30). Although the Uniform Guidelines recognizes the problems inherent in the rule of thumb

617

P. Richard Jeanneret and Sheldon Zedeck in Section 3D, where it states that statistical significance is impacted by “small numbers,” it does not provide guidance as to what is the favored strategy—the 80% rule or statistical difference. Some practitioners have suggested that both analyses should be standard practice (Colosimo, 2010).

Fairness This concept is introduced in the discussion of criterion-related validity (see Sec. 7.B [3] and Sec. 14.B [8]). The Uniform Guidelines requires that a fairness investigation of a selection procedure be conducted if technically feasible before applying validity evidence from one situation to a new situation. Furthermore, if adverse impact is observed and data from a criterion-related validation study are available, the user is expected to conduct a fairness analysis. Unfairness occurs when lower minority scores on a selection procedure are not reflected in lower scores on the criterion or index of job performance. As noted above, the Standards and Principles consider this a matter of predictive bias, and it is found when consistent nonzero errors of prediction occur for a protected subgroup, but not for other subgroups that are disproportionately selected. Moderated multiple regression is the most frequently used statistical method for examining predictive bias, which occurs if there are slope and/or intercept differences between subgroups. As previously mentioned, there is no consistent research evidence supporting predictive bias on cognitive tests for African Americans or Hispanics relative to Whites. Also, research directed at race differences on noncognitive tests suggests few to small differences (Cascio, Jacobs, & Silva, 2010; Hough & Oswald, 2008; Ployhart & Holtz, 2008; Ryan & Powers, 2012; Schmitt & Quinn, 2010; Schmitt, Keeney, Oswald, Pleskac, Billington, Sinha, & Zorzie, 2009).

Cutoff Scores Cutoff scores are discussed first in the Uniform Guidelines as part of the general standards for validity studies (Sec. 5. H.) and then in the Technical standards section (Sec. 14. B. [6]). According to the Uniform Guidelines, “Where cutoff scores are used, they should normally be set so as to be reasonable and consistent with normal expectations of acceptable proficiency within the work force” (Sec. 5. H.). This definition seems to imply the need for professional judgment in setting a cutoff score, and such a stance is consistent with the Standards and the Principles.

Bottom Line Another concept introduced by the Uniform Guidelines when trying to assess adverse impact or discrimination is the bottom-line approach. If there are multiple components to a selection procedure, then the final decision point is evaluated for adverse impact. According to the Uniform Guidelines, only if the adverse impact occurs at the bottom line must the individual components of a selection procedure be evaluated. However, this concept was struck down by the U.S. Supreme Court in Connecticut v. Teal (1982). Currently, it is typical for all components of a selection procedure to be evaluated in terms of adverse impact and validity if they can be examined individually (Hazelwood School District v. United States, 1977).

Alternative Selection Procedure The Uniform Guidelines introduced the concept that if two or more selection procedures are available that serve the user’s interest and have substantially equal validity “for a given purpose,” then

618

Professional Guidelines/Standards the procedure demonstrating the lesser amount of adverse impact should be used. Although conceptually the alternative selection procedure is understandable, it is difficult to contend with in practice. There is no clear definition for “substantially equal valid.” Although there may be alternatives, it is not necessarily easy to discern which of them might have lesser adverse impact in a given situation. The degree of adverse impact observed is very specific to the numbers and qualifications of applicants at a particular point in time; furthermore, it is not clear what constitutes “lesser adverse impact.” Finally, many selection procedures are available, “which serve the user’s legitimate interest in efficient and trustworthy workmanship” but still may not be feasible alternatives (see 3.B.). Examples of concerns affecting feasibility include faking or response distortions of personality and biodata inventories, costs of development and implementation, and the ability to assess very large numbers of applicants at the same time. It is important to consider carefully the purpose served by the selection procedure. It is one thing to substitute a less impactful mechanical aptitude test for one that adversely underselects women, but substituting a reading comprehension test (no matter how valid) for mechanical aptitude may not be appropriate depending on the job tasks and requirements. Also of note is the general application of the “alternative selection procedure” section of the Uniform Guidelines, Section 3B. Whereas most of the attention in the literature and litigation has focused on alternative procedures, the Uniform Guidelines also considers “an investigation of . . . suitable alternative methods of using the selection procedure which have as little adverse impact as possible.” Thus, application of a particular method in a given situation might be used as pass/fail instead of as top-down selection.

Job-Relatedness/Business Necessity An employment selection procedure that has adverse impact may be justified in two ways: (a) showing that the procedure is job-related and (b) showing that the procedure is justified by business necessity. Job-relatedness is demonstrated by the validation process. Business necessity is demonstrated when a selection procedure is necessary for the safe and efficient operation of the business entity. Relevant statutes and regulations often define the business necessity argument (i.e., legislation regarding public safety job requirements), but other times information from the analysis of work will demonstrate the business necessity of a selection procedure.

Validity The Uniform Guidelines sets forth what the enforcement agencies consider acceptable types of validity studies and identifies three types: criterion-related, content, and construct. The document notes that new validation strategies “will be evaluated as they become accepted by the psychological profession” (see 5.A.). The Uniform Guidelines also states that the validation provisions “are intended to be consistent with generally accepted professional standards . . . such as those described in the Standards for Educational and Psychological Tests . . . and standard textbooks and journals in the field of personnel selection” (see 5.C). Of course the Standards being referred to were published in 1974, and three major revisions were published in 1985, 1999, and 2014. The Uniform Guidelines makes no specific reference to the Principles, although the first edition was published in 1975. Consequently, it is easy to understand how the treatment of validity by the Uniform Guidelines is not particularly consistent with the state of the scientific knowledge as set forth in the current editions of the Standards and the Principles. When introducing validity, the Uniform Guidelines offers several warnings or conditions: • Do not select on the basis of knowledge, skills, and abilities (KSAs) that can be learned on the job during orientation. • The degree of adverse impact should influence how a selection procedure is implemented, and evidence sufficient to justify a pass/fail strategy may be insufficient for rank order.

619

P. Richard Jeanneret and Sheldon Zedeck • A selection procedure can be designed for higher-level jobs if most employees can be expected to progress to those jobs in about five years. • An employer can use a selection procedure if there is substantial validity evidence from other applications and if the employer has in progress, if technically feasible, a validity study that will be completed in a reasonable period of time, but reliance on such research, should it not demonstrate validity, will not protect an employer from enforcement actions. • Validity studies should be reviewed for currency, particularly if alternative procedures with equal validity but less adverse impact may be available. • There are no substitutes for validity evidence and no assumptions of validity based on general representation, promotional material, testimony, and the like. • Employment agencies are subject to the guidelines in the same manner as employers.

Criterion-Related Validity The Uniform Guidelines’ position on criterion-related validity is very consistent with the information set forth in the Standards and Principles. Job analysis is important for decisions regarding grouping jobs together and selecting and developing criterion measures. An overall measure of job performance may be used as a criterion if justified by the job analysis; however, the Principles and Standards emphasize the need for construct equivalence for predictor and criterion measures. Typically, there are criteria with a greater degree of construct specificity developed from work analysis than from “overall performance.” Success in training also can be used as a criterion. Concurrent and predictive designs are recognized, and emphasis is placed on the representativeness of the sample of individuals participating in the validity study, regardless of its design. Criterion-related validity evidence should be examined using acceptable statistical procedures, and the Uniform Guidelines establishes the .05 level of statistical significance as the threshold for concluding that there is a relationship between a predictor and a criterion. Usually, the relationship is expressed as a correlation coefficient, which must be assessed in the particular situation: “There are no minimum correlation coefficients applicable to all employment situations” (see 14.B. [6]). Additionally, care must be taken to not overstate validity findings.

Content Validity The technical standards for content validity studies begin by focusing on the appropriateness of such a study. A selection procedure must be a representative sample of the job content or purport to measure KSAs that are required for successful job performance. Selection procedures based on inferences about mental abilities or that purport to measure traits such as intelligence, common sense, or leadership cannot be supported only on the basis of content validity. Solid job analysis information that is representative of the jobs (and, when necessary, operationally defined) is critical to a content validity argument. The Uniform Guidelines provides for the ranking of candidates assessed by a content-valid selection procedure, given that the procedure is measuring one or more capabilities that differentiate among levels of job performance. This is generally compatible with the guidance offered by the Principles, although the Q&As to the Uniform Guidelines gives more examples as to when it is, or is not, appropriate to use rank ordering.

Construct Validity This form of validity is defined in Section 14.D (1) of the Uniform Guidelines as “a series of research studies, which include criterion-related and which may include content validity studies.” In Section 14.D (1) and (3), it is stated that a “construct” is the intermediary between the selection procedure on the one hand and job performance on the other. A job analysis is required, and one or more constructs that are expected to influence successful performance

620

Professional Guidelines/Standards of important work behaviors should be identified and defined. To accomplish a construct validity study, it should be empirically demonstrated “that the selection procedure is validly related to the construct and that the construct is validly related to the performance of critical or important work behaviors” (14.D [3]). (This is the definition that drew the objections of the APA and SIOP.) In turn, a selection procedure is developed that will measure the constructs of interest. In a somewhat discouraging note for researchers, the Guidelines state that “The user should be aware that the effort to obtain sufficient empirical support for construct validity is both an extensive and arduous effort involving a series of research studies” (Uniform Guidelines, Section 14. D[1]).

Documentation Required The Uniform Guidelines sets forth many documentation requirements for a validity study, and many of these requirements are labeled “essential.” Generally speaking, the information expected as part of the documentation effort is very consistent with the material presented in each of the various sections of the Uniform Guidelines.

Utility One term—“utility”—does not have a definition in the Uniform Guidelines, but it could have many interpretations. Though it is not defined, it is found in the sections dealing with the uses and applications of a selection procedure that has been evaluated by a criterion-related validity study. Specifically, when documenting the methods considered for using a procedure, it “should include the rationale for choosing the method of operational use, and the evidence of validity and utility of the procedure as it is to be used (essential)” (see 15.B. [10]). Identical sentences appear in the uses and applications sections for content and construct validity. Furthermore, in Section 5.G. the Uniform Guidelines states: If a user decides to use a selection procedure on a ranking basis, and that method of use has a greater adverse impact than use of an appropriate pass/fail basis . . ., the user should have sufficient evidence of validity and utility to support the use on a ranking basis.

COMPARISONS AMONG THE THREE AUTHORITIES Given different authorships, different purposes, and different dates of adoption, it is useful to make comparisons among the three authorities to identify areas of agreement and disagreement. Such information might be particularly valuable to a user who is deciding about relying on one or more of the authorities or who has relied on one of the authorities and not realized what one or two of the other authorities had to say on the topic of interest. The common themes across the three authorities are matters of validation and psychometric measurement. To facilitate this discussion, Table 27.1 has been prepared to compare the three authorities on several concepts or terms and their respective definitions or explanations. Before discussing any of the specifics, it is quickly obvious that there are many terms without definitions or explanations under the Uniform Guidelines column. There are, no doubt, several reasons for this situation, and two possible explanations may be offered: • The Uniform Guidelines is some 35 years older than the Standards and 25 years older than the Principles. The latter two documents have undergone two revisions each since the Uniform Guidelines was published, but the Uniform Guidelines has never been revised or brought up to date, except for inclusion of additional Q&As. • The Uniform Guidelines was written to guide the enforcement of civil rights legislation. The Standards and Principles were written to guide research and professional practice and to inform decision making

621

The degree to which accumulated evidence and theory support a specific interpretation of test scores for a given use of a test

The relationship of test scores to variables external to the test such as measures of some criteria that the test is expected to predict

The linkage between a predictor and one or more aspects of a criterion construct domain

The extent to which the relationships between test items conform to the construct that is the foundation for test score interpretation

The study of the cognitive account of some behavior, such as making a selection procedure item response

Whether or not the specific benefits expected from the use of a selection procedure are being realized

An indication that a predictor measure represents a predictor construct domain combined with evidence of the linkage between the predictor construct domain and the criterion construct domain. The term “construct validity” is no longer used to define a “type” of validity.

Evidence based on the relationship between test scores and other measures of the same constructs

Evidence indicating whether two tests interpreted as measures of different constructs are sufficiently independent that they do measure two different constructs

Applying validity evidence obtained in one or more situations to other similar situations on the basis of methods such as transportability by job analysis or meta-analysis

 elations to (a) R other variables/ criterion-related

(b) Content

(c) Internal structure

(d) Response process

 onsequences of (e) C testing

Construct validity

Convergent validity

Discriminant validity

Validity generalization

Sources of validity evidence

Validity (unitary concept)

Standards 2014

TABLE 27.1

Evidence of validity that generalizes to setting(s) other than the setting(s) in which the original validation evidence was documented. Generalized evidence is accumulated through such strategies as transportability, synthetic/job component validity, and meta-analysis.

Evidence of a lack of a relationship between measures intended to represent different constructs

Evidence of a relationship between measures intended to represent the same construct

Evidence that scores on two or more selection procedures are highly related and consistent with the underlying construct; can provide convergent evidence in support of the proposed interpretation of test scores as representing a candidate’s standing on the construct of interest.

Evidence that consequences of selection procedure use are consistent with the intended meaning or interpretation of the selection procedure

The study of the cognitive account of some behavior, such as making a selection procedure item response

The degree to which psychometric and statistical relationships among items, scales, or other components within a selection procedure are consistent with the intended meanings of scores on the selection procedure

The extent to which content of a selection procedure is a representative sample of work-related personal characteristics, work performance, or other work activities or outcomes

The statistical relationship between scores on a predictor and scores on a criterion measure

The degree to which accumulated evidence and theory support specific interpretations of scores from a selection procedure entailed by the proposed uses of that selection procedure

Principles 2003

Validation and Psychometric Terminology Comparison

Not defined

Not defined

Not defined

Data showing that the selection procedure measures the degree to which candidates have identifiable characteristics that have been determined to be important for successful job performance

Not defined, but possibly referenced in the term “utility”

Not defined

Not defined

Data showing that the content of a selection procedure is representative of important aspects of performance on the job

Empirical data showing that the selection procedure is predictive of or significantly correlated with important elements of work behavior

Not defined

Uniform Guidelines 1978

The degree to which scores for a group of assessees are consistent over one or more potential sources of error (e.g., time, raters, items, conditions of measurement, etc.) in the application of a measurement procedure

There are multiple perspectives on fairness. There is agreement that issues of equitable treatment, predictive bias, and scrutiny for possible bias when subgroup differences are observed are important concerns in personnel selection; however, there is not agreement that the term “fairness” can be uniquely defined in terms of any of these issues.

A statistical method of research in which the results from several independent, comparable studies are combined to determine the size of an overall effect on the degree of relationship between two variables

The degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable; the degree to which scores are free of errors of measurement for a given group

A test that is fair minimizes the construct-irrelevant variance associated with individual characteristics and testing contexts that otherwise would compromise the validity of scores for some individuals. There is no single technical meaning; in the employment setting, fairness can be defined as an absence of bias and that all persons are treated equally in the testing process.

The systematic under- or over prediction of criterion performance for people belonging to groups differentiated by characteristics not relevant to criterion performance

A specific point on a score scale such that scores at or above that point are interpreted or acted upon differently from scores below that point

Meta-analysis

Reliability

Fairness/unfairness

Predictive bias

Cut score/cutoff score

Cutoff scores should normally be set so as to be reasonable and consistent with normal expectations of acceptable proficiency within the workforce

Not defined, but see “Fairness/unfairness” above

When members of one race, sex, or ethnic group characteristically obtain lower scores on a selection procedure than members of another group and the differences in scores are not reflected in differences in a measure of job performance

The term is not defined, but the reliability of selection procedures, particularly those used in a content viability study, should be of concern to the user

Not defined

Not defined

Using evidence from another study when the job incumbents from both situations perform substantially the same major work behaviors as shown by appropriate job analyses; the study should also include an evaluation of test fairness for each race, sex, and ethnic group that constitutes a significant factor in the labor market for the job(s) in question within the labor force of the organization desiring to rely on the transported evidence

Note: The above definitions or explanations are taken verbatim from the glossaries or definition section of the authoritative sources whenever possible. Otherwise, the definitions were extracted from document text on the subject.

A score at or above which applicants are selected for further consideration in the selection procedure. The cutoff score may be established on the basis of several considerations (e.g., labor market, organizational constraints, normative information). Cutoff scores are not necessarily criterion referenced, and different organizations may establish different cutoff scores on the same selection procedure on the basis of their needs.

The systematic under- or over prediction of criterion performance for people belonging to groups differentiated by characteristics not relevant to criterion performance

A statistical method of research in which results from several independent studies of comparable phenomena are combined to estimate a parameter or the degree of relationship between variables

Generalized evidence of validity based on previous demonstration of the validity of inferences from scores on the selection procedure or battery with respect to one or more domains of work (job components)

Not defined

Synthetic/job component validity

A strategy for generalizing evidence of validity in which demonstration of important similarities between different work settings is used to infer that validation evidence for a selection procedure accumulated in one work setting generalizes to another work setting

Directly transporting validity evidence from another setting in a situation where sound evidence (e.g., careful job analysis) indicates that the local job is highly comparable to the job for which the validity data are being imported

Transport of validity

P. Richard Jeanneret and Sheldon Zedeck in applicable areas of employment selection. Hence, the latter two documents have more of a scientific focus and rely heavily on the current research literature; the Uniform Guidelines was intended to be consistent with generally accepted professional standards set forth in the 1974 version of the Standards but was not necessarily research-based at the time of its preparation.

Standards Versus Principles There are no areas of disagreement between the Standards and the Principles. In some areas the Standards offers more information and guidance than the Principles. Examples include (a) discussions of validity evidence based on response processes, internal structure, and the consequences of testing; (b) discussions of reliability and errors of measurement; (c) the test development and revision process; (d) scales, norms, and score comparability; and (e) the rights and responsibilities of test takers. A few topics are more broadly considered in the Principles than is true for the Standards. Examples include (a) the concept of the analysis of work (to incorporate the work context and organizational setting) rather than job analysis; (b) clarifying that the generalization of validity evidence can be accomplished by several methods, including transportability and synthetic/job component validity, as well as being supported by meta-analysis; and (c) certain operational considerations associated with the conduct of a validation study in organizational settings (e.g., communications, organizational needs and constraints, quality control and security, implementation models, and utility).

Validity (Unitary Concept) The Standards and the Principles view validity as a unitary concept, whereas the Uniform Guidelines partitions validity into three types: criterion-related, content, and construct. This partitioning of validity was the thinking 40 years ago, but it is clearly out of date now.

Sources of Validity Evidence (a) Relations to other variables/criterion-related: The Uniform Guidelines’ focus on work behavior as a criterion excludes potential studies of the relationships between a selection procedure of interest and other tests hypothesized to measure the same or different constructs (i.e., other external variables). (b) Content: All three authorities agree that content validity is dependent on a sound determination that the selection procedure is a representative sample of work-related behavior. The analysis of work (or the job) is fundamental to establishing the predictor-criterion linkage. The Uniform Guidelines confines job requirements to a study of KSAs; the Standards and Principles provide for the study of KSAOs and would include “O” variables in a selection procedure subject to a content validity study. The Uniform Guidelines precludes use of a content strategy to study the validity of traits or constructs such as spatial ability, common sense, judgment, or leadership. Although it is important to describe the relevant work behavior or KSAO at a level of specificity so there is no misunderstanding about what is being measured, it is unnecessary and unwise to reject content validity evidence simply because it is concerned with linking an ability or personal characteristic (i.e., leadership) to the domain of job performance. Many constructs can be defined in terms of specific work behaviors although they have broad labels. Furthermore, there are many situations in which content validity may be the only option. If leadership capabilities are critical to job performance, validity evidence beyond a content validity study may be infeasible. There may not be adequate numbers of candidates or incumbents to conduct a criterion-related study, and there may not be sufficient and reliable criteria available. Consequently, a content validity study may be the only viable approach to evaluating the validity of a construct of interest. (c) Internal structure/response processes/consequences of testing: These three lines of evidence for a validity argument were not developed at the time the Uniform Guidelines was written and hence are not discussed in it.

624

Professional Guidelines/Standards

Construct Validity The Uniform Guidelines treats construct validity as a separate type of validity. In the Standards and Principles, all selection procedure scores or outcomes are viewed as measures of some construct. Consequently, any evaluation of validity is a “construct validity” study.

Convergent and Discriminant Validity Although these terms and their implications were well-established at the time the Uniform Guidelines was prepared, there was no discussion about the value of these types of evidence in the document.

Validity Generalization The concept was known at the time the Uniform Guidelines was prepared but was not specifically used in the document. Many have interpreted Section 7.B of the Uniform Guidelines as providing for validity generalization arguments. The provisions of that section are described under transport of validity evidence in Table 27.1.

Transport of Validity The three authoritative sources agree that a work or job analysis is necessary to support the transport of validity. However, the Uniform Guidelines goes further and requires that there be an existing criterion-related validity and a fairness study of the selection procedure for relevant protected subgroups. However, there is no guidance as to the acceptability of transporting the validity of a selection procedure that has some demonstrated unfairness. Furthermore, as noted previously, in many situations, sample sizes may preclude adequate fairness analyses (Aguinis & Stone-Romero, 1997).

Synthetic/Job Component Validity This validity generalization strategy has been known for more than 40 years but has not received much attention in validation research conducted outside of the employment arena. Neither the Standards nor the Uniform Guidelines have defined this strategy of validity generalization.

Meta-Analysis In 1978 the authors of the Uniform Guidelines did not have knowledge of the research findings that have emerged subsequently from meta-analytic research. This, unfortunately, is another void, and a significant amount of research is available today that might not be considered to be within the scope of validation strategies acceptable under the Uniform Guidelines.

Reliability The term reliability is not defined in the Uniform Guidelines as it is in the other two authoritative sources, but it is considered to be important for selection procedures that have been supported

625

P. Richard Jeanneret and Sheldon Zedeck with a content validity strategy. The Standards and Principles emphasize that the reliability of any measurement be considered whenever it is technically feasible to do so.

Fairness/Unfairness and Bias The Standards and the Principles consider fairness to be a very broad concept with many facets. Alternatively, the two sources consider bias to be a very specific term concerned with underor overprediction of subgroup performance. This interpretation is basically the one that the Uniform Guidelines gives to the term unfairness while relying on the 1974 version of the Standards.

Cut Score/Cutoff Score The Standards and Principles give more attention to developing in detail many of the issues underlying the setting of cutoff scores than does the Uniform Guidelines. However, there does not seem to be any significant disagreement across the three documents as to how a cutoff score will function and the intent for a cutoff score to screen out those who will not achieve acceptable levels of job performance.

Summary There are some levels of consistency or agreement across the three authoritative sources but also consequential areas of disagreement. It is very likely that the advances in selection procedure research and scholarly thinking regarding validity that have occurred over the last 35 years account for these differences. Although the Uniform Guidelines is the document that seems most deficient in terms of knowledge of the field, it is also the first document of the three in terms of its adoption. On that basis, its deficiencies can be excused by being out of date; however, as noted earlier in this chapter, the authors of the Uniform Guidelines allowed for other procedures and issues to arise and envisioned their potential inclusion in the framework laid out by the document. Sections 5.A and 5.C acknowledge, respectively, that “New strategies for showing the validity of selection procedures will be evaluated as they become accepted by the psychological profession” and that “The provisions of these guidelines . . . are intended to be consistent with generally accepted professional standards . . . and standard textbooks and journals in the field of personnel selection.” These clauses can be interpreted to suggest that deference should be given to the Principles and Standards where they disagree with the Uniform Guidelines. Despite these forward-looking provisions, no substantive changes have ever been made in the Uniform Guidelines, even though case law has changed various provisions and interpretations (e.g., bottom-line analyses). Arguably, this state of affairs reflects the significant interaction between the Uniform Guidelines and the case law as it has developed since its adoption. Indeed, the Supreme Court (1971) indicated that the EEOC’s earlier Guidelines (predecessor to the Uniform Guidelines) was to be given “great deference” by the courts. Changes to the Uniform Guidelines will likely be controversial and difficult, if possible at all. Nevertheless, some time in the near future it will be important for the Uniform Guidelines to be revised to reflect the current state of the science. Until that time, the decision maker involved in employment selection should look to the Standards and Principles for guidance on many issues that either are now incorrect or are not addressed in the Uniform Guidelines.

FINAL THOUGHTS

Science Versus Litigation Versus Technical Authorities/Guidelines It is recognized that there are some significant inconsistencies at this time between the technical information provided by the Standards and Principles, on the one hand, and the Uniform Guidelines,

626

Professional Guidelines/Standards on the other hand, and that these differences can be extremely important in the event of litigation regarding a selection procedure. However, these differences can be resolved. Unfortunately, until a revision to the Uniform Guidelines is forthcoming, to the extent that there is more than one authority introduced in litigation that is offered as support to only one side of an argument, resolution of differences that appear in print will need to be part of the judicial decision-making process. In this regard, it is incumbent upon those who do rely on any of the authoritative sources during the course of litigation to be clear about the relevance and currency of the source(s) that are providing guidance to their opinions.

Conclusions In closing, we want to note several broad, as well as some specific issues. We will start with the broader issues. First, what deference should be given to the Uniform Guidelines, Principles, and Standards in guiding psychologists as they make decisions in employment settings? We ask this question given that the three documents are in many ways static, whereas the field is dynamic. That is, research is constantly being conducted that provides new knowledge and/or influences how we interpret behavioral phenomena. For example, it is a commonly accepted fact that the validity of cognitive ability tests generalizes across situations and jobs (Hunter & Hunter, 1984). Yet, this was not always the “accepted” fact; in the 1960s, validity was described as “situation specific” (Ghiselli, 1966). If there had been three sets of sources promulgated by various agencies in the 1960s, they most likely would have advocated for “situation specificity,” and the accepted practice would have been to validate tests in every situation for every job. The point of this example is that perhaps the current sources—Uniform Guidelines, Principles, and Standards— should not be viewed as authoritative regarding knowledge, but rather as primers for “how to conduct research” and what factors to consider when determining the validation of a test. Reliance on the sources for new research findings may hamper the field. The documents are not “living” and thus cannot account for changes due to new research. However, the practitioner or researcher can rely on the sources with regard to how to establish the validity of a test and what information is needed as part of the research. Acceptance of the above premise brings us to the second broad issue. Given that the sources are relied upon in litigation, whether introduced directly in testimony in court cases or as authority references when trying to explain to judges and lawyers what and how we conduct our research, the question becomes “How sound are the sources as authoritative documents in court proceedings?” One potential set of criteria are the “Daubert thresholds” (Daubert v. Merrell Dow Pharmaceuticals, Inc., 1993), which set forth rules for determining what is expert testimony or scientific evidence: 1. Testing—adequate testing can be or has been tested by collection of data with an accepted methodology 2. Has a known or potential error rate 3. Has been subjected to peer review and publications 4. Has gained general acceptance in a relevant scientific community

Another concern is the need to consider that the global economy is changing the way in which humans work and with whom they work. Accordingly, future sources should address cultural issues and the changing nature of work. Some examples of these issues include: 1. The need to consider assessment of individuals with diverse linguistic backgrounds as well as the need to accommodate test takers whose first language is not English. 2. The need to consider electronic, Internet, and web-based technology and the fact that the next generation of workers will likely have not been exposed to the same methods of training, operating, and performing at work as the current generation. Advanced technology may provide for greater opportunity to capture actual samples or simulations of job behaviors than are garnered in paperand-pencil multiple-choice formats.

627

P. Richard Jeanneret and Sheldon Zedeck 3. The need to identify criteria that are relatively focused on more short-term gains than those that have been used in the past (e.g., tenure in the position for at least one year). A global pace of competition implies that businesses will need to reduce losses (such as incorrect “hires”) and increase gains (such as faster training times) much more quickly than was common in the past. 4. The need to recognize that current tests explain, at most, approximately 25% of the variance in job performance as we measure it today. Although it is appropriate to concern ourselves with searching for additional predictors, we need to consider ways in which to broaden the criterion space and how to combine the criteria in such a fashion as to provide a “comprehensive” picture of the worker. That is, although we can predict to a reasonable degree (15–25% of the variance) how well entering college students may perform as represented by the criterion of final grade point average, we need to examine other factors that measure success in college and how these additional factors can be combined to represent success in the “college experience.”

Authoritative sources that incorporate principles, guidelines, and standards have a valuable role to play in the science of employment selection; however, the limitations inherent to such sources must be openly recognized, and to the degree there is disagreement or conflicts among the sources, they should be revealed before they attain a stature that creates a disservice to employees, employers, and I-O psychology professionals.

NOTE 1. This chapter with modifications and considerable updating is based on “Professional and Technical Authorities and Guidelines” by P. Richard Jeanneret, which is found in Landy, F. J. (2005). Employment discrimination litigation: Behavioral, quantitative and legal perspectives. San Francisco, CA: John Wiley & Sons.

REFERENCES Aguinis, H., & Stone-Romero, E. (1997). Methodological artifacts in moderated multiple regression and their effects on statistical power. Journal of Applied Psychology, 82, 192–206. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Education Research Association. Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical reanalysis. Journal of Applied Psychology, 66, 1–6. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). The FFM personality dimensions and jobs: Metaanalysis of meta-analysis. International Journal of Selection and Assessment, 9, 9–30. Bartlett, C. J., Bobko, P., & Mosier, S. (1978). Testing for fairness with a moderated multiple regression strategy: An alternative to differential analysis. Personnel Psychology, 31, 233–245. Bemis, S. E. (1968). Occupational validity of the general aptitude test battery. Journal of Applied Psychology, 52, 240–249. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential evidential bases. Journal of Applied Psychology, 74, 478–494. Callender, J. C., & Osburn, H. G. (1981). Testing the constancy of validity with computer-generated sampling distributions of the multiplicative model variance method estimate: Results for petroleum industry validation research. Journal of Applied Psychology, 66, 274–281. Camara, W. J. (1996). Fairness and public policy in employment testing: Influences from a professional association. In R. S. Barrett (Ed.), Fair employment strategies in human resource management. Westport, CT: Quorum Books. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod motive. Psychological Bulletin, 56, 81–105. Cascio, W. F., & Boudreau, J. (2011). Investing in people: The Financial impact of human resource initiatives (2nd ed.). Upper Saddle River, NJ: Pearson Education. Cascio, W., Jacobs, R., & Silva, J. (2010). Validity, utility, and adverse impact: Practical implications from 30 years of data. In J. L. Outtz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 271–288). New York, NY: Routledge, Taylor & Francis Group.

628

Professional Guidelines/Standards Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5, 115–124. Colosimo, J. (2010). A primer on adverse impact analysis. White Paper, DCI Consulting Group, Inc. Downloaded from http://dciconsult.com/whitepapers/AIPrimer.pdf, August 16, 2015. Connecticut v. Teal, 457 U.S. 440 (1982). Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach (4th ed.). Thousand Oaks, CA: Sage. Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions (2nd ed.). Urbana, IL: University of Illinois. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). Equal Employment Opportunity Commission, Civil Service Commission, Department of Justice, & Department of Labor. (1978). Uniform guidelines on employee selection procedures. 29 CFR, 1607. Equal Employment Opportunity Commission, Office of Personnel Management, Department of Justice, Department of Labor, & Department of Treasury. (March 2, 1979). Questions and answers to clarify and provide a common interpretation of the Uniform Guidelines on Employee Selection Procedures. 44FR, No. 43. European Federation of Psychologists’ Associations. (2013). EFPA Review Model for the Description and Evaluation of Psychological and Educational Tests, Version 4.2.6. Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: John Wiley. Gibson, W. M., & Caplinger, J. A. (2007). Transportation of validation results. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 29–81). San Francisco, CA: John Wiley and Sons. Hartigan, J. A., & Wigdor, A. K. (Eds.) (1989). Fairness in employment testing. Washington, DC: National Academy Press. Hazelwood School District v. United States, 433 U.S. 299.31 n. 17 (1977). Hoffman, C. C., Rashkovsky, B., & D’Egidio, E. (2007). Job component validity: Background, current research, and applications. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 82–121). San Francisco, CA: John Wiley and Sons. Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 181– 229). San Francisco, CA: John Wiley and Sons. Hough, L. M., & Oswald, F. L. (2008). Personality testing and industrial-organizational psychology: Reflections, progress, and prospects. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 272–290. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–88. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed). Newbury Park, CA: Sage. Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of Applied Psychology, 85, 869–879. International Standards Organisation. (2011). Assessment serviced delivery—Procedures and methods to assess people in work and organisational settings. ISO-10667–2. International Taskforce on Assessment Center Operations. (2015). Guidelines and ethical considerations for assessment center operations. Journal of Management, 41, 1244–1273. International Test Commission. (2001). International guidelines for test use. International Journal of Testing, 1(2), 93–114. International Test Commission. (2005). International Guidelines on Test Adaptation. Retrieved from www. intestcom.org Johnson, J. W. (2007). Synthetic validity: A technique of use (finally). In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 122–158). San Francisco, CA: John Wiley and Sons. Landy, F. J. (Ed.). (2005). Employment discrimination litigation: Behavioral, quantitative, and legal perspectives. San Francisco, CA: Jossey-Bass. McDonald, R. P. (1999). Test theory: Unified treatment. Mahwah, NJ: Lawrence Erlbaum. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York, NY: Macmillan. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007). Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60, 683–729.

629

P. Richard Jeanneret and Sheldon Zedeck Naylor, J. C., & Shine, L. C. (1965). A table for determining the increase in mean criterion scores obtained by using a selection device. Journal of Industrial Psychology, 3, 33–42. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum. Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Putka, D. J., & Sackett, P. R. (2010). Reliability and validity. In J. L. Farr & N. T. Tippins(Ed.), Handbook of employee selection (pp. 9–49). New York, NY: Routledge. Ryan, A. M., & Powers, C. L. (2012). Workplace diversity. In N. Schmitt (Ed.), Oxford handbook of personnel assessment and selection (pp. 814–831). London: Oxford University Press. Sackett, P. R., & Roth, L. (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on performance and minority hiring. Personnel Psychology, 49, 549–572. Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment, credentialing, and higher education. American Psychologist, 56, 302–318. Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of valid selection procedures on work-force productivity. Journal of Applied Psychology, 64, 609–626. Schmitt, N., Keeney, J., Oswald, F. L., Pleskac, T. J., Billington, A. Q., Sinha, R., Zorzie, M. (2009). Prediction of 4-years college student performance using cognitive and noncognitive predictors and the impact on demographic status of admitted students. Journal of Applied Psychology, 94, 1479–1497. Schmitt, N., & Quinn, A. (2010). Reductions in measured subgroup differences: What is possible. In J. L. Outtz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 425–451). New York: Routledge, Taylor & Francis Group. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Reprinted with permission. Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage. Wainer, H., & Braun, H. I. (Eds.) (1988). Test validity. Hillsdale, NJ: Lawrence Erlbaum.

630

28 AN UPDATED SAMPLER OF LEGAL PRINCIPLES IN EMPLOYMENT SELECTION ARTHUR GUTMAN, JAMES L. OUTTZ, AND ERIC DUNLEAVY

INTRODUCTION We are very pleased that we have been asked to contribute an updated chapter on legal principles for this volume.1 As we noted in the original chapter, many principles of employee selection have been illuminated through decades of equal employment opportunity legislation, regulation, and related case law. Granted, judges write opinions, but these judges frequently depend heavily on industrial-organizational (I-O) psychologists in framing these opinions and are generous in their citations to our work and our testimony. Once again, we see a clear role for employment litigation knowledge related to personnel selection in this volume. In the context of this role, we have been asked to review, refresh, and expand upon the “principles” that were established in the original chapter. We focused those principles on selection practices that are commonly at issue in employment discrimination cases. We intended the “principles” to represent general themes that practitioners might keep in mind when attempting to minimize legal risk when using such practices and identified exemplar court cases where we could.2 We use the term “principle” to refer to a basic and generally accepted combination of science and practice as they relate to specific legal matters. This use is obviously different from “the” Principles promulgated by the Society for Industrial and Organizational Psychology (SIOP; 2003), which were being revised for a fifth edition at the time this book chapter was written. The SIOP Principles are intended to be a scientific document and not a legal one, whereas the principles we denote in this chapter are focused on the equal employment opportunity (EEO) legal context. Neither are we intending to address the “best” practice from a theoretical or empirical perspective. Instead, we are simply articulating what might be seen as contemporary “defensible” practice from the legal perspective. Nevertheless, the principles we present are general rather than specific and are generally well documented via a variety of sources. Like other chapter authors in this Handbook, we are constrained by allotted pages. Therefore, we have been selective in formulating principles and providing underlying case law. As a result, we do not consider these principles to be exhaustive, or possibly even the most important, from a psychometric or theoretical sense. Although we would consider these principles more rather than less important than others that might have been chosen, we acknowledge that there may be other equally important principles that we might have chosen to articulate: so many principles, so few pages. The principles we have chosen to present generally deal with either a general practice (e.g., work analysis, adverse impact), a statute (e.g., ADA, CRA, ADEA), a protected group (e.g., gender, race, age, disability), or some combination of those factors. The employment selection

631

Arthur Gutman et al. practices we have chosen to address represent “terms and conditions” of employment as defined in key federal laws. Terms and conditions are broadly defined to include recruitment, hiring, training, promotion, benefits, and termination. The term “law” is broadly defined as including statutes, executive orders, and constitutional amendments. Broader coverage of EEO law is provided by Gutman, Koppes, and Vodanovich (2010), Landy (2005), and Outtz (2010). As a group, these EEO laws proscribe discrimination in the workplace on the basis of race or color, religion, sex, national origin, age, and/or disability. In some ways, Title VII of the Civil Rights Act of 1964 is the most critical of the statutes. It includes proscriptions relating to the largest number of protected classes (race, color, religion, sex, and national origin) and the full gamut of terms and conditions of work. Title VII also provides a critical model for later statutes, including the Age Discrimination in Employment Act of 1967 (ADEA) and the Americans with Disabilities Act of 1990 (ADA) as amended by the ADA Amendments Act (2008), which address the full gamut of terms and conditions for age and disability, respectively. The ADA updates the Vocational Rehabilitation Act of 1973 (Rehab-73). Section 503 of this Act was updated in March of 2014 (as well as the Vietnam Era Veterans’ Readjustment Assistance Act of 1974), and this change is covered in Principle 14. Finally, the Equal Pay Act of 1963 (EPA) is a more surgical statute that proscribes only wage discrimination on the basis of sex, which is also included in Title VII as part of terms and conditions. Among other relevant laws, Executive Order 11246 (EO 11246) contains rules for contractors doing business with the federal government. These contractors must agree to provide (a) equal employment opportunity for all groups and (b) affirmative action for minorities and women if there are discrepancies between their representation in the employer’s workforce and qualified workers in the labor pool. It is critical to note that the primary requirement for affirmative action in EO 11246 is recruitment and outreach, and there is no requirement (in fact, it is illegal under Title VII and other laws) to make actual selection decisions on the basis of race or gender. The key constitutional amendments governing employment decisions are the Fifth and Fourteenth Amendments, which apply to federal and state employers, respectively, and which have been used extensively in so-called reverse discrimination lawsuits on the basis of race or sex, providing overlapping coverage with Title VII. Additional overlapping coverage with Title VII is represented in the Thirteenth Amendment, which applies to the full range of terms and conditions of employment as in Title VII, with the exception of adverse impact challenges. Critically, although each of these amendments preceded Title VII, no one ever thought to use them in workplace discrimination cases until after Title VII, not only for overlapping coverage but also for gaps in coverage in Title VII that do not apply to constitutional claims, such as the minimum number of employees required in Title VII (N = 15); no such minimum workforce size exists for constitutional claims. Some final but important points to note are that each of the statutes, except for Rehab-73, requires that claims of discrimination be processed by the Equal Employment Opportunity Commission (EEOC). In comparison, constitutional claims can go directly to federal district court. Additionally, the Office of Federal Contract Compliance Programs (OFCCP) of the Department of Labor (DoL) regulates EO 11246 and Rehab-73 in the nonfederal sector and typically uses an audit system for investigation. EEOC regulates EO 11246 in the federal sector, and the EEOC and Merit Systems Protection Board (MSPB) share responsibilities for federal claims under Title VII and the ADEA. Because this entire volume is dedicated to the concept of “selection,” some of our principles may appear to stray from that mark. We construe selection broadly to encompass many personnel actions. These include, but are not limited to, applicant screening, hiring, promotion, selection to training experiences that are gateways to promotion, and layoffs (or deselection). We believe that selection should be broadly and not narrowly construed because in each case a personnel decision is being made that implies a direct or indirect movement (or lack thereof) of an applicant or a current employee. Finally, in the context of a selection or promotion decision, we will occasionally comment on a co-relevant issue such as compensation or training. It is not that we intend to turn the focus to compensation, which has been an interesting EEO focus3 under the Obama administration, but rather that compensation decisions are often made at the same time as selection or promotion decisions and deserve comment.

632

Legal Principles in Employment Selection

PRINCIPLES AND EXEMPLAR CASE LAW

1.  Companies using Internet recruitment should understand the definition of “Applicant” and be prepared to defend qualifications listed in the job description as well as procedures used to screen applicants. After publication of the Uniform Guidelines on Employment Selection Procedures (UGESP) in 1978, the EEOC answered a series of follow-up questions in 1979 (44 FR 11998). Among them, Q15 defined the term “Applicant” as follows: The precise definition of the term “applicant” depends upon the user’s recruitment and selection procedures. The concept of an applicant is that of a person who has indicated an interest in being considered for hiring, promotion, or other employment opportunities. (UGESP, Q15)

This definition was rendered obsolete by subsequent developments in Internet recruitment, because it became easy to indicate interest in many jobs at the same time. Therefore, the EEOC amended Q15 on March 4, 2004 (FR Doc 04–4090) to require more than an expression of interest. Additionally, the OFCCP issued separate guidance on March 29, 2004 (41 CFR Part 60–1). The OFCCP focuses on broad recordkeeping for EO 11246, whereas the UGESP focuses on adverse impact under Title VII. Readers should note that SIOP’s Professional Practice Committee provided a detailed evaluation of the EEOC and OFCCP documents (Reynolds, 2004). The EEOC answered five new questions. Two of the answers clarify that Internet methodologies are covered by laws such as Title VII and by the UGESP, and two others clarify that Internet search criteria are subject to adverse impact rules and that selection procedures administered online are subject to the UGESP. The last and most critical of the five answers defines the term “Applicant” using the following three prongs: 1. The employer has acted to fill a particular position. 2. The individual has followed the employer’s standard procedures for submitting applications. 3. The individual has indicated an interest in the particular position.

For prong 1, the EEOC cites a company seeking two hires from among 200 recruits in a database. If 100 recruits in the database respond to employer inquiries about the position and 25 are interviewed, then all 100 responders are applicants and all 100 nonresponders are not applicants. For prong 2, the EEOC cites two examples: (a) if employers require completion of an “online profile,” only those completing the profile are applicants; and (b) if employers e-mail job seekers requesting applications, only those who respond are applicants. The prong 3 answer clarifies that individuals are not applicants if they (a) only post a resume, (b) express interest in several potential jobs, or (c) follow prong 2 for a job other than those the employer acts on (prong 1). OFCCP acknowledged the differences in recruiting practices from 1979 to present day in explaining the need for an updated applicant rule. In 2005, OFCCP issued regulations creating the Internet Applicant Recordkeeping Rule (“Internet Applicant Rule”), at the same time acknowledging that the UGESP’s definition was overly broad.4 The OFCCP established four criteria to define an “Internet Applicant”: 1. The individual submits an expression of interest in employment through the Internet or related electronic data technologies; 2. The contractor considers the individual for employment in a particular position; 3. The individual’s expression of interest indicates the individual possesses the basic qualifications for the position; and 4. The individual at no point in the contractor’s selection process prior to receiving an offer of employment from the contractor, removes himself or herself from further consideration or otherwise indicates that he or she is no longer interested in the position.5

633

Arthur Gutman et al. The addition of basic qualifications is a key distinction between EEOC and OFCCP definitions. Under the Internet applicant rule, basic qualifications must be (1) non-comparative, (2) objective, and (3) relevant to performance of the particular position. Another key distinction is that OFCCP allows for application of neutral “data management techniques” as part determination of whether someone was considered for employment. Clearly, differing definitions of applicant could meaningfully affect who is included and excluded from analyses, which in turn could influence the results of adverse impact analyses.

Exemplar Case Law Related to Principle 1 Although there is as yet no case law citing the new definition of “Applicant,” or pitting EEOC and OFCCP definitions against each other, there is a relevant case decided under prior rules (Parker v. University of Pennsylvania, 2004). Parker, a White male, filed a Title VII reversediscrimination claim for failure to consider his application for various jobs at the university. Parker submitted his resume online. The university’s website advised individuals to submit their resumes into the database and, if they so choose, to apply for specific job postings. Parker received a letter stating that, if appropriate, he would receive a letter within 30 days notifying him of an interview, but that “otherwise this will be your only communication from us.” Summary judgment was granted to the university because Parker never expressed interest in a specific posted job. Parker also claimed adverse impact, but it was dismissed because he could provide no basis for assuming he was harmed. It should be noted that as a standard university policy, recruiters routinely searched the database and forwarded resumes to hiring officers of individuals who expressed interest in a posted job and who met the minimum qualifications (MQs) for that job. Interestingly, the district court judge ruled that Parker satisfied the four-prong prima facie test for disparate treatment from McDonnell Douglas v. Green (1973) in that he (a) was a protected class member, (b) applied for a position for which he was qualified, (c) was not hired, and (d) positions remained open for which Parker was qualified. However, the judge also ruled that the university articulated a legitimate nondiscriminatory reason for not hiring him and that Parker could not prove that the articulation was a pretext for discrimination. Accordingly: Defendant claims to have failed to consider Parker’s application because of its resume reviewing procedures: The reason Penn did not hire Parker is because he had not applied for any specific position, and it did not conduct any search of the resumes of non-applicants for positions being filled.

There are several things to note about this ruling. Although under prior rules, Parker would have been considered an applicant because he had submitted a general application, on the basis of the new rules Parker should not be considered an “Applicant” and therefore should lose at the prima facie level. Nevertheless, even nonapplicants may file valid adverse impact claims if they are “chilled” by MQs such as education requirements that disproportionately exclude minorities (Albemarle Paper Co. v. Moody, 1975; Griggs v. Duke Power, 1971) or physical requirements such as height and weight that disproportionately exclude females (Dothard v. Rawlinson, 1977).

2. Adverse impact may be measured using a variety of approaches. Both statistical significance tests and measures of practical significance may be useful approaches to measuring adverse impact. The measurement of adverse impact is a topic that continues to cause controversy.6 Some strategies for measuring adverse impact were codified in federal regulations (U.S. Equal Employment Opportunity Commission, 1978), whereas others were imported from foreign case law or can be found in regulatory compliance manuals (Cohen & Dunleavy, 2010). All of these methods have been criticized in the scholarly literature (e.g., Roth, Bobko, & Switzer, 2006).

634

Legal Principles in Employment Selection Statistical significance tests are one approach to adverse impact measurement, and they typically evaluate the hypothesis that the applicant population contains no differences in hiring rates among subgroups. Statistical significance testing has become a preferred method of evaluating adverse impact both by the courts (Esson & Hauenstein, 2006) and by federal agencies such as the OFCCP (Cohen & Dunleavy, 2009, 2010). These tests can be either directional or nondirectional depending on context and hypotheses of interest. The decision between these approaches can be controversial, because the choice of a one-tailed test can imply that we are only interested in identifying disparities against only one group. Adverse impact statistics are often evaluated using two-tailed significance tests (e.g., OFCCP, 1993), particularly when analyses are proactive. When a specific claim of discrimination is made based on persuasive historical or anecdotal context, a onetailed test may be reasonable. A variety of statistical tests that vary slightly based on data system assumptions are available to assess adverse impact. These include the Z-test for the difference between two proportions, the Chi-square test of association, Fisher’s Exact Test, and Lancaster’s Mid-P (LMP) test.7 The statistical significance paradigm’s appeal is intuitive in that it assesses whether results could be due to chance and provides a yes-or-no answer. However, it is not the only approach available, and in fact the question it answers is fairly narrow. The I-O psychology literature has recently noted the limitations of stand-alone significance tests (e.g., Dunleavy & Gutman, 2011; Jacobs, Murphy, & Silva, 2013; McDaniel, Kepes, & Banks, 2011; Murphy & Jacobs, 2012) and in practice-focused technical guidance (e.g., Cohen, Aamodt, & Dunleavy, 2010). This notion was recently echoed in a statement from the American Statistical Association (2016), which noted that “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold” and “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.” In the social scientific realm, assessing the magnitude of variable relations is well accepted. Meaningful relationships between variables are usually more important than trivial nonzero relationships, regardless of significance test results. Yet the challenge is distinguishing what is meaningful from what is trivial. Assessment of practical significance involves both an effect size measure and a standard for determining what is of sufficient magnitude to create concern. A recent technical advisory committee (TAC) on adverse impact analyses (Cohen et al., 2010) noted that practical significance is an important consideration in addition to statistical significance.8 The TAC noted that it makes particular sense to evaluate practical significance after a disparity has been identified as statistically significant, or unlikely due to chance. There are many measures of practical significance, including difference measures like the absolute difference in rates and the h statistic difference transform, ratio measures like the adverse impact ratio (often evaluated via the four-fifths rule) and odds ratio, and association measures like the phi coefficient, and phi squared, which is a proxy for traditional variance accounted for measures. The courts have also considered measures of class size as a practical significance consideration (e.g., shortfall, shortfall relative to the number of employment decisions made), as well as flip-flop rules that assess whether a small set of changes in the data affect conclusions (see Morris & Dunleavy, 2015, for more details).

Exemplar Case Law Related to Principle 2 Statistical significance tests first appeared in Castaneda v. Partida (1977), which was a jury selection case. Later that year, statistical significance testing was applied in the Supreme Court ruling in Hazelwood School Dist. v. United States (1977), where a “two or three standard deviations” standard was established.9 A p-value of .05 (roughly corresponding to two standard errors) means that there is a 5% probability that the observed difference in selection rates for two groups could have occurred due to chance, given a neutral selection procedure. As noted above, significance tests are found in compliance manuals and are generally given the most deference in case law.

635

Arthur Gutman et al. The four-fifths rule, which is a standard for evaluating the impact ratio, is described in the UGESP as follows: A selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact.

The four-fifths rule has been endorsed by case law. As an example, the 2nd Circuit ruled in Waisome v. Port Authority (1991) that a four-fifths rule violation was required even in situations where a disparity was statistically significant: We believe Judge Duffy correctly held there was not a sufficiently substantial disparity in the rates at which black and white candidates passed the written examination. Plainly, evidence that the pass rate of black candidates was more than four-fifths that of white candidates is highly persuasive proof that there was not a significant disparity.

Numerous courts have evaluated practical significance using differences in selection rates. For example, in Frazier v. Garrison I.S.D. (1993), a 4.5% difference in selection rates was deemed trivial when most applicants were accepted. In Moore v. Southwestern Bell Telephone Co. (1979), where the court held that “employment examinations having a 7.1 percentage point differential between black and white test takers do not, as a matter of law, make a prima facie case of disparate impact.”10 Several adaptations, or “flip-flop rules,” are seen in case law. These rules evaluate the analytic consequences of making slight alterations to the underlying data used for analyses. For example, in Contreras v. City of Los Angeles (1981), the number of additional protected-group applicants who would need to be selected to eliminate the statistical significance of the disparity was evaluated. Similar approaches were used in U.S. v. Commonwealth of Virginia (1978) and in the Waisome case described above. In these cases, if “one or two” additional passes from the “victim” group changed the statistical results, then the disparity was deemed trivial. More recently, courts have disagreed on how to measure adverse impact. In Stagi v. National RR Passenger Corp. (2012) and Smith v. City of Boston (2015), judges generally deferred to statistical significance tests over practical significance measures in concluding meaningful adverse impact. In Apsley v. Boeing (2013), an appeals court focused more on adjusted shortfall metrics than on statistical significance results and concluded that a disparity was trivial. In Lopez v. Lawrence (2014), a judge seemed to consider both significance tests and the four-fifths rule in concluding meaningful disparity.

3. The search for alternatives to procedures that might result in or have resulted in lesser adverse impact is typically split between defendants and plaintiffs. At the outset of a selection project, employers should typically consider alternatives that may meet business needs, be jobrelated, and minimize adverse impact. After a demonstration of jobrelatedness, the plaintiffs have the burden of demonstrating that there is an alternative procedure that would result in the same levels or similar levels of job-relatedness but shows lesser adverse impact. In years past, for both of these stages (initial research for the employer and post-job-relatedness demonstration for the plaintiff), the search for alternatives may have been considered an onerous and overly burdensome task. However, the I-O literature has expanded significantly since the mid- to late 1990s with regard to decision-making tools and strategies for assessing the tradeoffs between validity and adverse impact.11 As an example, De Corte, Lievens, and Sackett (2007) investigated the possibility of combining predictors to achieve optimal tradeoffs between

636

Legal Principles in Employment Selection selection quality and adverse impact. This article was a follow-up to earlier research in this area. De Corte (1999) and Sackett and Ellingson (1997) investigated the effects of forming predictor composites on adverse impact. Other researchers have used meta-analytic techniques to forecast the likely outcomes of combining high and low adverse impact predictors in a selection procedure (Bobko, Roth, & Potosky, 1999; Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997). Thus, employers and their consultants (in-house or external) may want to become familiar with this literature to meet their burden of demonstrating a reasonable search for alternatives at the outset of a selection project. After the fact, once adverse impact has been shown and defendants have adequately made the job-related argument, it is the plaintiffs’ burden to demonstrate a feasible alternative to that job-related process that would have lesser adverse impact while retaining job-relatedness. One of the interesting aspects of this requirement is that plaintiffs may be able to meet their burden by demonstrating that there is an alternative method of using the selection procedure (e.g., an alternative method of weighting predictor components) that would have less adverse impact without affecting validity (Section 5[G] of the UGESP). The evidence of the validity and utility of a selection procedure should support the method the user chooses for operational use of the procedure, if that method of use has a greater adverse impact than another method of use. This principle was established in Albemarle Paper Co. v. Moody (1975). The Supreme Court ruled that if job-relatedness is proven, the plaintiff may prove pretext by showing that “other tests or selection devices, without a similarly undesirable racial effect, would also serve the employer’s legitimate interest in ‘efficient and trustworthy workmanship’.” This principle was subsequently written into the UGESP and later codified in CRA-91.

Exemplar Case Law Related to Principle 3 Over the years, there have been unsuccessful attempts to prove job-related alternatives with lesser adverse impact (e.g., Bridgeport Guardians, Inc. v. City of Bridgeport, 1991). However, there were three 2006 district court rulings in which the principle was generally supported. These rulings are discussed in detail by Outtz (2007) and included Ricci v. Destefano (2006), Bradley v. City of Lynn (2006), and Johnson v. City of Memphis (2006). The Ricci ruling was overturned by the Supreme Court (Ricci v. Destefano, 2009), and the Johnson ruling was overturned by the 6th Circuit Court (Johnson v. City of Memphis, 2014). On the other hand, Bradley has not been overturned. In addition, one of the plaintiffs in Ricci (Michael Briscoe) sued, claiming that a 60% to 40% split between a written and oral exam produced adverse impact on minority candidates (Briscoe v. New Haven, 2011). We summarize these cases as follows. The one case that has not been appealed is Bradley v. City of Lynn (2006), in which a cognitive test was the sole basis for selecting entry-level firefighters. Adverse impact was demonstrated, and the judge ruled there was insufficient evidence of job-relatedness. The judge also ruled that there were likely several equally valid alternatives with lesser impact, most notably the combination of cognitive and physical abilities (e.g., Brunet v. City of Columbus, 1995), as well as personality (work style) and biodata devices, which when combined with cognitive tests may produce less adverse impact than cognitive tests alone. The district court judge ruled: “while none of these approaches alone provides the silver bullet, these other noncognitive tests operate to reduce the disparate impact of the written cognitive examination.” In Johnson v. City of Memphis (2006), the judge ruled that an examination used for promotions to police sergeant was job-related, but ruled for the plaintiffs because of the existence of valid alternatives with lesser impact. Critical to this case was that the city had previously used a valid promotion test in 1996 and deviated from its prior procedure in the challenged test. The district court judge ruled: “It is of considerable significance that the City had achieved a successful promotional program in 1996 and yet failed to build upon that success.” However, the ruling was reversed on appeal, where the 6th Circuit accepted the defendant’s argument that the “protracted nature of simulation testing and the number of moving parts reinforced the City’s

637

Arthur Gutman et al. concerns about testing security.” The 6th Circuit also accepted the defendant’s argument that the video component took too long to administer and was too costly. In Ricci v. Destefano (2006), the New Haven Civil Service Board (CSB) refused to certify promotional tests for lieutenant and captain positions because the Board had a “good faith” belief it would lose on adverse impact and offered an alternative (i.e., assessment centers) it believed was more valid with less adverse impact. However, the Supreme Court favored a higher standard—that the CSB needed “strong basis in evidence” for believing they would lose on adverse impact. In Briscoe v. City of Newhaven (2010), Briscoe was the top scorer among 77 applicants for lieutenant on the oral exam, but he ranked 24th overall because of his poor performance on the written test. He challenged the 60–40 weighting favoring written tests on the basis that it was arbitrary and that a 70–30 weighting favoring the oral exam was as valid and would have less adverse impact on minorities. The district court judge rejected the claim based on the defense argument that the Board had a strong basis for believing it would lose on disparate treatment (an argument they never previously made). However, the Circuit Court overturned, rejecting this logic and remanded the case back for trial. This trial would have been interesting, but we will never know the outcome because Briscoe settled for $285,000 and a transfer to a director position. So where does that leave the alternatives argument? The Bradley ruling still holds (so far), and the Briscoe settlement leaves room for the composite approaches advocated by De Corte (1999) and Sackett and Ellingson (1997). However, the reversal of the Johnson ruling is critical because, at least so far, it was the only ruling favoring alternatives in a case where the defendant had already won on proof of job-relatedness. So, we leave you with the same caveat as in the last edition—stay abreast of future decisions in this arena.

4. Work analysis should typically precede any selection process. This is one point about which there is little disagreement. The SIOP Principles, the APA Standards, and the UGESP all clearly indicate that a demonstration of the validity of a selection device usually begins with a thorough analysis of the job if the criterion of interest is job performance. Nevertheless, if the criterion of interest is turnover, absenteeism, or other inherently important work outcomes, a full work analysis may not be necessary. Work analysis generally refers to a detailed examination of a job, which may include the determination of what is done, how it is done, the context in which it is done (including the strategic importance of that job and its essential functions to organizational success), as well as the knowledge, skills, abilities, and other characteristics (KSAOs) required to perform the job successfully. The most common “O” characteristic is some measure of personality. Work analysis is also potentially important in demonstrating criterion-related validity because it establishes the importance or relevance of the criterion measure when some type of performance rating tool is used. Finally, work analysis is critical for demonstrating content validity because it provides a comparator for test content and establishes fidelity of the test content to the job content. Proper work analysis is a critical requirement to meet professional standards for content-oriented validation. Work analysis should establish the linkage between important functions/characteristics of the job and KSAOs, and it should form the basis for linking a selection procedure to those KSAOs (or duties if the procedure is a work sample or simulation), thus linking the selection device to the job. The importance of work analysis to validation, particularly content validation, is clearly established in the UGESP, which provide the following guidance regarding the importance of work analysis: Validity studies should be based on review of information about the job. Any validity study should be based upon a review of information about the job for which the selection procedure is to be used. The review should include a work analysis. Section 14 (C2) is also instructive. (UGESP, Section 14(A))

638

Legal Principles in Employment Selection Work analysis is also essential for content validity: There should be a work analysis which includes an analysis of the important work behavior(s) required for successful performance and their relative importance and, if the behavior results in work product(s), an analysis of the work product(s). Any work analysis should focus on the work behavior(s) and the tasks associated with them. If work behavior(s) are not observable, the work analysis should identify and analyze those aspects of the behavior(s) that can be observed and the observed work products. The work behavior(s) selected for measurement should be critical work behavior(s) and/or important work behavior(s) constituting most of the job. (Section 1607.14(C)(2))

Another point to note is that work analysis is often critical to the definition of essential job functions in the Americans with Disabilities Act (ADA), as employers who exclude individuals for failure to perform nonessential job functions are in violation of this statute. A final consideration is the frequency with which a work analysis should be updated. Although there is no specific guidance from regulatory agencies, many practitioners refresh work analyses every five to eight years regardless of significant changes to the job to ensure currency. Our view is that, all else being equal, five to eight years is no worse than any other rule of thumb, but the probative issue is whether the job has changed in meaningful ways.

Exemplar Case Law Related to Principle 4 The role of work analysis in criterion validity was established in Griggs v. Duke Power (1971) and Albemarle Paper Co. v. Moody (1975). Duke Power attempted to defend cognitive tests on the grounds that they were “professionally developed,” yet they conducted no validity study. On the basis of the 1966 EEOC Guidelines, the Supreme Court ruled that professionally developed tests must: [F]airly measures the knowledge or skills required by the particular job or class of jobs which the applicant seeks, or which fairly affords the employer a chance to measure the applicant’s ability to perform a particular job or class of jobs. (Griggs v. Duke Power, 1971, footnote 9)

The Albemarle Paper Company attempted to defend cognitive tests with a deficient criterion validity study conducted one month prior to trial. The most critical deficiency was the absence of work analysis. In Moody, the 4th Circuit ruled: In developing criteria of job performance by which to ascertain the validity of its tests, Albemarle failed to engage in any work analysis. Instead, test results were compared with possibly subjective ratings of supervisors who were given a vague standard by which to judge job performance. Other courts have expressed skepticism about the value of such ill-defined supervisor appraisals. (Moody v. Albemarle, 1973)

The Supreme Court affirmed the 4th Circuit ruling, and the Griggs and Albemarle rulings formed the basis of the 1978 UGESP. The role of work analysis in content validity was established in Guardians v. Civil Service (1980) and has since been followed by every circuit that has ruled on content validity. Relevant cases for content validity are discussed under Principle 5. On the basis of Guardians and other cases, it is axiomatic that content validity cannot be established absent work analysis. Finally, the importance of work analysis in the ADA is illustrated in PGA v. Martin (2001). Casey Martin, a professional golfer, who had a degenerative nerve disease and could not walk a golf course without pain, requested use of a golf cart during PGA tournaments as a reasonable accommodation. The PGA refused on grounds that walking the course is fundamental to PGA events. However, the Supreme Court unanimously ruled that walking is not “an essential attribute of the game.” Similar examples include Borkowski v. Valley Central (1995) (whether

639

Arthur Gutman et al. requirement to control students is an essential function for a librarian requesting a teacher’s aide for that purpose) and Stone v. City of Mt. Vernon (1997) (whether a paraplegic former firefighter requesting a desk job must also fight fires). The point here is that essential functions must stand up to work analysis; otherwise, employers could easily define themselves out of the reach of the ADA without actual evidence.

5. Validation evidence for standardized tests and similar procedures can come in many forms and is not limited to one particular approach. In general, the greater number of converging sources of validation evidence, the better. In the period immediately following passage of the Civil Rights Act of 1964, criterion-related validity was generally considered the gold standard for standardized tests, mainly because the 1970 EEOC Guidelines required that employment tests be “predictive of or significantly correlated with important elements of work behavior.” Subsequently, the 1978 UGESP added content and construct-related validation to the arsenal of tools for demonstrating job-relatedness, thus finally cementing the “Trinitarian” view of validation, which had been introduced in the psychometric literature the 1950s. However, the UGESP contains the following warning on “appropriateness of content validity studies”: A selection procedure based on inferences about mental processes cannot be supported solely or primarily on the basis of content validity. Thus, a content strategy is not appropriate for demonstrating the validity of selection procedures that purport to measure traits or constructs such as intelligence, aptitude, personality, common sense, judgment, leadership, and spatial ability. (UGESP, 1978, Sec. 1607.C (1))

For a short time, this passage perpetuated the myth that criterion-related validity is superior to content-related validity. By the mid-1980s, there was a growing consensus that discussions of “acceptable” models for validation were inappropriate because virtually any validation evidence, regardless of how gathered, is possible evidence of job-relatedness on its merits rather than by its name. This “Unitarian” view is now widely accepted by I-O psychologists and the courts. Furthermore, in recent decades, there have been several new and well-accepted techniques within and beyond the “Trinity,” including the following as described by the SIOP Principles (2003): • Validity transport: the use of a specific selection procedure in a new situation based on results of a validation research study conducted elsewhere12 • Synthetic validity: justification of the use of a selection procedure based upon the demonstrated validity of inferences from scores on the selection procedure with respect to one or more domains of work (job components) • Meta-analysis: the accumulation of findings from a number of validity studies to determine the best estimates of the predictor-criterion relationship for the kinds of work domains

Therefore, when evaluating job-relatedness of standardized tests and similar procedures, it is often valuable to collect evidence from as many different sources or validation designs as feasible, without considering one particular design to be the best or only design. In general, the more evidence that is collected, the greater will be one’s confidence in asserting job-relatedness.

Exemplar Case Law Related to Principle 5 The 1966 EEOC Guidelines were supported by the Supreme Court in Griggs v. Duke Power (1971) and Albemarle Paper Co. v. Moody (1975). However, after adopting the UGESP in 1978, courts immediately essentially overruled UGESP by supporting content validity for inferences about mental processes. The landmark case supporting content validity was the 2nd Circuit ruling in Guardians v. Civil Service (1980), which outlined five steps for content validity:

640

Legal Principles in Employment Selection 1. Suitable work analysis 2. Reasonable competence in test construction 3. Test content related to job content 4. Test content representative of job content 5. Scoring systems selecting applicants who are likely to be better job performers

In Guardians v. Civil Service (1980), although the defendants could not support a cut score or rank ordering on the basis of weakness in steps 2 and 3 (see Principle 8 below), they were able to use a content validation strategy for demonstrating job-relatedness. In Gillespie v. State of Wisconsin (1985), the 7th Circuit ruled “neither the Uniform Guidelines nor the psychological literature express a blanket preference for criterion-related validity” and in Police Officers v. City of Columbus (1990), the 6th Circuit, citing the 1987 SIOP Principles, ruled that it is critical that “selection instruments measure a substantial and important part of the job reliably, and provide adequate discrimination in the score ranges involved.” Content validity was then supported in many subsequent cases, including by the 2nd Circuit in Gulino v. New York State Education Department (2006). Since the prior edition of the Handbook, there have been several additional confirmations of the sufficiency of content evidence, including (1) a more recent ruling in Gulino v. New York State Education Department (2012), in which the content validity of a teacher licensing examination was struck down for failure to meet all five Guardian criteria, and (2) Smith v. City of Boston (2015), in which the content validity of a police promotion exam to sergeant was struck down because of failures in steps 4 (test content representative of job content) and 5 (reliable scoring system) from Guardians. However, in Lopez v. City of Lawrence (2014), the content validity of police promotion exams to sergeant was upheld, even though the judge accepted the opinion of one of the experts in that case that the exams were “minimally valid.” On the other hand, attempts to support job-relatedness on the basis of meta-analysis alone have often failed (EEOC v. Atlas Paper, 1989; Lewis v. Chicago, 2005). Nevertheless, meta-analysis has been credited as supplemental to local validity studies (Adams v. City of Chicago, 1996; Williams v. Ford Motor Co., 1999). Also, transportability and corrections for statistical artifacts were supported in Bernard v. Gulf Oil Corp. (1989), in which criterion validity was found for two of five jobs, and the 5th Circuit found “sufficient similarity in the skills required” for all five jobs. The Court also ruled “the adjusted figures . . . are better estimates of validity,” and uncorrected coefficients “underestimate” validity of the tests. In summary, although any of the “Trinity” approaches alone are likely sufficient for job-relatedness, a notion consistent with the UGESP, the body of case law as a whole suggests that a “Unitarian” approach leveraging multiple sources of validation may have greater acceptance in court and that new methods of supporting validity are at least acceptable supplements to traditional methods. A final point to note is that each of us has been involved in content validity cases, some still ongoing as this principle was being written, and it is our opinion that step 2 in Guardians v. Civil Service (1980) (Reasonable competence in test construction) often requires greater expertise for tests involving simulations than for multiple-choice tests. Therefore, our advice to those entities choosing to use simulations is to consider enlisting outside expertise from among those with a proven track record in designing such exams.

6. In a selection context, criterion information, particularly in the form of performance ratings, is as important in validation as is predictor information. For more than 50 years, I-O psychology has recognized the importance of the criterion in a human resource (HR) system. In the context of validation, criterion information also represents the second “half ” of the criterion-related validation model. Criterion information appears often in the form of supervisor ratings of performance. As is the case with predictors, the credibility of criterion information should be established through job analysis information, supported via psychometric research (e.g., demonstrating reliability or factor structure), or both when possible.

641

Arthur Gutman et al. Performance evaluations often play a significant role in non-entry-selection decisions such as promotions, training assignments, job changes, and reductions in force. To the extent that criterion information becomes predictor information (e.g., performance ratings are at least a partial foundation for a personnel decision such as promotion or downsizing), then that criterion information may be considered part of the selection process and analyzed in a way that permits the inference that this performance information was job-related, psychometrically credible, and fair. Examinations of criterion information often involve not only substance issues but also process issues such as the right of an employee to “appeal” information, as well as the extent to which those providing criterion information are knowledgeable and competent judges of the employee’s performance.

Exemplar Case Law Related to Principle 6 Performance appraisal was a key feature in two landmark Supreme Court rulings on adverse impact: Albemarle Paper Co. v. Moody (1975), a hiring case, and Watson v. Fort Worth Bank (1988), a promotion case. It was also a key feature in Meacham v. Knolls (2006). There were multiple deficiencies in the criterion validity study conducted by the defendant in Albemarle. Chief among them was the failure to establish a reliable and valid criterion against which test scores were compared. In the words of the Supreme Court, “The study compared test scores with subjective supervisorial rankings.” Although UGESP allow the use of supervisorial rankings in test validation, they quite plainly contemplate that the rankings will be elicited with far more care than was demonstrated here. Albemarle’s supervisors were asked to rank employees by a “standard” that was extremely vague and fatally open to divergent interpretations. Each “job grouping” contained several different jobs, and the supervisors were asked, in each grouping, to “determine which ones [employees] they felt, irrespective of the job that they were actually doing, but in their respective jobs, did a better job than the person they were rating against.” In Watson, the main challenge was subjective ratings of job performance. Clara Watson also challenged subjective ratings of interview performance and past experience. The main issue related to more subjective procedures that could cause adverse impact. The Supreme Court unanimously ruled there can be subjective causes of adverse impact in an 8–0 decision. In Meacham, 30 of 31 employees laid off in an involuntary reduction in force (IRIF) were over age 40. The layoffs were based entirely on performance appraisal ratings. In the words of the 2nd Circuit, the key RIF criteria were “subjective assessments of criticality and flexibility” of employee skills (Meacham v. Knolls, 2006). In summary, irrespective of whether the issue is hiring, promotion, termination, or the criterion in a criterion validity study, the Albemarle, Watson, and Meacham rulings carry inherent warnings that performance appraisals should be based on work analysis and that the methodology used to appraise worker performance meet acceptable psychometric properties.

7. The evidence required to demonstrate job-relatedness in the legal context typically differs for biographical factors (e.g., educational requirements) and physical factors (e.g., height and weight) than it does for standardized tests. Principle 5 discusses approaches to demonstrating job-relatedness for standardized tests, but there are sources of adverse impact other than standardized tests, and these can serve as MQs for selection, particularly in hiring. As discussed in Principle 5, demonstration of job-relatedness for standardized tests is exacting. In comparison, case law reveals that legal proof of job-relatedness for biographical MQs is often less exacting and legal proof of job-relatedness for physical factors is often more exacting than legal proof of job-relatedness for standardized tests. In fact, Gutman et. al (2010) refer to defending biographical MQs as “adverse impact light” and defending physical factors as “adverse impact heavy.” In this scheme, then, defending standardized tests falls somewhere in between (call it “adverse impact moderate”).

642

Legal Principles in Employment Selection

Exemplar Case Law Related to Principle 7 The term “adverse impact light” implies it is generally easier to prove job-relatedness with biographical variables, particularly if public safety is threatened. The two most extreme examples of the light defense are Hyland v. Fukada (1978) and NYC v. Beazer (1979). In Fukada, the 9th Circuit accepted articulated safety concerns to exclude from a security guard position a felon who had been previously convicted of armed robbery from a security job. In Beazer (1979), the Supreme Court upheld exclusion of methadone users for the position of transit authority cop because it is obvious that drug addiction threatens the “legitimate employment goals of safety and efficiency.” Similarly, in Davis v. Dallas (1985), the 5th Circuit accepted an articulated reason for excluding recent drug users from police work—that it shows a disregard for the law. In some cases, stronger evidence has been required in cases involving biographical variables, but none rising to the defense for standardized tests in the UGESP. For example, in Spurlock v. United Airlines (1972), a black applicant was excluded from flight officer training because (1) he had only 204 hours of flight time (500 were required) and (2) he had only two years of college (a four-year degree was required). The defense for the flight time variable showed a significant negative correlation between flight hours and training failures. However, on the degree requirement, an expert testified that a four-year degree was necessary to “cope” with rigorous classroom training requirements. Other examples of the lighter defense include United States v. Buffalo (1978), where a high school diploma for police officers was upheld based on federal commission reports in the 1960s that “a high school education is a bare minimum requirement for successful performance of the policeman’s responsibilities.” Also, in Davis v. Dallas (1985), the 5th Circuit upheld a requirement of 45 hours of college credit with C or better grades for police officers based on the task force reports cited in United States v. Buffalo (1978), as was a “poor driving” exclusion based on research indicating that past driving habits predict future driving habits. The adverse impact “heavy” designation implies a greater defense burden for physical requirements as compared to standard tests. The best illustration of this defense is Dothard v. Rawlinson (1977), where the Supreme Court rejected a height and weight criterion that adversely impacted women applying for prison guard positions in an all-male maximum security prison. After failing in defense of this requirement, the State of Alabama argued successfully that it is necessary to exclude all women from the job because 20% of the prison population included sex offenders, and having women in this situation presented an extra threat to prison safety. Thus, it was easier to defend exclusion of all women than it was the exclusion of all people who did not meet minimal height and weight requirements. Stated differently, exclusion based on pure physical requirements, in effect, rose to the level of the Bona Fide Occupational Defense (BFOQ).13 In other cases, employers have sometimes succeeded in defending physical requirements, as in Boyd v. Ozark Air Lines (1977), where the airline proved that shorter pilots could not safely operate all cockpit instruments. However, the more typical result in such cases is illustrated in Horace v. Pontiac (1980), where the police department asserted that being tall is necessary for police officers to fend off and gain the respect of criminals, to which the 6th Circuit responded that there were superior and more direct methods of assessing such capabilities. Critically, in Boyd v. Ozark Air Lines (1977), the physical requirement was a critical attribute. Additionally, consider Bradley v. Pizzaco of Nebraska, Inc. (1993), where the charge was adverse impact against blacks that occurred because of a “no beards” policy. In its defense, Domino’s Pizza offered survey data indicating customer preference for cleanly shaven counter and delivery staff. However, the 8th Circuit struck down this defense, ruling “the existence of a beard on the face of a delivery man does not affect in any manner Domino’s ability to make or deliver pizzas to their customers.” In comparison, a “no beards” policy was upheld in Fitzpatrick v. Atlanta (1993), because the City of Atlanta proved it is essential for firefighters to be beardless for facial safety equipment to function properly, thus posing a danger to other firefighters, as well as civilians whose lives might be in danger. In summary, especially when public safety is a primary concern, courts have a history of accepting less rigorous defenses for biographical variables than they have required for

643

Arthur Gutman et al. standardized tests. On the other hand, the defense for pure physical characteristics boils down to proof that the requisite characteristic is necessary for the business entity to survive. This standard imposes a heavier defense burden than for standardized tests because evidence of (for example) content validity of a test does not imply that a business entity is terminally threatened if the test is not used.

8. Cut scores should be based on a rational foundation that may or may not include empirical analyses. There are several types of cut scores. The first is a nominal cut score. This is a value often established as an arbitrary pass/fail score in a multiple-hurdle system. It designates the threshold for continuing to the next stage in the process. The second is known as the effective cut score. This is the score below which no one is hired or appointed. It is not predetermined but simply identified after the fact. Such scores are most often seen in strict rank-order appointments where candidates are appointed from the top scorer down until all positions are filled. The score of the individual filling the last opening is the effective cut score. The third type of cut score is the critical cut score. This value has been chosen to represent the score below which an applicant is thought to fall below a minimal standard for effective job performance. Often, a critical cut score is tied to the safety or well-being of the candidate or the public/customer base via some type of criterion-related analysis or subject matter expert judgement. Nominal cut scores are often set by combining practical and theoretical issues. A practical consideration might be the cost of testing. Assume there are 1,000 applicants for 20 openings and there will be a multistage assessment process. The employer might want to limit the cost of testing by eliminating many individuals at the first stage of the process. In this case, the cut score can be set by looking at the selection ratio and deriving some estimate of acceptable performance expectations for candidates. There is no need to “set” an effective cut score. It is an axiomatic score determined solely by the score of the last person hired or appointed in a rank-ordered list of candidates. With respect to a critical cut score, there are several options available, but all require some consideration of a criterion level that distinguishes between competent and incompetent or minimally qualified and less-than-minimally qualified. Subject matter experts can estimate requisite predictor performance and associated criterion performance. Such estimates, of course, would benefit greatly from a comprehensive and accurate work analysis. Incumbent populations can be used to identify predictor scores associated with minimally acceptable performance. If predictor and criterion data are available, analyses could be conducted to identify a critical cut score. Regardless of which techniques are used to set the critical cut score, there should be a rational foundation for its choice or an empirical one based on a work analysis. Although it is axiomatic that a single score (i.e., a cut score) cannot be “validated,” it is possible to produce evidence that provides some confidence that a critical cut score is related to an anticipated real-world outcome.

Exemplar Case Law Related to Principle 8 Until the 3rd Circuit’s ruling in Lanning v. SEPTA (1999), all courts relied on the UGESP, which state: “Where cut-off scores are used, they should normally be set so as to be reasonable and consistent with normal expectations of acceptable proficiency within the work force” (Section 1607.5(H)). For example, in Guardians v. Civil Service (1980) and Gillespie v. State of Wisconsin (1985), two cases discussed earlier under Principle 5, the 2nd and 7th Circuits ruled that: An employer may establish a justifiable reason for a cut-off score by, for example, using a professional estimate of the requisite ability levels, or, at the very least by analyzing the test results to locate a logical break-point in the distribution of scores.

644

Legal Principles in Employment Selection Both studies employed content validity strategies. However, the defendants lost on cutoff score in Guardians v. Civil Service (1980) but won in Gillespie. In Guardians, the defendants made selections on strict rank-ordering, and defined the cutoff score at below the point that the last applicant was selected (effective cut score). The Guardians Court ruled: If it had been shown that the exam measures ability with sufficient differentiating power to justify rank-ordering, it would have been valid to set the cutoff score at the point where rank-ordering filled the City’s needs. . .. But the City can make no such claim, since it never established a valid basis for rank-ordering.

However, in Gillespie, the 7th Circuit ruled in favor of the defendant on the basis of two factors: (a) establishment of inter-rater reliability (of grades) and (b) the cutoff was selected to permit interviewing “as many minority candidates as possible while at the same time assuring that the candidates possessed the minimum skills necessary to perform” the job. On rank ordering, the 7th Circuit cited verbatim from Q62 of the “Questions and Answers” of the UGESP: Use of a selection procedure on a ranking basis may be supported by content validity if there is evidence from work analysis or other empirical data that what is measured by the selection procedure is associated with differences in levels of job performance.

Therefore, up until the Lanning ruling, there was little dispute among the circuit courts on either cutoff scores or rank ordering. In Lanning, the 3rd Circuit interpreted the terms “job-related” and “consistent with business necessity” from the Civil Rights Act of 1991 as implying separate standards and that the “business necessity” part implied proof that the cutoff score “measures the minimum qualifications necessary for successful performance of the job in question.” This interpretation was explicitly rejected in Bew v. City of Chicago (2001), where the 7th Circuit ruled: “Griggs does not distinguish business necessity and job relatedness as two separate standards.” All other circuits have continued to follow precedents from Guardians and Gillespie, with the exception of a district court ruling in Isabel v. City of Memphis (2005), which was affirmed by the 6th Circuit on appeal. However, there was no need for either court to cite Lanning because the city lost on other grounds. For example, the city’s expert, who designed the test, admitted in open court that the job knowledge component did not represent the full job domain, and he testified that the cutoff point chosen was “totally inappropriate,” a “logical absurdity,” and “ludicrous.” In summary, in our opinion, the Guardians ruling, along with UGESP guidance cited above, remain the most important basis for establishing cutoff scores, particularly in content validity studies.

9. An optimal balance between job-relatedness and reduction of adverse impact should be struck when possible. The key point here is the definition of optimal. Several researchers have demonstrated that optimum validity depends upon the manner in which job performance is defined or the specific aspect(s) of performance that are most important to an employer (Hattrup, Rock, & Scalia, 1997; Murphy & Shiarella, 1997). Murphy and Shiarella (1997) proposed that weighting of predictors and criteria provides a better understanding of the relationship between selection and job performance. They documented that the validity of a predictor composite can vary substantially depending upon the weight given to predictors and criterion measures. The 95% confidence interval for the validity coefficients for various weightings varied widely from as low as .20 to as high as .78. Another challenging aspect of balancing job-relatedness and adverse impact is the fact that job-relatedness can be defined via different levels of analysis. When individual productivity and task performance are the focus, cognitive ability tests (instruments that typically have high

645

Arthur Gutman et al. adverse impact) typically result in the highest validity for a single predictor. However, if overall organizational effectiveness is the objective, factors such as legal defensibility, strategic positioning within the marketplace, employee performance, workforce diversity, and corporate social responsibility may all be considered. The bottom line is that the organization’s mission and values dictate what constitutes acceptable performance and, ultimately, the best methods of achieving that performance. This argument was made quite forcefully in the University of Michigan Law School admission cases Grutter v. Bollinger (2003) and Gratz v. Bollinger (2003). The University of Michigan took the position that its objective was to admit a first-year class that collectively advanced the law school’s overall mission as opposed to simply admitting each student with the highest probability of achieving a given law school grade point average (GPA). Whether one agrees with the school’s stated mission or not, once formulated, that mission basically defines job-relatedness. It also drives the types of strategies that are most likely to achieve an optimum balance between job-relatedness and reduction of adverse impact.

Exemplar Case Law Related to Principle 9 Two methods of reducing adverse impact have found favor among the courts: (a) eliminating test battery components that are most likely to produce adverse impact and (b) using other factors (e.g., diversity) in the selection process that are more likely to benefit minorities. Hayden v. County of Nassau (1999) illustrates the first method and the cases Grutter v. Bollinger (2003) illustrate the second. After losing in prior litigation, and facing a consent decree, Nassau County New York was motivated to develop a hiring exam for police officers that reduced or eliminated adverse impact. A 25-component test battery was initially administered to 25,000 candidates. The goal of reducing (but not eliminating) adverse impact was accomplished by eliminating scores on 16 of the 25 components. Another configuration with even less adverse impact (and even fewer components) was rejected because it had lower validity. This process was challenged by 68 unsuccessful nonminority candidates who would have benefited if all 25 components had been maintained. The 2nd Circuit favored Nassau County, ruling that “the intent to remedy the disparate impact of the prior exams is not equivalent to an intent to discriminate against non-minority applicants.” In Grutter, the Supreme Court ruled (under the Fourteenth Amendment) that (a) diversity is a compelling government interest and (b) the method used by the law school was narrowly tailored to that interest. At the same time, the Supreme Court struck down the Michigan undergraduate diversity plan for not being narrowly tailored (Gratz v. Bollinger, 2003). The Grutter ruling was based on Justice Powell’s 1978 ruling in Regents of University of California v. Bakke (1978). Between Bakke and Grutter, several courts upheld preference for minorities based on diversity in police forces, including Detroit Police Association v. Young (1979) and Talbert v. City of Richmond (1981). After Grutter, the 7th Circuit upheld preference for Black police officers in a promotion process, ruling in Petit v. City of Chicago, 2003, noting: It seems to us that there is an even more compelling need for diversity in a large metropolitan police force charged with protecting a racially and ethnically divided major American city like Chicago. Under the Grutter standards, we hold, the city of Chicago has set out a compelling operational need for a diverse police department.

The 7th Circuit then upheld out-of-rank promotions of minority applicants on grounds that it was narrowly tailored. It should be noted that the key ingredient in a successful defendants’ diversity argument is often proving to the satisfaction of the courts that diversity is an important job-related factor that furthers the mission of the organization. Diversity for diversity’s sake, therefore, will typically not work. For example, in Lomack v. City of Newark (2006), a newly elected mayor transferred firefighters to and from various posts so that all 108 fire stations were racially diverse. The mayor felt there were “educational and sociological” benefits for such a “rainbow,” but the

646

Legal Principles in Employment Selection 3rd Circuit saw it as “outright racial balancing.” Similar rulings were rendered in Biondo v. City of Chicago (2004) and Rudin v. Lincoln Land Community College (2005). Two recent cases are also worth considering. First, Parents v. Seattle School District (2007) involved two school districts, each with plans to fight de facto segregation. One plan (Seattle) featured three tiebreakers for admission into any of 10 high schools, one of which was race-based and the other plan (Jefferson County) involved clusters of school with targets of 15% minimum and 50% maximum per cluster. Both plans were struck down. However, Justice Kennedy ruled for a majority of five that race is still a compelling state interest. Although he also ruled the plans were not narrowly tailored, he offered solutions he felt were race-neutral and narrowly tailored. Second, just before this chapter was submitted, the Supreme Court ruled that the University of Texas’s admissions policy that considered diversity was legal, thus affirming the Grutter ruling. The case is Fisher v University of Texas at Austin (2013) and the issues in the matter are complex.14 Basically, the University of Texas found it was lacking in the “critical mass” of minority students, and in part in response to the Grutter ruling, developed a plan that permitted racial preference for a small percentage of students.

10. Many of the same practices that define responsible selection define responsible downsizing efforts. Downsizing (also known as reductions-in-force, or RIFs) represents a special instance of selection. Individuals are typically selected to “stay” in an organization (or conversely to “leave” an organization). It might be thought of as “deselection.” It represents the mirror image of the selection scenario. However, there are some unique aspects to downsizing. In the context of litigation, the employer typically must be prepared to show that the RIF was not a pretext for simply eliminating members of a protected group (e.g., female, minority, and/or older employees). Thus, the RIF may be tied to a larger business plan that documents the need for the reduction, which may also support why certain jobs, departments, or divisions have been targeted for the force reduction. As is the case in many selection/promotion scenarios, the employer should typically identify the particular knowledge, skills, abilities, or other personal characteristics that are central to the decision about whom to lay off. These variables might include abilities and skills needed for future vitality of the organization, critical experience bases, customer contacts, past performance, and future output responsibilities. Many of these variables may be illuminated by a current or future-oriented work analysis. In addition, selection procedures used to evaluate current employees (e.g., performance ratings, skill ratings, knowledge ratings) should typically conform to accepted professional and scientific standards for the use of such rating devices. Unlike selection scenarios, it is often difficult to formally “validate” a downsizing process because there is no obvious criterion for success beyond simple organizational survival. Nevertheless, just as in the case of selection, it is important to develop a theory of the downsizing process and its desired results. This theory would typically include some statement of requisite KSAOs for going forward, as well as the organizational need for downsizing.

Exemplar Case Law Related to Principle 10 Most RIF cases are age-based and involve disparate treatment charges. However, some are racebased (e.g., Jackson v. FedEx, 2008). Furthermore, after the Supreme Court’s ruling in Smith v. City of Jackson (2005), which clarified that adverse impact is a valid ADEA claim, we can expect more age-based adverse impact cases even in RIFs (e.g., Meacham v. Knoll, 2006). In a meta-analysis of 115 district court cases involving disparate treatment based on age, Wingate, Thornton, McIntyre, and Frame (2003) found that 73% of the rulings were summary judgment for defendants (SJDs). Factors associated with SJD were use of (a) performance appraisal, (b) organizational review, (c) employee assessment and selection methods, and (d) a concrete layoff plan. As a general principle, employers establishing and following sound concrete layoff policies will likely prevail.

647

Arthur Gutman et al. Cases with favorable rulings for plaintiffs reveal possible key mistakes made by employers. The most obvious mistake is weakness in the layoff plan. For example, in Zuniga v. Boeing Company (2005), the defendant had a concrete layoff plan that relied heavily on performance evaluations. However, the plaintiff defeated SJD by proving that the evaluations he received during the layoff process were inconsistent with performance evaluations he received shortly before the plan was established. Employers have made mistakes in reassignment after the RIF. For example, in Berndt v. Kaiser Aluminum & Chemical Sales, Inc. (1986), Berndt could not compete for other jobs in which he was arguably more qualified than younger employees who were afforded this opportunity. Similarly, in Zaccagnini v. Chas. Levy Circulating Co. (2003), older truck drivers were not considered for newly available jobs that younger drivers received. There are also cases where reassignment is part of the RIF plan, but the definition of “similarly situated” older versus younger employees is narrowly construed. For example, in Ercegovich v. Goodyear Tire and Rubber Co. (1998), three HR positions were eliminated, and the oldest employee was not reassigned. The defendant argued that the three employees were not similarly situated because they performed different job functions, but the 6th Circuit ruled for the plaintiff because the three eliminated jobs involved common knowledge, skills, and abilities. Plaintiffs have also successfully used direct evidence of stray remarks by supervisors. For example, in Starceski v. Westinghouse Elec. Corp. (1995), a supervisor admitted he stated that “it was actually a fact that older engineers . . . were going to be let go” (see also Madel v. FCI Marketing, 1997). Finally, employers must strictly follow eight explicit requirements in the Older Workers Benefit Protection Act of 1990 if they offer enhanced benefits to older employees who accept early retirement in exchange for waiver of the right to sue. These requirements are that (a) the waiver document is clearly written and easily understood; (b) the waiver document cites the ADEA; (c) it only affects rights prior to the effective date of the waiver; (d) it offers enhanced benefits; (e) it advises employees of their right to seek counsel; (f) it provides 21 days for individuals and 45 days for groups to make a decision; (g) it is revocable within 7 days of signing; and (h) it provides extensive information about who is affected if a group is involved. In Oubre v. Entergy (1998), the Supreme Court made it clear it will strictly construe these eight requirements.

11. Employers should typically take proactive measures to prevent, detect, and correct EEO violations involving procedural unfairness to workers or violations inferred from analysis of workforce data. A potentially effective model for preventing, detecting, and correcting EEO violations involving unfair treatment of employees is provided in exemplar form by EEOC’s approach to sexual harassment (SH). In EEOC Policy Guidance 915.050 (June 1999; http://www.eeoc.gov/policy/ docs/currentissues.html), the EEOC distinguished between minimum requirements and best practices. The minimum requirements are as follows: • A clear explanation of prohibited conduct; • Assurance that employees who make complaints of harassment or provide information related to such complaints will be protected against retaliation; • A clearly described complaint process that provides accessible avenues of complaint; • Assurance that the employer will protect the confidentiality of harassment complaints to the extent possible; • A complaint process that provides a prompt, thorough, and impartial investigation; • Assurance that the employer will take immediate and appropriate corrective action when it determines that harassment has occurred.

An aggressive response requires additional actions, including training employees to understand all employer policies, a dedicated “EEO Officer” to handle complaints, and an employee handbook that summarizes the rights and privileges for all employees.

648

Legal Principles in Employment Selection Although Policy Guidance 915.050 was expressly targeted at SH, it applies equally well to broader actions of any sort. For example, employees may feel that their performance is being improperly appraised or that they are not receiving training or certification opportunities necessary for advancement. Workforce data may include confidential survey information obtained by a trained EEO Officer or statistical data relating to performance of different groups of employees on selection tests, composition of the workforce in relation to the appropriate labor pool, or the funneling of different groups into different jobs (e.g., females offered jobs as cashiers vs. males offered jobs as assistant managers). Properly analyzed data could lead to the detection of a potential adverse impact or pattern and practice violation, thus enabling employers to take action before expensive and disruptive litigation while simultaneously increasing employee perceptions of fairness (see, for example, McPhail, which details issues related to “self-critical analyses” that could be used against employers).15 Finally, employers need a good anti-retaliation policy. EEOC statistics reveal that retaliation complaints are increasing, although EEO claims in general have stabilized and even decreased (Zink & Gutman, 2005). Adding to this caution, the Supreme Court recently lightened the burden on plaintiffs to prove retaliation, requiring employers to educate their HR professionals and managers on what to do when individuals complain about a workplace policy or file formal charges. It is not unusual for plaintiff employees to complain about routine selection decisions, such as lost training opportunities that might presage promotions, or failure to promote per se. It is critically important for employers to prevent reprisals against those who do complain.

Exemplar Case Law Related to Principle 11 EEOC Policy Guidance 915.050 interprets the Supreme Court’s 1998 rulings in Burlington Industries, Inc. v. Ellerth (1998) and Faragher v. City of Boca Raton (1998). The Court ruled in both cases that even when “no tangible employment action is taken” the employer has vicarious liability for supervisors, but may affirmatively defend itself by proving with evidence that it exercised (a) “reasonable care to prevent and correct promptly, sexually harassing behavior” and (b) “the . . . employee unreasonably failed to take advantage of any preventive or corrective opportunities provided by the employer to avoid harm otherwise.” Ellerth applies to private entities and Faragher to public entities. Both rulings clarify that (a) there is no defense for quid pro quo SH (strict liability); (b) there is an affirmative (see above) defense for hostile harassment by supervisors; and (c) employers must know or have a basis for knowing that SH occurred among coworkers or they will be considered guilty of reckless disregard. Examples of employer policies that succeeded in affirmative defenses are in Coates v. Sundor Brands, Inc. (1998) and Shaw v. AutoZone (1999), and examples of employer failures affirmatively to defend are in Baty v. Willamette Industries, Inc. (1999), Dees v. Johnson Controls (1999), and Gentry v. Export Packing (2001). There are two major Supreme Court rulings on retaliation. In Robinson v. Shell Oil (1997), the Court unanimously ruled that retaliation applies to actions of a former employer who wrote a negative letter of reference for a previously fired employee. Subsequently, the EEOC issued policy guidance (915.003) in May 1998 outlining three steps for proving retaliation: (1) opposing an employer policy (opposition) or filing a legal claim (participation), (2) suffering an adverse action, and (3) causally connecting opposition or participation to the adverse action. At the same time, the EEOC defined “adverse action” as any action reasonably likely to deter charging parties (or others) from engaging in a protected activity. More recently, in Burlington N. & SFR Co. v. White (2006), the Supreme Court endorsed the EEOC’s definition of “adverse action” over two earlier but heavier standards. One of those heavier standards required “adverse action” to include ultimate employment consequences such as hiring, discharge, promotion, or compensation. The other required proof of interference with terms and conditions of employment. By endorsing a lighter EEOC standard, the Supreme Court made it possible for otherwise legal employer actions

649

Arthur Gutman et al. to constitute retaliation. For example, in Moore v. Philadelphia (2006), White police officers opposed harassment against fellow Black police officers, and in Hare v. Potter (2007), a female employee cited 10 incidents in which her life was made miserable after she filed an EEOC complaint. These actions were deemed insufficient for proof of both racial and sexual harassment but were deemed sufficient to prove retaliation against the complainants, although they were not original plaintiffs. Finally, the risks employers face when altering selection processes to avoid adverse impact are illustrated in Ricci v. Destefano (2008) and Hayden v. County of Nassau (1999). In Ricci, where exams were discarded, there was a weak basis for the New Haven Civil Service Board (CSB) to believe they would lose an adverse impact challenge to minority applicants. However, in Hayden, where Nassau County (New York) eliminated portions of the test to reduce the adverse impact, there was a much stronger basis for that fear since the county was under a consent decree to create a valid test with the least amount of adverse impact.

12. Adverse impact is a valid claim in both Title VII and the ADEA. However, there are critical differences in each of the three phases in how the Title VII and ADEA scenarios are tried in court. There are three phases in an EEO trial. Phase 1 consists of prima facie evidence of a “substantial difference” in selection rates between two protected groups, Phase 2 is a defense against the prima facie claim, and Phase 3 is proof that the defense in Phase 2 is a pretext for discrimination. In the Title VII scenario, most Phase 1 claims involve statistical evidence that a test or other selection procedure produces “substantial differences”16 in selection rates between two groups (e.g., Blacks scoring lower than Whites on IQ tests in Griggs v. Duke Power, 1971). However, it is also possible to show adverse impact with minimal qualifications (MQs) that are chilling factors (e.g., having a high school degree; also in Griggs v. Duke Power). Either way, if adverse impact is shown, the defendant must prove in Phase 2 that what caused the adverse impact is job related and consistent with business necessity, forcing the plaintiff to prove in Phase 3 that alternative selection procedures are equally valid and produce less or no adverse impact in comparison to the test or selection procedure challenged. As noted above, each of the three phases in the ADEA differ from the Title VII prescription in part or in whole. The Phase 1 difference is partial; differences between older and younger workers due to a selection procedure (e.g., reduction in force, or RIF) are treated in the same way as racial differences based on a selection procedure. What is different is that in Title VII a correlation between (for example) an MQ and sex (e.g., minimum height/weight criteria that exclude more women than men) is sufficient to make the prima facie case, whereas a correlation between an MQ and age (e.g., higher percentage wage increases for individuals with fewer years of service) is not sufficient for the prima facie case in the ADEA because decisions correlated with age are not necessarily motivated by age. Second, as noted in Principle 5, there are three different types of Phase 2 defenses in Title VII (i.e., for standardized tests, biographical factors, and physical factors), each of which is designed to show that a cause of adverse impact is job related and consistent with business necessity. The ADEA uses the statutory Reasonable Factors Other Than Age (RFOA) defense. This defense is entirely different from Title VII. Additionally, the reasonable factor(s) must be proven with evidence, and not with a simple articulation as in most disparate treatment cases. Third, the proof of pretext in Phase 3 is entirely different in the ADEA compared to Title VII. As noted in Principle 3, the Title VII pretext phase requires proof of equally valid alternatives that result in less or no adverse impact. The parallel argument (i.e., a “more reasonable” factor other than age) is not available in the ADEA. Rather, the requirement here is much the same as in typical disparate treatment cases—that the factors offered in Phase 2 are not the true factors but rather a cover-up for discrimination.

650

Legal Principles in Employment Selection

Exemplar Case Law Related to Principle 12 In the 1980s, courts treated adverse impact in age cases with Title VII rules. For example, cost-cutting defenses in Geller v. Markham (1980) (hiring at the lowest of six steps) and Leftwich v. Harris Stowe State College (1983) (termination of tenured faculty) failed because neither was deemed job related in accordance with then existing DoL regulations (subsequently adopted by the EEOC). Under current rules, these factors are correlated with age and not necessarily motivated by age. The motivation requirement was introduced in Hazen v. Biggins (1993), where a 62-year-old was terminated shortly before eligibility for pension vestment, a clear-cut ERISA violation. However, the lower courts also favored disparate treatment because age and years of service are correlated. The Supreme Court reversed on disparate treatment, ruling unanimously that employer decisions may be motivated by “factors other than age . . . even if the motivating factor is correlated with age.” Additionally, three justices opined that it is “improper to carry over disparate impact analysis from Title VII to the ADEA.” After Hazen, three circuit courts continued to entertain age-based adverse impact claims, but seven circuit courts found adverse impact inapplicable in the ADEA as a matter of law. Then, in Smith v. City of Jackson (2005), police officers and dispatchers with less than five years of experience received higher percentage compensation increases. The lower courts ruled adverse impact was unavailable in the ADEA, but the Supreme Court ruled that Hazen does not preclude such claims. However, the Supreme Court affirmed in Hazen that factors correlated with age (e.g., years of service) do not qualify, and where adverse impact is shown, defendants may use the statutory RFOA in lieu of proving job-relatedness. The plaintiffs ultimately lost the prima facie case (failure to show adverse impact), and the City had a valid RFOA (the need to compete with neighboring municipalities for filling lower-level positions by increasing the compensation for those positions, although the entry-level positions were often filled by younger applicants). Although it solidified adverse impact as a valid ADEA claim, the Smith (2005) ruling was not clear on whether the RFOA required an articulation or actual proof. This ambiguity is illustrated in Meacham v. Knolls Atomic Power Lab (KAPL), which was reviewed by the 2nd Circuit Court both before (Meacham I, 2004) and after (Meacham II, 2006) the Smith ruling. In this case, 30 of 31 employees laid off in an RIF were over age 40. Using pre-Smith rules, the court found there were alternatives with less adverse impact in Meacham I. However, in Meacham II, after the Smith ruling, the court found that KAPL had articulated two nondiscriminatory reasons—whether employees were “flexible” and “retrainable” for alternative assignments. Recognizing, perhaps, the confusion created in the Smith ruling, the Supreme Court ruled in Meacham v. KAPL (2008) that the RFOA defense is affirmative, and requires evidence, not merely articulation. As a postscript, it is not clear what actual proof KAPL needed to provide. When the case was ultimately returned to the district court (Meacham v. KAPL, 2009), the court ruled that the defendants had waived the right to make the RFOA proof. Nevertheless, the moral of the story is to take a conservative route in conducting an RIF, such that specific measurable selection factors and evidence of their reliability are essential parts of the layoff plan.

13. All disability-related decisions should be made on a case-by-case basis, including determining if an applicant or employee is (a) disabled, (b) needs accommodations for performing essential job functions, and/or (c) assessments of KSAOs deemed necessary to perform essential job functions. Under the ADA of 1991 (and the prior Rehabilitation Act of 1973), there is no such thing as “disability as a matter of law.” The general requirements for being disabled under the law include (a) a physical or mental impairment that (b) interferes with a major life function. In addition to prongs (a) and (b), the individual must (c) be able to perform all essential job functions with

651

Arthur Gutman et al. or without accommodations. These requirements are necessary to demonstrate regardless of whether the impairment is current or past or if the employer mistakenly believes an individual is disabled. As a result, disabilities fall within an interval between prongs (b) and (c) in which individuals with minor or temporary impairments cannot demonstrate interference with major life functions, and individuals with extremely severe impairments may not be able to perform all essential job functions, even with accommodations. The first step in determining if a disabled person requires accommodations is determining whether there is a nexus between a physical or mental impairment and essential job functions. A person with one leg is clearly disabled under the ADA, but it is unlikely that accommodations beyond access to the workplace are required if the job is computer programmer. If there is a nexus, the employer should next meet with the applicant or employee and together explore potential accommodations to overcome the barrier implied by the disability. If no such accommodations are possible, the individual, unfortunately, faces an insurmountable yet legal barrier to employment. If assessment is involved in the selection process, it is important to accommodate applicants with special needs. For example, if a paper-and-pencil or computer-presented format is used and the construct of interest is “good judgment,” it may be important to allow applicants with limited vision to have the test questions read to them.17 More generally, it is good practice to incorporate KSAOs needed to perform essential job functions in the testing process. For example, if a job or critical tasks can be performed without undue concern for the passage of time, it may be inappropriate to use demanding time limits for a test to be taken by an individual who claims a learning disability related to speed of reading or processing, as this could alter the underlying measurements or distributions of the test.18 One other point is worth noting: Until 2014, soliciting disability status from applicants before a job offer was prohibited by the ADA. In 2014, Section 503 of Rehab-73 was updated in various ways, including the requirement that federal contractors meeting 50-employee/$50,000 contract thresholds are required to solicit disability status information from applicants pre-offer and post-offer. As noted by Pryor, Dunleavy, and Cohen (2014), there was immediate concern with the new regulations around whether the pre-offer solicitation of disability status violated the ADA. Similar regulations were proposed in 1996, and at that time the EEOC provided a letter stating that pre-offer solicitation would be a violation under general ADA provisions. However, in 2014, the EEOC provided a letter stating that when federal contractors are required to solicit this information to comply with a federal regulation, it will not be violating the ADA/ADAAA. It remains to be seen whether this will provide a legal safe haven for contractors if they are challenged. Regardless, the availability of applicant disability self-identification may provide selection researchers with fresh opportunities to conduct research that was impossible before these new regulations. There are other interesting aspects of the updated regulations. For example, employee disability status must be solicited every five years. Contractors are still required to “periodically” review mental and physical job qualifications and personnel processes, although not annually. Additional new requirements include written documentation of outreach and recruitment efforts as well as a utilization analysis with a goal of 7% employment of individuals with disabilities. Contractors must assess whether they have a gap between the 7% goal and actual employment in each affirmative action job group. If there is a gap, they must strive to eliminate the gap with focused outreach and recruitment efforts, not a quota. As such, not meeting the goal does not necessarily mean violating the regulations. At the time this chapter was written, little enforcement data were available related to the new regulations.

Exemplar Case Law Related to Principle 13 The individual approach to defining disability was affirmed in three 1999 Supreme Court rulings: Sutton v. United Air Lines (1999), Murphy v. UPS (1999), and Albertsons v. Kirkingburg (1999). For example, in Kirkingburg, the Supreme Court ruled:

652

Legal Principles in Employment Selection This is not to suggest that monocular individuals have an onerous burden in trying to show that they are disabled. . .. We simply hold that the Act requires monocular individuals . . . to prove a disability by offering evidence that the extent of the limitation in terms of their own experience, as in loss of depth perception and visual field, is substantial.

In other words, Kirkingburg could have proven he was disabled within the meaning of the law, but he did not, assuming amblyopia is a disability as a matter of law. In the other two cases, the Supreme Court ruled that impairments must be evaluated with mitigation (eyeglasses for visual impairments in Sutton and high blood pressure medication for hypertension in Murphy). In an extension of the ruling in Kirkingburg, the Murphy Court ruled: Murphy could have claimed he was substantially limited in spite of the medication. Instead, like Kirkingburg, [he] falsely assumed that his impairment was a disability as a matter of law.

Taking advantage of this “advice,” plaintiffs subsequently proved disability despite medication, as for example, in EEOC v. JH Routh Packing (2001) (seizures only partially controlled with epilepsy medication) and Lawson v. CSX (2001) (debilitating side effects of insulin medication for diabetics). It should be noted that in the ADA Amendments Act of 2008, Congress changed the rules for mitigating measures (except for eyeglasses), thus reversing the Albertsons v. Kirkingburg (1999) rulings. This ruling does not, however, alter the principle of assessing impairment on a case-by-case basis. Examples of insurmountable barriers include Southeastern Community College v. Davis (1979) (a deaf woman excluded from nursing school), Treadwell v. Alexander (1983) (a heart patient who cannot perform all-day foot patrols excluded from park ranger job), and Miller v. Illinois (1996) (a blind person who could perform some but not all essential functions of corrections officer). However, it is critical that the job functions in question are essential, as, for example, in Stone v. City of Mt. Vernon (1997) (a paraplegic former firefighter was refused a desk job because he could not fight fires in emergencies). When accommodations are possible, plaintiffs have a duty to inform potential employers, and both parties have a duty to “flexibly interact” to seek accommodations. Examples of failure to notify include Hedberg v. Indiana Bell (1995) (notification of fatigue syndrome after termination) and Taylor v. Principle Financial (1996) (notification of bipolar disorder after a poor performance evaluation). Examples of employee failures to flexibly interact include Beck v. University of Wisconsin Bd. of Regents (1996) (the employee refused a request for medical records to identify accommodations) and Grenier v. Cyanamid Plastics (1995) (the employee refused a request for psychiatric information). Examples of employer failures to flexibly interact include Bultemeyer v. Fort Wayne (1996) (the employer ignored a request by a psychiatrist for reassignment), Feliberty v. Kemper Corp. (1996) (the employer falsely assumed that a medical doctor can design his own accommodations), Whiteback v. Vital Signs (1997) (the employer ignored a request for a motorized cart because “it wouldn’t look right”), and Dalton v. Suburu-Izuzu (1998) (the employer ignored a request for step stools and guard rails without discussion). Mistakes in assessment include Stutts v. Freeman (1983), in which a dyslexic applicant failed the General Aptitude Test Battery for a job (heavy truck operation) that did not require reading skills. On the other hand, in Fink v. New York City (1995), the city was not liable when accommodations for blind applicants (readers and interpreters) did not result in passing scores on a civil service exam.

CONCLUSIONS The term “law” in the selection context generally has two separate but related meanings. There is statutory law embodied in the Civil Rights Acts, ADA, ADEA, and similar federal statutes that we have discussed above. There is also “case law,” which is embodied in the opinions of various levels of the federal judiciary (trial, appeals, and Supreme Courts). The latter interprets the former from the legal perspective. Selection practitioners must be aware of both aspects of

653

Arthur Gutman et al. “the law.” They must be aware of the statutory requirements as well as how judges have interpreted these requirements. Statutory statements of the law seldom recognize the specific contributions of I-O psychology to selection practices. Judges and other EEO stakeholders, on the other hand, often cite the testimony of I-O psychologists and the standards by which selection practice is evaluated (e.g., UGESP, SIOP Principles). In this chapter, we have attempted to bring together practice, statutory law, and case law as a way of educating practitioners. Other chapters provide more detailed descriptions of practices. We provide a legal context for many of those practices. Context matters, and toward that end we remind I-O practitioners that (a) professional judgement cannot be removed from the equation and (b) talking to the appropriate legal counsel may be a useful approach to getting ahead of these issues.

NOTES   1. Our colleague, co-author, and friend Jim Outtz passed away shortly before this chapter was completed. The field of I-O psychology lost a leader, scholar, and model scientist-practitioner that day, and we are grateful for having had the opportunity to work with and learn from Jim. We will miss our friend, as will the rest of the field.   2. We note that the legal context is often very complex, as are evidentiary standards related to technical matters. We are not intending to give legal advice in this chapter and recommend that readers consult legal counsel if they are dealing with any of the issues discussed in this chapter.   3. Interested readers are referred to Sady, Aamodt, and Cohen (2015) for a review of recent pay equity issues and enforcement.   4. Obligation to Solicit Race and Gender Data for Agency Enforcement Purposes, 70 Fed. Reg. 58946 (Oct. 7, 2006) (codified at 41 C.F.R. pt. 60–1). This final rule went into effect on February 6, 2006.   5. 41 C.F.R. § 60–1.3.   6. For a more detailed view of controversies in adverse impact measurement, please refer to Dunleavy, Morris, and Howard (2015).   7. Again, for more detail on these statistical tests, please refer to Dunleavy, Morris, and Howard (2015).   8. Practical significance is a well-established notion in the social scientific community. For example, in the most recent Publication Manual of the American Psychological Association (2010), a failure to report effect sizes (as practical significance measures) is considered a defect in the reporting of research: “No approach to probability value directly reflects the magnitude of an effect or the strength of a relation. For the reader to fully understand the importance of your findings, it is almost always necessary to include some index of effect size or strength of relation in your Results section.”   9. As discussed by Murphy and Jacobs (2012), the use of the term “standard deviation” is somewhat misleading, because this phrase refers to a descriptive measure of the variability of a distribution. The more appropriate term here is “standard error,” which describes the variability in a test statistic due to random fluctuation across samples. 10. An OFCCP Statistical Standards Group (1979) endorsed the absolute difference as a practical significance measure. 11. We note the complexity of the topic. From an I-O psychology perspective, consideration of alternatives could relate to different measurement methods of the same construct, measuring alternative constructs, measuring additional constructs, or implementation characteristics such as weighting and score use (e.g., cut score, banding, rank order, data point, compensatory versus multiple hurdle approaches, etc.). What is or is not reasonable from a legal perspective is a different question. We also note the complexity of defining what is equally valid, which is a complicated enough notion within the same validation strategy (e.g., comparing two criterion studies), never mind across different validation strategies (e.g., a content and a criterion study). 12. Note that transportability is a concept that originated in the Uniform Guidelines, but was framed as a substrategy within the criterion validation framework and limited to the transport of only criterion research. 13. In the BFOQ defense, it must be proven that it is reasonably necessary to exclude all members of a class for the business to succeed. In effect, proof of adverse impact based on height and weight requires a similar proof—that it is necessary to exclude all or most people based on such criteria. 14. Those readers interested in a comprehensive discussion of those facts are directed to the DCI Consulting website (www.dciconsult.com), where Gutman has written the following three blogs: (1)  http://dciconsult.com/supreme-court-to-review-fisher-v-university-of-texas-another-test-ofgrutter-v-bollinger-2003/

654

Legal Principles in Employment Selection (2) http://dciconsult.com/supreme-court-punts-in-long-awaited-ruling-in-fisher-v-university-oftexas/ (3) http://dciconsult.com/5th-circuit-declines-en-banc-review-of-fisher-v-university-of-texas/ 15. We note that the issue of “self-critical analyses” is complex and that in some instances such results could be used against an employer. We suggest that readers consult their legal counsel and read McPhail (2005). 16. The U.S. Supreme Court has never defined what constitutes a “substantial difference,” but lower courts have used statistical significance of Chi Square and/or Fisher Exact Tests. 17. We note that whether this change is (1) an accommodation where the construct being measured is still the construct of interest, or (2) a modification changing the construct being measured is often a complex question to answer. 18. We note the complexity of such decisions, particularly when considering whether scores from an accommodated administration can be reasonably compared with scores from an unaccommodated administration. This example is further complicated by the potential legal consequences of flagging accommodated scores.

REFERENCES American Psychological Association (2009). Publication manual of the American Psychological Association (6th Ed.). Washington, DC: American Psychological Association. American Statistical Association. (2016). P-values under question. Psychological Science Agenda. Retrieved March, 2016 from http://www.apa.org/ Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporating cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561–590. Cohen, D. B., Aamodt, M. G., & Dunleavy, E. M. (2010). Technical advisory committee report on best practices in adverse impact analyses. Washington, DC: Center for Corporate Equality. Cohen, D., & Dunleavy, E. M. (2009). A review of OFCCP enforcement statistics: A call for transparency in OFCCP reporting. Washington: Center for Corporate Equality. Cohen, D., & Dunleavy, E. M. (2010). A review of OFCCP enforcement statistics for fiscal year 2008. Published independently by The Center for Corporate Equality at www. cceq. org. De Corte, W. (1999). Weighing job performance predictors to both maximize the quality of the selected work-force and control the level of adverse impact. Journal of Applied Psychology, 84, 695–702. De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. Dunleavy, E. M., & Gutman, A. (2011). An update on the statistical versus practical significance debate: A review of Stagi v Amtrak (2010). The Industrial-Organizational Psychologist, 48, 121–129. Dunleavy, E. M., Morris, S., & Howard, L. (2015). Measuring adverse impact in selection decisions. In K. Sady & C. Hanvey (Eds.), HR Practitioner’s Guide to Legal Issues in Organizations: Research Methods for Practical Problems (pp. 1–27). New York, NY: Springer. Equal Employment Opportunity Commission, Civil Service Commission, Department of Justice, & Department of Labor. (1978). Uniform guidelines on employee selection procedures. 29 CFR, 1607. Esson, P. L., & Hauenstein, N. M. (2006). Exploring the use of the four-fifths rule and significance tests in adverse impact court case rulings. In 21st annual conference of the Society for Industrial and Organizational Psychology, Dallas, TX. Gutman, A., Koppes, L. L., & Vodanovich, S. J. (2010). EEO law and personnel practices (3rd ed.). New York: Routledge. Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying conceptualizations of job performance on adverse impact, minority hiring, and predicted performance. Journal of Applied Psychology, 82, 656–664. Jacobs, R., Murphy, K., & Silva, J. (2013). Unintended consequences of EEO enforcement policies: Being big is worse than being bad. Journal of Business and Psychology, 28(4), 467–471. Landy, F. J. (Ed.) (2005). Phases of employment litigation. In Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 3–19). San Francisco, CA: Jossey Bass. McDaniel, M. A., Kepes, S., & Banks, G. C. (2011). Encouraging debate on the uniform guidelines and the disparate impact theory of discrimination. Industrial and Organizational Psychology, 4(4), 566–570. McPhail, S. M. (2005). Auditing selection processes: Application of a risk assessment model. The PsychologistManager Journal, 8(2), 205–221.

655

Arthur Gutman et al. Morris, S., Dunleavy, E. M., & Howard, E. (2015). Measuring adverse impact in employee selection decisions. In C. Hanvey & K. Sady (Eds.), Practitioner’s guide to legal issues in organizations (pp. 1–27). New York, NY: Springer. Murphy, K. R., & Jacobs, R. R. (2012). Using effect size measures to reform the determination of adverse impact in equal employment litigation. Psychology, Public Policy, and Law, 18(3), 477–499. Murphy, K. R., & Shiarella, A. H. (1997). Implications of the multidimensional nature of job performance for the validity of selection tests. Personnel Psychology, 50, 823–854. Outtz, J. L. (2007). Less adverse alternatives: Making progress and avoiding red herrings. The Industrial‑ Organizational Psychologist, 45(2), 23–27. Outtz, J. L. (Ed.) (2010). Adverse impact: Implications for organizational staffing and high stakes selection. New York: Routledge. Pryor, K., Dunleavy, E. M., & Cohen, D. (2014). Funny you should mention it: New disability EEO/ AA regulations finalized for federal contractors. Industrial and Organizational Psychology, 7, 220–224. doi: 10.1111/iops.12135 Reynolds, D. H. (2004). EEOC and OFCCP guidance on defining a job applicant in the Internet age: SIOP’s response. The Industrial-Organizational Psychologist, 42(2), 127–138. Roth, P. L., Bobko, P., & Switzer, F. S., III. (2006). Modeling the behavior of the 4/5ths rule for determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91(3), 507–522. Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differences and adverse impact. Personnel Psychology, 50, 707–722. Sady, K., Aamodt, M. G., & Cohen, D. (2015). Compensation equity: Who, what, when, where, why, and how. In Practitioner’s guide to legal issues in organizations (pp. 249–282). New York, NY: Springer. Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and predictive efficiency of various predictor combinations. Journal of Applied Psychology, 82, 719–730. Society for Industrial and Organizational Psychology. (2003). Principles for the validation use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Reprinted with permission. Wingate, P. H., Thornton, G. C., III., McIntyre, K. S., & Frame, J. H. (2003). Organizational downsizing and age discrimination litigation: The influence of personnel practices and statistical evidence on litigation outcomes. Law and Human Behavior, 27, 87–108. Zink, D. L., & Gutman, A. (2005). Statistical trends in private sector employment discrimination suits. In F. J. Landy (Ed.), Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 101–131). San Francisco, CA: Jossey Bass.

Cases Cited Adams v. City of Chicago, 469 F.3d 609 (7th Cir. 2006). Albemarle Paper Co. v. Moody, 422 U.S. 405 (1975). Albertsons v. Kirkingburg, 527 U.S. 555 (1999). Apsley v. The Boeing Co., No. 05–1368 (D. Kan. Dec. 9, 2013). Baty v. Willamette Industries, Inc., 172 F.3d 1232 (10th Cir. 1999). Beck v. University of Wisconsin Bd. of Regents, 75 F.3d 1130 (7th Cir. 1996). Bernard v. Gulf Oil Corp., 890 F.2d 735 (5th Cir. 1989). Berndt v. Kaiser Aluminum & Chemical Sales, Inc., 789 F.2d 253 (3d Cir. 1986). Bew v. City of Chicago, 252 F.3d 891 (7th Cir. 2001). Biondo v. City of Chicago, Ill., 382 F.3d 680 (7th Cir. 2004). Borkowski v. Valley Cent. School Dist., 63 F.3d 131 (2d Cir. 1995). Boyd v. Ozark Air Lines, Inc., 568 F.2d 50 (8th Cir. 1977). Bradley v. City of Lynn, 443 F. Supp. 2d. 145 (D. Mass. 2006). Bradley v. Pizzaco of Nebraska, Inc., 7 F.3d 795 (8th Cir. 1993). Bridgeport Guardians, Inc. v. City of Bridgeport, 933 F.2d 1140 (2d Cir. 1991). Briscoe v. City of New Haven, No. 3: 09-cv-1642 (CSH) (D. Conn. July 12, 2010). Briscoe v. City of New Haven, 654 F.3d 200 (2d Cir. 2011). Brunet v. City of Columbus, 58 F.3d 251 (6th Cir. 1995). Bultemeyer v. Fort Wayne School, 100 F.3d 1281 (7th Cir. 1996). Burlington Industries, Inc. v. Ellerth, 524 U.S. 742 (1998). Burlington N. & S.F.R. Co. v. White, 548 U.S. 53 (2006). Castaneda v. Partida, 430 U.S. 482 (1977). Coates v. Sundor Brands, Inc., 164 F.3d 1361 (11th Cir. 1998).

656

Legal Principles in Employment Selection Contreras v. City of Los Angeles, 656 F.2d 1267 (9th Cir. 1981). Dalton v. Subaru-Isuzu Automotive, Inc., 141 F.3d 667 (7th Cir. 1998). Davis v. City of Dallas, 777 F.2d 205 (5th Cir. 1985). Dees v. Johnson Controls World Services, Inc., 168 F.3d 417 (11th Cir. 1999). Detroit Police Officers Ass’n v. Young, 608 F.2d. 671 (6th Cir. 1979). Dothard v. Rawlinson, 433 U.S. 321 (1977). EEOC v. Atlas Paper Box Co., 868 F.2d 1487 (6th Cir. 1989). EEOC v. J. H. Routh Packing Co., 246 F.3d 850 (6th Cir. 2001). Ercegovich v. Goodyear Tire & Rubber Co., 154 F.3d 344 (6th Cir. 1998). Faragher v. City of Boca Raton, 524 U.S. 775 (1998). Feliberty v. Kemper Corp., 98 F.3d 274 (7th Cir. 1996). Fisher v. University of Texas at Austin, 133 S. Ct. 2411 (2013). Fitzpatrick v. City of Atlanta, 2 F.3d 1112 (11th Cir. 1993). Frazier v. Garrison ISD, 980 F.2d 1514 (5th Cir. 1993). Geller v. Markham, 635 F.2d 1027 (2d Cir. 1980). Gentry v. Export Packaging Co., 238 F.3d 842 (7th Cir. 2001). Gillespie v. State of Wis., 771 F.2d 1035 (7th Cir. 1985). Gratz v. Bollinger, 539 U.S. 244 (2003). Grenier v. Cyanamid Plastics, Inc., 70 F.3d 667 (1st Cir. 1995). Griggs v. Duke Power Co., 401 U.S. 424 (1971). Grutter v. Bollinger, 539 U.S. 306 (2003). Guardians Ass’n of New York City v. Civil Serv., 630 F.2d 79 (2d Cir. 1980). Gulino v. Bd. of Educ. of City School Dist. of N.Y., 907 F. Supp. 2d 492 (S.D.N.Y. 2012). Gulino v. New York State Educ. Dept., 460 F.3d 361 (2d Cir. 2006). Hare v. Potter, 549 F. Supp. 2d 688 (E.D. Pa. 2007). Hayden v. County of Nassau, 180 F.3d 42 (2d Cir. 1999). Hazelwood School Dist. v. United States, 433 U.S. 299 (1977). Hazen Paper Co. v. Biggins, 507 U.S. 604 (1993). Hedberg v. Indiana Bell Telephone Co., 47 F.3d 928 (7th Cir. 1995). Horace v. City of Pontiac, 624 F.2d 765 (6th Cir. 1980). Hyland v. Fukuda, 580 F.2d 977 (9th Cir. 1978). Isabel v. City of Memphis, 404 F.3d 404 (6th Cir. 2005). Jackson v. FedEx Corporate Services Inc., 518 F.3d 388 (6th Cir. 2008). Johnson v. City of Memphis, No. 06–2052 Ma/P (W.D. Tenn. 2006). Johnson v. City of Memphis, 770 F.3d 464 (6th Cir. 2014). Lanning v. SEPTA, 181 F.3d 478 (3rd Cir. 1999). Lawson v. CSX Transp., 245 F.3d 916 (7th Cir. 2001). Leftwich v. Harris-Stowe State College, 702 F.2d 686 (8th Cir. 1983). Lewis v. Chicago, 299 F.Supp 1357 (N.D. Ill. 2005). Lomack v. City of Newark, 463 F.3d 303 (3d Cir. 2006). Lopez v. City of Lawrence, No. 07–11693-GAO (D. Mass. 2014). Madel v. FCI Marketing, 116 F.3d 1247, 1251 (8th Cir. 1997). Meacham v. Knolls Atomic Power Laboratory, 461 F.3d 134 (2d Cir. 2006). Meacham v. Knolls Atomic Power Laboratory, 627 F. Supp. 2d 72 (N.D.N.Y. 2009). McDonnell Douglas Corp. v. Green, 411 U.S. 792 (1973). Miller v. Illinois, 107 F.3d 483 (7th Cir. 1996). Moody v. Albemarle Paper Co., 474 F.2d 134 (4th Cir. 1973). Moore v. City of Philadelphia, 461 F.3d 331 (3d Cir. 2006). Moore v. Southwestern Bell Telephone Co., 593 F.2d 607 (5th Cir. 1979). Murphy v. United Parcel Service, 527 U.S. 516 (1999). New York City Transit Authority v. Beazer, 440 U.S. 568 (1979). Oubre v. Entergy Operations, Inc., 522 U.S. 422 (1998). Parents Inv. in Comm. Sch. v. Seattle School, 551 U.S. 701 (2007). Parker v. University of Pennsylvania, 2004 U.S. Dist. LEXIS 17423 (2004). Petit v. City of Chicago, 352 F.3d 1111 (7th Cir. 2003). PGA Tour, Inc. v. Martin, 532 U.S. 661 (2001). Police Officers v. City of Columbus, 916 F.2d 1092 (6th Cir. 1990). Regents of University of California v. Bakke, 438 U.S. 265 (1978). Ricci v. DeStefano, 554 F. Supp. 2d 142 (D. Conn. 2006). Ricci v. DeStefano, 530 F.3d 87 (2d Cir. 2008).

657

Arthur Gutman et al. Ricci v. DeStefano, 129 S. Ct. 2658 (2009). Robinson v. Shell Oil Co., 519 U.S. 337 (1997). Rudin v. Lincoln Land Community College, 420 F.3d 712 (7th Cir. 2005). Shaw v. AutoZone, Inc., 180 F.3d 806 (7th Cir. 1999). Smith v. City of Boston, No. 12–10291-WGY (D. Mass. Nov. 16, 2015). Smith v. City of Jackson, 544 U.S. 228 (2005). Southeastern Community College v. Davis, 442 U.S. 397 (1979). Spurlock v. United Airlines, Inc., 475 F.2d 216 (10th Cir. 1972). Stagi v. National RR Passenger Corp., 880 F. Supp. 2d 564 (E.D. Pa. 2012). Starceski v. Westinghouse Elec. Corp., 54 F.3d 1089 (3d Cir. 1995). Stone v. City of Mount Vernon, 118 F.3d 92 (2d Cir. 1997). Stutts v. Freeman, 694 F.2d 666 (11th Cir. 1983). Sutton v. United Air Lines, 527 U.S. 471 (1999). Talbert v. City of Richmond, 648 F.2d 925 (4th Cir. 1981). Taylor v. Principal Financial Group, 93 F.3d 155 (5th Cir. 1996). Treadwell v. Alexander, 707 F.2d 473 (11th Cir. 1983). United States v. City of Buffalo, 457 F. Supp. 612 (W.D.N.Y. 1978). United States v. Commonwealth of Virginia, 454 F. Supp. 1077 (E.D. Va. 1978) Waisome v. Port Auth. of New York & New Jersey, 758 F. Supp. 171 (S.D.N.Y. 1991). Watson v. Fort Worth Bank & Trust, 487 U.S. 977 (1988). Whiteback v. Vital Signs, 116 F.3d 588 (D.C. Cir. 1997). Williams v. Ford Motor Co., 187 F.3d 533 (6th Cir. 1999). Zaccagnini v. Chas. Levy Circulating Co., 338 F.3d 672 (7th Cir. 2003). Zuniga v. Boeing Company, No. 02-CV-807-TCK-SAJ (N.D. Okla. Jan. 25, 2005).

658

29 UPDATED PERSPECTIVES ON THE INTERNATIONAL LEGAL ENVIRONMENT FOR SELECTION WINNY SHEN, PAUL R. SACKETT, FILIP LIEVENS, EVELINE SCHOLLAERT, GREET VAN HOYE, DIRK D. STEINER, FLORENCE ROLLAND-SAYAH, KONSTANTINA GEORGIOU, IOANNIS NIKOLAOU, MARIA TOMPROU, SHAY TZAFRIR, PETER BAMBERGER, MARILENA BERTOLINO, MARCO MARIANI, FRANCO FRACCA­ ROLI, TOMOKI SEKIGUCHI, BETTY ONYURA, HYUCKSEUNG YANG, JANNEKE K. OOSTROM, PAUL ENGLERT, OLEKSANDR S. CHERNYSHENKO, HENNIE J. KRIEK, TINA JOUBERT, JESÚS F. SALGADO, ANNIKA WILHELMY, CORNELIUS J. KÖNIG, AICHIA CHUANG, AND MARK COOK1

In the United States, the legal context plays a major role in how industrial-organizational (I-O) psychologists approach selection system development. The set of protected groups, the approaches to making an a priori case of discrimination (e.g., differential treatment vs. adverse impact), the key court cases influencing selection, and the prohibitions against preferential treatment (e.g., the 1991 ban on score adjustment or within-group norming) are well known. Selection texts (e.g., Guion, 1998) and human resource management texts (e.g., Cascio & Aguinis, 2008) give prominent treatment to the legal context. In recent years, there has been a growing internationalization of I-O psychology such that psychologists from all over the world work with clients in other countries and contribute to our journals and to our conferences. Test publishers and consulting firms establish offices globally. As this internationalization continues, it becomes increasingly useful to take a broader look at the legal environment for selection, examining similarities and differences in various countries. For example, consider a U.S firm with operations in several other countries. Although U.S. fair employment law applies only to those overseas employees who are U.S. citizens or foreign nationals employed in the U.S. by a U.S.based firm, the employment by U.S. firms of host country nationals or third-country nationals is subject to the legal environment of the host country.

DATA COLLECTION METHODOLOGY To compare and contrast the legal environment for selection in various countries, the senior author prepared a set of questions about the legal environment for selection, prepared model answers describing the legal environment in the United States, and contacted psychologists in various countries, asking them to prepare a document describing the legal environment in their countries. The goal was to obtain a range of perspectives by sampling about 20 countries. Thus,

659

Winny Shen et al. this chapter is by no means a complete catalog of the legal environment around the world. Researchers and practitioners who are experts on the topic of selection participated from the following 22 countries in the original chapter, and updated information was obtained for 17 of these countries (denoted in asterisks) for this revision: Australia, Belgium*, Canada*, Chile, France*, Germany, Greece*, India, Israel*, Italy*, Japan*, Kenya*, Korea*, the Netherlands*, New Zealand*, South Africa*, Spain*, Switzerland*, Taiwan*, Turkey, the United Kingdom*, and the United States*. As the list indicates, the countries covered do broadly sample the world. Because of space constraints, the results for each country were summarized and organized by issue rather than by country to create this chapter. For more context on the legal, social, cultural, and political environment of the countries surveyed, see Myors et al. (2008). Contributing authors from each country responded to several questions, nine of which are addressed in turn in this chapter.

Question 1: Are There Racial/Ethnic/Religious Subgroups Such That Some Are Viewed as “Advantaged” and Others as “Disadvantaged”? The disadvantaged groups identified by country differ on several dimensions. First, the basis for disadvantaged status varies: (a) native/aboriginal people in a setting where colonizers became the majority group (e.g., Native Americans in the United States; Mäori in New Zealand; First Nations Peoples, Metis, and Inuit in Canada), (b) recent immigrants (e.g., people from the Middle East moving to many European countries), (c) racial/ethnic groups either native to or with long histories in the country (e.g., African Americans in the United States; Blacks, colored individuals, and Indians in South Africa; less populous ethnic tribes in Kenya), (d) religious groups (e.g., India), and (e) language groups (e.g., Francophones in Canada; Rhaeto-Romanic speakers in Switzerland). Second, the size of the minority population varies, from a very small percentage of the population in some countries to the South African extreme of a previously disadvantaged Black majority. Overall, there is considerable variability from country to country in what constitutes a disadvantaged group. Furthermore, we note that the status and prevalence of various groups are constantly evolving. As an example, the ongoing refugee crisis (i.e., with migrants coming primarily from Syria, Afghanistan, and Iraq), which has particularly affected European countries, may lead to long-term changes in the demographic composition of these and other nations depending upon where these migrants ultimately settle. We refer interested readers to the first edition of this Handbook chapter (i.e., Sackett et al., 2010) for additional details regarding specific disadvantaged groups for each country.

Question 2: What Is the General Picture Regarding Women in the Workplace (e.g., Historical Trends Regarding Employment for Women, Current Data on Percentage of Women in the Workforce, and Current Status Regarding Occupational Segregation, Such as Gender Representation in Various Job Classes and at Various Organizational Levels)? Among the countries surveyed, women make up a substantial portion of the workforce (ranging from approximately 30–70%). Strides have been made such that women are increasingly involved in the workforce across all countries surveyed, as evidenced by women’s generally high rates of participation in the workforce (ranging from 38–69%). These differences are undoubtedly at least partially due to the multitude of differences among countries, including those in history, culture and values, economic conditions, and political conditions. It is interesting to note that in no instance is the female participation rate higher than the male participation rate; this may partially reflect the traditional division of labor between men and women. Furthermore, although women are less likely than their male counterparts, to participate in the workforceit appears that there tends to be no or small differences in the unemployment rate for men

660

International Legal Environment for Selection and women (usually within 1 or 2 percentage points). Exceptions to this general trend include Greece, Kenya, and Switzerland, where women are still substantially more likely than male workers to be unemployed, and Taiwan, where the male unemployment rate has been higher than the female unemployment rate since 1996 (likely due to a shift from a manufacturing-based to a more service-based economy). Among all nations surveyed, there is still gender disparity in pay that is substantial in magnitude (ranging from 66–88%). Although it is unclear as to whether these estimates take into account factors such as differences in occupations, full- versus part-time work, and educational attainment, other research has shown that even taking into account some of these factors, women still earn less than their male counterparts (though the gap generally decreases; e.g., U.S. General Accounting Office, 2003). Furthermore, there continues to be occupational segregation to some extent in all countries surveyed, and women are still more likely to join the workforce as part-time workers in many countries (e.g., Belgium, France, Germany, Israel, Japan, Switzerland, the Netherlands, and the United Kingdom). Generally, women are more likely than their male counterparts to be found in clerical or secretarial, retail or sales, healthcare, education, public services, or small-scale agricultural farming occupations. The occupations that women are most heavily concentrated in also tend to be in the lower income segment. Finally, women remain underrepresented in business and management positions as well as technical and scientific, professional, and high-level government positions (e.g., judges and cabinet members), particularly at more senior levels. In the interest of space, we do not present specific statistics for each country, particularly as this information may change and become out of date relatively quickly. However, interested readers can refer to the first edition of this Handbook chapter (i.e., Sackett et al., 2010) for prior estimates in each country regarding women’s status in the workplace.

Question 3: Is There Research Documenting Mean Differences Between Groups on Individual Difference Measures Relevant to Job Performance? Mean differences on ability and personality measures are commonly examined in the United States, with enough data for large-scale meta-analytic summaries (e.g., Foldes, Duehr, & Ones, 2008; Roth, Bevier, Bobko, Switzer, & Tyler, 2001). Mean differences on tests of developed cognitive abilities of roughly 1.00 standard deviation (SD) between Whites and African Americans and roughly 0.67 SD between Whites and Hispanics have been consistently reported (Roth et al., 2001). This abundance of data proves to be in marked contrast to the pattern of findings in the countries examined here. In fact, for most countries, the authors reported finding either no research or research with samples so small that they generally refrained from drawing conclusions. Although limited, for a few countries, research on group differences on measures of cognitive ability is available. Generally, the research to date shows the advantaged group typically scores higher on tests of cognitive ability than the aboriginal group (e.g., aboriginal groups in Australia, Canada, New Zealand, and Taiwan). The available data also suggest that advantaged groups often score higher than recent immigrants on cognitive ability tests (i.e., Dutch vs. Turkish/Moroccan and Surinamese/Antillean immigrants in the Netherlands, and Belgians vs. Moroccan/Turkish immigrants in Belgium), though these differences may be driven, in part, by language as group differences generally decreased when comparing the advantaged group to second- versus first-generation immigrants. In South Africa, mean score differences on cognitive tests between Black and White groups are normally larger than U.S. studies, with Whites obtaining higher mean scores. In Israel, mean score differences between Jews and Arabs on college admissions tests favor the Jewish majority. Please see the first edition Handbook chapter (i.e., Sackett et al., 2010) for additional details regarding these studies. Data on personality measures are even more limited than for cognitive ability, with authors reporting personality data from only two countries: studies of Black-White differences in South Africa generally showing small differences (Joubert & Venter, 2013; Kriek, 2006), and several studies of Dutch-immigrant differences in the Netherlands showing much larger differences

661

Winny Shen et al. (De Soete, Lievens, Oostrom, & Westerveld, 2013; te Nijenhuis, van der Flier, & van Leeuwen, 1997, 2003; van Leest, 1997). Research examining gender differences in selection constructs and tools was also scarce in most countries, and research investigating group differences in job performance was virtually nonexistent outside of the U.S. Overall, several findings of interest emerge. First, it is clear that gathering data and reporting mean differences by group is far more common in the United States than in virtually all of the other countries contributing to this report. This outcome is likely the result of the legal scrutiny to which tests are held in the United States. The Uniform Guidelines on Employee Selection Procedures (U.S. Equal Employment Opportunity Commission, 1978) use adverse impact computations as the basis for a prima facie case of discrimination, and thus, adverse impact resulting from test use is routinely examined, with mean differences between groups and the method of test use (e.g., a high or a low cutoff) functioning as key determinants of adverse impact. Second, although data tend to be sparser elsewhere than in the United States, group differences have been studied and observed in various settings involving different types of disadvantaged groups. Third, as in the United States, there is interest not only in whether there are group differences but also in understanding the basis for these differences. Language, culture, and differences in educational access and attainment are seen as key concerns in understanding differences in test scores across groups. In the United States, disparate impact is the basis for a prima facie case of discrimination. The implicit assumption is that various groups are expected to obtain similar mean scores absent bias in the measure. Reports from European country authors suggest that many European countries target certain groups as immigrants to meet specific labor shortages. Thus, immigrants might have higher or lower abilities, depending on whether a country tried to attract highly skilled people (e.g., recent immigrants into Switzerland from Northern and Western Europe) or tried to attract people with low skills (e.g., Turkish immigrants to Germany). In other words, even if one has a general expectation of no group differences at the population level, a finding of differences between locals and immigrants would be expected given this targeted immigration.

Question 4: Are There Laws Prohibiting Discrimination Against Specific Groups and/or Mandating Fair Treatment of Such Groups? Which Groups Are Protected? Which Employers Are Covered? Which Employment Practices Are Covered (e.g., Selection, Promotion, Dismissal)? Table 29.1 presents summary information addressing the above questions for each country. Several findings emerge. First, there is some basis for legal protections for members of specified groups in all countries. The bases for these protections vary widely. In many cases, the national constitution provides general, or at times specific, protections. This may be seen as analogous to the Fifth and Fourteenth Amendments to the U.S. Constitution, which respectively state that “no person shall . . . be deprived of life, liberty, or property without due process of law” and that “no state shall . . . deny to any person within its protection the equal protection of the laws.” However, in virtually all cases there are also specific laws defining specified protected classes, specifying which employment practices are covered and which employers are required to comply. The intent here is to identify the major contemporary federal laws and government decrees, and as such it is not a complete record of all historical employment regulations. Additionally, several states and cities have additional statutes offering protection to groups beyond those covered by national law. Second, the protections offered are generally quite sweeping in terms of the types of employers covered and the range of employment practices included. In most cases all employers are covered. Some laws are restricted to government employees, and in some cases, coverage is restricted to larger employers, with the coverage threshold varying quite widely for some statutes (e.g., 6 employees in Israel [though the equal pay law is required of all organizations], 15 in the U.S., 100 in Taiwan, and 300 in Korea). It is also typical for a broad range of employment

662

Federal government departments, crown corporations, and other federally regulated agencies and organizations; note that the majority of employers, including private employers, are regulated under provincial rather than federal law in Canada. All employers

Belgian Constitution of 1994 Article 10, 11, 191 Law Equality of Men-Women of 1978 Anti-Racism Law of 2003 Antidiscrimination Law of 2007

Canadian Human Rights Code of 1985 Section 15 of the Charter of Rights and Freedoms (1982) Federal Employment Equity Act (2004) Federal Contractors Program Pay equity legislation (federal and some provinces)

Constitution, Chapter 3 (Rights and Duties), article 19 N° 16 (Freedom of Work and its protection) and Work Code, Article 2° (2002)

French Constitution of 1958 International convention of the United Nations (1965) ratified in 1971 International convention of the International Labor Organization (1958) ratified in 1981 “The law concerning the fight against racism” of 1972 “The law concerning worker’s liberties in organizations” of 1982 Treaty of Amsterdam of 1997 L. 122–45 from Labor Law 225–1 and 225–2 from the Penal Code

Belgium

Canada

Chile

France All employers

All employers

All employers; EOWW of 1999 refers to organizations of 100+

The Crimes Act 1914 Racial Discrimination Act 1975 Sex Discrimination 1984 Human Rights and Equal Opportunity Commission Act 1986 Disability Discrimination Act 1992 Workplace Relations Act 1996 Equal Opportunity for Women in the Workplace Act 1999 Age Discrimination Act 2004

Australia

Employers Covered

Law

Country

International Laws and Practices

TABLE 29.1

(Continued)

Many employment practices including selection, access to training, pay, layoffs, transfers, and job classification

The Constitution establishes the general nondiscrimination principle on the basis of race, color, sex, age, marital status, union membership status, religion, political opinions, nationality, and national or social origin. In March 2008, a new law went into effect (law # 20,087). This new law defines discrimination as any action that is against the equal opportunity for all workers. A new regulation will specify the practices that are covered by the law.

Most employment practices including selection, performance appraisal, termination, and compensation

Most employment practices including selection and appointment, promotions, employment opportunities, labor conditions, dismissal, and wages

All stages of the employment relationship including, but not limited to, recruitment, selection, termination, training, and promotion

Employment Practices Covered

Law

Allgemeines Gleichbehandlungsgesetz: General Equal Opportunity Law

Greek Law 3304 of 2005, equal treatment Greek Law 3896 of 2010, on equal treatment between people in the labor market

Indian Constitution Article 15. Prohibition of discrimination on grounds of religion, race, caste, sex, or place of birth Article 16. Equality of opportunity in matters of public employment Article 39 Article 46 Article 335

Basic Law on Human Dignity and Liberty Basic Law on the Freedom of Occupation Women’s Equal Rights Law of 1951 Equal Pay Law of 1996 Equal Employment Opportunity of 1988

Italian Constitution of 1948 Article 3 Legislative decree 216 of 2003 Legislative decree 198 of 2006 (Equal Pay)

Labour Standards Law of 1947 Law on Securing Equal Opportunity and Treatment between Men and Women in Employment of 1972 Law for Employment Promotion, etc. of the Disabled of 1960 Law Concerning Stabilization of Employment of Older Persons of 1971

Country

Germany

Greece

India

Israel

Italy

Japan

TABLE 29.1  (Continued)

All employers

All employers

All employers All employers 6+

Government entities, public sector organizations, and organizations receiving government funding

All employers

All employers, except tendency organizations (e.g., religious organizations)

Employers Covered

Wages, working hours, other working conditions Recruitment and hiring, assignment, promotion, demotion, training, fringe benefits, change in job type and employment status, encouragement of retirement, mandatory retirement age, dismissal and renewal of employment contract Recruitment and hiring Mandatory retirement, secure stable employment, and re-employment

Recruitment, selection, promotion, employment agencies, outplacement procedures, training, working conditions

Compensation, staffing, conditions of employment, promotion, training and development, dismissal, severance pay, retirement benefits

Selection; previously promotion

Conditions for access to employment, to self-employment, or to occupation, including selection criteria and recruitment conditions; promotion; access to all types and to all levels of vocational guidance, vocational training, advanced vocational training and retraining, including practical work experience,employment and working conditions; dismissals, pay, membership, and involvement in an organization of workers or employers, or any organization whose members carry on a particular profession, including the benefits provided for by such organizations; social protection, including social insurance and sanitary relief; social provisions; education; and access to disposal and to provision of benefits, which are provided to the public, including housing

All stages of the employment relationship including placing a job ad, hiring and selection, definition of payment, performance appraisal and promotion, job-related training and job counseling, corporate health services, design of working conditions, social services, and dismissal

Employment Practices Covered

Kenyan Constitution Chapter 5, Section 82 HIV and AIDS Prevention and Control Act 14 The Persons with Disabilities Act 14 of 2003 The Employment Act of 2007 Cap 226.

National Human Rights Commission Act of 2001 Act on Equal Employment and Support for Work-Family Reconciliation (formerly the Equal Employment Act of 1987) The Act of Employment Promotion and Vocational Rehabilitation for the Disabled of 1990 The Aged Employment Promotion Act of 1991 The Basic Employment Policy Act of 1993

Constitution, Article 1 of 2003 General Law Equal Treatment of 1994

Human Rights Act of 1993

Constitution of the Republic of South Africa of 1996 Labour Relations Act, Act 66, of 1995 Employment Equity Act, No. 55, of 1998 as amended July 2014

Spanish Constitution, Article 14 of 1978 Law of Worker’s Statute of 1980, 2005, Article 4.2 y 17 Organic Law for Effective Equality between Women and Men of 2007, Article 1, 3, 4, 5, 6 Law of Basic Statute of Public Employee of 2005, Article 14.i

Bundesverfassung of 1999 (Swiss Federal Constitution) Bundesgesetz ueber die Beseitigung von Benachteiligungen von Menschen mit Behinderungen of 2002 (Federal Law for the Equal Treatment of People with Disabilities) Bundesgesetz ueber die Gleichstellung von Mann and Frau of 1995 (Federal Law for the Equal Treatment of Men and Women) Schweizerisches Zivilgesetzbuch of 1907 (Swiss Civil Code) Bundesgesetz betreffend die Erganzung des Schweizerischen Zivilgesetzbuches—Obligationenrecht of 1912 (Swiss Code of Obligations)

Kenya

Korea

The Netherlands

New Zealand

South Africa

Spain

Switzerland

Includes pre-, during, and postemployment practices (i.e., recruitment, sexual harassment, earnings, promotions, etc.) Protection of employee personality and personal data throughout all stages of the employment process

All employers

All employers

(Continued)

Includes pre- (particularly), during, and postemployment practices

Recruitment, selection, promotion, compensation, training, temporal employment companies, employment agencies, dismissal, labor agreements, collective bargaining, education before and during employment, health programs, and working conditions

Includes, but is not limited to, recruitment procedures, advertising, selection criteria, appointment and appointment process, job classification and grading, remuneration, employment benefits, terms and conditions of employment, job assignments, working environment and facilities, training and development, performance evaluation systems, promotion, transfer, demotion, disciplinary measure other than dismissal, and dismissal

Refusal of employment, less favorable employment, conditions of work, superannuation, fringe benefits, training, promotion, transfer, termination, retirement, and resignation; the act also covers job advertisements. (The act also covers other areas of public life outside of employment, such as access to public spaces and education.)

Recruitment, selection, employment agencies, dismissal, labor agreements, education before and during employment, promotion, and working conditions

Recruitment, hiring, training, placement, promotion, compensation, loans, mandatory retirement age, retirement, and dismissal Recruitment, selection, compensation, education, training, job placement, promotions, setting a mandatory retirement age, retirement, and dismissal Hiring, promotion, transfer, education, and training Recruitment, hiring, and dismissal. Recruitment and hiring.

All employment practices The law covers a wide range of employment decisions including recruitment, training, promotion, termination and or allocation of terms and conditions of employment

Public employers

All employers

All employers except the National Defense Force, National Intelligence Agency, and South African Secret Service

All employers (exceptions are permitted where genuine occupational characteristics (GOQ) require a particular gender, age, or other prohibited characteristics. For example, a position in the National Security service requires New Zealand citizenship).

All employers (except religious, philosophical, or political organizations)

All employers All employers (employers of 500+ workers for affirmative action clause) Employers with 50+ workers government employees Employers with 300+ employers Not specified

All employers (exceptions for the Employment Act include the Armed Forces, Kenya police, National Youth Services, and employer dependents where the dependents are the only employees in a family undertaking)

United States

United Kingdom

Turkey

Equal Pay Act 1963 Genetic Information Nondiscrimination Act of 2008

Americans with Disabilities Act 1990 and Rehabilitation Act 1973

Age Discrimination in Employment Act 1967

Civil Rights Act of 1964, Title VII (amended 1972, 1991)

All public employers and private employers with 15+ employees Private employers with 20+ employees, state and local governments ADA covers private employers, state and local governments; Rehabilitation Act covers federal government; Virtually all employers Virtually all employers All public employers and private employers with 15+ employees

All employers, trade unions, professional bodies, and employment agencies All ages, young and old

Sex Discrimination Act of 1975

Employment Equality (Age) Regulations 2006 Equal Pay Act of 1970 Disability Discrimination Act 1995 European Community Directives Equality Act of 2010

Public employers All employers, trade unions, professional bodies, and employment agencies

Performance appraisal, pay, promotion, and termination practices are implicitly covered; selection is not covered because the law only covers private sector employees who are already employed

All employers (except sea transportation, air transport, agricultural and forestry with less than 50 employees, home services, internships, professional athletes, rehabilitation workers, businesses with less than 3 workers, handmade art, jobs done at home, journalists) All employers

Prohibits discrimination against women in pay Prohibits use of genetic info in employment decisions

Prohibits discrimination against individuals with disabilities in employment decisions

Range of employment decisions including hiring, compensation, terms, conditions, and employment privileges Prohibits discrimination against individuals 40+

Selection Generally all employment practices: selection, promotion, termination, pay, performance appraisal, access to training, and treatment

Generally all employment practices, including selection, promotion, termination, pay, performance appraisal, access to training, and treatment

Article 70 specifically covers selection for public institutions; other practices are implicitly covered including pay, promotion, and dismissal in other articles

Prime Minister’s office circular of 2004 Race Relations Act of 1976

UN’s Convention on the Elimination of All Sorts of Discrimination Against Women Article 11

Republic of Turkey Constitution of 1982 Article 10 Article 49 Article 50 Article 70 Labor Law, Article 5 of 2003

Staffing Recruitment, selection, promotion, job allocation, performance evaluation, promotion, training, compensation, benefits, retirement, and dismissal Staffing for the jobs of contract employee; stationed police; mechanic, driver, janitor, cleaner; fee administrator; nontechnical workers not requiring the qualifications of civil servants Staffing, occupational guidance assessment, occupational training, employment services, occupation redesign, and compensation and retirement

Employment Practices Covered

All employers

All levels of government, public schools and state-owned businesses (except for those located outside of Penghu, Jinmen and Lianjiang County) All employers

Equal Employment Opportunity for Aborigines Act of 2001

People with Disabilities Rights Protection Act of 1970

All employers All employers

Article 5 of the Employment Services Act of 1992 Gender Equality in Employment Law of 2002

Taiwan

Employers Covered

Law

Country

TABLE 29.1  (Continued)

X

Greece

X

X

X

X

X

X

Turkey

United Kingdom

United States

X

Spain

X

X

South Africa

Taiwan

X

New Zealand

Switzerland

X

The Netherlands

X

X

X

X

X

X

X

X

X

X

X

X

X

Kenya

Korea

X

Japan

X

X

X

Germany

Italy

X

France

X

X

X

Chile

X

X

X

X

Canada

X

X

Belgium

X

Israel

X

Australia

Sex

India

Race

Country

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

National/Ethnic Origin

X

X

X

X

X

X

X

X

X

X

X

Color

TABLE 29.2

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Age

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Religion

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Disability

Most Common Protected Classes

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Political Opinion

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Sexual Orientation

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Marital/Family Status

Winny Shen et al. practices to be included. For example, employee selection is specifically included in all countries except Chile, which has the least developed set of employment rights regulations examined here (though discrimination based on protected class status is prohibited in Chile, just which employment practices are covered is unclear). Third, there is both convergence and divergence in the classes that receive protection in each country. Table 29.2 identifies the most common protected classes and indicates whether those classes are covered in each of the contributing countries. The classes covered in U.S. Civil Rights law emerge as widely commonly covered across countries: race, color, religion, gender, national origin, age, and disability status. Three categories not protected by federal statute in the United States are protected in most countries: political opinion, sexual orientation, and marital/family status. Several protected classes are covered in only a few countries or are unique to a few countries; Table 29.3 identifies these less commonly protected classes. Examples include language, appearance, union membership, socioeconomic status, genetic information, and irrelevant or pardoned criminal record.

TABLE 29.3 Other Protected Classes by Country Country

Other Protected Classes

Australia

Breastfeeding, family or career responsibilities, irrelevant criminal record, physical features, potential pregnancy, trade union or employer association activity, pregnancy and transgender status

Belgium

Union membership, membership of other organizations, current or future health, wealth, physical or genetic characteristics, social status, and any other personal characteristic

Canada

A conviction for which a pardon has been granted or a record suspended

Chile

Union membership status

France

Moral principles or beliefs, genetic characteristics, union activities or activities in a “mutuelle” (i.e., private supplementary insurance), physical appearance, family name, health, and place of residence

Germany

Philosophy of life (i.e., moral principles/beliefs)

India

Scheduled castes, scheduled tribes, and other backward classes

Israel

Military service

Italy

Personal and social conditions and language

Japan

Social status

Kenya

Tribe, local connection, and HIV/AIDS status

Korea

Social status, region of birth, appearance, criminal record after punishment has been served, academic background, medical history, pregnancy, and physical conditions (e.g., appearance, height, weight)

The Netherlands

Philosophy of life (i.e., moral principles/beliefs), chronic disease, full-time/part-time work, and type of contract

New Zealand

Ethical belief (i.e., not having a religious belief), employment status

South Africa

HIV status, conscience, belief, culture, birth, pregnancy, and language

Spain

Social condition and membership to a labor union

Switzerland

Socioeconomic status, way of life, and language

Taiwan

Thought, provincial origin, appearance, facial features, union membership, status, and language

Turkey

Philosophical belief (i.e., moral principles/beliefs), sect, and language

United Kingdom

Persons who have undergone gender reassignment or intend to, pregnancy and maternity

United States

Pregnancy and genetic information

668

International Legal Environment for Selection

Question 5: What Is Required as Prima Facie Evidence of Discrimination? What Is Required to Refute a Claim of Discrimination? In most countries, direct (e.g., differential treatment) and indirect (e.g., disparate impact) prima facie evidence of discrimination are acknowledged. In India, disparate impact is necessary but not sufficient to prove a case of discrimination; underrepresentation must be shown to be due to historical, social, or religious discrimination toward a particular group. Only two countries require evidence of the intent to discriminate, Taiwan and Turkey, thus ruling out a disparate impact theory of discrimination. However, although disparate impact evidence can be used as evidence in most countries, highly specific evidentiary rules used in the United States (e.g., the four-fifths rule and tests of the statistical significance of the difference between passing rates for various groups) are generally not in use (Canada is an exception, because cases using the four-fifths rule in the United States have been used to make a case for a similar standard). Commentators note that in most cases there are few or no cases involving disparate treatment challenges to predictors commonly used by psychologists, and thus, there is not the extensive case law that has developed in the United States. Recall that the four-fifths rule in the United States derives from guidelines issued by enforcement agencies, and the use of significance testing derives from case law; neither the concept of disparate impact nor the mechanisms for identifying its presence are contained in a statute. Absent a history of challenges resulting in case law, it is not surprising to see the lack of specificity as to evidentiary standards. A similar lack of specificity applies to the question of what is required to refute a claim of discrimination. Table 29.4 summarizes information across countries. In general, there is some version of the shifting burden of proof model in countries where disparate impact evidence is permissible. After a prima facie showing, the burden to justify the use of the employment practice shifts to the employer in all countries except Switzerland, where the burden of showing that the practice is not job-related is only partially reduced or remains with the plaintiff. There is a general notion that the employer should present evidence to support the job-relatedness of the employment practice in question, but rarely is the required form of such evidence specified (e.g., use of validity evidence to establish job-relatedness).

Question 6: What Are the Consequences of Violation of the Laws? Table 29.4 also summarizes possible consequences of violation in each participating country. There is considerable variation in the array of possible remedies. As a point of reference, note that in the United States the focus is on compensatory or “make-whole” remedies, with punitive damages reserved for instances of intentional discrimination. Similarly, make-whole remedies are part of the landscape in all countries for which information could be obtained. Several countries also provide fines and punitive damages (e.g., Switzerland and Turkey), and several include imprisonment as a possible consequence (e.g., Belgium, France, and Greece).

Question 7: Are Particular Selection Methods Limited or Banned as a Result of Legislation or Court Rulings? There are relatively few restrictions on specific selection methods. As a point of reference, U.S. law regulates the use of the polygraph, prohibiting its use for most private employers; several other countries restrict polygraph use as well (e.g., Germany, Israel, and Turkey). The only selection method specifically mentioned in U.S. law is the reference in the Tower amendment to Title VII of the Civil Rights Act of 1964 (U.S. Code, 1964) to the permissibility of professionally developed ability tests, provided that such tests are not designed, intended, or used to discriminate. Additional instances reported of restrictions on specific selection methods in participating

669

TABLE 29.4

Evidence Needed to Refute a Claim

Inherent requirements of the job, existence of special measures to eliminate discrimination, occupational requirements, actions required by law, employment within small organizations, consistent beliefs (e.g., religious organizations or educational institutes). The statutes make no reference to the psychological concept of validity nor has it arisen in case law.

Statistical data or practical tests can be used as evidence

The employer must demonstrate that the employment policy, practice, or procedure that is challenged is a bona fide occupational requirement. Tribunals and courts are quite liberal in the evidence that they will accept from employers in defense of their employment practices. Empirical and statistical evidence generated by I-O psychologists (e.g., local validation studies) may be useful in defending employment practices, but courts and tribunals often lack the sophistication to make full use of such detailed and complex technical information.

Unclear, unless for sexual harassment or unionization suits; Empirical evidence not required.

Vague. Employer should present any information showing the decision is legitimate, nondiscriminatory, and based on objective information.

Country

Australia

Belgium

Canada

Chile

France

Three years’ imprisonment and/or a fine for conviction in a criminal court. Discriminatory act may be annulled in a civil court and possibly result in financial compensation.

Unknown. Currently, sexual harassment suits may result in monetary compensation and up to three years’ imprisonment.

Fines, payment for lost wages, reinstatement, and ordering of special programs

Mediation or binding judgment from civil court, Imprisonment and/or fines

Injunction to stop the act, award of damages, order to the organization to redress the situation, variation, or cancellation of a contract or agreement that violates the law

Consequences of Violation

Considerable discussion about preferential treatment; politically, it is seen as undesirable. However, there are settings where it is used. When parties present lists of candidates for regional and senatorial elections, they are required to have an equal number of men and women (and, for some elections, an equal number of men and women must be elected) There are quotas in one setting: at least 6% of workforce needs to be handicapped for organizations with more than 20 employees

Government has enacted an informal quota for women in minister positions; however, this has not crossed over into the private sector

Preferential treatment is permitted (mainly in the public sector)

Preferential treatment is permitted to remedy historical discrimination against a group. Quotas are required for board of director positions in public organizations and private organizations listed on the stock market, governmental positions of middle management level or higher, and scientific institutions, such that one-third of these positions must be held by women. Both sexes must be equally represented in election lists of political parties. Some organizations also utilize target numbers.

Within-group norming is not banned and is used by some psychological testers as a means of complying with legislation (Myors, 2003). Targets may be used in some EEO plans, but explicit quotas are avoided.

Permissibility of Preferential Treatment

Evidence Needed to Refute a Discrimination Claim, Consequences of Violation, and Permissibility of Preferential Treatment by Country

Validity evidence not requested. Evidence to refute a claim is currently unclear.

The general guideline by the Ministry of Health, Labour and Welfare states that selection should be based solely on applicant aptitudes and abilities, and human rights should be respected during the selection process. To refute a claim of discrimination, the employer must show that the selection procedure is consistent with the guideline. Empirical validity evidence is not necessarily required, nor is the evidence from an on-site study or in other settings. Investigation is carried out on a case-by-case basis.

Italy

Japan

In the event that an employer is in violation of the law, the Ministry of Health, Labour and Welfare will give recommendations pursuant. If the employer has not complied with recommendations, the Ministry of Health, Labour and Welfare may make a public announcement of such violation. The employer who has failed to make a requested report or made a false report shall be liable to a civil fine of not more than 200,000 yen (approximately 2,400 USD).

Unknown, most claims are resolved by sending the employer and employee to “regional equal opportunity counseling”

Small fines. Hiring, reinstatement, or career advancement of plaintiff, payment of back wages.

Evidence of test reliability and validity, which can be based on validity generalization. In addition, the National Labor Court recently ruled that employers seeking to prove their innocence will be subject to less severe tests of selection validity to the extent that they are accused of discriminating against internal as opposed to external candidates; the logic being that employers typically have far greater information upon which to base a selection decision when choosing among internal candidates.

Israel

(Continued)

Preferential treatment of women is not required, but softer forms of preference for women is permitted and supported by the state as long as it is intended to improve circumstances that impede the securing of equal opportunity and treatment between men and women in employment. Quotas required for physically disabled workers

Preferential treatment permitted for women

Preferential treatment is required by public organizations and state-owned enterprises for women and minorities; 50% of board members of state-owned enterprises must be women. Preferential treatment is permitted in the private sector.

Preferential treatment in the form of a relaxation of qualifying scores for protected groups in external recruitment is permitted; however, a common standard is required for promotion. Not all members of protected groups are equally eligible, also dependent on social/ economic status. Government positions also use quotas.

Preferential treatment to prevent or compensate for disadvantages linked to any of the protected classes

The employer who infringes the laws about equal treatment on the grounds of racial or ethnic origin, religion or belief, disability, age or sex may be punished by imprisonment of 6 months up to 3 years with a penalty of 1,000 up to 5,000 euros. At the discretion of the judge

Employer must show that there has been no breach of the principle of equal treatment

Greece

No formalization, but public authorities are to give preference to women and handicapped persons

Employee has right to refuse to work while on payroll and sue employers for damages.

India

Needs to be based on job requirements

Germany

Evidence Needed to Refute a Claim

Burden of proof rests with the employer. Evidence required is vague, but generally must show that decisions were based on applicant aptitudes and abilities. Empirical validity evidence not required.

Show job-relatedness, but specific method unclear.

Generally no validity evidence is requested because the validity of common psychological tests, such as tests for cognitive abilities, personality inventories, and assessment center exercises, is taken for granted. Most claims concern direct discrimination or treatment discrimination (Commissie Gelijke Behandeling, 2006). Exceptions are clear-cut cases of indirect discrimination in which inappropriate job requirements were set.

Country

Kenya

Korea

The Netherlands

TABLE 29.4  (Continued)

Nonbinding judgment by the Commission of Equal Treatment and possibly judgment referral to a civil court

National Humans Right Commission will make a binding conciliation resolution. Fines may be imposed.

The Employment Act is vague regarding penalties for organizations that violate the laws prohibiting discrimination. These cases would be referred to the industrial court for adjudication of punitive or remunerative damages. In the case of employer-employee relationships, aggrieved parties can lodge complaints (regarding any violations of the Employment Act) to labour officers or complaints/suits to the Industrial Court. Section 88 of the Employment Act limits liability to fines not exceeding 50,000 Kenya shillings (US$475) or imprisonment to terms not exceeding three months or to both unless otherwise specified.

For employment of individuals with disability, public employment security offices may order employers who do not meet quotas to create a plan for hiring individuals with disabilities. The employer who has failed to make the plan shall be liable to a civil fine of not more than 200,000 yen (approximately 2,400 USD). The Ministry of Health, Labour and Welfare may also make a public announcement of employers for not complying with the law.

Consequences of Violation

Preferential treatment is permitted for women, ethnic minorities, and persons with disability or chronic illness (only in the case of equal qualification and use of preferential treatment must be mentioned in the job description).

Quotas required for disabled. Preferential treatment for aged and “semi-aged” for priority occupations.

Preferential treatment is permitted and encouraged. The Employment Act of 2007 notes expressly that taking affirmative action measures that are consistent with “promoting equality or eliminating discrimination in the workplace” should not be considered as discrimination.

Permissibility of Preferential Treatment

Upon a finding of discrimination, a judge can specify “make whole” remedies, such as back pay, hiring, or reinstatement. There are no punitive damages absent a finding of intentional discrimination.

Evidence that the challenged practice is job-related for the position in question and consistent with business necessity (largely through validity studies)

United States

Fines may be imposed

Court has discretion. Compensation to the plaintiff. Formal investigation by governing bodies that can recommend changes in procedures.

Provide evidence of job-relatedness

Taiwan

Courts can award damages including payment of owed earnings and payment of compensation and satisfaction

Show that requirement is justified. The employer can show that it took all “reasonable” steps to prevent discrimination. No impact cases involving tests have reached the stage of a court decision, so there is as yet no requirement of validity evidence.

Empirical evidence not generally presented or required

Switzerland

Compensation, rejection of the decision, and subsequent application of the court decision, repetition of the selection process with new procedures

United Kingdom

Recent laws may lead to greater focus on empirical evidence; until now, validity of tests was taken for granted

Spain

Fines or possible cancellation of government contracts

Reinstatement, back pay, and/or monetary damages

Qualitative and empirical data can be brought to bear to support validity

South Africa

Apology, payment or compensation, assurance that the discriminatory act will not be repeated, or referral to a Human Rights Tribunal for further judgment (e.g., a declaration that defendant has committed a breach, an order to undertake a training or any other program, compensatory damages, or “any other relief the Tribunal thinks fit”)

Turkey

Unclear, because few cases make it to court Genuine Occupational Qualifications (GOQ)—sex (e.g., physiological requirements, considerations as to decency or privacy, single-sex establishments, provision of welfare services); race (e.g., necessary for dramatic performance, cultural authenticity, work in ethnic restaurants, provision of welfare services)

New Zealand

1991 amendments to Title VII of Civil Rights Act prohibit preferential treatment, specifically in the form of adjusting scores or using separate norms for minority group members. Preferential treatment is permitted after a finding of discrimination as part of a judicially ordered remedy.

Employers may give preferential treatment to members of underrepresented groups so long as they are equally well qualified

Preferential treatment is not permitted

Quotas required for aborigine peoples and individuals with disabilities (quotas differ for different organizations, areas of the country, and positions)

Preference is permitted but not required

Preferential treatment for women in some cases

Preferential treatment is permitted and applied. Racial quotas are legal and practiced by many large employers. The practical implication is that in the South African context it is legal to use race norming, or within-group top-down selection strategies, to address affirmative action needs of organizations.

Preferential treatment is currently being explored. It appears to be permitted (and may be soon applied to the Mäori population, given recent formulation of Treaty principles which state that the Crown has a duty to actively protect Mäori interests and to redress past injustices)

Winny Shen et al. countries include a prohibition against comprehensive personality assessment in Switzerland and a restriction on the use of certain Minnesota Multiphasic Personality Inventory (MMPI) and California Psychological Inventory (CPI) items in Spain. In Israel, recent Labor Court rulings have made the use of graphology for selection risky and potentially problematic for employers, though its use is still technically legal. The most strikingly different approach to regulating selection practices is found in South Africa. Rather than the common approach of a presumptive right of an employer to use a particular method absent a successful challenge by a plaintiff, South African law puts the burden immediately on the employer. According to the Employment Equity Act of 1998 (Government Gazette, 1999), psychological testing and other similar assessments are prohibited unless the test is proven to be scientifically valid and reliable, can be applied fairly to all employees, and is not biased against any employee or group. The Society for Industrial and Organizational Psychology in South Africa (SIOPSA) published “Guidelines for the Validation and Use of Assessment Procedures for the Workplace” during 2005 to provide guidelines for practitioners in the field of I-O psychology to ensure that their assessment instruments and practices comply with the scientific requirements and international best practices (SIOPSA, 2005). These guidelines were largely based on the American SIOP Principles. Given more recent amendments to the act (as amended in July 2014), employers are now also required to register instruments that measure psychological constructs with the Health Professionals Council of South Africa before they may be used in the employment setting. Similarly, in the Netherlands, the Dutch Committee on Tests and Testing (COTAN), which is a committee of the Dutch Association of Psychologists (Nederlands Instituut van Psychologen), audits the quality of psychological tests that are available for use in the Netherlands (Evers, Sijtsma, Lucassen, & Meijer, 2010). Currently, the COTAN has evaluated more than 750 tests, including intelligence tests, personality assessments, and occupational tests. Starting this year, COTAN will also evaluate tests on their evidence of fairness as one of their criteria, suggesting that investigations of differential prediction and group differences may become more commonplace for assessments used in the Netherlands. However, note that employers are legally allowed to use tests that have been rated as insufficient by the COTAN, though it appears that ratings by COTAN are beginning to carry substantial weight with employers, particularly government and financial institutions.

Question 8: What Is the Legal Status of Preferential Treatment of Members of Minority Groups (e.g., Quotas or Softer Forms of Preference)? To set the stage, note that the term “affirmative action” is used in various contexts, only some of which involve preferential treatment for protected groups. Some forms of affirmative action involve outreach efforts to publicize openings and to encourage applications from members of protected groups without preferential treatment given once an individual is in the applicant pool. Approaches involving preferential treatment fall into two main classes: (a) those that set differing standards for protected and nonprotected groups without setting aside a specified number or proportion of openings for members of protected groups (e.g., different cutoff scores, within-group norming) and (b) quota approaches that set aside a fixed number or proportion of openings for members of protected groups. Table 29.4 summarizes the status of preferential treatment in the participating countries. Preferential treatment is a domain in which the United States emerges as a clear outlier. Preferential treatment in terms of differing score cutoffs or separate norming of tests within group is prohibited by the U.S. Civil Rights Act of 1991 (U.S. Code, 1991), and the use of quotas is restricted to very limited settings, such as a court-ordered remedy following a finding of discrimination. In contrast, preferential treatment in some form is typically allowed, at least for some groups, in almost all other countries surveyed. Several commentators noted that applying lower standards to protected groups (e.g., different cutoffs or within-group norming) is used for selection but not for promotion decisions (e.g., Australia, South Africa, and India). The status of quotas also

674

International Legal Environment for Selection varies substantially across contexts, from prohibited (Australia), to permitted and widely used (South Africa), to used in government sectors (backward classes in India and women in Chile), to required for certain groups (e.g., aborigines in Taiwan, individuals with disabilities in France, Japan, Kenya, Korea, and Taiwan). Since our original Handbook chapter was published, several European countries have adopted the use of quotas to increase the number of women in highlevel government positions, including among elected public officials (e.g., Belgium and France).

Question 9: How Have Laws and the Legal Environment Affected the Practice of Science-Based Employee Selection in This Country? In only a few countries (i.e., Canada, South Africa, and the United States) is the legal environment seen as having a large effect on science-based employee selection. In general, the separation between legal issues and science-based practice can be attributed partially to the much more amorphous legal standards and consequences with regards to employment discrimination in most countries surveyed. However, the reciprocal relationship between science-based selection and the legal environment will need to continue to be monitored because many countries are still in the process of developing legal statutes and requirements or establishing guidelines for prosecution and rulings on employment discrimination. Overall, most employers in the countries surveyed have great latitude in choosing what selection procedures to utilize. However, most employers are aware of the social and political nature of selection procedures and seem to err on the side of mainstream, popular, and usually well-validated selection methods. The most common type of selection procedures do vary by country. It is common to see reports of increased use of the tools and techniques of science-based selection, but the driving forces appear more commonly to be the presence of multinational firms and consulting firms that import these techniques into the country.

DISCUSSION In the original version of this chapter, we offered 35 broad summary statements about the patterns emerging from the narratives from the countries surveyed (e.g., although every country has a law or directive that prevents discrimination on the basis of sex or race/ethnic origin, in many countries few cases are actually filed or brought to trial because workers do not understand their rights or because the evidence needed to establish discrimination is not clear). It appears that over the subsequent five to seven years, the landscape regarding the legal environment for selection has remained more similar than different. This is not entirely surprising given that it typically takes time for countries to alter their employment policies, regulations, and laws. Thus, we believe that our prior summaries and conclusions generally still stand, and we encourage interested readers to revisit our original chapter for these specifics (Sackett et al., 2010, pp. 673–675). In looking forward, we asked commentators to identify trends that they see emerging for both selection more generally, as well as specifically with regards to the legal environment for selection. Several commentators noted the increased use of new technologies by organizations for recruitment and selection, particularly social media, and that doing so may bring to the forefront new concerns regarding privacy as well as what information the employer can and should have access to about applicants. Many commentators also believe that concerns about fairness and discrimination will continue to grow. In particular, commentators in countries that have recently adopted new policies (e.g., more aggressive affirmative action efforts in Kenya) are curious as to whether and to what extent these laws will be effective in promoting greater representation of historically disadvantaged groups in the workplace. Other commentators highlight that laws may be insufficient in bringing about change if minority groups lack faith in mainstream institutions, which may need to be more proactive in their enforcement of anti-discrimination laws in order to change the public’s perception. Finally, given the ongoing global refugee crisis, the large influx of migrants, particularly in many European countries, may ultimately serve to substantially alter

675

Winny Shen et al. the prevalence of disadvantaged groups and the nature of such groups in many countries in the future. In conclusion, this compilation of information about perspectives from a wide range of countries should be a valuable resource to students, researchers, and practitioners around the globe as a starting point for further research and improved practice. We encourage international collaborations on other workplace issues, and we hope this project provides a useful model of an effective partnership.

NOTE 1. All authors contributed equally to this chapter. Winny Shen and Paul Sackett integrated the text materials provided by each author. Portions of this chapter were previously drawn from an article by a subset of the authors: Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., et al. (2008). International perspectives on the legal environment for selection. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 200–256. Used/reprinted by permission of the © Society for Industrial and Organizational Psychology and Cambridge.

REFERENCES Cascio, W. F., & Aguinis, H. (2008). Applied psychology in human resource management (6th ed.). Upper Saddle River, NJ: Pearson Education. De Soete, B., Lievens, F., Oostrom, J., & Westerveld, L. (2013). Alternative predictors for dealing with the diversity–validity dilemma in personnel selection: The constructed response multimedia test. International Journal of Selection and Assessment, 21, 239–250. Evers, A., Sijtsma, K., Lucassen, W., & Meijer, R. R. (2010). The Dutch review process for evaluating the quality of psychological tests: History, procedure, and results. International Journal of Testing, 10, 295–317. Foldes, H. J., Duehr, E. E., & Ones, D. S. (2008). Group differences in personality: Meta-analyses comparing U.S. racial groups. Personnel Psychology, 61, 579–616. Government Gazette. (1999). Employment Equity Act, 1998 (Act No. 55 of 1998), R 1360. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum. Joubert, T., & Venter, N. (2013). The occupational personality questionnaire. In S. Laher & K. Cockroft (Eds.), Psychological assessment in South Africa (pp. 277–291). Johannesburg, SA: Wits University Press. Kriek, H. J. (May 2006). Personality assessment: Group differences, language proficiency and fairness. Presented at the Society of Industrial and Organizational Psychology Conference, Dallas, TX. Myors, B. (2003). Within-group norming: Just because it’s illegal in America, doesn’t mean we can’t do it here. Paper presented at the 5th Australian Conference on Industrial/Organisational Psychology, Melbourne, Australia. Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., . . . & Sackett, P. R. (2008). International perspectives on the legal environment for selection. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 206–256. Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. Sackett, P. R., Shen, W., Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., . . . & Aguinis, H. (2010). Perspectives from twenty-two countries on the legal environment for selection. In J. L. Farr & N. T. Tippins (Eds.), Handbook of employee selection. New York, NY: Routledge. Society for Industrial and Organisational Psychology in South Africa. (2005). Guidelines for the validation and use of assessment procedures for the workplace. Retrieved on June 7, 2007, from http://www.siopsa. org.za te Nijenhuis, J., van der Flier, H., & van Leeuwen, L. (1997). Comparability of personality test scores for immigrants and majority group members: Some Dutch findings. Personality and Individual Differences, 23, 849–859. te Nijenhuis, J., van der Flier, H., & van Leeuwen, L. (2003). The use of a test for neuroticism, extraversion, and rigidity for Dutch immigrant job-applicants. Applied Psychology: An International Review, 52, 630–647. U.S. Code. (1964). Pub. L. 88–352. U.S. Code. (1991). Pub. L. 102–166. U.S. Equal Employment Opportunity Commission. (1978). Uniform guidelines on employee selection procedure. 29 CFR 1607.1. Washington, DC: Author.

676

International Legal Environment for Selection U.S. Office of General Accounting. (2003). Women’s earnings: Work patterns explain difference between men and women’s earning. Retrieved on June 20, 2008, from http://usgovinfo.about.com/gi/dynamic/offsite. htm?zi=1/XJ&sdn=usgovinfo&cdn=newsissues&tm=110&gps=64_261_1276_825&f=00&tt=2&bt= 0&bts=0&zu=http%3A//www.gao.gov/new.items/d0435.pdf van Leest, P. F. (1997). Persoonlijkheidsmeting bij allochtonen [Assessment of personality for ethnic minorities]. Lisse, the Netherlands: Swets & Zeitlinger.

677

30 A CONSIDERATION OF INTERNATIONAL DIFFERENCES IN THE LEGAL CONTEXT OF EMPLOYMENT SELECTION1 EMILEE TISON, KRISTEN PRYOR, MICHAEL AAMODT, AND ERIC DUNLEAVY

Chapter 29 in this Handbook does an excellent job of identifying and discussing international differences in employment law. In this chapter, we attempt to understand these differences by establishing a framework around the development and evolution of employment law as it relates to employee selection, promotion, and termination. We will first summarize three main categories of international differences in employment law, particularly in terms of primary differences from the U.S., and then propose a model to explain why these differences exist.

INTERNATIONAL DIFFERENCES IN THE LEGAL CONTEXT OF SELECTION There seem to be three major categories in which the U.S. differs from most other countries in employment law: • The groups that are defined as protected • The legality of preferential treatment for underrepresented groups • The evidentiary standard required to prove or refute charges of discrimination

Differences in Protected Group Status Many of the countries discussed in Chapter 29 have more inclusive policies than the U.S. concerning which groups of people are protected under law. For example, marital status, political affiliation, and even socio-economic status (SES) are delineated as protected in many countries, but not in the U.S.2 SES is not formally a protected group under U.S. law, though it is a covered status in many countries (e.g., India, Italy, Japan). This is possibly because it is a difficult concept to define and measure in the context of U.S. culture, whereas in some other countries there is a more formal definition on which the protected status is based (e.g., the caste system in India, where lower castes are protected). However, the notion that employment opportunities may differ as a function of financial and social status is intuitive. Additionally, SES has played a role in some important case law in the U.S. in the context of educational opportunities, albeit often as an

678

International Differences in the Legal Context alternative to race/ethnicity. For example, in several recent cases concerning affirmative action as an operational need (i.e., Gratz v. Bollinger, 2003; Grutter v. Bollinger, 2003; Parents v. Seattle School District, 2007), the Supreme Court discussed the use of SES as a plus factor in narrowly tailored affirmative action policies used by educational institutions. The recent resurgence of socioeconomic class disparity discourse within the U.S. (e.g., the 2011 Occupy Wall Street movement protesting income inequality and a wave of federal, state, and local regulations implemented since 2014 to increase minimum wage rates), coupled with the discussion of SES as a legitimate factor in education, raises the possibility that SES may yet become a protected status in the U.S. Though SES is not yet a factor commonly discussed in employment selection, it will be interesting to monitor the role of SES in the ongoing development of the legal context of employee selection in the U.S. Relatedly, Chapter 29 provides insight into the emergence of various protected classes over time, across the international community. Table 30.13 of this chapter highlights some of these key developments over time to illustrate international differences in the legal adoption of protected classes. This temporal presentation shows that most of the countries that were early to begin legally defining some protected class(es) went through a sequential process in which several pieces of legislation were enacted over time, each of which added a new protected class or classes. For example, Australian legislation covered race in 1975, followed by sex in 1984, disability in 1992, and then age in 2004; U.S. legislation covered sex, race, religion, color, and national origin in 1964, age in 1967, and disability in 1990; Japanese legislation covered national origin in 1947, disability in 1960, age in 1971, and sex in 1999. Approximately two-thirds of the countries listed in Table 30.1 included sex as a protected class in the first regulation; however, only three of those countries enacted this protection as a stand-alone regulation. Approximately half of the countries, including the U.S., protected several classes in the first regulation, most commonly including sex, race, and religious protections. Protections for disability status and age were more likely to be found in stand-alone regulations for any given country. In reviewing Chapter 29, some countries enacted protections for certain groups earlier than others. As examples, Israel, the U.S., and Japan each were early to adopt protections for a particular class well before the majority of the other surveyed countries. Israel included sexual orientation as a protected class in 1988, five years ahead of New Zealand (1993) and ten years ahead of Ireland (1998). The U.S. included age as a protected class in 1967, four years ahead of Japan in 1971, 13 years ahead of Spain in 1980, and 21 years ahead of Israel in 1988. Japan’s protection of the disabled in 1960 was 20 years ahead of Spain in 1980 and 30 years ahead of the U.S. in 1990. Many factors influenced when employment laws were enacted in various countries and can make cross-country comparisons difficult. For example, most countries listed in Table 30.1 include prohibitions against discrimination in their constitutions. Yet, typically these prohibitions only affect employees in the public sector and at times do not include sanctions for violating the prohibitions. Furthermore, employment law enforcement may be more or less centralized across countries. As another example, it is difficult to compare Canada to other countries because employment laws are enacted by each province rather than by the national government. Comparisons of member countries of the European Community are complicated by the fact that each member had its own set of employment laws prior to 2000, but many had to create new ones or modify existing ones to be in compliance with the antidiscrimination provisions of Directive 2000/78/EC. When discussing protected class status in the U.S., the focus here is on federally defined and enforced employment law issues.4 Additionally, it is important to note that the U.S. identifies and interprets protected group status differently when referring to private businesses, federal contractors, or federal agencies. For example, political affiliation is a protected group when referring to employees of federal agencies, but not when referring to employees of private businesses. Figure 30.1 illustrates some of these differences. The base of the figure defines the foundation of protected class statuses in the U.S. These protected class statuses are enforced by the Equal Employment Opportunity Commission (EEOC),5 regardless of industry. The middle of the figure defines those additional protected class statuses added for federal contractors and enforced

679

Japan

New Zealand

Australia

1971

1972

1975

Belgium

Spain

Netherlands

Australia

Finland

Israel

Argentina

Brazil

United States

Italy

Mexico

Korea

1978

1980

1976

1984

1987

1988

1988

1989

1990

1990

1990

1990

Italy

United States

1967

1977

Italy

1966

United Kingdom

United States

1964

United Kingdom

Japan

1960

1976

Japan

1947

1975

Country

Year

x

x

x

x

x

x

x

x

x

Race

x

x

x

x

x

x

x

x

x

x

x

x

Sex

x

x

x

x

x

x

x

Religion

x

x

x

x

x

National Origin

x

x

x

x

Disability

x

x

x

x

x

x

Age

x

Sexual Orientation

Summary of Protected Groups Over Time

TABLE 30.1

Act of Employment Promotion and Vocational Rehabilitation for the Disabled

Federal Labor Law—Article 3

Act 108, 11 May 1990

Americans with Disabilities Act

Consolidation of Labor Laws (CLT)

Anti-Discrimination Law

The Employment (Equal Opportunity) Law

Equality Act

Sex Discrimination Act

Act of Equal Treatment of Men and Women

Workers’ Statute Law

Law Equality of Men-Women

Act 903, 9 December 1977

Race Relations Act

Sex Discrimination Act

Racial Discrimination Act

Equal Pay Act

Law Concerning Stabilization of Employment of Older Persons

Age Discrimination in Employment Act

Act 604, 15 July 1966

Civil Rights Act of 1964

Employment Promotion Act for the Physically Disabled

Labor Standards Act

Statute

Australia

Greece

2004

2005

x

x

Kenya

2003

Switzerland

Ethiopia

2003

Czech Republic

Taiwan

2002

2004

Chile

2002

2004

European Community

2000

x

x

x

x

x

Korea

1999

x

x

Japan

1999

x

Hungary

x

Belgium

x

x

x

x

x

x

x

x

x

Sex

1999

Fiji

1999

x

x

x

x

x

x

Race

1999

Israel

South Africa

1998

1998

Ireland

1998

The Netherlands

1994

Switzerland

New Zealand

1993

United Kingdom

Singapore

1993

1995

Taiwan

1992

1995

Australia

Hungary

Czech Republic

1991

1992

Korea

1991

1992

Country

Year

x

x

x

x

x

x

x

x

x

x

x

x

Religion

x

x

x

x

x

x

x

x

x

National Origin

x

x

x

x

x

x

x

x

x

x

x

x

Disability

x

x

x

x

x

x

x

x

x

x

x

x

Age

x

x

x

x

x

x

x

Sexual Orientation

Act 3304

Age Discrimination Act

Federal Law on Equality of Disabled Persons

Persons with Disabilities Act 14

Labour Proclamation No. 377/2003 (Article 14)

Gender Equality in Employment Law

Article 2 of Labour Code

Directive 2000/78/EC

Gender Discrimination Prevention and Relief Act

Equal Employment Opportunity Act

Law of May 7, 1999

Rights of Handicapped Persons

Human Rights Commission Act

Employment Equity Act

Equal Rights for Handicapped Persons Law

Employment Equality Act

Disability Discrimination Act

Federal Law on Equality of Women and Men

General Equal Treatment Act

Human Rights Act

Retirement Age Act

Article 5 of the Employment Services Act

Act XXII of the Labor Code

Disability Discrimination Act

Act No. 1/1991

Aged Employment Protection Act

Statute

Emilee Tison et al.

FIGURE 30.1  Protected Group Statuses in the U.S. Based on Employment Sector

by the Office of Federal Contract Compliance Programs (OFCCP).6 The apex of the figure defines the additional protected class statuses added for federal employees that are enforced by the EEOC.7 Figure 30.1 highlights the complicated manner of defining protected class status in the U.S.—without even considering the idiosyncrasies that may exist across state and local governments. It is important to note that nuances of this sort are likely to exist in other countries as well.

Differences in Preferential Treatment U.S. law and regulation also appear to be different from other countries when it comes to the status of preferential treatment methods in selection. Preferential treatment in this context is referring to the differential treatment of a protected group, generally providing an advantage or special consideration. Many countries allow various forms of minority preference decision systems, including within-group norming, separate cut scores for protected groups, and quota systems. Specifically, a majority of participating countries allow some sort of preference in their selection practices. As Chapter 29 describes, since 1991, U.S. policy generally considers quota systems, within-group norming, and separate cut scores to be illegal. Even more flexible minority preference decision systems, such as minority preference sliding bands, also are

682

International Differences in the Legal Context illegal in the U.S. (Gutman & Christiansen, 1997). These international differences in preferential treatment may represent the differential status of what may or may not be perceived as reverse discrimination across various countries, suggesting that this phenomenon is of more concern in the U.S. Importantly, preferential treatment and affirmative action may be some of the most misunderstood topics in the U.S. (Aamodt, 2016). In contrast to preferential treatment, the concept of affirmative action is that focusing on efforts to identify and recruit more diverse qualified individuals will necessarily increase diversity over time (based on the assumption that the more diverse the pool to choose from is, the more diverse the selections will be). Some misunderstanding may stem from the differences in how affirmative action programs are applied in the employment setting versus in the education setting—where different laws apply (e.g., Title IX of the U.S. Educational Amendments of 1972, renamed the Patsy Mink Equal Opportunity in Education Act in 2002). For example, in educational admissions settings, protected group status has been used as a factor in case-by-case selection decision making, not just in outreach and recruitment, albeit controversially (e.g., the Grutter and Gratz cases). As discussed above, more recent U.S. Supreme Court cases have further limited the scope of the controversial aspects of affirmative action in the educational setting. Specifically, the courts have continued to define and limit the ‘narrowly tailored’ instances where race is an acceptable consideration as a factor in University admission processes. The courts require educational institutions to “verify that it is ‘necessary’ . . . to use race to achieve the educational benefits of diversity” (Bakke, 438 U.S. at 305). Furthermore, the courts have required that reasonable alternatives be sought before race is used as a factor (Wygant v. Jackson Board of Education, 476 U.S. 267, 280; Grutter, 539 U.S. at 339). One such recent Supreme Court case is Fisher v. University of Texas. The initial Supreme Court ruling in 2013 remanded the case for a more ‘strict scrutiny’ of the extent to which the University’s consideration of race is narrowly tailored to achieve diversity. The Fifth Circuit reaffirmed judgment for the University, noting that the University’s holistic use of race in pursuit of diversity was not reliant on quotas or targets, and this was supported by the fact that racial considerations were not limited to furtherance or advancement of minority students (Fisher v. University of Texas, 2014). The case was heard again by the Supreme Court in December 2015; on June 23, 2016, the ruling was upheld – citing that the race-conscious admissions program was lawful under the Equal Protection Clause (Fisher v. University of Texas, 2016). In the employment context, the focus of affirmative action is typically on outreach and recruitment, although analyses are conducted to evaluate the level of diversity in the workforce compared to the availability of qualified diverse individuals. There are specific instances, particularly when an organization is seeking to remedy past discrimination, where race considerations may be lawful, but those are exceptions more than the rule (e.g., court-ordered remedy or consent decree; CRA, 1991). These nuances to affirmative action in the U.S. (i.e., “narrow tailoring” of race considerations in educational settings, exceptions to the ban on preferential treatment to remedy past discrimination, and inappropriate implementation of action-oriented programs when affirmative action goals are set) contribute to the misconceptions that exist. Additionally, various political and social groups that oppose affirmative action may equate affirmative action with preferential treatment. Regardless of the confusion, U.S. Executive Order 11246 requires that government contractors and subcontractors take affirmative action to advance and employ minorities and females in their organizations. The regulations require contractors to assess the diversity of their workforce and set goals where it is determined a lack of diversity exists. However, an affirmative action goal in this context does not allow an employer to apply preferential treatment measures. Thus, the goal does not represent a target or quota that must be met. Rather, the employer must cast a wider net in the outreach and recruitment process to try to attract qualified minorities and females and encourage them to apply. Once those individuals apply for a position, equal employment law applies, and contractors are required to hire qualified applicants based on legitimate nondiscriminatory factors.

683

Emilee Tison et al.

Differences in Evidence of Discrimination Another pattern identified in Chapter 29 is the role of adverse impact in the legal context of selection. Specifically, it appears that adverse impact theory plays a more prominent role in the U.S. than it does in other countries. As a case in point, U.S. regulations require federal contractors and subcontractors to conduct adverse impact analytics proactively on a variety of personnel activities, including hiring and promotion selection decisions, as well as terminations. Furthermore, revisions to existing regulations and new regulatory requirements forthcoming in the U.S. may broaden the types of personnel activities required to be analyzed via an adverse impact framework (e.g., performance ratings, compensation, training, and transfer decisions). The U.S. has developed a clear chronology of burden in the adverse impact judicial scenario and highly specific and scientific evidentiary rules for each phase (i.e., adverse impact detection via statistical or practical significance, evidence of job relatedness/business necessity, and identification of reasonable and less adverse alternatives). In contrast, although some countries allow for adverse impact evidence in claims of discrimination, it is often necessary, but not sufficient, evidence of discrimination. Instead, disparate treatment appears to be the more generally accepted theory of discrimination internationally. This structured process of shifting burden in the U.S. appears to be rare in other countries. For example, Chapter 29 indicates that countries are still in the process of developing legal statutes and requirements, and relatively few cases have been brought under these still-developing legal frameworks. However, as noted in Chapter 29, a few countries are poised to take steps to refine and outline a more structured process for claims in the coming years. See Chapter 29 for more details.

A MODEL TO EXPLAIN THE LEGAL CONTEXT OF SELECTION Chapter 29 outlines the most commonly protected individual and group characteristics and highlights them across nationalities and cultures. There is now an almost universal affirmation that ethnicity, race, sex, and disability status groups are afforded protected status. Despite this agreement, there is wide variability in additional characteristics that are afforded protection across cultures and nationalities, as well as variability in the temporal emergence of legally protected classes and employment laws across countries. So, why do these differences, and other legal context variations, exist? Figure 30.2 presents an interactive model or framework that may explain some of these differences. The model depicts overarching forces at work in creating the catalysts for the development of legal protections and the accompanying enforcement/case law, as well as the advancement of the science of selection and the accompanying professional guidelines. We view this model as a preliminary first step in considering why there are differences between the U.S. and many other countries with regard to the legal context of selection. This section will describe the components of the framework and illustrate how each portion interacts with the rest of the model to influence the legal context of selection. Case examples will be highlighted throughout this section to illustrate the components of the framework.

Zeitgeist8 Defined as the general intellectual, moral, and cultural climate of an era, the zeitgeist, in the context of discrimination, is an important consideration when assessing why certain differences and similarities in the other framework factors may exist across countries; it also provides an intuitive starting point, as a country’s history and values clearly affect what characteristics are deemed important. Over time, the group(s) identified as needing protection from discrimination, as well as the enforcement of said protection, are likely to evolve as the country’s values evolve. Therefore, the zeitgeist can be a driving force of change and evolution in the legal context of

684

International Differences in the Legal Context

FIGURE 30.2  An Interactive Model to Explain the Legal Context of Selection

selection, and one critical component that fuels change in the legal arena is the general cultural understanding of fairness. The zeitgeist can also change over time, as it is influenced by other framework factors. For example, a consistent opposition to the status quo—based on vocal public reactions—can influence the tone and tenor of the general cultural understanding. Over time, this opposition may in fact fundamentally alter or shift the country’s values in such a way that changes to the legal context of selection are demanded.9 The U.S., as an example, recently experienced a zeitgeist shift in relation to perceptions of the LGBT community—where the general public opinion is now more favorable.10 After this shift, public reactions to continued perceived injustices against the LGBT community resulted in the proposal of federal legislation to add sexual orientation as a protected group nationally (Employee Non-Discrimination Act, ENDA); adverse political reactions to the proposal, however, stalled enactment of the legislation. Despite this outcome, enforcement and case law have addressed the issue, as evidenced by the EEOC enforcement actions under existing sex discrimination guidance (EEOC directive and Macy v. Holder, 2012) and the Federal Courts (Glenn v. Brumby, 11th Circuit, 2011). It can be argued that these changes were demanded by the zeitgeist expressed through public demonstrations and petitions. Continued government responses to the zeitgeist have manifested as executive action (by amending Executive Order 11246) for federal contractors and individual state actions to adopt formally sexual orientation as a protected group.

685

Emilee Tison et al.

Case Study 1: Preferential Treatment In this case study, the U.S. and India both are taking action to rectify past discrimination; however, different interpretations of how selection activities fit into this understanding are applied. United States. After the U.S. enacted the Civil Rights Act of 1964, several employers and academic institutions instituted preferential treatment systems in an effort to rectify the effects of past and ongoing discrimination against women and minorities. In other words, employers and academic institutions were making race- or sex-based decisions during the selection process to ‘make up’ for past discrimination and promote equality in the current selection systems. These preferential treatment systems created strong public reactions—a push for all individuals to be treated equally during the selection process. Although preferential treatment allowed for the selection of qualified minority members, it also presented additional obstacles to overcome—namely, the negative reactions to selections under this system. Preferential treatment allowed other individuals selected (or not selected), who were not members of the minority group, to view the minority individual as less capable and/or having been given the job based on his/her minority status instead of based on his/her qualifications. In response to some of these concerns, the revisions to the Civil Right Act in 1991 made the differential standard form of preferential treatment illegal. India. To alleviate discriminatory practices against historically disadvantaged groups, such as the Scheduled Castes and Tribes, India also employed programs to ensure their successful competition with the rest of society in the 1930s. Specifically, the Indian Constitution expressly allows “reservations” or quotas. This means that some employment opportunities, including selections and promotions, are reserved for individuals in the Scheduled Castes and Tribes. Unlike the U.S., this preferential treatment has been generally accepted by the Indian public, with additional demographic groups being added to the reservation system since the 1930s; however, recently there has been criticism of the reservation system. Opponents to the system claim that other, high-quality candidates are disadvantaged—such that they do not have a fair chance to attain advancement. Given the recent opposition, however, it will be interesting to see if any changes to the reservation system occur.

The zeitgeist inherently interacts with and reacts to other factors in the framework to influence and shape the legal context of selection. However, it is important to note that the zeitgeist, alone, cannot control the legal context of selection.

Reactions (Public, Private, and Government) Reactions are defined as actions performed or feelings experienced in response to a situation or event (e.g., changes in the strength of a county’s economy, political posturing, new or revised laws, case outcomes, enforcement). For example, the September 11, 2001, terrorist attacks in the U.S. resulted in anti-terrorism laws, regulations, and practices that would have been difficult to enact if the attacks had not taken place. Likewise, the 2015 mass murder of nine African Americans at a Charleston, South Carolina, church resulted in the South Carolina legislature overwhelmingly passing legislation to remove the Confederate flag from the State House grounds. Again, a law was passed whose approval was highly unlikely a week prior to the hate-based mass murder. The model postulates that there is a near-constant feedback loop occurring between zeitgeist and reaction. Specifically, reactions to the same type of situation or event can vary over time, as the zeitgeist evolves. Consider that while there may be a general moral or intellectual understanding in an era, there will also be a minority voice that disagrees. Indeed, modern times have seen a rise in extreme polarity of ideas.

686

International Differences in the Legal Context

Case Study 2: Protected Groups In this case study, the U.S. and South Africa both are taking steps to address ongoing racial discrimination; however, differential government reactions to this issue impact the legal context. United States. After the abolishment of slavery, many former confederate states enacted laws to keep the races and those of color “separate but equal.” However, during the 1950s a shift in reactions to segregation began to build, with increasing numbers of demonstrations and events expressing outrage with existing inequalities. These reactions were emboldened by court cases striking down the “separate but equal” laws, most notably the 1954 Supreme Court decision in Brown v. Board of Education (347 U.S. 483, 1954). Over the next decade, events driven by public, private, and government reactions led to the enactment of the Civil Rights Act of 1964, prohibiting discrimination based on color, race, sex, religion, or national origin. Although there was a growing push from the majority of the general public for equal rights, the involvement of the federal government was necessary to ensure enforcement, particularly in states where the local or state government was unwilling to enforce change. South Africa. In 1948, the all-white government implemented apartheid, under which existing policies of racial segregation were enforced. As the majority of the population was being discriminated against openly, the zeitgeist of South Africa was one where there was an understanding that apartheid was wrong and needed to end. However, despite strong and consistent opposition to apartheid, no action was taken by the government to end the policy. Protests and demonstrations continued, with many ending in violence. This violence drew international attention, which, in the end, forced government action due in large part to external pressures for change. However, only small changes were enacted in the late 1980s. It wasn’t until 1994 that a new constitution took effect enfranchising blacks and other racial groups.

In the context of this framework, public reaction is a collective response to changes in other aspects of the framework. In every facet of the framework there is an element of public reaction: in the minority reaction to the zeitgeist, in the often varying reactions to changes in legal protections and enforcement, and even in the science of selection, where we see public reactions to changes in the prevalence of certain selection methods. Reactions, however, are not unique to the general public. They can also originate from members of private organizations and the government. Private reaction has become more visible and vocal in this framework over time. This, in part, could be due to a rise in the available outlets for dispersing private reactions (e.g., social media, 24-hour news cycle). However, the rise in visibility is likely heavily influenced by the increased regulatory burden placed on the private sector, beginning with the 1964 Civil Rights Act and followed by several additional government actions (whether through affirmative legislation or court decree), up to the present day. Government reactions can also be the catalyst to changes in zeitgeist (i.e., enacting protections to facilitate a change in zeitgeist) or the response to zeitgeist (i.e., enacting protections or changing enforcement priorities because of the zeitgeist). As the zeitgeist sets the general tone of how the legal context of selection is viewed, it is intuitive that the factors of the model or other events would constantly generate reactions from the public, private industry, and government. We argue that this aspect of the feedback loop often provides the impetus for changes in the entire framework.

Legal Protection, Enforcement, Case Law Figure 30.2 also identifies other factors that may be useful in explaining similarities and differences in the legal context across countries. These include the enforcement landscape, court

687

Emilee Tison et al. systems, and relevant case law. The discrimination enforcement landscape is a factor that was not surveyed in Chapter 29 and may explain some of the differences the authors identified. For instance, the U.S. may have a more active and defined enforcement environment relative to other countries, as a number of federal agencies are charged with monitoring and enforcing employment discrimination in the U.S. (e.g., the Equal Employment Opportunity Commission (EEOC) and the Office of Federal Contract Compliance Programs [OFCCP]). The EEOC enforces claim-based discrimination primarily under Title VII, the Americans with Disabilities Act (ADA/ADAAA),11 and the Age Discrimination in Employment Act (ADEA). As described elsewhere (e.g., Zink & Gutman, 2005), the EEOC investigates many discrimination claims in a given year and seeks to address them by identifying unsubstantiated claims, settling claims of merit in favor of the claimant, and taking a small number of claims to court. The EEOC has recently developed initiatives focusing on systemic discrimination that affect large classes of applicants and employees. As such, employee selection has been the highest priority under the EEOC’s Strategic Enforcement Plan as an area of enforcement focus since 2013. The OFCCP, an arm of the Department of Labor (DoL), also actively enforces anti-discrimination policy in the U.S. The OFCCP proactively enforces Executive Order 11246 (women and minorities), Section 503 of the Rehabilitation Act of 1973 (Section 503, individuals with disabilities), and the Vietnam Era Veteran’s Readjustment Act of 1974 (VEVRAA, certain classes of veterans), which requires contractors working for the federal government to implement affirmative action programs that ensure equal opportunity for previously disadvantaged groups. In 2013, revised regulations governing Section 503 and VEVRAA were finalized. These regulations went into effect in 2014 and required contractors, for the first time, to gather data on disability and veteran status at both applicant and employee stages. In addition, the regulations focused on more robust outreach and recruitment, as well as numerical assessments based on veteran and disability status (i.e., comparing employment of disabled individuals to an availability metric and assessing hiring of veterans against a benchmark figure). The OFCCP’s work is primarily audit based; the Agency proactively investigates the employment practices (e.g., hiring, promotion, termination, compensation) of a subset of federal contractors each year. Like the EEOC, the OFCCP has recently developed initiatives intended to identify and remedy systemic discrimination and has focused on employee selection. Thus, the OFCCP and EEOC represent an active enforcement landscape that together is both claim and audit based and is linked to multiple federal protections. These agencies have the power to enforce both financial and reputational consequences of discrimination via published compensatory and punitive settlements, as well as eventual litigation. Although the EEOC and OFCCP are the most widely recognized enforcement agencies, it is important to note that they are not the only ones in existence. In fact, the Civil Rights Division within the Department of Justice also investigates and seeks remedies for employment discrimination in the state and local government context. Perhaps the active enforcement landscape of the U.S. partially explains the more stringent legal context of selection in the U.S. Is the enforcement landscape in other countries similar? Chapter 29 provides more insight into this question (see Table 29.3).

Case Study 3: Levels of Authority In this case study, the U.S. and the European Union are both seeking to clarify an overarching legal framework for addressing claims of discrimination. Despite similarities in the high courts’ levels of authority—in the oversight and enforcement of discrimination— contextual factors can impact the legal context differently. United States. An established hierarchical framework exists in the U.S. to interpret and enforce the legal context of discrimination. The U.S court system works to interpret a national, written constitution; therefore, the U.S. court system provides clarity and guidance to assist with interpretation issues on matters of federal law. Although states are

688

International Differences in the Legal Context

permitted to enact legislation that adds to federal laws/regulations, federal laws supersede state and local laws and regulations in instances where they are in conflict. For example, protection against sexual orientation discrimination is new for the U.S. In 2015, the Supreme Court ruled against four state bans defining marriage as between a man and a woman. This ruling, coupled with the EEOC’s stance (since 2012) that sexual orientation discrimination can be addressed under existing sex discrimination legislation (Title VII), opened the door for opportunities to redress instances of discrimination based on sexual orientation. Despite these developments, it is important to note that controversy is occurring on two fronts: (1) public reactions that there is conflict such that either an individual’s right to religious beliefs or an individual’s right to marry another of the same sex must be forsaken and (2) local and state government reactions either proactively invalidating their own same-sex marriage bans or asserting that the ruling does not explicitly strike down their existing laws, thereby seeking to maintain the status quo until such time as it is clear their particular laws must be changed. In this case, the hierarchical nature of the legal framework for redressing claims provides state and local governments with interpretive guidance on enacting the legal context of discrimination—though that guidance may take some time to be incorporated. The European Union. The EU has also established a framework to interpret and enforce the legal context of discrimination. However the Court of Justice of the European Union (CJEU) was created by treaty and is tasked with interpreting EU treaties and laws. Therefore, the CJEU spans multiple countries with sometimes differing legal interpretations or enforcement procedures. Similar to the U.S., EU directives supersede individual Member State laws and, in instances of conflict, the CJEU provides clarification and guidance. The European Commission can also review Member State laws to look for inconsistencies with EU Directives. That said, Member States’ judicial systems are trusted to apply the laws, which can result in wide variability in addressing claims of discrimination across nations. For example, although the Racial Equality and Employment Equality Directives were translated into national law by all 28 Member States, the CJEU launched infringement proceedings against 25 Member States between 2005 and 2007. The general finding was that the implementation of the required regulatory changes was (a) lacking in appropriately defining prohibited actions, (b) too limiting in scope, or (c) too interpretive in identifying exemptions. By 2013, only one Member State remained in breach of its obligation to transpose the Directives. Furthermore, there is considerable variability where application in case law is concerned, with some Member States (e.g., Germany, Denmark) regularly referring to the CJEU for case-law decision and overview, whereas other Member States have not received cases with which to test their newly translated laws (e.g., Estonia, Finland). In this case, the legal framework for redressing claims empowers individual Member States with interpreting and enacting the legal context of discrimination through case law and regulations, which spans a variety of cultures, languages, and constitutions.

The U.S. also has an active court system responsible for remedying those instances where employment discrimination claims cannot be settled between enforcement agency and employer. Thus, courts provide an additional level of enforcement; case law has even given rise to judicial scenarios not available via statute, as was the case for the adverse impact scenario in Griggs v. Duke Power (1971) and Albemarle v. Moody (1975). Taken together, the U.S. has an active enforcement landscape and a court system designed to resolve discrimination cases where the enforcement landscape cannot. Both systems can impose meaningful consequences if anti-discrimination law is violated by an employer. Differences in the role, functions, and powers of courts and regulatory agencies also may explain differences in the legal contexts across countries observed in Chapter 29.

689

Emilee Tison et al.

Science of Selection, Professional Guidelines Chapter 29 hinted at the notion that international differences in the science of selection may explain differences in the legal context of selection. For example, the authors noted that (a) subgroup mean differences in various predictor and performance constructs were more clearly documented in the scientific literature in the U.S. as compared to the scientific literature in other countries; and (b) I-O psychology was more affected by the legal context of selection in the U.S. as compared to other countries. The science of selection is used differently across countries and, as such, may differentially affect the legal context. Advances in the science of selection include emerging research into methods for improving selection assessments, processes, and scoring and methods of assessing validity, reliability, and adverse impact. While these advances ideally inform enforcement as well as professional guidelines and technical authorities, they are also influenced by zeitgeist, reactions, and existing legal protections. In the U.S., there has been considerable research on, and social debate about, adverse impact. In 1978, adverse impact as a phenomenon was memorialized in the Uniform Guidelines for Employee Selection Procedures (UGESP, 1978), which is a technical authority published jointly by various enforcement agencies. The UGESP also delineates the technical requirements for legally defensible employee selection systems by establishing ‘minimum requirements’ for research intended to show job-relatedness and business necessity of selection procedures. These requirements are essentially enforced as administrative law by the OFCCP and are used often by the EEOC to evaluate selection programs (Jeanneret, 2005); earlier courts explicitly gave the UGESP ‘great deference’ in reaching decisions. Additionally, professional organizations have published their own documents. For instance, the Society for Industrial and Organizational Psychology (SIOP) published a revised edition of the Principles for the Validation and Use of Personnel Selection Procedures (Principles) in 2003.12 The Principles are a technical authority often used by practitioners in the design of selection systems. Furthermore, multiple social science institutions have jointly published and recently updated the Standards for Educational and Psychological Testing (2014), which are also often used by practitioners in the U.S. when designing selection systems. These three technical authorities intend to capture the scientific state of selecting employees, although there is some debate over the usefulness of the UGESP given their age. While many of the tenets in UGESP remain relevant, there have been notable advances in validity evidence and statistics, as examples, that this document does not address. Regardless, these technical authorities are used by the I-O community13 in the U.S. Differences in legal context across countries may be partially explained by differential familiarity among the legal and regulatory communities with the state of science in selection (e.g., technical authorities and their applications) or differences in the availability and application of these types of technical authorities in designing and evaluating selection systems. According to the analysis in Chapter 29, professional guidelines and technical authorities appear to be most prominent in countries where there is an active enforcement landscape and specific selection requirements to which systems must adhere. Importantly, the authors conclude that the practice of I-O psychology is relatively novel in many countries and, as such, perhaps the international development and use of these technical authorities will increase over time. Given the nature of the UGESP—as codified professional guidelines—this document is uniquely situated as a direct bridge between the science of selection and legal enforcement and case law. Ideally, this type of document can inform enforcement agencies and courts of the best practices in selection, providing guidance on acceptable and unacceptable practices. Unfortunately, particular sections from the UGESP are likely in need of updates based on the contemporary science of selection (e.g., tripartite versus unidimensional theory of validity), and particular sections may be inconsistently interpreted in the legal context (e.g., some courts prefer significance tests over the four-fifths rule to measure adverse impact; some courts allow content-oriented approaches to validate measures of mental constructs but others do not).

690

International Differences in the Legal Context

SOME FINAL THOUGHTS The purpose of this chapter is to establish a framework around the global development and evolution of employment law as it relates to employee selection, promotion, and termination. Therefore, the presented model is designed to foster an understanding of how or why international differences emerge in this legal arena. As mentioned in the chapter introduction, the U.S. is an outlier on various dimensions as compared to the majority of the countries surveyed in Chapter 29. The main variations appear to cluster around three main issues: • Differences in protected group status • The legality of preferential treatment for underrepresented groups • The evidentiary standard required to prove or refute charges of discrimination

Our position is that these variations may be attributable to differences in relevant aspects of the presented model and interactions among model components. For example, will the focus on equity in compensation in the U.S. change the enforcement priorities of agencies charged with finding and remedying discrimination? New regulations impacting federal contractors may soon require additional analyses and focus in this area; new state regulations, including two that took effect in January 2016,14 are already impacting organizations with operations in those states. The OFCCP has indicated that compensation equity is a priority, but the results of efforts to identify and remedy this form of discrimination are as yet unclear. The lack of results may be, in part, due to the complexity of compensation and the difficulty of determining whether and where discrimination in the various forms of compensation may be lurking. In recent years there has been a dramatic increase in the amount of reporting on wealth inequalities in general, and sex differences in compensation in particular, as a zeitgeist shift appears to be in progress. In reaction to this increased scrutiny, many organizations, including several technology companies, are now publically reporting their overall pay gap results. Several states are also adding to the regulatory impetus, with at least seven enacting gender pay equality measures over the last two years, and other states considering legislation moving forward.15 Current global and local events highlight the constantly shifting and evolving nature of this topic. Globally, the immigrant crisis creates a landscape ripe for testing of existing immigrant, religious, and national origin protections and strong reactions from all sides to these provisions as well as any suggestions for change. The recent (2016) terrorist attacks in Paris and subsequent calls for identifying and closing extremist mosques in the country, along with other reactive measures, will test the resolve of the egalité mindset. In the U.S., a recent example of potential change is the controversy surrounding the U.S. Supreme Court’s ruling on same-sex marriage and the subsequent refusal of a county clerk in Kentucky to sign marriage certificates for any couples, to also avoid being required to sign certificates for gay or lesbian couples. Much of the controversy was created when the refusal was framed as an impasse between the rights of an individual with religious beliefs and the newly minted right of an individual to marry another of the same sex. Although our intent is not to use the model to evaluate or place value on any identified differences, a logical next step is to seek information related to beneficial practices. Given U.S. activity in some areas of the model, it is reasonable to ask whether the legal context of selection in the U.S. should be viewed as a practice to model. This question seems particularly important given recent civil rights activity in South Africa, where the U.S. has been used as a model of legal context for a country that was hindered by generations of discrimination. The U.S. seems like a useful model along a number of dimensions surveyed in Chapter 29, such as concerning the burdens of proof in evidence of discrimination, accounting for reverse discrimination in preferential treatment policy, and enforcing meaningful consequences for discrimination; it is clear that the U.S. has a highly developed legal system in place to evaluate and enforce protections. However, it is apparent that a developed legal system does not eliminate negative outcomes. Given the highly developed legal system in the U.S., the likelihood of a legal challenge is high—whether the challenge is merited or not. Take, for instance, the government’s

691

Emilee Tison et al. approach to discrimination: to seek out discrimination and enforce sanctions against organizations found to discriminate. The issue arises when this stance is muddied with operational realities. Enforcement agencies’ existence (i.e., funding) is linked to the agencies’ success in making companies pay for discrimination, which can incentivize the identification of discrimination. Additionally, the advancements the U.S. has made in some areas of the legal arena are missing in others; in regard to which classes are protected, the U.S. is less inclusive than many countries as there is no federal protection for socio-economic status, political opinion, marital status, etc., although many U.S. states have these protections. Furthermore, the UGESP, the U.S. codified professional guidelines, is in dire need of updates based on the science of selection, as mentioned previously in this chapter. One final factor to consider is a country’s history and experience related to both civil rights and the science of selection, specifically within the legal context of selection. For example, various social scientific communities have contributed to a long history of research on employee selection in the U.S. That said, Title VII is 52 years old, and the U.S. has experienced social and cultural shifts emphasizing civil rights during its existence. Furthermore, U.S. enforcement agencies and court systems have been dealing with anti-discrimination enforcement for multiple decades. In many of the countries surveyed in Chapter 29, legal protection is only now developing, as are enforcement agencies, court systems, and I-O psychology. Given time for science to develop and for legislative and enforcement structure to reach equilibrium, perhaps the legal context for selection will be much more similar across the international community 50 years from now. The model we present in this chapter is complex yet useful when attempting to understand how international differences emerge in the legal context of selection. In short, there is more to consider than just the establishment of employment laws; the broader context, including reactions and societal views, must also be considered.

NOTES   1. The first two authors contributed equally to this chapter. This chapter is an extension of Dunleavy, E. M., Aamodt, M. G., Cohen, D. B., & Schaeffer, P. (2008). A consideration of international differences in the legal context of selection. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 247–254. Used with permission of the Society for Industrial and Organizational Psychology and Cambridge University Press.   2. In this chapter, discussions of protected group status focus primarily on those defined in the Federal Regulations. However, it is important to note that the U.S. also identifies protected group statuses through the use of presidential Executive Orders (EO), as well as state and local legislation and EOs. As an example, the Civil Service Reform Act of 1978 identifies political affiliation and marital status as defining protected groups when referring to federal employees (this point is discussed in more detail below).   3. This table does not list every law indicated in Table 29.3 of Chapter 29, but instead lists only some of the key laws enacted for some of those countries, and includes key laws enacted in countries not covered by Chapter 29 as well.   4. The responsibility of defining and enforcing employment law issues is split between the federal and state governments. Despite this, any individual state variations that may exist are not included in this discussion.   5. Enforced through Title VII of the Civil Rights Act of 1964 (as amended), the Equal Pay Act of 1963 (EPA), the Age Discrimination in Employment Act of 1967 (ADEA), the Americans with Disabilities Act of 1990 (ADA), the Civil Rights Act of 1991, and the Genetic Information Nondiscrimination of Act of 2008 (GINA).   6. Enforced through EO 11246, Section 503 of the Rehabilitation Act (Section 503), and the Vietnam Era Veterans’ Readjustment Assistance Act (VEVRAA).   7. Enforced through the Civil Service Reform Act of 1978 and EO 13152 (applied under Title VII).   8. The term zeitgeist was selected, in part, due to the complexities associated with its definition. Although we use it in this context to refer to the “general intellectual, moral, and cultural climate of an era,” our model highlights how the zeitgeist can change—and that change could occur rapidly due to other

692

International Differences in the Legal Context factors in the model. Therefore, this word incorporates the fluidity that may exist with this factor of the model.   9. It is important to note that a discussion of values at the national level is a complex topic—and outside the scope of this chapter. 10. Based on Gallup trends for Gay and Lesbian Rights in the U.S. (http://www.gallup.com/poll/1651/ gay-lesbian-rights.aspx) and the Pew Research Center survey on homosexuality acceptance in the U.S. (http://www.pewglobal.org/2013/06/04/the-global-divide-on-homosexuality/), the U.S. is generally accepting of the community and their marriage rights. 11. The ADA was amended in 2008 to provide a broader definition of disability and to clarify that the regulations are intended to be interpreted as erring on the side of inclusion. 12. Note that SIOP is currently working on another update to the Principles. 13. It is important to note that the legal community also uses these technical authorities, though the Uniform Guidelines is often given the most deference in the legal context. 14. California signed the California Fair Pay Act (CFPA) into law on October 6, 2015; New York signed “an act to amend the labor law, in relation to the prohibition of differential pay because of sex” into law on October 21, 2015. 15. https://www.shrm.org/legalissues/stateandlocalresources/pages/state-equal-pay-laws.aspx

REFERENCES Aamodt, M. (2016). Industrial/organizational psychology: An applied approach (8th ed.). Boston, MA: Cengage Learning. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Gutman, A., & Christiansen, N. (1997). Further clarification of the judicial status of banding. The Industrial-Organizational Psychologist, 35, 75–81. Jeanneret, R. (2005). Professional and technical authorities and guidelines. In F. J. Landy (Ed.), Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 47–100). San Francisco, CA: Jossey-Bass Publishing. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Uniform Guidelines on Employee Selection Procedures. 29 C.F.R. §1607 et seq. (1978). Zink, D. L., & Gutman, A. (2005). Statistical trends in private sector employment discrimination suits. In F. J. Landy (Ed.), Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 101–131). San Francisco, CA: Jossey-Bass Publishing.

Cases Cited Albemarle Paper Co. v. Moody (1975) 422 U.S. 405. Regents of Univ. of California v. Bakke (1978) 438 U.S. 265. Fisher v. University of Texas Austin (2014) 758 F.3d 633 (5th Cir.). Fisher v. University of Texas Austin (2016) 579 U.S. ____. Glenn v. Brumby (2011) No. 10–14833 (11th Cir.). Gratz v. Bollinger (2003) 539 U.S. 306. Griggs v. Duke Power Co. (1971) 401 U.S. 424. Grutter v. Bollinger (2003) 539 U.S. 306. Macy v. Holder (2012) Appeal No. 0120120821. Parents Involved in Cmty. Schs. v. Seattle School Dist. No.1 (2007) 127 S. Ct. 2738. Wygant v. Jackson Bd. of Educ. (1986) 476 U.S. 267.

693

Part VII EMPLOYEE SELECTION IN SPECIFIC ORGANIZATIONAL CONTEXTS RICK JACOBS AND DOUGLAS H. REYNOLDS, SECTION EDITORS

31 SELECTION AND CLASSIFICATION IN THE U.S. MILITARY WAYNE S. SELLMAN, TERESA L. RUSSELL, AND WILLIAM J. STRICKLAND

The quality of a military workforce, or any workforce for that matter, depends on the quality of its people. Successful attainment of military missions requires a force composed of dedicated, knowledgeable, and competent members. When an organization can hire persons with prior experience, an evaluation of past performance can serve as the primary criterion for selection and assignment into jobs. Other characteristics such as aptitudes and education assume less importance. However, when organizations hire young people without job experience, it becomes important to evaluate aptitudes, education, interests, and other characteristics known to predict success in jobs sought by the applicants. The Department of Defense (DoD) is the largest employer of young people in the United States. Depending on personnel requirements, the DoD screens hundreds of thousands of youth for enlistment annually. During the late 1970s, the DoD screened approximately 600,000 applicants each year; that number declined to about 380,000 during the first years of the 21st century and to around 250,000 in 2010 (U.S. Department of Defense, 2013). As noted above, the military’s task in screening potential recruits is complicated by the fact that the available personnel pool is composed predominantly of young men and women who have never held a permanent full-time job of any kind, and almost exclusively have no experience performing jobs similar to those in which they will be trained. Consequently, the services must depend mainly on indicators of potential performance, such as aptitude and education.

MILITARY PERSONNEL SYSTEM The U.S. military separates its personnel into two or three categories: enlisted personnel, commissioned officers, and (for all services except the Air Force) warrant officers. Comprising approximately 85% of the entire military, the enlisted force consists of (a) entry-level soldiers, sailors, airmen, and marines; (b) noncommissioned officers and petty officers (NCOs); and (c) senior NCOs and senior petty officers. These levels correspond to different levels of training, education, experience, and leadership. Individuals at the entry level are in training, or have just achieved initial competence in their occupational specialties. NCOs are technical experts in their primary jobs and serve as first-line supervisors, who teach, train, and supervise entry-level personnel. Senior NCOs are seasoned individuals who have experienced a myriad of technical jobs, held numerous supervisory positions, and have performed their technical and supervisory duties at high levels of proficiency.

697

Wayne S. Sellman et al. Commissioned officers are the senior leadership and management of the military. Similar to the enlisted force, the officer force also is divided into three categories: (a) company-grade officers, (b) field-grade officers, and (c) general or flag officers. Company-grade officers are the military’s action officers and are largely involved in the tactical level of the military organization. Field-grade officers typically fill many operational-level positions and most command and staff assignments. General or flag officers are executives and are primarily engaged in strategic, policy-making decisions that affect the organization in the long term. The Army, Navy, and Marine Corps also have warrant officers who fill highly specialized leadership positions. Unlike their commissioned officer counterparts, whose experiences are broad and service-encompassing, warrant officers are employed in positions that require highly specialized or technical skills (e.g., helicopter pilots). Selection as a warrant officer is highly competitive and only available to those who meet rank and length-of-service requirements in the enlisted force. Distinct from the civilian sector, the military has a completely closed personnel system; this means that the services fill personnel vacancies with members who are already employed within their ranks. The services do not hire individuals from outside the military to enter mid- or senior-level ranks. Because it takes years to successfully replace a member who leaves the military, attracting officer and enlisted candidates is a high priority for military policy makers. Each service uniquely recruits, trains, and professionally develops its members. Moreover, selecting the correct number of high-quality individuals each year is essential to sustain a flow of seasoned leaders for the future. Within the services, there are literally hundreds of military occupations. Although many are similar to civilian jobs, there also are large numbers of occupations that are unique to the military. Because of the large number of military enlistees (about 245,000 in the active and reserve components in Fiscal Year 2014) (D. J. Drogo, personal communication, 2015) who must be assigned into a large number of military occupations, the services, unlike most civilian employers, must be proficient at job classification as well as personnel selection. However, classification into military occupations depends on eligibility, individual preference, and availability of openings (Campbell & Knapp, 2001; Rumsey & Arabian, 2014a). With an enormous diversity of occupations, a vast number of openings at specific positions, and a variety of individual skills, the challenge of military job classification is appreciable.

NEED FOR ENLISTED MILITARY SELECTION AND CLASSIFICATION Military recruiting is a supply-and-demand phenomenon (Sellman, Born, Strickland, & Ross, 2010; Sellman, 1999) that is influenced by the costs of recruiting qualified individuals for enlistment. When recruiting prospers, the services raise their enlistment standards. When times are bad, the services sometimes lower their standards and allow enlistment of somewhat lower-quality recruits, thus allowing the services to meet their recruiting goals. Military recruiting, assignment, and training of young, unskilled people is an investment; the underlying purpose of the personnel selection and job classification process is to reduce the risk that an investment will be made in persons who are unable (or unwilling) to perform their duties. There are costs associated with recruit quality levels. It is more difficult and costly to recruit high-quality youth than their lower-quality peers. Thus, recruit quality standards directly influence recruiting resource requirements. Once admitted into service, recruits are expected to progress through training, to perform their duties competently, and to observe military order and discipline. Nevertheless, not all enlistees get through basic training and job skill training and, even for those who do, not all manage to avoid disciplinary problems. Still others may play by the rules but may perform well below par on the job for reasons not related to low aptitude but rather to lack of motivation. The consequences for substandard performance may include slow promotion progress, reassignments, various forms of punishment from reprimands to incarceration, and in many cases an early exit from service. The most analyzed indicator of maladjustment to the military is first-term attrition, failure to complete an obligated period of service (White, Rumsey, Mullins, Nye, & LaPort, 2014).

698

Selection and Classification in the Military According to the U.S. Government Accountability Office (GAO), it cost $40,000 in 1997 to replace (recruit, train, and equip) each individual who failed to successfully complete a first tour of duty (U.S. Government Accountability Office, 1997, 1998). Given the substantial increase in recruiting resources associated with recruiting challenges brought on by the wars in Iraq and Afghanistan, today that number is considerably higher, with recruiting resources expended in FY 2014 alone running about $17,600 for each enlistee (D. J. Drogo, personal communication, 2015; U.S. Department of Defense, 2014). Cost information on training and equipping new recruits who replace those who leave service early is not available at this time. McCloy (2012) calculated that the cost of training a single soldier from recruitment to his/ her first duty station was $50,000 in 2005. With an estimated 30% attrition rate, he calculated the annual cost of attrition to be $2.5 billion. If more service members leave prematurely, then the recruiting requirements and related recruiting, training, and equipping costs must increase to maintain the force. In addition, there are non-pecuniary or indirect costs, which include force instability, lowered morale, and lack of readiness. Individuals also may pay a personal price. Failure in military service may significantly affect their future employment opportunities and earning potential. Consequently, it is in the interest of recruits and the services to reduce firstterm attrition (Strickland, 2005).

DEFINING RECRUIT QUALITY The use of aptitude and educational attainment as measures of “recruit quality” within the DoD and the services goes back more than 50 years (Sellman, 1997; Sellman & Valentine, 1981; Sticha, Sellman, Axelrad, McCloy, Barnes, & Gribben, 2014). These quality indices are used in lieu of evaluating past work experience—a criterion that rarely exists for enlisted military applicants, who are mostly recent high school graduates. Recruits who score above average in aptitude on the DoD enlistment test are more trainable, have higher levels of job performance, and are less likely to get into trouble than their lower-scoring peers. In addition, recruits with a traditional high school diploma are twice as likely to complete a three-year enlistment as recruits with alternative educational credentials (e.g., high school equivalency exams, adult high school diploma programs, experiential learning) or high school dropouts. Since it is costly to recruit, train, and equip a recruit to replace people who leave the military prematurely, the services prefer to enlist traditional high school graduates with above average aptitude (Sellman, Born, Strickland, & Ross, 2010). For enlisted selection and classification, the Armed Services Vocational Aptitude Battery (ASVAB) is the single test used to determine enlistment eligibility and job placement for all services (including the Coast Guard) as well as their reserve components. The ASVAB comprises 10 tests that measure verbal, mathematics, and science/technical skills and knowledge and is administered in computer adaptive format. The Armed Forces Qualification Test (AFQT), a weighted composite of the ASVAB, measures verbal (word knowledge and paragraph comprehension) and mathematics (arithmetic reasoning and mathematics knowledge) abilities. A measure of general mental ability, the AFQT is the primary enlistment screen for all services and is the DoD’s first index of recruit quality (Sellman, 1997; Sticha et al., 2014). Tests of science/ technical knowledge include general science, electronics information, mechanical comprehension, auto information, shop information, and assembling objects (Sackett, Eitelberg, & Sellman, 2013; Sellman, 2004). Tests in the current ASVAB and a brief description of the abilities, or constructs, they measure are shown in Table 31.1. Each service develops and validates its own set of aptitude area composites based on the combination of tests that correlate most closely with performance criteria for its occupational clusters (Campbell & Knapp, 2001; Rumsey, Walker, & Harris, 1994). Each service’s composites take into account the distinct functions required to fulfill its respective missions and are used to assign new recruits to the occupations that are most well-suited to their abilities. For example, the Army and Marine Corps have extensive ground combat responsibilities that are quite different from most Navy and Air Force activities. Consequently, for ostensibly the same occupations, such as electronic repair specialists, motor mechanics, cooks, supply technicians, or clerks, the

699

Wayne S. Sellman et al.

TABLE 31.1 ASVAB Tests and Measured Constructs TEST

CONSTRUCT

Verbal Word Knowledge (WK)

Ability to select the correct meaning of words presented in context and to identify the best synonym for a given word

Paragraph Comprehension (PC)

Ability to obtain information from written passages

Mathematics Arithmetic Reasoning (AR)

Ability to solve arithmetic word problems

Mathematics Knowledge (MK)

Knowledge of high school mathematics principles

Science/Technical General Science (GS)

Knowledge of physical and biological sciences

Electronics Information (EI)

Knowledge of electricity and electronics

Auto Information (AI)

Knowledge of automobile terminology and technologies

Shop Information (SI)

Knowledge of tools and shop terminology and practices

Mechanical Comprehension (MC)

Knowledge of mechanical and physical principles

Assembling Objects (AO)

Ability to figure out how an object will look when its parts are put together

Source: Sackett, Eitelberg, & Sellman, 2013; Sellman, 2004.

particular equipment used by personnel in the different services or the environment in which they serve might dictate a different mix of abilities (Sackett et al., 2013; Sellman, 2004; Waters, Laurence, & Camara, 1987). Table 31.2 shows the composites currently used by the services (Diaz, Ingerick, & Lightfoot, 2004; Sackett et al., 2013). The ASVAB is normed against a nationally representative sample of young people ages 18–23 years old tested in 1997 as part of the Bureau of Labor Statistics’ National Longitudinal Survey of Youth (Sackett et al., 2013; Segall, 2004). Such norms allow the comparison of applicant and recruit aptitude levels with those of the contemporary civilian youth population from which they come. AFQT scores are expressed on a percentile scale and grouped into five categories for reporting purposes. Table 31.3 shows the percentile score ranges and percentage of civilian youth that correspond with each AFQT category. Persons who score in Categories I and II tend to be above average in cognitive ability; those in Category III, average; those in Category IV, below average; and those in Category V, markedly below average. (Category III is divided at the 50th percentile into subcategories A and B. This facilitates reporting the proportion of scores above and below the mean of the AFQT distribution.) By law, Category V applicants and those in Category IV who have not graduated from high school are not eligible for enlistment. The best single predictor of successful adjustment to military life is possession of a high school diploma. About 80% of high school diploma graduates complete their first three years of service, compared to only 50% of high school dropouts (Laurence, 1997; U.S. Department of Defense, 1996; 2013; White et al., 2014). Completion rates for enlistees holding an alternative credential such as a General Education Development (GED) certificate fall between the high school diploma graduate and non-graduate rates (Elster & Flyer, 1981; Flyer, 1959; Laurence, 1993, 1997). Thus, educational achievement is the DoD’s second index of recruit quality (Sellman, 1997; Sellman et al., 2010). Over the past 25 years, there has been a proliferation of education credentials in the United States. In addition to earning a regular high school diploma, young people can receive alternative educational credentials through adult education programs and homeschooling, through experiential learning, and by taking high school equivalency tests (Laurence, 1984; Sticha et al., 2014).

700

TABLE 31.2 ASVAB Tests Used for Classification by Service COMPOSITE TESTS

ASVAB TESTS

ALL SERVICES Armed Forces Qualification Test (AFQT)

AR + MK + (2 X VE) where VE = WK + PC

ARMY General Technical

AR, WK, PC

Clerical

GS, AR, AI, SI, MK, MC, EI, WK, PC

Combat

GS, AR, AI, SI, MK, MC, EI, WK, PC

Electronics Repair

GS, AR, AI, SI, MK, MC, EI, WK, PC

Field Artillery

GS, AR, AI, SI, MK, MC, EI, WK, PC

General Maintenance

GS, AR, AI, SI, MK, MC, EI, WK, PC

Mechanical Maintenance

GS, AR, AI, SI, MK, MC, EI, WK, PC

Operators/Food

GS, AR, AI, SI, MK, MC, EI, WK, PC

Surveillance/Communications

GS, AR, AI, SI, MK, MC, EI, WK, PC

Skilled Technical

GS, AR, AI, SI, MK, MC, EI, WK, PC

NAVY General Technical

WK, PC, AR

Electronics

AR, MK, EI, GS

Basic Electricity & Electronics

AR, MK, GS

Engineering

MK, AI, SI

Mechanical

AR, MC, AI, SI

Mechanical 2

AR, MC, AO

Nuclear Field

WK, PC, AR, MK, MC

Operations

WK, PC, AR, MK, AO

Hospitalman

WK, PC, MK, GS

Administration

WK, PC, MK Test Abbreviations:

AIR FORCE Mechanical

AR, WK, PC, MC

Administrative

WK, PC, MK

General

WK, PC, AR

Electronic

AR, MK, EI, GS

MARINE CORPS Mechanical Maintenance

AR, EI, MC, AI, SI

General Technical

WK, PC, AR, MC

Electronics Repair

AR, MK, EI, GS

WK PC AR MK GS EI AI SI MC AO

Word Knowledge Paragraph Comprehension Arithmetic Reasoning Mathematics Knowledge General Science Electronics Information Auto Information Shop Information Mechanical Comprehension Assembling Objects

Note: With the exception of the AFQT, weights for the tests are not included in the above composites. The formula for computing the AFQT is AR + MK + (2 X VE), where VE (Verbal) = PC + WK. The VE score is determined by adding the raw scores from the PC and WK tests. Source: Diaz, Ingerick, & Lightfoot, 2004; Sackett et al., 2013; U.S. Department of Defense, 2004.

Wayne S. Sellman et al.

TABLE 31.3 Armed Forces Qualification Test (AFQT) Categories by Corresponding Percentile Score Ranges and Percent of Civilian Youth Population AFQT Categories

Percentile Score Range

Percent of Civilian Youth

I

93–100

8

II

65–92

28

IIIA

50–64

15

IIIB

31–49

19

IV

10–30

21

V

01–09

9

Source: Sticha et al., 2014; U.S. Department of Defense, 1996.

The DoD uses a three-tier system to classify education credentials. The system was developed after research indicated a strong relationship between level of education and successful completion of the first term of military service (Laurence, 1997; U.S. Department of Defense, 1996). Tier 1 includes regular high school diploma graduates, adult diploma holders, and non-graduates with at least 15 hours of college credit. Tier 2 comprises alternative credential holders, such as those with GED diplomas or certificates of completion or attendance, and Tier 3 is composed of non–high school graduates (Sackett et al., 2013; Sellman et al., 2010). The services prefer to enlist people in Tier 1 because they have a higher likelihood of completing a first term of service than do individuals in Tiers 2 and 3. Consequently, education enlistment standards refer to the application of progressively higher aptitude test score minimum requirements for high school diploma graduates, alternative credential holders, and non-graduates, respectively. The rationale for this policy is based on the differential attrition rates of individuals in these three education groups. That is, members of Tiers 2 and 3 are about twice as likely as those in Tier 1 to leave service before completing their enlistment contract. Higher aptitude requirements for Tiers 2 and 3 are used to accept only the “best” from the statistically less successful and thus less preferred group of applicants (Sticha et al., 2014; U.S. Department of Defense, 1996).

SHORT HISTORY OF MILITARY PERSONNEL TESTING (BEFORE AN ALL-VOLUNTEER FORCE) Although current testing methods are codified into U.S. law today, these testing methods have not always been in place. Because of the advent of new weaponry in World War I (tanks, airplanes, chemicals, etc.), the American military started using tests to screen people for service and assign them to military occupations. In 1917–1918, the Army Alpha and Army Beta tests were developed so commanders could have some measure of the ability of their men (Waters, 1997). Army Alpha was a verbal, group-administered test that measured verbal ability, numerical ability, and ability to follow directions and information. Army Beta was a non-verbal, groupadministered counterpart to Army Alpha. It was used to evaluate the aptitude of illiterate, unschooled, or non-English-speaking inductees (Yerkes, 1921). Both tests are recognized as prototypes for subsequent group-administered cognitive ability tests. Rising from Army Alpha and Beta tests’ foundations, the Army General Classification Test (AGCT) of World War II replaced its predecessors. The AGCT’s intent was similar to the Alpha and Beta tests in that it was designed to be a general learning test used for job placement. Although it served the services successfully throughout the World War II years, at the war’s conclusion, each service developed its own aptitude test for service entry. Eitelberg, Laurence, and Waters (1984) noted, “Though different in structure, primarily with respect to qualifying scores,

702

Selection and Classification in the Military the service tests were essentially the same with respect to content area, relying on the time-honored items of vocabulary, arithmetic, and spatial relationships.” In 1950, the military returned to a single test, the AFQT, to be used in conjunction with the Selective Service System draft. The AGCT served as the AFQT’s model, in which the AFQT measured basically the same variables as the AGCT and the previous Army Alpha and Beta tests; however, contrary to the previous tests, the AFQT was specifically designed to be used as a screening device (Karpinos, 1966). Thus, the AFQT was established for the purpose of (a) measuring examinees’ general ability to absorb military training and (b) providing a uniform measure of examinees’ potential usefulness in the service, if qualified, on the test (Maier, 1993; Uhlaner & Bolanovich, 1952).

MOVING TO AN ALL-VOLUNTEER FORCE Throughout most of American history, our military has been composed of volunteers. However, conscription was the primary means of obtaining military personnel during World Wars I and II and the Korean Conflict to the point that its renewal became perfunctory. The decision to move to an all-volunteer military evolved from criticism of the inequities of conscription during the Vietnam War—who shall serve when not all serve? In the late 1960s, President Richard Nixon established a commission to develop a comprehensive plan for eliminating conscription and moving toward an all-volunteer force. The commission built a case for a volunteer military by pointing out the unfairness of conscription, establishing the feasibility of a volunteer force on economic grounds, and refuting all major arguments against ending conscription and relying totally on volunteers (Gates, 1970; Lee & Parker, 1977). The commission believed that sufficient numbers of qualified youth could be persuaded to volunteer by increasing military pay to levels more competitive with civilian wages. They disputed claims that total reliance on volunteers would lead to a mercenary force consisting mainly of minorities, the poor, and the uneducated, and loss of civilian control. After much debate within the Administration and Congress and across the country, it was decided that an all-volunteer force was feasible, affordable, and would not jeopardize the nation’s security (Defense Manpower Commission, 1976; Rostker, 2006). Thus, the authority for conscription was allowed to lapse on July 1, 1973, and the last conscript entered the Army in December 1972. With adequate resources and support to attract and retain higher aptitude and better educated personnel, conscription is not needed to meet future military personnel requirements (Bicksler & Nolan, 2006). An all-volunteer force is more expensive than a conscription force in terms of military compensation and funds for advertising and enlistment incentives, but a voluntary military is less expensive in overall costs (Fredland, Gilroy, Little, & Sellman, 1996; Lee & McKenzie, 1992; Warner & Asch, 1996). It is more stable and career-oriented, thereby leading to extra performance and experience, with reduced training and other turnover costs (Oi, 1967). During conscription, 10% of new inductees reenlisted; today’s new recruits reenlist at a 50% rate. In short, military service is an economically rational choice for high-quality men and women looking for an edge on life. The military also is a good choice for people who want to serve a greater cause (Bicksler, Gilroy, & Warner, 2004). During the first years of the all-volunteer force, the AFQT was used to identify individuals who had a reasonable probability of success in service, and other service-specific tests were required for job classification. The Army Classification Battery, the Navy Basic Test Battery, and the Airman Qualifying Examination, just to name a few, were used from the late 1950s to the mid-1970s (Waters, 1997). During this period, the AFQT was administered to military applicants (including draft inductees) at Armed Forces Examining and Entrance Stations (AFEES) across the country for selection purposes. Because women were not subject to the draft, a different aptitude test was used for female applicants for enlistment. The Armed Forces Women’s Selection Test was administered to female applicants in lieu of the AFQT from 1956 to 1974. If individuals successfully “passed” the AFQT and were accepted for service, they were sent to basic training, although the specific occupation to which they would be assigned had not yet been

703

Wayne S. Sellman et al. determined. During basic training, new enlistees were administered their service’s classification tests and were assigned to their appropriate military occupations. During the mid-1970s, the DoD determined that a single test that measured aptitude and job placement was to be used, resulting in the development and implementation of the ASVAB, which is still in use today (Sellman, 2012; Sellman & Valentine, 1981). The ASVAB’s creation and implementation enabled the DoD to successfully screen applicants, match applicants with job positions, reserve job skill training for applicants if they qualified, and provided a uniform standard measure on which all applicants across the board could be ranked. This was a departure from previous procedures when selection testing was conducted at AFEES during the entrance process (for either enlistment volunteers or draft inductees) and classification testing was accomplished at service basic training centers preparatory to assigning new enlistees to military occupations and sending them for job skills training. By combining selection and classification testing at the AFEES, the testing process was made more expedient for the newly implemented all-volunteer military. Young people volunteering for enlistment would take one test and come away from the AFEES knowing not only if they qualified for enlistment but also, if qualified, the military occupation to which they would be assigned. Thus, the new testing process enabled the services to improve the matching of applicants with available occupations before they actually reported for duty and allowed job guarantees for individuals who qualified for enlistment (Sellman, 2012). With the end of conscription and the advent of the all-volunteer force, there was a significant change in the composition of new recruit cohorts (Sellman, Carr, & Lindsley, 1996). The percentage of African American enlisted accessions rose slightly, with some fluctuation, following the end of the draft (MacGregor, 1981). In 1973, the last year of the draft, African Americans made up 17% of new recruits. As African American men and women viewed the military as an opportunity for upward mobility, a gradual increase in African American accessions ensued through the 1990s. Participation for active component African American enlisted has remained relatively stable at around 20% thus far in the 21st century (U.S. Department of Defense, 2008, 2013). It also should be noted that with the exception of the ASVAB misnorming period described in the following section, African American recruits have met all aptitude and education enlistment standards, thereby demonstrating their qualifications for military service. The percentage of female enlisted accessions more than tripled, rising from 5% in 1973 (Goldman, 1973) to approximately 17% in 2006 among non–prior service members (Manning & Griffith, 1998; U.S. Department of Defense, 2008). As of 2013, that percentage remained stable at approximately 17% (U.S. Department of Defense, 2013). Although the services have increased their proportions of women, youth propensity polls indicate that young women are still approximately 50% less likely than young men to indicate an interest in joining the military (Handy & Ramsberger, 2014; Ramsberger, 1993; Sackett & Mavor, 2004; U.S. Department of Defense, 2008). Hispanics make up a smaller but growing proportion of the military services than do African Americans. Enlisted Hispanics constituted just over 1% in the early 1970s, but by the late 1980s, that percentage had increased to nearly 5%. There has been a steady rise in new recruits of Hispanic descent ever since. In 2013, that percentage had increased to 16%. However, this group remained underrepresented relative to the growing comparable civilian population (20%; U.S. Department of Defense, 2013).

ASVAB MISNORMING AND JOB PERFORMANCE MEASUREMENT PROJECT In 1980, the DoD announced that the ASVAB in use since 1976 had been misnormed, with the result that scores in the lower ranges were artificially inflated (Boldt, 1980; Jaeger, Linn, & Novick, 1980; Maier & Grafton, 1980; Sims & Truss, 1978, 1979, 1980). In other words, in developing norms for the ASVAB, an error was made in the sample and method used to convert raw scores to percentile scores. As a result, approximately 360,000 men and women entered service during the period 1976–1980 who would not otherwise have met enlistment standards (Eitelberg, 1988). About one out of every four male recruits across all services in those years

704

Selection and Classification in the Military would have been disqualified under the aptitude standards the services intended to apply. Young African American men appear to have been the biggest beneficiaries of the misnorming. Over 40% of African American recruits during this period had test scores that ordinarily would have kept them out of the military. Hispanics, too, were affected by the misnormed ASVAB. Almost 33% would have been ineligible under the correct aptitude standards (Eitelberg, 1988). The quality of Army recruits fell to an all-time low during this period, even lower than during the period of heavy mobilization for World War II (U.S. Department of Defense, 1985). The ASVAB misnorming episode turned out to be a natural experiment with large numbers of new recruits entering service “unselected.” The misnorming presented a unique opportunity to study, on a large scale, the validity of selection standards in a less restricted population. The people who were admitted to the military with aptitude scores below the cutoff points were assumed by their supervisors to have had scores above the enlistment standards. Individuals with legitimately qualifying scores did appreciably better than their lower-scoring peers in terms of training performance, promotions, disciplinary problems, and attrition. At the same time, the low-aptitude recruits were able to successfully perform in low- and medium-demand occupations (Greenberg, 1980; Means, Nigam, & Heisey, 1985; Shields & Grafton, 1983). As a consequence of the misnorming, members of Congress and policy makers in the DoD became interested in the methods used to set enlistment standards and to establish recruit quality requirements (Sellman, 2012; Sellman & Campbell, 2012). In the congressional view, the fact that the ASVAB traditionally had been validated against success in training rather than on-the-job performance was potentially problematic. Supporting studies regarding the relationship between recruit quality and military performance lacked persuasive power because proxy measures (e.g., attrition, promotion rates, or reenlistment eligibility) were used rather than actual measures of job performance. Congressional scrutiny of the ASVAB misnorming and surrounding issues of recruit quality and entry standards led to the Joint-Service Job Performance Measurement/Enlistment Standards Project (JPM Project; (Sellman & Campbell, 2012; Sellman, 1991). The JPM Project comprised three phases: (a) determine the feasibility of measuring hands-on job performance; (b) if feasible, validate the ASVAB against on-the-job performance; and (c) develop an enlistment standards cost/performance tradeoff model that linked recruit quality, recruiting resources, and job performance. The overall project strategy called for each service to develop and demonstrate various job performance measurement approaches that could be used to link enlistment standards to job performance (U.S. Department of Defense, 1991; Wigdor & Green, 1986, 1991). Because of the complexity of the JPM research goals, the DoD turned to the National Research Council to provide scientific oversight and an independent technical review by nationally recognized experts as the research progressed (Wigdor & Green, 1991). An exemplar of this research, documenting the relationship between the ASVAB and various measures of job performance, is Project A, the Army’s JPM contribution. This multiyear effort sought to assess the validity of cognitive abilities as well as supplemental predictors, such as temperament, vocational interests, and psychomotor skills. The ultimate goal was to generate a database of validity information needed for developing an organization-wide selection and classification system that would generalize across Army jobs (Campbell, 1990a, 1990b; McHenry, Hough, Toquam, Hanson, & Ashworth, 1990; Sellman & Campbell, 2012; Wise, McHenry, & Campbell, 1990). Each service developed and demonstrated hands-on job performance measures in several military occupations. These job performance measures were used to evaluate certain surrogate measures of performance (less expensive, easier to administer tests or existing performance information) as substitutes for the more expensive, labor-intensive, hands-on job performance tests (Armor & Roll, 1984; Green, Wing, & Wigdor, 1988). Performance tests consisted of tasks chosen from the domain of tasks in selected military occupations, on which examinees (job incumbents) were evaluated. These measures were designed to replicate actual job performance yet provide objective evaluation of the performance demonstrated. Integration of the different service research efforts into a joint service product was accomplished through development of a common data analysis plan. These analyses (a) described the distributions of hands-on performance test scores, aptitude scores, job experience, and

705

Wayne S. Sellman et al. educational attainment; (b) assessed the reliability of the hands-on performance test scores; and (c) measured the degree of relationship (i.e., correlation) between performance test scores and other variables of interest. These tests were administered to 8,000 incumbent, first-term soldiers, sailors, airmen, and marines assigned to 24 different occupations (U.S. Department of Defense, 1991). Occupations were selected to be representative of all military occupations with large numbers of recruits entering job skills training (McCloy, 1994). The examinees averaged 25.1 months in service, and the average AFQT score was 55.1 on a 100-point percentile scale (U.S. Department of Defense, 1991). The average split-half reliability coefficient for the performance tests across all 24 occupations in the JPM Project was .72 (U.S. Department of Defense, 1991). Split-half estimates were preferred to Cronbach’s alpha given the heterogeneous task content of the performance tests. Measures of reliability showed an acceptable degree of consistency in the performance test scores, suggesting that a reliable benchmark measure had been developed against which to compare the various surrogate measures of job performance (U.S. Department of Defense, 1991). The National Research Council scientists also wanted an established benchmark to which they could then compare other potential performance measures. Performance tests represented the pinnacle of the performance measure in their minds, and given the acceptable reliability that performance tests demonstrated, they were confident in fielding the surrogate measures as part of the subsequent selection and classification research. The correlation between the AFQT and hands-on performance tests, corrected for restriction in range, yielded an average validity coefficient of .40 (U.S. Department of Defense, 1991). This level of validity is of interest because the AFQT is a test of general aptitude, whereas the performance test scores reflected observable performance in different types of occupations. Thus, the JPM Project established the link between measured aptitude for performing a job and the demonstration of doing it. Given the nature of the performance test criterion, a validity coefficient of .40 compared well with other military validity studies (Armor & Sackett, 2004). The job performance measurement research completed by the services provided measures that closely replicated actual job performance. Rather than assessing, via a paper-and-pencil test, what enlisted personnel might know about calibrating a piece of precision avionics equipment or operating a weapon’s targeting system, the services were able to assess how well enlisted job incumbents did such tasks. Although the two are related, knowledge about a job is not the same thing as being able to do the job. Typically, (corrected) validities of military aptitude tests for predicting training success or supervisor ratings have ranged between .30 and .60 (Hartigan & Wigdor, 1989). Research shows a strong relation between ASVAB (including AFQT) scores and success in military job skills training and hands-on job performance across a range of occupations (Campbell, 1990a; Claudy & Steel, 1990; Dunbar & Novick, 1988; Earles & Ree, 1992; Holmgren & Dalldorf, 1993; Hunter, Crosson, & Friedman, 1985; Mayberry & Carey, 1997; Welsh, Kucinkas, & Curran, 1990; Wigdor & Green, 1991). The services value recruits with above average aptitude because they are more trainable and their job performance is superior to that of their lower-scoring peers. Even with on-the-job experience, enlistees with lower aptitude continued to lag behind those with higher aptitude. As is shown in Figure 31.1, below average (AFQT Category IV) recruits require more than three years of experience to attain the level of performance at which the higher aptitude recruits (AFQT Categories I-II) begin (Armor & Roll, 1984; Armor & Sackett, 2004; U.S. Department of Defense, 1991). Higher aptitude personnel also experience fewer disciplinary problems. The information shown in Figure 31.1 came from the JPM Project (U.S. Department of Defense, 1991). Although collected more than two decades ago, these job performance data continue to be the best source of information about the job performance of enlisted personnel. For one thing, research has consistently demonstrated that cognitive ability, such as is measured by the AFQT, is a strong predictor of job performance across a variety of occupations (Campbell, 1990a; Campbell, 1990b; Hunter & Hunter, 1984; Schmitt, Gooding, Noe, & Kirsch, 1984; Welsh, Watson, & Ree, 1990). In addition, recent interviews with military training specialists responsible for the occupations used in the research reported that the occupations had changed

706

Selection and Classification in the Military

58 56 54 52

Category I-II

Score

50 48 46 44 42 40 38

Category IIIA

Category IIIB Category IV

0-12

13-24

25-36

37+

Job Experience (Months) FIGURE 31.1  Hands-On Job Performance Scores as a Function of Aptitude and Experience. AFQT Percentile: I (93–99); II (65–92); IIIA (50–64); IIIB (31–49); IV (10–30) Source: Based on Office of the Assistant Secretary of Defense—Force Management and Personnel (1991:2–4).

little since the original job performance data were collected. Thus, it is safe to generalize from these data and to conclude that the relation between aptitude, experience, and job performance is still pertinent. One of the major objectives of the JPM Project was development of a mathematical model to link recruit quality, recruiting resources, and job performance. Working with the National Research Council, in 1991 the DoD used that model to establish the DoD recruit quality benchmarks (Sellman, 1997; Sellman & Campbell, 2012; Sticha et al., 2014). In general, enlistment standards are based on judgments by service policy makers as to the level of job performance required. However, standards should be guided by empirical evidence of the relationship between recruit quality and the required level of performance. Although it is extremely difficult to specify an absolute value of performance that can be considered sufficient to guarantee successful military mission accomplishment, even so, research performed within the JPM Project developed reliable and valid measures of individual job performance that became the basis for the linkage model. For years, industrial psychologists contended that job performance was the ultimate criterion for validating selection tests. In fact, S. Rains Wallace (1965), an eminent psychologist, once called it the holy grail of industrial psychology. Measuring job performance is a very expensive proposition. With the support of Congress and the DoD’s effort to recover from the embarrassing misnorming episode, $40 million was made available for the JPM Project. Another aspect of this research effort that made it unique was its sustainability. It was widely recognized as a project of great merit and it lasted for more than 15 years, spanning five presidential administrations, both Democrat and Republican.

ENLISTED SELECTION AND CLASSIFICATION IN TODAY’S MILITARY Currently, the U.S. military recruits nearly 200,000 young people annually into full-time, active-duty service and another 150,000 into the reserve components (U.S. Department of

707

Wayne S. Sellman et al. Defense, 2013). Standards for enlistment are established under the authority of Title X of the U.S. Code (January 2009). Enlistment criteria are based on the needs of the services and are designed to ensure that those individuals accepted are qualified for general military duties. These individuals must be able to cope successfully with a wide range of demands occurring in a military situation, such as exposure to danger, emotional stress, harsh environments, and the handling or operation of dangerous equipment. Furthermore, the services require all military members to be available for worldwide duty 24 hours a day without restriction or delay. Operating at the service-wide level are several mechanisms that probably do more than formal enlistment standards to determine the character of entering recruits. The most important of these is the general recruiting environment—the ever-varying willingness of high-aptitude youth with high school diplomas to enter the military. This willingness cannot be considered part of a service’s enlistment standards, but it sometimes directly affects the standards that a service sets. For example, during good recruiting times, a service may stop accepting non-graduates in AFQT Category IIIB (percentiles 31–49), even though they satisfy the entrance standards codified in Title X of the U.S. Code. Each service attempts to assign the highest quality recruit possible into the various military occupations. Consequently, composite cut scores for occupational classification represent a compromise between service ideals and fluctuating supply/demand pressures. Service officials set cut scores on the basis of personnel requirements, equipment used, training curricula, retention, the economy, and the availability of recruits with various composite aptitudes. Because the ASVAB is used to determine enlistment eligibility and job placement, it is important to the DoD and the services that the test be fair and equitable for all military applicants, no matter their gender or race/ethnicity. Over the years, military personnel researchers have devoted considerable effort to ensure that the ASVAB is a valid predictor of job training success and performance on the job and to minimize adverse impact for various subgroups. While the ASVAB yields subgroup differences that are similar in magnitude to those typically observed for comparable cognitive test batteries (Roth, Bevier, Bobko, Switzer, & Tyler, 2001; Sackett & Shen, 2009), research indicates that the ASVAB is valid for minorities and women. Equations for prediction of final grades in military training courses from the ASVAB were essentially the same for Whites and minorities and men and women (Held, Fedak, Crookenden, & Blanco, 2002; Mayberry, 1997; Wise et al., 1992). Where differences in prediction of school grades were observed, technical training performance of minorities was overpredicted by the ASVAB. For women, the ASVAB slightly overpredicted technical training performance in non-traditional career fields. No differences were found for traditional military occupations. The Office of the Secretary of Defense asked the Defense Advisory Committee on Military Personnel Testing to review the Wise et al. (1992) research, which looked at applicants across all services. In responding, the chair of that committee noted: “The conclusions from the analyses—that the ASVAB technical composites are fair and sensitive—are clear and compelling, and the use of the same enlistment standards and qualification scores for military occupations for all young people is justified” (Drasgow, 1992, p. 2).

ENLISTMENT PROCESS Young men and women interested in joining the military enter the enlistment process by contacting service recruiters. In addition to providing information about service life, opportunities, and benefits, recruiters also begin the initial screening of applicants. Most prospects take an enlistment-screening test at a recruiting office. This enlistment-screening test is used to predict the likelihood of “passing” the AFQT (Barnes & Brown, 2013). Estimates are that 10–20% of prospects do not continue beyond this point (U.S. Department of Defense, 2004). Applicants must meet multiple requirements before they are selected for service. After recruiters have completed the preliminary screening and prospects have decided to enlist, they can go either to a Military Entrance Processing Station (MEPS) or a military entrance testing (MET) site to take the ASVAB. Military and civilian staffs at MEPS evaluate applicants’ medical qualifications, aptitude, and moral character on the basis of standards predetermined by the

708

Selection and Classification in the Military services. Some services also require a test of physical ability at the MEPS. (Military Entrance Processing Stations were previously known as Armed Forces Examining and Entrance Stations.) If an applicant achieves qualifying ASVAB scores and wants to continue the application process, a physical examination and background review is conducted at the MEPS. The physical exam assesses medical fitness for military service and includes the measurement of blood pressure, pulse, visual acuity, and hearing; blood testing and urinalysis; drug and HIV testing; and medical history. If a correctable or temporary medical problem is detected, applicants may be required to get treatment before proceeding. Other applicants may require a service waiver for some disqualifying medical conditions before being allowed to enlist (Sackett & Mavor, 2006). Furthermore, applicants must meet rigorous moral character standards. Applicants undergo detailed interviews covering any involvement with civil law enforcement (e.g., arrests, convictions), and some undergo a financial check or computerized search for criminal records. Some types of criminal activity are immediately disqualifying; other cases may offer the possibility of a waiver of the rule, wherein the services examine applicants’ circumstances and make an individual determination of qualification (Putka, Noble, Becker, & Ramsberger, 2004). Moreover, applicants with existing financial problems are not likely to overcome those difficulties on junior enlisted pay. Consequently, credit histories may be considered as part of the enlistment decision. If the applicant’s ASVAB score, education credentials, medical fitness, and moral character qualify for entry, then the applicant meets with a service classification counselor at the MEPS to discuss options for enlistment (Sackett & Mavor, 2003). The counselor considers the applicant’s qualifications along with service training or skill openings, schedules, and enlistment incentives. In this classification process, high-scoring recruits are discouraged from choosing jobs that require only low aptitude, and recruits who want to enter jobs for which they barely meet the standard but who have high aptitudes in other areas are encouraged to choose jobs for which they are better qualified. Each service has incorporated its algorithms into computerized job reservation systems that service counselors at MEPS use to match the individuals’ desires with the needs of the services so that one component of those needs will be how well recruits’ ASVAB scores suit them for the various jobs. Generally, those who score higher on tests will have more occupational options. Although the process differs by service, specific skills and occupational grouping are arranged similarly to an airline reservation system, with the training “seat” and time of travel (to recruit training) based on the school or the field unit position openings. Using enlistment incentives (cash bonuses or extra money that can be used to cover college costs), recruiters may encourage the applicant to choose hard-to-fill occupational specialties. Ultimately, it is the applicant’s decision to accept or reject the offer. Although some applicants discuss options with their family and friends, others decide not to enlist (Sackett & Mavor, 2006).

RECRUIT QUALITY BENCHMARKS AND ENLISTMENT STANDARDS With the advent of the all-volunteer force (AVF) in 1974, recruit quality became a significant issue within the DoD and the Congress. Many critics of the AVF believed that the services would not be able to attract a sufficient number of high-quality recruits. Consequently, since that time Congress has tracked trends in recruit quality, just as it does for recruit quantity. In 1975, the Office of the Secretary of Defense (OSD) submitted a report to Congress on service recruit quality requirements (U.S. Department of Defense, 1975). In this report, the OSD retrospectively estimated that the Army would need 22–23% new recruits in AFQT Categories I–II, 57–58% in Category III, and 20% in Category IV. It is interesting to note that in the early days of the AVF, recruiting analysts in the Pentagon were willing to accept a much lower level of recruit quality than became the case with experience and certainly is true today. A decade later, in 1985, the OSD submitted a second report to Congress on recruit quality requirements (U.S. Department of Defense, 1985). This report reflected 10 years of experience in projecting recruiting requirements and was a more sophisticated effort to tie recruit quality to job performance. The recruit quality requirements in the 1985 report were prepared by the

709

Wayne S. Sellman et al. services, instead of being developed by the OSD, as they were in 1975. In the 1985 report, the Army projected that it would need 90% high school diploma graduates, 59% Categories I–IIIA, and 10% Category IV for FY 1989 recruits. With the exception of the percentage of lowaptitude recruits that the Army said would be acceptable, the Army’s other quality requirements were virtually identical to today’s recruiting benchmarks (Sticha et al., 2014). How does the U.S. military decide how many high school diploma graduate and above-average aptitude recruits to enlist? The goal is to maximize recruit quality (aptitude and education) while minimizing recruiting, training, and attrition costs. In conjunction with the National Research Council, and based on the results of the JPM Project discussed earlier, the DoD developed a mathematical model that links job performance to recruit quality and recruiting resources; this model specifies the number of high-quality recruits who will provide the desired level of job performance for the least cost (Harris et al., 1991; McCloy, 1994; Smith & Hogan, 1994; Wise, 1994). Scores from the JPM Project define the job performance variable (Green & Mavor, 1994; Wigdor & Green, 1991). Costs reflect training costs, compensation costs, and recruiting costs (e.g., recruiter compensation and money for advertising, education benefits, and enlistment bonuses). Using these relations, the model allows “what-if ” analyses to examine how changes in one or more of these variables affect the other variables. For example, the model could answer how decreasing the DoD advertising budget by $20 million would affect recruit quality and job performance. What should be the desired level of performance? Recruit quality benchmarks are used to help ensure that recruit performance is sufficient to complete military missions. The model cannot estimate how much quality is enough; rather, policy decision makers/recruiting analysts within the DoD set the desired level of performance. Nevertheless, the model can help specify a cohort of recruits that will provide the desired level of performance for the lowest cost (Sellman & Campbell, 2012). The performance level identified by the recruiting analyst is a minimally acceptable value. The DoD has chosen the level of performance provided by the 1990–1991 enlisted cohort (the cohort in service during Operations Desert Shield and Desert Storm). Specifying this level of desired performance resulted in recruit quality benchmarks that call for 60% of recruits to score above the 50th percentile on the AFQT (i.e., to be in Categories I–IIIA) and 90% to have high school diplomas (Sellman, 1994). These benchmarks are not enlistment standards that the services use to establish entrance eligibility. Rather, they are recruiting goals that the services strive to meet to maximize performance and minimize recruiting costs. Standards codified in Title X of the U.S. Code are considerably lower (i.e., AFQT scores at the 10th and 31st percentiles for high school diploma graduates and non-graduates, respectively) than standards actually used by the services for enlistment purposes (Sellman, 2004).

NEED FOR MILITARY OFFICER SELECTION AND CLASSIFICATION Military officers must be adept decision makers and problem solvers, good managers and supervisors, and effective leaders (Paullin et al., 2014; Wolters et al., 2014). But, unlike their civilian leader counterparts, military officers must be prepared to endure physical hardships over long periods of time and the strain of being and placing others in harm’s way. Officer selection research repeatedly supports the importance of stress tolerance and physical fitness for effective officer performance (Allen et al., 2014; Legree, Kilcullen, Putka, & Wasko, 2014), and U.S. Military Academy research has demonstrated that grit and hardiness are important predictors of performance of West Point graduates (Kelly, Matthews, & Bartone, 2014). Because officer education represents a substantial investment, the services also need to select officers who will stay in the service beyond their initial active-duty service obligation (ADSO). As mentioned previously, military selection is a closed system. American military leaders are “grown” from the junior ranks. Clearly, officer selection strategies must yield a pool of junior officers who are committed to serving, well-equipped with the skills and abilities they need to serve effectively, and capable of attaining higher ranks (Eitelberg, Laurence, & Brown, 1992).

710

Selection and Classification in the Military

SELECTION FOR OFFICER COMMISSIONING PROGRAMS Military officers are commissioned through several sources. About 35% of all new active-duty officers are commissioned through Reserve Officer Training Corp (ROTC) programs, with the remaining new officers coming from four-year service academies, Officer Candidate School (OCS)/Officer Training School (OTS) programs, and direct appointments. The Army and Air Force rely more heavily than the other services on ROTC, with roughly half of their new officers being commissioned through ROTC. The Navy’s distribution of new officers is fairly even across all sources. The Marine Corps has a very small ROTC program and counts on its OCS program for more than 60% of its new officers. To ensure that the officer corps can meet the rigorous demands placed on officers, the service academies use a highly selective “whole person” approach to select their students, hence, officer candidates. Candidates are selected based on scores on college admission tests, such as the Scholastic Achievement Test (SAT) and American College Test (ACT) and a host of other variables designed to get at leadership potential, achievement orientation, physical fitness, and endurance, such as participation in athletics, high school grades, and teacher recommendations. Applicants must be unmarried, U.S. citizens, and between 17 and 23 years of age. Most applicants must obtain a letter of nomination from a member of the U.S. Congress, and the process of securing a nomination is lengthy and highly selective, with its own requirements and deadlines. Each service’s ROTC has a collection of programs. Four-year ROTC scholarship programs use a whole-person selection approach that resembles service academy selection (without the nomination requirement). Like the service academies, ROTC scholarship programs involve a substantial monetary investment in the candidate’s education. Consequently, selection of individuals who will succeed in college is critical at this stage, and SAT and ACT scores have been shown to predict college grades (Morgan, 1990; Richardson, Abraham, & Bond, 2012). The services typically do not require cadets who drop out of school or out of ROTC within the first year or two to reimburse tuition and expenses. Therefore, attrition or disenrollment from college or ROTC is a concern. In recent years, the Army added a measure of temperament, the Cadet Background and Experience Form (CBEF) to its whole-person score to improve selection of cadets who are likely to complete college and become commissioned officers (Legree et al., 2014). College students also may apply for two- or three-year ROTC scholarships. These students will have taken military coursework, and the evaluation for scholarships takes collegiate performance and course grades into account. Selection for non-scholarship ROTC programs varies by service and location and can include service-specific test scores in conjunction with other academic, physical fitness, and experiential requirements. ROTC programs make a distinction between the first two basic years of the ROTC curriculum and the last two, advanced or professional, years. As cadets transition to more advanced levels, they participate in field training courses outside of the classroom and may take service-specific aptitude measures. Air Force ROTC cadets take the Air Force Officer Qualifying Test (AFOQT; Weissmuller, Schwartz, Kenney, & Gould, 2004). The Navy and Marine Corps use a similar test, the Aviation Selection Test Battery (ASTB), to select student flight officers and student pilots (Naval Operational Medicine Institute, n.d.). It takes four years to produce an officer through ROTC programs or the service academies. In contrast, OCS/OTS programs are two to four months long, providing a relatively quick means of meeting officer manning requirements. Consequently, the number of available OCS/OTS seats is highly variable over time, depending upon current demands. OCS/OTS programs provide an avenue for college graduates with no ROTC experience and enlisted service members who also are college graduates to become military officers. Selection is based on college grades, cognitive test scores, and scores on physical fitness measures as well as interviews. For cognitive measures, the Army and Marine Corps rely primarily on scores on the ACT, SAT, or for enlisted personnel, the ASVAB. The Navy, Marine Corps, and Air Force use service-specific tests, the ASTB and the AFOQT, respectively. The Army and Air Force currently are conducting research using noncognitive measures in an attempt to select candidates who are

711

Wayne S. Sellman et al. likely to perform well, stay on active duty beyond their initial commitment, and fit well in their military occupations. One salient finding from the Army’s longitudinal OCS research is that the two populations in OCS (i.e., college graduates with no military experience and enlisted personnel with college degrees) differ substantially in terms of demographic background, motivation, and experience. Prior enlisted candidates tend to be older and more committed to staying in the Army until retirement (Allen & Young, 2012; Thirtle, 2001). Separate prediction equations are needed to predict important outcomes for the two groups (Allen et al., 2014). The smallest and most specialized commissioning method is through direct appointment. This program is designed for individuals who currently possess an advanced degree and wish to enter the military in the fields of medicine, dentistry, law, or the chaplaincy. Upon selection, individuals are immediately commissioned and subsequently attend a short training course to prepare them for the military.

SELECTION OF PILOTS AND AIR CREWS The Navy, Air Force, and Marine Corps select student pilots well before commissioning, before or during ROTC and OCS/OTS. Pilot and air crew selection has an engaging history, beginning in World War I at a time when the science of flight was in its infancy and psychometric measurement methods were maturing (Russell & Rumsey, 2012). Little was known about the skills and abilities required for pilot and air crew positions, let alone how to select for them. Army and Navy researchers such as E. L. Thorndike (1947) and John Flanagan (1948) pioneered a host of assessment methods during World Wars I and II that serve as the foundation for test batteries used today. The ASTB grew out of a World War II research effort, the Pensacola 1000 Aviator Study, which examined more than 60 psychological, psychomotor, and physical tests (North & Griffin, 1977). The Naval Operational Medicine Institute (NOMI) revised the ASTB in 2004 (NOMI, n.d.). The current version has seven tests: the Math Skills Test, Reading Comprehension Test, Mechanical Comprehension Test, Aviation and Nautical Information Test, Naval Aviation Trait Facet Inventory, Performance Based Measures Battery, and Biographical Inventory with Response Validation. Four composite scores—Academic Qualifications Rating (AQR), Pilot Flight Aptitude Rating (PFAR), Flight Officer Aptitude Rating (FOFAR), and Officer Aptitude Rating (OAR)—are used to select students for pilot and flight crew jobs. The AFOQT was developed and validated by the Air Force Human Resources Laboratory and is now managed by the Air Force Personnel Center (Weissmuller et al., 2004). The AFOQT covers areas such as word knowledge, math knowledge, general science, table reading, and aviation information. Scores contribute to five composites: Verbal, Quantitative, Academic Aptitude, Pilot, and Navigator/Technical. Aviator and flight officer candidates (e.g., pilot, combat systems operator, and air battle manager) must meet minimum scores on the Pilot and Navigator/Technical composites and pass the Pilot Candidate Selection Method (PCSM), which comprises results on the AFOQT and the Test of Basic Aviation Skills (TBAS; Carretta, 2005)—a test of spatial, dichotic listening, and psychomotor abilities. After the Air Force became a separate service in 1947, the Army concentrated on the selection of rotary wing, or helicopter pilots, most of whom were warrant officers coming through the enlisted ranks. Currently, the Army uses the Selection Instrument for Flight Training (SIFT; Paullin, Katz, Bruskiewicz, Houston, & Damos, 2006) for the selection of rotary wing pilots. SIFT includes portions of the ASTB as well as components that were specifically developed for the Army (e.g., the Army Aviation Knowledge Test).

OFFICER OCCUPATIONAL ASSIGNMENT To make assignments to occupations other than pilot, flight officers, and direct commissioned occupations, each service begins with an overall officer accession target. Overall targets are divided across commissioning sources so that each source has an assigned target, or quota, for

712

Selection and Classification in the Military the number of officers overall and the number of officers in occupations. In turn, commissioning sources try to meet several goals in assigning cadets/candidates to occupations: • Meet strength and manning distribution requirements; • Ensure that the higher quality leaders are distributed among all occupations; • Balance officer demographic characteristics across occupations; • Maximize satisfaction by assigning cadets to occupations they prefer; and • Assign cadets to occupations where they will perform the best, based on their skills, abilities, and interests.

These goals often conflict. Assignments that maximize performance might assign a large proportion of the highest-performing cadets to the same occupation, so that quality would not be acceptably distributed. There also are a number of constraints on assignments, the first being the number of available slots for an occupation. Additionally, some occupations require a degree or other special qualifications. For example, a weather officer must have a degree in meteorology or a related field. Clearly, this complicates optimal assignment of cadets/candidates to occupations. The needs of the service are paramount to other classification goals, and those needs are embedded in algorithmic and judgment-based methods the services use to make assignments. For example, the Air Force uses an optimization algorithm to determine the targets (or quotas) for commissioning programs and make assignments to non-flying occupations. The algorithm takes account of degree requirements and desirable qualifications for occupations, overall accession goals for each occupation, and student preferences (Sickorez, 2003). Mismatches where the remaining occupational requirements do not match candidate qualifications must be resolved judgmentally. The Army recently has made a number of changes to the officer assignment process. Like the other services, the Army has traditionally slotted cadets from each graduating class (service academy, ROTC, or OCS) into branches (e.g., occupational clusters such as Infantry or Armor) by matching cadets’ branch preferences and with their rank ordering on an order of merit list (OML) based on academic, military and physical records. In 2014, the United States Military Academy (USMA) implemented a new officer branching system. The new system involves (a) comprehensive assessment of cadets during their junior year to help them understand their own talents and occupational interests, (b) training and developmental experiences to help cadets become better informed about specific branch characteristics (e.g., typical branch missions, capabilities, equipment, assignments, etc.), and (c) matching of cadet assessment scores to branch requirements. Military Academy OML ranking, cadet preferences, results of the matching process, and Army requirements are all considered in the final branching decision (Sönmez & Switzer, 2013). Army ROTC also recently implemented changes to its branching process. Cadets who meet a minimum criterion on the OML ranking have the option of extending their ADSO by three years in exchange for their branch of choice. Finally, OCS branching changed in 2008. The new method rewards candidates for strong performance in OCS. At the end of the sixth week of the 12-week OCS course, each candidate, in order of OML ranking, selects the branch they want from the remaining options. Candidates with lower OML rankings have fewer branch choices.

NEW DIRECTIONS IN MILITARY SELECTION AND CLASSIFICATION Enlisted and officer selection and classification programs are maintained by service research organizations and at the DoD level. This involves extensive and ongoing test development and validation and development and analysis of near-term and longitudinal data.

ASVAB Research Since the ASVAB was implemented in 1976, the Cattell-Horn-Carroll (CHC) model of human intellect (Carroll, 1993; Cattell, 1987; Horn, 1989) has garnered a broad base of support in the research literature. It is a hierarchical model. At the highest level, a general factor (g) accounts for

713

Wayne S. Sellman et al. the correlations that exist among all ability measures. There is a wealth of evidence indicating that g is a good predictor of job performance (e.g., Ree & Earles, 1992). The next level consists of eight broad factors including crystallized and fluid intelligence. The third, and lowest, level in the hierarchy consists of more specific abilities relating to each of the eight broad factors. Crystallized and fluid intelligence are particularly important because they are so central to g and because they are broader than the other six, more-specific factors of g. Crystallized intelligence underlies performance on knowledge or information tests. Fluid intelligence subsumes virtually all forms of reasoning—inductive, conjunctive, deductive, and so forth. It is at the heart of what is typically called intelligence, and it facilitates accumulation of crystallized knowledge (Carroll, 1993; Horn, 1989). The ASVAB technical tests are crystallized intelligence tests. Identifying a fluid intelligence test is a little trickier. Most tests require both knowledge and reasoning ability. Tests are good reasoning measures to the extent that they contain words or materials that are equally familiar, or unfamiliar, for all examinees; otherwise, variance due to knowledge makes them crystallized intelligence measures (Carroll, 1993; Horn, 1989). Noting that fluid ability is not strongly represented in ASVAB, a panel of experts recommended that the services review tests developed in earlier research projects, looking for fluid intelligence measures and measures of specific abilities that might be useful for classification purposes (Drasgow, Embretson, Kyllonen, & Schmitt, 2006). The services have made a number of steps toward that end, including: • Reconsidering Coding Speed (CS). CS, a measure of cognitive speediness, was dropped from the ASVAB in 2002 due to concerns about item response theory (IRT) scoring and potential lack of portability across paper-and-pencil and computer-administered modes of administration. Even so, the Navy retained it as a special test (Held, Carretta, & Rumsey, 2014), and the DoD continued to conduct studies of its portability and made scoring improvements (Segall, 1997). In recent validation studies by the Navy, CS provided small increments in validity over AFQT for predicting performance in Navy jobs, reduced adverse impact compared to other measures, and improved classification (Held et al., 2014). • Reconsidering Working Memory Capacity (WMC). WMC is a process that is involved in the performance of reasoning tasks (Carroll, 1993; Kyllonen & Christal, 1990). The DoD is evaluating the Mental Counters test (MCt), a WMC test that was a part of the Enhanced Computer Administered Test (ECAT) battery (Alderton, Wolfe, & Larson, 1997). ECAT research showed that MCt (a) loaded strongly on g, (b) provided incremental validity over AFQT for predicting performance in military occupations, (c) showed classification potential, and (d) minimized male–female and White–Hispanic subgroup differences (Sager, Peterson, Oppler, Rosse, & Walker, 1997). MCt is currently being administered to all Navy applicants at MEPS (Moreno, 2014). • Adding a matrix-type test. Matrix-type tests are well-established measures of fluid intelligence (Carroll, 1993). Trends are embedded in the rows and columns of a figural matrix, and examinees must find the figure that belongs in a specified cell of the matrix based on those trends. The DoD is currently preparing to pilot the test (Moreno, 2014). • Making better use of Assembling Objects (AO). AO was a part of the JPM research test batteries (Alderton et al., 1997; Campbell & Knapp, 2001; Peterson et al., 1990) and became part of the ASVAB in 2002. Factor-analytic research suggests that AO, which has spatial content, is a measure of visual perception and nonverbal reasoning. In the ECAT project, it provided modest incremental validity over the ASVAB for predicting performance in a wide array of military occupations and demonstrated classification potential for some occupations (Held et al., 2014; Sager et al., 1997). • Identifying cyber talent. A Cyber Test was developed and validated as a special supplement to ASVAB for selection into cybersecurity occupations (Trippe, Moriarty, Russell, Carretta, & Beatty, 2014). In June 2014, the Air Force began operational use of the Cyber Test, and the Army is currently conducting validation studies on it.

Noncognitive Measures The biggest stumbling block the services have encountered in trying to implement noncognitive measures of personal characteristics such as personality and interests is that applicants tend to present themselves in an overly positive light, or “fake.” Knowing this, the services have conducted a number of research efforts, including (a) evaluating noncognitive measures in

714

Selection and Classification in the Military an operational setting, (b) conducting additional research on reducing or detecting faking, and (c) using test development and administration methods known to reduce faking. Decades of research have culminated in two forced-choice, adaptive personality measures to overcome faking—the Tailored Adaptive Personality Assessment System (TAPAS) and Navy Computer Adaptive Personality Scales (NCAPS) (Rumsey & Arabian, 2014b). The TAPAS presents pairs of statements, often representing different traits, and asks test takers to select the one that is most like them (see Stark et al., 2014). The Army uses TAPAS to make decisions about applicants who have a high school diploma but fall into AFQT Categories IIIB and IV. Those with very low TAPAS scores are screened out. The Army also is conducting research on the possible use of TAPAS in officer selection. The Air Force does not use TAPAS for selection but does use TAPAS scores for classification into Special Operations Forces positions. The NCAPS presents examinees with pairs of statements representing different levels of a trait (unlike TAPAS, which often presents pairs of statements representing different traits) and asks them to select the statement that is most like them. Up to 15 pairs of statements are presented for each trait, until a precise estimate is obtained. Ongoing research continues to investigate the NCAPS (e.g., Oswald, 2010; Schneider et al., 2007), and it has been used operationally to help the Navy select individuals into Special Operations training assignments.

Educational Attainment Another possible use of noncognitive measures would be to replace educational achievement in the enlistment screening process. The DoD has moved some alternative educational credentials from Tier 2 to Tier 1, despite higher attrition rates for personnel attaining them because the DoD wants to avoid the appearance of devaluing alternative programs, but doing so reduces the effectiveness of the tier system in minimizing attrition. One solution to this issue would be to discontinue educational achievement as an enlistment screen. Education credentials could be replaced by noncognitive measures, which have been shown to be useful predictors of attrition. Educational credentials would still be documented, reported, and included in research but not used for selection. Research simulating the effects of such a move could help support or reject the notion. If this policy were adopted, the recruit quality benchmarks, described earlier in the chapter, would need to be redefined and revalidated to include noncognitive measures in place of educational credentials, using contemporary data.

CONCLUDING REMARKS Given the size of the military, the services need selection and classification methods that can be implemented efficiently on a large scale and used to identify personnel who will fit well in the military, perform well in their jobs, and stay in their assigned occupations, at least long enough for the services to make the investment in soldier/sailor/airman/marine/officer education worthwhile. Over the last century, these needs have driven the services to pioneer testing methodologies such as computer-based testing, IRT, adaptive personality measurement, statistical methods, and a host of other techniques that are beyond the scope of this chapter. These accomplishments are the product of a vibrant military testing community that continues to conduct large-scale, cutting-edge, short-term, and longitudinal investigation of methods to enhance selection and classification.

REFERENCES Alderton, D. L., Wolfe, J. H., & Larson, G. E. (1997). The ECAT battery. Military Psychology, 9, 5–37. Allen, M. T., Bynum, B. B., Oliver, J. T., Russell, T. L., Young, M. C., & Babin, N. E. (2014). Predicting leadership performance and potential in the U.S. Army Officer Candidate School (OCS). Military Psychology, 26(4), 310–326.

715

Wayne S. Sellman et al. Allen, M. T., & Young, M. C. (Eds.) (2012). Longitudinal validation of non-cognitive measures for the U.S. Army officer candidate school (Technical Report 1323). Fort Belvoir, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Armor, D. J., & Roll, C. R. (1984). Military manpower quality: Past, present, and future. In B. F. Green & A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 13–34). Washington, DC: National Academy Press. Armor, D. J., & Sackett, P. R. (2004). Manpower quality in the all-volunteer force. In B. A. Bicksler, C. L. Gilroy, & J. T. Warner (Eds.), The all-volunteer force: Thirty years of service (pp. 90–108). Washington, DC: Brassey’s. Barnes, J. D., & Brown, D. G. (2013). Internet Computer Adaptive Screening Test (iCAST): Recruiter perspectives (2013 No. 081). Alexandria, VA: Human Resources Research Organization. Bicksler, B. A., Gilroy, C. L., & Warner, J. T. (2004). The all-volunteer forces: Thirty years of service. Washington, DC: Brassey’s. Bicksler, B. A., & Nolan, L. G. (2006). Recruiting the all-volunteer force: The need for sustained investment in recruiting resources. Policy Perspectives, 1, 1–27. Boldt, R. F. (1980). Check scaling of the AFQT 7C portion of the Armed Services Vocational Aptitude Battery Form 7, and General Classification Test Form 1C to the Armed Forces Qualification Test scale. Princeton, NJ: Educational Testing Service. Campbell, J. P. (1990a). Modeling the performance prediction problem in industrial and organizational psychology. In M. D. Dunnette & L. J. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol.1, pp. 687–732). Palo Alto, CA: Consulting Psychologists Press. Campbell, J. P. (1990b). An overview of the Army selection and classification project (Project A). Personnel Psychology, 43, 231–239. Campbell, J. P., & Knapp, D. J. (2001). Exploring the limits in personnel selection and classification. Mahwah, NJ: Lawrence Erlbaum. Carretta, T. R. (2005). Development and validation of the Test of Basic Aviation Skills (TBAS) (AFRL-HEWP-TR-2005–0172). Wright-Patterson AFB, OH: Air Force Research Laboratory, Human Effectiveness Directorate. Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York, NY: Press Syndicate of the University of Cambridge. Cattell, R. B. (1987). Intelligence: Its structure, growth, and action. New York, NY: North Holland. Claudy, J. G., & Steel, L. (1990). Armed Services Vocational Aptitude Battery (ASVAB): Validation for civilian occupations using National Longitudinal Survey of Youth data (AFHRL-TR-90–29). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory. Defense Manpower Commission. (1976). Defense manpower: The keystone of national security. Report to the President and the Congress. Washington, DC: Author. Diaz, T. E., Ingerick, M. J., & Lightfoot, M. A. (2004). New Army aptitude areas: Evaluation of new composites and job families for Army classification (FR-04–29). Alexandria, VA: Human Resources Research Organization. Drasgow, F. (September 1992). Review of sensitivity and fairness of the Armed Services Vocational Aptitude Battery technical composites. Letter from the Chairman, Defense Advisory Committee on Military Personnel Testing to the Director for Accession Policy, Office of the Assistant Secretary of Defense (Force Management and Personnel). Champaign-Urbana: University of Illinois. Drasgow, F., Embretson, S. E., Kyllonen, P. C., & Schmitt, N. (2006). Technical review of the Armed Services Vocational Aptitude Battery (ASVAB) (FR 06–25). Alexandria, VA: Human Resources Research Organization. Dunbar, S. B., & Novick, M. R. (1988). On predicting success in training for men and women: Examples from Marine Corps clerical specialties. Journal of Applied Psychology, 73, 545–550. Earles, J. A., & Ree, M. J. (1992). The predictive validity of the ASVAB for training grades. Educational and Psychological Measurement, 52, 721–725. Eitelberg, M. J. (1988). Manpower for military occupations. Washington, DC: Office of the Assistant Secretary of Defense (Force Management and Personnel). Human Resources Research Organization. Eitelberg, M. J., Laurence, J. H., & Brown, D. C. (1992). Becoming brass: Issues in the testing, recruiting, and selection of American military officers. In B. R. Gifford & L. C. Wing (Eds.), Test policy in defense: Lessons from the military for education, training and employment (pp. 79–119). Boston, MA: National Commission on Testing and Public Policy. Eitelberg, M. J., Laurence, J. H., & Waters, B. K. (1984). Screening for service: Aptitude and education criteria for military entry. Washington, DC: Office of the Assistant Secretary of Defense (Manpower, Installations, and Logistics). Human Resources Research Organization. Elster, R. E., & Flyer, E. S. (1981). A study of the relationship between educational credentials and military performance criteria. Monterey, CA: Naval Postgraduate School.

716

Selection and Classification in the Military Flanagan, J. C. (Ed.) (1948). The aviation psychology program in the Army Air Forces (Report No. 1). Army Air Forces, Aviation Psychology Program Research Reports. Washington, DC: U.S. Printing Office. Flyer, E. S. (1959). Factors relating to discharge for unsuitability among 1956 airmen accessions to the Air Force (WADC-TN-59–201). Lackland AFB, TX: Air Force Personnel Research Laboratory. Fredland, J. E., Gilroy, C. L., Little, R. D., & Sellman, W. S. (1996). Professionals on the front line: Two decades of the all-volunteer force. Washington, DC: Brassey’s. Gates, T. S. (1970). Report of the President’s commission on an all-volunteer armed force. Washington, DC: U.S. Government Printing Office. Goldman, N. (1973). The changing role of women in the armed forces. The American Journal of Sociology, 78, 892–911. Green, B. F., & Mavor, A. S. (Eds.) (1994). Modeling cost and performance for military enlistment. Washington, DC: National Academy Press. Green, B. F., Wing, H., & Wigdor, A. K. (Eds.) (1988). Linking military enlistment standards to job performance. Washington, DC: National Academy Press. Greenberg, I. M. (1980). Mental standards for enlistment performance of Army personnel related to AFQT/ ASVAB scores (MGA-0180). Monterey, CA: McFann-Gray. Handy, K., & Ramsberger, P. F. (2014). Future applicants and demographic trends. In T. L. Russell, L. A. Ford, & P. F. Ramsberger (Eds.), Thoughts on the future of military enlisted selection and classification (2014 No. 053). Alexandria, VA: Human Resources Research Organization. Harris, D. A., McCloy, R. A., Dempsey, J. R., Roth, C., Sackett, P. R., Hedges, L. et al. (1991). Determining the relationship between recruit characteristics and job performance: A methodology and a model (FR-PRD-90–17). Alexandria, VA: Human Resources Research Organization. Hartigan, J., & Wigdor, A. K. (Eds.) (1989). Fairness in employment testing: Validity generalization, minority issues, and the General Aptitude Test Battery. Washington, DC: National Academy Press. Held, J. D., Carretta, T. R., & Rumsey, M. G. (2014). Evaluation of tests of perceptual speed/accuracy and spatial ability for use in military occupational classification. Military Psychology, 26, 199–220. Held, J. D., Fedak, G. E., Crookenden, M. P., & Blanco, T. A. (2002). Test evaluation for augmenting the Armed Services Vocational Aptitude Battery. Paper presented at the 44th Annual Conference of the International Military Testing Association, Ottawa, Canada. Holmgren, R. L., & Dalldorf, M. R. (1993). A validation of the ASVAB against supervisors’ ratings in the General Aptitude Test Battery (GATB). Washington, DC: U.S. Employment Service. Horn, J. L. (1989). Cognitive diversity: A framework of learning. In P. L. Ackerman, R. J. Sternberg, & R. Glaser (Eds.), Learning and individual differences (pp. 61–116). New York, NY: Freeman. Hunter, J. E., Crosson, J. S., & Friedman, D. H. (1985). The validity of the Armed Services Vocational Aptitude Battery (ASVAB) for civilian and military job performance. Washington, DC: Office of the Assistant Secretary of Defense (Force Management and Personnel). Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 98, 72–98. Jaeger, R. M., Linn, R. L., & Novick, M. R. (1980). A review and analysis of score calibration for the Armed Services Vocational Aptitude Battery. Washington, DC: Committee Commissioned by the Office of the Secretary of Defense. Karpinos, B. D. (1966). The mental qualification of American youth for military service and its relationship to educational attainment. In Proceedings of the American Statistical Association. Alexandria, VA: American Statistical Association. Kelly, D. R., Matthews, M. D., & Bartone, P. T. (2014). Grit and hardiness as predictors of performance among West Point cadets. Military Psychology, 26(4), 327–342. Kyllonen, P. C., & Christal, R. E. (1990). Reasoning is (little more than) working memory capacity. Intelligence, 14, 389–433. Laurence, J. H. (1993). Education standards and military selection: From the beginning. In T. Trent & J. H. Laurence (Eds.), Adaptability screening for the Armed Forces (pp. 1–40). Washington, DC: Office of the Assistant Secretary of Defense (Force Management and Personnel). Laurence, J. H. (1997). Does the education credential still predict attrition? In J. M. Arabian (Chair), Everything old is new again: Current research issues in accession policy. Symposium conducted at the 105th Annual Convention of the American Psychological Association, Chicago, IL. Lee, D., & McKenzie, R. (1992). Reexamination of the relative efficiency of the draft and the all-volunteer army. Southern Economic Journal, 59, 640–654. Lee, G. C., & Parker, G. Y. (1977). Ending the draft: The story of the all-volunteer force (FR-77–1). Alexandria, VA: Human Resources Research Organization. Legree, P. J., Kilcullen, R. N., Putka, D. J., & Wasko, L. E. (2014). Identifying the leaders of tomorrow: Validating predictors of leader performance. Military Psychology, 26(4), 292–309.

717

Wayne S. Sellman et al. MacGregor, M. (1981). Integration of the Armed Forces: 1940–1965. Washington, DC: Center of Military History. Maier, M. H. (1993). Military aptitude testing: The past 50 years (DMDC-TR-93–007). Monterey, CA: Defense Manpower Data Center. Maier, M. H., & Grafton, F. C. (1980). Renorming ASVAB 6 and 7 at Armed Forces examining and entrance stations. Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Manning, L., & Griffith, J. E. (1998). Women in the military: Where they stand (2nd ed.). Washington, DC: Women’s Research and Education Institute. Mayberry, P. W. (1997). Competing criteria in the formation of aptitude composites (CAB-97–03). Alexandria, VA: Center for Naval Analyses. Mayberry, P. W., & Carey, N. B. (1997). The effect of aptitude and experience on mechanical job performance. Educational and Psychological Measurement, 57, 131–149. McCloy, R. A. (1994). Predicting job performance scores for jobs without performance data. In B. F. Green & A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 61–99). Washington, DC: National Academy Press. McCloy, R. A. (2012). Army selection and classification today. In P. F. Ramsberger, N. R. Wooten, & M. G. Rumsey (Eds.), A history of the research into methods for selecting and classifying U.S. Army personnel 1917–2011 (pp. 11–48). Lewiston, NY: Edwin Mellen Press. McHenry, J. J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project A validity results: The relationship between predictor and criterion domains. Personnel Psychology, 43, 335–354. Means, B. M., Nigan, A., & Heisey, J. G. (October 1985). When low-aptitude recruits succeed. Exceptional recruits: A look at high and low-aptitude personnel. Symposium conducted at the 27th Annual Conference of the Military Testing Association, San Diego, CA. Moreno, K. (May 2014). Major ASVAB Milestones. A briefing presented to the Defense Advisory Committee on Military Personnel Testing in Louisville, KY. Morgan, R. (1990). Analyses of the predictive validity of the SAT and high school grades from 1976 to 1985. In W. W. Willingham, C. Lewis, R. Morgan, & L. Ramist (Eds.), Predicting college grades: An analysis of institutional trends over two decades (pp. 195–212). Princeton, NJ: Educational Testing Service. Naval Operational Medicine Institute. (n.d). Aviation Selection Test Battery (ASTB). Retrieved August 25, 2015, from http://www.usnavy.vt.edu/documents/astboverview.pdf North, R. A., & Griffin, G. R. (1977). Pilot selection 1919–19 77 (NAMRL Special Report 77–2). Pensacola, FL: Naval Aerospace Medical Research Laboratory. Oi, W. Y. (1967). The economic cost of the draft. American Economic Review, 57, 39–62. Oswald, F. L. (2010). Practical recommendations for trait-level estimation in NCAPS (NPRST-TN-11–1). Millington, TN: Navy Personnel Research, Studies, and Technology (NPRST/PERS-1). Paullin, C. J., Katz, L., Bruskiewicz, K. T., Houston, J., & Damos, D. (2006). Review of aviator selection (Tech. Rep. No. 1183). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Paullin, C. J., Legree, P., Sinclair, A. L., Moriarty, K. O., Campbell, R. C., & Kilcullen, R. (2014). Delineating officer performance and its determinants. Military Psychology, 26, 259–277. Peterson, N. G., Hough, L. M., Dunnette, M. D., Rosse, R. L., Houston, J. S., Toquam, J. L., & Wing, H. (1990). Project A: Specification of the predictor domain and development of new selection/classification tests. Personnel Psychology, 43, 247–276. Putka, D. J., Noble, C. L., Becker, D. E., & Ramsberger, P. F. (2004). Evaluating moral character waiver policy against servicemember attrition and in-service deviance through the first 18 months of service (FR-03–96). Alexandria, VA: Human Resources Research Organization. Ramsberger, P. F. (1993). Influences on the military enlistment decision-making process: Findings from the 1991 youth attitude tracking study (FR-PRD-93–06). Alexandria, VA: Human Resources Research Organization. Ree, M. J., & Earles, J. A. (1992). Intelligence is the best predictor of job performance. Current Directions in Psychological Science, 1, 86–89. Richardson, M., Abraham, C., & Bond, R. (2012). Psychological correlates of university students’ academic performance: A systematic review and meta-analysis. Psychological Bulletin, 138, 353–387. Rostker, B. (2006). I want you: The evolution of the all-volunteer force. Santa Monica, CA: The RAND Corporation. Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., III., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta analysis. Personnel Psychology, 54, 297–330. Rumsey, M. G., & Arabian, J. M. (2014a). Introduction to the special issue on selected new developments in military enlistment testing. Military Psychology, 26, 131–137. Rumsey, M. G., & Arabian, J. M. (2014b). Military enlistment selection and classification: Moving forward. Military Psychology, 26, 221–251. Rumsey, M. G., Walker, C. B., & Harris, J. H. (Eds.) (1994). Personnel selection and classification. Hillsdale, NJ: Lawrence Erlbaum.

718

Selection and Classification in the Military Russell, T. L., & Rumsey, M. G. (2012). The selection and classification of Army aviators. In P. F. Ramsberger, N. R. Wooten, & M. G. Rumsey (Eds.), A history of the research into methods for selecting and classifying U.S. Army personnel 1917–2011 (pp. 415–438). Lewiston, NY: Edwin Mellen Press. Sackett, P. R., Eitelberg, M. J., & Sellman, W. S. (2013). Profiles of American Youth: Generational changes in cognitive skills (FR-09–22). Alexandria, VA: Human Resources Research Organization. Sackett, P. R., & Mavor, A. S. (Eds.) (2003). Attitudes, aptitudes, and aspirations of American youth: Implications for military recruitment (pp. 70–96). Washington, DC: National Academy Press. Sackett, P. R., & Mavor, A. S. (Eds.) (2004). Evaluating military advertising and recruiting: Theory and methodology (pp. 40–67). Washington, DC: National Academy Press. Sackett, P. R., & Mavor, A. S. (Eds.) (2006). Assessing fitness for military enlistment: Physical, medical, and mental health standards (pp. 21–40). Washington, DC: National Academy Press. Sackett, P. R., & Shen, W. (2009). Subgroup differences on cognitively loaded tests in contexts other than personnel selection. In J. Outzz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 323–346). New York, NY: Routledge, Taylor and Francis Group. Sager, C. E., Peterson, N. G., Oppler, S. H., Rosse, R. L., & Walker, C. B. (1997). An examination of five indexes of test battery performance: Analysis of the ECAT battery. Military Psychology, 9, 97–120. Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Meta analysis of validity studies published between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37, 407–422. Schneider, R. J., Ferstl, K. L., Houston, J. S., Borman, W. C., Lords, A. O., & Bearden, R. M. (2007). Revision and expansion of Navy Computer Adaptive Personality Scales (NCAPS) (NPRST-TN-07–12). Millington, TN: Navy Personnel Research, Studies, and Technology (NPRST/BUPERS-1). Segall, D. O. (1997). The psychometric comparability of computer hardware. In W. A. Sands, B. K. Waters, & J. R. McBride, (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 219–226). Washington, DC: American Psychological Association. Segall, D. O. (2004). Development and evaluation of the 1997 ASVAB score scale. Monterey, CA: Defense Manpower Data Center. Sellman, W. S. (1991). The role of the military psychologist in development of national manpower policy. Military Psychology, 3, 127–134. Sellman, W. S. (1994). Job performance measurement: The nexus between science and policy. In B. F. Green & A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 1–6). Washington, DC: National Academy Press. Sellman, W. S. (1997). Public policy implications for military entrance standards. Keynote address presented at the 39th Annual Conference of the International Military Testing Association, Sydney, Australia. Sellman, W. S. (1999). Military recruiting: The ethics of science in a practical world. Invited address to the Division of Military Psychology, 107th Annual Convention of the American Psychological Association, Boston, MA. Sellman, W. S. (2004). Predicting readiness for military service: How enlistment standards are established. Commissioned paper prepared for the National Assessment Governing Board. Washington, DC: U.S. Department of Education. Sellman, W. S. (2012). Moving to the all-volunteer force (1973–1982). In P. F. Ramsberger, N. R. Wooten, & M. G. Rumsey (Eds.), A history of the research into methods for selecting and classifying U.S. Army personnel 1917–2011 (pp. 145–185). Lewiston, NY: Edwin Mellen Press. Sellman, W. S., Born, D. H., Strickland, W. J., & Ross, J. J. (2010). Selection and classification in the U.S. military. In J. L. Farr & N. T. Tippins (Eds.), Handbook of employee selection (pp. 679–704). New York, NY: Routledge, Taylor and Francis Group. Sellman, W. S., & Campbell, J. P. (2012). Project A: One of a kind. In P. F. Ramsberger, N. R. Wooten, & M. G. Rumsey (Eds.), A history of the research into methods for selecting and classifying U.S. Army personnel 1917–2011 (pp. 187–229). Lewiston, NY: Edwin Mellen Press. Sellman, W. S., Carr, W. K., & Lindsley, D. H. (1996). Shaping tomorrow’s military: The National agenda and youth attitudes. In Proceedings of the 15th Biennial Behavioral Sciences Symposium. Fort Collins, CO: U.S. Air Force Academy. Sellman, W. S., & Valentine, L. D. (1981). Aptitude testing, enlistment standards, and recruit quality. Paper presented at the 89th Annual Convention of the American Psychological Association, Los Angeles, CA. Shields, J. L., & Grafton, F. C. (1983). A natural experiment: Analysis of an almost unselected Army population. Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Sickorez, R. D. (2003). Allocating Air Force Career Field Accession Targets: An optimization-based tool (C102–1301). Monterey, CA: Naval Postgraduate School. Sims, W. H., & Truss, A. (1978). An analyses of the normalization and verification of the Armed Services Vocational Aptitude Battery (ASVAB) forms 6 and 7 (CNA 1115). Alexandria, VA: Center for Naval Analyses. Sims, W. H., & Truss, A. (1979). A reexamination of the normalization of the Armed Services Vocational Aptitude Battery (ASVAB) (CNA 79–3059). Alexandria, VA: Center for Naval Analyses.

719

Wayne S. Sellman et al. Sims, W. H., & Truss, A. (1980). A reexamination of the normalization of the Armed Services Vocational Aptitude Battery (ASVAB) forms 6, 7, 6E, and 7E (CNA 1152, MCOAG). Alexandria, VA: Center for Naval Analyses. Smith, D. A., & Hogan, P. F. (1994). The accession quality cost/performance trade-off model. In B. F. Green & A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 105–128). Washington, DC: National Academy Press. Sönmez, T., & Switzer, T. B. (2013). Matching with (Branch-of-Choice) contracts at the United States Military Academy. Econometrica, 81(2), 451–488. Stark, S., Chernyshenko, O. S., Drasgow, F., Nye, C. D., Farmer, W. L., White, L. A., & Heffner, T. (2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26(3), 153–164. Sticha, P. J., Sellman, W. S., Axelrad. E. T., McCloy, R. A., Barnes, J. D., & Gribben, M. A. (2014). Defining recruit quality: Beyond AFQT and educational attainment (No. 057). Alexandria, VA: Human Resources Research Organization. Strickland, W. J. (Ed.) (2005). Longitudinal examination of first term attrition and reenlistment among FY1999 enlisted accessions (Technical Report 1172). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Thirtle, M. R. (2001). Educational benefits and officer commissioning opportunities available to U.S. Military Servicemembers (MR-981-OSD). Santa Monica, CA: The RAND Corporation. Thorndike, R. L. (Ed.) (1947). Research problems and techniques (Report No. 3). Army Air Forces aviation psychology program research reports #3. Washington, DC: Government Printing Office. Trippe, D. M., Moriarty, K. O., Russell, T. L., Carretta, T. R., & Beatty, A. S. (2014). Development of a cyber/information technology knowledge test for military enlisted technical training qualification. Military Psychology, 26, 182–198. Uhlaner, J. E., & Bolanovich, D. J. (1952). Development of the Armed Forces Qualification Test and predecessor Army screening tests, 1946–1950 (PRS Report 976). Washington, DC: Personnel Research Section, Department of the Army. U.S. Code. (January 2009). Title X—Armed Forces, Subtitle—General Military, Part II—Personnel, Chapter 31— Enlistments, Section 520—Limitation on enlistment and induction of persons whose score on the Armed Forces Qualification Test is below a prescribed level. Washington, DC: Author. U.S. Department of Defense. (October 1975). Qualitative accession requirements (Report to the House and Senate Committees on Armed Services). Washington DC: Central All-Volunteer Force Task Force, Office of the Assistant Secretary of Defense (Manpower and Reserve Affairs). U.S. Department of Defense, Office of the Assistant Secretary of Defense (Manpower, Installations, and Logistics). (1985). Defense manpower quality (Vol. 1–3). Report to the Senate Committee on Armed Services. Washington, DC: Author. U.S. Department of Defense, Office of the Assistant Secretary of Defense (Force Management and Personnel). (1991). Joint-service efforts to link military enlistment standards to job performance. Report to the House Committee on Appropriations. Washington, DC: Author. U.S. Department of Defense, Office of the Assistant Secretary of Defense (Force Management Policy). (1996). Educational enlistment standards: Recruiting equity for GED certificates. Report to Congress. Washington, DC: Author. U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2004). Population representation in the military services: Fiscal year 2002. Washington, DC: Author. U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2008). Population representation in the military services: Fiscal year 2006. Washington, DC: Author. U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2013). Population representation in the military services: Fiscal year 2013. Washington, DC: Author. U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2014). Recruiting resources report. Washington, DC: Author. U.S. Government Accountability Office. (January 1997). Military attrition: DoD could save millions by better screening enlisted personnel (NSIAD-97–39). Washington, DC: Author. U.S. Government Accountability Office. (1998). Military attrition: DOD needs to better analyze reasons for separation and improve recruiting systems (GAO/T-NSIAO-98–117). Washington, DC: Author. Wallace, S. R. (1965). Criteria for what? American Psychologist, 20, 411–417. Warner, J. T., & Asch, B. J. (1996). The economic theory of a military draft reconsidered. Defense and Peace Economics, 7, 297–312. Waters, B. K. (1997). Army Alpha to CAT-ASVAB: Four score years of military selection and classification testing. In R. F. Dillon (Ed.), Handbook on testing. Westport, CT: Greenwood Press.

720

Selection and Classification in the Military Waters, B. K., Laurence, J. H., & Camara, W. J. (1987). Personnel enlistment and classification procedures in the U.S. military. Washington, DC: National Academy Press. Weissmuller, J. J., Schwartz, K. L., Kenney, C. W., & Gould, R. B. (2004). Recent developments in USAF officer testing and selection (AFCAPS-RF-2004–0001). Randolph AFB, TX: Air Force Personnel Center. Welsh, J. R., Kucinkas, S. K., & Curran, L. T. (1990). Armed Services Vocational Aptitude Battery (ASVAB): Integrative review of validity studies (AFHRL-TR-90–22). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory. Welsh, J. R., Watson, T. W., & Ree, M. J. (1990). Armed Services Vocational Aptitude Battery (ASVAB): Predicting military criteria from general and specific abilities (AFHRL-TR-90–63). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory. White, L. A., Rumsey, M. G., Mullins, H. N., Nye, C. D., & LaPort, K. A. (2014). Toward a new attrition screening paradigm: Latest Army advances. Military Psychology, 26, 138–152. Wigdor, A. K., & Green, B. F. (Eds.) (1986). Assessing the performance of enlisted personnel. Washington, DC: National Academy Press. Wigdor, A. K., & Green, B. F. (1991). Performance assessment in the workplace (Vol. 1 & 2). Washington, DC: National Academy Press. Wise, L. L. (1994). Setting performance goals for the DoD linkage model. In B. F. Green & A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 37–60). Washington, DC: National Academy Press. Wise, L. L., McHenry, J. J., & Campbell, J. P. (1990). Identifying optimal predictor composites and testing for generalizability across jobs and performance factors. Personnel Psychology, 43, 355–366. Wise, L. L., Welsh, J. R., Grafton, F., Foley, P., Earles, J. A., Sawin, L. L., & Divgi, D. R. (1992). Sensitivity and fairness of the Armed Services Vocational Aptitude Battery (ASVAB) technical composites (DMDC Technical Report 92–02). Monterey, CA: Defense Manpower Data Center. Wolters, H. M. K., O’Shea, P. G., Ford, L. A., Fleisher, M. S., Adeniyi, M. A., Conzelman, C. E., & Webster, R. J. (2014). Identifying and training brigade command competencies. Military Psychology, 26, 278–291. Yerkes, R. M. (1921). Psychological examining in the United States Army. In Memoirs of the National Academy of Sciences (Vol. XV). Washington, DC: U.S. Government Printing Office.

721

32 PUBLIC SECTOR EMPLOYMENT RICK JACOBS AND DONNA L. DENNING

Historians often cite the origin of civil service or public sector testing as far back as 2200 bc, when a Chinese emperor used a process of systematic assessment to determine if his officials were fit for office (DuBois, 1970; Frank, 1963). In these early times, individuals were assessed with what might now be labeled job-relevant work samples; they included tests of specific skills such as horsemanship and archery. The Han Dynasty (202 bce to 200 ce) is credited with moving testing from the actual actions required on the job to a surrogate, written format that included five areas of knowledge: civil law, military affairs, agriculture, revenue, and geography (Gregory, 1996). Candidates who were successful in meeting rigorous cutoff scores on local examinations were deemed appropriate to continue with the process of testing at regional and higher levels in the overall process. Thus, in many respects, these ancient tests were prototypes of what has become known generically as civil service examinations or, more generally, public sector testing and can be seen as way to guard against the potential negative consequences of patronage as well as embrace the positive results of having standardization and more accurate indicators of future performance. This brief historical description depicts the genesis of public sector testing and illustrates that it shares some similarities with current practices, but important differences exist. Most noteworthy, use of these early tests was deficient in terms of systematic evaluation of outcomes: demonstration of their predictive validity. Furthermore, the tests were conducted under extreme conditions that required candidates to spend long hours in confined spaces that would never be tolerated today, and they routinely had failure rates that were considerably higher than would often prove viable today, well in excess of 90%. Moving forward 2,000 years, from China (ad 200) to France (1791), England (1833), and finally the United States (1883), we see the more immediate historical roots of modern-day public sector testing (Graham & Lily, 1984). In these systems, tests were used to select individuals for government positions in a way that was intended to be free of patronage and fair to all candidates. These tests were each designed to identify the individuals who were most likely to succeed in a given position on the basis of specific subject matter that made up the content of the tests, a precursor to what is now routinely labeled as validity based on test content. Although much has been done over the years to improve the characteristics of these assessments, such as more carefully matching test materials to job requirements, further standardizing testing processes, and evaluating predictive efficiencies by validation studies, the basic ideas underlying civil service examining have a long and rich history, in fact, one that long predates emergence of the discipline of industrial psychology. This chapter provides details of the distinctive characteristics of testing in the public sector. It starts with the process of identifying the positions that are part of a competitive examination process and then moves on to discuss the development and administration of entry-level

722

Public Sector Employment examinations. The following section reviews validity, or linking tests to jobs. The next section addresses recruitment of candidates; optimal selection decisions require maximizing the number of individuals competing for the job. Part five of this chapter turns to testing for promotional opportunities. Next is a discussion of legal considerations surrounding testing in the public sector. The chapter concludes with a summary of how public sector testing has evolved through the past century and a view on where it might be evolving to in the 21st century.

POSITION CLASSIFICATION IN THE PUBLIC SECTOR To fully appreciate the extent to which use of formal civil service examinations is entrenched in public sector employee selection, the role of position classification in the public sector must be considered. In this context, a “position” is the segment of work to be performed by one person. Classification of positions involves documentation and analysis of the work of each position, then grouping the positions with sufficiently similar work into a “class” of positions. More formally, a “class” may be defined as follows: a group of positions . . . sufficiently similar in respect to the duties, responsibilities, and authority thereof that the same descriptive title may be used with clarity to designate each position allocated to the class, that the same requirements as to education, experience, capacity, knowledge, proficiency, ability, and other qualifications should be required of the incumbents, that the same tests of fitness may be used to choose qualified employees, and that the same schedule of compensation may be used. (Committee on Position-Classification and Pay Plans in the Public Service, 1941, p. 45, italics added)

Historically, this has been a judgmental exercise, but more recently it may include use of formal job analytic methods. Subsequent to the creation of a class, additional positions are “allocated” (assigned) to the class when an added need for comparable work is determined and documented and the additional work is deemed sufficiently similar to the work performed by incumbents in the class to warrant inclusion of the position into the existing class. Similarly, an existing class is abolished when the need for the work performed by those in the class no longer exists. A description of the work performed by incumbents in a class and the qualifications necessary for performing this work are then documented in a “class specification.” The “position classification plan” of the organization, then, “consists of (1) the system of classes and class specifications and (2) a code of formal fundamental rules for installation and maintenance of the classification plan” (Committee on Position-Classification and Pay Plans in the Public Service, 1941, p. 47). The classification of positions is a formal process that provides the underlying rationale for assessment. With the specification of position qualifications and requirements, organizations can seek to identify existing tests or construct new tests that match this information. What we have in this approach are the roots of a content-based validation strategy, which may then stand on its own or may be supplemented by additional validation information such as criterion-related evidence of validity.

CIVIL SERVICE EXAMINATIONS The provision in the definition of “class” that all positions in it require comparable qualification led to regulatory provisions regarding the evaluation of qualification. The U.S. Code (Section 2301, Title 5), which governs the U.S. federal civil service system and serves as a model for many other government agencies, stipulates that “selection and advancement should be determined solely on the basis of relative ability, knowledge and skills after . . . competition” (i.e., competitive examination). The Charter of the City of Los Angeles is even more explicit on this point, stating that “Examinations shall . . . test the relative capacity of the persons examined to discharge the duties of the class” (The City of Los Angeles, 2009, p. 69).

723

Rick Jacobs and Donna L. Denning A separate civil service examination (either a single test or often a series of tests, the scores on which are combined in a specific way to form a final examination score) is typically conducted for selection into each class. Results of the examination appear as a list of candidates who successfully completed all portions of the examination, ranked in descending order by their score. This list is variously referred to as an “eligible list” or “register of eligibles,” indicative that all persons on it are eligible for employment in the class on the basis of their having demonstrated in the civil service examination appropriate qualification to occupy a position in the class. Adoption of the eligible list/register of eligibles, usually by a civil service commission or the head of the department responsible for examining, marks the end point of the examination; but the selection process has not concluded, because no one has yet been hired or promoted. This final step is accomplished by the department with the vacancy requesting a “certification” of the list. Then, in accordance with specific, strict rules, a designated number of candidates’ names are provided to the department for their final hiring consideration (technically, they are provided to the “appointing authority,” who is typically the department head and is the only person who can legally fill a position in a civil service class). This certification rule has many variants and can range, for example, from a “rule of one” (the highest scorer only, who is then hired unless there is reason not to do so, in which case the second highest scorer is considered, and so forth), a “rule of three” (the three highest scorers), a “rule of 2N + 1” (2 times the number of vacancies, plus 1, of the highest scores), to the very liberal “rule of the list” (all persons on the list may receive final consideration for selection). Evaluation of candidates for this final selection decision is to be based on additional job-related criteria not found in the testing process. These can be difficult to identify when a thorough examination has been given, one that includes not only specific job-based knowledge but also other components such as work-based behavioral measures, personality indicators, and/or biographical information such as work history. In 1983, voters in Los Angeles approved a City Charter amendment for use of a rule of “Three Whole Scores” for certification selection. This rule accomplished two things: (1) the rounding of scores to whole numbers eliminated the miniscule, decimal point differences in scores that had previously separated the ranks of candidates, and (2) the hiring department was able to consider an expanded pool of candidates for final selection. In all instances described, all candidates tied at a given score are treated the same (either certified for final hiring consideration or not), so rounding scores to whole numbers has a greater impact than might be expected in grouping candidates at a given score (rank) and thus expanding the pool from which a selection can be made. Civil service examinations are seen as embodying “merit principles” in selection in that they are based on job-related criteria and provide a ranking of candidates in terms of their relative degree of qualification. Position classification invariably results in a class specification document that at the very least provides a starting point for construction of a job-relevant examination. The description of the job in the class specification document is often supplemented by a more detailed job analysis. Job analysis procedures may vary but have the common objective of specifying the work performed (tasks) on the job. Additionally, and especially relevant for the purpose of developing selection testing, the job analysis also often provides identification of the knowledge, skills, abilities, and possibly other personal characteristics needed to perform these tasks. This information then allows for designation of the most appropriate types of tests for use and their content. Once again, these provisions require that examinations are based on job requirements or the ability to “discharge the duties of the class,” which logically results in a content-based test construction strategy. This, coupled with classification practices that are based on extreme similarity of the work required of all positions in a class, results in nearly universal reliance on content-based testing. And not incidentally, the narrowness in scope of the work performed by incumbents in a class thus precludes local empirically based test validation strategies due to limited sample size.

Testing for a Multiplicity of Jobs The statutory requirement that an objective assessment program be in place for each job (class) and ensuing mandates that examinations be tailored to the unique demands of each are major

724

Public Sector Employment challenges facing public sector organizations; usually these organizations have a very large number of classes relative to the number of employees. As an example, in one county in the state of Ohio, the public library service employs just over 1,000 individuals, and these 1,000 employees are classified into more than 110 civil service classes. The turnover rate in this organization is approximately 6% annually, indicating that in any given year there may be about 60 job openings, and these 60 openings may span nearly as many classes, each requiring a different civil service examination (State of Ohio, personal communication, February 2008). Similarly, in a medium-size city in Pennsylvania, the Civil Service Commission must monitor staffing for more than 400 classes. Although some testing for larger classes is predictable and regular, many jobs may have a single vacancy only occasionally (and unpredictably), and the organization must be ready to examine candidates for every one of them at any point in time. In the City of Los Angeles, the Personnel Department is responsible for testing nearly 1,000 classes. The sheer number of jobs and requisites of the civil service system creates a situation in which tailoring selection programs very specifically to each job can make timely completion of the development of all examinations extremely challenging. Another factor contributing to the volume of civil service examinations that must be developed is the reluctance within many agencies to reuse tests. Test security reigns supreme, given the high stakes of these examinations and the need to preserve the integrity of the process, to the extent that considerable caution is exercised even with respect to repeat exposure of test material. And this caution is well founded; incidents of candidates colluding to reproduce a test (by each memorizing a specific set of items) have been repeatedly encountered.

Defining Test Content One approach that helps meet the demand for a separate examination for each job, given the multiplicity of jobs within a given organization, is a systematic approach of analysis across jobs with a focus on the commonality among jobs. This helps organizations bring order to jobs in terms of their similarities and, potentially, to assessment tools and processes. Such commonalities are identified by analyzing individual jobs and then comparing them for patterns of similar tasks, duties, responsibilities, and/or, most importantly, knowledge, skills, abilities, and other characteristics (KSAOs). Public sector organizations that must select for many classes can help reduce the burden of creating a complete examination unique to each class by constructing assessment procedures and processes for use across multiple classes on the basis of their similarities. This not only makes the development of selection systems more efficient, but such a process can also result in the compilation of normative information for a much larger sample of individuals, which, in turn, can improve the understanding of the tests used and the applicants being considered for employment. Implementation of such a process requires use of job analysis procedures that are consistent across jobs and that yield results that allow for comparison of jobs. Once this is accomplished, tests that meet the needs of multiple jobs may be created and any modifications for individual jobs can be made. This approach can also simultaneously facilitate consideration of a candidate for multiple jobs through administration of a comprehensive battery of tests. From this perspective, either the candidate, through application to multiple positions with similar requirements, or the organization, via evaluation of candidates for multiple positions simultaneously, can benefit from knowing the relationship among jobs. As earlier stated, for many public sector organizations, the number of jobs considered distinct (i.e., classes) is daunting, and the use of common testing across jobs can help make far more attainable the ultimate goal of timely administration of a formal examination for each class with a vacancy (or to always have an eligible list available for each class). Although many public sector organizations continue to use traditional job description and job analysis procedures to define jobs, the past decade has seen a rise in the use of competency modeling as an underlying process for identifying job requirements, parallel to its use in the private sector. A competency model may be constructed for higher-level jobs, especially those that are considered leadership positions (Hollenbeck, McCall & Silzer, 2006), or for all jobs in the

725

Rick Jacobs and Donna L. Denning organization. In both cases, the competencies identified form the basis of the examination plan (Rodriguez, Patel, Bright, Gregory, & Gowing, 2002).

LINKING TESTS TO JOBS: TOOLS AND PROCESSES FOR IDENTIFYING STRONG CANDIDATES Any examination used for employee selection, whether it takes advantage of similarities identified across jobs or not, must have a logical framework demonstrating how the tests are linked to the job or, more generally, an evaluation of test validity. Note that regardless of the validity evidence used to support the tests included in the selection process, the examiner (or examination analyst, as they are often called) must engage in developing an examination plan. Examination plans link the information about the job to the types of assessments that are included in the selection process. Examination plans provide a logical underpinning not only to the use of a given type of test and its content but also to the weight that each test receives in the final examination score. As an example, in police officer selection, there has been a movement to establish a more broad-based assessment consisting of not only cognitive ability but also personality characteristics, and experiential information that lead to effective policing. An examination plan for the class of police officer would likely include multiple types of tests with specific assessment dimensions for each and instructions as to how these tests are to be considered (pass/fail or weighted) in the final composite score on which candidates are ranked. Following this logic, it is not hard to see that very different jobs (e.g., library clerical worker, meter reader, lifeguard, and purchasing agent) would have examination plans that differ from police officer and from one another, given the nature of each job and the KSAOs necessary to perform in the position. Public sector employment covers a very wide range of jobs and thus requires use of a correspondingly wide range of assessment tools. Although the final examination may differ for various positions, a similar process is used for examination development for the vast array of public sector jobs. First, an analysis of the job is undertaken (details of job analysis methods are in Chapter 6, this volume). Then, based on the results of a job analysis, an examination plan is developed that identifies the optimal (and feasible) type(s) of test(s) necessary to assess the knowledge, skills, and/or abilities and aptitudes critical to performance of the job. For library clerical workers and meter readers, tests might focus on attention to detail, whereas for lifeguards the certification of successful completion of a first aid course might be supplemented with a physical abilities test that includes water rescue.

Minimum Qualifications Threshold requirements, usually in the form of education, training, experience, or certification attained, are often established as minimal qualifications for potential applicants. Public sector organizations rely heavily on minimum qualifications as an initial step in the employment process. Minimum qualifications (alternatively referred to as “requirements”) are threshold requirements that potential applicants must meet to participate in the competitive examination. In reality, they are the first “test” in the examination, because they consist of carefully established job-related criteria that each applicant must meet precisely to be allowed to proceed further in the examination process. These criteria are clearly communicated to applicants (so those lacking can self-select out at the earliest possible time), consistently and rigidly applied, and are often subject to verification. Their use is completely consistent with the content-based approach to testing that is so prevalent in the public sector, in that the criteria individuals must meet to participate in the examination for a given class are those that indicate a reasonable likelihood that they will have acquired the knowledge, skills, and abilities that will be subjected to more refined assessment through the remainder of the examination. Examples of minimum qualifications run the range of different characteristics such as age for air traffic controllers (both minimum

726

Public Sector Employment and maximums are specified by the Federal Aviation Administration); education at a specified level; a particular type of license, such as a commercial driver’s license for a bus driver; and in the case of police officers, in some jurisdictions the absence of a felony conviction.

Identifying Potential Selection Tools Once this (preliminarily) qualified pool of applicants is established, public sector testing personnel identify or construct selection instruments that can be used for more refined assessment in the remainder of the examination. They may search the Internet, the professional literature, test publisher catalogues, and/or professional volumes that review tests, such as Tests in Print and Mental Measurements Yearbook (Murphey, Plake, & Spies, 2006; Spies, Plake, & Geisinger, 2007). At times, this search process results in identification of instruments that are appropriate and sufficient in their coverage to constitute the entire examination used for selecting the most qualified applicants. However, even when this is the case, test security issues may dictate that the use of a test that is readily available may be inappropriate because some candidates may gain access to the tests, whereas others cannot. However, in many instances, certain features of the job or the need to address specific issues of job content require the creation of new tests; in fact, this is often a primary responsibility of the public sector testing professional. Clearly, the development of new assessment tools requires a great deal of time, effort, and skill, and when that is multiplied by the number of jobs in the organization, the workload can become overwhelming. Many public sector organizations pursue another option in some cases by outsourcing to individuals or consulting firms specializing in instrument development. This is especially true for high-stakes positions in which many jobs are being filled and the likelihood of follow-up objections and legal action on the part of candidates is high.

Role of the Interview As in the private sector, interviews are an extremely common type of test used in the public sector. As with other types of tests, public sector organizations most often use interviews that are carefully tailored to the job. For many jobs, formal written or actual work sample tests may not be a viable alternative, at times simply because there are very few candidates and the cost of test development does not warrant the effort. In these instances, an interview may be the only test in the examination (except the minimum qualifications). For other jobs for which there are many tests in the examination, those responsible for examining have the added obligation of creating and implementing an interview procedure that is well integrated with other tests in the process. Interview materials are typically developed directly from information contained in the job analysis. A viable set of questions for the interview and scoring criteria must be established. In addition, the most effective interview programs include careful standardization of interview administration, a written guide to conducting the interview, and a training session for interviewers. In the public sector, an interview panel is virtually always used as opposed to a single interviewer (or even sequential interviews). Although an interview panel introduces more costs in terms of time and personnel, it has the distinct advantage of enhancing the reliability of the process, and, most importantly, it provides a way of documenting that reliability along with a greater appearance of fairness. The American Public Transportation Association (APTA) has developed a Bus Operator Selection System (BOSS) that includes a 75-item survey of attitudes, beliefs, and experiences, followed by a multifaceted interview designed to be conducted by a panel of three interviewers, each representing a different perspective on the job: operations, training, and human resources (HR). This system is being used in about 30 transit organizations and has been administered to more than 160,000 candidates nationwide. The original work documenting the system is described in Jacobs, Conte, Day, Silva, and Harris (1996) and highlights the utility of multiple performance predictors for selecting bus operators.

727

Rick Jacobs and Donna L. Denning

Alternative Measures All employers should identify appropriate predictors for use in employee selection and, in addition, are required to search for alternative predictors for any original predictor that demonstrates a marked difference in pass rates on the basis of designated candidate demographic/cultural group membership. For example, cognitive ability tests are effective predictors of subsequent job performance, but the use of cognitive ability tests alone will also usually result in large racial/ ethnic group differences in pass rates, with majority group members passing at a higher rate than members of most minority groups. In this case, employers are required to seek out other predictors that can be used in conjunction with or in place of the test(s) that result in large group difference(s) in pass rates. The search for alternatives may take the form of identifying additional dimensions upon which to assess candidates. This approach is reflected in the previously mentioned police officer selection example, in which, for many decades, police candidates were given a cognitive ability test for the initial identification of qualified candidates and then further vetted via interviews, physical ability tests, background investigations, and medical/psychological evaluations. More recently, systems for initial screening commonly include a cognitive test with additional areas measured, such as personality or biographical data. These types of assessment systems can also include other testing formats such as video-based tests or job simulations. Both of these approaches expand test content and format and should be considered when meeting the mandate of alternative tests. Recently, a group of forward-thinking psychologists at Shaker Consulting Group has pioneered an expanded set of predictors they refer to as “The Virtual Job Tryout.” These systems have been built for specific clients to capture the complexities of jobs using a variety of types of selection tools (see http://www.shakercg.com). By combining more traditional tests and surveys with video based scenarios requiring preferred response a more complete picture of each candidates’ overall job fitness emerges.

Risks and Legal Challenges Ultimately, any testing system must have a formal evaluation regarding its ability to accurately select individuals. Public sector testing programs are often the first to be challenged because they require use of formalized testing and they impact large numbers of applicants to jobs that are so visible and pervasive in our society. The Equal Employment Opportunity Commission (EEOC), the U.S. Justice Department, and state and local fair employment agencies often scrutinize these highly visible testing programs, and the Uniform Guidelines (Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, Department of Justice, 1978) help define the process. Any test used for selecting a few successful candidates from a large number of applicants is a likely target for challenge, and those responsible for the testing process must have some way of demonstrating its links to the job, its ability to select the right people, and, possibly, why this test was used whereas others (alternatives) were not. This sets a high standard, and one that is required by law. It also demands a strong logical basis for decisions made in developing the testing process and, ultimately, leads to higher-quality tests and more capable individuals being selected. Even with all of these objectives met, those who believe the results were unfair have the right to and can challenge a civil service examination.

CREATING A TALENT PIPELINE: RECRUITING CANDIDATES No selection program can be successful unless the number of candidates exceeds the number of positions available. This has been an operating principle of employee selection for years and was first codified in the work of Taylor and Russell (1939). The challenge that faces public sector employers is two fold: (a) to create efficient recruitment and selection systems for all jobs to maximize the number of qualified applicants and (b) to expend extra effort to find candidates

728

Public Sector Employment for jobs where demand has outstripped supply. With respect to the first challenge, although a larger number of candidates is generally seen as a positive in terms of selection utility, assessing very large numbers of candidates can result in much higher selection costs, thereby driving down the overall utility of the program. A case in point is a large police force in New York State. A test is administered once every four years, and approximately 50 new officers are hired each year (about 200 hired over the life of the list). Because this is a particularly attractive job in terms of prestige, location, and salary, the number of candidates is very high. In 2015, there were over 17,000 test takers (in previous years it has exceeded 30,000). Clearly, a selection ratio of 1 in 85 is well beyond what is needed for effective selection and simply increases the cost of testing. In this case, the testing process required use of more than 150 schools and 3,000 test monitors and administrators, and its total cost was estimated to exceeded $2 million (EB Jacobs, LLC, personal communication, June 2015). In contrast, finding sufficient applicants is the major issue for several public sector jobs. For some jobs, there may be as many openings as candidates and, in some cases, fewer qualified individuals than positions available. When this occurs there is no selection taking place, and the focus must turn to recruiting. The job analysis identifies the KSAOs of people to attempt to attract, making it possible to target the recruiting effort. Prior hiring information can also help identify schools, vocational programs, and other training grounds where those who have been successful in the job were located. Establishing partnerships with educational institutions may also be considered; at times, completion of such a program has even become an MQ (minimum qualification) for the examination. Not all recruiting efforts return positive results. Recruiting without information on the KSAOs required to be successful is likely to be a waste of time and effort. In some cases, “random recruiting” may actually detract from the goal of more efficient testing. When recruits who are poorly prepared or not truly interested in the job are solicited and become candidates, the potential for changes in the passing rates of the various demographic applicant groups increases. The result can be a higher rather than a lower level of adverse impact, no doubt the opposite of the desired outcome. Indeed, effective recruitment is vital to creating an effective selection program, but the process must be guided by the knowledge of what is required by the job, what have historically been successful avenues for finding the needed talent, and what new approaches (e.g., pre-training programs and educational partnerships) may prove viable in the future.

PROMOTIONAL PROCESSES: USING WHAT WE KNOW ABOUT PEOPLE AND THEIR CAPABILITIES TO OUR ADVANTAGE A word about some differences between public sector entry-level testing and public sector promotional testing is important to fully understand the various approaches to selection that are required. For many entry-level positions, a formal training program exists, and those selected for the position will be placed in that program once they formally accept the job. In many public sector promotional systems, the individual who was at a lower-level job on Monday may find herself at a higher-level job on Tuesday with little or no training prior to moving up the organizational ladder. This distinction has important implications for testing. Because training will occur for the first example and not for the second, it means that testing for the lower-level position should not include the knowledge, skills, and expertise that will be learned prior to moving into the position. In the promotion situation, the requisite information, skills, and expertise are needed on day one of the higher-level position incumbency, so it is legitimate to test for all. In practice, what this often means is that entry-level tests focus on underlying abilities requisite for learning the job, whereas promotional tests are more closely linked to actual job requirements. There are also distinctions in terms of the validation processes most often encountered when it comes to entry-level versus promotional testing. A content strategy for validation is likely to be used for either entry or promotional testing. As stated earlier, this method of validation establishes logical links between test requirements and job requirements, often supported with judgment data from subject matter experts (SMEs). In many programs of validation for entrylevel testing, this strategy is supplemented with a criterion-related validity study or by citing

729

Rick Jacobs and Donna L. Denning evidence of generalized validity. However, it is rare that a criterion-related study is part of the validation process in promotional exams. This happens for various reasons, including relatively small sample sizes for many promotable positions, difficulties in motivating current incumbents to sit for an “experimental testing session,” and issues of test security that arise as a function of administering a test to a group of incumbents and then using it again for candidates. The number and variety of promotional examinations are also daunting, and the task of creating and implementing job-related examinations is equally difficult. In the vast majority of instances, promotional examinations include minimum qualifications that specify the lower-level class or classes from which promotion to a given class must be made, as well as the required number of years of service in the lower-level class(es). The process of developing a promotional examination is similar to that for entry-level examinations, with one very important difference: Most agencies emphasize promotion from within (again, often legally mandated). As such, the amount of information known about the candidates greatly exceeds what is known about entrylevel applicants. This could be a tremendous advantage in the identification of talent if properly tapped.

Developing Promotional Tests Promotion examinations are developed based on the concept that those in lower-level jobs acquire KSAOs required for the next job in the hierarchy. Similar to entry examinations, potential applicants for promotional examinations are prepared for testing by informing them about (a) the duties and responsibilities of the job, (b) the required knowledge base, and (c) the underlying skills, abilities, and other characteristics required by the job. This information is conveyed to candidates via a test announcement, or “bulletin,” which outlines the types of tests, and often their content and scoring, as well as hurdles (decision points for progression in the examination) that candidates will encounter. These promotional processes can range from a single knowledge-based test to very elaborate, multistage assessments involving simulations and assessment center exercises that unfold over a long period of time. For some positions, this may require very little preparation, but for others (e.g., police sergeant or fire captain), agencies often announce the examination six months or more in advance to give candidates adequate time to prepare for the various tests that make up the promotion process.

Appraising Past Performance One frequently missing element in promotional processes is the assessment of past performance. Although this has the potential to be the most important single indicator of future performance, its rare use in promotional processes stems from a lack of confidence that performance ratings have been or will be consistent and accurate. Concerns of bias in the ratings by supervisors abound. More generally, it is typically believed that performance ratings lack the psychometric rigor required for any formal testing process. Indeed, one clear opportunity for improving promotional processes is the more effective use of past performance for determining who will move up in the organization. To this end, several assessment techniques, some of which have been used in private sector selection and, especially, in employee development programs, have been devised for the measurement of past performance. Behavioral accomplishment records (Hough, 1984), ratings of “promotability,” career review boards, and behavior-based interviews have all been seen as additions to the overall promotional processes used in public sector testing. It remains the task of testing professionals to further enhance promotional processes by continuing to improve these techniques that capture prior job-relevant performance. An often expressed sentiment of promotion candidates is: “The examination should not evaluate what you do on that one test day. It should assess what you do the other 364 days.” Tools like accomplishment records and career reviews attempt to incorporate this perspective.

730

Public Sector Employment

PERSONNEL DECISION MAKING AND LEGAL JEOPARDY As noted above, the promotion process (as well as the selection of new employees) in public sector organizations can often lead to legal disputes and challenges by individuals, groups, and government entities, such as the U.S. Department of Justice. Most of the time what is at issue is disparate impact, in which the results of the selection or promotion systems appear to disadvantage one or more demographic/cultural groups. When this occurs, as for private employers, the public sector agency is required to demonstrate the validity of the process. This demonstration can take many forms, and it is not unusual to provide multiple sources of validity evidence, ranging from the most common form, content-based evidence of validity, to extensive documentation of criterion-related validity, which may be based on research conducted internally or by external consultants for the organization and/or generalized evidence of test validity, although “the jury is still out” regarding the degree to which validity generalization has been seen as acceptable by courts and regulatory bodies.

Unique Competitive Processes The stakes in public sector promotional testing can be very high. As stated above, in many public sector jobs, the only way to advance is by having served in one or more specific job(s) for a minimum number of years, sometimes additionally having successfully completed specialized education/training or other formal certification, and by successfully competing in the promotional examination process. In these examinations, some candidates succeed, but a larger number of candidates do not. This direct competition among peers can have negative consequences for the individuals involved and for the organization. The entire process may be challenged by individuals who did not do well enough to be promoted and have thereby concluded that the process was flawed and unfair. When this happens, colleagues find themselves on opposite sides of a legal battle, in which the candidates who were successful during the testing process hope the test results will be upheld, and those who did not do sufficiently well on the examination to be promoted work to discredit the process. Unfortunately, most candidates have often invested a great deal of preparation time, and many feel frustrated by the delay in implementing the results. These types of challenges may stretch out for years, creating problems for all participating entities: the candidates, the HR professionals, and management of the agency wishing to promote its employees. Another factor that affects the tendency for legal challenges of selection processes within the public sector, in contrast to much of the private sector, is that these processes are by design open and visible; unquestionably, the examination is “responsible for” selection outcomes. This provides disappointed candidates an obvious target for pursuit of litigation. Furthermore, because civil service systems still very frequently have mandated candidate appeal or “protest” rights, filing a lawsuit may seem nothing more than an obvious extension of a right they are already afforded. In point of fact, formal, stringent requisites of what constitutes an appeal or protest and how they are adjudicated exist, but these rights at times seem to be misinterpreted simply as a right to register complaints. Once administrative remedies have been exhausted and the outcome remains negative to the candidate’s interest, it may seem a natural next step to pursue litigation.

Negative Consequences for Individuals and Organizations The legal challenges that at times confront public sector testing may create a crisis of confidence in testing. Individuals may begin to question the ability of the people responsible for testing and speculate that the system has come under the control of the legal system without regard for merit. As these cases drag on, temporary appointments may be made, which further complicate the situation. When order is finally restored, another problem can occur with respect to what to

731

Rick Jacobs and Donna L. Denning do with those who were placed in the higher-level job as a provisional appointment. When a new testing program is instituted and someone who has been in the job for many months or even years does not achieve a successful score, the immediate question that arises is “How could the test be relevant to the job (valid) if someone who has managed to do the job successfully for the past few months/years cannot pass it?” In this context, no consideration is given to actual job performance, and those who have been successful for months or years can be taken out of that job based on the results of a day or less of testing. There exists no simple or single answer to this dilemma.

Balancing Validity and Diversity Public sector agencies are in a constant struggle to simultaneously increase the validity of their selection and promotion processes and to improve the diversity of the group that is selected. This complex task may involve actions that result in focus on one of these objectives at the expense of the other (Aguinis & Smith, 2007; DeCorte, Lievens, & Sackett, 2007; Lindsey, King, McCausland, Jones, & Dunleavy, 2013; Ployhart & Holtz, 2008; Pyburn, Ployhart, & Kravitz, 2008; Sackett & Lievens, 2008). Many of the tests used to predict future performance show large differences among various groups. With respect to a variety of measures of cognitive ability and knowledge-based multiple-choice tests, both of which are popular with public sector agencies because of their clear right-and-wrong response format, Caucasian candidates consistently outperform Black and Hispanic candidates. When the selection procedure switches from cognitive ability to physical ability, women typically score lower than men. Agencies take steps to minimize these differences, and although some approaches may be helpful, (e.g., practice testing), none eliminate the group differences completely, and some practitioners argue that those who already have what it takes just get stronger. Recently, many public sector agencies, although still acknowledging the need for some component of the process to test for cognitive ability, have created examinations that include noncognitive measures such as personality tests or biographical information. These types of tests are sometimes met with protest from applicants, unions, and other interested parties on several grounds, but, at least when it comes to selection of new candidates (versus promotional testing), such testing has been implemented. In some instances, the inclusion of different types of instruments has reduced group differences that are observed when testing only for cognitive ability and has also enhanced overall validity, but they have not eliminated adverse impact (Sackett & Lievens, 2008). In some instances there is a reduction of group differences by changing the weighting of the various test components (DeCorte, et al., 2007; Decorte, Lievens and Sackett, 2011). One of the reasons for this failure in eliminating adverse impact is a low but consistent correlation among cognitive-ability-oriented predictors and less traditional selection tools such as personality indicators. Although many personality scales show no difference between minority and majority group members, our work with police officers and firefighters has shown that some scales do show differences in the context of public safety selection. The difference is also in favor of majority test takers in a way that is believed to be linked to the positive correlation between these personality measures and cognitive ability and in a manner that inhibits their ability to reduce adverse impact (Cascio, Jacobs, & Silva, 2010).This problem is made even more difficult by the fact that, for many public sector jobs, the selection ratio is quite favorable for the organization (i.e., many candidates and few individuals selected). As the selection rate gets smaller and smaller (more candidates relative to the number of positions to be filled), the impact of any group difference grows quickly. Even small group differences can cause large levels of adverse impact when selection ratios drop below .20. This further complicates the situation for the agency, because one goal is to make the jobs widely available, but doing so can have a negative consequence on diversity. Important to recognize here is the fact that as selection rates go down, it is often because the number of applicants is quite high. With large numbers of applicants, statistical tests for adverse impact move in the direction of an increased finding of adverse impact. This relationship between the size of the applicant pool, selection ratio, and adverse impact has been highlighted by Jacobs, Murphy, and colleagues (Jacobs, Deckert, & Silva, 2011; Jacobs, Murphy, & Silva, 2012; Murphy & Jacobs, 2012).

732

Public Sector Employment

Defensibility of Process Ultimately, a public sector agency must make its employee selection systems (entry and promotional) defensible. To do this, steps must be taken, and these steps must not only conform to the laws and guidelines governing selection, but they must also be meticulously documented (Guion, 1998). In entry and promotional selections, there are winners and losers. The winners get what they desired, the job, and those who are less fortunate walk away either without a job (entry) or in the same job they were in prior to the promotional process. At times, those in the latter category seem to decide that challenging the test is a means of obtaining a second chance. In some situations, the tests may actually be poorly prepared, lacking in job relevance, undocumented with respect to how they were created and/or linked back to the job, or simply administered without regard to accepted testing practices. However, in other cases, the allegations about the test may be a disingenuous attempt to vacate the results and provide all test takers with another opportunity for success. When a test is challenged, it does not automatically mean the test was deficient or that the process violated the laws and guidelines that prevail. Challenging the test is a right of any candidate who can establish an underlying legal basis, most often in the form of adverse impact. Once adverse impact is established, it becomes the responsibility of those using the test to establish the validity of the process. Given the above, it is important to consider what must be present to make a hiring or promotional system defensible. Below we provide further details on the following critical factors for defending a testing process: • • • • • •

Job analysis Links between test elements and aspects of the job Logic behind the combination of multiple test scores into a final composite Test administration details Scoring processes Documentation of the entire testing program from start to finish

There is unanimous agreement that a fair test is only possible with confirmation that those responsible for the test understand the job. In the context of public sector testing, this means that the job in question has been defined and documented, which occurs at the inception of a job class through the creation of a class specification, and then is often supplemented with a more detailed job analysis and/or during examination development with the assistance of job experts. The results are widely accepted by incumbents, supervisors, HR specialists, and potential applicants as reflecting the important features of the job. To be useful as underlying test development documents, the class specification and job analysis must reflect not only the tasks and responsibilities of the job but also the knowledge base required by the job and those skills, abilities, and other personal characteristics that facilitate job performance. A second important element of providing the necessary information for defense of a test is evidence that links the test items, work samples, and other components of the examination to the tasks performed on the job. It is helpful to think of this in one of two ways. The most direct way can be seen in work-sample testing, where the test is actually a sample of the tasks performed on the job. Physical ability testing provides the best example of this direct linking. Firefighter candidates are often asked to perform a series of job-related activities such as advancing a fire hose, dragging a dummy, and climbing stairs with equipment. In the case of each event included in the test, the activity is a replicate of what is done on the job. One challenge here is to make sure that key skills learned during training are not required in the performance of test events. Recently, a test was developed for store clerks for a retail drug store chain. Clerks repeatedly engage in unloading boxes and in stocking shelves. These job activities require little in the way of specialized training, and a test was developed to replicate the unloading and stocking tasks. Here there is a direct link between the job and the “items” on the test. More often this logic requires two linkages: the first relates the tasks and requirements of the job to a set of knowledges, skills, and abilities, and then a second linkage is required to show the relationship between test items/elements to these same KSAs.

733

Rick Jacobs and Donna L. Denning This linking process often takes the form of surveys that identify the knowledge or other attributes underlying the test questions and asks job experts to identify the degree to which each aids in completion of various tasks. This process is commonly accepted as a means of establishing validity on the basis of test content. At the root of any successful demonstration of validity is a clear listing of test items, the job tasks, and the “linkage” of the two. Critical to this approach to the demonstration of validity based on test content is an appropriate sampling of incumbents, supervisors, and/or other job experts, along with clear instructions to those who are providing the responses. Although surveys are often used, this process can also be accomplished with review meetings involving job experts in which the material is analyzed and discussed and consensus judgments of the experts are documented. In most contemporary entry and promotional processes, a single test does not represent the full range of job requirements, so multiple tests are used for selection or promotion. Yet a single, final score in the examination is required and, therefore, the manner in which that score is calculated becomes an important consideration. Selection is often based on a written knowledge- or ability-based test and a series of interviews. The process requires that a list of candidates is created, and to do this these different assessments must be turned into a composite score. The logic of how the composite is formed can be taken from job analytic information that indicates the degree to which each score is related to job KSAs, the reliability of the score, and the overall importance of that indicator to job performance or other measures that provide a logic regarding aggregation of information. (Composite predictor scores are addressed in more detail in Chapter 17, this volume.) Another and perhaps more complex example can be seen in police officer selection and in firefighter selection, where there often is a written and a physical test. Combining these two scores becomes an issue because the weight assigned to each score will determine, in part, its impact on final score. Years of job analysis for both of these jobs have yielded consistent results. Although both jobs require physical capability, the firefighter job is more physically demanding. Results from our own job analysis work across various police and fire departments have shown that the job of a firefighter is between 40% and 60% physical, with the remainder requiring cognitive abilities, whereas the job of a police officer is often reported by incumbents to be between 20% and 30% physical and 70–80% cognitive. The two jobs clearly need different weights for the physical test when it comes to creating a selection composite. Like the evidence for linking the test elements to the job requirements, a rationale for the weighting used to form the final test score composite is necessary. This is often based on input from job experts and professionals in the testing area. The important point is that the rationale is tied to requirements of the job. There is no one best way to establish these weights; also, in many testing situations, the components of the examination are correlated with one another. When correlation exists among components, the weights become somewhat less of an issue because small variations in weights do not substantially change the overall results. As the correlations increase, the impact of differentially weighting the components becomes far less of an issue, and at some point, simply equally weighting the test components works in a similar manner to elaborately defining a very precise set of differential weights. On the other hand, some would argue for use of equal weights simply because of their demonstrated robustness in prediction (Schmidt, 1971). Either way, differential versus equal weights, there is a need to standardize each test score used so that the effective weights approach the intended weights. Without standardization of scores, the tests with the larger variations will have larger contributions to the total score. A fourth area for defensibility is in the actual administration of the tests. The best-developed tests, the ones with the highest degree of validity evidence and the strongest rationale for weighting of components, can become useless if test administration processes are deficient. Threats to administration can come in various forms, ranging from failure to protect testing materials before the actual test date to administering the test in a room with poor lighting, loud outside noises, or missing pages in test booklets. Although this seems to be the least difficult part of defending a test and the easiest to achieve, it is often the Achilles heel of a testing process. Care must be given to all phases of test administration; for example, materials such as instructions to candidates, information about the testing locations and facilities, and any irregularities in the actual administration of the test all must be well documented. Otherwise, one may do all

734

Public Sector Employment of the right things when it comes to test development, but then compromise it all during test administration. All test materials must be scored, and the scoring process represents a fifth area in which threats to the defense of a test can occur. Many modern tests are administered via paper and pencil and scored by scanning machines or taken online and scored automatically. Scoring integrity must be demonstrated in the form of getting the correct outcome for each candidate. In the case of scanned answer sheets, this means that all answer sheets must be reviewed for irregularities. It is a good idea to scan each test sheet twice and to compare scores for any differences in the two scans; any differences indicate scanner problems or simple “hiccups” in the scanning process. Another way to ensure accuracy is to compare candidates’ hand-scored tests with their scanned scores (usually for a sample of candidates). For online testing, periodically sending through a “phantom candidate” with a known score to make sure that the algorithm is generating the correct score is a useful step. With respect to other types of potential test scoring problems, demonstrations of inter-rater agreement and other forms of reliability help to substantiate the appropriateness of scoring protocols. Although a discussion of reliability is not consistent with the goals of this chapter (see Chapter 1, this volume), any and all steps that can be taken to show the consistency of test results will be of great assistance in addressing any challenges to the scoring process. A final step in the defensibility of a testing program is the documentation of all steps in the process. This includes specification of how each test was developed, how each test was administered and scored, and how the final examination score was calculated for all candidates. Creating the paper trail of your work not only allows everyone to see the steps taken but also memorializes the process. In many challenges to public sector testing, legal proceedings take place years after the examination was developed and administered. Relying on memory and randomly filed memos of what happened will never provide the information necessary to successfully support the contention of adequacy of the process. Public sector testing is best completed by the compilation of a final report or file that details the project from start to finish. This documentation should be clear, and it should contain all of the necessary surveys, instructions, and tests used in the examination. There is no better way to defend a test than to have it well documented. In situations in which a challenge is presented to an agency regarding the testing process, the agency can provide the potential plaintiff with a copy of the report. On more than one occasion, this has ended the challenge to the test.

CONCLUSIONS Public sector testing has evolved over the past two centuries in terms of test content and test format. We have seen the movement from tests based solely on memory and other cognitive abilities to the inclusion of social judgment, personality, and biographical information. We have seen simple paper-and-pencil testing transition to testing formats that are computer-based and inclusive of video stimulus materials. It is our sincere wish that those involved in public sector testing continue to engage in the scientific-practitioner model, where innovations in assessment that appear in the research literature are incorporated into testing programs and that progressive testing programs are described in our research journals. One example of such actions is the move to provide assessments of cultural proficiency during the hiring process for police officers. Much has been written about racial bias in policing, and the news has documented many cases where police response has been lethal and beyond what is considered reasonable. Much has been written about the assessment of racial bias, and now police departments are pushing to include a mechanism for measuring racial bias during the selection process. Success in this area should be further documented so others can benefit from the application of scientific knowledge to real-world problems. With respect to the identification of critical underlying job requirements, we have seen public sector testing programs expand their use of systematic job analytic techniques that approach not only single jobs under study but also groupings of jobs, so that the inherent interrelationships among jobs can be identified to better take advantage of opportunities to use a common testing

735

Rick Jacobs and Donna L. Denning system across jobs. With respect to the legal arena, public sector testing is often singled out as the test case for looking at the defensibility of specific test formats and test content as well as the way in which test scores are used in the decisions made about people and jobs. Clearly, as the challenges to the fairness of various types of testing programs move forward, public sector applications will be part of the landscape. Unlike the private sector, public sector employees are less susceptible, although still not immune, to layoffs or downsizing, although hiring freezes are common. This fiscal reality translates to the fact that most cities and public agencies, even in their toughest financial times, continue to require substantial levels of staffing. Therefore, the enormous demand for testing programs for the hiring and promotion of public sector employees will continue, and the need for accomplished and creative test development professionals will offer tremendous opportunities to further develop the way in which we measure candidates against job requirements.

REFERENCES Aguinis, H., & Smith, M. A. (2007). Understanding the impact of test validity and bias on selection errors and adverse impact in human resource selection. Personnel Psychology, 60, 165–199. Baruch, I. (Chair, Committee on position-classification and pay plans in the public service.) (1941). Position-classification in the public service. Chicago, IL: Public Personnel Association. Cascio, W. F., Jacobs, R. R., & Silva, J. (2010). Validity, utility and adverse impact: Practical implications from 30 years of data. In J. Outtz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 271–288). New York, NY: Psychology Press. City of Los Angeles. (2009). Official city of Los Angeles charter. American Legal Publishing Corporation. Retrieved November  16, 2009, from http://www.amlegal.com/nxt/gateway.dll?f=templates&fn=default. htm&vid=amlegal:laac_ca De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. De Corte, W., Lievens, F., & Sackett, P. R. (2011). Designing Pareto-optimal selection systems: Formalizing the decisions required for selection system development. Journal of Applied Psychology, 95, 907–926. DuBois, P. H. (1970). A history of psychological testing. Boston, MA: Allyn & Bacon. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43, 382990–38315. Franke, W. (1963). The reform and abolition of the traditional Chinese examination system. Cambridge, MA: Harvard University Press. Graham, J. R., & Lily, R. S. (1984). Psychological testing. Englewood Cliffs, NJ: Prentice Hall. Gregory, R. J. (1996). Psychological testing: History, principles and applications (2nd ed.). Boston, MA: Allyn & Bacon. Hollenbeck, G. P., McCall, M. W., & Silzer, R. F. (2006). Leadership competency models. The Leadership Quarterly, 17, 398–413. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum. Hough, L. M. (1984). Development and evaluation of the accomplishment record: Method of selecting and promoting professionals. Journal of Applied Psychology, 69, 135–146. Jacobs, R. R., Conte, J. M., Day, D. V., Silva, J. M., & Harris, R. (1996). Selecting bus driver multiple perspectives on validity and multiple estimates of validity. Human Performance, 9, 199–218. Jacobs, R. R., Deckert, P. J., & Silva, J. (2011). Adverse impact is far more complicated than the uniform guidelines indicate. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 558–561. Jacobs, R. R., Murphy, K., & Silva, J. (2012). Unintended consequences of EEO enforcement policies: Being big is worse than being bad. Journal of Business and Psychology, 28, 467–471. Lindsey, A., King, E., McCausland, T., Jones, K., & Dunleavy, E. (2013). What we know and don’t: Eradicating employment discrimination 50 years after the Civil Rights Act. Industrial and Organizational Psychology, 6, 391–412. Murphy, K., & Jacobs, R. R. (2012). Using effect size measures to reform the determination of adverse impact in equal employment litigation. Psychology, Public Policy and Law Journal, 18, 477–499. Murphy, L. L., Plake, B. S., & Spies, R. A. (Eds.) (2006). Tests in print VII. Lincoln, NE: Buros Institute of Mental Measurement.

736

Public Sector Employment Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Pryburn, K. M., Jr., Ployhart, R. E., & Kravitz, D. A. (2008). The diversity-validity dilemma: Overview and legal context. Personnel Psychology, 61, 143–151. Rodrigues, D., Patel, R., Bright, A., Gregory, D., & Gowing, M. (2002). Developing competency models to promote integrated human resource practices. Human Resource Management (Special Issue: Human Resources Management in the Public Sector), 41, 309–324. Sackett, P., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450. Schmidt, F. L. (1971). The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 31, 699–714. Spies, R. A., Plake, B. S., & Geisinger, K. F. (Eds.) (2007). The seventeenth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurement. Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of tests in selection. Journal of Applied Psychology, 23, 565–578.

737

33 SELECTION METHODS AND DESIRED OUTCOMES Improving Entry- and Mid-level Leadership Performance Through the Use of Assessment Technologies SCOTT C. ERKER, CHARLES J. COSENTINO, AND KEVIN B. TAMANINI

The importance of effective leadership selection to modern organizations cannot be overstated. By now, it is safe to assume that most private organizations use some form of structured selection tool, method, or process to make decisions about people—who to hire, promote, and/or accelerate toward future leadership roles. Organizations have paid the price for unstructured selection procedures. Lack of consistency in selection of leaders can lead to poor motivational and skills fit between the individual and the job, as well as ineffective leader to follower relationships resulting in low performance, unmet expectations, and high unwanted turnover. This ultimately leads to sub-optimized organizational productivity, inconsistent customer service, low employee engagement, and disengaged leaders lacking confidence in their skills. As modern organizations must frequently respond to new market demands, emerging competitors, and rapidly advancing technologies, unstructured selection methods that rely on “gut feel” of hiring managers pose great risk to organizational growth and survival. The requirements for leaders and the selection methods used to identify those who are the most likely to excel in these demanding roles must keep up with the rapid change in business. Indeed, Johansen (2009) identified 10 leadership skills that are needed for an uncertain world. The term ‘VUCA world’ was first used by the U.S. military to discuss preparedness, but Johansen popularized this phrase when describing the environment that organizations face (volatility, uncertainty, complexity, and ambiguity). Today, organizations benefit from decades of science and practice that provide guidance for how to maximize leader selection decision effectiveness about whom to hire/promote for a leadership role. When you consider the direct costs (e.g., hiring and training costs, compensation, and benefits) of employing an individual across their potential tenure in an organization and the indirect cost associated with a poor leader managing a sub-optimized team, each hire/promotion can be an investment in the millions of dollars. When looked at from an aggregate level, the cumulative effect of effective selection decisions can lead to extraordinary business performance and becomes a true competitive advantage. Personnel selection has been one of the central topics in the study of work behavior (Guion, 1998) and ultimately aims to identify the individuals who will constitute the workforce in any given organization (Salgado, Viswesvaran, & Ones, 2001). As noted by Howard (2006),

738

Selection Methods and Desired Outcomes effective systems boost organizational performance and allow individuals to excel by engaging in work they enjoy because the organization gets the right people into the right jobs. Although much of the research literature has focused on selection issues associated with entry-level jobs (e.g., Campbell, McHenry, & Wise, 1990), the selection systems at higher levels within organizations are just as, if not more, critical. In a recent report by Mitchell, Ray, and van Ark (2014), 4 of the top 10 strategies CEOs prioritized for human capital management focused specifically on improving the effectiveness of both front-line and senior managers, along with a focus on improving succession planning for current and future needs. Getting the right leaders into the top positions who can deal with the dynamics of an ever-changing business landscape will stimulate organizations to grow and prosper (Howard, 2006; Mitchell, Ray, & van Ark, 2014). Indeed, the financial health of an organization is predicated on the optimal selection and placement of employees. From a utility perspective, selecting a superior manager will result in 48% more output than an average manager (as compared to 32% for skilled workers and 19% for lower-level jobs; Hunter, Schmidt, & Judiesch, 1990). The bottom line is that it pays to have dynamic and effective selection systems, especially when dealing with leader-level positions. This chapter focuses on the selection of entry- and mid-level leaders within private sector organizations. Selection is defined as the processes used to select new hires, promote internal candidates, and select individuals into developmental acceleration pools. From this point forward, the authors refer to this as the selection of leaders or leadership selection. A review of contemporary organizational challenges unique to entry- and mid-level leadership sets the stage for a discussion of themes and strategies for enhancing leadership effectiveness and bench strength through improved leadership selection practices. Next, a comprehensive set of leadership assessment tools and procedures is described. Real-world case studies are used to illustrate the application and results achieved from enhanced selection programs. We conclude this chapter by overviewing common business scenarios in which assessment is used to guide leadership selection in the private sector. Recently, Development Dimensions International (DDI), a consulting firm focused on improving leadership insight and growth, conducted a global leadership forecast study in conjunction with The Conference Board. Survey participation included 13,124 leaders, 1,528 global human resource (HR) executives, from 2,031 participating organizations. In this landmark study, researchers examined findings spanning leaders across 4 levels, 48 countries, and 32 major industries (Sinar, Wellins, Ray, Abel, & Neal, 2014). They found that compared to previous studies, the number of leaders who expressed confidence in the overall quality of leadership in their organization increased slightly; 40% of leaders rated current quality as high in 2014 as compared to 37% in 2009 and 38% in 2011. Interestingly, only one in four organizations evaluated their leader performance as effective. This research also showed that those organizations that are in an aggressive-growth mode have a significantly higher proportion of Millennials (30%) in leadership positions as compared to those organizations that are focused on cautious growth (25%) or moderate growth (21%). These younger leaders are also more likely to intend to leave within the next 12 months as compared to other generational groups. This poses unique challenges for organizations when determining both how to continually fill leadership roles and also how to effectively consider longer-term leadership succession. Indeed, only 15% of organizations rated their future bench strength as strong, which is in alignment with a similar trend from previous research (Bernthal & Erker, 2005), showing that most organizations are not confident that they have the leadership to address current and future needs. As Bernthal and Erker (2005) noted, 52% of respondents expected to have problems filling mid-level leadership positions with qualified candidates and 28% anticipated problems filling first-level leader positions with qualified candidates. The cumulative impact of poor selection at entry and mid-level can have a debilitating impact on strategy, execution, and culture—especially given the volume of leader job changes that might be required to manage growth or turnover. As Sinar et al. (2014) noted, in order for organizations to mitigate the risk of poor selection, they need to prepare internal leader candidates by providing accelerated development programs for people who are in the leadership pipeline. Organizations that seek external leader candidates can attract new leaders from nontraditional

739

Scott C. Erker et al. external sources. These organizations can consider expanding the pool of candidates to attract leaders from other industries and then subsequently provide intense onboarding experiences (e.g., coaching, mentoring, networking opportunities) to facilitate their socialization into the organization.

CURRENT BUSINESS TRENDS AFFECTING LEADERSHIP SELECTION AND DEVELOPMENT Some current business trends have exacerbated the difficulty organizations have in selecting and proactively preparing individuals for leadership positions.

Trend 1. Flatter and Leaner Organizations Have Limited Critical On-the-Job Development Experiences to Prepare Leaders for High-Level Assignments The delayering of organizations has diminished the number of opportunities people have to develop and practice their leadership skills. In the 1980s and earlier, extensive management trainee programs, with development opportunities and exposure to more senior leaders while occupying lower-level positions (e.g., assistant managers), were effective means for identifying and developing future leaders. Since then, organizations have reduced leadership levels and eliminated assistant manager positions. This has diminished organizations’ ability to identify and develop those individuals with the greatest leadership potential. Reduced organizational levels have made each transition into a new level of leadership responsibility more difficult for newly promoted leaders. This has increased the importance of other strategies to identify, select, and accelerate the development of leaders and leadership capability. Given this trend, measures of potential have become more important than past performance and achievement in evaluating candidates (Tormala, Jia, Norton, 2012). In addition, many indicators of success at an individual contributor level, such as technical skills, dedication, and loyalty, are only marginally related to leadership potential and talent.

Trend 2. New Business Realities and Retiring Baby Boomers Have Challenged Organizations to Find New Ways to Define Leader Requirements and Attract, Identify, and Accelerate Leader Development When organizations have long-tenured employees, they have the benefit of an experienced workforce, but they are at risk of large-scale retirements. Many organizations that we work with, from chemical processing to transportation companies to financial service organizations, expect up to 70% of their senior team to retire over the next five years. New business realities have made the leadership strategies, approach, and behaviors used by leaders, in part, less relevant. These factors work together to create a “perfect storm.” Succession planning, especially at the first and mid-level is largely absent as a strategy for managing these tremendous changes. Unmotivated and unprepared leaders can be faced with an impossible situation. Organizations do not have the time they had in the past to grow leaders through a series of developmental assignments. It has become increasingly important for organizations to have pre-promotion acceleration programs to help leaders with key transitional leadership positions. Rapidly evolving business realities related to technology advancement, increased buyer sophistication (through social networking and better access to information), and comparable products/services among competitors has increased the pressure on leaders to drive change within their teams. Traditional methods of leading and/or managing are challenged by this pace of change. As a result, practitioners are using more visionary job analysis methods that study the

740

Selection Methods and Desired Outcomes new leadership challenges facing the organization and select competencies most closely associated with addressing those changes. Observations of leaders and focus groups with job content experts are not adequate to define these rapidly changing job requirements. Incumbent leader judgment is seen as skewed by behaviors and skills that were important in the past.

Trend 3. Globalization Has Impacted Leadership Requirements and Turnover Whether leaders are working in their home countries or are taking an assignment in another part of the world, the effect of globalization over the last decade has been profound. Plans for international growth are on the rise, with 69% of organizations intending to add offices or facilities outside of their home country (Sinar et al., 2014). Globalization in the matrixed organization has made the leader’s job of diversity management, creating a culture of trust and development, managing, and decision making much more complex. The ability to adapt one’s leadership behaviors to individuals from different cultures and a more diverse workforce is a considerable challenge for leaders at the first and second levels. Organizations that attempt to tackle these challenges struggle to balance the implementation of local versus centralized talent programs. In a survey conducted by Mitchell, Bolling, Phang, and Schott (2013), more than 1,500 HR professionals evaluated the effectiveness of leaderfocused talent management programs, and they found that corporate-owned programs were the most effective when selecting both front-line and mid-level leaders, but a balance between corporate and locally owned programs were most effective for the ongoing development of these leaders. This study demonstrates that having a consistent set of talent management practices (i.e., tools/processes) that are scalable and allow for alignment across the organization, while still providing flexibility for local market nuance, are the best combination for selecting and developing successful leaders. For those who choose to take an extended assignment in another country, the job challenges are compounded by challenges with adapting to a different culture. It has been reported that 28% of expatriate leaders leave their assignment early because of family concerns (Society for Human Resource Management, 2009). In addition, the attrition rate among expatriates once they have returned to their home country is significant (27% leave within the first year compared with 13% attrition for leaders who do not take an expatriate assignment; Society for Human Resource Management, 2009) because they are challenged with reassimilating into their companies as organizations struggle to fully utilize and recognize these leaders’ newfound skills.

Trend 4. Organizational Commitment Is Declining Frequent downsizing, mergers and acquisition, and conservative investment in employee incentives over the last few decades have reduced employees’ commitment and trust in organizations (Burke & Cooper, 2000; Kramer, 1999; Modern Survey, 2016). As a result, individuals are more focused on managing their own careers rather than relying on employers. Employees no longer have a strong belief that the company will develop them in ways aligned to their career interests. Both managers and employees cite lack of concrete career plans as a significant reason for turnover (Chakraborty & Rudbeck, 2014). This trend has substantial impact on Millennials, who commonly have even lower trust in large companies’ development and promotion practices and their companies’ long-term viability in the marketplace (Marston, 2007). Millennials seek work that they find meaningful with a good balance between personal and work time. They are more willing and able to find new career opportunities that enable them to achieve those goals more quickly. The advent of social media and web-based search make it easier to find new opportunities. When employees become dissatisfied with their current job and have no sanctioned career plans, the likelihood of turnover is high (Oracle, 2012). Exit interviews with high-performing employees have shown that a lack of career advancement is more important than pay and benefits as

741

Scott C. Erker et al. the predominant reason for turnover (Spencer, 2014). More employees are engaging in jobhopping and commonly capitalize on their current company’s brand to enhance their value to lesser-known companies with whom they seek employment. This change in employee views creates a dilemma for organizations in that they must balance the need to take action and accelerate the development of employees with the risk associated with the newly developed leaders becoming a retention problem if they are not rewarded (e.g., greater responsibility and compensation desired almost immediately). Organizations need to follow through on raised expectations and explicit or implicit promises made to leaders or risk their investment in development because it can be quickly lost through turnover. Candidates’ easy access to online information about a company culture and networking give greater knowledge about the company’s true culture, making recruiting new talent more difficult (Bersin, 2015). Well-thought-out and planned leadership identification and development programs communicated clearly through explicit career planning processes are critical to the engagement and retention of highly talented leaders.

Trend 5. Leader Readiness for Promotion Is Low, and Programs Designed to Increase “Speed to Productivity” Are Being Defined as a Critical Business Process When new leaders do not have the skills to engage their team, the impact is damaging at two levels. First, senior leaders fail to realize the results they had planned in terms of goals being achieved, and second, the employees managed by these ineffective leaders become frustrated and lose focus, resulting in lost workforce productivity and turnover. The complexity and rapid growth of business today, and truncated efforts to prepare leaders for new roles, leaves leaders with far less time and support to achieve mastery in their roles. Leadership mistakes are costly at a personal and organizational level. Lack of confidence in leading a new team can result in apprehension or avoidance in handling difficult leader challenges, requests to return to former positions, or micromanagement of direct reports. These effects are very apparent in various organizational settings and industries. For example, it affects the newly promoted leaders in a service industry who ask to return to the crew because they lacked the confidence, motivation, or skills to manage former peers’ performance. Ineffective leader behaviors are also apparent in technology and financial companies when new leaders focus on directing others’ technical activities rather than coaching and building a successful team. This behavior is shaped by their “comfort zone”—that is, their greater confidence in their technical rather than leadership skills. It impacts the new sales manager whose “coaching” consisted of expounding upon what has worked for him or her in the past. This often results in sales associates working around their leaders and developing their own, often unproductive, strategy to address a new competitor or market reality. There is growing recognition that speed to productivity, enhanced through effective new leader selection, onboarding, and development programs, is an important lead measure of success in meeting and exceeding goals. In summary, the five trends outlined above have contributed to a laser-like focus on entryand mid-level leader positions. Although this trend list is not exhaustive, it does highlight some of the more severe contextual challenges that must be taken into account when designing a sustainable leadership selection program. Our contention is that those organizations that can best anticipate the impact of these trends and then take action to implement programs that select the right leaders will be prepared with the right people in place to meet future business challenges.

ASSESSMENT PRINCIPLES FOR LEADERSHIP READINESS To make the best possible entry- and mid-level leadership decisions, many organizations turn to various forms of assessment. Assessment helps organizations gather and organize information about their current and potential future leaders. When applied in the context of hiring, promotion, or development, better decisions are made when assessments are used. The best

742

Selection Methods and Desired Outcomes assessment techniques are not only aligned with leaders’ job challenges but also with business and cultural strategies. Specifically, effective leadership assessment (a) increases the probability that an individual who is chosen for a target position has the behaviors, experience, knowledge, and skills needed to succeed in the leadership position and to drive business success; and (b) provides insights into leadership potential and readiness that can accelerate an individual’s development in the role. To build selection criteria and tools that will assess individuals accurately, we believe it is important to understand the role the leader will play as well as the specific business context in which he/she will play it. Understanding the leader’s role from this perspective shapes the selection criteria and approach needed to make accurate predictions of candidates’ potential and readiness. Fundamentally, from a psychological perspective, leaders maintain group cohesiveness, manage conflicts, sustain the group’s value to the broader organization, and, most importantly, manage external events that may threaten the group’s value to the organization or customers it serves. Leaders provide the structure through which priorities are set and norms are established to ensure the group’s value to a broader organization is sustained (Katz & Kahn, 1978). Leadership in this deeper sense cannot be bestowed on an individual by an organization. Although formal leaders can be given status and authority, leaders need to earn the role described above. If leaders fail to gain personal influence, their role is limited to becoming the “enforcer” who monitors compliance with rules. To add value, leaders need to provide more value/benefits than others. According to Hollander (2006, 2008), this gives leaders idiosyncrasy credits. This bank of earned credit, or perceived greater value, gives leaders the influence to change established procedures, behaviors, or decisions that do not add value to internal or external customers. This enhanced power and credibility of leaders enables greater control of decisions, greater receptivity to their ideas, and various forms of rewards in the form of greater respect and monetary incentives. Leaders need strong skills in influencing and engaging others to achieve important business and people objectives and in making sound decisions and plans. In private sector organizations, the value that leaders bring to the organization is translated into higher productivity, customer satisfaction, and the effective management of competitive threats. Leaders ensure that for every member, the benefits and costs of staying with the group outweigh the benefits and costs associated with leaving (Bandura, 2006; Hollander, 2008). To truly maximize the predictive power of entry- and mid-level leadership selection, a number of important assessment principles should be taken into account.

Assessment Principle 1. Multiple Selection Techniques and Multiple Evaluators Create Better Prediction and Mitigate the Risk of Selection Error Past performance and results achieved as an individual contributor have limited power for predicting future leadership performance when the uniqueness and complexity of the leadership role is significant and when there are substantial differences in skill sets required between leaders and individual contributors. Screening assessments of various types (e.g., basic qualifications, experience and knowledge reviews, biographical questions, and inventories) are very effective when used to screen out the less qualified. For the remaining candidates, multiple selection methods (e.g., situational, personality and cognitive ability tests, which more comprehensively assess candidates on dispositions and abilities, as well as interviews and behavioral simulations) provide a more comprehensive view of potential and readiness. Simulations and tests are particularly important when entry-level leader candidates have little previous leadership experience. There are no silver bullets in leadership selection. When practitioners are trying to mitigate the risk of selection error, comprehensiveness through multiple measures for critical assessment targets is critically important. Given all of the sources of error variance (e.g., methods and evaluators) and the rather low correlations between many selection tools and job performance, it is beneficial to have multiple processes in place. Similar to the mindset of an engineer who is designing a fail-safe system, a multiple-hurdle selection process is helpful in ensuring that only

743

Scott C. Erker et al. the best candidates are selected. Multiple evaluators involved in collecting selection data adds to the reliability of the process. It is important that a selection panel has access to all data collected through the selection process so that all relevant data is considered when making selection decisions.

Assessment Principle 2. Leadership Selection and Development, When Leveraged Together, Can Have Significant Impact In a well-designed and implemented leadership succession process (hiring, promotion, and succession management), assessment should focus on all elements of the job requirements, whereas development focuses on trainable elements. Not all leader requirements are equally developable. A well-designed assessment program will examine both non-trainable and trainable dimensions of success. Especially for behavioral competencies, a well-designed behavioral diagnostic can build awareness of the need for development and provide focus for development planning that is very useful to learners and facilitators. The value and impact of an assessment program is greatly enhanced when followed by a well-designed and actionable learning process well aligned with the assessment results. The assessment results provide insights into relevant and focused learning paths for participants and can help the organization make the best initial placement decisions.

Assessment Principle 3. Transparency About Assessment Results and Their Impact on Careers is Particularly Important When Selecting Leaders The best candidates for entry-level leadership positions are (a) often the best and most valued individual contributors, as well as (b) external candidates who are often highly sought after by other companies. Most individuals are resistant to evaluation, especially when they are uncertain of how it will impact their employment possibilities or careers. Explaining the importance of the role of leadership and the importance of objective assessment to the company and the candidates’ own career development reduces the natural resistance to be evaluated and produces greater acceptance of the process and its results. Internal candidates also should know who will review the results and how the results will be used and impact their career. Having alternative career paths for these valued employees who are not successful in the leadership selection process is critical to reduce the potential negative impact of failure.

ASSESSMENT TOOLS AND TECHNIQUES An ideal selection (and onboarding) process for leadership positions will consist of multiple hurdles. Multiple assessment methods arrayed across multiple hurdles is a common method to create efficient screening out of less qualified candidates and a more in-depth evaluation of the most qualified. The process often begins with a screening of candidates on the basis of an evaluation of relevant knowledge and experience, and then tests and inventories are used to provide more information about skills, potential, and attributes. Remaining candidates can be put through more in-depth assessments that can include simulations and interviews. Once candidates are hired, development plans are built upon their selection results and are incorporated into the onboarding process. This ensures that new hires are brought up to speed quickly, thereby reducing time to meaningful contributions. Various tools may be utilized to effectively evaluate candidates’ capabilities for each of the success profile components. Some methods (i.e., tests) assess basic psychological constructs or job knowledge, whereas other methods (e.g., work samples, simulations, and interviews) are more contextual and directly measure critical job challenges and behavioral competencies. These methods may also be placed along a continuum that ranges from measuring signs of behavior to

744

Selection Methods and Desired Outcomes samples of behavior (Wernimont & Campbell, 1968). Signs of behavior include an individual’s personality or dispositions and motivations related to job success, whereas samples of behavior refer to the demonstration of behaviors related to job success. Thus, methods may also be categorized as those that provide inferences about behavior (e.g., personality tests, cognitive tests), assess descriptions of work behavior (e.g., biodata, interviews), or demonstrate behavior (e.g., job simulations) (Howard, 2006). This is an important difference for organizations because the use of different methods requires different validation strategies. Effective entry- and mid-level leadership assessment programs use multiple assessment tools. Whether selection methods measure constructs or focus on job content—that is, depict signs (inferences) of behavior or samples (descriptions or demonstrations of behavior)—some have been shown to be better predictors of leader performance than others. Although there is an abundance of literature on the validity of selection predictors across jobs (mainly relying on entry-level jobs; e.g., Hunter & Hunter, 1984), much less has focused primarily on entry- and mid-level leader selection. The unique nature of these leadership positions demand targeted study, and more research should be conducted in this area.

Screening Methods Biographical Data Biographical data or biodata measures are empirically developed and quantify descriptions of past activities and accomplishments, such as life experiences, hobbies, and other pursuits. As Mumford, Stokes, and Owens (1990) noted over 25 years ago, studying patterns of life history sheds light on the ecology of human individuality. Indeed, more recent research has shown biodata to be one of the best predictors of employee performance (Breaugh et al., 2014; Schmidt & Hunter, 1998; Schmitt & Golubovich, 2013; Zibarras & Woods, 2010). Although this might be true, recent reviews have noted that the use of biodata has not been extensively leveraged by organizations for making employment decisions (Gatewood, Field, & Barrick, 2011), nor has there been much research over the last few decades (Cortina & Luchman, 2013; for a more thorough review of biodata research, see Mumford, Barrett, & Hester, 2012). As evidence of this apparent lack of use within organizations, Furnham (2008) surveyed 255 Human Resource (HR) professionals concerning their views on 12 selection methods (e.g., references, interviews, etc.) and they ranked biodata 10th in terms of its perceived validity, 9th in terms of its practicality, and 10th in terms of its perceived legality. Although Furnham (2008) did not gather data to determine why these HR professionals felt that biodata was not a practical measure to include in the selection process, others have postulated that one possible explanation for these results could be attributed to the use of incumbent samples rather than applicant samples in most of the empirical studies (Breaugh et al., 2014; Stokes et al., 1993). While it seems there is minimal use of biodata within organizations, it’s important to note that it has been shown to predict performance with greater accuracy (when used in an appropriate structured format) than many other commonly used selection tools (e.g., Schmidt & Hunder, 1998) and has also been shown to have incremental validity when used in combination with cognitive or personality measures (Mount, Witt, & Barrick, 2000). Although there has been relatively little attention from researchers on the issue of adverse impact, studies have indicated that biodata has minimal adverse impact in terms of gender (Becton, Matthews, Hartley, Whitaker, 2009), while studies on race have been mixed (e.g., Becton et al., 2009; Van Iddekinge et al., 2003). Clearly, the research has shown that there can be benefits to the use of biodata as a part of the selection process, but there is still relatively little empirical research that focuses on the use of this type of tool for entry- and mid-level leader selection. Much of the more recent research has focused on the development of biodata scales as well as the process for effectively structuring the use of those measures. Those studies that do exist that target leader-level roles, albeit older, do provide support for the use of these types of measures for selecting manager positions

745

Scott C. Erker et al. (e.g., Carlson et al., 1999), front-line leaders (e.g., Rothstein et al., 1990), and in predicting leadership potential (Stricker & Rock,1998). Considering the new dynamics that leaders face in the “VUCA” world, it is clear that further research on the predictive validity of life experiences for early success as a leader is needed to substantiate these dated findings. Gathering additional data from HR professionals around why they have not incorporated biodata more fully into their process could help expand and explain the findings from Furnham (2008). It is very possible that many are using biodata in an unstructured way and are therefore questioning the utility of such tools regardless of the empirical support.

Behavioral Consistency Method The behavioral consistency method of evaluating training and experience is a type of biodata evaluation. Although some have categorized the behavioral consistency method as biodata (i.e., Hough & Oswald, 2000), most others have differentiated the two types of measures (e.g., Howard, 2006; Robertson & Smith, 2001; Schmidt & Hunter, 1998). Also called individual achievement records/career achievement records/career achievement profiles, this method is based on the well-established principle that the best predictor of future performance is past performance, and according to Howard (2006) is a useful tool for leader selection. Applicants are asked to describe their past achievements or experiences, either in writing or orally. Managers, with the aid of scales that are anchored, then score these achievements. This works well for mid-level leadership selection but is problematic when individuals have no formal leadership experience and are applying for an entry-level leadership job. There are few relevant past behaviors to document giving this method limited practical utility. Research has also shown that contemporary items (current or ongoing behaviors/experiences) tend to be more valid than hypothetical/ future (potential behaviors) or historical items (past experiences), and items that ask respondents about other’s opinions of them are more valid than direct self-report items (Lefkowitz, Gebbia, Balsam, & Dunn, 1999). Although the behavioral consistency method is time-consuming and costly to construct, Schmidt and Hunter (1998) noted that the method is well worth the cost and effort for higher-level jobs, such as entry- and mid-level leaders. Indeed, an aspect of this method for selecting leaders that many practitioners would likely find appealing is the flexibility of this process for developing a highly engaging and job-relevant assessment experience for candidates. By adapting this approach to align with the job-specific competency profile for any leader-level position, the organization will be able to gather data that is aligned to the unique facets of a leader-level role within their unique context. Although there can be challenges with the calibration of raters and consistency of the scoring process, the opportunities the behavioral consistency method provides to practitioners should not be overlooked. Certainly, an opportunity for entry- and mid-level leadership research is to directly examine the predictive validity of achievement profiles for entry- to mid-level leaders.

Tests and Inventories Cognitive Ability Tests Since the earliest research on personnel selection, cognitive ability measures have been one of the major methods used when attempting to discriminate among candidates. Specifically, various cognitive ability tests (e.g., verbal, numerical, and spatial tests) intercorrelate, and the common variance often operationalizes a general cognitive ability factor, often called g (e.g., Sackett & Lievens, 2008; Schmitt, 2014). Among the various measures that might be used for personnel selection, cognitive ability (g) is one predictor that has demonstrated strong validity across most jobs. Interestingly, the main factor that moderates the validity of g as a predictor of performance is the complexity of the job. Hence, tests that measure g have their highest validity for complex jobs. General cognitive ability is an excellent predictor of academic achievements and

746

Selection Methods and Desired Outcomes professional expertise. It may not predict interpersonal leadership complexity related to operating in a business setting. Complexity in leadership positions often focuses on mastering ambiguous business situations and dealing with difficult social interaction and persuasion. The complexity is somewhat different from that found in other professional positions such as engineering and finance. Indeed, Schmidt and Hunter (1998) reported an adjusted correlation of .58 with performance for managers. Similarly, in Aberdeen’s 2013 Human Capital Management Trends study, Lombardi (2013) reviewed best-in-class companies and found critical thinking and cognitive ability assessments to be more valuable than any other assessment method for identifying high-potential talent. Although cognitive ability tests are unquestionably valid, they are commonly found to demonstrate considerably large group differences that often result in adverse impact across levels of job complexity (e.g., Berry, Clark, & McClure, 2011), and they do not measure all of the elements of leadership success. For this reason, many practitioners (in the United States) have avoided using cognitive tests as the sole screening tool early in the selection process (Sackett & Wilk, 1994). Sinar (2013) also noted that, for executive selection, it is increasingly important to determine the best way to fold cognitive skills assessments into a broader selection process. Because there is likely to be a restriction of range in cognitive ability as leaders move up the management hierarchy (Howard, 2006), determining what aspects of success as an executive cognitive ability links to can help organizations develop a comprehensive selection strategy that appropriately incorporates cognitive ability measures. Based on a data set of 857 senior executives across 22 companies, Sinar (2013) was able to show what key executive competencies were most linked to cognitive ability. He found that certain behaviors were driven by cognitive ability, some that were influenced by cognitive ability, and some that were completely distinct from cognitive ability. Clearly, cognitive ability is a key component for leader selection, and determining what the right balance is with other predictive measures is an important aspect for ensuring adequate coverage of the complexities of leader-level roles.

Personality Measures Personality measurement has been extensively researched, and practitioners continue to explore the practical value of personality for predicting leadership success. Within personnel selection, personality predictors can be roughly divided into two categories: (a) general measures of adult personality (e.g., NEO-PI, 16PF, HPI) that are intended to provide a comprehensive measure of the full range of personality and (b) more narrow measures of personality (such as integrity tests, violence scales, drug and alcohol scales, etc.) that are used to predict individual differences in specific categories of behavior such as theft and absenteeism (Salgado et al., 2001). Despite the extensive research on the Big Five for predicting job performance (e.g., Barrick & Mount, 1991) and relatively high validity coefficients for both conscientiousness (.31; Schmidt & Hunter, 1998) and integrity tests (.41; Ones, Viswesvaran, & Schmidt, 1993) for entry-level and professional jobs, these measures may have low validity for management jobs depending upon how the construct (e.g., conscientiousness) is defined (Hough & Oswald, 2000). For example, Hogan and Ones (1997) defined conscientiousness as conformity and socially prescribed impulse control. On the basis of this definition, Hough and Oswald believed that conscientiousness would not predict performance, in which creativity and innovation are highly important (characteristics that are aspects of many leadership positions). Although Ones and Viswesvaran (1996) argued that broad personality domains are better than narrow domains for predicting performance across job levels, others have shown that conscientiousness was not a valid predictor of managerial performance (Robertson, Barron, Gibbons, MacIver, & Nyfield, 2000). In contrast to Roberston et al. (2000), Bartram (2004) indicated that scores on scales of the Occupational Personality Questionnaire (OPQ) (SHL, 2015) and ratings of work behavior on the Inventory of Management Competencies showed an average uncorrected validity of .48, with a range of .29 to .69 (zero-order correlations). Additionally, personality measures have also been shown to predict leadership style (Hogan & Kaiser, 2005). More recently, Hogan, Davies,

747

Scott C. Erker et al. and Hogan (2007) proposed a conceptual model that links certain personality variables to workplace behaviors. They outlined various strategies for utilizing the validity evidence from prior research to apply to other positions, and they used research from managerial jobs as examples. Although there is some debate as to the level of analysis that should be used (e.g., Robertson & Smith, 2001), and there have been some conflicting findings regarding the validity for leader selection, personality measures (whether conscientiousness or integrity) add a degree of validity (i.e., incremental validity) over and beyond cognitive ability. An advantage of personality measures over cognitive ability measures is that personality measures do not demonstrate large group differences that can drive adverse impact to the same extent as other measures (Hogan & Hogan, 1995). As with cognitive ability tests, various group differences tend to be associated with personality measures; however, these differences tend to focus on sex differences rather than racial differences. Indeed, as noted previously, personality tests tend to not show significant group differences (i.e., potential for adverse impact) in regards to racial groups. For example, Ones and Viswesvaran (1998) compared the scores of African Americans, Hispanics, Native Americans, Asians, and Whites and found trivial differences. They went on to note that group differences with these “trivial magnitudes” are not likely to cause any discernible adverse impact. In regards to sex differences and the Big Five facet of conscientiousness, women tend to score higher than men (Feingold, 1994). Hough, Oswald, and Ployhart (2001) note that women tend to score higher on “dependability” scales, whereas men tend to score higher on “achievement” scales. Similarly, Ones and Viswesvaran (1998) found that women tended to score higher than men on overt integrity tests. Overall, the use of personality measures for making employment decisions is accepted, and the validity evidence for certain scales is growing (Hogan, Hogan, & Roberts, 1996), especially for use in entry- and mid-level leader selection. Indeed, Bergner, Neubauer, and Kreuzthalerand (2010) found that narrow traits added incremental validity to the Big Five to the prediction of managerial success (both salary progression and supervisory ratings).

Construct-Based Assessments (e.g., Situational Judgment Tests) Situational judgment tests (SJTs) are characterized by items that provide a work-related scenario and then ask test takers to choose among a list of actions that respond to the scenario. These tests of decision making and judgment in work settings can be constructed as a low-fidelity job simulation (Salgado et al., 2001) and are used primarily at lower levels of management (Howard, 2006; Weekly, Ployhart, & Holtz, 2006). Indeed, McDaniel, Morgeson, Finnegan, Campion, and Braverman (2001) estimated the population validity of SJTs at .34 with job performance for leader and non-leader jobs. They have also been shown to provide incremental validity over personality, cognitive ability, and experience measures (Chan & Schmitt, 2002; Clevenger, Pereira, Wiechmann, Schmitt, & Harvey, 2001; McDaniel et al., 2001), and applicants (as well as employers) tend to react positively to SJTs due to the face validity of the content and the perception of the test as job-related (Kluger & Rothstein, 1993; Ployhart & Ryan, 1998). Indeed, recent meta-analytic results also showed that the reduced fidelity of SJTs (as compared to high-fidelity Assessment Centers) did not impact their criterion-related validity (Christian, Edwards, & Bradley, 2010). Despite widespread research on and applied use of SJTs, there is still limited consensus on what might be considered best practice in the writing, scoring, and use of SJTs (Weekley, Ployhart, & Holtz, 2006). Indeed, some have called for more construct-based SJTs (e.g., Ployhart, 2006; Schmitt & Chan, 2006), while some consulting organizations (e.g., DDI) have already leveraged a variation on developing SJT items in which respondents are presented with a leadership situation that is based on well-defined competency constructs and asked to evaluate action statements (e.g., very effective to very ineffective). Another related, although different, construct that has raised considerable attention is emotional intelligence. Specifically, emotional intelligence (Goldman, 1996) refers to the ways in which people perceive, understand, and manage emotion. Sackett and Lievens (2008) noted that construct has received the greatest attention in both practitioner and academic literature;

748

Selection Methods and Desired Outcomes however, ambiguity of the definition, dimensions, and how to operationalize has led to considerable scrutiny. This criticism is also the result of questionable claims of validity and incremental validity (e.g., Landy, 2005; Mathews, Roberts, & Zeidner, 2004; Mayer, Roberts, & Barsade, 2008). While consistently defining the construct has presented challenges for researchers, practitioners (and organizations) have noted a clear link of this construct to success as a leader. Adele Lynn (2005), in her book titled The EQ Difference, highlights how people’s behavior can affect feelings, how feelings can influence performance, and how performance on the job can be enhanced through positive behaviors. While important, Lynn also notes that emotional intelligence is certainly not the only factor that will determine success as a leader, but blending it with other critical criteria is important. Van Rooy and Viswesvaran (2004) found that, generally, emotional intelligence measures produce a meta-analytic mean correlation of .23 with performance; however, this included measures of performance in many domains beyond just job performance. Advances in measuring emotional intelligence as a construct can expand our ability to effectively predict success as an entry- or mid-level leader.

Assessment Centers Assessment centers (ACs) have a long and varied history in both the selection and development of leaders within organizations. Assessment center refers to an evaluation method or process that includes multiple exercises, designed to assess both dimensions (i.e., competencies) and categories of behaviors associated with success or failure in the target position and to simulate critical managerial job activities (Bray, 1982; Kuncel & Sackett, 2014; Thornton, Rupp, & Hoffman, 2014). An AC simulates critical and representative job challenges. It may include written simulations (e.g., in-basket, fact finding, analysis, and planning exercises) and interactive simulations (e.g., role-plays, presentation, group discussion, and business game; Howard, 2006; Kuncel & Sackett, 2014; Thornton, Rupp, Hoffman, 2014). The groundbreaking work with ACs was the Management Progress Study at AT&T, which led to the use of ACs as an aid in selecting firstline supervisors (Bray & Howard, 1983). In contemporary virtual ACs, participants interact with a diverse set of trained assessors who role-play direct reports, peers, and customers. In some implementations, role players are replaced by highly engaging virtual interactions, giving participants 24/7 access to the assessment experience. Participants working online engage in a series of activities that simulate those commonly faced by front-line and mid-level managers on the job. They get information from a corporate intranet, video clips, and e-mails. Innovative online tools help participants coach and lead their teams, investigate problems, plan and prioritize work activities, and deploy resources to meet deadlines. Performance in this process helps predict who will succeed in meeting these new leadership challenges. Participants are given detailed feedback on the likely impact of leadership and managerial behaviors on direct reports, peers, and managers with whom they currently work or will work with in the future. ACs have been shown to demonstrate an impressive record of predictive validity (.37; Schmidt & Hunter, 1998) for managerial selection. Thornton and Rupp (2006) indicated that the estimates of the relationship between AC ratings and management success range from .31 to .43, and Gaugler, Rosenthal, Thornton, and Bentson (1987) found in their meta-analysis that there was an upper bound of .63 under “optimal” conditions. AC researchers and practitioners are in conflict about the appropriate means to approach AC research. Most practitioners agree that competencies are categories of behavior related to job success and not psychological constructs. Most research treats competencies as constructs in which factor-analytic studies indicate that the key factors that emerge from an analysis of AC data are related to exercises rather than the competencies that are the assessment target (Robertson & Smith, 2001). This issue around the construct validity of AC ratings has been an issue for nearly 30 years, influenced heavily by Sackett and Dreher (1982). Their observations lead to numerous studies that almost universally confirmed the notion that the scoring of AC exercises was more appropriate than the scoring of dimensions (i.e., competencies) overall (e.g., Lance, 2008). This general concept in turn lead to additional research that focused on the development of design

749

Scott C. Erker et al. and training techniques, which did help increase the reliability of dimension-oriented construct validity, but still left the exercises as the dominant factor (e.g., Bowler & Woehr, 2006; note a full review of the AC construct validity issue can be found in detail in Duncan, Jackson, Lance, & Hoffman, 2012.) Thornton & Rupp (2012) continued to argue that dimension ratings should be the primary focus because of their critical role for prediction, diagnosis, and development purposes. Indeed, recently, Kuncel and Sackett (2014) developed a framework where multiple exercise ratings were aggregated into an overall dimension rating, and this eliminated the finding that exercise variance dominates dimension variance. With this framework, they showed that dimension scoring can be psychometrically appropriate under many conditions and that the dimension scoring approach can lead to dimension variance dominating the dimension score. The findings from this research essentially presented an end to the three-decade-long debate and justified the shift in focus from the construct validity of the exercises to the construct validity of the overall dimension ratings. Particular attention has been given to group differences associated with ACs. The findings from this research have generally been mixed and noted a relatively even split between studies indicating that women scored somewhat higher than men and those showing no significant differences (Anderson, Lievens, van Dam, & Born, 2006). Anderson and colleagues (2006) presented an overview of the gender differences research over a 20-year period. From a leadership perspective, Bobrow and Leonards (1997) developed an AC for first-line supervisors in a customer service division (i.e., requiring substantial interpersonal skills) and found no differences between Whites and minorities. Similarly, Hoffman and Thornton (1997) have reported that, although Whites tend to score higher for overall AC ratings, the differences are typically lower than those found with cognitive ability. More recently, Anderson and colleagues (2006) examined gender differences for ACs for officer entry in the British Army, and they found that women were rated higher on interpersonally oriented leadership constructs (e.g., communication, interaction skills) as well as on drive and determination. It is generally agreed that these racial differences appear to be associated with measuring cognitive components (Hough, Oswald, & Ployhart, 2001; Ryan & Ployhart, 2014).

Interviews Interviews are the most frequently used procedures in personnel selection across all countries, jobs, and levels (McDaniel, Whetzel, Schmidt, & Maurer, 1994; Salgado et al., 2001) and likely the most frequently utilized method for leadership selection. It is estimated that practically 100% of selection processes use one or more interviews, although not all types of interviews are considered as valid, or even as useful as others. The employment interview has been the target of significant research (e.g., McDaniel et al., 1994). Krajewski and colleagues (2006) compared the validity of situational versus past experience interviews for predicting managerial performance. Using a sample of 157 applicants to managerial positions, they found that the experience-based interview significantly predicted overall performance (.32), whereas the situational interview did not (.09). Additionally, in an examination of the construct differences between the two interview types, Krajewski and colleagues also showed that the experience-based interviews were highly related to manager-relevant work sample measures (i.e., AC exercises), cognitive ability facets, and personality traits. Another interviewing trend that many practitioners encounter is around the use of a panel or team-based interviews, especially for use when selecting leaders into an organization. Stakeholders often have perceptions around panel interviews versus one-on-one interviews, and while there can be some advantages, there are also some disadvantages. Some of the perceived advantages of panel interviews include that they (a) indicate to the candidate that collaboration is an important value in the organization; (b) provide an opportunity for more people to meet and/or collect data about the candidate; (c) reduce time; and (d) provide those who are not asking the questions an opportunity to observe the candidate and refine their own follow-up questions.

750

Selection Methods and Desired Outcomes While many practitioners believe these advantages outweigh any disadvantages, there is the potential for challenges with this approach as a best practice. In particular, even though each interviewer spends an allotted amount of time with the candidate, it is less than if he/she were conducting a one-on-one interview, which ultimately translates into an inefficient use of each interviewer’s time and limits the opportunity to gather more comprehensive data. An independent interviewer brings unique data to data integration sessions, and that can get lost with panel interviews. Finally, a key potential disadvantage revolves around the candidate experience. Panel interviews can be intimidating, which could impact the candidate’s performance during the data gathering. While Sackett and Lievens (2008) noted that there was a focus on interview structure and construct measures, more recent research has shifted to impression management during the interview (e.g., Kleinmann & Klehe, 2011; Stewart, Darnold, Barrick, & Dustin, 2008). Ultimately, the most effective selection system will use various methods, and this is especially true for entry- and mid-level leadership jobs in which the job requires balance among experience, knowledge, interpersonal competencies, leadership motivation, and personality. On the basis of Schmidt and Hunter (1998), “incremental validity” can be translated into increases in utility (i.e., practical value). The issue of the incremental validity provided by different methods is useful for assessing the extent to which combinations of methods are useful or, by contrast, overly redundant. For example, personality tests and assessment simulations measure different job requirements. Using both methods together should produce incremental validity, thereby leading to a stronger relationship with leader performance.

CASE STUDIES OF LEADERSHIP SELECTION The final section of this chapter will examine common, high-stakes organizational contexts in which selection systems are likely to be deployed. Two cases are described: one that illustrates high-velocity hiring of leaders for an organization start-up and a second that illustrates a promotional process for leadership succession.

Case 1. High-Velocity Hiring for Entry- and Mid-Level Leadership Positions Most private sector organizations, if successful, face the positive prospect of starting up a new facility, plant, or store. This is a positive outcome of success and growth in the business. Although there are many positive aspects to growth, in these instances, the pressure for immediate success is very high. Senior leaders are under pressure to make sound expansion decisions with good return on investment (ROI). They must choose the right site, pick the right product mix, install the right technology, and create the right culture. In this complex mix of business issues is a unique opportunity to hire the right people—the first time. For “greenfield” facility start-ups, if executed in a well-planned way, a new culture can be more easily created because there is no existing culture to change. This situation can be contrasted with “brownfield” or retrofit work, in which existing facilities and incumbent employees need to be pointed in a new direction. In this less enviable situation, current operations must overcome the natural inertia caused by years, if not decades, of work conducted in the older operating style and culture. The authors have worked on many facility start-ups. In our experience, these capital-intensive projects are entered into with a high degree of hope for return as well as incredible pressure for the people involved to be successful. One “greenfield” start-up in particular had this mix of factors at play from the start. The goal of this company’s start-up in the Midwest was to build a mid-size automobile engine at better quality and lower cost. The pressure was high given that most of the company’s high-quality engines were built outside of the U.S. If quality and cost goals could not be achieved, the plant would be seen as a failure. Early in the planning for the plant, it was recognized that a new style of manufacturing would be required to achieve the goals. With this in mind, the plant start-up team set out to define a new work culture. In the new plant, lean operating procedures would be the core concept that

751

Scott C. Erker et al. defined requirements for people and teams. Identification and elimination of waste is a core concept of lean manufacturing. This requires everyone in the plant to work together to follow established operating procedures and to put in place improvements on a fast, continuous basis. Leader success required higher levels of teamwork, empowerment, coaching, and initiative. The management team struggled to break away from past practices to define this new working culture. A job analysis was conducted with the new management team. A series of “visionary job analysis” discussions was conducted with the plant manager and his direct reports, the functional managers within operations, engineering, maintenance, and HR. Targets were set for behavior, motivational, and technical skills for leaders. It was recognized that front- and mid-level leaders would be critical for creating the desired culture and for executing the operating model that the senior leaders had established. A rigorous program was required to identify those leaders who would accept and excel in this progressive manufacturing environment. The recruitment effort was complicated by the fact that leader candidates would be selected from existing manufacturing facilities (where older manufacturing practices were the norm). It was critical to select leaders with the right skills and dispositions to create in the new culture. The first step in the hiring process was a comprehensive job application form that covered work experience and technical skills. Special care was taken to fill the selection funnel with a broad pool of leader applicants to achieve a high number of people with the potential to display the right skill set and diversity mix. Screening of applicants was limited to minimal education achieved, technical skill requirements, and eligibility to work in the U.S. The next step involved a comprehensive test battery that targeted behavioral, personality, and motivational competencies that were consistent with a lean manufacturing environment. Candidates were prioritized for the next step according to “fit” (as measured by the test battery) with the defined roles leaders would play in the plant. A third step employed the use of a day-in-the-life AC. This simulation-based set of exercises involved pre-work about the fictitious company’s operation (market, competitors, structure, and culture) that was used in the simulation, an in-basket exercise that challenged the candidate on issues ranging from planning the schedule of production to dealing with HR issues, a coaching exercise to improve a direct report’s performance, and a peer exercise requiring partnering and negotiating skills. The AC was designed to reflect the operating environment of the new plant and give the candidates the opportunity to display behaviors required in leadership roles. Assessors were contractors trained in the assessment process and the client’s business context. The benefit of the AC was realized in two ways. First, candidates had the chance to experience a realistic preview of the leadership job for which they were applying. Second, assessors had the chance to see how candidates performed in exercises that were very similar to the target job. The final step in the selection process was a behavior-based interview, during which candidates described how they had performed in past jobs. Each candidate participated in two one-on-one structured interviews conducted by a line or HR manager. Interview questions were designed to elicit information about target competencies. This provided candidates with the opportunity to describe their previous work and the results they had achieved in these situations. Answers were evaluated against the target job requirements, with relevancy, recency, and similarity to the target job used as guiding evaluation criteria. At the end of the process, all of the data were integrated by a selection panel of line managers facilitated by a HR specialist. Successful candidates were given a contingent job offer (candidates needed a successful reference check and drug screen to get the job). The plant management team recognized the importance of selecting the right leaders to the eventual success of the facility. The selection of the first leaders to come on board in a new facility is especially critical, as they play multiple roles early in the start-up and set the right tone for the desired culture. For this plant start-up, new leaders were supported with additional training on the concepts of lean manufacturing, coaching, interviewing, and team skills. The results at this plant have been impressive. To date, the new engine manufacturing facility is on target to reach its production goals and has started up on time and on budget, due in part to the successful hiring of its first wave of leaders. The workforce is measured on safety (number of safety violations and on-the-job injuries), quality (number of defects and achievement of quality goals), and engagement (workforce survey conducted yearly). Engagement levels at the new plant are at the top of the list as compared to the network of plants operated by the organization. The plant

752

Selection Methods and Desired Outcomes management team attributes the high engagement level of the workforce to the quality of the front-line and mid-level leaders. This benchmark facility is held up as an example for how a new start-up should be implemented and as an example of the culture of the future.

Case 2. A Leadership Pipeline Approach for Entry- and Mid-Level Leadership Positions More and more organizations are executing talent management strategies to close the leadership readiness gap at entry- and mid-level leader positions discussed previously (see Trend 5 above). They achieve this by getting individuals ready to face the challenges encountered at this level prior to promotion. A robust pipeline for entry-level leaders encourages promotion from within (which can be less risky than hiring from the outside) and demonstrates to employees with potential that the company supports a “grow from within” strategy. The pipeline concept received considerable attention as a result of the book The Leadership Pipeline: How to Build the Leadership Powered Company (Charan, Drotter, & Noel, 2000) and was later expanded upon in Grow Your Own Leaders and Leaders Ready Now (Byham, Smith, & Paese, 2002; Paese, Smith, & Byham, 2016). According to Byham, Concelman, and Cosentino (2007), the leadership pipeline can be defined as “a critical business process that provides organizations with a sustainable supply of quality leaders (at all levels) to meet the challenges of today and tomorrow” (p. 3). A strong pipeline is an integrated assessment and development approach supported by senior management. It is not a single program or tool, but rather it is a process that provides the right quantity and quality of leaders in time to step up and meet pressing business challenges. Traditional methods of succession management used at the senior level tend to fall apart when applied to lower organizational levels due to lack of scalability. At higher levels there are fewer candidates, all of whom have known track records. High-touch lengthy assessment programs and development plans tailored to each individual can be developed to support transitions into executive levels of leadership. A scalable pipeline strategy is needed at first- and midlevel leadership because leadership assessment and development processes need to be applied to potentially large numbers of first- and second-level leader candidates. We believe that an effective pipeline approach must (a) focus on early identification of leadership potential and readiness and (b) provide individuals with accelerated development prior to their promotion so they are confident in their leadership skills on day one. Practitioners responsible for entry- and mid-level career management look to mitigate the risk of early-leadership failure by integrating assessment and development solutions. The pressure to demonstrate payback to the company for the expense of these programs in terms of time and money is significant, and ROI analyses are critical for sustained implementations. The company described here took a programmatic approach to leadership pipeline management. This Fortune 500 technology company was interested in identifying individuals in their sales force who had the motivation and potential to be effective district managers, a first-level leader position. They were committed to a “grow your own leadership strategy” because they recognized that performance of internally promoted leaders was more effective than those hired from the outside. The organization’s primary business involved advising potential clients on a number of technically complex products and supporting their launch in their client’s organization. Leaders with minimal technical and sales experience specific to the company lacked credibility with the sales associates and were not effective coaches. Unfortunately, individuals promoted from within (under the current process) were only marginally more effective as leaders. This was disconcerting to HR leaders because they spent considerable resources in training internal leader candidates prior to their promotion. Candidates for this leadership training program were picked based on top management’s judgment. These strategies were not working as the company faced new market opportunities, new competitors, and an increasingly complex set of market offerings. The first step taken to improve the program was to develop a success profile for leaders. The selection criteria for new leaders had not changed for many years, whereas the business

753

Scott C. Erker et al. challenges for sales leaders had changed substantially. Working with senior sales leaders, the authors identified the business drivers for sales leadership positions. Business drivers represent the critical activities on which incumbents must focus to successfully implement the company’s sales strategy. In this case, they included better targeting of opportunities, establishing broader client networks, and insight selling. The data from the visionary job analysis formed the base for success profiles—specific competencies, experiences, knowledge, and personal attributes needed to address current leadership challenges. These conclusions were confirmed by successful managers. Key job activities related to each performance area were documented and confirmed by senior managers. On the basis of this success profile, tools were developed, and a process was designed to support a new promotion process. The program included the following: 1. Realistic job preview and assessment of motivational fit for leadership. Current sales associates who were above average sales associates for three years had access to an online leadership career site. On the site, they obtained a balanced view of a career in sales leadership. This site included insights from current successful sales leaders about the transition challenges they faced. Without identifying themselves, users had access to a motivational fit inventory in which they documented the degree to which they liked various work activities. Their responses were computer-scored, and they were given immediate and confidential access to their motivational matches and mismatches with a leadership career. The profile gave associates data and insights to help them make better-informed decisions about their fit with a leadership career. The associates were encouraged, but not required, to share the results with their managers so they could help associates make the best possible decision about pursuing a leadership career. 2. Assessment of leadership dispositions. When they decided to continue in the process, the associates documented their relevant experience and knowledge online and completed a leadership insights inventory that was predictive of most of the required leadership competencies. The inventory consisted of a variety of item types, including situational judgment, bio data, and personality items. The items were grouped into subscales that mapped back to the target success profile dimensions. Only managers of candidates had access to the results. Based on the assessment results, candidates had varying degrees of readiness for a leadership career. These managers received training in interpreting the results and provided feedback to candidates. Managers were required to have a feedback discussion with candidates. Managers were to try to influence candidates’ career choice, but the final decision to proceed was left to the job candidates. 3. Online training. Candidates had access to online leadership courses that they completed at their own pace and on their own time. After candidates completed the coursework, they were encouraged to discuss their training results with their managers and decide jointly if they were ready for the next step in the process. 4. Competency assessment. Candidates who decided to proceed in the process had access to an online assessment of the new sales leader competencies. The online assessment asked candidates to respond to a series of leadership challenges by rating the effectiveness of various actions to address each challenge. The leadership challenges provided to candidates were tailored to challenges identified by the job analysis process. Responses were computer-scored, and results were provided to a promotional panel, who conducted a behavioral interview to further evaluate readiness. The promotional panels integrated the interview and assessment results in order to make the best decisions. Candidates were provided with feedback. If the decision was not to proceed, there was a career planning discussion. 5. Ongoing leadership training. Candidates placed in a promotion pool had access to more in-depth leadership training.

In a concurrent validity study, 153 randomly selected incumbent managers completed the competency assessment. Ratings of the leadership competencies of the participating managers were made by their direct supervisors. The correlation between the predictor (the competency assessment) and criterion (the ratings of supervisors) was .56. Under the new process, satisfaction with the slate of candidates and success rates in the final promotional interview were much higher. There was no increase in turnover among sales associates who did not succeed in the final steps of the promotional process, suggesting that those not selected saw the process as fair. Performance in pre-promotional training was substantially better than before the process redesign.

754

Selection Methods and Desired Outcomes

CONCLUSIONS Current trends in business suggest that the demand for high-quality leaders is high, the complexity of their jobs has increased, and the process for readying future leaders is more difficult for organizations to implement. Selection processes that have multiple phases and methods have the greatest likelihood of success. These processes include well-developed and validated screening tools (e.g., cognitive tests, biodata instruments, personality and/or situational tests) accompanied by more in-depth evaluations, such as simulations, assessment centers, and structured interviews. Sound leadership selection processes that are tied to development have the greatest impact on performance, especially when there is a sound implementation strategy (e.g., the way the need for assessment is communicated to participants; transparency of results with those impacted by the assessment; clear accountability for the participants, managers, and HR; alignment with other HR systems; and success metrics that can be used to demonstrate ROI). As the case studies demonstrate, differing organizational needs and contexts, such as start-up and leadership pipeline, have differing demand characteristics that impact the tools and processes used and the implementation strategy. Other organizational contexts, such as mergers and acquisitions and the desire to improve employee and customer engagement, also have differential impact on the assessment targets, as well as implications for how they are measured and implemented. It is clear that private organizations are not all in the same place when it comes to improving the performance of leaders. Although there are bright spots that can be pointed to as examples that others should follow, the lack of a systematic approach to identifying, selecting, and developing leaders provides opportunity for the future. Leadership is a topic that has been written about extensively in the academic and popular business press, and there is no lack of theory or advice on defining leadership or conceptualizing what steps should be taken to improve leadership performance. There is, however, a lack of agreement on the best way to assess leadership potential and performance and how to get individuals ready for entry- and mid-level leadership roles. To be useful, future practice and research should seek to evaluate the specific tools, processes, and implementation strategies that will create the best ROI within specific organizational contexts. Modern, successful selection systems balance the science of leadership selection with practical realities of the business environment. Sustainable, long-term impact is achieved by taking a holistic and practical approach to interpreting organizational context, weighing the potential impact of various selection tools, and rigorously executing the implementation plan.

REFERENCES Anderson, N., Lievens, F., van Dam, K., & Born, M. (2006). A construct-driven investigation of gender differences in a leadership-role assessment center. Journal of Applied Psychology, 91, 555–566. Bandura, A. (2006). Social cognitive theory. In S. Rogelberg (Ed.), Encyclopedia of industrial/organizational psychology. Beverly Hills, CA: Sage. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bartram, D. (2004). Assessment in organizations. Applied Psychology: An International Review, 53, 237–259. Becton, J. B., Matthews, M. C., Hartley, D. L., & Whitaker, D. H. (2009). Using biodata to predict turnover, organizational commitment, and job performance in healthcare. International Journal of Selection and Assessment, 17, 189–202. Bergner, S., Neubauer, A. C., & Kreuzthaler, A., (2010). Broad and narrow personality traits for predicting managerial success. European Journal of Work and Organizational Psychology, 19, 177–199. Bernthal, P. R., & Erker, S. (2005). Selection forecast: Recruiting and hiring talent. Pittsburgh, PA: Development Dimensions International. Berry, C. M., Clark, M. A., & McClure, T. K. (2011). Racial/ethnic difference in the criterion-related validity of cognitive ability tests. Journal of Applied Psychology, 96, 881–906. Bersin, J. (2015). Predictions for 2015, Research Report. Deloitte/Bersin Consulting. Retrieved from: http:// blog.bersin.com/predictions-for-2015-redesigning-the-organization-for-a-rapidly-changing-world/

755

Scott C. Erker et al. Bobrow, W., & Leonards, J. S. (1997). Development and validation of an assessment center during organizational change. Journal of Social Behavior and Personality, 12(5), 217. Bowler, M. C., & Woehr, D. J. (2006). A meta-analytic evaluation of the impact of dimension and exercise factors on assessment center ratings. Journal of Applied Psychology, 91, 1114–11124. Bray, D. W. (1982). The assessment center and the study of lives. American Psychologist, 37, 180–189. Bray, D. W., & Howard, A. (1983). The AT&T longitudinal studies of managers. New York, NY: The Guilford Press. Breaugh, J., Labrador, J., Frye, K., Lee, D., Lammers, V., & Cox, J. (2014). The value of biodata for selecting employees: Comparable results for job incumbent and job applicant samples. Journal Of Organizational Psychology, 14(1), 40–51. Burke, R. J., & Cooper, C. L. (2000). The organization in crisis: Downsizing, restructuring, and privatization. Hoboken, NJ: Blackwell. Byham, T. M., Concelman, J., & Cosentino, C. (2007). Optimizing your leadership pipeline. Pittsburgh, PA: Development Dimensions International. Byham, W. C., Smith, A. B., & Paese, M. J. (2002). Grow your own leaders. Upper Saddle River, NJ: Prentice Hall. Campbell, J. P., McHenry, J. J., & Wise, L. L. (1990). Modeling job performance in a population of jobs. Personnel Psychology, 43, 313–333. Carlson, K. D., Scullen, S. E., Schmidt, F. L., Rothstein, H., & Erwin, F. (1999). Generalizable biographical data validity can be achieved without multi-organizational development and keying. Personnel Psychology, 52, 731–755. Chakraborty, R., & Rudbeck, S. (2014). Career management: Making it work for employees and employers. TowersWatson. Retrieved from: https://www.towerswatson.com/en/Insights/Newsletters/Europe/HRmatters/2014/12/Career-management-Making-it-work-for-employees-and-employers Chan, D., & Schmitt, N. (2002). Situational judgment and job performance. Human Performance, 15, 185–199. Charan, R., Drotter, S., & Noel, J. (2000). The leadership pipeline: How to build the leadership powered company. Hoboken, NJ: Jossey-Bass. Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010). Situational judgments tests: Constructs assessed and a meta-analysis of their criterion-related validities. Personnel Psychology, 63, 83–117. Clevenger, J., Pereira, G., Wiechmann, D., Schmitt, N., & Harvey, V. S. (2001). Incremental validity of situational judgment tests. Journal of Applied Psychology, 86, 410–417. Duncan, J. R., Jackson, D., Lance, C., & Hoffman, B. (Eds.). (2012). The psychology of assessment centers. New York, NY: Routledge. Cortina, J. M., & Luchman, J. N. (2013). Personnel selection and employee performance. In N. W. Schmitt & S. Highhouse (Eds.), Handbook of psychology, Vol 12: Industrial and organizational psychology (pp. 143–183). Hoboken, NJ: John Wiley & Sons, Inc. Feingold, A. (1994). Gender differences in personality: A meta-analysis. Psychological Bulletin, 116, 429–456. Furnham, A. (2008). HR professionals’ beliefs about and knowledge of assessment techniques and psychometric tests. International Journal of Selection and Assessment, 16, 300–305. Gatewood, R. D., Feild, H. S., & Barrick, M. (2011). Human resource selection. Mason, OH: Thompson Southwestern. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72(3), 493–511. Goldman, D. (1996). Emotional intelligence. London, England: Bloomsbury. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum. Hoffman, C. C., & Thornton, G. C. (1997). Examining selection utility where competing predictors differ in adverse impact. Personnel Psychology, 50, 455–470. Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 181– 229). Hoboken, NJ: John Wiley & Sons. Hogan, J., & Ones, D. S. (1997). Conscientiousness and integrity at work. In The handbook of personality psychology (pp. 849–870). New York, NY: Academic Press. Hogan, R., & Hogan, J. (1995). Hogan personality inventory manual. Tulsa, OK: Hogan Assessment Systems. Hogan, R., Hogan, J., & Roberts, B. W. (1996). Personality measurement and employment decisions: Questions and answers. American Psychologist, 51, 469–477. Hogan, R., & Kaiser, R. B. (2005). What we know about leadership. Review of General Psychology, 9, 169–180.

756

Selection Methods and Desired Outcomes Hollander, E. (2006). Influence processes in leadership-followership: Inclusion and the idiosyncrasy credit model. In D. A. Hantula (Ed.), Theoretical & methodological advances in social & organizational psychology: A tribute to Ralph Rosnow (pp. 293–312). Mahwah, NJ: Lawrence Erlbaum. Hollander, E. (2008). Inclusive leadership and leader-follower relations: Concepts, research, and applications. New York, NY: Routledge/Psychology Press. Hough, L. M., & Oswald, F. L. (2000). Personnel selection: Looking toward the future—remembering the past. Annual Review of Psychology, 51, 631–664. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detections and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Howard, A. (2006). Best practices in leader selection. In J. A. Conger & R. E. Riggio (Eds.), The practice of leadership: Developing the next generation of leaders (pp. 11–40). San Francisco, CA: Jossey-Bass. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Journal of Applied Psychology, 96, 72–98. Hunter, J. E., Schmidt, F. L., & Judiesch, M. K. (1990). Individual differences in output variability as a function of job complexity. Journal of Applied Psychology, 75, 28–42. Johansen, B. (2009). Leaders make the future: Ten new leadership skills for an uncertain world. San Francisco: Berrett-Koehler Publishers. Katz, D., & Kahn, R. L. (1978). The social psychology of organizations. Hoboken, NJ: John Wiley & Sons, Inc. Kleinmann, M., & Klehe, U. C., (2011). Selling oneself: Construct and criterion-related validity of impression management in structured interviews. Human Performance, 24, 29–46. Kluger, A. N., & Rothstein, H. R. (1993). The influence of selection test type on applicant reactions to employment testing. Journal of Business and Psychology, 8, 3–25. Krajewski, H. T., Goffin, R. D., McCarthy, J. M., Rothstein, M. G., & Johnston, N. (2006). Comparing the validity of structured interviews for managerial-level employees: Should we look to the past or focus on the future? Journal of Occupational and Organizational Psychology, 79, 411–432. Kramer, R. M. (1999). Trust and distrust in organizations: Emerging perspectives, enduring questions. Annual Reviews in Psychology, 50, 569–598. Kuncel, N. R., & Sackett, P. R. (2014). Resolving the assessment center construct validity problem (as we know it). Journal of Applied Psychology, 99, 38–47. Lance, C. (2008). Why assessment centers don’t work the way they’re supposed to. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 84–97. Landy, F. J. (2005). Some historical and scientific issues related to research on emotional intelligence. Journal of Organizational Behavior, 26(4), 411–424. Lefkowitz, J., Gebbia, M. I., Balsam, T., & Dunn, L. (1999). Dimensions of biodata items and their relationships to item validity. Journal of Occupational and Organizational Psychology, 72, 331–350. Lombardi, M. (2013). Human capital management trends 2013: It’s a brave new world. Aberdeen Group. Retrieved from: http://www.aberdeen.com/assets/report-preview/8101-RA-human-capital-management.pdf Lynn, A. (2005). The EQ difference: A powerful plan for putting emotional intelligence to work. Broadway, NY: AMACOM. Marston, C. (2007). Motivating the “What’s in it for me?” workforce. Hoboken, NJ: John Wiley & Sons. Matthews, G., Zeidner, M., & Roberts, R. D. (2004). Emotional intelligence: Science and myth. Cambridge, MA: MIT press. Mayer, J. D., Roberts, R. D., & Barsade, S. G. (2008). Human abilities: Emotional intelligence. Annual Review of Psychology, 59, 507–536. McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730–740. McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599–616. Mitchell, C., Ray, R. L., & van Ark, B. (January 2014). The Conference Board CEO Challenge® 2014: People and performance. New York, NY: The Conference Board. Mitchell, S., Bolling, B., Phang, N., & Schott, T. (2013). Talent beyond borders: An organizational guide to delivering the promise of global talent management. Pittsburgh, PA: Development Dimensions International. Modern Survey. (2016). The corporate trust crisis. Minneapolis, MN: Modern Survey. Mount, M. K., Witt, L. A., & Barrick, M. R. (2000). Incremental validity of empirically keyed biodata scales over GMA and the five factor personality constructs. Personnel Psychology, 53, 299–323. Mumford, M. D., Barrett, J. D., & Hester, K. S. (2012). Background data: Use of experiential knowledge in personnel selection. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment and selection (pp. 353–382). New York, NY: Oxford University Press.

757

Scott C. Erker et al. Mumford, M. D., Stokes, G. S., & Owens, W. A. (1990). Patterns of life history: The ecology of human individuality. Hillsdale, NJ: Lawrence Erlbaum. Ones, D. S., & Viswesvaran, C. (1996). What do pre-employment customer service scales measure? Explorations in construct validity and implications for personnel selection. Presented at the Annual Meeting for the Society of Industrial and Organizational Psychology, San Diego, CA. Ones, D. S., & Viswesvaran, C. (1998). Gender, age and race differences on overt integrity tests: Analyses across four large-scale applicant data sets. Journal of Applied Psychology, 83, 35–42. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78, 679. Oracle Corporation. (2012). Talent retention: Six technology enabled best practices. Retrieved from: http://www. oracle.com/us/media1/talent-retention-6-best-practices-1676595.pdf Paese, M. J., Smith, A. B., & Byham, W. C. (2016). Leaders ready now: Accelerating growth in a faster world. Bridgeville, PA: DDI Press. Ployhart, R. E. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of Management, 32, 868–897. Ployhart, R. E., & Ryan, A. M. (1998). Applicants’ reactions to the fairness of selection procedures: The effects of positive rule violations and time of measurement. Journal of Applied Psychology, 83(1), 3. Robertson, I. T., Barron, H., Gibbons, P., MacIver, R., & Nyfield, G. (2000). Conscientiousness and managerial performance. Journal of Occupational and Organizational Psychology, 66, 225–244. Robertson, I. T., & Smith, M. (2001). Personnel selection. Journal of Occupational and Organizational Psychology, 74, 441–472. Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, C. P. (1990). Biographical data in employment selection: Can validities be made generalizable? Journal of Applied Psychology, 75, 175–184. Ryan, A. M., & Ployhard, E. (2014). A century of selection. Annual Review of Psychology, 65, 693–717. Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment center dimensions: Some troubling empirical findings. Journal of Applied Psychology, 67(4), 401. Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450. Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psychologist, 49, 929–954. Salgado, J. F., Viswesvaran, C, & Ones, D. S. (2001). Predictors used for personnel selection: An overview of constructs, methods, and techniques. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work, and organizational psychology. Vol. 1: Personnel psychology (pp. 165–199). Thousand Oaks, CA: Sage. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Journal of Applied Psychology, 124, 262–274. Schmitt, N. (2014). Personality and cognitive ability as predictors of effective performance at work. Annual Review of Psychology, 1, 45–65. Schmitt, N., & Chan, D. (2006). Situational judgment tests: Method or construct? In J. Weekley & R. E. Ployhart (Eds.), Situational judgment tests (pp. 135−156). Mahwah, NJ: Lawrence Erlbaum. Schmitt, N., & Golubovich, J. (2013). Biographical information. In K. F. Geisinger (Ed.), APA handbook of testing and assessment in psychology. Washington, DC: American Psychological Association. Sinar, E. (2013). Leadership insights: A 10-year culmination of executive analytics. Pittsburgh, PA: Development Dimensions International. Sinar, E., Wellins, R. S., Ray, R., Abel, A. L., & Neal, S. (2014). Ready-now leaders: 25 findings to meet tomorrow’s business challenges. Development Dimensions International and The Conference Board. Retrieved from: http://www.ddiworld.com/ddi/media/trend-research/global-leadership-forecast-2014-2015_tr_ddi. pdf ?ext=.pdf Society for Human Resource Management. (2009). SHRM’s 2009 HR trend book. Alexandria, VA: Author. Stricker, L. J., & Rock, D. A. (1998). Assessing leadership potential with a biographical measure of personality traits. International Journal of Selection and Assessment, 6, 162–184. Spencer, G. (2014). Career development framework at IBM. Delray Beach, FL: Brandon Hall Case Study. Stewart, G. L., Darnold, T., Barrick, M. R., & Dustin, S. D. (2008). Exploring the handshake in employment interviews. Journal of Applied Psychology, 93, 1139–1146. Stokes, G. S., Hogan, J. B., & Snell, A. F. (1993). Comparability of incumbent and applicant samples for the development of biodata keys: The influence of social desirability. Personnel Psychology, 46, 739–762. Stricker, L. J., & Rock, D. A. (1998). Assessing leadership potential with a biographical measure of personality traits. International Journal of Selection and Assessment, 6, 164–184.

758

Selection Methods and Desired Outcomes Thornton, G. C., & Rupp, D. E. (2006). Assessment centers in human resource management: Strategies for prediction, diagnosis, and development. Mahwah, NJ: Lawrence Erlbaum. Thornton, G. C., & Rupp, D. E. (2012). Research into dimension-based assessment center. In Duncan, J. R., Jackson, D., Lance, C., & Hoffman, B. (Eds.), The psychology of assessment centers (pp. 141–172). New York, NY: Taylor & Francis. Thornton, G. C., Rupp, D. E., & Hoffman, B. J. (2014). Assessment center perspectives for talent management strategies. New York, NY: Routledge. Tormala, Z. L., Jia, J. S., & Norton, M. I. (2012). The preference for potential. Journal of Personality and Social Psychology, 103(4), 567–583. Van Iddekinge, C. H., Eidson, C. E., Kudisch, J. D., & Goldblatt, A. M. (2003). A biodata inventory administered via interactive voice response (IVR) technology: Predictive validity, utility, and subgroup differences. Journal of Business and Psychology, 18, 145–156. Van Rooy, D. L., & Viswesvaran, C. (2004). Emotional intelligence: A meta-analytic investigation of predictive validity and nomological net. Journal of Vocational Behavior, 65(1), 71–95. Weekley, J. A., Ployhart, R. E., & Holtz, B. C. (2006). On the development of situational judgment tests: Issues in item development, scaling, and scoring. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement, and applications. (pp. 157–182). Mahwah, NJ: Lawrence Erlbaum Associates Publishers. Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52, 372–376. Zibarras, L. D., & Woods, S. A. (2010). A survey of UK selection practices across different organization sizes and industry sectors. Journal of Occupational and Organizational Psychology, 83, 499–511.

759

34 BLUE-COLLAR SELECTION IN PRIVATE SECTOR ORGANIZATIONS ROBERT P. MICHEL AND SHANNON BONNER

This chapter focuses on selection for blue-collar jobs. There are several aspects of blue-collar jobs and blue-collar work environments that make selection efforts different from those for white-collar or other types of jobs. Without a clear understanding of these contextual considerations, the selection practitioner’s efforts at developing selection systems for blue-collar jobs can be undermined. Our goals in this chapter are to highlight the issues that make blue-collar selection unique and provide guidance to selection practitioners who deal with blue-collar selection, as well as suggest possible areas for future research that would help inform blue-collar selection efforts.

DEFINITION/BOUNDARIES The term “blue-collar” has been used since the early 20th century to describe working-class jobs and contrast them to “white-collar” jobs that are professional or managerial in nature and typically occur in an office environment. Similar terms have been added to the lexicon over the years to describe other classes of workers (e.g., pink, green, and gold collar), but blue and white collar remain the primary distinctions when differentiating classes of workers. Because of the type of work involved in blue-collar jobs, several stereotypes have developed over time, some positive (e.g., work ethic, pride, and loyalty) and some negative (e.g., unskilled and unrefined), but we use the term neutrally, simply to describe a specific class of jobs. What constitutes a blue-collar job varies depending on the source. For our purposes, we define blue-collar jobs as those that involve some type of manual or physical labor, often involving the use of tools or specialized equipment, and typically occurring in “non-office” environments that are sometimes hazardous. Clearly falling within this definition are skilled trades or craft jobs in manufacturing and construction industries, such as mechanics, machinists, electricians, welders, plumbers, and carpenters. However, blue-collar jobs can range from completely unskilled (e.g., manual laborer) to very specialized, highly skilled jobs (e.g., nuclear reactor operator). As with the broad range of skill required by blue-collar jobs, the complexity and autonomy varies widely as well. Jobs such as helper, assistant, or laborer may entail following direct and concise instructions. These jobs have relatively low information processing demands and involve minimal planning or decision making. At the other end of the spectrum, jobs such as locomotive engineer, nuclear reactor operator, and demolition expert are highly complex and have extensive information processing demands, as incumbents must plan carefully, evaluate a myriad of possible outcomes, and coordinate the work of others to ensure safe and effective outcomes.

760

Blue-Collar Selection Characteristic of many skilled trade occupations is the use of an apprenticeship system, in which employees are hired as apprentices (or even pre-apprentices) and progress to journeymen and then, in some cases, masters. The purpose of the apprenticeship is for the employee to learn the trade through on-the-job training (OJT), typically complemented by classroom training. Given the hands-on nature of most blue-collar jobs, OJT is a perfect fit. Apprenticeships typically last three to six years, at which point a successful apprentice would become a journeyman. While the use of the terms “apprentice” and “journeyman” by employers in the U.S. is not regulated, official recognition of apprentice or journeyman status is regulated. Many states and jurisdictions have formal requirements to obtain a journeyman license, and the Department of Labor has a formal Registered Apprenticeship program, through which the worker receives a nationally recognized certification.

CONTEXTUAL CONSIDERATIONS A handful of contextual factors are important to consider when designing selection programs for blue-collar jobs. Each factor is not necessarily unique to blue-collar jobs, but the issues are more salient than they are for white-collar jobs. When one or more of these factors is a consideration for a particular blue-collar job, it can impact both the tools that are included in the selection process and the amount of resources dedicated to selection system design.

Work Environment Just as the level of skill can vary across jobs classified as blue collar, so, too, can the work environment. Some blue-collar jobs, such as iron worker or roofer, may entail performing tasks outdoors in varying weather conditions, and others, such as firefighter and police officer, may experience extreme temperatures and loud noises. At the other end of the spectrum are jobs such as high-technology manufacturing operator, where workers operate computers and robots in clean, climate-controlled environments. Despite the variation in blue-collar work environments, they all differ from the typical white-collar environment, where much of the workday is spent either sitting at a desk in front of a computer or in meetings. Another common characteristic of many blue-collar work environments is potentially hazardous working conditions, such as extensive physical demands (e.g., firefighter and police officer), use of heavy equipment and machinery (e.g., construction worker), working with dangerous materials (e.g., demolition expert), and working at heights (e.g., line worker). For this reason, safety is often a central component of the work culture for these jobs. Research shows meaningful individual differences in the extent of safety behaviors in which people typically engage (O’Connell & Delgado, 2011), so safety is often an important focus of the selection process for blue-collar jobs. Depending on the nature of the job, this might entail using assessments that evaluate requisite physical abilities (Campion, 1983), specialized knowledge (Hoffman, Jacobs, & Landy, 1995), or safety awareness (Vredenburgh, 2002). A final component of some blue-collar work environments worth noting is shift work. In many blue-collar industries, 24-hour operations are either necessary (e.g., utilities and emergency response occupations) or desirable for efficiency (e.g., manufacturing). This means that some employees must work at times other than traditional working hours, including swing and night shifts. Some organizations use rotating shifts, in which an employee works on each of the various shifts over the course of several weeks. While this helps to distribute the burden of non-traditional work hours across all employees, it can also be disruptive to one’s personal life, and some research suggests that rotating shifts may even be damaging to one’s health (De Bacquer et al., 2009).

Tenure Another important consideration when designing selection programs for blue-collar jobs is the average tenure of employees. In many blue-collar industries, employees stay with their employer

761

Robert P. Michel and Shannon Bonner longer than in other industries. For example, according to the Bureau of Labor Statistics (2015), over the past decade manufacturing employees have consistently had the highest median tenure for any private sector industry, at about six years. When you drill down to specific industry sectors, the average tenure is even higher, with paper and printing (9.7 years) and utilities (9.2 years) leading the way. Moreover, because these are average tenures across entire industries or sectors, they include both blue- and non-blue-collar jobs. Undoubtedly, if these tenure figures were isolated by specific blue-collar job groups, the median tenure for many groups would be even higher. Logically, the longer an employee stays with an organization, the impact of his or her performance on overall workforce productivity becomes more pronounced over time. This is accounted for in utility formulas by the inclusion of a multiplier for the average tenure of those hired (Schmidt & Hunter, 1983b). The economic value to an organization of a specific selection procedure depends on multiple factors, including the dollar value of job performance variability. Determining the variability in job performance for a specific job can be time consuming and complex, but Schmidt and Hunter (1983a) estimated that the lower-bound standard deviation of employee contributions in dollars is 40%. Using a selection tool with a predictive validity of .30 for a job with a starting salary of $64,000 and a selection ratio of 40%, this translates into a utility benefit of about $7,500 per hire in the first year. Actual utility gains across an employee’s tenure are complicated and involve multiple factors, including the availability and cost of replacement employees (Cascio & Boudreau, 2011), but for example’s sake, if we assume the impact is linear, the cumulative effect on an organization’s bottom line for a job with an average tenure of 10 years is $75,000 per hire. When this is multiplied by the number of hires per year, the effect of average tenure on selection utility becomes quite pronounced, so it makes sense to consider investing more in the selection process for high-tenure jobs to ensure you reap the benefits.

Labor Unions Many blue-collar workers are represented by labor unions, which exist to protect the interests of their members, particularly around wages, benefits, job security, and working conditions. The best way for unions to achieve their goals is by having influence or control over organizational policies and procedures. Agreements regarding workers’ rights are negotiated between union leadership and company management and are codified in labor contracts or collective bargaining agreements. Concessions made by either side during the negotiation process become part of the labor contract and generally cannot be renegotiated until the labor contract expires. Because unions exist to protect the interests of their members, who are already company employees, they are generally more concerned with internal selection practices (e.g., promotions and transfers) than the hiring of external applicants for entry-level jobs. However, in some instances the union may also care about external selection practices since those hired are their future members. The greater the impact the quality of hires has on the union’s well-being, the greater their interest will be. For example, in one organization with which the authors are familiar, company management reserves the right to outsource certain functions if specific performance targets are not met. In this instance, the quality of hires has a direct bearing on the union’s well-being, and they take a keen interest in the external selection process. As far as internal selection, from the union’s vantage point the fairest and most objective way to handle selection decisions is through seniority (Bownas, 2000). When employee seniority is the sole factor for determining promotions and transfers, the union never has to favor one member over another and union loyalty is rewarded. To the extent that additional selection procedures are used, the union will want them to be as objective and job-relevant as possible, preferably based on current performance (e.g., training success, apprenticeship completion, or completion of verifiable goals). If some form of testing is used, the preference will be for tests that have strong fidelity with the job, such as work samples. In contrast to the goals of union leaders, line managers want fast and accurate selection of competent performers to ensure the productivity of their work unit. Because of the inherently competing priorities of labor unions and management, there is a tradition of conflict and distrust between the two groups. The relationship can vary greatly from organization to

762

Blue-Collar Selection organization, or even from union to union within a specific organization, and can range from downright acrimonious to positive and productive. For the selection professional working with unionized jobs, understanding the perspectives of both labor and management is critical. As Bownas (2000) astutely pointed out, the two sides are really no different in what motivates them. They are both looking out for their self-interests, which is only natural. The selection professional’s aim, then, should be to remain an objective third party, whose goal is to develop the most effective selection process possible that helps, at least partly, address the concerns of each side. Effectively working with both parties also requires an understanding of the history of the union–management relationship and what, if anything, has been incorporated into the labor agreement that would impact the ability to implement and manage a selection process. Organizations should take care during the negotiation process to retain the flexibility to administer and score tests as needed. In some instances, the union may advocate including specific tests or specifying cutoff scores in the collective bargaining agreement. This can be problematic in a number of ways. If the job changes, new selection tests may be warranted yet impossible to implement due to the agreement. Likewise, in situations where cutoff scores are specified in the labor agreement, organizations lose the flexibility to adjust to changing labor markets or organizational needs. In the worstcase scenario, stipulations in the labor agreement are no longer legally defensible. Since the union is not responsible for defending the selection procedure against a legal charge, they have little motivation for considering the future legal ramifications of decisions made during the negotiation process. Consequently, it is prudent for the organization to either include a selection professional in these decisions or ensure that those responsible for making the decisions are fully informed. Another important consideration when working with unions is whether to include union leadership and their members in the selection tool development and validation process. To increase buy-in and promote transparency, it is certainly desirable to do so, but union members’ willingness to participate will depend largely on the relationship between the union and company management, as represented employees are very unlikely to participate in any type of data collection without the approval of union leadership. Without their participation, the success of the process depends on either the availability of non-represented employees in the same or similar positions or the ability to conduct a predictive validation study with applicants, which will take much longer and increase costs. Direct union involvement also increases buy-in, helps ensure the ultimate quality of the selection process, and helps avoid questions and issues post-implementation. Throughout this process, candor and clear communication are critical. If the union suspects that the selection professional is hiding something, cooperation will become almost impossible. This does not mean sharing every aspect of the selection process; rather, it means being clear about what you will (and will not) share and why. For example, many companies consider selection test scoring formulas and cutoff scores to be secure information, and only individuals with a true need to know and who have been trained in how to interpret the information are allowed access to them. In instances like this, the important thing is being clear about why you cannot share the information and explaining that many other stakeholders (e.g., applicants, employees, hiring managers, and recruiters), not just the union, are not given access to the information either. In addition, you have to tailor your messaging to the audience. While company management will typically be interested in how the selection process provides value to the organization, union leadership will want to understand how the process impacts their members. If prior union–management relations have been constructive, both sides are included in the development of the selection program, and communication is clear throughout the process, it should result in a rigorous selection process and a smooth implementation. However, if one or more of those three components is missing, union cooperation is unlikely, and if implementation succeeds at all, grievances are likely to follow.

Applicant Population Issues During the early phases of designing a selection system, test developers should critically evaluate the probable applicant pool. For example, an organization with a large, well-qualified applicant pool can afford to be more selective and still meet hiring needs, so it could choose to develop

763

Robert P. Michel and Shannon Bonner a rigorous selection process with multiple hurdles. In contrast, if there is a dearth of qualified applicants, the organization may need to limit its selection criteria to only the most critical factors to ensure enough applicants can pass all steps of the selection process. Another applicant pool factor that can impact selection system design is the level of jobrelevant experience or skill applicants are likely to possess. Aptitude tests might be most appropriate if the pool is filled with inexperienced workers, but if an organization anticipates mainly experienced, journeyman-level applicants, job knowledge and work sample tests could be used. The level of test-taking experience and skill may also be important for many blue-collar applicant pools. The average age of some blue-collar applicant pools may skew higher, meaning that many of the applicants have not been in traditional educational or learning environments for quite some time. The goal of the selection process is to measure job-relevant knowledge, skills, and abilities (KSAs), rather than contaminating factors such as test savviness, so for these applicant populations, it is particularly critical to clearly outline what to expect in the hiring process. If aptitude or knowledge tests are used, you should also provide as many resources as possible (e.g., descriptive test brochures or practice tests) to allow applicants to familiarize themselves with the test format and ensure that only job-relevant KSAs are assessed once they test. Selection system developers can also benefit from considering likely future changes to their applicant pools. For example, an increase in education standards or advances in technology can lead to a shift in the knowledge and skills that applicants possess. Likewise, a change in the generational composition of the applicant pool can lead to different applicant values and expectations. Advance consideration of these issues can allow developers to create a selection system with greater long-term utility.

English Language Proficiency In certain regions of the country, a large portion of the applicant population for some bluecollar jobs may lack English language fluency. This raises the question of which language should be used for any selection tests. Regulatory bodies, such as the Equal Employment Opportunity Commission (EEOC), have filed lawsuits against companies that have required English proficiency without justifying it as needed for safe and effective performance of the job (Equal Employment Opportunity Commission, 2014). At a minimum, an organization should gather job analysis data to determine the language skills required on the job. If English proficiency cannot be established as a job requirement, it may mean that selection tests should be offered in multiple languages, or that nonverbal tests, such as Raven’s Progressive Matrices, should be considered. Assuming English proficiency can be established, it is essential that care be taken to match the reading level required on the job with that required by the test. This can be accomplished by collecting and evaluating training manuals and other written job materials. Patterning test content after these materials will ensure that the reading level required by the test is not harder than that required by the job itself. This approach will also help with stakeholder buy-in and ensure legal compliance, should the selection procedure ever be challenged.

Partnerships Among Industry, Education, and Government In the global economy, many employers are under increasing pressure to reduce costs and enhance efficiency. This can translate into a reluctance to hire unskilled workers and invest in years of training or apprenticeship. Instead, many employers seek workers who are already skilled in a particular craft. Because the pool of skilled applicants is often smaller than the hiring needs of large organizations, one potential solution is to create a pipeline of skilled applicants through partnerships with educational institutions or government training programs. In such partnerships, the employer helps identify the skills that are needed for successful performance, and the educational institution or government program designs curriculum to support the development

764

Blue-Collar Selection of those skills. Students graduate from these programs with a base level of job skill that allows employers to spend less time and money engaging in on-the-job training. From a selection standpoint, the obvious benefit of such partnerships is a pipeline of betterqualified applicants. Applicants coming from such partnerships may also reflect greater diversity than already skilled applicants. Of course, this only pays dividends to the organization if it hires and retains enough of the graduates to offset its investment in the partnership. Understanding this requires a supply-chain approach to staffing (Cascio & Boudreau, 2011), in which all steps of the staffing process, from an understanding of the labor pool to employee retention practices, are evaluated. Potentially more important, but less tangible and more difficult to quantify, are benefits such as company brand and goodwill in the community. At the same time, these types of partnerships can present unique challenges to the organization. The programs providing the training, whether educational institutions or government bodies, have a vested interest in their graduates’ success, so one potential difficulty is a partner who starts teaching to the employer’s selection requirements rather than the full range of KSAs needed for successful job performance. This can be problematic since most selection processes target a subset of critical KSAs rather than the entire job domain. If the partner teaches specifically to the selection requirements, individuals hired from the program may have unexpected gaps in their abilities. Organizations considering entering into partnerships with schools or government programs should be mindful of potential benefits and drawbacks. When successful, organizations can fill positions with a diverse group of well-qualified applicants, often drawn from local communities. This can have a positive impact on the organization’s reputation as well as on hiring and training costs. Alternately, an organization’s reputation can be harmed if it engages in these partnerships but is unable or unwilling to hire expected numbers of program graduates. The decision not to hire program graduates may be the result of changes in hiring needs or budgets, or may occur because the training program, once executed, lacked sufficient rigor to produce qualified applicants. Whatever the reason, failure to hire program graduates can lead to dissatisfaction and backlash from both the graduates and the education/government partner, ultimately harming community relations.

SELECTION TOOL CONSIDERATIONS In this section, we review common selection tools frequently used for blue-collar jobs. Our discussion focuses specifically on considerations for using these tools with blue-collar jobs. For a more complete treatment of many of the topics, we refer the reader to the relevant chapter(s) in Part III of this book (Categories of Individual Difference Constructs for Employee Selection). As noted earlier, the complexity of blue-collar jobs varies greatly. Given the importance of aligning a selection system with the complexity and demands of the required work, particular care must be taken when identifying selection tools appropriate for predicting successful performance in blue-collar jobs. Measuring skills or abilities that are too complex for the position can result in inappropriate restrictions in applicant flow, at best, and legal risk, at worst. Alternately, measuring KSAs that are too simple for the position can result in new hires who cannot perform the work effectively. It is also important to identify the KSAs that employees must possess upon hire. Those KSAs that employees are trained on shortly after hire add no value to the selection process.

Interviews Interviews are a ubiquitous selection tool, for blue-collar jobs or otherwise. Research on interviews has consistently shown that adding structure and standardization is critical to maximizing their validity (Campion, Palmer, & Campion, 1997; Huffcutt & Arthur, 1994). In blue-collar environments, structured interviews that focus on the technical knowledge and skills needed on the job are particularly appealing. They can be less labor intensive to develop and administer than

765

Robert P. Michel and Shannon Bonner knowledge or work sample tests, yet still allow employers to assess the knowledge and skill of external job applicants in a manner similar to job skills checklists or observational assessments, which are often used for internal blue-collar selection. With a structured, technical interview, job analysis data are used to identify critical tasks and the necessary inputs to and outcomes of those tasks. Interview questions are then developed to describe the task scenario and provide the input. Interviewees are scored on the extent to which they provide the correct outcome. Unlike many behaviorally based situational interviews, such as those described by Latham, Saari, Pursell, and Campion (1980), technically focused structured interviews can have objectively scored right and wrong answers. When such objectively scored interviews are used in combination with multiple trained interviewers, inter-rater reliability is enhanced, defensibility is improved, and candidate perceptions of fairness may increase. Organizations seeking to use a technical interview must carefully evaluate the extent to which subject matter experts are available to administer the interview. With technical interviews, the interview panel must be fully knowledgeable of the technical subject, in order to effectively administer and score the interview. In addition, due to the open-ended nature of spoken responses to technical interview questions, the test development team must dedicate a substantial amount of time clearly defining scoring anchors and training interview panelists during the validation process. For internal selection in union environments, the use of interviews can be particularly challenging. Depending on union–management relations, union leadership may perceive interviews as a method to allow management to circumvent the seniority system. Unions that are unwilling to accept behaviorally based interviews may be open to the use of technical interviews as a reasonable replacement for work samples or knowledge tests. If feasible, including knowledgeable union members on the interview panel will further help alleviate any suspicion on the part of the union. This also provides additional subject matter expertise and can strengthen union–management relations.

Cognitive Ability Tests It is a firmly established finding that cognitive ability is the best single predictor of training success and job performance for most jobs (Ghiselli, 1973; Hunter & Hunter, 1984; Schmidt & Hunter, 1998). Schmidt and Hunter estimated the average predictive validity of general mental ability (GMA) tests to be .51 for overall job performance and .56 for training success. The validity of GMA is largely driven by its impact on the acquisition of job knowledge, and this is evident in increasingly higher validities for more complex jobs (Schmidt & Hunter, 2004). As noted earlier, blue-collar jobs vary widely in the complexity of the work, from unskilled to very complex. Based on the research evidence, we would expect cognitive ability to be a valid predictor for all blue-collar jobs, but particularly so for those that are more complex (e.g., locomotive engineer, nuclear reactor operator, and demolition expert) or require significant training in order to become proficient (e.g., most skilled trades). One meta-analysis focused specifically on blue-collar construction and skilled trades jobs in the electric utility industry (Jones & Gottschalk, 1988). Mean corrected validities for specific cognitive abilities ranged from .30 (memory) to .53 (mechanical ability). Consistent with previous research, the correlations were even higher for training criteria, ranging from .55 (memory) to .77 (quantitative ability). Other meta-analytic studies that included blue-collar jobs have also found consistently significant validities (e.g., Ghiselli, 1966; Schmidt & Hunter, 1978; Schmidt, Hunter, Pearlman, & Shane, 1979). The importance of the complexity or learning curve of many blue-collar jobs is compounded by the need for safe performance. An inability to properly evaluate the consequences of different actions or to learn the information needed to perform the work safely can have disastrous consequences. For example, five workers were killed at an Illinois chemical plant in 2004 when a worker failed to follow instructions and opened the wrong reactor. Similarly, a 2008 explosion that killed 14 and injured hundreds at the Imperial Sugar refinery in Georgia occurred, in part, because cleaning workers did not understand the consequences of sugar dust build-up. As these

766

Blue-Collar Selection examples illustrate, for many blue-collar jobs it is critical that those who are hired have the aptitude to learn the work and perform it effectively and safely. The most effective and efficient way to do this, at least for entry-level selection, is using relevant measures of cognitive ability. Although the utility of cognitive ability tests for blue-collar selection is firmly established, their use presents several challenges. First, given the adverse impact associated with cognitive ability tests, users should anticipate legal challenges and grievances. This mindset should guide the job analysis, test development, and validation processes to ensure they are planned properly, executed carefully, and documented thoroughly. Strategies for reducing adverse impact, such as those outlined by Ployhart and Holtz (2008), should be considered. Because a reduction in adverse impact will typically come at the expense of some predictive validity, the challenge is one of comparing different implementation strategies and determining what level of tradeoff is acceptable. In the context of evaluating how to weight multiple specific abilities in a predictor battery, Wee, Newman, and Joseph (2014) demonstrated one particularly innovative strategy for doing this based on Pareto optimization. The face validity of cognitive ability tests with blue-collar populations is another challenge that must be considered. Applicant reactions may be negative if the link between the test and the job is not clear. This can both dissuade good applicants and increase the chances for challenges. Consequently, even if an efficient GMA test is a valid predictor, it may make sense to use a longer test battery that incorporates the specific abilities needed for successful job performance and has items that are contextualized to the work.

Knowledge and Experience Tests Many blue-collar jobs require specialized knowledge. Jobs such as electrician, chemical technician, or nuclear power plant operator cannot be performed safely and effectively in the absence of specialized knowledge. Placing a worker who lacks the requisite knowledge in these types of roles could have terrible consequences on safety, both for the worker and for the general public. Selection systems that measure job-relevant knowledge, through written knowledge tests, experience checklists, or job simulations, can help identify applicants who have sufficient knowledge to perform such critical work safely and effectively (Burke, Sarpy, Tesluk, & Smith-Crowe, 2002). Knowledge tests, thus, often make sense in lieu of cognitive ability tests for the external selection of experienced blue-collar workers. Internally, they are often used for promotion in skilled trades or similar blue-collar jobs. The knowledge one needs to learn in order to become proficient in the trade is dictated by the trade itself and is typically very defined (e.g., electrical knowledge for electricians or mechanical knowledge for mechanics), so it’s important to ensure that employees possess the requisite knowledge before moving into higher-level roles. In general, while knowledge tests provide an easy method to identify already skilled workers, they can be labor-intensive to develop and maintain. Defining the relevant content domain and writing items can require many hours of subject matter expert time, and, particularly in industries where technology drives rapid change, the content may need to be updated regularly. Thus, test developers are advised to consider the tradeoffs between the efficacy of knowledge testing as a way to identify skilled works and the cost of development and upkeep.

Work Samples In many cases, especially in union environments, there can be heavy resistance to knowledge tests for skilled positions. When the work entails hands-on performance, job applicants often perceive written knowledge tests as too far removed from the work or too esoteric to be appropriate. In these situations, work samples or other job performance tests may be deemed more palatable by both test takers and the union (Steiner & Gilliland, 1996). Because work samples are designed to reflect critical job content, they must be based on a thorough job analysis that includes stakeholder input and carefully weighs which tasks to include. One

767

Robert P. Michel and Shannon Bonner question test developers must answer is whether to include rarely used but very critical tasks in the work sample. For example, line workers who perform their jobs dozens or even hundreds of feet up in the air on utility poles or towers may never be called upon to rescue an injured colleague, but the ability to perform that task can mean the difference between life and death. In this case, the criticality of the task may be sufficient to merit evaluating job applicants’ ability to perform it. For work samples that use raters to evaluate applicants’ performance, consideration must be given to the specific behaviors to be scored, as well as to the qualification, training, and calibration of the raters. In order to effectively score any type of hands-on assessment, the raters must typically be experts in the subject, and time must be spent to ensure they are calibrated and will assign the same ratings after observing the same behaviors. Test developers can reduce rater variability by incorporating standardized equipment into the work sample and creating measurable evaluation criteria. For example, a machinist might be given specifications, raw materials, and tools and be asked to manufacture a part to the specifications. The assessment could be scored based upon the extent to which the part matches the specifications, as determined using calibrated measurement tools. Such an assessment should be highly reliable and provide a standardized test-taker experience. For internal applicants, another job performance test that can be effective is a job skills checklist that requires applicants to demonstrate proficiency on a pre-selected set of skills on their current job and obtain “sign-off.” Checklists are especially valuable in promotional situations, when the higher-level job requires skills that can be learned and demonstrated in the current, lower-level job. In these situations, the employee is observed performing specific tasks over the course of days, weeks, or even months. A knowledgeable assessor, such as a trainer or supervisor, makes a note when the individual has successfully completed a task on the checklist. When a certain number of tasks have been “checked off,” the individual is deemed to be qualified. Such assessments have the benefit of evaluating proficiency in performing a wide range of tasks. One potential concern with job skills checklists is whether those being evaluated will have ample opportunity to perform all of the tasks on the checklist. Some tasks, although sufficiently critical to merit inclusion on a checklist, may only occur on certain shifts or at certain times of the year. This can be inherently unfair to employees who are on different shifts or trying to obtain promotions at other times of the year. Because employees who are dissatisfied with the fairness of the promotional process are more likely to engage in behaviors that are detrimental to organizational goals (McCarthy, Hrabluik, & Jelley, 2009), the test developer must carefully consider the availability of opportunities to perform tasks when creating job skills checklists for promotional purposes.

Physical Ability Tests Physical ability testing can be instrumental in identifying qualified job candidates for blue-collar jobs with specific physical requirements. However, test developers face unique challenges when creating physical ability tests, and their development should be approached carefully and knowledgably. First and foremost, test developers must clearly understand the actual physical demands of the job in order to develop measures that evaluate the ability to meet those demands. They then must decide whether to develop complex physical work samples that reflect relevant portions of the job or more straightforward measures of specific physical capacities needed for successful performance. Physical capacity tests (e.g., maximum amount of weight that can be lifted or speed in running a specified distance) are often relatively simple to administer and have been shown to predict future job performance (Henderson, Berry, & Matic, 2007), but they may lack fidelity with the tasks performed on the job, which can make them subject to greater legal scrutiny (e.g., Berkman v. City of New York, 1982). Physical capacity tests require the test developer to engage in more comprehensive research to justify how the capacity measured relates to the tasks performed on the job. This may be necessary in situations where physically demanding job tasks are only learned after extensive on-the-job training and cannot practically be included in the selection process. Physical work samples (e.g., carry a 150-pound victim down a flight of

768

Blue-Collar Selection stairs or remove and re-set a cross arm on a utility pole), on the other hand, may be more complex to administer but are more easily linked to job task performance. With any type of physical testing, care must be taken to protect test takers from injury and avoid inappropriate collection of medical data. If there is concern over the physical safety or health of test takers, test developers must determine if pre-testing will be used before allowing individuals to take the test. They should also identify and provide clear guidance on any situations for which the test administrator should intervene, including when a test taker engages in behaviors that are likely to cause injury. A final challenge with physical testing is adverse impact. Physical tests, especially those focused on muscular strength and cardiovascular endurance, frequently produce adverse impact against women (Hough, Oswald, & Ployhart, 2001). Addressing adverse impact can be challenging, especially when working with incumbent populations that contain relatively few group members for whom impact is most likely. In order to include under represented group members in validation studies, test developers may have to solicit participation from outside the sponsoring organization. Test developers can further enhance the chances of selecting under represented group members by offering training programs that give potential job applicants an opportunity to build the physical skills needed for the test and the job. This not only has the potential to improve applicant diversity, but it also demonstrates the organization’s commitment to creating a level playing field, which can be important should a legal challenge arise. Further discussion of physical ability tests is outside the scope of this chapter, but interested readers are encouraged to read Chapter 12 of this volume for a more in-depth treatment of the subject.

Personality Tests The use of personality tests for high-stakes, high-volume selection is a common practice. Several influential meta-analyses in the 1990s (Barrick & Mount, 1991; Ones, Viswesvaran, & Schmidt, 1993; Tett, Jackson, & Rothstein, 1991) brought about a resurgence in their use as a preemployment selection tool after the field had largely abandoned them for several decades. Only the Barrick and Mount research evaluated validities separately for different occupational groups. For the skilled/semi-skilled group, Conscientiousness was the only Big Five dimension with a lower-bound credibility value that was not negative. However, the skilled/semi-skilled group included both jobs we would consider blue collar (e.g., production worker, assembler, and truck driver) and those we would not consider blue collar (e.g., clerical, flight attendant, and nurse’s aide), making it difficult to draw firm conclusions. Although personality tests are now commonly used for pre-employment selection, their practical usefulness is hotly debated (cf. Morgeson et al., 2007; Ones, Dilchert, Viswesvaran, & Judge, 2007; Tett & Christiansen, 2007). Much of the debate centers around the potential for intentional response distortion on the part of applicants. Attempts at reducing or eliminating response distortion through alternative item designs (e.g., forced-choice scales) or measurement approaches (e.g., those based on item response theory) have shown promise but continue to show mixed results. (We refer the interested reader to Ziegler, MacCann, & Roberts (2012) for a thorough discussion of the topic.) One practical approach to dealing with response distortion in high-volume selection contexts is to use personality tests in a select-out rather than a select-in manner. Mueller-Hanson, Heggestad, and Thornton (2003) found that response distortion in a simulated selection setting was much more prevalent at the top end of the distribution, and criterion-related validity was significantly higher at the lower end of the distribution. Given this finding, rather than selecting the highest scorers by rank ordering or using a high cutoff score, an approach that may yield greater utility is to use the test as an early screen in the process to eliminate the lowest scorers and then use other methods to identify the best candidates. In addition to considering a select-out approach, we would give blue-collar selection practitioners two other pieces of advice when evaluating personality tests for pre-employment selection. First, pay close attention to predictor-criterion alignment when validating personality tests. Because of the breadth of blue-collar jobs, the importance of contextual performance factors varies greatly.

769

Robert P. Michel and Shannon Bonner For example, some blue-collar work is carried out in a team environment or involves extensive customer contact, whereas other blue-collar jobs are less dependent on interpersonal interactions than on solitary performance. Some blue-collar work involves tasks that require attention to detail and careful focus (e.g., electrician), and, as discussed earlier, safety is an important consideration for many blue-collar jobs (e.g., public safety jobs, utility jobs, and manufacturing jobs). Consequently, the use of personality tests for blue-collar selection requires carefully aligning predictor constructs with facets of job performance that are important to a particular job or job family. Second, do not rely on concurrent validation designs as the only source of validity evidence. While research on the magnitude of the loss of validity caused by intentional response distortion is mixed, most researchers agree at this point that applicants can and do distort their responses. The Army’s experience transitioning from research to operational settings is a stark example of the false security that validation with incumbent samples can provide (White, Young, Hunter, & Rumsey, 2008). If a pure predictive validation study is not possible, we advise that you at least follow up a concurrent study with predictive evidence by tracking those hired once the test is implemented and collecting performance data at some later point (e.g., after six months or one year on the job).

Biodata The only non-cognitive predictor other than personality for which we could find validity evidence specific to blue-collar jobs is biodata. Jones and Gottschalk (1988) found little support for biodata as a predictor of training or job proficiency criteria in their meta-analysis, but in his literature review, Pannone (1994) found mostly successful examples of biodata predicting important criteria for blue-collar jobs, including tenure, job performance, training performance, and test performance (in later stages of the selection process). Another study demonstrating impressive results for biodata was conducted by Jacobs et al. (1996). In a large-scale investigation of bus operator performance, two different biodata scales together significantly predicted both subjective (i.e., supervisor ratings of safety, attendance, and customer service) and objective (i.e., absences) criteria. One important consideration when using biodata, or other non-cognitive measures, with blue-collar populations is face validity. Pannone (1994) posited that biodata face validity is a more critical concern with blue-collar than white-collar applicants, making prior training and work experience key components of any biodata measure used for blue-collar selection. He also highlighted a potential dilemma this poses, since face-valid biodata forms tend to be more easily faked. This can be particularly challenging for internal selection, as unions will almost always oppose measures that bear little resemblance to actual work behavior (Bownas, 2000).

Realistic Job Previews Given the unique working conditions for many blue-collar jobs, it is particularly important that applicants develop a clear understanding of the work environment during the selection process. In fact, selection practitioners should consciously approach selection system design for blue-collar jobs as if two decisions are being made—the organization deciding who they want to hire and applicants deciding if the job is the right fit for them. The best way to ensure that applicants fully appreciate the work environment is through the use of realistic job previews (RJPs). An RJP can take many forms, and the specific demands of a particular blue-collar job may make certain approaches more effective than others. In some instances, a simple written RJP that clearly outlines the working environment may be sufficient. In other instances, an RJP video is the best way to demonstrate the working environment. If practical, an interactive simulation in which the applicant answers questions based on information provided in the RJP may be even more effective. Finally, if the job lends itself to it and the consequences of a poor hire are particularly costly to the organization, more resource-intensive RJP approaches, such as ride-alongs or job shadowing, may be warranted.

770

Blue-Collar Selection RJPs can also be incorporated into other selection tools. Interviews are a particularly easy way to add RJPs to the selection process, by having the interviewer describe the work environment or specific job demands and asking applicants for their reactions or having them describe similar environments in which they have worked. Another possibility is the use of high-fidelity computer simulations that measure job-relevant KSAs while simultaneously giving the applicant a preview of the actual work. Raising applicants’ understanding of the actual work environment through RJPs can accomplish two goals. First, it can discourage individuals who prefer not to work in the job’s true environment from continuing with the application process. Second, it can clarify potentially misguided perceptions and attract applicants who might otherwise avoid working conditions they perceive to be undesirable (Premack & Wanous, 1985). Both of these RJP goals are important for blue-collar jobs. Clearly, the unique working conditions for many blue-collar jobs are not for everyone. For applicants for whom this is true, it is in their and the organization’s best interests to realize this during the selection process, before they have each dedicated time and resource on what ultimately turns out to be a poor fit. On the other hand, stereotypical perceptions of blue-collar work environments may lead some applicants to believe that working conditions are worse than they actually are or to overlook potential benefits of the work environment that they had not considered. Providing applicants with as much information as possible about the work environment can accomplish both goals.

SELECTION PROCESS CONSIDERATIONS Chapter 16 of this volume provides a general overview of issues to consider when administering assessment tools. In this section, we address several test administration considerations that are particularly salient in blue-collar selection contexts.

Standardization Standardization of the testing process should be a primary concern for any selection system, and this is particularly so for blue-collar selection. First, for large organizations with blue-collar jobs, having a standardized selection process allows the organization to have confidence in the skill level of employees who move from one location to another. Second, many blue-collar jobs are unionized, and differences in details as small as the specific model of calculator or spell checker used by test takers (Delta Twp. v. F/F Assn. of Michigan, 1998), the noise level in the room during testing, or the location of test administration (Palm Beach County Sheriff’s Office v. PBC Police PBA, 2005) can lead to grievances that can undermine the entire testing program. Test developers can take several steps to facilitate standardization. First, test developers should exercise care in clearly specifying all aspects of test administration protocol. This includes everything from room configuration (e.g., spacing between workstations) to testing aids (e.g., model of calculator, number of pieces of scratch paper) to scripts for test administrators to read aloud to candidates when scheduling and administering tests. Test developers should also create a standardized process to communicate test results. Once the entire test administration process has been thoroughly documented, test developers can conduct training for test administrators. In situations where accuracy is most critical, such as union environments or large-scale hiring, organizations should consider developing a test administrator certification process and conducting spot checks to confirm that the entire protocol is being followed as designed.

Test Security Given the high-stakes nature of most blue-collar selection and the development cost associated with many selection tools, test security breaches can be particularly costly. Test developers can take

771

Robert P. Michel and Shannon Bonner many precautions to reduce the risk of security breaches. First, when developing a test, caution should be taken in selecting participating job experts. Job experts, especially those involved in test creation, should be vetted to ensure they have no history of dishonesty and no conflicts of interest related to the test. This may mean excluding trainers or educators, who are often evaluated on the extent to which their students can pass tests. In addition, all participating job experts should be made aware of the importance of test security and sign confidentiality agreements that articulate what information about the test development process can be shared and with whom. Additional steps that can be taken during the test development phase to maximize test security include creation of multiple or adaptive versions of the test (Guo, Tay, & Drasgow, 2009) or item pools from which different content is drawn for each test administration (Zhang, Chang, & Yi, 2012). Because creating multiple or adaptive forms entails extensive additional work, such an approach is recommended for high-volume situations and situations where sufficient data can be collected during the test validation process to effectively create multiple versions. For high-volume, ongoing hiring, using adaptive testing or multiple, equated test forms is recommended. For positions that are filled in classes, such as police officers, there may be a bit more flexibility in variance between test forms if new forms are developed for each class. Test administration protocols can also have an impact on test security. Although unproctored testing is increasingly popular because of the efficiency benefits, it increases item exposure and creates an opportunity for test takers to utilize assistance on the test. Using proctored testing, and training proctors to actively monitor for cheating, greatly reduces both of these concerns. Testing policies, such as limiting the number of attempts on a test or requiring a waiting period before a retest is allowed, can also help improve test security. Another, sometimes overlooked, threat to test security is unwittingly allowing test content, scoring keys, or scoring formulas to enter the public domain in response to a request or subpoena from a regulatory agency, union, or plaintiff. Without the proper protections in place, these materials are potentially accessible by third parties, such as competitors or applicants. In the case of an agency demand, once the government possesses the materials, they can be accessed through a simple Freedom of Information Act (1967) request. Agencies, unions, and plaintiffs do not have the same sensitivity or motivation to protect secure test material, so it becomes the responsibility of the test developer and test user to preserve security, a professional and ethical responsibility outlined clearly in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). If the request for test materials is unreasonable or irrelevant to the issues being investigated, test users are within their rights to refuse disclosure. However, if the request is legitimate, a common remedy is to execute a protective order, limiting disclosure of the secure materials to qualified testing experts under secure conditions. Preventing test security breaches should always be the primary goal, but identifying them once they have occurred is also important. Test developers should monitor test performance over time to look for changes in pass rates that could indicate test security breaches (Guo & Drasgow, 2010). Monitoring should account for differences within locations, administrators, or versions of the test, using software to help analyze patterns of test answers in search of evidence of possible cheating.

Technology Technology can aid in various aspects of test administration, and organizations involved in blue-collar selection must weigh issues of complexity, fidelity, and cost when deciding which technologies make sense to incorporate in the selection process. Part VIII of this book deals with the role of technology in employee selection, and we encourage blue-collar selection practitioners to read the relevant chapters to develop a more thorough understanding of the issues involved. In this section, we discuss the most commonly encountered issues in our experience. Although research findings are mixed, there is support that technology can be used to administer tests in ways that garner positive test-taker reactions (Bauer, Truxillo, Mack, & Costa, 2011).

772

Blue-Collar Selection Many types of tests, particularly those that use multiple-choice items (e.g., aptitude, knowledge, and non-cognitive tests), easily lend themselves to computer-based test (CBT) administration. CBT delivery can also be used for more complex assessment types that incorporate videos or interactive simulations, and many CBT delivery systems offer the ability to create a robust testtaker experience. At even higher levels of sophistication, adaptive tests can reduce test length by more efficiently and precisely determining a test taker’s skill level by selecting test items based upon real-time test performance (Tonidandel, Quinones, & Adams, 2002). CBT systems can also reduce test administrator burden. Some systems display test instructions on screen or announce them via pre-recorded audio, eliminating the need for a human test administrator. CBT systems can also manage the timing of the test and score tests without human interaction. These features can improve the accuracy and reliability of test scores by eliminating or reducing human error during administration or scoring. Although computerized testing offers many benefits, its use with some blue-collar jobs can present challenges. In terms of efficiency, CBT is most effective when used in an unproctored setting, eliminating the need for test administrators and, in some cases (e.g., unproctored Internet testing), space and computer resources. However, many types of tests used for blue-collar selection (e.g., aptitude and knowledge tests) require strict test security and a need to minimize opportunities for cheating. Technology (e.g., computer adaptive testing and remote monitoring) and follow-up verification testing provide ways to address some of the concerns, but these approaches are not foolproof. They also require extensive resources to develop, making them cost prohibitive for many organizations. In addition, even if these approaches are successful at eliminating cheating at the group level (e.g., the impact on validity is minimal), in many bluecollar selection environments the testing process is under constant scrutiny, and a single incident of confirmed cheating or a security breach can create significant problems for the organization. Another issue with the use of technology in blue-collar selection environments is the availability of computer resources in some locations. In many organizations that employ a large number of blue-collar workers, much of the hiring takes place at dispersed or remote operational sites. In other instances, affordable technology may not be capable of withstanding the necessary environmental conditions. Mobile devices can help overcome some of these challenges (e.g., using tablets or cell phones to complete job skills checklists in the field), but they may not be suitable for many types of assessments. Consequently, test developers should pre-test technological solutions to validate that they work as intended and reduce the burden for end users. Otherwise, paper-and-pencil alternatives should be considered. A final challenge with CBT in blue-collar selection environments is the fidelity of the technology to the job. While computer use is increasingly common in the workplace, many blue-collar workers have little daily interaction with computers. In these situations, requiring job applicants to use a computer for selection testing, when a computer will not be used on the job, can pose a problem. This is especially true if the use of technology is more likely to screen out older workers, who may have had less exposure to technology during their education. Test developers should consider the match of the technology to the job, and to the applicant pool, when evaluating test delivery technology.

SELECTION POLICY CONSIDERATIONS It is good practice for any organization that uses employment tests to have a defined selection testing policy, but for several reasons this can be particularly true for organizations employing blue-collar workers. The progression paths, higher-than-average tenure, and presence of labor unions associated with many blue-collar jobs can lead to a myriad of situations for which the “right” answer is less than obvious without clear guidelines. A good selection policy outlines the roles and responsibilities associated with the testing function and defines key terms. Most selection policies incorporate issues such as retest intervals and how long test results are good for. Examples of additional issues that could be included are discussed below, and a sample selection testing policy is shown in Figure 34.1.

773

COMPANY SELECTION TESTING POLICY A.  Background and Purpose Pre-employment testing makes a significant contribution to fair, valid, and objective employee selection and placement. This Testing Policy is adopted to ensure the proper use of tests and protect the rights of individual test takers. Compliance with the letter and spirit of this policy will enable the company to adhere to legal requirements, professional standards and guidelines, and standard business practice. All persons involved in the testing function, and especially those who have direct contact with applicants, tests, scoring keys, and test scores must uphold a high level of integrity and honesty. It is the policy of the company to treat employees and applicants equally, with dignity, fairness, and respect. Persons working in testing must be fair and impartial, and it is important to remember that even the perception of unfairness, dishonesty, or discrimination by applicants or others can have devastating effects on the testing program. If anyone encounters a situation where these perceptions might exist, he/she should immediately contact the Testing Program Manager. B.  Key Definitions Accommodation A modification or special arrangement in the testing process for persons with disabilities that allows the person to participate in the testing process. Legal Requirements Various laws, guidelines, and executive orders enforced by agencies that govern one or more aspects of the testing process or enforced by another federal or state agency. These agencies include the Equal Employment Opportunity Commission (EEOC), the U.S. Department of Labor, Office of Federal Contract Compliance (OFCCP), and the State and Local Fair Employment Practice Agency. Professional Standards Various guidelines published by professional associations such as the American Psychological Association (APA), the Society for Industrial and Organizational Psychology (SIOP), and the Society for Human Resource Management (SHRM). Reliability The degree of consistency in test scores. Retest Period The amount of time that must pass before an applicant can take a test again. This period varies for different types of tests depending on issues of reliability, validity, practice effects, variation in applicant test performance, and variation in administration procedures. Selection Procedure Any assessment device used as the basis for employment decisions. These include not only tests, but also minimum qualifications, interviews, reference checks, probation periods, seniority, etc. Test Administrator The person who administers and scores tests. This person is normally a non-union employee. Test Result An interpretation of an individual’s test scores in terms of expected work performance or other business-related outcome (e.g., pass/fail, accept/reject, and probability of success). Validity The job-relatedness of a test or selection procedure. C.  General Requirements 1. Application and Scope This Testing Policy applies to all employment tests used by the company. All employees involved in the administration and use of employment tests must be alert to possible abuse, misuse, or other problems that arise involving the use of the tests. 2. Problems Any problems, including complaints or grievances, should be brought immediately to the attention of the Testing Program Manager. 3. Enforcement The Testing Program Manager is responsible for the enforcement of the Testing Policy. The testing function will be monitored through: 1) audits conducted by the Testing Program Manager, 2) statistical monitoring of test data, and 3) other procedures as appropriate.

FIGURE 34.1  Sample Selection Testing Policy

4. Violation To maintain the integrity of the testing program, appropriate corrective action will be taken in response to any policy violations. The Testing Program Manager (and direct supervisor of the violator as appropriate) will review each alleged case of violation and determine appropriate action. Depending on the intent and severity of the violation, corrective action may include suspension from testing activities, re-training, de-certification, and/or discipline, up to and including discharge by the direct supervisor. D.  Certification of Test Administrators All Test Administrators must be certified. The certification process includes attendance in Test Administrator training conducted by the Testing Program Manager or his/her designee. Certification also requires passing the Test Administrator certification test. E.  Test Administration Tests must be administered only by certified Test Administrators. Administration instructions are to be followed verbatim to ensure that all applicants are treated the same. The Test Administrator should remain in the testing room at all times. This practice ensures the security and integrity of the testing session and test materials. Questions, other than routine ones, raised by applicants should be referred to the Testing Program Manager. F.  Security and Storage All support staff that handle tests share the responsibility for maintaining test security. Handling of tests and materials must be accomplished in a manner that safeguards their security. • All test booklets, answer sheets, and scoring keys must remain locked in the test storage room when not in use. When tests are being administered, it is essential that test materials remain in the possession of the person administering the tests. As soon as practical, the tests should be returned to their normal storage location. • The test storage room and testing rooms that contain test materials or equipment must be kept secure at all times. No one outside of certified Test Administrators should have access to test storage areas. Any other requests for access should be directed to the Testing Program Manager. Keys to these areas must remain in the possession of certified Test Administrators, and should not be duplicated. • When appropriate, test materials must be securely destroyed (i.e., shredded) in the presence of a certified Test Administrator. • Any loss or compromise of tests or test materials must be reported immediately to the Testing Program Manager. • Tests should not be lent or shown to anyone other than certified Test Administrators (except applicants during bona fide testing sessions). Requests for test materials by anyone else should be directed to the Testing Program Manager. G.  Reporting Test Results and Scores Users of test results must be trained in their interpretation. They must understand the test results with which they work, and they must protect the privacy of the individual test taker. Test results and scores should be reported on a need-to-know basis, and should be reported only by and to the appropriate person(s). No person is entitled to test results by virtue of rank, position, or title. H.  Retest Periods and Results Expiration Each company selection procedure has a specified retest period. Retest periods are based upon issues of reliability, validity, practice effects, applicant developmental activities, and administrative issues. Skill and knowledge tests have a 3-month retest period, and all other tests have a 6-month retest period. Skill and knowledge test results remain in effect for 1 year, unless the skill or knowledge being tested (and thus required for successful job performance) changes. The results for all other tests are good indefinitely unless there is a significant change to a test, in which case it will be considered a new test. I.  Grandparenting and Exemptions from Testing Employees currently in a job progression when a selection test is implemented do not have to meet the testing requirement. Except for skill and knowledge tests, employees who leave a job after a test is implemented and later repost for the same job are exempt from meeting the testing requirement, as long as they had held the job for 6 months or more AND had a performance rating of satisfactory or higher when they left the job. In the case of skill and knowledge tests, they are exempt from meeting the testing requirement if they return to the same job within 1 year of leaving it OR unless the skill or knowledge being tested has changed, as defined in Section H. J.  Test Accommodations under the Americans with Disabilities Act (ADA) Prior to scheduling testing, and again prior to administering any tests, all applicants must be provided with an opportunity to request an accommodation in the testing process under the ADA. Any applicant who requests an accommodation must be referred immediately to the Testing Program Manager. The Testing Program Manager will review the request with the company’s accommodation review team, and take the necessary steps to ensure that appropriate accommodations are made.

Robert P. Michel and Shannon Bonner

Grandparenting Rules While all external applicants should be required to take a selection test once it is implemented, the same is not necessarily true of internal candidates. Unlike external applicants, the employing organization already has firsthand knowledge of its current employees’ job performance, and if an employee is already performing substantially similar work to that for which a selection test is designed, it likely does not make sense for him/her to have to meet the selection requirement. To deal with this, organizations typically employ “grandparenting” rules. That is, the day an employment test is implemented, all internal employees currently in the line of progression are exempted from ever having to pass the test. This can also apply to employees who are not currently in the line progression but who have previously performed the work successfully. One gray area around grandparenting rules concerns jobs that are not in a particular line of progression but entail substantially similar work. Assuming the KSAs required for successful performance are the same in two progressions, it would seem to make sense that grandparenting rules would extend to candidates moving between the lines of progression. The challenge is defining what constitutes “substantially similar,” which is not always straightforward.

Journeyman-level Hiring/Experience Exemptions Because of employee development and morale benefits, as well as the lower recruiting and hiring costs, many organizations employing blue-collar employees in jobs with defined progression structures prefer to fill higher-level openings internally. However, this isn’t always possible, and sometimes the organization must seek experienced applicants to fill higher-level openings. In these situations, the organization must decide whether the same selection process for entry-level employees will apply to experienced applicants. The employer has three options in this case: (1) require experienced applicants to complete the same selection process as entry-level applicants, (2) modify the selection process by substituting components that better apply to experienced applicants, or (3) waive specific components of the selection process. The first option ensures the most consistency, but it may not make sense in all instances. For example, it is likely desirable to at least modify the interview to include questions that focus on the applicant’s relevant experience. It also may make sense to use a knowledge test in lieu of, or in addition to, any tests that entry-level applicants must take. Waiving selection requirements for experienced applicants may seem intuitively appealing to some in the organization, but this is rarely the most appropriate route. Those who argue for this approach typically lack an appreciation of the potential impact of different selection, performance, and promotion standards of other employers for whom applicants have worked. One instance where waiving selection requirements might make sense is when states or localities have very standard, verifiable criteria for obtaining a journeyman card.

Selecting for Higher-level Jobs For some blue-collar jobs, entry-level positions are sometimes unskilled or semi-skilled positions that become “feeder” jobs for skilled craft positions within the organization. In these instances, it is in the organization’s best interest to ensure that those hired into the feeder jobs have the ability to successfully learn and perform the work of the skilled craft positions they will ultimately assume. The best way to do this is to hire entry-level workers using selection procedures validated for the higher-level job. However, if a selection procedure has adverse impact, it is only acceptable to do this if the majority of employees in the feeder job progress to the higher-level job in a reasonable amount of time (Uniform Guidelines on Employee Selection Procedures, 1978), and it is the organization’s responsibility to evaluate this. Establishing that the promotion rate supports selecting for the higher-level job is easy when there is a specified training or probationary period after which all employees in the feeder job

776

Blue-Collar Selection move into the higher-level job. However, if progression isn’t so automatic, it requires an evaluation of historical promotional data. This can be fairly straightforward if employee movement has been regularly and carefully logged in an applicant tracking system, but it can become quite challenging if recordkeeping has been shoddy or inconsistent. It is also important to be cognizant of “seasoned” versus “unseasoned” data. For example, to determine the specific percentage of employees who have progressed into a higher-level job within a five-year timeframe requires using employees who were hired at least five years ago. This is because it remains unknown whether those hired since then who are still in the feeder job will progress to the higher-level job within five years. Once it is established that the progression rate supports the use of a selection procedure for a higher-level job, the organization must monitor the progression rate over time. It is possible that the progression rate supports the use of the procedure at one point in time but may not 5, 10, or 20 years later. This can be due to internal organizational changes (e.g., progression structures) or external economic changes (e.g., downturn in the economy that impacts the retirement rate in the higher-level jobs creating fewer opportunities for the entry-level job).

Contingent Workers Many organizations employing blue-collar workers use contingent workers (e.g., contractors and temps) to address short-term or periodic needs. When hiring contingent workers, one question is whether to put them through the same selection process as regular full-time employees occupying the same position. If the same process is used, it can take longer for the employer to fill needed slots, removing some of the flexibility associated with using contingent workers. However, if contingent workers are not required to complete the same selection process, it can create several land mines for the employer. First, if full-time employees are working alongside contingent workers who did not have to complete the same selection process, it becomes easier to question the business necessity of the process. This may not be an issue if contingent workers are used to complete a short-term assignment, since they likely will not be performing the full scope of the job. However, the risk increases the longer the contingent workers are retained, so if the employer knows it will be a longer-term assignment, it makes sense to either put contingent workers through the full selection process or intentionally narrow the scope of their work up front to avoid any challenges to the business necessity of the selection process down the line. Another political issue that can arise is when a supervisor really likes a contractor or temp and wants to bring them on full time. If the contingent worker did not have to complete the selection process when he/she was initially hired and is now unable to meet the selection criteria to become a full-time employee (e.g., fails a pre-employment test), the selection practitioner should expect pushback. The supervisor may ask for an exception to the hiring criteria or may openly question the value of the selection tool. In either instance, it can quickly undermine the entire process. In addition, the Internal Revenus Service (IRS) has recently settled several major co-employment cases with large employers (e.g., Microsoft in 2007). The IRS’s rules on co-employment are complex, so organizations are well served to seek legal counsel when considering how, or if, to apply internal selection procedures to contingent workers.

CONCLUSION The basic principles of selection for blue-collar jobs are similar to those for other types of jobs, but contextual, process, and policy considerations unique to blue-collar environments must be attended to when designing selection programs for blue-collar jobs. The work environments for many blue-collar jobs are much different than those for white-collar jobs, and blue-collar employees tend to stay with their employers much longer than other workers do. In addition, many blue-collar jobs are unionized, which necessitates a different approach and philosophy to

777

Robert P. Michel and Shannon Bonner selection system design. The selection practitioner must also be aware of applicant population characteristics for some blue-collar jobs that differ from white-collar applicant populations in important ways (e.g., education level, test savviness, and language proficiency). All of these issues can impact the types of selection tools that are useful or appropriate for blue-collar selection. For entry-level selection, cognitive ability tests have particular utility for blue-collar jobs because learning and knowledge acquisition is often a key component of these jobs, particularly skilled trades that use a journeyman structure and complex blue-collar jobs with far-reaching safety implications. For experienced hires, knowledge tests and work samples are typically more appropriate, and for internal selection in union environments, objective selection procedures that look like the job will be most preferred. As far as selection policy, the defined promotional path for many blue-collar jobs has important implications. It can necessitate focusing on jobs beyond the entry level when designing and validating selection processes and require implementing grandparenting policies to define which internal candidates must complete the selection process and which are exempt. The use of contingent workers in many blue-collar industries also has important implications for selection policies, as does the hiring of experienced employees in journeyman progressions. Employee selection research specific to blue-collar jobs is scant and often inconsistent. Given the prevalence of blue-collar jobs in the workforce and their importance to the economy, more research is warranted. There are three areas where more research would be particularly beneficial to the blue-collar selection practitioner. The first is more systematic evidence regarding the efficacy of non-cognitive predictors for blue-collar work. Ability, knowledge, and skills tests are often used for blue-collar selection, and there is significant opportunity to complement these with non-cognitive tests if more consistent research can be brought to bear to help guide these efforts for practitioners. Second, with the high average tenure in many blue-collar industries, research on the longitudinal effectiveness of different predictors would be instructive. That is, as blue-collar employees gain experience and become acculturated to the work environment, do some predictors show more (or less) effectiveness as time passes. Third, given the prevalent use of cognitive ability and physical ability tests for blue-collar selection, continued research on adverse impact reduction would be useful. Recent efforts at developing and expanding theories of adverse impact (e.g., Cottrell, Newman, & Roisman, 2015; Outtz & Newman, 2010) should help guide these efforts.

REFERENCES American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance. Personnel Psychology, 44, 1–26. Bauer, T. N., Truxillo, D. M., Mack, K., & Costa, A. B. (2011). Applicant reactions to technology-based selection. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 190–223). San Francisco: Jossey-Bass. Berkman v. City of New York, 536 F. Supp. 177 (E. D. N. Y. 1982). Bownas, D. (2000). Selection programs in a union environment: A commentary. In J. Kehoe (Ed.), Managing selection in changing organizations (pp. 197–209). San Francisco: Jossey-Bass. Bureau of Labor Statistics. (2015). Union members summary [Economic news release]. Retrieved from www.bls. gov/news.release/union2.nr0.htm Burke, M. J., Sarpy, S. A., Tesluk, P. E., & Smith-Crowe, K. (2002). General safety performance: A test of a grounded theoretical model. Personnel Psychology, 55, 429–457. Campion, M. A. (1983). Personnel selection for physically demanding jobs: Review and recommendations. Personnel Psychology, 36, 527–550. Campion, M. A., Palmer, D. K., & Campion, J. E. (1997). A review of structure in the selection interview. Personnel Psychology, 50, 655–702. Cascio, W. F., & Boudreau, J. W. (2011). Utility of selection systems: Supply-chain analysis applied to staffing decisions. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology, Vol 2: Selecting and developing members for the organization (pp. 421–444). Washington, DC: American Psychological Association.

778

Blue-Collar Selection Cottrell, J. M., Newman, D. A., & Roisman, G. I. (2015). Explaining the black-white gap in cognitive test scores: Toward a theory of adverse impact. Journal of Applied Psychology, 100, 1713–1736. De Bacquer, D., Van Risseghem, M., Clays, E., Kittel, F., De Backer, G., & Braeckman, L. (2009). Rotating shift work and the metabolic syndrome: A prospective study. International Journal of Epidemiology, 38, 848–854. Delta Twp. v. F/F Assn. of Mich., M.E.R.C. # A97 J-0057, 111 Lab. Arb. Rep. (BNA) 936 (1998) (Sugerman, Arb.). Equal Employment Opportunity Commission. (2014). EEOC sues Wisconsin Plastics for discrimination against Hmong and Hispanic employees [Press release]. Retrieved from http://www.eeoc.gov/eeoc/newsroom/ release/6–9–14.cfm. Freedom of Information Act, 5 U.S.C. § 552 (1967). Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: John Wiley & Sons. Ghiselli, E. E. (1973). The validity of aptitude tests in personnel selection. Personnel Psychology, 26, 461–477. Guo, J., & Drasgow, F. (2010). Identifying cheating on unproctored internet tests: The Z-test and the likelihood ratio test. International Journal of Selection and Assessment, 18, 351–364. Guo, J., Tay, L., & Drasgow, F. (2009). Conspiracies and test compromise: An evaluation of the resistance of test systems to small-scale cheating. International Journal of Testing, 9, 283–309. Henderson, N. D., Berry, M. W., & Matic, T. (2007). Field measures of strength and fitness predict firefighter performance on physically demanding tasks. Personnel Psychology, 60, 431–473. Hoffman, D. A., Jacobs, R., & Landy, F. (1995). High reliability process industries: Individual, micro, and macro organizational influences on safety performance. Journal of Safety Research, 26, 131–149. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection, and amelioration of adverse impact in personnel selection procedures: Issues, evidence, and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Huffcutt, A. I., & Arthur, W. (1994). Hunter and Hunter (1984) revisited: Interview validity for entry-level jobs. Journal of Applied Psychology, 79, 184–190. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Jacobs, R. R., Conte, J. M., Day, D. V., Silva, J. M., & Harris, R. (1996). Selecting bus drivers: Multiple predictors, multiple perspectives on validity, and multiple estimates of utility. Human Performance, 9, 199–217. Jones, D. P., & Gottschalk, R. J. (1988). Validation of selection procedures for electric utility construction and skilled trades occupations: Literature review and meta-analysis of related validation studies. Washington, DC: Edison Electric Institute. Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). The situational interview. Journal of Applied Psychology, 65, 422–427. McCarthy, J., Hrabluik, C., & Jelley, B. (2009). Progression through the ranks: Assessing employee reactions to high-stakes employment testing. Personnel Psychology, 62, 793–832. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007). Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60, 683–729. Mueller-Hanson, R., Heggestad, E. D., & Thornton, G. C. (2003). Faking and selection: Considering the use of personality from select-in and select-out perspectives. Journal of Applied Psychology, 88, 348–355. O’Connell, M. S., & Delgado, K. (2011). Safer hiring. Industrial Management, 53, 24–30. Ones, D. S., Dilchert, S., Viswesvaran, C., & Judge, T. A. (2007). In support of personality assessment in organizational settings. Personnel Psychology, 60, 995–1027. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance [Monograph]. Journal of Applied Psychology, 78, 679–703. Outtz, J. L., & Newman, D. A. (2010). A theory of adverse impact. In J. L. Outtz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 53–94). New York, NY: Routledge. Palm Beach County Sheriff’s Office v. PBC Police PBA, AAA Case No. 32–390–100713–04, 121 Lab. Arb. Rep. (BNA) 1624 (2005) (Smith, Arb.) Pannone, R. (1994). Blue collar selection. In G. S. Stokes, M. D. Mumford, & W. A. Owens (Eds.), Biodata handbook: Theory, research, and use of biographical information and selection and performance prediction (pp. 261–273). Palo Alto, CA: CPP Books. Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Premack, S. L., & Wanous, J. P. (1985). A meta-analysis of realistic job preview experiments. Journal of Applied Psychology, 70, 706–719.

779

Robert P. Michel and Shannon Bonner Schmidt, F. L., & Hunter, J. E. (1978). Moderator research and the law of small numbers. Personnel Psychology, 31, 215–231. Schmidt, F. L., & Hunter, J. E. (1983a). Individual differences in productivity: An empirical test of estimates derived from studies of selection procedure utility. Journal of Applied Psychology, 68, 407–414. Schmidt, F. L., & Hunter, J. E. (1983b). Quantifying the effects of psychological interventions on employee job performance and work-force productivity. American Psychologist, 38, 473–478. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmidt, F. L., & Hunter, J. (2004). General mental ability in the world of work: Occupational attainment and job performance. Journal of Applied Psychology, 86, 162–173. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, G. S. (1979). Further tests of the Schmidt-Hunter Bayesian validity generalization model. Personnel Psychology, 32, 257–281. Steiner, D. D., & Gilliland, S. W., (1996). Fairness reactions to personnel selection techniques in France and the United States. Journal of Applied Psychology, 81, 134–141. Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A response to Morgeson, Campion, Dipboye, Hollenbeck, Murphy, and Schmitt (2007). Personnel Psychology, 60, 967–993. Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Personnel Psychology, 44, 703–742. Tonidandel, S., Quiñones, M. A., & Adams, A. A. (2002). Computer-adaptive testing: The impact of test characteristics on perceived performance and test takers’ reactions. Journal of Applied Psychology, 87, 320. Uniform Guidelines on Employee Selection Procedures. 29 C.F.R. § 1607 et seq. (1978). Vredenburgh, A. G. (2002). Organizational safety: Which management practices are most effective in reducing employee injury rates? Journal of Safety Research, 33, 259–276. Wee, S., Newman, D. A., & Joseph, D. L. (2014). More than g: Selection quality and adverse impact implications of considering second-stratum cognitive abilities. Journal of Applied Psychology, 99, 547–563. White, L. A., Young, M. C., Hunter, A. E., & Rumsey, M. G. (2008). Lessons learned in transitioning personality measures from research to operational settings. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 291–295. Zhang, J., Chang, H., & Yi, Q. (2012). Comparing single-pool and multiple-pool designs regarding test security in computerized testing. Behavior Research Methods, 44, 742–752. Ziegler, M., MacCann, C., & Roberts, R. D. (Eds.) (2012). New perspective on faking in personality assessment. New York, NY: Oxford University Press.

780

35 SELECTION FOR SERVICE AND SALES JOBS JOHN P. HAUSKNECHT AND ANGELA L. HEAVEY

According to data from the Bureau of Labor Statistics (BLS), U.S. organizations currently employ more than 30 million workers in service and sales occupations (U.S. Department of Labor, 2015). Although annual turnover rates can exceed 100% for some jobs in services and sales, even a conservative estimate of 20% turnover reveals that U.S. organizations select more than 6 million service and sales workers each year. As such, many organizations have adopted formal assessment methods to improve hiring decisions and ultimately increase organizational effectiveness. Research shows that the use of validated selection tools as part of a broader, strategic approach to human resource (HR) management is associated with higher productivity, lower employee turnover, and better corporate financial performance (Huselid, 1995; Terpstra & Rozell, 1993). However, it is clear that not all selection methods are equally effective, nor do research findings apply uniformly to all occupations. This chapter provides a review of selection research for service and sales occupations and is organized into three major sections. First, we describe the nature of service and sales work and define the competencies that underlie success in these jobs. Second, we summarize past research concerning the methods that have been used to select service and sales employees with attention to issues of validity, applicant reactions, and adverse impact. Finally, we discuss the implications of this body of work for practice and future research, highlighting several important but often overlooked issues concerning selection system design for this critical workforce segment.

NATURE OF SERVICE AND SALES WORK Companies rely upon their core service and sales workers to execute service-driven strategies and place the organization’s products and services in the hands of customers and clients (Vinchur, Schippmann, Switzer, & Roth, 1998). Service and sales jobs share many similarities because service- and sales-related tasks can be found in both types of occupations, and there is a large degree of competency overlap (Frei & McDaniel, 1998). As detailed below, many of the similarities are attributable to the high degree of interpersonal interaction with clients or customers required in these jobs (Mount, Barrick, & Stewart, 1998).

Major Duties and Responsibilities Broadly defined, service work involves relational processes between service providers and customers. Unlike goods, services are relatively intangible, cannot be stored or transported, require

781

John P. Hausknecht and Angela L. Heavey the participation of the customer, and because of changing situational demands, tend to be less standardized (Bruhn & Georgi, 2006; Schneider & White, 2004). BLS data show that service workers (broadly defined) have come to dominate the U.S. economy, as more than 80% of jobs involve at least some aspect of service provision as opposed to goods production. Some of the most common job titles for service workers in the United States include retail sales worker (approximately 8.6 million employees, based on Occupational Employment Statistics from BLS, May 2014) and waiter/waitress (2.4 million). Table 35.1 provides a sampling of these and other job titles commonly found within the service sector. Occupational information from the O*NETTM database (http://www.onetcenter.org) reveals that the core activities of service workers often involve (a) interacting directly with the public (i.e., customers), (b) processing customer requests (e.g., billing inquiries, food orders, bank deposits), (c) soliciting sales of new products and services, and (d) routinely dealing with unpleasant and angry people, such as when resolving complaints. In contrast, the general nature of most sales work involves selling products and services to customers, clients, or businesses. (See Table 35.1 for a sample of common sales-related job titles.) This group consists largely of retail sales workers, cashiers, and sales representatives. On the basis of O*NET information, the core activities of sales workers include (a) locating new clients or customers, (b) determining customers’ needs, (c) providing information about products or services (e.g., features, benefits, pricing), (d) convincing customers to purchase products or services, (e) negotiating sale prices and terms, and (f) providing follow-up services.

Competencies Required for Success O*NET data reveal several competencies (i.e., knowledge, skills, abilities, and other characteristics [KSAOs]) that underlie successful performance in common service and sales occupations. These competencies and their O*NET definitions are summarized in Table 35.2. For the knowledge dimension, understanding basic customer and personal service principles and processes is necessary for both types of jobs, but importance ratings for this dimension are generally higher for service occupations than for sales occupations. In contrast, knowledge of sales and marketing concepts is essential for many sales jobs but is rated as much less important for service positions. In terms of required skills, speaking, active listening, service orientation, and social perceptiveness are critical for service and sales occupations. Time management and persuasion tend to be rated high in importance only for sales jobs. Analysis of the ability requirements reveals that both types of occupations require high levels of oral expression and oral comprehension ability. Examination of O*NET importance ratings for the final dimension—work

TABLE 35.1 Job Titles for Common Occupations in Services and Sales Services

Sales

Flight attendants

Retail sales workers

Customer service representatives

Real estate sales agents

Ticket agents and travel clerks

Sales representatives

Tellers

Telemarketers

Hotel, motel, and resort desk clerks

Insurance sales agents

Waiters and waitresses

Travel agents

Gaming service workers

Advertising sales agents

Concierges

Cashiers

Source: Information obtained from the O*NET database. (http://www.onetcenter.org.)

782

Selection for Service and Sales Jobs

TABLE 35.2 Important Worker Requirements and Characteristics for Service and Sales Occupations Worker Requirements

Worker Characteristics

Knowledge

Abilities

Customer and personal servicea: Knowledge of customer-service principles and processes (e.g., customer needs assessment, quality service standards, evaluating customer satisfaction) Sales and marketingb: Knowledge of principles and methods for promoting and selling products and services (e.g., marketing strategies, product demonstrations, sales techniques)

Oral comprehension: The ability to listen to and understand information and ideas presented through spoken words and sentences Oral expression: The ability to communicate information and ideas in speaking so that others will understand

Skills

Work styles

Speaking: Talking to others to convey information effectively

Conscientiousness: Being dependable, reliable, attentive to detail, and trustworthy

Active listening: Giving full attention to what others are saying, taking time to understand points made, and asking questions as appropriate

Adjustment: Poise, flexibility, maintaining composure, and dealing calmly with high-stress situations

Service orientation: Actively looking for ways to help people

Interpersonal orientationa: Being pleasant, cooperative, sensitive to others, and preferring to associate with other organizational members

Social perceptiveness: Maintaining an awareness of others’ reactions and understanding why they react as they do

Achievement orientationb: Setting personal goals, persisting in the face of obstacles, and willing to take on responsibilities and challenges

Time managementb: Managing one’s own time and the time of others Persuasionb: Persuading others to change their minds or behavior According to O*NET, information worker requirements are defined as “descriptors referring to work-related attributes acquired and/or developed through experience and education.” Worker characteristics are defined as “enduring characteristics that may influence both work performance and the capacity to acquire knowledge and skills required for effective work performance.” a Rated as more important for service-related occupations. b Rated as more important for sales-related occupations. Source: Information obtained from the O*NET database. (http://www.onetcenter.org.)

styles—reveals that conscientiousness and adjustment are rated highly for both types of occupations. Interpersonal orientation is rated higher for service occupations, whereas achievement orientation is rated higher for sales jobs.

Contrasting Service and Sales Jobs Although there are many similarities between service and sales occupations, closer examination of O*NET data reveals several notable differences in the degree to which certain characteristics are deemed critical to successful job performance. When compared to service occupations, sales employees must possess higher levels of initiative, persistence, persuasiveness, negotiation, and time management. In contrast, service work requires higher levels of interpersonal orientation and greater knowledge of customer service principles, and the importance of sales and marketing knowledge is somewhat diminished. More broadly, sales workers are rewarded differently (e.g., commission-based pay) and tend to operate independent of supervision (Vinchur et al., 1998). Despite these differences, the selection systems ultimately adopted for service and sales workers are often very similar. In the research review presented in the following section, we do not make strong distinctions between the two unless warranted. Instead, we organize the review around the competencies that have been routinely assessed in past research.

783

John P. Hausknecht and Angela L. Heavey

RESEARCH ON SELECTION FOR SERVICE AND SALES WORKERS It is clear from our review of selection research published over the last 50 years or so that there are no simple solutions when it comes to designing selection systems for service and sales workers that are valid, fair, legally defensible, and relatively simple to administer. The review emphasizes validity evidence to reflect the focus of past research and concludes with information regarding applicant perceptions and adverse impact considerations.

Selection Research on Personality and Personality-Related Characteristics By far, most of the published literature on selection for service and sales workers involves personality assessment. This is perhaps not surprising given the interpersonal and motivational skills required for success in these occupations (see Table 35.2). Although there are exceptions, most of the published work in this area concerns assessment of the “Big Five” dimensions of personality using self-report, paper-and-pencil inventories. A smaller number of studies examine personality dimensions that are more narrowly defined or evaluate personality-related constructs that are developed specifically for service or sales occupations. Although we generally restrict the focus to personality measures used in service and sales domains, a broader discussion of personality and selection can be found in Chapter 13, this volume.

Big Five Personality Dimensions The dimensions of the Big Five (or Five-Factor Model) include agreeableness, conscientiousness, emotional stability, extraversion, and openness to experience. Agreeableness is generally defined as being flexible, trusting, cooperative, forgiving, and tolerant (Barrick & Mount, 1991; Vinchur et al., 1998). Conscientiousness refers to one’s level of dependability, achievementorientation, and perseverance (Barrick & Mount, 1991). Emotional stability, also referred to as neuroticism, encompasses traits such as anxiousness, depression, anger, embarrassment, or insecurity (Barrick & Mount, 1991), whereas extraversion assesses interpersonal interaction, tapping such traits such as assertiveness and sociability (Vinchur et al., 1998). Finally, openness to experience refers to one’s propensity to be imaginative, curious, intelligent, or artistically sensitive (Barrick & Mount, 1991). Many scales have been developed to assess the Big Five, which often contain several hundred items. Associations between Big Five personality dimensions and performance in sales jobs have been summarized using meta-analysis. When supervisor-provided ratings were the performance criterion, Vinchur et al. (1998) found average unadjusted correlations of .03 (agreeableness), .11 (conscientiousness), .05 (emotional stability), .09 (extraversion), and .06 (openness to experience). Effects were somewhat larger after corrections for criterion unreliability and range restriction were applied (r = .03 to .12). When examining objective sales performance as the criterion, they found average unadjusted correlations of −.02 (agreeableness), .17 (conscientiousness), −.07 (emotional stability), .12 (extraversion), and .03 (openness to experience). Values were generally larger once corrected for range restriction, particularly in the case of conscientiousness (.31) and extraversion (.22). Vinchur et al. also reported relatively larger effects for those studies that used an alternative taxonomy of personality dimensions. In particular, unadjusted validity coefficients for achievement (defined as a subdimension of conscientiousness) and potency (subdimension of extraversion) as predictors of supervisor ratings were .14 and .15, respectively (corrected values were .25 and .28, respectively). When considering objective sales criteria, unadjusted validity estimates for achievement and potency were .23 and .15, respectively (corrected values were .41 and .26). In service contexts, dozens of studies (e.g., Avis, Kudisch, & Fortunato, 2002; Hunthausen, Truxillo, Bauer, & Hammer, 2003; Hurley, 1998; Liao & Chuang, 2004; Mount et al., 1998) reveal correlations with job performance ratings ranging from .09 to .20 (agreeableness), .11 to .33 (conscientiousness), .09 to .21 (emotional stability), .07 to .26

784

Selection for Service and Sales Jobs (extraversion), and .09 to .20 (openness to experience). Hurtz and Donovan (2000) also provide personality validity coefficients for sales and customer service samples, values of which are comparable to those reported above. Across studies, differences in types of jobs studied, the rating criteria adopted, and other study characteristics likely explain variability in effect-size estimates, but many of these moderators have not been empirically evaluated to date. One recent exception is Judge and Zapata’s (2015) investigation of situational strength, or the degree to which work situations have clear structure, rules, and cues that govern expected behavior. They argued that weak situations (i.e., those with limited external control and greater individual discretion) would allow personalities or traits to play a more prominent role in determining performance. Results were supportive, as all Big Five dimensions revealed higher predictive validity in weak situations (though some results vary depending on how situational strength was operationalized; see Judge & Zapata, 2015, p. 1162). In other studies, interactive effects among personality dimensions, moderating contextual influences, and other design considerations have been found to account for an additional 2–9% of the variance in performance ratings. Brown, Mowen, Donovan, and Licata (2002) studied front-line restaurant service workers and found that customer orientation partially mediated the relationship between certain personality traits (emotional stability, agreeableness, need for activity) and self- and supervisor-provided performance ratings. The results indicated that customer orientation accounted for an additional 2% of the variance in supervisor-reported performance and an additional 9% of the variance in self-reported performance. In a selection context, such results show the potential value of assessing certain traits (i.e., customer service orientation) in conjunction with more traditional personality characteristics. Research has also found that certain cognitive-motivational work orientations, specifically accomplishment-striving and status-striving, may mediate the relationship between certain personality traits (i.e., conscientiousness and extraversion) and supervisor-rated job performance (Barrick, Stewart, & Piotrowski, 2002). Barrick et al. sampled telemarketing sales representatives and found that an individual’s orientation toward status-striving mediated the relationship between extraversion and job performance such that individuals scoring higher on extraversion were more likely to strive for status, which in turn resulted in higher supervisor ratings of effectiveness. Similarly, individuals high in conscientiousness were more likely to strive for accomplishment, which led to higher effectiveness ratings indirectly through status-striving. Goal-setting behavior is another motivational variable that has been found to mediate the relationship between personality and job performance. Looking specifically at the personality trait of conscientiousness, Barrick, Mount, and Strauss (1993) studied sales representatives of a large appliance manufacturer and found that the relationship between conscientiousness and supervisor-rated job performance was mediated by goal commitment and autonomous goalsetting, such that individuals scoring high in conscientiousness were more likely to set and commit to goals, which then led to increased job performance. The above studies help illustrate the process by which personality affects job performance. In terms of design, researchers have found that using supervisor, coworker, and customer ratings of employee personality (rather than self-ratings alone) increases the total explained variance in performance ratings by an additional 11–20% (Mount, Barrick, & Strauss, 1994). In addition, when job performance is measured using more specific versus general job criteria, personality characteristics appear to more accurately predict job performance ratings (Hogan & Holland, 2003). Regarding personality measurement, Hunthausen et al. (2003) studied entrylevel customer service managers at a major airline and found that using an “at-work” frame of reference (i.e., asking respondents to think about how they behave at work when responding to survey questions) resulted in stronger relationships between two dimensions of the Big Five (extraversion and openness to experience) and supervisory performance ratings (controlling for cognitive ability). Huang and Ryan (2011) found meaningful variation in personality states within individuals over time, suggesting that workers’ abilities to adapt their personality to the situational demands may be more important than having a certain “average” or stable level of any given characteristic. Finally, Grant (2013) found evidence of a curvilinear relationship between extraversion and sales revenue among call-center representatives, observing that performance gains associated with extraversion begin to decline at higher levels of this personality dimension.

785

John P. Hausknecht and Angela L. Heavey Such findings suggest that the optimum level of certain personality characteristics may vary depending on the focal occupation and corresponding job demands.

Narrow Personality Traits Although a large amount of research centers on broad measures of personality such as the Big Five, researchers have also examined relationships between specific or narrow traits of personality and job performance. In general, there is debate concerning whether broad or narrow measures of personality are best for predicting job performance. Although some contend that broad measures are more successful at predicting overall performance (Ones & Viswesvaran, 1996), others maintain that narrow measures account for more variance and argue that researchers should use narrow personality traits to predict specific aspects of job performance (Schneider, Hough, & Dunnette, 1996). In doing so, criterion-related validity may be improved because the predictors (traits) are more closely attuned to the criterion (job performance). Although not as plentiful as the research involving broad traits, there is evidence supporting a narrow-traits approach to studying job performance. A meta-analysis conducted by Dudley, Orvis, Lebiecki, and Cortina (2006) found (in their overall analysis, which included all types of jobs) that four narrow traits of conscientiousness (dependability, cautiousness, achievement, and order) have incremental validity over the global conscientiousness construct in predicting performance. Specifically, the narrow traits explained an additional 3.7% of variance in overall performance. Breaking performance into more specific criteria, narrow traits explained an additional 5–26% of the variance in specific aspects of job performance, such as task performance (4.6%), job dedication (25.9%), interpersonal facilitation (5.8%), and counterproductive work behaviors (13.6%). In addition, Dudley et al. (2006) examined the incremental validity of narrow traits of conscientiousness on the basis of occupational type. Jobs were divided into four occupation types: sales, customer service, managerial, and skilled/semi-skilled. Across all occupational categories, narrow conscientiousness traits were found to have incremental validity of 1–24% over the global dimension. Although the incremental validity of narrow traits over the global trait was relatively small for the customer service occupational group (1.2%), it rose to 5.4% for the sales group. The managerial occupational group showed a 9.3% increase in variance explained, whereas the skilled/semiskilled group posted the largest increase at 24%. On the basis of these results, the authors note that the degree of prediction offered by narrow traits depends in large part on the type of job and the aspect of performance under study (Dudley et al., 2006). In the context of sales and service selection, such results suggest that although the assessment of narrow conscientiousness traits may be useful for selection of salespeople, such assessment may have less utility for those positions with a customer service focus. Although further research is necessary to examine the utility of a narrow traits approach to personality assessment (particularly for other personality dimensions), initial results suggest the assessment of narrow traits may be useful in predicting performance for certain jobs.

Service/Customer/Sales Orientation Given the distinctive features of service and sales work, researchers have developed composite scales to assess candidates’ dispositions toward customers, service, and/or sales. Sometimes referred to as “criterion-focused occupational personality scales” (Ones & Viswesvaran, 2001), these self-report, noncognitive composite measures typically assess a pattern of personality characteristics that are thought to underlie successful performance in service and sales domains. Service orientation is one such construct, and it is defined as a set of basic predispositions to provide helpful customer service, including dimensions such as friendliness, reliability, responsiveness, courteousness, and cooperativeness (Cran, 1994; Frei & McDaniel, 1998; Hennig-Thurau, 2004; Hogan, Hogan, & Busch, 1984). Meta-analysis findings provide evidence of validity for service orientation measures. In a review of 41 studies, and with supervisory performance ratings serving as the criterion, Frei and

786

Selection for Service and Sales Jobs McDaniel (1998) reported an unadjusted validity coefficient of .24. They also showed that service orientation was moderately correlated (approximately .30 to .40) with several Big Five personality constructs (agreeableness, emotional stability, and conscientiousness), sales drive, and social vocational interests. Service orientation was generally unrelated to extraversion, openness to experience, cognitive ability, or other vocational interests. One caveat noted by Frei and McDaniel is that most of the coefficients summarized in the meta-analysis were drawn from unpublished studies that were produced by the test vendor. More recently, McDaniel, Rothstein, and Whetzel (2006) conducted a case study of test vendor technical reports and found evidence of “moderate-to-severe publication bias” such that two of the four test vendors studied showed a greater likelihood of reporting only statistically significant validity coefficients for particular scales. A second concern is that researchers have found that service orientation measures fare no better than general personality dimensions in predicting performance and do not predict s­ ervice-focused criteria any better than they predict broader criteria such as overall performance or counterproductive work behaviors (Ones & Viswesvaran, 2001; Rosse, Miller, & Barnes, 1991). Several measures have been developed to evaluate sales potential, customer-oriented selling orientation, or sales ability. These scales variously reflect composite measures of personality facets that are important for success in sales occupations (e.g., Hakstian, Scratchley, MacLeod, Tweed, & Siddarth, 1997; Hogan, Hogan, & Gregory, 1992; Li & Wang, 2007), self-assessments of behaviors taken when selling (Saxe & Weitz, 1982), or knowledge of basic selling principles (Bruce, 1953, 1971, as cited in Vinchur et al., 1998). These studies generally find that sales potential is predictive of supervisory ratings and objective sales (Farrell & Hakstian, 2001; Hogan et al., 1992; Li & Wang, 2007). Regarding selling/customer orientation, meta-analytic evidence from 19 studies reveals unadjusted validity coefficients of .17 for subjective performance measures and .06 for objective performance indicators, although confidence intervals for the two criteria overlap (Jaramillo, Ladik, Marshall, & Mulki, 2007). Vinchur et al. (1998) summarized the predictive validity of sales ability measures using meta-analysis and reported unadjusted average correlations of .26 (supervisory performance ratings) and .21 (objective sales). Finally, a recent study by Gupta, Ganster, and Kepes (2013) showed that a specific measure of sales selfefficacy—i.e., “a person’s belief that he or she has the ability to sell products and services and that he or she enjoys the tasks involved in selling products or services” (p. 691)—was more predictive of both objective and subjective performance than were the broader Big Five measures (when using a concurrent validation strategy). The authors thus caution against the exclusive reliance on generalized personality measures when selecting sales workers.

Selection Research on Background, Experience, Interests, and Other Life History Dimensions In addition to personality testing, the other dominant approach to the selection of service and sales workers involves systematic assessment of candidates’ personal histories using biodata inventories. The most common approach has been to develop paper-and-pencil questionnaires that ask candidates about various domains such as work history, experience, interests, values, attitudes, and leadership activities (e.g., Allworth & Hesketh, 2000; Jacobs, Conte, Day, Silva, & Harris, 1996; McManus & Kelly, 1999; Ployhart, Weekley, Holtz, & Kemp, 2003; Schoenfeldt, 1999; Stokes, Toth, Searcy, Stroupe, & Carter, 1999). Regarding sales occupations, meta-analysis evidence reveals an average unadjusted correlation of .31 between biodata and job performance ratings and .17 between biodata and objective sales (Vinchur et al., 1998). Dalessio and Silverhart (1994) found that biodata predicted 12-month survival and first-year commissions among life insurance sales agents, although effects tended to be smaller than those typically found for performance rating criteria. Research also supports biodata as a predictor in customer service contexts. Allworth and Hesketh (2000) found that a biodata inventory that measured experience with tasks and behaviors required in service jobs provided incremental validity beyond cognitive ability and personality measures in explaining supervisory performance ratings. Although biodata inventories encompass multiple aspects of an applicant’s background, work experience is one element of such inventories that deserves more detailed examination. Organizations

787

John P. Hausknecht and Angela L. Heavey routinely advertise that “previous experience is required” for many service and sales jobs, but experience is rarely addressed in most validation studies. Drawing from two broader meta-analyses that included (but were not limited to) sales and service jobs reveals some support for work experience as a predictor of performance. Schmidt and Hunter (1998) reported an adjusted correlation of .18 between previous work experience (in years) and job performance. When work experience measures were categorized according to their level of specificity (task, job, and organization) and measurement mode (amount, time, and type), researchers found adjusted correlations with performance ranging from .16 to .43 (Quinones, Ford, & Teachout, 1995) and suggested that validity can be maximized by measuring the amount of work experience and tailoring measures to the task level. Although neither study was conducted with an exclusive focus on sales or service settings, other research demonstrates the potential of assessing an applicant’s previous work experience in these contexts. Allworth and Hesketh (2000) approached the construct of work experience by collecting job requirements biodata from incumbents at an international hotel. This type of biodata asked participants to gauge how much their previous or current jobs required them to enlist certain customer service behaviors. Overall, the authors found that job requirements biodata accounted for 7.6% of unique variance in job performance. Further validation studies by Weekley and Jones (1997, 1999) in multiple service contexts found correlations between previous work experience and future performance that ranged from .14 to .19. Work experience was assessed using a multidimensional measure that asked participants to report their total full-time work experience, maximum tenure with any single organization, retail-specific work experience, number of different employers, and tenure in last job.

Selection Research on Cognitive Ability Cognitive ability testing is somewhat of an enigma in the context of service and sales occupations. Although cognitive ability is a strong predictor of performance for a wide range of jobs (Hunter & Hunter, 1984; see also Chapter 12, this volume), research that is specific to service and sales occupations yields mixed results. Some studies report finding no relationship between cognitive ability and performance (Jacobs et al., 1996; Robie, Brown, & Shepherd, 2005), whereas others have found statistically significant effects, with validity coefficients generally ranging from .10 to .25 (Allworth & Hesketh, 2000; Avis et al., 2002; Cellar, DeGrendel, Klawsky, & Miller, 1996; Hakstian et al., 1997; McCarthy et al., 2013; Miner, 1962; Rosse et al., 1991; Stokes, Hogan, & Snell, 1993; Weekley & Jones, 1997, 1999). A meta-analysis of the cognitive ability–performance relationship for sales jobs in particular may help explain these discrepant findings. Vinchur et al. (1998) found an unadjusted validity coefficient of .23 for general cognitive ability when the criterion was supervisory ratings of job performance (based on 22 studies) but only .02 when the criterion was objective sales volume (12 studies). Unadjusted validity coefficients involving verbal ability and quantitative ability (two facets of general cognitive ability) were generally low (−.17 to .08) and were largely based on a small number of studies. Thus, variance in performance criteria, predictor dimensions, and sample characteristics may account for the differences in effect sizes observed across studies. One final consideration is that O*NET data for common sales and service occupations reveal importance ratings for problem-solving and critical thinking skills that are comparably lower than those for social skills, which may also explain why cognitive ability is not a stronger predictor of performance in service and sales contexts. On the other hand, certain service and sales jobs may indeed require fairly high levels of critical thinking and problem-solving skills, such as those that require consultative selling and ongoing relationship management (e.g., pharmaceutical sales; see Ahearne, Bhattacharya, & Gruen, 2005).

Selection Research on Situational Judgment Situational judgment tests (SJTs) present candidates with various job-related scenarios and ask how they would respond to each situation (McDaniel, Hartman, Whetzel, & Grubb, 2007;

788

Selection for Service and Sales Jobs Weekley & Jones, 1997). For example, candidates for service-related positions may be asked how they would respond when a customer requests an item that the store does not carry (Weekley & Jones, 1999). On the basis of scoring guidelines established during test development, responses are weighted based on how well they match the judgment exercised by high-performing incumbents. Research shows unadjusted validity coefficients averaging in the mid-.20s when SJTs are used to predict job performance (McDaniel et al., 2007). Although this meta-analysis was not restricted to service and sales research, the findings are consistent with individual studies conducted in service contexts (McCarthy et al., 2013), including those that have used a video-based mode of administration rather than paper and pencil (Cellar et al., 1996; Weekley & Jones, 1997, 1999). These latter studies also show that SJTs offer incremental validity over cognitive ability as a predictor of performance.

Applicant Reactions In addition to validity concerns, it is important to consider how applicants will respond to different selection procedures. Broadly speaking, research on applicant reactions involves understanding candidates’ perceptions of the fairness and job-relatedness of different selection procedures. The general arguments put forth in this area suggest that candidates who hold negative perceptions of the selection process will be less attracted to the company, less likely to recommend the company to others, and perhaps even less likely to perform well or remain on the job (Gilliland, 1993). Literature reviews and meta-analytic evidence confirm many of these propositions (Hausknecht, Day, & Thomas, 2004; Ryan & Ployhart, 2000), with the exception of the hypothesized performance and retention outcomes, which are only beginning to be systematically addressed. For instance, across studies involving sales and service samples, McCarthy et al. (2013) found some evidence that reactions affected job performance indirectly via test scores, but did not find any support for the notion that candidate reactions affect the criterion-related validity of test scores. When compared with a list of other possible selection methods, participants have among the least favorable reactions to personality inventories and biodata, whereas reactions to cognitive ability testing tend to be somewhat more positive but not as favorable as they are to interviews or work samples (Hausknecht et al., 2004). We are not aware of any published work on applicants’ reactions to occupation-specific inventories. Given their strong association with personality inventories, one might expect reactions to be somewhat negative. However, because these tests have been designed for particular applications in service and sales contexts, fairness and job-relatedness perceptions may improve because of the close connections to relevant aspects of the job. Smither and colleagues found that applicants’ perceptions were more positive for item types that were less abstract, suggesting that occupation-specific predictors may fare somewhat better on this dimension (Smither, Reilly, Millsap, Pearlman, & Stoffey, 1993). Applicant reactions to SJTs have been studied infrequently, but evidence from Chan and Schmitt (1997) indicated that reactions to a video-based SJT were favorable and comparable in magnitude to those found for work sample tests in the Hausknecht et al. (2004) meta-analysis. Bauer and Truxillo (2006) noted that reactions to SJTs may be dependent on the stimulus and response formats used (i.e., written vs. video, multiple-choice vs. open-ended) but suggested that reactions to SJTs overall should be more favorable than reactions to selection procedures with more abstract content.

Adverse Impact Given the legal context of selection and employment testing, concerns about adverse impact must be given due consideration in selection system design and administration. Although a detailed treatment of adverse impact research is beyond the scope of this chapter (see Hough, Oswald, & Ployhart, 2001), several findings are summarized here concerning subgroup differences in test scores for the predictor classes reviewed above. We note upfront that even small subgroup differences can produce adverse impact (as defined by the four-fifths rule), particularly

789

John P. Hausknecht and Angela L. Heavey as organizations become more selective (Sackett & Ellingson, 1997). Furthermore, adverse impact calculations involving a small number of hires and/or low selection ratios tend to produce higher numbers of false positives, meaning that adverse impact can be found even though subgroup differences are not statistically significant (Roth, Bobko, & Switzer, 2006). Finally, it is important to point out that many of the estimates reported in this section are based on reviews that include, but are not limited to, samples drawn from service and sales domains. At this point in the literature, there are simply too few published studies available to make definitive conclusions concerning adverse impact in sales and service settings. Generally speaking, subgroup differences based on ethnic/cultural background, gender, and age for Big Five personality measures tend to be minimal and, when found, are typically less than one-tenth of a standard deviation. The largest effects have been found for measures of agreeableness (women tend to score about four-tenths of a standard deviation higher than men) and emotional stability (men tend to score about one-quarter of a standard deviation higher than women; Hough et al., 2001). Subgroup differences have not been comprehensively assessed for measures of service/sales/customer orientation, although the large overlap with personality constructs suggests that differences would be relatively small. Hogan et al. (1992) examined archival data for a personality-based sales potential inventory and found no differences when comparing scores across ethnic/cultural and gender-based subgroups. Published data concerning subgroup differences for biodata inventories in sales and service contexts are limited, although broader reviews find that the average performance for Whites is about one-third of a standard deviation higher than that for Blacks (Bobko, Roth, & Potosky, 1999). For measures of cognitive ability, the cumulative evidence (across all types of occupations) indicates that Whites score approximately one standard deviation higher than Blacks, over onehalf of a standard deviation higher than Hispanics, and approximately two-tenths of a standard deviation lower than Asians (Hough et al., 2001; Roth, Bevier, Bobko, Switzer, & Tyler, 2001). These estimates are moderated by job complexity such that subgroup differences tend to be larger for less complex jobs. Thus, given that many service and sales occupations are relatively low in complexity (see O*NET), subgroup differences may be somewhat larger in these domains. Regarding age and gender differences, research shows that age and cognitive ability test scores tend to be negatively related, whereas cognitive ability test performance does not generally differ between males and females (Hough et al., 2001). Finally, research on subgroup differences for video-based and written SJTs shows that Whites tend to score about four-tenths of a standard deviation higher than members of other ethnic/cultural groups, whereas women tend to score slightly higher (approximately one-tenth of a standard deviation) than men (Nguyen, McDaniel, & Whetzel, 2005, cited in Ployhart & Holtz, 2008). Potential age-based differences for SJTs have not been reported in the published literature. In summary, validity and adverse impact considerations often represent tradeoffs. Selection methods with strong evidence of predictive validity often share variance with cognitive ability, and cognitively loaded measures tend to produce the highest levels of adverse impact. Pyburn, Ployhart, and Kravitz (2008) termed this situation the “diversity-validity dilemma.” From a practical standpoint, there are many strategies available to selection specialists who must balance diversity and validity concerns, and the interested reader is directed to several recent papers that provide valuable critiques of these various approaches (Aguinis & Smith, 2007; De Corte, Lievens, & Sackett, 2007; Kravitz, 2008; Ployhart & Holtz; 2008). One common conclusion from this line of research is that, to date, there are no universal solutions that successfully maximize validity and eliminate adverse impact.

IMPLICATIONS FOR PRACTICE AND FUTURE RESEARCH Despite the wealth of information available concerning service and sales selection, several opportunities remain to enhance our understanding of the factors that contribute to effective selection in these domains. We raise several issues with regard to past research in terms of (a) the criteria adopted, (b) the range of predictors studied, (c) the temporal perspectives addressed, and (d) the levels of analysis considered.

790

Selection for Service and Sales Jobs

Criterion Issues Much of the research reviewed here has included supervisory performance ratings as the sole criterion. Although these ratings serve many important functions, rarely are organizations as interested in boosting performance appraisal ratings as they are in increasing sales volume and service quality perceptions. Objective sales figures have obvious implications for an organization’s bottom line, and customer perceptions of service quality are an important leading indicator of future sales and repeat business (Bowman & Narayandas, 2004). Furthermore, particularly in sales domains, research has shown that validity coefficients vary considerably across different criteria (Vinchur et al., 1998). Thus, organizations that use cognitive ability tests, for example, may see no benefit in terms of enhanced sales volume among new hires (but would identify candidates who will be rated highly by supervisors). Despite the appeal of using objective sales criteria, such validation work requires adequate consideration of situational opportunities that may influence performance (Stewart & Nandkeolyar, 2006). For example, researchers argue for controlling geographic or territorial constraints such as market potential, workload, company presence in a particular area, local economic conditions, and other region-specific factors (Cravens & Woodruff, 1973; McManus & Brown, 1995). In addition to considering alternative measures of job performance, researchers might broaden the types of criteria examined in future research. Only a handful of studies reviewed here examined withdrawal behaviors or counterproductive work behaviors (e.g., Dalessio & Silverhart, 1994; Jacobs et al., 1996; Ones & Viswesvaran, 2001). Given the significant costs associated with these outcomes, it would be useful to broaden the scope of selection research by incorporating these criteria into validity studies whenever possible.

Predictor Issues Almost exclusively, published research in this area tends to feature self-report, paper-and-pencil personality tests or biodata inventories. This work is valuable, but research must also respond to new and different forms of assessment. For example, Winkler (2006) estimated that about 5% of organizations (e.g., Toyota, SunTrust Bank) are using technology to assess important competencies via online job simulations. These interactive assessments place candidates in a virtual environment that mirrors the work that they would be doing on the job and allows companies to assess important competencies while providing a realistic preview of the work. Other forms of capturing live behavior (e.g., assessment centers) may also be appropriate for assessing service and sales candidates, although little work has been published in this area (see Burroughs & White, 1996, for an exception). The format of predictors is another important consideration, particularly as organizations consider how to leverage technology when building selection systems. Technology-based selection measures differ from their paper-and-pencil counterparts in several ways (Weekley & Jones, 1997, 1999; see also Chapter 39, this volume) and suggest a different profile of considerations for organizations in terms of costs, applicant reactions, administrative ease, and so forth. Until additional research examines these alternative approaches in the context of what we already know, it is unclear what (if any) effect these alternative forms of assessment have on selection outcomes in service and sales contexts.

Temporal Issues Another issue raised by this analysis is that we currently know very little about the role of time in the selection process. Much of the research reviewed here uses concurrent (i.e., cross-sectional) designs or time-lagged predictive designs with a fairly short temporal window (e.g., six-month performance review). Yet, recent explorations into predictors of performance trends suggest that past findings may not readily generalize across time (Ployhart & Hakel, 1998; Stewart & Nandkeolyar, 2006; Thoresen, Bradley, Bliese, & Thoresen, 2004). For example, in a study of

791

John P. Hausknecht and Angela L. Heavey insurance sales personnel, Hofmann, Jacobs, and Baratta (1993) found that the performance of sales agents followed a quadratic trend over time such that mean performance was initially positive and linear, then curved asymptotically with time. The authors suggested that different skills and abilities may be predictive of performance at early and later stages of the sales agents’ careers. Goal orientation was advanced as a potential determinant of intraindividual performance trends, such that highly goal-oriented individuals may be better equipped to learn from unsuccessful sales calls over time and more likely to engage in the self-development activities that ultimately lead to improved performance. Other researchers have shown that conclusions about personality-performance relationships differ when comparing cross-sectional and longitudinal designs such that certain characteristics are more predictive of performance trends than they are of initial performance (Thoresen et al., 2004), whereas others moderate the effect of situational opportunities on performance over time (Stewart & Nandkeolyar, 2006). Conclusions about the predictive validity of cognitive ability measures are also likely time-dependent in service and sales contexts. Keil and Cortina (2001) found that validity coefficients decline with time, and although their review was not confined to sales and service contexts, Cascio and Aguinis (2005) argued that task performance should be dynamic in service contexts (thus making it more difficult to predict over time) because service workers often have to adapt to new work processes as new products or services are introduced. These studies demonstrate that by focusing more closely on temporal dynamics, organizations can not only select candidates who are likely to perform well soon after hire but also identify those who have the capacity to increase their performance over time or reach proficiency in a shorter period, both of which are critically important to long-term organizational success.

Levels Issues A final consideration is that nearly all of the studies reviewed here focus on selection at the individual level of analysis. This reflects a long tradition in psychology of examining individual difference characteristics that predict individual job performance. However, selection researchers have argued that the field needs to examine relationships at higher levels of analysis (Ployhart, 2004, 2006). In one recent empirical example, Ployhart, Weekley, and Baughman (2006) found unique personality-performance associations at individual, job, and organizational levels and concluded that higher-level relationships may occur because certain personality factors relate to the teamwork and coordination behaviors critical for success in service work. Another multilevel study found that a manager’s personality may play a role in shaping service climate (Salvaggio et al., 2007). Core self-evaluations were administered to managers, in which participants rated themselves on certain personality traits (i.e., self-esteem, self-efficacy, neuroticism, etc.). Results indicated that managers with more positive self-evaluations had higher service quality orientations, which in turn led to more positive service climates. As the authors note, these results demonstrate the impact that individual managers’ personality traits may have on the overall workplace service climate. Considering that service climate positively relates to sales volume via customer-focused citizenship behaviors and customer satisfaction (Schneider, Ehrhart, Mayer, Saltz, & Niles-Jolley, 2005), such findings show that careful attention to employee selection may be useful in predicting not only individual performance but also more distal indicators of success. Taken together, these studies demonstrate that multilevel approaches are valuable for addressing the longstanding question of how to improve organizational effectiveness through selection (see also Chapter 5, this volume).

CONCLUSIONS Service and sales workers represent a significant portion of the global workforce, and the economic success of many organizations hinges upon their performance. Although much remains to be learned, the research reviewed here shows that careful attention to selection system design

792

Selection for Service and Sales Jobs provides organizations with an opportunity to improve the overall quality of hiring decisions for service and sales employees. Results clearly indicate that investments in formal selection methods improve the odds of finding service and sales workers who will perform well on the job. The validity coefficients discussed here are not large, but they can translate into substantial benefits in terms of reduced hiring and training costs, increased sales productivity, and better service quality. Combining the results of these individual-level studies with what is known about similar relationships at higher levels and over time shows that effective selection is a viable means by which organizations can generate competitive advantage.

REFERENCES Aguinis, H., & Smith, M. A. (2007). Understanding the impact of test validity and bias on selection errors and adverse impact in human resource selection. Personnel Psychology, 60, 165–199. Ahearne, M., Bhattacharya, C. B., & Gruen, T. (2005). Antecedents and consequences of customer-company identification: Expanding the role of relationship marketing. Journal of Applied Psychology, 90, 574–585. Allworth, E., & Hesketh, B. (2000). Job requirements biodata as a predictor of performance in customer service roles. International Journal of Selection and Assessment, 8, 137–147. Avis, J. M., Kudisch, J. D., & Fortunato, V. J. (2002). Examining the incremental validity and adverse impact of cognitive ability and conscientiousness on job performance. Journal of Business and Psychology, 17, 87–105. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Barrick, M. R., Mount, M. K., & Strauss, J. P. (1993). Conscientiousness and performance of sales representatives—Test of the mediating effects of goal-setting. Journal of Applied Psychology, 78, 715–722. Barrick, M. R., Stewart, G. L., & Piotrowski, M. (2002). Personality and job performance: Test of the mediating effects of the motivation among sales representatives. Journal of Applied Psychology, 87, 43–51. Bauer, T. N., & Truxillo, D. M. (2006). Applicant reactions to situational judgment tests: Research and related practical issues. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement, and applications (pp. 233–249). Mahwah, NJ: Erlbaum. Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporating cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561–589. Bowman, D., & Narayandas, D. (2004). Linking customer management effort to customer profitability in business markets. Journal of Marketing Research, 41, 433–447. Brown, T. J., Mowen, J. C., Donovan, D. T., & Licata, J. W. (2002). The customer orientation of service workers: Personality trait effects on self- and supervisor performance ratings. Journal of Marketing Research, 39, 110–119. Bruhn, M., & Georgi, D. (2006). Services marketing: Managing the service value chain. Essex, England: Pearson. Burroughs, W. A., & White, L. L. (1996). Predicting sales performance. Journal of Business and Psychology, 11, 73–84. Cascio, W. F., & Aguinis, H. (2005). Test development and use: New twists on old questions. Human Resource Management, 44, 219–235. Cellar, D. F., DeGrendel, D. J. D., Klawsky, J. D., & Miller, M. L. (1996). The validity of personality, service orientation, and reading comprehension measures as predictors of flight attendant training performance. Journal of Business and Psychology, 11, 43–54. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity. Journal of Applied Psychology, 82, 143–159. Cran, D. J. (1994). Towards validation of the service orientation construct. Service Industries Journal, 14, 34–44. Cravens, D. W., & Woodruff, R. B. (1973). An approach for determining criteria of sales performance. Journal of Applied Psychology, 57, 242–247. Dalessio, A. T., & Silverhart, T. A. (1994). Combining biodata test and interview information: Predicting decisions and performance criteria. Personnel Psychology, 47, 303–315. De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393. Dudley, N. M., Orvis, K. A., Lebiecki, J. E., & Cortina, J. M. (2006). A meta-analytic investigation of conscientiousness in the prediction of job performance: Examining the intercorrelations and the incremental validity of narrow traits. Journal of Applied Psychology, 91, 40–57.

793

John P. Hausknecht and Angela L. Heavey Farrell, S., & Hakstian, A. R. (2001). Improving salesforce performance: A meta-analytic investigation of the effectiveness and utility of personnel selection procedures and training interventions. Psychology & Marketing, 18, 281–316. Frei, R. L., & McDaniel, M. A. (1998). Validity of customer service measures in personnel selection: A review of criterion and construct evidence. Human Performance, 11, 1–27. Gilliland, S. W. (1993). The perceived fairness of selection systems: An organizational justice perspective. Academy of Management Review, 18, 694–734. Grant, A. M. (2013). Rethinking the extraverted sales ideal: The ambivert advantage. Psychological Science, 24, 1024–1030. Gupta, N., Ganster, D. C., & Kepes, S. (2013). Assessing the validity of sales self-efficacy: A cautionary tale. Journal of Applied Psychology, 98, 690–700. Hakstian, A. R., Scratchley, L. S., MacLeod, A. A., Tweed, R. G., & Siddarth, S. (1997). Selection of telemarketing employees by standardized assessment procedures. Psychology & Marketing, 14, 703–726. Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated model and meta-analysis. Personnel Psychology, 57, 639–683. Hennig-Thurau, T. (2004). Customer orientation of service employees—Its impact on customer satisfaction, commitment, and retention. International Journal of Service Industry Management, 15, 460–478. Hofmann, D. A., Jacobs, R., & Baratta, J. E. (1993). Dynamic criteria and the measurement of change. Journal of Applied Psychology, 78, 194–204. Hogan, J., Hogan, R., & Busch, C. M. (1984). How to measure service orientation. Journal of Applied Psychology, 69, 167–173. Hogan, J., Hogan, R., & Gregory, S. (1992). Validation of a sales representative selection inventory. Journal of Business and Psychology, 7, 161–171. Hogan, J., & Holland, B. (2003). Using theory to evaluate personality and job-performance relations: A socioanalytic perspective. Journal of Applied Psychology, 88, 100–112. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Huang, J. L., & Ryan, A. M. (2011). Beyond personality traits: A study of personality states and situational contingencies in customer service jobs. Personnel Psychology, 64, 451–488. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Hunthausen, J. M., Truxillo, D. M., Bauer, T. N., & Hammer, L. B. (2003). A field study of frame-of-reference effects on personality test validity. Journal of Applied Psychology, 88, 545–551. Hurley, R. F. (1998). Customer service behavior in retail settings: A study of the effect of service provider personality. Journal of the Academy of Marketing Science, 26, 115–127. Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of Applied Psychology, 85, 869–879. Huselid, M. A. (1995). The impact of human resource management practices on turnover, productivity, and corporate financial performance. Academy of Management Journal, 38, 635–672. Jacobs, R. R., Conte, J. M., Day, D. V., Silva, J. M., & Harris, R. (1996). Selecting bus drivers: Multiple predictors, multiple perspectives on validity, and multiple estimates of utility. Human Performance, 9, 199–217. Jaramillo, F., Ladik, D. M., Marshall, G. W., & Mulki, F. P. (2007). A meta-analysis of the relationship between sales orientation-customer orientation (SOCO) and salesperson job performance. Journal of Business and Industrial Marketing, 22, 302–310. Judge, T. A., & Zapata, C. P. (2015). The person–situation debate revisited: Effect of situation strength and trait activation on the validity of the Big Five personality traits. Academy of Management Journal, 58, 1149–1179. Keil, C. T., & Cortina, J. M. (2001). Degradation of validity over time: A test and extension of Ackerman’s model. Journal of Applied Psychology, 127, 673–697. Kravitz, D. A. (2008). The diversity-validity dilemma: Beyond selection—the role of affirmative action. Personnel Psychology, 61, 173–193. Li, L., & Wang, L. (2007). Development and validation of the salespeople forced choice behavioral style test in the information technology industry. Personality and Individual Differences, 42, 99–110. Liao, H., & Chuang, A. (2004). A multilevel investigation of factors influencing employee service performance and customer outcomes. Academy of Management Journal, 47, 41–58. McCarthy, J. M., Van Iddekinge, C. H., Lievens, F., Kung, M., Sinar, E. F., & Campion, M. A. (2013). Do candidate reactions relate to job performance or affect criterion-related validity? A multistudy investigation of relations among reactions, selection test scores, and job performance. Journal of Applied Psychology, 98, 701–719.

794

Selection for Service and Sales Jobs McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. L. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91. McDaniel, M. A., Rothstein, H. R., & Whetzel, D. L. (2006). Publication bias: A case study of four test vendors. Personnel Psychology, 59, 927–953. McManus, M. A., & Brown, S. H. (1995). Adjusting sales results measures for use as criteria. Personnel Psychology, 48, 391–400. McManus, M. A., & Kelly, M. L. (1999). Personality measures and biodata: Evidence regarding their incremental predictive value in the life insurance industry. Personnel Psychology, 52, 137–148. Miner, J. B. (1962). Personality and ability factors in sales performance. Journal of Applied Psychology, 46, 6–13. Mount, M. K., Barrick, M. R., & Stewart, G. L. (1998). Five-factor model of personality and performance in jobs involving interpersonal interactions. Human Performance, 11, 145–165. Mount, M. K., Barrick, M. R., & Strauss, J. P. (1994). Validity of observer ratings of the Big 5 personality-factors. Journal of Applied Psychology, 79, 272–280. Ones, D. S., & Viswesvaran, C. (1996). Bandwidth-fidelity dilemma in personality measurement for personnel selection. Journal of Organizational Behavior, 17, 609–626. Ones, D. S., & Viswesvaran, C. (2001). Integrity tests and other criterion-focused occupational personality scales (COPS) used in personnel selection. International Journal of Selection and Assessment, 9, 31–39. Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. Research in Personnel and Human Resource Management, 23, 121–176. Ployhart, R. E. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of Management, 32, 868–897. Ployhart, R. E., & Hakel, M. D. (1998). The substantive nature of performance variability: Predicting interindividual differences in intraindividual performance. Personnel Psychology, 51, 859–901. Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Ployhart, R. E, Weekley, J. A., & Baughman, K. (2006). The structure and function of human capital emergence: A multilevel examination of the attraction-selection-attrition model. Academy of Management Journal, 49, 661–677. Ployhart, R. E., Weekley, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable? Personnel Psychology, 56, 733–752. Pyburn, K. M., Ployhart, R. E., & Kravitz, D. A. (2008). The diversity-validity dilemma: Overview and legal context. Personnel Psychology, 61, 143–151. Quinones, M. A., Ford, J. K., & Teachout, M. S. (1995). The relationship between work experience and job performance: A conceptual and meta-analytic review. Personnel Psychology, 48, 887–910. Robie, C., Brown, D. J., & Shepherd, W. J. (2005). Interdependence as a moderator of the relationship between competitiveness and objective sales performance. International Journal of Selection and Assessment, 13, 274–281. Rosse, J. G., Miller, H. E., & Barnes, L. K. (1991). Combining personality and cognitive ability predictors for hiring service-oriented employees. Journal of Business and Psychology, 5, 431–445. Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., III., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. Roth, P. L., Bobko, P., & Switzer, F. S., III. (2006). Modeling the behavior of the 4/5ths rule for determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91, 507–522. Ryan, A. M., & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and decisions: A critical review and agenda for the future. Journal of Management, 26, 565–606. Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differences and adverse impact. Personnel Psychology, 50, 707–721. Salvaggio, A. N., Schneider, B., Nishii, L. H., Mayer, D. M., Ramesh, A., & Lyon, J. S. (2007). Manager personality, manager service quality orientation, and service climate: Test of a model. Journal of Applied Psychology, 92, 1741–1750. Saxe, R., & Weitz, B. A. (1982). The SOCO scale: A measure of the customer orientation of salespeople. Journal of Marketing Research, 19, 343–351. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schneider, B., Ehrhart, M. G., Mayer, D. M., Saltz, J. L., & Niles-Jolly, K. (2005). Understanding organization‑customer links in service settings. Academy of Management Journal, 48, 1017–1032. Schneider, B., & White, S. S. (2004). Service quality: Research perspectives. Thousand Oaks, CA: Sage. Schneider, R. J., Hough, L. M., & Dunnette, M. D. (1996). Broadsided by broad traits: How to sink science in five dimensions or less. Journal of Organizational Behavior, 17, 639–655.

795

John P. Hausknecht and Angela L. Heavey Schoenfeldt, L. F. (1999). From dust bowl empiricism to rational constructs in biographical data. Human Resource Management Review, 9, 147–167. Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant reactions to selection procedures. Personnel Psychology, 46, 49–76. Stewart, G. L., & Nandkeolyar, A. K. (2006). Adaptation and intraindividual variation in sales outcomes: Exploring the interactive effects of personality effects of personality and environmental opportunity. Personnel Psychology, 59, 307–332. Stokes, G. S., Hogan, J. B., & Snell, A. F. (1993). Comparability of incumbent and applicant samples for the development of biodata keys—The influence of social desirability. Personnel Psychology, 46, 739–762. Stokes, G. S., Toth, C. S., Searcy, C. A., Stroupe, J. P., & Carter, G. W. (1999). Construct/rational biodata dimensions to predict salesperson performance: Report on the U.S. Department of Labor sales study. Human Resource Management Review, 9, 185–218. Terpstra, D. E., & Rozell, E. J. (1993). The relationship of staffing practices and organizational level measures of performance. Personnel Psychology, 46, 27–48. Thoresen, C. J., Bradley, J. C., Bliese, P. D., & Thoresen, J. D. (2004). The Big Five personality traits and individual job performance growth trajectories in maintenance and transitional job stages. Journal of Applied Psychology, 89, 835–853. U.S. Department of Labor. (2015). Occupational employment and wages. Washington, DC: Author. Vinchur, A. J., Schippmann, J. S., Switzer, F. S., & Roth, P. L. (1998). A meta-analytic review of predictors of job performance for salespeople. Journal of Applied Psychology, 83, 586–597. Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25–49. Weekley, J. A., & Jones, C. (1999). Further studies of situational tests. Personnel Psychology, 52, 679–700. Winkler, C. (2006). Job tryouts go virtual: Online job simulations provide sophisticated candidate assessments. HR Magazine, 51, 131–134.

796

36 SELECTION IN MULTINATIONAL ORGANIZATIONS PAULA CALIGIURI AND KAREN B. PAUL

Four concurrent changes created the era of globalization in the early 1990s: (1) the transition to a market economy in many former Soviet-bloc countries, (2) the liberalization of markets and increases in regional economic integration (e.g., NAFTA and the European Union), (3) the advances in technology and communication enabling firms of all sizes to compete globally and share information in real time, and (4) the increases in firms’ global reach through foreign direct investment, joint ventures, acquisitions, and the like (Dunning, 2009). Multinational companies (MNCs), in the era of globalization, need to strategically adapt, reconfigure, and acquire the resources needed for the ever-changing global marketplace. A critical resource for strategic advantage within MNCs is its human talent, which, like other resources, needs to be managed and leveraged effectively. Across subsidiaries and operations around the world, the right skills need to be in the right locations when needed. Cascio and Aguinis (2008, p. 135) noted that “the company of the future will call on talent and resources—especially intellectual capital—wherever they can be found around the globe.” Companies need to attract and select employees globally with the technical skills necessary for ever-expanding international operations. This is a challenge in MNCs as global talent shortages are one of the leading risks affecting MNCs’ operational agility, competitiveness, and strategic growth (EY, 2013). One-third of CEOs have had to cancel global strategic initiatives due to talent shortages (PWC, 2012), and the concern is present in almost every country (Manpower, 2011). In addition to finding the right employees with the necessary skill set, MNCs also need culturally agile professionals who can effectively work in different countries and with people from different cultures. CEOs report that there is a dearth of culturally agile leaders who are able to manage the complexity of diverse environments, negotiate cultural challenges, and understand regulatory requirements and stakeholder demands in foreign countries (PriceWaterhouseCoopers, 2007). Consistent with these talent-related challenges, this chapter is divided into three major sections applied to employee selection. The first section begins with a discussion of the strategic alignment of employee selection systems in MNCs: centralized systems for greater global integration, localized systems for greater local responsiveness, and synergistic or hybrid systems, responsive to both local and global demands (Bartlett & Ghoshal, 1989; Prahalad & Doz, 1987). The second section covers the specific challenges in developing MNCs’ employee selection systems. From both the cross-cultural and cross-national systems perspectives, this section will emphasize the importance of the cross-cultural context with respect to the effect of national culture on method of selection and assessment, culture’s influence on the candidates’ reactions, and cross-national differences in HR systems affecting employee selection methods used (e.g., discrimination and privacy laws, unemployment rates, education systems). The third section focuses on the selection

797

Paula Caligiuri and Karen B. Paul for culturally agile professionals who can effectively staff and lead strategic initiatives globally, whether as international assignees, global team members, or business travelers.

STRATEGIC ALIGNMENT OF EMPLOYEE SELECTION IN MNCS MNCs and domestic firms differ along two dimensions: geographic dispersion and multiculturalism (Adler, 2001). Geographic dispersion is the extent to which a firm is operating across borders and must coordinate operations across borders in order to be effective. Multiculturalism is the extent to which the workers, customers, suppliers, etc. are from diverse cultural backgrounds and the extent to which the organization must coordinate the activities of people from diverse cultures in order to be effective. In leveraging both geographic dispersion and multiculturalism, MNCs must achieve a unique balance between the need to be centralized, or tightly controlled by headquarters, and the need to be decentralized, or operating differently across diverse locations (Bartlett & Ghoshal, 1989). The achievement of this balance between centralization and decentralization can happen in various ways. Extreme centralization can provide an organization with a variety of competitive benefits such as economies of scale (and the associated cost controls), improved value chain linkages, product/service standardization, and global branding. Extreme decentralization, however, can also be highly strategic, enabling a firm to modify products or services to fully meet local customer needs, respond to local competition, remain compliant with various governments’ regulations in different countries of operation, readily attract local employees, and penetrate local business networks. In most MNCs, these extremes are not useful strategies organization-wide. To be successful, MNCs (and units within MNCs) should adopt a strategy that “fits” the complexity of its environment and how it competes (Bartlett & Ghoshal, 1989; Ghoshal & Bartlett, 1990; Ghoshal & Nohria, 1993). When greater global integration or standardization is strategically desired, MNCs leverage economies of scale and share costs and investments throughout the organization and have greater control over the systems and functions. When greater local responsiveness is desired, MNCs vary their products and services to suit the preferences of clients in each of their diverse markets and allow for foreign subsidiaries to run their operations as needed. When a transnational approach is desired to enable innovation and learning, units around the world share approaches and ideas and spend more resources to assimilate approaches to be used globally. Research has found that MNCs’ strategy affects their approach to managing human resources (Caligiuri & Colakoglu, 2008; Gomez & Sanchez, 2005).

Global Integration and Employee Selection in MNCs When MNCs (or units within MNCs) desire greater standardization, key functions and tasks are managed and controlled by headquarters; for example, customer expectations for consistency, such as outstanding quality (e.g., Sony), luxury fashion image (e.g., Louis Vuitton), or global standards for their fast food (e.g., McDonald’s). The production workers with Sony must maintain worldwide quality standards, regardless of where they are in the world. The sales agents with Louis Vuitton must provide world-class customer service. The food preparation staff at McDonald’s must prepare food to the famous global standards as well as have a janitorial staff to clean restrooms to a global standard of sanitation and hygiene. In all of these cases, the standard is set forth by corporate and the uniformity is a competitive advantage. To maintain standards and consistency, MNCs will tend to have centrally developed dimensions to be included in selection systems, or possibly even centrally developed selection systems. For example, a global fast-food restaurant chain is competitive, in part, by delivering consistency to its customers in terms of food, service, cleanliness, and restaurant appearance. It follows that this same fast-food restaurant chain would include friendliness and personal hygiene in their selection systems, regardless of country. In technically oriented roles, in which international consistency is needed, selection dimensions are more objective and relatively easy to maintain across cultures. In 3M, for example,

798

Selection in Multinational Organizations a global prehire test for sales representatives has been developed and is currently used in 22 countries. The test originally was developed to be in the local language, as well as to be administered online so that it is available regardless of the time zone where applicants are taking it. The idea was that 3M should develop one test, enabling them to maintain the rights to it; this would obviate some issues in intellectual property regarding test publishing, such as the difficulty of obtaining permission to translate an existing test into a language or move it to a different system. As a result, part of 3M’s solution was to create their own test using the sales competencies that were jointly developed with 3M Sales and Marketing. The competency model for sales representatives globally (see Table 36.1) has been integrated into 3M’s selection system for potential new hires and also the training and development programs for incumbent sales representatives. In developing this competency model as the basis for the common test globally, an international job analysis was conducted to assess whether the content domain of the 3M sales representative position was similar around the world. A job analysis questionnaire (JAQ) was administered to sales representative and sales subject matter experts in 10 countries. The JAQ assessed work behaviors from the content domain for 3M sales representatives shown in Table 36.1. In 2006, 3M sales representatives from Brazil, Russia, India, China, and Poland (labeled BRICP) completed the JAQ. In 2007, 3M sales representatives in Australia, Singapore, Taiwan, Japan, and Korea (labeled APAC) completed the JAQ. The seven most important work behavior dimensions are presented in Table 36.2. For each of these seven dimensions, the average importance rating is shown for both country sets. As Table 36.2 illustrates, the results of this global job analysis found that the job content was the same around the world.

TABLE 36.1 Core Sales Competencies for 3M Sales Representatives Cluster

Functional Competencies

Selling

Customer Consultation Sales Channel Management External Organization Acumen

Customer Management

Opportunity Pipeline Management Managing Business at Risk

Analysis and Planning

Strategic Sales Planning

TABLE 36.2 Mean Importance Ratings for Work Behavior Dimensions: 3M Sales Representatives’ Job Analysis Results Across Two Sets of Countries Country Set Work Behavior Dimensions

APAC

BRICP

Conduct sales and follow-up

3.7

3.7

Work with others

3.4

3.7

Provide information to customers and distributors

3.7

3.6

Plan, organize, and prioritize

3.5

3.6

Maintain knowledge and skills

3.4

3.5

Negotiate and persuade

3.4

3.5

Document work: keep records

3.3

3.5

APAC country set includes Australia, Japan, Korea, Singapore, and Taiwan. BRICP country set includes Brazil, Russia, India, China, and Poland. Job analysis importance ratings obtained on five-point scale (5 = highly important).

799

Paula Caligiuri and Karen B. Paul The greater challenge for global organizations is in maintaining consistency with more subjective dimensions of the type generally found in critical leadership roles (i.e., a global firm’s top management team). It is critical for organizations to select leaders who have integrity, can work well in teams, are committed, and are results-oriented. However, the interpretation of these dimensions can vary greatly depending on the culture of the subsidiary location. There are also practical challenges with headquarter-controlled international selection systems. Using the same test across countries may be difficult for reasons ranging from possible culture-based interpretations lowering the validity of the test to the basic logistics of testing. The challenge of maintaining consistency in employee selection is discussed later in this chapter.

Local Responsiveness and Employee Selection in MNCs MNCs (or units within MNCs) with greater local responsiveness will allow for the greatest level of differentiation within countries such that key decisions are made at the subsidiary level. The benefit of this strategy is that global firms are able to compete locally—and with local knowledge, which may be especially important when a country has a unique infrastructure, market, client base, governmental roles, etc. It follows that the localization of selection systems is best for positions where a localization strategy is being deployed. The weakness of this strategy at the company level is that companies lose the economies of scale and their ability to maintain consistency and standards around the world and the possibility for global talent management. For example, in selection, multiple selection tests would need to be validated, and it would be impossible to have cross-nationally comparable candidates across countries.

Transnational Strategy and Employee Selection in MNCs When MNCs (or units within MNCs) prefer to compete with a synergistic and interdependent global network of subsidiaries, these units are integral parts of a whole system with both global and local objectives. Each subsidiary makes its unique contribution to the MNC through knowledge sharing, learning, and collaboration. In this context, organizations often prefer employee selection systems that are consistent around the world—based on strategic necessity—but that are also culturally acceptable across the participating countries. Many MNCs aspire to (or believe themselves to follow) this type of transnational business strategy. As such, there is an increased pressure to develop HR systems (and employee selection systems in particular), which are acceptable and integrated across cultures. In the case of 3M’s prehire sales test, their solution was hybrid—to standardize the test (on the basis of the common competency model outlined in Table 36.1) but allow the countries’ HR departments the freedom to vary when and how the test was given. For example, in some countries it did not make sense to offer the test at the beginning of the selection process, but rather a little bit later if it was a particularly competitive job market, if it was in a more remote location, and so forth. By working very pragmatically, 3M came up with a variety of different approaches to implement the online test to make sure that the process was helping advance the cause of the country and company rather than something prescribed and imposed from corporate. In the end, 3M’s solution to global testing was implemented, and the prehire test for sales representatives was rolled out globally.

CHALLENGES IN DEVELOPING MNCS’ EMPLOYEE SELECTION SYSTEMS There are challenges when developing employee selection systems from the transnational perspective. The first challenge is determining selection constructs that would be applicable for candidates for the same positions across subsidiaries, regardless of country (Ployhart, Wiechmann, Schmitt, Sacco, & Rogg, 2003). This means that the content domain is comparable across cultures within positions and that the selection systems based on the common content domain would have

800

Selection in Multinational Organizations validity coefficients generalizable across countries (Lievens, 2007; Salgado & Anderson, 2002). Once the common content domain is determined, creating conceptual equivalence in the assessment tools is the next and second challenge. This may include everything from language comparability in selection tests to developing the behavioral indices of various selection dimensions so that raters (e.g., interviewers, assessors in assessment centers) can make cross-culturally comparable ratings or possibly even changing cut scores and norms within countries to appropriate levels. The third and fourth challenges are the cross-cultural and cross-national differences, the former affecting the types of selection methods that are preferred and the latter affecting the types of selection methods allowed given the legal system in the country. We will now discuss these challenges in greater detail in the next section—with the caveat that a thorough review of all of the measurement, methodological, and cultural issues embedded in these challenges is beyond the scope of this section.

Determining the Selection Constructs Applicable Across Cultures As with the development of selection systems in the domestic context, the first step is to determine the broad content domain for a given position—repeating this step across countries for the same position to determine whether the jobs are, in fact, comparable. In validity language, the selection systems (predictors) would need to tap the same performance domain across countries. This step is particularly challenging for more contextual, less technical roles. In leadership roles, the multi-country Global Leadership and Organizational Behavior Effectiveness (GLOBE) project found that “executives tend to lead in a manner more or less consistent with the leadership prototypes endorsed within their particular culture. In turn, leaders who behave according to expectations are most effective” (Dorfman, Javidan, Hanges, Dastmalchian, & House, 2012, p. 504). Leaders’ behaviors—and perceptions of success—can differ from country to country. In firms transferring people across borders, the conceptual equivalence and validity generalization challenge may be further exacerbated. When an employee is selected in one country and transferred to another country in the same role (where the performance domain may differ), the validity of the original selection system might be lowered (Lievens, 2007). For example, a study of relocating professionals working in public relations and as economic and political analysts found that the tasks involved in the way they performed their jobs changed depending on where they were performing their jobs, even though the jobs did not change (Shin, Morgeson, & Campion, 2007). In more collectivist cultures, their jobs required more relationship-oriented tasks (e.g., coordinating, team-building) than when they were performed in more individualistic cultures. The exception to this can be found, depending on the level of specificity and topic, in companies with strong cultures and in more technical roles where constructs and behaviors are heavily communicated and reinforced. “Setting the Agenda,” a common leadership behavior, and “Territory Management,” a common sales behavior, both can be endorsed as occurring or not and its relative importance to the role. Yet, how both are operationalized can differ due to culture. Measuring items and tasks at the right level is crucial if generalizability is desired. (This issue is addressed again in the last section of the chapter when international assignments are discussed.) Many MNCs have driving corporate cultural values that appear in managerial selection systems around the world. These corporate values may include dimensions such as managing with integrity, taking appropriate risks, being customer-focused, being results-oriented, and the like. After these broad performance dimensions are named, the challenge turns to creating conceptual equivalence for each dimension across cultures. Once this equivalence is established, selection systems to assess candidates against these dimensions are created.

Creating Conceptual Equivalence Across Cultures Cultural values are socialized in each individual through various agents such as nationality, religion, family, education, company, and profession. This foundation of individuals’ culture can

801

Paula Caligiuri and Karen B. Paul influence the sphere of work. Thus, individuals’ work-related values are formed when their overarching cultural values are applied to the work situation (Hofstede, 1980). Comparative management researchers have found that individuals within one country will have more values in common compared to individuals from different countries (Hofstede, 1980), especially when corporate or professional cultures are weak. In the context of this chapter, culturally laden work values can affect the weight that one places on a particular selection dimension or the actual interpretation of the applicants’ behaviors, creating a challenge for assessing candidates through a single cultural lens. Applied psychologists and HR practitioners working internationally have been grappling with the challenge of developing assessment and measurement methods that are conceptually comparable across cultures—beyond a mere translation of words (see Hult et al., 2008 for a summary). In this context, the goal is to create enough conceptual equivalence for comparisons of candidates to be meaningful. The past decade has brought us a plethora of published articles with a goal of establishing the cross-cultural conceptual equivalence of various constructs of interest. By definition, conceptual equivalence occurs when constructs have similar meanings across cultures. For example, customer service orientation may translate into “complete attention to customers’ needs” in Japan, where anticipating needs is important. However, in Italy, where shopkeepers with exquisite taste are highly valued, customer service may mean “providing honest feedback.” In this example, “customer service orientation” lacks conceptual equivalence. However, in both Japan and Italy, the construct “expending effort for clients” may be defined as working hard to find a desired item or to help a client resolve a problem. In this example, “expending effort for clients” does possess conceptual equivalence. Maximizing conceptual equivalence may be especially problematic when constructs in the content domain are more subjective and less objective. Some examples of the challenges of conceptual equivalence also happen at the item level. For an item written through the lens of the 3M HR team in the United States, the alternative involved the appropriateness of inviting a new client to lunch. The assumption of taking a new client to lunch is within acceptable standard operating procedures for most U.S. sales representatives—yet in a different cultural context, the same activity conveys a level of familiarity that is inconsistent with establishing a new relationship, hence, making the response option cross-culturally less viable. In countries such as Brazil, inviting a person to lunch implies a deeper level of the relationship that had not yet been established between the potential new client and the sales representative. The option would not be selected as written and was ultimately rewritten to reflect a universally appropriate response.

Cultural Differences and Employee Selection Once the dimensions to be included in the selection system have been established, the next cross-cultural concern would be the appropriateness of the assessment method and the logistics of those methods in a given cross-cultural context. With respect to testing methods, it is important to understand whether certain selection methods are perceived more (or less) favorably by applicants around the globe. In a meta-analysis of applicant reactions to various selection methods, Anderson, Salgado, and Hülsheger (2010) found that, across 17 countries, the most preferred methods were work samples and interviews, followed by résumés, cognitive tests, references, biodata, and personality inventories. The least preferred methods across countries were honesty tests and personal contacts. The picture might be more refined than an overall conclusion that certain methods have a universal appeal. For example, one study comparing perceptions of selection methods in Singapore and the United States found that Singaporeans rated personality tests more favorably than did Americans (Phillips & Gully, 2002). Although applicant reactions to selection methods may be generally similar across countries, their usage is not. Multicountry survey-based studies found that countries did vary significantly in terms of employee selection procedures used (Ryan, McFarland, Baron, & Page, 1999; Shackleton & Newell, 1997). Ryan et al. (1999) found that national-level cultural values, such as uncertainty avoidance, predicted what selection procedures were more likely to be used across countries. Countries higher in risk aversion were more likely to rely more heavily on

802

Selection in Multinational Organizations interviews and testing, presumably as a way of reducing hiring risks. Further research in the area of cross-cultural differences in use and acceptance of selection methods is important to further understanding of global employee selection methods and, hopefully, to reduce resistance to them (for a review, see Lievens, 2007). Even in situations where the same employee selection method is used, culture might affect the validity of the approach or the way in which it is used. In employee interviews, for example, Manroop, Boekhorst, and Harrison (2013) suggest that when interviewers from one country are interviewing candidates from another, differences in self-promotion, verbal and nonverbal behaviors can influence the interviewer’s judgment of the interviewee. They noted that, for example, when foreign-born job candidates from collectivistic cultures perceive an absence of behavioral mirroring on the part of the interviewers, they may infer a lack of rapport, and hence become anxious and experience psychological stress, which may, in turn, hinder their performance in the interview. (p. 3524)

Even within regions of the world, subtle cross-national differences exist. Tixier (1996) noted differences in the qualities viewed as valuable for managerial candidates to possess across the Nordic countries of Finland, Norway, Sweden, and Denmark, suggesting that the content in résumés, cover letters, and interviews should reflect differences. With respect to logistics, testing assumptions need to be questioned cross-culturally. For example, when 3M was rolling out their prehire sales test globally, one of the basic assumptions made was that testing would be done in a room with multiple computers and a fairly controlled environment so that multiple applicants could simultaneously take the online test. As it turned out, this was easier thought than done. First, for many of the 3M subsidiaries around the world, they did not have an available testing room (i.e., an empty room with multiple computers each with Internet connections). Second, some of the subsidiaries had sales territories that covered vast regions. If 3M was looking for sales representatives for a given region, they needed to be able to connect with candidates in their remote locations. In Russia, for example, 3M needed to be able to connect with candidates in more remote places such as Siberia. Practically, decisions needed to be made regarding the appropriate distance for a candidate to need to travel to even take the prehire test. Third, as 3M learned, the idea to have multiple applicants taking the test simultaneously in some countries was flawed. For some cultures, and in highly competitive job markets, it was undesirable and discouraging for applicants to see how many people are competing. Furthermore, in some cultures this kind of testing is culturally unacceptable. Even the idea of a controlled testing room with a closed door in some small subsidiaries or in predominantly open-floor plans such as Japan raised cross-national challenges.

National Differences and Employee Selection HR systems vary from country to country depending on some relatively fixed dimensions, including the given country’s work systems (Begin, 1992). These country-level factors may affect the practice of employee selection across given countries as they affect employment laws, workforce competence, and availability of talent. Although not intended to be comprehensive, this section offers some illustrative examples of the way in which countries’ work systems affect employee selection. Countries differ with respect to laws governing the practice of employee selection. (See chapters in this volume for more details about national differences in legal issues concerning employee selection.) For example, the United States has a body of laws stemming from the initial fair nondiscriminatory employment legislation covered in the Civil Rights Act of 1964, Title VII, the Age Discrimination in Employment Act, and the Americans with Disabilities Act. As in the United States, laws exist in almost every country that define the type of firm that must abide by the given law prohibiting discrimination (e.g., size of the firm, public or private sector) and define who is considered protected under the given law (e.g., race, sex age, sexual orientation).

803

Paula Caligiuri and Karen B. Paul In India, for example, Article 15 of the Indian Constitution prohibits discrimination on the basis of caste. Across these laws around the world, most state that an employee selection system cannot discriminate against a target protected group; however, the way in which discrimination is adjudicated and the penalty for the violation of the law varies greatly from country to country. Another legal issue affecting international selection is data privacy. For example, the European Union (EU) Directive on Data Protection prohibits the transfer of personal information from Europe to other countries unless an adequate protection of privacy, notice, and consent is given. This EU Directive affects selection practices globally in the way data are collected and shared. Countries also have their own privacy laws, as illustrated in the example of 3M in Poland. To implement the prehire assessment sales test to representatives in Poland, 3M had some added challenges. The Polish Labor Code limits, in Article 22, the personal data that might be required by an employer from the candidate for employment. Those data are limited mainly to such items as name, surname, date of birth, candidate education, and history of previous employment. In order not to be even remotely viewed as risking violation, 3M chose not to require candidates to provide personal data other than those specifically outlined in Article 22 of the Polish Labor Code. For compliance to the Polish Act on Personal Data Protection, additional adjustments were made to comply with all regulations in terms of demographics collected. For example, given that some information would reside on the U.S.-based server, names needed to be removed from the information collected. Furthermore, changes were required given that the test was processed on a U.S. server, such as written (not electronic) informed consents to be signed and collected before the start of the testing of each applicant. These steps, among others, are examples of how cross-national differences in laws may affect the logistics of the testing situation. Countries vary in terms of their workforce competence, which, in turn, has an influence on competence and readiness of candidates. Organizations such as the U.N. Educational, Scientific, and Cultural Organization (UNESCO) and the International Archive of Education Data (IAED) report large differences in literacy rates, education levels, and test scores across countries, which, in turn, have implications for the quality of a given country’s workforce. For example, Germany is considered to have one of the best-trained workforces in the world, with an extensive apprenticeship program in which employers help train students on their intended trades and professions. Within-country selection systems rely on an ample supply of qualified talent against the organization’s demand for talent for a given job. Given that countries differ in labor economics, the availability of talent will influence selection ratios, making selection systems more or less effective across the entire workforce strategy with the country. The general labor economics of a given country or city affects the size and quality of the applicant pools. Supply of talent also affects the concern companies will have for candidate reactions to their (even validated) selection methods. For example, in both India and Poland, skilled labor is in high demand. Often a hiring manager just wants someone to fill a position, without the extra hurdle of giving applicants a test, which increases the time needed to make a hiring decision and could result in losing some viable candidates. One of the ways that 3M accommodated this high demand for skilled labor in Poland and India was to change the placement of testing in the selection process to be a later hurdle in the process. The goal was to keep more qualified candidates in the pipeline for the interpersonally interactive aspects of the selection system, such as the interview, and not turn them off with the testing process. Testing was conducted after the relationship with 3M was built, which also ensured that top talent was selected.

SELECTION FOR CULTURALLY AGILE PROFESSIONALS Our chapter thus far has focused on employee selection systems and the challenges present for MNCs developing selection systems consistent with business strategy and the cultural context. We now shift our focus from selection for the purpose of staffing globally to selecting those who can lead key strategic initiatives globally, specifically global leaders and international assignees.

804

Selection in Multinational Organizations

Selecting Global Leaders1 Whether leading a global business, a virtual global team, or people from different countries or in different countries, a global leader is “an individual who inspires a group of people to willingly pursue a positive vision in an effectively organized fashion while fostering individual and collective growth in a context characterized by significant levels of complexity, flow and presence” while doing so in an international or cross-cultural context (Mendenhall, Reiche, Bird, & Osland, 2012, p. 500). Studies have found that effective global leaders share certain competencies (i.e., predictors of their success), which sort into three categories: self-management, relationship management, and business management (Bird, Mendenhall, Stevens, & Oddou, 2010). Competency-based selection systems for global leaders should include an assessment of these dimensions, broadly defined. With respect to self-management, certain competencies affect the leaders’ ability to maintain their composure and adjust to the ambiguity of working in multicultural and intercultural environments (Bird et al., 2010; Caligiuri, 2012). Cross-cultural competencies such as tolerance of ambiguity and self-efficacy improve global leaders’ self-management, enabling them to work quickly and comfortably in different cultures and with people from different cultures. In regards to relationship management, global leadership competencies include those affecting an individual’s multicultural and intercultural interactions at the group level and ability to build strong dyadic relationships with people from different cultures (Bird et al., 2010; Caligiuri, 2012). Global leaders with cross-cultural competencies such as perspective taking and rapport building are better able to develop relationships in different cultures and with people from different cultures. With respect to business management, these competencies affect the leaders’ abilities to take an enterprise-wide mindset and operate from an international strategic perspective (Bird et al., 2010; Caligiuri, 2012). Global leaders need to be able to integrate a wide range of dynamic factors from the organization and the local environment. This requires a high level of cognitive complexity, which enables leaders to understand and integrate broader bases of knowledge and balance the demands of global integration with local responsiveness (Dragoni & McAlpine, 2012; Levy, Beechler, Taylor, & Boyacigiller, 2007). Global leaders with cross-cultural competencies such as cognitive complexity and the ability to think creatively are more effective in their global roles. Identifying the tasks of global leaders through a job analytic approach, one more dimension emerges—response management. Research suggests that global leaders need to have a variety of cultural responses available to them and that some tasks require different, if not opposite, responses (Caligiuri, 2012; Levy, Beechler, Taylor, & Boyacigiller, 2007). For example, tasks such as “interacting with external clients from other countries” and “maintaining a budget globally” might require opposite responses—the former requiring adaptation and the latter, possibly, requiring that the leader maintain an organizational standard while minimizing the effects of culture (Caligiuri, 2006). Response management means that leaders respond with cultural agility, rather than always adapting to behavioral norms of the cultural context. Cultural adaptation is only one possible response and not always the correct one. At times, leaders might also use cultural minimization to communicate and influence in order to minimize the differences across cultures and maintain some necessary standard (e.g., safety, quality, and ethics). In other situations, such as leading a team, the situation might dictate the use of cultural integration, where team and facilitation skills help create an entirely new approach, one that represents no individual’s culture completely.

Selecting International Assignees2 There are many challenges when developing selection systems for international assignee candidates who will be living and working outside of their own national borders. International assignees are nationals of one country who are sent by a parent organization to live and work in another country. The definition of international assignees, for the purpose of this chapter, is those who are sent by their organizations for an assignment (rather than a self-initiated

805

Paula Caligiuri and Karen B. Paul relocation) to another country for at least one year. This section will describe the individual-level antecedents that are most important for inclusion in international assignee selection systems and then discuss the process issues for international assignee candidate selection. When thinking about international assignee selection, unlike traditional selection, we are considering ways to predict success within the job context (i.e., working in a foreign country), rather than job content in the traditional sense. In the research literature on international assignees, cross-cultural adjustment is most often considered an important dependent variable when considering selection across assignee types given that adjustment (psychological comfort living and working in another country) is important for almost all expatriates. In meta-analysis of antecedents and consequents of expatriate adjustment, Bhaskar-Shrinivas, Harrison, Shaffer, and Luk (2005) found language ability, previous overseas experience, withdrawal cognitions, job satisfaction, and spousal adjustment were predictors of cross-cultural adjustment. In another meta-analysis, Hechanova, Beehr, and Christiansen (2003) found selfefficacy, frequency of interaction with host nationals, and family support were predictors of cross-cultural adjustment. These meta-analyses also suggest that greater cross-cultural adjustment in international assignees generally predicted greater job satisfaction, less strain, and higher levels of organizational commitment. Another meta-analysis examining personality as predictors of expatriate performance (Mol, Born, Willemsen, & Van Der Molen, 2005) found that extraversion, emotional stability, agreeableness, and conscientiousness were predictive of expatriate performance. This same meta-analysis also found cultural sensitivity and local language ability to be predictive. Across these meta-analyses, three categories of individual-level antecedents seem to emerge as predictors of cross-cultural adjustment that would lend themselves to international assignee selection systems. They are personality characteristics, language skills, and prior experience living in a different country (see Caligiuri & Tarique, 2006, for a review).

Personality Characteristics Extensive research has found that well-adjusted and high-performing international assignees tend to share certain personality traits (e.g., Mol et al., 2005; Shaffer, Harrison, Gregersen, Black, & Ferzandi, 2006). Personality characteristics enable international assignees to be open and receptive to learning the norms of new cultures, to initiate contact with host nationals, to gather cultural information, and to handle the higher amounts of stress associated with the ambiguity of their new environments (Shaffer et al., 2006)—all important for international assignee success. Each of the Big Five personality characteristics relate to international assignee success in a unique way (Shaffer et al., 2006) and should be included in a selection system for international assignees for different reasons (see Van Vianen, De Pater, & Caligiuri, 2005, for a review). On the basis of the meta-analysis conducted by Mol et al. (2005), the estimated true population effect size for the relationship between conscientiousness and international assignee success is positive (ρ = .17), reflecting the cognitive complexity of working in a host country. Bhaskar-Shrinivas et al.’s meta-analysis (2005) found that relational skills, which aid in social learning in the host country, are positively related to cross-cultural adjustment (ρ = .32). The meta-analytic results from Mol and colleagues (2005) found the estimated true population effect size for the relationship of international assignee success to the relationship-oriented personality characteristics, extroversion and agreeableness, to be positive (ρ = .17 and .11, respectively). Given that stress is often associated with living and working in an ambiguous and unfamiliar environment (Stahl & Caligiuri, 2005), it is not surprising that the meta-analysis conducted by Mol et al. (2005) found that the estimated true population effect size for the relationship between emotional stability and international assignee success is positive (ρ = .10). Lastly, openness should be related to international assignee success because individuals who are higher in this personality characteristic will have fewer rigid views of appropriate and inappropriate contextual behavior and are more likely to be accepting of the new culture. Mol et al.’s meta-analysis (2005) found that the estimated true population effect size for the relationship between openness and

806

Selection in Multinational Organizations international assignee success is positive (ρ = .06); however, this relationship was not significant, as the confidence interval included zero. The authors noted that “moderated support was found for the relationship of openness” (p. 608), which is consistent with other research. For example, Caligiuri (2000) found moderated support for openness as a personality characteristic relating to expatriate adjustment, such that greater contact with host nationals was positively related to cross-cultural adjustment when an individual possesses the personality trait of openness. Collectively, these personality characteristics should be included in any selection program for international assignees (Van Vianen et al., 2005). It is important to note that this type of employee assessment would predict those who will do well adjusting to a cross-cultural job context. However, this assessment does not predict success in the actual job tasks. Likewise, the absolute level of each personality characteristic may be contingent upon the type of international assignment under consideration. For example, the necessary level of relational skills might be important for all international assignees but higher for more senior executives who may need to network with, persuade, and influence host nationals, media, government officials, and nongovernmental organizations (NGOs) to be successful, compared with technical assignees, who may interact with host nationals mostly around tasks with computer systems or equipment.

Language Skills Many have noted a positive relationship between language skills and international assignee success (Shaffer, Harrison, & Gilley, 1999). In their meta-analytic studies, Mol et al. (2005) and Bhaskar-Shrinivas et al. (2005) found that local language ability is a positive predictor of international assignee success (as generally defined by adjustment; ρ = .19 and .22).

Prior International Experience From a social learning perspective, the more contact international assignees have with host nationals and the host culture, the greater their cross-cultural adjustment (Toh & DeNisi, 2007), provided the past experience does not reinforce previously held stereotypical beliefs or foster negative, unrealistic expectations of the foreign culture. Past experience might be most helpful in predicting success on an expatriate assignment when the experience provides an accurate and realistic representation of the host countries’ norms, customs, values, etc. Bhaskar-Shrinivas et al.’s meta-analytic results (2005) found that prior international experience was a weak but positive predictor of interaction adjustment and work adjustment (ρ = .13 and .06, respectively). It is likely that the quality of the prior international experience is an important factor.

Practices in International Assignee Selection While the aforementioned individual difference variables—personality, language skills, and prior experience—can be used as the basis for an expatriate selection system, Brookfield Global Relocation Trends 2015 survey of global firms found that only about 20% use selection tools to assess expatriate candidates. Traditional selection methods are often challenging to employ in situations when the expatriates’ skills are scarce and necessary to fill important skills gaps in host countries. Expatriate candidates’ willingness to relocate has been—and continues to be—the most frequently cited selection criterion. In the early 1980s, Rosalie Tung’s seminal work found that the vast majority of firms (over 90%) named “interest in overseas work” to be used as a criterion for selection (Tung, 1981). Nearly 80% of firms today use the same predictor—an individual’s willingness to go on an international assignment—in selection (Brookfield, 2015). A “willingness to relocate” might be a sufficient predictor for more technical assignments designed to fill a skills gap, but it will not be sufficient for managerial or organizational

807

Paula Caligiuri and Karen B. Paul development. Getting the right expatriates into key developmental opportunities will have a longer-term benefit for the organization. For this to occur, selection (especially for personality characteristics) is critical. This has become particularly important recently as the number of expatriates being sent abroad to fill critical skills gaps is shrinking compared to the number of expatriates being sent abroad for organizational or leadership development (Caligiuri & Bonache, 2016). This trend is evident in the increase in the number of firms adopting expatriate selection systems (Brookfield, 2015) and the increased number of firms integrating global mobility and talent management functions (Cerdin & Brewster, 2014; Collings, 2014). Another trend is the increased use of self-assessment for better decision making. Given that the demographic, personal, and family situations of the international assignee candidates will vary, self-assessment (or self-selection) has been found to be an effective method for sharing realistic assessments in a tailored way (Caligiuri & Phillips, 2003). For example, an unmarried person who is a candidate for an international assignment might have a different set of concerns compared with a married candidate with a family and elderly parents (Caligiuri, Hyland, Joshi, & Bross, 1998). With the use of expatriate self-assessment tools, expatriate candidates self-assess their fit with the personality and lifestyle requirements of the assignment and help candidates make a thoroughly informed and realistic decision about the assignment (Caligiuri & Phillips, 2003). Many firms have found that this self-assessment step fosters the creation of a candidate pool of potential international assignees. This candidate pool can be organized to include the following pieces of information: the availability of the employee (when and to what countries), languages the employee speaks, countries preferred, technical knowledge, skills, and abilities, etc. Caligiuri and Phillips (2003) found that providing realistic previews prior to international assignments did not change candidates’ interest in possible assignments but did increase candidates’ self-efficacy for an international assignment. Most multinational companies acknowledge that the wrong person in an expatriate assignment can result in poor job performance, early repatriation, anxiety or other emotional problems, and personal and professional upheaval for accompanying family members. With the risks so high, expatriate selection (designed to identify who will have the greater likelihood of success) is critical. The efficacy of expatriate selection programs is challenged when transnational firms report (as they often do) that there are not enough people to fill current expatriate assignments. The natural reaction, in this case, is to believe that expatriate selection would not apply. However, ignoring proper selection is extremely shortsighted given the risks to the firm and the individual if the global assignment is unsuccessful. This reaction is especially limited given that when selection is thorough, firms cast wider nets for possible candidates and generally find multiple candidates with a higher probability of success. These comprehensive selection systems generally have four distinct phases including (1) the creation of a candidate pool, (2) self-assessment, (3) technical and managerial selection, and (4) placement. The placement in a host country will be most successful when agreement is mutual among the candidate, the candidate’s family, the sending unit, and the host national unit.

CONCLUSIONS This chapter covered the many challenges of developing international selection systems, the challenges of construct development with respect to cross-cultural comparability, and selection of culturally agile employees. As the need for strategically oriented and conceptually equivalent international selection systems continues to grow, so do the demands on HR professionals and applied psychologists to respond to this complex need. There are many dynamic changes happening today that will increase the need for and the ease of adopting internationally integrated selection systems. For example, increasingly strong worldwide corporate cultures, where employees globally share values and norms, may diminish the influence of national cultures. Strong global corporate cultures create a common frame-of-reference for more subjective constructs and ease the integration of international selection systems. Subjective constructs, such as “integrity,” “teamwork,” and “trust,” will have a company-driven understanding leveling any nationally driven cultural differences. This move to

808

Selection in Multinational Organizations stronger corporate cultures will increasingly ease integrating international selection systems. For instance, 3M was able to define generalizable tasks due to its strong company culture with low between-country variability on many work-related issues. In fact, as seen in Table 36.2, widely different country cultures and countries at different economic stages perform similarly due to company culture and approach having a much bigger impact on the job than the country the role resides within. Although the technical issues of employee selection are important, the implementation of selection systems globally requires more than merely validating employee selection tests in different countries. Employee selection tests are created and adopted by HR professionals located around the world. These HR professionals, from different cultures and with different levels of knowledge of the science and practice of employee selection, ultimately affect whether a given selection system can be integrated globally. As described in this chapter, the concept of testing—and the very idea of individual differences—varies from country to country. Likewise, the science of testing and the level of acceptance of U.S.-oriented industrial-organizational psychology standards for practice also vary from country to country. In some cultures, testing is rooted in education (not industrial-organizational psychology), where teachers create and give tests, assigning grades accordingly. Test validation, in these cultures, would seem like a burdensome and unnecessary process. Creating standards for practice for a company’s HR professionals globally is an important step to developing selection systems that can be validated and accepted globally. The future success of international employee selection may also rely on headquarters-based HR professionals’ and industrial-organizational psychologists’ abilities to manage relationships cross-nationally. Developing relationships with in-country HR leaders and line managers is critical for successful integration of selection systems. The in-country HR professionals will likely be the first to identify any country-specific problems and ways to eventually solve those problems. Because this willingness to help relies on the goodwill of in-country HR professionals (some of whom may initially need to be convinced that testing is appropriate), the ability for headquarters-based testing professionals to develop respectful, collegial, and lasting relationships is critical. Lastly, the future success of international employee selection may be determined by whether the employee selection systems are integrated as part of a whole strategic HR system (or high-performance work system). HR professionals would be addressing only part of the picture if they developed employee selection systems in isolation. Ideally, selection and assessment should be integrated with training and development, performance management, and reward systems. Collectively, when these systems globally reinforce the predictors of performance in a comprehensive manner, the needle moves much quicker toward a high-performing globally competitive organization.

NOTES 1. Ideas in the section are abstracted from Caligiuri, P. M., & Dragoni, L. (2015). Global leadership development. Invited chapter for D. Collings, G. Wood, & P. Caligiuri (Eds.). Companion to International Human Resource Management (Routledge). Please refer to that chapter for more information. 2. For more information, please see Caligiuri, P. M., & Bücker, J.J.L.E. (2015). Selection for international assignments. In D. Collings, G. Wood, & P. Caligiuri (Eds.). Companion to International Human Resource Management (Routledge).

REFERENCES Adler, N. J. (2001). International dimensions of organizational behavior (4th ed.). Cincinnati, OH: Southwestern. Anderson, N., Salgado, J. F., & Hülsheger, U. R. (2010). Applicant reactions in selection: Comprehensive meta-analysis into reaction generalization versus situational specificity. International Journal of Selection and Assessment, 3, 291–304. doi: 10.1111/j.1468–2389.2010.00512.x

809

Paula Caligiuri and Karen B. Paul Bartlett, C. A., & Ghoshal, S. (1989). Managing across borders: The transnational solution. Boston, MA: Harvard Business School Press. Begin, J. P. (1992). Comparative Human Resource Management (HRM): A systems perspective. International Journal of Human Resource Management, 3(3), 379–408. Bhaskar-Shrinivas, P., Harrison, D. A., Shaffer, M., & Luk, D. M. (2005). Input-based and time-based models of international adjustment: Meta-analytic evidence and theoretical extensions. Academy of Management Journal, 48(2), 257–281. Bird, A., Mendenhall, M., Stevens, M. J., & Oddou, G. (2010). Defining the content domain of intercultural competence for global leaders. Journal of Managerial Psychology, 25(8), 810–828. Brookfield Global Relocation Services. (2015). Global Relocation Trends Survey Report. Woodridge, IL: Brookfield. Caligiuri, P. (2012). Cultural agility: Building a pipeline of successful global professionals. San Francisco California: Jossey-Bass. Caligiuri, P. (2006). Performance measurement in a cross-national context: Evaluating the success of global assignments. In W. Bennett, D. Woehr, & C. Lance (Eds.), Performance measurement: Current perspectives and future challenges (pp. 227–245). Mahwah, NJ: Lawrence Erlbaum. Caligiuri, P. (2000). Selecting expatriates for personality characteristics: A moderating effect of personality on the relationship between host national contact and cross-cultural adjustment. Management International Review, 40(1), 61–80. Caligiuri, P., & Bonache, J. (2016). The enduring and evolving challenges in global mobility. Journal of World Business, 51(1), 127–141. Caligiuri, P., & Colakoglu, S. (2008). A strategic contingency approach to expatriate assignment management. Human Resource Management Journal, 17, 393–410. Caligiuri, P., Hyland, M., Joshi, A., & Bross, A. (1998). A theoretical framework for examining the relationship between family adjustment and expatriate adjustment to working in the host country. Journal of Applied Psychology, 83, 598–614. Caligiuri, P., & Phillips, J. (2003). An application of self-assessment realistic job previews to expatriate assignments. International Journal of Human Resource Management, 14, 1102–1116. Caligiuri, P., & Tarique, I. (2006). International assignee selection and cross-cultural training and development. In I. Björkman & G. Stahl (Eds.), Handbook of research in international human resource management (pp. 302–322). London, England: Edward Elgar. Cascio, W. F., & Aguinis, H. (2008). Staffing twenty-first-century organizations. Academy of Management Annals, 2(1), 133–165. doi: 10.1080/19416520802211461 Cerdin, J. L., & Brewster, C. (2014). Talent management and expatriation: Bridging two streams of research and practice. Journal of World Business, 49(2), 245–252. doi: 10.1016/j.jwb.2013.11.008 Collings, D. G. (2014). Integrating global mobility and global talent management: Exploring the challenges and strategic opportunities. Journal of World Business, 49(2), 253–261. doi: 10.1016/j.jwb.2013.11.009 Dorfman, P., Javidan, M., Hanges, P., Dastmalchian, A., & House, R. (2012). GLOBE: A twenty year journey into the intriguing world of culture and leadership. Journal of World Business, 47, 504–518. Dragoni, L., & McAlpine, K. (2012). Leading the business: The criticality of global leaders’ cognitive complexity in setting strategic directions. Industrial & Organizational Psychology, 5(2), 237–240. doi: 10.1111/j.1754–9434.2012.01438.x Dunning, J. H. (2009). Location and the multinational enterprise: John Dunning’s thoughts on receiving the Journal of International Business Studies 2008 Decade Award. Journal of International Business Studies, 40(1), 20–34. EY. (2013). Business Pulse: Exploring dual perspectives on the top 10 risks and opportunities in 2013 and beyond. New York, NY: EY. Ghoshal, S., & Bartlett, C. A. (1990). The multinational corporation as an interorganizational network. Academy of Management Review, 15(4), 603–626. Ghoshal, S., & Nohria, N. (1993). Horses for courses: Organizational forms for multinational corporations. Sloan Management Review, 2, 23–35. Gomez, C., & Sanchez, J. I. (2005). Human resource control in MNCs: A study of the factors influencing the use of formal and informal control mechanisms. International Journal of Human Resource Management, 16(10), 1847–1861. doi: 10.1080/09585190500298438 Hechanova, R., Beehr, T. A., & Christiansen, N. D. (2003). Antecedents and consequences of employees’ adjustment to overseas assignment: A meta-analytic review. Applied Psychology: An International Review, 52(2), 213–236. Hofstede, G. (1980). Cultures consequences: International differences in work-related values. Thousand Oaks, CA: Sage. Hult, G. T. M., Ketchen, D. J., Griffith, D. A., Finnegan, C. A., Gonzalez-Padron, T., Harmancioglu, N., Huang, Y., Talay, M. B., & Cavusgil, S. T. (2008). Data equivalence in cross-cultural international business

810

Selection in Multinational Organizations research: Assessment and guidelines. Journal of International Business Studies, 39, 1027–1044. doi: 10.1057/ palgrave.jibs.8400396 Levy, O., Beechler, S., Taylor, S., & Boyacigiller, N. (2007). What we talk about when we talk about global mindset: Managerial cognition in multinational corporations. Journal of International Business Studies, 38, 231–258. Lievens, F. (2007). Research on selection in an international context: Current status and future directions. In M. M. Harris (Ed.), Handbook of research in international human resource management (pp. 107–123). Mahwah, NJ: Lawrence Erlbaum. Manpower. (2011). 2011 Talent shortage survey results. Milwaukee, WI: Manpower Group. Manroop, L., Boekhorst, J. A., & Harrison, J. A. (2013). The influence of cross-cultural differences on job interview selection decisions. International Journal of Human Resource Management, 24(18), 3512–3533. doi: 10.1080/09585192.2013.777675 Mendenhall, M. E., Reiche, B. S., Bird, A., & Osland, J. S. (2012). Defining the “global” in global leadership. Journal of World Business, 47(4), 493. Mol, S. T., Born, M. P., Willemsen, M. E., & Van Der Molen, H. T. (2005). Predicting expatriate job performance for selection purposes: A quantitative review. Journal of Cross-Cultural Psychology, 36, 590–620. Phillips, J. M., & Gully, S. M. (2002). Fairness reactions to personnel selection techniques in Singapore and the United States. International Journal of Human Resource Management, 13, 1186–1205. Ployhart, R. E., Wiechmann, D., Schmitt, N., Sacco, J. M., & Rogg, K. (2003). The cross-cultural equivalence of job performance ratings. Human Performance, 16, 49–79. Prahalad, C. K., & Doz, Y. L. (1987). The multinational mission: Balancing local demands and global vision. New York: Free Press. PriceWaterhouseCoopers. (2007). 10th Annual Global CEO Survey. New York, NY: PriceWaterhouseCoopers. PriceWaterhouseCoopers. (2012). 15th Annual Global CEO Survey (2011). New York, NY: PriceWaterhouseCoopers. Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391. Salgado, J. F., & Anderson, N. R. (2002). Cognitive and GMA testing in the European community: Issues and evidence. Human Performance, 15, 75–96. Shackleton, V., & Newell, S. (1997). International assessment and selection. In N. Anderson & P. Herriot (Eds.), International handbook of selection and assessment (pp. 82–95). New York, NY: Wiley. Shaffer, M. A., Harrison, D. A., & Gilley, K. M. (1999). Dimensions, determinants and differences in the expatriate adjustment process. Journal of International Business Studies, 30, 557–581. Shaffer, M. A., Harrison, D. A., Gregersen, H., Black, J. S., & Ferzandi, L. A. (2006). You can take it with you: Individual differences and expatriate effectiveness. Journal of Applied Psychology, 91, 109–115. Shin, S. J., Morgeson, F. P., & Campion, M. (2007). What you do depends on where you are: Understanding how domestic and expatriate work requirements depend upon the cultural context. Journal of International Business Studies, 38, 64–83. Stahl, G., & Caligiuri, P. M. (2005). The relationship between expatriate coping strategies and expatriate adjustment. Journal of Applied Psychology, 90, 603–616. Tixier, M. (1996). Cross-cultural study of managerial recruitment tools in Nordic countries. International Journal of Human Resource Management, 7(3), 753–775. Toh, S. M., & DeNisi, A. S. (2007). Host country nationals as socializing agents: A social identity approach. Journal of Organizational Behavior, 28(3), 281–301. Tung, R. (1981). Selection and training of personnel for overseas assignments. Columbia Journal of World Business, 16, 21–25. Van Vianen, A. E. M., De Pater, I. E., & Caligiuri, P. M. (2005). Expatriate selection: A process. In A. Evers, O. Smit-Voskuyl, & N. Anderson (Eds.), The handbook of personnel selection (pp. 458–475). Oxford, England: Blackwell.

811

37 SELECTION FOR TEAM MEMBERSHIP Complexity, Contingency, and Dynamism Across Multiple Levels SUSAN MOHAMMED AND ALEXANDER S. MCKAY

For well over half of a century, scholars have agreed that selecting the right team members is a key variable in the team1 effectiveness equation (e.g., Mann, 1959; Mathieu, Maynard, Rapp, & Gilson, 2008). However, despite the importance of team selection, significant knowledge gaps remain regarding how to distinguish “team players” from “team inhibitors” and how to create teams whose members have the right mix of competencies. Ironically, despite a wealth of accumulated knowledge about how to select individuals to fit jobs and a burgeoning team literature, relatively little of this research has systematically focused on team selection issues (e.g., Mathieu, Tannenbaum, Donsbach, & Alliger, 2013; Zaccaro & DiRosa, 2012). Instead, the team composition literature has been described as fragmented and in need of coherence (Mathieu, Tannenbaum, Donsbach, & Alliger, 2014). Therefore, the purpose of this chapter is to review and integrate what is currently known about team selection with the goals of identifying deficiencies in current knowledge and underscoring promising avenues for future research. In doing so, we emphasize the complexity underlying staffing teams by adopting a dynamic, contingency, and multilevel perspective. Recent work has highlighted that team membership is far more dynamic than assumed in team research, with individuals joining and leaving teams with increasing frequency (Tannenbaum, Mathieu, Salas, & Cohen, 2012). With respect to contingency, one of the overarching themes of the present work is that selection approaches will differ for diverse types of teams and tasks because the nature of the team and why it exists plays such a prominent role in determining what member characteristics are needed. The multilevel nature of team functioning acknowledges that choosing team members based on individual competencies alone is not sufficient to ensure team success. Rather, it is important to consider the configuration of members with regard to knowledge, skills, abilities, and other factors (KSAOs) such as personality traits and experience levels. Therefore, mechanisms must be developed to determine how a potential employee will “fit” into a particular team.

CONCEPTUAL FRAMEWORK FOR UNDERSTANDING SELECTION FOR TEAM MEMBERSHIP Figure 37.1 presents a conceptual framework that captures the dynamic, contingency, and multilevel approaches of team selection. Each component of Figure 37.1 is discussed in the following sections.

812

Selection for Team Membership

Contingency Perspective: Team & Task Context Task demands:

•Interdependence •Coordination demands •Behavioral discretion •Role specialization •Structure (Formal & Informal) •Level of autonomy •Environmental stability

Team type:

Core/Generic teamwork competencies: •Knowledge •Skills •Abilities •Attitudes •Other personality

Contingent teamwork competencies: •Knowledge •Skills •Abilities •Attitudes •Other personality

Team-Level

Individual-Level

•Skill differentiation •Authority differentiation •Temporal stability

Dynamic considerations:

•Changing task demands •Team membership/fluidity •Team longevity •Team developmental cycle •Compatibility/Fit among members •Balance/Coverage of competencies •Homogeneity/Heterogeneity of competencies •Organizational/Functional scheme •Leadership (internal/external) •Size (current & optimal) •Climate/Norms •Staffing level (over/under)

Multilevel Perspective: Team-level Staffing Considerations FIGURE 37.1  Conceptual Framework for Understanding Selection for Team Membership

Core and Contingent Teamwork Competencies The first step in selection for team membership is to garner a thorough understanding of the KSAOs needed for effective team performance. Team selection subsumes the requirements of traditional selection, such as ensuring that individuals possess technical competence and maximizing the fit between the person and job. However, team members must also possess teamwork skills that enable interdependent work. Because taskwork skills are not unique to the team context, we focus on two types of teamwork competencies: core (general teamwork behaviors common to all team tasks) and contingent (dependent on the task and the team’s configuration). Similar to other researchers (Cannon-Bowers, Tannenbaum, Salas, & Volpe, 1995), we posit that core teamwork competencies are generic or transportable—that is, they are important regardless of the particular task or team at hand. Examples of such competencies include interpersonal skills, knowledge of teamwork, communication skills, preference for teamwork, and agreeableness. Furthermore, we propose that these attributes can be measured at the individual level. In contrast to core competencies, contingent teamwork competencies are particular to the team and task for which an individual is being selected and must therefore consider team-level attributes. Because of the dynamic nature of teams, particular needs may change as a function of the team’s changing structure, configuration, size, and/or life cycle. A culmination of the other categories of variables presented in Figure 37.1, contingent teamwork competencies are influenced by team type, task demands, dynamism, and team staffing variables, which are described in the following sections.

813

Susan Mohammed and Alexander S. McKay

Team Type Integrating the plethora of team taxonomies proposed in the literature, Hollenbeck, Beersma, and Schouten (2012) identified three critical dimensions underlying diverse team types. First, skill differentiation refers to the degree to which members possess specialized knowledge. Second, authority differentiation describes whether decision-making responsibility resides in individuals, subgroups, or the team as a whole. Third, temporal stability captures the extent to which team members have worked together in the past and plan to do so in the future. Because teams vary with regard to each of these dimensions, selection requirements will clearly differ for diverse team types. To illustrate, for stable, self-managed teams with low skill differentiation who make decisions relying on consensus and have a history and future of working together, a premium would be placed on contingent teamwork characteristics in the selection process. In contrast, for ad hoc emergency crisis teams with high skill differentiation in which one member has decision-making authority that disband after task completion, emphasis would be placed on taskwork characteristics in the selection process. Generic teamwork competencies that enhance human capital would be needed for both types of teams (Mathieu et al., 2013). Consideration must also be given to the particular form of staffing situation an organization is facing. Mathieu and colleagues (2013) delineated six types of team composition human resource decisions. Regarding existing teams, (1) a single member may be added, subtracted, or replaced, (2) multiple team members may be concurrently replaced, or (3) new personnel might be simultaneously assigned to multiple teams. Concerning new team creation, (4) a single team may be staffed at once (team cluster hiring), (5) multiple teams may be staffed concurrently, or (6) members may be reconfigured into multiple teams. Of the six categories listed above, the most frequently cited staff experiences were team cluster hiring (4) and multiple member replacement to an existing team (2) in 21 interviews with team staffing experts (Donsbach et al., 2009). Each approach has benefits and drawbacks. For example, team cluster hiring is proposed to be most useful in highly competitive industries because it better mitigates external threats, exploits diversity opportunities, and increases team motivation compared to individual staffing approaches (Munyon, Summers, & Ferris, 2011). Although pre-employment expenses are predicted to be higher for cluster hiring than for individual selection, costs should decrease over time if there is little member turnover (Munyon et al., 2011).

Task Demands Influenced in large part by team type, the nature of the task includes different types of interdependence (Saavedra, Earley, & van Dyne, 1993),the behavioral requirements of members during performance (McGrath, 1984), coordination demands (Bowers, Morgan, Salas, & Prince, 1993), behavioral discretion (the degree of control team members have in performing the task as dictated by the level of proceduralization; Cannon-Bowers, Salas, & Blickensderfer, 1998), role specialization (how roles are defined in the team; Kilduff, Angelmar, & Mehra, 2000), structure (the nature of the formal organization and communication channels; Price, Harrison, & Gavin, 2006), and level of autonomy (the degree to which the team manages itself; Langfred, 2007). Given the importance of task demands, team researchers have developed team task analysis methods. Team Task Analysis  Analogous to job analysis for individuals (see Chapter 6, this volume), team task analysis (TTA) involves a comprehensive understanding of the nature of the team and the key skills necessary to function effectively as a collective unit (Baker, Salas, & Cannon-Bowers, 1998). Specifically, team competencies, job characteristics, and cognitive demands are three categories of information gathered during TTA (Lorenzet, Eddy, & Klein, 2003). Nevertheless, because of the lack of validated TTA techniques, traditional job analysis methods are often used for teams, violating multilevel principles (Lorenzet et al., 2003) and overlooking interactive teamwork processes, coordination, and interdependence requirements (Morgan & Lassiter, 1992).

814

Selection for Team Membership Arthur, Edwards, Bell, Villado, and Bennett (2005) developed and validated three generic task analysis scales measuring team relatedness (extent to which tasks cannot be performed by a single individual), team workflow (paths through which information flows throughout the team), and team-task ratio (ratio of the number of tasks that cannot be performed alone to the total number of tasks). In addition, groupware task analysis has been proposed as a method for studying group activities, which involves modeling structure, workflow, artifacts, and the work environment (van Welie & van der Veer, 2003). Furthermore, cognitive TTA investigates the cognitive components underlying teamwork processes, including knowledge of goals, task procedures, sequences, timing, roles, and teammate characteristics (Blickensderfer, Cannon-Bowers, Salas, & Baker, 2000). Despite these promising developments, additional research is needed to validate existing TTA methodologies and to develop new tools.

Dynamism Despite the prevalence of cross-sectional research designs assuming a high degree of stability, dynamism across levels of analysis is a reality in modern-day teams (Tannenbaum et al., 2012). Shifting task and environmental demands motivated by internal or external forces (Keck & Tushman, 1993) may cause members to join or leave teams at different times as well as lengthen or shorten a team’s longevity. Thus, the fluidity and permeability of team and membership boundaries must be taken into account in team staffing decisions.

Team Staffing Considerations The level of complexity of team selection is substantially increased by the need to consider an additional set of team-relevant KSAOs and to navigate multiple levels of analysis. Indeed, a fundamental difference between selection for individual and team positions is that in team situations, the fit of members with each other and the team as a whole must be taken into account (Zaccaro & DiRosa, 2012). Therefore, when considering team selection systems, it is crucial to consider the mix of attributes across members, as well as issues like size, current staffing levels, member compatibility, and the team’s climate. Another potential difference between traditional and team selection involves the locus of responsibility for staffing. Although normally ascribed to management, some autonomous work groups are tasked with member recruitment, testing, and hiring (Hackman, 2002; Wellins, Byham, & Wilson, 1991). With the rising popularity of self-managing teams, member-initiated team selection is becoming increasingly common (D’Souza & Colarelli, 2010). In a policy capturing study of hypothetical profiles of team member selection decisions, task skills were significantly more important than attitudinal similarity, race, or physical attractiveness in selecting members of virtual teams (D’Souza & Colarelli, 2010). In face-to-face teams, only gender had a significant effect on decision policies; women chose women more than men in both face-to-face and virtual teams. As these results highlight, team staffing decisions are generally made from positions within the company and therefore may be smaller and less heterogeneous than external candidate pools more common to individual selection (Zaccaro & DiRosa, 2012). The decision to recruit internal or external candidates for team positions should consider the longevity of the team as well as the depth and breadth of the candidate pool for the task and team skills needed (Zaccaro & DiRosa, 2012). In addition to the distinction between internal and external candidates, the criteria considered by organizational insiders and outsiders making selection decisions are also likely to vary. To illustrate, Whiting and Maynes (2016) found that National Football League (NFL) insiders valued contextual performance much more than external experts did, although both utilized prior task performance in evaluating college football players in the NFL draft. Contrary to predictions, workplace deviance did not significantly affect insider or outsider evaluations.

815

Susan Mohammed and Alexander S. McKay

INDIVIDUAL-LEVEL CONSIDERATIONS

Individual Attributes That Contribute to Effective Teamwork From a purely practical perspective, organizations typically hire employees individually even if they are going to work as part of a team. For this reason, it behooves team selection researchers to attempt to identify teamwork competencies that can predict as much variance in team performance as possible. Table 37.1 provides a summary of the knowledge, skills, attitudes, and personality traits that are important for team selection, although we do not claim to be exhaustive. In some cases, the variables displayed here have been studied, and even validated, in a selection context. However, in other cases, we have made the link to selection by extrapolating from the broader team performance literature, particularly if the attribute is difficult to train.

Measurement and Validation Survey-Based Measures and Tests Because of ease of administration and relatively low cost, surveys are a popular means of assessing KSAOs and personality traits for team member selection. The Teamwork KSA test is commercially available and frequently used by organizations for team selection (O’Neill, Goffin, & Gellatly, 2012). The Teamwork KSA test consists of 35 situational judgment items answered in a multiple-choice format (Stevens & Campion, 1999). On the basis of the conceptual model of teamwork requirements developed by Stevens and Campion (1994), the test captures both interpersonal (conflict resolution, collaborative problem-solving, communication) and self-management (goal-setting, performance management, planning, and coordination) KSAs. Validation efforts showed that the Teamwork KSA test correlates with supervisory ratings of teamwork and taskwork performance (Leach, Wall, Rogelberg, & Jackson, 2005; Stevens & Campion, 1999), peer nominations of teamwork (Stevens & Campion, 1999), team task proficiency (Hirschfeld, Jordan, Field, Giles, & Armenakis, 2006), observed ratings of effective teamwork (Hirshfeld et al., 2006), and contextual performance (Morgeson, Reider, & Campion, 2005) in organizational and military samples. Moreover, one sample revealed incremental criterion-related validity beyond employment aptitude tests (Stevens & Campion, 1999). Higher scores on the Teamwork KSA test also yielded higher observable teamwork behavior scores and peer ratings of individual effectiveness in a student sample (McClough & Rogelberg, 2003). In a recent quantitative review of the Teamwork KSA test, which included nine studies (33 coefficients), O’Neill and colleagues (2012) found an average criterion validity of .20. Despite these strengths, a cautionary note is that strong correlations (.80) have raised the issue of redundancy with cognitive ability. Furthermore, in a field sample of 268 job candidates using a predictive validity design in a team-based organization, O’Neill and colleagues (2012) concluded that none of the observed correlations between the Teamwork KSA Test and team performance were significant. In addition, subscale reliabilities were inadequate, and no interpretable factor structure emerged, although these findings are not uncommon for situational judgment test (SJT) measures. The Teamwork KSA Test also correlated higher with taskwork than did teamwork criteria (perhaps because of the strong correlation with cognitive ability). Another SJT developed and validated for team member selection is the Team Role Test (Mumford, van Iddekinge, Morgeson, & Campion, 2008), which assesses declarative and procedural knowledge of team role types and the situational contingencies needed for role adaptability. The Team Role Test consists of nine team scenarios, each requiring one appropriate role, 10 items per scenario. In academic and work team samples, the Team Role Test was positively related with peer ratings of team role performance (Mumford et al., 2008). Furthermore, the SJT demonstrated incremental validity beyond cognitive ability and Big Five traits in predicting role performance (Mumford et al., 2008).

816

Knowledge of team roles and their situational contingencies

Knowledge of team roles

Ability of team members to adjust their strategies in response to task demands, by reallocating team resources Ability of team members to optimize the quality of team member interactions through resolution on dissent, motivational reinforcement, and cooperative behaviors Ability of team members to direct and coordinate activities; assign tasks; organize workflow among members; and plan, organize, and establish a positive climate Capacity of team members to communicate effectively by sharing ideas clearly and directly in interpersonal situations

Adaptability

Interpersonal

Team management/leadership

Assertiveness

Skills

Understanding of the necessary underpinnings and behavioral requirements of effective team performance

Capacity to perform higher mental processes such as problem solving and reasoning

Definition

Knowledge of teamwork skills

Knowledge

Cognitive Ability

Ability

Attribute

TABLE 37.1

Task-related assertiveness, component of extraversion

Task motivation, goal-setting, planning and task coordination, establishing roles and expectations, instructing others, planning, organizing

Morale building, conflict resolution, negotiation, cooperation, consulting with others, interpersonal trust, social perception, persuasion, helping others

Compensatory behavior, backing-up behavior, dynamic reallocation of function, mutual adjustment, workload balancing

Understanding teamwork, familiarity with teamwork, knowledge of teamwork KSAs

Verbal, numerical, spatial

Related/Subsidiary Constructs

Examples of KSAOs That May Be Important for Team Selection

(Continued)

Can be assessed via survey-based measures or tests, but behavioral measures are better. Some validation data exist (Pearsall & Ellis, 2006; Smith-Jentsch, Salas, & Baker, 1996).

Best assessed in a work sample or other simulation, although survey-based instruments may add value. Some research indicates that individual leadership skills are associated with teamwork effectiveness (Burke et al., 2006).

May be assessed through a combination of surveybased and behavioral measures. Some validation data suggests that interpersonal skills predict teamwork (Morgeson et al., 2005).

Best assessed in a work sample or other simulation. Some data to suggest that adaptability improves teamwork (Salas, Nichols, & Driskell, 2007).

Assessed via Team Role Test. Validation data show that this test predicts role performance (Mumford et al., 2008).

Assessed via Teamwork KSA test. Some validation data predictive of effective teamwork (Hirschfeld et al., 2006; McClough & Rogelberg, 2003; Stevens & Campion, 1999), but also non-supportive predictive validity evidence (O’Neill et al., 2012).

Evidence of a positive relationship with team performance across multiple meta-analyses (Bell, 2007; Devine & Phillips, 2001; Stewart, 2006)

Validation/Measurement Issues

Ability to clearly and accurately articulate and exchange information among team members using accepted terminology; acknowledge of receipt of information; clarify message when needed External, task-related actions directed to other teams or the larger organizational context

Communication

Cross-boundary

Degree to which individuals believe that they have the requisite knowledge, skills, and other attributes to be a successful team member

Self-efficacy for teamwork

Predisposition to occupy team roles based on prior experience Extent to which a person is self-disciplined and organized

Team Role Experience and Orientation (TREO) dimensions Personality

Conscientiousness

Other Characteristics

Inclination and desire to be part of a team; willingness to engage with other people in pursuit of task success; appreciation for the importance of teamwork in accomplishing challenging tasks

Preference for teamwork

Attitudes

Ability of team members to accurately monitor and assess the work of others; ability to give, seek, and receive task-clarifying feedback in a constructive manner, and to offer advice

Definition

Mutual performance monitoring

Skills

Attribute

TABLE 37.1  (Continued)

Need for achievement, ambition, responsible, dependable

Organizer, doer, challenger, innovator, team builder, and connector roles

Teamwork self-efficacy

Team/collective orientation, importance of teamwork; appreciation for teamwork; desire to work in a team; collectivism; preference for teamwork

Organizational awareness, organizational resourcefulness, building relationships with other teams

Active listening; information exchange; closedloop communication; information sharing; open exchange; consulting with others

Accepting suggestions/criticism; giving suggestions/ criticism; intrateam feedback; monitoring and giving feedback; cross checking; error correction; team maintenance

Related/Subsidiary Constructs

Assessed with survey-based measures. Evidence of a positive relationship with team performance from multiple meta-analyses (Bell, 2007; Mount, Barrick, & Stewart, 1998; Peeters, van Tuijl, Rutte, & Reymen, 2006). Positively related to contextual performance in team settings (Morgeson et al., 2005).

Measured with surveys. Some content and predictive validity evidence for TREO dimensions (Mathieu et al., 2015)

Measured with surveys (McClough & Rogelberg, 2003). Some data support the link to effective teamwork (e.g., Tasa, Taggar & Seijts, 2007).

Assessed with survey-based measures. Some evidence to suggest that a collective orientation leads to better teamwork (Driskell & Salas, 1992) and that those who enjoy working in a team engage in less social loafing (Stark, Shaw, & Duffy, 2007) and have better team performance (Bell, 2007; Helmreich & Foushee, 1993).

Survey-based measure developed by Druskat and Kayes (1999).

Best assessed through a combination of surveybased and behavioral measures. Closed-loop communication has been shown to predict teamwork (Bowers, Pharmer, & Salas, 2000).

Best assessed through a combination of survey-based and behavioral measures. Has been linked to team performance (Marks & Panzer, 2004).

Validation/Measurement Issues

Definition Extent to which an individual is social, outgoing, and talkative

Extent to which an individual is gentle and cooperative

Extent to which an individual is calm and poised

Extent to which an individual is curious and imaginative

Attribute

Extraversion

Agreeableness

Emotional stability

Openness to Experiences

Original, daring, and broad-minded

Neuroticism (negative relationship), adjustment, lack of nervous tendencies, not anxious, security

Likeability, interpersonal facilitation, trustworthy, tolerance, courteousness

Enthusiasm, optimism, assertiveness, dominance, gregariousness

Related/Subsidiary Constructs

Assessed with survey-based measures. Positively related with team performance in two meta-analyses (Bell, 2007; Mount et al., 1998).

Assessed with survey-based measures. Small, positive relationship with team performance when meanaggregated in two meta-analyses (Bell, 2007; Mount et al., 1998). Positively (but only marginally) related to contextual performance in team settings (Morgeson et al., 2005).

Assessed with survey-based measures. Evidence of a strong, positive relationship with team performance from multiple meta-analyses (Bell, 2007; Mount et al., 1998; Peeters et al., 2006). Positively related to contextual performance in team settings (Morgeson et al., 2005).

Assessed with survey-based measures. Small, but significant positive relationship with team performance in two meta-analyses (Bell, 2007; Peeters et al., 2006). Positively related to contextual performance in team settings (Morgeson et al., 2005).

Validation/Measurement Issues

Susan Mohammed and Alexander S. McKay Recently, a 48-item survey measure has been developed and validated to assess members’ propensities to occupy different team roles independent of particular team contexts (organizer, innovator, doer, challenger, team builder, and connector; Mathieu, Tannenbaum, Kukenberger, Donsbach, & Alliger, 2015). Self-reports of these six Team Role Experience Orientation (TREO) dimensions were content validated, found to be distinguishable from Big Five personality constructs, and predicted corresponding peer ratings of their behaviors three months later (Mathieu et al., 2015). It should be noted that there were high intercorrelations (averaging .70 across samples) among the six dimensions.

Work Sample and Interview Measures Although the advantages of behaviorally based measures for team processes and performance are readily acknowledged by team scholars (Salas, Burke, Fowlkes, & Priest, 2004), placing applicants in realistic team situations is more difficult and expensive to employ than administering survey-based measures and tests. Nevertheless, team-oriented assessment centers utilizing team consensus exercises have been successfully implemented (Kirksey & Zawacki, 1994; Wellins, Byham, & Dixon, 1994). Moreover, interviews have been shown to effectively measure interpersonal skills (Huffcutt, Conway, Roth, & Stone, 2001). Indeed, a study investigating the selection of individuals in organizational teams found that social skills, as measured by a structured interview, predicted contextual performance beyond Big Five traits and the Teamwork KSA test (Morgeson et al., 2005). Technologies such as intelligent video-based systems may also prove useful in providing a realistic context in which to assess team skills (Cannon-Bowers, Bowers, & Sanchez, 2007).

TEAM-LEVEL CONSIDERATIONS Thus far, we have discussed the individual-level KSAOs needed for team functioning, which assumes that teams whose members score higher on taskwork and teamwork competencies will perform better. However, “when individuals form groups the effects of a valid selection procedure can be nullified by any lack of cooperation within groups and by bottlenecks, shirking, and social loafing” (Schneider, Smith, & Sipe, 2000, p. 99). Therefore, it is critical that the overall team context be considered in selection for team membership. In the following sections, we discuss team size, person-group fit, and team composition.

Team Size Because too few members can result in unreasonable work demands and too many members can produce unnecessary redundancy, an important consideration in team staffing involves determining an appropriate team size. Although larger teams are generally advantaged in terms of division of labor and knowledge resources, they are disadvantaged by lower member involvement and heightened coordination difficulties (Aube, Rousseau, & Tremblay, 2011; Staats, Milkman, & Fox, 2012). Managers tend to focus on the potential for process gains when increasing team size, but they underestimate process losses (Staats et al., 2012). This is unfortunate, as a number of studies have found negative outcomes for increasing the number of team members. For example, Aube and colleagues (2011) found a negative relationship between team size and the quality of group experience in organizational teams, as mediated by counterproductive work behaviors (e.g., interpersonal aggression, boastfulness, misuse of resources). Across 329 U.S. work groups, Wheelan (2009) concluded that groups with 3–6 members were more productive and more developmentally advanced than groups with 7–10 members or more than 11 members (no significant difference between the latter two categories). Evidencing the same trend, groups of 3–4 members were more productive and developmentally advanced than groups of 5–6 members (Wheelan, 2009).

820

Selection for Team Membership Based on the studies presented above, one prescription is to staff teams with the smallest number required to do the work, but determining the optimal figure is contingent on team and task type (Steiner, 1972). To illustrate, a meta-analysis by Stewart (2006) found that the overall relationship between team size and performance was very small, but moderation effects revealed stronger positive results for project and management teams as compared to production teams. Because project and management teams involve unstructured tasks and interaction with external constituencies, more team members may be desirable when the environment is complex (Stewart, 2006). Thus, the right size for a team depends on its goals and purpose.

Person-Group Fit Subsumed under the broad, multilevel construct of person-environment (PE) fit, person-group (PG) or person-team fit refers to the compatibility between members and their groups (Werbel & Johnson, 2001). Two general categories of PG fit have been identified. Supplementary PG fit occurs when the individual and the workgroup share similar personality, goals, values, and abilities. In contrast, complementary PG fit occurs when members have different competencies, offsetting others’ weaknesses and offering resources that support each other (Werbel & Johnson, 2001). For example, a person with a marketing background may fill a gap in a team comprising engineers with complementary fit, whereas a person with an engineering background may join a team of other engineers with supplementary fit. Research on supplementary fit or PG congruence has examined fit on a variety of content domains, such as values (e.g., Adkins, Ravlin, & Meglino, 1996; DeRue & Morgeson, 2007), goals (e.g., Kristof-Brown & Stevens, 2001), and personality traits (e.g., Kristof-Brown, Barrick, & Stevens, 2005a). Among these various content dimensions, PG value congruence appears to have the strongest correlations, with various outcomes given the relative constancy of value systems (Kristof-Brown, Zimmerman, & Johnson, 2005b). With respect to complementary fit on personality traits, there is some evidence that extraverts are more attracted to teams of introverts, whereas introverts are more attracted to teams of extraverts (Kristof-Brown et al., 2005a). Compared with the other types of fit (e.g., person-job, person-organization, person-supervisor), PG fit has received the least research attention. However, research activity has grown in the past several years. Individual member characteristics have been shown to be important predictors of PG fit. In particular, individual performance and growth satisfaction of team members were found to positively predict person-team congruence on values and person-role demands-abilities fit (DeRue & Morgeson, 2007). In addition, individuals who worked in many companies in the past placed greater emphasis on person-organization fit, whereas individuals with longer working experience prioritized person-job fit more, deflating the significance of PG fit when evaluating satisfaction with work and team (Kristof-Brown, Jansen, & Colbert, 2002). Hollenbeck (2000) discussed the various ways in which individual personal traits can be matched with team type to improve team performance. For example, to achieve internal person-team fit, it is recommended that researchers and practitioners select individuals who are high on cognitive ability for teams characterized by broad and undefined roles, but select individuals who are relatively high on openness to experience for teams that constantly need to change and adapt to the environment (Hollenbeck, 2000). Additionally, functional team structures, which are defined by roles that are narrow and low in scope, require agreeable members, whereas self-managing teams are better suited for high-conscientiousness members. Finally, misaligned team structures, which occur when the team structure is not well matched to the environment, need emotionally stable individuals to handle the stress of associated problems (Hollenbeck, 2000). Research on PG fit has also demonstrated various advantages for the individual and team. For example, PG value congruence contributed to increased satisfaction with work and social relationships, improved performance on interpersonal dimensions, and reduced tardiness and absenteeism (Adkins et al., 1996). Additionally, similarity between the individual and team on perceived self and team mastery goals as well as self and team performance goals led to increased interpersonal contributions to the workgroup (Kristof-Brown & Stevens, 2001).

821

Susan Mohammed and Alexander S. McKay Self-team performance goal congruence also improved satisfaction with work and the team (Kristof-Brown & Stevens, 2001). The PG fit-outcome relationship can be characterized as reciprocal and cyclical, in that improved PG fit enhances individual and group outcomes, which then results in better perceived PG fit (DeRue & Morgeson, 2007). It is important for researchers to measure the perceptions of team members in assessing PG fit, as studies have shown the greater salience of perceived PG fit as opposed to actual PG fit in determining individual outcomes (Kristof-Brown & Stevens, 2001). Indeed, shared team member perceptions of high supplementary and high complementary fit was associated with better performance (De Cooman, Vantilborgh, Bal, & Lub, 2016). Two meta-analyses have shed light on the relationship between PG fit and a number of outcomes. First, Kristof-Brown and colleagues (2005b) established that PG fit (broadly defined) taps an independent conceptual domain distinct from other types of fit. Interestingly, PG fit predicted outcomes such as work satisfaction and overall performance equally as well as more established dimensions of fit (Kristof-Brown et al., 2002; Kristof-Brown et al., 2005b). Specifically, PG fit was positively correlated with job satisfaction, organizational commitment, supervisor satisfaction, overall performance, and contextual performance and negatively correlated with intention to quit (Kristof-Brown et al., 2005b). Coworker satisfaction and group cohesion exhibited particularly strong relationships with PG fit. In a second meta-analysis, Oh et al. (2014) obtained similar results to Kristof-Brown and colleagues (2005b), but also compared the relationship between PG fit and various outcomes across cultures. They found that the relationship between PG fit and organizational commitment, job satisfaction, and performance was stronger in East Asian samples than in North American samples. These differences appeared to be driven by cultural values of in-group and institutional collectivism and power distance. The results indicate that culture plays an important role in shaping PG fit, which has implications for cross-cultural team selection. Although previous research focused on individual-level outcomes, recent studies have begun focusing on team-level outcomes. Kristof-Brown, Seong, Degeest, Park, and Hong (2014) examined team-level collective fit, which was defined as “team members’ shared assessment of compatibility with each other and with the requirements of the task environment” (p. 971). Team-level collective fit positively predicted team cohesion, team efficacy, and team performance beyond individual-level fit. Team-level collective fit also positively predicted individuallevel commitment and performance beyond individual-level fit. Also at the team-level, Seong, Kristof-Brown, Park, Hong, and Shin (2015) found that supplementary and complementary fit were better represented as a single PG fit factor. Sex diversity and work experience diversity were negatively related to PG fit perceptions, whereas age diversity and education diversity were positively related to PG fit perceptions. Furthermore, team-level fit was more strongly related to performance compared to the relationship between individual-level fit and performance, as indicated by the Kristof-Brown and colleagues (2005b) meta-analysis. Given the advantages gleaned from PG fit, it is important for managers and practitioners to consider the match between individuals and the groups to which they are assigned. Measuring individual-level teamwork skills is necessary, but not sufficient, for team selection, as the interaction between individual characteristics, the team environment, and culture must be taken into account. One available tool for determining PG fit is the Team Selection Inventory, which assesses an individual’s preferred style for working in a team as compared to the team’s current climate (Burch & Anderson, 2004). Evidence of acceptable psychometric quality was reported across six studies (Burch & Anderson, 2004).

Team Composition Composition is a broad term referring to configurations of attributes within small groups (Levine & Moreland, 1990). Whereas the PG fit literature has mostly examined individual-level criteria, team composition studies aggregate member characteristics to the group level and investigate their impact on group-level outcomes.

822

Selection for Team Membership The emerging conceptual framework reflects a contingency perspective by suggesting that how and why composition variables influence team outcomes will depend on a multiplicity of factors, including the aggregation method used, the individual differences assessed, the particular outcomes studied, and the nature of the team task (Mathieu et al., 2013). For example, team composition research is complicated by the various ways that individual scores can be combined to arrive at a group score (e.g., mean, variance, the lowest or highest team member scores). Studies have demonstrated that results differ, depending on the type of aggregation used, and that each captures a unique aspect of team composition (e.g., Barrick, Stewart, Neubert, & Mount, 1998; Bell, 2007). In the following sections, we organize our discussion of these contingency factors by reviewing three broad approaches to assessing team composition: mean values, diversity indices, and more complex configurations (Mathieu et al., 2008).

Mean Values The most popular and straightforward approach to aggregate individual scores to the team level is to simply average each member’s responses. Cognitive ability has yielded the most robust results in team composition research, replicating across field maintenance teams (Barrick et al., 1998), student laboratory groups (Day, Arthur, Miyashiro, Edwards, & Hanson, 2004), human resource teams (Neuman & Wright, 1999), military tank crews (Tziner & Eden, 1985), and hierarchical decision-making teams (Lepine, Hollenbeck, Ilgen, & Hedlund, 1997). Isomorphic to the strong positive relationship between cognitive ability and individual-level performance (Schmidt, 2002), several meta-analyses have concluded that teams with smarter members do better (Bell, 2007; Devine & Philips, 2001; Stewart, 2006). When different operationalizations of cognitive ability are compared (e.g., mean, maximum, minimum, variance), the mean has emerged as the strongest predictor of team performance across several task types (Day et al., 2004; Devine & Philips, 2001). Although the results for cognitive ability were notably stronger, a meta-analysis by Stewart (2006) found a small positive relationship between expertise (mean-aggregated member experience and education) and team performance. Regarding personality traits, much of the existing mean-aggregated research has focused on the Five-Factor Model (conscientiousness, extraversion, agreeableness, neuroticism, and openness to experience). Multiple meta-analyses have concluded that teams composed of conscientious and agreeable members perform better (Bell, 2007; Peeters, van Tuijl, Rutte, & Reymen, 2006; Stewart, 2006). Bell’s (2007) meta-analysis also found that mean levels of all five traits of the Five-Factor Model positively predicted performance in field settings. Not surprisingly, these personality traits generally exhibited stronger relationships with performance for organizational teams as compared to laboratory groups (Bell, 2007; Peeters et al., 2006). In terms of values, there is meta-analytic support for a positive relationship between team performance and both mean team collectivism and mean preference for teamwork in field settings (Bell, 2007). In addition, a study by Hobman, Bordia, and Gallois (2004) found that group openness to diversity was positively associated with team involvement.

Diversity Indices Diversity describes the “distribution of differences among the members of a unit with respect to a common attribute” (Harrison & Klein, 2007, p. 1200). Diversity can be represented as differences of opinion among group members on a horizontal continuum (separation), differences in access to distinct sources of information (variety), or differences regarding valued resources (disparity; Harrison & Klein, 2007). As the team diversity literature is voluminous, we will briefly highlight mostly meta-analytic work on demographics, job-related diversity, and personality. The results of multiple meta-analyses have consistently yielded negligible effects for the relationship between heterogeneity on demographic variables (e.g., gender, race, age) and team performance (Bell, Villado, Lukasick, Belau, & Briggs, 2011; Bowers, Pharmer, & Salas, 2000;

823

Susan Mohammed and Alexander S. McKay Horowitz & Horowitz, 2007; Joshi & Roh, 2009; Stewart, 2006; van Dijk, van Engen, & van Knippenbert, 2012; Webber & Donahue, 2001). Therefore, researchers have been strongly advised to explore moderating influences rather than focus solely on main effects (van Knippenberg & Schippers, 2007). Team and task types have been strongly implicated as moderator variables that account for the inconsistency in research findings concerning the effect of composition variables on team outcomes (e.g., Bell et al., 2011; Bowers et al., 2000; Webber & Donahue, 2001). The potentially positive effects of work group diversity on group performance are more likely to emerge in teams performing relatively complex tasks that require information processing, creativity, and collaborative decision making where the exchange and integration of diverse task-related information may stimulate thorough consideration of ideas (Bowers et al., 2000; Stewart, 2006; van Knippenberg, De Dreu, & Homan, 2004). Time is another moderator that has proven fruitful in explaining some of the null and inconsistent research findings. Specifically, the effects of demographic diversity on team processes have been shown to weaken over time (or with greater group tenure), whereas the effects of deep-level diversity (e.g., job-related attitudes) strengthen over time (e.g., Harrison, Price, & Bell, 1998; Harrison, Price, Gavin, & Florey, 2002). In addition to team/task types and time, accounting for contextual factors such as industry and occupation increased the size of the relationship between demographic diversity and team performance in a meta-analysis by Joshi and Roh (2009). Moreover, meta-analytic results revealed the role of rater biases in that the relationship between demographic diversity and performance was negative when performance was rated by external team leaders but nonsignificant when performance was objectively measured or rated by internal team leaders or team members (van Dijk et al., 2012). Meta-analytic results for job-related diversity have also been inconsistent. Although Webber and Donahue (2001) found that highly job-related diversity (functional, educational, and industry background) was not related to team outcomes, Horowitz and Horowitz (2007) found a positive relationship with both the quality and quantity of team performance. A more recent meta-analysis by Bell et al. (2011) established that diversity of functional background measured as variety (but not educational diversity) was positively associated with team performance (Bell, 2007). Once again, interactive effects play a key role in interpreting mixed results. In their meta-analysis, Van Dijk and colleagues (2012) found that task complexity moderated the relationship between job-related diversity and team performance, and that job-related diversity was more positively associated with innovative performance than in-role performance. Similarly, functional background and educational diversity yielded stronger effects with performance when innovation was the criterion compared to efficiency as the criterion (Bell et al., 2011). Industry, occupation, and team context also meta-analytically emerged as moderators of the relationship between job-related diversity and team performance (Joshi & Roh, 2009). With regard to personality, heterogeneity may be disadvantageous for some traits and advantageous for others. Because low- and high-conscientiousness members hold different perspectives on how much effort to invest toward goal achievement, diversity on conscientiousness has been negatively related to performance (Barrick et al., 1998; Humphrey, Hollenbeck, Meyer, & Ilgen, 2011). In contrast, diversity on extraversion may lead to more positive outcomes because roles are complementary, with some members talking/leading and others listening/following (e.g., Humphrey, Hollenbeck, Meyer, & Ilgen, 2007; Neuman, Wagner, & Christiansen, 1999). Several studies have found favorable results for variability on extraversion (e.g., Barry & Stewart, 1997; Humphrey et al., 2011; Mohammed & Angell, 2003; Neuman et al., 1999), but meta-analytic results have not been supportive (Bell, 2007; Peeters et al., 2006). In general, meta-analyses investigating member heterogeneity on personality characteristics have not yielded strong findings (e.g., Bell, 2007; Stewart, 2006). Extending beyond Big Five personality traits, research has begun to demonstrate that temporal diversity on traits such as time urgency (chronic hurriedness), polychronicity (preference for multitasking), and pacing style (pattern of effort distribution in working toward deadlines) has implications for team processes and performance (Mohammed & Angell, 2004; Mohammed & Nadkarni, 2011, 2014).

824

Selection for Team Membership

Complex Configurations Whereas mean and diversity aggregation methods assume that all members make equal contributions to the team, selecting the maximum or minimum team member score assumes that particular members exert a disproportional influence on team processes and outcomes (e.g., Mathieu et al., 2014). For example, the Bell (2007) meta-analysis found that a single disagreeable member impaired team performance. Considerably less research has been devoted to compilational models capturing complex patterns of lower-level constructs in comparison to compositional models representing more straightforward combinations like the mean or variance (Kozlowski & Klein, 2000). Also representing a compilational approach, faultline theory explores the hypothetical dividing lines that may split members into subgroups based on one or more attributes (Lau & Murnighan, 1998). Rather than focusing on a single demographic characteristic at a time (e.g., gender), the faultline approach recognizes that individuals have multiple identities simultaneously (e.g., Hispanic female under 30) and that the configuration of those differences matters in teams. Meta-analytic evidence shows that the more demographic differences converge with each other (e.g., all male members of a work group are Caucasian, while all female members are Hispanic), the more groups experience heightened task and relationship conflict as well as decreased cohesion, satisfaction, and performance (Thatcher & Patel, 2011).

Team and Task Type Revisited Steiner’s (1972) task typology has been the most commonly used approach to specifying the appropriate operationalization in the team composition literature. According to Steiner (1972), mean aggregation is best suited for additive tasks, in which group performance is the sum of each member’s contribution (e.g., shoveling snow). Minimum scores are deemed appropriate for conjunctive tasks where the weakest member determines team performance (e.g., mountain climbing), and maximum scores are deemed appropriate for disjunctive tasks where the most competent member determines team performance (e.g., problem solving). However, studies have been critical of this rationale (e.g., Day et al., 2004), and a meta-analysis found that stronger effects were not observed when the operationalization matched the task type of Steiner’s typology (Bell, 2007). Because Steiner’s (1972) task taxonomy focused exclusively on the way in which group members’ contributions combine into a team outcome, additional variables must be considered in determining the appropriate method of aggregation, including the predictor and outcome variables being assessed as well as team and task type. For example, in a sample of business student teams, Mohammed and Angell (2003) found that diversity on agreeableness, neuroticism, and extraversion affected oral presentation scores, but mean cognitive ability positively affected written reports. Reflecting these findings, Bell’s (2007) meta-analysis concluded that the best aggregation method depended on the composition variable of interest and that no single operationalization emerged as superior for all composition variables. To illustrate, the strongest relationships with team performance were observed when conscientiousness was operationalized as the team mean but when agreeableness was operationalized as the team minimum (one disagreeable member was enough to be a disruptive force) (Bell, 2007).

Team Composition Tools In recent years, computer-based systems have been developed to assist with the multiplicity of factors that should be taken into account when compositing teams. In this section, we feature three tools, the first specifically designed for student teams and the second and third developed for organizational teams. A team of academics developed a free web-based system (www.CATME.org) designed to compose student teams and track their performance, called the Comprehensive Assessment of Team

825

Susan Mohammed and Alexander S. McKay Member Effectiveness (CATME; Layton, Loughry, Ohland, & Ricco, 2010). Relevant to team selection, the Team-Maker tool in the CATME system allows instructors to collect student data on various criteria (e.g., demographic information, grade point average, preferred team roles, meeting availability) via a computer-aided team formation survey. Instructors can then select which criteria to use, weight each factor, and determine the maximum and minimum team size in assigning members to teams (Hrivnak, 2013). The system algorithm then automatically composes teams as specified. Based on interviews with team staffing experts from a variety of industries, researchers developed a generic, customizable tool to help decision makers compose teams (Donsbach et al., 2009). The Team Optimal Profile System (TOPS) provides an algorithm that balances competing demands, including individual team and task competencies, task interdependence, and interrelationships among members. A variety of team staffing decisions are accommodated, including assigning multiple people to a new team or more than one person to an existing team. Leaders provide information in the customization process, including individual KSAOs, minimum job requirements, member availability, and constraints such as which individuals should not be paired together. Decision makers also assign each attribute a weight representing its importance. The TOPS algorithm then optimizes the mix of members’ KSAOs with job demands, and changes can be made as new information becomes available (Donsbach et al., 2009). Millhiser, Coen, and Solow (2011) investigated how information about employee interdependencies could be used to compose teams to maximize performance. Computer simulation was used to run thousands of experiments testing various interdependence configurations. Specifically, policies that divided members equally across teams based on individual performance were compared with policies that distributed members based on how well they worked together. Results revealed that dividing skilled workers equally across teams (“spreading the talent around”) was less effective than allowing good performers to maintain most of their relationships and disrupting the relationships of poor performers. Thus, Millhiser and colleagues (2011) recommended that managers respect prior member interdependencies (e.g., how supportive members are to each other) in forming teams to maximize performance across teams.

DISCUSSION

Implications for Research Although many theoretically derived variables have been hypothesized and investigated as important contributors to team effectiveness, few studies have been conducted to validate the predictive power of these attributes in a selection context. Moreover, studies that assess the combinatorial contributions of individual- and team-level factors are required in order to optimize the prediction of effective teamwork. Because many aspects of team functioning cannot be easily measured via surveys, efforts to develop behaviorally based assessment tools to capture observed team actions objectively and reliably are also sorely needed. New technologies have emerged as candidates for simulating team environments realistically, including role players, video-based systems, and virtual world technologies (Cannon-Bowers et al., 2007), but they must be validated for team selection. Although meta-analytic results have been straightforward regarding mean-aggregated characteristics (e.g., Bell, 2007; Stewart, 2006), findings have been far less conclusive regarding how to improve the mix of competencies in a team or how to select new team members while considering existing team member KSAOs. Criticized as being “conceptually scattered” (McGrath, 1984, p. 256) and “atheoretical” (Levine & Moreland, 1990, p. 594), well-developed models adopting a contingency and multilevel perspective are needed to help clarify the complex patterns of variables that are deemed important in the team composition literature. A comprehensive “meso” approach to team staffing involves not only multiple levels but also cross-level interactions (Ployhart, 2004). Team boundaries are becoming more dynamic, permeable, and difficult to identify because many employees are members of multiple teams simultaneously, work in multiple geographies and/or time zones, may join or leave teams at different times, and are expected to self-govern

826

Selection for Team Membership (Tannenbaum et al., 2012). Given the increasing dynamism and complexity of many team contexts, the team composition literature needs to revisit many of its simplistic assumptions regarding membership stability and equal member contributions to team dynamics (as assumed by mean aggregation). Qualitative research, longitudinal designs, computational models, and network approaches can help achieve higher levels of sophistication theoretically, methodologically, and analytically (Mathieu et al., 2014).

Implications for Practice Clearly, the starting point in selection for team membership should be a team-based task analysis that specifies the nature of the team and the purposes for which it exists. Based on the need to account for both individual member performance as well as team performance as a whole, we suggest that a multi-phase procedure be utilized for team selection. In the first stage, generic team and task competencies would be assessed, including cognitive ability, conscientiousness, agreeableness, preference for teamwork, and interpersonal KSAs. In the second stage, a contingency framework would be adopted to examine the synergy of several factors, including the type of team and the outcomes that are important, task-specific and team-specific competencies, and the capability and personality compatibility of members. Group-role analysis, which identifies the nature of group norms and group-specific task roles, maintenance roles and role interactions, should also be leveraged in the process of identifying the complementary and supplementary needs of the team (Werbel & Johnson, 2001). Convergent meta-analytic results offer some guidance to practitioners in their quest to staff teams effectively in organizations. Both taskwork and teamwork competencies have been shown to contribute unique variance as predictors of team performance (Bell, 2007). Specifically, multiple meta-analyses have confirmed that teams with smart, conscientious, and agreeable members perform better (Bell, 2007; Peeters et al., 2006; Stewart, 2006). Individual meta-analyses have also found that higher mean levels of expertise (Stewart, 2006) and team collectivism (Bell, 2007) are also related to higher team performance. As compared to the range of predictors investigated by researchers (e.g., demographics, personality, attitudes, abilities, experience), it appears that practitioners formally consider a narrower subset of variables in team assignments (Donsbach et al., 2009). Although it is recommended that the heterogeneity/homogeneity of member characteristics be explored in team selection (McClough & Rogelberg, 2003), the inconsistency and complexity of current research findings disallow the kind of straightforward prescriptions that are appealing to practitioners. Whereas moderated results are attractive to researchers in specifying the conditions under which diversity will aid or hinder team performance, the number of contingencies to be considered significantly complicates the feasibility of interventions to compose teams. However, computer-based systems like CATME and TOPS are promising developments that can incorporate a wide range of individual and team-based factors when composing teams. To summarize, Mathieu and colleagues (2013) recommend a seven-step process for composing teams, beginning with (1) describing the team (e.g., positions most critical for team success, interdependence levels, member strengths and weaknesses) and (2) clarifying position, team, and organizational requirements. Next, (3) the candidate pool is established, taking into account the eligibility, availability, and constraints of members. Candidates are then (4) assessed in terms of individual and team competencies and (5) tentatively assigned to teams. Finally, the proposed team composition is (6) assessed to ensure that important positions are staffed with high-quality candidates and (7) adjusted as needed. The legal issues underlying selection for team membership must also be considered. Whereas the legal perspective emphasizes standardization and the importance of evaluating all applicants according to a common set of metrics, the team contingency perspective emphasizes customization and the value of member compatibility as well as skill heterogeneity. Is it legally defensible for an employer to reject a candidate who has the same competencies of other team members and select another candidate with different competencies? What are the legal ramifications when selection for team membership is seen as promotion or special placement? These questions have yet to be fully explored and will likely remain unresolved because significant legal concerns

827

Susan Mohammed and Alexander S. McKay about team placement are uncommon in practice. This is because organizations are generally choosing among employees who have already been selected into the organization as compared to the more scrutinized decisions regarding external candidate pools. Thus, organizations with many teams have considerable latitude to both ensure fairness according to legal standards as well as place individuals in collectives that maximize team effectiveness.2

CONCLUSIONS Understanding how to form superior teams is the key to harnessing selection as a tool for improving team performance. Given the importance of teams in many modern organizations, it is surprising that the state of the science and practice in team selection has not advanced further. Although there is no shortage of variables that have been hypothesized to affect team performance, specific studies validating predictors of team effectiveness in a selection context are relatively rare. However, computer-based tools (e.g., CATME and TOPS) have begun to offer greater sophistication and precision in composing teams by considering a range of competencies, task features, and constraints (Donsbach et al., 2009; Layton et al., 2010; Millhiser et al., 2011). Nevertheless, more work is needed regarding the categories of attributes that are necessary to optimize team functioning—those that are held by individual members and those that transcend individual members and exist at the team level. Although the increasing complexity of modern-day teams makes conducting team research even more challenging than it already is, furthering our understanding of team selection practices may be one of the most fruitful directions for future research, with clear implications for practice.

NOTES 1. For the purpose of this chapter, teams are defined as “collectives who exist to perform organizationally relevant tasks, share one or more common goals, interact socially, exhibit task interdependencies, maintain and manage boundaries, and are embedded in an organizational context that sets boundaries, constrains the team, and influences exchanges with other units in the broader entity” (Kozlowski & Bell, 2003, p. 334). 2. The authors would like to thank Nancy Tippins and Doug Reynolds for this addition.

REFERENCES Adkins, C. L., Ravlin, E. C., & Meglino, B. M. (1996). Value congruence between co-workers and its relationship to work outcomes. Group & Organization Studies, 21, 439–460. Arthur, W., Jr., Edwards, B. D., Bell, S. T., Villado, A. J., & Bennett, W., Jr. (2005). Team task analysis: Identifying tasks and jobs that are team-based. Human Factors, 47(3), 654–669. Aube, C., Rousseay, V., & Tremblay, S. (2011). Team size and quality of group experience: The more the merrier? Group Dynamics: Theory, Research, and Practice, 15(4), 357–375. Baker, D. P., Salas, E., & Cannon-Bowers, J. (1998). Team task analysis: Lost but hopefully not forgotten. Industrial-Organizational Psychologist, 35, 79–83. Barrick, M. R., Stewart, G. L., Neubert, M. J., & Mount, M. K. (1998). Relating member ability and personality to work-team processes and team effectiveness. Journal of Applied Psychology, 83(3), 377–391. Barry, B., & Stewart, G. L. (1997). Composition, process, and performance in self-managed groups: The role of personality. Journal of Applied Psychology, 82(1), 62–78. Bell, S. T. (2007). Deep-level composition variables as predictors of team performance: A meta-analysis. Journal of Applied Psychology, 92(3), 595–615. Bell, S. T., Villado, A. J., Lukasik, M. A., Belau, L., & Briggs, A. L. (2011). Getting specific about demographic diversity variable and team performance relationships: A meta-analysis. Journal of Management, 37, 709–743. Blickensderfer, E., Cannon-Bowers, J. A., Salas, E., & Baker, D. P. (2000). Analyzing knowledge requirements in team tasks. In J. M. Schraagen, S. F. Chipman, & V. L. Shalin (Eds.), Cognitive task analysis (pp. 431–450). Philadelphia, PA: Lawrence Erlbaum.

828

Selection for Team Membership Bowers, C. A., Morgan, B. B., Jr., Salas, E., & Prince, C. (1993). Assessment of coordination demand for aircrew coordination training. Military Psychology, 5, 95–112. Bowers, C. A., Pharmer, J. A., & Salas, E. (2000). When member homogeneity is needed in work teams: A meta-analysis. Small Group Research, 31(3), 305–327. Burch, G. S. J., & Anderson, N. (2004). Measuring person-team fit: Development and validation of the team selection inventory. Journal of Managerial Psychology, 19(4), 406–426. Burke, C. S., Stagl, K. C., Klein, C., Goodwin, G. F., Salas, E., & Halpin, S. (2006). What type of leadership behaviors are functional in teams? A meta-analysis. The Leadership Quarterly, 17, 288–307. Cannon-Bowers, J. A., Bowers, C. A., & Sanchez, A. (2007). Using synthetic learning environments to train teams. In V. I. Sessa & M. London (Eds.), Work group learning: Understanding, improving, and assessing how groups learn in organizations (pp. 315–347). Mahwah, NJ: Lawrence Erlbaum. Cannon-Bowers, J. A., Salas, E., & Blickensderfer, E. (April 1998). On training crews. In R. J. Klimoski (Chair), When is a work team a crew—and does it matter? Paper presented at the 13th annual meeting of the Society of Industrial and Organizational Psychology, Dallas, TX. Cannon-Bowers, J. A., Tannenbaum, S. I., Salas, E., & Volpe, C. E. (1995). Defining competencies and establishing team training requirements. In R. A. Guzzo & E. Salas (Eds.), Team effectiveness and decision making in organizations (pp. 333–380). San Francisco, CA: Jossey-Bass. Day, E. A., Arthur, W., Miyashiro, B., Edwards, B. D., & Hanson, T. (2004). Criterion-related validity of statistical operationalizations of group general cognitive ability as a function of task type: Comparing the mean, maximum, and minimum. Journal of Applied Social Psychology, 34(7), 1521–1549. De Cooman, R., Vantilborgh, T., Bal, M., & Lub, X. (2016). Creating inclusive teams through perceptions of supplementary and complementary person-team fit: Examining the relationship between person-team fit and team effectiveness. Group and Organization Management, 41(3), 310–342. DeRue, D. S., & Morgeson, F. P. (2007). Stability and change in person-team and person-role fit over time: The effects of growth satisfaction, performance, and general self-efficacy. Journal of Applied Psychology, 92(5), 1242–1253. Devine, D. J., & Philips, J. L. (2001). Do smarter teams do better?: A meta-analysis of cognitive ability and team performance. Small Group Research, 32(5), 507–532. Donsbach, J. S., Tannenbaum, S. I., Alliger, G. M., Mathieu, J. E., Salas, E., Goodwin, G. F., & Metcalf, K. A. (2009). Team composition optimization: The Team Optimal Profile System (TOPS). Technical report 1249 for U.S. Army Research Institute. Driskell, J. E., & Salas, E. (1992). Collective behavior and team performance. Human Factors, 34, 277–288. Druskat, V. U., & Kayes, D. C. (1999). The antecedents of competence: Toward a fine-grained model of self-managing team effectiveness. In E. A. Mannix, M. A. Neale, & R. Wageman (Eds.), Research on managing groups and teams: Groups in context (Vol. 2, pp. 201–231). Greenwich, CT: JAI Press. D’Souza, G., & Colarelli, S. M. (2010). Team member selection decisions for virtual versus face-to-face teams. Computers in Human Behavior, 26, 630–635. Hackman, J. R. (2002). Leading teams: Setting the stage for great performances. Boston, MA: Harvard Business School Press. Harrison, D. A., & Klein, K. J. (2007). What’s the difference? Diversity constructs as separation, variety, or disparity in organizations. The Academy of Management Review, 32(4), 1199–1228. Harrison, D. A., Price, K. H., & Bell, M. P. (1998). Beyond relational demography: Time and the effects of surface-and deep-level diversity on work group cohesion. Academy of Management Journal, 41, 96–107. Harrison, D. A., Price, K. H., Gavin, J. H., & Florey, A. T. (2002). Time, teams, and task performance: Changing effects of surface-and deep-level diversity on group functioning. Academy of Management Journal, 45(5), 1029–1045. Helmreich, R. L., & Foushee, H. C. (1993). Why crew resource management? Empirical and theoretical bases of human factors in training and aviation. In E. Wiener, B. G. Kanki, & R. L. Helmreich (Eds.), Cockpit resource management (pp. 3–45). San Diego, CA: Academic Press. Hirschfeld, R. R., Jordan, M. H., Feild, H. S., Giles, W. F., & Armenakis, A. A. (2006). Becoming team players: Team members’ mastery of teamwork knowledge as a predictor of team task proficiency and observed teamwork effectiveness. Journal of Applied Psychology, 91(2), 467–474. Hobman, E. V., Bordia, P., & Gallois, C. (2004). Perceived dissimilarity and work group involvement: The moderating effects of group openness to diversity. Group and Organization Management, 29, 560–587. Hollenbeck, J. R. (2000). A structural approach to external and internal person-team fit. Applied Psychology: An International Review, 49(3), 534–549. Hollenbeck, J. R., Beersma, B., & Schouten, M. E. (2012). Beyond team types and taxonomies: A dimensional scaling conceptualization for team description. Academy of Management Review, 37(1), 82–106. Horowitz, S. K., & Horowitz, I. B. (2007). The effects of team diversity on team outcomes: A meta-analytic review of team demography. Journal of Management, 33(6), 987–1015.

829

Susan Mohammed and Alexander S. McKay Hrivnak, G. A. (2013). CATME smarter teamwork. Academy of Management Learning & Education, 12. Huffcutt, A. L., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86(5), 897–913. Humphrey, S. E., Hollenbeck, J. R., Meyer, C. J., & Ilgen, D. R. (2007). Trait configurations in self-managed teams: A conceptual examination of the use of seeding for maximizing and minimizing trait variance in teams. Journal of Applied Psychology, 92(3), 885–892. Humphrey, S. E., Hollenbeck, J. R., Meyer, C. J., & Ilgen, D. R. (2011). Personality configurations in self-managed teams: A natural experiment on the effects of maximizing and minimizing variance in traits. Journal of Applied Social Psychology, 41(7), 1701–1732. Joshi, A., & Roh, H. (2009). The role of context in work team diversity research: A meta-analytic review. Academy of Management Journal, 52(1), 599–627. Keck, S., & Tushman, M. (1993). Environmental and organizational context and executive team structure. Academy of Management Journal, 36(6), 1314–1344. Kilduff, M., Angelmar, R., & Mehra, A. (January 2000). Top management-team diversity and firm performance: Examining the role of cognitions. Organization Science, 11, 21–34. Kirksey, J., &. Zawacki, R. A. (1994). Assessment center helps find team-oriented candidates. Personnel Journal, 73(5), 92. Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In S. W. Kozlowski & K. J. Klein (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 3–90). London, England: Wiley. Kristof-Brown, A., Barrick, M. R., & Stevens, C. K. (2005a). When opposites attract: A multi-sample demonstration of complementary person-team fit on extraversion. Journal of Personality, 73, 935–958. Kristof-Brown, A. L., Jansen, K. J., & Colbert, A. E. (2002). A policy-capturing study of the simultaneous effects of fit with jobs, groups, and organizations. Journal of Applied Psychology, 87(5), 985–993. Kristof-Brown, A. L., Seong, J. Y., Degeest, D. S., Park, W.-W., & Hong, D-S. (2014). Collective fit perceptions: A multilevel investigation of person-group fit with individual-level and team-level outcomes. Journal of Organizational Behavior, 35, 969–989. Kristof-Brown, A. L., & Stevens, C. K. (2001). Goal congruence in project teams: Does the fit between members’ personal mastery and performance goals matter? Journal of Applied Psychology, 86(6), 1083–1095. Kristof-Brown, A. L., Zimmerman, R. D., & Johnson, E. C. (2005b). Consequences of individual’s fit at work: A meta-analysis of person-job, person-organization, person-group, and person-supervisor fit. Personnel Psychology, 58(2), 281–342. Langfred, C. (2007). The downside of self-management: A longitudinal study of the effects of conflict on trust, autonomy, and task interdependence in self-managing teams. Academy of Management Journal, 50, 885–900. Lau, D. C., & Murnighan, J. K. (1998). Demographic diversity and faultlines: The compositional dynamics of organizational groups. The Academy of Management Review, 23(2), 325–340. Layton, R. A., Loughry, M. L., Ohland, M. W., & Ricco, G. D. (2010). Design and validation of a webbased system for assigning members to teams using instructor-specified criteria. Advances in Engineering Education, 2, 1–28. Leach, D. J., Wall, T. D., Rogelberg, S. G., & Jackson, P. R. (2005). Team autonomy, performance, and member job strain: Uncovering the teamwork KSA link. Applied Psychology: An International Review, 54, 1–24. Lepine, J. A., Hollenbeck, J. R., Ilgen, D. R., & Hedlund, J. (1997). Effects of individual differences on the performance of hierarchical decision-making teams: Much more than g. Journal of Applied Psychology, 82(5), 803–811. Levine, J. M., & Moreland, R. L. (1990). Progress in small group research. Annual Review of Psychology, 41(1), 585–634. Lorenzet, S. J., Eddy, E. R., & Klein, G. D. (2003). The importance of team task analysis for team human resource management. In M. M. Beyerlein, D. A. Johnson, & S. T. Beyerlein (Eds.), Team-based organizing (Vol. 9, pp. 113–145). New York, NY: Elsevier Science. Mann, R. D. (1959). A review of the relationships between personality and performance in small groups. Psychological Bulletin, 56(4), 241–270. Marks, M. A., & Panzer, F. J. (2004). The influence of team monitoring on team processes and performance. Human Performance, 17, 25–41. Mathieu, J., Maynard, M. T., Rapp, T., & Gilson, L. (2008). Team effectiveness 1997–2007: A review of recent advancements and a glimpse into the future. Journal of Management, 34, 410–476. Mathieu, J. E., Tannenbaum, S. I., Donsbach, J. S., & Alliger, G. M. (2013). Achieving optimal team composition for success. In E. Salas (Ed.), Developing and enhancing high-performance teams: Evidence-based practices and advice (pp. 520–551). San Francisco: Jossey-Bass.

830

Selection for Team Membership Mathieu, J. E., Tannenbaum, S. I., Donsbach, J. S., & Alliger, G. M. (2014). A review and integration of team composition models: Moving toward a dynamic and temporal framework. Journal of Management, 40(1), 130–160. Mathieu, J. E., Tannenbaum, S. I., Kukenberger, M. R., Donsbach, J. S., & Alliger, G. M. (2015). Team role experience and orientation: A measure and tests of construct validity. Group and Organization Management, 40(1), 6–34. McClough, A. C., & Rogelberg, S. G. (2003). Selection in teams: An exploration of the teamwork knowledge, skills, and ability test. International Journal of Selection and Assessment, 11, 56–66. McGrath, J. E. (1984). Groups: Interaction and performance. Englewood Cliffs, NJ: Prentice-Hall. Millhiser, W. P., Coen, C. A., & Solow, D. (2011). Understanding the role of worker interdependence in team selection. Organization Science, 22(3), 772–787. Mohammed, S., & Angell, L. C. (2003). Personality heterogeneity in teams: Which differences make a difference for team performance? Small Group Research, 34(6), 651–677. Mohammed, S., & Angell, L. (2004). Surface- and deep-level diversity in workgroups: Examining the moderating effects of team orientation and team process on relationship conflict. Journal of Organizational Behavior, 25, 1015–1039. Mohammed, S., & Nadkarni, S. (2011). Temporal diversity and team performance: The moderating role of temporal leadership. Academy of Management Journal, 54(3), 489–508. Mohammed, S., & Nadkarni, S. (2014). Are we all on the same temporal page? The moderating effects of temporal team cognition on the polychronicity diversity-team performance relationship. Journal of Applied Psychology, 99(3), 404–422. Morgan, B. B., & Lassiter, D. L. (1992). Team composition and staffing. In R. W. Swezey & E. Salas (Eds.), Teams: Their training and performance (pp. 75–100). Westport, CT: Ablex. Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The importance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58(3), 583–611. Mount, M. K., Barrick, M. R., & Stewart, G. L. (1998). Five factor model of personality and performance in jobs involving interpersonal interactions. Human Performance, 11, 145–165. Mumford, T. V., van Iddekinge, C. H., Morgeson, F. P., & Campion, M. A. (2008). The team role test: Development and validation of a team role knowledge situational judgment test. Journal of Applied Psychology, 93, 250–267. Munyon, T. P., Summers, J. K., & Ferris, G. R. (2011). Team staffing modes in organizations: Strategic considerations on individual and cluster hiring approaches. Human Resource Management Review, 21, 228–242. Neuman, G. A., Wagner, S. H., & Christiansen, N. D. (1999). The relationship between work-team personality composition and the job performance of teams. Group and Organization Management, 24, 28–45. Neuman, G. A., & Wright, J. (1999). Team effectiveness: Beyond skills and cognitive ability. Journal of Applied Psychology, 84(3), 376–389. Oh, I-S., Guay, R. P., Kim, K., Harold, C. M., Lee, J-H., Heo, C-G., & Shin, K-H. (2014). Fit happens globally: A meta-analytic comparison of the relationships of person-environment fit dimensions with work attitudes and performance across East Asia, Europe, and North America. Personnel Psychology, 67, 99–152. O’Neill, T. A., Goffin, R. D., & Gellatly, I. R. (2012). The knowledge, skill, and ability requirements for teamwork: Revisiting the teamwork-KSA Test’s validity. International Journal of Selection and Assessment, 20(1), 36–52. Pearsall, M. J., & Ellis, A. P. J. (2006). The effects of critical team member assertiveness on team performance and satisfaction. Journal of Management, 32, 575–594. Peeters, M. A. G., van Tuijl, H., Rutte, C. G., & Reymen, I. (2006). Personality and team performance: A meta-analysis. European Journal of Personality, 20(5), 377–396. Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. Research in Personnel and Human Resources Management, 23, 121–176. Price, K., Harrison, D., & Gavin, J. (2006). Withholding inputs in team contexts: Member composition, interaction processes, evaluation structure, and social loafing. Journal of Applied Psychology, 91(6), 1375–1384. Saavedra, R., Earley, P. C., & van Dyne, L. (1993). Complex interdependence in task performing groups. Journal of Applied Psychology, 78, 61–72. Salas, E., Burke, C. S., Fowlkes, J. E., & Priest, H. A. (2004). On measuring teamwork skills. In J. Thomas (Ed.), Comprehensive handbook of psychological assessment. Vol. 4: Industrial/organizational assessment (pp. 427– 442). Hoboken, NJ: Wiley. Salas, E., Nichols, D. R., & Driskell, J. E. (2007). Testing three team training strategies in intact teams: A meta-analysis. Small Group Research, 38, 471–488. Schmidt, F. L. (2002). The role of general cognitive ability and job performance: Why there cannot be a debate. Human Performance, 15, 187–210.

831

Susan Mohammed and Alexander S. McKay Schneider, B., Smith, D. B., & Sipe, W. P. (2000). Personnel selection psychology. In S. W. Kozlowski & K. J. Klein (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 91–120). London, England: Wiley. Seong, J. Y., Kristof-Brown, A. L., Park, W-W., Hong, D-S., & Shin, Y. (2015). Person-group fit: Diversity antecedents, proximal outcomes, and performance at the group level. Journal of Management, 41, 1184–1213. Smith-Jentsch, K. A., Salas, E., & Baker, D. P. (1996). Training team performance-related assertiveness. Personnel Psychology, 49, 909–936. Staats, B. R., Milkman, K. L., & Fox, C. R. (2012). The team scaling fallacy: Underestimating the declining efficiency of larger teams. Organizational Behavior and Human Decision Processes, 118, 132–142. Stark, E. M., Shaw, J. D., & Duffy, M. K. (2007). Preference for group work, winning orientation, and social loafing behavior in groups. Group and Organization Management, 32(6), 699–723. Steiner, I. D. (1972). Group processes and productivity. New York, NY: Academic Press. Stevens, M. J., & Campion, M. A. (1994). The knowledge, skill, and ability requirements for teamwork: Implications for human resource management. Journal of Management, 20(2), 503–530. Stevens, M. J., & Campion, M. A. (1999). Staffing work teams: Development and validation of a selection test for teamwork settings. Journal of Management, 25(2), 207–228. Stewart, G. L. (2006). A meta-analytic review of relationships between team design features and team performance. Journal of Management, 32(1), 29–54. Tannenbaum, S. I., Mathieu, J. E., Salas, E., & Cohen, D. (2012). Teams are changing: Are research and practice evolving fast enough? Industrial and Organizational Psychology: Perspectives on Science and Practice, 5(1), 2–24. Tasa, K., Taggar, S., & Seijts, G. H. (2007). The development of collective efficacy in teams: A multilevel and longitudinal perspective. Journal of Applied Psychology, 91(1), 17–27. Thatcher, S. M. B., & Patel, P. C. (2011). Demographic faultlines: A meta-analysis of the literature. Journal of Applied Psychology, 96(6), 1119–1139. Tziner, A., & Eden, D. (1985). Effects of crew composition on crew performance: Does the whole equal the sum of its parts? Journal of Applied Psychology, 70(1), 85–93. Van Dijk, H., van Engen, M. L., & van Knippenberg, D. (2012). Defying conventional wisdom: A metaanalytical examination of the differences between demographic and job-related diversity relationships with performance. Organizational Behavior and Human Decision Processes, 119, 38–53. van Knippenberg, D., De Dreu, C. K. W., & Homan, A. C. (2004). Work group diversity and group performance: An integrative model and research agenda. Journal of Applied Psychology, 89, 1008–1022. van Knippenberg, D., & Schippers, M. C. (2007). Work group diversity. Annual Review of Psychology, 58, 515–541. van Welie, M., & van der Veer, G. C. (2003). Pattern languages in interaction design: Structure and organization. Proceedings of Interact, 3, 1–5. Webber, S. S., & Donahue, L. M. (2001). Impact of highly and less job-related diversity on work group cohesion and performance: A meta-analysis. Journal of Management, 27(2), 141–162. Wellins, R. S., Byham, W. C., & Dixon, G. R. (1994). Inside teams: How 20 world-class organizations are winning through teamwork. San Francisco, CA: Jossey-Bass. Wellins, R. S., Byham, W. C., & Wilson, J. M. (1991). Empowered teams, creating self managing working groups and the improvement of productivity and participation. San Francisco, CA: Jossey-Bass. Werbel, J. D., & Johnson, D. J. (2001). The use of person-group fit for employment selection: A missing link in person-environment fit. Human Resource Management, 40(3), 227–240. Wheelan, S. A. (2009). Group size, group development, and group productivity. Small Group Research, 40(2), 247–262. Whiting, S. W., & Maynes, T. D. (2016). Selecting team players: Considering the impact of contextual performance and workplace deviance on selection decisions in the national football league. Journal of Applied Psychology, 101(4), 484–497. Zaccaro, S. J., & DiRosa, G. A. (2012). The process of team staffing: A review of relevant studies. International Review of Industrial and Organizational Psychology, 27, 197–229.

832

38 SELECTING LEADERS Executives and High-Potentials GEORGE C. THORNTON III, STEFANIE K. JOHNSON, AND ALLAN H. CHURCH

Despite the importance of leaders at the senior-most levels of organizations, there has been relatively little research on executive selection. There is a large literature on leadership concepts (Dinh, Lord, Gardner, Meuser, Liden, & Hu, 2014) and leadership development (Day, Fleenor, Atwater, Sturm, & McKee, 2014) but surprisingly little empirical research on selecting leaders into top-level positions. Although selection in general is a major area of practice in industrial-organizational (I-O) psychology, and executive selection is extremely important in any organization, recent books on assessment and selection (Geisinger et al., 2013; Scott & Reynolds, 2010) provide little or no guidance on the selection of executive leaders. A notable exception is Howard and Thomas (2010), who compare factors distinguishing assessment of lower-, mid-, and executive-level managers and describe systems for designing and implementing assessment systems. There are many reasons for the dearth of research on executive selection (e.g., small samples, proprietary concerns, organization-specific requirements), which we will address in subsequent sections. Much of our understanding of executive selection over the years has come from applied research, surveys of practice, and experience. For example, Hollenbeck (1994) summarized observations from his experience and eight books on the selection of chief executive officers (CEOs). Sessa and Taylor (2000) summarized results of a series of studies conducted at the Center for Creative Leadership in the 1990s using simulations and surveys of executives. More recently, Church and Rotolo (2013) reported on the practices of executive assessment for decision making among 84 companies that do assessment, development, and selection well. They found that the most frequent target of assessments (90%) in those major corporations studied were senior executives. Clearly, the practice of executive selection remains quite important to organizational success. The purpose of this chapter is to review research and practice in the selection of executive leaders and those who have high potential for these positions, and to comment on these developments over the past decades on the basis of our observations and results of surveys of organizational practices. We begin the chapter by defining our focal group of executives and high-potentials. To clarify our focus, we make distinctions among leader behaviors, leaders, and management. Next, we describe a number of the attributes that are important for the effectiveness of top leaders and review techniques to assess these attributes in high-potentials and executive candidates. Then, we describe the importance of an integrated process of recruitment, development, and management of high-potentials and executives in the context of several factors. The process

833

George C. Thornton III et al. involves a multiyear, multistage program of assessing and developing leaders, the performance and potential of leaders, and the organization’s talent management strategy including the outcomes expected, follower characteristics, diversity, and country culture. We include discussion of the increasingly significant role of the board of directors in C-suite decision making (Charan, Carey, & Useem, 2014) and the importance of transparency of processes and results. Finally, we discuss what may be the most difficult and challenging issue: an evaluation of whether leader selection methods work. We conclude with some review of past and present executive selection research discussing roles (actual and potential) of I-O psychologists in executive selection.

EXECUTIVES AND HIGH-POTENTIALS: WHO ARE THEY AND WHAT DO THEY DO? Definitions of “executives” and “high-potentials” are highly variable and elusive. By “executives” we mean those at the top of the hierarchy in organizations, those who carry responsibility for major organizational units, or those who occupy key jobs that are essential to the purpose of the organization (e.g., chief scientist, marketing officer). In many large organizations these represent the top 200–300 key roles and are the focus of core talent management and succession planning efforts (Church & Waclawski, 2010). In publicly traded companies, executives are individuals in the top 10–15 C-suite roles running various business units; they are often Section 16 Executive Officers as defined by the Securities Exchange Act of 1934. Executives, defined by level in the organization and by participation in the company’s executive compensation plan, generally make up less than 2% of the total employee population in large organizations. In smaller organizations, the half-dozen or so executives are a much smaller percentage of the employee population. Of these top-level leaders, few are women and minorities. For example, only 19 Fortune 500 companies are run by people of color and 21 run by women (Catalyst Organization, 2013; Diversity Inc., 2013). In general, “executive” refers fairly exclusively to those at the top levels; by contrast, “high-potential” refers inclusively to one deemed to be capable, with the right development, of occupying a senior executive position at some time in the future. Thus, we include mid-level and lower-level managers who may have long-range potential in the pool of high-potentials. High-potentials are typically high-performing managers who demonstrate the capabilities required for future success (Church & Silzer, 2014; Ready, Conger, & Hill, 2010). While having a track record of successful performance is important, it is only a leading indicator of potential, as the popularity of the nine-box model crossing three levels of performance and potential makes clear. At PepsiCo, a high-potential is defined as “A highly valuable contributor with a great deal of stretch capability within the organization. Such individuals are typically promoted to higher levels beyond their current role, and a select few can be seen as leading the organization at the senior levels” (p. 627, Church & Waclawski, 2010). In short, high-potentials are those thought to be able to reach senior executive jobs. Depending on the resources devoted by the organization to leadership development, high-potentials may be identified quite early in their careers. The requirements of high potential for executive jobs are often organization-specific. Though clearly some aspects of leadership potential are universal, the term may be used more narrowly than general leadership potential. This is because the organization is answering the question “potential for what?” in its organization (i.e., a specific role), which is very different than focusing on the identification of raw potential for general pipeline development at lower levels (Church & Silzer, 2014). In this context then, classifying employees as high-potential grows out of an organization’s efforts to (a) ensure continuity in its supply of executives through talent management and succession planning, (b) develop leaders within its culture, (c) respond to an increase in number of retirements, and (d) capitalize on the knowledge it has about internal staff members. Executive jobs have changed dramatically since the 1950s when management was associated with large, stable organizations and consisted of the classic functions of planning, organizing, and controlling. Hemphill’s classic studies of executive behavior arrived at 10 dimensions of management at the executive level (Campbell, Dunnette, Lawler, & Weick, 1970) that only faintly resemble the way executive work is described today. Now, those classic management

834

Selecting Leaders functions must be augmented with complex, diverse, and situation-specific leadership behaviors required by dynamic, global, competitive business environments. Bass (1990) captured the essence of the distinction between manager and leader: “Leaders manage and managers lead, but the two activities are not synonymous” (p. 383). To manage, the executive carries out the classic functions; to lead, the executive behaves in ways that inspires and influences the behavior of others. Members throughout the organization may carry out leadership behaviors, and a full description of a modern understanding of leadership behaviors is beyond the scope of this chapter (see Salancik, Calder, Rowland, Leblebici, & Conway, 1975 for the classic distinction between leadership vs. leader). Today, the simplest answer to the question of “What do executives do?” may be “Whatever it takes.” Lengthy executive position descriptions have given way to outcome-oriented objectives, relating to what the executive is expected to contribute to the strategic mission of the organization. This is one of the reasons why the concept of critical experiences, first introduced in the late 1980s (McCall, Lombardo, & Morrison, 1988), has become so important in the development and selection of senior executives today. Many major corporations are now basing their talent management systems on the types of experiences, learnings, and outcomes that leaders achieve to determine their future succession paths (McCauley & McCall, 2014), and as a result career paths are far more dynamic and organic compared to career models of the 1970s through 1990s. In addition, organizations, positions, and executives are seen as dynamic and rapidly changing. Executives are expected to change the jobs they are in and expected to be changed by these jobs. The higher the level of executive position, the more the incumbents shape the position to their preferences, talents, and abilities to advance the organization’s mission. The key question about selecting an executive has changed from simply “What must the executive do?” to the more complex “What must get done and what does it take to get that done?” The answer is typically a list of competencies or human attributes believed to underlie executive success. The answer to the question “Potential for what?” provides the finishing touches to the overall framework of potential for an executive selection process.

EXECUTIVE COMPETENCIES, DIMENSIONS, AND ATTRIBUTES Organizations seek to identify and articulate the key competencies, dimensions, and attributes needed for executive success in their given culture or context. These competencies, often expressed in terms of clusters of behaviors, are used in various assessments and feedback interventions (such as 360-degree feedback, structured interviews, and simulations) to both assess and develop readiness in high-potential talent. Not only do they specify what is needed in the organization, but they also communicate what is perceived to be important by senior leadership (often endorsed by the CEO). This is why many organization development (OD) practitioners advocate a custom approach to designing leadership frameworks for organizational change (Church, Waclawski, & Burke, 2001). Surprisingly, the vast majority of these competencies are consistent from one organization to the next (e.g., Church, 2014; Schippmann, 2010). What differs is the relative emphasis placed on each (e.g., inclusion, innovation, inspiring others). While many practice-based applications exist today, very few theoretical approaches capture the full range of characteristics. One such model, the Leadership Potential BluePrint, introduced by Silzer and Church (2009), has gained significant traction in the field in the past few years (see Figure 38.1). Based on a comprehensive review and synthesis of the psychological literature and both internal and external talent management efforts, the BluePrint represents a comprehensive approach for framing the identification and prediction of future leadership success. It is currently used in assessment and development efforts at several large organizations, including Citibank, Eli Lilly, and PepsiCo (Church & Silzer, 2014). A basic assumption of the BluePrint is that potential is a multidimensional construct consisting of a mixture of traits, specific capabilities, knowledge, and skills that contribute individually and collectively to long-term leadership success in organizations. Conceptually these attributes consist of three sets of dimensions: foundational, growth, and career. They are layered in the model from more stable traits to more developable skills and capabilities in leaders.

835

George C. Thornton III et al.

Career Dimensions

Growth Dimensions

Foundational Dimensions

Leadership Competencies • Inspiring Others • Developing Teams • Global Mindset

Learning • Learning Agility • Openness • Adaptability Personality • Traits • Preferences • Derailers

Funconal / Technical • Operaonal Excellence • Business Acumen • Industry Knowledge Movaon • Ambion • Energy and Drive • Career Aspiraons Cognive Abilies • Intelligence • Strategic • Conceptual

FIGURE 38.1  The Leadership Potential BluePrint Source: Adapted from Church & Silzer (2014) and Silzer & Church (2009).

Foundational Dimensions represent the most basic and enduring attributes of an individual. They are either largely genetically determined and/or shaped early in life. They include two core factors: personality (e.g., traits, preferences) and cognitive capabilities (e.g., intelligence, strategic thinking). They are generally quite stable throughout one’s adult life and career. High-potentials and successful executives are seen as smarter, more strategic thinkers, with a constellation of personality factors related to strong interpersonal skills. Growth Dimensions reflect an individual’s ability and orientation toward development. They include learning (e.g., learning ability/agility, openness, adaptability) and individual motivation (e.g., ambition, drive, and career aspirations). Here, high-potentials are broad and fast learners with high ambition who seek out and apply learnings from prior developmental experiences. Career Dimensions are perhaps the most widely targeted of the BluePrint areas in executive assessment and selection. The two core factors here are leadership (e.g., inspiring others, developing teams, global mindset) and functional and technical skills (e.g., operational excellence, business and industry knowledge). High-potentials and executives are seen as possessing the right mix of leadership capability to set the vision and strategy, while also having the breadth of knowledge to lead a complex global business environment. Whether or not one believes that competencies are unique to a given organization, they represent a universal set of characteristics in the language of many managers. They may or may not even be the right areas to focus on for development (e.g., Hollenbeck, McCall, & Silzer, 2006; Schippmann, 2010). The key point here is that the BluePrint ensures that a relevant and all-encompassing set of capabilities have been identified and articulated for the purpose of answering the question: Potential for what? Beyond that, the target of assessment becomes organizationally specific. Although academic reviews of the BluePrint have suggested that additional areas may need further highlighting, such as dark side personality constructs (e.g., Dalal & Nolan, 2009) or the importance of organizational culture and other contextual factors (e.g., Dominick & Gabriel, 2009; Heslin, 2009), the framework has resonated with many in the field and in practice. Aside from being embedded in various talent management efforts in large organizations (Church & Silzer, 2014), it has formed the underlying basis of recent consulting approaches (e.g., Aon-Hewitt, 2013), as well as scholar-practitioner models and reviews of potential in various publications (e.g., MacRae & Furnham, 2014; Piip & Harris, 2014). In addition, it was recently featured in a white paper on leadership development (Dugan & O’Shea, 2014) published jointly by the Society for Human Resource Management (SHRM) and the Society for Industrial Organizational Psychology (SIOP). In short, the Leadership Potential BluePrint is a framework covering the landscape of factors contributing to high-potential and executive success and as such can be used to review an

836

Selecting Leaders entire talent management agenda (Church, 2014). The framework also highlights the need for a multitrait, multimethod approach to assessment and development in any high-potential process (Church & Rotolo, 2013). Recent research on 80 companies excelling in assessing, developing, and selecting executives supports these points, noting that dimensions in the BluePrint accounted for the vast majority of content (ranging from 50–75%) being assessed today in major corporations (Church, Rotolo, Ginther, & Levine, 2015). Thus, as a framework it captures the broad realm of attributes useful for high-potential assessment and development. Along with determining the nature of potential broadly, organizations also face the challenge of deciding whether leadership differs at various levels. Since the early work of Katz (1955) and Mann (1965), there was the thought that the roles of leaders differ with leader level. Whereas at low levels technical skills are the most important, at middle levels interpersonal skills are the most important, and at higher levels conceptual skills are the most important. Indeed, research on derailment (Hogan, Hogan, & Kaiser, 2010), advancement (Freedman, 2005; Kates & Downey, 2005), and decision making (Brousseau, Driver, Hourihan, & Larsson, 2006) suggests that performance requirements change with level. In contrast, the Leadership Strataplex model suggests that high-level leaders do not lose the need for previous skill levels (technical, interpersonal), but that new skills are needed as leaders progress (Mumford, Campion, & Morgeson, 2007). Consistent with both schools of thought, Kaiser and Craig (2011) found that top-level leaders tended to be high on all leadership skills, but that the relationship between different skills and performance differed by level. For example, the job complexity of the executive role make learning agility and empowering leadership particularly important, whereas the interpersonal skill required for the middle-manager role make supportive leadership and a lack of abrasiveness most important. In the next section, we discuss some of the research as it applies to key attributes associated with successful executives and high-potential leaders. The scores of human attributes that have been associated with effective leadership literally range from “a to z” (Bass, 1990). We will focus on those that are the most enduring, conceptually distinct, and currently used in practice today: cognitive abilities, personality attributes, and learning ability. Although functional and technical skills are important, these tend to be more domain- and organization- specific, and therefore are less generalizable for this discussion.

Cognitive Abilities Executives must have a fairly high level of intelligence to run complex organizations. The well-documented relationship of job complexity to cognitive abilities (Ones, Viswesvaran, & Dilchert, 2005) suggests that intelligence is important for executive success, but the type of intelligence and the relationship of intelligence to leader effectiveness have been debated for decades. Long ago, Korman (1968) concluded that verbal intelligence predicts performance of supervisors but not higher-level managers. Cavazotte, Moreno, and Hickmann (2012) found that intelligence had stronger indirect effects on managerial performance than all five of the Big Five personality variables. Menkes (2005) found that cognitive skills, such as analyzing and anticipating business trends, differentiate “star” executives from their peers. Crystallized intelligence (i.e., knowledge of facts) may be more important at lower levels, whereas fluid intelligence (akin to creativity) is important at executive levels. Although a certain (probably considerable) amount of cognitive abilities is important, additional levels may not be related to executive performance. In fact, some evidence suggests a curvilinear relationship: lower and higher levels of intelligence may be detrimental to leadership success (Bass, 1990). The search for elements of cognitive ability that are important to executive leadership has led to specification of different types of intelligence. For example, Dries and Pepermans (2012) suggest that cognitive ability is part of a greater construct that is central to assessing leadership, which they call analytical skills (e.g., intellectual curiosity, decision making, problem solving, strategic insight). Although at least a substantial amount of some form of cognitive ability is important for executive performance, the use of typical cognitive tests may be problematic, as described later in this chapter.

837

George C. Thornton III et al.

Personality The Five-Factor Model (“Big Five”) has become one of the most widely used and popular conceptualizations of personality (e.g., McCrae & Costa, 1989). It includes conscientiousness, agreeableness, openness to experience, neuroticism (sometimes called anxiety), and extraversion. Variables in the model, especially extraversion and conscientiousness, have been associated with leadership effectiveness (Judge, Bono, Ilies, & Gerhardt, 2002). Personality traits may be particularly predictive of success for top-level leaders given the amount of autonomy that is characteristic of such positions (Barrick & Mount, 1991). Indeed, researchers at companies such as Sears (Bentz, 1990) have demonstrated the importance of personality to top-level leaders in organizations. For example, Colbert, Barrick, and Bradley (2014) found that CEO conscientiousness and top-management team conscientiousness were related to organizational performance. For the flip side, Hogan and Hogan (2001) suggested that “dark side” personality traits (e.g., paranoia and passive-aggressiveness) can also be used to predict leadership failure. In fact, many executive coaches and organizational assessment programs have embraced the “derailer” concept to the point that they find these traits more useful than positive personality dimensions for development purposes. This is in fact the premise of the book Why CEOs Fail (Dotlich & Cairo, 2003), which with the Hogan suite has launched a trend in the industry. The popularity of the derailer approach may be a function of several factors (e.g., the dark-side characteristics may manifest themselves in stress, and they can be easily identified). In addition, their negative effects may be mitigated more easily than core personality traits by behavioral interventions and adaptations in the work environment. For example, it is far easier to coach an executive to be less excitable during times of stress than it is to help him or her to be less anxious in general. Curiously, there is also evidence that extremely low levels of the dark-side leader personality variables are also associated with ineffective leadership (Kaiser, LeBreton, & Hogan, 2015).

Learning Ability The ability to learn and then adapt one’s leadership is a complex and controversial competency. The job complexity and changing nature of the executive role makes learning ability particularly important (Kaiser & Craig, 2011). There is no agreed-upon definition of learning ability, and there is controversy around the related construct of learning agility. They may not be distinguishable from cognitive ability or they may be all that is needed for executive potential. Learning ability seems to encompass aspects of motivation, positive orientation toward learning, and flexibility in thinking. For example, Maurer and Lippstreu (2008) highlight the importance of motivation to develop leadership, specifically, as a driver of leader engagement in learning. Reichard and Johnson (2011) said it includes learning goal orientation as it interacts with organizational norms to create motivation to learn. The BluePrint highlights two aspects of growth orientation: learning ability (e.g., what some call learning agility, openness, adaptability, feedback-seeking behavior), and individual motivation (e.g., ambition, drive, career aspirations, and achievement focus). High-potentials are generally characterized as high learners who are open to feedback and individual development and driven to succeed and advance. Many organizations seek to assess learning ability through a review of relevant background experiences in the screening process.

ASSESSMENT TECHNIQUES Methods to assess these and other attributes range from using internal performance appraisal data to elaborate testing and assessment. In the following section, we summarize research on several different types of assessment techniques and, where data exist, discuss their use and validity in executive selection. The strength of the relationship between a specific attribute measured by a given assessment technique and a specific criterion of leadership effectiveness is typically only moderate. That is, the correlation is seldom over .35 to .40. Combinations of

838

Selecting Leaders multiple measures of multiple attributes often yield correlations in the range of .50 to .60. In other words, approximately 15–35% (i.e., .352 to .602) of the variation in leadership effectiveness is predicted from one or more human attributes. Readers seeking more information on specific tests, questionnaires, and assessment methods in I-O psychology will find the handbook by Thomas (2004) quite useful.

Cognitive Ability Tests Although there is little question that executives must have relatively high cognitive ability, there is mixed evidence regarding whether cognitive ability tests are widely used, valid, or useful for selection into top ranks of the organization. In a survey of 628 staffing directors, Howard, Erker, and Bruce (2007) found that approximately 50% of organizations surveyed used ability tests at high managerial levels. In another survey, Silzer and Church (2010) found that 20% of companies surveyed used cognitive ability to identify high-potentials. More recently, Church and Rotolo (2013) found that 39% and 38% of companies used cognitive ability tests to assess high-potential and senior executives. Cognitive ability tests are commonly used in individual psychological assessment (as described later in this chapter). Although cognitive ability tests (in comparison with other measures) have been shown to have some of the highest validity correlations with performance throughout the managerial ranks, some organizations may be reluctant to use cognitive abilities tests for executive selection for a variety of reasons. Fiedler (1995) pointed out that measures of an individual’s cognitive abilities have been marginally successful in predicting how a leader will perform in a particular job. Furthermore, cognitive ability tests have a potential for adverse impact (Ones, Dilchert, & Viswesvaran, 2012). At the highest executive levels, marked restriction of range in test scores may provide little meaningful differentiation among candidates and may severely restrict correlation coefficients. Finally, from a practical standpoint, it is difficult to use cognitive test results for developmental purposes at very senior levels. Thus, they can be perceived negatively as a part of an assessment battery, unless the real intent is to use the results for decision making.

Personality Questionnaires Tett, Jackson, and Rothstein (1991) reported that personality tests, and in particular measures of the Five-Factor Model of personality, were frequently used in selection contexts. More recently, research has demonstrated that personality tests that are more specific than the Big Five are more predictive than the Big Five (Pulakos, Borman, & Hough, 2008). Thus, the use of personality assessments seems to be on the rise. In Howard et al.’s (2007) survey of organizations, 65% of the organizations had never used a personality inventory for selection. However, Silzer and Church (2010) found that 55% of companies used personality as an indicator of high potential, and then in Church and Rotolo’s (2013) survey of organizations, 66% of companies reported using personality to assess high-potentials, and 57% used such methods to assess executives. Considerable controversy exists over the extent to which responses to self-report personality tests are influenced by contaminants such as self-enhancement biases and faking. Contrasting opinions on these matters are expressed in a series of articles in the Autumn and iWnter 2007 issues of Personnel Psychology. Morgeson et al. (2007) raised questions over the utility of self-report personality questionnaires for selection purposes, but other authors argued for the utility of personality measures in selection, particularly when more sophisticated weighting schemes and conceptualizations of personality are used (Tett & Christiansen, 2007). Although 85% of Howard et al.’s (2007) respondents had never used integrity tests for executive selection, there is some evidence that they could be useful at the executive level (Ones, Viswesvaran, & Schmidt, 1993). Ones et al.’s (1993) meta-analysis demonstrated that integrity tests were equally good at predicting performance in jobs ranging from low to medium to high complexity, and integrity tests were better at predicting counterproductive work behaviors for

839

George C. Thornton III et al. highly complex jobs. In practice, at more senior levels, personality tests tend to be used in conjunction with other tools and focused on both development and decision making. Church and Rotolo (2013), for example, reported that companies in their survey were using on average about four different assessment tools at the same time, including personality measures, in their assessment and development efforts with executives.

Biodata Questionnaires There is little question that one’s experiences during childhood, adolescence, education, military training, and initial work up to the time of being considered to have high potential or being chosen for an executive position are relevant to later success. Considerable research has shown that systematic measures of early life experiences can be highly predictive of performance in various jobs (Schmitt & Golubovich, 2013), including supervisors and managers (Stokes & Cooper, 1995). However, at the executive level, systematically gathering information about individuals’ early life experiences can be problematic because the relevance may not be apparent. Formally scored biodata questionnaires are not used frequently, although their prevalence as a selection device may be increasing. In a 2006 survey of staffing directors, 60% reported that they used application forms and 23% (up from 13% in 2004) used a biographical data form (Howard et al., 2007). More recently, Church and Rotolo (2013) reported that biodata questionnaires were the fourth most commonly used tool in assessment and development efforts at 84 top development companies both for high-potentials (43%) and senior executives (43%). At the more senior levels, general biodata may have been replaced by critical experiences (i.e., some pre-set list of experiences seen as necessary for success). Many organizations are focusing on key work experiences that have developed management talent during the previous 10 years and can guide planning for the next 10–15 years to develop that talent further for C-suite roles (McCauley & McCall, 2014). Biodata may also be gathered in interviews.

Multisource Feedback Multisource or 360-degree performance feedback questionnaires are often used for selection and development of high-potentials and for screening of executive candidates. Managers are rated on a questionnaire by their supervisors, subordinates, peers, and selves, and even internal customers, external customers, vendors, or suppliers (Bracken, Timmreck, & Church, 2001). The content of questions may provide assessment of a variety of decision-making, interpersonal, and leadership capabilities. Estimates of the use of 360-degree appraisals range from 12–29% of all organizations (Church, 2000), to 60% when used to assess executives (Church & Rotolo, 2013), to 66% of companies using them to assess high potential (Church & Rotolo, 2013), and up to 90% of Fortune 500 companies (Atwater & Waldman, 1998). Although 360-degree appraisals have primarily been used as a feedback and development tool for managers (Goodstone & Diamante, 1998), Tornow (1993) suggested that the method can also be used for appraisal, selection, and promotion. Indeed, Halverson, Tonidandel, Barlow, and Dipboye (2005) found that ratings and self–other agreement on 360-degree ratings predicted promotion rate throughout one’s career in the United States Air Force. Despite the trend toward using 360-degree appraisals for administrative purposes (Bracken & Church, 2013), there is controversy over their application to selection (Toegel & Conger, 2003). Specifically, writers have expressed several concerns: (a) self-enhancement by the manager who has a strong motivation to convey that he or she has been performing well (Craig & Hannum, 2006); (b) raters who know the assessment is for administrative purposes may not wish to negatively impact the focal manager (DeNisi & Kluger, 2000); and (c) employment decisions based on ratings from unknown and minimally trained raters may not be legally defensible. In fact, Morgeson, Mumford, and Campion (2005) noted that one of the leading organizational consultants in the use of 360-degree appraisals, the Center for Creative Leadership, restricts their use to

840

Selecting Leaders developmental purposes. However, the authors of the Handbook of Multisource Feedback (Bracken et al., 2001) and practitioners in the 360 area (Bracken & Church, 2013) recently made a formal declaration of the importance of using 360 for decision making, citing a number of factors that have changed since the original concerns of the 1990s. In support of this position, Church and Rotolo (2013) cite the recent increased use of 360-degree appraisals to assess high-potentials and executives, and Murphy, Cleveland, and Mohler (2001) summarize evidence of their reliability, validity, and meaningfulness.

Assessment Centers The assessment center (AC) method has been used for evaluating executive potential for more than 50 years (Thornton, Rupp, & Hoffman, 2015). Originally validated with longitudinal studies as an indicator of early management potential of men and women (Howard & Bray, 1988), the method has been used to assess executives in numerous industries and countries. The unique aspect of the AC method is the combination of features that involve observation of overt behavior in multiple simulations of organizational challenges by multiple trained observers who integrate evaluations in consensus discussion, statistical formulae, or a combination of both. Some ACs involve the consideration of information from other techniques, such as cognitive ability tests, personality questionnaires, multisource feedback, or a background interview. Older and recent surveys show that large numbers of organizations use ACs for selecting executives and high-potentials (Thornton & Krause, 2009). Executive ACs involve dimensions such as global awareness and strategic vision, calling for strategic decisions such as launching joint ventures, managing the talent pool, or promoting a turnaround. Studies have found that ACs predict senior management potential (Ritchie, 1994) and that an AC added incremental validity over cognitive ability tests in predicting executive success in a public organization in Germany (Krause, Kersting, Heggestad, & Thornton, 2006).

Leadership Questionnaires The reader may be surprised that we have not reviewed leadership behavior and style questionnaires (Clark & Clark, 1990). To be sure, scores of self-report questionnaires have been developed over the years, such as the Leader Behavioral Description Questionnaire and the Leadership Opinion Questionnaire. The respondent is typically asked to indicate how often he or she does certain behaviors, such as directing the work of subordinates or providing them support. Although these instruments have been useful in research, in helping individuals gain insight into their leadership styles, and in counseling and training settings, they have not been applied extensively in executive selection. The older questionnaires typically do not cover the broader set of leader characteristics deemed important in recent leadership studies, and they all suffer the potential biasing effects of self-enhancement.

Individual Psychological Assessment The individual psychological assessment (IPA) procedure typically involves a single person administering some variable combination of in-depth background interview, tests of cognitive abilities and personality, and behavioral observations and ratings (Church & Rotolo, 2013). An individual assessor makes judgments about how to combine and interpret assessment information to make judgments about the fit between the candidate and the job, executive team, and organization. Because of the idiosyncratic nature of IPAs, their effectiveness has been and remains a subject of disagreement. The job analysis is sometimes as informal as a discussion with the client organization about what the job incumbent must accomplish and what competencies are required. In organizations with more sophisticated talent management and assessment programs, significant

841

George C. Thornton III et al. job profiling may be conducted before the assessment to ensure rigor and validity, especially when the results are used for decision making. Inconsistency in how information is gathered, integrated, and reported is also of concern. Highhouse (2002) concluded: “The holistic approach to judgment and prediction has simply not held up to scientific scrutiny” (p. 391). He speculated that IPA, like psychotherapy before it, has achieved “functional autonomy [that] has enabled individual psychological assessment to survive and flourish” (p. 391). On the other hand, the IPA allegedly has several advantages. It has been a well-known part of the toolkit of psychologist-practitioners in most consulting firms for decades (Ryan & Sackett, 1987). Its popularity over the years is due, in part, to its flexibility. It can be used to assess individual executives on the spur of the moment, and it can serve various purposes (Jeanneret & Silzer, 1998). Use of multiple methods is consistent with the assumption of the BluePrint that multiple measures of the same constructs are more useful for key talent management decision making, diagnostic discussions, and interventions than any single measure. An IPA can measure a variety of dimensions, including personality characteristics, cognitive ability, learning ability, and motivation. A recent large-scale meta-analysis (Morris, Daisley, Wheeler, & Boyer, 2015) demonstrated the IPA has moderate criterion validity in relation to subjective criteria such as managerial ratings (mean r = .24) and administrative decisions (mean r = .19). Validity was higher when the IPA involved cognitive ability tests (but not personality tests, biodata, or interviews), when the same assessor was used for all candidates, and when the method was applied to managers versus non-managers. IPAs are used by many large organizations. Piotrowski (2007) provided an informative description of an IPA program run at The Hartford using a cadre of outside psychologists in which more than 300 managers are assessed per year. Historically it has been more frequently used with high-potentials after they have been identified as high-potential rather than as part of the high-potential selection decision. PepsiCo, however, has embraced the use of individual assessment and development at four different levels for different purposes with their multitier Leadership Assessment and Development program (LeAD). Based on the Leadership Potential BluePrint, it provides an intensive integrated assessment and development experience linked to key leadership transitions and targeted at individuals in career stages in the organization (Church & Silzer, 2014). At the lowest levels in the organization, the emphasis of the program is on the identification of future leadership potential emphasizing more of the Foundational and Growth dimensions. At the next two levels, the focus is on confirmation and verification of high-potential status along with accelerated development of those already identified for future roles through the talent review process. The content focus here is balanced across all elements of the BluePrint. At the highest levels, the assessment program is more about shaping and refining executives for succession planning purposes focusing on leadership capabilities and functional breadth rather than selection decisions per se. Interestingly, all four layers of assessment at PepsiCo use a combination suite of tools, but some of the specific tools vary based on the emphasis and requirements of intervention goals (development versus decision making), the nature of the target audience (junior versus mid-level versus C-suite talent), and the cost, complexity, and timing of administering to small versus large numbers of employees globally in multiple languages. Thus, the total LeAD system uses a number of measures, including personality tools, custom online simulations, in-person assessment centers, structured interviews, biodata, 360-degree measures, and situational judgment tests. While the configuration of tools many differ somewhat by level (e.g., the OPQ is available in more languages globally than other personality measures but is less appropriate for senior executives), there is a concerted effort to ensure total coverage across the BluePrint dimensions and consistency in measurement wherever possible. Other measures, such as 360-degree feedback, are used more consistently. The final result is a comprehensive multitrait, multimethod system based on a consistent framework that has been validated across the different levels of the BluePrint.

EXECUTIVE SELECTION IN ORGANIZATIONAL AND CULTURAL CONTEXT Although this chapter focuses on executive selection, the final screening procedure is only one phase of a long-term, complex, multistage process of identifying leaders. In this section we describe several stages of the selection of organizational leaders, including who makes

842

Selecting Leaders the selection, the processes involved, and the criteria and standards for evaluating candidates. Organizations begin the process of selecting future leaders during college recruiting and screening management trainees. Recruiters, HR staff, and line managers evaluate the credentials of bachelor’s, master’s, and doctoral graduates, often favoring candidates with high grades and special extracurricular accomplishments from top-flight universities. Performance evaluations along the way are critical. For years it has been noted that, despite their well-known limitations, supervisory judgments were the most commonly used practice for predicting managerial effectiveness (Thornton & Byham, 1982). The same has been said more recently with regard to executive selection (Sessa, 2001) and for perhaps the same reasons: Performance reviews by supervisors are practical and widely accepted by those in the organization. In many organizations, more formal and complex systems of performance management have replaced simple performance appraisal programs, and common practice today includes review and participation by higher-level management and HR specialists. Nevertheless, a high-performance evaluation by an employee’s immediate supervisor has become “the admission price for future growth and development” (Charan, Drotter, & Noel, 2001, p. 166). The nine-box grid with three levels of performance crossed with three levels of potential is an integral part of many talent management processes today. Placement on the grid typically begins with the immediate manager’s evaluation, reviewed by the next-level manager. Other inputs to the potential ratings are often based on a combination of evaluations by each individual’s manager on performance in initial assignments, success in management training programs, and formal assessments of leadership potential using a variety of assessment methods described in previous sections of this chapter. Grid ratings, usually done annually, become a part of the discussion at talent management meetings and developmental assignments. For such grid ratings to be effective, they must differentiate among individuals on both performance and potential, and the two ratings must not be highly correlated. The process has failed in organizations when all individuals fall along the diagonal. There should also be concern that ratings of potential might be related to age and thus be unfairly biased against older managers. A critical set of experiences takes place as individuals are recommended for and placed in a number of career-enhancing positions after a manager is identified as high potential at successively higher managerial levels. Many persons are instrumental in such movement, including any assessor who recommends a developmental assignment, HR staff who know of relevant opportunities and openings, the candidate’s immediate manager who endorses and fosters such assignment, mentors who are “looking out for” the individual, and higher-level managers in new organizational settings who are receptive to taking on new staff members. As managers advance within an organization, their reviewers become a broader group, typically including more senior executives, the CEO, and the board. The board of directors of an organization has become increasingly involved in an organization’s selection (as well as compensation and evaluation) of the CEO as a result of the recent raft of corporate scandals and “a subsequent stream of regulations and guidelines” (Nadler, Behan, & Nadler, 2006, p. 174). Boards are involved in reviewing the performance of not only the CEO and senior executives but also current high-potentials and staff who are in the pipeline to become executives in the next several years. At a critical point, individuals are selected into the top levels of executive positions. Here consultants often provide individual psychological assessment of finalists to the CEO and the board of directors. Candidates are evaluated on a combination of high performance in a variety of critical assignments, potential to lead the organization toward long-range strategic goals, and fit with organizational requirements and the existing executive leadership team. In addition, because some organizations have board-mandated specific requirements for CEO candidacy (e.g., based on retirement planning scenarios), the current level and career stage of potential successors may also factor in an accelerated development strategy. Research by SpencerStuart (2008), an executive search firm, noted that the average age of Fortune 500 CEOs has decreased from 59 in the 1980s to 54 at the time of the survey, and the median tenure was five years. This trend puts pressure on organizations to ensure that individuals in their pipeline are developed with the right experiences and at the right speed. Here again, organizations should be concerned that the process is not biased against older managers.

843

George C. Thornton III et al. Examples of the integration of evaluations of performance and potential are General Electric’s widely copied process called Session C (Freedman, 2004) and PepsiCo’s talent review model (Church & Waclawski, 2010). Both include formal annual review of past accomplishments and potential, input from multiple sources at the top executive levels, and suggestions for future developmental assignments.

Fit Most cases of executive failure are accompanied by a statement that “there wasn’t a good fit.” What does this mean? The traditional selection paradigm matches individuals with jobs. At executive levels, a broader array of characteristics of individuals and jobs are essential; fit becomes multidimensional on both sides. Hollenbeck (1994) argued that successful selection depends upon the fit among three sets of variables: those of the individual, the organization, and the external environment or strategic demands. Moses and Eggebeen (1999) suggested that fit changes over time (e.g., from a large, stable organization to a faster-paced, versatile, constantly evolving organization such as the earlier to the later AT&T or IBM). Sessa and Taylor (2000) discussed characteristics of the candidate, the organization, and its talent management strategy, but only recently has there been exploration of what talent management strategy means and its implications.

Talent Management (TM) Strategies The processes of selecting high-potentials and executives are often claimed to be a reflection of the TM strategy of the organization, but for this assertion to be meaningful, the term TM requires specification. As used in human resource management (HRM) literature, TM means many different things, and there is no single universal operational form. Acknowledging the somewhat oversimplified distinction of types of TM, Thornton et al. (2015) identified three general talent management strategies among the myriad descriptions in recent HR literature. Personnel and human resource management (P/HRM) is the traditional emphasis on highly standardized procedures of evaluating persons for promotion based on merit, fitness, and freedom from patronage adhered to in many public organizations. Strategic human resource management (SHRM) is probably the most frequently endorsed form of TM carried out in organizations in recent years; it emphasizes the integration of numerous HR functions, including assessment, development, and selection of the organization’s inclusive total talent pool. Targeted talent management (TTM) places major emphasis on just the most select and highly skilled staff members in highly critical positions. Selection of high-potential and executive leaders differs when organizations adhere to these three strategies. P/HRM places emphasis on fairness and transparency, SHRM attempts to be quite inclusive in fostering talent throughout the organization, and TTM recognizes the need to attend primarily to just the exclusive top talent in mission-critical positions. Any given organization may follow any combination of these strategies at different times, in different parts of the organizations, and with different occupations and job positions.

Followers Follower characteristics may also impact what type of leader will be effective in a given situation. Many factors have been shown to moderate the effectiveness of leaders’ behavior, including followers’ satisfaction and perceptions of their abilities (Hersey & Blanchard, 1984), need for autonomy, openness to experience (Groves, 2005), motives (Wofford, Whittington, & Goodwin, 2001), and achievement orientation, self-esteem, and need for structure (Ehrhart & Klein, 2001). In particular, Hooijberg and Schneider (2001) suggested that executive leaders who are

844

Selecting Leaders high in social intelligence may be better able to adapt to differences in followers’ attitudes, personalities, and motives.

Diversity An increasingly important consideration for the selection of high-potentials is diversity in leadership ranks. In general, climates that promote diversity and inclusion have positive effects on organizational outcomes such as performance, innovation, firm reputation, recruitment, and organizational attitudes (Dezsö & Ross, 2012; Walker, Field, Bernerth, & Becton, 2012). Companies with outstanding records for diversity and inclusion have seen the benefits of diversity and inclusion efforts through organizational development techniques such as employee surveys, performance management, and training (Church, Rotolo, Shull, & Tuller, 2013). The benefits of diversity are most pronounced when women represent more than 22% and racial minorities represent 25% of the executive team (Labaye, 2012; Roberson & Park, 2007), numbers that most companies fall short of. Church et al. (2013) suggest that diversity should be included in the selection of high-potentials in two ways: (1) leaders who are effective at diversity relationships and inclusion should be more likely to be promoted, and (2) individuals who fulfill diversity objectives should be given added consideration in the selection process. They cite benchmarking studies from the Mayflower Group and The Conference Board showing that 59–82% of the companies consider diversity in their TM system. In another survey, Silzer and Church (2010) found that 25% of companies set goals for the representation of women and minorities in their high-potential pool. Other companies monitor the percentage of these groups but do not set formal goals. The dearth of women and minorities at the top of organizations suggests that more work is needed in this area to ensure that women and minorities are making it into the pipeline and transitioning into top leader roles. Maybe I-O psychologists could help fashion some variation of the Rooney Rule (Freedman, 2014) to encourage organizations to assess at least one woman for all phases of the executive succession process and still meet affirmative action standards.

Country Culture The country culture in which a leader’s organization is embedded can also impact leadership effectiveness. That is to say, certain leadership traits and behaviors will be perceived more positively in some cultures than others. Considerable evidence for cross-cultural differences in leadership effectiveness has come from the work on the GLOBE project (Chhokar, Brodbeck, & House, 2007). As examples, charismatic leadership is highly preferred in South Africa, whereas consensus-based leadership is preferred in Ireland. Moreover, the type of selection practices used to hire executives should reflect the national culture from which an organization is hiring, such as the prominent use of individual difference testing in the individualistic U.S. culture and their lack of use in more collectivistic cultures (Dipboye & Johnson, 2008).

Summary The evidence for the validity, relevance, and accessibility of these techniques for the selection of executives and high-potentials is mixed. Moreover, perhaps the most widely used process for selecting executives from outside the organization involves and is managed by external executive recruiters (Howard et al., 2007), a process not discussed in detail here. Executive recruiters typically use interviews and references as the data-gathering methods, and then provide descriptions of candidates and make their own recommendations to the organization (Wackerle, 2001). When there are internal and external candidates for an executive position in an organization, there may well be much more information available about the internal employee (e.g., performance

845

George C. Thornton III et al. appraisals, test results). The differences raise questions of the comparability and fairness of the selection decisions. Equity may be introduced if an external source assesses both internal and external candidates, a process that is being carried out more frequently (Howard, 2001). There does not appear to be any one best method for executive selection, and evidence of the prevalence of one selection technique or the lack of use of another does not support the measure’s validity or utility. Each of the measures discussed here may be useful for selecting executives. Organizations must examine the qualities they are looking for, their staffing strategy and philosophy, and their past success with different measures when choosing their selection plan. In addition, consideration must be given to a leader’s fit with the organization.

DOES IT WORK? In the previous sections, we reviewed several techniques and processes used in executive selection. Mixed amounts and levels of relevant published, supportive evidence were noted for each assessment method. This begs the question: Does the overall process of executive selection work? This question is particularly important given the marked increase in executive turnover since the 1990s, with many high-profile cases of executive failure (Walberg, 2014). The high levels of top-level turnover have only increased the “war for talent,” creating greater reliance on outside selection rather than internal promotion (Aguinis, Gottfredson, & Joo, 2012). Russell (2001) reported a longitudinal study of performance among 98 top-level executives. Information from interviews and questionnaires was integrated into ratings on several performance dimensions by teams of executives and researchers via a process akin to the wrap-up discussion in an AC. Competency ratings in problem solving and people orientation predicted subsequent fiscal and nonfiscal performance trends. Why is there so little empirical research on the effectiveness of executive selection practices? Such research is difficult for several reasons. First, as Hollenbeck (1994) pointed out, there are several inherent difficulties of CEO selection: each CEO position is unique and may be changing in an uncertain future; the selection decision is unique; the decision makers may never have made such a decision and are probably not trained to do so; the process is probably not completely open; and outside forces may come into play. Second, it is difficult to conduct a good study to determine if the process was successful. Hollenbeck (1994, 2009) offered a partial list of explanations: the long time involved to select one person, high secrecy surrounding this high-stakes choice, and difficulty for credible researchers to get access to the expensive process. There are also technical research problems precluding classic criterion validation studies: small sample size, low range in measures of key variables such as intelligences, resistance of candidates to be subject to onerous and sensitive assessment procedures, inherent limitations (e.g., faking, biased rating by self and others) of some promising constructs, difficulty of accessing a comparison group of individuals who are not selected, and complexity of any criterion measure. The difficulty of finding a meaningful criterion of effectiveness of selecting high-potentials and executive leaders bedevils researchers and practitioners. Appealing as it may appear, an index of organizational performance as a complex criterion measure has proven to be a contentious topic in the leadership literature. The lack of published empirical studies of the accuracy of executive selection procedures may be lamentable, but it is hardly surprising. Furthermore, there has been a debate over the extent to which leadership impacts organizational performance. For example, Pfeffer and Salancik (1978) have argued that leadership has a minimal impact on organizational performance. Despite these arguments, other researchers have demonstrated the enormous potential for leaders to affect organizational performance. Estimates of the variance in profitability due to the CEO range from 16–20% (Hambrick & Quigley, 2014) to 43.9% (Weiner & Mahoney, 1981). However, the relationship between leadership and organizational performance may not be a good barometer of the success of leader selection practices. The executive selection practice may be effective, but organizational performance may falter because success also depends to a large extent on the CEO’s team. In addition, if all leader selection efforts were successful, there would be no variance in resultant competencies, and thus there would be no statistical

846

Selecting Leaders relationship with organizational performance. There is a definite need for research on the use of different selection methods in relation with various indices of organizational and leadership performance to further address this issue. Howard and Thomas (2010) offered suggestions for a variety of metrics to study the effectiveness of executive assessment programs (e.g., evaluation of the focus, process, outcomes, and impact of the methods).

Research Opportunities Future involvement of I-O psychologists could include further articulation of the competencies and attributes needed for diverse organizational challenges (e.g., defining and assessing characteristics related to long term success such as character); specification of organizational and environmental variables that need to be considered in determining fit; understanding how the complex process of executive selection will be done differently in every job and every organization; and training CEOs, top executive teams, and boards of directors in processes of matching candidates to demands of positions. Consulting firms are becoming more involved in assessing internal and external candidates for CEO positions, and these assignments may provide opportunities for I-O psychologists to apply more scientific methods to the selection of CEOs. On the basis of our observations of the field of executive selection, we note that there was much systematic research in the 1960s to 1980s on early identification of management potential and executive selection, but not as much recent published research, possibly due to the changing standards of publications in scholarly journals. Previously, various assessment techniques (e.g., biodata, ACs, and cognitive ability tests) were evaluated for selection, but emphasis in the past two decades has been placed on development. Considering the noted scandals in executive ranks, selection may be gaining renewed importance as the cost of executive failure becomes higher. The concern for fit is probably the most significant development in executive selection in the last 20 years. More research is needed into the judgmental processes that are needed to combine complex patterns of information about candidates on the one hand with the complex changing patterns of organizational and situational demands on the other hand. At the risk of appearing nihilistic, we suggest that the traditional statistical methods used by I-O psychologists to study relationships of predictor scores and criterion measures may not be up to the task of understanding the processes of executive selection at the highest levels of organizations. Studies of these complex processes of executive selection may call for different research methods to study executive selection, including clinical methods of judgment, policy capturing with executive recruiters’ judgments, evaluation of broader measures of organization-level human capital (Birri & Melcher, 2011), systematic qualitative studies of successes and failures, and a return to dormant complex validation strategies such as synthetic validation (McPhail, 2007).

CONCLUSIONS There are many ways executives get the job done. There is no agreed-upon list of executive competencies or attributes, and only recently have more systematic hierarchies of these dimensions emerged. Many competencies that are commonly listed are too broad or vague to guide assessment efforts (e.g., strategic global perspective, thinking outside the box, performance orientation, and emphasis on people development). To get the job done, the executive will have a pattern of human attributes needed by the organization at the point in time of selection. No single attribute or simple profile of attributes is related to executive effectiveness. These attributes form a unique profile including some forms of intelligence, personality characteristics, and values, as well as experience, knowledge, and effective interpersonal skills. Organizations use various methods to assess these attributes. The quality of tests, personality questionnaires, and interviews has improved over the years, but these procedures are used in different ways at each stage of the process. They are used in more formal systematic, quantitative ways at screening candidates into pools of high-potentials, but in more informal and variable ways during the

847

George C. Thornton III et al. integration of information at time of selection into executive ranks. The most common method of selecting executives remains the performance/potential review process by higher-level executives and the board of directors. In larger companies, the process patterned after GE’s Session C has become common. The rigor of these final steps of leader selection varies considerably. Observers of these processes have lamented the lack of consistency and sophistication shown by many organizations. Suggestions have been made for more systematic processes of determining organization needs, assessing competencies in candidates, and matching competencies to needs. In fact, many organizations are following these practices, but little research has been conducted to evaluate these methods. I-O psychologists have helped articulate and evaluate the attributes related to leader effectiveness and have been involved in designing programs to screen candidates into high-profile pools and to develop leadership and managerial skills. In addition, I-O psychologists have specialized in individual assessment of external candidates. However, with few exceptions of psychologists who consult with CEOs and boards, they have not played extensive roles in the final stages of executive selection among internal candidates.

REFERENCES Aguinis, H., Gottfredson, R. K., & Joo, H. (2012). Using performance management to win the talent war. Business Horizons, 55, 609–616. Aon-Hewitt. (2013). Building the right high potential pool: How organizations define, assess and calibrate their critical talent. Consulting performance, rewards and talent paper Aon plc. Retrieved from http://www.aon.com/attachments/human-capital-consulting/2013_Building_the_Right_High_Potential_Pool_white_paper.pdf Atwater, L. E., & Waldman, D. (1998). Accountability in 360 degree feedback. HR Magazine, 43, 96–104. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bass, B. M. (1990). Bass and Stogdill’s handbook of leadership: Theory, research, and managerial applications (3rd ed.). New York, NY: Free Press. Bentz, V. J. (1990). Contextual issues in predicting high-level leadership performance. In K. E. Clark & M. B. Clark (Eds.), Measures of leadership (pp. 131–143). West Orange, NJ: Leadership Library of America. Birri, R., & Melcher, A. (2011). Building a talent for talent. In N. Povah & G. C. Thornton III. (Eds.), Assessment centres and global talent management (pp 175–192). Farnham, England: Gower. Bracken, D. W., & Church, A. H. (2013). The “New” performance management paradigm: Capitalizing on the unrealized potential of 360-degree feedback. People & Strategy Journal, 36, 34–40. Bracken, D., Timmreck, C. W., & Church, A. H. (Eds.) (2001). The handbook of multisource feedback: The comprehensive resource for designing and implementing MSF processes. San Francisco, CA: Jossey-Bass. Brousseau, K. R., Driver, M. J., Hourihan, G., & Larsson, R. (2006). The seasoned executive’s decisionmaking style. Harvard Business Review, 84, 110–121. Campbell, J. P., Dunnette, M. D., Lawler, E. F., III., & Weick, K. E., Jr. (1970). Managerial behavior, performance, and effectiveness. New York, NY: McGraw-Hill. Catalyst Organization. (2013). Women CEOs of the fortune 1000. Retrieved from http://catalyst.org/knowledge/ women-ceos-fortune-1000j (accessed January 2013). Cavazotte, F., Moreno, V., & Hickmann, M. (2012). Effects of leader intelligence, personality and emotional intelligence on transformational leadership and managerial performance. The Leadership Quarterly, 23, 443–455. Charan, R., Carey, D., & Useem, M. (2014). Boards that lead: When to take charge, when to partner, and when to stay out of the way. Boston: Harvard Business School Publishing. Charam, R., Drotter, S., & Noel, J. (2001).The leadership pipeline: How to build the leadership-powered company. San Francisco, CA: Jossey-Bass. Chhokar, J. S., Brodbeck, F. C., & House, R. J. (Eds.) (2007). Culture and leadership across the world. Mahwah, NJ: Lawrence Erlbaum. Church, A. H. (2000). Do higher performing managers actually receive better ratings? A validation of multirater assessment methodology. Consulting Psychology Journal: Practice and Research, 52, 99–116. Church, A. H. (2014). What do we know about developing leadership potential? The role of OD in strategic talent management. OD Practitioner, 46, 52–61. Church, A. H., & Rotolo, C. T. (2013). How are top companies assessing their high-potentials and senior executives? A talent management benchmark study. Consulting Psychology Journal: Practice and Research, 65, 199–223.

848

Selecting Leaders Church, A. H., Rotolo, C. T., Ginther, N. M., & Levine, R. (2015). How are top companies designing and managing their high-potential programs? A follow-up talent management benchmark study. Consulting Psychology Journal: Practice and Research, 67, 17–47. Church, A. H., Rotolo, C. T., Shull, A. C., & Tuller, M. D. (2013). Inclusive organization development. In B. M. Ferdman & B. R. Deane (Eds.), Diversity at work: The practice of inclusion (pp. 260–295). San Francisco, CA: Jossey-Bass. Church, A. H., & Silzer, R. (2014). Going behind the corporate curtain with a Blueprint for Leadership Potential: An integrated framework for identifying high-potential talent. People & Strategy, 36, 51–58. Church, A. H., & Waclawski, J. (2010). Take the Pepsi challenge: Talent development at PepsiCo. In R. Silzer & B. E. Dowell (Eds.), Strategy-driven talent management: A leadership imperative (pp. 617–640). San Francisco: Jossey-Bass. Church, A. H., Waclawski, J., & Burke, W. W. (2001). Multisource feedback for organization development and change. In D. W. Bracken, C. W. Timmreck, & A. H. Church (Eds.), The handbook of multisource feedback: The comprehensive resource for designing and implementing MSF processes (pp. 301–317). San Francisco, CA: Jossey-Bass. Clark, K. E., & Clark, M. B. (1990). Measures of leadership. West Orange, NJ: Leadership Library of America. Colbert, A. E., Barrick, M. R., & Bradley, B. H. (2014). Personality and leadership composition in top management teams: Implications for organizational effectiveness. Personnel Psychology, 67, 351–387. Craig, S. B., & Hannum, K. (2006). Research update: 360-degree performance assessment. Consulting Psychology Journal: Practice and Research, 58, 117–122. Dalal, D. K., & Nolan, K. P. (2009). Using dark side personality traits to identify potential failure. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 434–436. Day, D. V., Fleenor, J. W., Atwater, L. E., Sturm, R. E., & McKee, R. A. (2014). Advances in leader and leadership development: A review of 25 years of research and theory. The Leadership Quarterly, 25, 63–82. DeNisi, A. S., & Kluger, A. N. (2000). Feedback effectiveness: Can 360-degree appraisals be improved? Academy of Management Executive, 14, 129–139. Dezsö, C. L., & Ross, D. G. (2012). Does female representation in top management improve firm performance? A panel data investigation. Strategic Management Journal, 33, 1072–1089. Dinh, J. E., Lord, R. G., Gardner, W. L., Meuser, J. D., Liden, R. C., & Hu, J. (2014). Leadership theory and research in the new millennium: Current theoretical trends and changing perspectives. The Leadership Quarterly, 25, 36–62. Dipboye, R. L., & Johnson, S. K. (2008). A cross-cultural perspective on employee selection. In D. Stone, E. F., Stone-Romero, & E. Salas (Eds.), The influence of cultural diversity on human resources practices (pp. 53–84). Mahwah, NJ: Lawrence Erlbaum. Diversity Inc. (2013). Where is the diversity in fortune 500 CEOs? Retrieved from http://www.diversityinc. com/facts/wheres-the-diversity-in-fortune-500-ceos (accessed January 2013). Dominick, P. G., & Gabriel, A. S. (2009). Two sides to the story: An interactionist perspective on identifying potential. Industrial and Organizational Psychology Perspectives on Science and Practice, 2, 430–433. Dotlich, D. L., & Cairo, P. C. (2003). Why CEOs fail. San Francisco: Jossey-Bass. Dries, N., & Pepermans, R. (2012). How to identify leadership potential: Development and testing of a consensus model. Human Resource Management, 51, 361–385. Dugan, B. A., & O’Shea, P. G. (2014). Leadership development: Growing talent strategically. Society for Human Resource Management (SHRM) and Society for Industrial and Organizational Psychology (SIOP). Science of HR White Paper Series. Ehrhart, M. G., & Klein, K. J. (2001). Predicting followers’ preferences for charismatic leadership: The influence of follower values and personality. The Leadership Quarterly, 12, 153–179. Fiedler, F. (1995). Cognitive resources and leadership performance. Applied Psychology: An International Review, 44, 5–28. Freedman, A. (2004). The ‘Session C’ strategy. Human Resource Executive Online. Retrieved from http:// www.hreonline.com/HRE/view/story.jhtml?id=5359233 Freedman, A. (2005). Swimming upstream: The challenge of managerial promotions. In R. B. Kaiser (Ed.), Filling the leadership pipeline (pp. 25–44). Greensboro, NC: Center for Creative Leadership. Freedman, S. G. (2014). What works remains for the Rooney Rule. The New Yorker, February 11, 2014. Geisinger, K. F., Bracken, B. A., Carlson, J. F., Hansen, J. I. C., Kuncel, N. R., Reise, S. P., & Rodriguez, M. C. (2013). APA handbook of testing and assessment in psychology, Vol. 3: Testing and assessment in school psychology and education. Washington, DC: American Psychological Association. Goodstone, M. S., & Diamante, T. (1998). Organizational use of therapeutic change strengthening multisource feedback systems through interdisciplinary coaching. Consulting Psychology Journal: Practice and Research, 50, 152–163.

849

George C. Thornton III et al. Groves, K. S. (2005). Linking leader skills, follower attitudes, and contextual variables via an integrated model of charismatic leadership. Journal of Management, 31, 255–277. Halverson, S. K., Tonidandel, S., Barlow, C., & Dipboye, R. L. (2005). Self-other agreement on a 360-degree leadership evaluation. In S. Reddy (Ed.), Perspectives on multirater performance assessment (pp. 125–144). Nagarjuna Hills, Hyderabad, India: ICFAI Books. Hambrick, D. C., & Quigley, T. J. (2014). Toward more accurate contextualization of the CEO effect on firm performance. Strategic Management Journal, 35, 473–491. Hersey, P., & Blanchard, K. H. (1984), Management of organizational behaviour (4th ed.). Englewood Cliffs, NJ: Prentice-Hall. Heslin, P. A. (2009). “Potential” in the eye of the beholder: The role of managers who spot rising stars. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 420–424. Highhouse, S. (2002). Assessing the candidate as a whole: A historical and critical analysis of individual psychological assessment for personnel decision making. Personnel Psychology, 55, 363–396. Hogan, J., Hogan, R., & Kaiser, R. B. (2010). Management derailment: Personality assessment and mitigation. In S. Zedeck (Ed.), American Psychological Association handbook of industrial and organizational psychology (Vol. 3, pp. 555–575). Washington, DC: American Psychological Association. Hogan, R., & Hogan, J. (2001). Assessing leadership: A view from the dark side. International Journal of Selection and Assessment, 9, 40–51. Hollenbeck, G. P. (1994). CEO selection: A street-smart review. Greensboro, NC: Center for Creative Leadership. Hollenbeck, G. P. (2009). Executive selection: What’s right. . . and what’s wrong. Industrial and Organizational Psychology, 2, 130–143. And commentaries. Hollenbeck, G. P., McCall, M., & Silzer, R. (2006). Do competency models help or hinder leadership development? A debate. Leadership Quarterly, 17, 398–413. Hooijberg, R., & Schneider, M. (2001). Behavioral complexity and social intelligence: How executive leaders use stakeholders to form a systems perspective. In S. J. Zaccaro & R. J. Klimoski (Eds.), The nature of organizational leadership (pp. 104–131). San Francisco, CA: Jossey-Bass. Howard, A. (2001). Identifying, assessing, and selecting senior leaders. In S. J. Zaccaro & R. Klimoski (Eds.), The nature and context of organizational leadership (pp. 305–346). San Francisco, CA: Jossey-Bass. Howard, A., & Bray, D. W. (1988). Managerial lives in transition: Advancing age and changing times. New York, NY: Guilford Press. Howard, A., Erker, S., & Bruce, N. (2007). Selection forecast 2006/2007. Pittsburgh, PA: Development Dimensions International. Howard, A., & Thomas, J. N. (2010). Executive and managerial assessment. In J. C. Scott & D. H. Reynolds (Eds.), Handbook of workplace assessment (pp. 395–436). San Francisco, CA: Wiley. Jeanneret, P. R., & Silzer, R. (Eds.) (1998). Individual psychological assessment. San Francisco, CA: Jossey-Bass. Judge, T. A., Bono, J. E., Ilies, R., & Gerhardt, M. W. (2002). Personality and leadership: A qualitative and quantitative review. Journal of Applied Psychology, 87, 765–780. Kaiser, R. B., & Craig, S. B. (2011). Do the behaviors related to managerial effectiveness really change with organizational level? An empirical test. The Psychologist-Manager Journal, 14, 92–119. Kaiser, R. B., LeBreton, J. M., & Hogan, J. (2015). The dark side of personality and extreme leader behavior. Applied Psychology: An International Review, 64, 55–92. Kates, A., & Downey, D. (2005). The challenges of general manager transitions. In R. B. Kaiser (Ed.), Filling the leadership pipeline (pp. 45–68). Greensboro, NC: Center for Creative Leadership. Katz, R. L. (1955). Skills of an effective administrator. Harvard Business Review, 33(1), 33–42. Korman, A. K. (1968). The prediction of managerial performance: A review. Personnel Psychology, 21, 295–322. Krause, D. E., Kersting, M., Heggestad, E. D., & Thornton, G. C., III. (2006). Incremental validity of assessment center ratings over cognitive ability tests: A study at the executive management level. International Journal of Selection and Assessment. 14, 360–371. Labaye, E. (2012). Women Matter 2012: Making the Breakthrough. McKinsey Consulting Report. Retrieved from http://www.mckinsey.com/~/media/McKinsey/dotcom/client_service/Organization/PDFs/ Women_matter_mar2012_english.ashx MacRae, I., & Furnham, A. (2014). High potential: How to spot, manage and develop talented people at work. London: Bloomsbury. Mann, F. C. (1965). Toward an understanding of the leadership role in formal organizations. In R. Dubin, G. C. Homans, F. C. Mann, & D. C. Miller (Eds.), Leadership and productivity (pp. 68–103). San Francisco, CA: Chandler. Maurer, T. J., & Lippstreu, M. (2008). Who will be committed to an organization that provides support for employee development? Journal of Management Development, 27, 328–347.

850

Selecting Leaders McCall, M. W., Lombardo, M. M., & Morrison, A. M. (1988). The lessons of experience: How successful executives develop on the job. New York, NY: The Free Press. McCauley, C. D., & McCall Jr., M. W. (Eds.) (2014). Using experience to develop leadership talent: How organizations leverage on-the-job development. San Francisco, CA: Jossey-Bass. McCrae, R. R., & Costa, P. T. (1989). The structure of interpersonal traits: Wiggins’s circumplex and the Five-Factor model. Journal of Personality and Social Psychology, 56, 586–595. McPhail, S. M. (2007). Alternative validation strategies: Developing new and leveraging existing evidence. San Francisco, CA: Jossey-Bass. Menkes, J. (2005). Executive intelligence: What all great leaders have. New York, NY: HarperCollins. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007). Reconsidering the use of personality tests in selection contexts. Personnel Psychology, 60, 683–729. Morgeson, F. P., Mumford, T. V., & Campion, M. A. (2005). Coming full circle: Using research and practice to address 27 questions about 360-degree feedback programs. Consulting Psychology Journal: Practice and Research, 57, 196–209. Morris, S. G., Daisley, R. L., Wheeler, M., & Boyer, P. (2015). A meta-analysis of the relationship between individual assessments and job performance. Journal of Applied Psychology, 100, 5–20. Moses, J. L., & Eggebeen, S. L. (1999). Building room at the top. In A. Kraut & A. K. Korman (Eds.), Evolving practices in human resource management: Responses to a changing world of work (pp. 201–225). San Francisco, CA: Jossey-Bass. Mumford, T. V., Campion, M. A., & Morgeson, F. P. (2007). The leadership skills strataplex: Leadership skill requirements across organizational levels. Leadership Quarterly, 18, 154–166. Murphy, K. R., Cleveland, J. N., & Mohler, C. J. (2001). Reliability, validity, and meaningfulness of multisource ratings. In D. Bracken, C. W. Timmreck, & A. J. Church (Eds.), The handbook of multisource feedback: The comprehensive resource for designing and implementing MSF processes (pp. 130–148). San Francisco, CA: Jossey-Bass. Nadler, D. A., Behan, B. A., & Nadler, M. B. (Eds.) (2006). Building better boards: A blueprint for effective governance. San Francisco, CA: Jossey-Bass. Ones, D. S., Dilchert, S., & Viswesveran, C. (2012). Cognitive abilities. In N. Schmitt (Ed.), The Oxford handbook of personnel assessment & selection (pp. 179–224). New York, NY: Oxford University Press. Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Cognitive ability in personnel selection decisions. In A. Evers, N. Anderson, & O. Voskuijl (Eds.), Handbook of personnel selection (pp. 143–173). Malden, MA: Blackwell. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78, 679–703. Pfeffer, J., & Salancik, G. R. (1978). The external control of organizations: A resource dependent perspective. New York, NY: Harper & Row. Piip, J., & Harris, R. (2014). Leadership talent identification and management. In R. Harris & T. Short (Eds.), Workforce development: Perspectives and issues (pp. 213–231). Dordrecht, Netherlands: Springer. Piotrowski, M. (April 2007). Individual assessment today: What works, and what doesn’t work! Presentation at the Annual Meeting of the Society for Industrial and Organizational Psychology, New York, NY. Pulakos, E. D., Borman, W. C., Hough, L. M. (2008). Test validation for scientific understanding: Two demonstrations of an approach to studying predictor-criterion linkages. Personnel Psychology, 41, 703–716. Ready, A. D., Conger, A. J., & Hill, A. L. (2010). Are you a high potential? Harvard Business Review, 88, 78–84. Reichard, R. J., & Johnson, S. K. (2011). Leader self-development as organizational strategy. The Leadership Quarterly, 22, 33–42. Ritchie, R. J. (1994). Using the assessment center method to predict senior management potential. Consulting Psychology Journal: Practice And Research, 46, 16–23. doi: 10.1037/1061–4087.46.1.16 Roberson, Q. M., & Park, H. J. (2007). Examining the link between diversity and firm performance the effects of diversity reputation and leader racial diversity. Group & Organization Management, 32, 548–568. Russell, C. J. (2001). A longitudinal study of top-level executive performance. Journal of Applied Psychology, 86, 560–573. Ryan, A. M., & Sackett, P. R. (1987). A survey of individual assessment practices by I-O psychologists. Personnel Psychology, 40, 455–488. Salancik, G. R., Calder, B. J., Rowland, K. M., Leblebici, H., & Conway, M. (1975). Leadership as an outcome of social structure and process: A multidimensional analysis. In J. G. Hunt & L. L. Larson (Eds.), Leadership frontiers (pp. 81–101). Kent, OH: Kent State University. Schippmann, J. S. (2010). Competencies, job analysis, and the next generation of modeling. In J. S. Scott & D. H. Reynolds (Eds.), Handbook of workplace assessment (pp. 197–231). San Fransisco: Jossey-Bass. Schmitt, N., & Golubovich, J. (2013). Biographical information. In K. F. Geisinger (Editor in Chief.), APA Handbook of testing and assessment: Vol. 1. Test theory and testing and assessment in industrial and organizational psychology (pp. 437–456). Washington, DC: American Psychological Association.

851

George C. Thornton III et al. Scott, J. C., & Reynolds, D. H. (2010). Handbook of workplace assessment (Vol. 32). Hoboken, NJ: John Wiley & Sons. Sessa, V. I. (2001). Executive promotion and selection. In M. London (Ed.), How people evaluate others in organizations (pp. 91–110). Mahwah, NJ: Lawrence Erlbaum. Sessa, V. I., & Taylor, J. J. (2000). Executive selection. San Francisco, CA: Jossey-Bass. Silzer, R., & Church, A. H. (2009). The pearls and perils of identifying potential. Industrial and Organizational Psychology, 2, 130–143. Silzer, R. F., & Church, A. H. (2010). Identifying and assessing high-potential talent: Current organizational practices. In R. Silzer & E. Dowell (Eds.), Strategy-driven talent management: A leadership imperative (pp. 213–81). San Francisco, CA: Jossey Bass. SpencerStuart. (November 5, 2008). 2008 route to the top. Retrieved from https://content.spencerstuart. com/sswebsite/pdf/lib/2008_RTTT_Final_summary.pdf Stokes, G. S., & Cooper, L. A. (1995). Selection using biodata: Old notions revisited. In G. S. Stokes, M. D. Mumford, & W. A. Owens (Eds.), Biodata handbook (pp. 311–349). Palo Alto, CA: CPP Books. Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A response to Morgeson, Campion, Dipboye, Hollenbeck, Murphy, & Schmitt (2007). Personnel Psychology, 60, 967–993. Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Personnel Psychology, 44, 703–742. Thomas, J. C. (Ed.) (2004). Industrial and organizational assessment. In M. Hersen (Ed.), Comprehensive handbook of psychological assessment (Vol. 4, pp. 319–344). Hoboken, NJ: Wiley. Thornton, G. C., III., & Byham, W. C. (1982). Assessment centers and managerial performance. New York, NY: Academic Press. Thornton, G. C., III., & Krause, D. E. (2009). Comparison of practices in selection vs. development assessment centers: An international survey. International Journal of Human Resource Management, 20, 478–498. Thornton, G. C. III., Rupp, D. E., & Hoffman, B. J. (2015). Assessment center perspectives for talent management strategies (2nd ed.). New York, NY: Routledge. Toegel, G., & Conger, J. A. (2003). 360-degree assessment: Time for reinvention. Academy of Management Executive, 12, 86–94. Tornow, W. W. (1993). 360-degree feedback. Human Resource Management, 32, 211–384. Wackerle, F. W. (2001). The right CEO: Straight talk about making tough CEO selection decisions. San Francisco, CA: Jossey-Bass. Walberg, R. (2014). Why CEO turnover is on the rise. Financial Post. Retrieved from http://business.financialpost.com/executive/leadership/why-ceo-turnover-is-on-the-rise?__lsa=cbaa-0efb Walker, H. J., Field, H. S., Bernerth, J. B., & Becton, J. B. (2012). Diversity cues on recruitment websites: Investigating the effects of job seekers’ information processing. Journal of Applied Psychology, 97, 214–224. Weiner, N., & Mahoney, T. A. (1981). A model of corporate performance as a function of environmental, organizational, and leadership processes. Academy of Management Journal, 24, 453–470. Wofford, J. C., Whittington, J. L., & Goodwin, V. L. (2001). Follower motive patterns as situational moderators for transformational leadership effectiveness. Journal of Managerial Issues, 13, 196–211.

852

Part VIII TECHNOLOGY AND EMPLOYEE SELECTION WALTER C. BORMAN AND MICHAEL D. COOVERT, SECTION EDITORS

39 TECHNOLOGY AND EMPLOYEE SELECTION An Overview DOUGLAS H. REYNOLDS AND DAVID N. DICKTER

The practice of employee selection has become heavily dependent on software and the technology systems used to deploy it. Once an administratively burdensome process, selection is now supported by a variety of technologies that are designed to engage candidates while assessing their suitability for employment. A hiring process might now begin with a social media contact, seamlessly screen online for basic qualifications, route acceptable candidates to an online test, and invite those who pass to an in-depth assessment; technology will support each of these steps, as well as the interviewing process and eventual onboarding into the organization. The rise of these technologies has been driven by the need for cost reduction, the desire to improve an imperfect organizational function for strategic advantage, and a large dose of venture capital flowing into the software development industry. Can technology improve the manner by which organizations select their next generation of associates? The answer is surely positive, but the use of technology-enabled selection tools is also accompanied by a variety of challenges and recurring issues. Writing about technology for traditional media can be a folly; technology development cycles are far shorter than the publication process, and new technologies may become commonplace or obsolete within just a year or two. Fads are the norm with technology, so popular and novel techniques today are quickly replaced by tomorrow’s innovations. Nonetheless, a reader interested in modern selection practices should be aware of the classes of technologies available and the likely direction of their evolution. Despite the rapid pace of change, some challenges and opportunities tend to be enduring. In this overview we isolate and review these recurring issues that arise when technology-supported selection procedures are used. The potential benefits of using these tools only accrue if the technology is effectively implemented and used, so implementers need to recognize and handle the recurring challenges and opportunities that surface along the journey. These issues may be represented as a set of questions that should be answered as new technologies are designed and implemented: • Is a new technology compatible with more familiar formats and tools? This question is often posed as one of equivalence, with a key issue centering on whether a new technology introduces irrelevant variance into the results of an assessment process. • Under what conditions and circumstances will users get access to the selection tools? The options available for deployment have broadened but still involve some basic choices about whether tools will be open for any user, if administration will be supervised, and if users will be authenticated.

855

Douglas H. Reynolds and David N. Dickter • How will the technology systems be implemented within the broader context of the organization so that they will be accepted and used? Effective implementation is likely to have a bigger impact on the value of the system than any specific feature or capability. • Will the system be deployed across broad geographic, cultural, and/or national boundaries? One of the clear benefits of technology-based HR systems is that they allow for globalized operations. However, global deployment adds new layers of complexity to a selection process. • Technology-based assessment and selection systems generate a lot of data; some of these data are personal and sensitive. How will data be maintained and kept secure? Are systems designed and maintained with an eye on compliance with global data privacy regulations? • Do new technologies allow for the new ways to assess people for jobs? How can organizations pick through the many fads to see which innovations might stick and which ones will not?

In this chapter, we will review each of these questions in more detail. To establish context, we first provide an overview of the most common types of technology-based personnel selection system components. Our intention is to set the stage for understanding the challenges raised whenever new technologies are deployed as a backdrop for the subsequent chapters in this section.

TECHNOLOGY-BASED SYSTEMS TO SUPPORT EMPLOYEE SELECTION Technology-supported selection tools and supporting systems are popular because they add value to organizations; their contribution stems from a mix of tactical and strategic benefits. On the tactical side, technology-based systems often provide administrative efficiency gains, ease of use, cost savings, and advantages associated with scale and standardization. Their strategic benefits stem from promises of improved insight into job candidate characteristics through better measurement, the ability to generate strong engagement with the process, and, ultimately, information of depth and scope about talent that will help executives build and steer their organizations. Several types of software systems have emerged to support employee selection processes. It is common to arrange multiple systems together into a multistage process, sometimes requiring applicants to pass each portion before gaining access to the next (multiple hurdle). Although it is rare for an organization to use all of the components of this arrangement, we will use the classic selection funnel configuration as a model for describing the role of each type of system. Figure 39.1 shows the set of systems arranged as if all components were in operation together. In practice, an organization may use only one or two components, often supported by an applicant tracking system. Each type of component is described in more detail in the following sections; readers are encouraged to review the latest offerings by providers of HR technologies because these techniques evolve quickly.

Candidate Sourcing The first step in selection is the recruitment of individuals for consideration. Recruiting is often considered a separate process, both in practice (recruiters rarely make selection decisions) and in the scientific literatures that tie to these functions. However, the technical systems supporting each should ideally be integrated to allow for efficient operation. The way an organization recruits serves as a first selection decision, although it may not always be acknowledged as such. Most large organizations will publicize a broad recruiting stance—an openness to consider qualified applications regardless of background. In practice, it is also common to see recruitment patterns that emphasize a preference for certain universities, experiences, or other recruitment channels (e.g., Ivy League, military officers, physics majors). These patterns are often established due to a few high-profile successful recruits in the past and have the effect of limiting the applicant pool and potentially creating a discriminatory recruitment pattern. The introduction of Internet-based recruitment and screening tools allows organizations to sidestep these problematic practices and recruit masses of potentially interested recruits because they are easily screened down to more manageable numbers in later steps of the process.

856

An Overview

FIGURE 39.1  The Selection Funnel

Screening can be done algorithmically, and checked for validity and unintended demographic deficits regularly and more easily due to the easy availability of the data flow from the process. There are many technology-centered methods for generating a pool of recruits. Most commonly, a “Careers” tab is built for the organization’s website. These sites can be elaborate, filled with videos and other rich content that both describe and sell the benefits of employment at the organization; once interested, the user can begin to submit expressions of interest and qualifications for screening directly on the site. Recruitment then becomes a process of driving traffic to the site from job boards, career fairs, press placements, social media posts, etc. Some of the more technically sophisticated online recruitment approaches have borrowed a set of functions from the field of marketing automation. These tools allow for detailed tracking of electronic recruiting campaigns to build a list of warm applicants for later outreach. For example, an e-mail campaign may be orchestrated to build a talent pool; electronic interactions with the e-mail are then tracked and tallied to build an interest score for all recipients. Variables such as the open rate, tracked clicks through to linked sites, time on the site, connected topic areas opened, and number of repeat visits can all be tracked and scored. High-scoring prospects may then be contacted with more targeted communications or direct contact by a recruiter. By linking available information from social media accounts, recruitment automation systems are able to start a profile of qualifications and likelihood of interest, success in the selection process and beyond. Similar data can be tracked from social media posts and other online activity. Note that much or all of these activities may be executed outside the awareness of the target candidate. They are simply reading an e-mail or social media post and acting upon it or not. Some research has even demonstrated the ability to reliably score candidates on personality variables based solely on the content of their social media postings (Park et al., 2015). We expect these techniques to advance and proliferate as the range of trackable activities increases. Currently, attention-grabbing games, brainteasers, social media placements, and

857

Douglas H. Reynolds and David N. Dickter similar techniques are used for collecting lists for recruitment. Essentially any technique developed for the well-funded arena of product marketing can be retooled as a recruiting device and will be if it allows an organization to gain competitive advantage by building a fresh and strong database of possible qualified candidates. Once a talent pool is created, one of two classes of tools is often used to begin the process of selecting those who meet specified qualifications. One method is centered on the submission of a resume, and the other is based on structured screening questions. Sometimes these techniques are used in combination.

Resume Storage, Parsing, and Search Resume storage tools are typically built to work in concert with an Applicant Tracking System (ATS), once a resume has been submitted and a candidate record has been established in the database. These tools may also incorporate methods to pull resumes from the millions that are posted to recruitment sites. Resume management tools are designed to mine background and qualifications information to help manage the high volumes of candidates who attach a resume to their submission of interest to the organization. Typical features for these tools include the use of text search and keyword parsing to support various search methods for mining the resume database. Resume parsing tools will automatically deconstruct the resume and put relevant information, (e.g., contact information, degrees, and educational institutions) and certifications into database fields, thereby increasing the speed and accuracy of searches. Once critical information is gleaned from the resume, keyword search tools can be deployed to assist recruiters in the task of assembling a group of job seekers who meet criteria that can be imposed during a database search. These tools may include advanced technologies that allow the meaning of a word or phrase to be detected from the context in which it appears in the resume. For example, the resume for a financial analyst that worked for the company State Street would parse “State Street” under experience, not as the job seeker’s address. Resume search tools can help improve the efficiency of large-scale recruiting processes, but resumes have severe limitations for providing insight into job seeker qualities. The traditionally brief format of the resume does not reveal the quality of prior work or reflect the learning gained from prior experience. These essential characteristics can best be determined via more advanced screening, assessment, and interviewing techniques.

Applicant Screening Tools As an alternative or supplement to resume-based tools, structured qualification screening begins by requesting responses to qualification questions that may be scored and used as the basis for candidate ranking. Unlike resume-based tools, where candidates are pulled from a database of broadly collected resumes, screening questions may be constructed to be highly specific to available jobs. When they are designed and implemented effectively, they can provide a standardized method for quickly collecting background and qualifications data on job seekers and sorting them on the basis of their fit or predicted success for specific open positions. Common questions include work and educational history (some systems may extract this information from a resume and have candidates review and update the extracted information), basic qualifications (e.g., licenses, certifications, and years of relevant experience), and specific experiences (e.g., with equipment, work processes, or business issues common to the job). The selection of questions and complexity of the scoring system applied to qualification questions of this sort is important to balance against the sophistication of the administrative users. Sometimes an I-O psychologist will guide the design of the tool and the scoring scheme; in other situations, these roles may be delegated to a broader range of system users. Software developers and users may perceive this feature as a benefit, but the flexibility comes with the risk that qualifications are poorly defined; if not carefully designed, basic qualification questions can

858

An Overview be too specific, too strict, or bear little relationship to the target job. Well-designed systems will include role-based access to question construction, scoring, and deployment so organizations can standardize and control their approach to this selection stage just as they usually would for more complex aspects of the section process. It should also be noted that the qualifications screening process supported by these tools has the same validation requirements as any other selection process. The easy configuration of screening questions may encourage casual users to set standards that are insupportably rigorous and lead to indefensible adverse impact when not properly validated.

Automated Testing Compared with screening tools, testing provides greater insight into individual characteristics by deploying standardized and psychometrically sound instruments that can provide more accurate measurement of constructs that are difficult to index with screening questions alone, such as abilities, traits, and knowledge. Automated testing systems tend to have several common features. Test takers are typically invited to the assessment by providing them with secure log-in and password information; standardized instructions, help menus, and practice items are then provided to orient them. During the test session, several features are deployed to facilitate test taking, including countdown timers, progress indicators, and other navigational aids. Many test delivery systems simply deploy computerized versions of popular paper-based versions, but test developers are also taking advantage of the broad range of available computing and display capabilities. Tests that use embedded audio, video, and animated graphics as part of the question stimuli are commonplace; response formats that involve hot-spots, drag and drop, and other interactive controls provide a greater flexibility for handling a range of engaging item types. Advances in natural language processing add the potential for free-form responses to be used in these tools. Page-level timing and other measures of attention and performance provide the basis for new aspects of measurement to be investigated as well. Question presentation and response analysis, once largely limited to classical linear test configurations, have now yielded to the power of more complex psychometric models, such as those reviewed in Chapter 42. Once rare in practice, measures based on Item Response Theory (IRT) now play a regular role in online testing systems due to their advantages for limiting question exposure, mitigating common cheating strategies, and shorter length—all very desirable qualities for online selection tests. The rise of gaming technologies has also extended the range of tools deployed to measure more complex traits. Game-based psychometrics provide scores on job-relevant characteristics derived from performance on videogame-style tasks (Rampell, 2014). This mix of technology, simulation, and psychometrics has yet to be researched at much depth, so conclusions about the effectiveness and longevity of the technique are premature. Chapter 44 reviews many of the current options in this arena and findings to date. We return to this issue in more detail later in the chapter.

Behavioral Assessment Support Tools for supporting behavioral assessment, such as work simulations and assessment centers, allow for presentation of stimuli via controlled e-mail inboxes, instant messaging tools, and voice and video mail. The addition of embedded video or audio interactions with live role players allows for the replication of the work environment in a manner that is a better reflection of how modern complex work is performed and has higher degrees of control and standardization than past versions of nonautomated assessment centers. Unlike most of the testing systems mentioned above, behavioral assessment will emphasize response fidelity and target behavioral

859

Douglas H. Reynolds and David N. Dickter competencies in their measurement, such as Planning and Organizing, Communication, Analysis and Judgment, Financial Acumen, etc. The strength of these systems rests with their ability to deliver work-related challenges and present realistic situations that elicit work-relevant natural behavior. Behavior is then captured online (through constructed responses, such as a response to an e-mail), through recordings, or the production of work products such as project plans. Recorded behavior may be categorized or pre-rated according to algorithms, but the final scoring and report development is often left in the hands of trained assessors, who use the same system to pull responses and provide ratings against common standards. Natural language processing of simulation responses has supplemented assessor judgment and will continue to play an expanding role in behavioral assessment. These systems are most often used for complex jobs involving leadership, managerial, or executive-level requirements (see Reynolds & Rupp, 2010, for examples of these tools). Some variants have been developed for nonmanagement roles, such as selection in advanced manufacturing facilities where the cost of training is high. In this context, behavioral assessment allows for the reduction of training failures by providing standardized, monitored, and scored simulated production exercises. Heavy computerization of the activities through the use of sensors attached to physical exercise components allows for accurate tracking of complex motor behavior (Byham, 2010).

Interview Support Automated tools have also been developed to help structure and facilitate the interview process. Interview facilitators often allow for the identification of the rating targets (e.g., competencies, past behaviors), the construction or identification of questions that assess these targets, the assignment of questions to interviewers, and a process for data combination across interviewers (e.g., Chambers & Arnold, 2015). Furthermore, the tools can help with records retention if the interview protocol, summary notes, and ratings are maintained in the system. Many of these steps are geared toward improving the efficiency and standardization of the interview process; if the tool is based on an interview technique that has been well researched and proven, additional insight into candidates may also be gained. Just as is the case with behavioral assessment tools, interviewing tools now include capabilities for capturing live-streamed video between the interviewer and interviewee. Automated variants using avatar interviewers or video-based question delivery help maintain a degree of situational fidelity while capturing natural responses from candidates. Here again the use of natural language processing will likely drive additional efficiency and standardization into these tools as the technology evolves. Biometric tools (e.g., voice analysis) and facial recognition technologies are also being used in this context to confirm interviewee identity and aid in scoring.

Applicant Tracking Systems The role of the ATS is to establish and build candidate records as the recruiting and selection process unfolds and to undergird the systems used to deliver each of the selection steps; these systems also manage job openings and candidate flow. The system should collect, track, and report critical information about open positions, candidates, and selection processes to enable the efficient management of recruiting and staffing functions. The ATS also frequently serves as a hub for additional services (e.g., job posting and background checking) to further extend the value provided to the hiring process through automation. In addition to the main functions of data tracking and management, an ATS will enable reporting on candidate quality and flow rates throughout the staffing process. This allows for computation of effectiveness metrics, such as the success rates of recruiters and recruiting channels, time-to-hire, and the criterion validity of candidate information. Data storage and reporting are also critical for understanding how the system as a whole is operating with respect

860

An Overview to critical outcomes such as the diversity mix of the candidate pool at each stage of selection. These reports are required to support government recordkeeping requirements and to respond to audit and challenge requests. On its own, an ATS typically provides little sophistication for the measurement of people, so supplemental processes are often added to support the measurement required for strong selection decisions, and these tools are usually required to integrate with the processes and data formats supported by the ATS. More detailed summaries of available technology systems to support employee selection are available (e.g., Reynolds & Weiner, 2008). Certainly, many other technology-based products exist and many more will be developed. Fundamental issues regarding efficiency, control, standardization, and measurement accuracy will continue to underlie the business value of new approaches. Novelty, user engagement, and marketing will also play a big part in which of these advances becomes popular. Selection system designers will need to evaluate advancements on the basis of the balance between their business value and the risks they might pose as increasingly complex measurement functions become automated and broadly accessible.

COMMON ISSUES ENCOUNTERED WITH TECHNOLOGY-BASED SELECTION Various technology-centered tools and techniques to support selection will come and go, but the challenges they raise for implementers and users will likely persist. Various factors may contribute to the potential impact of technology. Some of these issues are unique to technology deployment (e.g., the equivalence of assessments across media), whereas others may be inherent to the endeavor of personnel selection (e.g., test security), but their impact may be magnified by the use of technology. In the sections that follow, we review the common practices and research on the issues that arise due to the use of technology in the selection process.

Equivalence As technology-based selection may utilize a variety of formats and tools, the question of crossmode equivalence must be addressed. Until about the last decade, the equivalence between paper and computerized assessment was a common concern because computer tests were often derived from legacy paper tools, so the comparability of the psychometric characteristics of the old instrument to the new one was essential to establish. Research supported the paper– computer equivalence of power (unspeeded) cognitive ability tests and noncognitive assessments—in particular, personality and biodata (Bartram & Brown, 2004; Salgado & Moscoso, 2003)—but not speeded cognitive tests (Mead & Drasgow, 1993), where scored differences likely resulted from the examinees’ interactions with the test materials and input devices. Other studies indicated the need for continued caution when computerizing some types of tests (e.g., Ployhart, Weekly, Holtz, and Kemp, 2003, on situational judgment tests). Reviewers have urged more within-group, repeated-measure studies of testing modality (Potosky & Bobko, 2004) and more systematic study of the factors that might affect equivalence (Stone, Lukaszewski, StoneRomero, & Johnson, 2013). Now with the ever-broadening variety of input devices and operating software, the question has moved on to Technology X versus Technology Y test equivalence, and the importance of understanding in what way, if any, the inevitable new technology introduces undesired test variance. Trying to make a new test approximate older technology to achieve equivalence is likely to be a step backward; it would be better to develop some theoretical groundwork or framework explaining equivalence across testing formats (Potosky, 2008). Borrowing in part from Barry and Fulmer’s (2004) theory on the use of communication media, Potosky (2008) offers one such framework, in which the test is the medium of exchange (between providers and examinees), and social bandwidth, interactivity, surveillance, and transparency are factors that influence this exchange. Social bandwidth refers to the amount of informational cues that are used, such as the

861

Douglas H. Reynolds and David N. Dickter use of audio to simulate aspects of an interview, and interactivity refers to the exchange of information (e.g., rapid, reciprocal, synchronous interactive simulations vs. slow, one-sided, asynchronous fixed-form multiple-choice tests). Surveillance refers to the possibility of outside monitoring and transparency refers to the fidelity and clarity of the test content that can be conveyed without distraction—for example, by the test controls or interface (Potosky, 2008) and by the environment when mobile technology is used (Illingworth, Morelli, Scott, & Boyd, 2015). Using such a framework will be an improvement on rough classifications by test mode or format (e.g., paper, computer, tablet, and phone), particularly as the technology delivery methods multiply. The primacy of equivalence analysis as a prelude to broader usage in a selection context is evident whenever new technologies and delivery platforms emerge. The latest wave of comparisons focuses on the use of small platform mobile devices. Several recent studies have compared results from mobile users to those using full-screen platforms such as laptops and desktops. Huff (2015) found scores on a personality measure did not vary between mobile and computer-based administrations using a within-subjects design. Arthur, Doverspike, Muñoz, Taylor, and Carr (2014), using a natural sample of job applicants, similarly found no differences in personality scores across platforms, but scores on a speeded test of general mental ability did vary across the format. However, demographic differences between the format groups as well as the speeded nature of the cognitive measure may have impacted these findings. Morelli, Mahan, and Illingworth (2014) examined construct equivalence across formats and found few differences across mobile and computer platforms for personality, cognitive ability, biodata, and situational judgment measures, but these authors did note mean differences on the situational judgment test, perhaps as a result of text-heavy stimuli in that measure. Similarly, Illingworth et al. (2015) did not find differences across platforms for personality and biodata measures. Perhaps the most remarkable finding from studies of measurement equivalence across technology platforms is the robust consistency of most measures. Despite the many variables that can be imposed by screen size, keyboard controls, connection speed, browser format and functions, and assorted other potentially moderating conditions, most media comparisons have found few differences in how typical selection measures operate. This observation does not diminish the need to make sensible design decisions and to confirm the equivalence across likely delivery formats, but as long as obvious interactions between assessment conditions and delivery format are avoided, most measures seem to generalize readily across available formats. The availability of responsive web design formats (that automatically scale to the user’s media format) should provide additional service to advance this general finding.

Deployment Strategies How will potential job applicants access the steps used to recruit and select them into an organization? What are the side effects of the procedures used to allow them access? For example, will screening procedures be available to anyone who lands on a corporate recruiting site; and if so, how will you know the person completing the process is the same person who shows up for the interview? These questions have formed the basis for a sizable volume of research and commentary as the use of technology in selection rapidly accelerated in the early 2000s. The options are perhaps best summarized by the International Test Commission’s Guidelines for Computer-Based and Internet-Delivered Testing (2006). The Guidelines delineated four common strategies for deploying Internet-based assessment tools based on the level of oversight and control asserted over the assessment process. In brief, these strategies fall into the following categories: • Open access. The assessment can be accessed via the Internet from any location with no authentication of the user (i.e., proof that the participant is who she/he claims to be) and no direct supervision of the administration of the assessment. • Controlled delivery. The assessment is made available only to known participants (e.g., by sending a one-time access invitation to screened candidates), yet no direct authentication or supervision of the assessment session is involved.

862

An Overview • Supervised delivery. The identity of the assessment participant can be authenticated (e.g., by requesting ID, agreeing to video supervision, or passing a biometric test), and there is a degree of direct supervision over the administration. • Managed delivery. The assessment session is highly controlled, often through the use of dedicated testing centers, where there is oversight over authentication, access, security, the qualifications of the administrators, and the technical specifications of the computers used to deliver assessments.

The first two options in the list above are all variations of what has come to be known as UIT (Unproctored Internet Testing), and much has been written about the practical and ethical implications of implementing selection systems where the potential for gaining of unfair advantage is heightened by the delivery conditions (e.g., Burke, Mahoney-Phillips, Bowler, & Downey, 2011; Ryan & Ployhart, 2014; Tippins, 2009). Despite the risks, it can be argued that organizational selection processes have never been perfect; the use of technology to accelerate and broaden the selection process merely amplifies existing threats. Several questions are important to evaluate as the concerns raised by the various access methods are assessed: Is the person responding to a selection step the same person who will show up to work? Have sensitive assessments been compromised and/or has the opportunity to cheat been increased? Are candidates given the same opportunity to perform regardless of how they access the selection steps?

Examinee Identification There is no foolproof way to verify that the person taking the test on the Internet is the actual candidate, but a number of techniques have been developed to reduce the risk of the substitution of a confederate responder during the selection process. One common technique sidesteps the issue by requiring minimal identification during early selection stages (often deployed with “controlled delivery” techniques), but candidates are warned that similar measures will be used later in the process under supervised conditions and, once the second test is administered, similar scores will be expected (e.g., Burke et al., 2011). These procedures provide some discouragement and safeguards against confederate test takers, but they raise questions about the appropriate treatment of cases where discrepant scores between administrations are found; typically, the most secure administration is treated as the score of record, and no further action is taken to investigate the difference. Of course, imperfect test reliability dictates that a portion of all test takers will receive a different result (e.g., pass or fail) on a second administration of a measure, and the probability of a different result will be higher for candidates who score near the decision cut point. Technical solutions to the problem of examinee identification have been deployed and will become easier to implement as the required hardware and supporting technologies become more prevalent. Foster (2009) describes several identification techniques ranging from keystroke analytics, where the examinee’s typing cadence is analyzed and confirmed across registration and assessment events, to the use of webcams and data forensics, such as the analysis of response patterns and latencies. Many devices are now packaged with cameras and fingerprint readers that can be used for establishing identity. The problem of remote identification will likely persist despite these advances. Careful implementation of the various stages of selection can help minimize issues associated with security and identity compromises, for example, by placing short measures of verifiable biographical information early in the selection process where open or controlled delivery is used. More sensitive measures with a bigger impact on final selection can then be deployed on a smaller population of prescreened candidates under supervised or managed conditions.

Test Security and Cheating Security is a concern for both the users and publishers of assessments. The user might be concerned about a compromise to a carefully designed process necessitating redesign; test publishers are additionally concerned about the loss of intellectual property and the loss of value of a carefully designed product. Although the test delivery system can block the ability to copy the

863

Douglas H. Reynolds and David N. Dickter test using the local computer’s operating system, there is no stopping the determined candidate from using other technology to circumvent these security features (e.g., by taking pictures of the computer screen). Test publishers typically include a statement to which the candidate must agree before accessing the test, stipulating that he or she will take the test honestly and will not distribute the information or risk disqualification. The organization also may attempt to limit access to the test to specific devices, conditions, and time periods, and to candidates who have been extensively prescreened. Strong countermeasures, such as regular web patrols to locate compromised test content on the Internet, are also critical for maintaining the security of testing materials. Candidates who are motived to cheat can often find ways to do so, and Internet-based testing provides increased opportunity for unethical advantages to be gained. There have been several attempts to quantify the frequency and impact of cheating on test results. In one frequently cited study, unproctored test scores were compared to proctored scores, and when a criterion of 1 standard error of measurement (SEM) score difference across conditions was applied, 7.8% of the subjects were identified as cheaters (Arthur, Glaze, Villado, & Taylor, 2010). These authors later comment that this figure is a likely underestimate (Arthur & Glaze, 2011). Others estimate higher rates of cheating, but the notion of a stable base rate for cheating is probably misguided. Malfeasant test-taking behavior likely varies based on the stakes associated with the exam, the deployment strategy used to deliver it, and the moderation provided by various countermeasures. The prospect that dishonest test-taking strategies may have a substantial impact on the resulting group of selected candidates has refreshed interest in the use of adaptive testing (e.g., McCloy & Gibby, 2011) and variations such as linear-on-the-fly testing (e.g., Burke et al., 2011). These techniques have the dual benefit of reducing cheating, because each candidate can be given a different set of test quesntions, and increasing security because the exposure rate of each test question can be monitored and controlled.

Test Delivery Standardization Internet-based test deployment broadens the range of conditions under which tests are taken. Remote deployment strategies inherently limit the ability of the test administrator to control the test environment, leading to situations that may disadvantage some candidates. In addition to the distractions that may be present in an unsupervised setting, hardware and unforeseen software issues may arise. Differences among examinees’ computers and Internet connection speeds could affect testing, particularly on speeded tests. The imperfect remedy for this situation is to limit the test to deployment on devices that meet predetermined specifications, a process that can be automated prior to the initiation of the test. This practice has disadvantages because some number of otherwise qualified applicants may be discouraged from continuing with the selection process if the available hardware, software, and connectivity conditions are difficult to obtain. To make matters worse, availability of the required system features may be correlated with the demographics of the candidates (e.g., lower-income candidates may have access only through wireless mobile devices). For this reason, test providers are often reluctant to place strong restrictions on access, thereby increasing the variability in testing conditions across candidates. A common practice for resolving this dilemma is to provide strong recommendations to candidates about the appropriate testing conditions and let the users’ judgment guide the degree to which they are able to replicate the desired testing conditions. As shown in this section, unproctored testing brings ethical issues to the forefront. The threat of cheating or unfair conditions creates an ethical dilemma for I-O psychologists, who have an obligation to ensure the quality and standardization of assessments. Professional standards and guidelines for testing dictate the importance of material security, standardization of the test environment, and the control of factors that may impact test performance aside from the construct being assessed (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014; Society for Industrial and Organizational Psychology, 2003). However, operational pressures push toward remote deployment methods because they allow employers to efficiently gain insight into job candidates before they are brought on-site for more expensive activities. Selection procedures must balance these pressures to manage the quality and fairness of the programs they design.

864

An Overview

Implementation Effectiveness I-O psychologists must consider a range of issues when implementing technology-driven selection systems, including development, administration, and support. Designing and managing technology-based selection systems involves skills that are related to the analytical, decision-making, and organizational change skills of an I-O psychologist, yet also requires those that are not central, such as business acumen and IT expertise. Development includes making a business case for purchasing or building a solution, acting as an information architect for the new system, and managing the transition to new systems. A model for guiding the choices involved in the implementation of technology-based hiring systems was offered in Reynolds (2011); a synopsis of these considerations is provided in this section. Justification for a technology system will often rely upon reduction of labor costs and improved efficiency associated with new or improved automation. One of the first considerations in the business case is whether to buy or build. The choice affects the time horizon for the implementation. Now that there are numerous commercial off-the-shelf solutions with a range of functions, organizations may seek the assistance of vendors in making the business case for them (e.g., by sending out a formal or informal request for information (RFI) prior to soliciting bids). When seeking bids, the costs of any customization should be clearly identified to the extent it is possible to do so. Many factors can drive the need for customization, including creating customized score reports, migrating tests or examinee data onto the vendor’s platform, and setting up systems that mirror the company’s organizational and/or regional structures (Kehoe, Dickter, Russell, & Sacco, 2005). Although many organizations may see benefits to customization, it should also be recognized that there are significant drawbacks, usually in higher upfront costs and ongoing maintenance fees, because the resulting software is divergent from the provider’s standard platform. For these reasons, configuration of available options within a system is usually preferable to customization of software. Table 39.1 provides general guidance about the categories of costs associated with technology-based selection systems. I-O psychologists implementing selection technology within organizations should keep in mind three levels of users with a stake in the day-to-day operation of the system: the candidate, the HR manager or administrator, and the manager or supervisor receiving the candidates (Gilliland &

TABLE 39.1 Costs of Implementing Technology-Based Selection Source

Examples

Development

• Software development, customization, and/or configuration • Technology integration (with applicant tracking systems, enterprise resource planning, etc.) • Equivalency studies • Hardware (as applicable; e.g., computers, tablets, kiosks)

Deployment

• Field testing/quality assurance • System hosting fees • Installation costs if locally hosted, or costs for software-as-a-service for cloud-based solutions • Account setup (process flows, permissions, reporting requirements) • Training

Maintenance

• Tracking and reporting • Schedules for upgrades • Security monitoring • Backups and failure recovery

865

Douglas H. Reynolds and David N. Dickter Cherry, 2000). Flowcharts should be developed to map the current and desired processes and features and to understand them from each type of stakeholder’s point of view (Kehoe et al., 2005). The solution’s functionality will need to be scalable and flexible. For example, it should be possible to implement the standardized solution without new development or workarounds to accommodate different departments in the organization (scalability). The solution must also be adaptable to meet future requirements and technology upgrades. Kehoe et al. (2005) discussed questions the I-O psychologist should ask when developing a technology-based selection system. These include how the administrative rights to it will be managed, how candidates will gain access to tests, how test security will be assured, whether the system can be configured to apply the organization’s test policies (such as retests or disability accommodation), and how test results will be stored and communicated. The organization’s available hardware/software and IT infrastructure are also key considerations. The technology’s requirements (e.g., operating systems, browsers) must be compatible with the organization’s special requirements (such as security protocols and firewalls). Whether the solution is created in-house or procured from a vendor, the IT department must assist with its implementation.

Managing the Implementation Implementing a technology solution can be a complex project, from the design of the new selection process (as described in other chapters in this volume) to managing the technology. These projects involve software development and database administration, two skills that are not standard in I-O training. The plan for implementation should include specifications of functionality, whether for building the system or understanding how an off-the-shelf solution will work with the organization’s hardware and software. The new system also must accommodate any legacy processes and data. The more complex the organization’s technology infrastructure, the more fine-grained the details should be about software functionality. Failing to specify the most critical software requirements ahead of time could delay or derail the project. Because of the high-stakes nature of selection and the precision required for selection systems, a quality assurance process plan is also essential. The system should be beta-tested to make sure the software functionality is intact and that it is in line with user expectations. The selection software may also need to integrate with other systems the organization may be using—whether other selection systems or related systems (e.g., applicant tracking or career management tools). Once configured, an application program interface (API) will allow programs to exchange data automatically.

Administration When administering the system, the organization must pay special attention to the ways in which information is accessed, processed, and stored. Different types of users will have varying levels and methods of access to the selection tools and systems. The organization must have policies detailing when, or if, test information is available to a user (i.e., applicants, HR staff, and hiring managers). I-O psychologists also must decide to what extent the information will be processed by rules and automation, and to what extent HR experts will be involved to make judgments and carry out transactions. The method of storage partly determines the administration possibilities. Example configurations are provided in Table 39.2.

Supporting the Implementation The selection system will not be a static entity. As technology progresses, the organizational structure changes, and HR processes are upgraded, there will be a need for consulting and

866

An Overview

TABLE 39.2 Administration of Technology-Based Selection Systems Function

Example Configurations

Role-Based Access

• External applicants are granted test access only after a screening and approval process categorizes them as viable candidates. • Internal applicants, but not externals, are provided with a feedback report about results. • Hiring managers can access only those candidates who are eligible for interviews. • Test administrators have basic privileges (access to deliver tests and see only pass/ fail results). • HR managers can view summary of reports. • Analysts and I-O psychologists can run database queries of raw data and conduct item analyses.

Automation of Rules and Reports

Data Storage and Archiving

• Business managers and HR generalists can access workforce and adverse impact analyses directly. • Rule-based progression of passing candidates to the next step in the selection process • Automatic communications triggered to candidates and stakeholders based on status within the selection system • Database structure that permits data integration across systems, such as assessments from different vendors • Automatic archiving of assessment data after a specified usable lifespan

dedicated staff time in the form of technical support, ongoing maintenance, and user training. Table 39.3 highlights examples in which each type of service may be needed. When provided by a vendor, these support services should be included in a service-level agreement as part of the contract for the vendor’s services. Importantly, the overarching trend that influences development, implementation, and support is technology integration. Many technology providers are building market share by supporting a broader array of HR functions across the “talent management lifecycle,” requiring the integration of data from recruitment and organizational entry through management and career

TABLE 39.3 Support Services for Technology-Based Selection Systems Service

Example Needs

Technical Support

• Services to each major group of stakeholders (i.e., candidates, HR staff, and business managers) • Candidate support: information on testing technology, troubleshooting access to tests • HR staff encountering technical difficulties with aggregate reporting • Business managers seeking guidance about access to reports

Maintenance

• Ongoing maintenance, updates, and revisions (e.g., content and feature changes to hosted websites) that may require downtime • Commonly and increasingly, organizations may need to adjust to an agile development approach where software is updated continuously instead of in occasional bulk releases

Staff Training

• Multiple methods for training, including live sessions, websites with frequently asked questions, and self-guided training modules • Retraining readily available to account for organization changes, software upgrades, staff turnover • Guidance on integrated systems when hiring data are merged with other HR systems

867

Douglas H. Reynolds and David N. Dickter progression. It is reasonable to expect that in the future it will be rare for a technology-based selection system to be implemented and administered in isolation from other systems. Some talent management systems already integrate a broad range of HR data into a single resource planning database for use in making strategic decisions about human capital. This trend has several implications for I-O psychologists. Researchers and practitioners will be able to obtain data that are harder to come by today (e.g., performance and return-on-investment data for use in validation studies, program evaluation, and employee-organization linkage research). Psychologists will be able to earn a broader role in strategy setting and decision making if they are able to provide an analytical, forward-thinking use of the information to help inform organizational decision-making.

Globalization Technology-based systems enable broader and more standardized application of their selection procedures across a wide geography. This provides many advantages for large organizations who seek to centralize the operation of their selection process. Internet deployment also means selection system designers need to consider a range of cultural factors as these systems are constructed. Caligiuri and Paul (in Chapter 36 of this volume) detail many of the considerations faced by multinational organizations in this regard. Our short summary focuses on an overview of the conditions that systems designers should consider as they deploy assessments over the Internet when applicants across cultural boundaries are expected. At the root of the potential concerns that arise when selection processes are broadly available over the Internet is the fact that people are being compared (either to each other or to a common standard) when a portion of the differences between them are due to different cultures, languages, or a combination of both of these factors. Unless these differences are acknowledged and managed, the quality of the resulting decisions will be diminished. By attending to the variables across which the selection system is intended to generalize, the potential for bias and harm due to poorly calibrated assessment can be reduced. The procedures for properly adapting assessments across languages and cultures can be technically complex and time-consuming, and laypersons involved with the selection process will often underestimate the impact of these broad variables as well as the complexity of the adaptation process, creating challenges for the selection practitioner. The professional standards cited earlier, as well as the International Testing Commission’s International Guidelines for Test Adaptation (2005), describe the obligations for the assessment professional as well as the steps to consider when making adjustments for language and culture. Assuming the job requirements and working conditions are similar across locations, most cross-culture and/or cross-language applications can be supported with techniques such as re-norming to appropriately support the purpose of the assessment, translation (using procedures designed for use with assessment), local validation, and construct equivalence studies when samples sizes allow. Ryan and Ployhart (2014) note in their review that the globalization of selection processes is a recent phenomenon, and there is a great need for more research in this area. The critical issue for the selection specialist is to recognize the complexities involved and plan a course of research to support the types of comparisons and generalizations being made. One recent study (Lievens et al., 2015) examined a myriad of issues involved with transporting a situational judgment test across cultures; after careful translation and comparability analyses, the measure was found to operate with a reasonable degree of consistency across two cultures, but the study provides a detailed example of the issues that can arise as tests are generalized in this manner.

Data Usage As the use of the Internet for recruitment and selection extends the reach of these processes to a worldwide audience, organizations must be compliant with the various international rules that

868

An Overview apply to the transfer of individual data across national borders. The political and legal context for online privacy and data security has become more complicated in recent years, as high-profile data breaches have become commonplace and revelations of government-supported data monitoring have become public. These incidents expose the different approaches various countries have taken toward privacy protections for their citizens. Perhaps the most advanced framework for data privacy was established by the European data protection regulations. These rules are relevant for technology-based selection systems because they govern cross-border transfer of personal information. For example, online tools that allow job seekers in Europe to apply for a job in Europe might be hosted on computers located in the United States; this process involves collecting personal data and transferring these data across borders. By engaging in this activity, an organization could potentially be in violation of the domestic laws of the European Union (EU) Member States. These laws were implemented as a result of the EU Directive on Data Protection, which aims to prohibit the free flow of personal information from EU nations to countries that have been deemed to have inadequate privacy protection, such as the United States. The Directive, which went into effect in 1998, underscores the difference between both the cultures and legal systems of Europe and the United States. In Europe, privacy protection is viewed as a personal right. To protect this right, the various EU Member States have, over the past several decades, enacted legislation administered through government data protection agencies. The Directive’s primary intended purpose is to set minimum privacy protection standards for each of the EU Member States and to make it easier to transfer personal data within the EU. In the United States, by contrast, the protection of private information is viewed less uniformly, with differing standards for varying circumstances; therefore, privacy protection in the United States is guided more by limited legislation and regulation, and by self-regulation. Organizations that seek to deploy Internet-based HR systems that involve international data transfers have several compliance options. Two approaches are common. First, organizations can establish a data handling agreement under contract directly with their European partners and/or clients that governs how data will be handled. As long as the procedures are consistent with EU regulations, the data transfers handled under these contracts are allowed. Second, U.S.based organizations may join a U.S. Department of Commerce program that certifies them as a safe harbor for personal data. This certification states the organization’s willingness to adhere to seven Safe Harbor Privacy Principles that the Commerce Department negotiated with the EU. This program, and the EU laws to which it relates, are described at www.export.gov/safeharbor. Unfortunately, the status of this program was challenged in 2015 by a European court ruling that disallowed the Safe Harbor under EU law (“The Court of Justice,” 2015). At the time of this writing a revision to the program has just been developed and is in the process of being implemented. The core privacy principles upon which the EU laws were based remain intact, but the new regulations require enhanced transparency and monitoring, and the penalties that may be imposed by EU authorities are now substantially higher (“European Commission unveils EU-US Privacy Shield,” 2016). In brief, the seven privacy principles that form the basis for allowable data transfer are as follows: 1. Notice: Individuals must be informed, as early as possible and in unambiguous language, about the organization’s reasons for collecting and using their personal information. 2. Choice: Individuals must be allowed to decide if and how their information is to be used or disclosed to third parties beyond the purpose originally specified and authorized by the organization collecting the information. 3. Onward transfer: Personal information may only be transferred to a third party under the Notice and Choice conditions specified above. One organization can transfer data to another without participant assent only if the third-party organization is also qualified as a safe harbor or otherwise satisfies the requirements of the Directive. 4. Access: Within logistical reason, individuals must have access to and be able to correct, add to, or delete their personal information where it is deemed inaccurate. 5. Security: Data must be reasonably protected from loss, misuse, unauthorized access, and disclosure.

869

Douglas H. Reynolds and David N. Dickter 6. Data integrity: Personal information must be relevant, reliable, accurate, current, complete, and used only for the purpose for which it was collected and authorized by the individual. 7. Enforcement: Organizations must provide mechanisms for complaints, recourse, and procedures for verifying adherence to the safe harbor principles remedying any problems.

The liabilities associated with these responsibilities need to be carefully examined with respect to any HR processes, and online recruitment and selection processes are of particular concern because of their broad reach to the public at large. Companies that use online recruitment and selection processes should be aware of these privacy considerations and take steps to ensure their online tools are compliant with the latest legislation in the regions in which they operate; otherwise, Internet-based systems that collect information broadly from job seekers will raise substantial risks associated with liability for data processing and transfer.

New Measurement Methods This is an exciting time for selection system developers and researchers, as the amount, variety, and interconnected nature of data bring new opportunities for measurement and for improving validity of selection procedures. Chapter 44 covers some emerging technologies in detail, though it is impossible to know what other innovations might be germinating at a technology firm’s laboratory (or in an entrepreneur’s garage) that could have implications for selection. This section skims the surface and offers a few observable themes. Awareness of current trends and emerging themes will help I-O psychologists apply new technology to selection and assessment systems, as well as to lead implementation and research efforts that are cutting-edge, theory-based, valid, practical, and fair. How can technology continue to enhance the practice of assessment? A few themes are apparent from progress that has been made already. Technology can be used to collect more data faster and aggregate it into bigger data sets than we have worked with before. It will support the development of more powerful analytic and algorithmic tools, and it can enable the collection of smaller slices of behavior or wider samples of behavior. Technology can alter the stimulus or response format to increase fidelity or engagement. It can gather new types of response information in new contexts that could assist with measurement. Technology can also support greater interconnectedness across other organizational functions. These possibilities are evident by advancements already being witnessed in some assessment contexts: 1. Data Supply: The promise of large amounts of data, perhaps replenished frequently, whether from within an organization or gathered from Internet sources, will transform how we work and live. Social media will continue to connect our professional and personal lives, as statements we make publicly online are put to use in understanding our personalities, interests, potential for success at work, and predicted attraction to specific job openings. 2. Aggregation: A related theme is the centralization of data. It is becoming common for organizations to link assessment and other HR data (e.g., learning and development processes) with enterprise data to demonstrate how talent pipelines help generate organization-level outcomes. Aggregation may also take place in the broader economy, with centralized marketplaces (e.g., LinkedIn) matching candidates and competencies to job opportunities and recommending avenues for professional development. 3. Analytics and Algorithms: New analytic techniques and software will need to be developed to process the tremendous supply of data. Innovations could encompass everything from machine learning to improved predictive algorithms to the use of increasingly complex artificial intelligence (AI) programs. (It is tempting to speculate on the extent to which such AI might one day be involved in the research and development cycle, from meta-analysis of published empirical findings to model development, assessment delivery, analysis and refinement, and in the publication of results. How much of an educated individual’s job, including that of an I-O psychologist, might one day be aided by an artificial agent?) 4. Novel Data Types and Micro-behaviors: We are able to collect information in small bits and from different devices and integrate this information. Micro-behaviors such as mouse-over hover times (the time spent with a mouse above a link before deciding to click it or not) and response latencies are currently observable, and new uses for current devices are possible (e.g., smart phone accelerometers that measure movement for motor skills assessment), but soon there will be other, more complex

870

An Overview responses that can be interpreted by a computing device, such as eye-tracking and facial recognition of micro-expressions. We will grow accustomed to data collection and connectivity between everyday devices that will contain computer chips that will feed connected databases. Wearable technologies bring both the possibility of biometric data (e.g., heart rate) and control methods (e.g., brain wave sensors/biofeedback to control devices). 5. Virtual and Augmented Reality: High-fidelity simulations will become even more realistic, and where desired, information may be provided to a virtual-reality test taker (e.g., labels on objects in view) to augment the experience, whether to clarify or to enhance its complexity. One can imagine how the stimulus- or response-fidelity of a virtual experience, with video, audio, and other sensory information, could transform the realism of a work-sample simulation and inspire the participant to perform more as he or she would in a natural setting. 6. Gamification: As concern over applicant reactions to technology (e.g., Bauer, Truxillo, Mack, & Costa, 2011) yields to competition over applicant engagement, there is increasing interest in using game-like features in the hiring process. It may become common to use games as passive recruitment tools or as low-stakes assessments. Candidates might not know that a particular activity, perhaps a learning game, contains an assessment, or even that they are being recruited to become a candidate. This trend is explored at length in Chapter 44.

Whether or not each of these themes could describe a selection technology that one day results in improved predictive validity remains to be seen, and all have practical constraints and ethical considerations to take into account. In some cases, a particular innovation also may cut across one or more themes to provide solutions that are truly new and change expectations regarding the validity and utility of selection systems.

CONCLUSION Looking ahead, the practice of employee selection can be expected to continue to experience rapid change as a result of technology advancements. The chapters in this section of the Handbook consider the critical issues provided in this chapter in more detail, such as the cybersecurity of selection processes amid persistent threats from determined hackers (in Chapter 41), the current availability of Big Data sources and analytic techniques for mining information about candidates and employees (in Chapter 43), the implications for employee selection of proliferating mobile devices and gamification of organizational systems (Chapter 44), and updates on the latest advancements in job classification in a technology-fueled labor market churning out new types of jobs (Chapter 40). This chapter and those that follow should be read with the caveat that it is challenging to survey a changing landscape that outpaces the research literature. Much of the existing literature that deals with technology in the hiring process is results-focused and practice-oriented, so there is great opportunity for programmatic research and the development of related psychological theory. To generate further research interest, we have highlighted some of the issues raised in this chapter in the categorized list below. • Equivalence: Use within-group, repeated-measure studies of technology-based tests and assessments to investigate equivalence of assessments on multiple technology modes of administration. • Assessment Environment. Using quasi-experimental field studies, classify the factors in test modality (e.g., delivery technology, administration environments such as mobile access) and usage conditions (e.g., proctored/unproctored, high-stakes) that influence scores, pass rates, and validity in order to develop a taxonomy of influences and acceptable administration protocols. Develop and test theory-based explanations for underlying similarities and differences. • New Measurement: Determine the assessment value and opportunities added by virtual-reality simulations, tracking various micro-behaviors, and using novel assessment data types including biometric data. • Big Data, Aggregation, Analytics, and Algorithms: Study the practicality, benefits, and ethics of the use of the data that are being amassed from the Internet. • Applicant Engagement: Study the conditions that lead job seekers to engage with and persist in assessment processes. How can longer interactions with technology-based assessment systems be encouraged in order to provide more data to support the validity and insights derived from the experience? How can potential applicants be identified and drawn into the process more easily?

871

Douglas H. Reynolds and David N. Dickter

REFERENCES American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Arthur, W., Doverspike, D., Muñoz, G. J., Taylor, J. E., & Carr, A. E. (2014). The use of mobile devices in high-stakes remotely delivered assessments and testing. International Journal of Selection and Assessment, 22, 113–123. Arthur, W., Jr., & Glaze, R. M. (2011). Cheating and response distortion on remotely delivered assessments. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 99–152). San Francisco, CA: Jossey-Bass. Arthur, W., Jr., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). The magnitude and extent of cheating and response distortion effects on unproctored internet-based tests of cognitive ability and personality. International Journal of Selection and Assessment, 18, 1–16. Barry, B., & Fulmer, I. S. (2004). The medium and the message: The adaptive user of communication media in diadic influence. Academy of Management Review, 29(2), 272–292. Bartram, D., & Brown, A. (2004). Online testing: Mode of administration and the stability of OPQ 32i scores. International Journal of Selection and Assessment, 12, 278–284. Bauer, T. N., Truxillo, D. M., Mack, K., & Costa, A. B. (2011). Applicant reactions to technology-based selection: What we know so far. In N. T. Tippins & S Adler (Eds.), SIOP Professional Practice Series: Technology-Enhanced Assessment (pp. 190–223). San Francisco, CA: Jossey-Bass. Burke, E., Mahoney-Phillips, J., Bowler, W., & Downey, K. (2011). Going online with assessment. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 355–379). San Francisco, CA: Jossey-Bass. Byham, W. C. (October 2010). Rethinking assessment centers: Multiple variations to meet multiple needs. Keynote presentation at the 35th International Congress on Assessment Center Methods, Singapore. Chambers, Brad A., & Arnold, John D. (2015). Using technology to improve the interview as a selection tool. Personnel Assessment and Decisions, 1(1), Article 7. Retrieved from http://scholarworks.bgsu.edu/ pad/vol1/iss1/7 Court of justice declares the Commission’s US Safe Harbour decision is invalid. (October 6, 2015). Retrieved from http://curia.europa.eu/jcms/upload/docs/application/pdf/2015–10/cp150117en.pdf European Commission unveils EU-US Privacy Shield. (February 29 2016). Retrieved from http://ec.europa.eu/justice/newsroom/data-protection/news/160229_en.htm Foster, D. (2009). Secure, online, high-stakes testing: Science fiction or business reality? Industrial and Organizational Psychology, 2, 31–34. Gilliland, S., & Cherry, B. (2000). Customers of selection processes. In J. F. Kehoe (Ed.), Managing selection strategies in changing organizations (pp. 158–196). San Francisco: Jossey-Bass. Huff, K. C. (2015). The comparison of mobile devices to computers for web-based assessments. Computers in Human Behavior, 49, 208–212. Illingworth, A. J., Morelli, N. A., Scott, J. C., & Boyd, S. L. (2015). Internet-based, unproctored assessments on mobile and non-mobile devices: Usage, measurement equivalence, and outcomes. Journal of Business Psychology, 30, 325–343. International Test Commission. (2006). Guidelines for computer-based and internet-delivered testing. International Journal of Testing, 6, 143–172. International Test Commission. (2005). International guidelines on test adaptation. Retrieved from https://www. intestcom.org/page/16 Kehoe, J. F., Dickter, D. N., Russell, D. P., & Sacco, J. M. (2005). e-Selection. In H. Gueutal, D. L. Stone, & E. Salas (Eds.), The brave new world of eHR: Human resources in the digital age (pp. 54–103). New York, NY: Wiley & Sons. Lievens, F., Corstjens, J., Sorrel, M. A., Abad, F. J., Olea, J., & Ponsoda, V. (2015). The cross-cultural transportability of situational judgment tests: How does a US-based integrity situational judgment test fare in Spain? International Journal of Selection and Assessment, 23, 361–372. Mead, A. D., & Drasgow, F. (1993). Effects of administration medium: A meta-analysis. Psychological Bulletin, 114, 449–458. McCloy, R. A., & Gibby, R. E. (2011). Computerized adaptive testing. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 153–189). San Francisco, CA: Jossey-Bass. Morelli, N. A., Mahan, R. P., & Illingworth, A. J. (2014). Establishing the measurement equivalence of online selection assessments delivered on mobile versus nonmobile devices. International Journal of Selection and Assessment, 22, 124–138.

872

An Overview Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., Ungar, L. H., & Seligman, M. E. P. (2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108, 934–952. Ployhart, R. E., Weekly, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable? Personnel Psychology, 56, 733–752. Potosky, D. (2008). A conceptual framework for the role of administration medium in the personnel assessment process. Academy of Management Review, 33(3), 629–648. Potosky, D., & Bobko, P. (2004). Selection testing via the Internet: Practical considerations and exploratory empirical findings. Personnel Psychology, 57, 1003–1034. Rampell, C. (January 22, 2014). Your next job application could involve a video game. New York Times, Retrieved from http://www.nytimes.com/2014/01/26/magazine/your-next-job-application-could-involve-a-video-game.html?_r=0 Reynolds, D. H. (2011). Implementing assessment technologies. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 66–98). San Francisco, CA: Jossey-Bass. Reynolds, D. H., & Rupp, D. E. (2010). Advances in technology-facilitated assessment. In J. Scott & D. Reynolds (Eds.), Handbook of workplace assessment (pp. 609–641). San Francisco, CA: Jossey-Bass. Reynolds, D. H., & Weiner, J. A. (2008). Online recruiting and selection: Innovations in talent Acquisition. Malden, MA: Wiley-Blackwell. Ryan, A. M., & Ployhart, E. (2014). A century of selection. Annual Review of Psychology, 65, 693–717. Salgado, J. F., & Moscoso, S. (2003). Internet-based personality testing: Equivalence of measures and assessees’ perceptions and reactions. International Journal of Assessment and Selection, 11, 194–203. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Stone, D. L., Lukaszewski, K. M., Stone-Romero, E. F., & Johnson, T. L. (2013). Factors affecting the effectiveness and acceptance of electronic selection systems. Human Resource Management Review, 23, 50–70. Tippins, N. T. (2009). Internet alternatives to traditional proctored testing: Where are we now? Industrial and Organizational Psychology, 2(1), 2–10.

873

40 ADVANCING O*NET DATA, APPLICATION, AND USES DAVID RIVKIN, CHRISTINA M. GREGORY, JENNIFER J. NORTON, DENISE E. CRAVEN, AND PHIL M. LEWIS

O*NET, the Occupational Information Network (O*NET™), sponsored by the U.S. Department of Labor (DOL), is a comprehensive system for collecting and disseminating information on occupational and worker requirements for 974 occupations, covering the U.S economy. O*NET development efforts were initiated in response to the changing world of work. Previous to O*NET, the DOL supported the Dictionary of Occupational Titles (DOT) (U.S. Department of Labor, 1991a, 1991b). The DOT, first developed in 1939 and last published in 1991, listed more than 12,000 occupations. However, the DOT was difficult to maintain, and there were questions about its relevance in the new world of work. The DOT was heavily weighted with blue-collar and manufacturing occupations. Jobs were changing more rapidly than in the past, workers required new skills, and technology was advancing. Employers, educators, policy makers, workforce development professionals, job seekers, and others needed occupational information to make important career development and work life decisions (Miller, Treiman, Cain, & Roos, 1980; Committee on Techniques for the Enhancement of Human Performance: Occupational Analysis, 1999; National Center for O*NET Development, 2015). The O*NET System, developed to replace the DOT, uses a standardized common language of work. The system provides updated occupational information that is easily accessible to potential users. It provides multiple windows of occupational information, allowing users to focus on the information they need. The first version of the O*NET database was released in 2001. O*NET 20.3 database is currently in production. Since its initial release, the O*NET program has continued efforts to improve the database and associated websites, products, and tools. O*NET information has gained increasing popularity over the last decade, being adapted and used by government, private and public sector organizations, international users, as well as individual job seekers and career explorers. This chapter builds upon Peterson and Sager (2010). First, we will briefly discuss DOL’s evolution from supporting the DOT to its development of O*NET. This historical background was originally presented in Peterson and Sager (2010). An overview will be presented of the O*NET Content Model and O*NET-SOC Taxonomy, the foundations of the O*NET program. We then focus on the multimethod approach to O*NET data collection, one of the keys to the program’s ability to collect high-quality, usable data. Next, we will present recent data enhancements that have made the occupational information more current and complete. We also describe O*NET websites used to disseminate data to different types of interested parties. Examples of how customers use O*NET data are presented. Finally, we summarize O*NET program accomplishments, challenges, and possible future enhancements.

874

Advancing O*NET

MOVING FROM THE DOT TO O*NET The DOT was first implemented in the late 1930s, and the move to O*NET occurred in the early 1990s (National Center for O*NET Development, 2005; USDOL, 1991a). Although the DOT had many users, there were increasing concerns of its viability due to the magnitude of the efforts and costs to keep it updated (Advisory Panel for the Dictionary of Occupational Title, (APDOT), 1993; Dye & Silver, 1999). In the following section we present a brief description of the content and development of the DOT, adapted from Peterson and Sager (2010).

DOT Content and Development The DOT was a major attempt by the DOL to provide structured, comprehensive occupational information (Droge, 1988; National Center for O*NET Development, 2015; USDOL, 1991a). One of the DOT’s goals was to provide standardized occupational information that provided levels of performance necessary for on-the-job success. By including this information, job seekers, educators, and employers would have a better understanding of the requirements of work. Additionally, the U.S. Employment Service (USES), seen as a primary front-line customer for the DOT, used DOT information in the development of job descriptions, job advertisements, training curriculum, and a variety of assessment tools. Through the USES, millions of individuals benefited from DOT information. The DOT was a static printed book, and each occupation was assigned a nine-digit code. The code represented the occupational category, occupational division, and occupational grouping, identifying where the occupation was located in the DOT taxonomy. Occupational data provided in the DOT was categorized as work performed (what tasks/activities are performed) and worker characteristics (worker attributes important to successful job performance). In the last edition of the DOT, there were more than 12,000 occupations, with approximately 100 occupational descriptors presented for each (USDOL, 1991a). DOT work activity descriptors included the following: • Worker functions: A set of three ratings depicting the worker’s level of involvement with the wellknown “data,” “people,” and “things” (DOL, 1972; Fine, 1968a, 1968b, 1988) • Work fields: One chosen from a set of 100 technological and socioeconomic objectives, such as “material moving,” “cooking-food preparing,” and “researching” with associated definitions and verbs • Work devices: A list of machines, tools, equipment, and work aids used on the job • Materials, products, subject matter, and services (MPSMS): One or more of these chosen from an organization of 48 groups with 328 constituent categories, such as a “marine life” group with “finfish,” “shellfish,” and “other marine life” categories

DOT worker characteristic descriptors included: • General educational development: Three ratings, on six-point scales, of the reasoning, mathematical, and language development required on the job • Special vocational preparation: Amount of training required using a time scale (“short demonstration only” through “over 10 years”) • Aptitudes: Estimates of levels of 11 abilities “predictive” of an individual’s job performance, based largely on the General Aptitude Test Battery components with two additions, Eye-Hand-Foot Coordination and Color Discrimination • Interests: Selection of a primary interest factor from a defined list of 12 • Temperaments: Selection of the primary temperaments (e.g., adaptability requirements of a job) from a list of 11 • Physical demands: A single rating of the strenuousness of the job, on a five-point scale, and estimates of the importance and frequency of 28 specific physical demands • Environmental conditions: Ratings of the importance of 14 environmental conditions, with seven of these labeled as hazards

The structure of the DOT contributed to a standardized process of reporting occupational and worker characteristics. Cross-occupational comparisons were possible because an attempt

875

David Rivkin et al. was made to collect the same information on the almost 12,000 DOT occupations. The data collected assisted in the services provided by the U.S employment services and other workforce development agencies for decades.

DOT Data Collection Although great efforts were taken to establish a DOT data collection protocol, the system was difficult to monitor and control (APDOT, 1993). At the time of the DOT development, the USES had offices and research centers in almost all 50 states. Occupational analysts, employed by these offices, were responsible for collecting data using observational interviews with incumbents. Methods varied greatly from state to state, and even within state offices. The number of businesses and incumbents required to participate for any given occupation was not standardized. The total number of completed “job analyses reports” to develop DOT content was not well documented and appeared to vary greatly across occupations studied.

DOT Users Despite its flaws, the DOT served an important purpose in the workforce development community. It was one of the primary sources of occupational information in the United States and was used by both government and nongovernmental institutions. The National Research Council (1980) reported that the many users of the DOT included schools, libraries, human resources departments, veterans and rehabilitation programs, and the Social Security Administration. The DOT has even been written into government legislation, mandating its use for specific workforce development activities. DOT information was used for job description writing, transferable skills analysis, curriculum development, and disability assessment. It played a major role in the job placement services offered by the USES. It contributed to the development of a number of personnel assessment instruments and was used by social science researchers, including psychologists, sociologists, and economists. The DOL realized the importance of providing occupational information for many types of users. They supported several reviews of the DOT (APDOT, 1993; National Research Council, 1980). Recommendations from these studies and panels initiated the development of the O*NET system.

DEVELOPMENT OF O*NET The first electronically available O*NET database was released in 1998 (National Center for O*NET Development, 2002). O*NET 98 was referred to as the “Analyst Database” because the information presented was the result of occupational analysts converting the DOT occupational information into the new O*NET structure. O*NET has undergone major transformations since this early release. The database has been fully updated with new data, improvements have been made in data collection methods, new types of data have been included, and new modes of dissemination have been developed. These continuous improvement efforts by the O*NET program have led to increases in the number and types of O*NET users. In the next sections of this chapter we will discuss the development of O*NET and the advances that have been made in the program over the last decade.

Advisory Panel for the Dictionary of Occupational Titles In the late 1980s, DOL’s Employment and Training Administration (ETA) assembled a panel of private sector and public sector experts, including individuals representing the military services,

876

Advancing O*NET to review the DOT. Appointed by the Secretary of Labor, members of the Advisory Panel for the Dictionary of Occupational Titles (APDOT) consisted of psychologists, educators, economists, and policy makers. Peterson and Sager (2010) summarized the panel’s dramatic recommendations on how to improve occupational information provided by the USDOL: • Develop a common language of jobs, occupations, and skills. This common language could facilitate workforce development activities. This would help improve workforce activities like writing job descriptions, job ads and resumes; developing curriculum and assessment instruments; improving tools for career development and exploration; and enabling more accurate cross-occupational comparisons that could enhance job transitions. • Incorporate multiple windows of occupational information to enable users to find and use occupational information that would be more appropriate for their specific tasks. • Use of a common occupational classification system with a manageable level of occupations to enable more linkages between labor market information and other sources of occupational data. • Development of a hierarchical structure of data to give users more opportunities to find the level of data needed for their purposes. • Implementation of a more structured, repeatable, and valid data collection methodology. Empirically based sampling techniques and the use of standardized questionnaires were recommended to improve the efficiency and quality of the data collected. • Production of a relational electronic database that could be readily updated and could be easily searched by users.

These recommendations served as one of the primary bases for developing the occupational information system we now know as O*NET. They continue to shape the direction of the O*NET program and the development of O*NET products and tools.

O*NET Content Model One of the major themes of the APDOT report (APDOT, 1993) was to develop a common language and hierarchical structure of occupational information. The Content Model, the conceptual foundation of O*NET data, addresses these recommendations (National Center for O*NET Development, 2015). It provides the structure and framework for O*NET information and identifies important types of information about workers and occupations, integrating them into a theoretically and empirically sound system. The development of the Content Model is well documented (Peterson et al., 2001; Peterson, Mumford, Borman, Fleishman, & Levine, 1997; Peterson, Mumford, Borman, Jeanerette, & Fleishman, 1995) and was developed using research on job and organizational analysis. It embodies a view that reflects the character of occupations (via job-oriented descriptors) and people (via worker-oriented descriptors). The Content Model allows occupational information to be applied across jobs, sectors, or industries (cross-occupational descriptors) and within occupations (occupational-specific descriptors). It was designed to reflect the whole occupation and to define the “entire world-of-work.” O*NET descriptors are organized into six major domains. These “windows” of information enable users to focus on occupational information that is most useful to them. Peterson and Sager (2010) summarized the six major domains of the Content Model as follows:

Worker Characteristics These cross-occupation descriptors were conceptualized to represent the enduring characteristics of individuals that are relevant to job performance and/or the capacity to acquire knowledge and skills necessary for work (National Center for O*NET Development, 2007). The Abilities for the prototype and the current version of O*NET include 52 descriptors that address enduring human capabilities that are not substantially affected by experience (e.g., Oral Comprehension, Deductive Reasoning, Spatial Orientation, and Near Vision), originating from work by

877

David Rivkin et al.

FIGURE 40.1  Current Version of the Content Model Source: The Content Model has retained its basic structure since its original inception. A detailed, interactive description of the Content Model, its domains, and individual descriptors can be found at http://www.onetcenter. org/content.html. Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

Fleishman (1975). The Occupational Interests are six descriptors based on Holland’s (1976) taxonomy of interests (i.e., Realistic, Investigative, Artistic, Social, Enterprising, and Conventional) that represent preferences for work environments and outcomes. The Work Values are consisting of six Values that the work can satisfy. They are based on Dawis and Lofquist’s (1984) Theory of Work Adjustment and include values such as achievement, independence, and recognition. The Work Styles are 16 descriptors that represent personal characteristics relevant to how well a person performs work, traditionally referred to as personality and/or temperament variables. Examples include achievement, cooperation, and self-control. The prototype included a list of 17 descriptors that underwent only minor modifications in the operational version.

Worker Requirements These cross-occupation descriptors are work-related attributes that can be developed by education, training, and experience (National Center for O*NET Development, 2007). Basic Skills are 10 developed capabilities that facilitate acquisition of information and learning (e.g., Writing, Mathematics, Critical Thinking). Cross-functional skills are currently 25 developed capacities relevant to the performance of activities occurring across occupations (Mumford, Peterson, & Childs, 1999). The knowledges are 33 descriptors that organize sets of principles and facts that apply to general domains of knowledge (i.e., Economics and Accounting, Mechanical, Biology, and Law and Government). Education covers educational experiences required to perform an occupation in terms of level (i.e., high school, vocational school, college, etc.). The current educational portion of the current O*NET is fairly similar to its prototype version.

878

Advancing O*NET

Experience Requirements These descriptors are both cross-occupation and occupation-specific (National Center for O*NET Development DOL, 2007). They begin with an indication of the amount of different types of experience required to be hired (e.g., related work experience, on-the-job training, and apprenticeship). Next are the basic and cross-functional skills required to be hired for the occupation. These are the same skills referred to in worker requirements. Finally, this part of the Content Model includes licenses, certificates, and registrations relevant to the occupation.

Occupational Requirements These work-oriented descriptors describe the requirements of each occupation (National Center for O*NET Development, 2007). The generalized work activities (GWAs) in the current version of O*NET include 42 descriptors that address general work behaviors that occur across occupations (e.g., Getting Information, Analyzing Data or Information, Performing General Physical Activities, and Resolving Conflicts and Negotiating with Others). The GWA taxonomy was substantially influenced by the theory behind and content of the Position Analysis Questionnaire (PAQ; McCormick et al., 1989) for nonsupervisory jobs and for supervisory and management jobs by several managerial dimension systems summarized in Borman and Brush (1993). The prototype GWAs included 42 descriptors (Jeanneret, Borman, Kubisiak, and Hanson, 1999). Recently, however, the GWA hierarchy was expanded to include 332 intermediate work activities (IWAs) and 2059 detailed work activities (DWAs). Additionally, linkages were developed among all three levels of activities, as well as to tasks (Hansen, Norton, Gregory, Meade, FosterThompson, Rivkin, Lewis, & Nottingham, 2014). This revised information was developed to provide more cross-occupational and occupation-specific data requested by users. The development of this new hierarchy is presented later in the Data Enhancements section of this chapter. Organizational Context descriptors reflect characteristics of organizations in which the occupations are embedded. (These descriptors are not part of the O*NET data collection). They cover a variety of areas, such as the level of employee empowerment, skill variety, reward systems, and organizational culture. There were some adjustments between the prototype and current version of organizational context items (Arad, Hanson, & Schneider, 1999). Finally, the work context descriptors cover physical and social factors that influence the work environment. The current version has 57 work context descriptors. They are organized somewhat differently than in the prototype version, but they are otherwise fairly similar (Strong, Jeanneret, McPhail, Blakely, & D’Egidio, 1999).

Workforce Characteristics This part of the content model contains information from the BLS regarding wages, level of employment, and employment outlook for each occupation (National Center for O*NET Development, 2007). It includes information such as average salary, number of job openings, and projected employment growth rates.

Occupation-Specific Information This portion of the Content Model serves as the primary location of occupation-specific information (National Center for O*NET Development, 2015). The Title and Description are included here. They are needed to identify and define the occupation. Additionally, Tasks, Tools, Technology, and Alternate Titles relevant to each occupation are presented. The Tasks are work

879

David Rivkin et al. behaviors more specific than GWAs and DWAs. They are unique to the occupation. Tools and Technology are the equipment, machines, tools, and software that an incumbent can use to perform the occupation. Alternate titles are “lay titles” that are used in the world of work to identify the occupation. Alternate titles help users of O*NET data identify/link to O*NET-SOC occupations of interest. The O*NET prototype included Tasks, but not Tools and Technology or Alternate Titles.

Hierarchical Structure The six individual domains in the Content Model (e.g., Worker Characteristics, Worker Requirements, Experience Requirements, Occupational Requirements, Worker Characteristics, Occupation-Specific Information) group information hierarchically (National Center for O*NET Development, 2015). For example, the Worker Characteristics domain contains four types of information: Abilities, Occupational Interests, Work Values, and Work Styles. From these four, the Abilities domain, in turn, contains four types of abilities: Cognitive, Psychomotor, Physical, and Sensory. Each of these types of abilities contains further levels of detail. For example, the psychomotor type includes Fine Manipulative, Control Movement, and Reaction Time and Speed. Finally, Fine Manipulative contains three specific descriptors: Arm-Hand Steadiness, Manual Dexterity, and Finger Dexterity. (It is at this lowest level that O*NET collects data.) Hierarchies are a useful means of both organizing occupational information and allowing for its access at different levels of specificity. The hierarchal structure and standardized worker and occupational descriptors facilitate the use of a common language of work. The O*NET structure enables examinations of individual occupations as well as cross-occupational comparisons. A user can look at approximately 239 descriptors for each occupation, or they can decide to focus on a particular domain (e.g., Occupational Requirements) or content area (e.g., Work Values). Since standardized descriptors are used, comparisons across occupations can easily be made. The use of cross-occupational comparisons can be seen in O*NET OnLine (http://www.onetonline.org). One of the functions users can perform on this site is to search for occupations that require similar levels of a particular O*NET descriptor (e.g., Abilities, Knowledges, Skills).

Updates to the Content Model Some additions have been made to the Content Model over time. Briefly, under Occupational Requirements, the Work Activities domain has been expanded and updated (Hansen et. al., 2014). Intermediate and Detailed Work Activities have been added to develop a more compete hierarchy of information. This new “intermediate level” was developed to address user interest in additional data that could be used for workforce activities such as resume writing, job description development, and training program design. The new hierarchy now allows users to link O*NET task information to GWAs. A more complete description of this new information is provided in the Data Enhancements section of this chapter. In the Content Model window on Occupational-Specific Information, Alternate Titles and Tools and Technology (T2s) have been added. Alternate titles enables users to more easily link occupations they are interested in to occupations provided in O*NET. Tools and Technology are critical in helping O*NET data keep up with the changing world of work. Using real-time data applications and transactional data from career and job seeker websites, the O*NET program is able to identify new tools and technology for occupations. These T2s often represent new “hard skills” necessary for successfully job performance. Job seekers, policy makers, employers, and educators are keenly interested in the T2 data to help them make more informed decisions. Alternate Titles and T2 development is presented in the Data Enhancements section of this chapter.

880

Advancing O*NET

O*NET-SOC Taxonomy Along with the Content Model, one of the foundations of the O*NET program is the O*NET-SOC Taxonomy (National Center for O*NET Development, 2006, 2010, 2015). This occupational taxonomy identifies the occupations included in the O*NET data collection. It is the taxonomy used when developing O*NET products and tools developed by the O*NET Program. The O*NET-SOC Taxonomy structure is based on the Standard Occupational Classification (SOC) developed by the BLS (U.S. Department of Commerce, 1980). The Office of Management and Budget (OMB) mandates that all federal agencies undertaking occupational data collection align with the SOC. This provides users of this information the opportunity to take advantage of the wide array of occupational information collected by the federal government. By using this common taxonomy, federal agencies are able to share occupational information, facilitating a common language of work. The 2010 SOC is the most recent version of the taxonomy; the next update is scheduled for 2018 (http://www.bls.gov/soc/update.htm). The O*NET-SOC 2010 used by the O*NET program is aligned with the 2010 SOC (U.S. DOL, 2010) and adds more detailed occupations. This is the fifth update of the O*NET-SOC taxonomy. Figure 40.2 presents the O*NETSOC 2010 taxonomy. Like the DOT, each O*NET-SOC occupation is assigned a code. This eight-digit code identifies where the occupations lie in the O*NET-SOC hierarchy. The first six digits match the SOC coding scheme. The SOC occupational taxonomy has four levels of aggregation: 23 major groups, 96 minor groups, 449 broad occupations, and 821 detailed occupations. The last two digits of code indicate whether the O*NET-SOC occupation is a one-to-one match (.00) or is a detailed O*NET breakout of the SOC occupation. (e.g., .01, .02). Figure 40.3 provides an illustrated example of the O*NET-SOC coding system for the SOC-level occupation Nuclear Technician and its associated detailed occupational breakouts.

Based on the 2010 SOC Classification: 23 Major Groups

Data-Level

97 Minor Groups 461 Broad Occupations

667

SOC Level without detailed O*NET-SOCs

20

SOC Military specific Military “All Other”

37

SOC Level with detailed O*NET-SOCs

45

SOC Level with detailed O*NET-SOCs

Detailed O*NET-SOCs

25

SOC “All Other” with Detailed O*NET-SOCs

840 840 Detailed SOC Occupations

1110 1110

O*NET-SOC 2010 Occupations (including 840 SOCs)

Non Data-Level

269 1

Exceptional cases

974 Titles and Data

46

SOC “All Other” without Detailed O*NET-SOCs

136 Titles Only

FIGURE 40.2  2010 O*NET-SOC Taxonomy Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

881

David Rivkin et al.

19–0000 Life, Physical, and Social Science Occupations (SOC major group) 19–4000 Life, Physical and Social Science Technicians (SOC minor group) 19–4050 Nuclear Technicians (SOC broad occupation) 19–4051 Nuclear Technicians (SOC detailed occupation) 19–4051.01 Nuclear Equipment Operation Technicians (detailed O*NET-SOC occupation) 19–4051.02 Nuclear Monitoring Technicians (detailed O*NET-SOC occupation) FIGURE 40.3  19–4050 Nuclear Technicians Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

In order to keep up with the changing world of work, the O*NET program conducts projects to update the O*NET-SOC to include New and Emerging (N&E) occupations (National Center for O*NET Development, 2009). As shown in Figure 40.2, the 2010 O*NET-SOC taxonomy has 1,110 occupational titles, 974 of which are included in O*NET data collection. They are identified as data-level occupations. A multimethod data collection approach is used to collect occupational information. Incumbents, occupational experts, and job analysts complete questionnaires, based on Content Model descriptors, for the data-level occupations. Information on occupations is also collected from employer job postings, transactional data, web research, professional associations, and customers. Of the 974 data-level occupations included in O*NET-SOC 2010, 152 are N&E occupations. These were occupations identified within 17 in-demand industries, including health care, information technology, and education. To be included as an N&E, the occupation had to meet specific criteria, such as having significant employment and positive projected growth rate. In addition to making sure that the occupation is indeed relevant to the world of work, these criteria help ensure that data can be successfully collected for the N&E occupation (National Center for O*NET Development, 2009). The O*NET program continues to evaluate the completeness and specificity of the O*NETSOC taxonomy. The need for more detailed level occupations versus project resources is constantly considered. Ways to update the taxonomy and the possibility of adding more occupations continues to be studied. Recently, the DOL has attempted to initiate work in identifying ways of using real-time data and “Big Data” (i.e., large open-source databases of current transactional data) to identify N&E occupations. O*NET will follow this research to see if it can contribute to enhancing the O*NET-SOC taxonomy.

O*NET DATA COLLECTION One of the keys to the success of the O*NET program is the O*NET multimethod data collection methodology. The methodology was developed based on the need for a standardized system that had enough flexibility to face the challenges in collecting more than 200 descriptors on almost 1,000 occupations that covered the entire U.S. economy. The system had to be cost effective, mindful of the burden it would place on employers and incumbents, and result in the collection of high-quality data that could be used by a wide range of users. The OMB examines the technical quality, costs, and burden to the public. They ensure that there is no duplication between the O*NET data collection and other collection efforts sponsored by the federal government. The O*NET data collection has received OMB clearance five times thus far, first in 2002 and last in 2015 (National Center for O*NET Development, 2016). Efforts to continually improve the data collection methods is a cornerstone of the O*NET program. Currently, methods include data collection from job incumbents, occupational experts, and job analysts. Other sources of information include expert research, data from government programs, transactional data, employer job postings, and customer input (National Center for O*NET Development, 2015). Table 40.1 demonstrates the breadth of the O*NET data

882

Advancing O*NET

TABLE 40.1 O*NET Data Collection Questionnaires O*NET Data Collection Program Questionnaire

Number of Descriptors

Number of Scales per Descriptor

Total Number of Scales

Data Source

Skills

35

2

70

Analysts

Knowledge

33

2

66

Job incumbents

Work Stylesa

16

1

16

Job incumbents

5

1

5

Job incumbents

Generalized Work Activities

41

2

82

Job incumbents

Work Context

57

1

57

Job incumbents

Education and Training

a

Abilities

52

2

104

Analysts

Tasksb

Varies

2

Varies

Job incumbents

Total (not including Tasks)

239

NA

400

NA

Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission. Notes: Occupation experts use the same questionnaires as job incumbents for those occupations whose data collection is by the Occupation Expert Method. NA = not applicable. a The Knowledge Questionnaire packet also contains the Work Styles Questionnaire and the Education and Training Questionnaire. b All job incumbents are asked to complete a Task Questionnaire in addition to the domain questionnaire.

collection. It summarizes the number of descriptors and scales in the O*NET data collection program questionnaires (downloadable at http://www.onetcenter.org/questionnaires.html.) Descriptors are identified from O*NET Content Model domains. Data are collected by means of 239 descriptors that include 400 scales (e.g., Importance, Level, and Frequency). Currently, to collect ratings for the Abilities and Skills domains, trained occupational analysts review updated information (e.g., Tasks, Generalized Work Activities) provided by job incumbents (Reede, & Tsacoumis, 2015; Tsacoumis, 2007; Tsacoumis & Van Iddekinge, 2006). No data collection is planned for the Workforce Characteristics domain. Information for it is provided through links to the employment, wage, and long-term projections databases produced by the BLS, the state employment security agencies, and other agencies.

Establishment Data Collection One of largest components of the O*NET data collection is the survey of job incumbents. Research has indicated that incumbents who actually perform the job are excellent sources for occupational information (Fleishman & Mumford, 1988; National Center for O*NET Development, 2015; Peterson, Owens-Kurtz, Hoffman, Arabian, & Whetzel, 1990). By surveying job incumbents, the O*NET program is more likely to be capturing new requirements of work as occupations change due to advances in technology and other work demands. The O*NET program has had great success with using job incumbents in providing ratings on a variety of questions regarding their work. Thus far, more than 50,000 businesses have provided access to their job incumbents to participate in the data collection, and more than 200,000 job incumbents have completed O*NET surveys (Lewis & Rivkin, 2015). Each job incumbent completes one of three domain surveys: Knowledges (including Work Styles and Education and Training), Generalized Work Activities, and Work Context. Each survey takes about 30 minutes to complete. Job incumbents are also asked to provide demographic information, answer questions that ensure that they are in the occupation being surveyed, and complete a task questionnaire that is specific to their occupation. Spanish versions of the questionnaires are also available, which contributes to improved response rates. Paper-and-pencil and online surveys are both available.

883

David Rivkin et al.

Sample Design To be effective, the O*NET data collection has to be cost effective, efficient, and timely. O*NET employs a sample design that considers the quality of data necessary, as well as the burden allocations for the project. Burden allocation includes costs and time necessary for business and respondent participation. For the establishment data collection method, which is used for approximately 75% of occupations, a stratified two-stage sampling approach is used. Components of the sampling design that improve efficiency include the use of multiple sub-waves of data and the implementation of Model Assisted Sampling (MAS). Two-Stage Approach  In the first stage of sampling, a sample of businesses is selected from a national database, provided by Dun & Bradstreet (D&B). This database is continuously updated and currently has approximately 17 million establishments. The sample is selected with probability proportional to the expected number of employed workers in specific occupations. During the second stage, job incumbents are randomly selected from lists of workers in the occupations, provided by the establishments. In certain cases, where the D&B frame is not sufficient for particular occupations, a special frame is developed; and professional or trade associations provide memberships lists from which workers are sampled. The special frame is usually used to supplement the D&B frame. Multiple Waves  To improve the efficiency of the sampling design, a wave approach is used in releasing samples for data collection. Each “primary wave” consists of approximately 50 occupations that have been clustered together because they can be found in similar industries. This helps limit the number of establishments contacted, because multiple occupations are likely to be found at a particular establishment. Following the release of the primary wave, up to three additional sub-waves are released to complete an occupation. Based on experiences from the primary wave, the sub-waves can be more targeted based on which industries/establishments the occupations were most frequently found. The sampling approach described maximizes the efficiency of the “establishment” portion of the data collection. First, this methodology is more likely to find occupations being sought because it empirically identifies which establishments are most likely to employ the occupations. Second, it minimizes oversampling of an occupation. If an occupation is completed in an early wave, sampling efforts in the remaining waves can be targeted toward occupations that have not yet been completed. Model Assisted Sampling  To help control the employee sample selection, Model Assisted Sample (MAS) is used in O*NET data collection (National Center for O*NET Development, 2012, 2015). This methodology enables the selection of employees to be defined for each occupation before data collection begins. Targets are set in terms of census region, business size, and industry division. By using MAS, sample employees are distributed across the target cells in proportions that reflect the expected distribution in the total population. Targets are determined based on the Occupational Employment Statistics (OES) survey conducted by the BLS and establishment information found in the D&B frame. Cell targets are monitored during data collection. Once a cell is complete, data collection for that cell is stopped. More efforts can then be directed toward cells that have not yet met their MAS targets. The use of MAS has greatly improved the efficiency of the O*NET data collection with negligible effects on the quality of descriptor information (Berzofsky, Welch, Williams, & Biemer, 2008). MAS has significantly reduced the number of establishments that need to be contacted in order to retain a random sample of establishments and has minimized issues of oversampling and unnecessary public burden. Finally, it has helped control costs associated with the establishment data collection.

Occupational Expert Data Collection Occupation Expert (OE) data collection is another key component of the multimethod approach of O*NET data collection (Lewis & Rivkin, 2015; National Center for O*NET Development,

884

Advancing O*NET 2015). Approximately 25% of occupations have data collected using the OE method. The OE method is used when it is difficult to locate incumbents in the occupation. Conditions that would contribute to this challenge include small employment size, inaccessibility of job incumbents (e.g., working on an oil tanker), or the occupation is considered new and emerging in the economy and thus it has not been included in the D&B or BLS sample frames. Examples of occupations that have data collected via the OE method include Robotics Technicians, Chief Sustainability Officers, and Bridge and Lock Tenders. Participants for the OE method are selected from lists of potential experts provided by multiple professional and trade associations. To be considered an occupational expert, the individual has: • Actively worked in the occupation within the last six months. This includes working, supervising, and/or training others in the occupation. • Actively worked in the occupation for a minimum of 5 years. • Performed the duties of the occupation for at least one year.

Individuals meeting these requirements are frequently supervisors, managers, trainers, or academics. Stratified samples are selected to meet a goal of 20 complete for each type of questionnaire. Regional distribution is considered when selecting samples. OEs who participate are asked to complete all three domain questionnaires (Knowledge, Generalized Work Activities, and Work Context), as well as a task questionnaire specific to the occupation in question. They also provide background data.

Data Collection Operations Well-trained business liaisons (BLs) are one of the key components to the success of the O*NET data collection program (National Center for O*NET Development, 2002, 2012, 2015). BLs are responsible for contacting sampled establishments and gaining cooperation from business points of contact (POCs). Once establishments agree to participate, the POC provides a listing of eligible employees. The BL randomly selects employees for participation and sends all related data collection materials to the POC. The POC distributes the materials to selected employees. The employees than mail the completed surveys directly back to the O*NET program. An online case management system (CMS) is employed by the BLs to ensure that data collection operations are standardized and efficient. The CMS functionality includes mechanisms for sending questionnaires to POCs, tracking questionnaire receipt rates, and reminders for follow-up with participating establishments. The functionality of the CMS is continuously improved upon based on BL experiences and data collection results.

Occupational Analyst Data Collection In addition to job incumbents and occupational experts, occupational analysts (OAs) also play an important part in the O*NET multimethod data collection. Ability and skill occupation data is populated by ratings from OAs. A cadre of 16 trained OAs rate the importance and level requirements for 35 skills and 52 abilities for each O*NET-SOC occupation.

Abilities and Skill Ratings For the ability and skills domain, there was a concern that because of the abstract nature of the constructs, incumbents might have a difficult time providing accurate ratings. Research supported this concern, suggesting that trained occupational analysts could provide more accurate ability and skills information than incumbents (Morgeson & Campion, 1997; Tsacoumis, 2007). A study comparing O*NET incumbent versus analyst skills ratings indicated that they were substantially similar, including underlying psychometric properties (Tsacoumis & Van Iddekinge,

885

David Rivkin et al. 2006). Research also suggested that incumbents may be prone to inflating their ratings, as compared to analysts, due to concerns with how ratings could affect compensation, future training opportunities, or the general status of their occupation. Finally, analysis revealed, as expected, that using OAs would significantly reduce the cost and burden of data collection. Based on a review of the O*NET Content Model (Donsbach, Tsacoumis, Sager, & Updegraff, 2003), the following information is presented to OAs: • • • • • •

Title and occupation description Job Zone of the occupation Task statements with ratings Generalized Work Activity Work Context items and data Knowledge domains and data

This data presented provides the OAs with an updated and complete picture of the occupations. The title and occupation description are obviously necessary to identify and understand the occupation being rated. The Job Zone (a measure of education, training, and experience requirements) is provided to enable the OAs to better understand the complexity of the occupation. Tasks, GWAs, and Work Context give OAs a good picture of what the incumbent actually does on the job, and knowledge helps the OAs understand what is required for the occupation. Only descriptors rated important by incumbents are delivered to OAs. OA rating methodology continues to be evaluated (Fleisher & Tsacoumis, 2012a, 2012b). As more occupations are rated multiple times, comparisons between ratings over time and documented changes in the occupation may provide interesting areas of research.

Data Collection Status The partnership between the DOL and private and public economic sectors has been very productive. For the incumbent data collection method, the cumulative participation rate for establishments is over 75% and for incumbents it is over 65%. These response rates were obtained within burden and cost limits. More than 50,000 businesses/organizations and 200,000 job incumbents have provided data for O*NET. For the OE data collection method, the cumulative participation rate is approximately 78%. Working through more than 700 national associations, more than 7,000 OEs have provided data. These results compare favorably to other federal surveys using similar methodologies. The O*NET program continues to look for ways to improve response rates and maintain costs and burden.

DATA ENHANCEMENTS O*NET information is designed to be organic, driven by current data, and responsive to the needs of its wide variety of users. Ongoing projects seek not only to update data but also to provide new ways of organizing and linking information so that users are able to work with a tool that provides maximum flexibility and expanded application opportunities. To that end, O*NET continually strives to provide new types of data along with the enhancement of existing data. In the next several sections, we will present details of the development of new Work Activity Statements, Tools and Technology information, and Alternate Occupational Titles. All of these enhancements can improve O*NET users’ opportunities to make quality workforce development and work-life decisions.

New Work Activity Statements A recent important adition to the the O*NET databases is the development of new work activities. Detailed Work Activities (DWAs) and Intermediate Work Activities (IWAs) have been

886

Advancing O*NET

O*NET OCCUPATION UPDATE SUMMARY 2003-2015

1100 1000

997

900 800

854

974 841

817

700 600

610

500

528

522

400

434

381

359

300 200

177

100 0

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

229

2013

2014 YTD 2015

FIGURE 40.4  Overview of Occupational Updates, 2000–2015 Source: The average number of occupations updated per year is 570. The number of occupations updated ranges from 177 to 974. The O*NET Occupation Update Summary Page (http://www.onetcenter.org/dataUpdates.html#summary) provides information on specific updates by occupation and descriptor area. It is an excellent resource for O*NET users to understand the currency of O*NET data. Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

developed for every occupation in the O*NET taxonomy. These activity statements have completely replaced an older and less specific list of DWAs and have provided a new type of data presented in a hierarchical structure. See Figure 40.5 for a comparison of the structures of the old and new work activities frameworks. Hansen et al. (2014) contains a complete discussion of the development of the new work activities structure and data. DWA and IWA information was developed in response to specific user data needs. As the world of work changes, worker skill requirements change as well. The DWAs and IWAs provide information on such skills and use multiple feedback sources to incorporate current information about skill changes on a timely basis, enhancing how the data can be used. For example, as job demand for occupations in areas such as manufacturing has decreased, other areas (e.g., health care) have grown rapidly, often incorporating new technologies, but not necessarily requiring completely new skills. O*NET users wanted to be able to facilitate exploration of alternative or related occupations using the current occupational task information. Using DWAs and IWAs, which build upon O*NET task information, users can now search across occupations to identify between-occupation similarities, even if occupations are not in the same job family or industry area. O*NET now has 2,059 DWAs and 332 IWAs; these statements were generated by qualitative analysis and clustering of the 19,450 tasks in the O*NET 18.0 database (Hansen et al., 2014).

Work Activities Definitions and Hierarchy Work activities within O*NET range from the very general (found across many occupations) to the very specific (unique to a single occupation). In the O*NET work activities hierarchy, the 42 GWAs cross multiple occupations and occupational areas and, as the name implies, are

887

David Rivkin et al.

General

GWAs

GWAs IWAs DWAs

Tasks

DWAs

Specific

Current

Tasks

Revised

FIGURE 40.5  Original and Updated Work Activities Hierarchy Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

very general (e.g., “thinking creatively”). The most specific work activity information available through O*NET is the task statement. Each occupation has a list of unique task statements in a standardized format. The completely revamped DWAs and the newly added IWAs fill the gap between GWAs and tasks and establish, for the first time, a complete chain of linkages from an individual task to an individual GWA. The nested work activities hierarchy allows for multiple avenues of drill-up or drill-down searching, along with cross-occupational skill matching. Some examples of how this nesting works are in Table 40.2. DWAs are simple work activity statements (e.g., “Record patient medical histories”) that have some degree of occupational context, in contrast to the broader GWAs. However, these statements apply to the activities in multiple occupations. They are more general than task statements, which are normally specific to a single occupation. DWAs were developed within the SOC system’s 22 major groups. Most of the 2,069 DWAs link to four or more tasks and three or more occupations within a single job family. The 322 IWAs are nested between GWAs and DWAs, and are more general activity statements, common to many occupations. Many cross major occupational groups or job families. All of the 41 GWAs cross major occupational groups or families.

TABLE 40.2 47–2111.00 Electricians—Example of Nested Activity Data GWA:

Inspecting Equipment, Structures, or Material

  IWA:

Inspect commercial, industrial, or production systems or equipment.

    DWA:       Task:

GWA:

Inspect electrical systems, equipment, or components to identify hazards, defects, or the need for adjustment or repair, and to ensure compliance with codes.

Estimating the Quantifiable Characteristics of Products, Events, or Information

  IWA:     DWA:       Task:

888

Inspect electrical or electronic systems for defects.

Estimate project development of operational costs. Estimate construction project costs. Provide preliminary sketches or cost estimates for materials or services.

Advancing O*NET GWA:

Thinking Creatively

  IWA:

Estimate project development of operational costs.

    DWA:       Tasks:

Create visual designs or displays. Create construction or installation diagrams. Provide preliminary sketches or cost estimates for materials or services.

Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

Work Activities Development DWAs and IWAs were developed by groups of analysts who evaluated and sorted the 19,450 tasks in the O*NET 18.0 database based on similarity of activity, objects, purpose, context, and technology. Precisely worded DWA statements were then developed to reflect those common characteristics and to distinguish those activities from other activity statements. Once the DWAs were developed, several data refinements were applied. First, relevant original, or “legacy,” DWAs were integrated into the DWA data set if they filled a conceptual gap in the set of new DWAs. Then the entire set of DWAs was reviewed to identify identical or nearly identical DWAs that could be combined. Finally, tasks that contained information about multiple activities were linked to additional DWAs. This process led to the 2,092 DWAs currently in the O*NET 20.3 database. The project team conducted the same two-stage process to develop IWA statements. DWA statements were clustered using the same rational process used for task clustering. DWA statements reflecting similar activities were grouped and activity statements were written to reflect the common activity themes in the DWA clusters. This work resulted in the 322 IWAs in the O*NET 20.3 database.

Potential Uses of Work Activity Data Work activities data can support career exploration, resume building, skills gap analysis, work requirements profiling, and industry skills standard developments. The new work activities framework expands the potential applications of work activities data in two ways. First, the new DWAs improve access to information about high-growth and emerging occupations. Second, the development of IWAs and the integration of work activity data (linking of tasks to DWAs to IWAs to GWAs) expand and enhance the overall applications of work activities data. Following are some examples of potential uses of the enhanced data. Displaced workers can benefit from the hierarchical structure of the work activities statements. They can identify existing skills that they have, both general and specific, which will enable them to enter career exploration at multiple points. For example, a displaced worker could look for jobs that require the ability to assemble electrical wiring. With the new work activities taxonomy, they can explore occupations that link to the DWA for that task (“Install electrical components, equipment or systems”) or that are linked to the higher-level IWAs and GWAs (“Install commercial or production equipment” and “Handling and Moving Objects,” respectively). This provides multiple means of identifying transferable skills. Once such skills are identified, the related work activities can be used to construct resumes containing skills that are targeted for jobs an individual is interesting in pursuing. Employers also benefit from the new work activities taxonomy. The more detailed structure enables better cross-occupational comparisons. Employers can analyze skill gaps between the existing workforce and current and future worker requirements. They can use the new work activities to write more detailed and standardized job orders or position descriptions. DWA and IWA information can also be integrated into the development of work training programs, allowing for more specific training objectives and a more targeted training curriculum.

889

David Rivkin et al. Hierarchical work activities data can also be used by industry groups, such as associations. Because the work activities follow standardized style and structure, there is a more defined common language of cross-occupational activity. This information can be used to help unify industry skills descriptions and standards. These skills standards can then be used to assist in development of professional certification specifications. In addition to the workplace issues described above, the O*NET Work Activities data can be used to address some newer challenges. The new activities include those related specifically to the greening of the workplace, for example. Veterans returning to the workplace can use the work activities to better understand how skills used in military work can be transferrable to the civilian world of work. In an increasingly unstable workplace in which re-skilling can be frequent and in which various industries often change the numbers and skills of workers they need, the work activities can help bring structure and information to both employers and job seekers in terms of identifying both skills that are changing and pinpointing the types of changes that are occurring. The DWAs can address new skills and technologies that have not previously been in great demand in the workplace but for which training and skill standards are now needed. The IWAs have excellent potential utility in the current labor market. As global competition has increased, particularly in manufacturing, and new technologies have emerged, there have been significant shifts in the types and sizes of labor pools needed in the U.S. economy. This has required many American workers to look for jobs outside of the traditional boundaries of their industries or what they see as similar occupational groups. The IWAs can help job seekers see where else their skills are needed. Employers and workforce specialists can also use the IWAs to look outside their industries to identify and recruit workers with valuable skills that can be applied to the jobs they need filled.

Maintaining the Currency of Work Activities The O*NET database is intended to be dynamic, with regular updating and revisions. The DWAs and IWAs are intended to be dynamic as well, allowing for continuous improvement over time. As new tasks are added to the database—and as other tasks are deleted—the DWAs and IWAs are regularly and systematically evaluated to ensure that work activities data are changed as necessary. With each O*NET database update, there are approximately 150 task changes, of which 40–60 are considered to be substantive and likely to affect work activities data. By continuing to update work activities, the O*NET database remains more current and useful to policy makers, educators, employers, work force development professionals, and job seekers. This new information allows the O*NET program to provide both cross-occupation and occupation-specific information and supports a common language of work.

Tools and Technology Another addition to the O*NET database has been the Tools and Technology (T2) information. T2s are important across jobs in the U.S. economy for both high- and low-skilled occupations and have been developed for all 974 occupations presented in the O*NET 20.3 database, with more than 67,000 T2 objects linked to these occupations. T2 data provide occupational information regarding machines, equipment, tools, and software. Special emphasis is placed on cutting-edge technologies and emerging workplace practices. These occupational data can be used for a wide range of O*NET applications, such as workforce development, employee training, and vocational and career guidance. This database was developed in response to user requests for more specific occupational information, and it is being refined to provide more frequently updated information. The T2 data in the O*NET database are simultaneously generic and specific. They are specific in that they are described in the “language” of the occupation, industry, or field. Thus, T2

890

Advancing O*NET

FIGURE 40.6  17–2061.00 Computer Hardware Engineers. Example of Tools and Technology Data Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

data can function as carriers of information specific to an individual occupation. At the same time, T2 data are organized by generic classifications in a standardized taxonomy—the United Nations Standard Products and Services Code (UNSPSC). For more details on the UNSPSC, see www.unspsc.org. This taxonomic organization allows for comparisons of T2 data across multiple occupations. By possessing both generic and specific attributes, the same tool or technology for an occupation can serve multiple end-user purposes (Dierdorff, Drewes, & Norton, 2006). The example in Figure 40.6 shows the online T2 information display for Computer Hardware Engineers. The bold words at the beginning of each line are the UNSPSC commodity titles; the actual T2 objects that are classified to these commodities appear after the dash. For example, “Development environment software” is the UNSPSC commodity classification, and the objects linked to that classification include C and Microsoft Visual Basic.

Updates and Additions The initial set of 67,000 T2 objects was gathered primarily from analysts’ Internet research of sources that included job descriptions, educational curricula, certification requirements, professional association information, and the contents of individual job advertisements. Update efforts include the examination and incorporation of data culled from real-time labor market information as well as data gathered directly from O*NET users. The initial collection, while thorough, was also extremely time- and labor-intensive. New methodology enhancements are aimed at ensuring that the T2 objects reflect change in the workforce as it is occurring. Cutting-edge and high-demand technologies are of particular interest in this process.

891

David Rivkin et al.

Real-Time Labor Market Information This method leverages “Big Data” to acquire large amounts of information in a timely manner. Current Employer Job Postings captured in “real time” are used to evaluate the current lists of occupational technologies for completeness and currency. This leveraging takes two forms. The first involves evaluation of job postings data related to each O*NET-SOC title. For each occupation, real-time job postings data is reviewed. This data represents thousands of job postings. Additions to the T2 database are made to reflect both new technological objects, such as new software packages, and the linking of these new objects to the UNSPSC taxonomic classification. The second way in which the job postings data is used involves compilation of a list of the current top or “hot” occupational technologies (Lewis & Norton, 2016). A list of technologies—those occurring most frequently across occupations—is gathered from the complete list of job postings across all occupations. The top technologies are identified, and the list is evaluated to ensure that all objects contained are appropriate T2 objects according to database definitions. A list of occupations with job postings linked to each technology on the “top technology” list is compiled, and then those O*NET occupations’ T2 lists are examined to see if the technologies are already included. Where appropriate, top technologies are linked to additional occupations and added to their T2 data.

User Information O*NET users’ feedback is used to update T2 data by identifying new objects or “missing” objects as well as by identifying outdated or obsolete tools and technologies. User feedback is solicited in two primary ways. First, users of O*NET OnLine are offered a Feedback link within the display of an occupation’s T2 object list. Second, T2 feedback is solicited directly from professional associations and OEs participating in the OE data collection. Data gathered from real-time data, incumbents, occupational experts, and professional associations is added to the database on a regular and frequent basis. Each potential new object is analyzed for validity, formatted to match current T2 data standards, and then classified appropriately using the UNSPSC information. The frequent updates ensure that the database contains information reflective of real-world occupational experiences.

Uses In O*NET websites, T2 data is displayed for each occupation (National Center for O*NET Development, 2015). Additionally, in O*NET OnLine (https://www.onetonline.org/) users can identify occupations that require specific T2s by searching with the specific T2. The T2 database can be searched to yield T2 data for a specific occupation or to provide occupation data for a specific technology. There are multiple uses for this data: • Job seekers can determine which technologies are relevant to a given occupation in order to see if they are qualified for that occupation. • Individuals seeking a career change can identify occupations that have T2 requirements closely matching their skills. • Career development professionals can use the information to help students and other clients to determine which occupations might be good fits for their interests and skills. • Curriculum developers can use the information as a guide to what types of tools and technology knowledge and skills a student might need for successful job performance.

As with other O*NET data, T2 information undergoes continual examination and enhancement. By using T2s, individuals can make better work life decisions, workforce professionals

892

Advancing O*NET can have quality information to assist their clients, and developers can use valid, updated data information in their products.

Development of Occupation Alternate Titles A key concern of the O*NET program is to make it as easy as possible for users to find the occupations they wish to explore (Gregory & Lewis, 2015). The development of Alternate Titles for occupations is essential for successful occupational searches, as it links commonly used lay titles to O*NET occupations. The Alternate Titles improve keyword searches in several DOL Internet applications—O*NET OnLine (www.onetonline.org), My Next Move (www.mynextmove.org), My Next Move for Veterans (www.mynextmove.org/vets/), and O*NET Code Connector (www.onetcodeconnector.org). They are also incorporated into a number of public and private keyword searches through the O*NET Web Services (http:// www.onetcenter.org/dev_web.html). O*NET Web Services is an application programming interface (API) developers can use to display O*NET information in their applications and take advantage of tools, such as the occupation keyword search featured in My Next Move and O*NET OnLine. For more detailed information on O*NET’s web services feature, see http://www.onetcenter.org/dev_web.html. Alternate Titles provide customers with a better understanding of the O*NET occupations. For example, “Ultrasound Technician” is an alternate title for O*NET’s Diagnostic Medical Sonographers; “Cosmetology Instructor’’ and “Culinary Arts Instructor” are alternate titles for the O*NET occupation Vocational Education Teachers, Postsecondary; and “Stockbroker” is an alternate title for Sales Agents, Securities and Commodities. To date, there are 59,634 entries in (45,472 distinct titles) covering the 1,110 occupational titles within the O*NET-SOC classification. The average number of Alternate Titles per occupation is 54, with the majority of occupations having a range between 10 and 100 titles. (Gregory & Lewis, 2015). A multimethod data collection approach is used to populate the Alternate Titles (see Gregory & Lewis (2015) for the complete Alternate Titles development methodology). Following we describe the different sources for Alternate Titles and summarize the procedures for developing the data included in O*NET products and tools.

Alternate Titles Data Sources Multiple data sources are used to develop Alternate Titles. Our multisource approach helps to develop a complete list of lay titles that are used by many different types of O*NET users, including employers, job seekers, educators, and workforce development professionals. The five sources for Alternate Titles are described in the following sections. Incumbent and Occupational Expert Write-in Titles  Incumbent and Occupational Expert (OE) data are collected through the O*NET data collection program on the background questionnaire (see http://www.onetcenter.org/ombclearance.html for data collection questionnaires). Job incumbents and OEs write in their job titles on these questionnaires, responding to the following question: “What is the title of your current job?” After data cleaning and write-in title collapsing, O*NET retains a list of these titles and their frequencies for inclusion in the Alternate Titles database. Employer Job Postings  Employers post job advertisements on multiple national job boards. Alternate Titles data are gathered from these job postings and compared to current O*NET Alternate Titles data. If a job posting contains a title that is not currently present in the Alternate Titles database and also has a significantly high frequency of related job postings, the title is obtained for inclusion.

893

David Rivkin et al. Occupational Code Assignment Submissions  An occupational code assignment (OCA) is a process established to help occupational information users relate a job title or occupational specialty to an O*NET-SOC occupation (http://www.onetcenter.org/oca.html). Businesses, training and educational institutions, labor and occupational organizations, and professional associations can use the OCA process to determine if a job title or occupational specialty is recognized within the O*NET-SOC system and the U.S. labor market. Submitted job titles are obtained for Alternate Titles data. Transactional Analyses  Analysis of customers’ transactions on DOL-sponsored career and job seeker websites provides a source for Alternate Titles. For example, America’s Career InfoNet (ACINet; 2016) collects user search terms that return no results on their website. Unmatched search terms are collected by ACINet and provided to the O*NET Center approximately every six months. Miscellaneous Submissions  A variety of other sources of Alternate Titles information are used to build the database. These sources include requests from associations that support and participate in the O*NET Data Collection efforts, professional groups, customers, and other occupational classification systems. All Alternate Titles submissions are catalogued throughout the year for Alternate Titles data. Alternate Titles Procedural Summary  A standardized set of procedures is used to develop Alternate Titles. O*NET gathers data from all five data sources annually. Each title undergoes an extensive multistep review process and is reviewed by multiple occupational analysts. First, incumbent write-in titles, provided from survey questionnaires, are reviewed to identify exclusionary titles, duplicates, acronyms, abbreviations, deletions, and compound titles. The goals of this first review are to reformat titles as needed to match style guidelines, to expand any acronyms or abbreviations, and to mark titles as needed for exclusion (i.e., the title is generally associated with a different O*NET-SOC occupation) or deletion (the title contains no useful content). Second, write-in titles with a frequency of three or greater and non-write-in titles (e.g., SOC, DOT) are identified for the database of Alternate Titles. Write-in titles with a frequency of one or two are excluded from this database. The Alternate Titles database is available for download at http://www.onetcenter.org/dictionary/20.3/excel/alternate_titles.html. Third, based on the review, a database of sample reported titles is created. This is a subset of the Alternate Titles database and includes the job titles most frequently reported by incumbents and occupational experts on data collection surveys. These titles are displayed on occupational reports in the O*NET OnLine and O*NET Code Connector web applications; up to 10 titles for each occupation are displayed and included in this file. Up to four titles are also displayed in My Next Move, My Next Move for Veterans, and Mi Próximo Paso. This database is available for download at http://www.onetcenter.org/dictionary/20.3/excel/sample_ of_reported_titles.html. The database of Alternate Titles and the Sample of Reported Titles are updated annually. (These databases are also available in Spanish.) They provide current relevant job titles for the O*NET program as well as other developers of career exploration tools.

O*NET PRODUCTS AND USERS The O*NET program has a wide array of products and tools that incorporate O*NET data. These resources help the O*NET program disseminate information to a wide variety of users. O*NET products are available free of charge. On O*NET’s Resource Center site, www.onetcenter.org, users can learn about the development and uses of O*NET information; download the O*NET database and assessment tools (e.g., Interest, Ability and Work Values Profilers); download development and technical reports on all of O*NET research

894

Advancing O*NET and products; get details about the O*NET data collection; download O*NET data collection questionnaires; sign up for O*NET updates; and link to additional O*NET websites including the O*NET Academy, where training materials related to O*NET are provided. The O*NET Resource Center site is the “library” for all things O*NET and is a critical resource for O*NET users. O*NET has developed a number of products and tools to help O*NET customers succeed in using O*NET data. These tools help individuals explore careers and find jobs. They help workforce development professionals and educators develop assessments, job descriptions, training programs, and performance systems. Researchers and government agencies can use the products to examine the changing world of work, perform skills gap analyses, develop competency models, and identify new and emerging occupations. O*NET products and tools are useful to a seemingly endless array of workforce development activities. In the next sections of this chapter, we will describe some of the major O*NET products, the O*NET database, and major O*NET websites. The O*NET Resource Center has complete information and links to access these products. We will also describe the web services now available to developers, enabling them to more readily integrate O*NET data, reports, and websites directly into their websites or web applications. Then, we will present a broad description of O*NET user statistics, as well as specific examples of the variety of ways O*NET is used.

O*NET DATABASE The heart of the O*NET program is the O*NET data. O*NET collects data on 974 occupations. Each occupation has over 270 descriptor ratings (e.g., importance, level, and frequency). This information is stored in the O*NET database. Updates to the database are ongoing and occur annually. Currently, O*NET version 20.3 is the latest database, which can be downloaded from the O*NET Resource Center, containing data on all 974 occupations in the O*NETSOC taxonomy. (Additionally, over 500 occupations have had more than one update.) Detailed knowledge, skills, and ability ratings, task information, work activities, T2s, and alternate titles are part of the database. Recently, improvements have been made to the structure of the database to make it easier for developers and other interested parties to use. Instead of the use of supplementary files, all files are now part of the main database. Individual files for each type of descriptor (e.g., knowledge, skills, work activities) are available for download. The database has been formatted in Microsoft Excel and Oracle. In addition to tab-delimited text, SQL files of MySQL, Microsoft SQL Server, or Oracle have been prepared.

O*NET Data Dictionaries To help navigate the O*NET database, customized data dictionaries for each format of the database are provided. The dictionary includes: • • • • •

An outline of the data structure Definitions for database elements including descriptors and ratings scales Meta-data, including means, standard error, and upper and lower confidence intervals Data suppression rules used by the O*NET Center is determining whether or not to publish the data The source of data and the date of data collection

The Data Dictionaries are available in interactive online forms or they can be downloaded (http://www.onetcenter.org/dictionary/). The Dictionaries are invaluable resources for users to help navigate the O*NET Databases.

895

David Rivkin et al.

O*NET® OnLine (www.onetonline.org) O*NET OnLine is the most comprehensive of O*NET websites in presenting O*NET data. It was developed to provide access to “all” users of O*NET data, thus it has many different features. It is a free website with easy access to information on more than 900 occupations. The landing page for O*NET OnLine is displayed in Figure 40.7. O*NET OnLine offers users the opportunity to: • • • • • • •

Find occupations to explore Search for occupations that use their skills Look at related occupations View occupational summaries of the worker and requirements of the work View details of occupations, such as skills, knowledge, interests, and activities Use crosswalks from other classification systems to find corresponding O*NET occupations Connect to other online career information resources

FIGURE 40.7  Landing Page—O*NET OnLine (https://www.onetonline.org/) Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

896

Advancing O*NET O*NET OnLine provides a number of key features that optimize customer ease of use. First, the site offers three types of search functions to facilitate easy navigation of occupational data. The simplest search function, “Find Occupations,” allows for a quick search of occupations using keywords or O*NET-SOC codes. In addition, users can browse groups of similar occupations, such as Job Families (grouped occupations based on work performed, skills, education, training, and credentials), O*NET Descriptors (categories of occupational information collected and available for O*NET-SOC occupations), and Career Clusters (groups that contain occupations in the same field of work that require similar skills). The Advanced Search allows users to explore occupations with skill sets similar to theirs or use machines, equipment, tools, or software to find high-demand occupations. The Crosswalk Search locates O*NET-SOC occupations using any of several different occupational classifications systems [Classification of Instructional Programs (CIP), Dictionary of Occupational Titles (DOT), Military Occupational Classification (MOC), Occupational Outlook Handbook (OOH), Registered Apprenticeship Partners Information Data System (RAPIDS), and Standard Occupational Classification (SOC)]. Second, customers can select summary, detailed, or custom reports to provide quick and easy prepared reports or to display and print only those worker characteristics and occupational requirements of interest. Table 40.3 outlines the descriptors included in all three types of reports. Reports also directly link to corresponding wages and employment outlook, job listings,

TABLE 40.3 O*NET Descriptors Listed in Summary, Details, Custom Reports O*NET Descriptors Listed in Summary, Details, and Custom Reports Tasks:

Work activities that are specific to each occupation, such as “analyzing and testing computer programs or systems to identify errors”

Tools & Technology:

Machines, equipment, tools, and software that workers may use for successful performance on the job, such as “laser measuring systems” or “computeraided design CAD software”

Knowledge:

Organized sets of principles and facts that apply to a wide range of situations, such as knowledge of “mathematics,” “chemistry,” or “fine arts”

Skills:

Capacities developed through education or experience that help you perform your job, such as “reading comprehension”

Abilities

Enduring attributes of an individual that influence performance, such as “deductive reasoning”

Work Activities:

Tasks that may be performed across multiple occupations, like “thinking creatively”

Work Context:

Physical and social factors that influence the nature of work, such as “the amount of time spent sitting”

Interests:

Preferences for work environments and outcomes. For example, an interest in “investigative occupations” signals an interest in working with ideas and thinking.

Work Values:

Global aspects of work that are important to a person’s satisfaction, like “independence”

Work Styles:

Work characteristics that can connect what is important to a worker with occupations that reflect or develop those values, such as “Initiative,” “Persistence,” or “Cooperation”

Job Zones:

Job Zones group occupations into one of five categories based on levels of education, experience, and training necessary to perform the occupation.

Related Occupations:

Occupations similar to the selected occupation in required knowledge areas, skills, abilities, work environment, and work activities

Wages and Employment:

National wage information and employment prospects for your selected occupation. State information is provided through a link to CareerOneStop (www.careeronestop.org/).

Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

897

David Rivkin et al. and job banks found in CareerOneStop (http://www.careeronestop.org), a DOL-sponsored website. There are also direct links to training, certifications, licenses, and apprenticeships from myskillsmyfuture (http://www.myskillsmyfuture.org/), another DOL-sponsored website. Users can save occupational information for easy use in word processing spreadsheet or database programs. Third, users can not only search for occupations of interest, but they can also view related occupations. Related occupations are generated based on the comparison of similar tasks, skills, and other descriptor information collected for each occupation. This new feature of O*NET OnLine was developed as an expansion of the Detailed Work Activities project (see New Work Activities section) to further develop and utilize cross-occupational data and cross-occupational searches. This feature, although new, has been very popular. In January 2016 alone, it was used more than 600,000 times (National Center for O*NET Development, 2016). As discussed previously, O*NET OnLine is an inclusive website, presenting many different aspects of O*NET. The next three websites described—My Next Move, My Next Move for Veterans, and Mi Proximo Paso—have been targeted for more specific groups. They are good examples of how O*NET data can be used to serve the needs of particular populations.

My Next Move (www.mynextmove.org) In an effort to develop a simplified version of O*NET OnLine for new job seekers, students, or other adults with lower literacy and computer skills, O*NET released My Next Move. The landing page for My Next Move is presented in Figure 40.8. This web-based interactive tool was designed to assist users in managing their education and career plans and to learn more about their career options. It provides easy access to career exploration, educational and training programs, and job postings. Students and those in transition find this streamlined website a great tool in helping determine their next move on the road to a satisfying career. Users can: • Explore more than 900 different careers and see important information including skills, tasks, salaries, and employment outlook on easy-to-read one-page career reports • Look at related apprenticeships and training, and search actual job openings • Find careers through a keyword search; by browsing industries; or through the O*NET Interest Profiler, a tool that offers customized career suggestions based on a person’s interests and level of education and work experience

The simplified career reports provided in My Next Move feature the most important knowledge, skills, and abilities needed to perform the work, explained in language that’s easy to understand. Outlook and education sections let users find salary information, job postings, and training opportunities. The visual design enables users to identify a career’s key points or to explore a career in depth. Job seekers interested in specific careers can start exploring quickly with an intuitive keyword search. Users looking for a broader range of opportunities can browse industries, exploring over a dozen, each featuring a range of careers from which to choose, including those in the green economy and with a bright outlook for job opportunities. My Next Move also includes a web-based version of the popular O*NET Interest Profiler, a tool designed to assess an individual’s vocational interests. The web-based version of the tool features 60 items that, along with information about the user’s education and work experience, guide users to careers they may enjoy. The O*NET Interest Profiler Short Form page (http:// www.onetcenter.org/IPSF.html) has more information about this career exploration tool.

My Next Move for Veterans (www.mynextmove.org/vets/) Because of the influx of veterans returning to civilian jobs in recent years, O*NET proactively developed My Next Move for Veterans. This web-based interactive tool for U.S. veterans enables

898

Advancing O*NET

FIGURE 40.8  Landing Page—My Next Move (www.mynextmove.org) Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

them to use their military experience to explore the civilian world of work. Like My Next Move, the site has tasks, skills, salary information, job listings, and more for more than 900 different careers. Veterans can also take advantage of the O*NET Interest Profiler, a tool that offers personalized career suggestions based on a person’s interests and level of work experience. Figure 40.9 presents the landing page for My Next Move for Veterans. Similar to My Next Move, users can: • Explore more than 900 different careers and see important information, including skills, tasks, salaries, and employment outlook on easy-to-read career reports • Look at related apprenticeships and training, and search actual job openings • Find careers through a keyword search or by browsing industries • Find careers through a military transition search using military job titles; similar civilian careers relevant to their military experience are recommended.

899

David Rivkin et al.

FIGURE 40.9  Landing Page—My Next Move for Veterans (http://www.mynextmove.org/vets/) Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

The career reports in My Next Move for Veterans replicate those in My Next Move. They feature the most important knowledge, skills, and abilities needed to perform the work, explained in language that’s easy to understand. Outlook and education sections let users find salary information, job postings, and training opportunities. In addition, the career reports display related job titles from military classification systems so veterans can compare careers using familiar terms. They can also enter their current military job code or title into the site’s military transition search and see a list of civilian careers that have similar tasks or requirements.

900

Advancing O*NET

Mi Próximo Paso (http://www.miproximopaso.org/) The Spanish-speaking population of job seekers is on the rise in the United States, and demand for both English and Spanish applications and translations is increasing. Due to customer feedback and the need to allow the Spanish-speaking population to easily utilize the O*NET system, O*NET developed Mi Próximo Paso. Figure 40.10 presents the landing page for this website. A web-based interactive tool, this site was developed for Spanish-speaking job seekers, students, and other career explorers to learn more about their career options. Mi Próximo Paso includes all the features of the English-language site My Next Move. The site also has a web-based, Spanish-translated version of the O*NET Interest Profiler. The version was developed to help improve the career exploration possibilities for Spanish-speaking job explorers.

FIGURE 40.10  Mi Próximo Paso (http://www.miproximopaso.org/) Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

901

David Rivkin et al.

O*NET Web Services O*NET website features were designed not only for job seekers and career explorers but also for application developers. The O*NET program has introduced web services for developers. These services will improve access for users to O*NET products and tools, especially for application developers. An intuitive screen interface and comprehensive contextual information help make the application easy to use without training and support. For developers, published application program interfaces (APIs) are available to connect vendor systems to key features of O*NET Web Applications. Through O*NET Web Services, developers can integrate O*NET tools into their own website or web-enabled application, including: • Keyword Search—both the My Next Move search and the OnLine occupation search are available for use in career sites. The REST web services API returns occupations matching a word, phrase, title, or full or partial O*NET-SOC code. The results include the code and title of each matching occupation. • My Next Move Career Reports—concise, easy-to-read overviews for each occupation in My Next Move. Key knowledge, skills, and abilities are available for more than 900 occupations. APIs also provide Bright Outlook and Green information, job outlook, and more. • Summary and Details Occupation Reports—detailed information from O*NET OnLine for more than 900 occupations. User applications can include an occupation’s most important or all tasks, knowledge, skills, abilities, tools and technology, and more. • Military Search—the military transition search used in My Next Move for Veterans is also available through the web services API. The search returns relevant O*NET-SOC occupations based on full or partial codes and titles from the Army, Navy, Air Force, Marine Corps, and Coast Guard classification systems. • Spanish Keyword Search—the Spanish-language keyword search used in Mi Próximo Paso is part of the web services API. Occupation titles are returned, in Spanish, matching a Spanish word or phrase. A wide range of features from Mi Próximo Paso are also available, including detailed career reports and Interest Profiler questions and scoring. • Interest Profiler—this assessment tool can be included in customer career tool sites using the IFrame Widget. After adding a simple block of HTML code, users can take the O*NET Interest Profiler without leaving their career resources. For tighter integration, a REST web services API is offered. It provides scoring services and career results from the range of O*NET-SOC occupations. This tool is provided in both English and Spanish.

Organizations using web services include federal, state, and government agencies, military services, educational institutions, assessment and career information delivery systems, public workforce investment systems, private organizations and corporations, and international users (National Center for O*NET Development, 2015). Web services can significantly reduce the cost and effort for developers to update their applications with O*NET products and tools. One of the advantages of web services is that O*NET data updates are seamlessly incorporated, thus no new programing is required by developers. Additionally, as new features are added to O*NET web applications, new web services are designed so developers can have immediate access to them and update their applications in an efficient and timely manner.

O*NET Web and Product Use O*NET websites serve three general purposes: (1) making O*NET occupational data available to a range of users; (2) describing O*NET products and their potential uses; and (3) providing historical, technical, and procedural information on the O*NET data collection program and O*NET product development. Site use has increased steadily since each site’s initial launch, and in particular, over the past three years (National Center for O*NET Development, 2015). The National Center for O*NET Development maintains statistics on visits to O*NET websites and downloads of O*NET products. The term “products” refers to O*NET database

902

Advancing O*NET files, assessment tools, assessment tool software, and other products, such as crosswalks across O*NET-SOC taxonomies, O*NET questionnaires, and the Toolkit for Business.

Site Statistics, Linkages, and Product Downloads In 2014, the O*NET program’s six websites received a combined total of nearly 52 million visits, over 1.45 billion hits, and over 211 million page views. Table 40.4 presents these statistics by site. Online currently averages 3.4 million visits per month, three times as many as the reported average in 2011. The O*NET Resource Center (http://onetcenter.org) averages 670,000 visits per month, twice the number of visitors recorded in 2011. Annual user statistics compiled from 2002 to 2014 show the upward trend in site use. Figure 40.11 presents site visits by year. Also, the number of Internet sites that link to O*NET websites is impressive. According to a search conducted by the National Center for O*NET Development (National Center for O*NET Development, 2015): • • • • • •

Over 18,000 sites link to O*NET OnLine. Over 900 sites link to the O*NET Code Connector. Over 800 sites link to the O*NET Resource Center. Over 3,600 sites link to My Next Move. Over 750 sites link to My Next Move for Veterans. Over 580 sites link to Mi Próximo Paso.

In 2014, users performed nearly 105,000 downloads of O*NET products: over 10,000 of the database, nearly 65,000 of the assessment tools, over 16,000 of the computerized assessment tools, and over 13,000 of other products. Table 40.5 presents these statistics by product. By looking at O*NET site statistics, linkages to O*NET websites, and O*NET downloads, it is clear that O*NET is reaching more and more users. The introduction of the My Next Move websites (which are geared toward specific populations) and O*NET Web Services should facilitate the increasing dissemination and use of O*NET products and tools. The following section presents actual examples of O*NET product and data use.

O*NET Case Studies As organized within the O*NET Content Model, O*NET descriptors capture both job-oriented and worker-oriented characteristics to guide job seekers and career changers to occupations that match their interests and skills. They also inform the work of professionals in career counseling, human resources consulting, and workforce development. Discussions of how O*NET is used have been written about by multiple authors (National Center for O*NET Development, 2015;

TABLE 40.4 2014 O*NET Web Statistics by Site (in millions) Site

Visits

Resource Center

22.78

OnLine My Next Move My Next Move for Veterans Mi Próximo Paso Total:

Hits

Page Views

68.76

11.50

21.34

860.98

111.98

5.97

440.71

67.92

.78

42.02

9.83

.62

22.07

5.62

51.49

1,434.54

206.85

Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

903

David Rivkin et al.

24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

O*NET Site Visits by Year (in Millions)

2002

2003

2004

2005

2006

2007

2008

O*NET Resource Center

2009

2010

2011

O*NET OnLine

MNM for Vets

2012

2013

2014

My Next Move

Mi Proximo Paso

FIGURE 40.11  O*NET Site Visits by Year (2002 Through 2014) Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

TABLE 40.5 2014 O*NET Downloads, By Product Product

Downloads

Database

10,145

Ability Profiler

15,906

Interest Profiler

34,474

Work Importance Locator

14,457

  CIP-WIP Software 

16,335

Other Products Total:

13,582 104,899

Source: Courtesy of U.S. Department of Labor, Employment and Training Administration—Used with Permission.

National Research Council, 2010; Peterson & Sager, 2010). Meta-data provided with the database can help users decide appropriate ways to use O*NET data (http://www.onetcenter.org/ dictionary/20.3/excel/). Meta-data is available at both the occupational level (e.g., response rates, questionnaire completeness rates, respondent experience information, and industry sector) and item level (e.g., standard error, confidence intervals, data suppression recommendations, and relevance flags). The National Center for O*NET Development maintains O*NET Products at Work (PAW) and the O*NET Reference List on the resource center site (www.onetcenter.org/paw.html). The PAW function enable users to share their stories on how they use O*NET. The PAW is an excellent source for O*NET users to gain insight on the multiple ways O*NET can help them accomplish various work development activities. The reference list is also an excellent resource for O*NET users. The following paragraphs describe O*NET data and product use through case studies from the government, private, and military sectors. These and other “stories” can be found in the O*NET PAW.

904

Advancing O*NET

Government The DOL integrates O*NET data in its online tools to assist individuals and businesses toward a variety of career and workforce development objectives: • CareerOneStop (http://www.careeronestop.org) is an online resource for assistance in career exploration and preparation; job searches; talent acquisition, development, and retention; and disaster recovery assistance relating to employment. Its career exploration interface uses the O*NET occupational taxonomy, data, and assessment tools to match users’ interests, skills, experience, and work values to jobs. • America’s Career InfoNet (http://www.careerinfonet.org/) incorporates O*NET knowledge, skills, abilities, and task data in the occupation profiles presented in its enhanced job search tool. • Job Description Writer (2016; http://www.careeronestop.org/businesscenter/jdw/gettingstarted. aspx) supplies eight categories of occupation-specific O*NET data, which the user may customize in building a functional job description. • Competency Model Clearinghouse (2016; http://www.careeronestop.org/competencymodel/) provides two interactive online tools: Build a Competency Model and Build a Career Ladder/Lattice. Both incorporate O*NET occupations’ titles, tasks, vocational preparation levels, and Job Zones at the models’ highest levels of specificity.

In 2008, the Social Security Administration (SSA) began development of an Occupational Information System (OIS) for use in its disability adjudication process. The system, which was intended to replace the Dictionary of Occupational Titles (DOT) as a source of occupation-specific work requirements, expands on the DOT bank of descriptors to include basic mental and cognitive work requirements and further describe occupations’ exertion and skill ratings. To avoid duplication of ongoing work by the DOL, the SSA has investigated numerous existing sources of information critical to the disability adjudication process (https://www.ssa. gov/disability/step4and5.htm#&a0=2). To develop data elements describing the mental and cognitive requirements of work, the SSA has drawn upon O*NET’s mental and cognitive descriptors. The agency continues to work with the DOL Employment and Training Administration to identify and incorporate O*NET’s task statements, lay titles, and occupationally relevant tools and technology. O*NET task statements and T2 data are especially relevant to Step 4 of SSA’s five-step process of determining individual disability: Can the claimant do the work he or she did previously? T2 data helps provides the specific level of occupational information to determine what the worker is required to do on the job (How We Decide If You Are Disabled; https://www.ssa.gov/disability/step4and5. htm#&a0=2; https://www.ssa.gov/disabilityresearch/occupational_info_systems.html). The Connecticut Department of Labor used O*NET data to respond to the workforce investment area requirements that states assess current and future job opportunities in the state, the skills necessary to obtain these jobs, and the skills necessary to meet the economic development needs of the state. To meet these requirements, Connecticut collected and analyzed labor market information (LMI) and published an extensive report. O*NET skills and skill descriptions were used in the report sections describing skills necessary for Connecticut’s high-demand occupations and industry sectors important for state economic development (Connecticut Department of Labor; http://www.doleta.gov/programs/ONET/ct.cfm).

Private Sector Companies The multinational human resources consulting firm ManpowerGroup has used the O*NET occupational and skills taxonomy to match candidates to jobs (https://www.doleta.gov/programs/ONET/Manpower.cfm). Owing to its clients’ diverse classification and coding systems, the firm faced a challenge in standardizing the classification of job titles and skills. To efficiently analyze the occupational mix and high-demand skill sets, ManpowerGroup recoded client jobs to map to O*NET occupational classifications, permitting use of O*NET’s common-language descriptors in standardizing the characteristics of client jobs. Standardization permitted the firm

905

David Rivkin et al. to improve accuracy in (a) identification of job placement types, (b) consolidation of information to facilitate market and other analyses, (c) global reporting of firm activity, and (d) tracking of staffing trends. A report from the research division of IBM relied upon O*NET data to demonstrate the feasibility of organizing knowledge-based workers according to a clustering of their attributes (Leung & Glissmann, 2010). In the study, O*NET data was used to identify clusters of attributes required of workers in the insurance industry. Mapping the requirements of knowledge-intensive jobs to O*NET-SOC classifications enabled the identification of relevant O*NET knowledges and skills for those jobs. Using level means of the O*NET descriptors, a statistical clustering procedure was used to arrive at teams of workers in seemingly diverse jobs, such as claims processor, underwriter, and sales manager. By mapping the functional requirements of jobs to O*NET occupations and their knowledge and skill requirements, this study provides support for O*NET data in the use of organizational design. Assessment firms have incorporated O*NET data in customized workforce development tools for public and private sector use. One such company, Profiles International (PI), assesses job seekers on soft skills, job behavior traits, thinking style, and occupational interests. Results are then matched to O*NET occupations and presented in a Career Compatibility Report, which lists occupations that may be a good fit. PI develops a profile of the ideal candidate for a specific job opening by administering the same assessment to a firm’s most successful incumbent workers. Job seekers’ profiles are compared with the company’s job profile and the results are presented in a Placement Report, which displays the degree of match between each applicant and the job (Global Assessment Center; https://www.profilesgac.com/Login.aspx).

U.S. Armed Services The U.S. military has recognized the value of O*NET data and career tools in its various transition programs, recruiting activities, and human systems development projects. Presented here are a few examples of the range of O*NET products being put to work in the armed forces. As described above, My Next Move for Veterans is designed for U.S. veterans who are current job seekers. This interactive tool helps veterans learn about their career options. The site has tasks, skills, salary information, job listings, and more for more than 900 different careers. Veterans can find careers through keyword search, by browsing industries that employ different types of workers, or by discovering civilian careers that are similar to their jobs in the military. Veterans can also take advantage of the O*NET Interest Profiler, a tool that offers personalized career suggestions based on a person’s interests and level of work experience. Transition GPS, a civilian-workforce re-entry tool for separating and retiring service members, uses assessment results from the O*NET Interest Profiler to generate interest-based civilian job options for clients. The virtual aspect of these learning modes opens up muchneeded educational opportunities that are not often available to service members because of their mobility, varying time zones, accessibility, and stage of demobilization and integration. All that is required for attendance is a broadband Internet connection and a telephone. The series of 11 webinar course offerings includes one on decoding military skills for civilian employment, which prominently features both O*NET OnLine and My Next Move as tools for service members to facilitate a successful transition to civilian employment (http://www.dol.gov/ vets/programs/tap/). A report prepared for the Air Force Personnel Center details how O*NET assists the process of selecting the candidates who are most likely to succeed as either remotely piloted aircraft (RPA) pilots or sensor operators. Specifically, the O*NET Content Model provided an organizing framework for the relevant skills, abilities, and other characteristics required by these jobs. O*NET descriptors allowed them to “reduce redundancy across constructs and to ensure broad coverage of several different domains of individual differences.” Their use of the Content Model resulted in two different options for selection batteries for each of the critical positions of RPA pilot and sensor operator (Paullin, Ingerick, Tripp, & Wasco, 2011).

906

Advancing O*NET The various case examples collected in O*NET Products at Work indicate that O*NET is being used for a variety of purposes by both public and private sector organizations. As more advanced and easy-to-use technology become available, and more customized applications are developed, additional users will have access to O*NET data, products, and tools.

O*NET ACCOMPLISHMENTS, CHALLENGES, AND FUTURE ENHANCEMENTS The O*NET program has made significant progress since its inception in the early 1990s. In many respects it has successfully addressed recommendations of the APDOT panel for improving occupational information provided by the DOL to the workforce development community. Historical experience, advancing technology, and an emphasis on continuous improvement has helped the O*NET program provide quality data to its users. Yet, the O*NET program still faces challenges in performing data collection and dissemination. To reach more customers and expand services, enhancements to the program might be necessary. In the following sections, we present some major accomplishments of the program as well as challenges faced. We also discuss some future enhancements that might address these challenges and improve the quality of data and services provided by the O*NET program.

Accomplishments Using the O*NET Content Model and O*NET-SOC taxonomy provides the common language of work and structure necessary for successful data collection and dissemination. These components of the O*NET program help address the APDOT committee’s recommendations for a common language of occupational and worker descriptors, a hierarchical structure of data, multiple windows of information, and the use of a common classification system. The Content Model and O*NET-SOC taxonomy enables O*NET to expand and change occupations and descriptors within a sound structure. They make it possible for O*NET to communicate with users in a very specific manner about what occupations and descriptors are included in the O*NET program. The core of the O*NET program is data collection. O*NET’s multimethod approach to data collection is one of the keystones of the program’s success. The partnership between the DOL and private and public organizations has been substantial. More than 50,000 businesses, 200,000 job incumbents, 7,000 occupational experts, and 700 national associations have participated in the data collection. Using standardized, repeatable measures and procedures helps ensure the collection of quality data. Having a data collection operations center with well-trained business liaisons ensures a continuous, well-managed effort. Collecting data using multiple sources (e.g., incumbents, occupational experts, job analysts, transactional data, real time employer data, customer input, and web research) helps obtain complete information on a broad range of occupations. The data collection procedures and operations address the APDOT committee’s recommendation for a systematic approach to data collection. The data collection design allows for the inclusion of new occupations or the removal of outdated occupations. Since the same descriptors are collected for all occupations using similar methods, comparisons can be made across occupations. Occupational changes over time can also be made as data is collected multiple times for occupations. Data dissemination and use has also been a major accomplishment of the O*NET program. The APDOT panel discussed the need for an electronic relational database that could be accessed by different type of users. Thus, one of the early goals of the program was to make the data accessible to the many potential users of the data. The O*NET database and websites have greatly enhanced the accessibility of the O*NET data. O*NET 20.3, published in late 2015, is the latest version of the database. Like almost all O*NET products and tools, the database is offered free of charge for download via the ONET Resource Center (www.onetcenter.org).The database has been continually updated over the past 15 years. It is available in multiple formats,

907

David Rivkin et al. and individual files have been created for all descriptor categories. The development of multiple O*NET websites (e.g., O*NET Online, My Next Move, My Next Move for Veterans, Mi Próximo Paso, O*NET Resource Center) has improved the delivery of O*NET information to a variety of users. The addition of O*NET Web Services will greatly assist developers in incorporating O*NET data/functions easily and efficiently. O*NET usage statistics and Products at Work lend support to the notion that O*NET is widely used by both public and private sectors for workforce development activities.

Challenges Despite its accomplishments, O*NET continues to face challenges. A few key questions related to the data collection face the program. First, what occupations should be included in the data collection? Currently, the program collects databases on the SOC. This enables users to take advantage of other data collection programs using the SOC taxonomy. The program has added more detailed occupations to create the O*NET-SOC taxonomy. However, the question arises as to whether this level of detail is enough. Feedback from users varies. Some would like a more detailed occupational taxonomy, whereas others think it is very important to stay closely aligned with the SOC. Related to the level of detail of occupations in O*NET, the question of adding new and emerging occupations is frequently discussed. Is the O*NET-SOC taxonomy keeping up with the changing world of work? Is it capturing new occupations that enter the workplace? The SOC taxonomy is updated approximately every 7 to 10 years. The last update was in 2010. The next update is scheduled for 2018 (http://www.bls.gov/soc/update.htm). The question persists: should the O*NET program increase efforts to identify new and emerging occupations based on additional research? The program does have the Occupational Code Assignment (OCA) system (http://www.onetcenter.org/oca.html). This procedure enables O*NET users to request help in finding their occupation of interest in the O*NET-SOC taxonomy. The OCA system (1) leads to code assignments, (2) helps update the O*NET Alternate Titles file, and (3) is considered during the O*NET-SOC occupational classification review and development. Although this process is useful, it might be necessary for the DOL to initiate a similar New and Emerging Occupations project that was conducted by the O*NET Center in 2009, which resulted in over 150 occupations being added to the O*NET-SOC taxonomy (National Center for O*NET Development, 2009). The DOL might also want to investigate ways to update the SOC more frequently, such as leveraging real-time and “Big Data” sources. This would then allow government programs to update their occupational taxonomies more frequently (Davenport, 2014). Another challenge faced by the O*NET program is the timeliness of the data collected. Are the skills and other activities required for occupations changing so rapidly that the data collection can’t keep up? To help address this issue, the O*NET program has implemented several procedures. As discussed earlier in this chapter, O*NET has recently added the use of real-time data to update tools and technology and alternate titles (specific information prone to frequent updates or trends). Although referred to as tools and technology in the O*NET system, others in government and in the public and private workforce development communities refer to this information as “skills.” Additionally, the collection of new task information from incumbents (part of the data collection procedures) helps keep the O*NET information more relevant. This information is used to update occupational work activities. The use of real-time data and incumbent feedback will greatly improve the ability of the O*NET system to keep up with the changing skills of the workforce.

Possible Future Enhancements The O*NET program continually looks for ways to improve the services, products, and tools it delivers to customers. There are several enhancements currently in consideration related to the

908

Advancing O*NET Content Model, the O*NET-SOC Taxonomy, Currency of Data, and the O*NET Assessment Tools. First, there have been discussions about extending the Content Model. The DOL has expressed interest in extending Worker Requirements to provide more detail under knowledges and skills. For example, could the math knowledge area have more specific data (e.g., algebra, geometry)? These new descriptors would have to fit into the structure of the Content Model, and it would have to be reasonable to expect that data could reliably and accurately be collected for them. Different methods for updating O*NET-SOC taxonomy are also being considered. Are there ways to update the taxonomy more frequently? In addition to traditional methods (e.g., literature review, customer outreach), are there new ways to capture emerging occupations that are not currently included in the O*NET-SOC taxonomy? Data mining and real-time data tools that examine “Big Data” sources might be able to capture emerging occupations and skills. These sources include databases from private, government, and not-for-profit organizations. The O*NET program is using real-time data to update Alternate Titles and Tools and Technology. It might be possible to update the O*NET-SOC taxonomy using these resources. More private– public partnerships could help the DOL update this taxonomy (and other areas of the program as well). Resources like LinkedIn, Monster.Com, and private and public job boards may provide database sources to mine for changes in occupations and new and emerging skills. Limiting data collection to certain areas of the content domain might also be a way of facilitating the expansion of the O*NET-SOC taxonomy. Resources for the O*NET program have remained relatively fixed over time. Adding additional occupations to the taxonomy for data collection may increase costs. However, if only certain elements, which may be changing very rapidly, for more specific occupations were collected, costs could be controlled. For example, if a more detailed layer of occupations related to computer science were added, real-time data and other data mining techniques could be used to collect Alternate Titles, T2 information, and DWAs. Other areas, such as Ability and GWA data, could be gleaned from the higher-SOC-level occupation to which the more specific occupation is linked. New versions of the O*NET Assessment Tools are also being considered. These assessments are one of the most widely used of O*NET products and tools. Public workforce development programs have indicated a need for “shortened, mobile phone” friendly tools. Currently, a shortened version of the O*NET Interest Profiler is being tested. It includes nonverbal emoji rating scales to a help a more varied audience make better use of the tool. A new computerized version of the Work Importance Locator is being explored. With new computer technologies available, it might be possible to develop a computerized version of the O*NET Ability Profiler in a more cost-effective manner. In summary, the O*NET program is always exploring for ways to improve data collection with the goal of delivering quality data in a useful and efficient manner. It also looks for ways to improve other products (e.g., assessment tools, reports, taxonomy, training, websites). The program attempts to keep abreast of the most current methods and technologies in order to keep advancing the system. With the expanding use of O*NET, the program is hopeful for continued and increase support from both public and private sector organizations as well as individual users.

REFERENCES Advisory Panel for the Dictionary of Occupational Titles (APOT). (1993). The new DOT: A database of occupational titles for the twenty-first century (Final Report). Washington, DC: Employment Service, U.S. Department of Labor Employment and Training Administration. America’s Career InfoNet. (2016). Retrieved from http://careeronestop.org/Toolkit/ACInet.aspx Arad, S., Hanson, M. A., & Schneider, R. J. (1999). Organizational context. In N. G. Peterson, M. D. Mumford, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The development of O*NET (pp. 147–174). Washington, DC: American Psychological Association. Berzofsky, M., Welch, B., Williams, R., & Biemer, P. (2008). Using a model-aided sampling paradigm instead of a traditional sampling paradigm in a nationally representative establishment survey. Durham, NC: RTI International.

909

David Rivkin et al. Borman, W. G., & Brush, D. H. (1993). More progress toward a taxonomy of management performance requirements. Human Performance, 6, 1–21. Committee on Techniques for the Enhancement of Human Performance: Occupational Analysis. (1999). The changing nature of work: Implications for occupational analysis. Washington, DC: National Academy Press. Competency Model Clearinghouse. (2016). Retrieved from http://www.careeronestop.org/competencymodel/ Connecticut Department of Labor, Office of Research. (n.d.). Using O*NET to implement the workforce investment act. Retrieved from https://www.doleta.gov/programs/ONET/ct.cfm Davenport, T. H. (2014). Big data at work. Boston, MA: Harvard Business Review. Dawis, R. V., & Lofquist, L. H. (1984). A psychological theory of work adjustment: An individual-differences model and its applications. Minneapolis, MN: University of Minnesota Press. Dierdorff, E., Drewes, D., & Norton, J. (2006). O*NET tools and technology: A synopsis of data development procedures. Raleigh: North Carolina State University. Donsbach, J., Tsacoumis, S., Sager, C., & Updegraff, J. (2003). O*NET analyst occupational abilities ratings: Procedures (DFR-03–22). Alexandria, VA: Human Resources Research Organization. Droege, R. C. (1988). Department of Labor job analysis methodology. In S. Gael (Ed.), The job analysis handbook for business, industry, and government (Vol. 2, pp. 993–1018). New York, NY: Wiley. Dye, D., & Silver, M. (1999). The origins of O*NET. In N. G. Peterson, M. D. Mumford, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The development of O*NET (pp. 9–19). Washington, DC: American Psychological Association. Fine, S. A. (1968a). The use of the dictionary of occupational titles as a source of estimates of educational and training requirements. Journal of Human Resources, 3, 363–375. Fine, S. A. (1968b). The 1965 Third edition of the dictionary of occupational titles—Content, contrasts, and critique. Kalamazoo, MI: Upjohn Institute for Employment Research. Fine, S. A. (1988). Functional job analysis scales: A desk aid. Milwaukee, WI: Author. Fleishman, E. A. (1975). Manual for Ability Requirement Scales (MARS). Bethesda, MD: Management Research Institute. Fleishman, E. A., & Mumford, M. D. (1988). The ability requirement scales. In S. Gael (Ed.), The job analysis handbook for business, industry, and government. New York: Wiley. Fleisher, M. S., & Tsacoumis, S. (2012a). O*NET analyst occupational abilities ratings: Procedures update. Alexandria, VA: Human Resources Research Organization. Fleisher, M. S., & Tsacoumis, S. (2012b). O*NET analyst occupational skills ratings: Procedures update. Alexandria, VA: Human Resources Research Organization. Global Assessment Center. (n.d.). Retrieved from https://www.profilesgac.com/Login.aspx Gregory, C., & Lewis, P. (2015). Alternate titles procedures. Raleigh, NC: National Center for O*NET Development. Hansen, M., Norton, J., Gregory, C., Meade, A., Foster Thompson, L., Rivkin, D., Lewis, P., & Nottingham, J. (2014). O*NET work activities project technical report. Raleigh, NC: National Center for O*NET Development. Holland, J. L. (1976). Vocational preferences. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 521–570). Chicago, IL: Rand McNally. How We Decide If You Are Disabled. (n.d.). Retrieved from https://www.ssa.gov/disability/step4and5. htm#&a0=2 Jeanneret, P. R., Borman, W. C., Kubisiak, U. C., & Hanson, M. A. (1999). Generalized work activities. In N. G. Peterson, M. D. Mumford, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The development of O*NET (pp. 105–125). Washington, DC: American Psychological Association. Job Description Writer. (2016). Retrieved from http://www.careeronestop.org/businesscenter/jdw/gettingstarted.aspx Leung, Y. T., & Glissmann, S. M. (2010). A clustering approach to the organization design of knowledge-intensive service providers. Retrieved from http://domino.research.ibm.com/library/cyberdig.nsf/ papers/58C6B1D509E8DCA68525780000603960 Lewis, P., & Norton, J.(2016). Identification of “hot technologies” in the O*NET system. Raleigh, NC: National Center for O*NET Development. Lewis, P., & Rivkin, D. (January 2015). O*NET data overview. Paper presented at O*NET pilot open data round table sponsored by the Open Data Enterprise, Washington, DC. McCormick, E. J., Mecham, R. C., & Jeanneret, P. R. (1989). Technical manual for the position analysis questionnaire (2nd ed.). Palo Alto, CA: Consulting Psychologist Press. Mi Proximo Paso. Retrieved from http://www.onetcenter.org/miproximopaso.html Miller, A. R., Treiman, D. J., Cain, P. S., & Roos, P. A. (Eds.) (1980). Work, jobs, and occupations: A critical review of the Dictionary of Occupational Titles. Washington, DC: National Academy Press. Morgeson, F. R., & Campion, M. A. (2005). Social and cognitive sources of potential inaccuracy in job analysis. Journal of Applied Psychology, 82(5), 627–655.

910

Advancing O*NET Mumford, M. D., Peterson, N. G., & Childs, R. A. (1999). Basic and cross-functional skills. In N. G. Peterson, M. D. Mumford, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The development of O*NET (pp. 49–69). Washington, DC: American Psychological Association. My Next Move. Retrieved from http://www.onetcenter.org/mynextmove.html My Next Move for Veterans. Retrieved from http://www.onetcenter.org/veterans.html National Center for O*NET Development. (2016). O*NET 20.3 Data Dictionary: Using O*NET data and Meta data. Retrieved from http://www.onetcenter.org/dictionary/20.1/excel/. National Center for O*NET Development. (January 2015). ONET OnLine web statistics. Raleigh, NC: Author. National Center for O*NET Development. (2015). O*NET data collection program: Office of management and budget clearance package. Part A: Justification. Raleigh, NC: Author. National Center for O*NET Development. (2012). O*NET data collection program: Office of management and budget clearance package. Part A: Justification. Raleigh, NC. National Center for O*NET Development. (2010). Updating the O*NET-SOC taxonomy: Incorporating the 2010 SOC structure. Raleigh, NC: Author. National Center for O*NET Development. (2009). New and emerging occupations of the 21st century. Updating the O*NET-SOC taxonomy. Raleigh, NC: Author. National Center for O*NET Development. (2007). The O*NET® content model detailed outline with descriptions. Retrieved January 30, 2008, from http://www.onetcenter.org/dl—files/ ContentModel—DetailedDesc.pdf National Center for O*NET Development. (2006). Updating the O*NET-SOC taxonomy: Summary and implications. Raleigh, NC: Author. National Center for O*NET Development (2005). O*NET data collection program: Office of management and budget clearance package. Part A: Justification. Raleigh, NC: Author. National Center for O*NET Development. (2002). O*NET data collection program: Office of management and budget clearance package. Part A: Justification. Raleigh, NC: Author. National Research Council. (1980). In A. R. Miller, D. J. Treiman, P. S. Cain, & P. A. Roos (Eds.), Work, jobs, and occupations: A critical review of the Dictionary of Occupational Titles. Washington, DC: National Academy Press. Occupational Information System Project. (n.d.). Retrieved from https://www.ssa.gov/disabilityresearch/occupational_info_systems.html O*NET Code Connector. Retrieved from http://www.onetcenter.org/codeconnector.html O*NET in Action: ASVAB. (2010). Retrieved from https://www.doleta.gov/programs/ONET/asvab.cfm O*NET OnLine. Retrieved from http://www.onetcenter.org/online.html O*NET Resources: A Piece of the Action. (2009). Retrieved from https://www.doleta.gov/programs/ONET/ Manpower.cfm O*NET Resource Center. http://www.onetcenter.org/. Paullin, C., Ingerick, M., Trippe, D. M., & Wasko, L. (2011). Identifying best bet entry-level selection measures for US Air Force remotely piloted aircraft (RPA) pilot and sensor operator (SO) occupations. Retrieved from http://www. dtic.mil/dtic/tr/fulltext/u2/a554209.pdf Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., Fleishman, E. A., & Levin, K. Y. (Eds.). (1997, September). O*NET final technical report. Salt Lake City: Utah Department of Workforce Services, through a contract with the American Institutes for Research. Peterson, N. G., Borman, W. C., Mumford, M. D., Jeanneret, P. R., & Fleishman, E. A. (Eds.) (1995). An occupational information system for the 21st century: The development of O*NET. Washington, DC: American Psychological Association. Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., & Fleishman, E. A. (1995, September). Development of prototype Occupational Information Network (O*NET) content model. Salt Lake City: Utah Department of Workforce Services, through a contract with the American Institutes for Research. Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., Fleishman, E. A., Levin, K. Y., & Campion, M. A. (2001). Understanding work using the occupational information network (O*NET): Implications for practice and research. Personnel Psychology, 54, 451–492. Peterson, N. G., Owens-Kurtz, C., Hoffman, R. G., Arabian, J. M., & Whetzel, D. C. (1990). Army synthetic validation project. Alexandria, VA: U.S. Army Research Institute for the Behavioral Sciences. Peterson, N., & Sager, C. E. (2010). The dictionary of occupational titles and the occupational information network. In J. L. Farr, & N. T. Tippins (Eds.), Handbook of employee selection (pp. 887–908). New York, NY: Routledge/Taylor & Francis Group. Reeder, M. C., & Tsacoumis, S. (2015). O*NET occupational skills ratings: Analysis cycle 16 results. Alexandria, VA: Human Resources Research Organization. Strong, M. H., Jeanneret, P. R., McPhail, S. M., Blakely, B. B., & D’Egidio, E. L. (1999). Work context: Taxonomy and measurement of the work environment. In N. G. Peterson, M. D. Mumford, P. R.

911

David Rivkin et al. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The development of O*NET (pp. 127–145). Washington, DC: American. Transition Assistance Program (TAP) Information. (n.d.). Retrieved from http://www.dol.gov/vets/programs/ tap Tsacoumis, S. (May 2007). The feasibility of using O*NET to study skill changes. Paper presented at the Workshop on Research Evidence Related to Future Skills Demands sponsored by Center for Education National Research Council, Washington, DC. Tsacoumis, S., & Van Iddekinge, C. H. (2006). A comparison of incumbent and analyst ratings of O*NET skills. Alexandria, VA: Human Resources Research Organization. U.S. Department of Commerce. (1980). Standard occupational classification manual. Washington, DC: Author. U.S. Department of Labor. (2010). Standard occupational classification system. Washington, DC: Author. U.S. Department of Labor. (1991a). Dictionary of occupational titles. (4th. ed., rev.). Washington, DC: U.S. Government Printing Office. U.S. Department of Labor. (1991b). The revised handbook for analyzing jobs. Washington, DC: U.S. Government Printing Office. U.S. Department of Labor (1972). Handbook for analyzing jobs. Washington, DC: Government Printing Office. U.S. Department of Labor. (1939). Dictionary of occupational titles. Washington, DC: U.S. Government Printing Office.

912

41 CYBERSECURITY ISSUES IN SELECTION DAVID W. DORSEY, JACLYN MARTIN, DAVID J. HOWARD, AND MICHAEL D. COOVERT

INTRODUCTION Exponential technologies, or technologies whose performance-to-cost ratio grows faster than the pace of Moore’s law, are drastically changing the modern world by propelling society forward, often with unexpected consequences (Arena, 2014; Briggs & Shingles, 2015). Examples of exponential technologies include artificial intelligence, quantum computing, and cybersecurity. These technologies challenge previously held systems in society (Briggs & Shingles, 2015). For instance, Tesla allows consumers to bypass car dealerships by selling directly to customers, while companies like Uber and Lyft are largely replacing taxi cabs. Exponential technologies also greatly influence selection processes. The changes due to the information and telecommunications revolution in the 1980s sparked research on the role of technology in the employee selection process (see Farr & Tippins, 2010; Tippins, 2015). In many ways, technology appears to improve the selection process: online applications increase the applicant pool through greater accessibility; novel technologies facilitate the collection, storage, and analysis of assessment responses; and technology advances test development methods. Still, the growth of exponential technologies introduces specific challenges. Namely, the rate of disruption, or the speed with which technological innovations are changing societal processes, continues to increase considerably. This indicates that no technology is stable—that is, people will continue to develop novel technologies to replace existing ones. While this does advance society, it makes it difficult for researchers and practitioners to keep up with ever-changing practices. One constant in this era of technological change is the continuous threat of security breaches—that is, the many threats to the cybersecurity of systems that are used throughout the selection process. The aim of this chapter is to provide insight for researchers and practitioners into the challenges that are unique to cybersecurity in the selection context. Specifically, this chapter outlines the current trends in cyber attacks, cybersecurity issues within the context of current selection methods, other issues in selection, and finally, recommendations and directions for future research.

CURRENT TRENDS IN CYBERSECURITY More than ever before, today’s organizations must be aware of the threats to their internal networks and information databases. In order to defend against these attacks, it is important to know how an intruder can gain access. Attackers use several different vectors, or pathways.

913

David W. Dorsey et al. These vectors include threats that occur at the personal device level, such as malware taking advantage of vulnerabilities in operating systems (e.g., Windows, Linux) and software; physical threats such as theft, unauthorized physical access, and distribution of malicious hardware (e.g., USB drives); and general network threats such as network hacking and Wi-Fi and cellular attacks. In this section, we present an overview of how attackers are able to obtain workplace information by nefarious methods and an overview of the different types of possible attacks against organizations.

Social Engineering Before we learn some of the types of attacks intruders use to gain access to networks, it is important to understand how an attacker enters an organization’s computer systems in the first place. The most common point of entry for intrusion is through social engineering. Social engineering is defined as the use of social deception, psychological tricks, and cultural ploys to help a hacker gain unauthorized access to a computer network (Abraham & Chengalur-Smith, 2010; Erbschloe, 2005). Social engineering techniques are superior to other forms of hacking because they manipulate the most vulnerable part of the system, the end user (Krombholz, Hovel, Huber, & Weippl, 2014). It is vital to understand that no technology user is immune to social engineering tactics. Successful attacks have occurred on companies as technologically savvy as Google (Zetter, 2010), Microsoft, and Facebook (Schwartz, 2011), with social engineering methods being the initial foot-in-the-door for the attacks on all three companies. One social engineering method known as spear-phishing (i.e., receiving a targeted fraudulent e-mail that appears to be from someone you know) is an incredibly effective way for hackers to breach networks. In June 2015, this point could not be more evident as Kaspersky Labs, makers of one of the most popular antivirus programs in the market, found evidence of a nation-state spyware similar to Stuxnet and Duqu on their own internal networks, with the initial attack being traced back to a spear-phishing e-mail and zero-day exploit targeted at one of their employees in their Asia-Pacific offices (Zetter, 2015). Zero-day exploits occur when an attacker preys on a software vulnerability that is not yet known to the developer. Zero-day exploits are difficult for organizations to analyze because data on the exploit are not available until after the attack is complete (Bilge & Dumitras, 2012). The rise of e-mail as the major form of communication in organizations gave hackers a tool to utilize social engineering manipulations. E-mail, however, is not the only method of intrusion. In 2011, the Department of Homeland Security (DHS) conducted an experiment using a technique called baiting, whereby they placed USB drives with and without the department’s logo in parking lots used by government employees and private contractors. The only purpose of the USB drive when plugged into a computer was to contact the DHS experimenters to inform them that an employee had taken the bait and plugged in the USB drive that could have contained malicious code. Sixty percent of the USB drives that did not have the DHS logo were plugged into a computer. Even more shocking, 90% of the drives emboldened with the DHS logo were plugged into a PC (Rosenzweig, 2012; Schwartz, 2011).

Malware Malware encompasses many different types of software, each designed with malicious intent to gain access to computers and networks (Hsu, Chen, Ristenpart, Li, & Su, 2006). Some of the more common examples of malware include viruses, worms, and Trojan horses (Siponen & Oinas-Kukkonen, 2007). Malware is, perhaps, the most well-known type of attack as most employees have heard of computer viruses and spyware. In fact, many employees may have anti-malware software, such as Norton Antivirus, McAfee Antivirus, or Malwarebytes AntiMalware, installed on their personal computers. Less familiar to employees are the important differences in malware type and how each gains access to or spreads itself across a computer

914

Cybersecurity Issues network. While the end result of each type of malware is the same—to have unauthorized entry to a computer or network and execute the creator’s intention—there are fundamental differences between the aforementioned malware attacks. A virus is a program that can perform unauthorized actions on a computer and then replicate and spread itself to other computers (Cohen, 1987). Most viruses share three main components: (1) a replication mechanism allowing the virus to spread, (2) a task to perform, and (3) a trigger to execute the replication mechanism or task (Erbschloe, 2005). A worm is similar to a virus in purpose and its ability to replicate and search for other computers to infect. However, a virus requires an infected host file (i.e., carrier software) to replicate, and a worm is able to replicate without a host file and carry out its purpose as standalone software. Thus, the main difference is in how the malware travels (Cisco, n.d.; Symantec, 2015). Trojan horses differ from worms and viruses because they do not have the ability to replicate. Instead, much like the ancient Greek horse having to be pulled into the city of Troy, digital Trojan horses rely on a user to download a file that appears to be something of interest. A well-known example is the 2001 Trojan horse that promised downloaders photos of tennis star Anna Kournikova (Glass, 2001). Once a user has downloaded the file, the Trojan horse goes into operation and infects the computer with its malicious intent. A more recent example of a Trojan horse attack occurred in November 2014, when Symantec discovered the Regin Trojan horse, malware designed to collect information on the energy, airline, hospitality, and research industries (Summers, 2014). The primary purpose of the Regin Trojan horse is thought to be espionage, and with industries now operating in an online world, it is easy to imagine the potential damage to internal systems that can be wrought by outsiders. If an employee were to download an attachment in an e-mail, and that attachment contained a Trojan horse, the creator of the Trojan horse could have unmonitored access to an organization’s intellectual property.

Distributed Denial of Service (DDoS) Distributed Denial of Service (DDoS) attacks operate in an entirely different manner from malware assaults. Traffic on the Internet is handled by web servers, and each server can only handle a set amount of traffic. DDoS attacks take advantage of a server’s maximum traffic cap by flooding the server with packets of information (Kumar, 2015). In general, once a server is flooded by a DDoS attack, the server is unable to process legitimate traffic requests (e.g., an applicant attempting to access an organization’s job application webpage). From the first DDoS attacks on Yahoo! and Amazon in 1999 (Bhuyan, Kashyap, Bhattacharyya, & Kalita, 2014) to the DDoS attacks that occurred over the Christmas 2014 holiday weekend that crippled the Microsoft Xbox Live and Sony Playstation Network servers (BBC News, 2014), the result remains the same: an inability to access an organization’s websites or servers. As organizations move their employee application process to the online world, a DDoS attack can cripple the ability for potential applicants to begin the initial application process and also hinder the organization from accessing its own online presence.

Man-in-the-Middle Attacks (MITM) Man-in-the-middle (MITM) attacks occur when a third party is privy to information between two unsuspecting parties and uses that access to eavesdrop on or alter the communication. An example attack could have one employee sending information encrypted with a public key1 to another employee in the department. A third party, the attacker, could intercept the public key en route and either monitor and record the information, or the hacker might decrypt the key and control the communication between both parties. With the MITM attack in progress, the two employees believe they are communicating directly with each other (Thurimella & Mitchell, 2009), having no idea of the digital presence of the third party. An attack such as this leaves

915

David W. Dorsey et al. organizations particularly vulnerable, as the two employees communicating with each other believe they are using a protected method (encryption) to transfer information.

Advanced Persistent Threat Advanced persistent threats (APTs) are a particularly dangerous cyber attack to organizations. APTs are frequently engineered to attack a specific organization, with a target of that organization’s extremely high-value data (Brewer, 2014). APTs combine multiple attack vectors, including malware, social engineering, and physical means to accomplish their objective. Three major characteristics of APTs that differentiate them from other forms of attacks are that (1) APTs repeatedly attack their target over time; (2) APTs are resistant to the target’s defenses; and (3) APTs maintain the level of interaction needed to accomplish their goal (Joint Task Force Transformation Initiative, 2011). Even more troublesome is the fact that the presence of APTs on a computer network is often difficult to identify (Thomson, 2011). Intellectual property theft is a common objective of APTs, and thus any proprietary selection methods an organization creates are at risk of no longer being owned exclusively by the company that invested the resources (e.g., time, money) in developing those methods.

Anti-Forensics An emerging trend in cybersecurity is the use of anti-forensic methods. Cyber attacks using anti-forensic methods hide their presence on a network through several different ways, including trail obfuscation, data hiding, artifact wiping, and attacks against the tools designed to detect the attack (Harris, 2006). The previously mentioned Regin Trojan horse is an example of malware designed with anti-forensic traits. While Regin’s initial deployment is as a Trojan horse, the entirety of the Regin attack is polymorphic, taking place through a five-stage process. Each stage in the process is activated by the previous stage, so there is no way for digital forensics to collect complete information about the attack at any one time. The malware researchers know they are not able to see all variants of the attack on a single victim, as it only has one component (i.e., in one of the five stages) per victim at a time (Summers, 2014). Traits such as these make detection and removal of such attacks like the Regin Trojan horse difficult. Furthermore, experienced attackers are aware that slow-moving, quiet attacks are camouflaged under more “noisy” less-experienced attackers, and can employ methods to make detection akin to attempting to find a needle in a haystack.

Breach Prevention and Response If even the employees of the DHS, Kaspersky, and Google can be deceived into falling for social engineering ploys, then how can organizations begin to prevent attacks on their systems? Organizations must be mindful that while hacker groups such as Anonymous and Lizard Squad do not have a large number of members, the relatively few hackers in existence can have disproportionate effects. Additionally, nation-states have entered into the cyber attack domain, with North Korea being responsible for the late 2014 attacks on Sony Pictures Entertainment (Niccolai, 2015), and China is now suspected of being responsible for the cyber attacks on the U.S. Office of Personnel Management, in which the personnel data for 21.5 million Americans were purloined (Banker, 2015). One must also recognize that there is no such thing as an absolutely secure system. It is not that Target, Home Depot, Sony, JPMorgan Chase, the Postal Service, the Office of Personnel Management, and the White House simply had bad security practices when their networks were breached. To be sure, some of their security systems were better than others (e.g., Home Depot was warned of lax security policies prior to their attack; Creswell & Perlroth, 2014). However,

916

Cybersecurity Issues any threatening party putting enough time on target will get in. Attackers also benefit from the fact that there is little recourse across multiple international borders and that most companies are focused on business strategy and not on business defense (Auty, 2015). Since there are no impervious networks, organizations must consider breach response in addition to breach prevention. In order to effectively respond to security intrusions, it is necessary to have a breach response plan in place and ready to be enacted in the inevitable event of a cyber attack (United States Department of Justice, 2015). Critical components of a breach response plan include assembling a response team prior to any security breach, fully investigating and containing the compromised computers involved, communicating and working with government and law enforcement agencies applicable to the intrusion, employing external partners such as forensic analysts and public relations firms, notifying customers who have been affected by the breach, and responding to inquiries from those customers (Experian Data Breach Resolution, 2014; United States Department of Justice, 2015). Additionally, the computers affected should be disconnected from the network and imaged for analysis, and under no circumstance should an effort be made to hack into the hackers. While this is not an exhaustive response plan, we would direct you to the Department of Justice’s “Best Practices for Victim Response and Reporting of Cyber Incidents,” available at http://www.justice.gov/sites/default/files/criminal-ccips/legacy/2015/04/30/04272015reporting-cyber-incidents-final.pdf, for more detailed instructions on breach response.

CURRENT SELECTION METHODS While technological advancements facilitate the development of sophisticated cyber attacks, these innovations also encourage the growth of novel personnel selection methods. Specifically, technological evolutions recently led many companies to change the medium of selection assessment from computers to mobile devices. Moreover, companies now use technological innovation to shape selection methods through technologies involving simulation, gaming, social media, and Big Data (see Chapters 43 and 44 in this volume). This section will outline such current trends in selection methods and address the associated potential cyber threat vectors. It is essential to consider the cybersecurity issues mentioned in the previous section within the selection context because investments in test security can have huge implications for the return on investment (ROI) of organizations’ hiring processes. The selection process is now widely mediated through technology (Farr & Tippins, 2010). Applicants complete virtually all assessments (personality tests, cognitive ability tests, structured application forms) utilizing some form of technology. Specifically, applicants might complete assessments through on-site computers, personal computers, tablets, or mobile phones. Fortunately, a growing body of research suggests this transition does not always pose validity concerns, as computer-based assessment scores are often found comparable to paper-based assessment scores or at least statistically and practically comparable (Mead & Drasgow, 1993; Ployhart, Weekley, Holtz, & Kemp, 2003). However, the movement of selection assessment to technology-mediated platforms introduces several cybersecurity concerns. Specifically, sophisticated cyber attacks—such as those mentioned in the previous section— provide the potential for unauthorized users to gain access to both the test taker’s information and test content. Such information theft can result in serious issues for the organization and individual, including identity theft and the lessening of test integrity through the copy and distribution of test content. Moreover, information theft can be especially damaging to the test maker when the theft of intellectual property results in a substantial breach to test takers. For instance, Chinese students have created chatrooms in which to record as many questions from the computerized Graduate Record Exam (GRE) as they can remember after taking the exam (Hornby, 2011). This gives students who review the questions on the chatrooms a significant advantage when they complete the exam. Though the organization that produces the GRE, the Educational Testing Service (ETS), attempted to address this cheating by changing the GRE in China to paper only and retiring questions after each test, many Chinese students now fly to countries with the computerized version to complete the exam (Hornby, 2011). In addition, the

917

David W. Dorsey et al. prominence of so-called braindump sites on the Internet is equally troubling. Such sites actively promote the sharing of proprietary test content. Smith (2004) demonstrated that via braindump sites, about 25% of a test item bank was exposed within three weeks of the exam being published live and with a fair amount of accuracy. After eight months, nearly the entire exam bank, more than 200 items, was posted with nearly perfect accuracy, including the answer key. Fortunately, organizations have started to evolve methodologies for combating online theft/cheating. For example, Gibson and Mulkey (2016) presented a number of techniques for using data forensics to identify stolen or compromised test material and responses. This included analyzing braindump answer keys, analyzing test response patterns to identify anomalous trends, and even using “trojan horse” items to create a test-within-a-test to detect cheating. Another area of specific concern for cybersecurity is the growing use of mobile devices in selection. Mobile device security is simply not keeping up with the increases in mobile device usage. Survey statistics reveal that mobile devices now account for one-third of all web traffic worldwide (“StatCounter Global Stats,” 2015). Furthermore, the International Data Commission (IDC) predicts that by 2017, tablets and smartphones will constitute 87% of the connected device market, with desktop and laptop PCs accounting for only 13% (Columbus, 2013). Mobile assessment in the workplace follows this global movement in that Censuswide found 78% of job applicants polled in the U.S., UK, and Australia would apply for jobs on their mobile devices (2014). There are several reasons for the rising use of mobile devices in selection. The application of mobile devices is beneficial to organizations because it increases the applicant pool and allows for easier and faster distribution of application materials and tests, which cuts costs (Tippins, 2011). However, organizations need to be aware of the cybersecurity issues associated with mobile device usage. Mobile devices lack much of the security that PCs encompass. Specifically, mobile devices often lack firewalls, antivirus programs, and encryption (Ruggiero & Foote, 2011). The combination of widespread usage and inadequate security in mobile devices provides cyber attackers an opportunity for information theft that could result in device hijacking, identity theft, and threats to the integrity of the assessment. Cyber attackers can compromise the security of information kept on mobile devices in several ways, including but not limited to the installation of malicious applications, “vishing,” and “SMiShing” (Ruggiero & Foote, 2011; Wisenberg Brin, 2012). Vishing occurs when someone sends a fraudulent request via a voicemail message to have a person call a certain number. SMiShing occurs when a fraudulent text message is sent with a URL or phone number. Both of these methods use a message they believe will entice the user to click on the link or call back the number and ultimately give the hacker access to the device. Cyber attackers can hide malicious code that allows access to the device in seemingly innocent mobile applications (apps). Like the aforementioned phishing scams, vishing and SMiShing use social engineering to gain access to your device. However, in this case, the device is your phone. Another way the use of mobile devices can put assessment information at risk is when mobile devices are used on public networks, which are not secure. Essentially, if a job applicant is able to access an assessment (e.g., a situational judgment test) on a public network, there is a much greater chance that the content of the assessment can be hacked. If the scenarios/questions from a test are stolen, it would compromise the integrity of the assessment and diminish the validity of the selection tool. Most attention to cybersecurity problems with mobile devices in the organizational context focuses on the potential for information or data theft from current employees’ mobile devices (Wisenberg Brin, 2012). Organizations need to consider the cybersecurity implications for assessments in the selection context, as more selection procedures move to mobile devices. In addition to the changing medium of selection assessment, organizations now utilize novel technologies in the development of selection tools. These technologies include simulation, gaming, social media, and Big Data. In practice, organizations use a wide range of simulations or virtual role plays that measure performance on tasks that closely match those that would be performed on the job. These simulations range from low-fidelity simulations, like multiple-choice text-based situational judgment tests (SJTs), to high-fidelity multimedia simulations, which present a highly realistic job scenario and encompass numerous response options. For

918

Cybersecurity Issues instance, a simulation may test piloting performance by producing a life-like scenario through flight simulator software and a hands-on-throttle-stick (HOTAS) device (Drollinger et al., 2015). Such a simulation can incorporate real-world models of physics, weather, instrument responses, and failures. Simulations offer many advantages, such as the ability to predict a wide range of job-relevant skills with lower rates of adverse impact and a lessened susceptibility to coaching effects (Weekley, Hawkes, Guenole, & Ployhart, 2015). Conversely, research lacks clarity on which constructs simulations measure, and simulations can be costly to the organization in terms of production (Weekley et al., 2015). Moreover, it is unclear precisely what cybersecurity issues simulations may present. One concern is that simulations may become more susceptible to coaching as organizations begin to distribute the simulations via the Internet, as it is more difficult to secure the content of the test from cyber attacks. To increase the security of high-stakes assessment, Naglieri et al. (2004) recommend using a “three-tier server model” that incorporated three independent servers (an internet server, a test application server, and a database server). The authors conclude that “this configuration reduces the possibility of unauthorized intrusions into client test data” (Naglieri et al., 2004). Furthermore, the authors suggest that server traffic be actively and continuously monitored for intrusions (Naglieri et al., 2004), although this would only help with known viruses and malware. Weekley et al. (2015) also describe the likelihood for simulations to increasingly move to distribution via mobile devices, which presents some of the aforementioned vulnerability issues. Social media represents a new and unregulated source of selection information for many practitioners. In a survey conducted by the Society for Human Resource Management (SHRM), one-fifth of respondents indicated that they use social media (i.e., Facebook, LinkedIn, Twitter, etc.) during the selection process (“Social Networking Websites and Recruiting/Selection,” 2013). HR representatives find that scanning a job candidate’s Facebook or LinkedIn profile is advantageous because it is a quick and cheap way to gather information on a candidate. Though there is little research on the use of social media in the employee selection process, initial examination shows mixed results in terms of the validity of the practice (Golbeck, Robles, & Turner, 2011; Park et al., 2015; Roth, Bobko, Van Iddekinge, & Thatcher, 2013; Van Iddekinge, Lanivich, Roth, & Junco, 2013). Aside from validity issues, the use of social media in selection presents ethical and legal concerns, as most social media platforms include information about the user’s age, race, gender, religion, and other information that is unlawful to use in selection. The primary cybersecurity concerns are the particular vulnerabilities presented by social engineering. Cyber attackers can profile the LinkedIn or Facebook accounts of people who work at a company to ascertain information they can employ to spoof an e-mail enticing a user to click a link that appears to be to a social media login site. By doing this, a hacker would be able to record the victim’s credentials or to obtain names of coworkers, connections, or Facebook friends to use in such e-mails. “Big Data” is a topic that has recently received much attention in organizational research and practice. As data storage costs decrease, the collection and storage of large amounts of user information increases. Certain websites, such as Facebook and Amazon, use this technology to personalize advertisements reflecting user preferences based on information gathered from previous searches, likes, and geographic location. Facebook has even used its large pool of user data to identify the times of the year when couples are most likely to break up (Gross, 2010). Evidently, the days leading up to spring break and the winter holidays offer the most distinct peaks in breakups. Given this type of data, perhaps employee posts on social media could be used to predict counterproductive work behaviors. Or perhaps certain times of the year could be identified as significantly less productive, so interventions could be introduced to alleviate this effect. In a selection context, companies can use information from Big Data to evaluate applicants. Though this is a new area, a startup exists that can compile a comprehensive folder with all publicly available online information for a certain candidate (Preston, 2011). Because the use of Big Data in selection is still in preliminary stages, the cybersecurity implications are not yet known. However, it follows that when employers collect, store, and analyze an increasingly large amount of data, there will be greater risks for information theft, manipulation, and massaging.

919

David W. Dorsey et al. As technology progresses and assessments use richer media, cyber attackers will develop new methods to gain access to test and user information. Practitioners need to be aware of the potential for cyber attacks on new and existing selection tools as these attacks have the potential to affect the psychometric standards (i.e., validity and reliability) of selection assessments. While IT departments develop preventative methods to increase cybersecurity in the workplace, increased awareness will help practitioners understand the human side of preventing breaches. Furthermore, an increased understanding of cybersecurity will aid in the development of protocols for security breach response.

ADDITIONAL ISSUES IN SELECTION Aside from the selection methods themselves, it is essential to consider cybersecurity weaknesses in other aspects of the selection process, such as test development, applicant tracking, and the use of cybersecurity competency assessment in selection. This section gives an overview of these facets. Many considerations accompany the development of selection assessments. For instance, organizations aim to develop assessments that have high predictive validity for job performance and positive applicant reactions. Technology enables improvement in these domains in a number of ways, such as computerized adaptive testing, computational content analysis, and new assessment validation methods. These technologies each offer ways to improve the test development stage of selection, often resulting in more valid methods that have more positive applicant reactions. On the other hand, the movement of the test development process to technology-mediated platforms introduces some cybersecurity concerns, which are important to consider, as the security of test content is essential to maintaining the validity of the assessment. CAT technology is a method of computer-based assessment that adapts to the test taker’s ability level throughout the assessment process (Wainer, Dorans, Flaugher, Green, & Mislevy, 2000). This tailored testing allows for shortened testing time without compromising the reliability of the assessment (Davison, Maraist, Hamilton, & Bing, 2012). Researchers note an increase in test security as another advantage to the CAT development method because exposure control ensures that there is minimal overlap in questions on assessments between examinees (Davey & Parshall, 1995). However, with the exponential increases in methods of information theft through cyber hacks in the last couple of decades, these controls may no longer be sufficient to ensure test security. For example, the International Test Commission (ITC) publishes guidelines on the security of tests, examinations, and other assessments that outline categories of cheating, and test theft threats now range from stealing questions during testing through digital photography to recording test content electronically (ITC, 2014). One method that provides the potential to address some of these security concerns is the development of counter-technologies, such as on-the-fly item generation (Bejar et al., 2002). Not only would this method of item development increase test security by reducing the overlap of test items, but it also prevents the storage of items, making the content more difficult to appropriate. Another technology that could be instrumental in the development and scoring of assessments in selection is computational content analysis that allows for the automated analysis and scoring of text responses to test items (Ryan & Ployhart, 2014). The movement of essay scoring to computerized methods would presumably decrease costs while increasing the reliability. However, these methods could introduce cybersecurity concerns as the collection and storage of these large amounts of data (Big Data) for data mining electronically further invite information theft of test content, scoring algorithms, and applicant information. A novel assessment validation method that was presented at the 2015 Society for Industrial and Organizational Psychology Conference was the use of Amazon’s Mechanical Turk (MTurk) to pilot selection assessments (Beatty, Buckley, Sprenger, & Russell, 2015). MTurk is an online marketplace that essentially employs people to complete human intelligence tasks or tasks that cannot be automated (www.mturk.com). The researchers found positive indicators of data quality and participant motivation (Beatty et al., 2015). As with CAT and computerized content analysis, the cybersecurity challenges lie in the security and storage of the test content and test user data.

920

Cybersecurity Issues Organizations commonly use applicant tracking systems (ATS) to collect and organize information on candidates throughout the selection process. This software tracks the candidate from the resume search stage through phone screens and interviews (and after selection, through performance management, training, and even compensation management). The use of this technology is advantageous because it is cheaper and faster than traditional hard-copy filing systems. For instance, a standardized mass e-mail can be sent to candidates who were not selected for certain positions. This technology introduces vulnerability in the potential for cyber attackers to access the plethora of personally identifiable information (PII) on candidates that is stored on the software. If cyber attackers get access to an organization’s applicant tracking system, they would be able to find an employee’s social security number, address, date of birth, phone number, and any other information that was recorded in the system throughout the selection process. The announcement that the U.S. Office of Personnel Management (OPM) was hacked for personal information from background check data, affecting an estimated 22.1 million current and former government employees, demonstrates the widespread effects that these cyber attacks on personnel data can have (Levine & Date, 2015). Though this chapter has thus far focused on the importance of test and user information security during the selection process, another essential consideration is how selection can be used to prevent or reduce cyber breaches for organizations after employees are hired. Although test security and test user information security are important, organization-wide information theft can be much more serious. Recent examples of cyber theft for large organizations are shown in Table 41.1. Wiederhold (2014) emphasizes that even an organization with the most cutting-edge technologies for security systems cannot prevent cyber attacks when employees fall victim to social engineering. The fact that simple mistakes (such as failing to create a strong password or sending work files to personal e-mail accounts) increase an organization’s vulnerability to cyber attacks demonstrates why many refer to employees as the “weakest link” in cybersecurity (Belbey, 2015; Crossler et al., 2013; Wiederhold, 2014). Assessing which characteristics predict cybersecurity compliance in employees could prevent some of the cyber attacks that are caused by human error. Some such assessments are in the developmental stages. For instance, Trippe, Moriarty, Russell, Carretta, and Beatty (2014) evaluated the incremental validity of a “Cyber Test” for Army personnel over technical knowledge tests for ASVAB. This test was developed for cybersecurity and information technology (IT)

TABLE 41.1 Examples of Recent Cyber Attacks Organization

Date

Estimated number of people affected

Information stolen

Anthem

February 2015

80 million customers and employees

• Names • Birthdates • SSNs • Addresses •  Income data

Sony Pictures

November 2014

• Contracts •  Salary lists •  Film budgets •  Entire films • SSNs •  Sensitive e-mails

Home Depot

September 2014

56 million customers

•  Credit card information

JPMorgan Chase

July 2014

83 million customers

• Checking and savings account information

Adapted from Granville (2015).

921

David W. Dorsey et al. positions. Still, outside of these positions, the assessment of cybersecurity competence, and specifically tests that could predict compliance with day-to-day cybersecurity procedures (i.e., locking computer, changing passwords), is particularly important for human resource and IT department positions, as those occupations typically have administrative access to all employee system login information. It is also important to note that much of today’s cybersecurity activity is conducted in teams (e.g., Cyber Incident Response Teams). Accordingly, the large body of literature around the science of team formation, development, and performance has much to offer (Steinke et al., 2015), including how to select individuals into teams (also see Chapter 37 in this volume). For most organizations, it is beneficial to consider cybersecurity compliance for all employees, yet some organizations might focus solely on cybersecurity competencies in critical positions. The obvious focus for cybersecurity competence is on positions in the IT department or other similarly technology heavy positions. The following section will address recommendations for the selection of such individuals in addition to outlining specific organizational interventions to minimize cyber vulnerabilities.

RECOMMENDATIONS Cybercrime is a costly business for organizations. The 2013 Norton Report, published by the makers of Norton Antivirus, states that the total direct annual cost of cybercrime globally has reached US$113 billion and continues to grow (Symantec, 2013). The estimate of the total cost annually (including work-hours lost responding to cyber attacks and lawsuits as a result of security breaches) ballooned to US$445 billion annually (Nakashima & Peterson, 2014), with financial damage to organizations from intellectual property (IP) theft at US$160 billion annually (Reuters, 2014). On a mechanical level, many steps may be taken by organizations to minimize the potential of cyber attacks on their networks. Computers used by employees should always have firewall and antivirus/anti-malware software installed, with the definitions (i.e., updates) kept current. When malware attacks are recognized by antivirus software companies such as McAfee and Kaspersky, the resolution to thwart the attacks is continuously updated in the software. Keeping antivirus software updated makes it more difficult for a hacker to breach an organization’s network with a known attack. Furthermore, all operating systems (e.g., Microsoft Windows) and software (e.g., Microsoft Word, Adobe Reader, etc.) should be patched regularly to the latest version (Federal Communications Commission, n.d.; Norton, 2015). While some software updates contain fixes for software bugs, here are often patches for software vulnerabilities. Software flaws must be patched as soon as their presence is known, so hackers cannot take advantage of vulnerabilities with zero-day exploits. Training employees in security principles is an important step to take in curbing cyber attacks. Cybersecurity training should include advising employees to avoid social engineering techniques such as “click on this link” e-mails or foreign USB sticks and guiding employees to report any suspicious e-mail or possible malware attacks (Department of Homeland Security, n.d.). Additionally, it is necessary to require employees to have strong passwords for network access and critical applications. The latest annual report from Splashdata compiles the most common passwords that were compromised by cyber attacks in 2014, a total of 3.3 million passwords. The top 25 most common passwords represented 2.2% of all passwords. The top five included “password,” “qwerty,” and sequential numbers such as “123456” (Martin, 2015; PRWeb, 2015). Employees who use passwords such as these leave organizations vulnerable to hackers. Employees using non-secure passwords are not the only obstacle to adequate cyber defense. SailPoint conducted a survey among 1,000 global workers and found that 20% of employees share passwords with other employees, 56% reuse passwords, and 14% use the same password for all logins. Even more troublesome, some employees admitted they would be willing to sell their passwords for as little as US$150 (Business Wire, 2015). Recent research has shown differences in how the current workforce views security in the workplace. Duggan, Johnson, and Grawemeyer (2012) modeled password use and security among three different worker groups:

922

Cybersecurity Issues computer scientists, administrative workers, and students. Their results showed that although password security positively correlated with the sensitivity of the task, the three groups did not display the same ideology toward password use. The computer scientists viewed information security as part of their job, and passwords were a means to complete their tasks. However, administrative workers and students both felt using passwords was a cost incurred in completing their tasks. The students’ and administrative workers’ mindset reveals behavioral patterns and thought processes to overcome in training. Another major aspect of cybersecurity is a focus upon security (security culture) as part of overall organizational culture. Similar to occupational health professionals analyzing safety culture and climate as part of an organizational assessment, companies must assess norms, standards, and practices that have arisen around cybersecurity. Parsons et al. (2015) reported a positive correlation between the information security culture of an organization and employees making sound security decisions. Likewise, da Veiga and Martins (2015) conducted a case study of an international financial institution over an eight-year period and found that monitoring, assessing, and influencing information security culture aided compliance with security procedures. Furthermore, as suggested earlier, companies must focus upon cybersecurity considerations as part of employee selection. As organizations become more familiar with what constitutes a new “cyber worker,” employee selection will serve as the first line of defense against cyber attacks. Currently, the field lacks a common and generally accepted definition of “cyber worker.” In addition, it is unclear what specific knowledge, skills, abilities, and other characteristics drive performance in this domain. Given the seemingly infinite complexity in technologies, it is possible that the nature of technology knowledge is hierarchical, interconnected, and complex, like mathematics, thus potentially requiring different types of knowledge assessments (e.g., Davis & Yi, 2004; Dorsey, Campbell, Foster, & Miles, 1999). In addition, when focused upon specific cybersecurity skills, the typical applicant will have spent many years honing his/her knowledge and skills, thus experience (and specific types of experience) likely plays a prominent role in determining performance (Assante & Tobey, 2011). Beyond knowledge structures and deep learning over time, we need to examine the constructs that make a worker less susceptible to social-engineering tactics or indifference to password security (e.g., individuals high in the personality trait conscientiousness). Simply put, we need good predictor and criterion models that elucidate the cyber domain. When hiring cybersecurity workers, it is important to remember that cybersecurity work requires a unique set of skills, such as reverse engineering and knowledge of domains that are not completely understood at this point. We must also ask ourselves where hackers and cybersecurity workers acquire the knowledge they use to administer or prevent attacks. Some of the most prominent hackers and cybersecurity experts in the world are high school or college dropouts (e.g., Kevin Mitnick), and currently there are few degrees available in hacking (even ethical hacking). Unfortunately, hiring professionals are currently too reliant on applicants holding technical certifications and degrees, thus many potential workers are rejected that would otherwise be excellent candidates (Yankelovich, 2013). However, with help from the National Security Agency (NSA) and others, collegiate programs are now underway at universities across the U.S. and internationally (Dewey, 2013). Furthermore, the DHS recently created the Secretarial Honors Program, a two-year program designed to develop cyber professionals from recent college graduates (Nakashima, 2012). These advances in educational programs will help clear the ambiguity around the mysteries of the ever-changing cyber domain and thus assist in development of sound selection procedures when hiring cybersecurity employees. Another aspect to consider in the hiring of cybersecurity employees is the difference in culture and ethos between these employees and a typical white-collar employee. Many potential cybersecurity workers tend to want to choose (1) whether to work alone (or not), (2) when they want to work, and (3) what tasks they do at any given time (Lawrence, 2014). These traits can add to the difficulty in hiring decisions. Another roadblock to the creation of well-defined hiring models for cybersecurity exists because even within the hacker community there are black hat, white hat, and gray hat hackers who have different motivations and end goals in mind when they hack. Black hat hackers are known for doing cyber attacks for their personal gain or to cause

923

David W. Dorsey et al. chaos; white hat hackers are ethical hackers and generally are on the “good” side (e.g., exposing security risks in hardware, software, and websites); and gray hat hackers are a mix of the black and white (Kovacs, 2015). The picture of ethos and intent is even fuzzier when one considers growing concern over “insider threats”—those who were hired to apply cyber skills in service of the organization but who then turn and use these skills to advance a personal agenda (Azaria, Richardson, Kraus, & Subrahmanian, 2014). The challenge of insider threat is receiving increasing attention across a broad and interdisciplinary body of research, which includes contributions from computer scientists, psychologists, criminologists, and security practitioners (Azaria et al., 2014). Although there are many obstacles to overcome in creating well-defined selection models for cybersecurity employees, the desire for these technical employees is only increasing the need for good predictors. In 2013, the demand for cybersecurity workers was 3.5 times greater than the IT worker demand overall, and 12 times the demand of the overall job market (Rosenbush, 2013). In 2014, the Pentagon announced plans to triple its cybersecurity workforce by 2016, and the Federal Bureau of Investigation announced plans to add 2,000 cyber workers that year alone (Lawrence, 2014). In 2015, after the well-publicized breaches of Target and Sony Pictures, big business’s demand for cybersecurity workers can be described as “insatiable” (Anand, 2015). This increase in demand for the cybersecurity worker must be met with an equal increase in prioritization of selection methods.

The Internet of Things In 1999, years before the current landscape of exponential technologies, Kevin Ashton foresaw the development of multiple devices connected through the Internet working together and coined the term “The Internet of Things” (Wood, 2015). Ashton proved prescient, as Gartner estimates 4.9 billion things will be connected to the Internet in 2015, up 30% from 2014. Gartner predicts the number of Internet-connected devices to reach 25 billion by 2020, with the fastest growing segment of internet-connected devices being automobiles (Gartner, 2014). Although car shoppers may be joyous that their new vehicle sports an Internet connection, organizations must be mindful that this increase in Internet-connected devices and technological advancements also grants hackers more pathways for infiltration. Thus, it is vital for companies to constantly learn, adapt, and evolve to the ever-changing technological landscape. As mentioned previously, the ubiquity of smartphones and increase in tablet computing is the most recent technological shift impacting organizations. Gartner’s (2014) survey revealed that 40% of workers use their personal smartphones at least casually for work, with 26% of workers using their personal tablets (e.g., iPad). Disturbingly, half of those users stated they perform work on their personal devices without the knowledge of their employer (Gartner, 2014). In 2014, there was a surge in malware created to breach the Android and iOS operating platforms. Malware, such as Wirelurker, preyed upon even non-jail-broken iOS devices (Epper Hoffman, 2015). An Alcatel-Lucent malware report estimated there were 15 million mobile devices infected with malware globally. The growth rate of malware infection on mobile devices was double for the first six months of 2014 as it was for the year 2013, with Android smartphones infected at the fastest rate (Alcatel-Lucent, 2014). Because of statistics such as these, policies for mobile device use, anti-malware software on personal devices used for work, and training on using personal devices for work should be a priority for organizations.

FUTURE RESEARCH DIRECTIONS Given the nascent nature of cybersecurity research and practice, the topic of future research directions could be a chapter in and of itself. Here, we merely skim the surface of how future research might contribute to understanding cybersecurity and its role in selection systems. As a general framework for thinking about cybersecurity research, consider the three-by-three table shown in Table 41.2.

924

Cybersecurity Issues

TABLE 41.2 General Framework for Cybersecurity Research Relevant to Selection Individual

Team/Unit

Organization

Inputs

Selecting for specific cyber skills, knowledge, attitudes, and fit Developing creative recruiting and sourcing tools to find cyber talent, including social media

Selecting for and building cyber teams

Creating organizational policies, norms, and standards around cybersecurity for individual employees

Throughputs

Predicting aspects of insider threat Providing security interventions at the level of each employee

Monitoring selection systems for exposure and compromise Countering assessment exposure with advanced adaptive testing systems

Building an internal security climate and culture

Outputs

Reinforcing security practices among external customers/end users

Building breach-response procedures and teams

Communicating an organization’s cybersecurity posture externally

As shown in this table, cybersecurity is a multilevel phenomenon, affected by actions and events at the individual, team, and organizational levels. Moreover, one can consider activities, events, and interventions that are input, throughputs, or outputs of an organizational system. The entries in the cells are just a few of the myriad of possible research topics that need further exploration. We further explore just a few of these specific areas for future work as follows. As stated earlier, one-fifth of HR professionals use social media as part of their selection process (“Social Networking Websites and Recruiting/Selection,” 2013). Social media usage has skyrocketed, with Facebook alone having 1.44 billion monthly active users, and 65% of those users accessing Facebook daily (Protalinski, 2015). The amount of data being collected by these social media websites is astounding, and one potential area we commend for future research is using Latent Semantic Analysis (LSA) and related text analysis tools to understand critical cybersecurity activities and trends. Latent semantic analysis is a method of natural language processing, which employs a matrix algebra method to model combinations of words. Research has shown that cybercriminals tend to increasingly exchange cybercrime knowledge and transact via online social media (Lau & Xia, 2013); thus, text analysis of social media data could be used to understand potential threats, identify potential organizational vulnerabilities, and even to vet potential cyber job applicants. Another fascinating and important area for research is the role of simulations, gaming, and competitions in hiring decisions, particularly as they relate to the world of cyber. The cyber worker of tomorrow might not see the value in traditional education or technical certification, but rather be self-taught in STEM fields such as computer science and programming. A traditional unstructured interview, cognitive test, or personality assessment may not motivate a person with requisite hacking skills to excel in the selection process. However, the same person might react positively when presented with a simulation scenario to prevent a new virus attack or a video game based on avoiding social engineering tactics. With the demand for excellent cybersecurity workers at an all-time high, researchers must seek creative solutions to appeal to those who are proficient in STEM skills. Although developers and sponsors of cyber games, simulations, and competitions endorse their use, few empirical studies of their efficacy exist, and evidence from other fields such as computer science and mathematics competitions has been mixed (Tobey, Pusey, & Burley, 2014). Some research does support that cyber competitions attract experienced individuals who will remain in the profession for the long term, but future research is needed to understand how competitions may engage more diverse applicants, including those new to the field (Tobey et al., 2014).

925

David W. Dorsey et al. Additionally, it is important that research continues in the cybersecurity culture domain. The applicants selected for employment become a part of the organizational culture, and that culture must promote adherence to the organization’s cybersecurity policies. Although previous research has focused on defining information security culture and assessing a cybersecurity culture (Parsons et al., 2015), future work remains, such as coworker intervention studies (e.g., behavior of workers who notice another employee leaving their computers unlocked) and analysis of integrating new cyber employees into existing organizational cultures.

CONCLUSION Exponential technologies, such as the Internet and smartphones, have forever changed the selection process for employees. Where once an applicant would have filled out a paper application and then either hand-delivered the application or mailed it through the postal service, now the applicant can seek employment by surfing the Internet on his/her smartphone, and the employer can easily collect and store the applicant’s information without any physical contact. Technologies such as computer adaptive tests, social media, and high-fidelity multimedia tests have contributed to changing traditional selection methods. Although these technologies have improved the selection process immeasurably, the improvement does not come without a cost. All of the roads on the information superhighway are two-way streets, and some of the streets have “drivers” (hackers) who do not follow the rules. However, unlike a real-world traffic stop manned by the sheriff in town, it is up to organizations to police the traffic in and out of their networks and for researchers and practitioners to improve the tools used in such efforts.

NOTE 1. A public key is a value provided by a designated authority, which when combined with a private key can be used to encrypt messages.

REFERENCES Abraham, S., & Chengalur-Smith, I. (2010). An overview of social engineering malware: Trends, tactics, and implications. Technology in Society, 32(3), 183–196. Alcatel-Lucent. (2014). Alcatel-Lucent malware report reveals that more apps are spying on us, stealing personal information and pirating data minutes. Retrieved July 29, 2015, from https://www.alcatel-lucent.com/press/2014/ alcatel-lucent-malware-report-reveals-more-apps-are-spying-us-stealing-personal-information-and Anand, P. (2015). Attention, graduates: Hackers wanted. Retrieved July 29, 2015, from http://www.marketwatch.com/story/hackers-wanted-the-ethical-ones-2015–04–22 Arena, C. (August 13 2014). 4 reasons why exponential technologies are taking off. Retrieved August 26, 2015. Assante, M., & Tobey, D. (2011). Enhancing the cybersecurity workforce. IT Professional, 13(1), 12–15. doi: 10.1109/MITP.2011.6 Auty, M. (2015). Anatomy of an advanced persistent threat. Network Security, 4, 13–16. Azaria, A., Richardson, A., Kraus, S., Subrahmanian, V. S. (2014). Behavioral analysis of insider threat: A survey and bootstrapped prediction in imbalanced data. IEEE Transactions on Computational Social Systems, 1(2), 135–155. Banker, S. (2015). The office of personnel management security breach: How does it affect our supply chains? Retrieved July  16, 2015, from http://www.forbes.com/sites/stevebanker/2015/07/14/ the-office-of-personnel-management-security-breach-how-does-it-affect-our-supply-chains/ BBC News. (December 27 2014). Xbox and PlayStation resuming service after attack—BBC News. Retrieved July 14, 2015. Beatty, A., Buckley, K., Sprenger, A., & Russell, T. L. (2015). MTurk: Piloting critical thinking items and guiding test development. Paper presented at the Society for Industrial and Organizational Psychology’s Annual Conference, Philadelphia, PA. Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2002). A feasibility study of on-the-fly item generation in adaptive testing. ETS Research Report Series, 2, i–44.

926

Cybersecurity Issues Belbey, J. (2015). The Weakest Link in Cybersecurity. Forbes. Bhuyan, M. H., Kashyap, H. J., Bhattacharyya, D. K., & Kalita, J. K. (2014). Detecting distributed denial of service attacks: Methods, tools and future directions. Computer Journal, 57(4), 537–556. Bilge, L., & Dumitras, T. (2012). Before we knew it: An empirical study of zero-day attacks in the real world. In Proceedings of the 2012 ACM conference on computer and communications security (pp. 833–844). ACM. Brewer, R. (2014). Advanced persistent threats: Minimising the damage. Network Security, 4, 5–9. Briggs, B., & Shingles, M. (January 29 2015). Tech trends 2015, exponentials. Retrieved August 26, 2015. Business Wire. (2015). SailPoint Survey Confirms: Employees Will Sell Passwords for $150. Austin, TX: Business Wire. Cisco. (n.d.). What is the difference: Viruses, worms, trojans, and bots? Retrieved September 2, 2015, from http:// www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html Cohen, F. (1987). Computer viruses: Theory and experiments. Computers & Security, 6, 22–35. doi:10.1016/0167–4048(87)90122–2 Columbus, L. (2013). IDC: 87% of connected devices sales by 2017 will be tablets and smartphones. Retrieved from http://www.forbes.com/sites/louiscolumbus/2013/09/12/idc-87-of-connected-devices-by-2017will-be-tablets-and-smartphones/ Creswell, J., & Perlroth, N. (September 19 2014). Ex-employees say home depot left data vulnerable. Retrieved September 9, 2015, from http://www.nytimes.com/2014/09/20/business/ex-employees-say-home-depot-left-data-vulnerable.html Crossler, R. E., Johnston, A. C., Lowry, P. B., Hu, Q., Warkentin, M., & Baskerville, R. (2013). Future directions for behavioral information security research. Computers & Security, 32, 90–101. da Veiga, A., & Martins, N. (2015). Improving the information security culture through monitoring and implementation actions illustrated through a case study. Computers & Security, 49, 162–176. doi: 10.1016/j. cose.2014.12.006 Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA. Davis, F. D., & Yi, M. Y. (2004). Improving computer skill training: Behavior modeling, symbolic mental rehearsal, and the role of knowledge structures. Journal of Applied Psychology, 89(3), 509. Davison, H. K., Maraist, C. C., Hamilton, R. H., & Bing, M. N. (2012). To screen or not to screen? Using the internet for selection decisions. Employee Responsibilities and Rights Journal, 24(1), 1–21. Department of Homeland Security. (n.d.). Cybersecurity 101. Retrieved on July 29, 2015, from: http://www. dhs.gov/sites/default/files/publications/cybersecurity-101_4.pdf Dewey, C. (2013, September 2011). The NSA sponsors ‘cyber operations’ training at universities. Here’s what students learn. Washington Post online. Retrieved from: https://www.washingtonpost.com/news/ the-switch/wp/2013/09/11/the-nsa-sponsors-cyber-operations-training-at-universities-heres-what-students-learn/?utm_term=.8c90e3eabc2b Dorsey, D. W., Campbell, G. E., Foster, L. L., & Miles, D. E. (1999). Assessing knowledge structures: Relations with experience and posttraining performance. Human Performance, 12, 31–57. Drollinger, S. M., Brennan, D. C., Moclaire, C. M., Olson, T. M., Vorm, E. S., & Foster, C. (2015). Impact of gaming and simulator experience on flight performance. Paper presented at the Society for Industrial and Organizational Psychology’s Annual Conference, Philadelphia, PA. Duggan, G. B., Johnson, H., & Grawemeyer, B. (2012). Rational security: Modeling everyday password use. International Journal Of Human-Computer Studies, 70415–70431. doi: 10.1016/j.ijhcs.2012.02.008 Epper Hoffman, K. (2015). Malware on the move. SC Magazine: For IT Security Professionals (15476693), 26(3), 16. Erbschloe, M. (2005). Trojans, worms, and spyware. [electronic resource]: A computer security professional’s guide to malicious code. Amsterdam, Boston : Elsevier Butterworth Heinemann, c2005. Experian Data Breach Resolution. (2014). Data breach response guide. Retrieved September 9, 2015, from http://www.experian.com/assets/data-breach/brochures/2014–2015-data-breach-response-guide.pdf Farr, J. L., & Tippins, N. T. (2010). Handbook of employee selection. New York: Routledge. Federal Communications Commission. (n.d.) Cybersecurity for Small Business. (n.d.). Retrieved July 29, 2015, from https://www.fcc.gov/cyberforsmallbiz Gartner. (2014). Gartner says 4.9 billion connected. Retrieved August 27, 2015, from http://www.gartner.com/ newsroom/id/2905717 Gartner. (2014). Gartner says 40 percent of U.S. employees of large enterprises use personally owned devices for work. Retrieved July 29, 2015, from http://www.gartner.com/newsroom/id/2881217 Gibson, K., & Mulkey, J. (2016). Dumping the dopes who use braindump sites: How IBM turned the tables using data forensics. ATP Innovations in Testing Conference, Orlando, FL. Glass, B. (May 8 2001). Know Your Enemy. PC Magazine, pp. 90–91. Golbeck, J., Robles, C., & Turner, K. (2011). Predicting personality with social media. Paper presented at the CHI’11 extended abstracts on human factors in computing systems.

927

David W. Dorsey et al. Granville, K. (February 5 2015). 9 recent cyberattacks against big businesses. The New York Times. Retrieved from http://www.nytimes.com Gross, D. (2010). Facebook knows when you’ll break up. Retrieved from http://www.cnn.com/2010/TECH/ social.media/11/02/facebook.breakups/ Harris, R. (2006). Arriving at an anti-forensics consensus: Examining how to define and control the anti-forensics problem. Digital Investigation, 3(Supplement), 44–49. doi:10.1016/j.diin.2006.06.005 Hornby, L. (July 27 2011). Gaming the GRE test in China, with a little online help (K. Wills & A. Richardson, Eds.). Retrieved August 26, 2015. Hsu, F., Chen, H., Ristenpart, T., Li, J., & Su, Z. (2006, December). Back to the future: A framework for automatic malware removal and system repair. Paper presented at 22nd Annual Computer Security Applications Conference (pp. 257–268). International Test Commission. (2014). ITC guidelines on quality control in scoring, test analysis, and reporting of test scores. International Journal of Testing, 14(3), 195–217. Joint Task Force Transformation Initiative. (2011). NIST Special Publication 800-39. Managing information security risk: Organization, mission, and information system view. Gaithersburg, MD: National Institute of Standards and Technology. Kovacs, N. (2015). What is the difference between Black, White and Grey Hat Hackers? Retrieved July 29, 2015, from http://community.norton.com/en/blogs/norton-protection-blog/what-difference-between-blackwhite-and-grey-hat-hackers Krombholz, K., Hobel, H., Huber, M., & Weippl, E. (2014). Advanced social engineering attacks. Journal of Information Security and Applications, 22, 113–122. doi: 10.1016/j.jisa.2014.09.005 Kumar, K. V. (2015). Distributed Denial of Service (DDOS) attack, networks, tools and defense. International Journal of Applied Engineering Research, 10(8), 20959–20971. Lau, R., & Xia, Y. (2013). Latent text mining for cybercrime forensics. International Journal of Future Computer and Communication, 2(4), 368–371. Lawrence, D. (2014). The U.S. Government wants 6,000 New ‘Cyberwarriors’ by 2016. Retrieved July 29, 2015, from http://www.bloomberg.com/bw/articles/2014–04–15/uncle-sam-wants-cyber-warriors-but-canhe-compete Levine, M., & Date, J. (2015). 22 Million Affected by OPM Hack, Officials Say. ABC News Network. Retrieved from http://abcnews.go.com/US/exclusive-25-million-affected-opm-hack-sources/story?id=32332731 Martin, A. (2015). Weakest, common passwords of 2014 revealed. Retrieved July 29, 2015, from http://www. welivesecurity.com/2015/01/21/weakest-common-passwords-2014-revealed/ Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114(3), 449. Naglieri, J. A., Drasgow, F., Schmit, M., Handler, L., Prifitera, A., Margolis, A., & Velasquez, R. (2004). Psychological testing on the Internet: New problems, old issues. American Psychologist, 59(3), 150. Nakashima, E. (2012). Federal agencies, private firms fiercely compete in hiring cyber experts. Retrieved July 29, 2015, from https://www.washingtonpost.com/world/national-security/federal-agencies-private-firms-fiercely-compete-in-hiring-cyber-experts/2012/11/12/a1fb1806–2504–11e2-ba29–238a6ac36a08_story.html Nakashima, E., & Peterson, A. (June 9 2014). Cybercrime and espionage is costing the global economy near half a trillion dollars annually. Retrieved September 9, 2015, from https://www.washingtonpost.com/world/national-security/report-cybercrime-and-espionage-costs-445-billion-annually/2014/06/08/8995291c-ecce11e3–9f5c-9075d5508f0a_story.html Niccolai, J. (2015). Code typo helps tie North Korea to the Sony hack. Retrieved July 15, 2015, from http://www. computerworld.com/article/2885534/code-typo-helps-tie-north-korea-to-the-sony-hack.html Norton. (2015). Cybercrime Prevention Tips from Norton | Norton. (n.d.). Retrieved July 29, 2015, from http:// us.norton.com/prevention-tips/article Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., . . . Seligman, M. E. P. (2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108(6), 934. Parsons, K. M., Young, E., Butavicius, M. A., McCormac, A., Pattinson, M. R., & Jerram, C. (2015). The influence of organizational information security culture on information security decision making. Journal of Cognitive Engineering and Decision Making, 9(2), 117–129. Ployhart, R. E., Weekley, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable? Personnel Psychology, 56(3), 733–752. Preston, J. (2011). Social media history becomes a new job hurdle. New York Times. Protalinski, E. (April 22 2015). Facebook passes 1.44B monthly active users and 1.25B mobile users; 65% are now daily users. Retrieved September 9, 2015, from http://venturebeat.com/2015/04/22/facebook-passes-1– 44b-monthly-active-users-1–25b-mobile-users-and-936-million-daily-users/

928

Cybersecurity Issues PRWeb. (2015). “123456” maintains the top spot on SplashData’s annual “Worst Passwords” List. Retrieved July 29, 2015, from http://www.prweb.com/releases/2015/01/prweb12456779.htm Reuters. (June 9, 2014). Cyber crime costs global economy $445 billion a year: Report. Retrieved July 29, 2015, from http://www.reuters.com/article/2014/06/09/us-cybersecurity-mcafee-csis-idUSKBN0EK0SV20140609 Rosenbush, S. (2013). Demand for Cyber Security jobs is soaring. Retrieved July 29, 2015, from http://blogs.wsj. com/cio/2013/03/04/demand-for-cyber-security-jobs-is-soaring/ Roth, P. L., Bobko, P., Van Iddekinge, C. H., & Thatcher, J. B. (2013). Social media in employeeselection-related decisions a research agenda for uncharted territory. Journal of Management, 42, 269–298. doi: 0149206313503018 Rosenzweig, P. (2012). Thinking about cybersecurity: From cyber crime to cyber warfare. Chantilly, VA: The Great Courses. Ruggiero, P., & Foote, J. (2011). Cyber threats to mobile phones. Carnegie Mellon University Retrieved from https://http://www.us-cert.gov/sites/default/files/publications/cyber_threats-to_mobile_phones.pdf Ryan, A. M., & Ployhart, R. E. (2014). A century of selection. Annual Review of Psychology, 65, 693–717. Schwartz, M. (2011). How USB sticks cause data breach, Malware Woes. Retrieved July 16, 2015, from http:// www.darkreading.com/risk-management/how-usb-sticks-cause-data-breach-malware-woes/d/ d-id/1099437? Siponen, M. T., & Oinas-Kukkonen, H. (2007). A review of information security issues and respective research contributions. Database for Advances in Information Systems, 38(1), 60. doi:10.1145/1216218.1216224. Smith, R. W. (2004). The impact of Braindump sites on item exposure and item parameter drift. Paper presented at the annual meeting of the American Education Research Association, San Diego, CA. Social Networking Websites and Recruiting/Selection. (2013). SHRM Staffing Research. StatCounter Global Stats. (2015). Retrieved April 14, 2015, from http://gs.statcounter.com/ Steinke, J., Bolunmez, B., Fletcher, L., Wang, V., Tomassetti, A. J., Repchick, K. M., . . . & Tetrick, L. E. (2015). Improving cybersecurity incident response team effectiveness using teams-based research. IEEE Security & Privacy, 13(4), 20–29. Summers, D. (2014). Regin, a new piece of spyware, said to infect telecom, energy, airline industries. Retrieved July 11, 2015, from http://fortune.com/2014/11/23/regin-malware-surveillance/ Symantec. (October 1 2013). 2013 Norton Report. Retrieved July 29, 2015. Symantec. (2015). What is the difference between viruses, worms, and Trojans? Retrieved July 14, 2015, from https://support.symantec.com/en_US/article.TECH98539.html Thomson, G. (2011). APTs: A poorly understood challenge. Network Security, 11, 9–11. Thurimella, R., & Mitchell, W. (2009). Cloak and Dagger: Man-In-The-Middle and Other Insidious Attacks. International Journal of Information Security & Privacy, 3(3), 55. doi: 10.4018/jisp.2009100704 Tippins, N. T. (2011). Overview of technology-enhanced assessments. In N. T. Tippins, S. Adler, & I. Kraut (Eds.), Technology-enhanced assessment of talent (pp. 1–18). San Francisco, CA: Jossey-Bass. Tippins, N. T. (2015). Technology and assessment in selection. Annual Review of Organizational Psychology and Organizational Behavior, 2, 551–582 (Volume publication date April 2015). Tobey, D. H., Pusey, P., & Burley, D. L. (2014). Engaging learners in cybersecurity careers: Lessons from the launch of the national cyber league. ACM Inroads, 5(1), 53–56. doi: =10.1145/2568195.2568213 http:// doi.acm.org/10.1145/2568195.2568213 Trippe, D. M., Moriarty, K. O., Russell, T. L., Carretta, T. R., & Beatty, A. S. (2014). Development of a cyber/information technology knowledge test for military enlisted technical training qualification. Military Psychology, 26(3), 182. U.S. Department of Justice. (2015). Best practices for victim response and reporting of cyber incidents. 3, 2015, from http://www.justice.gov/sites/default/files/criminal-ccips/legaRetrieved August  cy/2015/04/30/04272015reporting-cyber-incidents-final.pdf Van Iddekinge, C. H., Lanivich, S. E., Roth, P. L., & Junco, E. (2013). Social media for selection? Validity and adverse impact potential of a facebook-based assessment. Journal of Management, 42, 1811–1835. doi: 0149206313515524. Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized adaptive testing: A primer. New York: Routledge. Weekley, J. A., Hawkes, B., Guenole, N.,  & Ployhart, R. E. (2015). Low-fidelity simulations. Annual Review of Organizational Psychology and Organizational Behavior, 2, 295–322. doi: 10.1146/annurev-orgpsych-032414–111304 Wiederhold, B. K. (2014). The role of psychology in enhancing cybersecurity. Cyberpsychology, Behavior, and Social Networking, 17(3), 131–132. Wisenberg Brin, D. (2012). Employer beware: Spyware comes to mobile. Retrieved from http://www.shrm.org/ hrdisciplines/technology/articles/pages/spyware-comes-to-mobile.aspx Wood, A. (2015). The internet of things is revolutionising our lives, but standards are a must. Retrieved August 27, 2015, from http://www.theguardian.com/media-network/2015/mar/31/the-internet-of-things-is-revolutionisingour-lives-but-standards-are-a-must

929

David W. Dorsey et al. Yankelovich, M. (August 6, 2013). Cybersecurity threats: A people problem HR can solve. Retrieved August  27, 2015, from https://hr.blr.com/whitepapers/HR-Administration/Employee-Privacy/ Cybersecurity-threats-A-people-problem-HR-can-solv Zetter, K. (2010). Google hack attack was ultra sophisticated, new details show. Retrieved July 6, 2015, from http:// www.wired.com/2010/01/operation-aurora/ Zeller, K. (2015). Kaspersky finds new Nation-State attack-in its own network. Retrieved July 12, 2015, from http://www.wired.com/2015/06/kaspersky-finds-new-nation-state-attack-network/

930

42 MODERN PSYCHOMETRIC THEORY TO SUPPORT PERSONNEL ASSESSMENT AND SELECTION STEPHEN STARK, OLEKSANDR S. CHERNYSHENKO, AND FRITZ DRASGOW

Advances in computing technology have rapidly expanded the options available for psychological assessment in work contexts. Faster computers, mobile devices with multimedia capabilities, and Internet access have virtually eliminated the need for paper-and-pencil tests. The types of items that can be presented to examinees have also expanded beyond traditional multiple-choice and Likert-type formats to include more complex stimuli, such as videos and immersive task simulations (Drasgow & Olson-Buchanan, 1999; Mills, Potenza, Fremer, & Ward, 2002; Stark, Martin, & Chernyshenko, 2015). Today, virtually all assessments can be offered “on-demand,” meaning that they can be accessed by test takers at any time, and results can be provided readily to stakeholders to expedite the personnel screening process. One of the important implications of computerization is that more sophisticated measurement technologies have gradually been implemented to support assessment needs. Item response theory (IRT) methods are particularly well-suited for designing and evaluating selection tests, because the parameters that describe items are invariant across examinee subpopulations, and neither the properties of individual items nor test scores depend fundamentally on the subset of items composing a test. IRT methods can thus be used to construct parallel or tailored test forms, to match test difficulty to individual examinee capabilities as in computerized adaptive testing (CAT), to link item properties and test scores across different measurement occasions, to identify aberrant examinee response patterns, and to test for measurement invariance across different examinee groups (e.g., Embretson & Reise, 2000; Hulin, Drasgow, & Parsons, 1983; Maydeu-Olivares & McArdle, 2005). As more flexible IRT methods are developed, IRT is likely to become the methodology of choice for supporting structured assessments. The aim of this chapter is to introduce readers to IRT models and methods now being used in personnel assessment and selection. We hope to help readers who lack in-depth IRT training to better understand some models and applications. First, we describe four IRT models commonly used in cognitive ability and “noncognitive” (e.g., personality, attitudes, and interests) testing. Second, we discuss how examinee response data are scored and show how adding or removing items from a test affects measurement precision. Third, we discuss item response and item information functions and how they can be used to increase measurement efficiency with adaptive item selection. Finally, we describe the concept of “person fit” and how person-fit methods can be used to verify the results of unproctored tests and identify unmotivated or careless examinees. We also discuss some examples involving actual workplace tests and IRT techniques used in practice.

931

Stephen Stark et al.

IRT MODELS

Three-Parameter Logistic Model (3PLM) Item response theory involves probabilistic mathematical models that describe how an examinee’s trait level and an item’s properties jointly influence item responses. For example, one of the most parsimonious and well-recognized IRT models is the one-parameter logistic or Rasch model (Rasch, 1960) for dichotomous data (correct-incorrect; agree-disagree). The Rasch model uses examinee trait level (θ) and just one item property, item difficulty (the location of an item on the trait continuum), to predict the probability of answering an item correctly. In most psychological domains, however, more item properties (parameters) are used to model the probability of correct responses. Specifically, for multiple-choice items, like those often used in cognitive ability tests, item discrimination (how well an item differentiates examinees of different ability levels) and guessing are also taken into account. In the three-parameter logistic model (3PLM; Birnbaum, 1968), each item is characterized by an item difficulty (a.k.a. extremity or location) parameter (b), an item discrimination parameter (a), and a lower asymptote or “guessing” parameter (c). The probability of a correct or positive response to item i for the 3PLM is given by: P( u i = 1 θ ) = c i +

1 − ci , 1 + exp[ − 1.7a i (θ − b i )]



(1)

where ui denotes an examinee’s response to item i (ui = 1 if correct; 0 if incorrect), P(ui = 1|θ) is the probability of a correct response for a randomly chosen examinee having trait level θ, and 1.7 is a scaling factor that is included for historical reasons. Note that the two-parameter logistic model (2PLM) and the one-parameter logistic model (1PLM or Rasch model) can be obtained from Equation 1 by placing “constraints” on the a- and c-parameters. If one assumes no guessing and sets c = 0 for all items, then the 2PLM results. If one also assumes that all items are equally discriminating (e.g., set all a = 1), then the 1PLM results. The structure of the 3PLM is most easily understood by plotting the probability of a correct response as a function of θ on the latent trait range [−3.0, +3.0] and examining how the curve changes as a function of the item parameters. Figure 42.1 shows 3PLM item response functions (IRFs) for three hypothetical items having the same discrimination and guessing parameters (a = 1.5 and c = 0.2, respectively) but different difficulty parameters (b = −1.0, 0.0, 1.0, respectively). It can be seen that the item difficulty parameter affects the lateral position of the IRFs along the trait continuum. As the difficulty parameter increases from −1.0 to 1.0, the probability of a correct response, at a particular θ, decreases. For example, only examinees having θ > 1.5 have a high probability of answering the item with b = 1 correctly. Figure 42.2 illustrates how the item discrimination parameter affects the shape of IRFs. Shown are items that exhibit rather low discrimination (a = 0.5), medium discrimination (a = 1.0), and high discrimination (a = 2.0). It is evident that as a increases, the IRFs become steeper near θ = b. Also note that the difference in response probabilities at trait levels of θ = −0.5 and θ = 0.5 increases as item discrimination increases. When a = 0.5, the response probability difference is only 0.2, but when a = 2.0, the response probability difference is nearly 0.8. Thus, items with large a-parameters better differentiate examinees at trait levels near the item difficulty parameter. Finally, Figure 42.3 illustrates the effect of the c-parameter of the 3PLM. With a typical multiplechoice item, even low-ability examinees can sometimes guess the correct answer. As shown, the c-parameter affects the lower asymptote of the IRF so that the probability of a correct response remains above zero for any trait level. Values of the c-parameter typically range from 0.1 to 0.3 for items having four or five response options. The 3PLM is often applied to cognitive ability test data, because in most cases, it is reasonable to assume that items are not equally discriminating and guessing is a realistic possibility (Hulin et al., 1983). The process of estimating item parameters is called item calibration. Because the

932

Modern Psychometric Theory

IRF Plots for 3PLM 1.0

Prob. of Positive Response

0.9 0.8 0.7 0.6 0.5

b = -1

0.4

b=0

0.3

b=1

0.2 0.1 0.0 -3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

Theta

0.5

1.0

1.5

2.0

2.5

3.0

FIGURE 42.1  3PLM IRFs for Three Items Having Different Difficulty Parameters

IRF Plots for 3PLM 1.0

Prob. of Positive Response

0.9 0.8 0.7 0.6 0.5

a = .5

0.4

a=1

0.3

a=2

0.2 0.1 0.0 -3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

Theta

0.5

1.0

1.5

2.0

2.5

3.0

FIGURE 42.2  3PLM IRFs for Three Items Having Different Discrimination Parameters

3PLM equation has only one θ, the test data must be essentially unidimensional, meaning that just one dominant factor underlies the item responses. Drasgow and Parsons (1983), Hattie (1985), and Stout (1987) discuss several methods for testing the unidimensionality assumption with dichotomous data, but one of the simplest approaches, which works reasonably well in practice,

933

Stephen Stark et al.

IRF Plots for 3PLM 1.0

Prob. of Positive Response

0.9 0.8 0.7 0.6 0.5

c = .1

0.4

c = .2

0.3

c = .3

0.2 0.1 0.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

Theta

0.5

1.0

1.5

2.0

2.5

3.0

FIGURE 42.3  3PLM IRFs for Three Items Having Different “Guessing” Parameters

is exploratory factor analysis of item tetrachoric correlations. If the ratio of the first to second eigenvalue is greater than 3, then the response data may be viewed as sufficiently unidimensional (see Lord, 1980; also see Chapter 8 of Hulin et al. (1983) concerning violations of tetrachoric correlation assumptions, and the alternative modified parallel analysis procedure.). Several specialized software packages are available for estimating 1PLM, 2PLM, and 3PLM IRT parameters. BILOG-MG (Zimowski et al., 2003) is still widely used, but many newer applications have been developed, and statistical packages such as Mplus (Muthén & Muthén, 2015), SAS, and R now contain functions for fitting these models. For best results, sample sizes of 250, 500, and 1000 are often recommended for the 1PLM, 2PLM, and 3PLM, respectively. Readers seeking guidance on parameter estimation are referred to Hulin et al. (1983) and Hambleton and Swaminathan (1985). For readers interested in model derivations, Baker and Kim (2004) is recommended.

Samejima’s Graded Response Model (SGRM) Models for polytomous data are more complex than models for dichotomous data, because they must account for multiple response categories. In polytomous IRT terminology, the function that relates trait level to the probability of endorsement, or choosing, a particular response category is called an option response function (ORF). Among the most widely used polytomous IRT models are Samejima’s graded response model (SGRM; Samejima,1969), Bock’s nominal model (Bock, 1972), and Muraki’s generalized partial credit model (Muraki, 1992). Here, we focus on the SGRM, because it is the most commonly used with questionnaire data involving Likert-type response formats (e.g., strongly disagree, disagree, neutral, agree, strongly agree). The mathematical form of the SGRM is P (u = k θ ) =

934

1 1 , − 1 + exp[ −1.7a (θ − bk )] 1 + exp[ −1.7a (θ − bk +1 )]

(2)

Modern Psychometric Theory where u denotes the response to a polytomously scored item, k is the particular option selected by the respondent, a is the item discrimination parameter, and bk is referred to as a location or extremity parameter. Note that an item with k options will have one discrimination parameter, and k−1 extremity parameters. These parameters are used to calculate what are known as boundary response functions, and the differences between successive boundary response functions give the respective option response functions, which relate trait level to the probability of endorsing a particular response category, as shown in Equation 2. Example SGRM ORFs for a five-option Likert-type item with a = 2.0, b1 = −1.5, b2 = −0.5, b3 = 0.7, and b4 = 1.2 are shown in Figure 42.4. As shown in Figure 42.4, the b-parameters correspond to the points on the trait continuum where adjacent ORFs intersect. For example, the ORF for option 1, which is the left-most monotonically decreasing curve, intersects with the ORF for option 2 at b1 = −1.5. Similarly, the ORF for option 2 intersects with the ORF for option 3 at b2 = −0.5, and so on. The five ORFs are distinct (steep slopes and narrow peaks), because the discrimination parameter is large (a = 2.0). As a result, there are clearly identifiable regions of the trait continuum where the endorsement of a particular response option is most likely. For example, examinees located between −0.2 and 0.6 on the trait continuum have a 60% or higher chance of endorsing option 3, a 10–20% chance of endorsing options 2 or 4, and virtually zero chance of endorsing options 1 or 5. Thus, endorsing option 3 tells us that an examinee is most likely located in the −0.2 to 0.6 range. In contrast, if an item has a low discrimination parameter, the endorsement of a particular response option provides less information with regard to an examinee’s location. For example, Figure 42.5 shows the ORFs for a five-option item having the same b-parameters as in the previous figure but an a-parameter of 0.4. As can be seen in the figure, the ORFs for options 2, 3, and 4 are very flat across the trait continuum and overlap to a large extent. Thus, selecting any of these options tells us relatively little about the examinee’s trait level. Historically, the most widely used software for estimating SGRM parameters was the MULTILOG computer program (Thissen, 1991). However, as with the previously mentioned dichotomous models, SGRM parameters can now be estimated with Mplus (Muthén & Muthén, 2015), SAS, and R, as well as newer specialized applications. Samples of 1,000 have been recommended for SGRM parameter estimation, but much smaller samples may be acceptable in many settings. One concern, however, with small samples is that at least 20 to 50 persons per category are recommended for parameter estimation. Otherwise, infrequently

Prob. of Posive Response

1.0 0.8 0.6

Opon 1 Opon 2

0.4

Opon 3 Opon 4

0.2

Opon 5

0.0 -3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Theta FIGURE 42.4  SGRM ORFs for a Five-Option Item Having a High Discrimination Parameter

935

Stephen Stark et al.

Prob. of Posive Response

1.0 0.8 Opon 1

0.6

Opon 2 0.4

Opon 3 Opon 4

0.2

Opon 5

0.0 -3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Theta FIGURE 42.5  SGRM ORFs for a Five-Option Item Having a Low Discrimination Parameter

endorsed categories should be collapsed. Readers interested in a more detailed description of the SGRM and applications in noncognitive testing may refer to Embretson and Reise (2000) and Chernyshenko et al. (2001).

Generalized Graded Unfolding Model (GGUM) Both the 3PLM and SGRM described above belong to a class of IRT models known as dominance models. Dominance models assume that the probability of a correct response (or endorsement of the highest response category) increases as an examinee’s trait level increases. Thus, respondents with very high trait scores (e.g., +3.0) are those who are most likely to answer an item correctly or endorse a “strongly agree” response option (assuming negatively worded items have been reverse scored). An alternative family of models, known as ideal point models, makes a different assumption about item responding—namely, the probability of endorsement increases as a function of the similarity between a person and an item. For example, a person is most likely to endorse an attitude item expressing an opinion similar to his or her own. Ideal point models have been found to work well with items measuring job attitudes (Carter & Dalal, 2010), personality (Stark et al., 2006), and vocational interests (Tay et al., 2009). In particular, these models have been shown to accommodate neutral (moderate) items, which are typically discarded when dominance models are applied (Chernyshenko et al., 2007). James Roberts and colleagues have developed a number of item response models that implement an ideal-point-response process. The most general and widely used is the Generalized Graded Unfolding Model (GGUM; Roberts, Donoghue, & Laughlin, 2000). Under the GGUM, the probability of obtaining an observed response Ui = u is defined as: u u       exp  α i u θ j − δ i − ∑ τ ik  + exp  α i ( M − u ) θ j − δ i − ∑ τ ik      k =0 k =0   P U i = u |θ j  = w C  w        exp  α i w θ j − δ i − ∑ τ ik  + exp  α i ( M − w ) θ j − δ i − ∑ τ ik   ∑   k =0 w =0  k =0       

(

(

936

(

)

)

)

(

)

(3)

Modern Psychometric Theory where: θj = the location of respondent j on the continuum underlying responses, αi = the discrimination parameter for item i, δi = the location of item i on the continuum underlying responses, u = 0, 1, 2, . . ., C; u = 0 corresponds to the response option with the strongest level of disagreement and u = C corresponds to the response option with strongest level of agreement, C = the number of observable response options minus 1, w = an index for summing over observable response options 0 to C, M = 2 * C + 1 is the number of subjective response categories, indexed k = 0, 1, 2. . .., M, τik = the location of the kth subjective response category threshold on the latent continuum relative to the location of item i where τi0 = 0 and

M

∑τ k =0

ik

= 0.

The GGUM equation defines the option response function for each observable response. According to the model, two subjective responses underlie each observable response; so, a respondent may agree or disagree with an item from a position that is above or below the item on the trait continuum (the formula for subjective response functions can be found in Roberts et al., 2000). Consequently, each ORF is obtained by summing the two corresponding subjective response functions, as shown in Equation 3. Figure 42.6 displays ORFs for a hypothetical item (i) with four response options, where 1= strongly disagree, 2= disagree, 3= agree, 4= strongly agree, αi = 2, δi = 0, C = 3, M = 7, τi1 = −1, τi2 = −.7, τi3 = −.4, τi4 = 0, τi5 = .4, τi6 = .7 and τi7 = 1. The values of τik indicate where successive subjective response functions intersect. GGUM item parameters may be estimated using the GGUM2004 computer program (Roberts & Fang, 2006) or Markov chain Monte Carlo (MCMC) algorithms developed in various statistical programming languages (e.g., de la Torre, Stark, & Chernyshenko, 2006; Wang et al., 2014). Although samples of 700 or more have been recommended for GGUM calibration, we have found that 400–500 may be satisfactory when the primary emphasis is on scoring.

Prob. of Posive Response

1.0 0.8 0.6

Opon 1 Opon 2

0.4

Opon 3 Opon 4

0.2 0.0 -3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Theta FIGURE 42.6  GGUM ORFs for an Item Having Four Observable Response Categories

937

Stephen Stark et al.

Multi-Unidimensional Pairwise Preference (MUPP) Model Historically, most noncognitive measures used for personnel screening have required respondents to indicate their level of agreement with individual statements using a Likert-type response format. More recently, however, there has been a great deal of interest in forced-choice formats that require respondents to rank or choose between two or more statements. These statements typically measure different dimensions and are matched in terms of social desirability in an effort to reduce faking and other response biases (Stark et al., 2014). Some recent examples of forced-choice measures in the personality domain are OPQ-32 (https://online.shl.com/gb/en-gb/products/opq32r), TAPAS (Stark et al., 2014), and ETS WorkFORCE Assessment (Naemi et al., 2014). Traditional approaches to scoring forced-choice measures suffer from the problem of ipsativity (Cattell, 1944; Hicks, 1970; Salgado, Anderson, & Tauriz, 2014). A set of scales is said to be ipsative when the total score, obtained by summing the scale scores, is a constant. In this situation, scores can be compared meaningfully within persons, but between-person comparisons are problematic. However, IRT methods have since been developed to overcome these ipsativity problems (Böckenholt, 2004; Brown & Maydeu-Olivares, 2011; de la Torre et al., 2012; Stark, 2002; Stark, Chernyshenko, & Drasgow, 2005), thus opening new possibilities for the use of forced-choice measures in job selection. An example of a forced-choice IRT model for pairwise preference data is the Multi-Unidimensional Pairwise Preference model (MUPP; Stark, 2002; Stark et al., 2005). The model assumes that when a respondent is presented with a pair of statements, denoted s and t, and is asked to choose the statement that is “more like you,” he or she evaluates each statement independently until a preference is reached. The probability of preferring statement s to statement t in item i, given trait scores ( θ d s ,θ d t ) on the dimensions, ds and dt represented by those statements, can be written as P( s > t )i (θ d s ,θ d t ) =

Ps (1|θ d s )Pt ( 0|θ d t )

Ps (1|θ d s )Pt ( 0|θ d t ) + Ps ( 0||θ d s )Pt (1|θ d t )

,

(4) 

where P( s > t )i (θ d s ,θ d t ) = probability of a respondent preferring statement s to statement t in pairwise preference item i; d = index for dimensions, where d = 1, . . ., D, ds represents the dimension assessed by statement s, and dt represents the dimension assessed by statement t; s, t = indices for first and second statements, respectively, in an item; (θ d s ,θ d t ) = latent trait scores for the respondent on dimensions ds and dt, respectively; Ps (1|θ d s ) = probability of endorsing statement s given trait score θ d s ; Ps ( 0|θ d s ) = probability of not endorsing statement s given trait score θ d s ; Pt (1|θ d t ) = probability of endorsing statement t given trait score θ d t ; and Pt ( 0|θ d t ) = probability of not endorsing statement t given trait score θ d t . The probability of preferring a statement in a pairwise preference item thus depends on a respondent’s trait scores, θ d s and θ d t , and the unidimensional model chosen to compute endorsement probabilities for the individual statements composing a pair. To date, the majority of measures based on the MUPP model have used the dichotomous version of the GGUM (Roberts et al., 2000), but other IRT models have been explored (e.g., Seybert, 2013). In most applications, parameters for the statements representing each dimension have been estimated by administering statements individually to large samples of examinees (e.g., 400–500) using an ordinal response format (Stark, 2002). The ordinal responses are then dichotomized and calibrated using software for the selected unidimensional IRT model. Alternatively, statement parameters may be calibrated directly from pairwise preference responses by using MCMC methods (e.g., Lee, 2016; Seybert, 2013).

938

Modern Psychometric Theory With pairwise preference items that involve statements representing different dimensions, the relationship between trait levels and endorsement probabilities is represented by a three-dimensional surface, which has many peaks and valleys. An example item response surface for personality statements reflecting Dominance and Responsibility is shown in Figure 42.7. In the figure, values along the vertical axis indicate the probability of preferring statement s to statement t given a respondent’s standing on the respective dimensions and each statement’s GGUM parameters; these values were computed using Equation 4. Note that when forced-choice items involve more than two statements (e.g., triples or tetrads), more complex IRT models are needed (e.g., de la Torre et al., 2012; Joo, Lee, & Stark, 2016; Lee, 2016). When such items involve more than two dimensions, there is no point in creating item response surfaces because they would be difficult to display and interpret.

IRT SCORING The logic of scoring examinees in IRT is different from classical test theory (CTT). In CTT, item responses are scored dichotomously and summed over items to obtain a total (number correct) score, which may be standardized and transformed to another metric for score reporting (e.g., the IQ metric with mean 100 and standard deviation 15 or the SAT metric with mean 500 and standard deviation 100). In IRT, estimating trait levels is analogous to clinical diagnosis, where a clinician tries to estimate the most likely “disease” given a set of presenting “symptoms” (see Embretson & Reise, 2000). In IRT, the symptoms are item responses and the disease is an examinee’s trait level. Note that both IRT and clinical diagnosis assume that other outcomes are also possible (an examinee may have a different trait level or a patient can have a different disease), but the diagnosed outcome (trait level) is the most likely one. Therefore, in IRT, scoring is a search

FIGURE 42.7  MUPP Item Response Surface for a Pairwise Preference Item Measuring Dominance and Responsibility

939

Stephen Stark et al. process in which the presenting behaviors (item responses and their parameters) are used to determine what trait level is most likely. Consider, for example, an examinee with a correct and an incorrect response to two 3PLM items. Suppose the correctly answered item has an item discrimination parameter of a = 1.0, an item difficulty parameter of b = −1.0, and a guessing parameter of c = 0.1. Suppose the incorrectly answered item has item parameters of 1.0, 2.0, and 0.2, respectively. The conditional probability or “likelihood” of observing the response pattern (correct, incorrect), given a value of θ and the item parameters specified above, is simply the product of the individual item response probabilities given by Equation 1. The product rule comes from the fact that, in IRT, item responses are assumed to be independent after conditioning on θ; this is known as the local independence assumption. In nontechnical terms, local independence implies that the response probability for a given item is a function only of an examinee’s trait level and that item’s parameters; thus, the response for one item does not depend on how the examinee answers other items. Formally, the likelihood of a response pattern, u = , for examinee j, given a value of θ and a vector of 3PLM parameters for item i, βi = , is given by L ( u θ j , β1 ,..., β n ) = ∏ Pi (θ j )ui Qi (θ j )1−ui , i

(5)



where Pi is the probability of an correct response to the ith item and Qi = 1 – Pi. To illustrate the product rule written in Equation 5 above, we have multiplied the probability of the correct response to Item 1 and the probability of the incorrect response to Item 2, computed at trait levels on the interval [−3, −2.9, . . ., +3.0], and plotted the results in Figure 42.8. The resulting curve, which is called the “likelihood function,” is single-peaked with a maximum at θ = 0.7. The value of theta corresponding to the peak of the curve is called the maximum likelihood estimate (MLE) of theta and represents the value of the latent trait that makes the observed response pattern most likely. Note that this procedure for finding the MLE of theta is known as a grid search. It can be used to estimate trait scores in most situations, but more computationally efficient methods, such as Newton-Raphson iterations, are available and typically used. Readers interested in the details of these procedures should refer, for example, to Hambleton and Swaminathan (1985).

Likelihood Plot 0.80 0.70

L(u|theta)

0.60 0.50 0.40 0.30 0.20 0.10 0.00 -3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

Theta FIGURE 42.8  Likelihood of Response Pattern for a Hypothetical Examinee

940

3.0

Modern Psychometric Theory With dominance models, a problem with maximum likelihood estimation is that trait scores cannot be estimated for examinees with all “correct” or all “incorrect” response patterns: the maximum of the likelihood function would occur at plus or minus infinity. Therefore, in practice, the Bayes model or expected a posteriori (EAP) estimation is used instead (e.g., Thissen & Wainer, 2001). The same is generally true for ideal point and forced-choice models. In either case, scoring is accomplished by computing the posterior likelihood of an observed response pattern, using previously estimated item parameters, and then the mode or mean of this posterior likelihood is found. An important feature of IRT is that trait scores do not depend inherently on the specific subset of items that are administered. Specifically, examinee trait scores can be compared even if the examinees answered different subsets of items, provided that the item parameters are all on the same metric (that can be accomplished via concurrent calibration or linking; see Kolen & Brennan, 2014). This invariance property is critical to applications, such as computerized adaptive testing (CAT), where examinees receive individually tailored item sets to optimize measurement precision.

Information and the Precision of IRT Trait Scores An important question for IRT applications is: How do the items administered to an examinee affect the precision of the θ estimate? The MLE procedure, described above, yields the single most likely trait score for a given response pattern and set of item parameters. A different pattern of responses or a different set of item parameters would change the shape of the likelihood function (its height and width) and possibly the precision of the trait estimate because different response probabilities would be used in the computations. In general, it is best to present items having IRFs that are steep in the ability range where an examinee is located because the resulting likelihood function will be higher and narrower, thus reducing trait score estimation error. The steepness of an IRF over a particular range of theta influences how much information an item provides in that part of the trait continuum. Information is an IRT concept that is commonly used to judge item quality and suitability for a particular examinee(s). The more information an item provides, the more it reduces trait score estimation error. To examine the quality of a test form (i.e., a set of items), one can sum the item information functions to obtain the test information function, which shows where the test provides the best measurement. An example of a test information function for a three-item 3PLM test is presented in Figure 42.9a. As can be seen in the figure, the test information function for this short measure peaks at θ = 0.0, and it is relatively high between θ = −1.0 and θ = 1.5. At the same time, the test provides almost no information at the extremes of the trait continuum. This translates into greater measurement precision in the central region of the trait continuum and high imprecision at the extremes, as illustrated by the standard error plot in Figure 42.9b. In IRT, standard error is computed as the inverse square root of test information. The error at the extremes of the trait continuum could be reduced by adding some items that provide information at high or low trait levels.

COMPUTERIZED ADAPTIVE TESTING (CAT) Unlike traditional testing environments where one or more test forms are constructed in advance and items are administered to examinees in a prescribed order, CATs can be constructed on the fly so that each examinee receives a unique set of items that provides near-maximum information at his or her estimated trait score at every stage of an exam. At the start of a test, it is often assumed that an examinee has an average trait score on the construct being assessed. The first item is chosen to provide near-maximum information at that trait score. After answering the item, the examinee’s trait score is updated, and the next item is selected to provide near-maximum information at the new trait score, subject perhaps to content and item

941

Stephen Stark et al.

Test Information Function Plot

1.8

Sum of IIF Values

1.6

a)

1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 -3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

2.0

3.0

Theta

Standard Error Function Plot

35.0

Standard Error

30.0

b)

25.0 20.0 15.0 10.0 5.0 0.0 -3.0

-2.0

-1.0

0.0

1.0

Theta FIGURE 42.9  Test Information and Standard Error Functions for a Hypothetical Three-Item Test

exposure constraints. This process continues until a predetermined number of items has been administered or until the standard error of the trait score falls below a preset level of acceptability. (These test termination criteria are known as fixed-length and variable-length stopping rules, respectively.) Adaptive testing in this fashion is psychometrically efficient, often yielding precision similar to nonadaptive tests having nearly twice as many items. In addition, CATs tend to provide better accuracy and precision at extreme trait levels than do nonadaptive tests, which improves the utility of CATs for decision making and diagnostic feedback. To illustrate CAT in more detail, consider, for example, the item information equation for the 3PLM shown below: 2  1 − Pi (θ )   ( Pi (θ ) − c i )  I i (θ ) = a i2 *  .  Pi (θ )   (1 − c i )2  

(6) 

Before an item is selected, Equation 6 can be used to compute the information provided by each available item at the examinee’s estimated trait score. Selecting the item that provides the most

942

Modern Psychometric Theory information is optimal in a psychometric sense but leads to items with the largest a-parameters being overused across testing sessions, particularly in the early stages of exams when examinees have very similar scores. Methods to control item exposure vary in complexity. The Sympson-Hetter procedure (Hetter & Sympson, 1997) is a sophisticated item exposure control method that uses exposure parameters, derived from simulation research, to prevent overuse. A much simpler method is to identify, for example, the top five most informative items and select one randomly from that group. Importantly, any procedure that results in less discriminating items being administered reduces the efficiency of CAT somewhat, but the added security provided by controlling item exposure may justify the cost. Figure 42.10 presents illustrative test information and standard error functions for simulated 20-item adaptive and nonadaptive 3PLM tests. The nonadaptive tests were constructed by random selection from a diverse pool of items, while the fixed-length adaptive tests were created

Test Information

25

Average TIFs for Adaptive and Nonadaptive Tests of 20 Items Adaptive

20

Nonadaptive

15 10 5 0

Theta

2.0

Average SEs for Adaptive and Nonadaptive Tests of 20 Items

1.8

Nonadaptive

Standard Error

1.6

Adaptive

1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

Theta FIGURE 42.10  Comparison of Test Information Functions (TIFs) and Standard Error Functions (SEs) for Simulated 20-Item Adaptive and Nonadaptive Tests

943

Stephen Stark et al. by sequentially selecting items that provided near-maximum information at an examinee’s estimated trait score at every point during a test. As can be seen in the figure, test information is highest in the central regions of the trait continuum, with adaptive tests yielding two to four times as much information as the nonadaptive tests. In addition, it can be seen that the corresponding standard error is markedly lower for adaptive tests, especially at extreme trait scores. Examples of well-known 3PLM CATs used for personnel screening, selection, or classification include CAT-ASVAB (see Sands, Waters, & McBride, 1997) and the Proctor and Gamble Reasoning Screen (see McCloy & Gibby, 2011). CAT-ASVAB is a cognitive ability test battery used for screening and classifying U.S. military applicants. CAT-ASVAB was developed as an alternative to the paper-and-pencil ASVAB to shorten the number of items administered and time needed to evaluate applicants. Today, CAT-ASVAB consists of ten 11-item to 16-item unidimensional subtests, which all together take about 2.5 hours to complete. The Proctor and Gamble (P&G) Reasoning Screen is a 15-item unproctored Internet-based CAT used to screen P&G job applicants worldwide. The Reasoning Screen is available in 20 languages and contains items measuring figural, numeric, and logical reasoning. Applicants passing the Reasoning Screen must subsequently complete the proctored nonadaptive Reasoning Test, which serves as an independent hurdle as well as a score verification tool in the hiring process. Although the adaptive testing principles and examples above focused on the 3PLM, it is important to note that CATs are being developed based on a variety of models to measure both cognitive and noncognitive constructs. For example, the National Institutes of Health PROMIS assessment (Reeve et al., 2007) uses SGRM-based CATs to efficiently measure a wide variety of psychological and physical health indicators. The Tailored Adaptive Personality Assessment System (TAPAS) uses MUPP-based CATs to measure a collection of narrow personality factors for screening military job applicants (Stark et al., 2014), and the ETS WorkFORCE Assessment uses MUPP-based CATs to predict applicants’ job fit (Naemi et al., 2014). In noncognitive testing applications, CAT is especially useful because organizations typically want to measure many constructs in a short time. The trait scores for the various constructs may be used to form profiles or composites for evaluating the suitability of applicants for multiple job roles.

DETECTING ABERRANT RESPONSE PATTERNS When validating and using tests for selection and licensure, it is important to screen examinee data for potential aberrant responding. This is especially true for unproctored and noncognitive tests, for which cheating and faking good are major concerns (National Research Council, 2015; Tippins et al., 2006). Unmotivated and random responding are also key concerns when pretesting new items, especially in research contexts where there are no clear incentives to answer carefully. For these reasons, many CTT and IRT methods for detecting aberrance have been developed (Drasgow, 1982; Karabatsos, 2003; Meade & Craig, 2012; Meijer & Sijtsma, 1995). One IRT index that has consistently been found to perform well is 1z (Drasgow, Levine, & McLaughlin, 1987), which represents the standardized log likelihood of a response pattern. For unidimensional dichotomous models (e.g., 3PLM), the log likelihood of an n-item response pattern can be written n

l 0 = ∑u i log Pi ( u i = 1|θ ) + (1 − u i )log(1− ui ( u i = 1|θ )), ^

^

i =1

^

where θ is an estimate of θ. The approximate expectation of this log likelihood is n

E( l 0 ) ≈ ∑Pi ( u i = 1|θ )log Pi ( u i = 1|θ ) + [1 − Pi ( u i = 1|θ )] log [1 − Pi ( u i = 1|θ )] i =1

944

^

^

^

^

.

Modern Psychometric Theory The approximate variance is 2

^  Pi ( u i = 1|θ )   Var ( l 0 ) ≈ ∑Pi ( u i = 1|θ )[1 − Pi ( u i = 1|θ ] log  ^ . i =1  [1 − Pi ( u i = 1|θ )]  n

^

^

Finally, the approximately standardized index is lz =

l 0 − E (l 0 ) Var (l 0 )



(7)

Since then, 1z has been extended for use with multiple subtests, polytomous responses, and most recently forced-choice models (Drasgow, Levine, & McLaughlin, 1987, 1991; Lee, Stark, & Chernyshenko, 2014; Stark, Chernyshenko, & Drasgow, 2012). The 1z indices developed by Drasgow et al. focus on identifying persons who respond inconsistently with model predictions. Responding in a way that is incongruent with one’s true trait scores over the course of a long test leads to large negative 1z values. Thus, based on early research showing that the distribution of 1z is approximately standard normal for long tests (e.g., 80 items), critical values for a one-tailed z-test can be used to classify response patterns as normal or aberrant. For example, if one wants to screen response patterns with a 5% false-positive rate (i.e., 5% of normal response patterns will be misclassified as aberrant), the critical 1z for a lower one-tailed z-test would be −1.65. If a respondent’s observed 1z were less than the critical value, then the response pattern would be flagged as aberrant; otherwise, the pattern would be considered normal. In addition to indices such as 1z that are generally sensitive to inconsistencies with model predictions, methods have been developed to detect specific forms of aberrance, such as patterned responding (AAA, BBB) and rapid responding (Chernyshenko, Stark, & Drasgow, 2012). With modern computer-based testing, response time has become particularly easy to track, and, in noncognitive testing, the number of items answered in less than two seconds can serve an effective flag for careless responding. Indices have also been developed to detect item pool and test compromise as well as to verify the integrity of scores on tests administered in unproctored environments (e.g., Segall, 2001, 2002; Wang, Zheng, & Chang, 2014). Regardless of which methods are used to identify aberrant responders and vet test scores, it is incumbent upon organizations to have policies describing how flagged examinees will be treated. Nonzero false-positive rates guarantee that some percentage of examinees will be inappropriately flagged. Therefore, to promote fairness and guard against potential litigation, retesting, rather than disqualification of flagged examinees, may be the more prudent course of action.

SUMMARY AND CONCLUSION IRT is a continuously expanding and improving technology for constructing, administering, scoring, and evaluating a variety of assessment tools. Unlike classical test theory statistics, IRT item parameters are invariant across subpopulations; person parameters do not depend on the specific set of items administered; and measurement precision can be readily evaluated as a function of trait level. These properties make IRT methods useful for computerized adaptive testing, for detecting measurement bias and assessing growth (or decline) in trait levels over time, and for model-based detection of aberrant responding. One limitation, of course, is that large samples are needed for some applications (e.g., 250 or more per group for measurement bias analyses), and minimum sample size recommendations tend to increase with model complexity. In this chapter, we discussed just a few models and two IRT applications. However, as described in a 2015 National Research Council report entitled Measuring Human Capabilities, there are many more IRT models and methods for improving the quality of structured assessments used for organizational decision making. There remains, however, a pressing need for new psychometric technology to support emerging forms of assessment, such as simulations,

945

Stephen Stark et al. serious games, collaborative exercises, and constructed response tasks, which involve stochastic elements and interdependencies that most current psychometric models cannot account for (Chernyshenko & Stark, 2015; Stark, Martin, & Chernyshenko, 2015). We anticipate that future IRT research will attempt to address these challenges, and it will be interesting to see whether today’s prevailing models will play a central role in future assessment programs.

REFERENCES Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques. New York, NY: Dekker. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–472). Reading, MA: Addison-Wesley. Bock, D. R. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51. Böckenholt, U. (2004). Comparative judgments as an alternative to ratings: Identifying the scale origin. Psychological Methods, 9, 453–465. Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71, 460–502. Carter, N. T., & Dalal, D. K. (2010). An ideal point account of the JDI work satisfaction scale. Personality and Individual Differences, 49, 743–748. Cattell, R. B. (1944). Psychological measurement: Normative, ipsative, interactive. Psychological Review, 51, 292–303. Chernyshenko, O. S., & Stark, S. (2015). Mobile psychological assessment. In F. Drasgow (Ed.), Volume 2 of the NCME Book Series. Technology in testing: Measurement Issues (pp. 206 – 216). NJ: Wiley-Blackwell. Chernyshenko, O. S., Stark, S., Chan, K. Y., Drasgow, F., & Williams, B. A. (2001). Fitting item response theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36, 523–562. Chernyshenko, O. S., Stark, S., & Drasgow, F. (July 2012). Investigating effects of unmotivated responding on validities of multidimensional forced choice personality tests. Invited presentation at the 8th conference of the International Test Commission. Amsterdam, NE. Chernyshenko, O. S., Stark, S., Drasgow, F., & Roberts, B. W. (2007). Constructing personality scales under the assumptions of an ideal point response process: Toward increasing the flexibility of personality measures. Psychological Assessment, 19, 88–106. de la Torre, J., Ponsoda, V., Leenen, I., & Hontangas, P. (2012). Some extensions of the multiunidimensional pairwise preference model. Paper presented at the 26th annual meeting of the Society for Industrial and Organizational Psychology. Chicago, IL. de la Torre, J., Stark, S., & Chernyshenko, O. S. (2006). Markov chain Monte Carlo estimation of item parameters for the generalized graded unfolding model. Applied Psychological Measurement, 30, 1–17. Drasgow, F. (1982). Choice of test models for appropriateness measurement. Applied Psychological Measurement, 6, 297–308. Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11, 59–79. Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1991). Appropriateness measurement for multidimensional test batteries. Applied Psychological Measurement, 15, 171–191. Drasgow, F., & Olson-Buchanan, J. B. (Eds.) (1999). Innovations in computerized assessment. Mahwah, NJ: Erlbaum. Drasgow, F., & Parsons, C. K. (1983). Application of unidimensional item response theory models to multidimensional data. Applied Psychological Measurement, 7, 189–199. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer-Nijhoff. Hattie, J. A. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139–164. Hetter, R. D., & Sympson, J. B. (1997). Item exposure control in CAT-ASVAB. In W. A. Sands, B. K. Waters, & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 141–144). Washington, DC: American Psychological Association. Hicks, L. E. (1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74, 167–184.

946

Modern Psychometric Theory Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Applications to psychological measurement. Homewood, IL: Dow Jones Irwin. Joo, S-H., Lee, P., & Stark, S. (April 2016). Information functions of multidimensional forced-choice IRT models. Paper presented at the annual conference of the National Council on Measurement in Education. Washington, DC Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298. Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. Methods and practices. New York, NY: Springer. Lee, P. (2016). Investigating parameter recovery and item information for triplet multidimensional forced choice measures: An application of the GGUM-RANK model. Doctoral dissertation. University of South Florida. Tampa, FL. Lee, P., Stark, S., & Chernyshenko, O. S. (2014). Detecting aberrant responding on unidimensional pairwise preference tests: An application of Lz based on the Zinnes Griggs ideal point IRT model. Applied Psychological Measurement, 38, 391–403. Lord, F. M. (1980). Applications of item response theory to practical testing problems. New York, NY: Routledge. Maydeu-Olivares, A., & McArdle, J. (Eds.) (2005). Contemporary psychometrics. A Festschrift to Roderick P. McDonald. Mahwah, NJ: Lawrence Erlbaum. McCloy, R., & Gibby, R. (2011). Computer adaptive testing. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 153–190). San Francisco, CA: Jossey-Bass. Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17, 437–455. Meijer, R. R., & Sijtsma, K. (1995). Detection of aberrant item score patterns: A review and new developments. Applied Measurement in Education, 8, 261–272. Mills, C. N., Potenza, M. T., Fremer, J. J., & Ward, W. C. (Eds.) (2002). Computer-based testing: Building the foundation for future assessments. Mahwah, NJ: Lawrence Erlbaum Associates. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. Muthén, L. K., & Muthén, B. O. (1998–2015). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. Naemi, B., Seyert, J. M., Robbins, S., & Kyllonen, P. (2014). Examining the WorkFORCE assessment for job fit and core capabilities of the FACETS engine (Research report ETS-RR-14–32). Princeton, NJ: ETS. National Research Council. (2015). Measuring human capabilities: An agenda for basic research on the assessment of individual and group performance potential for military accession. Committee on Measuring Human Capabilities: Performance Potential of Individuals and Collectives. Board on Behavioral, Cognitive, and Sensory Sciences; Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press. Reeve, B., Hays, R. D., Bjorner, J., Cook, K., Crane, P. K., Teresi, J. A., Thissen, D., Revicki, D. A., Weiss, D. J., Hambleton, R. K., Liu, H., Gershon, R., Reise, S. P., Lai, J. S., Cella, D., & on behalf of the PROMIS cooperative group. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcome Measurement Information System (PROMIS). Medical Care, 45, 22–31. Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32. Roberts, J. S., & Fang, H-R. (2006). GGUM2004: A Windows-based program to estimate parameters in the Generalized Graded Unfolding Model. Applied Psychological Measurement, 30, 64–65. Salgado, J. F., Anderson, N., & Tauriz, G. (2014). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 88, 797–834. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 17, 1–100. Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association. Segall, D. O. (2001). Detecting test compromise in high-stakes computerized adaptive testing: A verification testing approach. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Seattle, WA. Segall, D. O. (2002). An item response model for characterizing test compromise. Journal of Educational and Behavioral Statistics, 27, 163–179.

947

Stephen Stark et al. Seybert, J. (2013). A new item response theory model for estimating person ability and item parameters for multidimensional rank order responses. Doctoral dissertation. University of South Florida. Tampa, FL. SHL. (2015). Occupational Personality Questionnaire (OPQ32). Retrieved from https://online.shl.com/gb/ en-gb/products/opq32r Stark, S. (2002). A new IRT approach to test construction and scoring designed to reduce the effects of faking in personality assessment: The generalized graded unfolding model for multi-dimensional paired comparison responses. Doctoral dissertation. University of Illinois at Urbana-Champaign. Urbana-Champaign, IL. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: An application to the problem of faking in personality assessment. Applied Psychological Measurement, 29, 184–201. Stark, S., Chernyshenko, O. S., & Drasgow, F. (July 2012). Development of a person-fit index for multidimensional pairwise preference tests. Invited presentation at the 8th conference of the International Test Commission. Amsterdam, NE. Stark, S., Chernyshenko, O. S., Drasgow, F., White, L. A., Heffner, T., Nye, C. D., & Farmer, W. L. (2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26, 153–164. Stark, S., Chernyshenko, O. S., Drasgow, F., & Williams, B. A. (2006). Examining assumptions about item responding in personality assessment: Should ideal point methods be considered for scale development and scoring. Journal of Applied Psychology, 91, 25–39. Stark, S., Martin, J., & Chernyshenko, O. S. (2015). Technology and testing: Developments in education, work, and healthcare. In F. Leong, F. Cheung, K. Geisinger, D. Bartram, & D. Iliescu (Eds.), ITC international handbook of testing and assessment. NY: Oxford University Press. Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589–617. Tay, L., Drasgow, F., & Rounds, J., & Williams, B. A. (2009). Fitting measurement models to vocational interest data: Are dominance models ideal? Journal of Applied Psychology, 94, 1287–1304. Thissen, D. (1991). MULTILOG user’s guide: Multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software. Thissen, D., & Wainer, H. (2001). Test scoring. Mahwah, NJ: Lawrence Erlbaum Associates. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006). Unproctored internet testing in employment settings. Personnel Psychology, 59, 189–225. Wang, W., de la Torre, J. D., Drasgow, F., Meade, T., & Louden, R. (2014). MCMC GGUM v1.2 user’s guide. Orlando, FL: University of Central Florida. Wang, C., Zheng, Y., & Chang, H. H. (2014). Does standard deviation matter? Using “standard deviation” to quantify security of multistage testing. Psychometrika, 79, 154–174. Zimowski, M., Muraki, E.,  & Bock, D. (2003). BILOG-MG. Lincolnwood, IL: Scientific Software International.

948

43 USING BIG DATA TO ENHANCE STAFFING Vast Untapped Resources or Tempting Honeypot?1 RICHARD N. LANDERS, ALEXIS A. FINK, AND ANDREW B. COLLMUS

The overall purpose of organizational staffing is to deliver fresh hires into organizations. Efforts to improve staffing have historically involved pursuing two primary goals: improving job applicant quality and improving the process used to quantify and make decisions about those applicants. Industrial/-organizational (I-O) psychologists, based upon decades of research, have many specific processes they commonly employ to meet these goals. Despite this, a family of technologies commonly referred to as big data has begun to appear in staffing processes without much, if any, validation from I-O psychologists. Data scientists have claimed that such technologies have the potential to “disrupt” the bedrock staffing procedures on which much of modern I-O psychology has been built. The truth of this claim is difficult to determine for many reasons, but most glaringly because data scientists and I-O psychologists come from such different theoretical perspectives that it is often difficult to find common ground even in casual conversation. As noted above, I-O psychologists rely upon a great deal of existing research to support the consideration of a wide range of individual differences as predictors in selection systems (i.e., KSAOs: knowledge, skills, abilities, and other characteristics) alongside methods to measure them (e.g., surveys and questionnaires, assessment centers, work sample tests). Using this toolkit, I-O psychologists can consistently improve hiring outcomes in terms of applicant reactions, task performance, and/or contextual performance for just about any organization. Importantly, the development of this approach was based upon a number of assumptions and theoretical perspectives that are not shared by everyone attempting to improve staffing. Specifically, I-O psychologists primarily practice from behind the broader assumptions of psychological science and the measurement guidelines commonly associated with it. Our science is one of theory and reflective constructs; that is, we assume certain persistent underlying human characteristics exist regardless of our measurement of them and that the data we solicit from job applicants are reflections of those characteristics. These assumptions are driven by psychological theory that was created, developed, and refined by psychological researchers based upon the scientific method over many decades, if not longer. For example, we have theory to suggest that there are persistent non-cognitive differences between people, which we call personality, and that these differences are associated with work-related outcomes, including job performance (Barrick & Mount, 1991). Thus we might administer a personality measure to job applicants as part of a selection system in order to predict their future job performance. Scientists in other fields of inquiry that are highly relevant to staffing do not necessarily share these same values. The empirical branch of computer science, for example, is primarily

949

Richard N. Landers et al. concerned with the development and testing of computers and related technologies, including algorithms (Newell & Simon, 1976). From this perspective, the world of computer languages is much more “real” than that of psychological constructs; there is no unmeasurable, unknowable characteristic of a computer that must be assumed to exist, tested only by proxy and by inference. To many computer scientists, a psychological construct is itself inherently unknowable, and, taken to its logical conclusion, studying the unknowable is a waste of researcher time and effort. In contrast, the patterns within data potentially caused by such constructs are a well-defined problem. They are data, and patterns with data can be modelled. With sufficiently high-quality data, such models could be used to predict other data that do not yet exist. One never needs to worry about constructs; the patterns tell the story. Thus, the major objectives for computer scientists in this domain are to increase the quantity of data from which to create models and to improve the predictive value of modelling. This sort of thinking lies at the foundation of big data and, to a degree, at the foundation of the older and broader field of business intelligence/ business analytics (Chen, Chiang, & Storey, 2012). From this perspective, data are not necessarily reflective of a larger problem to be solved; they are the problem to be solved. Until recently, practitioners of business analytics and big data analytics have applied this perspective primarily as a means to increase the effectiveness of marketing. For example, to maximize conversion from website visitor and online advertisement-viewer to purchaser of products and services, marketers collect and interpret incredibly vast sources of behavioral data as they occur. Computer clicks and taps, the keywords and phrases uses in search engines, specific webpages visited and the amount of time spent on them, the roads a smartphone has travelled down during a person’s commute, the advertisements and conversations that a smartphone has overheard throughout the day, and many other such sources of information may all be collected and tied to a particular digital identity. An algorithm, refined automatically from these vast data sets that are continually updated to maximize prediction, is used to automatically identify advertisement content that maximizes click-through rates and displays advertising content when online shopping, all in a fraction of a second. This algorithmic approach has become so effective that many consumers view the accuracy of such systems as emotionally disturbing (Ur, Leon, Cranor, Shay, & Wang, 2012), due in part to the significant number of perceived privacy violations (Cumbley & Church, 2013). Regardless of the ethics of these practices, the sheer amount of detailed information available about almost everyone with access to the Internet has grown exponentially. As these data sources have grown in size and complexity, researchers have continued to expand the analytic toolkits used to make sense of and draw conclusions from them. Because both these data sources and analytic toolkits offer a great deal of potential for staffing, the purpose of this chapter is to explore how this potential might be realized and how researchers and practitioners are already realizing it. Perhaps more importantly, we also explore if it should be realized. Big data are not necessarily high-quality data, and I-O psychology already has many techniques to obtain, analyze, and apply high-quality small data. Research is not yet available demonstrating specific validity or utility advantages to big data staffing approaches above and beyond more well-established small data techniques, and ultimately, big data may be little more than a fad (Davenport, 2014) and therefore only a short-term distraction (Dunnette, 1966). Thus, to provide some guidance in this domain, this chapter begins with an exploration of the concept of big data, including the introduction of a framework of big data functions based upon current applications in staffing. Next, we explore each of the dimensions of that framework by presenting case studies drawn from the experiences of I-O psychologists working with big data in the area of staffing, each case study paired with literature-based exploration of related cautions and new horizons. Finally, we draw conclusions regarding cross-functional benefits and risks.

A FRAMEWORK OF BIG DATA FUNCTIONS IN STAFFING As with most new technologies with a significant interest from industry, there are many definitions of big data, although most share a common thread. In the Harvard Business Review, McAfee and Brynjolfsson (2012) describe the most common breakdown of big data, defining its three key features as volume, velocity, and variety. First, volume refers to the quantity of data analyzed,

950

Using Big Data to Enhance Staffin which is sometimes expressed in exabytes. An exabyte is 1,000 petabytes, a petabyte is 1,000 terabytes, and a terabyte is a 1,000 gigabytes. Thus, a single exabyte of nothing but Microsoft Word documents would contain approximately 50 trillion of them. According to IBM (2015), 2.5 exabytes of data are created worldwide each day; tapping into this vast resource is part of what big data proponents seek to accomplish. Second, velocity refers to the speed of both data creation and analysis. In addition to the speed of data creation described above, using big data analytic techniques, real-time analysis of any phenomenon of interest might be observed. For example, an internal, employee-directed social network site might be automatically monitored for emotional content using real-time text mining. In doing so, management could get an up-tothe-minute estimate of the emotional state of their employees. Third, variety refers to the many different forms big data might take. Although text data such as electronic communications and electronic records are the most common, meta-data such as Internet history, radio-frequency identification (RFID) data such as physical location and the amount of time spent in various parts of an office, global positioning satellite (GPS) data, audio and video data, and other types of data are now often collected and analyzed together (Cumbley & Church, 2013). Although this set of three characteristics is commonly found in big data definitions, additional dimensions are often added, and these dimensions and therefore definitions of big data vary by discipline (Hitzler & Janowicz, 2013). For example, value refers to the specific explanatory power of information to solve specific problems or challenges, veracity refers to the uncertainty surrounding information collected, and variability refers to the often inconsistent nature of collected data. Some authors have further defined big data as data that cannot be meaningfully processed or analyzed using conventional approaches (e.g., Dumbill, 2013), which includes all standard statistical software commonly applied in staffing, such as SPSS and SAS. From this perspective, big data is defined largely by the necessity of distributed processing, a technology involving tens, hundreds, or thousands of computers running in tandem, called a cluster, to achieve the high speed and accuracy of data handling necessary to meet whatever demand exists. For example, during a sales event in 2015, online retailer Amazon.com sold 398 items per second (Garcia, 2015), each purchase requiring a significant amount of data to be accessed and updated at a data processing rate currently impossible for a single personal computer to achieve. Even among data scientists, those academics and practitioners most directly connected to big data, the definition of big data—and for that matter, data science—is currently contested (Provost & Fawcett, 2013). Given these disagreements, a precise and agreed-upon definition of big data may be less useful in the staffing context than a framework demonstrating how the various technologies typically involved in big data might be used to improve organizational functioning. Advantages to big data are proposed to be quite broad. For example, a report by the McKinsey Global Institute described five general organizational advantages to incorporating big data: (1) increased transparency and usability of data, (2) increased accuracy and detail of data, (3) increased specificity of data, (4) improved decision-making based upon data, and (5) improved research and development pipelines within organizations (Manyika et al., 2011). From the perspective of an I-O psychologist or other staffing specialist, many of these supposed advantages to big data likely seem quite familiar. The introduction of quantitative measurement to management formalized the data-gathering process, and data regarding human resources are now commonly collected and maintained in order to make the best decisions possible regarding organizational personnel. In this sense, organizations already collect transparent, usable, accurate, detailed, specific data about human resources that can be used to improve decision making in order to ultimately increase value. If big data is to provide new value to staffing, it must measurably improve one or more of these properties beyond what is currently possible with the existing I-O toolkit. Data must be more transparent, usable, accurate, detailed, and/or specific in such a way that an advantage is gained, despite increased costs due to specialized computer programming expertise and the use of complex computing systems. To maximize the apparent value of big data in these ways, we have developed and present here a framework of big data functions based upon its four major application areas in staffing. These areas are not intended to be an exhaustive list of the ways in which big data might be used in staffing. We also do not mean to imply that these areas are orthogonal. Big data applications typically apply multiple technologies simultaneously. Instead, we have created this framework to illustrate the most common ways that big data technologies are currently applied in

951

Richard N. Landers et al. order to highlight where industry staffing professionals believe the greatest added value might be achieved. Thus, we contend that if there is value to be found for staffing, it is likely to be in one of these areas. However, this does not preclude the creation of new application areas in the future, nor does it preclude additional uses beyond those we describe. The first of these areas, big data gathering, refers to the use of big data technologies to collect data that was never before realistically collectable. One of the most relevant applications to staffing is the extraction of data from both external social media platforms, such as LinkedIn and Facebook, and internal social media platforms (Landers & Goldberg, 2014). Using social media, current employees and job applicants create lengthy and complex attitudinal and behavioral records that are often accessible to organizations. In the case of Facebook, Twitter, and other personal social media platforms, this behavioral record is quite focused upon the personal life of the person in question but still may contain job-relevant information. For example, using big data analytic techniques on a sample of 86,220 Facebook users, researchers developed an algorithm that can predict self-report personality ratings from Facebook likes better than judgments by their friends can (Youyou, Kosinski, & Stillwell, 2014), which the researchers framed as “computers outpacing humans in personality judgment” (p. 1036). Alternative measurement methods like this potentially bring many advantages to the measurement of predictors of performance, such as data collection speed and reduced fakeability, in comparison to self-report surveys. The second of these areas, big data storage, refers to the use of big data technologies to maintain massive databases, which are far larger than any traditional staffing data sets. Most relevant to staffing is the incredible quantity of data now captured by wearable electronic devices, such as electronic employee badges. Wearables as a technology have existed for some time, although primarily for the purpose of personal healthcare (Lymberis, 2003), with only a recent expansion into broad consumer and enterprise applications, such as smartwatches. Wearables may be considered one part of a broader concept called the Internet of Things, which refers to the increasing movement toward providing Internet access to a wide variety of objects that have never before had Internet access (Xia, Yang, Wang, & Vinel, 2012), including household appliances. In the case of wearables at work, sometimes called enterprise wearables (Sacco, 2014), an employee badge might collect data on the specific location of the wearer throughout the workday, the doors accessed throughout the building, the other people with whom that person has been in close proximity, any sounds that resemble spoken words from which the speaker of those words can often be identified, and other such information. Once the badge passes within proximity of a reader, strategically located throughout the office, this information is uploaded to a central location. Although such information is now often collected in corporate environments where electronic badges are used, it is unclear what value this information might hold for the organization. Perhaps more importantly, the liability of holding onto this information is also unknown. This liability may be legal or may be felt more as a violation of employees’ sense of privacy, trust, and respect in their relationship with their employer. The third area, big data analytics, refers to the wide variety of data analysis techniques that have been developed as a result of the complexity of big data. Perhaps the most prominent of these techniques are machine learning and data mining (Chen, Chiang, & Storey, 2012). In contrast to I-O psychology’s “theory-first” deductive approach, data scientists approach data holistically and inductively, seeking ways to simplify the data and extract meaning. Theory is the result of this process, not the cause. Where psychology relies on the deductive approach to minimize the degree to which conclusions are drawn based upon statistical artifacts, data science has developed statistical approaches to do this post hoc, generally based upon multivariate statistical approaches familiar with I-O psychologists. For example, in staffing, linear regression is often used to develop an equation predicting job performance from selection predictors. Used this way, regression works reasonably well with a relatively small set of predictors. In the case of big data, however, the number of potential predictors might increase to a few hundred or thousand. Because regression prioritizes explanatory power when adding predictors to a model, such an analysis would likely result in a high degree of capitalization on chance. To deal with this problem, data scientists might use a least absolute shrinkage and selection operator technique to be used in combination with linear regression in order to maximize prediction while also maintaining parsimony (Tibshirani, 1996). With this technique, all possible combinations of predictors can be modelled simultaneously to determine the tradeoff between explanatory

952

Using Big Data to Enhance Staffin power and parsimony, allowing a data scientist to pick the regression model that best achieves a desirable balance. Models like these can also be developed automatically, programmatically, and iteratively, using a wide range of statistical techniques. Many, although certainly not all, big data analytic techniques are distinct but recognizable cousins to statistical approaches common in I-O psychology. One commonly discussed technique in data science is machine learning, which is commonly used to sort ambiguous data into categories. I-O psychologists are generally familiar with two statistical techniques that accomplish the same general goal: factor analysis and cluster analysis. In both of these approaches, patterns within data are used to develop a broader classification scheme that can be used later for other purposes. For example, the Big Five personality traits were originally developed in part by using factor analysis to sort personality judgments into categories based upon words found in English that can be used to describe people. Machine learning is often employed to do similar sorts of categorization, but with a much greater degree of flexibility and autonomy. For example, the Big Five traits might be identifiable by programming a computer to comb the Internet (Landers, Brusso, Cavanaugh, & Collmus, in press), identifying words that appear to be descriptors of people based upon their position in each sentence. Next, the computer could iteratively process every sentence it identified to determine which personality words tend to cluster together, aided by a database of synonyms for reference, developing a personality model as it went. As the computer continued to collect more data, it would incrementally refine this model to better represent the data it has already collected, correcting for chance variation increasingly over time based upon the size of the data set at the time. In this way, such a machine could develop the Big Five automatically and algorithmically using cutting-edge technologies, yet this approach has the same conceptual basis as what was done by psychologists in the 1930s (i.e., Allport & Odbert, 1936). The final area, big data visualization, refers to the use of interactive displays of data that allow viewers to parse the meaning of data in highly complex ways without any data science expertise. Data visualization was developed in part to help people make sense of fleeting data before their value disappears (Keim, Qu, & Ma, 2013). For example, in the time it might take for a data scientist to analyze data and develop a report to interpret its findings, the competitive advantage that might be gained for that organization could be lost. Additionally, in an environment where new data are created constantly and old data may become obsolete in a very short time, such a report may even provide faulty or harmful recommendations. Using data visualizations, key decision makers can explore summaries of data in real time, as those data change. Such a person could click-and-drag to explore organization-wide patterns to draw insights or “zoom in” to see differences between individual organizational units. In the context of enterprise wearables, a manager might be able to see the current locations of all employees in a real-time interactive map but also obtain real-time summaries of how many employees are at their desks, how many are at the water cooler, how many are in the restroom, and how many are on smoke breaks. As demonstrated above, the possibilities of big data are far-reaching. However, reality often lags behind possibilities. In the next four sections, we will explore each of these four functions of big data—gathering, storage, analytics, and visualization—by presenting an anonymized case study describing how I-O psychologists working in staffing have utilized big data. After each of these case studies, we consider those applications from the perspective of available research literature within both the staffing literature and data science literature to identify strengths, weaknesses, and future directions.

BIG DATA GATHERING

Case Study A moderately sized, regional organization grew dramatically by acquisition over a period of five years from 3,000 people into a geographically distributed, global organization of over 10,000. The original company had enjoyed a favorable reputation in its community as a good employer, and staffing processes had been fairly simple, based largely on employee referrals and a good relationship with the local university. Those close relationships meant that, in most cases, new

953

Richard N. Landers et al. applicants had been known to the company as interns, scholarship recipients, or secondhand through recommendations from their professors, and selecting “the best” among them had seemed quite straightforward, given the work samples available from internship performance and classwork performance, which was available directly or vicariously. The company’s expansion had been based on product line complementarity, and none of the recruiting infrastructure of good employment brand and close university relationship was present in the newly acquired firms. In fact, in most cases, the existing goodwill that the legacy firms had enjoyed in their communities was damaged by the acquisition. Furthermore, as is common after acquisitions, there was a spike in attrition within most of the companies the organization acquired. Thus, the organization had to simultaneously address several challenges in its previously sleepy staffing function. They needed to understand who was leaving, why, and where they were going. They needed to understand their own employer brand and position in the landscape of employers, and they had to figure out how to recruit mid-career professionals for the first time. A team of three talent analysts built a big data strategy to address these challenges, using multiple sources of data, including social media. In the first phase, they tackled the problem of attrition and talent flows. To do this, they applied natural language processing, a technique to extract meaning from text data, to exit interview notes and survey data, next applying machine learning to understand who was choosing to leave and what key drivers of attrition were. They then collected a large volume of social media data, primarily via LinkedIn’s tools, to identify where their former employees had gone. Their review also revealed that a handful of employees had left after the acquisition but later returned. These people were asked to provide interviews. The second phase of their work was understanding their employment brand in the marketplace. The team analyzed social media ratings and comments regarding their company, the companies that had absorbed most of their exiting employees and were thus their biggest talent competitors, and the legacy company names from prior to the acquisition. This provided insight on what at least a sample of employees and former employees viewed as important in their employment relationship and how each of the companies studied fared in the eyes of employees. This gave the researchers an idea not only of their competitiveness in the marketplace but also key assets they could highlight in their employment brand communications and key limitations they could work to address within the company. This information was especially helpful as they considered, for the first time, recruiting mid-career professionals. Here, the researchers reached out to recruiters from the acquired organizations for best practices and supplemented those practices with insights from the social media review. From all of these efforts, the researchers learned that the company’s generous leave policies, including unlimited vacation and periodic sabbaticals, were very highly valued by employees, especially by emerging professionals. However, employees, especially those mid-career to senior leaders who left, were frustrated by what they saw as very limited opportunities for influence and promotion in a company where interpersonal trust, based on many years of working closely together, was key to decision making. Based upon these findings, the company invested in highlighting its generous leave as a key employee benefit early in the recruiting process. The researchers also took their discovery around departing employee frustrations to company leadership and influenced organizational structure to visibly include a critical mass of leaders from outside the original, acquiring organization. Finally, the company invested in a specialized leadership recruiting team that extensively used professional social media to identify candidates with appropriate skills and experience. The researchers built a playbook that highlighted the organization’s employee value proposition in contrast to those of key talent competitors, and trained the leadership recruiters to subtly use that perspective in wooing candidates, highlighting key areas where the company was attractive as an employer compared to talent competitors. As they worked to improve their ability to identify, attract, hire, and retain these mid-career employees, they continually revisited their original analyses, periodically re-examining exit trends, talent transfer rates among the key companies with the highest talent flows among them, and social media sites. They adjusted and enhanced their employer branding materials on their company pages on professional social media, as well as in college recruiting in response to new information, and watched with pleasure as their employer ratings improved on social media sites. As their sophistication with social media grew, they also monitored visits to their company pages on professional social media and noted what changes to employment brand messages resulted in better candidate flows (Table 43.1).

954

Using Big Data to Enhance Staffin

TABLE 43.1 Summary of “Big Data Gathering” Case Study Staffing Application

The organization needed to gather information about the causes of employee turnover.

Limitations of Small Data

Exit interviews are time-consuming and resource intensive, relying upon thoughtful answers in a face-to-face setting, which does not promote frank honesty.

Advantages to Big Data

The harvesting of social media data, in combination with machine learning and natural language processing, allows organizations to develop insights about turnover based upon not-previously-accessible information. Employee in-flow and out-flow analysis based upon this data helps draw conclusions regarding motivation.

Cautions

The ubiquity of social media does not necessarily solve more fundamental sampling challenges. Furthermore, very large samples are required for predictive accuracy.

Conclusions, Cautions, and New Horizons As shown in this case study, big data gathering techniques can be used to collect multiple dissimilar types of information, such as text extracted from interviews and social media streams, to produce a single model from which insights can be drawn and predictions made. Such data collection is particularly useful in this context for two reasons. First, the collection of unstructured data from social media enables follow-up from ex-employees whose opinions would normally be inaccessible to the organization. Second, because there is relatively little theoretical guidance on what specific human resources policy changes might be perceived as problematic after a merger, this approach enables high-quality, data-driven decision making. The results from this approach will be highly organization-specific, but a highly organization-specific solution is precisely what was needed to solve this problem. Importantly, big data techniques do not avoid the traditional challenges of sampling. As Harford (2014) notes, it is seductive to think of big data as “N = All” yet this is a risky assumption. Landers and Behrend (2015) describe the considerations associated with using convenient sources of data like these, big or small. Importantly, relationships of interest must not covary with membership status in the convenient sample, or results from that sample will be biased. In this case, it would be important to ensure that the reasons shared on social media were common among all employees who left the organization and not unique to those complaining on social media. In this case, the organization saw improvements in their staffing function, but the benefits might have been even greater with a better source of information—perhaps even one from small data, if such data had been otherwise attainable. For big data gathering of this type to be effective, the data source must also be quite large. Thus, the organization also benefited from its own size, which enabled a significant amount of social media data to be collected. In an organization with a low absolute turnover rate, big data of this type may be less useful since fewer data are likely to be available. Much as with I-O psychology’s mainstream selection techniques, small samples and small employee populations add a great deal of noise to available data, decreasing the evident value of many staffing practices (Sackett & Arvey, 1993). Small organizations may find greater value in gathering big data from public but highly relevant sources, such as those that can be geographically targeted. However, this introduces generalization challenges. Specifically, many of the scaling challenges associated with synthetic validation apply similarly to big data. Synthetic validation refers to validity evidence gathered by logical inference to draw conclusions about particular jobs based upon broader, non-organization-specific validation efforts when a traditional concurrent or predictive validation study is not feasible due to either small sample size or lack of criterion data (Scherbaum, 2005). Similarly, big data of a desirable type may not be available from current employees. In such cases, staffing specialists will need to determine how dissimilar the data can be yet still provide useful information. In this case study, the organization decided that whatever information was posted on social media by current employees of competitors and its own

955

Richard N. Landers et al. ex-employees was trustworthy. The only statistical test that could determine if this assumption was valid would not be necessary if the population data necessary to conduct it were available, so this will always be an assumption for practitioners to make. It is one that should be made cautiously. The approach taken here also highlights another risk of big data. When researchers assume that data created in the past must contain all the answers needed in the future, those biases become part of the conclusions drawn. Specifically, big data is typically previously collected data. Its availability may discourage researchers from considering creative, alternative solutions that are not present. In this case study, the talent analyst team started with an assumption that exit interviews and social media would highlight the most efficacious solutions for the organization. Just as when using traditional research strategies, the design of the study that created the data set drives the conclusions that can be drawn from it. The original data collection decisions that created big data, such as the social media case described here, are rarely under the researcher’s control, which introduces a degree of risk. Given this, we recommend researchers considering big data approaches to their talent problems carefully consider what creative solutions not relying on existing data might be employed. Ideally, a combination of both forward- and backward-looking data should be used as the basis for decision making.

BIG DATA STORAGE

Case Study At a large organization, the staffing group was having a problem making good hires for sophisticated manufacturing technician roles that required specific manual skills and high degrees of both teamwork and coordination. About 30% of new hires did not successfully complete their 90-day evaluation period and were terminated before completing it. Although the numbers of employees in these roles were not large, errors were costly, and it was beneficial to the organization to limit the risk of poor performance even during this 90-day trial period. To improve the number of successful applicants, staffing decided to capture actual performance of job tasks and test this performance in a simulation to be made a key part of their selection process. To do this, the organization implemented wearables, which were intended to collect a massive volume of information about performance. The project began by identifying a core group of successful employees. These employees volunteered to spend 40 total hours over three months working with the project team in order to build a realistic simulation of the essential functions of the job. The volunteers wore a wristmounted device on their dominant hand that measured specific locations and proximity to equipment and interactions with that equipment, as well as interactions with team members. The volunteers also wore a head-mounted, eyeglass-style device that tracked eye movements and thereby measured attention to specific pieces of information. The level of detail enabled by the wearable device vastly increased the number of available variables for measurement and prediction. Both devices were lightweight and judged to be non-intrusive. Personal biometrics, such as stress responses, were not measured, out of a concern that employees and job candidates might perceive it as a violation of their privacy. Once the simulation was built and the quality of the measurement system had been well established, additional content-related validity evidence and also concurrent criterion-related validity evidence were gathered by asking additional employee volunteers to spend one hour participating in the simulation, wearing both the wrist-mounted and head-mounted devices. These employees were then asked how accurate and relevant the experiences in the simulation were, and data collected from their wearable devices were compared to metrics of actual on-the-job performance. The predictive model developed during the first phrase was refined based on this larger set of results. The first group of candidates to complete the simulation was a test group, used as part of a predictive validation study. For this group, the simulation was used as a realistic job preview, but the results were not shared with interviewers and thus were not considered in the final hiring decisions. This way, data from applicants could also be used to refine the predictive algorithm. After a few final adjustments were made to the predictive model, it was incorporated into the hiring process.

956

Using Big Data to Enhance Staffin

TABLE 43.2 Summary of “Big Data Storage” Case Study Area

Summary

Staffing Application

The organization wanted to understand job performance at a high level of detail to better predict those behaviors.

Limitations of Small Data

The sheer quantity of tiny, difficult-to-observe pieces of information makes it difficult to know a priori which of them are actually relevant to job performance and in what combination. Even if specific information could be chosen, problems with rater training and rater accuracy are significant.

Advantages to Big Data

The ability to store a massive amount of data enables a model to be built based upon that massive amount of data. The specific challenges associated with identifying what is relevant because to a machine learning algorithm, all of the data can be considered simultaneously.

Cautions

Privacy is a concern when storing massive data because many (perhaps most) people are not aware of how much data is really collected. Organizations may incur heightened legal risk if opposing counsel subpoenas those data and mines for chance relationships. Data security is also a major concern and requires significant technical expertise.

During the first year of implementation of the wearable-enhanced simulation, the failure rate during the evaluation period was reduced by over 60%; that is, the failure rate during the evaluation period went from 30% of new hires to 12% of new hires. Furthermore, the cost reduction associated with reducing errors by new hires paid for the program investment within the first 10 months of program implementation (Table 43.2).

Conclusions, Cautions, and New Horizons As illustrated in this case study, big data techniques can be used effectively as additions to existing selection and training techniques already well-known in staffing. In this case, big data storage enabled the collection of a wide variety of data types at a high velocity in a simulation, itself a method already commonly used for both selection and training when high-fidelity representation of job tasks is a priority (Boyce, Corbet, & Adler, 2013). Importantly, the addition of big data does not diminish the importance of a traditional and comprehensive validation process. Here, content-related validity evidence was collected from both an initial pool of subject matter experts and later from a broader employee sample. Criterion-related validity evidence was also collected, first in a concurrent design and later in a predictive design, as commonly recommended by selection experts (Society for Industrial and Organizational Psychology [SIOP], 2003). The inclusion of wearables does not change this; it only adds greater breadth and depth to the type of data collected. Although wearables as used in this study increased both the breadth and depth of data collected from those participating in the simulation, such data are not necessarily useful. If data relevant to the problem to be solved are never collected and stored, no degree of analytic complexity will be able to extract useful information from them. Thus, it is important to consider precisely what kind of data is being stored by the devices creating those data. In this case, the wrist-mounted devices worn by participants primarily captured distances. These distances were calculated based upon the locations of other wrist-mounted devices and stationary objects broadcasting their location. If distances were not relevant to job success, then the distance data stored by the wearables would be effectively useless, despite the vast size and complexity of those data. To prevent the collection of low-value data, it is therefore recommended to carefully link existing theory and research to each particular problem to be solved. With big data, size alone is insufficient. Inspired by this case, we identified three other major cautions related to the long-term storage of vast quantities of data. First, privacy is a major concern. Existing research in selection already

957

Richard N. Landers et al. notes the impact of perceived privacy violations on applicant reactions (Bauer et al., 2006), and such violations are much easier to make when a firm’s big data philosophy involves the collection of as much and as varied data as possible. Importantly, perceptions of privacy violation and actual privacy violations are distinct. Applicants may perceive that their privacy has been violated when it in fact has not and vice versa. Big data that are collected surreptitiously will not influence applicant perceptions until the collection effort becomes known; however, such a policy creates the potential for a highly publicized public outcry when it is discovered (e.g., Hackett, 2015). Even in the relatively low-risk case study described here, in which big data were only collected on job incumbents and used to generalize to applicants, staffing specialists were concerned that the wearables might collect information that their employees would see as “off-limits.” Such potential privacy violations should be carefully considered when any organization plans to create big data, and the targets of planned big data gathering efforts should be consulted before any databases are actually created. Second, there may be a degree of legal risk associated with the collection and maintenance of vast quantities of big data. For example, if the staffing specialists in this case study had provided wearables to all incumbent employees, rather than just targeted individuals during the simulation development process, a vast database containing all movements of all employees over an indeterminate amount of time would have been created. In certain types of legal challenges, organizations might be required by subpoena to provide their big data to opposing counsel, as is common in adverse impact cases (Guion, 2011). Because big data are so complex, there are many ways to analyze them without generally agreed-upon standards, making competing interpretations likely (Bollier, 2010). For this reason, we recommend organizations only collect those big data that are needed for specific purposes, and retain them only as long as necessary for those purposes, echoing older recommendations regarding small data (Binning and Barrett, 1989). Third, precautions should be taken to ensure that big data are stored securely. Small data sets are generally easy to anonymize (Ghinita, Karras, Kalnis, & Mamoulis, 2007), limiting the damage that can be done if those data sets are accessed by unauthorized personnel. Even in cases where data are somewhat more complex, such as personnel records, there are many well-established security practices to keep those data safe. In contrast, the scope of big data means that information may be stored across many systems, many user accounts, many physical locations, and potentially many organizations. Each of these is a potential security breach point and must be treated with the same care as any other single data source, also taking care to meet the requirements of the various legal systems within which those data exist. This is less of a concern for large organizations that are already accustomed to maintaining large, secure data warehouses. In these organizations, the storage of big data requires only an expansion of existing resources. In smaller organizations without existing standards-compliant secure data storage, a great deal of caution must be exercised to ensure that security standards are met as data storage capacity is increased to handle these new requirements. For such organizations, we instead recommend cloud-based solutions such that all potentially sensitive big data are stored and secured by organizations specializing in data warehousing and data security. Importantly, such a strategy is still not risk-free. Whereas cloud storage is likely to have superior countermeasures and protection, it is also a much more tempting target to hackers than a lone organization’s databases.

BIG DATA ANALYTICS

Case Study A global employer became concerned that its keyword-search-and-filter-based process for identifying job candidates within its Applicant Tracking System (ATS) was missing successful candidates. The company received hundreds of thousands of applicants per year across several job titles and ultimately hired more than 1,000 employees each year. These parameters led them to believe that artificial intelligence could add both efficiency and accuracy into their candidate identification process. In this case, the type of artificial intelligence targeted was machine

958

Using Big Data to Enhance Staffin

TABLE 43.3 Summary of “Big Data Analytics” Case Study Area

Summary

Staffing Application

The organization wanted to improve its recruitment pipeline to identify and target higher-quality applicants at a faster rate.

Limitations of Small Data

Minor indicators of success often go unnoticed by recruiters, who are also influenced by a variety of personal biases that may influence their judgments. Recruiters also cannot respond to shifts in the labor market without interpreting a significant amount of business intelligence.

Advantages to Big Data

Algorithms can identify and make judgments about candidates automatically without intervention, responding to labor market shifts as they occur. Algorithms can also respond to internal personnel records as they change, resulting in the most accurate predictive model at all times.

Cautions

Although these models are powerful, it is important to maintain existing well-supported I-O processes. Big data recommendations should be validated and treated as a distinct hurdle in the selection process. Feeding data to an algorithm will result in predictions based upon those data, so care is needed when considering what sort of data to feed.

learning, a process by which algorithms are developed iteratively and automatically to produce a predictive model (Kotsiantis, Zaharakis, & Pintelas, 2007). Given high direct and replacement costs of attrition at this employer, the company considered both hiring rates and retention rates when identifying five particular job titles for a pilot project. Within each of these job titles, a group of employees was identified as “successful” based upon two characteristics: (1) a tenure of at least two years and (2) current high job performance records. The original job applications were used to train a machine learning model tasked with identifying this group. Inputs for the model included both resume data and process data, such as the channel by which the person applied (e.g., as an employee referral, a participant in a job fair, a student at a target school). The model was then refined by providing data on candidates who were hired but not in the successful group. These were candidates with poor performance and those who left voluntarily. This helped develop a set of markers for candidates at risk of being false positives. Separate models were built for each of the target positions. Given the volume of candidates present in the ATS, it was assumed that, in addition to the false positives identified in the step above, the ATS contained a number of false negatives. To investigate this, the machine learning algorithms already developed were next applied to candidates who had not been hired but remained in the ATS. Specifically, this assumed that the previous approach was overlooking good candidates who were already in the applicant pool. The machine learning approach was successful: the algorithms were able to identify additional candidates who were likely to perform well but who had been overlooked initially. These applicants were then hired and did generally perform well. This information was then used to further improve the algorithms. The organization made a choice to use the algorithms as a complement to its existing, recruiter-driven process, rather than rely exclusively on the machine learning approach. This was primarily to ensure that the organization remained nimble as the industry and competitors evolved. The organization was concerned that exclusively relying on a backwards-looking approach would cause it to miss market shifts. Thus, application of the machine learning algorithm was used as a final step in preparing candidate slates, rather than the first one (Table 43.3).

Conclusions, Cautions, and New Horizons As illustrated in this case study, big data analytics can be used to improve the prediction of existing employee selection processes. Big data approaches do not need to replace existing practices and can be used as a supplemental selection tool. What remains unclear are the consequences

959

Richard N. Landers et al. of this improved prediction. I-O psychologists go to great lengths to ensure a high degree of construct validity for the measures they choose (Binning and Barrett, 1989). This is done, first and foremost, to ensure that prediction of job performance is based upon a well-defined characteristic of each applicant. If a conscientiousness measure is used, we must be confident that the measure is in fact one of conscientiousness. This value is reflected in all commonly accepted measurement guidelines (e.g., American Educational Research Association, American Psychological Association, National Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing, 2014). Practically speaking, this is in part to reduce the risk of loss in litigation; in the event that a selection system is legally challenged, a clear record of validation efforts is necessary to defend it (Guion, 2011). The consequences of ignoring the validation process and instead entirely relying upon a machine learning algorithm, the internal details of which are often unknown to their users, may be significant. If any variable contained within the data is correlated with group membership in a protected class, that variable will be included in any resulting algorithms and result in biased selection. For example, if information found within a “Personal Interests” section of a resume provides useful information in the prediction of job performance, but the presence of such a section is by chance correlated with sex, a sex bias will be introduced into the resulting algorithm. Selecting on anything highly correlated with membership in any protected class will result in significant legal risk in the United States (Hough, Oswald, & Ployhart, 2001), which leads us to conclude that machine learning algorithms cannot be used indiscriminately in selection systems. To minimize potential problems, we recommend that organizations only use machine learning algorithms in hiring as a distinct hurdle, as described in the case study. This way, the results of recommendations from the algorithm can be validated independently, as is recommended in hurdle systems (Mendoza, Bard, Mumford, & Ang, 2004). If a problematic bias is discovered, the algorithm can be modified and the effects observed directly. This problem is reminiscent of the days of so-called dust-bowl empiricism in I-O psychology, an era when any characteristic of a person that improved prediction of job performance was considered a reasonable hiring tool (Bryan & Vinchur, 2013). Many of the problems associated with that approach have reappeared here. In particular, because big data invites the inclusion of any and all even vaguely relevant data sources to improve its algorithms, job relevance of included data may be quite low. In the case above, process data were restricted to sources that the staffing team believed likely to aid in prediction, such as source of referral. However, a much broader array of process data could be collected and included in the algorithm, including the amount of time spent on individual web pages in the application process, the font size used on the resume, or any other such discrete piece of information provided by the job candidate. Anything given as input to the machine learning process to improve its algorithm’s prediction may be used. Although it may in fact improve prediction, the lack of job-relatedness may be both legally and ethically problematic. To avoid this, we recommend only providing input to the machine learning process that is theoretically consistent with the prediction of job performance. In the case above, referral source was included, which has a supporting research literature (Zottoli & Wanous, 2000). Specific times spent on application pages were not. Machine learning is closely related to another concept called data mining, which brings somewhat different challenges. In contrast to the traditional descriptive and inferential statistical approaches commonly used in staffing, data mining is a more flexible, computationally driven approach to understanding data (Hand, 1999). In a data mining approach, algorithms are developed by a researcher to identify patterns in data and build predictive models; automation might be used but is not necessary (Olson & Delen, 2008). Machine learning identifies such patterns and builds upon them automatically; in short, the researcher creates the intelligence, and the intelligence creates the algorithm. Data mining brings many of the same advantages and disadvantages of machine learning described above; however, the more hands-on role of the researcher potentially mitigates some of the disadvantages. Staffing specialists with knowledge of both data mining techniques and I-O psychology practices may be able to blend the best of both approaches, although this has not yet been demonstrated in the research literature. Most papers in this area to date have been written by data mining researchers (e.g., Chien & Chen, 2008; Cho & Ngai, 2003).

960

Using Big Data to Enhance Staffin Big data analytic techniques are evolving at a rapid rate. The community tends to be practice-oriented, so new research is not always published in traditional outlets. Additionally, as is common in many fields related to and including computer science, the primary outlet for new research by those developing these techniques tends to be academic conferences. As a result, staffing specialists who are more familiar with traditional statistical approaches are likely to have difficulties both accessing and judging the quality of new research in big data analytics. Although some efforts have begun to appear related to big data research in staffing, the literature is quite sparse in comparison to the literature in data science broadly. As a result, for those seeking to implement big data analytics, we currently recommend seeking out and collaborating closely with professional data scientists who specialize in this domain, although this may change over the next few years as resources more accessible to staffing specialists are developed.

BIG DATA PRESENTATION AND VISUALIZATION

Case Study A complex global organization with hundreds of standard job titles and dozens of major locations in multiple countries wanted to improve their overall staffing processes, including both recruitment and selection. Due to the complexity and volume of data, the organization turned to data visualization in two projects to help identify important patterns and to enable dynamic exploration of the data by organizational stakeholders without significant statistical or analytics expertise. In taking this approach, the researchers hoped to empower decision makers to act on data without the traditional complexities of statistical reporting. Their first project was intended to improve staffing processes. In this case, the organization built a visualization that displayed key process steps, recruiting channels, job titles nested into job families, geographies, levels, and recruiter caseload for each job requisition. The initial data display showed the global average time and standard deviation of time for each step in the recruiting process. Users could then click on each process step to drill down to any combination of variables of interest. This enabled users to quickly identify outliers, as well as best and worst in class, within each class and for each set of variables being targeted. The organization was able to explore thousands of combinations and visually identify three process steps that introduced the greatest variability. The best-in-class examples were then used as prototypes to build new standard processes. The second project was intended to better understand the current workforce and available labor markets in order to build new recruiting strategies. For this visualization, a map view of the organization was created showing unit populations and recruiting trends within each population. After consideration of the most challenging areas from this visualization, additional recruiting times and barriers data were added to better understand which strategies would be most effective in these challenging areas. Next, external labor market data, using census data and other sources, were added to enable the organization to identify which positions could be best served with local searches and which should be bundled together and addressed with a multisite, national or global search. This approach maximized efficiency in search times and cost in terms of relocation and retention. Finally, the organization analyzed efficiency for each of the recruiting channels and strategies at the local and national level in order to identify optimal criteria for each recruiting strategy. Specifically, the organization was able to explore which strategies best served each combination of recruiting circumstances. In doing so, the organization built a recruiting strategy around insights gleaned from visualized data. This increased the degree of data-driven decision making in the organization, because before this point, the personal insights and creativity of executives and recruiters typically drove recruiting strategy. Not much attention was generally paid to the key roles and groups of roles that were particularly hard to fill because the difficulty filling these roles only became obvious with the visualization. Based upon conclusions drawn from the visualization, the organization also established a satellite team near a particular university to capitalize on the flow of candidates from that school in that specific area. They furthermore segmented the recruiter organization; part of the team focused on efficiency and transactions in the areas with a highly liquid talent market, and the remainder focused on proactive, passive candidate recruitment in areas that were more difficult to fill (Table 43.4).

961

Richard N. Landers et al.

TABLE 43.4 Summary of “Big Data Presentation and Visualization” Case Study Area

Summary

Staffing Application

The organization had such a large quantity of data that it was difficult to understand all aspects of the recruitment and selection pipeline simultaneously.

Limitations of Small Data

Traditional visualization and presentation of data involves taking a snapshot of current data relationships. These visualizations may become outdated quickly. Creating such visualizations is also usually the task of a data analyst, which adds a step between the collection of data and action based upon those data.

Advantages to Big Data

Big data visualization techniques enable data to be visualized live as changes occur. Instead of considering a snapshot of data now, a stream of data is considered as it is created.

Cautions

Many of the same downsides to small data visualization still exist with big data visualization. A great deal of power is provided to the visualization designer to dictate what viewers see and consider when making decisions. Unique to big data is the sheer quantity and variety of data, which exacerbates this problem. High quality design is critical.

Conclusions, Cautions, and New Horizons As demonstrated in this case study, big data visualizations can serve as powerful analytic tools (Frankel & Reid, 2008). This is in stark contrast to the use of visualizations as a supplement to statistical analyses, where visualizations are unfortunately often an afterthought (Gelman, Pasarica, & Dodhia, 2002). Visualizations in both small and big data contexts can provide intuitive displays of complex data, enabling new insights if designed well. In the big data context, visualizations go beyond the capabilities of traditional figures and charts by adding interactivity. Those viewing big data visualizations can in effect create and interpret cross-sectional analyses at any level of specificity without ever looking at a number; thousands of static figures may be contained within a single visualization, and a person interested in one of those thousands of figures can view that one desired figure immediately and automatically upon request. Big data visualization tools can even be used with small data, although the added complexity is only worthwhile when this sort of interactivity would be valuable to the target audience. The implication of this interactivity is that the specificity of insights is much greater, and this brings both unique opportunities and unique challenges. Because users may drill down to any of thousands of figures, and because the people creating visualizations rarely look at all possible permutations of figure enabled by those visualizations, drilldowns containing spurious results are likely. In the circle packing visualization found in Figure 43.1, for example, circle sizes represent the total number of employees in a large organization within each first-order job grouping (division), divided further based upon a second-order job grouping (product team). A user might click on any given circle to gain more specific information about that grouping and its subgroups, and then click within subgroups to get information about even smaller subgroups, as shown on the right side of Figure 43.1. In such cases, chance variation alone may cause a particular requested figure to misrepresent larger trends, a common problem with multilevel data (Klein, Dansereau, & Hall, 1994). In the same way that simple statistical tests can be misleading when contextual assumptions are not met, visualizations can be misinterpreted when viewers forget, ignore, or do not have access to the bigger picture. Because images in general are more persuasive than other more numerically oriented forms of information (Latour, 1990), visualizations have a great deal of power to misinform as well as inform. Even when decision makers are prepared to consider visualization data from multiple perspectives to avoid this problem, the sheer quantity of information produced may be overwhelming. When a thousand different cross-sectional figures can be obtained, it is often unclear which should be prioritized and trusted. Humans are only readily able to consider a relatively small number of sources of information simultaneously in decision making (Payne, 1976); thus, the availability of so many figures may in this way be harmful. Statistical approaches were developed,

962

Using Big Data to Enhance Staffin

FIGURE 43.1  Sample Big Data Visualization Source: Courtesy of Evan Sinar (Development Dimensions International).

in part, to simplify decision making from vast quantities of data. Although big data visualization tools may make it somewhat easier to sift through large amounts of data meaningfully, there is still a limit to human information processing. Given these challenges, we recommend visualizations be used only in contexts where the specific affordances of data interactivity would aid in decision making. In such cases, the visualization should still be carefully designed to provide only relevant and actionable data to the viewer. Although excess variables can be easily included in visualizations, simplicity is still a virtue. Only those visualization options that are theoretically linked to the problem to be solved should be included. Because of the potential for misleading results, we also recommend that big data visualizations, when used analytically, only be used as the first step in the decision-making process, to then be followed up with small data investigations using traditional research methods.

CONCLUSION In summary, we have presented four case studies highlighting each of the four functional areas of big data in staffing: gathering, storage, analytics, and presentation/visualization. Across these areas, there is a great deal of potential for staffing to be transformed by big data. We can now collect information we could never collect before at a scale we could never before collect it, applying a wide variety of analytic techniques based upon artificial intelligence research to identify patterns that can be acted upon. We can create interactive visualizations so that people with no statistical expertise can interactively and powerfully explore data, to make data-driven decision making well within the reach of even the most numbers-phobic organizational leader. This provides an incredible opportunity to increase the accuracy of both staffing decisions and staffing research. There is also a great deal of potential to mislead ourselves. These techniques are quite powerful, bringing many opportunities to head down a harmful path based upon seemingly minor

963

Richard N. Landers et al. decisions. The ease of data gathering means that far more data can be collected than are useful, encoding information with unclear value and potential legal risk. Big data storage is so inexpensive and vast that massive amounts of data can be stored essentially indefinitely. This can create a tempting target for hackers, yet sensitive electronic information cannot be stolen if it is not accessible to the Internet (or to big data practitioners). Big data analytics offer the ability to extract insights from data that were never before extractable, identifying subtle patterns of numbers that a human analyst running traditional analyses would likely never find, but these approaches are often quite brute force, extracting patterns in samples when no such patterns may exist in the population. Big data visualizations that enable non-statisticians to dive deeply into data also may create a false sense of security, and the type of information conveyed by such visualizations is entirely under the control of the visualization designer, who will likely make hundreds or thousands of small decisions along the path from raw data to a particular visualization. Given this combination of potential and caution, we contend that the greatest value will be found at the intersection points between big data and traditional staffing research. When these two families of techniques are used in concert, when insights are discovered with big data and verified with the collection of in-depth small data, we can be maximally confident that the right decisions are being made. Echoing recommendations for mixed-methods research (Creswell & Clark, 2011), we contend that the convergence of multiple methods on the same recommendation is the best evidence to initiate a particular organizational intervention. When these multiple methods do not converge, it is time for further investigation; conclusions drawn from big data are neither inherently better nor worse than those drawn from small data. Instead, an interdisciplinary perspective will provide the answers organizations seek, and I-O psychologists, staffing specialists, and big data practitioners should try to build this perspective.

NOTE 1.  We would like to thank Evan Sinar for his gracious contribution of Figure 43.1.

REFERENCES Allport, G. W., & Odbert, H. S. (1936). Trait names: A psycholexical study. Psychological Monographs, 47(1). American Educational Research Association, American Psychological Association, National Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing (U.S.). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bauer, T. N., Truxillo, D. M., Tucker, J. S., Weathers, V., Bertolino, M., Erdogan, B., & Campion, M. A. (2006). Selection in the information age: The impact of privacy concerns and computer experience on applicant reactions. Journal of Management, 32, 601–621. doi: 10.1177/0149206306289829 Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Bollier, D. (2010). The promise and peril of big data (Aspen Institute Report). Retrieved from http://www. aspeninstitute.org/sites/default/files/content/docs/pubs/The_Promise_and_Peril_of_Big_Data.pdf Boyce, A. S., Corbet, C. E., & Adler, S. (2013). Simulations in the selection context: Considerations, challenges, and opportunities. In M. Fetzer & K. Tuzinski (Eds.), Simulations for personnel selection (pp. 17–41). New York, NY: Springer. Bryan, L. K., & Vinchur, A. J. (2013). Industrial-organizational psychology. In D. K. Freedheim & I. B. Weiner (Eds.), Handbook of psychology, Vol. 1: History of psychology (2nd ed., pp. 407–428). Hoboken, NJ: John Wiley & Sons Inc. Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36, 1165–1188. Chien, C-F., & Chen, L-F. (2008). Data mining to improve personnel selection and enhance human capital: A case study in high-technology industry. Expert Systems with Applications, 34, 280–290. Cho, V., & Ngai, E. W. T. (2003). Data mining for selection of insurance sales agents. Expert Systems, 20, 123–132.

964

Using Big Data to Enhance Staffin Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research. Thousand Oaks, CA: SAGE Publications. Cumbley, R., & Church, P. (2013). Is “big data” creepy? Computer Law & Security Review, 29, 601–609. Dumbill, E. (2013). Making sense of big data. Big Data, 1(1), 1–2. Davenport, T. (2014). Big data at work: Dispelling the myths, uncovering the opportunities. Cambridge, MA: Harvard Business Review Press. Dunnette, M. D. (1966). Fads, fashions, and folderol in psychology. American Psychologist, 21, 343–352. Frankel, F., & Reid, R. (2008). Big data: Distilling meaning from data. Nature, 455, 30. Garcia, A. (July 16 2015). Amazon ‘Prime’ Day’ shattered global sales records. CNN Money. Retrieved from http://money.cnn.com/2015/07/15/news/amazon-walmart-sales/ Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach. The American Statistician, 56, 121–130. doi: 10.1198/000313002317572790 Ghinita, G., Karras, P., Kalnis, P., & Mamoulis, N. (2007). Fast data anonymization with low information loss. Paper presented at the Proceedings of the 33rd international conference on Very large data bases, Vienna, Austria. Guion, R. M. (2011). The legal context for personnel decisions. In R. M. Guion (Ed.), Assessment, measurement, and prediction for personnel decisions (pp. 163–207). New York, NY: Taylor & Francis. Hackett, R. (June 9 2015) Massive federal data breach affects %7 of Americans. Time Magazine. Retrieved from http://time.com/3952071/opm-data-breach-federal-employees/ Hand, D. J. (1999). Statistics and data mining: Intersecting disciplines. SIGKDD Explorations, 1(1), 16–19. Harford, T. (2014). Big data: A big mistake? Significance, 11(5), 14–19. Hitzler, P., & Janowicz, K. (2013). Linked data, big data, and the 4th paradigm. Semantic Web, 4, 233–235. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9, 152–194. doi: 10.1111/1468–2389.00171 IBM. (2015). Bringing big data to the enterprise. Retrieved from https://www-01.ibm.com/software/data/ bigdata/what-is-big-data.html Keim, D., Qu, H., & Ma, K-L. (2013). Big-data visualization. IEEE Computer Graphics and Applications, 33, 50–51. Klein, K. J., Dansereau, F., & Hall, R. J. (1994). Levels issues in theory development, data collection, and analysis. Academy of Management Review, 19, 195–229. Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2007). Machine learning: A review of classification and combining techniques. Artificial Intelligence Review, 26, 159–190. doi: 10.1007/s10462–007–9052–3 Landers, R. N., & Behrend, T. S. (2015). An inconvenient truth: Arbitrary distinctions between organizations, mechanical turk, and other convenience samples. Industrial and Organizational Psychology, 8, 142–164. Landers, R. N., Brusso, R. C., Cavanaugh, K. J., & Collmus, A. B. (In press). A primer on theory-driven web scraping: Automatic extraction of big data from the internet for use in psychological research. Psychological Methods. Landers, R. N., & Goldberg, A. S. (2014). Online social media in the workplace: A conversation with employees. In M. D. Coovert & L. F. Thompson (Eds.), Psychology of workplace technology (pp. 284–306). New York, NY: Routledge Academic. Latour, B. (1990). Drawing things together. In M. Lynch & S. Woolgar (Eds.), Representation in scientific practice (pp. 19–68). Cambridge, MA: MIT Press. Lymberis, A. (2003). Smart wearables for remote health monitoring, from prevention to rehabilitation: Current R&D, future challenges. In R. Summers (Chair) & E. Carson (Co-Chair), 4th International IEEE EMBS special topic conference on information technology applications in biomedicine 2003 (pp. 272–275). Piscataway, NJ: IEEE. McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review, 90(10), 61–67. Manyika, J., Chui, M., Brown, B., Buhin, J., Dobbs, R., Roxburgh, C. & Byers, A. H. (2011). Big data: The next frontier for innovation, competition and productivity. Retrieved from http://www.mckinsey.com/insights/ business_technology/big_data_the_next_frontier_for_innovation Mendoza, J. L., Bard, D. E., Mumford, M. D., & Ang, S. C. (2004). Criterion-related validity in multiple-hurdle designs: Estimation and bias. Organizational Research Methods, 7, 418–441. doi: 10.1177/1094428104268752 Newell, A., & Simon, H. A. (1976). Coputer science as empirical inquiry: Symbols and search. Communications of the ACM, 19(3), 113–126. Olson, D. L., & Delen, D. (2008). Advanced data mining techniques. Berlin, Germany: Springer-Verlag. Payne, J. W. (1976). Task complexity and contingent processing in decision making: An information search and protocol analysis. Organizational Behavior and Human Performance, 16, 366–387. Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data, 1, 51–59.

965

Richard N. Landers et al. Sacco, A. (Spring 2014). Enterprise wearables present tech challenges and management pitfalls. CIO, 44–45. Sackett, P. R., & Arvey, R. D. (1993). Selection in small N settings. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 418–447). San Francisco: Jossey-Bass. Scherbaum, C. A. (2005). Synthetic validity: Past, present, and future. Personnel Psychology, 58, 481–515. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Society for Industrial and Organizational Psychology. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288. Ur, B., Leon, P. G., Cranor, L. F., Shay, R., & Wang, Y. (2012). Smart, useful, scary, creepy: Perceptions of online behavioral advertising. In L. F. Cranor (Chair), Proceedings of the Eighth Symposium on Usable Privacy and Security (pp. 1–15). New York, NY: ACM Press. Youyou, W., Kosinski, M., & Stillwell, D. (2014). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academies of Science, 112, 1036–1040. Xia, F., Yang, L. T., Wang, L., & Vinel, A. (2012). Internet of things. International Journal of Communication Systems, 25, 1101–1102. Zottoli, M. A., & Wanous, J. P. (2000). Recruitment source research: Current status and future directions. Human Resource Management Review, 10, 353–382. doi: 10.1016/s1053–4822(00)00032–2

966

44 THE IMPACT OF EMERGING TECHNOLOGIES ON SELECTION MODELS AND RESEARCH Mobile Devices and Gamification as Exemplars WINFRED ARTHUR JR., DENNIS DOVERSPIKE, TED B. KINNEY, AND MATTHEW O’CONNELL

INTRODUCTION To stay current, assessment professionals must track developments in a plethora of emerging technologies including mobile assessment, gamification, serious games, simulations, social media, artificial intelligence, avatars, and Big Data. The introduction of each new technology seems to result in a similar cycle of calls for research and validation efforts, studies on equivalence, and the publication of findings in journals or presentations at conferences. This is accompanied by much brow beating regarding the lack of impact of research on practice, the lag between the adoption of the technologies and scientific publications, and the lack of impact of the academy on practice. Of course, there will always be emerging technologies, some of which may have implications for employment-related testing, assessment, and research, and others which will not; a cynic might argue that this has always been true and that once upon a time we worried about OpScan sheets or computer anxiety. In writing this chapter, we are cognizant of the likelihood that any discussion of the topic of “emerging” technologies could very well be outdated soon after it appeared in print. Therefore, instead of a discussion of the whole gamut of emerging technologies, this chapter focuses on mobile devices and gamification as exemplars of the interplay between the emergence of new technologies and the practices and methods of the fields of personnel assessment and industrial-organizational (I-O) psychology. Specifically, we examine and explore the extent to which emerging technologies may lead to disruptive innovations in the way the field conceptualizes and implements the traditional methods and approaches to test development and validation; that is, will any of these technologies alter the basic psychometric tools of our profession? Three themes characterize discussions about emerging and new technologies with organizational stakeholders: (1) Many organizations want all of these trending topics applied to their selection program (e.g., Hypothetical client question: “Are you able to provide my organization with an avatar-based simulation with artificial intelligence and game-like features that can

967

Winfred Arthur Jr. et al. be administered on any Internet device . . . Oh, and it would be helpful if you can also farm big data sets to get reliable measures of traits from Facebook posts. Can you do that?”); (2) There is very little agreement on how any of these trends are defined or how effectively they can be applied in a selection context (e.g., Hypothetical client request: “I don’t really know what gamification is, but I know that Nike and Walmart do it, so I need to have gamification incorporated into all of my talent strategies too”; and (3) Empirical research from I-O psychology lags practice on each of these topics. In fact, to date, only six empirical investigations of assessments delivered via mobile assessment have been published in typical I-O or other related applied journals (Arthur, Keiser, & Doverspike, 2017), and we were unable to locate any published empirical investigations of the use of gamification in employment-related testing and assessment. Despite this absence of empirical research, Dale (2014) had projected that organizations will allocate more than $2.8 billion in spending on gamification by 2015. So, with such a high level of organizational interest in the use of emerging technology in talent acquisition and interventions, there is clearly both value and need for personnel psychology to devote some research attention to these topics in an effort to not only keep up with but also get ahead of these trends. Subsequent sections of this chapter first present a review of the literature on the selected exemplars, specifically mobile devices and gamification, including games and simulations, as it pertains to employment-related testing and assessment. Next, we present a discussion of the traditional test development and validation model and its intersection with said emerging technologies. Finally, the chapter concludes with a discussion of recommendations, the need for research, other emerging technologies, and some future-oriented speculation.

MOBILE DEVICES AND GAMES: REVIEW OF THE LITEATURE AND IMPLICATIONS FOR EMPLOYMENT-RELATED TESTING AND ASSESSMENT

Delivering Assessments on Mobile Devices For years there have been serious concerns in the I-O community about unproctored Internet testing (UIT). There were legitimate concerns about test security, equivalence, and cheating (Pearlman, 2009; Tippins et al., 2006). Although some of these concerns remain, especially in high-stakes testing situations such as certification tests, research has consistently failed to show practical or meaningful differences between UIT and non-UIT tests and assessments in reference to psychometric properties, test score validity, or candidate reactions (Davies & Waddlington, 2006; Do, Shepherd, & Drasgow, 2005; O’Connell, Delgado, & Kung, 2012). It is acknowledged that UIT is here to stay (O’Connell, Arthur, & Doverspike, 2015), and the advent of mobile devices has made it even easier for candidates to take tests anywhere, anytime, and almost exclusively in unproctored environments. Usage data suggest that test taking on mobile devices continues to increase substantially (e.g., see Illingworth, Morelli, Scott, & Boyd, 2015; McClure Johnson & Boyce, 2015). This increase in usage gives more people than ever the opportunity to apply for jobs, reduces testing-related costs for organizations, and also increases the size of the applicant pool, thereby resulting in smaller selection ratios, which favor the hiring organization. A pivotal issue in this research domain is “What is and is not a mobile device and what theories or constructs would even lead us to expect that this differentiation should affect outcomes that are psychologically interesting and meaningful?” In an effort to provide theoretical guidance to inform why Internet-based testing (IT) devicetype (i.e., “mobile” vs. non-mobile) should or should not have an effect on test scores, Arthur et al. (2017) presented a framework for conceptualizing device types in terms of the construct-irrelevant information processing demands placed on the test taker while taking the assessment. Said information processing demands translate into additional, construct-irrelevant cognitive load, which interacts with the device type, resulting in differential outcomes as a function of the construct assessed. So, instead of differentiating mobile and non-mobile devices simply in terms of whether said devices are tethered to the wall or not (i.e., wireless vs. wired connection

968

Impact of Emerging Technologies to the Internet), Arthur et al. (2017) identified four information processing variables—working memory, perceptual speed and visual acuity, psychomotor ability, and selective attention—that correspond to four structural characteristics of IT assessment devices, specifically screen size, screen clutter, response interface, and permissibility (i.e., distractibility). Arthur et al.’s Structural Characteristics/Information Processing (SCIP) model permits the classification of current IT devices on a continuum that ranges from desktops at one end (i.e., large screen, low clutter, easy response interface, and low permissibility, which translates into lower construct-irrelevant cognitive load) to smartphones at the other (small screen, relatively high clutter, difficult response interface, and high permissibility, which translates into high construct-irrelevant cognitive load). Thus, when the literature uses the label mobile device, in terms of Arthur et al.’s SCIP model, this refers to devices at the high end of the information processing continuum, which definitely includes smartphones but may also include tablets as well, which are lower than smartphones on Arthur et al.’s continuum.

Measurement Equivalence The issue of interest here is whether the psychometric properties of a test administered on a mobile device are similar to those administered on a non-mobile device such as a desktop computer. A large-scale, high-stakes study comparing mobile to non-mobile devices with a sample of 2.8 million applicants (approximately 49,000 of whom used mobile devices) found comparable reliabilities, factor loadings, and intercorrelations for cognitive and non-cognitive measures (Arthur, Doverspike, Muñoz, Taylor, & Carr, 2014). Similar findings are reported in other studies as well (Lawrence, Wasko, Delgado, Kinney, & Wolf, 2013; Morelli, Mahan, & Illingsworth, 2014; Parker & Meade, 2015). In summary, the vast majority of research, using large sample sizes, suggests that the psychometric properties, including factor structure and reliability, are similar for mobile and non-mobile devices when assessments are intentionally designed to be administered across devices. These findings are the case for the measurement of both cognitive and non-cognitive constructs.

Mean Differences Mean score differences between mobile and non-mobile devices appears to be a function of the constructs assessed. Thus, a robust finding that characterizes this literature is that whereas there are no mean differences on non-cognitive assessements (e.g., personality) taken on mobile and non-mobile devices (e.g., Arthur et al., 2014; Dages & Jones, 2015; Morelli et al., 2014; Wood, Stephens, & Sliter, 2015), there are pronounced differences for cognitive constructs, with scores on mobile devices being consistently and substantially lower. For instance, Arthur et al. (2014) reported a d of .90. Impelman (2013) found similar performance decrements on cognitive measures across four organizational samples. Wood et al. (2015) report ds of .46 and .35 for two cognitive ability tests and .93 and .26 for two mechanical aptitude tests. Finally, to the extent that UIT devices are used to take assessments in the form of complex interactive simulations and situational judgment tests (SJTs), assessments that generally engender higher construct-irrelevant cognitive load, one would expect lower scores on mobile devices compared to non-mobile devices. The preceding pattern of findings are in accord with the percepts of Arthur et al.’s (2017) model in that to the extent that the four information processing variables (i.e., working memory, perceptual speed and visual acuity, psychomotor ability, and selective attention) that correspond to four structural characteristics of UIT assessment devices (i.e., screen size, screen clutter, response interface, and permissibility [distractibility]) play a role in using the UIT device, they then result in additional construct-irrelevant cognitive load that is likely to influence performance on the test when said cognitive demands are not the focal construct of interest.

969

Winfred Arthur Jr. et al.

Criterion-Related Validity Research on the criterion-related validity of mobile device assessments, and more importantly, compared to non-mobile assessments, is almost non-existent. However, a limited number of studies have examined the comparative criterion-related validity of proctored versus unproctored assessments, and their findings indicate little if any differences (Beaty et al., 2011; Wasko, Lawrence, & O’Connell, 2015; Weiner & Morrison, 2009). So, although the volume of research is quite small, the preceding lends credence to the proposition that there is little theoretical or conceptual basis to expect differential criterion-related validity in the comparisons of mobile versus vs. non-mobile Internet devices (Kinney, Chang, Lawrence, & Moretti, 2015; O’Connell et al., 2015).

Demographic Differences in Usage The research to date suggests that African Americans, Hispanics, and females are more likely than white males to take a test on a mobile device (Arthur et al., 2014; Illingworth et al., 2015; McClure Johnson & Boyce, 2015). If taking cognitive tests on mobile devices results in lower scores, and the tendency to take assessments on mobile devices covaries with specified protected group status, then this raises the specter of observed subgroup differences and higher adverse impact potential resulting from the use of mobile devices in employment-related assessments. However, there is no research that we are aware of that has examined this issue for cognitive constructs. That being said, a detailed look at Arthur et al.’s (2014) data suggests that the mobile device effect appears to be a main effect; that it does not appear to interact with demography to result in larger subgroup differences. This pattern of results appears to be similar to those reported by Arthur, Edwards, and Barrett (2002), and Edwards and Arthur (2007) in their comparisons of constructed-response and multiple-choice tests. Finally, it should be noted that for non-cognitive constructs, the absence of meaningful subgroup differences reported in the general personnel selection and assessment literature is observed for mobile device assessments as well (e.g., Golubovich & Boyce, 2013; Kinney, Lawrence, & Chang, 2014; McClure Johnson & Boyce, 2015).

Applicant Reactions Assessment professionals and organizations generally consider providing applicants with the opportunity to take tests on mobile devices to be a positive attribute (Fursman & Tuzinski, 2015; Gutierrez, Meyer, & Fursman, 2015). However, it is unclear whether applicants experientially actually prefer to take assessments on mobile devices over desktops or personal computers (PCs). So, for instance, although Kinney et al. (2014) found no difference in applicant satisfaction based on mode of delivery, other researchers have found applicants to have much more favorable reactions to PCs than mobile devices (Fursman & Tuzinski, 2015; Gutierrez & Meyer, 2013; Landers, Reddock, Cavanaugh, & Proaps, 2014). Hence, whereas applicants generally indicated that test takers should be given the opportunity to complete assessments on mobile devices, they also generally had more negative reactions to using mobile devices for assessments and consistently expressed a preference for PCs over mobile devices in taking personnel selection tests and assessments. In summary, a number of conclusions and recommendations can be made concerning the use of mobile devices in personnel selection and assessment. First, the growth in unproctored mobile device testing continues to display an accelerating upward trend. Second, it is important to improve the experience for applicants by designing mobile tests in an optimized manner (e.g., maximizing the use of screen space, limiting unnecessary buttons, etc.). Third, the permissibility of mobile devices (i.e., the ability to use them in a variety of locations and conditions) means that they also potentially engender high levels of distractibility, which may be a contributory

970

Impact of Emerging Technologies factor in the lower scores observed for cognitive constructs (Arthur et al., 2017). Consequently, organizations and testing and assessment professionals should consider instructing and encouraging candidates to take control of their test environment and make sure they are free of distractions during the assessment. Fourth, because the use of mobile devices generally results in substantially lower scores on cognitive constructs, research that directly investigates differential subgroup differences on mobile versus vs. non-mobile assessments of cognitive constructs is needed. Fifth, comparative criterion-related validity studies are woefully absent in the literature—even conference presentations. Finally, with very few exceptions (e.g., Arthur et al., 2017), at present most research uses a simple classification of mobile versus non-mobile device, with a very small number of recent studies recognizing distinctions between PCs versus tablets versus smartphones. Consequently, future research needs to pay closer attention to finer device-type designations. So, for instance, on the basis of their two dimensions (structural characteristics and information processing variables), Arthur et al. (2017) currently place UIT devices on the following continuum: desktops→laptops→tablets→phablets→smartphones, with desktops engendering the lowest levels of construct-irrelevant cognitive load and smartphones engendering the highest levels of construct-irrelevant cognitive load. In conclusion, the growth of mobile device testing poses a number of challenges but at the same time opens up a wide range of exciting opportunities for reaching non-traditional candidates, expanding the applicant pool, and also increasing the potential to reduce testing-related costs for organizations, especially those pertaining to test administration.

Gamification (and Serious Games and Simulations) Prevalence of Gamification and Game-Thinking in Organizations Game-thinking is a term that has been used to broadly present the concepts of gamification and serious games (Amstrong, Collmus, & Landers, 2015). Gamification has been embraced as a common technique to facilitate change in organizations by making traditional interventions more engaging. Many uses of game-thinking are regularly applied in organizations for a host of purposes, including recruitment, training, sales prospecting, professional development, and performance reviews (Oprescu, Jones, & Katsikitis, 2014). Starting with e-learning systems, there has been an exponential growth in the interest in gamification (Dale, 2014), with gamification appearing on Google Trends in 2010 (DuVernet & Popp, 2014). In 2011, and the Oxford Dictionary added gamification to its word-of-the-year shortlist, with a definition referring to the application of game features to non-game applications. The recent and projected growth in game-thinking is due to the convergence of cheaper technology and the prevalence of games in society in general (Deterding, Sicart, Nacke, O’Hara, & Dixon, 2011).

Gamification versus Serious Games versus Simulations With the rapid growth of game-thinking in organizations and with the relative lag of scientific inquiry on these approaches, several definitions have emerged for the seemingly related concepts of gamification, serious games, and simulations. Most researchers broadly define gamification as the application of game mechanics, elements, and features to non-game environments (Attali & Arieli-Attali, 2015; Dale, 2014; Deterding et al., 2011; Gartner, 2011). For instance, Figure 44.1 presents a screenshot of a gamified assessment designed to measure attention to detail and critical thinking. Some traditional activities (e.g., assessments, surveys) in organizations are built by leveraging technology that is not particularly eye-catching or engaging, whereas games are designed to be fun. The basic concept of gamification is to apply the elements that make games interesting to non-game contexts to make them more entertaining than they would otherwise be in their traditional form (Attali & Arieli-Attali, 2015). In fact, Dale (2014) reported that the

971

Winfred Arthur Jr. et al.

FIGURE 44.1  Screenshot of a Gamified Assessment Designed to Measure Attention to Detail and Critical Thinking Developed by Select International Inc. Reprinted with permission

primary purpose of gamification was to engage participants and motivate future and lasting participation. Interventions involving gamification are not necessarily games. Huizinga (2000) defined a game as a non-serious but intensely engaging voluntary activity structured by rules. Kapp (2014) explained that games are self-contained units with discrete starting points, game play sections, and clearly defined endings, with winning being a possibility. Thus, participants are aware they are playing a game. Unlike gamification, serious games are a type of game. The concept of a serious game simply refers to the application of a game for non-trivial or non-entertainment purposes (Petridis, Baines, Lightfoot, & Shi, 2014; Simpson & Jenkins, 2015). Participants in a serious game are aware that they are in a game, there is a defined start and end point; however, the elements of fun or engagement discussed in descriptions of gamification are less relevant. Consequently, serious games and gamification have several features in common but also important differences. Gamification is a technique or collection of techniques applied to programs, assessments, or other content. A serious game is a discrete unit developed for a particular purpose. Figure 44.2 presents a screenshot of a vehicle assembly task designed as a serious game. How do simulations relate to game-thinking? Just as serious games and gamification are not the same concept, simulations are not necessarily the same as gamification or serious games. Gamification, again, is the process of making a tool more “game-like”; a serious game is a game developed for a serious purpose. That being said, serious games can be considered a “type” of simulation. Not all simulations are serious games, but all serious games are simulations (e.g., a flight simulator with scores, levels, and objectives used to train pilots). Likewise, the process of gamification can certainly be applied (and often is) to simulations of all types. Today’s organizations routinely build simulations that feature certain common gamification elements such as progress bars, timed sections, narrative stories, and challenges. Simulations also often have features common to games (as opposed to simply applying game features to an existing assessment, as is the case with gamification) in that there is typically a defined start and end. Simulations are also often created to capture either a work sample or other clearly job-related behavior in a high-fidelity, engaging user experience. A differentiator between game-thinking and simulations

972

Impact of Emerging Technologies

FIGURE 44.2  Screenshot of a Vehicle Assembly Task Designed as a Serious Game Developed by Select International Inc. Reprinted with permission

is that there are several core concepts to games and gamification that are typically not built into simulations, such as leaderboards, “win states,” levels, and community sharing. In sum, game-thinking is a term used to describe gamification and serious games. Gamification and serious games are different concepts used for different purposes (gamification is a process; serious games are a type of simulation). Simulations are related to both, and the use of simulations in selection contexts is increasing (see Fetzer & Tuzinski, 2013, for a comprehensive review). However, questions remain about the value of gamification in selection contexts. In particular, what key features of games are appropriate to leverage in a selection context? To consider whether or not gamification is appropriate in most selection contexts, further explanation of what gamification typically entails is needed.

Gamification Concepts Gamification is about more than merit badges, it is about understanding, influencing, and rewarding desired behaviors (Dale, 2014). Like any effective applied psychological intervention, a gamification manipulation should focus on a specified outcome. Dale (2014) explained that good gamification design is user-centric and not mechanism-centric. Gamification is not just about adding attractive technology to an existing measure; instead, gamification manipulations should add features and elements to existing content to increase interest, engagement, and participation. The typical desired outcome of gamification is engagement; however, whether this is of particular value in high-stakes employment testing is debatable. Nevertheless, engagement is

973

Winfred Arthur Jr. et al. created by what Gartner (2011) referred to as the three Ms: Motivation (rewards—both extrinsic and intrinsic), Momentum (sustained participation often determined by the match between task difficulty and participant skill), and Meaning (which is the extent to which the outcome of the experience is desired). Deterding et al. (2011) and Attali and Arieli-Attali (2015) explained that points, leaderboards, and badges are among the most basic elements of games and that these lead to engagement. Game characteristics include rules, tools, mechanics, and players. Rules and tools are specific to the particular game. Players, obviously, are the participants. Mechanics that are employed in a game vary, but there are consistent features and common elements such as achievements (points, levels, bonuses), exercises (challenges or quests), synchronization with the community (leaderboards), results transparency (experience bars, continuous feedback), time (countdowns, speed), and luck (lottery, random achievement). These game mechanics make up the typical “toolset” applied in gamification (Dale, 2014). Kapp (2014) described two classes of gamification—structural and content. Structural gamification refers to the application of game elements to encourage participation through content with no actual content changes. Content gamification is the application of game elements, game mechanics, and game-thinking to alter content to make it more game-like (e.g., through the use of stories, challenges, and quests). Bailey, Pritchard, and Kernohan (2015) applied these concepts in a study on survey research in a marketing context and found that applying game elements to marketing surveys increased survey completion rates. Nevertheless, the motivational dynamics of participating in a market survey are quite different from those of completing an assessment in a high-stakes employment-related context as a job applicant.

Challenges to Applying Game-Thinking in Selection Contexts The common theme in the gamification literature is that applying game principles leads to increased engagement. The idea is that increased engagement leads to desirable outcomes such as increased completion rates, sustained participation, and competency development or behavior change. When considering how gamification can be leveraged in selection contexts, it is important to consider whether or not these outcomes provide value to the organization. In a selection context, job candidates are typically highly motivated to engage in the selection component and pay close attention to the assessment content because the outcome of high performance (e.g., progression to the next stage in the selection process) is highly valued. Consequently, it would seem that the primary value of gamification—engagement—is not a major or particularly important outcome in a selection context. Candidates do not need gamelike interventions to motivate them to try hard; repeated participation is limited and delayed, and typically, selection assessments measure individual characteristics that are not expected to change (over and beyond measurement error) across multiple administrations for any individual candidate. As DuVernet and Popp (2014, p. 41) point out: “in an assessment application the goal is to measure a skill or characteristic rather than to train or motivate, thus repeated exposure to content or feedback may not be desirable.” The most common game mechanic used in many gamification initiatives is adding scoring, badges, rewards, or providing some other form of feedback to the participant. However, as Geimer, Sanderson, and Popp (2015) note, providing feedback to job candidates can have negative unintended consequences. Consequently, Geimer et al. (2015) warned that when gamification introduces a feedback component, negative performance information may increase anxiety, hinder concentration, and reduce the perception of having an opportunity to perform.

The Case for Using Gamification in Personnel Selection Although there may be several arguments against using gamification in selection contexts, nevertheless some characteristics of this approach translate into some promising possibilities

974

Impact of Emerging Technologies for further exploration of the use of gamification in personnel selection and assessment. For instance, gamification can shift the frame of reference of the candidate to a job-relevant context by applying game mechanics, such as a work-related quest. This approach could assist the candidates in drawing on work-related past behavioral examples when responding to items. Similarly, Armstrong and Landers (2015) suggested that game-thinking may be desirable in assessment contexts if the game-thinking makes the desired behavior less transparent and susceptible to social desirability responding. Bailey et al. (2015) noted the difference between “hard” and “soft” gamification. Hard gamification refers to embedding items into gamified solutions such that participants are not aware of the items, whereas soft gamification refers to simpler interventions to “frame” items and encourage participation with item presentation features. To the extent that a hard gamification approach can be accommodated, there may be some value in gamification’s ability to reduce certain socially desirable response patterns. Other possible positive outcomes from gamification include an improved candidate experience and face validity. Armstrong et al. (2015) suggested that gamification can increase a sense of job relevance but also that applicant reactions may only be improved by gamification under certain conditions. In summary, whereas there may be some value to creating interactive and attractive assessments, simulations, and serious games, simply building a great-looking assessment is not in and of itself a gamification intervention.

What Does the Future Hold for Gamification in Personnel Selection? The future of game-thinking in selection is difficult to foretell. To date, there has been very little empirical research on how game-thinking can add value to the selection process. In a broad review of the general gamification literature, Hamari, Koivisto, and Sarsa (2014) identified only 24 empirical studies across multiple disciplines. These studies all generally addressed whether or not gamification “works.” They found that the most common application of gamification is in education and learning contexts. No empirical studies on the use of gamification in selection were identified. Geimer et al. (2015) reported that to date there is no known empirical research on gamified assessments. As technology advances, selection tools and assessment devices will continue to become more attractive. For example, detailed high-fidelity simulations are becoming more commonplace, with a corresponding emergence of SJTs incorporating “stories” and images (Tippins, 2014; Weekley, Hawkes, Guenole, & Ployhart, 2015). These features, which enhance the look and feel of assessments, will certainly continue to be used, but these enhancements are not at the core of gamification. Even if future research fails to indicate that gamification for assessment purposes provides a return on investment to organizations, there may be value to adding gaming elements to other human resources processes that are related to selection, such as recruiting, onboarding, and training. As such, there does appear to be a place for gamification in organizations, even if the impact on personnel assessment and selection is not an easy and natural fit.

INTERSECTION OF EMERGING TECHNOLOGIES AND THE TRADITIONAL TEST DEVELOPMENT AND VALIDATION MODEL In the preceding section, we provided a brief overview of several exemplars of current emerging technologies. In the present section, we ask the question: “What implications do the emerging technologies have for the traditional model used in the development and validation of tests in personnel selection and assessment?” The traditional model in employee selection and placement follows a well-established sequence of steps, as illustrated in Figure 44.3. Each step in this sequence and the extent to which it is impacted by and can readily incorporate the exemplar emerging technologies that are the focus of this chapter are discussed. Table 44.1 presents a brief summary of the key features and characteristics of each step, along with the role and influence or lack thereof, of the emerging technologies of interest here.

975

Winfred Arthur Jr. et al.

Job analysis—Collect job-related information

Determine major work behaviors (Collect ratings)

Determine or identify (work) performance criterion or criteria

Identify KSAOs or constructs underlying major work behaviors (Collect ratings)

Select or develop test or assessment tool to measure KSAOs

Make decision to use, revise, or abandon test or assessment tool

If validation study is empirical, then collect and analyze data

Validation study (determine approach, study design, sample, etc.)

FIGURE 44.3  Prototypical Test Development and Validation Sequence. KSAOs = Knowledge, Skills, Abilities, and Other Characteristics

TABLE 44.1 Summary of Key Features and Characteristics of the Test Development and Validation Process and the Role of Simulations, Games, and Mobile Devices Test Development and Validation Step

Key Features and Characteristics

Impact of Emerging Technologies

Information gathering method

Wide range of methods available. Choice of methods is determined by practical constraints and other factors. Based on the extant research, the recommendation is to use multiple methods.

Influence is in the form of technological aids in the collection of data such as online surveys for job analysis questionnaires, video recording of performance episodes instead of live observation, and remote focus groups and interviews. Simulations and games are unlikely to impact this step. However, job analysis questionnaires could be completed on mobile devices.

Collecting ratings—rating scales

Rating data/scales pertaining to factors such as importance, time spent, frequency, consequences of error, time to proficiency, difficulty, and needed-uponentry. Choice of scales is determined by the purpose of the job analysis and professional judgement.

Potential ease of online data collection might facilitate the collection of data and thus the number of factors assessed. Simulations and games are unlikely to impact this step. However, ratings could be completed on mobile devices. Limitations of mobile devices may impact the use of large matrices and also the length of questionnaires.

Work/Job Analysis

976

Impact of Emerging Technologies Collecting ratings—raters

Decisions about the source of ratings pertain to (a) the level of expertise [and experience], (b) the level in the organizational hierarchy [incumbents, subordinates, supervisors], and (c) the number of raters. The extant research supports the use of experts, sampling across the organizational hierarchy, and including as large a number of ratees as possible.

Potential ease of online data collection, especially if extended to mobile devices, vastly increases the ease with which larger numbers of raters can be sampled and correspondingly a wider range of rater types. However, simulations and games, are unlikely to impact this step.

Identifying KSAOs/constructs and work performance criteria

Arrived at on the basis of the statistical analysis of the job analysis ratings and also informed by the expertise, experience, and judgment of the job analyst, assessment researcher, or professional.

Gamification, may increase the need for precision and detail in the collection of KSAOs. The need for high levels of fidelity will put additional strains on the work/job analysis system. Other types of games and simulations put less emphasis on specific KSAOs, and instead use a broader, more work sample, behaviorally based approach.

Selecting, or Developing the Test or Assessment Tool

Primarily entails determining the specific method(s) or approach (e.g., interviews, SJT, work sample) to measuring the specified constructs.

Emerging technologies offer not only new methods (e.g., simulations and games) to assessing specified constructs but also different delivery platforms (e.g., mobile devices) as well.

Validation Study

Decisions pertain to choice of validation approach or source of validity evidence. If empirical (e.g., criterionrelated or construct-related), then one would design and implement a research study. Some design choices include type of correlational design (e.g., predictive vs. concurrent), and sample size and type (i.e., applicants vs. incumbents).

To the extent that they are the test or assessment tool (i.e., simulations and games) or the platform via which the test is administered (i.e., mobile devices), then emerging technologies play an important role in this step since they are the source of the scores being validated.

Conclusion or Decision to Use, Revise, or Abandon Test or Assessment Tool

Primarily informed by the results of the validation study; evidence that speaks to the job-relatedness of the test or assessment scores or lack thereof.

Decisions should be carefully made by informed experts, including those trained in psychometrics and I-O psychology. Regardless of the fidelity of the assessment, we need to know whether we can make appropriate inferences regarding work performance from the obtained scores.

Work/Job Analysis Gathering Information Job analysis is recognized in both the scientific and professional literatures (e.g., Society for Industrial and Organizational Psychology [SIOP], 2003]) and legal guidelines (Equal Employment Opportunity Commission [EEOC], 1978) as a pivotal initial step in test development and validation. Work (job) analysis (recently broadened to include competency modeling efforts as well) is a process via which information is gathered about work and jobs with the objective of identifying and describing what incumbents do, how they do it, and the knowledge, skills, abilities, and other characteristics (KSAOs) or competencies that are required to successfully

977

Winfred Arthur Jr. et al. perform said job tasks and activities. (See Chapter 6 in this volume for more details concerning work analysis.) In the implementation of a work analysis, several decisions and choices must be made. For instance, one of the first decisions is the choice of information gathering method. A wide range of methods is available to researchers and practitioners, ranging from interviews, observations, questionnaires, and the job analyst performing the tasks/activities to the use of materials and sources such as training materials, task inventories and checklists, employee log books and diaries, previous/old job descriptions, and the O*NET. With limited research demonstrating the superiority of one method over others, the general recommendation is to use multiple methods to permit a more complete information-gathering effort that balances the tradeoffs between the strengths and weaknesses of the various methods. The role or influence of the emerging technologies of interest is primarily in the form of facilitating the information-gathering process. Thus, for instance, whereas gamification is unlikely to influence or play a role in the information-gathering process, mobile devices (e.g., smartphones, tablets, and even laptops) can broaden the scope of the online administration of job analysis surveys and questionnaires. Furthermore, recognizing that they might pose their own set of challenges, the video capabilities of mobile devices have the potential to permit remote job analysis interviews, focus groups, and “video job analysis” involving the recording of activities as they occur in the workplace.

Collecting Ratings The next step in the work analysis process is typically to obtain ratings on the major work behaviors and tasks that have been identified in the preceding information-gathering step. To this end, ratings on factors such as importance, time spent, frequency, consequences of error, difficulty, and task interdependence are collected in an effort to further elucidate and refine the list of major work behaviors and tasks. Next, the KSAOs that underlie the successful performance of the major work behaviors and tasks are identified. Once again, ratings of the KSAOs (on factors such as importance and needed-upon-entry) will be obtained. Finally, as an additional step to developing the test specification plan, linkages between the KSAOs and the major work behaviors will be made, again by means of a questionnaire, resulting in a task by KSAO matrix. Another decision in the collection of ratings pertains to the source of the ratings (i.e., the individuals who will provide the ratings). Using raters from multiple levels of the organizational hierarchy is encouraged because it permits the triangulation of the data, and thus in the aggregate, higher levels of completeness and accuracy. It should be noted that regardless of their position in the organizational hierarchy, all raters should be fairly knowledgeable about the job to provide informed ratings. Finally, because it has the additional advantage of giving every employee a voice and fosters a sense of participation, unless the sample sizes are too large to make it unmanageable, the recommendation is to sample all eligible responders (Doverspike & Arthur, 2012). As with the information-gathering phase, gamification is unlikely to influence or play a role in the rating process. However, in contrast, once again mobile devices can broaden the scope of the rating process. Specifically, the extension of online data collection to mobile devices vastly increases the ease with which larger numbers of raters can be sampled and, correspondingly, a wider range of rater types as well. However, the small screen size of some mobile devices may limit the ability to use large linkage matrices of the type possible with traditional presentations on paper. Job analysis surveys may also have to be shortened to limit the amount of time respondents have to spend completing surveys on mobile devices.

Identifying KSAOs/Constructs to Be Assessed (and Work Performance Criteria) In the context of personnel selection, the primary objective of the test development and validation process is the development of an assessment tool or predictor whose scores can be used to make inferences about future job/work performance. The demonstration of this then puts one

978

Impact of Emerging Technologies in a position to use the scores from the test or assessment tool for employment decision-making purposes. Consequently, this requires that one has a criterion against which the test is validated (i.e., the work performance criterion or criteria that one is trying to “forecast” with the use of the predictor scores). As such, one goal of the work analysis process is to determine or identify the specified work performance criterion or criteria. As illustrated in Figure 44.3, these criteria are typically the outcomes associated with the successful performance of the specified major work behaviors and tasks. As previously noted, as part of this process, the KSAOs that underlie the successful performance of the major work behaviors and tasks are identified. Deciding on the final list of KSAOs is based on the statistical analysis of the job analysis ratings and is also informed by the expertise, experience, and judgment of the job analyst, assessment researcher, or professional (Doverspike & Arthur, 2012). For instance, on the basis of the ratings, KSAOs that are linked to low importance, and low-frequency major work behaviors and tasks, and are not needed-upon-entry would typically not be assessed. In addition, despite what the ratings may indicate, professional decisions have to be made about the psychometric and practical feasibility of measuring the specified constructs because some constructs may be more amenable to measurement (e.g., GMA) than others (e.g., “vision,” “inspiration”). Gamification, especially the use of simulations and games, may increase the need for precision and detail in the collection of KSAOs, especially as developers attempt to achieve 100% physical and psychological fidelity with the actual work environment. The need for such high levels of fidelity between the assessment and the job will put additional strains on the work analysis system in terms of depth and detail, making the process more similar to those carried out in human factors investigations. At the same time, other types of games and simulations seem to put less emphasis on specific KSAOs, using a broader, more work sample and behaviorally based approach (Wernimont & Campbell, 1968).

Selecting or Developing the Test or Assessment Tool This step entails determining the specific method(s) or approach(es) (e.g., interviews, assessment centers, SJTs, paper-and-pencil tests) to measuring the specified KSAOs or constructs. As noted by Doverspike and Arthur (2012), this may entail either selecting a previously developed assessment tool that measures the constructs/KSAOs of interest or developing an assessment tool from scratch. An important issue associated with this step of the test development and validation process is the pivotal distinction between constructs (what is being measured) and methods (how the construct is being measured; Arthur & Villado, 2008). It is important to recognize the distinction between methods, modes, and delivery platforms. So, using SJTs as an example, it is a method that can be administered in different modes (e.g., text vs. video) on different delivery platforms (e.g., desktop computer vs. smartphone). The preceding distinctions are important because they clearly highlight the fact that emerging technologies such as simulations and games offer not only new methods of assessing specified constructs but also new and different platforms via which said methods can be implemented or delivered. Interestingly, a review of the simulation and gaming literature indicates that the issues noted by Arthur and Villado (2008) characterize and are present with these methods as well. Specifically, there is an absence of attention to the specific constructs measured by these assessments with an almost exclusive focus on the methods (i.e., the simulations or games). Of course, within the context of the sign versus sample distinction (Wernimont & Campbell, 1968), an argument could be made that simulations and games are more aligned with a sample instead of sign approach to assessment, but such a position is not explicitly or clearly articulated by the developers of these assessment tools. On a related note, the traditional test development and validation model usually entails an item writing phase where on the basis of a test specification plan, items are generated, reviewed, revised, and finalized (Doverspike & Arthur, 2012). Indeed, this focus on items serves as the basis for psychometric item analysis procedures. However, in the context of emerging technologies such as simulations and games, it is unclear what constitutes an “item” since there is not a single item, query, or problem to which the test taker provides an answer or response. For instance, in Arthur

979

Winfred Arthur Jr. et al. et al.’s (2015) Crisis in the Kodiak, an oil-rig disaster search-and-rescue simulation, the participant’s tasks are to (1) shut off four burning oil valves (50 points each), (2) locate and heal the 20 survivors on the burning oil rig (10 points for each survivor healed), and rescue the healed survivors (10 points for each survivor successfully evacuated off the oil rig). Hence, there are no traditional “items” (i.e., queries to which the participant provides an answer or response), but instead a series of tasks that must be and are completed in a fluid and dynamic fashion in a limited amount of time. In summary, even if the overall test development and validation model remains the same, emerging technologies greatly increase the number of options for methods, modes, and delivery platforms. There is an associated increase in the possible item types and scoring methods. Unfortunately, although a large body of knowledge and guidelines now exists on developing traditional multiple-choice tests (e.g., see Haladyna & Rodriguez, 2013), we know far less about the important features impacting the design of assessments based on emerging technologies. Nevertheless, it is important that the design choices be informed by psychological theories and constructs, and not be left to the information technologists.

Validation Study1 Consonant with the prevailing unitarian view of validity (American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 2014; SIOP, 2003), the “validation of personnel selection decisions is merely a special case of the more general validation process” (Binning & Barrett, 1989, p. 480). Hence, content-, criterion-, and construct-related validity are simply different strategies for demonstrating the construct validity of a test or measure—that is, what a test measures, how well it does so, and the accumulation of evidence that speaks to the extent to which the inferences drawn from the test scores are appropriate (Binning & Barrett, 1989). Consequently, to the extent that within the unitarian framework of validity, content-related, criterion-related, and construct-related validity are considered to be three of several evidential bases for demonstrating the construct validity of a test or measure (e.g., see AERA et al., 2014; SIOP, 2003), a decision must be made as to the most appropriate validation approach for the specified circumstances. Furthermore, if this decision results in the use of an empirical validation approach that generates a specified validity coefficient (e.g., criterion-related or construct-related), then one would design and implement an empirical correlational research study. In the subsequent implementation of such a study, some design choices will include the type of correlational design (e.g., predictive vs. concurrent) and the sample size and type (i.e., applicants vs. incumbents). Concerning the role of emerging technologies, to the extent that they are the test or assessment tool (i.e., simulations and games) or the platform via which the test is administered (i.e., mobile devices, and maybe even Google Glass, Apple Watch, and Samsung Gear S2 in the future), then said technologies play an important albeit indirect role in this step since they are the source of the scores being validated. In the case of simulations, especially those with close to 100% physical and psychological fidelity, there may be an argument as to whether the simulated work task should serve as the predictor, criterion, or both (i.e., “perfect” overlap between the predictor and the criterion as in high-end commercial aircraft simulators). In addition, as is the case with SJTs, developers may argue that validation is not needed for simulations or games, as the development process itself guarantees job-relatedness. We would certainly argue that our traditional validation models apply to assessments based on emerging technologies and that the input of I-O psychologists is critical in designing appropriate validation studies. Even if we found ourselves in disagreement with such a viewpoint, we would still argue for the necessity of validation based on the likely viewpoint of regulatory agencies in the United States. In particular, the existence of an algorithm does not preclude the need for professional involvement in the development process leading up to the creation of the algorithm, nor does it eliminate the need for the validation of inferences made based on machine-generated scores. In summary, it is our view that in terms of scientific, professional, and legal standards, the use of emerging technologies in employment-related decision making

980

Impact of Emerging Technologies needs to be validated and held to the same psychometric standards as any other assessment tool (AERA et al., 2014; SIOP, 2003).

Conclusion or Decision to Use, Revise, or Abandon the Test or Assessment Tool The final step in the test development and validation model entails making a decision to use the assessment tool for employment decision-making purposes, revise or further refine the test, or abandon it. This decision is primarily informed by the results of the validation study in terms of its ability to furnish evidence that speaks to the job-relatedness of the test or assessment scores or lack thereof, but also requires an analysis of practicality, utility, the impact on protected classes, and user acceptance and reactions. As argued above, these decisions should be carefully made by informed experts, including those trained in psychometrics and I-O psychology. The use of mobile devices has led to a variety of practical and ethical questions. In particular, consideration must be paid to security issues, including the verification of the identity of the test taker. Furthermore, because mobile devices allow for a more diverse and geographically distributed applicant pool, there are also issues of global distribution, translation, and accommodation of disabilities. A reliance on gamification, including games and simulations, leads to its own set of potential practical concerns. This is especially true when job candidates must complete the assessments on mobile devices; some of the concerns listed above, such as translation and global distribution, may be magnified. In addition, gamification may increase the information processing load, require greater effort and time from the candidate, and increase costs to the organization.

FINAL THOUGHTS, DISCUSSION, AND CONCLUSION This chapter started with an acknowledgment that there have always been and will always be emerging technologies. So, rather than attempt to review a laundry list of emerging technologies, we sought to examine the extent to which new technologies, as exemplified by mobile devices and gamification, are compatible with traditional approaches to developing and validating tests. Asked another way: “Do emerging technologies lead to disruptive innovations in the way we think about the traditional methods we use to develop and validate tests?” Although we may be biased by our professional affiliation, our conclusion is that existing approaches to test development and validation are still relevant and appropriate in the context of evaluating emerging technologies to assessment. That is, regardless of the fidelity of the assessment, the amount of fun created, or the technologies involved, the fundamental question remains one of whether we can make accurate inferences regarding future work performance from the scores obtained from the assessment. Emerging technologies, rather than reducing the need for validation, increase the importance of ensuring that decision making regarding the use of these devices and technologies for employee assessment and testing includes input from assessment experts with backgrounds in I-O psychology. Thus, the good news is that our methods, approaches, and expertise are probably more needed and relevant than ever, but we need to address the way we do, share, and communicate research. For instance, we may need to expedite our research initiation and communication cycles to keep up with the pace of technological changes and innovations. Hence, it is not surprising that most of our “emerging technology” research is more likely to be found in conference presentations (which have a short initiation-to-communication cycle) than peer-reviewed publications (e.g., see Arthur et al.’s [2017] review).

The Need for Research One of the current oddities is that it is often easier to do field research on emerging technologies, in the case of mobile devices collecting millions of cases a year, than it is to do laboratory research. Nevertheless, we feel strongly that there is a need for cooperation among testing

981

Winfred Arthur Jr. et al. companies, consultants, and academics in order to carry out well-designed laboratory research. Although we would be among the first to argue for the increased publication of practice-oriented articles, there also is a need for theoretical development, as well as theory-guided research. For example, in the case of mobile devices, one theory (Arthur et al., 2017) is that certain types of devices increase information processing demands, which then leads to differences across device types in test scores. For gamification and games, a basic hypothesis is that gamification increases engagement, which in turn leads to improved user reactions, increased motivation, and potentially greater effort. Such mediated models can be tested in the field but are probably easier to first test in the laboratory. Another concern is the interaction between the introduction of new technologies and the demographic characteristics of users. For example, concerning the role of mobile devices as a delivery platform, a detailed review of the literature as reported in Arthur et al. (2017) indicated that there were differences in the extent to which different demographic groups use mobile devices to complete employment-related assessments, with African Americans, Hispanics, and women displaying higher mobile device usage. A resultant question then is: “What implications does this have for the diversity of the candidates selected for employment?” A similar issue emerges regarding the reactions of various cultural and gender groups to the gamification of assessments. It is not enough to conduct research; it must also be shared and communicated. This may be our greatest challenge, as the traditional journal publication model does not always allow for a particularly rapid dissemination of results. In order to remain relevant and to contribute to the conversation on new technologies, we will have to find ways to expedite the process of peer review and professional publication.

Other Emerging Technologies Admittedly, mobile devices and gamification represent only two potential technologies. Other chapters in this book address additional technologies. Some technologies worth noting include: • • • • •

Big data Mining Facebook and other social media to extract personality and other data Automated scoring of essays and written material Applications of machine learning The use of avatars (discussed in more detail in the next section)

Although we have restricted our attention to the direct impact on assessment, technologies affect selection in other ways as well. Technology leads to the creation of new jobs, as well as the elimination of some occupations. Organizations are also changed through technology, although we have yet to see the widespread emergence of virtual organizations, accompanied by the elimination of all jobs, as was predicted in the 1990s.

The Future One emerging technology that we believe will impact assessment significantly in the near future is the use of artificial intelligence (AI)-enhanced avatars. The combination of AI, natural language processing, and realistic avatars is being used in assessment applications to enhance the applicant experience, increase realism, and deliver tailored feedback. Computer-generated avatars in one form or another have been used in assessments for almost a decade (see Fetzer & Tuzinski, 2013, for a review). Typically, they have been used to enhance the look and feel of SJTs and other simulations. However, in recent years, avatars, both human and computer-generated, have become more intelligent. Applications of such AI-enhanced avatars include guiding candidates through the hiring process from initial application, or resume submission, through testing and final interviews by

982

Impact of Emerging Technologies answering questions, explaining the human resources hiring process, introducing company culture, providing functional position details, scheduling tests and interviews, and keeping applicants informed regarding their status in the process. In these applications, the goal is to improve the applicant experience, increase the likelihood that top candidates remain in the selection process, and improve their perceptions of the organization. Another area where AI-enhanced avatar technology is being deployed is in providing tailored feedback on test results to individuals, typically in developmental, as opposed to selection, situations. These applications strive to marry the richness of a professional coach with cost effectiveness and 24/7 access. Avatar-based coaches understand the individual’s profile, based on the assessment results, can go over their results, answer questions, recommend a course of action, keep people on track with reminders, set up a personalized dashboard for tracking progress, and even link to other individuals, trusted others, and learning resources. While these applications are in their infancy, it is clear that AI-based avatars will take on more significant roles in the assessment, application, and feedback processes in the not-too-distant future. Interestingly, in some healthcare applications that use a human avatar, patients are more likely to provide detailed feedback in responding to the avatar than they are to a live nurse or even a person over the phone. The same thing has been found in retail applications where customers provide product feedback. It is likely that we will see similar findings when avatars are used as test administrators, as actual components in the assessment, and also in providing developmental feedback regarding test performance. If futurists and science fiction novelists are correct, someday soon technology will eliminate jobs since all decisions will be made by robots or machines (Autor, 2015; Frey & Osborne, 2013). Hopefully, and optimistically, I-O psychologists may be some of the last individuals working, matching the last few job applicants to the few remaining roles performed by people.

NOTE 1. It is recognized that in some instances, the implementation of a selection procedure may occur concurrently with the validation process.

REFERENCES American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Armstrong, M. B., Collmus, A. B., & Landers, R., N. (2015). Game-thinking in human resource management. Poster presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Armstrong, M. B., & Landers, R. N. (2015). Game-thinking in Assessment: Applications of gamification and serious games. Symposium presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Arthur, W., Jr., Doverspike, D., Muñoz, G. J., Taylor, J. E., & Carr, A. E. (2014). The use of mobile devices in high-stakes remotely delivered assessments and testing. International Journal of Assessment and Selection, 22, 113–123. Arthur, W., Jr., Edwards, B. D., & Barrett, G. V. (2002). Multiple-choice and constructed-response tests of ability: Race-based subgroup performance differences on alternative paper-and-pencil test formats. Personnel Psychology, 55, 985–1008. Arthur, W., Jr., Keiser, N. L., & Doverspike, D. (2017). An information processing-based conceptual model of the effects of the use of internet-based testing devices on scores in employment-related assessments and tests. Manuscript submitted for publication. Arthur, W., Jr., Naber, A. N., Muñoz, G. J., McDonald, J. N., Atoba, O. A., Cho, I., Keiser, N. L., White. C. D., Glaze, R. M., Jarrett, S. M., Schurig, I., & Bennett, W., Jr. (2015). An investigation of skill decay and reacquisition of individual- and team-based skills in a synthetic training environment. Toronto, ON, Canada: American Psychological Association Division 19 Suite presentation at the 123rd Annual Convention of the American Psychological Association.

983

Winfred Arthur Jr. et al. Arthur, W., Jr., & Villado, A. J. (2008). The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93, 435–442. Autor, D. H. (2015). Why are there still so many jobs? The history and future of workplace automation. The Journal of Economic Perspectives, 29, 3–30. Attali, Y., & Arieli-Attali, M. (2015). Gamification in assessment: Do points affect test performance? Computers and Education, 83, 57–63. Bailey, P., Pritchard, G., & Kernohan, H. (2015). Gamification in market research: Increasing enjoyment, participation engagement and richness of data, but what about data validity? International Journal of Market Research, 57, 17–28. Beaty, J. C., Nye, C. D., Borneman, M. J., Kantrowitz, T. M., Drasgow, F., & Grauer, E. (2011). Proctored versus unproctored internet tests: Are unproctored noncognitive tests as predictive of job performance? International Journal of Assessment and Selection, 19, 1–10. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Dages, K., & Jones, J. (2015). Mobile device Administration: Does length or level of assessment matter? In N. A. Morelli (Chair), Mobile devices in talent assessment: The next chapter. Symposium presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Dale, S. (2014). Gamification: Making work fun, or making fun of work? Business Information Review, 31, 82–90. Davies, S. A., & Wadlington, P. L. (2006). Factor and parameter invariance of a five factor personality test across proctored/unproctored computerized administration. Paper presented at the 21st Annual Conference of the Society for Industrial and Organizational Psychology, Dallas, TX. Deterding, T. C., Sicart, M., Nacke, L., O’Hara, K., & Dixon, D. (2011). Gamification: Toward a definition. Proceedings of the CHI2011 Gamification Workshop, Vancouver, BC, Canada. Do, B. R., Shepherd, W. J., & Drasgow, F. (2005). Measurement equivalence across proctored and unproctored administration modes of web-based measures. Paper presented at the 20th Annual Conference of the Society for Industrial and Organizational Psychology, San Francisco, CA. Doverspike, D., & Arthur, W., Jr. (2012). The role of job analysis in test selection and development. In M. A. Wilson, W. Bennett, Jr., S. G. Gibson, & G. M. Alliger (Eds.), The handbook of work analysis in organizations: Methods, systems, applications, and science of work measurement in organizations (pp. 381–399). New York, NY: Routledge/Psychology Press. DuVernet, A. M., & Popp, E. (2014). Gamification of workplace practices. The Industrial Organizational Psychologist, 52(1), 39–44. Edwards, B. D., & Arthur, W., Jr. (2007). An examination of factors contributing to a reduction of subgroup differences on a constructed-response paper-and-pencil test of scholastic achievement. Journal of Applied Psychology, 92, 794–801. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Adoption by four agencies of uniform guidelines on employee selection procedures. Federal Register, 43, 38290–38315. Fetzer, M., & Tuzinski, K. (Eds.) (2013). Simulations for personnel selection. New York, NY: Springer. Frey, C. B., & Osborne, M. A. (2013). The future of employment: How susceptible are jobs to computerisation? Oxford, UK: The Oxford Martin Programme on the Impacts of Future Technology. Retrieved from www.futuretech.ox.ac.uk/ Fursman, P. M., & Tuzinski, K. A. (2015). Reactions to mobile testing from the perspective of job Applicants. Paper presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Gartner, Inc. (November 9 2011). Garner predicts over 70 percent of Global 2000 organisations will have at least one gamified application by 2014. Retrieved from http://www.gartner.com/newsroom/id/1844115 [6/17/2015] Geimer, J., Sanderson, K., & Popp, E. (2015). Effects of gamification on test performance and test taker reactions. Symposium presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Golubovich, J., & Boyce, A. S. (2013). Hiring tests: Trends in mobile device usage. Paper presented at the 28th Annual Conference of the Society for Industrial and Organizational Psychology, Houston, TX. Gutierrez, S. L., & Meyer, J. M. (2013). Assessment on the go: Applicant reactions to mobile testing. In J. C. Scott & N. Morelli (Chairs), Mobile devices in talent assessment: Where are we now? Symposium presented at the 28th Annual Conference of the Society for Industrial and Organizational Psychology, Houston, TX. Gutierrez, S. L., Meyer, J. M., & Fursman, P. (2015). What exactly drives positive reactions to mobile device administration? Paper presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA.

984

Impact of Emerging Technologies Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge. Hamari, J., Koivisto, J., & Sarsa, H. (2014). Does gamification work? A literature review of empirical studies on gamification. Paper presented at the 47th Hawaii International Conference on System Science. Huizinga, J. (2000). Home ludens: A study of the play-element in culture. London, UK: Routledge. Illingworth, A. J., Morelli, N. A., Scott, J. C., & Boyd, S. L. (2015). Internet-based, unproctored assessments on mobile and non-mobile devices: Usage, measurement equivalence, and outcomes. Journal of Business and Psychology, 30, 325–343. Impelman, K. (2013). Mobile assessment: Who’s doing it and how it impacts selection. Paper presented at the 28th Annual Conference of the Society for Industrial and Organizational Psychology, Houston, TX. Kapp, K. M. (Spring 2014). What L&D professionals need to know about gamification. Training Industry Magazine, pp. 17–19. Kinney, T. B., Chang, L., Lawrence, A. D., & Moretti, D. M. (2015). Demonstrating criterion-related validity equivalence with PC, phone, and tablet test-takers. Paper submitted to the 31st Annual Conference of the Society for Industrial and Organizational Psychology, Anaheim, CA. Kinney, T. B., Lawrence, A. D., & Chang, L. (2014). Understanding the mobile candidate experience: Reactions across device and industry. Paper presented at the 29th Annual Conference of the Society for Industrial and Organizational Psychology, Honolulu, HI. Landers, R. N., Reddock, C. M., Cavanaugh, K. J., & Proaps, A. B. (2014). Talent assessment using mobile devices. In T. Kantrowitz & C. M. Reddock (Chairs), Shaping the future of mobile assessment: Research and practice update. Symposium presented at the 29th Annual Conference of the Society for Industrial and Organizational Psychology, Honolulu, HI. Lawrence, A. D., Wasko, L., Delgado, K., Kinney, T., & Wolf, D. (2013). Does mobile assessment Administration impact psychological measurement? Paper presented at the 28th Annual Conference of the Society for Industrial and Organizational Psychology, Houston, TX. McClure Johnson, T. K., & Boyce, A. S. (2015). Selection testing: An updated look at trends in mobile device usage. In N. Morelli (Chair), Mobile devices in talent assessment: The next chapter. Symposium presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Morelli, N. A., Mahan, R. P., & Illingworth, A. J. (2014). Establishing the measurement equivalence of online selection assessments delivered on mobile versus nonmobile devices. International Journal of Selection and Assessment, 22, 124–138. O’Connell, M. S., Arthur, W., Jr., & Doverspike, D. (2015). Mobile assessment: The horses have left the barn . . . now what? Pre-conference workshop presented at the 29th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. O’Connell, M. S., Delgado, K., & Kung, M. C. (2012). Does proctoring impact measurement methods differently? An evaluation in a high stakes testing environment. Paper presented at the 26th Annual Conference of the Society for Industrial and Organizational Psychology, San Diego, CA. Oprescu, F., Jones, C., & Katsikitis, M. (2014). I PLAY AT WORK: Ten principles for transforming work processes through gamification. Frontiers in Psychology, 5, 1–5. Parker, B., & Meade, A. (2015). Smartphones in selection: Exploring measurement invariance using item response theory. In N. Morelli (Chair), Mobile devices in talent assessment: The next chapter. Symposium presented at the 30th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA. Pearlman, K. (2009). Unproctored internet testing: Practical, legal, and ethical concerns. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 14–19. Petridis, P., Baines, T., Lightfoot, H., & Shi, V. G. (2014). Gamification: Using gaming mechanics to promote a business. Available at http://www.researchgate.net/profile/Panagiotis_Petridis/publication/263236374_Gamification__using_gaming_mechanics_to_promote_a_business/links/53e4b36c0cf2fb748710dbe4.pdf [6/17/15] Simpson, P., & Jenkins, P. (2015). Gamification and Human Resources: an overview. Retrieved from https://www. brighton.ac.uk/_pdf/research/crome/gamification-and-hr-overview-january-2015.pdf [6/12/15] Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. Tippins, N. T. (2014). Technology and assessment in selection. Annual Review of Organizational Psychology and Organizational Behavior, 2, 5.1–5.32. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W., Pearlman, K., Segall, D., & Shepherd, W. (2006). Unproctored internet testing in employment settings. Personnel Psychology, 59, 189–225. Wasko, L. E., Lawrence, A. D., & O’Connell, M. S. (2015). What matters in the test environment? Paper presented at the 29th Annual Conference of the Society for Industrial and Organizational Psychology, Philadelphia, PA.

985

Winfred Arthur Jr. et al. Weekley, J. A., Hawkes, B., Guenole, N., & Ployhart, R. E. (2015). Low-fidelity simulations. Annual Review of Organizational Psychology and Organizational Behavior, 2, 18.1–18.28. Weiner, J. A., & Morrison, J. D. (2009). Unproctored online testing: Environmental conditions and validity. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 27–30. Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52, 372–376. Wood, E., Stephens, K., & Sliter, K. (2015). Applies to oranges? Use and comparative scores for mobile and nonmobile selection assessments. Paper presented at the 2015 Annual Conference of the International Personnel Assessment Council, Atlanta, GA.

986

INDEX

Bold page numbers refer to figures. aberrant response patterns 944 – 5 absenteeism and medical treatment 530 abusive supervision 539 Academic Qualifications Rating (AQR) 712 accepting offers in recruiting models 171 – 3 achievement motivations 334 action-oriented knowledge 342 ADA Amendments Act (2008) 86, 282 adaptability, defined 449 adaptation, defined 449 adaptive and citizenship behavior at work: adaptive behavior, defined 448 – 9; adaptive performance measurement 460 – 1; citizenship, defined 450, 450 – 1; citizenship-related variables 461 – 2; cognitive ability 451, 454; conceptualization 448; conclusions 467 – 8; distal individual differences 451, 456 – 7; immediate/proximal determinants 458 – 60; impact on organizational outcomes 465 – 6; individual differences predictors 451 – 76, 452–3; measurement 460 – 2; moderators and mediators 462 – 4; motivations, interests, and previous experience 455 – 6; overemphasis on 466 – 7; personality dimensions 454 – 5; previous experience in 455 – 6 adaptive performance 401, 449, 560 adaptive transfer method 461 adaptive value 338 – 9 adaptive workers 538 advanced persistent threats (APTs) 916 advantaged groups 660 adverse impact: Age Discrimination in Employment Act 650 – 1; defined 41; discrimination 617 – 18; employee selection constructs 373; job-relatedness and 645 – 7; measurement of 634 – 6; personality measurement and validity 315 – 16; science of selection 690; service/sales jobs selection 789 – 90 Advisory Panel for the Dictionary of Occupational Titles (APDOT) 877, 907 aerobic capacity 279, 280 – 2, 292 Age Discrimination in Employment Act of 1967 (ADEA): adverse impact 650 – 1; claim-based discrimination 688; cognitive ability and 262; employment practice laws 593; national differences 803; Title VII and 632; Uniform Guidelines and 617; see also discrimination aggregate personality 300 aggregation, defined 870

air crew selection 712 Air Force Human Resources Laboratory 712 Air Force Officer Qualifying Test (AFOQT) 711 Albemarle Paper Co. v. Moody (1975) 637, 639 – 40, 642, 689 Albertsons v. Kirkingburg (1999) 652 – 3 Alcatel-Lucent malware report 924 alternate test formats 189 – 90 alternate titles database 893 – 4 altruistic citizenship 466 Amazon’s Mechanical Turk (MTurk) 920 American College Test (ACT) 711 – 12 American Educational Research Association (AERA) 3, 282, 601, 980 American Psychological Association (APA): employment-related psychometrics 600; ethical standards 408, 577, 593; introduction 3; job simulations 282; role of ethics codes 584 – 6; unitarian view of validity 980 Americans with Disabilities Act (ADA): accommodations in testing 189, 282; disability-related legal decisions 369 – 70, 651 – 3; legal principles 632; national differences 803; overview 290; prediction rationale 86; research with human participants 593; wellness programs and 533 – 4; work analysis 639 anaerobic energy 280 – 1 analysis of variance (ANOVA) 5, 11, 18 – 19, 27 analyze data stage 208 Anger (Neuroticism) trait 488 anti-forensic methods 916 applicant tracking systems (ATS) 858, 860 – 1, 921, 958 – 9 application program interfaces (APIs) 902 apprenticeships 761 Apsley v. Boeing (2013) 636 Armed Forces Examining and Entrance Stations (AFEES) 703 – 4 Armed Forces Qualification Test (AFQT) 699 – 700, 706 Armed Services Vocational Aptitude Battery (ASVAB) 44, 699 – 700, 704 – 8, 707, 713 – 14 Army General Classification Test (AGCT) 702 – 3 Army Research Institute (ARI) 442 assertive behavior in interviews 173 assessment centers (ACs) 257, 310 – 11, 588 – 9, 749 – 50, 841

987

Index Assessment of Background Life Experiences (ABLE) 444 assessments, employment decisions: behavioral assessment support 859 – 60; clinical assessment 388 – 90; compensatory methods 390 – 3; conclusion 395; delivery modes 241; evaluation of tools 378 – 81; introduction 388; multiple combination methods 394 – 5; multiple-hurdle approach 393 – 4; see also psychological assessments assessments, feedback: from applicants 409 – 14; from assessment centers 420; benefits and costs 414 – 19; candidate’s perspective 416; case examples 406 – 7; conclusions 422; executive selection 407, 421; implications for practice 419 – 21; implications for research 421 – 2; internal vs. external candidates 416; introduction 406; online methods 417 – 18; opportunities with different methods 417; organization’s perspective 415; professional standards 408 – 9; self-image and 412 – 14; test feedback 421; test preparation and coaching 418 – 19; unemployed individuals 419 assessments, selection methods and outcomes: behavior consistency method 746; cognitive ability tests 350, 746 – 7; construct-based assessments 748 – 9; impact 744; interviews 750 – 1; leaders/executives selection 838 – 42; leadership dispositions 754; leadership readiness 742 – 4; multiple selection techniques 743 – 4; personality measures 747 – 8; tools and techniques 744 – 51; transparency of 744 assets vs. cost 126 attribute requirement descriptors 136 – 7, 139 augmented reality 871 authority and alignment principle 221 – 2 Aviation and Nautical Information Test 712 avoid performance goal orientation (APGO) 333 – 4 banding in academic situations 45 bands in personnel selection procedures 615 Baty v. Willamette Industries, Inc. (1999) 649 Beck v. University of Wisconsin Bd. of Regents (1996) 653 behavioral assessment support 859 – 60 behavioral dimensions 437 – 8 behavioral indicators (BIs) 153 behaviorally anchored rating scales (BARS) 432, 434 behavioral measures 509 – 10 behavioral social intelligence 346 behavioral tendency instructions 310 behavior consistency method 746 behavior observation scales (BOS) 433 behavior summary scales (BSS) 432 – 3, 442 Bernard v. Gulf Oil Corp. (1989) 641 Berndt v. Kaiser Aluminum & Chemical Sales, Inc. (1986) 648 between-sample variance 97 Bew v. City of Chicago (2001) 645 bias: absence of bias in criterion measures 76; personnel selection procedures 613 – 14; predictive bias 266 – 8, 267; sample estimates 104; unfairness bias 626 big data (BD): analytics 958 – 61, 959; conclusion 963 – 4; framework for staffing 950 – 3; gathering of 953 – 6, 955; introduction 949 – 50; presentation and visualization 961 – 3, 962, 963; storage of 956 – 8, 957; use of 590 – 1

988

Big-Five traits: adaptive performance and 455; development of 953; hierarchical taxonomy of human abilities 184; leaders/executives selection 837 – 8; personality dimensions 747 – 8, 784 – 6; validity of 98 biodata: blue-collar selection 770; industrialorganizational (I-O); inventories of 787 – 8; leadership questionnaires 840; personality measurement and validity 307 – 8; questionnaires for 840; scales 745; tests for 185 – 6, 417 Biographical Inventory with Response Validation 712 biomechanical analysis in physical performance tests 278 – 9, 279 Biondo v. City of Chicago (2004) 647 bivariate relationships 356 – 7 blue-collar selection: applicant population issues 763 – 4; biodata 770; conclusion 777 – 8; ­contextual considerations 761 – 5; ­contingent workers 777; definition/­boundaries 760 – 1; English language proficiency 764; ­grandparenting rules 776; higher-level jobs 776 – 7; introduction 760; journeymen selection 761, 776; knowledge and experience tests 767; labor unions 762 – 3; partnerships with educational/governmental institutions 764 – 5; personality tests 769 – 70; physical ability tests 768 – 9; policy consideration 773 – 5, 774 – 5; realistic job previews 770 – 1; selection process considerations 771 – 3; standardization 771; technology 772 – 3; tenure considerations 761 – 2; test security 771 – 2; tool considerations 765 – 71; work environment 761; work samples 767 – 8 Bona Fide Occupational Defense (BFOQ) 643 Borkowski v. Valley Central (1995) 639 – 40 Boyd v. Ozark Air Lines (1977) 643 Bradley v. City of Lynn (2006) 637 Bradley v. Pizzaco of Nebraska, Inc. (1993) 643 braindump sites 918 Briscoe v. City of Newhaven (2010) 638 Brown v. Board of Education (1954) 687 Bultemeyer v. Fort Wayne (1996) 653 Bureau of Labor Statistics (BLS) 781 Burlington Industries, Inc. v. Ellerth (1998) 649 Burlington N. & SFR Co. v. White (2006) 649 business-unit-level value 116 – 17, 123 – 5 Bus Operator Selection System (BOSS) 727 Cadet Background and Experience Form (CBEF) 711 California Psychological Inventory (CPI) 674 Canadian Psychological Association 593 career dimensions 836 CareerOneStop 898 caring/benevolence principle 580 – 1 case law in legal context of selection 687 – 9 Castaneda v. Partida (1977) 635 causal ambiguity 119 CEB Leadership Study 199 Center for Creative Leadership 833, 840 – 1 Center of Excellence (COE) 208, 213 central tendency and dispersion 262 – 6, 264 change management 228 – 30, 229 chief executive officers (CEO) 833, 847; see also leaders/executives selection Chi-square test of association 635 Civil Rights Acts (1964, 1991) 262, 593, 617, 632, 640, 687, 803

Index Civil Rights Commission (CRC) 616 Civil Service Commission (CSC) 600, 616, 725 civil service examinations 723 – 4 classical test theory (CTT): conceptual perspective vs. estimation methods 20 – 1; goal of reliability estimation 16; introduction 3 – 4; limitations of 18; logic of scoring examinees 939; overview 7 – 12; summary 15; violations of 26 Classic Model of performance 558 Classification of Instructional Programs (CIP) 897 Coates v. Sundor Brands, Inc. (1998) 649 Coding Speed (CS) 714 coefficient of equivalence and stability (CES) 18 cognitive ability: acceptability of 251 – 4, 254; adaptive and citizenship behavior at work 451, 454; age differences 267; central tendency and dispersion 262 – 6, 264; conclusion 269 – 70; counterproductive work behavior and 260, 489 – 90; criterion-related validity evidence 258 – 62, 259, 261; definitions and theoretical underpinnings 254 – 5; future challenges 269; group differences 262 – 8, 264, 267; introduction 251; job performance predictors and 259 – 62, 261; leaders/executives selection 837, 839; measurement of 256 – 8; service/sales jobs selection 788; structure of 255 – 6; team composition research 823; tests for 350, 746 – 7; test validities 259, 259; validity differences and predictive bias 266 – 8, 267 cognitive antecedents 546 – 7 cognitive styles: construct-oriented studies 329 – 32; current concerns and emerging issues 331 – 2; study of 330; varieties of 330 – 1 cognitive analysis: task (CTA) 147 – 51; work (CWA) 148 compensatory models 286 – 7, 614; of assessment criteria 390 – 3; of scoring 80 – 1 competency assessment 754 competency modeling (CM) 152 – 4 competitive advantage of selection: alignment of selection and strategy 120 – 3, 121; businessunit-level value 116 – 17, 123 – 5; conclusion 129 – 30, 129 – 30; critical mass 128 – 9; diversity and 127 – 8; global considerations 127; human capital theory 119 – 20; human resource activities 129; introduction 115 – 16; as lever/barrier to change 126 – 7; multilevel selection 123 – 5; resource-based view 118 – 19; retention and 125 – 6; strategic human resource management 116, 117 – 20; sustainable competitive advantage 118, 123; talent as asset vs. cost 126; talent segmentation 128 Comprehensive Assessment of Team Member Effectiveness (CATME) 825 – 7 computer-based test (CBT) 773 computerized adaptive rating scales (CARS) 433 – 4 computerized adaptive testing (CAT) 187, 242, 404, 931, 941 – 4, 942, 943 computerized scoring algorithms 398 – 9 conceptual differentiation 331 conceptual equivalence 801 – 2 concurrent validation designs 48 configural perspective 118 confirmatory factor analytic (CFA) frameworks 4, 20 – 5, 402 congeneric tests 5, 20 conscientiousness measures 5, 357 consequentialist theories 579 – 80

consequential validity 39 construct-based assessments 748 – 9 construct-irrelevant variation 6 construct-oriented studies: cognitive styles 329 – 32; conclusion 339; introduction 326; motivational constructs 332 – 6; practical considerations 336 – 7; strategic agenda 338 – 9; values 326 – 9 construct validity 37, 59, 344, 620 – 1, 625 content-format confusion 257 content validity 34, 612, 620, 625, 640, 956; evidence 35 context-based advertising 174 context-based selection 199 – 200 contextual performance 559 – 60 contingency perspective 118 contingent workers 777 continuous learning 300 convergent validity 603 – 4, 625 Cooperation (Agreeableness) trait 488 – 9 core and contingent competencies 813 – 15 core self-evaluations 413 corrected coefficients 106 – 8 counterproductive work behavior (CWB): age variables 490 – 1; assessment of 477 – 8; attitudes and emotions 492; climate factors 495 – 6; cognitive ability and 260, 489 – 90; consequences 496; demographic and background variables 490 – 1; employee selection implications 498 – 9; environmental factors 492 – 6; Five-Factor Model (FFM) 488 – 9; gender variables 490; human resource management practices 494 – 5; individual differences 479, 480 – 4, 485 – 7; individual personality traits 489; integrity tests 479; introduction 476; nature of 476 – 7; organizational justice 494; organizational tenure 491; person and environment 496; potential antecedents of 478, 479; relationships with supervisors 493 – 4; social networking websites 497 – 8; stressors and 492 – 3 Counterproductive Work Behavior Checklist (CWB-C) 478 Counterproductive Work Behavior interpersonal (CWB-I) 477 – 8, 495 Counterproductive Work Behavior organizational (CWB-O) 477 – 8 Court of Justice of the European Union (CJEU) 689 covariance in CFA model 22 – 3 criterion development 135 criterion-related validity 36 – 7, 344, 611, 620, 640 – 2, 970 criterion theory and measurement: business unit strategy 567 – 8; challenges and mitigation strategies 521 – 4; conclusions 568 – 9; deficiency in 560 – 1; definition and assumptions 556 – 7; health and well-being 561 – 3, 562; introduction 554; in I-O psychology 556 – 63, 558, 562; multilevel issues in performance and success 563 – 7; outmoded assumptions 555 – 6; service/ sales jobs selection 791; success, defined 554 – 8; validity studies 258 – 62, 259, 261, 285, 287; work-family conflict 565 – 7 Critical Incident Technique 139 cross-cultural/cross-national differences 801 – 3 cross-mode equivalence 861 – 2 cross-situational consistency hypothesis 93 – 4 crystallized intelligence 255 – 6 cultural adaptation in recruitment 173 – 4

989

Index cultural differences in construct-oriented studies 337 culturally agile professionals 804 – 8 curvilinear citizenship relationships 464 cut scores 45 – 7, 197 – 8, 591 – 2, 614 – 15, 618, 626, 644 – 5; critical 644; effective cut score 644 cybersecurity issues in selection: additional issues 920 – 2, 921; advanced persistent threats 916; anti-forensics 916; breach prevention/response 916 – 17; conclusion 926; current selection methods 917 – 20; current trends 913 – 17; Distributed Denial of Service (DDoS) attacks 915; future research directions 924 – 6, 925; Internet of Things 924; introduction 913; malware 914 – 15; Man-in-the-middle (MITM) attacks 915 – 16; recommendations 922 – 4; social engineering 914 Cyber Test 714 Dalton v. Suburu-Izuzu (1998) 653 data collection: establishment data collection 883; international legal environment for selection 659 – 60; occupational analysts (OAs) data collection 885; Occupational Information Network 882 – 6, 883; occupation expert (OE) data collection 884 – 5; operations 885; status of 886; validation studies for primary studies 35 – 9; work analysis 141 – 3 data dictionaries 895 Data Protection Act (UK) 219 data supply 870 data types 870 data usage 868 – 70 Davis v. Dallas (1985) 643 decision-making modes 44 – 8, 234 Dees v. Johnson Controls (1999) 649 Defense Advisory Committee on Military Personnel Testing 708 deontological theories 579 – 80 Detailed Work Activities (DWAs) 886 – 90 Detroit Police Association v. Young (1979) 646 Development Dimensions International (DDI) 739 deviance behavior 476 – 7 diagnosed outcome (trait level) 939 Diagnostic Analysis of Nonverbal Accuracy (DANVA2) 350 Dictionary of Occupational Titles (DOT) 138 – 9, 874 – 7, 897, 905 differential validity 266 dimensionality of constructs in construct-oriented studies 338 disability-related legal decisions 651 – 3 disadvantaged groups 660 discriminant validity 603 – 4 discrimination: adverse impact 617 – 18; differences in evidence 684; employment discrimination 688; evidence of 684; item discrimination 932; laws against 662, 667, 668, 668; prima facie evidence of 669; see also Age Discrimination in Employment Act of 1967 distal individual differences 451, 456 – 7 Distributed Denial of Service (DDoS) attacks 915 distributive justice 409 diversity and selection 127 – 8 domain sampling 135 dominance models 936 Dothard v. Rawlinson (1977) 643 downsizing efforts 647 – 8

990

drama-based training 540 Dun & Bradstreet (D&B) 884 – 5 Dutch Association of Psychologists 674 Dutch Committee on Tests and Testing (COTAN) 674 Dutifulness (Conscientiousness) trait 488 – 9 dynamically administered tests 186 – 7 dynamic change 228 – 30, 229 Educational Testing Service (ETS) 917 effective weights 193 18-dimension taxonomy of performance ­categories 436 emerging technologies: conclusion 981 – 3; future of 982 – 3; gamification 871, 971 – 5, 972, 973; introduction 967 – 8; mobile devices 968 – 75; research needs 981 – 2; traditional test ­development and validation 975 – 81, 976, 976 – 7; work/job analysis 977 – 9 Emotional Competence Inventory (ECI) 348 emotional intelligence (EI): cognitive ability tests 350; conceptual framework 353, 353 – 4; conclusions 347; defined 343 – 5; developing better measures 354 – 5; method-construct distinction 356; multilevel perspective 358; multiple-choice SJTs 351; predictor and criterion matching 355 Emotionality (Openness) trait 488 – 9 empirical weighting schemes 193 employee selection: assessing outcomes 237; benchmarking 240 – 1, 241; beyond hiring and HR management 231, 231; business value of 232 – 3; challenges to business value of 227 – 8; cognitive ability measures 251 – 4, 254; ­conclusion 244 – 5; counterproductive work behavior 498 – 9; dynamic change and change management 228 – 30, 229; effectiveness drivers 237 – 8; executive team 230; expatriate selection 212 – 13; future of 242 – 3; governance and 206, 216 – 17; high vs. low scores 235, 235 – 7, 236; interpersonal processes 232 – 8, 234, 235, 236; introduction 226; issues with methods of 588 – 91; job candidates 231; line managers 230; management and 238 – 40, 239; multimedia immersive simulations 243; multiple ­organizational stakeholders 230 – 1; as organizational process 228 – 31, 229, 231; outcomes of 233; pre-employment selection 292, 770; selection program managers 231; traditional model 226 – 7; utility analyses 233 – 5, 234; see also blue-collar selection; competitive advantage of selection; cybersecurity issues in selection; ethics of employee selection; international differences in legal context of selection; international legal environment for selection; leaders/executives selection; legal principles in employment selection; military workforce selection; personnel selection procedures; psychometric theory in personnel selection; service/sales jobs selection; sustainable selection programs; team membership selection; technology employee selection; Uniform Guidelines on Employee Selection Procedures; validation considerations in selection systems employee selection, constructs: administration of tests 381 – 3; administrative challenges 374 – 5; administrative concerns 375 – 7; adverse impact 373; alternative considerations 373 – 4; applicant

Index reactions 378; conclusions 386; evaluating assessment tools 378 – 81; hiring decisions ­consequences 377; introduction 367 – 9, 369; measurement of 369 – 72; methods of measuring 372 – 8; organization reactions to 377 – 8; test validity information 379; timing considerations 372 – 3; using test scores 383 – 6; validation strategies 379 – 80 employee selection, methods and outcomes: assessment principles for leadership readiness 742 – 4; assessment tools and techniques 744 – 51; case studies 751 – 4; conclusions 755; globalization 741; high-velocity hiring 751 – 3; impact of 744; introduction 738 – 40; leader readiness 742; leadership selection trends 740 – 2; multiple selection techniques 743 – 4; organizational commitment declines 741 – 2; pipeline approach to leadership positions 753 – 4; screening methods 745 – 6; tests and inventories 746 – 50; employee skill strategy 209 – 10 employer job postings 893 employment and credentialing tasks 605 – 6 Employment and Training Administration (ETA) 876 Employment Equity Act (1998) 674 enforcement of laws related to selection 687 – 9 English language proficiency 764 enlisted personnel selection 707 – 8 entry-level: leaders 739; leader selection 745 entry-level leader selection 745 environmental working conditions 279 – 80 The EQ Difference (Lynn) 749 Equal Employment Opportunity (EEO) 631 – 2, 648 – 9 Equal Employment Opportunity (EEO) Act (1972) 616 Equal Employment Opportunity Commission (EEOC) 600, 616, 632 – 4, 648 – 53, 679, 764 Equal Opportunity Coordinating Council 616 Equal Pay Act of 1963 (EPA) 632 Equal Protection Clause 683 Ercegovich v. Goodyear Tire and Rubber Co. (1998) 648 ergonomic analysis in physical performance tests 278 – 9, 279 Ernst et al. v. City of Chicago 291 error-prone situations 333 essential task identification 278 essential tau-equivalence 9 establishment data collection 883 estimation traditions 25 – 7, 26 ethics of care 580 – 1 ethics of employee selection: assessment centers 588 – 9; big data 590 – 1; computer-/webbased testing 589 – 90; conclusion 593 – 4; cut scores 591 – 2; dilemmas with 579 – 84; expert testimony 592 – 3; forms of 581 – 4; individual level assessments 588; introduction 575; issues and problems 586 – 93; meta-issues 575 – 9; ­multiple responsibilities 588; principles of 579 – 81; ­professional competence 587; retesting issues 592; role of ethics codes 584 – 6; selection method issues 588 – 91; test data misuse 592; test security 587 – 8; unionized organization 591; universal interests 578 – 9; validity issues 587 ETS WorkFORCE Assessment 944 European Federation of Psychologists’ Associations 599

European Union (EU) Directive on Data Protection 804, 869 European Value Survey 328 evaluating assessment tools 378 – 81 examinee identification 863 Excitement Seeking (Extraversion) trait 488 expatriate selection 212 – 13 expectancy charts 197 expected a posteriori (EAP) estimation 941 expected to perform scale 278 experience requirements 879 expert-only strategy 213 – 14 expert-owner strategy 214 expert testimony 592 – 3 external evaluation of citizenship 461 face-to-face feedback 418 fairness principle 580, 613, 62 false positive/negative errors 41 – 2 family conflict and work stress 542 Faragher v. City of Boca Raton (1998) 649 Federal Aviation Administration (FAA) 511 Feliberty v. Kemper Corp. (1996) 653 field independence 330 financial metrics 522; and perspective in evaluation 238 Fink v. New York City (1995) 653 Fisher’s Exact Test 635 Fisher v. University of Texas (2013) 683 fit, defined 844; person-organization (P-O) fit 169, 198, 207, 821; person-team (P-T) 207 – 8; Fitzpatrick v. Atlanta (1993) 643 Five-Factor Model (FFM) 301 – 2, 312 – 13, 488 – 9, 541 Fleishman Ability Requirement Scales (ARS) methodology 139 Flight Officer Aptitude Rating (FOFAR) 712 fluid intelligence 255 – 6 forced-choice item response formats 304 – 6 Force Personnel Center 712 for-research-only performance ratings 432 four-fifths rule 636 “frame-of-reference” training 462 Frazier v. Garrison I.S.D. (1993) 636 Functional Job Analysis 139 game-thinking 971, 974 gamification 871, 971 – 5, 972, 973 Geller v. Markham (1980) 651 gender differences and stress 542 General Aptitude Test Battery (GATB) 266 general cognitive ability 443 General Education Development (GED) 700 generalizability theory (G-theory): conceptual perspective vs. estimation methods 20 – 1; goal of reliability estimation 16; overview 8, 10 – 12; ­random-effects model underlying 18 – 20; ­summary 15; violations of 26 Generalized Graded Unfolding Model (GGUM) 315, 936 – 7, 937 generalized work activities (GWAs) 438 general job performance taxonomy 436 – 7 general mental abilities (GMAs) 227, 233, 766 – 7 Gentry v. Export Packing (2001) 649 geographic disbursement 798 Gillespie v. State of Wisconsin (1985) 641, 644 – 5

991

Index globalization 741, 797 global leader selection 805 global mobility functions 808 global positioning satellite (GPS) data 951 global recruiting 173 – 4 global strategic initiatives 797 GLOBE project 845 governance and selection 206, 216 – 17 governing policies and rules in validation ­considerations 81 – 8 Government Accountability Office (GAO) 511 grade point average (GPA) 355, 646 Graduate Record Exam (GRE) 917 graphic rating scales (GRS) 433 Gratz v. Bollinger (2003) 646 Grenier v. Cyanamid Plastics (1995) 653 GRE Psychology test 398 – 9 grid ratings 843 Griggs v. Duke Power (1971) 639 – 40, 689 group administration 382 group-level variables 300 Group Task Circumplex 462 Grutter v. Bollinger (2003) 646 Guardians v. Civil Service (1980) 639 – 41, 644 – 5 Guidelines for Computer-Based and Internet-Delivered Testing (ITC) 862 – 3, 868 Gulino v. New York State Education Department (2006) 641 Handbook of Multisource Feedback (Bracken) 841 hands-on-throttle-stick (HOTAS) device 919 Hare v. Potter (2007) 650 HARKing (hypothesizing after the results are known) 49 – 50 Hayden v. Bollinger (2003) 646 Hayden v. County of Nassau (1999) 646, 650 Hazelwood School Dist. v. United States (1977) 635 Hazen v. Biggins (1993) 651 HDS dark-side personality measure 315 health and safety of employees: conclusions 548; healthcare costs 531 – 2; interpersonal relationships 539 – 40; introduction 530 – 1, 531; job-level stressors 537 – 9; occupational safety 543 – 8, 544; organizational-level stressors 536 – 7; organizational wellness programs 532 – 4; personal characteristics 541 – 2; safety behavior predictors 546 – 7; safety compliance 547; safety criteria and systems 544 – 5; safety participation 547 – 8; selecting healthy workers 534 – 5; work and non-work interface 542 – 3; workplace accident predictors 545 – 6; work stress 535 – 43 Health Insurance Portability and Accountability Act (HIPAA) 219, 533 Hedberg v. Indiana Bell (1995) 653 HEXACO model 301 – 2 hierarchical task analysis 148 higher-level jobs selection 776 – 7 high-fidelity multimedia simulations 918 high-fidelity physical abilities tests 376 high-performance individuals 146 – 7 high-performance workplaces 145 – 6 high-performance work systems 118 high-potentials selection 834 – 5 high-quality evaluation information 523, 524 – 5 high-velocity hiring 751 – 3 hiring decisions 231, 231, 377

992

HIV testing 709 Homeland Security (DHS) 511 Horace v. Pontiac (1980) 643 host country nationals (HCNs) 212 Human Capital Management Trends study 747 human capital theory 119 – 20 human resource management (HRM) 844 human resources (HR): administrative costs 368; beyond hiring and HR management 231, 231; competitive advantage of selection 115, 129; criterion models 555; criterion-related validity 641 – 2; individual objectives in 514; information system for 154; Internet-based HR systems 869; leadership insight and growth improvements 739; management practices 494 – 5; processes 64; recruiting models 165; selection method view 745; selection strategy 206, 208 – 16; transnational strategy of MNCs 800; validated selection tools 781; work-family research 566; see also ethics of employee selection Hunter, Jack 252 Hyland v. Fukada (1978) 643 ideal point: model 936; response methods 305 immediate proximal determinants 458 – 60 imprecise inferences from corrected validities 105 – 8, 106 incremental contributions to overall criterion prediction 77 – 8 incremental validity 313 Incumbent and Occupational Expert (OE) data 893 Index of Vocal Emotion Recognition (Vocal-I) 350 individual test administration 382 individual differences: counterproductive work behavior 479, 480 – 4, 485 – 7; distal individual differences 451, 456 – 7; measures of 661 – 2, 663 – 6; predictors of 451 – 76, 452 – 3, 525 – 6; validation considerations 67 – 8 individual health measurements 561 – 3, 562 individual performance objectives 512 – 15, 513, 514 individual psychological assessment (IPA) 841 – 2 individual workplace behavior 510 – 11 inferential variant 95 information-gathering process in job analysis 978 information system for human resources 154 information technology (IT) 187, 921 – 2 instrumental stakeholder theory 578 instrumental values 327 intellectual property (IP) theft 922 intelligence constructs: assessment center exercises 351 – 2; bivariate relationships 356 – 7; conceptual framework 353, 353 – 4; defined 255; developing better measures 354 – 5; emotional intelligence 343 – 5; epilogue 358 – 9; future research strategies 354 – 9; intelligence quotient (IQ) 650; interviews 350; introduction 342; longitudinal validation designs 357 – 8; measurement approaches 347 – 52, 348; method-construct distinction 355 – 6; multilevel perspective 358; other reports 349; performance-based tests 349 – 50; practical intelligence 342 – 3; predictor and criterion matching 355; self reports 348 – 9; situational judgment tests 350 – 1, 356; social intelligence 345 – 6 intelligence quotient (IQ) 650 intentional distortion 305 – 6

Index interactive voice response (IVR) 183 interactivity, defined 862 interconstruct relationships 339 Intermediate Work Activities (IWAs) 886 – 7, 889 – 90 International Archive of Education Data (IAED) 804 international assignees 805 – 6 International Data Commission (IDC) 918 international differences in legal context of selection: evidence of discrimination 684; final thoughts 691 – 2; introduction 678; legal protection, enforcement, case law 687 – 9; levels of authority 688 – 9; model of explanation 684 – 90, 685; preferential treatment 682 – 3, 686; protected group status 678, 680 – 1, 682, 682, 687; reactions to 686 – 7; science of selection 690; zeitgeist in 684 – 6, 685 international legal environment for selection: advantaged/disadvantaged groups 660; ­consequences of violation 669; data collection methodology 659 – 60; discrimination laws 662, 667, 668, 668; discussion 675 – 6; individual difference measures 661 – 2, 663 – 6; limited or banned methods 669, 670 – 3, 674; minority groups, preferential treatment 674 – 5; prima facie evidence of discrimination 669; science-based employment selection 675; women in the workplace 660 – 1 International Personality Item Pool (IPIP) 184 International Standards Organisation 599 International Task Force on Assessment Center Guidelines 593 International Task Force on Assessment Center Operations 599 International Test Commission (ITC) 74, 187, 599, 862, 868, 920 Internet-based HR systems 869 Internet-based recruitment 168 – 9, 856 Internet-based testing (IT): deployment 864; device 968 – 7 Internet of Things 924, 952 Internet recruitment case law 633 – 4 interpersonal processes 232 – 8, 234, 235, 236 interpersonal relationships 539 – 40 inter-rater reliability 103 – 4 interview measures 820 interviews in cognitive task analysis 149 – 50 Intraclass Correlation Coefficients (ICCs) 18 – 19 involuntary reduction in force (IRIF) 642 Isabel v. City of Memphis (2005) 645 isokinetic strength tests 281 isometric strength tests 281 isotonic strength tests 281 item calibration 932 item discrimination 932 item exposure control model 943 item response function (IRFs) 932, 941 item response theory (IRT) 186, 304, 316, 404, 434, 931 – 2 Japanese and Caucasian Brief Affect Recognition Test (JACBART) 350 job analysis 39, 193, 277 – 80; selection-oriented work analysis 152 job component validity 612, 625 job embeddedness 496

job knowledge tests (JKTs) 71, 431 job-level stressors 537 – 9 job performance: dimensionality of 435 – 42, 439 – 40; measurement project 704 – 7, 707; ­predictors of 259 – 62, 261; variability of 762 job relatedness 59 – 60, 642 – 7, 824 job resources 493 job simulation tests 282 job skills checklists 768 Job Zone 886 Johnson v. City of Memphis (2006) 637 Joint-Service Job Performance Measurement/ Enlistment Standards Project (JPM Project) 705 journeymen selection 761, 776 justice principle 580 Kaiser Daily Health Report 531 Karasek’s Demands-Control model 538 knowledge, skills, abilities, and other characteristics (KSAOs): big data 949 – 50; competitive advantage 115 – 16; conclusion 368; contamination of test instruments 402; determining importance 144; domain sampling 135; employee selection constructs 379 – 80; execute respective functions 127; grandparenting rules 776; high-fidelity computer simulations 771; incumbent ratings of 142; individual differences in 51; individual-level selection 120, 122 – 3; intelligence constructs 353 – 4; job analysis 729; job performance and 368; job-relevant tests of 98, 369 – 70; job-­ relevant work samples 722; job-relevant tests of KSAs 98; job training 35; linking process 734; measurement of 370 – 2; motivational constructs 336; multilevel selection 123 – 5; normal distribution of 126; promotional tests 730; ­psychometric tests 93; selection process measures 764 – 5; team membership 813, 820, 826; test content, defined 725 – 6; test scores and 182; test ­sufficiency 402; work analysis 638, 977 – 9 knowledge and experience tests 767 knowledge instructions 310 labor market information 892 labor unions 762 – 3 Lancaster’s Mid-P (LMP) test 635 language skills 807 Lanning v. Septa (1999) 285, 288, 290, 644 Latent Semantic Analysis (LSA) 925 leader-member exchange (LMX) 459, 463 – 4 leaders/executives selection: assessment centers 841; assessment techniques 838 – 42; biodata questionnaires 840; cognitive ability 837, 839; competencies, dimensions, attributes 835 – 8, 836; conclusions 847 – 8; country culture 845; diversity considerations 845; effectiveness of 846 – 7; executives and high-potentials 834 – 5; fit, defined 844; follower characteristics 844 – 5; introduction 833 – 4; learning ability 838; organizational and cultural content 842 – 6; questionnaires 841 – 2; research opportunities 847; talent management 844; trends in 740 – 2 Leadership Assessment and Development program (LeAD) 842 The Leadership Pipeline: How to Build the Leadership Powered Company (Charan, Drotter, Noel) 753 Leadership Potential BluePrint model 835 – 7, 836

993

Index leadership training 754 learning ability 838 learning goal orientation (LGO) 333 – 4 Leftwich v. Harris Stowe State College (1983) 651 legal principles in employment selection: adverse impact measurement 41, 634 – 6, 645 – 7; alternatives to procedures 636 – 8; compliance 221; conclusions 653 – 4; construct-oriented studies 336 – 7; criterion-related validity 641 – 2; cut scores 644 – 5; disability-related legal decisions 651 – 3; downsizing efforts 647 – 8; EEO violations 648 – 9; Internet recruitment 633 – 4; introduction 631 – 2; job relatedness 642 – 7; legal context of 687 – 9; reductions-in-force 647 – 8, 650 – 1; Title VII statute 632, 650 – 1; validation evidence 640 – 1; work analysis 638 – 40 levels issues in service/sales jobs selection 792 Levels of Emotional Awareness scale (LEAS) 350 LGBT community 685 linear on the fly (LOFT) testing 186 – 7 LinkedIn 175 – 6 linking process 734 local independence assumption 940 local responsiveness 798, 800 Lomack v. City of Newark (2006) 646 longitudinal validation designs 357 – 8 Lopez v. City of Lawrence (2014) 636, 641 macro (strategy) scholarship 124 maladaptive personality measures 313 malware 914 – 15 management by objectives (MBO) 512 manager-relevant work sample measures 750 man-in-the-middle (MITM) attacks 915 – 16 Markov chain Monte Carlo (MCMC) algorithms 937 matrix-type test 714 maximum likelihood estimate (MLE) 940, 940 Mayer-Salovey-Caruso Emotional Intelligence Test (MSCEIT) 350, 356 Mayflower Group 845 McDonnell Douglas v. Green (1973) 634 McKinsey Global Institute 951 Meacham v. Knolls Atomic Power Lab (KAPL) (2006) 642, 651 measuring emotional management (MEMA) 351 Measuring Human Capabilities report 945 Mechanical Comprehension Test 712 Mechanical Turk (MTurk) 920 mental health measurements 562, 562 – 3 Merit Systems Protection Board (MSPB) 632 micro-behaviors 870 – 1 micro (strategy) scholarship 124 mid-level leader selection 739, 745 Military Entrance Processing Station (MEPS) 708 – 9 military entrance testing (MET) 708 military occupational specialties (MOS) 277, 442 military personnel system 697 – 8 military workforce selection: all-volunteer force 703 – 4; ASVAB research 713 – 14; conclusion 715; educational attainment 715; enlisted selection 707 – 8; introduction 697; job performance measurement project 704 – 7, 707; military personnel system 697 – 8; new directions in 713 – 15; noncognitive measures 714 – 15; officer

994

commissioning programs 711 – 12; officer occupational assignment 712 – 13; officer selection and classification 710; pilots and air crews 712; recruiting system 698 – 9; recruit quality 699 – 702, 700 – 2, 709 – 10 Miller v. Illinois (1996) 653 minimum qualification (MQ) 650, 729 Minnesota Multiphasic Personality Inventory (MMPI) 546, 674 minority groups, preferential treatment 674 – 5 Mi Proximo Paso 901, 901 mistreatment climate 495 mobile devices 968 – 71 Model Assisted Sampling (MAS) 884 modified parallel analysis procedure 934 Montreal Set of Facial Displays of Emotion (MSFDE) 350 Moore v. Philadelphia (2006) 650 Moore v. Southwestern Bell Telephone Co. (1979) 636 moral character 581 moral sensitivities 594 moral values 327 motivational constructs: achievement motivations 334; adaptive and citizenship behavior 455 – 6; construct-oriented studies 332 – 6; current concerns and emerging issues 335 – 6; examples of 333 – 5; interest measures 334 – 5; safety behaviors 546 – 7; study of 332 – 3; trait goal orientations 333 – 5 multiculturalism 798 Multidimensional Emotional Intelligence Assessment (MEIA) 349 multidimensionality of social intelligence 346 multifaceted replicates in CFA 24 – 5 multilevel issues in performance and success 563 – 7 MULTILOG computer program 935 multimedia immersive simulations 243 multinational companies (MNCs): challenges in developing systems 800 – 4; conceptual equivalence 801 – 2; conclusions 808 – 9; cross-cultural/ cross-national differences 801 – 3; culturally agile professionals 804 – 8; expatriate selection 212; global integration 798 – 800, 799; global leader selection 805; international assignees 805 – 6; introduction 797 – 8; language skills 807; local responsiveness 800; national differences 803 – 4; personality characteristics in employee selection 806 – 7; prior international experience 807; selection constructs 800 – 1; strategic alignment 798 – 800; transnational strategy of 800 multiple-hurdle approach 196, 286 – 7, 393 – 4, 614 multiple organizational stakeholders 230 – 1 Multipurpose Occupational Systems Analysis Inventory-Close-Ended (MOSAIC) 140 multisource (360-degree) feedback survey results 417 multi-trait multi-method (MTMM) 43 Multi-Unidimensional Pairwise Preference model (MUPP) 938 – 9, 939 Muraki’s generalized partial credit model 934 Murphy v. UPS (1999) 652 muscular strength tests 281 Myers-Briggs Type Indicator (MBTI) 330 My Next Move 898, 899 My Next Move for Veterans 898 – 900, 900, 906

Index narrow personality traits 786 National Council on Measurement in Education (NCME) 3, 282, 601, 980 National Council on Measurement Used in Education (NCMUE) 601 national differences in employee selection 803 – 4 national-level cultural values 802 National Research Council 705, 945 native/aboriginal people 660 The Nature of Human Values (Rokeach) 326 Naval Aviation Trait Facet Inventory 712 Naval Operational Medicine Institute (NOMI) 712 Navy Computer Adaptive Personality Scales (NCAPS) 304, 715 Need for Achievement (McClelland) 334 negative affectivity 541 negative psychological effects (NPEs) 413 – 14 NEO Personality Inventory 309 new jobs and test validation 380 Newton-Raphson iterations 940 new work activity statements 886 – 90, 887, 888, 888 – 9 nominal weights 193 nomological network 42 – 3 noncognitive measures 714 – 15 noncognitive personality tests 83 – 5 noncrossed measurement designs 24 – 5 nongovernmental organizations (NGOs) 807 nonmaleficence principle 581 “non-office” environments 760 nonsubstitutable resources 119 normative stakeholder model 578 NYC v. Beazer (1979) 643 occupational analysts (OAs) data collection 885 occupational code assignment (OCA) 894, 908 occupational credentialing 606 Occupational Employment Statistics (OES) 782, 884 Occupational Information Network (O*NET): accomplishments 907 – 8; alternate titles 893 – 4; attribute requirements 144; case studies 903, 903 – 4, 904, 904; challenges 908; content model 877 – 80, 878, 880; core activities of service workers 782; database 895 – 907; data collection 882 – 6, 883; data dictionaries 895; data enhancements 886 – 94; Detailed Work Activities 886 – 90; development of 140, 154, 876 – 82; Dictionary of Occupational Titles 874, 875 – 7, 897, 905; experience requirements 879; generalized work activities 438; government and 905; hierarchical structure 880; information gathering with 977; Intermediate Work Activities 886 – 7, 889 – 90; introduction 874; labor market information 892; new work activity statements 886 – 90, 887, 888, 888 – 9; occupational requirements 879; occupation-specific information 879 – 80; O*NET-SOC Taxonomy 881, 881 – 2, 882, 909; online websites 896, 896 – 8, 897; physical ability requirements 280; possible future enhancements 908 – 10; private sector companies 905 – 6; products and users 894 – 5; sample design 884; service/sales jobs selection 782 – 3; tools and technology 890 – 1, 891, 892; updates and additions 891; U.S. Armed Services and 906 – 7; user information 892; web services 902; worker characteristics 875, 877 – 8, 878;

worker requirements 878; workforce characteristics 875, 879 Occupational Interests 878 Occupational Outlook Handbook (OOH) 897 Occupational Personality Questionnaire (OPQ) 747 occupational safety 543 – 8, 544 occupation expert (OE) data collection 884 – 5 Office for Human Research Protection (OHRP) 577 Office of Federal Contract Compliance Programs (OFCCP) 601, 616, 632 – 5, 682, 688 Officer Aptitude Rating (OAR) 712 officer commissioning programs 711 – 12 officer occupational assignment 712 – 13 officer selection and classification 710 Officer Training School (OTS) 711 – 12 Older Workers Benefit Protection Act (1990) 648 O*NET-SOC Taxonomy 881, 881 – 2, 882, 908 operational perspective in evaluation 238 operational validity 107 – 8 OpScan sheets 967 option response function (ORF) 934 – 5 order of merit list (OML) 713 organizational citizenship behavior (OCB) 357, 450, 456 – 9, 464, 559 – 60 organizational commitment declines 741 – 2 organizational constraints in validation considerations 68 organizational deviance behaviors 560 organizational expectancy charts 235, 235 – 7, 236 organizational health 563 organizational-level stressors 536 – 7 organizational wellness programs 532 – 4 organization context for sustainable selection 205 – 6, 206 organization development (OD) 152, 835 Oubre v. Entergy (1998) 648 parent country nationals (PCNs) 212 Parents v. Seattle School District (2007) 647 Pareto optimization 193, 392 – 3 PARI method 150 Pearson correlation 16 – 18 perceived organizational support (POS) 102 perceptual/psychomotor ability in task performance measures 443 – 4 performance appraisal programs 843 performance assessments 148 – 9, 228 Performance Based Measures Battery 712 performance-based tests 349 – 50 performance goal orientation 333 performance objective measures 521 – 2, 522 performance success 563 – 7 Personal Aggression 477 personality heterogeneity 824 personality measurement and validity: adaptive and citizenship behavior 454 – 5; adverse impact 315 – 16; assessments 83 – 5; biodata 307 – 8; changing demographics 299 – 300; changing environments 298 – 9; characteristics in employee selection 806 – 7; conclusions 316; factors affecting usefulness 311 – 13; forced-choice item response formats 304 – 6; ideal point response methods 305; incremental validity 313; intentional distortion 305 – 6; interviews 308 – 9; introduction 298; measurement 303 – 11, 747 – 8; mega-data availability 300; moderating variables 314;

995

Index predictor-criterion relationships 314 – 15; scale scores 75; self-report questionnaire measures 303 – 4; simulations and assessment centers 310 – 11; situational judgement tests 309 – 10; structure of variables 300 – 3; task performance measures 444; team performance 823; tests 769 – 70; trends affecting use 298 – 300 personality-performance relationships 314 – 15 personality/personality-related characteristics for sales and service jobs 784 – 7 Personality-Related Position Requirements Form (PPRF) 139 personally identifiable information (PII) 921 personal maladjustment 546 personal values 327 personnel and human resource management (P/HRM) 844 personnel selection procedures: analysis of work 610 – 11; application to litigation 610; bands in 615; brief history 609; content validity evidence 612; criterion-related validity evidence 611; cutoff scores vs. rank order 614 – 15; decision making 731 – 5; fairness and bias 613 – 14; gamification in 974 – 5; generalizing validity evidence 612 – 13; introduction 3; meta-analysis 613; mobile devices and gamification 967; operational considerations 614 – 15; principles vs. standards 610; purpose of 609 – 10; summary 615; technical validation report requirements 615; utility computations 615; validity, defined 611 person-task interaction variance 15 Petit v. City of Chicago (2003) 646 PGA v. Martin (2001) 639 phishing scams 918 physical performance tests: administration of 289; benefits and trends 292 – 3; environmental conditions 279 – 80; ergonomic/biomechanical/ physiological analysis 278 – 9, 279; factors to consider 281 – 3, 283; final test battery 286; introduction 277; job analysis for arduous jobs 277 – 80; legal issues 290; overview 281 – 3, 283, 768 – 9; passing scores 287 – 8; physiological analysis in 278 – 9, 279; preparation for 289; required physical abilities 280 – 1; site visits and essential task identification 278; types of scoring 286 – 7; validity of 285 – 6 Pilot Candidate Selection Method (PCSM) 712 Pilot Flight Aptitude Rating (PFAR) 712 pilot selection 712 pipeline approach to leadership positions 753 – 4 Police Officers v. City of Columbus (1990) 641 Polish Act on Personal Data Protection 804 Polish Labor Code 804 Political Deviance 477 Position Analysis Questionnaire (PAQ) 139 position classification in public sector employment 723 positive psychological effects (PPEs) 413 – 14 power tests 189, 382 practical intelligence: conceptual framework 353, 353 – 4; conclusions 347; defined 342 – 3; multilevel perspective 358; predictor and criterion matching 355 prediction rationale in validation considerations 66, 69 – 81, 86, 88 predictive bias 266 – 8, 267

996

predictive inference vs. evidence 60 – 1, 61 predictive validity 83 predictor and criterion relationships: absence of bias in criterion measures 76; criterion assessment process 74 – 6; development of 135; governing policies and rules 83 – 6; managing and maintaining 87 – 8; nature of 314 – 15; quality of scores 71 – 2, 80; scores in validation considerations 66; service/sales jobs selection 791; validity 57 – 8 pre-employment selection 292, 770 preferential treatment differences 682 – 3 presentation in big data 961 – 3, 962, 963 preventing harm paradigm 582 previous research criteria 78 – 9 prima facie evidence of discrimination 669 Principles for the Validation and Use of Personnel Selection Procedures (SIOP Principles) 369, 371; see also professional guidelines/standards prior international experience 807 privacy in big data 957 – 8 procedural justice climate (PJC) 409, 463 proctored testing 376, 382 Production Deviance 477 production rates in task performance measures 430 professional credentialing 606 professional guidelines/standards: application of 602; brief history 601; cautions offered by 603; central focus 600; comparisons among 621 – 6, 622 – 3; conclusions 627 – 8; cut scores/cutoff scores 626; development and administration 608; employment and credentialing 605 – 6; employment selection application 600; expertise 213 – 14; fairness in 607 – 8, 626; importance of authorities 600 – 1; inconsistencies 626 – 7; introduction 599; meta-analysis 625; professional and occupational credentialing 606; purpose of 602; reliability and measurement errors 605, 625 – 6; review of 607; rights and responsibilities 609; selection decision-making 603; summary 609, 626; validity, defined 602, 624; validity evidence 603 – 4, 606, 624; validity generalization 604, 625; validity standards 604 – 5 Profile of Nonverbal Sensitivity (PONS) 350 PROMIS assessment 944 promotional processes 729 – 30 Property Deviance 477 protected group status 678, 680 – 1, 682, 682, 687 protocols/process training 150 prove performance goal orientation (PPGO) 333 – 4 proximal determinants 458 – 60 psychological assessments: conclusions 405; considerations in choosing 403 – 4; contamination of 402 – 3; efficiency of 404; introduction 397; momentary time-limited factors 399 – 400; norms in 404; rater errors 398 – 9; reliability information, use of 400; reliability of 397 – 400; sources of errors 400; sufficiency of 401 – 2; validity 401 – 3; validity evidence in 403; see also assessments psychological well-being 536 psychometric characteristics 185, 969 psychometric meta-analysis (PMA): in applied contexts 108 – 10; biases sample estimates 104; conclusions 110; continuum perspective 96 – 9;

Index corrected coefficients 106 – 8; employment-­ related 600; failure to model situational attributes 99 – 100; imprecise inferences from corrected validities 105 – 8, 106; inter-rater reliability 103 – 4; introduction 16, 93 – 4; ­questionable estimates 102 – 3; suspending standards 104 – 5; untested statistical assumptions 100 – 2; validity generalization 95 – 6 psychometric theory in personnel selection: computerized adaptive testing 941 – 4, 942, 943; detecting aberrant response patterns 944 – 5; Generalized Graded Unfolding Model 936 – 7, 937; introduction 931; item response theory 931 – 2; Multi-Unidimensional Pairwise Preference model 938 – 9, 939; Samejima’s graded response model 934 – 6, 935, 936; summary/ conclusion 945 – 6; three-parameter logistic model 932 – 4, 933, 934 public sector employment: alternative measures 728; appraising past performance 730; balancing validity and diversity 732; civil service examinations 723 – 4; conclusions 735 – 6; defensibility of process 733 – 5; defining test content 725 – 6; identifying strong candidates 726 – 8; interview role 727; introduction 722 – 3; minimum qualifications 726 – 7; multiplicity of jobs 724 – 5; negative consequences 731 – 2; personnel decision making 731 – 5; position classification in 723; potential selection tools 726; promotional processes 729 – 30; recruiting candidates 728 – 9; risks and legal challenges 728; unique competitive processes 731 qualitative outcomes 237 quality-of-data concerns 43 – 4 quality-of-life outcomes 536 quantitative overload 538 Racial Equality and Employment Equality Directives 689 radio-frequency identification (RFID) data 951 RAND Health report 532 random-effects ANOVA models 18 – 19, 27 Rasch model 932 rater training 435 rating formats 432 – 4 rational weighting of assessments 391 Raven’s matrices 256 Reading Comprehension Test 712 realistic job previews (RJPs) 537, 770 – 1 Reasonable Factors Other Than Age (RFOA) 650 – 1 recruiting models: accepting offers 171 – 3; candidate information 166 – 9; conclusions 175 – 6; global recruiting 173 – 4; introduction 165; maintaining interest 170 – 1; practice implications 176 – 7; reaching potential applicants 165 – 70; strategy considerations 174 – 5 reductions-in-force (RIF) 647 – 8, 650 – 1 Regents of University of California v. Bakke (1978) 646 Regin Trojan horse 915 – 16 Registered Apprenticeship Partners Information Data System (RAPIDS) 897 regression weighting of assessments 390 – 1 Rehabilitation Act (1973) 617 reliability concepts: central concepts 5; characteristics of the measurement procedure 14 – 15;

classical test theory 3 – 4, 7 – 12; classic tradition 17 – 18; closing thoughts on 27 – 8; components of variance 12 – 14; confirmatory factor analytic tradition 20 – 4; consistency and inconsistency 6 – 7; desired generalizations 12 – 13; estimation of 15 – 17; estimation traditions 25 – 7, 26; expectation 6; generalizability theory 8, 10 – 12; intended use of scores 13 – 14; introduction 3 – 4; measurement design 24 – 5; measurement models 7 – 8; overview 4 – 5; professional guidelines/ standards 605, 625 – 6; random-effects model 18 – 20; replication concept 5 – 6; summary 7, 15 reporting test scores 194 – 5 resource-based view (RBV) of the firm 118 – 19 respect principle 580 results measures in workplace: avoiding goal conflict 520 – 21; challenges with individual objectives 516 – 19; challenges with team-based objectives 519 – 21; choosing important measures 523, 523; comparable and fair objectives 517 – 18; conclusions 526; employee control of objectives 518; fluid situations with 519; high-quality evaluation information 524, 524 – 5; important aspects of performance 519; individual behavior 510 – 11; individual difference predictors 525 – 6; individual performance objectives 512 – 15, 513, 514; interdependence and 520; introduction 509 – 10; job relevance and 516 – 17, 517; manager and staff training 516; mitigation strategies challenges 522 – 5; number of criteria 524; partial attributable objectives 518 – 19; performance objectives 521 – 2, 522; team-based performance measurement 511 – 12; team-based performance objectives 515 – 16; team uniqueness considerations 521 resume storage tools 858 retesting issues 83 – 4, 190 – 1, 592 return on investment (ROI) 751, 755, 917 Ricci v. Destefano (2006) 637 – 8; (2008) 650 The Rights and Responsibilities of Test Takers: Guidelines for Testing Professionals (APA) 408 Robinson v. Shell Oil (1997) 649 Rokeach Value Survey 326 role ambiguity 538 role conflict paradigm 582 – 3 Rooney Rule 845 Rudin v. Lincoln Land Community College (2005) 647 Safe Harbor Privacy Principles 869 safety behavior predictors 546 – 7 safety compliance 547 safety criteria and systems 544 – 5 Samejima’s graded response model (SGRM) 934 – 6, 935, 936 Scholastic Achievement Test (SAT) 711 Schutte Self-Report Emotional Intelligence Test (SREIT) 349 science, technology, engineering, and math (STEM) fields 263, 925 score usage 79 – 81, 190 – 1 Secretarial Honors Program 923 Secretary’s Commission on Achieving Necessary Skills (SCANS) 139 Securities Exchange Act (1934) 834 selection constructs 800 – 1 selection errors 743 – 4

997

Index selection systems see validation considerations in selection systems self-image and assessment feedback 412 – 14 self-management 805 self reports 303 – 4, 337, 348 – 9 serious games 971 service/sales jobs selection: adverse impact 789 – 90; applicant reactions 789; background research 787 – 8; cognitive ability 788; composite scales to assess 786 – 7; conclusion 792 – 3; contrasts among 783; criterion issues 791; duties and responsibilities 781 – 2; future research 790 – 2; introduction 781; levels issues 792; narrow personality traits 786; nature of 781 – 3; personality/personality-related characteristics 784 – 7; predictor issues 791; required competencies for success 782, 782 – 3, 783; research on 784 – 90; sales jobs 430; situational judgment 788 – 9; temporal issues 791 – 2 sex differences in cognitive ability 263, 264 Shaw v. AutoZone (1999) 649 simulations 310 – 11, 971 situational judgment interviews (SJIs) 417 situational judgment tests (SJTs) 71, 151, 185, 191; assessing cognitive ability 257 – 8; construct-based assessments 748 – 9; cybersecurity issues 918; employee skill strategy 209; intelligence constructs 350 – 1, 356; mobile devices 969; overview 309 – 10; practical intelligence 343; service/sales jobs selection 788 – 9 Situational Specificity (SS) hypotheses 94, 96 – 100 Situational Test of Emotional Understanding (STEU) 351 Situational Test of Emotion Management (STEM) 351 SMART goals 515 Smith v. City of Boston (2015) 636, 641 Smith v. City of Jackson (2005) 647 – 8, 651 social bandwidth 861 – 2 social capital theory 119 – 20 social complexity 119 social dominance orientation 170 social engineering trend 914 social exchange theory (SET) 457, 463 – 4 social identity theory 412 social intelligence: conceptual framework 353, 353 – 4; conclusions 347; defined 345 – 6; multilevel perspective 358; predictor and criterion matching 355 social issues in construct-oriented studies 336 – 7 social maladjustment 546 social networking websites 497 – 8 Social Security Administration (SSA) 876, 905 social values 327 Society for Human Resource Management (SHRM) 240, 919 Society for Industrial and Organizational Psychology (SIOP) 244, 285, 593, 601, 631, 920 Society for Industrial and Organizational Psychology in South Africa (SIOPSA) 674 socio-economic status (SES) 678 Southeastern Community College v. Davis (1979) 653 spatial ability in task performance measures 443 – 4 Spearman-Brown prophecy formula 9 – 10, 12 – 13, 17 – 18, 108 specify cross-level relationships stage 207

998

specify within-level relationships stage 207 speeded tests 189, 382 split-half estimates 706 Spurlock v. United Airlines (1972) 643 Stagi v. National RR Passenger Corp. (Amtrak, 2012) 636 Standard Occupational Classification (SOC) 881, 897 Standards for Educational and Psychological Testing (AERA) 3, 34, 56, 291, 369, 401, 408, 577; see also professional guidelines/standards Starceski v. Westinghouse Elec. Corp. (1995) 648 Stone v. City of Mt. Vernon (1997) 640, 653 strategic human resource management (SHRM) 116 – 20 Strategic Job Modeling (SJM) 140 strategic perspective in evaluation 238 strategic work analysis (SWA) 152 stressors and counterproductive work behavior 492 – 3 Structural Characteristics/Information Processing (SCIP) model 969 structural equations modeling (SEM) 107 Stutts v. Freeman (1983) 653 subgroup differences in construct-oriented studies 337 subject matter experts (SMEs): associated KSAOs 135 – 6, 379; difficulties categorizing items 462; inaccessibility to 40; job knowledge measurements 459; promotional processes 729; rational weights 193; validity standards 604 success considerations 554 – 8, 563 – 7 summary judgment for defendants (SJDs) 647 – 8 Supreme Court of the United States (SCOTUS) 109 sustainable competitive advantage 118, 123 sustainable selection programs: access to programs 220 – 1; approval roles 211 – 12; authority and alignment principle 221 – 2; conclusions 224; corporate/unit interests 211; defined 206; development and delivery alignment 215 – 16; employee skill strategy 209 – 10; expert-only strategy 213 – 14; expert-owner strategy 214; funding considerations 211; governance 216 – 17; guiding principles 217; human resources selection strategy 206, 208 – 16; introduction 205; legal compliance 221; location differences 212 – 13; organization context for 205 – 6, 206; organization purposes 207 – 8; policy of 217 – 18; professional selection expertise 213 – 14; selection process management 222 – 3; taxonomy of 218; use of 219 – 20 Sutton v. United Air Lines (1999) 652 Swinburne University Emotional Intelligence Test (SUEIT) 349 synthetic validity 38, 612, 625, 640 systematic development 40 tacit knowledge 342 Tacit-Knowledge Inventory for Military Leaders 351 Tailored Adaptive Personality Assessment System (TAPAS) 304 – 5, 715, 944 Talbert v. City of Richmond (1981) 646 Talent Acquisition 238 talent management (TM) 808, 844 talent segmentation 128 targeted talent management (TTM) 844 Task Inventory/Comprehensive Occupational Data Analysis Programs (TI/CODAP) 138

Index task performance measures: dimensionality of job performance 435 – 42, 439 – 40; general cognitive ability 443; integration of 438, 439 – 40, 441 – 2; introduction 429; job knowledge tests 431; objective criteria 429 – 31; overview 559; personality research 444; predictors of 442, 442 – 5; production rates 430; rater training 435; rating formats 432 – 4; sales jobs 430; spatial and ­perceptual/psychomotor ability 443 – 4; subjective criteria 431 – 5; vocational interests 444 – 5; work samples 430 – 1 taxonomies of job performance 183 Taylor v. Principle Financial (1996) 653 team-based performance measurement 511 – 12 team-based performance objectives 515 – 16 team membership selection: complex configurations of 824; composition of 822 – 6; conceptual framework for understanding 812, 813; conclusions 828; core and contingent competencies 813 – 15; discussion on 826 – 8; diversity indices 823 – 4; dynamism 815; individual-level considerations 816 – 20, 817 – 19; introduction 812; mean values 823; measure and validation 816, 820; person-group fit 821 – 2; practice implications 827 – 8; research implications 826 – 7; size considerations 820 – 1; staffing considerations 815; task demands 814 – 15; task type revisited 824; team-level considerations 820 – 6; team type 814; work sample and interview measures 820 Team Optimal Profile System (TOPS) 826 – 7 Team Role Experience Orientation (TREO) 820 team task analysis (TTA) 814 – 15 technical validation report requirements 615 technology, use in employee selection: administration 866; applicant screening tools 858 – 9; applicant tracking systems 860 – 1; automated testing 859; behavioral assessment support 859 – 60; candidate sourcing 856 – 8, 857; common issues with 861 – 71; conclusion 871; cross-mode equivalence 861 – 2; data usage 868 – 70; deployment strategies 862 – 3; examinee identification 863; globalization 868; implementation effectiveness 865, 865 – 6; implementation management 866; implementation support 866 – 8, 867; interview support 860; new measurement methods 870 – 1; overview 855 – 6; resume storage, parsing, and search 858; support systems for 856 – 61; test delivery standardization 864; test security and cheating 863 – 4 temporal issues in service/sales jobs selection 791 – 2 temptation paradigm 582 tenure considerations 761 – 2 test administration and scores: aggregation inferences stage 208; alternate forms 184 – 7; alternate test formats 189 – 90; available testing time 188 – 9; blue-collar selection test security 771 – 2; choosing predictors 192 – 3; collection of 184 – 91; combining scores 384; computation of 191 – 5; conclusion 200; content defined 725 – 6; context-based selection 199 – 200; cut scores 197 – 8; data misuse 592; delivery standardization 864; dynamically administered tests 186 – 7; feedback on 385 – 6; information function 941; inventories for selection methods 746 – 50; job relatedness vs. 59 – 60; knowledge and experience tests 767; means 373; methods of

selection 195 – 7; mode of administration 187 – 8; personality tests 769 – 70; power tests 189, 382; predictor composites 192; preparation 84 – 5, 382 – 3; reporting test scores 194 – 5; retesting 83 – 4, 190 – 1, 592; security 184, 415, 587 – 8; security and cheating concerns 863 – 4; selection decisions 195 – 200; selection to fit a profile 198 – 9; speeded tests 189, 382; technology employee selection 859; test security 184, 415; use of 182, 383 – 6; validity information 379; weighing predictors 193 – 4; see also assessments; physical performance tests Test of Basic Aviation Skills (TBAS) 712 theory-first deductive approach 952 Theory of Work Adjustment 878 third country nationals (TCNs) 212 three-parameter logistic model (3PLM) 932 – 4, 933, 934, 941, 943 – 4 three-tier server model 919 time compression diseconomies 119 Title VII (CRA) statute 632, 650 – 1 Title X (CRA) of the U.S. Code 708 Tools and Technology (T2) information 890 – 1, 891, 892 top-down selection 45, 195 – 7 traditional test development and validation 975 – 81, 976, 976 – 7 training in cognitive task analysis 149 trait activation theory 98 trait affectivity 545 – 6 Trait Emotional Intelligence Questionnaire (TEIQue) 348 trait goal orientations 333 – 4 Trait Meta-Mood Scale (TMMS) 348 transparency of assessments 744 transportability of validity 38, 612, 625, 640 Treadwell v. Alexander (1983) 653 Trojan horse malware 915 – 16 true score variance 24 “truncated” validation strategies 42 12-dimension taxonomy of performance dimensions 437 Type A behavior pattern 541 U.N. Educational, Scientific, and Cultural Organization (UNESCO) 804 unfairness bias in the Principles and Standards 626 unidimensional criterion 48 – 9 unidimensional dichotomous IRT models 944 Uniform Guidelines on Employee Selection Procedures (Uniform Guidelines): adverse impact as a phenomenon 690; alternative selection procedure 618 – 19; applications and limitations 616 – 17; bottom line 618; brief history 616; cutoff scores 618; discrimination/adverse impact 617 – 18; documentation required 616; fairness 618; fourfifths rule 636; importance of 369; Internet recruitment case law 633 – 4; job-relatedness/ business necessity 619; legal scrutiny tests 662; physical performance tests 285, 287; purpose of 616; selection procedures/employment decisions 617; test score means 373; utility computations 621; validation evidence 640; validity 619 – 21; work analysis 638; see also professional guidelines/standards unionized organizations 591

999

Index unitarian conceptualization of validity 42 United Nations Standard Products and Services Code (UNSPSC) 891 – 2 United States v. Buffalo (1978) 643 unit weighting of assessments 390 universalistic perspective 117 unproctored Internet testing (UIT) 187 – 8, 375 – 7, 258, 382, 589 – 90, 968 unreliability-corrected correlations 257 U.S. Army Research Institute 300 U.S. Department of Justice (DoJ) 600, 616, 731 U.S. Department of Labor (DoL) 443, 599 – 600, 616, 632, 688, 761, 874 U.S. Departments of Defense (DoD) 277, 511, 697, 699, 702 – 4, 710 U.S. Employment Service (USES) 266, 875 U.S. Government Accountability Office (GAO) 699 U.S. Occupational Safety and Health Act (1970) 530 U.S. Office of Personnel Management (OPM) 511, 921 U.S. Supreme Court cases see individual cases U.S. v. Commonwealth of Virginia (1978) 636 US Department of Labor 140, 154 utility: analyses 233 – 5, 234; computations 615, 612 validation considerations in selection systems: absence of bias in criterion measures 76; choosing predictors 73 – 4; concept of validity 57 – 63; conclusions 89; criterion assessment process 72 – 9; description of work 70 – 2; design and implementation of 63 – 70; framework for description of 64 – 5, 64 – 70; governing policies and rules 81 – 8; individuals vs. test score use 62 – 3; inference vs. test validity 58; intended uses and outcomes 66 – 9, 73 – 4, 76 – 7, 82 – 3, 87; introduction 56; key inferences 66; legal principles in 640 – 1; managing and maintaining 87 – 8; meaning of validity 63 – 4; mode of administration 85 – 6; overview 56 – 7; personality scale scores 75; prediction rationale 66, 69 – 70, 76 – 9, 86, 88; predictive inference vs. evidence 60 – 1, 61; predictor-criterion relationships 57 – 8, 69, 74 – 6, 83 – 8; retesting 83 – 4; score usage 79 – 81; stages of 67; summary 88; test preparation 84 – 5; test scores vs. job relatedness 59 – 60; traditional test development and validation 975 – 81, 976, 976 – 7; validity evidence vs. validity types 58 – 9 validity: available evidence 135 – 6, 604; big data 955; Big-Five traits 98; construct validity 37, 59, 344, 620 – 1, 625; content validity 612, 625; convergent validity 603 – 4, 625; criterion-related validity 36 – 7, 344, 611, 620, 640 – 2, 970; criterion theory and measurement 258 – 62, 259, 261, 285; defined 602, 611; differences and predictive bias 266 – 8, 267; ethics of employee selection 587; evidence in assessments 403; generalizing evidence 612 – 13; job component validity 612, 625; meta-analysis 640; predictor and criterion relationships 57 – 8; psychological assessments 401 – 3; public sector employment 732; sources of 603 – 4; standards of 604 – 5; synthetic validity 38, 612, 625, 640; transportability of 38, 612, 625, 640; Uniform Guidelines 619 – 21; see also personality measurement and validity

1000

validity generalization (VG): evidence for 96 – 9; failure to model situational attributes 99 – 100; imprecise inferences from corrected validities 105 – 8, 106; introduction 93 – 4; overview 38 – 9, 95 – 6; professional guidelines/standards 604, 625; questionable estimates 102 – 3; situational artifacts, independence 101 – 2; situational variables, independence 100 – 1; untested statistical assumptions 100 – 2; work analysis 135 – 6 validity of inference vs. validity of test 58 video-based assessments 376 Vietnam Era Veteran’s Readjustment Act of 1974 (VEVRAA) 688 violation of ethics paradigm 583 – 4 virtual reality 871 visualization in big data 961 – 3, 962, 963 vocational interests 444 – 5 Vocational Rehabilitation Act (1973) 632 VUCA world 738 Waisome v. Port Authority (1991) 636 Watson v. Fort Worth Bank (1988) 642 web-based assessment 258 Whiteback v. Vital Signs (1997) 653 white-collar selection 760 Wong Law Emotional Intelligence Scale (WLEIS) 349 work analysis: attribute requirement descriptors 136 – 7, 139; cognitive task analysis 147 – 51; competency modelling 152 – 4; criterion development 135; data collection issues 141 – 3; domain sampling 135; evaluation of 144; future steps 151; high-performance individuals 146 – 7; high-performance workplaces 145 – 6; hybrid systems 139 – 40; inferential leaps and linkages 143 – 4; information framework 136 – 7, 137; introduction 134; limits of expertise 151; multiple levels of 154 – 5; practical implications/ perspectives 140 – 1, 141; predictor development 135; reaction time 151; review of 137 – 40, 138; selection process and 638 – 40; selection-related applications 134 – 6; synopsis and conclusion 155 – 6; validity evidence extension 135 – 6; work-oriented content descriptors 136, 138 work environment and blue-collar selection 761 worker characteristics 875, 877 – 8, 878 worker-oriented content descriptors 136 work-family conflict 467 workforce characteristics 875, 879 Workgroup Emotional Intelligence Profile 349 Working Memory Capacity (WMC) 714 work-oriented content descriptors 136, 138 workplace accident predictors 544 – 5 work-related constraints in validation considerations 68 work sample tests (WSTs) 71, 282, 430 – 1, 820 work stress 535 – 43 O*NET Work Styles 878 O*Net Work Values 878 World Values Survey 328 Youth Attitude Tracking Study 171 Zaccagnini v. Chas. Levy Circulating Co. (2003) 648 Zuniga v. Boeing Company (2007) 648