Economic Analysis of the Digital Economy 9780226206981

As the cost of storing, sharing, and analyzing data has decreased, economic activity has become increasingly digital. Bu

250 33 5MB

English Pages 520 [504] Year 2015

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Economic Analysis of the Digital Economy
 9780226206981

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Economic Analysis of the Digital Economy

National Bureau of Economic Research Conference Report

Economic Analysis of the Digital Economy

Edited by

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

The University of Chicago Press Chicago and London

Avi Goldfarb is professor of marketing at the Rotman School of Management at the University of Toronto. Shane M. Greenstein is the Kellogg Chair in Information Technology and professor of management and strategy at the Kellogg School of Management at Northwestern University. Catherine E. Tucker is the Mark Hyman Jr. Career Development Professor and associate professor of management science at the MIT Sloan School of Management. All three editors are research associates of the National Bureau of Economic Research.

The University of Chicago Press, Chicago 60637 The University of Chicago Press, Ltd., London © 2015 by the National Bureau of Economic Research All rights reserved. Published 2015. Printed in the United States of America 24 23 22 21 20 19 18 17 16 15 1 2 3 4 5 ISBN-13: 978-0-226-20684-4 (cloth) ISBN-13: 978-0-226-20698-1 (e-book) DOI: 10.7208/chicago/9780226206981.001.0001 Library of Congress Cataloging-in-Publication Data Economic analysis of the digital economy / edited by Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker. pages cm — (National Bureau of Economic Research conference report) ISBN 978-0-226-20684-4 (cloth : alk. paper) — ISBN 978-0-226-20698-1 (e-book) 1. Digital media—Economic aspects. 2. Digital media—Government policy. 3. Internet— Economic aspects. I. Goldfarb, Avi. II. Greenstein, Shane M. III. Tucker, Catherine (Catherine Elizabeth) IV. Series: National Bureau of Economic Research conference report. ZA4045.E26 2015 302.23'1—dc23 2014035487

o This paper meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).

National Bureau of Economic Research Officers Martin B. Zimmerman, chairman Karen N. Horn, vice chairman James M. Poterba, president and chief executive officer Robert Mednick, treasurer

Kelly Horak, controller and assistant corporate secretary Alterra Milone, corporate secretary Denis Healy, assistant corporate secretary

Directors at Large Peter C. Aldrich Elizabeth E. Bailey John H. Biggs John S. Clarkeson Don R. Conlan Kathleen B. Cooper Charles H. Dallara George C. Eads Jessica P. Einhorn

Mohamed El-Erian Linda Ewing Jacob A. Frenkel Judith M. Gueron Robert S. Hamada Peter Blair Henry Karen N. Horn John Lipsky Laurence H. Meyer

Michael H. Moskow Alicia H. Munnell Robert T. Parry James M. Poterba John S. Reed Marina v. N. Whitman Martin B. Zimmerman

Directors by University Appointment Jagdish Bhagwati, Columbia Timothy Bresnahan, Stanford Alan V. Deardorff, Michigan Ray C. Fair, Yale Edward Foster, Minnesota John P. Gould, Chicago Mark Grinblatt, California, Los Angeles Bruce Hansen, Wisconsin–Madison

Benjamin Hermalin, California, Berkeley Marjorie B. McElroy, Duke Joel Mokyr, Northwestern Andrew Postlewaite, Pennsylvania Cecilia Rouse, Princeton Richard L. Schmalensee, Massachusetts Institute of Technology David B. Yoffie, Harvard

Directors by Appointment of Other Organizations Jean-Paul Chavas, Agricultural and Applied Economics Association Martin Gruber, American Finance Association Ellen L. Hughes-Cromwick, National Association for Business Economics Arthur Kennickell, American Statistical Association William W. Lewis, Committee for Economic Development Robert Mednick, American Institute of Certified Public Accountants

Alan L. Olmstead, Economic History Association Peter L. Rousseau, American Economic Association Gregor W. Smith, Canadian Economics Association William Spriggs, American Federation of Labor and Congress of Industrial Organizations Bart van Ark, The Conference Board

Directors Emeriti George Akerlof Glen G. Cain Carl F. Christ Franklin Fisher

George Hatsopoulos Saul H. Hymans Rudolph A. Oswald Peter G. Peterson

Nathan Rosenberg John J. Siegfried Craig Swan

Relation of the Directors to the Work and Publications of the National Bureau of Economic Research 1. The object of the NBER is to ascertain and present to the economics profession, and to the public more generally, important economic facts and their interpretation in a scientific manner without policy recommendations. The Board of Directors is charged with the responsibility of ensuring that the work of the NBER is carried on in strict conformity with this object. 2. The President shall establish an internal review process to ensure that book manuscripts proposed for publication DO NOT contain policy recommendations. This shall apply both to the proceedings of conferences and to manuscripts by a single author or by one or more co-authors but shall not apply to authors of comments at NBER conferences who are not NBER affiliates. 3. No book manuscript reporting research shall be published by the NBER until the President has sent to each member of the Board a notice that a manuscript is recommended for publication and that in the President’s opinion it is suitable for publication in accordance with the above principles of the NBER. Such notification will include a table of contents and an abstract or summary of the manuscript’s content, a list of contributors if applicable, and a response form for use by Directors who desire a copy of the manuscript for review. Each manuscript shall contain a summary drawing attention to the nature and treatment of the problem studied and the main conclusions reached. 4. No volume shall be published until forty-five days have elapsed from the above notification of intention to publish it. During this period a copy shall be sent to any Director requesting it, and if any Director objects to publication on the grounds that the manuscript contains policy recommendations, the objection will be presented to the author(s) or editor(s). In case of dispute, all members of the Board shall be notified, and the President shall appoint an ad hoc committee of the Board to decide the matter; thirty days additional shall be granted for this purpose. 5.The President shall present annually to the Board a report describing the internal manuscript review process, any objections made by Directors before publication or by anyone after publication, any disputes about such matters, and how they were handled. 6. Publications of the NBER issued for informational purposes concerning the work of the Bureau, or issued to inform the public of the activities at the Bureau, including but not limited to the NBER Digest and Reporter, shall be consistent with the object stated in paragraph 1. They shall contain a specific disclaimer noting that they have not passed through the review procedures required in this resolution. The Executive Committee of the Board is charged with the review of all such publications from time to time. 7. NBER working papers and manuscripts distributed on the Bureau’s web site are not deemed to be publications for the purpose of this resolution, but they shall be consistent with the object stated in paragraph 1. Working papers shall contain a specific disclaimer noting that they have not passed through the review procedures required in this resolution. The NBER’s web site shall contain a similar disclaimer. The President shall establish an internal review process to ensure that the working papers and the web site do not contain policy recommendations, and shall report annually to the Board on this process and any concerns raised in connection with it. 8. Unless otherwise determined by the Board or exempted by the terms of paragraphs 6 and 7, a copy of this resolution shall be printed in each NBER publication as described in paragraph 2 above.

Contents

Acknowledgments

xi

Introduction Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

1

I. Internet Supply and Demand 1. Modularity and the Evolution of the Internet Timothy Simcoe Comment: Timothy F. Bresnahan 2. What Are We Not Doing When We Are Online? Scott Wallsten Comment: Chris Forman

21

55

II. Digitization, Economic Frictions, and New Markets 3. The Future of Prediction: How Google Searches Foreshadow Housing Prices and Sales Lynn Wu and Erik Brynjolfsson 4. Bayesian Variable Selection for Nowcasting Economic Time Series Steven L. Scott and Hal R. Varian

89

119

vii

viii

Contents

5. Searching for Physical and Digital Media: The Evolution of Platforms for Finding Books Michael R. Baye, Babur De los Santos, and Matthijs R. Wildenbeest Comment: Marc Rysman 6. Ideology and Online News Matthew Gentzkow and Jesse M. Shapiro 7. Measuring the Effects of Advertising: The Digital Frontier Randall Lewis, Justin M. Rao, and David H. Reiley 8. Digitization and the Contract Labor Market: A Research Agenda Ajay Agrawal, John Horton, Nicola Lacetera, and Elizabeth Lyons Comment: Christopher Stanton 9. Some Economics of Private Digital Currency Joshua S. Gans and Hanna Halaburda

137

169

191

219

257

III. Government Policy and Digitization 10. Estimation of Treatment Effects from Combined Data: Identification versus Data Security Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev 11. Information Lost: Will the “Paradise” That Information Promises, to Both Consumer and Firm, Be “Lost” on Account of Data Breaches? The Epic is Playing Out Catherine L. Mann Comment: Amalia R. Miller 12. Copyright and the Profitability of Authorship: Evidence from Payments to Writers in the Romantic Period Megan MacGarvie and Petra Moser Comment: Koleman Strumpf 13. Understanding Media Markets in the Digital Age: Economics and Methodology Brett Danaher, Samita Dhanasobhon, Michael D. Smith, and Rahul Telang

279

309

357

385

Contents

14. Digitization and the Quality of New Media Products: The Case of Music Joel Waldfogel 15. The Nature and Incidence of Software Piracy: Evidence from Windows Susan Athey and Scott Stern Comment: Ashish Arora Contributors Author Index Subject Index

ix

407

443

481 485 491

Acknowledgments

It almost goes without saying, but it is worth saying nonetheless: we are grateful to our authors and discussants for working with us on this project. This was a collective effort of many contributors, and we thank all of the participants. We also thank the Sloan Foundation for their support and encouragement. In addition to funding, Danny Goroff, Josh Greenberg, and Paul Joskow each provided the advice, criticism, and praise necessary to create a successful project. Josh Lerner, Scott Stern, Nick Bloom, and Jim Poterba enabled the creation of a digitization initiative at the NBER. The NBER provided the intellectual home to the project, and we are grateful for the infrastructure and environment conducive to creative economic thinking about the impact of digitization. The staff at the University of Chicago Press and Rob Shannon and Helena Fitz-Patrick at the NBER provided essential support without which this book would have been impossible to complete. We thank the Kellogg School of Management for hosting our preconference in Chicago and Ranna Rozenfeld for hosting our dinner. Finally, we thank Rachel, Ranna, and Alex, as well as all of our children, for their patience with us as this project developed.

xi

Introduction Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

Research on the economics of digitization studies whether and how digital technology changes markets. Digital technology has led to a rapid decline in the cost of storage, computation, and transmission of data. As a consequence, economic activity is increasingly digital. The transformative nature of digital technology has implications for how we understand economic activity, how consumers behave, how firms develop competitive strategy, how entrepreneurs start new firms, and how governments should determine policy. This volume explores the economic impact of digitization in a variety of contexts and also aims to set an agenda for future research in the economics of digitization. While no one volume can be comprehensive, the objective is to identify topics with promising areas of research. The chapters summarize and illustrate areas in which some research is already underway and warrant further exploration from economists. Of the various technology drivers enabling the rise of digital technology, growth in digital communication—particularly the Internet—has played a central role. It is constructive to focus a volume around digital communiAvi Goldfarb is professor of marketing at the Rotman School of Management, University of Toronto, and a research associate of the National Bureau of Economic Research. Shane M. Greenstein is the Kellogg Chair of Information Technology and professor of management and strategy at the Kellogg School of Management, Northwestern University, and a research associate of the National Bureau of Economic Research. Catherine Tucker is the Mark Hyman Jr. Career Development Professor and associate professor of management science at the Sloan School of Management, Massachusetts Institute of Technology, and a research associate of the National Bureau of Economic Research. For acknowledgments, sources of research support, and disclosure of the authors’ material financial relationships, if any, please see http://www.nber.org/chapters/c12987.ack.

1

2

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

cation as a key driver of economic activity. In particular, digitization has some features that suggest that many well-studied economic models may not apply, suggesting a need for a better understanding of how digitization changes market outcomes. The development of a nearly ubiquitous Internet has motivated many new questions. In particular, the Internet’s deployment and adoption encouraged the growth of digital products and services, and many of these display very low marginal costs of production and distribution. Correspondingly, digital markets are often easy to enter. These features have motivated questions about how digitization has restructured economic activities across a broad array of the economy. Similarly, low communication costs, even over long distances, also brought about economic restructuring by creating opportunities for new marketplaces. This motivates questions about how new marketplaces can overcome information asymmetries between buyers and sellers in different places, and reduce search costs for either type of participant in a market. Low communication costs also translate into low distribution costs for information services. That means that nonexcludable information services resemble public goods that can be consumed at enormous scales, by hundreds of millions of people, and perhaps by billions in the future. That has focused attention on the incentives to develop public goods and understand how these diffuse. It has also focused attention on the valuation issues that arise when businesses and households reallocate their time to unpriced goods. While these features of digital markets and service do not generally require fundamentally new economic insight, they do require more than simply taking theoretical and empirical results from other markets and assuming the implications will be the same. For example, digital information can be stored easily and aggregated to improve measurement. This creates previously unseen challenges for privacy and security, and those issues are not salient in other economic analyses because they do not have to be. More broadly, many policies that have been settled for many years seem poorly adapted to digital markets. It is no secret that firms and governments have struggled to apply copyright, security, and antitrust regulations to the digital context, as the reasoning that supported specific policies came under pressure from piracy, or lost relevance in a new set of economic circumstances. General pressures to alter policies are coming from mismatches between historical institutions and the present circumstances, and these mismatches generate calls from private and public actors to make changes. These pressures will not disappear any time soon, nor will the calls for change. Economic research on digitization can inform the debate. We do not think the economics of digitization is a new field. Rather, digitization research touches a variety of fields of economics including (but not limited to) industrial organization, economic history, applied econometrics, labor economics, tax policy, monetary economics, international economics,

Introduction

3

and industrial organization. Many of the key contributions to the economics of digitization have also found an intellectual home in these fields. What distinguishes research on the economics of digitization is an understanding of the role of digital technology. Research on the economics of digitization therefore has a consistent framing, even if the applications are diverse. There are two complementary approaches to motivating new work. One characterizes the progress to date in addressing fundamental research questions, as a handbook might do (see, e.g., Peitz and Waldfogel 2012). The other approach, which this volume pursues, stresses different ways to address open research questions by providing extensive examples of how to frame, execute, and present research on the frontier. These are not mutually exclusive approaches, and many chapters in this book dedicate substantial attention to the prior literature before providing new analysis and ideas. As might be expected, the scope of the book is quite broad, and drawing boundaries required several judgment calls. In general, the topics in the book emphasize the agenda of open questions and also tend to stress unsettled issues in public policy. A few traits are shared by all the chapters. The topics are representative of many of the active frontiers of economic research today and are not slanted toward one subdiscipline’s approach to the area. More affirmatively, the chapters illustrate that the economics of digitization draw from many fields of economics and matches the approach to the question. No chapter argues for any form of digitization exceptionalism—as if this research requires the economic equivalent of the invention of quantum mechanics, or a fundamental break from prior precedents in positive analysis or econometric methods. The Internet contains unique features that require additional data, as well as sensitivity to new circumstances, not a radical abandoning of prior economic lessons. The volume’s chapters take steps toward building a theoretically grounded and empirically relevant economic framework for analyzing the determinants and consequences of digitization. For example, several chapters examine questions about how digitization changes market structure and market conduct. These changes are especially evident in newspapers, music, movies, and other media. Relatedly, there are many broad questions arising in areas where copyright plays an important role. The application of copyright to online activity has altered both the incentives for innovation and creativity. The advent of piracy has altered the monetization of the products and services. Digitization has also altered the costs of collecting, retaining, and distributing personal information, which is an important development in itself. It is also consequential for the personalization of commerce, such as in targeted advertising. Many chapters address policy issues related to digitization, including copyright law, privacy law, and efforts to restructure the delivery and access to digitized content and data. There also is a strong emphasis on developing

4

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

unbiased approaches to economic measurement. Unbiased measurement can assess the extent of digitization and begin the long-term conversation toward understanding the private and social costs of digitization. As a consequence, this will improve understanding of the rate of return on investments in digitization by public and private organizations. An astute reader will notice that some topics do not make it into the book. Perhaps most directly, all chapters focus on digitization enabled through the Internet rather than other consequences of digital technology such as increased automation in manufacturing and services or increased use of digital medical records. In addition, some relevant Internet-related topics do not appear. There is only a peripheral discussion about universal service for new communications technology such as broadband Internet access to homes in high-cost regions or low-income regions. It is an important topic, but the economic issues are not fundamentally unique to digitization and resemble universal service debates of the past. There is also limited coverage of many issues in the design of markets for search goods (e.g., keyword auctions) because there is already a robust conversation in many areas related to these services. Therefore, the book focuses attention on frontier questions that remain open, such as search and online matching in labor markets. Finally, the volume also largely eschews the well-known debates about information technology (IT)’s productivity and what has become known as the Solow Paradox, often stated as “We see the IT everywhere except in the productivity statistics.” Again, that is because the literature is large and robust. The contrast is, however, particularly instructive. This volume stresses vexing issues in measuring the value of digital services where the measurement issues are less widely appreciated by academic economists and where mainstream economic analysis could shed light. The remainder of the introduction provides some detail. The first set of chapters discusses the basic supply and demand for Internet access. The next set of chapters discusses various ways in which digitization reduces economic frictions and creates new opportunities and challenges for business. The final set of chapters lays out some policy issues that these opportunities and challenges create. All the chapters received comments from discussants at a conference held in June 2013, in Park City, Utah. In some cases the discussants chose to make their commentary available and these are provided as well. Internet Supply and Demand The Internet is not a single piece of equipment with components from multiple suppliers. It is a multilayered network in which different participants operate different pieces. Sometimes these pieces are complements to one another, and sometimes substitutes. Many years ago, the “Internet” referred to the networking technology that enables computer networks to

Introduction

5

communicate. Over time it has come to also mean the combination of standards, networks, and web applications (such as streaming and file sharing) that have accumulated around the networking technology. Internet technology has evolved through technological competition. Many firms possess in-house technical leadership that enables them to develop and sell components and services that are valuable to computer users. Firms that do not possess such capabilities can acquire them through the market by, for example, hiring a team of qualified engineers. Consequently, multiple firms can possess both the (expensive) assets and the (rare) employees with skills to reach the frontier and commercialize products near the technical frontier. Bresnahan and Greenstein (1999) call this feature of market structure “divided technical leadership,” contrasting it with earlier eras in which a single firm could aspire to control the vast majority of inputs near the technical frontier. Therefore, one of the big open research questions is: What are the principles of competition in this area of divided technical leadership? Computing market segments are typically defined by “platforms,” which Bresnahan and Greenstein define as “a reconfigurable base of compatible components on which users build applications.” Platforms are identified by a set of technical standards or by engineering specifications for compatible hardware and software. The emergence of platforms with many stakeholders (including firms, academics, and nonprofits) increased the importance of organizations that design standards and platforms, referred to as “standard-setting committees” (Mowery and Simcoe 2002; Simcoe 2012). The key standard-setting committees for the Internet such as the Internet Engineering Task Force (IETF), the Institute for Electrical and Electronics Engineers (IEEE), and the World Wide Web Consortium (W3C), made decisions that shaped much of the equipment that underlies the Internet, with the IETF shaping the infrastructure layer, the IEEE shaping local area network and wireless communications, and the W3C shaping the web-based software and applications layer. The chapter by Simcoe (chapter 1) inquires whether modularity shaped technological competition and specialization. The chapter offers an empirical examination of the consequences of the Internet architecture using data from the IETF and W3C. Both organizations adopted modular architectures, which produced specialized division of labor in designing and operating protocols. The chapter analyzes citations between Internet standards as further evidence of this specialization. Such specialization is the key to avoiding diminishing marginal returns in scaling up these networks. Modularity helps these technologies adapt to new circumstances and heterogeneous applications, helping them deploy more widely. This particular approach arises frequently with digital technologies and, the chapter argues, warrants attention as a fundamental feature of the digital economy. In his comments Timothy Bresnahan stresses that modularity should be distinguished from openness. The former is a partitioning of the technical

6

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

architecture while the latter arises from the policies and actions of those involved with commercialization, typically making information available. Bresnahan stresses these two aspects of the Transmission Control Protocol/ Internet Protocol (TCP/IP) commercial experience and argues that these processes turned TCP/IP into what it is today. That leads Bresnahan to raise questions about platform governance and the evolution of general-purpose technologies. In his view, Simcoe’s chapter illustrates a major unaddressed question in digitization economics, namely, why processes that depart from strict contractual approaches have had a successful historical record. Modularity’s value, therefore, may depend on more than merely the specialization that it permits, but also the institutional processes that guide the specialists. In that sense, Bresnahan speculates that Simcoe has introduced the reader to a potentially rich new agenda. Many fundamental questions remain open. Competition between platforms determines prices for customers deciding between platforms, and divided technical leadership shapes the supply of vendor services that build on top of a platform. How does such competition shape the division of returns within a platform? How do these two margins differ when a third type of participant, such as an advertiser, plays an important role in creating market value for the platform? If platforms differentiate in terms of their capabilities and approaches to generating revenue, does that alter the composition of returns to its participants? If platforms develop in collective organizations, what type of firm behavior shapes participation in standards committees? How do these incentives shape the direction of innovation in markets connecting multiple platforms? In practice, do most of the returns for new platform development go to existing asset holders in the economy or to entrepreneurial actors who create and exploit value opened by the technical frontier? These are rich areas for additional research, and some of the following chapters also touch on these questions. In addition to understanding how the technology evolved, how the infrastructure was built, and how decentralized platforms develop standards, it is also important to understand demand for digital technology. Without an understanding of the value of the technology to users, it is difficult to tease out policy implications. Several recent studies have examined demand for services. For example, Greenstein and McDevitt (2011) examine the diffusion of broadband Internet and its associated consumer surplus by looking at revenue of Internet service providers over time. Rosston, Savage, and Waldman (2010) use survey data to estimate household demand. Wallsten (chapter 2) examines the microbehavior comprising demand in household behavior. In particular, he examines what people do when they are online, which often involves many choices between priced and unpriced options, or among unpriced options. The chapter provides detailed insight into the debate about how the Internet has changed lives, particularly in households, where many of the changes involve the allocation of leisure

Introduction

7

time. This allocation will not necessarily show up in gross domestic product (GDP) statistics, thereby framing many open questions about valuing the changes. The chapter is also novel for its use of the American Time Use Survey from 2003 to 2011 to estimate the crowd-out effects. That data shows that time spent online and the share of the population engaged in online activities has been steadily increasing since 2000. At the margin, each minute of online leisure time is correlated with fewer minutes on all other types of leisure. The findings suggest that any valuation of these changes must account for both opportunity costs and new value created, both of which are hard to measure. Chris Forman’s discussion of Wallsten’s chapter emphasizes a household’s trade-off in terms of opportunity costs and links it to prior literature on the implications of online behavior on offline markets. The discussion suggests opportunities for future research to leverage differences across locations in order to understand how the relative value of the Internet varies with the availability of offline substitutes. Many open questions remain. If Internet use changes the allocation of leisure time, then what about the converse? How do changes in leisure time (for example, over the life cycle) affect Internet use and demand for Internet access? Does wireless access and ubiquitous connectivity (for example, in transit) change the relative benefit of different types of Internet use? How do particular applications (e.g., social networks, online shopping) affect the adoption and usage intensity of wireless and wireline Internet by consumers and businesses? Will improvements in technology, such as speed and memory, change demand and spill over into other areas of economic activity? How do these changes in demand reshape the allocation of supply? Many of these issues arise in other chapters, especially where public policy shapes markets. Digitization, Economic Frictions, and New Markets Among the major themes in the literature on digitization is an assessment of how it changes economic transactions. In particular, the literature identifies a variety of economic frictions that are increased or decreased as a consequence of digitization. Much of the literature on digitization has emphasized the impact of the cost of storage, computation, and transmission of data on the nature of economic activity. In particular, technology makes certain economic transactions easier, reducing several market frictions. This could lead to increased market efficiency and increased competition. At the same time, if the technology reduces some frictions but not others, it could distort market outcomes, helping some players and hurting others. Broadly, changes related to digitization have changed economic measurement, altered how some markets function, and provided an opportunity for new markets to arise. The influx of data due to the reduced cost of collecting and storing infor-

8

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

mation, combined with improvements in the tools for data analysis, has created new opportunities for firms and policymakers to measure the economy and predict future outcomes. The economics literature on the opportunities presented by data analysis of this kind is relatively sparse. Goldfarb and Tucker (2012) describe the opportunities from Internet data with respect to advertising; Einav and Levin (2013) describe the opportunities from Internet data for economics researchers; and Brynjolfsson, Hitt, and Kim (2011) document that companies that use data often tend to do better. Two different chapters in this book emphasize the potential of Internet data to improve measurement. In the policy section, two other chapters emphasize the challenges created by ubiquitous data. Just as predicting the weather had profound consequences for much economic activity, such as agriculture, better measurement and prediction of a wide range of economic activity could generate profound economic gains for many participants in the economy. Wu and Brynjolfsson (chapter 3) highlight the potential of online data to predict business activity. They ask whether there is a simple but accurate way to use search data to predict market trends. They illustrate their method using the housing market. After showing the predictive power of their method, they suggest several directions for future work regarding the potential of detailed data to help consumers, businesses, and policymakers improve decision making. Scott and Varian (chapter 4) also highlight the potential of online data to improve the information that goes into decision making. Rather than prediction, they emphasize “nowcasting,” or the ability of online data in general (and search data in particular) to provide early signals of economic and political indicators. They develop an approach to deal with one of the main challenges in using online data for prediction: there are many more potential predictors than there are observations. Their method helps identify the key variables that are most useful for prediction. They demonstrate the usefulness of the method in generating early measures of consumer sentiment and of gun sales. Together, chapters 4 and 5 demonstrate that online data has the potential to substantially improve the measurement of current economic activity and the ability to forecast future activity. These chapters represent early steps toward identifying (a) what types of economic activity are conducive to measurement with online data, (b) the specific data that is most useful for such measurement, and (c) the most effective methods for using digital data in economic measurement. However, as the chapters both note, there is still much work to be done, and open questions remain around refining these methods, developing new methods, and recognizing new opportunities. The next three chapters discuss ways in which digitization has altered how markets function. Digital technology makes some activities easier, thereby changing the nature of some economic interactions. Perhaps the oldest and largest stream of research on the Internet and market frictions emphasized

Introduction

9

reduced search costs. This literature, still going strong, builds on an older theory literature in economics (e.g., Stigler 1961; Diamond 1971; Varian 1980) that examines how search costs affect prices. This older literature showed that prices and price dispersion should fall when search costs fall. Digitization of retail and marketing meant that consumers could easily compare prices across stores, so the empirical work on Internet pricing examined the impact on prices and price dispersion. Initially hypothesized by Bakos (1997), the first wave of this research empirically documented lower prices, but still substantial dispersion (Brynjolfsson and Smith 2000; Baye, Morgan, and Scholten 2004; Ellison and Ellison 2009). Baye, De los Santos, and Wildenbeest (chapter 5) is a good example of the newest wave of this research, which collects data about online searches to examine the actual search process that consumers undertake when looking for a product online. They focus on the question of how consumers search for books and booksellers online. This is of itself an interesting topic, both because books have often been the focus of studies that explore the “long tail” and because there have been policy concerns about how the online sales of books has affected offline channels. The chapter asks whether most book searches have been conducted on proprietary systems such as Amazon’s Kindle and Barnes & Noble’s Nook rather than consumers searching on general search engines such as Google or Bing, meaning that search might be mismeasured in the literature. This question also emphasizes that the final stage of purchase is often controlled by a more familiar retail environment, and it raises questions about the growing importance of standards and platforms in the distribution of creative content. As noted earlier, near-zero marginal costs of distribution for information goods might change where and how information goods get consumed. Geographic boundaries might be less important if information can travel long distances for free (Sunstein 2001; Sinai and Waldfogel 2004; Blum and Goldfarb 2006). A big open question concerns the incidence of the impact of low distribution costs. The benefits might vary by location, with locations with fewer offline options generating a larger benefit from digitization (Balasubramanian 1998; Forman, Ghose, and Goldfarb 2009; Goldfarb and Tucker 2011a). Gentzkow and Shapiro (chapter 6) explore the potential of near-zero marginal costs of distribution to affect political participation and the nature of news consumption. In particular, they ask whether technology-driven reductions in the costs of news distribution, both within and across geographic boundaries, affect the diversity of media production and consumption. Digital media could increase the diversity of news consumption because it enables inexpensive access to a broad range of sources; digital media could decrease the diversity of news consumption because it may permit specialized outlets that serve niche tastes that are not viable when physical production costs are high or when demand is limited to a geographically localized

10

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

market. This contribution addresses an important open question: Will digitization of news content exacerbate existing political divisions as consumers access only content that supports their existing political ideology? The work of these authors does not stoke the worst fears. Their findings suggest that those who have niche tastes in news are still obtaining the majority of their news content from mainstream sources. For many pure information goods, online platforms link readers to advertisers. Given the challenges of protecting online information content from being shared (a topic we discuss below in the context of policy), advertising has become an important source of revenue for many providers of pure information goods. Because of this, it is important to understand how online advertising works in order to understand the opportunities and challenges faced by providers of digital information goods. Goldfarb and Tucker (2011b) emphasize that online advertising is better targeted and better measured because of the ease of data collection. The study of online advertising continues to attract attention because this is the principal means for generating revenues in much of the Internet ecosystem. Most of the content on the Internet and many of the services (such as search or social networking) rely on advertising revenues for support. Lewis, Rao, and Reiley (chapter 7) discuss the methods used for measuring the effects of advertising. To do this, they draw on their previous and current work that has used multiple field experiments to try and measure how effective online display advertising is at converting eyeballs into actual incremental sales. They emphasize that an important challenge to the accurate measurement of advertising is the high noise-to-signal ratio. This chapter suggests that as clients become increasingly sophisticated about measurement, this revenue source may be called into question. Many other markets have also be changed by digitization. Other promising areas of research include rating mechanisms and quality signals (e.g., Cabral and Hortaçsu 2010; Jin and Kato 2007; Mayzlin, Dover, and Chevalier 2014), niche products and superstar effects (e.g., Brynjolfsson, Hu, and Smith 2003; Fleder and Hosanagar 2009; Bar-Isaac, Caruana, and Cunat 2012), and skill-biased technical change and the organization of work (e.g., Autor 2001; Garicano and Heaton 2010). Chapters 8 and 9 discuss examples of markets that have been enabled by digitization. Agrawal, Horton, Lacetera, and Lyons (chapter 8) examine online markets for contract labor, another area in which digitization reduces frictions. In particular, digitization makes it easier for an employer to hire someone for information-related work without ever meeting the employee in person. If the work can be described digitally, completed off site, and then sent back to the employer digitally (such as with computer programming), then there might be an opportunity for long-distance North-South trade in skilled labor. The key challenges relate to information asymmetries regard-

Introduction

11

ing the quality of the employee and the trustworthiness of the employer. The chapter frames a large agenda about the role of online platforms to reduce these information asymmetries, thereby changing the types of contract labor transactions that are feasible online. They lay out a clear research agenda around the key players, their incentives, and the potential welfare consequences of this market. Stanton’s discussion extends the agenda behind this finding. His discussion speculates on whether the digitization of labor relationships enables labor outsourcing to other countries even without a platform intermediary. Chapter 8 also extends a fourth stream of research related to frictions and digitization, the potential for new markets and new business models that take advantage of the lower frictions. Many successful Internet firms provide platforms that facilitate exchange, including eBay, Monster, Prosper, Airbnb, and oDesk. This is another channel through which digitization has restructured the supply of services. New policies—for copyright, privacy, and identity protection, for example—directly shape firm incentives by shaping the laws that apply to these new business models. Several other chapters also touch on these themes. Addressing an important policy issue for governments, Gans and Halaburda (chapter 9) discuss the potential of digitization to create markets for private currencies that support activities on a particular platform, seemingly bypassing state-sponsored monetary authorities. They focus on the viability of the market for private digital currencies with noncurrency-specific platforms and speculate on the potential for a privacy-oriented entity to launch a real currency to compete with government-backed currencies such as the dollar and the euro. They lay out a model in which a platform currency offers “enhancements” to people who spend time on the platform. People allocate time between working and using the platform. They ask whether platforms have incentives to allow users to exchange at full convertibility private digital currency for government-backed currency. Their analysis illustrates the broad open question about whether private currencies in support of a platform are likely to migrate beyond the platform. Online labor markets and private currencies are just two examples of markets enabled by digitization. Other promising related research areas include markets for user-generated content and the provision of public goods (e.g., Zhang and Zhu 2011; Greenstein and Zhu 2012), online banking and finance (e.g., Agrawal, Catalini, and Goldfarb 2013; Rigbi 2013; Zhang and Liu 2012), and “the sharing economy” of hotels and car services (e.g., Fradkin 2013). Thus, the chapters in this section give a summary of some of the impact of digitization on a variety of markets. This is a big and growing area of research and much remains to be done. As digital technology advances, new opportunities for markets (and new ideas for research) will continue to arise.

12

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

Government Policy and Digitization Increasing digitization has implications for policy, but the literature on the impact of digitization on policy is still in its infancy. As hinted above, ubiquitous data yields new challenges to privacy and security that policymakers need to address (e.g., Goldfarb and Tucker 2012; Miller and Tucker 2011; Arora et al. 2010). Near-zero marginal costs of distribution and the nonrival nature of digital goods pose challenges to copyright policy (e.g., Rob and Waldfogel 2006; Oberholzer-Gee and Strumpf 2007). The ease with which digital goods can be transferred over long distances and across borders might affect tax policy (e.g., Goolsbee 2000), financial regulation (e.g., Agrawal, Catalini, and Goldfarb 2013), and trade policy (Blum and Goldfarb 2006). Privacy and data security are an area where digitization has substantially changed the costs and benefits to various economic actors. The current policy structure was implemented in a different regime, when data sharing was costly and data security was not an everyday concern. It is important to assess whether such laws match with the needs of a digital era in which everyone is of sufficient interest (relative to costs) to warrant data tracking by firms and governments. Komarova, Nekipelov, and Yakovlev (chapter 10) make an important contribution. They combine a technically rich approach to econometrics with the question of how researchers, and research bodies who share data with those researchers, can protect the security and privacy of the people in the data. This is important because, all too often, researchers are unable to make use of the increasing scale and detail of data sets collected by government bodies because access is restricted due to unspecified privacy and data security concerns. This means that potentially many important research questions are being left unanswered, or are being answered using less adequate data, because of our technical inability to share data without creating privacy concerns. The authors develop the notion of the risk of “statistical partial disclosure” to describe the situation where researchers are able to infer something sensitive about the individual by combining public and private data sources. They develop an example to emphasize that there is a risk to individual privacy due to researchers’ ability to combine multiple anonymized data sets. However, beyond that, they also suggest that there are ways that data-gathering research bodies can minimize such risks by adjusting the privacy guarantee level. Mann (chapter 11) looks at the question related to data security. She provides several frameworks for analyzing how the question of data breaches should be evaluated in economic terms. She argues that markets for data security are incomplete and suggests that a good market analog to consider is the market for pollution. This market similarly is characterized by negative economies of scale, asymmetric information, and systematic uncertainty.

Introduction

13

She also provides useful data to calibrate just how large the problem of data breaches actually is, and why breaches tend to occur. Interestingly, despite policy emphasis on external threats such as hacking and fraud, most breaches occur because of carelessness on the part of the data curator. She emphasizes that typically the number of records involved in a data breach is surprisingly small, and that many data breaches stem from the medical sector, though the data breaches that involve the release of a Social Security number are often from retail. She concludes by emphasizing the complexity introduced into the issue by questions of international jurisdiction. Miller’s discussion of Mann’s chapter provides a useful synthesis of other literature on this topic. She focuses on the extent to which traditional policymaking on data security issues can backfire if it distorts incentives. For example, emphasizing the need for encryption to firms can lead firms to focus only on external outsider threats to data and ignore internal threats to the security of data from employee fraud or incompetence. She also points to the difficulty of making policy recommendations about differences in US and EU approaches to data security when there is, so far, scant information about the relative perceived costs to firms and consumers of data breaches. A second area of policy interest concerns intellectual property. The digitization process resembles the creation of a giant free photocopier that can duplicate any creative endeavor with little or no cost. Varian (2005) supplies a theoretical framework for thinking about this change from an economics perspective. Usually, the economic effect on copyright holders in the context of free copying is considered to be negative. However, Varian suggests an important counterargument. If the value a consumer puts on the right to copy is greater than the reduction in sales, a seller can increase profits by allowing that right. Varian also provides a detailed description of several business models that potentially address the greater difficulty of enforcing copyrights as digitization increases. These models span strategies based on balancing prices, selling complementary goods, selling subscriptions, personalization, and supporting the goods being sold through advertising. Empirical research has not reached the point of having established a set of accepted facts about the merits or demerits of these different strategies, which the earlier sociological and political science literature has discussed in broad terms (Castells 2003). This volume provides a sample of the range of new thinking in this area and it complements existing work on the effect of the digitization of music downloads on copyright holders (e.g., Rob and Waldfogel 2006; Hong 2007; Oberholzer-Gee and Strumpf 2007). The four chapters on this topic all shed light on how business activity changes when the protection of intellectual property changes. Together these chapters demonstrate the importance of copyright policy for market outcomes. MacGarvie and Moser (chapter 12) address an argument often made by proponents of stronger copyright terms. Due to the scarcity of data about

14

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

the profitability of authorship under copyright, they go to historical events to discover whether a historical episode that increased copyright terms did, in fact, encourage creativity by increasing the profitability of authorship. Their historical study also encounters a setting with much shorter copyright lengths than our current copyright length of seventy years after the author’s death. That is an advantage, since further extensions today—beyond seventy—may not have any effects on the profitability of authorship, whereas in their study further extensions could have major consequences. The chapter also introduces a new data set of publishers’ payments to authors of British fiction between 1800 and 1830. These data indicate that payments to authors nearly doubled following an increase in the length of copyright in 1814. Further exploring themes related to copyright’s influence on the incentives to distribute creative works, this volume also includes a chapter by Danaher, Dhanasobhon, Smith, and Telang (chapter 13). It examines research opportunities related to the erosion of copyright caused by Internet file sharing. Digitization has created many new opportunities to empirically analyze open questions by leveraging new data sources. This chapter discusses methodological approaches to leverage the new data and natural experiments in digital markets to address these questions. The chapter closes with a specific proof-of-concept research study that analyzes the impact of legitimate streaming services on the demand for piracy. Waldfogel (chapter 14) explores another side to these questions, namely, how copyright policy alters incentives to create music. Revenue for recorded music has collapsed since the explosion of file sharing, and yet, Waldfogel argues, the quality of new music has not suffered. He considers an explanation that stresses changes on the supply side, namely, that digitization has allowed a wider range of firms to bring far more music to market using lower-cost methods of production, distribution, and promotion. Prior to the supply change, record labels found it difficult to predict which albums would find commercial success. In that situation many released albums necessarily would fail, and, relatedly, many nascent but unpromoted albums might have been successful. After the change in supply conditions, the increasing number of products released would allow consumers to discover more appealing choices if they can sift through the offerings. The chapter argues that digitization is responsible for such a supply shift: specifically that Internet radio and a growing cadre of online music reviewers provide alternatives to radio airplay as means for new product discovery. Despite a long history of piracy software markets, researchers have not been able to assemble informative data about the phenomenon, much less their causes. Athey and Stern (chapter 15) make a novel contribution by analyzing data that permits direct measurement of piracy for a specific product—Windows 7. They are able to use anonymized telemetry data to characterize the ways in which piracy occurs, the relative incidence of piracy

Introduction

15

across different economic and institutional environments, and the impact of enforcement efforts on choices to install pirated versus paid software. The chapter has several provocative new observations. For example, most piracy in this setting can be traced back to a small number of widely distributed “hacks” that are available through the Internet. Despite the availability of these hacks to any potential Internet user, they do not get used everywhere. The microeconomic and institutional environment appears to play a crucial role in fostering or discouraging piracy. Moreover, piracy tends to focus on the most “advanced” version of Windows (Windows Ultimate). The chapter lays out a broad agenda for this area of research. These chapters all demonstrate the important role of copyright policy in digital markets. Copyright enforcement affects what is produced and what is consumed. Still, as should be evident from these chapters, many open policy questions remain. Questions about the role of policy in determining copyright rules, privacy norms, and security practices arise in many markets for digital goods and services. Questions about the principles for redesigning these policies also remain elusive. We hope this book motivates further investigation into the economics underlying these policy issues. Conclusions The emerging research area of the economics of digitization improves our understanding of whether and how digital technology changes markets. Digitization enables outcomes that were not possible a few decades earlier. It not only reduces existing costs, but has also enabled the development of new services and processes that did not exist before because they were just too costly or merely technologically infeasible. The opportunities generated by digitization have also generated dramatic resource reallocation and restructuring of routines, market relationships, and patterns of the flow of goods and services. This in turn has led to a new set of policy questions and made several existing policy questions more vexing.

References Agrawal, Ajay, Christian Catalini, and Avi Goldfarb. 2013. “Some Simple Economics of Crowdfunding.” In Innovation Policy and the Economy, vol. 14, edited by Josh Lerner and Scott Stern, 63–97. Chicago: University of Chicago Press. Arora, A., A. Nandkumar, C. Forman, and R. Telang. 2010. “Competition and Patching of Security Vulnerabilities: An Empirical Analysis.” Information Economics and Policy 10:164‒77.

16

Avi Goldfarb, Shane M. Greenstein, and Catherine E. Tucker

Autor, David H. 2001. “Wiring the Labor Market.” Journal of Economic Perspectives 15 (1): 25–40. Bakos, J. 1997. “Reducing Buyer Search Costs: Implications for Electronic Marketplaces.” Management Science 43 (12): 1676–92. Balasubramanian, S. 1998. “Mail versus Mall: A Strategic Analysis of Competition between Direct Marketers and Conventional Retailers.” Marketing Science 17 (3): 181–95. Bar-Isaac, H., G. Caruana, and V. Cunat. 2012. “Search, Design, and Market Structure.” American Economic Review 102 (2): 1140‒60. Baye, Michael, John Morgan, and Patrick Scholten. 2004. “Price Dispersion in the Small and in the Large: Evidence from an Internet Price Comparison Site.” Journal of Industrial Economics 52 (4): 463‒96. Blum, Bernardo S., and Avi Goldfarb. 2006. “Does the Internet Defy the Law of Gravity?” Journal of International Economics 70 (2): 384‒405. Bresnahan, T., and S. Greenstein. 1999. “Technological Competition and the Structure of the Computing Industry.” Journal of Industrial Economics 47 (1): 1‒40. Brynjolfsson, Erik, L. M. Hitt, and H. H. Kim. 2011. “Strength in Numbers: How Does Data-Driven Decision-Making Affect Firm Performance?” http://ssrn.com /abstract=1819486. Brynjolfsson, Erik, Yu “Jeffrey” Hu, and Michael D. Smith. 2003. “Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety.” Management Science 49 (11): 1580‒96. Brynjolfsson, Erik, and Michael Smith. 2000. “Frictionless Commerce? A Comparison of Internet and Conventional Retailers.” Management Science 46 (4): 563‒85. Cabral, L., and A. Hortaçsu. 2010. “Dynamics of Seller Reputation: Theory and Evidence from eBay.” Journal of Industrial Economics 58 (1): 54‒78. Castells, M. 2003. The Internet Galaxy: Reflections on the Internet, Business, and Society. Abingdon, UK: Taylor and Francis. Diamond, P. 1971. “A Simple Model of Price Adjustment.” Journal of Economic Theory 3:156–68. Einav, Liran, and Jonathan D. Levin. 2013. “The Data Revolution and Economic Analysis.” NBER Working Paper no. 19035, Cambridge, MA. Ellison, G., and S. F. Ellison. 2009. “Search, Obfuscation, and Price Elasticities on the Internet.” Econometrica 77 (2): 427‒52. Fleder, D., and K. Hosanagar. 2009. “Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity.” Management Science 55 (5): 697–712. Forman, C., A. Ghose, and A. Goldfarb. 2009. “Competition between Local and Electronic Markets: How the Benefit of Buying Online Depends on Where You Live.” Management Science 55 (1): 47–57. Fradkin, Andrey. 2013. “Search Frictions and the Design of Online Marketplaces.” Working Paper, Department of Economics, Stanford University. Garicano, Luis, and Paul Heaton. 2010. “Information Technology, Organization, and Productivity in the Public Sector: Evidence from Police Departments.” Journal of Labor Economics 28 (1): 167‒201. Goldfarb, Avi, and Catherine Tucker. 2011a. “Advertising Bans and the Substitutability of Online and Offline Advertising.” Journal of Marketing Research 48 (2): 207‒28. ———. 2011b. “Privacy Regulation and Online Advertising.” Management Science 57 (1): 57‒71. ———. 2012. “Privacy and Innovation.” In Innovation Policy and the Economy, vol. 12, edited by J. Lerner and S. Stern, 65–89. Chicago: University of Chicago Press.

Introduction

17

Goolsbee, A. 2000. “In a World without Borders: The Impact of Taxes on Internet Commerce.” Quarterly Journal of Economics 115 (2): 561‒76. Greenstein, S., and R. McDevitt. 2011. “The Broadband Bonus: Estimating Broadband Internet’s Economic Value.” Telecommunications Policy 35:617–32. Greenstein, S., and  F. Zhu. 2012. “Is Wikipedia Biased?” American Economic Review 102 (3): 343–48. Hong, Seung-Hyun. 2007. “The Recent Growth of the Internet and Changes in Household Level Demand for Entertainment.” Information Economics and Policy 3–4:304‒18. Jin, G. Z., and A. Kato. 2007. “Dividing Online and Offline: A Case Study.” Review of Economic Studies 74 (3): 981–1004. Mayzlin, Dina, Yaniv Dover, and Judith Chevalier. 2014. “Promotional Reviews: An Empirical Investigation of Online Review Manipulation.” American Economic Review 104 (8): 2421–55. Miller, A., and C. Tucker. 2011. “Encryption and the Loss of Patient Data.” Journal of Policy Analysis and Management 30 (3): 534–56. Mowery, D., and T. Simcoe. 2002. “The Origins and Evolution of the Internet.” In Technological Innovation and Economic Performance, edited by R. Nelson, B. Steil, and D. Victor, 229–64. Princeton, NJ: Princeton University Press. Oberholzer-Gee, Felix, and Koleman Strumpf. 2007. “The Effect of File Sharing on Record Sales: An Empirical Analysis.” Journal of Political Economy 115 (1): 1‒42. Peitz, Martin, and Joel Waldfogel. 2012. The Oxford Handbook of the Digital Economy. New York: Oxford University Press. Rigbi, Oren. 2013. “The Effects of Usury laws: Evidence from the Online Loan Market.” Review of Economics and Statistics 95 (4): 1238‒48. Rob, Rafael, and Joel Waldfogel. 2006. “Piracy on the High C’s: Music Downloading, Sales Displacement, and Social Welfare in a Sample of College Students.” Journal of Law & Economics 49 (1): 29‒62. Rosston, Gregory, Scott J. Savage, and Donald M. Waldman. 2010. “Household Demand for Broadband Internet in 2010.” The B.E. Journal of Economic Analysis & Policy 10 (1): article 79. Simcoe, T. 2012. “Standard Setting Committees: Consensus Governance for Shared Technology Platforms.” American Economic Review 102 (1): 305‒36. Sinai, T., and J. Waldfogel. 2004. “Geography and the Internet: Is the Internet a Substitute or a Complement for Cities?” Journal of Urban Economics 56 (1): 1–24. Stigler, George J. 1961. “The Economics of Information.” Journal of Political Economy 69 (3): 213‒25. Sunstein, C. 2001. Republic.com. Princeton, NJ: Princeton University Press. Varian, H. 1980. “A Model of Sales.” American Economic Review 70:651–59. ———. 2005. “Copying and Copyright.” Journal of Economic Perspectives 19 (2): 121‒38. Zhang, Juanjuan, and Peng Liu. 2012. “Rational Herding in Microloan Markets.” Management Science 58 (5): 892‒912. Zhang, X., and F. Zhu. 2011. “Group Size and Incentives to Contribute: A Natural Experiment at Chinese Wikipedia.” American Economic Review 101:1601‒15.

1

Modularity and the Evolution of the Internet Timothy Simcoe

1.1

Introduction

The Internet is a global computer network comprised of many smaller networks, all of which use a common set of communications protocols. This network is important not only because it supports a tremendous amount of economic activity, but also as a critical component within a broader constellation of technologies that support the general-purpose activity of digital computing. Given its widespread use and complementary relationship to computing in general, the Internet is arguably a leading contemporary example of what some economists have called a general purpose technology (GPT). The literature on GPTs highlights the importance of positive feedback between innovations in a GPT-producing sector and the process of “coinvention” (i.e., user experimentation and discovery) in various application sectors that build upon the GPT.1 Much of this literature elaborates on the implications of coinvention for understanding GPT diffusion and the timing of associated productivity impacts.2 However, the literature on GPTs is Timothy Simcoe is associate professor of strategy and innovation at Boston University School of Management and a faculty research fellow of the National Bureau of Economic Research. This research was funded by the NBER Digitization program with support from the Kauffman Foundation. Useful comments were provided by Tim Bresnahan, Shane Greenstein, Avi Goldfarb, Joachim Henkel, and Catherine Tucker. All errors are my own, and comments are welcome: [email protected]. For acknowledgments, sources of research support, and disclosure of the author’s material financial relationships, if any, please see http://www.nber.org/chapters /c13000.ack. 1. See Bresnahan (2010) for a recent review of this literature. 2. For a historical example, see Paul David (1990) on the role of coinvention in industrial electrification. For a contemporary quantitative application of these ideas, see Dranove et al.’s (2012) analysis of the productivity benefits from adopting health information technology.

21

22

Timothy Simcoe

less precise about how the supply of a GPT can or should be organized, or what prevents a GPT from encountering decreasing returns as it diffuses to application sectors with disparate needs and requirements. This chapter provides an empirical case study of the Internet that demonstrates how a modular system architecture can have implications for industrial organization in the GPT-producing sector, and perhaps also prevent the onset of decreasing returns to GPT innovation. In this context, the term “architecture” refers to an allocation of computing tasks across various subsystems or components that might either be jointly or independently designed and produced. The term “modularity” refers to the level (and pattern) of technical interdependence among components. I emphasize voluntary cooperative standards development as the critical activity through which firms coordinate complementary innovative activities and create a modular system that facilitates a division of innovative labor. Data collected from the two main Internet standard-setting organizations (SSOs), the Internet Engineering Task Force (IETF), and World Wide Web Consortium (W3C), demonstrate the inherent modularity of the Internet architecture, along with the division of labor it enables. Examining citations to Internet standards provides evidence on the diffusion and commercial application of innovations within this system. The chapter has two main points. First, architectural choices are multidimensional, and can play an essential role in the supply of digital goods. In particular, choices over modularity can shape trade-offs between generality and specialization among innovators and producers. Second, SSOs play a crucial role in designing modular systems, and can help firms internalize the benefits of coordinating innovation within a GPT-producing sector. While these points are quite general, it is not possible to show how they apply to all digital goods. Instead, I will focus on a very specific and important case, showing how modularity and SSOs played a key role in fostering design and deployment of the Internet. The argument proceeds in three steps. First, after reviewing some general points about the economics of modularity and standards, I describe the IETF, the W3C, and the Transmission Control Protocol/Internet Protocol (TCP/IP) “protocol stack” that engineers use to characterize the Internet’s architecture. Next, I use data from the IETF and W3C to illustrate the modularity of the system and the specialized division of labor in Internet standard setting. In this second step, I present results from two empirical analyses. The first analysis demonstrates the modular nature of the Internet by showing that citations among technical standards are highly concentrated within “layers” or modules in the Internet Protocol stack. The second analysis demonstrates that firms contributing to Internet standards development also specialize at particular layers in the protocol stack, suggesting that the technical modularity of the Internet architecture closely corresponds to the division of labor in standards production. The final step in the chapter’s

Modularity and the Evolution of the Internet

23

broader argument is to consider how components within a modular system evolve and are utilized through time. To illustrate how these ideas apply to the Internet, I return to citation analysis and show that intermodule citations between standards occur later than intramodule citations. Similarly, citations from patents (which I use as a proxy for commercial application of Internet standards) occur later than citations from other standards. These patterns suggest that modularity facilitates asynchronous coinvention and application of the core GPT, in contrast to the contemporaneous and tightly coupled design process that occur within layers. 1.1.1

Modularity in General

Modularity is a general strategy for designing complex systems. The components in a modular system interact with one another through a limited number of standardized interfaces. Economists often associate modularity with increasing returns to a finer division of labor. For example, Adam Smith’s famous description of the pin factory illustrates the idea that system-level performance is enhanced if specialization allows individual workers to become more proficient at each individual step in a production process. Limitations to such increasing returns in production may be imposed by the size of the market (Smith 1776; Stigler and Sherwin 1985) or through increasing costs of coordination, such as the cost of “modularizing” products and production processes (Becker and Murphy 1992). The same idea has been applied to innovation processes by modeling educational investments in reaching the “knowledge frontier” as a fixed investment in human capital that is complementary to similar investments made by other workers (Jones 2008). For both production and innovation, creating a modular division of labor is inherently a coordination problem, since the ex post value of investments in designing a module or acquiring specialized human capital necessarily depend upon choices and investments made by others. A substantial literature on technology design describes alternative benefits to modularity that have received less attention from economists. Herb Simon (1962) emphasizes that modular design isolates technological interdependencies, leading to a more robust system, wherein the external effects of a design change or component failure are limited to other components within the same module. Thus, Simon highlights the idea that upgrades and repairs can be accomplished by swapping out a single module instead of rebuilding a system from scratch. Baldwin and Clark (2000) develop the idea that by minimizing “externalities” across the parts of a system, modularity multiplies the set of options available to component designers (since design constraints are specified ex ante through standardized interfaces, as opposed to being embedded in ad hoc interdependencies), and thereby facilitates decentralized search of the entire design space. Economists often treat the modular division of labor as a more or less

24

Timothy Simcoe

inevitable outcome of the search for productive efficiency, and focus on the potential limits to increasing returns through specialization. However, the literature on technology design is more engaged with trade-offs that arise when selecting between a modular and a tightly integrated design. For example, a tightly integrated or nondecomposable design may be required to achieve optimal performance. The fixed costs of defining components and interfaces could also exceed the expected benefits of a modular design that allow greater specialization and less costly ex post adaptation. Thus, modularity is not particularly useful for a disposable single-purpose design. A more subtle cost of modularity is the loss of flexibility at intensively utilized interfaces. In a sense, modular systems “build in” coordination costs, since modifying an interface technology typically requires a coordinated switch to some new standard.3 The virtues of modular design for GPTs may seem self-evident. A technology that will be used as a shared input across many different application sectors clearly benefits from an architecture that enables decentralized end-user customization and a method for upgrading “core” functionality without having to overhaul the installed base. However, this may not be so clear to designers at the outset, particularly if tight integration holds out the promise of rapid development or superior short-run performance. For example, during the initial diffusion of electricity, the city electric light company supplied generation, distribution, and even lights as part of an integrated system. Langlois (2002) describes how the original architects of the operating system for the IBM System 360 line of computers adopted a nondecomposable design, wherein “each programmer should see all the material.”4 Similarly, Bresnahan and Greenstein (1999) describe how divided technical leadership—which might be either a cause or a consequence of product modularity—did not emerge in computing until the personal computer era. The evolution or choice of a modular architecture may also reflect expectations about the impact of modularity on the division of rents in the GPTproducing sector. For example, during the monopoly telecommunications era, AT&T had a long history of opposing third-party efforts to sell equipment that would attach to its network.5 While the impact of compatibility on competition and the distribution of rents is a complex topic that goes beyond the scope of this chapter, the salient point is that the choice of a modular architecture—or at a lower level, the design of a specific interface—will 3. A substantial economics literature explores such dynamic coordination problems in technology adoption, starting from Arthur (1989), David (1985), and Farrell and Saloner (1986). 4. The quote comes from Brooks (1975). 5. Notable challenges to this arrangement occurred in the 1956 “Hush-a-Phone” court case (238 F.2d 266, D.C. Cir., 1956) and the Federal Communication Commission’s 1968 Carterphone ruling (13 F.C.C.2d 420).

Modularity and the Evolution of the Internet

25

not necessarily reflect purely design considerations in a manner that weighs social costs and benefits.6 It is difficult to say what a less modular Internet would look like. Comparisons to the large closed systems of earlier eras (e.g., the IBM mainframe and the AT&T telecommunications network) suggest that there would be less innovation and commercialization by independent users of the network, in part because of the greater costs of achieving interoperability. However, centralized design and governance could also have benefits in areas such as improved security. Instead of pursuing this difficult counterfactual question, the remainder of this chapter will focus on documenting the modularity of the Internet architecture and showing how that modularity is related to the division of labor in standardization and the dynamics of complementary innovation. 1.1.2

Setting Standards

If the key social trade-off in selecting a modular design involves up-front fixed costs versus ex post flexibility, it is important to have a sense of what is being specified up front. Baldwin and Clark (2000) argue that a modular system partitions design information into visible design rules and hidden parameters. The visible rules consist of (a) an architecture that describes a set of modules and their functions, (b) interfaces that describe how the modules will work together, and (c) standards that can be used to test a module’s performance and conformity to design rules. Broadly speaking, the benefits of modularity flow from hiding many design parameters in order to facilitate entry and lower the fixed costs of component innovation, while its costs come from having to specify and commit to those design rules before the market emerges. The process of selecting globally visible design parameters is fundamentally a coordination problem, and there are several possible ways of dealing with it. Farrell and Simcoe (2012) discuss trade-offs among four broad paths to compatibility: decentralized technology adoption (or “standards wars”); voluntary consensus standard setting; taking cues from a dominant “platform leader” (such as a government agency or the monopoly supplier of a key input); and ex post efforts to achieve compatibility through converters and multihoming. In the GPT setting, each path to compatibility provides an alternative institutional environment for solving the fundamental contracting problem among GPT suppliers, potential inventors in various applications sectors, and consumers. That is, different modes of standardization imply alternative methods of distributing the ex post rents from complementary inventions, and one can hope that some combination of conscious 6. See Farrell (2007) on the general point and MacKie-Mason and Netz (2007) for one example of how designers could manipulate a specific interface.

26

Timothy Simcoe

choice and selection pressures pushes us toward a standardization process that promotes efficient ex ante investments in innovation. While all four modes of standardization have played a role in the evolution of the Internet, this chapter will focus on consensus standardization for two reasons.7 First, consensus standardization within SSOs (specifically, the IETF and W3C, as described below) is arguably the dominant mode of coordinating the design decisions and the supply of new interfaces on the modern Internet. And second, the institutions for Internet standard setting have remarkably transparent processes that provide a window onto the architecture of the underlying system, as well as the division of innovative labor among participants who collectively manage the shared technology platform. If one views the Internet as a general purpose technology, these standard-setting organizations may provide a forum where GPT-producers can interact with application-sector innovators in an effort to internalize the vertical (from GPT to application) and horizontal (among applications) externalities implied by complementarities in innovation across sectors, as modeled in Bresnahan and Trajtenberg (1995). 1.2

Internet Standardization

There are two main organizations that define standards and interfaces for the Internet: the Internet Engineering Task Force (IETF) and World Wide Web Consortium (W3C). This section describes how these two SSOs are organized and explains their relationship to the protocol stack that engineers use to describe the modular structure of the network. 1.2.1

History and Process

The IETF was established in 1986. However, the organization has roots that can be traced back to the earliest days of the Internet. For example, all of the IETF’s official publications are called “Requests for Comments” (RFCs), making them part of a continuous series that dates back to the very first technical notes on packet-based computer networking.8 Similarly, the first two chairs of the IETF’s key governance committee, called the Internet Architecture Board (IAB), were David Clark of MIT and Vint Cerf, who worked on the original IP protocols with Clark before moving to the Defense Advanced Research Projects Agency (DARPA) and funding the 7. For example, Russell (2006) describes the standards war between TCP/IP and the OSI protocols. Simcoe (2012) analyzes the performance of the IETF as a voluntary SSO. Greenstein (1996) describes the NSF’s role as a platform leader in the transition to a commercial Internet. Translators are expected to play a key role in the transition to IPv6, and smartphones are multihoming devices because they select between Wi-Fi (802.11) and cellular protocols to establish a physical layer network connection. 8. RFC 1 “Host Software” was published by Steve Crocker of UCLA in 1969 (http://www .rfc-editor.org/rfc/rfc1.txt). The first RFC editor, Jon Postel of UCLA, held the post from 1969 until his death in 1998.

Modularity and the Evolution of the Internet

27

initial deployment of the network. Thus, in many ways, the early IETF formalized a set of working relationships among academic, government, and commercial researchers who designed and managed the Advanced Research Projects Agency Network (ARPANET) and its successor, the National Science Foundation Network (NSFNET). Starting in the early 1990s, the IETF evolved from its quasi-academic roots into a venue for coordinating critical design decisions for a commercially significant piece of shared computing infrastructure.9 At present the organization has roughly 120 active technical working groups, and its meetings draw roughly 1,200 attendees from a wide range of equipment vendors, network operators, application developers, and academic researchers.10 The W3C was founded by Tim Berners-Lee in 1994 to develop standards for the rapidly growing World Wide Web, which he invented while working at the European Organization for Nuclear Research (CERN). BernersLee originally sought to standardize the core web protocols, such as the Hypertext Markup Language (HTML) and Hypertext Transfer Protocol (HTTP), through the IETF. However, he quickly grew frustrated with the pace of the IETF process, which required addressing every possible technical objection before declaring a consensus, and decided to establish a separate consortium, with support from CERN and MIT, that would promote faster standardization, in part through a more centralized organization structure (Berners-Lee and Fischetti 1999). The IETF and W3C have many similar features and a few salient differences. Both SSOs are broadly open to interested participants. However, anyone can “join” the IETF merely by showing up at a meeting or participating on the relevant e-mail listserv. The W3C must approve new members, who are typically invited experts or engineers from dues-paying member companies. The fundamental organizational unit within both SSOs is the working group (WG), and the goal of working groups is to publish technical documents. The IETF and W3C working groups publish two types of documents. The first type of document is what most engineers and economists would call a standard: it describes a set of visible design rules that implementations should comply with to ensure that independently designed products work together well. The IETF calls this type of document a standards-track RFC, and the W3C calls them Recommendations.11 At both SSOs, new standards must be approved by consensus, which generally means a substantial supermajority, and in practice is determined by a WG chair, subject 9. Simcoe (2012) studies the rapid commercialization of the IETF during the 1990s, and provides evidence that it produced a measurable slowdown in the pace of standards development. 10. http://www.ietf.org/documents/IETF-Regional-Attendance-00.pdf. 11. Standard-track RFCs are further defined as proposed standards, draft standards, or Internet standards to reflect their maturity level. However, at any given time, much of the Internet runs on proposed standards.

28

Timothy Simcoe

Fig. 1.1

Total RFCs and W3C publications (1969–2011)

Notes: Figure 1.1 plots a count of publications by the IETF and W3C. Pre-IETF publications refer to Request for Comments (RFCs) published prior to the formation of the IETF as a formal organization. Standards are standards-track RFCs published by IETF and W3C Recommendations. Informational publications are nonstandards-track IETF RFCs and W3C notes.

to formal appeal and review by the Internet Engineering Steering Group (IESG) or W3C director.12 The IETF and W3C working groups also publish documents that provide useful information without specifying design parameters. These informational publications are called nonstandards-track RFCs at the IETF and Notes at the W3C. They are typically used to disseminate ideas that are too preliminary or controversial to standardize, or information that complements new standards, such as “lessons learned” in the standardization process or proposed guidelines for implementation and deployment. Figure 1.1 illustrates the annual volume of RFCs and W3C publications between 1969 and 2011. The chart shows a large volume of RFCs published during the early 1970s, followed by a dry spell of almost fifteen years, and then a steady increase in output beginning around 1990. This pattern coincides with a burst of inventive activity during the initial development of ARPANET, followed by a long period of experimentation with various 12. For an overview of standards-setting procedures at IETF, see RFC 2026 “The Internet Standards Process” (http://www.ietf.org/rfc/rfc2026.txt). The W3C procedures are described at http://www.w3.org/2005/10/Process-20051014/tr.

Modularity and the Evolution of the Internet

29

networking protocols—including a standards war between TCP/IP and various proprietary implementations of the open systems interconnection (OSI) protocol suite (Russell 2006). Finally, there is a second wave of sustained innovation associated with the emergence of TCP/IP as the de facto standard, commercialization of the Internet infrastructure and widespread adoption. If we interpret the publication counts in figure 1.1 as a proxy for innovation investments, the pattern is remarkably consistent with a core feature of the literature on GPTs. In particular, there is a considerable time lag between the initial invention and the eventual sustained wave of complementary innovation that accompanies diffusion across various application sectors. There are multiple explanations for these adoption lags, which can reflect coordination delays such as the OSI versus TCP/IP standards war; the time required to develop and upgrade complementary inputs (e.g., routers, computers, browsers, and smartphones); or the gradual replacement of prior technology that is embedded in substantial capital investments. With respect to replacement effects, it is interesting to note that the share of IETF standards-track publications that upgrade or replace prior standards has averaged roughly 20 percent since 1990, when it becomes possible to calculate such statistics. Another notable feature of figure 1.1 is the substantial volume of purely informational documents produced at IETF and W3C. This partly reflects the academic origins and affiliations of both SSOs, and highlights the relationship between standards development and collaborative research and development (R&D). It also illustrates how, at least for “open” standards, much of the information about how to implement a particular module or function is broadly available, even if it is nominally hidden behind the layer of abstraction provided by a standardized interface. To provide a better sense of what is actually being counted in figure 1.1, table 1.1A lists some of the most important IETF standards, as measured by the number of times they have been cited in IETF and W3C publications, or as nonpatent prior art in a US patent in table 1.1B. All of the documents listed in tables 1.1A and 1.1B are standards-track publications of the IETF.13 Both tables contain a number of standards that one might expect to see on such a list, including Transmission Control Protocol (TCP) and Internet Protocol (IP), the core routing protocols that arguably define the Internet; the HTTP specification used to address resources on the Web; and the Session Initiation Protocol (SIP) used to control multimedia sessions, such as voice and video calls over IP networks. Several differences between the two lists in tables 1.1A and 1.1B are also noteworthy. For example, table 1.1A shows that IETF and W3C publica13. I was not able to collect patent cites for W3C documents, and the W3C Recommendation that received the most SSO citations was a part of the XML protocol that received 100 cites.

30

Timothy Simcoe

Table 1.1A

Most cited Internet standards (IETF and W3C citations)

Document

Year

IETF & W3C citations

Title

RFC 822 RFC 3261 RFC 791 RFC 2578 RFC 2616 RFC 793 RFC 2579 RFC 3986 RFC 1035 RFC 1034

1982 2002 1981 1999 1999 1981 1999 2005 1987 1987

346 341 328 281 281 267 262 261 254 254

Standard for the format of ARPA Internet text messages SIP: Session Initiation Protocol Internet Protocol Structure of Management Information Version 2 (SMIv2) Hypertext Transfer Protocol—HTTP/1.1 Transmission Control Protocol Textual conventions for SMIv2 Uniform Resource Identifier (URI): Generic syntax Domain names—implementation and specification Domain names—concepts and facilities

Note: This list excludes the most cited IETF publication, RFC 2119 “Key Words for Use in RFCs to Indicate Requirement Levels,” which is an informational document that provides a standard for writing IETF standards, and is therefore cited by nearly every standards-track RFC.

tions frequently cite the Structure of Management Information Version 2 (SMIv2) protocol, which defines a language and database used to manage individual “objects” in a larger communications network (e.g., switches or routers). On the other hand, table 1.1B shows that US patents are more likely to cite security standards and protocols for reserving network resources (e.g., Dynamic Host Configuration Protocol [DHCP] and Resource Reservation Protocol [RSVP]). These differences hint at the idea that citations from the IETF and W3C measure technical interdependencies or knowledge flows within the computer-networking sector, whereas patent cites measure complementary innovation linked to specific applications of the larger GPT.14 I return to this idea below when examining diffusion. 1.2.2

The Protocol Stack

The protocol stack is a metaphor used by engineers to describe the multiple layers of abstraction in a packet-switched computer network. In principle, each layer handles a different set of tasks associated with networked communications (e.g., assigning addresses, routing and forwarding packets, session management, or congestion control). Engineers working at a particular layer need only be concerned with implementation details at that layer, since the functions or services provided by other layers are described in a set of standardized interfaces. Saltzer, Reed, and Clark (1984) provide 14. Examining citations to informational publications reinforces this interpretation: The nonstandards-track RFCs most cited by other RFCs describe IETF processes and procedures, whereas the nonstandards-track RFCs most cited by US patents describe technologies that were too preliminary or controversial to standardize, such as Network Address Translation (NAT) and Cisco’s Hot-Standby Router Protocol (HSRP). On average, standards receive many more SSO and patent citations than informational publications.

Modularity and the Evolution of the Internet Table 1.1B

31

Most cited Internet standards (US patent citations)

Document

Year

US Patent citations

Title

RFC 2543 RFC 791 RFC 793 RFC 2002 RFC 3261 RFC 2131 RFC 2205 RFC 1889 RFC 2401 RFC 768

1999 1981 1981 1996 2002 1997 1997 1996 1998 1980

508 452 416 406 371 337 332 299 284 261

SIP: Session Initiation Protocol Internet Protocol Transmission Control Protocol IP mobility support SIP: Session Initiation Protocol Dynamic Host Configuration Protocol Resource ReSerVation Protocol (RSVP)—Version 1 RTP: A transport protocol for real-time applications Security architecture for the Internet Protocol User Datagram Protocol

an early description of this modular or “end-to-end” network architecture that assigns complex application-layer tasks to “host” computers at the edge of the network, thereby allowing routers and switches to focus on efficiently forwarding undifferentiated packets from one device to another. In practical (but oversimplified) terms, the protocol stack allows application designers to ignore the details of transmitting a packet from one machine to another, and router manufacturers to ignore the contents of the packets they transmit. The canonical TCP/IP protocol stack has five layers: applications, transport, Internet, link (or routing), and physical. The IETF and W3C focus on the four layers at the “top” of the stack, while various physical layer standards are developed by other SSOs, such as the IEEE (Ethernet and Wi-Fi/802.11b), or 3GPP (GSM and LTE). I treat the W3C as a distinct layer in this chapter, though most engineers would view the organization as a developer of application-layer protocols.15 In the management literature on modularity, the “mirroring hypothesis” posits that organizational boundaries will correspond to interfaces between modules. While the causality of this relationship has been argued in both directions (e.g., Henderson and Clark 1990; Sanchez and Mahoney 1996; Colfer and Baldwin 2010), the IETF and W3C clearly conform to the basic cross-sectional prediction that there will be a correlation between module and organizational boundaries. In particular, both organizations assign individual working groups to broad technical areas that correspond to distinct modules within the TCP/IP protocol stack. For each layer, the IETF maintains a technical area comprised of several related working groups overseen by a pair of area directors who sit on the Internet Engineering Steering Group (IESG). In addition to the areas cor15. Within the W3C there are also several broad areas of work, including Web design and applications standards (HTML, CSS, Ajax, SVG), Web infrastructure standards (HTTP and URI) that are developed in coordination with IETF, XML standards, and standards for Web services (SOAP and WSDL).

32

Timothy Simcoe

Fig. 1.2

Evolution of the Internet Protocol Stack

Notes: Figure 1.2 plots the share of all IETF and W3C standards-track publications associated with each layer in the Internet Protocol Stack, based on the author’s calculations using data from IETF and W3C. The full layer names are: RTG = routing, INT = Internet, TSV = transport, RAI = real-time applications and infrastructure, APP = applications, and W3C = W3C. The figure excludes RFCs from the IETF operations and security areas, which are not generally treated as a “layer” within the protocol stack (see figure 1.3).

responding to layers in the traditional protocol stack, the IETF has created a real-time applications area to develop standards for voice, video, and other multimedia communications sessions. This new layer sits “between” application and transport-layer protocols. Finally, the IETF manages two technical areas—security and operations—that exist outside of the protocol stack and develop protocols that interact with each layer of the system. Figure 1.2 illustrates the proportion of new IETF and W3C standards from each layer of the protocol stack over time. From 1990 to 1994, protocol development largely conformed to the traditional model of the TCP/ IP stack. Between 1995 and 1999, the emergence of the Web was associated with an increased number of higher-level protocols, including the early IETF work on HTML/HTTP, and the first standards from the W3C and real-time applications and infrastructure layers. From 2000 to 2012 there is a balancing out of the share of new standards across the layers of the protocol stack. The resurgence of the routing layer between 2005 and 2012 was based on a combination of upgrades to legacy technology and the creation of new standards, such as label-switching protocols (MPLS) that allow IP networks to function more like a switched network that maintains a specific path between source and destination devices.

Modularity and the Evolution of the Internet

33

Figure 1.2 illustrates several points about the Internet’s modular architecture that are linked to the literature on GPTs. If one views the Web as a technology that enables complementary inventions across a wide variety of application sectors (e.g., e-commerce, digital media, voice-over IP, online advertising, or cloud services), it is not surprising to see initial growth in application-layer protocol development, followed by the emergence of a new real-time layer, followed by a resurgence of lower-layer routing technology. This evolution is broadly consistent with the notion of positive feedback from application-sector innovations to extensions of the underlying GPT. Unfortunately, like most papers in the GPT literature, I lack detailed data on Internetrelated inventive activity across the full range of application sectors, and I am therefore limited to making detailed observations about the innovation process where it directly touches the GPT. Nevertheless, if one reads the RFCs and W3C Recommendations, links to protocols developed by other SSOs to facilitate application sector innovation are readily apparent. Examples include standards for audio/video compression (ITU/H.264) and for specialized commercial applications of general-purpose W3C tools like the XML language. Figure 1.2 also raises several questions that will be taken up in the remainder of the chapter. First, how modular is the Internet with respect to the protocol stack? In particular, do we observe that technical interdependencies are greater within than between layers? Is there a specialized division of labor in protocol development? Second, is it possible to preserve the modularity of the entire system when a new set of technologies and protocols is inserted in the middle of the stack, as with the real-time area? Finally, the dwindling share of protocol development at the Internet layer suggests that the network may be increasingly “locked in” to legacy protocols at its key interface. For example, the IETF has long promoted a transition to a set of next generation IP protocols (IPv6) developed in the 1990s, with little success. This raises the question of whether modularity and collective governance render technology platforms less capable of orchestrating “big push” technology transitions than alternative modes of platform governance, such as a dominant platform leader. 1.3

Internet Modularity

Whether the Internet is actually modular in the sense of hiding technical interdependencies and, if so, how that modularity relates to the division of innovative labor, are two separate questions. This section addresses them in turn. 1.3.1

Decomposability

Determining the degree of modularity of a technological system is fundamentally a measurement problem that requires answering two main questions: (1) how to identify interfaces or boundaries between modules, and

34

Timothy Simcoe

(2) how to identify interdependencies across modules. The TCP/IP protocol stack and associated technical areas within the IETF and W3C provide a natural way to group protocols into modules. I use citations among standardstrack RFCs and W3C Recommendations to measure interdependencies. The resulting descriptive analysis is similar to the use of design structure matrices, as advocated by Baldwin and Clark (2000) and implemented in MacCormack, Baldwin, and Rusnak (2012), only using stack layers rather than source files to define modules, and citations rather than function calls to measure technical interdependencies. Citations data were collected directly from the RFCs and W3C publications. Whether these citations are a valid proxy for technical interdependencies will, of course, depend on how authors use them. Officially, the IETF and W3C distinguish between normative and informative citations. Normative references “specify documents that must be read to understand or implement the technology in the new RFC, or whose technology must be present for the technology in the new RFC to work.” Informative references provide additional background, but are not required to implement the technology described in a RFC or Recommendation.16 Normative references are clearly an attractive measure of interdependency. Unfortunately, the distinction between normative and informative cites was not clear for many early RFCs, so I simply use all cites as a proxy. Nevertheless, even if we view informative cites as a measure of knowledge flows (as has become somewhat standard in the economic literature that relies on bibliometrics), the interpretation advanced below would remain apt, since a key benefit of modularity is the “hiding” of information within distinct modules or layers. Figure 1.3 is a directed graph of citations among all standards produced by the IETF and W3C, with citing layers/technical areas arranged on the Y-axis and cited layers/areas arranged on the X-axis. Shading is based on each cell’s decile in the cumulative citation distribution. Twenty-seven percent of all citations link two documents produced by the same working group, and I exclude these from the analysis.17 In a completely modular or decomposable system, all citations would be contained with the cells along the main diagonal. Figure 1.3 suggests that the Internet more closely resembles a nearly decomposable system, with the majority of technical interdependencies and information flows occurring either within a module or between a module and an adjacent layer in the protocol stack.18 If we ignore the security and operations areas, 89 percent of all citations in figure 1.3 are on the main diagonal or an adjacent cell, 16. For the official IESG statement on citations, see http://www.ietf.org/iesg/statement /normative-informative.html. 17. Including within-WG citations would make the Internet architecture appear even more modular. 18. An alternative nonmodular and non-interdependent design configuration would be a hierarchy, with all cites either above or below the main diagonal.

Modularity and the Evolution of the Internet

Fig. 1.3

35

Citations in the Internet Protocol Stack

Notes: Figure 1.3 is a matrix containing cumulative counts of citations from citing layer standards-track publications to cited layer standards-track publications based on the author’s calculations using data from IETF and W3C. Layer names are: RTG = routing, INT = Internet, TSV = transport, RAI = real-time applications and infrastructure, APP = applications, W3C = W3C, SEC = security, and OPS = operations.

whereas a uniformly random citation probability would lead to just 44 percent of all citations on or adjacent to the main diagonal. The exceptions to near-decomposability illustrated in figure 1.3 are also interesting. First, it is fairly obvious that security and operations protocols interface with all layers of the protocol stack: apparently there are some system attributes that are simply not amenable to modularization. While straightforward, this observation may have important implications for determining the point at which a GPT encounters decreasing returns to scale due to the costs of adapting a shared input to serve heterogeneous application sectors. The second notable departure from near-decomposability in figure 1.3 is the relatively high number of interlayer citations to Internet layer protocols. This turns out to be a function of vintage effects. Controlling for publication-year effects in a Poisson regression framework reveals that Internet layer specifications are no more likely to receive between-layer citations than other standards.19 Of course, the vintage effects themselves are inter19. These regression results are not reported here, but are available from the author upon request.

36

Timothy Simcoe

esting to the extent that they highlight potential “lock in” to early design choices made for an important interface, such as TCP/IP. Finally, figure 1.3 shows that real-time and transport-layer protocols have a somewhat greater intermodule citation propensity than standards from other layers. Recall that these layers emerged later than the original applications, Internet, and routing areas (see figure 1.2). Thus, this observation suggests that when a new module is added to an existing system (perhaps to enable or complement coinvention in key application areas), it may be hard to preserve a modular architecture, particularly if that module is not located at the “edges” of the stack, as with the W3C. 1.3.2

Division of Labor

While figure 1.3 clearly illustrates the modular nature of the Internet’s technical architecture, it does not reveal whether that modularity is associated with a specialized division of labor. This section will examine the division of labor among organizations involved in IETF standards development by examining their participation at various layers of the TCP/IP protocol stack.20 The data for this analysis are extracted from actual RFCs by identifying all e-mail addresses in the section listing each author’s contact information, and parsing those addresses to obtain an author’s organizational affiliation.21 The analysis is limited to the IETF, as it was not possible to reliably extract author information from W3C publications. On average, IETF RFCs have 2.3 authors with 1.9 unique institutional affiliations. Because each RFC in this analysis is published by an IETF working group, I can use that WG to determine that document’s layer in the protocol stack. In total, I use data from 3,433 RFCs published by 328 different WGs, and whose authors are affiliated with 1,299 unique organizations. Table 1.2 lists the fifteen organizations that participated (i.e., authored at least one standard) in the most working groups, along with the total number of standards-track RFCs published by that organization. One way to assess whether there is a specialized division of labor in standards creation is to ask whether firms’ RFCs are more concentrated within particular layers of the protocol stack than would occur under random assignment of RFCs to layers (where the exogenous assignment probabilities equal the observed marginal probabilities of an RFC occupying each layer in the stack). Comparing the actual distribution of RFCs across layers to a simulated distribution based on random choice reveals that organizations participating in the IETF are highly concentrated within particular 20. In principle, one might focus on specialization at the level of the individual participant. However, since many authors write a single RFC, aggregating to the firm level provides more variation in the scope of activities across modules. 21. In practice, this is a difficult exercise, and I combined the tools developed by Jari Arkko (http://www.arkko.com/tools/docstats.html) with my own software to extract and parse addresses.

Modularity and the Evolution of the Internet Table 1.2

37

Major IETF participants Sponsor Cisco Microsoft Ericsson IBM Nortel Sun Nokia Huawei AT&T Alcatel Juniper Motorola MIT Lucent Intel

Unique WGs

Total standards

122 65 42 40 38 35 31 28 27 26 25 24 24 23 23

590 130 147 102 78 76 83 49 50 64 109 42 42 41 33

layers. Specifically, I compute the likelihood-based multinomial test statistic proposed by Greenstein and Rysman (2005) and find a value of –7.1 for the true data, as compared to a simulated value of –5.3 under the null hypothesis of random assignment.22 The smaller value of the test statistic for the true data indicates agglomeration, and the test strongly rejects the null of random choice (SE = 0.17, p = 0.00). To better understand this pattern of agglomeration in working group participation, it is helpful to consider a simplistic model of the decision to contribute to drafting an RFC. To that end, suppose that firm i must decide whether to draft an RFC for working group w in layer j. Each firm either participates in the working group or does not: ai = 0,1. Let us further assume that all firms receive a gross public benefit Bw if working group w produces a new protocol. Firms that participate in the drafting process also receive a private benefit Siw that varies across working groups, and incur a participation cost Fij that varies across layers. In this toy model, public benefits flow from increasing the functionality of the network and growing the installed base of users. Private benefits could reflect a variety of idiosyncratic factors, such as intellectual property in the underlying technology or improved interoperability with proprietary complements. Participation costs are assumed constant within-layer to reflect the idea that there is a fixed cost to develop the technical expertise needed to innovate within a new module. If firms were all equally capable of innovating at any layer (Fij = Fik, for all i, j ≠ k), there would be no specialized division of labor in standards production within this model. 22. Code for performing this test in Stata has been developed by the author and is available at http://econpapers.repec.org/software/bocbocode/s457205.htm.

38

Timothy Simcoe

To derive a firm’s WG-participation decision, let Φw represent the endogenous probability that at least one other firm joins the working group. Thus, firm i’s payoff from working group participation are Bw + Siw – Fij , while the expected benefits of not joining are ΦBw . If all firms have private knowledge of Siw, and make simultaneous WG participation decisions, the optimal rule is to join the committee if and only if (1 – Φw)Bw + Siw > Fij . While dramatically oversimplified, this model yields several useful insights. First, there is a trade-off between free riding and rent seeking in the decision to join a technical committee. While a more realistic model might allow for some dissipation of rents as more firms join a working group, the main point here is that firms derive private benefits from participation, and are likely to join when Siw is larger. Likewise, when Siw is small, there is an incentive to let others develop the standard, and that free-riding incentive increases with the probability (Φ) that at least one other firm staffs the committee. Moreover, because Φ depends on the strategies of other prospective standards developers, this model illustrates the main challenge for empirical estimation: firms’ decisions to join a given WG are simultaneously determined. To estimate this model of WG participation I treat Siw as an unobserved stochastic term, treat Bw as an intercept or WG random effect, and replace Φw with the log of one plus the actual number of other WG participants.23 I parameterize Fij as a linear function of two dummy variables—prior RFC (this layer) and prior RFC (adjacent layer)—that measure prior participation in WGs at the same layer of the protocol stack, or at an adjacent layer conditional on the same-layer dummy being equal to zero. These two dummies for prior RFC publication at “nearby” locations in the protocol stack provide an alternative measure of the division of labor in protocol development that may be easier to interpret than the multinomial test statistic reported above. The regression results presented below ignore the potential simultaneity of WG participation decisions. However, if the main strategic interaction involves a trade-off between free riding and rent seeking, the model suggests that firms will be increasingly dispersed across working groups when the public benefits of protocol development (Bw) are large relative to the private rents (Siw). Conversely, if we observe a strong positive correlation among participation decisions, the model suggests that private benefits of exerting some influence over the standard are relatively large and/or positively correlated across firms. It is also possible to explore the rent-seeking hypothesis by exploiting the difference between standards and nonstandards-track RFCs, an idea developed in Simcoe (2012). Specifically, if the normative aspects of standards-track documents provide greater opportunities for rent seek23. An alternative approach would be to estimate the model as a static game of incomplete information following Bajari et al. (2010). However, I lack instrumental variables that produce plausibly exogenous variation in Φw, as required for that approach.

Modularity and the Evolution of the Internet Table 1.3

39

Summary statistics

Variable Stds.—track WG participation Nonstds.—track participation Prior RFC (this layer) Prior RFC (adjacent layer) log(1 + other participants)

Mean

SD

Min.

Max.

0.06 0.05 0.34 0.17 2.11

0.24 0.22 0.47 0.38 0.86

0 0 0 0 0

1 1 1 1 4.51

ing (e.g., because they specify how products will actually be implemented), there should be a stronger positive correlation among firms’ WG participation decisions, leading to more agglomeration when “participation” is measured as standards-track RFC production than when it is measured as nonstandards-track RFC publication. The data used for this exercise come from a balanced panel of 43 organizations and 328 WGs where each organization contributed to ten or more RFCs and is assumed to be at risk of participating in every WG.24 Table 1.3 presents summary statistics for the estimation sample and table 1.4 presents coefficient estimates from a set of linear probability models.25 The first four columns in table 1.4 establish that there is a strong positive correlation between past experience at a particular layer of the protocol stack and subsequent decisions to join a new WG at the same layer. Having previously published a standards-track RFC in a WG in a given layer is associated with a 5 to 7 percentage-point increase in the probability of joining a new WG at the same layer. There is a smaller but still significant positive association between prior participation at an adjacent layer and joining a new WG. Both results are robust to adding fixed or random effects for the WG and focal firm. Given the baseline probability of standards-track entry is 6 percent, the “same layer” coefficient corresponds to a marginal effect of 100 percent, and is consistent with the earlier observation that participation in the IETF by individual firms is concentrated within layers. The fifth column in table 1.4 shows that the number of other WG participants has a strong positive correlation with the focal firm’s participation decision. A 1 standard deviation increase in participation by other organizations, or roughly doubling the size of a working group, produces a 5 percentage-point increase in the probability of joining and is therefore roughly equivalent to prior experience at the same layer. I interpret this as 24. Increasing the number of firms in the estimation sample mechanically reduces the magnitude of the coefficient estimates (since firms that draft fewer RFCs participate in fewer working groups, and therefore exhibit less variation in the outcome) but does not qualitatively alter the results. 25. The linear probability model coefficients are nearly identical to average marginal effects from a set of unreported logistic regressions.

40

Timothy Simcoe

Table 1.4

Linear probability models of IETF working group participation Stds.—track particip.

Outcome Prior RFC (this layer) Prior RFC (adjacent layer) log(other WG participants) WG random effects WG fixed effects Firm fixed effects Observations

(1)

(2)

(3)

(4)

(5)

(6)

0.06 [6.87]*** 0.02 [3.27]***

0.07 [11.98]*** 0.02 [3.12]***

0.07 [9.64]*** 0.02 [2.72]***

0.05 [6.25]*** 0.01 [1.54]

0.06 [11.24]*** 0.02 [3.49]*** 0.06 [23.70]***

0.06 [11.19]*** 0.01 [2.36]** 0.04 [17.82]***

N N N 14,104

Y N N 14,104

N Y N 14,104

N Y Y 14,104

N N N 14,104

N N N 14,104

Notes: Unit of analysis is a firm-WG. Robust standard errors clustered by WG (except random effects model). T-statistics in brackets. ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

evidence that private benefits from contributing to specification development are highly correlated across firms at the WG level, and that the costs of WG participation are low enough for these benefits to generally outweigh temptations to free ride when an organization perceives a WG to be important. The last column in table 1.4 changes the outcome to an indicator for publishing a nonstandards-track RFC in a given WG. In this model, the partial correlation between a focal firm’s participation decision and the number of other organizations in the WG falls by roughly one-third, to 0.04. A chisquare test rejects the hypothesis that the coefficient on log(other participants) is equal across the two models in columns (5) and (6) (χ2(1) = 6.22, p = 0.01). The stronger association among firms’ WG participation decisions for standards-track RFCs than for nonstandards-track RFCs suggests that the benefits of exerting some influence over the standards process are large (relative to the participation costs and /or the public-good benefits of the standard) and positively correlated across firms.26 In summary, data from the IETF show that the division of labor in protocol development does conform to the boundaries established by the modular protocol stack. This specialized division of labor emerges through firms’ decentralized decisions to participate in specification development in vari26. In unreported regressions, I allowed the standards/nonstandards difference to vary by layer, and found that standards was larger at all layers except applications and operations, with statistically significant differences for real-time, Internet, and routing and security.

Modularity and the Evolution of the Internet

41

ous working groups. The incentive to join a particular WG reflects both the standard economic story of amortizing sunk investments in developing expertise at a given layer, and idiosyncratic opportunities to obtain private benefits from shaping the standard. The results of a simple empirical exercise show that forces for agglomeration are strong, and suggests that incentives to participate for private benefit are typically stronger than free-riding incentives (perhaps because the fixed cost of joining a given committee are small). Moreover, firms’ idiosyncratic opportunities to obtain private benefits from shaping a standard appear to be correlated across working groups, suggesting that participants know when a particular technical standard is likely to be important. Finally, it is important to note that while this analysis focused on firms that produce at least ten RFCs in order to disentangle their motivations for working group participation, those forty-three firms are only a small part of the total population of 1,299 unique organizations that supplied an author on one or more RFCs. Large active organizations do a great deal of overall protocol development. However, the organizations that only contribute to one or two RFCs are also significant. By hiding many of the details of what happens within any given layer of the protocol stack, the Internet’s modular architecture lowers the costs of entry and component innovation for this large group of small participants. 1.4

Diffusion across Modules and Sectors

The final step in this chapter’s exploration of Internet modularity is to examine the distribution of citations to RFCs over time. As described above, lags in diffusion and coinvention occupy center stage in much of the literature on GPTs for two reasons: (1) they help explain the otherwise puzzling gap between the spread of seminal technologies and the appearance of macroeconomic productivity effects, and (2) they highlight the role of positive innovation externalities between and among application sectors and the GPT-producing sector. Analyzing the age distribution of citations to standards can provide a window onto the diffusion and utilization of the underlying technology. However, it is important to keep in mind the limitations of citations as a proxy for standards utilization in the following analysis. In particular, we do not know whether any given citation represents a normative technical interdependency or an informative reference to the general knowledge embedded in an RFC. One might also wish to know whether citations come from implementers of the specification, or from producers of complements, who reference the interface in a “black box” fashion. While such fine-grained interpretation of citations between RFC are not possible in the data I use here, examining the origin and rate of citations does reveal some interesting patterns that hint at the role of modularity in the utilization of Internet standards.

42

1.4.1

Timothy Simcoe

Diffusion across Modules

I begin by examining citation flows across different modules and layers within the IETF and the TCP/IP protocol stack. If the level of technical interdependency between any two standards increases as we move inward from protocols in different layers, to protocols in the same layer, to protocols in the same working group, we should expect to see shorter citation lags. The intuition is straightforward: tightly coupled technologies need to be designed at the same time to avoid mistakes that emerge from unanticipated interactions. Two technologies that interact only through a stable interface need not be contemporaneously designed, since a well-specified interface defines a clear division of labor.27 To test the idea that innovations diffuse within and between modules at different rates, I created a panel of annual citations to standards-track RFCs for sixteen years following their publication. Citation dates are based on the publication year of the citing RFC. The econometric strategy is adapted from Rysman and Simcoe (2008). Specifically, I estimate a Poisson regression of citations to RFC i in citing year y that contains a complete set of age effects (where age equals citing year minus publication year) and a third order polynomial for citing years to control for time trends and truncation: E[Citesiy ] = exp{λage + f(Citing year)}. To summarize these regression results, I set the citing year equal to 2000 and generate the predicted number of citations at each age. Dividing by the predicted cumulative cites over all sixteen years of RFC life yields a probability distribution that I call the citation-age profile. These probabilities are plotted and used to calculate a hypothetical mean citation age, along with its standard error (using the delta method). Figure 1.4 illustrates the citation-age profile for standards-track RFCs using three different outcomes: citations originating in the same WG, citations originating in the same layer of the protocol stack, and citations from other layers of the protocol stack.28 The pattern is consistent with the idea that more interconnected protocols are created closer together in time. Specifically, I find that the average age of citations within a working group is 3.5 years (SE = 0.75), compared to 6.7 years (SE = 0.56) for cites from the same layer and 8.9 years (SE = 0.59) for other layers. The main lesson contained in figure 1.4 is that even within a GPT, innovations diffuse faster within than between modules. This pattern is arguably driven by the need for tightly interconnected aspects of the system to coordinate on design features simultaneously, whereas follow-on innovations can rely on the abstraction and information hiding provided by a well-defined 27. The costs of time shifting when the division of labor is nor clearly defined ex ante will be familiar to anyone who has worked on a poorly organized team project. 28. For this analysis, I exclude all cites originating in the security and operations layers (see figure 1.3).

Modularity and the Evolution of the Internet

Fig. 1.4

43

Age profiles for RFC-to-RFC citations

interface. The importance of contemporaneous design for tightly coupled components may be compounded by the fact that many interface layers may need to be specified before a GPT becomes useful in specific application sectors. For example, in the case of electricity, the alternating versus direct current standards war preceded widespread agreement on standardized voltage requirements, which preceded the ubiquitous three-pronged outlet that works with most consumer devices (at least within the United States). While this accretion of interrelated interfaces is likely a general pattern, the Internet and digital technology seems particularly well suited to the use of a modular architecture to reduce the rate at which technical knowledge depreciates and to facilitate low-cost reuse and time shifting. 1.4.2

Diffusion across Sectors

To provide a sense of how the innovations embedded in Internet standards diffuse out into application sectors, I repeat the empirical exercise described above, only comparing citations among all RFCs to citations from US patents to RFCs. The citing year for a patent-to-RFC citation is based on the patent’s application date. While there are many drawbacks to patent citations, there is also a substantial literature that argues for their usefulness as a measure of cumulative innovation based on the idea that each cite limits the scope of the inventor’s monopoly and is therefore carefully assessed for its relevance to the claimed invention. For this chapter, the key assumption is simply that citing patents are more likely to reflect inventions that enable applications of the GPT than citations from other RFCs.

44

Timothy Simcoe

Fig. 1.5

Age profiles for RFC-to-RFC and US patent-to-RFC citations

Figure 1.5 graphs the age profiles for all RFC cites and all patent cites. The RFC age profile represents a cite-weighted average of the three lines in figure 1.4, and the average age of an RFC citation is 5.9 years (SE = 0.5). Patent citations clearly take longer to arrive, and are more persistent in later years than RFC cites. The average age of a US patent nonprior citation to an RFC is 8.2 years (SE = 0.51), which is quite close to the mean age for a citation from RFCs at other layers of the protocol stack. At one level, the results illustrated in figures 1.4 and 1.5 are not especially surprising. However, these figures highlight the idea that a GPT evolves over time, partly in response to the complementarities between GPT-sector and application sector innovative activities. The citation lags illustrated in these figures are relatively short compared to the long delay between the invention of packet-switched networking and the emergence of the commercial Internet illustrated in figure 1.1. Nevertheless, it is likely that filing a patent represents only a first step in the process of developing applicationsector-specific complementary innovations. Replacing embedded capital and changing organizational routines may also be critical, but are harder to measure, and presumably occur on a much longer time frame. 1.5

Conclusion

The chapter provides a case study of modularity and its economic consequences for the technical architecture of the Internet. It illustrates the modu-

Modularity and the Evolution of the Internet

45

lar design of the Internet architecture, the specialized division of innovative labor in Internet standards development, and the gradual diffusion of new ideas and technologies across interfaces within that system. These observations are limited to a single technology, albeit one that can plausibly claim to be a GPT with significant macroeconomic impacts. At a broader level, this chapter suggests that modularity and specialization in the supply of a GPT may help explain its long-run trajectory. In the standard model of a GPT, the system-level trade-off between generality and specialization is overcome through “coinvention” within application sectors. These complementary innovations raise the returns to GPT innovation by expanding the installed base, and also by expanding the set of potential applications. A modular architecture facilitates the sort of decentralized experimentation and low-cost reusability required to sustain growth at the extensive margin, and delivers the familiar benefits of a specialized division of labor in GPT production. Finally, this chapter highlights a variety of topics that can provide grist for future research on the economics of modularity, standard setting, and general-purpose technologies. For example, while modularity clearly facilitates an interfirm division of labor, even proprietary systems can utilize modular design principles. This raises a variety of questions about the interaction between modular design and “open” systems, such as the Internet, which are characterized by publicly accessible interfaces and particular forms of platform governance. The microeconomic foundations of coordination costs that limit the division of innovative labor within a modular system are another broad topic for future research. For example, we know little about whether or why the benefits of a modular product architecture are greater inside or outside the boundaries of a firm, or conversely, whether firm boundaries change in response to architectural decisions. Finally, in keeping with the theme of this volume, future research might ask whether there is something special about digital technology that renders it particularly amenable to the application of modular design principles. Answers to this final question will have important implications for our efforts to extrapolate lessons learned from studying digitization to other settings, such as life sciences or the energy sector.

References Arthur, W. Brian. 1989. “Competing Technologies, Increasing Returns, and Lock-In by Historical Events.” Economic Journal 97:642–65. Bajari, P., H. Hong, J. Krainer, and D. Nekipelov. 2010. “Estimating Static Models of Strategic Interactions” Journal of Business and Economic Statistics 28 (4): 469–82. Baldwin, C. Y., and K. B. Clark. 2000. Design Rules: The Power of Modularity, vol. 1. Boston: MIT Press.

46

Timothy Simcoe

Becker, G. S., and K. M. Murphy. 1992. “The Division-of-Labor, Coordination Costs and Knowledge.” Quarterly Journal of Economics 107 (4): 1137–60. Berners-Lee. T., and M. Fischetti. 1999. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by its Inventor. San Francisco: Harper. Bresnahan, T. 2010. “General Purpose Technologies.” In Handbook of the Economics of Innovation, vol. 2, edited by B. Hall and N. Rosenberg, 761–91. Amsterdam: Elsevier. Bresnahan, T. F., and S. Greenstein. 1999. “Technological Competition and the Structure of the Computer Industry.” Journal of Industrial Economics 47 (1): 1–40. Bresnahan, T., and M. Trajtenberg. 1995. “General Purpose Technologies: Engines of Growth?” Journal of Econometrics 65:83. Brooks, F. 1975. The Mythical Man-Month. Boston: Addison-Wesley. Colfer, L., and C. Baldwin. 2010. “The Mirroring Hypothesis: Theory, Evidence and Exceptions.” Working Paper no. 10–058, Harvard Business School, Harvard University. David, Paul A. 1985. “Clio and the Economics of QWERTY.” American Economic Review 77 (2): 332–37. David, Paul A. 1990. “The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox.” American Economic Review Papers and Proceedings 80 (2): 355–61. Dranove, D., C. Forman, A. Goldfarb, and S. Greenstein. 2012. “The Trillion Dollar Conundrum: Complementarities and Health Information Technology.” NBER Working Paper no. 18281, Cambridge, MA. Farrell, J. 2007. “Should Competition Policy Favor Compatibility?” In Standards and Public Policy, edited by S. Greenstein and V. Stango. Cambridge: Cambridge University Press. Farrell, J., and G. Saloner. 1986. “Installed Base and Compatibility—Innovation, Product Preannouncements, and Predation.” American Economic Review 76 (5): 940–55. Farrell, J., and T. Simcoe. 2012. “Four Paths to Compatibility.” In Oxford Handbook of the Digital Economy, edited by M. Peitz and J. Waldfogel, 34–58. Oxford: Oxford University Press. Greenstein, S. 1996. “Invisible Hand versus Invisible Advisors.” In Private Networks, Public Objectives, edited by Eli Noam. Amsterdam: Elsevier. Greenstein, S., and M. Rysman. 2005. “Testing for Agglomeration and Dispersion.” Economics Letters 86 (3): 405–11. Henderson, R., and K. B. Clark. 1990. “Architectural Innovation: The Reconfiguration of Existing Product Technologies and the Failure of Established Firms.” Administrative Science Quarterly 35 (1): 9–30. Jones, B. F. 2008. “The Knowledge Trap: Human Capital and Development Reconsidered.” NBER Working Paper no. 14138, Cambridge, MA. Langlois, R. 2002. “Modularity in Technology and Organization.” Journal of Economic Behavior & Organization 49 (1): 19–37. MacCormack, A., C. Baldwin, and J. Rusnak. 2012. “Exploring the Duality between Product and Organizational Architectures: A Test of the ‘Mirroring’ Hypothesis.” Research Policy 41:1309–24. MacKie-Mason, J., and J. Netz. 2007. “Manipulating Interface Standards as an Anticompetitive Strategy.” In Standards and Public Policy, edited by S. Greenstein and V. Stango, 231–59. Cambridge: Cambridge University Press. Russell, A. 2006. “‘Rough Consensus and Running Code’ and the Internet-OSI Standards War.” Annals of the History of Computing, IEEE 28 (3): 48–61.

Modularity and the Evolution of the Internet

47

Rysman, M., and T. Simcoe. 2008. “Patents and the Performance of Voluntary Standard Setting Organizations.” Management Science 54 (11): 1920–34. Saltzer, J. H., D. P. Reed, and D. D. Clark. 1984. “End-to-End Arguments in System Design.” ACM Transactions on Computer Systems 2 (4): 277–88. Sanchez, R., and J. T. Mahoney. 1996. “Modularity, Flexibility, and Knowledge Management in Product and Organization Design.” Strategic Management Journal 17:63–76. Simcoe, T. 2012. “Standard Setting Committees: Consensus Governance for Shared Technology Platforms.” American Economic Review 102 (1): 305–36. Simon, H. A. 1962. “The Architecture of Complexity.” Proceedings of the American Philosophical Society 106 (6): 467–82. Smith, A. 1776. Wealth of Nations, vol. 10, Harvard Classics, edited by C. J. Bullock. New York: P. F. Collier & Son. Stigler, G., and R. Sherwin. 1985. “The Extent of the Market.” Journal of Law and Economics 28 (3): 555–85.

Comment

Timothy F. Bresnahan

In “Modularity and the Evolution of the Internet” Tim Simcoe brings valuable empirical evidence to bear on the structure and governance of the Internet’s more technical, less customer-facing, layers. His main empirical results are about the Internet’s protocol stack, that is, the structure of the technical layers’ modular architecture and of the division of labor in invention of improvements. To organize my discussion, I will follow Simcoe’s main results. There are, however, three distinctions that I want to draw before proceeding: (1) modularity is not the same as openness; (2) one can say that an architecture is modular (or open), which is not the same as saying the process by which the architecture changes is modular (or open); and (3) the Internet, like most ICT platforms, includes both purely technical standards and de facto standards in customer-facing products. 1. Modularity is related to, but not the same as, openness. Modularity is an engineering design concept. A large, complex problem can be broken up into pieces, and engineers working on one piece need know only a small amount about all the other pieces. They do need to know how their piece can interact with the other pieces—for which they (ideally) need know only the information contained in the interface standards described in the IETF (and preceding) and W3C documents analyzed by Simcoe. In contrast, openness is an economic organization concept. It refers to the availability and control Timothy F. Bresnahan is the Landau Professor in Technology and the Economy at Stanford University and a member of the board of directors of the National Bureau of Economic Research. For acknowledgments, sources of research support, and disclosure of the author’s material financial relationships, if any, please see http://www.nber.org/chapters/c13056.ack.

48

Timothy Simcoe

of information about interface standards and to the role of a platform sponsor as a gatekeeper. In a closed (or proprietary) architecture, a GPT sponsor controls certain interface standards, and access to information about those standards flows to other firms through contracting with the sponsor. The sponsor can compel others to contract either because it only has the interface information or because it controls access to distribution to customers or both. Modularity makes openness feasible, but many proprietary architectures are quite modular. 2. Modularity is most precisely used as a modifier of an architecture at a moment in time. Modularity in this sense means that the boundaries between layers exist and “local” inventive effort can proceed. An architecture can remain modular over time, however, either by respecting the old boundaries (a part of “backward compatibility”) or by moving them in light of new technical or market developments. As we move to this dynamic viewpoint, an important element of openness is that outsiders can define new general-purpose layers and add them to the stack. 3. The Internet, like most multilayered GPTs, has both technical layers and user-facing layers among its general-purpose components. Simcoe focuses on technical layers and the interfaces between them. He does not focus on the commercial layers that connect the Internet to customers. Search, from Google or Microsoft, is an important general-purpose layer in the Internet for both users and advertisers. So, too, is product search inside Amazon or eBay or other storefronts, for both merchants and consumers. For a long time, the Internet index created by Yahoo appeared to be a general-purpose component. Other examples abound. The key point is that not all of the general components associated with the Internet fall within the organized standard setting of the IETF or the W3C. Some are, instead, set in markets or by dominant firms in some layer. A Great Transformation as New Uses Are Found Simcoe usefully notes that the time-series pattern of the count of Internet documents (RFCs and W3C publications) corresponds to the role of the Internet as a GPT, or more precisely, a GPT for which important applications were discovered after a lag. If we interpret the count of documents as an indicator of the amount of inventive activity, there is a burst of invention in the 1970s, comparatively less until the 1990s, and a steady growth from the mid-1990s through the present day. This corresponds broadly to the two main eras of the application of the Internet. From its invention until the commercialization of the Internet in the early 1990s, the Internet largely connected technical users in military and academic labs. While there was steady invention throughout this period, Simcoe shows that the architecture of the Internet, at least as measured by the count of documents, needed to

Modularity and the Evolution of the Internet

49

be invented to support this technical-user era but, once invented, did not need radical expansion in capabilities. The second main era in the application of the Internet is its widespread use for commercial and mass market electronic communication, commerce, and content, hereafter EC3. The commercial portion of this begins in the early to mid-1990s, and, famously, the mass market part of this in the mid- to late 1990s. As Simcoe shows, the ongoing explosion in the range of applications of the Internet that began then and continues to the present has been associated with a dramatic expansion in the number of Internet documents. His interpretation, which is clearly right, is that the wider range of applications elicited new improvements in the general-purpose components. This pulls together a familiar and an unfamiliar aspect of GPT economics. Familiarly, important applications of a GPT can lag years behind its original invention. Less familiarly, new applications, particularly if they involve much larger demand for the GPT than earlier ones, can call for changes in the technical capabilities of the general-purpose components themselves. Surprising Persistence of Openness As Simcoe suggests, this transformation involves at least two surprising and very positive developments: commercialization without proprietization and expansion by outsiders. Both are related to modularity and openness. Most commercial computing and communications platforms are proprietary.1 The IBM 360 family was proprietary from the get-go, though an essential feature of the family was its modular architecture. The personal computer (PC) began as an open system, but is now the proprietary Microsoft Windows platform, even though there is a great deal of modularity in its architecture. The Oracle or SAP software platforms of the present are at once modular and proprietary. In each case, a single-firm GPT sponsor maintains control over the GPT and, in particular, either controls or commodifies supply of general-purpose layers. The Internet moved from being mostly a technical-uses GPT to being mostly a commercial-uses GPT without (yet) becoming a proprietary platform with a dominant sponsor firm, and with continued openness. This is a borderline miracle. How the miracle of commercialization without proprietization was achieved is partly reflected in Simcoe’s tables. Within the technical layers there continues to be an open architecture, and he shows this. Still, our best understanding of how and why this miracle occurred comes from detailed 1. As Bresnahan and Greenstein (1999) point out, this tendency is less marked for technical platforms such as minicomputers. Thus, the distinction between the technical layers of the Internet and the commercial GPTs running “on top of ” them is economically important.

50

Timothy Simcoe

examinations of the important historical epochs at which there was a risk of some or all of the Internet becoming proprietary. Shane Greenstein (forthcoming) writes with compelling depth and understanding of the exit of the NSF from Internet funding, the “commercialization of the Internet.” At that stage, it could easily have transited to being an IBM technology— only a very thoughtful exit by the NSF prevented this. Another moment when the Internet might have become proprietary was after Microsoft won the browser war. Faced with substantial scope diseconomies between the businesses offering Windows and the Internet (Bresnahan, Greenstein, and Henderson 2012), the firm ultimately focused on maintaining control of the Windows standard for mass market computing and chose not to use command of the browser to proprietize the Internet. These important historical transitions illustrate an important theme about causation. The technical layers of the Internet stack studied by Simcoe have remained open and modular in part because of their governance, as Simcoe suggests. Equally important, however, has been the absence of a takeover of standards setting by the firm supplying a complementary commercial layer. Outsider Innovation The second surprising and very positive development is expansion of the set of open, modular, general-purpose layers of the Internet by outsiders. An important pair of examples is the World Wide Web (WWW) and the web browser. These inventions transformed the Internet into a mass medium. Today, if you ask most consumers what the Internet is, they will answer in terms of the WWW viewed through a browser. Both the WWW and the web browser were new layers in the stack. Economically, they are complements to the preexisting layers of the Internet. The openness of the Internet architecture meant that the WWW could be invented without getting the permission of any suppliers of existing Internet components or engaging in contracts with them. Instead, the WWW could be defined in a way that it “runs on top of ” the Internet; that is, that it interacts with the other layers through open interface standards. This is, as Shane Greenstein (forthcoming) has emphasized, an important element of open organization. In turn, the outsiders who invented and (some of whom) later commercialized the web browser did not need to get the permission of the inventors of the WWW or engage in contracts with them. This would have gone badly if it were required, since Tim Berners-Lee, inventor of the Web, strongly disapproved of the web browser once it became commercialized at Netscape. This is an important example of uncontrolled, uncontracted for, invention by outsiders permitted by open systems, for the series of events culminating in the commercialization of the web browser is one of the top ten economic growth innovations of the twentieth century.

Modularity and the Evolution of the Internet

51

Decomposability, Division of Labor, and Diffusion Simcoe uses citations—from later Internet documents and from patents— to Internet documents to examine the structure of Internet innovation, both organizationally and technically, and the diffusion of new applications of the Internet. This is an extremely valuable undertaking and we can learn much from it. Of course, it also suffers from the difficulties of citations analysis generally. Simcoe’s analysis of the division of innovative labor seems to me to be a particularly successful deployment of citations methods. The Internet is largely modular in its different technical layers, and firms that work on a layer also tend to patent inventions that are related to that layer. As he points out, considerable gains have been made by having multiple firms inventing and supplying general-purpose components. The study of the diffusion of new applications for the Internet is a difficult one, and particularly so from a technical-layer-centric perspective. This is, of course, not particularly a weakness of Simcoe’s chapter. Data sets on new technologies generally emphasize the technical rather than application. One cautionary note, however, is what the measurable perspective of an “application” is here. Most of the “applications” studied by Simcoe are themselves GPTs, which connect to the Internet and to which, in turn, many specific applications are connected. This is not a small point. A list of things that are not applications from the perspective of the citations used in this chapter includes Google Search, Facebook social networking, and Apple media and applications sales in the iTunes store. My interpretation would be that there is no doubt that the enormous transformation of the uses of the Internet to the commercial realm and then to mass market EC3 is behind these tables, but that it is less obvious that the timing or breadth of the spread of applications can be seen in these tables. A difficulty for patent citations is that patent policy is changing over the relevant time period, so that it is not obvious whether the quantitative growth lies in the breadth of applications or in the tendency to patent inventions. The Internet document citations difficulty is that they are, by their nature, from within the standardized GPT layers of the Internet, not from applications. Only insofar as new applications lead to a change in the GPT layers will an expansion of applications be reflected there. The Framework Ultimately, the most interesting thing about Simcoe’s chapter is the perspective it takes on the analysis. We have two very different literatures on coordination between suppliers of general-purpose components and applications. These are sufficiently different, especially in their treatment of the optimal form of coordination, that much confusion has arisen.

52

Timothy Simcoe

The first literature, typically writing about “two-sided markets” or “platform economics,” is concerned mostly with the coordination of production and prices.2 The literature takes a contractual approach to the coordination of applications supply with platform (GPT) supply. To facilitate the contractual approach, the most common assumption is that the general-purpose components are supplied by a single firm. By that I mean each platform or GPT cluster has a single supplier of general-purpose components at its center, and that this firm contracts with, or offers incentives to, suppliers of applications. Sometimes there is competition to be (or to become) the dominant platform or GPT, so that there are competing central sponsors, each offering contracts or incentives to an atomless distribution of applications developers. While the second literature, typically calling itself “GPT” or “Recombination,”3 treats the same industries, it emphasizes very different phenomena and modeling elements. First, this literature is concerned with the problem of invention, especially repeated rounds of invention, much more than pricing and production. This arises because the practical GPT literature has had to deal with the phenomenon—so emphasized by Simcoe—of generalpurpose components supplied by many firms. The “layered” architecture of systems like the Internet involves competition within each layer (rather than competition between whole systems), but complementary invention of improvements across layers. An important general point of this literature is that explicit contracts to coordinate innovation may be impossible so that “softer” governance structures such as the one described by Simcoe are optimal. Why might the softer governance structures work? Are they optimal only because the governance structure we would really like, explicit contracts among complementary suppliers, is impossible? There are several important points to make here. The most important point concerns the possibility of unforeseen and perhaps unforeseeable change. Sometimes after a period of exploitation of a general-purpose technology, new demands or new inventions call for improvements in the general-purpose components. This is a moment at which not drawing too sharp a distinction between “applications” and general-purpose components can be valuable. A system that is open to the invention of new applications (in the strong sense that they do not need to contract with anyone) will have low barriers to entry. If an application is very widely used and itself becomes a general purpose input into new applications, then the platform is transformed. In Simcoe’s chapter, as in other studies, we see the value of uncoordinated (or only loosely coordinated) innovation for this kind of ex post flexibility. 2. See Jullien (2011) or Rysman (2009). An important exception is Tirole and Weyl (2010), which attempts to extend this framework to invention. 3. See Bresnahan and Trajtenberg (1995).

Modularity and the Evolution of the Internet

53

Modularity and openness permit flexible innovation ex post. They permit flexibility not only in reconfiguration of the platform’s general-purpose components but also in allowing an ex post opportunity for multiple heterogeneous innovators to undertake differentiated efforts to improve the general-purpose components of the same GPT. Elsewhere (Bresnahan 2011) I have argued that it was the modularity and openness of the Internet that made it the winner in a multiway race to be the general-purpose technology underlying the enormous EC3 breakthroughs of the last two decades. Simcoe offers us a fascinating glimpse into the workings of that modularity and openness underlying flexible improvements in the Internet’s GPT components. References Bresnahan, T. 2011. “General Purpose Technologies.” In Handbook of the Economics of Innovation, edited by Bronwyn Hall and Nathan Rosenberg. North Holland: Elsevier. Bresnahan, T., and S. Greenstein. 1999. “Technological Competition and the Structure of the Computer Industry.” Journal of Industrial Economics 47 (1): 1–40. Bresnahan, T., S. Greenstein, and R. Henderson. 2012. “Schumpeterian Competition and Diseconomies of Scope: Illustrations from the Histories of Microsoft and IBM.” In The Rate and Direction of Inventive Activity Revisted, edited by Josh Lerner and Scott Stern. Chicago: University of Chicago Press. Bresnahan, T., and Manuel Trajtenberg. 1995. “General Purpose Technologies: ‘Engines of Growth’?” Journal of Econometrics special issue 65 (1): 83–108. Greenstein, S. Forthcoming. Innovation from the Edges. Princeton, NJ: Princeton University Press. Jullien,  B. 2011. “Competition in Multi-Sided Markets: Divide-and-Conquer.” American Economic Journal: Microeconomics 3 (4): 1–35. Rysman, M. 2009. “The Economics of Two-Sided Markets.” Journal of Economic Perspectives 23:125–44. Tirole, Jean, and Glen Weyl. 2010. “Materialistic Genius and Market Power: Uncovering the Best Innovations.” IDEI Working Paper no. 629. Institut d’Économie Industrielle (IDEI), Toulouse, France.

2

What Are We Not Doing When We Are Online? Scott Wallsten

2.1

Introduction

The Internet has transformed many aspects of how we live our lives, but the magnitude of its economic benefits is widely debated. Estimating the value of the Internet is difficult, in part, not just because many online activities do not require monetary payment, but also because these activities may crowd out other, offline, activities. That is, many of the activities we do online, like reading the news or chatting with friends, we also did long before the Internet existed. The economic value created by online activities, therefore, is the incremental value beyond the value created by the activities crowded out. Estimates of the value of the Internet to the economy that do not take into account these transfers will, therefore, overstate the Internet’s economic contribution. This observation is, of course, not unique to the Internet. In the 1960s Robert Fogel noted that the true contribution of railroads to economic growth was not the gross level of economic activity that could be attributed to them, but rather the value derived from railroads being better than previously existing long-haul transport such as ships on waterways (Fogel 1962, Scott Wallsten is vice president for research and senior fellow at the Technology Policy Institute. I thank Alexander Clark and Corwin Rhyan for outstanding research assistance and Avi Goldfarb, Chris Forman, Shane Greenstein, Thomas Lenard, Jeffrey Macher, Laura Martin, John Mayo, Gregory Rosston, Andrea Salvatore, Robert Shapiro, Amy Smorodin, Catherine Tucker, and members of the NBER Economics of Digitization Group for comments. I am especially grateful to Avi, Catherine, and Shane for including me in this fun project. I am responsible for all mistakes. For acknowledgments, sources of research support, and disclosure of the author’s or authors’ material financial relationships, if any, please see http://www.nber .org/chapters/c13001.ack.

55

56

Scott Wallsten

1964). The true net economic benefit of the railroad was not small, but was much smaller than generally believed. This chapter takes to heart Fogel’s insight and attempts to estimate changes in leisure time spent online and the extent to which new online activities crowd out other activities. If people mostly do online what they used to do offline, then the benefits of time spent online are biased upward, potentially by a lot. In other words, if online time substitutes for offline time then that online time purely represents an economic transfer, with the net incremental benefit deriving from advantages of doing the activity online, but not from the time doing the activity, per se. By contrast, brand new online activities or those that complement offline activities do create new value, with activities crowded out representing the opportunity cost of that new activity. Using the available data, this chapter does not evaluate which online activities substitute or complement offline activities. Instead, it estimates the opportunity cost of online leisure time. The analysis suggests that the opportunity cost of online leisure is less time spent on a variety of activities, including leisure, sleep, and work. Additionally, the effect is large enough that better understanding the value of this opportunity cost is a crucial issue in evaluating the effects of online innovation. To my knowledge, no empirical research has investigated how leisure time online substitutes for or complements other leisure activities.1 In this chapter I begin to answer that question using detailed data from the American Time Use Survey, which allows me to construct a person-level data set consisting of about 124,000 observations from 2003 to 2011. I find that the share of Americans reporting leisure time online has been increasing steadily, and much of it crowds out other activity. On average, each minute of online leisure is associated with 0.29 fewer minutes on all other types of leisure, with about half of that coming from time spent watching TV and video, 0.05 minutes from (offline) socializing, 0.04 minutes from relaxing and thinking, and the balance from time spent at parties, attending cultural events, and listening to the radio. Each minute of online leisure is also correlated with 0.27 fewer minutes working, 0.12 fewer minutes sleeping, 0.10 fewer minutes in travel time, 0.07 fewer minutes in household activities, and 0.06 fewer minutes in educational activities, with the remaining time coming from sports, helping other people, eating and drinking, and religious activities. Among the interesting findings by population groups, the crowd-out effect of online leisure on work decreases beyond age thirty, but remains fairly con1. One existing study tries to investigate the effects of information technology (IT) use using the same data I use in this chapter, though only from 2003 to 2007. The author finds no particular effect of IT use on other time spent on other activities, though the empirical test is simply whether IT users and nonusers spend significantly different amounts of time on various activities. See Robinson (2011).

What Are We Not Doing When We Are Online?

57

stant with income. Online leisure has a large crowd-out effect on time spent on education among people age fifteen to nineteen, but the effect decreases steadily with age. 2.2

Existing Research on the Economic Value of the Internet

The value of the Internet is intrinsically difficult to estimate, in part, because it enables so many activities and, in part, because many of the most popular online activities are “free” in the sense that they have no direct monetary cost to consumers. Several tools exist for valuing nonmarket goods, such as contingent valuation surveys to revealed preference inferred by related market activities (Boardman et al. 1996). Those mechanisms have shortcomings. In principle, contingent valuation can tell you willingness to pay, but people often have no reason to respond truthfully to contingent valuation surveys. Measuring spending on relevant complements reveals how much people spend on an activity, but not how much they would be willing to spend. Given those weaknesses, perhaps the most common approach to valuing time spent on activities outside of work is to value that time at the wage rate under the implicit assumption that the marginal minute always comes from work. Of course, that assumption may be problematic, as those who employ that approach readily admit. Nevertheless, it is a useful starting point. Goolsbee and Klenow (2006) were among the first to apply this approach to the Internet. They estimated the consumer surplus of personal (i.e., nonwork) online time using the wage rate as the measure of time value and an imputed demand curve. They estimated a consumer surplus at about $3,000 per person. Setting aside the question of whether the wage rate is an accurate measure of the value of all leisure time, this approach provides an estimate of gross consumer surplus as it does not measure incremental benefits. Brynjolfsson and Oh (2012) improves on Goolsbee and Klenow with newer survey data from 2003 to 2010 to measure the value of incremental time spent online. Although they also use the wage rate to estimate surplus, their estimates are smaller in magnitude because they focus on the increase in time spent online over this time period rather than the aggregate time spent online. Based on that approach, they estimate the increase in consumer surplus from the Internet to be about $33 billion, with about $21 billion coming from time spent using “free” online services. Both Goolsbee and Klenow (2006) and Brynjolfsson and Oh (2012) almost certainly overestimate the true surplus created by the Internet, even setting aside the question of whether all leisure time should be valued at the wage rate. In particular, they neglect to factor in the extent to which consumers are simply doing some things online that they used to do offline and that new activities must, at least partially, come at the expense of activities they are no longer doing. Spending an hour reading the paper online shows up

58

Scott Wallsten

as a “free” activity, assuming no subscriber paywall, but is not intrinsically more valuable than the same hour spent reading the news on paper. Similarly, the net benefit of reading an electronic book on a Kindle, for example, does not include the time spent enjoying the book if it would have otherwise been read in dead-tree format. Instead, the net benefit is only the incremental value of reading an electronic, rather than paper, book. To be sure, the online version of the newspaper must generate additional consumer surplus relative to the offline version or the newspaper industry would not be losing so many print readers, but not all time spent reading the paper online reflects the incremental value of the Internet. Additionally, at a price of zero the activity might attract more consumers than when the activity was paid, or consumers might read more electronic books than paper books because they prefer the format, or because e-books are so much easier to obtain. But even if lower prices increase consumption of a particular activity, the cost of that additional consumption is time no longer spent on another activity. Activities that once required payment but became free, such as reading the news online, represent a transfer of surplus from producers to consumers, but not new total surplus. Of course, these transfers may have large economic effects as they can lead to radical transformations of entire industries, especially given that consumers spend about $340 billion annually on leisure activities.2 Reallocating those $340 billion is sure to affect the industries that rely on it. Hence, we should expect to see vigorous fights between cable, Netflix, and content producers even if total surplus remains constant. Similarly, as Joel Waldfogel shows in this volume (chapter 14), the radical transformation in the music industry does not appear to have translated into radical changes in the amounts of music actually produced. That is, the Internet may have thrown the music industry into turmoil, but that appears to be largely because the Internet transferred large amounts of surplus to consumers rather than changing net economic surplus. As the number and variety of activities we do online increases, it stands to reason that our Internet connections become more valuable to us. Greenstein and McDevitt (2009) estimate the incremental change in consumer surplus resulting from upgrading from dialup to broadband service based on changes in quantities of residential service and price indices. They estimate the increase in consumer surplus related to broadband to be between $4.8 billion and $6.7 billion. 2. See table 57 at http://www.bls.gov/cex/2009/aggregate/age.xls. The $340 billion estimate includes expenditures on entertainment, which includes “fees and admissions,” “audio and visual equipment and services,” “pets, toys, hobbies, and playground equipment,” and “other entertainment supplies, equipment, and services.” I added expenditures on reading to entertainment under the assumption that consumer expenditures on reading are likely to be primarily for leisure.

What Are We Not Doing When We Are Online?

59

Rosston, Savage, and Waldman (2010) explicitly measure consumer willingness to pay for broadband and its various attributes using a discrete choice survey approach. They find that consumers were willing to pay about $80 per month for a fast, reliable broadband connection, up from about $46 per month since 2003. In both years the average connection price was about $40, implying that (household) consumer surplus increased from about $6 per month in 2003 to $40 per month in 2010. That change suggests an increase of about $430 per year in consumer surplus between 2003 and 2010. Translating this number into total consumer surplus is complicated by the question of who benefits from each broadband subscription and how to consider their value from the connection. That is, a household paid, on average, $40 per month for a connection, but does each household member value the connection at $80? Regardless of the answer to that question, Rosston, Savage, and Waldman’s (2010) estimate is clearly well below Goolsbee and Klenow (2006). In the remainder of the chapter I will build on this research by explicitly estimating the cost of online activities by investigating the extent to which online activities crowd out previous activities. 2.3

The American Time Use Survey, Leisure Time, and Computer Use

Starting in 2003, the US Bureau of Labor Statistics and the US Census began the American Time Use Survey (ATUS) as a way of providing “nationally representative estimates of how, where, and with whom Americans spend their time, and is the only federal survey providing data on the full range of nonmarket activities, from childcare to volunteering.”3 Each year the survey includes about 13,000 people (except in 2003, when it included about 20,000) whose households had recently participated in the Current Population Survey (CPS).4 From the relevant BLS files we constructed a 2.5 million-observation data set at the activity-person-year level for use in identifying the time of day in which people engage in particular activities, and a 124,000-observation, person-year-level data set for examining the crowd-out effect. The ATUS has several advantages for estimating the extent to which online time may crowd out or stimulate additional time on other activities. First, each interview covers a full twenty-four-hour period, making it 3. http://www.bls.gov/tus/atussummary.pdf. 4. More specifically, BLS notes that “Households that have completed their final (8th) month of the Current Population Survey are eligible for the ATUS. From this eligible group, households are selected that represent a range of demographic characteristics. Then, one person age 15 or over is randomly chosen from the household to answer questions about his or her time use. This person is interviewed for the ATUS 2–5 months after his or her household’s final CPS interview.” See http://www.bls.gov/tus/atusfaqs.htm.

60

Scott Wallsten

possible to study how time spent on one activity might affect time spent on another activity. Second, it is connected to the CPS, so it includes copious demographic information about the respondents. Third, the survey focuses on activities, not generally on the tools used to conduct those activities. So, for example, reading a book is coded as “reading for personal interest” regardless of whether the words being read are of paper or electronic provenance.5 As a result, the value of the time spent reading would not be mistakenly attributed to the Internet when using these data. Similarly, time spent watching videos online would be coded as watching TV, not computer leisure time. The survey does, however, explicitly include some online activities already common when the survey began in 2003. In particular, time spent doing personal e-mail is a separate category from other types of written communication.6 Online computer games, however, are simply included under games. The ATUS coding rules therefore imply that any computer- or Internetbased personal activity that did not exist in 2003 as its own category would be included under “Computer use for leisure (excluding games),” which includes “computer use, unspecified” and “computer use, leisure (personal interest).”7 For example, Facebook represents the largest single use of online time today, but ATUS has no specific entry for social media, and therefore Facebook would almost certainly appear under computer use for leisure. This feature of the ATUS means that increases in computer use for leisure represent incremental changes in time people spend online and that it should be possible to determine the opportunity cost of that time—what people gave up in order to spend more time online. It is worth noting, however, that the ATUS does not code multitasking, which is a distinct disadvantage to this research to the extent that online behavior involves doing multiple activities simultaneously. In principle the survey asks whether the respondent is doing multiple activities at a given time, but only records the “primary” activity. To reiterate, the ATUS does not make it possible to determine, say, how much time spent watching video has migrated from traditional television to online services like Netflix. It does, however, tell us how new online activities since 2003 have crowded out activities that existed at that time and—to extend the video example—how much those activities have crowded out (or in) time spent watching video delivered by any mechanism. A significant disadvantage of the survey, however, is that as a survey, 5. More explicitly, reading for pleasure is activity code 120312: major activity code 12 (socializing, relaxing, and leisure), second-tier code 03 (relaxing and leisure), third-tier code 12 (reading for personal interest). http://www.bls.gov/tus/lexiconwex2011.pdf. 6. Code 020904, “household and personal e-mail and messages,” which is different from code 020903 “household and personal mail and messages (not e-mail). See http://www.bls.gov /tus/lexiconwex2011.pdf, p.10. Inexplicably, however, any time spent doing volunteer work on a computer is its own category (150101) (http://www.bls.gov/tus/lexiconwex2011.pdf, p.44). 7. See http://www.bls.gov/tus/lexiconwex2011.pdf, p. 34.

What Are We Not Doing When We Are Online?

Fig. 2.1 coders

61

Evolution of examples of “computer use for leisure” provided for ATUS

Source: “ATUS Single-Year Activity Coding Lexicons,” 2003–2011, http://www.bls.gov/tus /lexicons.htm.

as discussed above, respondents have little reason to respond truthfully, especially about sensitive subjects. For example, would viewing pornography online be categorized under “computer use for leisure” (based on the “unspecified” example in the codebook), or under “personal/private activities” (also the “unspecified” example under this subcategory)? 2.3.1

“Computer Use for Leisure” is Online Time

The relevant ATUS category is time spent using a computer for leisure.8 This measure explicitly excludes games, e-mail, and computer use for work and volunteer activities. While some computer leisure activities may not necessarily involve the Internet, nearly all of the many examples provided to interviewers under that heading involve online activities (figure 2.1). Addi8. Computer games are simply recorded as “leisure/playing games,” and e-mail is coded as “household and personal e-mail and messages.” Text messaging is recorded as “telephone calls.” Bureau of Labor Statistics (2010).

62

Scott Wallsten

Table 2.1

Top ten online activities by time spent on them Share of time

Rank

Category

1 2 3 4 5 6 7 8 9 10

Social networks Online games E-mail Portals Videos/moviesa Search Instant messaging Software manufacturers Classifieds/auctions Current events and global news Multicategory entertainment Otherb

 

Position change

May–11 (%)

Jun–10 (%)

Jun–09 (%)

’10–’11 (%)

22.50 9.80 7.60 4.50 4.40 4.00 3.30 3.20 2.90 2.60 — 35.10

22.70 10.20 8.50 4.40 3.90 3.50 4.40 3.30 2.70 — 2.80 34.30

15.80 9.30 11.50 5.50 3.50 3.40 4.70 3.30 2.70 — 3.00 37.30

↔ ↔ ↔ ↔ ↑1 ↑1 ↓2 ↔ ↑1 ↑1 ↓2  

Source: Nielsen NetView (June 2009–2010) and Nielsen State of the Media: The Social Media Report (Q3 2011). a Nielsen’s videos/movies category refers to time spent on video-specific (e.g., YouTube, Bing Videos, Hulu) and movie-related websites (e.g., IMDB, MSN Movies, and Netflix). It does not include video streaming non–video-specific or movie-specific websites (e.g., streamed video on sports or news sites). b Other refers to 74 remaining online categories for 2009–2010 and 75 remaining online categories for 2011 visited from PC/laptops.

tionally, while the measure is coded as “computer use for leisure,” based on the coding instructions it also likely includes mobile device use. Based on what the ATUS measure excludes and other sources of information detailing what online activities include, we can get a good idea of what people are probably spending their time doing. Nielsen identifies the top ten online activities (table 2.1). Of the top ten, the ATUS variable excludes online games, e-mail, and any Internet use for work, education, or volunteer activities. Based on this list, it is reasonable to conclude that the top leisure uses included in the ATUS variable are social networks, portals, and search. 2.3.2

How Do Americans Spend Their Time?

The New York Times produced an excellent representation of how Americans spend their time from the ATUS (figure 2.2). As the figure highlights, ATUS data track activities by time of day and activity, as well as by different population groupings due to coordination with the CPS. Each major activity in the figure can be broken down into a large number of smaller activities under that heading. The figure reveals the relatively large amount of time people spend engaged in leisure activities, including socializing and watching TV and movies. The ATUS includes detailed data on how people spend their leisure time.

What Are We Not Doing When We Are Online?

Fig. 2.2

63

How Americans spent their time in 2008, based on ATUS

Source: New York Times (2009). http://www.nytimes.com/interactive/2009/07/31/business /20080801 –metrics-graphic.html.

The ATUS has seven broad categories of leisure, but I pull “computer use for leisure” out of the subcategories to yield eight categories of leisure. Figure 2.3 shows the share of time Americans spent on these leisure activities in 2011. The total time Americans engage in leisure on average per day has remained relatively constant at about five hours, increasing from 295 minutes in 2003 to about 304 minutes in 2011, though it has ranged from 293 to 305 minutes during that time. Figure 2.4 shows the average number of minutes spent per day using a computer for leisure activities. While the upward trend since 2008 is readily apparent, the data also show that, on average, at about thirteen minutes per day, leisure time online is a small share of the total five hours of daily leisure activities the average American enjoys. This average is deceptively low, in part, not just because it does not include time spent doing e-mail, watching videos, and gaming, but also because it is calculated across the entire population, so is not representative of people who spend any time online. Figure 2.5 shows that the average is low primarily because a fairly small share of the population reports spending any leisure time online (other than doing e-mail and playing games). However, the figure shows that the share of the population who spend nongaming and non– e-mail leisure time online is increasing, and, on average, people who spend any leisure time online spend about 100 minutes a day—nearly one-third of their total daily leisure time.

Fig. 2.3

Share of leisure time spent on various activities, 2011

Source: ATUS 2011 (author’s derivation from raw data). Note: Average total daily leisure time is about five hours.

Fig. 2.4

Average minutes per day spent using computer for leisure

What Are We Not Doing When We Are Online?

65

Fig. 2.5 Share of population using computer for leisure and average number of minutes per day among those who used a computer for leisure

Fig. 2.6

Minutes and share of leisure time online by age group in 2010

2.3.3 Who Engages in Online Leisure? Online leisure time differs across many demographics, including age and income. As most would expect, the amount of online leisure time decreases with age, more or less (figure 2.6). People between ages fifteen and seventeen spend the most time online, followed by eighteen- to twenty-four-year-olds. Perhaps somewhat surprisingly, the remaining age groups report spending similar amounts of time engaged in online leisure. However, because total

66

Scott Wallsten

Fig. 2.7

Time spent using computer for leisure by age and year

leisure time increases with age, beginning with the group age thirty-five to forty-four, the share of leisure time spent online continues to decrease with age. Perhaps not surprisingly given the trends discussed above, both the amount of leisure time spent online (figure 2.7) and the share of respondents reporting spending leisure time online is generally increasing over time (figure 2.8). Leisure time also varies by income. Figure 2.9 shows average total leisure time excluding computer use and computer use for leisure by income. The figure shows that overall leisure time generally decreases with income. Computer use for leisure, on the other hand, appears to increase with income. People with higher incomes, however, are more likely to have computer access at home, meaning average computer use by income is picking up the home Internet access effect. Goldfarb and Prince (2008) investigated the question of online leisure by income in a paper investigating the digital divide. Based on survey data from 2001, they find that conditional on having Internet access, wealthier people spend less personal time online than poorer people. Their key instrument identifying Internet access is the presence of a teenager living in the house, which may make a household more likely to subscribe to the Internet but not more likely to spend personal time online except due to having Internet access.

Fig. 2.8

Share of respondents reporting using computer for leisure by age and year

Fig. 2.9

Leisure time by income

68

Scott Wallsten

With the ATUS data I can attempt to replicate their instrumental variables results using this more recent data. While I know the ages of all household members, the data do not indicate whether a household has Internet access. However, I can identify some households that have access. In particular, any ATUS respondent who spends any time at home involved in computer leisure, e-mail, or using a computer for volunteer work must have home Internet access. Following Goldfarb and Prince, I estimate the following two simultaneous equations using two-stage least squares: home Internet accessi =

(1)

 incomei,educationi, agei, sexi, racei, marriedi, number of children  in householdi, Spanish-speaking onlyi, labor force statusi, f  (metro, suburban, rural)i , leisure excluding computer usei, yeart,   survey day of weeki, teenager in housei

     

 computer use for leisurei = f ((Z) , home Internet access i ) ,  where i indicates a respondent, and Z is the vector of independent variables included in the first equation. Note the absence of a t subscript—no individual appears more than once in the survey, so the data are a stacked cross section rather than a pure time series. “Labor force status” is a vector of dummy variables indicating whether the respondent is employed and working, employed but absent from work, employed but on layoff, unemployed and looking for work, or not in the labor force. I include year dummy variables to control for time trends. I include an indicator for the day of the week the survey took place since certain activities—leisure time especially— differs significantly across days. As mentioned, my indicator for home Internet access identifies only a portion of households that actually have Internet access. This method implies that only 17 percent of households had access in 2010 when the US Census estimated that more than 70 percent actually had access.9 Nevertheless, in the first stage of this two-stage model the variable is useful in creating a propensity to have access for use in the second stage in that while the level is wrong, the fitted trend in growth in Internet access tracks actual growth in access reasonably well. The fitted propensity to have access increases by about 70 percent while actual home Internet access increased by about 78 percent during that same time period.10 Table 2.2 shows the (partial) results of estimating the set of equations above. The first column replicates Goldfarb and Prince. These results mirror theirs: conditional on home Internet access, computer leisure time decreases with income. In order to see whether computer leisure looks different from (2)

9. See http://www.ntia.doc.gov/files/ntia/data/CPS2010Tables/t11_2.txt. 10. See http://www.pewinternet.org/Trend-Data-(Adults)/Internet-Adoption.aspx.

Table 2.2

Variable $10k–$19.9k $20k–$29k $30k–$49k $50k–$75k $75k–$99k $100k–$149k > = $150k Age Male Grade 6 Grades 7, 8, 9 High school, no diploma High school grad. Some college Associate/vocational degree Bachelor’s Master’s Professional Doctoral

Computer leisure as a function of income Computer leisure

Computer as share of leisure

0.00264 (0.00453) –1.015 (–1.371) –2.352*** (–2.621) –3.510*** (–3.079) –3.993*** (–3.257) –4.690*** (–3.530) –4.701*** (–3.699) –0.0244 (–1.355) 4.164*** (18.00) 3.459** (2.086) 2.044** (2.180) 3.450*** (3.910) 1.777** (2.154) 0.0904 (0.144) –0.0462 (–0.0563) –4.004*** –5.909*** (–5.276) –2.928** (–2.374) –5.557*** (–3.809)

0.00124 (0.748) –0.00238 (–1.134) –0.00622** (–2.477) –0.0101*** (–3.148) –0.0108*** (–3.155) –0.0122*** (–3.241) –0.0124*** (–3.447) –6.83e–05 (–1.241) 0.00661*** (9.131) 0.0119*** (2.582) 0.00827*** (3.005) 0.0100*** (3.903) 0.00345 (1.429) –0.00229 (–1.281) –0.00250 (–1.067) –0.00969*** –0.0163*** (–5.321) –0.0116*** (–3.292) –0.0116*** (–2.820)

Variable Black   American Indian   Asian   White American Indian White Asian   White Asian Hawaiian Spanish only Hhld Monday   Tuesday   Wednesday   Thursday   Friday   Saturday   Constant   Observations R-squared  

Computer leisure

Computer as share of leisure

3.078*** (4.590) 1.176 (0.829) 2.250*** (3.194) –2.314* (–1.842) 8.130*** (3.112) 42.55*** (4.270) 0.906* (1.741) –2.568*** (–5.955) –3.292*** (–7.548) –4.565*** (–9.920) –3.374*** (–7.596) –0.781* (–1.758) 0.217 (0.498) 1.988 (1.067) 110,819 0.176

0.0101*** (5.414) 0.00801** (1.975) 0.0122*** (5.864) –0.00195 (–0.545) 0.0227*** (3.026) 0.450*** (15.62) 0.00177 (1.254) 0.00127 (0.875) –0.000695 (–0.461) –0.00189 (–1.107) –0.00289* (–1.853) 0.00240** (1.975) 0.00127 (1.030) 0.00500 (1.289) 106,869 0.238

 

 

Notes: Other variables included but not shown: year fixed effects; number of household children; urban, rural, suburban status; labor force status. (Abridged results of second stage only; full results, including first stage, in appendix at http://www.nber.org/data-appendix/c13001/appendix-tables.pdf.) ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

70

Scott Wallsten

other types of leisure, I change the dependent variable to computer leisure as a share of total leisure (column [2]). These results are similar in that conditional on home Internet access, computer time as a share of total leisure time decreases with income, although the effect is fairly small in magnitude above $50,000 in annual family income. I also find that computer use for leisure decreases with education, conditional on access, although the effect on computer use as a share of leisure is less straightforward. For example, online leisure as a share of total leisure is less for people with master’s degrees than for people with doctorate degrees. By race, people who identify as “White-Asian-Hawaiian” spend the most time engaged in online leisure, followed by “White-Asian,” “Black,” and finally “White.” Not surprisingly, the largest amount of online leisure takes place on Saturday and Sunday, followed closely by Friday. Wednesday appears to have the least online leisure. As Goldfarb and Prince note, these results shed some light on the nature of the digital divide. In particular, while we know from census and other data that a significant gap remains on Internet access conditional on access, poorer people and minorities are more likely to engage in computer leisure than are rich people and white people. Goldfarb and Prince note that these results are consistent with poorer people having a lower opportunity cost of time. These results, using ATUS data, are also consistent with that hypothesis. However, because, as shown above, poorer people engage in more leisure time overall, the results also suggest that online leisure may not be so different from offline leisure, at least in terms of how people value it. 2.3.4

What Times Do People Engage in Online Leisure?

As discussed, to better understand the true costs (and benefits) of time spent online, it is important to figure out the source of the marginal minute online—What activities does it crowd out? It is reasonable to assume that much of it comes from other leisure activities, since leisure time has remained unchanged for so many years, but it need not necessarily come only from other leisure time. To begin to understand where online time comes from, we first look at it in the context of some other (major) activities throughout the day. Figure 2.10 shows how sleep, work, leisure (excluding computer time), and computer time for leisure are distributed throughout the day. Not surprisingly, most people who work begin in the morning and end in the evening, with many stopping mid-day, presumably for lunch. People begin heading to sleep en masse at 9:00 p.m. with nearly half the population over age fifteen asleep by 10:00 p.m. and almost everyone asleep at 3:00 a.m. Leisure time begins to increase as people wake up and increases steadily until around 5:00 p.m. when the slope increases and the share of people engaged in leisure peaks at about 8:45 p.m. before dropping off as people go to sleep. Time engaged in computer leisure, a subcategory of leisure, tracks overall

What Are We Not Doing When We Are Online?

71

Fig. 2.10 Percentage of people who engage in major activities doing that activity throughout the day

leisure fairly well, but exhibits somewhat less variation. In particular, the peak in the evening is not as pronounced and continues later in the evening. This time distribution suggests that computer leisure may, in principle, crowd out not just other leisure activities, but also work, sleep, and other (smaller) categories. The next section investigates the extent to which online leisure crowds out these other categories. 2.4

What Does Online Leisure Crowd Out?

The ATUS has seventeen major categories of activities (plus one unknown category for activities that the interviewer was unable to code). Each of these major categories includes a large number of subcategories. The first step in exploring where online leisure time comes from is to investigate its effects at the level of these major categories. The second step will be investigating the effects within those categories. 2.4.1

Major Activity Categories

Figure 2.11 shows the average time spent on each of the eighteen major categories. Personal care, which includes sleep, represents the largest block of time, followed by leisure, work, and household activities. To explore potential crowd-out effects, I begin by estimating eighteen versions of equation (3), once for each major activity category.

72

Scott Wallsten

Fig. 2.11

Average time spent on daily activities, 2003‒2011

major activityi =  computer leisurei, incomei,educationi, agei, sexi, racei, marriedi,    (3)  number children in householdi, occupationi . f  Spanish-speaking onlyi, labor force statusi, (metro, suburban, rural)i ,      yeart, survey day of weeki

Table 2.3 shows the coefficient (and t-statistic) on the computer leisure variable from each of the eighteen regressions.11 Figure 2.12 shows the results graphically. Perhaps not surprisingly, since computer use for leisure is a component of the major leisure category, computer use for leisure has the largest effect on other leisure. Each minute spent engaged in computer leisure represents almost 0.3 minutes less of doing some other type of leisure. Online leisure appears to have a relatively large effect on time spent at work as well, with each minute of online leisure correlated with about 0.27 minutes less time working. Each minute of online leisure is also correlated with 0.12 minutes of personal care. Most other activities also show a negative, though much smaller, correlation with online leisure. Travel time, too, is negatively correlated with online leisure time. Avoided 11. The full regression results are in an online appendix at http://www.nber.org/data -appendix/c13001/appendix-tables.pdf.

What Are We Not Doing When We Are Online? Table 2.3

73

Estimated crowd-out effects of computer leisure on major categories Leisure (excluding computer) Work activities Personal care (including sleep) Travel Household activities Education Sports Helping household members Eating and drinking Helping nonhousehold members Religion Unknown Volunteer Professional care and services Household services Government and civic obligations Consumer purchases Phone calls

–0.293*** (22.34) –0.268*** (19.38) –0.121*** (12.36) –0.0969*** (17.36) –0.0667*** (7.149) –0.0574*** (8.560) –0.0397*** (9.17) –0.0368*** (7.589) –0.0254*** (6.991) –0.0232*** (6.763) –0.0146*** (5.758) –0.0141*** (4.080) –0.0120*** (3.503) –0.00360* (1.896) –0.00129 (1.583) –0.000177 (0.303) 0.00368 (1.025) 0.0134*** (7.433)

Note: Equation (3) shows the variables included in each regression. Full regression results in appendix at http://www.nber.org/data-appendix/c13001/appendix-tables.pdf. ***Significant at the 1 percent level. **Significant at the 5 percent level. * Significant at the 10 percent level.

travel time is generally considered a benefit, suggesting at least one area where the trade-off yields clear net benefits. Phone calls are positively correlated with online leisure time, although the magnitude is small. It is conceivable that this result reflects identifying the type of person who tends to Skype. Calls made using Skype or similar VoIP services would likely be recorded as online leisure rather than phone calls

74

Scott Wallsten

Fig. 2.12

Estimated crowd-out effects of online leisure on major categories

since phone calls are specifically time spent “talking on the telephone.”12 If people who are inclined to talk on the phone are also inclined to Skype, then perhaps the correlation is picking up like-minded people. The analysis above controls for demographics, but any crowd-out (or crowd-in) effects may differ by those demographics, as well. Table 2.4 shows the abridged regression results by demographic group. Men and women show few differences in terms of crowd-out effects, except for time spent helping household members. While online leisure time is not statistically significantly correlated with helping household members for men, each minute of online leisure is associated with 0.08 fewer minutes helping household members for women. This result, however, is at least partly because women spend more than 50 percent more time helping household members than men do. Among race, black people show the biggest crowd-out correlation between online and other leisure, while Hispanic people show the smallest crowding out. Black, white, and Hispanic people show similar levels of crowding out 12. See http://www.bls.gov/tus/tu2011coderules.pdf, p.47.

What Are We Not Doing When We Are Online? Table 2.4

75

Crowd-out effect on selected major categories by demographics

Demographic

Leisure (other than online)

Travel

Household activities

Education

Helping household members

Work

Men Women White Black Asian Hispanic

–0.307*** –0.283*** –0.274*** –0.394*** –0.305*** –0.230***

–0.258*** –0.264*** –0.273*** –0.308*** –0.151** –0.275***

–0.0638*** –0.0554*** –0.0680*** –0.00453 –0.0589*** –0.0590***

–0.0668*** –0.0642*** –0.0732*** –0.0348 0.00178 –0.174***

–0.0620*** –0.0555*** –0.0546*** –0.0450** –0.227*** 0.0177

–0.00833 –0.0724*** –0.0418*** 0.00511 –0.0195 –0.0709***

0 be an indicator for whether a household reports being conservative, where τ0 is a cutoff. With some abuse of notation, let

∑i =1ci Kij I ∑i =1Kij I

cj =

(7)

denote the share of visitors to site j who are conservative. The econometrician observes {c j}Jj =1. The econometrician can therefore impose the following J constraints: ∞

∫ j () f ()()d c j = ∞0 . ∫ −∞j () f ()()d

(8)

These constraints are necessary to identify τ0 and the γj’s in a sample of households whose ideology is unknown. 6.4.2

Model of Supply of Online News

Setup and Notation We define several summaries of the number of visits to site j. Let Vj denote the total number of visitors to site j. Let Sj denote the fraction of consumers

Ideology and Online News

179

who visit site j at least once. Let Xj denote the fraction of consumers who visit site j and no other site. Write the operating profits of outlet j as

j = a (Vj , Sj , X j ) − g ( j ,j ), where a (Vj , Sj , X j ) is annual advertising revenue and g(αj, γj ) is the annual cost of content production. The function a( ) allows for several possible advertising technologies. The case where a (Vj , Sj , XJ ) = aV  j for some constanta corresponds to a constant  per-viewer advertising rate. The case where a (Vj , Sj , XJ ) = aS  j exhibits  strong diminishing returns to additional impressions to the same viewer on the same site. The case where a (Vj , Sj , X j ) = aX  j exhibits strong diminishing  returns to additional impressions both across and between sites. This last form of diminishing returns is especially interesting in light of the theoretical literature on multihoming (Armstrong 2002; Ambrus and Reisinger 2006; Anderson, Foros, and Kind 2010; Athey, Calvano, and Gans 2013). The function g ( j , j ) is similarly abstract. A convenient starting point is that g ( j , j ) = g ( j ) strictly increasing in αj . Such an assumption implies that it is costly to produce quality but free to locate anywhere on the ideological spectrum for a given quality. Audience Metrics Using our demand model it is possible to derive simple expressions for the various audience metrics that we define above. The number of visits to site j by the average consumer is given by (9)

Vj =







∫ T∑=0j ()T Pr(T |)()d = ∫ j () f ()()d.

−∞

−∞

The derivation uses the fact that E(T | τ) = f(τ). The share of consumers who ever visit site j is given by (10)



   S j = ∫ ∑ (1− (1− j ()) )Pr(T |)()d = 1− ∫   ()d. T =0 f () () +  −∞ −∞  j ∞





T

To derive the second expression from the first, observe that ∞

∑ (1 −  j ())T Pr(T |) = ET |((1 − j ())T ) = ET |(exp(T ln(1 − j ())))

T =0



   = ,  f ()j () +   where the last step follows from the moment-generating function of the negative binomial. The share of consumers who visit site j and no other site is given by

180

Matthew Gentzkow and Jesse M. Shapiro

Xj = (11) =





∫ T∑=1(j ())T Pr(T |)()d

−∞

       − ∫    ()d.  f () +    −∞   f ()(1 −  j ()) +   



The derivation here is analogous to that for Sj , but begins by noting that ∞

∑ (j ())T Pr (T |) = ET |((j ())T ) − Pr (T = 0|).

T =1

Equilibrium Choice of Attributes Given the set of outlets, we suppose that attributes { j , j }Jj =1 are a Nash equilibrium of a game in which all outlets simultaneously choose attributes. The first-order conditions are that (12)

∂ j ∂ j = = 0∀j. ∂ j ∂ j

The first-order conditions are a useful starting point for empirical work, because the game we have specified will in general have many equilibria. (For example, any set of attributes that constitutes an equilibrium is also an equilibrium under a relabeling of the outlets.) Coupled with an estimate of demand, the first-order conditions have substantial empirical content. Consider, for example, the case in which

= aV  j − g ( j ) for some constanta.  Then the model implies that  j ∂V g′( j ) = a j ∀j. ∂ j  An estimate of the demand model implies a value for ∂Vj / ∂ j and the constanta may be approximated from aggregate data. By plotting g′( j ) against αj for all outlets j one can trace out the shape of the cost function for quality. The model also implies that (13)

0=

∂Vj ∀j. ∂j

That is, since we have assumed that ideology can be chosen freely, each outlet must be at the visit-maximizing ideology. This is a version of Gentzkow and Shapiro’s (2010) test for the optimality of print newspapers’ choice of slant. Equilibrium Number of Outlets If news outlets are substitutes in demand then, in general, the profits of all outlets will decline in the number of outlets. A natural way to define the equilibrium number of outlets is then the number of outlets such that the next entering outlet would be unprofitable. For such a number to exist there

Ideology and Online News

181

must be a sunk entry cost. Suppose that this cost is uniform across potential entrants. Then the sunk cost can be bounded above by the operating profit of the least profitable outlet and below by the operating profit that the J + 1st outlet would earn if it were to enter and choose the optimal position given the positions of the existing J outlets. 6.5 6.5.1

Estimation and Results Empirical Strategy and Identification

Our demand estimator solves the following problem: (14)

min

0,, f ( ),{ j , j}Jj =1

ln (L)



(15)

s.t.

∫  j () f ()()d c j = ∞0 ∀j. ∫ −∞j () f ()()d

subject to a normalization of the location of the αs and γs. Our data include panel microdata on individual households, but to develop intuition for model identification it is useful to imagine data that consist only of the shares cj and the market shares of each site. Consider the problem of identifying τ0 and { j , j }Jj =1 taking as given the parameters governing the number of sites visited by each household. There are J conservative shares cj and J – 1 market shares (these must sum to one): 2J – 1 empirical objects that can vary separately. Up to an appropriate normalization, there are J – 1 qualities αj , J – 1 site ideologies γj , and one reporting cutoff τ0: 2J – 1 parameters. We assume that τ ~ N(0, 1). We parameterize f(τ) = κ for some constant κ. This allows us to factor the likelihood into two components: the likelihood for the count model of total visits and the likelihood for the logit model of outlet choice. We exploit this factoring to estimate the model via two-step maximum likelihood, first fitting the count model to the total number of visits Ti , then fitting the logit choice model to each household’s individual sequence of visits. In the second step we limit attention to consumers who make fifteen or fewer visits to the five sites in our sample. Appendix table 6A.1 presents Monte Carlo evidence on the performance of our estimator. 6.5.2

Demand Estimates

Table 6.1 presents estimates of model parameters and their standard errors. We normalize γ so that it has a visit-weighted mean of zero. We normalize α so that it is equal to zero for the least-visited site. Estimates are in general very precise; this precision is somewhat overstated as we do not incorporate uncertainty in the constraints in equation (15). We explore several dimensions of model fit.

182

Matthew Gentzkow and Jesse M. Shapiro

Table 6.1

Model parameters γ CNN Drudge Report Fox News Huffington Post New York Times

‒0.0127 (0.00058) 0.7229 (0.0000) 0.5320 (0.00015) ‒0.3645 (0.00082) ‒0.2156 (0.00072)

α CNN Drudge Report Fox News Huffington Post New York Times θ κ Pr(τ > τ 0)

4.3252 (0.0488) 0 (.) 2.7345 (0.0475) 1.8632 (0.0547) 3.6381 (0.0502) 0.3132 (0.0000) 3.0259 (0.0000) 0.5431 (0.00087)

Notes: The table presents the estimated parameters of the model presented in section 6.4. Estimates use 2008 comScore data for five sites. Estimation is by two-step maximum likelihood, estimating (θ, κ) in the first step and the remaining parameters in the second step. We normalize γ to have a visit-weighted mean of zero across all sites, and α to take value zero for the least-visited site. Asymptotic standard errors are in parentheses.

Figure 6.2 shows that the negative binomial model provides a good fit to the distribution of total visits across machines in our panel. Table 6.2 shows that the model provides a good fit to the overall size and ideological composition of the sites. Table 6.3 shows that the model does an adequate job of replicating the distribution of conservative exposure in the data. Table 6.4 shows that the model predicts far more cross-visiting than is observed in the data. 6.5.3

Supply Estimates

We focus on the supply model’s implications for sites’ choice of ideology. To get a feel for how the model works, we begin with the incentives of a hypothetical news site. Consider a world with J = 2 and α1 = α2 = 0. Suppose

Ideology and Online News

Fig. 6.2

183

Fit of model to total visit counts

Note: Plot shows total visits to the five sites in our sample in 2008 for each machine in the panel and the density predicted from our estimated model. Table 6.2

Model fit to size and ideology of news outlets Share of total visits

CNN Drudge Report Fox News Huffington Post New York Times

Conservative share of site visits

Data

Simulation

Data

Simulation

0.5297 0.0113 0.1401 0.0483 0.2707

0.5348 0.0101 0.1339 0.0488 0.2724

0.5504 0.9266 0.8669 0.3008 0.4027

0.5604 0.9270 0.8731 0.3079 0.4080

Notes: The table presents, for each site, the share of total visits that each site receives, and the share of visits to each site from conservative consumers, along with analogues from a single simulation at the estimated parameters.

that site 1 chooses γ1 = 0. Should site 2 stick to the center as well or move out to the extremes? Figure 6.3 plots our three audience size metrics—average visits Vj , share ever visiting Sj , and share visiting exclusively Xj —as a function of site 2’s choice of γ2. We find that site 2 maximizes visits and the share ever visiting by being centrist. In the case of a site maximizing exclusive visits, it is optimal to be slightly to the right or to the left of the center. Moving away from the center attracts viewers who are not attracted to site 1, and hence who are more likely to visit site 2 exclusively. Figure 6.4 explores the incentive to differentiate ideologically in the con-

184

Matthew Gentzkow and Jesse M. Shapiro

Table 6.3

Model fit to conservative exposure Conservative exposure of households visiting at least one site Percentile

Data Simulation

5th

25th

50th

75th

95th

Mean

Standard deviation

0.4027 0.4080

0.4256 0.4842

0.5504 0.5604

0.5504 0.5805

0.8669 0.8213

0.5387 0.5516

0.1360 0.1155

Notes: The table presents statistics of the distribution of conservative exposure in the data and in a single simulation at the estimated model parameters. A consumer’s conservative exposure is the visit-weighted average share conservative across the sites visited by the consumer. Table 6.4

Model fit to cross-visiting patterns Also visiting site:

Share of visitors to site: CNN Drudge Report Fox News Huffington Post New York Times

Data Simulation Data Simulation Data Simulation Data Simulation Data Simulation

CNN

Drudge Report

Fox News

Huffington Post

New York Times

— — 0.4131 0.8495 0.4774 0.8019 0.4640 0.8442 0.4472 0.7896

0.0087 0.0406 — — 0.0140 0.0814 0.0090 0.0261 0.0089 0.0342

0.1635 0.3254 0.2278 0.6905 — — 0.1847 0.2857 0.1516 0.3213

0.0711 0.1781 0.0656 0.1153 0.0826 0.1485 — — 0.0805 0.2164

0.3027 0.5667 0.2857 0.5133 0.2996 0.5684 0.3556 0.7363 — —

Notes: For each site, the table shows the share of visitors to that site who also visit each of the other sites, both for the empirical data and for a single simulation at the estimated parameters.

text of the five sites in our data. We take the αs as given at their estimated values. For each site j, we plot our audience size metrics as a function of γj, taking as given the estimated γs for the other sites. The plot also shows the estimated position ˆ j for each site. Whether a given site would increase its audience by moving closer to or further from the center depends on the audience metric of interest. Most sites would get more households to visit at least once by moving to the center. But most would get more exclusive visitors by moving further from the center. Most sites would also increase total visits by becoming more ideologically extreme. 6.6

Discussion and Conclusions

We propose a model of the demand and supply of online news designed to capture key descriptive features of the market. We estimate the model on

Fig. 6.3

Audience size and ideology: Hypothetical news site

Notes: The figure shows objects computed from our model using the values of the parameters θ and κ in table 6.1. In each plot we assume that J = 2, that α1 = α2 = 0, and that γ1 = 0, and we plot measures of the size of the audience for outlet j = 2 as a function of its ideology γ2. “Average visits” is the number of visits V2 made by the average consumer to site 2 across all consumers. “Share ever visiting” is the share of consumers S2 who visit site 2 at least once. “Share visiting exclusively” is the share of consumers X2 who visit site 2 and only site 2. See text for formal definitions. Audience size metrics are approximated using Gaussian quadrature.

Audience size and ideology: Actual news sites

Notes: Panel A, CNN; panel B, Drudge Report; panel C, Fox News; panel D, Huffington Post; and panel E, New York Times. The figure shows objects computed from our model using the values of the parameters γ, α, θ, and κ in table 6.1. In each plot we show measures of the size of the audience for outlet j as a function of its ideology γj, holding constant all other parameters. “Average visits” is the number of visits Vj made by the average consumer to site j across all consumers. “Share ever visiting” is the share of consumers Sj who visit site j at least once. “Share visiting exclusively” is the share of consumers Xj who visit site j and only site j. See text for formal definitions. Audience size metrics are approximated using Gaussian quadrature. The dashed line indicates the site’s estimated ideology ˆ j .

Fig. 6.4

188

Matthew Gentzkow and Jesse M. Shapiro

data from a panel of Internet users and explore its fit to consumer behavior. We then study the model’s implications for the supply of news. We stop short of a full equilibrium model of the supply of news, but we believe such a model can be estimated with the primitives we propose. A proposed strategy is as follows. From our demand model, it is possible to calculate how much each outlet would gain in terms of audience from increasing its quality. Using a model of equilibrium advertising rates, one can translate this audience gain into a revenue gain. Conditions for a static equilibrium imply that the gain in revenue must equal the cost of additional content. By performing this exercise for a large set of sites, it is in principle possible to trace out the marginal cost of quality at different points in the quality distribution, and hence to recover the shape of the cost function for quality. A similar exercise could, in principle, yield a cost function for ideology. Given cost functions and a notion of equilibrium, the model implies a set of equilibrium positions for news outlets under various assumptions. For example, it would be possible to contemplate changes in the value of online audience to advertisers, or changes in fixed costs or other elements of the news production technology. The model will imply a mapping from these primitives to features of consumer demand such as the extent of ideological segregation. Stepping further back, it may also be interesting to explore how well the same model can perform in rationalizing patterns of demand in other domains. As we note in section 6.2, many of the descriptive features of news consumption are reminiscent of other domains such as DVD-by-mail rental patterns. Though the conditions of supply likely differ greatly across domains, common features in demand may suggest a similar underlying model of consumer behavior. Finally, it is important to note that we focus on the supply and demand for news but not its impact on political beliefs or behavior. As technology evolves it will be important to accumulate theory and evidence on how media platforms change politics.

Ideology and Online News

189

Appendix Table 6A.1

Monte Carlo experiments Baseline estimate

Average estimate across simulations

Asymptotic standard errors

Bootstrap standard errors

CNN Drudge Report Fox News Huffington Post New York Times

‒0.0127 0.7229 0.5320 ‒0.3645 ‒0.2156

‒0.0127 0.7230 0.5321 ‒0.3645 ‒0.2157

0.0006 0.0000 0.0002 0.0008 0.0007

0.0000 0.0003 0.0002 0.0001 0.0001

CNN Drudge Report Fox News Huffington Post New York Times

4.3252 0.0000 2.7345 1.8632 3.6381 0.3132 3.0259 0.5431

4.3264 0.0000 2.7389 1.8663 3.6393 0.3132 3.0259 0.5432

0.0488 0.0000 0.0475 0.0547 0.0502 0.0000 0.0000 0.0009

0.0267 0.0000 0.0237 0.0303 0.0249 0.0000 0.0000 0.0003

Parameter γ

α

θ κ Pr(τ > τ0)

Notes: The table reports the results of Monte Carlo experiments in which we first simulate ten data sets from our model at the parameter values shown in the first column, then reestimate our model on each simulated data set with the starting parameters set at the estimated values.

References Ambrus, Attila, and Markus Reisinger. 2006. “Exclusive vs. Overlapping Viewers in Media Markets.” Working Paper, Harvard University. Anderson, Chris. 2006. The Long Tail: Why the Future of Business is Selling Less of More. New York: Hyperion. Anderson, Simon P., Øystein Foros, and Hans Jarle Kind. 2010. “Hotelling Competition with Multi-Purchasing: Time Magazine, Newsweek, or Both?” CESifo Working Paper no. 3096, CESifo Group Munich. Armstrong, Mark. 2002. “Competition in Two-Sided Markets.” Working Paper, Nuffield College. Athey, Susan, Emilio Calvano, and Joshua S. Gans. 2013. “The Impact of the Internet on Advertising Markets for News Media.” Working Paper no. 2180851, Rotman School of Management, University of Toronto. Berry, Steven, and Joel Waldfogel. 2010. “Product Quality and Market Size.” Journal of Industrial Economics 58 (1): 1‒31. Cutler, David M., Edward L. Glaeser, and Jacob L. Vigdor. 1999. “The Rise and Decline of the American Ghetto.” Journal of Political Economy 107 (3): 455‒506. DellaVigna, Stefano, and Ethan Kaplan. 2007. “The Fox News Effect: Media Bias and Voting.” Quarterly Journal of Economics 122 (3): 1187‒234. Edmonds, Rick. 2013. “New Research Finds 92 Percent of Time Spent on News Consumption is Still on Legacy Platforms.” Poynter Institute for Media Stud-

190

Matthew Gentzkow and Jesse M. Shapiro

ies. http://www.poynter.org/latest-news/business-news/the-biz-blog/212550/new -research-finds-92–percent-of-news-consumption-is-still-on-legacy-platforms/. Elberse, Anita. 2008. “Should You Invest in the Long Tail?” Harvard Business Review 86 (7): 88‒96. Fiorina, Morris P., and Samuel J. Abrams. 2008. “Political Polarization in the American Public.” Annual Review of Political Science 11:563‒88. Gentzkow, Matthew, and Jesse M. Shapiro. 2010. “What Drives Media Slant? Evidence from US Daily Newspapers.” Econometrica 78 (1): 35‒71. ———. 2011. “Ideological Segregation Online and Offline.” Quarterly Journal of Economics 126 (4): 1799‒839. Gentzkow, Matthew, Michael Sinkinson, and Jesse M. Shapiro. 2011. “The Effect of Newspaper Entry and Exit on Electoral Politics.” American Economic Review 101 (7): 2980‒3018. Greene, William H. 2012. Econometric Analysis. New York: Prentice Hall. McCarty, Nolan, Keith T. Poole, and Howard Rosenthal. 2006. Polarized America: The Dance of Ideology and Unequal Riches. Cambridge, MA: MIT Press. Mullainathan, Sendhil, and Andrei Shleifer. 2005. “The Market for News.” American Economic Review 95 (4): 1031‒53. Prior, Markus. 2005. “News vs. Entertainment: How Increasing Media Choice Widens Gaps in Political Knowledge and Turnout.” American Journal of Political Science 49 (3): 577‒92. ———. 2013. “Media and Political Polarization.” Annual Review of Political Science 16:101‒27. Shaked, Avner, and John Sutton. 1987. “Product Differentiation and Industrial Structure.” Journal of Industrial Economics 36 (2): 131‒46. Sunstein, Cass R. 2001. Republic.com. Princeton, N.J.: Princeton University Press. Webster, James G., and Thomas B. Ksiazek. 2012. “The Dynamics of Audience Fragmentation: Public Attention in an Age of Digital Media.” Journal of Communication 62:39‒56. White, Michael J. 1986. “Segregation and Diversity Measures in Population Distribution.” Population Index 52 (2): 198‒221.

7

Measuring the Effects of Advertising The Digital Frontier Randall Lewis, Justin M. Rao, and David H. Reiley

7.1

Introduction

In the United States, advertising is a $200 billion industry, annually. We all consume “free” services—those monetized by consumer attention to advertising—such as network television, e-mail, social networking, and a vast array of online content. Yet despite representing a relatively stable 2 percent of gross domestic product (GDP) since World War I and subsidizing activities that comprise most of Americans’ leisure time (Bureau of Labor Statistics 2010), advertising remains poorly understood by economists. This is primarily because offline data have typically been insufficient for a firm (or researcher) to measure the true impact of advertising on consumer purchasing behavior. Theories of advertising (Demsetz 1982; Kessides 1986; Becker and Murphy 1993) that have important implications for competition are even harder to empirically validate. The digital era offers an unprecedented opportunity to bridge this informational divide. These advances, both realized and potential, can be attributed to two key factors: (1) individual-level data on ad delivery and subsequent purchasing behavior can be linked and made available to advertisers at low cost; and (2) ad delivery can be randomized at the individual level, generating exogenous variation essential to Randall Lewis is an economic research scientist at Google, Inc. Justin M. Rao is an economic researcher at Microsoft Research. David H. Reiley is a research scientist at Google, Inc. Much of this work was done when all the authors were at Yahoo! Research. We thank Garrett Johnson, Dan Nguyen, Sergiy Matusevych, Iwan Sakran, Taylor Schreiner, Valter Sciarillo, Christine Turner, Michael Schwarz, Preston McAfee, and numerous other colleagues for their assistance and support in carrying out the research. For acknowledgments, sources of research support, and disclosure of the authors’ material financial relationships, if any, please see http:// www.nber.org/chapters/c12991.ack.

191

192

Randall Lewis, Justin M. Rao, and David H. Reiley

identifying causal effects.1 In this chapter we explore the dramatic improvement in the empirical measurements of the returns to advertising, highlight fundamental challenges that currently remain, and look to what solutions we think the future will bring. Digital advertising has led to standard reporting of precise quantitative data for advertising campaigns, most notably the click-through rate (CTR). Of course, the CTR of an ad is only an intermediate proxy for the real outcome of interest to the advertiser: increased purchases by consumers, both in the present and future.2 Despite these limitations, intermediate metrics such as the CTR have proved to be enormously useful dependent variables in automated targeting algorithms that match ads with consumers and contexts (Pandey and Olston 2006; Gonen and Pavlov 2007). Related intermediate metrics come from “purchasing intent” surveys paired with randomized exposure to a firm’s advertising. Cross-experiment analysis of such surveys has provided estimates of the relative value of targeted (versus untargeted) advertising (Goldfarb and Tucker 2011b), contextual relevance and ad intrusiveness (Goldfarb and Tucker 2011a), and has informed the debate on privacy (Tucker 2012). The advances in both academic understanding and business best-practice attributable to these intermediate metrics should not be understated. But while general insights on how ad features impact users can guide advertising spend and CTR maximizing algorithms can make spending more efficient, a firm is presumably interested in measuring the overall returns on advertising investment: dollars of sales causally linked to the campaign versus dollars spent. An overreliance on intermediate metrics can draw attention away from the true underlying goal, and research has shown it can lead to highly suboptimal spending decisions (Blake, Nosko, and Tadelis 2014). Along with deficiencies in intermediate metrics, endogeneity of advertising exposure is the other key challenge in measuring advertising returns. Traditional econometric measurements typically rely on aggregate data fraught with identification problems due to the targeted nature of advertising (Bagwell 2007).3 Yet despite the ability to run very large randomized control trials made possible by digital delivery and measurement, we have discovered a number of conceptual flaws in standard industry data collection and anal1. There have been experimental approaches to measuring advertising effectiveness in the past, see most notably the split-cable experiments of Lodish et al. (1995), but these were typically conducted as small pilots and not using the normal ad delivery pipeline. 2. Toward these ends, advertisers use browser cookies and click beacons to obtain a “conversion rate,” the ratio of transactions attributed to the campaign to ad exposures. This measure seems ideal, but the attribution step is critical and current methods of assigning attribution have serious flaws, which we discuss in detail. 3. The split cable TV experiments reported in Lodish et al. (1995) are a notable exception. The sample sizes in these experiments, run in a small US town, were far smaller than online experiments, and the authors did not report per experiment confidence intervals, rather they used cross-experiment techniques to understand what factors tended to influence consumers (for a follow-up analysis, see Hu, Lodish, and Krieger [2007]).

Measuring the Effects of Advertising

193

ysis methods used to measure the effects of advertising. In other words, the deluge of data on advertising exposures, clicks, and other associated outcomes have not necessarily created greater understanding of the basic causal effects of advertising, much less an understanding of more subtle questions such as the relative effectiveness of different types of consumer targeting, ad creatives, cross-channel effects, or frequency of exposure. The voluminous data, it seems to us, have not only created opportunity for intelligent algorithmic advances, but also mistaken inference under the guise of “big data.” First, many models assume that if you do not click on the ad, then the ad has no effect on your behavior. Here we discuss work by coauthors Lewis and Reiley that showed online ads can drive offline sales, which are typically not measured in conversion or click rates; omitting these nonclick-based sales leads to underestimating the total effects of advertising. Linking online and offline sales requires a dedicated experimental infrastructure and third-party data merging that have only recently become possible. Second, many models assume that if you do click on an ad and subsequently purchase, that conversion must have been due to that ad. This assumption seems particularly suspect in cases, such as search advertising, where the advertising is deliberately targeted at those consumers most likely to purchase the advertised product and temporally targeted to arrive when a consumer is performing a task related to the advertised good. Research has shown, for example, that a person searching for “ebay shoes” is very likely to purchase shoes on eBay regardless of the intensity of advertising (Blake, Nosko, and Tadelis 2014). While this is an extreme example, Blake, Nosko, and Tadelis (2014) also show that the problem arises generally, and measuring the degree to which advertising crowds out “organic conversions” is difficult to measure precisely. Näive approaches effectively assume this problem away, but since only “marginal clicks” are valuable and all clicks count toward the CTR, these methods will always overstate the causal effect on users who clicked the ad. Third, more sophisticated models that do compare exposed to unexposed users to establish a baseline purchase rate typically rely on natural, endogenous advertising exposure and can easily generate biased estimates due to unobserved heterogeneity (Lewis, Rao, and Reiley 2011). This occurs when the pseudo-control group does not capture important characteristics of the treated group, such as purchase intent or browsing intensity, which we show can easily be correlated with purchases whether advertising is present or not. Using data from twenty-five large experiments run at Yahoo! (Lewis and Rao 2013), we have found that the standard deviation of purchases is typically ten times the mean. With such a noisy dependent variable, even a tiny amount of endogeneity can severely bias estimates. Beyond inducing bias in coefficient estimates, these specification errors also give rise to an overprecision problem. Because advertising typically explains only a very small fraction of the variance in consumer transaction behavior, even cleanly

194

Randall Lewis, Justin M. Rao, and David H. Reiley

designed experiments typically require over a million subjects in order to be able to measure economically meaningful effects with any statistical precision (but even experiments with one million subjects can have surprisingly weak power, depending on the variance in sales). Since experiments are generally considered the gold standard for precision4 (treatment is exogenous and independent across individuals), we should be suspicious if observational methods claim to offer higher precision. Further, with nonexperimental methods, omitted heterogeneity or selection bias (so long as it can generate a partial R-squared of 0.00005 or greater) can induce bias that swamps plausible estimates of advertising effectiveness. Thus, if an advertiser does not use an experiment to evaluate advertising effectiveness, she has to have a level of confidence in her model that, frankly speaking, we find unreasonable given the obvious selection effects due to ad targeting and synchronization of advertising with product launches (e.g., new iPad release) and demand shocks (e.g., holiday shopping season). Experimental work on measuring the dollar returns to advertising has given us a deeper appreciation for the limits of current data and methods. For example, we show that seemingly simple “cross-channel” complementarity measures are exceedingly difficult to reliably estimate. Here we present evidence taken from Lewis and Nguyen (2013) that display advertising can increase keyword searches for the advertised brand. Some clicks on sponsored links are incorrectly attributed entirely to the search ad, but while the directional impact on searches can be documented, we cannot tell if search ads perform better or worse in terms of the conversion rate when paired with display advertising. A similar experimental design at a much larger scale could answer this sort of question, but advertising to over five to ten million individuals may be out of reach5 for most advertisers. These findings are confirmed by similar work on online advertising spillovers (Rutz and Bucklin 2011; Papadimitriou et al. 2011). So while some questions are answerable with feasible (at least for some market participants) scale, we believe other questions are still outside the statistical power of current experimental infrastructure and methods. The most prominent example is the long-run effects of advertising. Essentially any analysis of the impact of advertising has to make a judgment call on which time periods to use in the analysis. Often this is the “campaign window” or the campaign window plus a chosen interval of time (typically one to four weeks). These thresholds are almost certainly “wrong” because any impact that occurs after the cutoff should count in the return on investment (ROI) calculation. We explain why practitioners typically choose relatively short impact windows. The intuition is that the longer the time window 4. Not all experiments are created equal and methodologies to use preexperiment data to enhance power as well as postexperiment trimming have advanced considerably in the digital era (Deng, Kohavi, and Walker 2013). 5. Pun intended.

Measuring the Effects of Advertising

195

under study, the lower the signal-to-noise ratio in the data (presuming the ad gets less impactful over time): point estimates of the cumulative effect tend to increase with longer time horizons, but standard errors of the effect increase by even more. This leads to an estimation “impossibility” analogous to the well-known “curse of dimensionality.” In the next two sections we shift our gaze further into the future. First, we discuss how computational methods have increased advertising effectiveness through automated targeting and bidding. With automated targeting, the conversation is usefully shifted from “who to hit” to “what should I get.” Currently, the key parameters of the automated system such as the valuation of actions such as clicks or conversions, the budget of the campaign and the duration, must still be entered by a human. Indeed, these are the exact parameters that we have argued are very difficult to estimate. However, there is no major technical barrier to incorporating controlled randomization—on the fly experimentation—into the core algorithm. By constantly incorporating experimentation, an informative prior could be developed and returns could be more precisely estimated (which would then govern bid, budget, and so forth). To unlock the full potential of this class of algorithms, ad exchanges would have to provide data to participants on the outcomes of auctions in which the bidder intentionally lost. Currently, outcome tracking is only possible if you win the auction, meaning today this type of experimentation is limited to temporal and geography-based identification, severely limiting power. In our final section we extend the discussion on how advances in ad delivery, measurement, and infrastructure are creating opportunities to advance the science of advertising. We discuss how the provision of these features and data relates to the incentives facing the advertising platform. In the final section we present concluding remarks. 7.2

Selection and Power

In today’s dollars, the average American is exposed to about $500 worth of advertising per year.6 To break even, the universe of advertisers needs to net about $1.35 in marginal profits per person per day. Given the gross margins of firms that advertise, our educated guess is that this roughly corresponds to about four to six dollars in incremental sales per day. When an advertiser enters this fray, it must compete for consumers’ attention. The cost per person of a typical campaign is quite low. Online “display” (banners, rectangular units, etc.) campaigns that deliver a few ads per day to a targeted individual cost about one to two cents per person per day. Televi6. Mean GDP per American is approximately $50,000 in 2011, but median household income is also approximately $50,000. The average household size is approximately 2.5, implying an individual’s share of median household income is roughly $20,000. Thus, while 2 percent of GDP actually implies a per capita expenditure of $1,000, we use $500 as a round and conservative figure that is more representative of the average American’s ad exposure.

196

Randall Lewis, Justin M. Rao, and David H. Reiley

sion ads delivered once per person per day are only a bit more expensive. Note that even an aggressive campaign will typically only garner a small percentage of an individual’s daily advertising exposure. We see many ads per day and presumably only a minority of them are relevant enough to a given person to impact his behavior. The relatively modest average impact per person makes it difficult to assess costeffectiveness. What complicates matters further is that individual-level sales are quite volatile for many advertisers. An extreme example is automobiles—the sales impact is either tens of thousands of dollars or zero.7 While not as extreme, many other heavily advertised categories, including consumer electronics, clothing and apparel, jewelry, air travel, banking, and financial planning also have volatile consumption patterns.8 Exceptions to this class are single goods sold through direct conversion channels. Here we summarize work presented in Lewis and Rao (2013), which used twenty-five large advertising field experiments to quantify how individual expenditure volatility impacts the power of advertising effectiveness (hereafter, adfx) experiments. In general, the signal-to-noise ratio is much lower than we typically encounter in economics. We now introduce some formal notation to clarify the argument. Consider an outcome variable y (sales), an indicator variable x equal to 1 if the person ˆ of the average was exposed to the advertising, and a regression estimate, , difference between the exposed (E) and unexposed (U) groups. In an experiment, exposure is exogenous—determined by a flip of the proverbial coin. In an observational study, one would also condition on covariates W, which could include individual fixed effects, and the following notation would use y |W. All the following results go through with the usual “conditional upon” caveat. We consider a regression of y on x, whose coefficient ˆ will give us a measure of the average dollar impact of the advertising per consumer. We use standard notation for the sample means and variances of the sales of the exposed and unexposed groups, the difference in means between those groups, and the estimated standard error of that difference in means. We assume for simplicity that the exposed and unexposed samples are the same size (NE = NU = N ) as well as equal variances (σE = σU = σ) to simplify the formulas: yE ≡

(1) (2)

ˆ 2E ≡

1 1 ∑ yi, yU ≡ NE i∈E NU

∑ yi

i∈U

1 1 ∑ ( yi − yE )2, ˆ U2 ≡ ∑ ( yi − yU )2 NE − 1 i∈E NU − 1 i∈U

7. The marginal profit impact is large, but clearly smaller, as it is the gross margin times the sales impact. 8. For a bank, the consumption pattern once you sign up might be predictable, but the bank is making money from consumer switching, which is “all or nothing.”

Measuring the Effects of Advertising

197

y ≡ yE = yU

(3) ˆ y ≡

(4)

ˆ 2E ˆ 2 + U = NE NU

2 ⋅ . ˆ N

We focus on two familiar econometric statistics. The first is the R2 of the regression of y on x, which gives the fraction of the variance in sales explained by the advertising (or, in the model with covariates, the partial R2 after first partialing out covariates—for more explanation, see Lovell [2008]): (5)

R2 =

( )

∑i∈U (yU − y)2 + ∑i∈E(yE − y)2 = 2N[(1/2) y]2 = 1 y 2 . 2Nˆ 2 4 ˆ ∑i(yi − y)2

Second is the t-statistic for testing the hypothesis that the advertising had no impact: (6)

t y =

y = ˆ y

( )

N y 2 ˆ

In both cases, we have related a standard regression statistic to the ratio between the average impact on sales and the standard deviation of sales between consumers. In the following hypothetical example, we calibrate values using approximately median values from nineteen retail sales experiments run at Yahoo!. For expositional ease, we will discuss it as if it is a single experiment. The campaign goal is a 5 percent increase in sales during the two weeks of the campaign, which we will use as our “impact period” of interest. During this period, customers of this advertiser make purchases with a mean of $7 and a standard deviation of $75.9 The campaign costs $0.14 per customer, which amounts to delivering 20‒100 display ads at a price of $1‒$5 CPM,10 and the gross margin (markup over cost of goods sold, as a fraction of price) is assumed to be about 50 percent.11 A 5 percent increase in sales equals $0.35 per person, netting profits of $0.175 per person. Hence, the goal for this campaign is to deliver a 25 percent return on investment (ROI): $0.175/$0.14 = 1.25.12 The estimation challenge facing the advertiser in this example is to detect a $0.35 difference in sales between the treatment and control groups amid 9. Based on data-sharing arrangements between Yahoo! and a number of advertisers spanning the range from discount to high-end retailers, the standard deviation of sales is typically about ten times the mean. Customers purchase goods relatively infrequently, but when they do, the purchases tend to be quite large relative to the mean. 10. CPM is the standard for impression-based pricing for online display advertising. It stands for “cost per mille” or “cost per thousand”; M is the Roman numeral for 1,000. 11. We base this assumption on our conversations with retailers and our knowledge of the industry. 12. For calibration purposes, note that if the gross margin were 40 percent instead of 50 percent, this would imply a 0 percent ROI.

198

Randall Lewis, Justin M. Rao, and David H. Reiley

the noise of a $75 standard deviation in sales. The ratio is very low: 0.0047. From our derivation above, this implies an R2 of: (7)

R2 =

2

1 $0.35  ⋅ = 0.0000054. 4  $75 

That is, even for a successful campaign with a relatively large ROI, we expect an R2 of only 0.0000054. This will require a very large N to identify any influence at all of the advertising, let alone give a precise confidence interval. Suppose we had two million unique users evenly split between test and control in a fully randomized experiment. With a true ROI of 25 percent and a ratio of 0.0047 between impact size and standard deviation of sales, the expected t-stat is 3.30, using the above formula. This corresponds to a test with power of about 95 percent at the 10 percent (5 percent one-sided) significance level, as the normally distributed t-statistic should be less than the critical value of 1.65 about 5 percent of the time given the true effect is a 25 percent ROI. With 200,000 unique customers, the expected t-statistic is 1.04, indicating the test is hopelessly underpowered to reliably detect an economically relevant impact: under the alternative hypothesis of a healthy 25 percent ROI, we fail to reject the null 74 percent of the time.13 The low R2 = 0.0000054 for the treatment variable x in our hypothetical randomized trial has serious implications for observational studies, such as regression with controls, difference-in-differences, and propensity score matching. A very small amount of endogeneity would severely bias estimates of advertising effectiveness. An omitted variable, misspecified functional form, or slight amount of correlation between browsing behavior and sales behavior generating R2 on the order of 0.0001 is a full order of magnitude larger than the true treatment effect. Compare this to a classic economic example such as the Mincer wage/schooling regression (Mincer 1962), in which the endogeneity is roughly 1/8 the treatment effect (Card 1999). For observational studies, it is always important to ask, “What is the partial R2 of the treatment variable?” If it is very small, as in the case of advertising effectiveness, clean identification becomes paramount, as a small amount of bias can easily translate into an economically large impact on the coefficient estimates. Our view has not yet been widely adopted, however, as evidenced by the following quotation from the president of comScore, a large data provider for online advertising: Measuring the online sales impact of an online ad or a paid-search campaign—in which a company pays to have its link appear at the top of a page of search results—is straightforward: We determine who has viewed 13. Note that when a low-powered test does, in fact, correctly reject the null, the point estimates conditional on rejecting will be significantly larger than the alternatively hypothesized ROI. See Gelman and Carlin (2013) regarding this “exaggeration factor.”

Measuring the Effects of Advertising

199

the ad, then compare online purchases made by those who have and those who have not seen it. (Abraham 2008) The argument we have made shows that simply comparing exposed to unexposed can lead to bias that is many orders of magnitude larger than the true size of the effect. Indeed, this methodology led the author to report as much as a 300 percent improvement in outcomes for the exposed group, which seems surprisingly high (it would imply, for instance, that advertisers are grossly underadvertising). Since all ads have some form of targeting,14 endogeneity is always a concern. For example, most display advertising aims to reach people likely to be interested in the advertised product, where such interest is inferred using demographics or past online behavior of that consumer. Similarly, search advertising targets consumers who express interest in a good at a particular point in time, where the interest is inferred from their search query (and potentially past browsing behavior). In these cases, comparing exposed to unexposed is precisely the wrong thing to do. By creating exogenous exposure, the first generation of advertising experiments have been a step in the right direction. Experiments are ideal—necessary, in fact—for solid identification. Unfortunately, for many advertised products the volatility of sales means that even experiments with millions of unique users can still be underpowered to answer basic questions such as “Can we reject the null hypothesis that the campaign had zero influence on consumers’ purchasing behavior?” Measuring sales impact, even in the short run, turns out to be much more difficult than one might have thought. The ability to randomize ad delivery on an individual level and link it to data on customer-level purchasing behavior has opened up new doors in measuring advertising effectiveness, but the task is still by no means easy. In the remainder of the chapter we discuss these challenges. The next section focuses on using the right metrics to evaluate advertising. 7.3

The Evolution of Advertising Metrics

The click-through-rate, or CTR, has become ubiquitous in the analysis and decision making surrounding online advertising. It is easy to understand why: clicks are cleanly defined, easily measurable, and occur relatively frequently. An obvious but intuitively appealing characteristic is that an ad click cannot occur in the absence of an ad. If one runs 100,000 ads and gets a 0.2 percent CTR (a typical rate for a display ad or a low-ranked search ad), it is tempting to conclude the ad caused 200 new website visits. The assump14. “Untargeted” advertising usually has implicit audience targeting based on where the ads are shown or implicit complementary targeting due to other advertisers purchasing targeted inventory and leaving the remnant inventory to be claimed by advertisers purchasing “untargeted” advertising inventory.

200

Randall Lewis, Justin M. Rao, and David H. Reiley

tion may well be true for new or little-known brands. But for well-known advertisers, there are important ways that consumers might navigate to the site in the absence of an ad, such as browsing directly to the site by typing the name in the URL window of the browser or finding it in organic (that is, not paid or “sponsored”) search results on a topic like “car rental.” It is a mistake to assume that all of those 200 visits would not have occurred in the absence of the ad—that is, those clicks may be crowding out visits that would have happened via other means (Kumar and Yildiz 2011; Chan et al. 2010). The overcounting problem is surmountable with randomized trials where the control group is used to estimate the “baseline arrival rate.” For example, a sponsored search ad could be turned off during random times of the day and the firm could measure arrivals from the search engine for when the ad is running and when it is not (this approach is used in Blake, Nosko, and Tadelis [2014]).15 A deeper problem with the CTR is what it misses. First, it does little for “brand advertisers”—firms that are not trying to generate immediate online sales, but rather to promote awareness and goodwill for the brand. To assess their spend, brand advertisers have traditionally relied on surveys that attempt to measure whether a campaign raised the opinion of the firm in the minds of their target consumers (Goldfarb and Tucker 2011b). Linking the surveys to future purchasing behavior adds another layer of complexity, both because the time frame from exposure to sale is longer (something we will discuss in more detail in section 7.5) and because it requires a reliable link from hypothetical responses to actual behavior, which can be fraught with what is known as “hypothetical bias” (Dickie, Fisher, and Gerking 1987; Murphy et al. 2005). One common approach to neutralize hypothetical bias is to use the surveys to make relative comparisons between campaigns. For advertisers that sell goods both online and in brick-and-mortar stores the click (or online conversions) can be a poor proxy for overall ROI. Lewis and Reiley (2013a) show that for a major retailer, the majority of the sales impact comes offline. Johnson, Lewis, and Reiley (2013) link the offline impact to consumers who lived in close physical proximity to one of the retailer’s locations. These studies indicate purely online measurements can induce a large negative bias in measuring the returns to advertising. For firms that do business on- and offline it is essential to develop the infrastructure to link online ad exposure to offline sales. An alternative to the click is the further downstream outcome measure known as a “customer acquisition” (which itself might be considered a short-term proxy for the net-present-discounted value of a customer). Advertisers can now run “cost per acquisition” (CPA) advertising on many 15. Despite the simplicity of their design, Blake, Nosko, and Tadelis (2014) estimate that their employer, eBay, had been wasting tens of millions of dollars a year.

Measuring the Effects of Advertising

201

ad exchanges.16 An acquisition, or conversion, is defined as a successful transaction that has a “qualifying connection” to the advertisement. On the surface, focusing on conversions seems more attractive than clicks because it is a step closer to sales. Unfortunately, this benefit brings with it what is known as the “attribution problem”: which ad gets “credit” for a given sale? Suppose a consumer views and clicks a given ad, but does not purchase on the same day. Over the next few days, she sees a host of other ads for the product (which is likely, given a practice known as “retargeting”) and then purchases the good. Which ad should get credit for the purchase? Ad exchanges tend to use a set of rules to solve these problems from an accounting perspective. Common rules include requiring a click for credit or only counting the “last click” (so if a consumer clicks a retargeted ad, that ad gets credit). Requiring a click seems to make sense and is enormously practical as it means a record of all viewers that see the ad but do not click need not be saved.17 However, requiring a click errs in assuming that ads can only have an impact through clicks, which is empirically not true (Lewis, Reiley, and Schreiner 2012). The “last click” rule also has intuitive appeal. The reasoning goes as follows: had the last click not occurred, the sale would not have happened. Even if this were true, which we doubt, the first click or ad view might have led to web search or other activity, including the behavioral markers used for retargeting, which made the last click possible. The causal attribution problem is typically solved by ad hoc rules set by the ad exchange or publisher such as “the first ad and the last ad viewed before purchase each get 40 percent of the credit, while the intermediate ad views share the remaining 20 percent of the credit for the purchase.”18 A proliferation of such rules gives practitioners lots of choices, but none of them necessarily gives an unbiased measurement of the performance of their ad spending. In the end, such complicated payment rules might make the click more attractive after all. The attribution problem is also present in the question of complementaries between display and search advertising. Recent work has shown that display ads causally influence search behavior (Lewis and Nguyen 2013). The authors demonstrate this by comparing the search behavior of users exposed to the campaign ad to users who would have been served the campaign ad but were randomly served a placebo. Brand-related keywords were significantly more prevalent in the treatment group as compared to the control. The attribution problem has received more attention in online advertising because of the popularity of cost-per-acquisition and cost-per-click payment mechanisms, but it applies to offline settings as well. How do we 16. But not the major search engines, as of August 2013. 17. A CTR of ≈ 0.2 percent meaning, storage, and processing costs of only clicks involves only 1/500 of the total ad exposure logs. 18. Source: https://support.google.com/analytics/bin/answer.py?hl=en\&answer=1665189.

202

Randall Lewis, Justin M. Rao, and David H. Reiley

know, for example, whether an online ad was more responsible for an online conversion than was the television ad that same user saw? Nearly every online campaign occurs contemporaneously with a firm’s offline advertising through media such as billboards and television because large advertisers are continuously advertising across many media.19 Directly modeling the full matrix of first-order interactions is well beyond the current state of the art. Indeed in every paper we know of evaluating online advertising, the interactions with offline spending is ignored. Our discussion thus far has indicated that the evolution of advertising metrics has brought forth new challenges linking these metrics to the causal impact on sales. However, one way in which intermediate metrics have proved unambiguously useful for advertisers is providing relatively quick feedback on targeting strategies allowing for algorithmic adjustments to the ad-serving plan. For instance, while it may be unreasonable to assume that the click captures all relevant effects of the ad, it may very natural to assume that within a given class of advertisements run by a firm a higher CTR is always preferred to a lower one. If so, bandit algorithms can be applied to improve the efficiency of advertising spend and give relative comparisons of campaign effectiveness, allowing one to prioritize better performing advertisements (Pandey and Olston 2006; Gonen and Pavlov 2007). We discuss these advances in more detail in section 7.7. 7.4

A Case Study of a Large-Scale Advertising Experiment

To get a better idea of how large advertising experiments are actually run, in this section we present a case study taken from Lewis and Reiley (2013a) (herein “LR”). Lewis and Reiley ran a large-scale experiment for a major North American retailer. The advance the paper makes is linking existing customers in the retailer’s sales records, for both online and brick-and-mortar sales, to a unique online user identifier, in this case the customer’s Yahoo! username. The experiment was conducted as follows. The match yielded a sample of 1,577,256 individuals who matched on name and either e-mail or postal address. The campaign was targeted only to existing customers of the retailers as determined by the match. Of these matched users, LR assigned 81 percent to a treatment group who subsequently viewed two advertising campaigns promoting the retailer when logged into Yahoo’s services. The remaining 19 percent were assigned to the control group and prevented from seeing any of the retailer’s ads from this campaign on the Yahoo! network of sites. The simple randomization was designed to make the treatment-control assignment independent of all other relevant variables. 19. Lewis and Reiley (2013b) show that Super Bowl commercials cause viewers to search for brand-related content across a wide spectrum of advertisers.

Measuring the Effects of Advertising Table 7.1

203

Summary statistics for the campaigns

Time period covered Length of campaign Number of ads displayed Number of users shown ads Treatment group viewing ads Mean ad views per viewer

Campaign 1

Campaign 2

Both campaigns

Early fall ’07 14 days 32,272,816 814,052 63.7% 39.6

Late fall ’07 10 days 9,664,332 721,378 56.5% 13.4

41,937,148 867,839 67.9% 48.3

Source: Lewis and Reiley (2013a).

The treatment group of 1.3 million Yahoo! users was exposed to two different advertising campaigns over the course of two months in fall 2007, separated by approximately one month. Table 7.1 gives summary statistics for the campaigns, which delivered 32 million and 10 million impressions, respectively. The two campaigns exposed ads to a total of 868,000 users in the 1.3-million-person treatment group. These individuals viewed an average of forty-eight ad impressions per person. The experiment indicated an increase in sales of nearly 5 percent relative to the control group during the campaign, a point estimate that would translate to an extremely profitable campaign (with the retailer receiving nearly a 100 percent rate of return on the advertising spending). However, purchases had sufficiently high variance (due in part to 95 percent of consumers making zero purchases in a given week) to render the point estimate not statistically significantly different from zero at the 5 percent level. Controlling for available covariates (age, gender, state of residence) did not meaningfully reduce standard errors. This is a good example of how economically important effects of advertising can be statistically very difficult to detect, even with a million-person sample size. Just as we saw in section 7.2, we see here that the effects of advertising are so diffuse, explaining such a small fraction of the overall variance in sales, that the statistical power can be quite low. For this experiment, power calculations show that assuming the alternative hypothesis that the ad broke even is true, the probability of rejecting the null hypothesis of zero effect of advertising is only 21 percent. The second important result of this initial study was a demonstration of the biases inherent in using cross-sectional econometric techniques when there is endogenous advertising exposure. This is important because these techniques are often employed by quantitative marketing experts in industry. Abraham (2008), for example, advocates comparing the purchases of exposed users to unexposed users, despite the fact that this exposure is endogenously determined by user characteristics and browsing behavior, which might easily be correlated with shopping behavior. To expose the biases in these methods, LR temporarily “discarded” their control group and compared the levels of purchases between exposed and (endogenously)

204

Randall Lewis, Justin M. Rao, and David H. Reiley

unexposed parts of the treatment group. The estimated effects of advertising were three times as large as in the experiment, and with the opposite sign! This erroneous result would also have been deemed highly statistically significant. The consumers who browsed Yahoo! more intensely during this time period (and hence were more likely to see ads) tended to buy less, on average, at the retailer, regardless of whether they saw the ads or not (this makes sense, because as we will see most of the ad effect occurred offline). The control group’s baseline purchases prior to the ad campaign showed the same pattern. Without an experiment an analyst would have had no way of realizing the extent of the endogeneity bias (in this case, four times as large as the true causal effect size) and may have come to a strikingly wrong conclusion. Observing the consistent differences between exposed and unexposed groups over time motivated LR to employ a difference-in-differences estimator. Assuming that any unobserved heterogeneity was constant over time allowed LR to take advantage of both exogenous and endogenous sources of variation in advertising exposure, which turned out to reduce standard errors to the point where the effects were statistically significant at the 5 percent level. The point estimate was approximately the same as (though slightly higher than) the straight experimental estimate, providing a nice specification check. With this estimator, LR also demonstrated that the effects of the advertising were persistent for weeks after the end of the campaign, that the effects were significant for in-store as well as online sales (with 93 percent of the effect occurring offline), and that the effects were significant even for those consumers who merely viewed but never clicked the online ads (with an estimated 78 percent of the effect coming from nonclicking viewers). In a companion paper (Lewis and Reiley, forthcoming), the authors also showed that the effects were particularly strong for the older consumers in the sample—sufficiently strong to be statistically significant even with the simple (less efficient) experimental estimator. In a follow-up study, Johnson, Lewis, and Reiley ([2013], henceforth JLR) improved on some of the weaknesses of the design of the original LR experiment. First, JLR ran “control ads” (advertising one of Yahoo!’s own services) to the control group, allowing them to record which control-group members would have been exposed to the ad campaign if they had been in the treatment group. This allowed them to exclude from their analysis those users (in both treatment and control groups) who were not exposed to the ads and therefore contributed noise but no signal to the statistics. Second, JLR convinced the advertiser to run equal-sized treatment and control groups, which improved statistical power relative to the LR article’s 81:19 split. Third, JLR obtained more detailed data on purchases: two years of precampaign sales data on each individual helped to explain some of the variance in purchases, and disaggregated daily data during the campaign allowed them to exclude any purchases that took place before the first ad

Measuring the Effects of Advertising

205

delivery to a given customer (which, therefore, could not have been caused by the ads, so including those purchases merely contributed noise to the estimates). The more precise estimates in this study corroborate the results of LR, showing point estimates of a profitable 5 percent increase in advertising, which are statistically significant at the 5 percent level, though the confidence intervals remain quite wide. 7.5

Activity Bias

In the preceding sections, we have presented this argument on an abstract level, arguing that the since the partial R2 of advertising, even for a successful campaign, is so low (on the order of 0.00001 or less), the likelihood of omitted factors not accounting for this much variation is unlikely, especially since ads are targeted across time and people. In this section we show that our argument is not just theoretical. Here we identify a bias that we believe is present in most online ad serving; in past work, we gave it the name “activity bias” (Lewis, Rao, and Reiley 2011). Activity bias is a form of selection bias based on the following two features of online consumer behavior: (1) since one has to be browsing online to see ads, those browsing more actively on a given day are more likely to see your ad; and (2) active browsers tend to do more of everything online, including buying goods, clicking links, and signing up for services. Any of the selection mechanisms that lead to their exposure to the advertising are highly correlated with other online activities. Indeed, many of the selection mechanisms that lead to their exposure to the advertising, such as retargeting20 and behavioral targeting, are highly correlated with other online activities. Hence, we see that ad exposure is highly and noncausally correlated with many online activities, making most panel and time-series methods subject to bias. In a nonexperimental study, the unexposed group, as compared to the group exposed to an ad, typically failed to see the ad for one or both of the following reasons: the unexposed users browsed less actively or the user did not qualify for the targeting of the campaign. When the former fails, we have activity bias. When the latter fails, we have classic selection bias. In our 2011 paper, we explored three empirical examples demonstrating the importance of activity bias in different types of web browsing. The first application investigates the causal effects of display ads on users’ search queries. In figure 7.1 we plot the time series of the number of searches by exposed users for a set of keywords deemed to be brand-relevant for a firm. The figure shows results for a time period that includes a one-day-display advertising campaign for a national brand on www.yahoo.com. The campaign excluded a randomized experimental control group, though for the moment we ignore the control group and focus on the sort 20. For a discussion and empirical analysis of retargeting see Lambrecht and Tucker (2013).

206

Fig. 7.1

Randall Lewis, Justin M. Rao, and David H. Reiley

Brand keyword search patterns over time

Source: Lewis, Rao, and Reiley (2011).

of observational data typically available to advertisers (the treatment group, those that saw the firm’s advertisements). The x-axis displays days relative to the campaign date, which is labeled as Day 0. One can easily see that on the date of the ad, ad viewers were much more likely to conduct a brandrelevant search than on days prior or following. The advertising appears to double baseline search volume. Is this evidence of a wildly successful ad? Actually, no. Examining the control group, we see almost the same trend. Brand-relevant keyword searches also spike for those who saw a totally irrelevant ad. What is going on? The control group is, by design of the experiment, just as active online as the treatment group, searching for more of everything, not just the brand-relevant keywords of interest. The time series also shows that search volume is positively serially correlated over time and shows striking day-of-week effects—both could hinder observational methods. The true treatment-control difference is a statistically significant, but far more modest, 5.1 percent. Without an experiment, we would have no way of knowing the baseline “activity-related increase” that we infer from the control group. Indeed, we might have been tempted to conclude the ad was wildly successful. Our second application involves correlation of activity not just across a publisher and search engine, but across very different domains. We ran a marketing study to evaluate the effectiveness of a video advertisement promoting the Yahoo! network of sites. We recruited subjects on Amazon Mechanical Turk, showed them the video, and gave them a Yahoo! cookie so we could track their future behavior. Using the cookie we could see if the

Measuring the Effects of Advertising

207

Fig. 7.2 The effect on various Yahoo! usage metric of exposure to treatment/ control ads Source: Lewis, Rao, and Reiley (2011). Note: Panels A, B, and C: probability of at least one visit to the Yahoo! network, Yahoo.com, and Yahoo! mail, respectively. Panel D: total page views on the Yahoo! network.

ad really generated more Yahoo! activity. The control group saw a political ad totally unrelated to Yahoo! products and services. Again, we ignore the control group to begin. Figure 7.2 has the same format as figure 7.1 Day 0 on the x-axis labels the day an individual saw the video ad (with the actual calendar date depending on the day the subject participated in the study). Examining the treatment group, we can see that on the day of and the days following ad exposure, subjects were much more likely to visit a Yahoo! site as compared to their baseline propensity, indicating a large apparent lift in engagement. However, data on the control group reveals the magnitude of activity bias—a very similar spike in activity on Yahoo! occurs on the day of placebo exposure as well. Both groups also show some evidence of positive serial correlation in browsing across days: being active today makes it more likely that you will be active tomorrow as compared to several days from now. People evidently do not engage in the same online activities (such as visiting Yahoo! and visiting Amazon Mechanical Turk) every day, but they engage in somewhat bursty activity that is contemporaneously correlated

208

Randall Lewis, Justin M. Rao, and David H. Reiley

across sites. Online activity leads to ad exposure, which mechanically tends to occur on the same days as outcome measures we hope to affect with advertising. In the absence of a control group, we can easily make errors in causal inference due to activity bias. In this particular case, the true causal effect of the ad was estimated to be small and not statistically significant—given the cost of running a video ad, it was probably not worth showing, but the biased estimates would have led us to a wrong conclusion in this regard. The third application again involves multiple websites. This time the outcome measure was filling out a new account sign-up form at an online brokerage advertised on Yahoo! Finance. Again, our results show that even those who were randomly selected to see irrelevant placebo ads were much more likely to sign up on the day they saw the (placebo) ad than on some other day. We refer the reader to our original paper for the details, stating here that the results are very similar to the ones we have just presented (the now familiar mountain-shaped graphs are again present). With activity bias it seems that one could erroneously “show” that nearly any browsing behavior is caused by nearly any other browsing behavior! We hope that our results will cause industry researchers to be more cautious in their conclusions. Activity bias is a real form of bias that limits the reliability of observational methods. In the absence of an experiment, researchers may be able to use some other cross-validation technique in order to check the robustness of causal effects. For example, one could measure the effect of movie advertisements on searches for the seemingly irrelevant query “car rental.” Similarly, one could check whether (placebo) ad views of a Toyota ad on the New York Times website on May 29 causes the same effect on Netflix subscriptions that day as did the actual Netflix ad on the New York Times website on May 30. Differences in differences using such pseudo-control groups will likely give better estimates of true causal effects than simple time-series or crosssectional studies, though, of course, a randomized experiment is superior if it is available (Lewis, Rao, and Reiley 2011).21 Is activity bias a new phenomenon that is unique to the online domain? While it is not obvious that offline behavior is as bursty and as contemporaneously correlated as online behavior, before our study we did not think these patterns were obvious in online behavior either (and scanning industry white papers, one will see that many others still do not find it obvious!). We believe the importance of activity bias in the offline domain is an open question. It is not difficult to come up with examples in which offline advertising exposure could spuriously correlate with dependent variables of interest. Billboards undoubtedly “cause” car accidents. Ads near hospitals “cause” illness. Restaurant ads near malls probably “cause” food consumption in 21. In some cases, even such placebo tests may fail as the qualifications for seeing the ad may be intrinsically correlated with the desired outcome, as may be the case for remarketing and other forms of targeting, which account for search activity and browsing behavior.

Measuring the Effects of Advertising

209

general. Exposure to ads in the supermarket saver are likely correlated with consumption of unadvertised products, and so forth. The superior quality of data (and experiments) available in online advertising has laid bare the presence of activity bias in this domain. We believe the level of activity bias in other domains is an interesting, open question. 7.6

Measuring the Long-Run Returns to Advertising

Any study of advertising effectiveness invariably has to specify the window of time to be included in the study. While effects of advertising could in principle last a long time, in practice one must pick a cut-off date. From a business perspective, making decisions quickly is an asset worth trading decision accuracy for at the margin. But can patient scholars (or firms) hope to measure the long-run effects of advertising? Here we address the statistical challenges of this question. The answer, unfortunately, is rather negative. As one moves further and further from the campaign date, the cumulative magnitude of the sales impact tends to increase. (This is not guaranteed, as ads could simply shift purchases forward in time, so a short time window could measure a positive effect while a long time window gives a zero effect. But in practice, we have so far noticed point estimates of cumulative effects to be increasing in the time window we have studied.) However, the amount of noise in the estimate tends to increase faster than the increase in the signal (treatment effect) itself because in the additional data the control and treatment groups look increasingly similar, making long-run studies less statistically feasible than short-run ones. In the remainder of this section we formalize and calibrate this argument. We again employ the treatment versus control t-statistic indexed by little t for time. For concreteness, let time be denominated in weeks. For notational simplicity, we will assume constant variance in the outcome over time, no covariance in outcomes over time,22 constant variance across exposed and unexposed groups, and balanced group sizes. We will consider the long-term effects by examining a cumulative t-statistic (against the null of no effect) for T weeks rather than a separate statistic for each week. We write the cumulative t-statistic for T weeks as: N  ∑t=1 yt  . 2  T ˆ  T

(8)

t yt =

At first glance, this t-statistic appears to be a typical ( T ) asymptotic rate with the numerator being a sum over T ad effects and the denominator grow22. This assumption is clearly false: individual heterogeneity and habitual purchase behavior result in serial correlation in purchasing behavior. However, as we are considering the analysis over time, if we assume a panel structure with fixed effect or other residual-variance absorbing techniques to account for the source of this heterogeneity, this assumption should not be a first-order concern.

210

Randall Lewis, Justin M. Rao, and David H. Reiley

ing at a T rate. This is where economics comes to bear. Since yt represents the impact of a given advertising campaign during and following the campaign (since t = 1 indexes the first week of the campaign), yt ≥ 0. But the effect of the ad each week cannot be a constant—if it were, the effect of the campaign would be infinite. Thus, it is generally modeled to be decreasing over time. With a decreasing ad effect, we should still be able to use all of the extra data we gather following the campaign to obtain more statistically significant effects, right? Wrong. Consider the condition necessary for an additional week to increase the t-statistic: t yT < t yT +1

∑t=1 yt

T +1

T

T


0, and [∂2v(xi,x j )]/( ∂xi ∂x j ) > 0. Each user has total time Z available. The time can be spent either using the platform or working. When working, the user can earn wage w per unit of time. The total amount of money earned allows the user to consume a numeraire good (i.e., a composite of goods and services consumed outside of the platform), which adds to the user’s utility. Both users are the same, with the exception of the wage—user A earns a higher wage than user B(wA > wB ). Hence, if user i spends ni time to earn the numeraire, then he can consume niwi of the numeraire. Each user aims to maximize his or her utility given the time constraint: max xi,ni

v(xi,x j ) + niwi

such that xi + ni ≤ Z. The constraint binds in the optimum, so ni = Z − xi , and the utility maximization problem simplifies to maxxiv(xi,x j ) + (Z − xi )wi. In the interior solution,13 the optimal usage xˆi is given by

12. The model can be easily extended to A and B denoting types of users with an arbitrary number of agents in each type. The qualitative results stay the same, but the notation is more complicated. 13. Corner solutions may happen for very high and very low w’s. When wi is low enough that {[∂v(xˆi, x j )] / ∂xi}| xi = Z > wi, then the user spends all of his or her time using the platform, xˆi = Z . Notice that, in such a case, increasing xj does not change xˆi , but decreasing xj may decrease xˆi below Z if the derivative decreases to {[∂v(xˆi, x j )] / ∂xi}| xi = Z < wi . Similarly, when wi is high enough that {[∂v (xˆi, x j )] / ∂xi}| xi = 0 < wi , then the agent spends no time using the platform, xˆi = 0. Decreasing xj will not change i’s consumption decision. But increasing xj may induce i to set positive xˆi > 0, in the case when the increase in xj increases the derivative to {[∂v (xˆi, x j )] / ∂xi}| xi = 0 > wi.

Some Economics of Private Digital Currency

263

∂v(xˆi,x j ) = wi. ∂xi

(1)

Since [∂2v(xi, x j )]/ ∂xi2 < 0, wA > wB implies xˆA < xˆB . That is, the user earning the higher wage is choosing to spend less time on the platform. Example. Suppose that v(xi, x j ) = xix1− j , for  > 1/2. Combining the firstorder conditions, we get wA  xˆB  =  wB  xˆA 

2(1−)

.

Then, wA > wB implies that xˆA < xˆB . Moreover, there are multiple equilibria possible. Any combination of xA and xB such that wA / wB = (xˆB / xˆA )2(1−) and xB ≤ Z constitutes an equilibrium. Multiplicity of equilibria is not surprising, given the consumption complementarity. 9.3.2

The Platform

We assume that the platform’s revenue directly depends on the usage, r(xA + xB ) where r > 0 is the revenue, say from advertising, related to the total level of activity on the platform, xA + xB . Higher level of activity induces higher revenue. For now, we assume that this is the only source of the platform’s revenue. Under this assumption, the platform aims at maximizing the total usage, xA + xB . Later in the analysis, we allow other sources of revenue, for example, the sale of platform-specific currency. In that latter case, the platform’s optimal decisions do not necessarily maximize total usage. Notice that, due to consumption complementarity, there may exist multiple equilibria with different total usage. Example (continued). Given multiplicity of equilibria, the platform’s usage depends on the equilibrium played. In our example, the largest usage that may be obtained in an equilibrium is for xˆB = Z and xˆA = Z(wB / wA)1/[2(1−)]. The smallest one is arbitrarily close to 0, when xˆB = ε ≠ 0 and xˆA = ε(wB /wA)1/[2(1−)]. 9.3.3

Enhancing the Platform: “Buying” and “Earning”

Suppose that now the platform allows the users to acquire options, ei , that enhance the value of platform usage. For example, this may be additional options in a game. The enhancement increases the usage utility; that is, for the same level of usage, v (xi, ei′, x j ) > v (xi, ei, x j ) for ei′ > ei. Moreover, we assume that [∂v (xi, ei′, x j )]/ ∂xi > [∂v (xi, ei, x j )]/ ∂xi , [∂v (xi′, ei, x j )]/ ∂ei > [∂v (xi, ei, x j )]/ ∂ei for xi′ > xi and [ ∂v (xi, ei, x j )]/ ∂ei → as ei → 0.14 The enhancement may be obtained by “buying” it, or by “earning” it (e.g., through testing functionality or simply by playing the game more intensively). Specifically, we assume that 14. This is on top of the usual second-order conditions: [∂2v (xi, ei, x j )] / ∂xi 2 < 0 and [∂2v (xi, ei, x j )] / ∂ei 2 < 0.

264

Joshua S. Gans and Hanna Halaburda

ei = yi + ti , where yi are the units of the numeraire (buying) and ti are in units of time (earning). User i’s utility in the environment with the enhancement is (2)

v (xi, ei (ti, yi ), x j ) + (Z − xi − ti )wi − yi ,

which the user maximizes by choosing xi , ti , and yi subject to the constraints that yi ≤ (Z − xi − ti )wi and Z ≥ xi + ti . For a solution interior in all three variables, the first-order conditions are (3)

w.r.t.xi :

∂v (xi, ei, x j ) = wi ∂xi

(4)

w.r.t.ti :

∂v(xi, ei, x j )  = wi ∂ei

(5)

w.r.t. yi :

∂v (xi, ei, x j )  = 1. ∂ei

Notice, however, that ti and yi are perfect substitutes in achieving ei . Therefore, each user chooses only one way of obtaining ei , whichever is cheapest. Buying a unit of ei costs the user 1/, while earning it costs wi /. If wi < /, then user i only earns the enhancement, and yi = 0. Then, the two relevant first-order conditions are (6)

∂v (xi, ei, x j ) ∂v (xi, ei, x j ) = wi and  = wi . ∂xi ∂ei

When wi > /, then user i only buys, that is, ti = 0. Then, the two relevant first-order conditions are (7)

∂v (xi, ei, x j ) ∂v (xi, ei, x j ) = wi and  = 1. ∂xi ∂ei

For exogenously given w’s, ϕ, and γ, we assume here that Z is large enough that solutions on the relevant parameters (xi and ti , or xi and yi ) are interior. For an interior xi , we can prove the following result. Lemma 1. Holding ei and xj fixed, a user i with lower wi optimally chooses higher usage, xi. Proof. Since Z is large enough for xi to be interior for both users, [∂v (xi, ei, x j )]/ ∂xi = wi . With wA > wB , for the same ei and x j , the derivative is higher for the higher-wage user. And since ∂2v (xi,⋅) / ∂xi2 < 0, the derivative is higher for smaller usage xi . Hence xA < xB if ei and x j are unchanged. Given that users have different wages, in equilibrium it will not be the case that ei and x j are the same for both users. With the higher usage xi , the marginal benefit of enhancement is higher. Thus, users with lower wi choose larger ei , which further increases their optimal usage.

Some Economics of Private Digital Currency

265

Lemma 2. The low-wage user acquires more enhancements and has higher usage in equilibrium. Proof: We conduct this proof in two steps. In the first one, we show that the low-wage user acquires more enhancements for a fixed xi and x j . In the second step, we combine the result of the first step and Lemma 1 to complete the proof for the equilibrium outcome. When both wA and wB are greater—or both lower—than  / , we find that the low-wage user acquires more enhancement directly from the secondorder conditions (for a fixed xi and x j ). The interesting case is when wA >  /  > wB. In this case, the first-order conditions are (∂v / ∂eB ) = wB and (∂v / ∂eA) = 1. Those conditions imply that ∂v / ∂eB = wB /  and ∂v / ∂eA = 1/ . And since  /  > wB ⇔ 1/  > wB / , then ∂v / ∂eA > ∂v / ∂eB . Therefore, if faced with the same xi and x j , eA < eB . In the second step of the proof, notice, from Lemma 1, that we know that xA < xB for the same ei and x j . Moreover, because own consumption has  a  larger effect on utility than x j , it is still true that xA(xB ) < xB (xA) for  the  same ei . Moreover, from the previous step of this proof, given xi and x j , eA < eB reinforces the fact that in equilibrium x*A < xB* (i.e., xB (e) − xA(e) < xB*(eB*) − x*A(e*A )). Notice that usage increases more when both ways of procuring ei are available. Because users choose the cheapest way, they choose more ei than they would if only one way of procurement was allowed. Higher ei leads to higher xi . Moreover, due to consumption complementarities, it further increases the consumption of the other user, x j . Therefore, by allowing users to both earn and buy an enhancement of the platform usage (e.g., Facebook Credits), the platform increases usage, as compared to allowing for only one type of enhancement procurement. Proposition 1. When the platform allows for both earning and buying of the enhancement, the direct usage, xA + xB , (weakly) increases by more than when the platform allows for only one type of enhancement procurement (only buying or only earning). The increase is weak because if both users are choosing the same means of obtaining the enhancement, and the only option is the optimal option, then adding a new option does not strictly improve usage. The following proof focuses on the interesting case where improvement is strict. Proof. Let wA > / > wB . Suppose that only option buy is available. Both i = A, B choose their enhancement investment and usage based on equation (7). Let B’s optimal choices in this case be xˆB and eˆB. When it becomes possible to earn, user B prefers to go for the new option, and chooses enhancement eˆˆB according to condition (6). Since wB / < 1/, then [∂v (xˆB, eˆB,xA)]/ ∂eB > [∂v (xˆB, eˆˆB,xA)]/ ∂eB , which implies eˆˆB (xˆB ) > eˆB (xˆB ). But then, also, xˆˆB > xˆB . So, in equilibrium eˆˆB and xˆˆB > xˆB . Given the comple-

266

Joshua S. Gans and Hanna Halaburda

mentarity in users’ activity, increasing xB also increases xA . Thus, allowing for earning of platform enhancement along with buying increases total platform usage by increasing both xB and xA . In a similar way, we can also show that starting from earning only, and then allowing buying as well, increases total platform usage by increasing both xA and xB . It is useful to consider the relevance of this proposition for digital currency. For instance, Facebook Credits represent a unit of account. It could have been that, like Microsoft and Nintendo, these credits were solely bought. In this way, they would merely be a way of converting real currency into onplatform payments. However, to the extent that some users of the platform are income or wealth constrained, this would reduce their use of enhancements. Complementarity among users would then imply a reduction in overall activity on the platform. Instead, by offering a means of earning enhancements, the platform provides an alternative pathway for incomeconstrained users. Of course, this may be strengthened if such earning was itself platform activity—as sometimes occurs—but we have supressed that effect here. Later, in section 9.3.5, we also discuss how Proposition 1 may sometimes fail if the platform has different objectives than maximizing total usage. The proposition also demonstrates that allowing “inward convertibility” from real currency onto the platform encourages more usage from incomerich users. Once again, complementarity among users leads to more overall usage from convertibility. Thus, while World of Warcraft may officially prohibit “Gold farming,” there is a sense in which it increases platform usage. Of course, it could be imagined that digital currencies associated with platforms could go further and allow outward convertibility—the reverse exchange back into state-issued currency. It is this feature that would put those currencies on a path to competing with state-issued currencies. We examine this option next. 9.3.4

Reverse Exchange

In this section, we show that if the platform were to allow for the reverse exchange of earned credits into state-issued currency, it would decrease platform usage. Proposition 2. If the platform allows for the reverse exchange of ei into yi at any positive rate, it lowers platform usage. Proof. Suppose that user i can spend ti to get ei = ti , but then can convert it back into cash at a rate of μ: yi = ei / = ti /. Then, the effective wage of user i is yi /ti =  /. If the platform puts no restrictions on this exchange, it allows all agents with outside wage wi <  / to achieve the effective wage of wˆ =  /. But, from the previous results, we know that increasing the wage

Some Economics of Private Digital Currency

267

lowers the equilibrium usage xi , and also lowers how much of ei is actually used by the agent on the enhancement, (as the agent may redeem15 part or all of ei for yi ). The proof here does not take into account the fact that reverse exchange would be costly for the platform. In other words, it is unambiguously detrimental to the platform. Thus, as long as the goal of the platform is to maximize direct activity (xA + xB ), platforms have no incentive to allow for outward convertibility or reverse exchange. In other words, despite the concern of commentators, platforms that utilize digital currencies for within platform transactions have no incentive to move toward full convertibility. It is worth considering the assumption that drives this strong result. Here we have assumed that platform activity—including the incentive to purchase an enhancement—is solely driven by utility earned within the platform. Specifically, the enhancement increases the marginal utility from activity and is reduced if currency is redeemed outside of the platform. However, it could be the case that by earning the enhancement, activity is increased even if the currency earned is redeemed rather than spent within the game. In this case, the incentive to earn that currency increases activity and could be enhanced by allowing convertibility. This may be part of the rationale for allowing full convertibility of Linden dollars in the game Second Life. 9.3.5

Optimal Choice of γ and ϕ

Until now, we have taken γ and ϕ as given. Typically, however, the platform sets γ and ϕ. Each user’s choice of whether to earn or purchase an enhancement depends on the prices, 1/γ and 1/ϕ, and their relationship to the user’s wage. The prices chosen by a platform depend on its precise objective. Thus far, we have focused on the impact of various platform choices on xA + xB, direct platform usage. This would be relevant if the platform’s only source of revenue was, say, advertising, related to platform usage. In this case, the platform would aim to set both γ and ϕ as high as possible while still assuring that, regardless of how a user chooses to obtain the enhancement, each does so. In effect, the enhancement would be so ubiquitous that it would be an integral part of the platform, and there would be few interesting questions regarding currencies. In some cases, the platform may also earn the same advertising revenue from users’ activity while earning an enhancement. In this case, the platform would aim to maximize r (xA + xB + tA + tB ). The platform may then benefit from users engaging in a variety of activities (depending on the nature of v(.)), but, regardless, it would want ϕ to be as high as possible while still assuring that all users earn the enhancement. For γ, the platform faces a trade-off. Decreasing γ can induce high-wage types to switch their activity 15. Since part or all of the enhancement is redeemed, it does not enter as ei into v(xi, ei, x j ).

268

Joshua S. Gans and Hanna Halaburda

toward earning the enhancement, which directly increases tA. However, this involves some substitution away from xA which, depending upon v(.), may lead to a reduction in activity by B. Thus, it is not possible to characterize this price in the general case, as the optimal price will depend on the particular functional forms. Of course, the purchases of enhancements can also represent an alternative revenue stream for the platform. In this case, it would be reasonable  to  consider the platform as maximizing r (xA + xB ) + (yA + yB ) or r (xA + xB + tA + tB ) + (yA + yB ). Depending on the level of r, the platform may prefer to withdraw the possibility of earning an enhancement and force all agents to buy it. In such a case, Proposition 1 may fail. Regardless of whether Proposition 1 holds or fails, the platform will set the prices so that each user’s time constraint is binding and focused on the platform, either through activity or income. That is, for users buying an enhancement, ti = 0 and yi = (Z − xi )wi , while for a user earning the enhancement, yi = 0 and ti = Z − xi. This allows us to identify the first-order conditions for users. For users earning the enhancement, it is (8)

∂v (xi, ei, x j ) ∂v (xi, ei, x j ) = . ∂xi ∂ei ei =(Z −xi ) ei =(Z −xi )

Notice that this condition is independent of wi. Thus, the optimal usage schedule for those earning the enhancement is independent of wage. That is, if both high-wage and low-wage agents decide to earn the enhancement, they would earn the same ei and consume the same xi. For a user buying the enhancement, the first-order condition yields (9)

∂v (xi, ei, x j ) ∂v (xi, ei, x j ) = wi . ∂xi ∂ei ei =(Z −xi )wi ei =(Z −xi )wi

Thus, users who buy the enhancement will differ in their usage levels, depending on the wage. This suggests that allowing users to buy enhancements can be useful when it is optimal to exploit their differential usage rather than ignore it. Of course, a precise characterization is not possible in the general case. For our running example, however, we can provide a more precise conclusion. Example (continued). Suppose that, in our example, the platform  introduces the enhancement and now v (xi, ei, x j ) = xix1− j ei . Moreover,   1− ei = yi + ti . Then, user i’s utility is xi x j (yi + ti ) + (Z − xi − ti )wi − yi . For wi <  /, that is, yi = 0:   xi−1x1− = wi j (ti )  ⇒ ti = xi .  −1  1−   xi x j (ti ) = wi

Some Economics of Private Digital Currency

269

Using ti = ( /)xi , the first-order condition yields xi+−1 = wi /(1−x1− j ) if the solution is interior, that is, when ti < Z − xi . When ϕ is large enough +−1 (i.e.,  > ( /){[wi ( + )+−1]/[x1− ]}1/ ), so that ti = Z − xi, the j (Z)  1− user’s problem becomes maxxi xi x j ((Z − xi )) . The optimal usage is then xi = Z/( + ) and ti = Z/( + ). Notice that it does not depend on ϕ once the time constraint is binding. For wi >  /, that is, ti = 0,   xi−1x1− = wi j (yi )  ⇒ yi = xiwi .  −1  1−   xi x j (yi ) = 1

And further it yields xi+−1 = wi1− /(1− x1− j ) for the interior solution. The corner solution, which arises when γ is sufficiently large, is xi = Z/( + ) and yi = [Z/( + )]wi . Depending on the wages and “prices” ( and ), there are three situations possible: both agents earn the enhancement, both buy it, or one buys and the other earns. We analyze each case in turn (for the interior solution). 1. When both agents earn the enhancement, then any consumption patterns in equilibrium must satisfy (xB /xA)2(1−)− = wA/wB . Together with the formula for xi derived above, it yields w  xi =  j   wi 

(1−) / [2(1−)−]

wi . 1− 

This is a complicated formula, but it uniquely characterizes xi with respect to the exogenous parameters. 2. When both agents buy the enhancement, then in any equilibrium it must be that (xB /xA)2(1−)− = (wA /wB )1− . Then, w  xi =  j   wi 

[(1−)(1−)]/ [2(1−)−]

wi1− . 1− 

3. When agent A buys the enhancement, while agent B earns, then in any  equilibrium it must be that (xA /xB )2(1−)− = (wB /w1− A )(/) . And then,  wB  xA =  1−  wA 

(1−) / [2(1−)−]

    

[(1−)]/ [2(1−)−]

 w1−  xB =  A   wB 

(1−) / [2(1−)−]

    

[(1−)]/ [2(1−)−]

1− A  1− 

w  

wB . 1− 

Notice that, in all three cases, introducing the enhancement eliminates multiplicity of equilibria, since now xA and xB are uniquely characterized by the exogenous parameters. Now consider the platform setting prices ϕ and γ to maximize its objective. We consider four possible objective functions for the platform:

270

Joshua S. Gans and Hanna Halaburda

1. max r (xA + xB ): The platform is indifferent on whether to buy or earn. Whether γ is so high that both buy, ϕ so high that both earn, or one buys and one earns, the platform can always achieve the global maximum of xA = xB = Z/( + ). 2. max r (xA + xB ) + (yA + yB ): The platform raises γ so that not only do both users buy the enhancement, but both reach the corner consumption schedule. The platform reaches the global maximum of xA = xB = Z/( + ) and yi = [Z/( + )]wi, i = A, B. 3. max r (xA + xB + tA + tB ): The platform raises ϕ so that not only do both users earn the enhancement, but both reach the corner consumption schedule. The platform reaches the global maximum of xA = xB = Z/( + ) and tA = tB = Z/( + ) earning 2Z. If the platform were to set ϕ lower so that wB < / < wA , then tA = 0 and xA = Z/( + ). Thus, the platform would earn Z[1 + /( + )] < 2Z . 4. max r (xA + xB + tA + tB ) + (yA + yB ): Optimal prices (and optimal users’ consumption schedule) depend on wi’s and r. The interesting case is when wB < r < wA. Then the platform is strictly better off by setting the prices such that user A buys and user B earns the enhancement with consumption achieving a global maximum, xA = xB = Z/( + ), tB = Z/( + ) and yA = [Z/( + )]wA. 9.3.6

Summary

For a platform whose main source of revenue is advertising (e.g., Facebook), its objective is to increase the activity of its users (e.g., the use of social games). When activity on the platform is more valuable for a user when other users increase their activity (e.g., from the social component), there is complementarity in activity on the platform. A platform can provide an enhancement of user experience to encourage more activity (e.g., buying special versions of crops for your farm in FarmVille, which have a higher yield than regular crops). Higher activity by one user increases the utility— and activity—of other users, due to the complementarity. For this reason, if two users acquire the enhancement, the increase in activity is larger than double the increase of activity resulting from a single user’s enhancement. Therefore, it is optimal for the platform to encourage all users to acquire the enhancement. But some users may find the monetary cost too high, for example, if they have a low wage. Then, the platform gains if it allows for both buying and earning the enhancement. High-wage users will prefer to spend money rather than time, while low-wage users can spend time instead of money. Both types will acquire the enhancement and increase activity on the platform. This reflects the policies of many social networks and also some gaming platforms. Of particular significance is Proposition 2, which prevents platform-specific currencies from being traded back for state-issued cur-

Some Economics of Private Digital Currency

271

rency. This provides a strong result that such platforms are not interested in introducing currencies that would directly compete with existing state-issued currencies. That said, for a platform such as Facebook, there is a flow of money back through developer payments: that is, a developer writes a game that induces people to purchase enhancements. The developer then receives part of the revenue that Facebook receives when Credits are purchased. Nonetheless, this is really just an extension of the platform notion, where the game itself is the platform of interest. Indeed, in mid-2012, Facebook announced that it would phase out Credits by the end of 2013 and rely only on state-issued currencies. The users often needed to further convert Facebook Credits into currencies within apps and games, for example, zCoins in Zynga’s games. Users and developers were against this additional layer of complication and wanted a direct link to state-issued currencies. This is consistent with the model, in that, for Facebook’s core activity, literally the activity or news feed, all features were available to all users. It could still earn essentially “referral” fees for revenue generated by others on its platform, but for its core activity, a currency would perform no additional role. By contrast, it is easy to imagine that app developers such as Zynga introduced their own currencies for exactly the same reason as in our main model: to increase activity on their “app platform.” Just as Facebook Credits once bought or earned cannot be exchanged back into cash, so zCoins—once bought or earned—cannot be exchanged back into state-issued currency (or indeed Facebook Credits when they were available). This policy is driven by Zynga’s objective to maximize activity on its own platform. This may, however, conflict with Facebook’s objective to increase activity on the Facebook platform, possibly across different apps. A richer model would be required to explore issues arising from interlocking platforms. A distinct argument lies behind Amazon Coins, introduced in the beginning of 2013. Amazon announced that it would give away “millions of US dollars worth” of Amazon Coins to customers, starting in May 2013. Like all other introductions of digital currencies, this attracted the usual concern about the threat to state-issued currencies. “But in the long term what [central banks] should perhaps be most worried about is losing their monopoly on issuing money,” wrote the Wall Street Journal. “A new breed of virtual currencies are starting to emerge—and some of the giants of the web industry such as Amazon.com Inc. are edging into the market.”16 However, Amazon Coins is simply a subsidy to buyers to participate in the platform (Kindle Fire), with the purpose of starting and accelerating any indirect network effects benefiting Amazon’s app platform. When Kindle Fire users purchase Amazon Coins, they receive an effective discount on 16. Wall Street Journal, Market Watch. http://articles.marketwatch.com/2013-02-13 /commentary/37064080_1_currency-war-bitcoin-central-banks.

272

Joshua S. Gans and Hanna Halaburda

apps (from 5 to 10 percent, depending on how many Coins are purchased), something that was a feature of Facebook Credits as well. Due to uncertainty about the quality of apps, a subsidy to users is more effective than a subsidy to the developers, since users will “vote” with their Coins for the best apps. At the same time, introducing Amazon Coins is potentially more convenient than subsidizing via cash, since it ensures that the subsidy is spent on the Amazon app platform, and not on other services on Amazon or outside. 9.4

Regulatory Issues

Our analysis of platform-specific currencies shows that voices calling for specific regulation of them overstate their case, since the purpose of those currencies is a natural complement to the business models associated with platforms such as Facebook or Amazon. To maximally benefit the platform, the use of currencies needs to be restricted. Thus, it is not in the interest of the platforms to provide fully functional currencies that could compete with state currencies. In our analysis, however, we have not considered Bitcoin, which is a fully convertible, purely digital currency not associated with a given platform. It is explicitly designed to compete with state currencies. In March 2013, the US government for the first time imposed regulations on online currencies.17 Virtual currencies are to be regulated by the US Treasury, since the Financial Crimes Enforcement Network (FinCEN) decided they fall under the anti-money-laundering laws.18 According to the new rules, transactions worth more than $10,000 need to be reported by companies involved in issuing or exchanging online currencies. The rules do not single out Bitcoin, but apply to all “online currencies.” This clarification of FinCEN laws was issued after evidence emerged that Bitcoin is used for illegal activity (e.g., Silk Road). Illegal activity is a concern because the anonymity of Bitcoin allows for untraceable trades. There may be other reasons to regulate online currencies that apply to both anonymous and account-based currencies. The European Central Bank released a report at the end of 2012 analyzing whether virtual currency schemes can affect price stability, financial stability, or payment stability.19 The report distinguishes between closed virtual currency schemes (i.e., used only within games or apps, akin to virtual Monopoly money) and virtual currency schemes that interact with state currencies (i.e., can be used to purchase real goods and services, or even directly converted to state curren17. http://finance.fortune.cnn.com/tag/facebook-credits/. 18. http://www.newscientist.com/article/mg21729103.300–us-to-regulate-bitcoin-currency -at-its-alltime-high.html. 19. http://www.ecb.europa.eu/pub/pdf/other/virtualcurrencyschemes201210en.pdf. The report focused specifically on case studies of Bitcoin and Linden dollars, but the conclusions were more general.

Some Economics of Private Digital Currency

273

cies).20 Closed virtual currency schemes are not a concern in the view of the report, since only virtual currency that interacts with the real economy can affect price stability, financial stability, and payment stability. However, the report also concluded that, currently, virtual currency that interacts with state currencies poses no risks, since such money creation is at a low level. Moreover, the interaction of Linden dollars, Bitcoin, and similar schemes with the real economy is low because those currencies are used infrequently, by a small group of users, and—most importantly—their use is dispersed geographically, across many state currencies, hence the impact on any one state currency is negligible. In the case of Q-coin, used only in China, the impact could be significant enough for the central bank to step in and regulate the use of virtual currencies. A social networking site, Tencent QQ, introduced Q-coin to allow for virtual payments. This was not a platform-sponsored currency as we have modeled above, but instead a substitute for state-sponsored currency. Indeed, Q-coins are purchased with Chinese state currency. Thus, while Q-coin was intended for the purchase of virtual goods and services provided by Tencent, users quickly started transferring Q-coin as peer-to-peer payments, and merchants started accepting Q-coin as well.21 As the amount of Q-coins traded in one year reached several billion yuan, the Chinese authorities stepped in with regulation. In June 2009, the Chinese government banned exchanging virtual currencies for real goods and services, in order to “limit the possible impact on the real financial system.”22 9.5

Future Directions

This chapter has considered the economics of pure digital currencies and demonstrated that, in most cases, private currencies issued in support of a platform are unlikely to have implications that extend beyond the platform. Of course, our approach has been theoretical, but it does provide a framework to examine digital currencies as a lens for understanding platform strategy. What is of broader future concern is the emergence of digital currencies that compete with state-issued currency. For this, the gap in economic knowledge arises from an imperfect set of frameworks for analyzing money and its uses per se, let alone whether they are real or virtual. That said, considering our exploration of these issues, we speculate here that platform 20. The European Central Bank report also acknowledges that virtual currency schemes “can have positive aspects in terms of financial innovation and the provision of additional payment alternatives for consumers” (47). However, the position of a central bank is to protect state currencies from the risks the virtual currencies may pose. 21. http://voices.yahoo.com/a-virtual-currency-qq-coin-has-taken-real-value-278944.html. 22. http://english.mofcom.gov.cn/aarticle/newsrelease/commonnews /200906 /2009 0606364208.html.

274

Joshua S. Gans and Hanna Halaburda

economics may actually have a role in assisting a broader understanding of monetary economics. Any currency can be viewed as a platform, where people need to “join” by believing in its value, that is, they join by accepting it. Transactions occur only between people who accept the currency and have joined the platform. Currencies also exhibit network effects: the more people accept it, the more value there is to accepting it. If we were to consider any other technology platform instead of currency, the concerns expressed by regulators (e.g., in the European Central Bank report) would be akin to protecting the market power of an incumbent against innovative entrants. We know from the technology literature that such protection usually leads to loss of efficiency because new entrants can come up with ways to better and more cheaply serve the market, and perhaps also to expand the market. Is there a good reason for such protection? The nineteenth and early twentieth century in North America saw a period of so-called “free banking,” where private banks were allowed, under some initial conditions, to issue their own currency. That is, the state did not have a monopoly on issuing currency. However, throughout this period, regulatory interventions increased, and in the early twentieth century it became common practice to delegalize issuing currency by anyone except the state (Frankel 1998). Issuing currency is profitable, since the issuer gains seigniorage. Thus, one reason for the state to institute a monopoly would be the incentive to capture the whole seigniorage profit—to the detriment of innovation. However, economic historians23 point to other factors leading to the increasingly stricter regulation and eventual monopolization of currency. One such factor is frequent bank failures. In a competitive environment, firms often fail and new ones enter. Prior to the early twentieth century in North America, however, bank failures left customers with bank notes redeemable for only a fraction of their nominal value, and sometimes not redeemable at all (i.e., worthless). This undermined financial stability and the public’s trust in paper currency overall. Lack of trust sometimes resulted in bank runs, which led to more bank failures. The trust issues were also reflected in exchange rates between currencies from different issuers. Some private bank notes circulated at a discount (i.e., a $1 bank note was considered worth less than the nominal $1) when there were doubts about the bank’s solvency. Another reason for lower trust was counterfeiting, which is, of course, also a concern with state-issued currency. But with multiple issuers the number and variety of notes in circulation is larger, and it is harder for the public to keep track of genuine features. Since the notes were only redeemable at the issuing bank and banks were typically local, the acceptance of some notes would be geographically 23. See, for example, Rockoff (1974) or Smith (1990).

Some Economics of Private Digital Currency

275

restricted. Farther away from the issuing bank’s location, the notes would be accepted at a discount, if they were accepted at all. Both of these factors— lack of trust and varying exchange rates—created difficulties for trade. At times, it even created worries that the trade could collapse altogether. But how do those well-known factors compare to the analyses in the technology literature? We know that the presence of network effects often creates multiple equilibria—either lots of people join the platform because they expect lots of other people to join, or no one joins because they do not expect others to join. Similar equilibria can be seen in currency usage. Trust in the currency helps to coordinate better equilibrium where people generally adopt paper currency. Another parallel in the technology literature is compatibility. Having multiple networks with limited or no compatibility lowers efficiency as compared to one single network, since under limited compatibility the network effects cannot be realized to their full value. This brings out a well-known tension: On the one hand, the presence of multiple competing platforms creates inefficiency by limiting the extent of network effects (when compatibility is limited), and presents the risk of coordination failure when users will not join at all. On the other hand, a single, well-established dominant platform overcomes the issue of coordination and renders compatibility irrelevant while stifling innovation and possibly extracting monopoly profit from the users. In issuing currency, since the twentieth century states have traditionally considered one single network as the better side of this trade-off. Whether it is still a valid conclusion with respect to online currencies is a question for future research.

References Armstrong, M. 2006. “Competition in Two-Sided Markets.” RAND Journal of Economics 37 (3): 668‒91. Evans, D. S. 2012. “Facebook Credits: Do Payments Firms Need to Worry?” PYMNTS.com, February 28. http://www.pymnts.com/briefing-room/commerce -3–0/facebook-commerce-2/Facebook-Credits-Do-Payments-Firms-Need-to -Worry-2/. Frankel, A. S. 1998. “Monopoly and Competition in the Supply and Exchange of Money.” Antitrust Law Journal 66 (2): 313‒61. Gans, J. S., and S. P. King. 2003. “The Neutrality of Interchange Fees in Payments Systems.” B.E. Journal of Economic Analysis and Policy 3 (1). doi:10.2202/1538 -0653.1069. Rochet, J-C., and J. Tirole. 2002. “Cooperation among Competitors: Some Economics of Payment Card Associations.” RAND Journal of Economics 33 (4): 549‒70. ———. 2003. “Platform Competition in Two-Sided Markets.” Journal of the European Economic Association 1 (4): 990‒1029. Rockoff, H. 1974. “The Free Banking Era: A Reexamination.” Journal of Money, Credit and Banking 6 (2): 141‒67.

276

Joshua S. Gans and Hanna Halaburda

Smith, V. C. 1990. The Rationale of Central Banking and the Free Banking Alternative. Indianapolis: Liberty Fund. Weyl, E. G. 2010. “A Price Theory of Multi-Sided Platforms.” American Economic Review 100 (4): 1642‒72. Yglesias, M. 2012. “Social Cash: Could Facebook Credits Ever Compete with Dollars and Euros?” Slate, February 29. http://www.slate.com/articles/business/cashless _society/2012/02/facebook_credits_how_the_social_network_s_currency_could _compete_with_dollars_and_euros_.html.

10

Estimation of Treatment Effects from Combined Data Identification versus Data Security Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

10.1

Introduction

In policy analysis and decision making, it is instrumental in many areas to have access to individual data that may be considered sensitive or damaging when released publicly. For instance, a statistical analysis of the data from clinical studies that can include the information on the health status of their participants is crucial to study the effectiveness of medical procedures and treatments. In the financial industry, a statistical analysis of individual decisions combined with financial information, credit scores, and demographic data allows banks to evaluate risks associated with loans and mortgages. The resulting estimated statistical model will reflect the characteristics of individuals whose information was used in estimation. The policies based on this statistical model will also reflect the underlying individual data. The reality of the modern world is that the amount of publicly available (or searchable) individual information that comes from search traffic, social networks, and personal online file depositories (such as photo collections) is increasing on a daily basis. Thus, some of the variables in the data sets used Tatiana Komarova is assistant professor of economics at the London School of Economics and Political Science. Denis Nekipelov is assistant professor of economics at the University of California, Berkeley. Evgeny Yakovlev is assistant professor at the New Economic School in Moscow, Russia. We appreciate helpful comments from Philip Haile, Michael Jansson, Phillip Leslie, Aureo de Paula, Martin Pesendorfer, James Powell, Pasquale Schiraldi, John Sutton, and Elie Tamer. We also thank participants of the 2013 NBER conference “Economics of Digitization: An Agenda” for their feedback. For acknowledgments, sources of research support, and disclosure of the authors’ material financial relationships, if any, please see http://www.nber.org/chapters /c12998.ack.

279

280

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

for policy analysis may be publicly observable.1 Frequently, various bits of information regarding the same individual are contained in several separate data sets. Individual names or labels are most frequently absent from available data (either for the purposes of data anonymization or as an artifact of the data collection methodology). Each individual data set in this case may not pose a direct security threat to individuals. For instance, a collection of online search logs will not reveal any individual information unless one can attach the names of other identifying information to the generic identifiers attached to each unique user. However, if one can combine information from multiple sources, the combined array of data may pose a direct security threat to some or all individuals contained in the data. For instance, one data set may be a registry of HIV patients in which the names and locations of the patients are removed. Another data set may be the address book that contains names and addresses of people in a given area. Both these data sets individually do not disclose any sensitive information regarding concrete individuals. A combined data set will essentially attach names and addresses to the anonymous labels of patients in the registry and, thus, will disclose some sensitive individual information. The path to digitization in a variety of markets with the simultaneous availability of the data from sources like social networks makes this scenario quite realistic. Clearly, from a policy perspective the prevention of a further increase in the availiability of such multiple sources is unrealistic. As a result, a feasible solution seems to be aimed at assuring some degree of anonymization as a possible security measure. At the same time, inferences and conclusions based on such multiple sources may be vital for making accurate policy decisions. Thus, a key agenda item in the design of methods and techniques for secure data storage and release is in finding a trade-off between keeping the data informative for policy-relevant statistical models and, at the same time, preventing an adversary from the reconstruction of sensitive information in the combined data set. In this chapter we explore one question in this agenda. Our aim is to learn how one can evaluate the treatment effect when the treatment status of an individual may present sensitive information while the individual demographic characteristics are either publicly observable or may be inferred from some publicly observable characteristics. In such cases we are concerned with the risk of disclosing sensitive individual information. The questions that we address are, first, whether the point identification of treatment effects from the combined public and sensitive data is compatible with formal restrictions on the risk of the so-called partial disclosure. Second, we want to investigate how the public release of the estimated statistical model can lead to an increased risk of such a disclosure. 1. Reportedly, many businesses indeed rely on the combined data. See, for example, Wright (2010) and Bradley et al. (2010), among others.

Estimation of Treatment Effects from Combined Data

281

In our empirical application we provide a concrete example of the analysis of treatment effects and propensity scores from two “anonymized” data sets. The data that we use come from the Russian Longitudinal Monitoring Survey (RLMS) that combines several questionnaires collected on a yearly basis. The respondents are surveyed on a variety of topics from employment to health. However, for anonymization purposes any identifying location information is removed from the data making it impossible to verify where exactly each respondent is located. Due to the vast Soviet heritage, most people in Russia live in large apartment developments that include several blocks of multistory (usually five floors and up) apartment buildings connected together with common infrastructure, shops, schools, and medical facilities. With such a setup in place the life of each family becomes very visible to most of the neighbors. Our specific question of interest is the potential impact of the dominant religious affiliation in the neighborhood on the decision of parents to get their children checked up by a doctor in a given year as well as the decision of the parents to vaccinate their child with the age-prescribed vaccine. Such an analysis is impossible without neighborhood identifiers. Neighborhood identifiers are made available to selected researchers upon a special agreement with the data curator (University of North Carolina and the Higher School of Economics in Moscow). This allows us to construct the benchmark where the neighborhood identification is known. Then we consider a realistic scenario where such an identification needs to be restored from the data. Using a record linkage technique adopted from the data mining literature, we reconstruct neighborhood affiliation using the individual demographic data. Our data linkage technique relies on observing data entries with infrequent attribute values. Accurate links for these entries may disclose individual location and then lead to the name disclosure based on the combination of the location and demographic data. We note that the goal of our work is not to demonstrate the vulnerability of anonymized personal data but to demonstrate a synthetic situation that reflects the component of the actual data-driven decision making and to show the privacy versus identification trade-off that arises in that situation. Further, we analyze how the estimates of the empirical model will be affected by the constraints on partial disclosure. We find that any such limitation leads to a loss of point identification in the model of interest. In other words, we find that there is a clear-cut trade-off between the restrictions imposed on partial disclosure and the point identification of the model using individual-level data. Our analysis combines ideas from the data mining literature with those from the literature on statistical disclosure limitations, as well as the literature on model identification with corrupted or contaminated data. We provide a new approach to model identification from combined data sets as a limit in the sequence of statistical experiments. A situation when the chosen data combination procedure provides a link

282

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

between at least one data entry in the data set with sensitive information (such as consumer choices, medical treatment, etc.) and auxiliary individual information from another data set with the probability exceeding the selected confidence threshold presents a case of a successful linkage attack and the so-called individual disclosure. The optimal structure of such attacks as well as the requirements in relation to the data release have been studied in the computer science literature. The structure of linkage attacks is based on the optimal record linkage results that have long been used in the analysis of databases and data mining. To some extent, these results were used in econometrics for combining data sets as described in Ridder and Moffitt (2007). In record linkage, one provides a (possibly) probabilistic rule that can match the records from one data set with the records from the other data set in an effort to link the data entries corresponding to the same individual. In several striking examples, computer scientists have shown that the simple removal of personal information such as names and Social Security numbers does not protect the data from individual disclosure. Sweeney (2002b) identified the medical records of William Weld, then governor of Massachusetts, by linking voter registration records to “anonymized” Massachusetts Group Insurance Commission (GIC) medical encounter data, which retained the birth date, sex, and zip code of the patient. Recent “depersonalized” data released for the Netflix prize challenge turned out to lead to a substantial privacy breach. As shown in Narayanan and Shmatikov (2008), using auxiliary information one can detect the identities of several Netflix users from the movie selection information and other data stored by Netflix. Modern medical databases pose even larger threats to individual disclosure. A dramatic example of a large individual-level database is the data from genome-wide association studies (GWAS). The GWAS are devoted to an in-depth analysis of genetic origins of human health conditions and receptiveness to diseases, among other things. A common practice of such studies was to publish the data on the minor allele frequencies. The analysis of such data allows researchers to demonstrate the evidence of a genetic origin of the studied condition. However, there is a publicly available single nucleotide polymorphism (SNP) data set from the HapMap NIH project that consists of SNP data from four populations with about sixty individuals each. Homer et al. (2008) demonstrated that they could infer the presence of an individual with a known genotype in a mix of DNA samples from the reported averages of the minor allele frequencies using the HapMap data. To create the privacy breach, one can take an individual DNA sequence and then compare the nucleotide sequence of this individual with the reported averages of minor allele frequencies in the HapMap population and in the studied subsample. Provided that the entire list of reported allele frequencies can be very long, individual disclosure may occur with an extremely high probability. As a result, if a particular study is devoted to the analysis of a particular health condition or a disease, the discovery that a particular

Estimation of Treatment Effects from Combined Data

283

individual belongs to the studied subsample means that this individual has that condition or that disease. Samarati and Sweeney (1998), Sweeney (2002a, 2002b), LeFevre, DeWitt, and Ramakrishnan (2005), Aggarwal et al. (2005), LeFevre, DeWitt, and Ramakrishnan (2006), and Ciriani et al. (2007) developed and implemented the so-called k-anonymity approach to address the threats of linkage attacks. Intuitively, a database provides k-anonymity, for some number k, if every way of singling an individual out of the database returns records for at least k individuals. In other words, anyone whose information is stored in the database can be “confused” with k others. Several operational prototypes for maintaining k-anonymity have been offered for practical use. The data combination procedure will then respect the required boundary on the individual disclosure (disclosure of identities) risk if it only uses the links with at least k possible matches. A different solution has been offered in the literature on synthetic data. Duncan and Lambert (1986), Duncan and Mukherjee (1991), Duncan and Pearson (1991), Fienberg (1994, 2001), Duncan et al. (2001), and Abowd and Woodcock (2001) show that synthetic data may be a useful tool in the analysis of particular distributional properties of the data such as tabulations, while guaranteeing a certain value for the measure of the individual disclosure risk (for instance, the probability of “singling out” some proportion of the population from the data). An interesting feature of the synthetic data is that they can be robust against stronger requirements for the risk of disclosure. Dwork and Nissim (2004) and Dwork (2006) introduced the notion of differential privacy that provides a probabilistic disclosure risk guarantee against the privacy breach associated with an arbitrary auxiliary data set. Abowd and Vilhuber (2008) demonstrate a striking result that the release of synthetic data is robust to differential privacy. As a result, one can use the synthetic data to enforce the constraints on the risk of disclosure by replacing the actual consumer data with the synthetic consumer data for a combination with an auxiliary individual data source. In our chapter we focus on the threat of partial disclosure. Partial disclosure occurs if the released information such as statistical estimates obtained from the combined data sample reveals with high enough probability some sensitive characteristics of a group of individuals. We provide a formal definition of partial disclosure and show that generally one can control the risk of this disclosure, so the bounds on the partial disclosure risk are practically enforceable. Although our identification approach is new, to understand the impact of the bounds on the individual disclosure risk we use ideas from the literature on partial identification of models with contaminated or corrupted data. Manski (2003), Horowitz et al. (2003), Horowitz and Manski (2006), and Magnac and Maurin (2008) have understood that many data modifications such as top-coding suppression of attributes and stratification lead to the

284

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

loss of point identification of parameters of interest. Consideration of the general setup in Molinari (2008) allows one to assess the impact of some data anonymization as a general misclassification problem. In this chapter we find the approach to the identification of the parameters of interest by constructing sets compatible with the chosen data combination procedure extremely useful. As we show in this chapter, the sizes of such identified sets for the propensity scores and the average treatment effect are directly proportional to the pessimistic measure of the partial disclosure risk. This is a powerful result that essentially states that there is a direct conflict between the informativeness of the data used in the consumer behavioral model and the security of individual data. An increase in the complexity and nonlinearity of the model can further worsen the trade-off. In the chapter we associate the ability of a third party to recover sensitive information about consumers from the reported statistical estimates based on the combined data with the risk of partial disclosure. We argue that the estimated model may itself be disclosive. As a result, if this model is used to make (observable) policy decisions, some confidential information about consumers may become discoverable. Existing real-world examples of linkage attacks on the consumer data using the observable firm policies have been constructed for online advertising. In particular, Korolova (2010) gives examples of privacy breaches through micro ad targeting on Facebook.com. Facebook does not give advertisers direct access to user data. Instead, the advertiser interface allows them to create targeted advertising campaigns with a very granular set of targets. In other words, one can create a set of targets that will isolate a very small group of Facebook users (based on the location, friends, and likes). Korolova shows that certain users may be perfectly isolated from other users with a particularly detailed list of targets. Then, one can recover the “hidden” consumer attributes, such as age or sexual orientation, by constructing differential advertising campaigns such that a different version of the ad will be shown to the user depending on the value of the private attribute. Then the advertiser’s tools allow the advertiser to observe which version of the ad was shown to the Facebook user. When a company “customizes” its policy regarding individual users, for example, when a PPO gives its customers personalized recommendations regarding their daily routines and exercise or hospitals reassign specialty doctors based on the number of patients in need of specific procedures, then the observed policy results may disclose individual information. In other words, the disclosure may occur even when the company had no intention of disclosing customer information. Security of individual data is not synonymous to privacy, as privacy may have subjective value for consumers (see Acquisti [2004]). Privacy is a complicated concept that frequently cannot be expressed as a formal guarantee against intruders’ attacks. Considering personal information as a “good” valued by consumers leads to important insights in the economics of privacy. As seen in Varian (2009), this approach allowed the researchers to

Estimation of Treatment Effects from Combined Data

285

analyze the release of private data in the context of the trade-off between the network effects created by the data release and the utility loss associated with this release. The network effect can be associated with the loss of competitive advantage of the owner of personal data, as discussed in Taylor (2004), Acquisti and Varian (2005), and Calzolari and Pavan (2006). Consider the setting where firms obtain a comparative advantage due to the possibility of offering prices that are based on past consumer behavior. Here, the subjective individual perception of privacy is important. This is clearly shown in both the lab experiments in Gross and Acquisti (2005), Acquisti and Grossklags (2008), as well as in the real-world environment in Acquisti, Friedman, and Telang (2006), Miller and Tucker (2009), and Goldfarb and Tucker (2010). Given all these findings, we believe that the disclosure protection plays a central role in the privacy discourse, as privacy protection is impossible without the data protection. The rest of the chapter is organized as follows. Section 10.2 describes the analyzed treatment effects models, the availability of the data, and gives a description of data combination procedures employed in the chapter. Section 10.3 provides a notion of the identified values compatible with the data combination procedure for the propensity score and the average treatment effect. It looks at the properties of these values as the sizes of available data sets go to infinity. Section 10.4 introduces formal notions of partial disclosure and partial disclosure guarantees. It discusses the trade-off between the point identification of the true model parameters and partial disclosure limitations. Section 10.5 provides an empirical illustration. 10.2

Model Setup

In many practical settings the treatment status of an individual in the analyzed sample is a very sensitive piece of information, much more sensitive than the treatment outcome and/or the individual’s demographics. For instance, in the evaluation of the effect of a particular drug, one may be concerned with the interference of this drug with other medications. Many anti-inflammatory medications may interfere with standard HIV treatments. To determine the effect of the interference one would evaluate how the HIV treatment status influences the effect of the studied anti-inflammatory drug. The fact that a particular person participates in the study of the antiinflammatory drug does not necessarily present a very sensitive piece of information. However, the information that a particular person receives HIV treatment medications may be damaging. We consider the problem of estimating the propensity score and the average treatment effect in cases when the treatment status is a sensitive (and potentially harmful) piece of information. Suppose that the response of an individual to the treatment is characterized by two potential outcomes { } Y  1,Y0 ∈  ⊂ , and the treatment status is characterized by D ∈ 0,1 . Outcome Y1 corresponds to the individuals receiving the treatment and Y0 cor-

286

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

responds to the nontreated individuals. Each individual is also characterized by the vector of individual-specific covariates X ∈  ⊂  p such as the demographic characteristics, income, and location. Individuals are also described by vectors V and W containing a combination of real-valued and string-valued variables (such as Social Security numbers, names, addresses, etc.) that identify the individual but do not interfere with the treatment outcome. The realizations of V belong to the product space  = * ×  , where * is a finite space of arbitrary (nonnumeric) nature. *, for instance, may be the space of combinations of all human names and dates of birth (where we impose some “reasonable” bound on the length of the name, e.g., thirty characters). The string combination {‘John’,’Smith’, ‘01/01/1990’} is an example of a point in this space. Each string in this combination can be converted into the digital binary format. Then the countability and finiteness of the space * will follow from the countability of the set of all binary numbers of fixed length. We also assume that the space is endowed with the distance. There are numerous examples of definitions of a distance over strings (e.g., see Wilson et al. 2006). We can then define the norm in* as the distance between the given point in and a “generic” point corresponding to the most commonly observed set of attributes. We define the norm in  as the weighted sum of the defined norm in and the standard Euclidean norm inv and denote it  v . Similarly, we  assume that W takes values in  = ** × w , where ** is also a finite space. The norm in  is defined as a weighted norm and denoted as  w.  Spaces * and ** may have common subspaces. For instance, they both may contain the first names of individuals. However, we do not require that such common elements indeed exist. Random variables V and W are then defined by the probability space with a σ-finite probability measure defined on Borel subsets of  and. We assume that the data-generating process creates N y i.i.d. draws from the joint distribution of the random vector (Y,D,X,V,W ). These draws form N the (infeasible) “master” sample {yi, di, xi, i, wi}i =1y . However, because either all the variables in this vector are not collected simultaneously or some of the variables are intentionally deleted, the data on the treatment status (treatment outcome) and individual-specific covariates are not contained in the same sample. One sample, containing N y observations is the i.i.d. sample N {xi, i}i =1y is in the public domain. In other words, individual researchers or research organizations can get access to this data set. The second data set is a subset of N ≤ N y observations from the “master” data set and contains information regarding the treatment-related variables {y j, d j, w j}Nj =1.2 This 2. Our analysis applies to other frameworks of split data sets. For instance, we could consider the case when x and y are contained in the same data subset, while d is observed only in the other data subset. We could also consider cases when some of the variables in x (but not all of them) are observed together with d. This is the situation we deal with in our empirical illustration. The important requirement in our analysis is that some of the relevant variables in x are not observed together with d.

Estimation of Treatment Effects from Combined Data

287

data set is private in the sense that it is only available to the data curator (e.g., the hospital network) and cannot be acquired by external researchers or general public. We consider the case when, even for the data curator, there is no direct link between the private and the public data sets. In other words, the variables in i and w j do not provide immediate links between the two data sets. In our example of the HIV treatment status, we could consider cases where the data on the HIV treatment (or testing) are partially or fully anonymized (due to the requests by the patients) and there are only very few data attributes that allow the data curator to link the two data sets. We impose the following assumptions on the elements of the model: Assumption 1. (a) The treatment outcomes satisfy the conditional unconfoundedness, that is, (Y1,Y0) ⊥ D|X = x. (b) At least one element of X has a continuous distribution with the density strictly positive on its support. We consider the propensity score P(x) = E [D|X = x] and suppose that for some specified 0 <  < 1 the knowledge that the propensity score exceeds (1 − )—that is, P(x) > 1 − , constitutes sensitive information. The next assumption states that there is a part of the population with the propensity score above the sensitivity threshold. Assumption 2. Pr (x : P(x) > 1 − ) > 0. P will denote the average propensity score over the distribution of all individuals: P = E [P(x)]. We leave distributions of potential outcomes Y1 and Y0 conditional on X nonparametric with the observed outcome determined by Y = DY1 + (1 − D)Y0 . In addition to the propensity score, we are interested in the value of the conditional average treatment effect tATE (x) = E [Y1 − Y0 |X = x], or the average treatment effect conditional on individuals in a group described by some set of covariates0 :

tATE (0) = E [Y1 − Y0 |X ∈ 0 ], as well as overall average treatment effect (ATE)

288

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

tATE = E [Y1 − Y0 ]. In this chapter we focus on the propensity score and the overall average treatment effect. The evaluation of the propensity score and the treatment effects requires us to observe the treatment status and the outcome together with the covariates. A consistent estimator for the average treatment effect tATE could be constructed then by, first, evaluating the propensity score and then estimating the overall effect via the propensity score weighting: (2.1)

(1 − D)Y   DY . tATE = E  −  P(X ) 1 − P(X ) 

In our case, however, the treatment and its outcome are not observed together with the covariates. To deal with this challenge, we will use the information contained in the identifying vectors V and W to connect the information from the two split data sets and provide estimates for the propensity score and the ATE. Provided that the data curator is interested in correctly estimating the treatment effect (to further use the findings to make potentially observable policy decisions, for example, by putting a warning label on the package of the studied drug), we assume that she will construct the linkage procedure that will correctly combine the two data sets with high probability. We consider a two-step procedure that first uses the similarity of information contained in the identifiers and covariates to provide the links between the two data sets. Then, the effect of interest will be estimated from the reconstructed joint data set. To establish similarity between the two data sets, the researcher constructs vector-valued variables that exploit the numerical and string information contained in the variables. We assume that the researcher constructs variables Z d = Z d (D,Y,W ) and Z x = Z x(X,V ) (individual identifiers) that both belong to the space =  ×  z . The space  is a finite set of arbitrary nature such as a set of strings, corresponding to the string information contained in * and **. We choose a distance in  constructed using one of commonly used distances defined on the strings d as a weighted combination of d and   (⋅,⋅). Then the distance in is defined the standard Euclidean distancedz(Z x, Z d ) = ( sd (zsx,zsd )2 + z || zzx − zzs || 2)1/2, where Z x = (zsx, zzx ) and s , d > 0. Then we define the “null” element in  as the observed set of attributes that has the most number of components shared with the other observed sets of attributes and denote it 0 . Then the norm in  is defined as the distance from the null element: ||Z ||z = ( sd (zs, 0 )2 + s || zz || 2)1/2.  The construction of the variables Z d and Z x may exploit the fact that W and V can contain overlapping components, such as individuals’ first names and the dates of birth. Then the corresponding components of the identifiers can be set equal to those characteristics. However, the identifiers may

Estimation of Treatment Effects from Combined Data

289

also include a more remote similarity of the individual characteristics. For instance, V may contain the name of an individual and W may contain the race (but not contain the name). Then we can make one component of Z d to take values from 0 to 4 corresponding to the individual in the private data set either having the race not recorded, or being black, white, Hispanic, or Asian. Then, using the public data set we can construct a component of Z x that will correspond to the guess regarding the race of an individual based on his name. This guess can be based on some simple classification rule, for example, whether the individual’s name belongs to the list of top 500 Hispanic names in the US Census or if the name belongs to the top 500 names in a country that is dominated by a particular nationality. This classifier, for instance, will classify the name “Vladimir Putin” as the name of a white individual giving Z x value 2, and it will classify the name “Kim Jong Il” as the name of an Asian individual giving Z x value 4. When the set of numeric and string characteristics used for combining two data sets is sufficiently large or it contains some potentially “hard to replicate” information such as the individual’s full name, then if such a match occurs it very likely singles out the data of one person. We formalize this idea by expecting that if the identifiers take infrequent values (we model this situation as the case of identifiers having large norms), then the fact that the values of Z d and Z x are close implies that with high probability the two corresponding observations belong to the same individual. This probability is higher the more infrequent are the values of Z d and Z x. Our maintained assumptions regarding the distributions of constructed identifiers are listed below. Assumption 3. We fix some  ∈ (0,1) such that for any  ∈ (0,): (a) (Proximity of identifiers) Pr (dz(Z x,Z d ) < |X = x, D = d,Y = y, ||Z d ||z > 1/) ≥ 1 − . (b) (Nonzero probability of extreme values)

(

lim Pr ||Z d ||z > →0

(

)

1 | D = d, Y = y / () = 1 

lim Pr ||Z x|| z > →0

)

1 |X = x / () = 1 

for some nondecreasing and positive functions (⋅) and (⋅). (c) (Redundancy of identifiers in the combined data) There exists a sufficiently large M such that for all ||Z d || z ≥ M and all ||Z x|| z ≥ M f (Y | D = d, X = x, Z d = z d , Z x = z x ) = f (Y | D = d,X = x). Assumption 3(a) reflects the idea that more reliable matches are provided by the pairs of identifiers whose values are infrequent. In other words, if, for example, in both public and private data sets collected in Durham, NC,

290

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

we found observations with an attribute “Denis Nekipelov,” we expect them to belong to the same individual with a higher probability than if we found two attribute values “Jane Doe.” Thus, the treatment status can be recovered more reliably for more unique individuals. We emphasize that infrequency of a particular identifier does not mean that the corresponding observation is an “outlier.” In fact, if both public and private data sets contain very detailed individual information such as a combination of the full name and the address, most attribute values will be unique. Assumption 3(b) requires that there are a sufficient number of observations with infrequent attribute values. This fact can actually be established empirically in each of the observed subsets and, thus, this assumption is testable. Assumption 3(c) is the most important one for identification purposes. It implies that even for the extreme values of the identifiers and the observed covariates, the identifiers only served the purpose of data labels as soon as the “master” data set is recovered. There are two distinct arguments that allow us to use this assumption. First, in cases where the identifiers are high dimensional, infrequent attribute combinations do not have to correspond to unusual values of the variables. If both data sets contain, for instance, first and last names along with the dates of birth and the last four digits of the Social Security number of individuals, then a particular combination of all attributes can be can be extremely rare, even for individuals with common names. Second, even if the identifiers can contain model relevant information (e.g., we expect the restaurant choice of an individual labeled as “Vladimir Putin” to be different than the choice of an individual labeled as “Kim Jong Il”), we expect that information to be absorbed in the covariates. In other words, if the gender and the nationality of an individual may be information relevant for the model, than we include that information into the covariates. We continue our analysis with the discussion of identification of the model from the combined data set. In the remainder of the chapter we suppose that Assumptions 1‒3 hold. 10.3

Identification of the Treatment Effect from the Combined Data

Provided that the variables are not contained in the same data set, the identification of the treatment effect parameter becomes impossible without having some approximation to the distribution of the data in the master sample. A way to link the observations in two data sets is to use the identifiers that we described in the previous section. The identifiers, on the other hand, are individual-level variables. Even though the data-generating process is characterized by the distribution over strings, such as names, we only recover the master data set correctly if we link the data of one concrete “John

Estimation of Treatment Effects from Combined Data

291

Smith” in the two data sets. This means that the data combination is intrinsically a finite sample procedure. We represent the data combination procedure by the deterministic data combination rule  N that for each pair of identifiers z dj and zix returns a binary outcome M =  N (zix, z dj ),  ij which labels two observations as a match (Mij = 1) if we think they belong to the same individual, and labels them as a nonmatch (Mij = 0) if we think that the observations are unlikely to belong to the same individual or are simply uncertain about this. Although we can potentially consider many nonlinear data combination rules, in this chapter we focus on the set of data combination rules that are generated by our Assumption 3 (a). In particular, for some prespecified  ∈ (0,1) we consider the data combination rule  N = 1{dz(zix, z dj ) < N , ||zix|| > 1/N},  generated by a Cauchy sequence N such that 0 < N <  and limN →∞ N = 0. The goal of this sequence is to construct the set of thresholds that would isolate in the limit all of the infrequent observations. To guarantee that, such a sequence would have to satisfy the following two conditions. For infrequent observations, the probability of the correct match would be approaching one, as the probability of observing two identifiers taking very close values for two different individuals would be very small (proportional to the square of the probability of observing the infrequent attribute values). On the other hand, the conditional probability that the values of identifiers are close for a particular individual with infrequent values of the attributes would be of a larger order of magnitude (proportional to the probability of observing the attribute value). Thus, an appropriately scaled sequence of thresholds would be able to single out correct matches. Let mij be the indicator of the event that the observation i from the public data set and the observation j from the private data set belong to the same individual. Given that we can make incorrect matches, Mij is not necessarily equal to mij . However, we would want these two variables to be highly correlated, meaning that the data combination procedure that we use is good. With our data combination procedure we will form the reconstructed master data set by taking the pairs of all observations from the public and the private data sets that we indicated as matches (Mij = 1) and discard all other observations. We can consider more complicated rules for reconstructing the master sample. In particular, we can create multiple copies of the master sample by varying the threshold N and then we combine the information from those samples by downweighting the data sets that were constructed with higher threshold values. The reconstructed master data set will have a small sample distribution, characterizing the joint distribution of outcomes and the covariates for all

292

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

observations that are identified as matches by the decision rule N . We use fNN (yi |d j,xi,zix, z dj ) to denote the conditional density of the outcome distribution with the decision rule applied to samples of size N. Provided that the decision rule does not perfectly identify the information from the same individual, density fNN (⋅) will be a mixture of the “correct” distribution with the distribution of outcomes that were incorrectly identified as matches: fNN (y j |d j,xi,zix ) = fY |D,X (y j |d j,xi )Pr (mij = 1| N (zix, z dj ) = 1) + fY | D(y j |d j )Pr(mij = 0| N (zix, z dj ) = 1),



where we used the fact that identifiers are redundant once a correct match was made, as well as the fact that in the i.i.d. sample the observations have to be independent. Thus, if an incorrect match was made, the outcome should not be correlated with the treatment. By ENN [⋅|d j ] we denote the conditional expectation with respect to the density product fNN (⋅|d j,xi,zix ) f (xi,zix ). We can also introduce the propensity score implied by the finite sample distribution, which we denote PNN (⋅). The finite sample propensity score is characterized by the mixture distribution combining the correct propensity score and the average propensity score PNN (x) = P (x)Pr (mij = 1|xi = x, N (zix, z dj ) = 1) 

+ PPr (mij = 0|xi = x, N (zix, z dj ) = 1).

We can extend our data combination method by choosing sequences N depending on the value of x. Then the value of Pr (mij = 0|xi = x, N (zix, z dj )  = 1) even in the limit will depend on x. We allow for such situations. In fact, later in the chapter we make use of this opportunity to choose differences threshold sequences for different values of x. To stress that we permit the threshold sequences to depend on x we denote a sequence of thresholds chosen for x as N,x (instead of N ). In the beginning of this section, we indicated that the estimation that requires combining the data based on the string-valued identifiers is an intrinsically finite sample procedure. As a result, we suggest the analysis of identification of this model as the limit of a sequence of data combination procedures. We allow for situations when the data curator could want to use several sequences N,x for some x and denote the collection of such sequences as C0,x . Definition 1. By  N we denote the set of all functions p :   [0,1] that correspond to the set of finite sample propensity scores for all sequences N,x in C0,x :  N =  {PNN,x(⋅)}. {N,x}∈C0,x 

Estimation of Treatment Effects from Combined Data

293

We call  N the N-identified set for the propensity score compatible with the data combination procedure with a threshold decision rule. By  N we denote the subset of  that corresponds to the set of treatment effects calculated as equation (2.1) for all sequences N,x in C0,x using the corresponding to N,x propensity score PNN,x(⋅): N = 

 DjY j (1 − Dj )Yj  − . N N  PN,x(X i ) 1 − PN,x(X i ) 

 ENN,x 

{N,x}∈C0,x

We call  N the N-identified set for the average treatment effect compatible with the data combination procedure with a threshold decision rule. Definition 2 below characterizes the identified set compatible with the data combination procedure as the set of all limits of the estimated treatment effects and the propensity scores under all possible threshold sequences that are bounded and converge to zero. Provided that the reconstructed master sample depends on the sample size, the set of treatment effect parameters that are compatible with the data combination procedure applied to random split samples of size N will depend on N. Provided that the small sample distribution in the sample of size N will always be a mixture of the correct joint distribution and the marginal outcome distribution for the outcomes that are misidentified as matches, the only way to attain the point identification is in the limit. Thus, we consider the concept of parameter identification in terms of the limiting behavior of the identified sets compatible with the data combination procedure constructed from the finite sample distributions as the sample size N approaches infinity. Definition 2. (a) We call ∞ the identified set for the propensity score under the threshold decision rules if  ∞ is the set of all partial pointwise limits of sequences of propensities score functions from the N-identified sets  N . That is, function ∞  f (⋅) ∈  if and only if for any x in the support of X, f (x) = Nlim f (x), →∞ N k k

for some fNk (⋅) ∈  Nk .  (b) Similarly, we call ∞ the identified set for the average treatment effect under the decision threshold rules if  ∞ is the set of all partial limits of sequences of ATEs from the N-identified sets N . That is,t ∈  ∞ if t = Nlim t , →∞ N k k

for some tNk ∈  Nk .  (c) The propensity score is point identified from the combined data if ∞  = {P(⋅)}. Otherwise, it is identified only up to a set compatible with the decision threshold rules.

294

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

(d) The average treatment effect parameter is point identified from the combined data if the identified set is a singleton  ∞ = {tATE}. Otherwise, it is identified only up to a set compatible with the decision threshold rules. Our next idea will be based on the characterization of the sets for the average treatment effect parameter and the propensity score identified under the given threshold decision rule under Assumption 3. We start our analysis with the following lemma, that follows directly from the combination of Assumptions 3(b) and (c). Lemma 1. Under Assumption 3 the propensity score can be point identified from the observations with infrequent attribute values as follows:  1  P(x) = E D|X = x, dz(Z x,Z d ) < N,x, ||Z x|| z > .  N,x   Also, the average treatment effect can be point identified from the observations with infrequent attribute values as follows:  DY (1 − D)Y 1  tATE = E  − dz(Z x,Z d ) < N,x, ||Z x ||z > . P(X ) 1 − P(X )  N,x   This lemma states that if we are able to correctly reconstruct the master data set only for the observations with infrequent values of the attributes, those observations are sufficient for correct identification of the components of interest. Two elements are crucial for these results. First, we need Assumption 3(c) to establish redundancy of identifiers for matches constructed for observations with infrequent values of those identifiers. Second, we need Assumption 3(b) to guarantee that there is a nonzero probability of observing individuals with those infrequent values of identifiers. The biggest challenge in our analysis is to determine which Cauchy sequences have appropriate behavior to isolate the infrequent attribute values as N → ∞ and guarantee that the probability of a mismatch, conditional on the observation being in the reconstructed master sample, approaches zero. We do so by an appropriate inversion of the probability of misidentifying a pair of observations as a match. We can provide the general result that delivers a fixed probability of a mismatch in the limiting reconstructed master sample. Proposition 1. Suppose that for x ∈  the chosen sequence {N,x} ∈C0,x satisfies Pr (mij = 0|xi = x, N (Ziy,Z dj ) = 1) → (x)  for some (x) ∈ [0,1] as N → ∞. Then (3.2)

PNN,x(x) = ENN,x[Dj |X i = x] → (1 − (x))P(x) + (x)P,

Estimation of Treatment Effects from Combined Data

295

and  DY (1 − Dj )Yj  TNN,x = ENN,x  N j j −  → tATE N  PN,x(X i ) 1 − PN,x(X i )  (3.3)

 (X )   + E (E [Y1] − E [Y |X,D = 1]P)  ( ) ( ) ( ) 1 −  X P X +  X P ( )  (X )   . −E (E[Y0 ] − E[Y |X,D = 0](1 − P)) 1 − (1 − (X ))P(X ) − (X )P  

Proposition 1 states that if one controls the mismatch probability in the combined data set, then the propensity score recovered through such a procedure is a convex combination of the true propensity score and the expected fraction P of treated individuals. Thus, the propensity score recovered through the data combination procedure will be biased toward the expected fraction of treated individuals. Also, the resulting identified average treatment effect will be a sum of the true ATE and a nontrivial term. In other words, the presence of mismatched observations in the “limiting” reconstructed master data set biases the estimated ATE toward zero. The formulated theorem is based on the premise that a sequence in C0,x that leads to the limiting probability of an incorrect match equal to (x) exists. The proof of existence of fundamental sequences satisfying this property is given in Komarova, Nekipelov, and Yakovlev (2011). These sequences are determined from the behavior of functions (⋅) and (⋅). The result in that paper demonstrates that for each (x) ∈ [0,1] we can find a Cauchy sequence that leads to the limiting mismatch probability equal to (x). Our next goal is to use one particular sequence that will make the mismatch probability approach zero in the limit. Theorem 1. (Point identification of the propensity score and the ATE). Suppose that for eachx ∈  the chosen sequence {N,x} ∈C0,x satisfies lim Pr (mij = 0|xi = x, N (Ziy,Z dj ) = 1) = 0. N →∞ Then PNN,x(⋅) → P(⋅) pointwise everywhere on and TNN,x → tATE as N → ∞. In other words, the propensity score and the treatment effect are point identified.

296

10.4

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

Inference of the Propensity Score and the Average Treatment Effect with Limited Partial Disclosure

The calculations of the propensity score and the treatment effect require the data curator to have a technique that would combine the two data sets with the available observation-identifying information. Our approach to data combination described above is based on constructing the threshold decision rule that identifies the observations as “a match” corresponding to the data on a single individual if the observed individual attributes are close in terms of the chosen distance. With this approach we can construct the sequences of thresholds that would lead to very high probabilities of correct matches for a part of the population that allows us to point identify the propensity score and the treatment-effect parameter. If we provide a high-quality match, then we have a reliable link between the public information regarding the individual and this individual’s treatment status. The release of the reconstructed master data set would then constitute an evident threat to the individual’s privacy. However, even if the reconstructed master data set is not public, the release of the estimated propensity score and/or the value of the treatment effect itself may pose a direct threat to the security of individual data. To measure the risk of such a disclosure in the possible linkage attacks, we use a measure based on the notion of disclosure in Lambert (1993). We provide a formal definition for this measure. Partial disclosure can occur if the released information that was obtained from the data may potentially reveal some sensitive characteristics of individual. In our case, the information we are concerned with are the propensity score and the treatment effect. In particular, in our case the sensitive characteristic of an individual is her treatment status, or how an individual with given characteristics is likely to receive a treatment. Below we provide a formal definition of the risk of partial disclosure for the propensity score. The definition takes as given the following two parameters. One parameter is 1 −  and it characterizes the sensitivity level of the information about the propensity score. Namely, the information that the propensity score of an individual is above 1 −  is considered to be damaging. The other parameter is denoted as  and represents a tolerance level— specifically,  is the upper bound on the proportion of individuals for whom the damaging information that P(x) > 1 −  may be revealed. Another important component of our definition of partial disclosure is how much information about the data combination procedure is revealed to the public by the data curator. We denote this information as . For instance, if the data curator reveals that Pr (mij = 0|xi = x, N (Ziy,Z dj ) = 1) → (x)  for some (x), then the public can determine that in the limit the released propensity score for an individual with characteristics x has the form (1 − (x))P(x) + (x)P. If, in addition, the data curator releases the value of

Estimation of Treatment Effects from Combined Data

297

Pr (mij = 0|xi = x, N (Ziy,Z dj ) = 1) or the value of  (x), then the public can  pin down the true propensity score P(x)3 and, thus, obtain potentially damaging information if this propensity score is above 1 − . Definition 3. Let be the information about the data combination procedure released to the public by the data curator. Let  ∈ (0,1) and  ∈ [0,1]. Given  , we say that a (1 − , ) bound guarantee is given for the risk of partial disclosure, if the proportion of individuals in the private data set for whom the public can determine with certainty that P(x) > 1 −  does not exceed  . The value of  is called the bound on the risk of partial disclosure. Setting  at  = 0 means that we want to protect all the individuals in the private data set. The idea behind our definition of partial disclosure is that one can use the released values of PNN,x (or limN →∞ PNN,x ) from the model to determine whether the probability of the positive treatment status exceeds the given threshold. If this is possible to determine with a high confidence level for some individuals, then this individual is identified as the one with “the high risk” of the positive treatment status. Such information can be extremely damaging. In the following theorem we demonstrate that the release of the true propensity score is not compatible with a low disclosure risk. Theorem 2. Suppose that lim Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1) = 0 for x ∈  . N →∞ If the data curator releases information (4.4), then for sufficiently large N the release of the propensity score PNN,x (or its limit) is not compatible with the bound on the risk of partial disclosure  for sufficiently small  . (4.4)

The formal result of Theorem 2 relies on Assumption 2, and Theorem 1 and is based on two elements. First, using the threshold decision rule we were able to construct the sequence of combined data sets where the finite-sample distribution of covariates approaches the true distribution. Second, from the estimated distribution we could improve our knowledge of the treatment status of individuals in the data. For some individuals the probability of the positive treatment status may be very high. This result forces us to think about ways to avoid the situations where potentially very sensitive information may be learned about some individuals. The bound guarantee on the risk of partial disclosure essentially requires the data curator to keep a given proportion of incorrect matches in the data sets of any size. As discussed in Proposition 1, a fixed proportion of the incorrect matches leads to the calculated propensity score to be biased toward the proportion of treated individuals in the population, and also causes bias in the average treatment effect. 3. Note that the value P is known from the public data set.

298

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

Theorem 3. Suppose the value of P is publicly available, and P < 1 − . A (1 − ,0) bound guarantee for the risk of partial disclosure can be achieved if the data curator chooses N (x) in such a way that (x) = Nlim Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1) > 0 for all x ∈  →∞  and for individuals with P(x) > 1 −  the value of (x) is chosen large enough to guarantee that lim PNN,x = (1 − (x))P(x) + (x)P < 1 − .

N →∞

We assume that the data curator provides information that the data were matched with an error and the matching error does not approach 0 as N → ∞ but does not provide the values of Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1) or (x).  In this case, the behavior of the released propensity score and the treatment effect is as described in equations (3.2) and (3.3), and thus, the true propensity score and the true treatment effect are not identified. Note that in the framework of Theorem 3 for individuals with small P(x) the data curator may want to choose a very small (x) > 0 whereas for individuals with large P(x) the bias toward P has to be large enough. Remark 1. Continue to assume that P < 1 − . Note that if the released propensity score for an individual with x is strictly less than P, then the public will be able to conclude that the true propensity score for this individual is strictly less than P. If the released propensity score for an individual with x is strictly greater than P, then the public will be able to conclude that the true propensity score for this individual is strictly greater than P but, under conditions of Theorem 3, will not know whether P(x) > 1 − . If the released propensity score for an individual with x is equal to P, then the public is unable to make any nontrivial conclusions about P(x)—that is, P(x) can be any value from [0,1]. We can consider other approaches the data curator may exploit regarding the release of the propensity score values and the information provided with this release. For instance, for some individuals with P(x) < 1 −  she may choose (x) = 0 and provide information that for some individuals the data were matched without an error in the limit, but for the other individuals the matching error is strictly positive and does not approach 0 as N → ∞ (given that she does not specify the values of Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1) or  (x)). In this case, the result of Theorem 3 continues to hold. The next theorem gives a result on privacy protection when the data curator releases more information. Theorem 4. Suppose the value of P is publicly available, and P < 1 − . A (1 − ,0) bound guarantee for the risk of partial disclosure can be achieved if the data curator chooses N (x) in such a way that Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1) ≥  for all x ∈  

Estimation of Treatment Effects from Combined Data

299

for all N, and for individuals with P(x) > 1 −  the value of Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1) is chosen large enough to guarantee that  PNN,x = (1 − Pr(mij = 0|xi = x, N (Zix,Z dj ) = 1))P (x) + Pr(mij = 0|xi = x, N (Zix,Z dj ) = 1)P < 1 −   for all N. We assume that the data curator provides information that the data were matched with an error and the matching error is greater or equal than the known  but does not provide the values of Pr (mij = 0|xi = x, N (Zix,Z dj ) = 1)  or (x). In this case, the behavior of the released propensity score and the treatment effect is as described in equations (3.2) and (3.3), and thus, the true propensity score and the true treatment effect are not identified. To summarize, the fact that we want to impose a bound on the risk of disclosure leads us to the loss of the point identification of the true propensity score and the true average treatment effect. This means that the point identification of the econometric model from the combined data set is incompatible with the security of individual information. If the publicly observed policy is based on the combination of the nonpublic treatment status and the public information regarding the individual, then the treatment status of any individual cannot be learned from this policy only if it is based on a biased estimate for the propensity score and a biased treatment effect. The next theorem considers the case when P > 1 − . It shows that in this case any release of point estimates of the propensity score from the treatment effect evaluation is not compatible with a low disclosure risk. Theorem 5. Suppose the value of P is publicly available, and P > 1 − . Then the released propensity score will reveal all the individuals with P(x) > 1 −  even if the data are combined with a positive (even very large) error. Let p* = Pr (x : P(x) > 1 − ), that is, p* is the proportion of individuals with the damaging information about the propensity score. Then a (1 − ,) bound guarantee cannot be attained for the risk of partial disclosure if  ≤ p*. In the framework of Theorem 5 the release (or publicly observable use) of the propensity score is blatantly nonsecure. In other words, there will exist a sufficient number of individuals for whom we can learn their high propensity scores. To protect their privacy, no propensity scores should be released. 10.5

Does a Religious Affiliation Affect a Parent’s Decision on Childhood Vaccination and Medical Checkups?

To illustrate our theoretical analysis, we want to bring our results to the real data.

300

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

Even though in the main body of this chapter we do not develop a formal theory of the statistical estimation of PNN,x(⋅) or the true propensity score P(⋅) in a finite sample, in this section we want to illustrate an empirical procedure one could implement in practice. The data come from the Russian Longitudinal Monitoring survey (RLMS).4 The RLMS is a nationally representative annual survey that covers more than 4,000 households (the number of children varies between 1,900 and 3,682), from 1992 until 2011. The survey gathers information on a very broad set of questions, including demographic and household characteristics, health, religion, and so on. The survey covers 33 Russian regions—31 oblasts (krays, republics), and also Moscow and St. Petersburg. Islam is the dominant religion in two regions, and Orthodox Christianity is the dominant religion in the rest. We combine our data from two parts of the RLMS—the survey for adults and the survey for children. The question that we want to answer can be informally stated as follows: Does the religion of family members affect the probability of a child getting regular medical checkups or to be vaccinated against tuberculosis? More specifically, we analyze whether (1) religious (Muslim or Orthodox Christian) families have their children seen by doctors or have their children vaccinated against tuberculosis with lower probability; and (2) families from neighborhoods with high percentages of religious people have their children seen by doctors with lower probability. From the data set for children we extract the following individual characteristics for a child: the indicator for whether the child had a medical checkup in the last twelve (or three) months, the indicator for whether the child was vaccinated against tuberculosis, the indicator for whether the child lives in a city, and the child’s age. We also have the following information on the child’s family: the share of Orthodox Christian family members, the share of Muslim family members,5 and the share of family members with a college degree. From other publicly available data sets we obtain the following information for the child’s region: the share of Muslims and the gross regional product per capita. The summary statistics of all these variables are presented in table 10.1. Our analysis focuses on the propensity scores that represent the probabil4. This survey is conducted by the Carolina Population Center at the University of Carolina at Chapel Hill, and by the Higher School of Economics in Moscow. Official Source name: “Russia Longitudinal Monitoring Survey, RLMS-HSE,” conducted by Higher School of Economics and ZAO “Demoscope” together with Carolina Population Center, University of North Carolina at Chapel Hill and the Institute of Sociology RAS. (RLMS-HSE websites: http://www .cpc.unc.edu/projects/rlms-hse, http://www.hse.ru/org/hse/rlms). 5. Variables for the shares of Muslims and Orthodox Christians in a family are constructed based on the following definition of a Muslim (Orthodox Christian). We say that a person is a Muslim (Orthodox Christian) if the person (a) says that she believes in God, and (b) says that she is a Muslim (Orthodox Christian). There are people in the survey who said, for example, that they are Muslims, but at the same time said that they are not believers. We consider such people nonbelievers.

Estimation of Treatment Effects from Combined Data Table 10.1

301

Summary statistics of various variables for a child

Variable Child: Medical checkup in last 12 months? Child: Medical checkup in last 3 months? Child: Vaccinated (tuberculosis)? Child: I (lives in a city) Child: Age Family: Share of Orthodox Christians Family: Share of Muslims Family: Share of those with college degree Region: Share of Muslims Region: Log group per capita

Obs.

Mean

Std. Dev.

Min.

Max.

33,924 62,316 49,464 73,100 73,100 59,142 59,142 66,314 73,100 71,466

0.69 0.45 0.96 0.38 7.19 0.22 0.06 0.26 0.09 10.96

0.46 0.50 0.19 0.49 4.09 0.35 0.23 0.37 0.17 1.38

0 0 0 0 0 0 0 0 0 7.04

1 1 1 1 18 1 1 1 0.71 13.50

ity of the child getting regular checkups (being vaccinated against tuberculosis). In our model, the following information is considered to be sensitive: propensity scores are below a given threshold; the variable of the share of Orthodox Christian (or Muslim) family members has a negative marginal effect on the propensity score; the variable of the share of Orthodox Christians (or Muslims) in the child’s neighborhood has a negative marginal effect on the propensity score. The RLMS data set has a clustered structure as people are surveyed within small neighborhoods with a population of around 300 people (socalled census district; see Yakovlev [2012]). Thus, it is possible to construct characteristics of neighborhoods—in particular, on the shares of Orthodox Christians (or Muslims) in neighborhoods—by using the religion variable from the RLMS data set for adults6 if one has information on neighborhood labels. Due to a vast Soviet heritage, the majority of people in Russia live in large communal developments that combine several multistory apartment buildings. These developments have common infrastructure, shops, and schools. High concentration in a relatively small area makes the life of each family very visible to all the neighbors. The neighborhoods are defined precisely by such developments. Neighborhood labels were publicly available till 2009 but then were deleted by the RLMS staff due to the privacy concerns.7 In our study, we exploit the RLMS survey data from 1994 until 2009 because the neighborhood identifiers were publicly available in those years and, thus, one was able to consider the child’s neighborhood and then use the religious affiliation variable from the adult data set to construct the data for religion in that particular neighborhood, and use the income variable from the adult data set to calculate the average logarithm of income 6. Thus, the variable for the shares of Muslims and Orthodox Christians in a neighborhood is constructed using the same principle as in the case of families. 7. Fortunately, we happened to have the data on neighborhood identifiers.

302

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

Table 10.2

Summary statistics of neighborhood characteristics

Variable Neighborhood: Share of Muslims Neighborhood: Share of Orthodox Neighborhood: Log(income)

Obs.

Mean

Std. Dev.

Min.

Max.

53,800 53,800 58,578

0.06 0.23 6.25

0.20 0.18 1.86

0 0 0

1 1 10.9

in that particular neighborhood. The summary statistics of neighborhood characteristics are presented in table 10.2. In order to answer the posed questions, we estimate the following probit regression Pr (Dit = 1) = (1share of Muslims in familyit + 2share of Orthodox Christians in familyit + 1share of Muslims in neighborhoodit + 2share of Orthodox Christians in neighborhoodit + ′ qit ), where Dit stands for the indicator of whether a child had a medical checkup within the last twelve (or three) months, or the indicator of whether a child has a vaccination against tuberculosis. The set of controls qit contains child’s characteristics of (age, I(live in city)), regional characteristics such as the GRP per capita and the share of Muslims in the region, family characteristics such as family income and the share of family members with a college degree, neighborhood characteristics (average income in neighborhood), and the year fixed effects. For notational simplicity, we write Pr (Dit = 1) instead of Pr (Dit = 1|religious characteristicsit,qit ). The estimation results are presented in table 10.3. Columns (2) and (4) in the table show the evidence that a higher percentage of Muslims in the family is associated with a lower chance of the child being regularly seen by a doctor. This holds for the sample of all children and for the subsample of children with health problems. Also, when the sample of all children is considered, a higher percentage of Muslims in the neighborhood has a negative marginal effect on the probability of the child being vaccinated against tuberculosis as well as being regularly seen by a doctor. The variables for the shares of Orthodox Christians are not significant. The discussion below considers the sample of all children. The first two graphs in figure 10.1 are for the case when the dependent variable is the indicator for a checkup within the last twelve months. The last two graphs in that figure are for the case when the dependent variable is the indicator for a vaccination against tuberculosis. The large dot in the first graph in figure 10.1 shows the pair (‒0.3416, ‒0.3314) of estimated coefficients for the share of Muslims in the family and the share of Muslims in the neighborhood from

Estimation of Treatment Effects from Combined Data Table 10.3

Probit regression estimation Sample: All children

Child: Age Child: I (live in city) Family: Share of Muslims Family: Share of Orthodox Christians Family: Average log(income) Family: Share of those with a college degree Region: Share of Muslims Region: Log GRP per capita Neighborhood: Share of Muslims Neighborhood: Share of Orthodox Year fixed effects Constant Observations

303

Sample: Children with health problems

Medical checkup in last 12 months?

Vaccinated against tuberculosis?

Medical checkup in last 3 months?

‒0.0423 [0.0032]*** 0.1704 [0.0313]*** ‒0.3314 [0.1127]*** 0.0478 [0.0394] 0.0602 [0.0151]*** 0.0741 [0.0367]** ‒0.0129 [0.1421] 0.1838 [0.0308]*** ‒0.3416 [0.1757]* ‒0.105 [0.0840]

0.0685 [0.0047]*** ‒0.2062 [0.0441]*** ‒0.1506 [0.1686] ‒0.0936 [0.0604] ‒0.0169 [0.0211] 0.0296 [0.0571] ‒0.3195 [0.2062] ‒0.0412 [0.0463] ‒0.429 [0.2319]* ‒0.0169 [0.1272]

‒0.0438 [0.0067]*** 0.0601 [0.0543] ‒0.4193 [0.2515]* ‒0.0244 [0.0711] 0.0437 [0.0303] 0.1561 [0.0651]** 0.2551 [0.3075] ‒0.0858 [0.0544] ‒0.4922 [0.4512] ‒0.1603 [0.1603]

yes ‒2.0794 [0.3701]*** 10,780

yes 1.9472 [0.4039]*** 17,413

yes ‒3.9003 [103.6494] 2,902

***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

column (2) in table 10.3. The large dot in the second graph in figure 10.1 shows the pair (‒0.105, ‒0.3314) of estimated coefficients for the share of Orthodox Christians in the neighborhood and the share of Muslims in the neighborhood, respectively, from column (2) in table 10.3. The large dot in the third graph in figure 10.1 shows the pair (‒0.1506, ‒0.429) of estimated coefficients for the share of Muslims in the family and the share of Muslims in the neighborhood from column (3) in table 10.3. The large dot in the fourth graph in figure 10.1 shows the pair (‒0.105, ‒0.429) of estimated coefficients for the share of Orthodox Christians in the neighborhood and the share of Muslims in the neighborhood, respectively, from column (3) in table 10.3. Finally, we analyze how the estimates of our parameters would change if

304

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

Fig. 10.1 Sets of estimates from 1,000 data sets combined using neighborhoods. Contour sets are for the cases of 2-anonymity

we enforce a bound on the risk of partial disclosure and consider the bound of 0.5—that is, Pr (mij = 0| N (Zix,Z dj ) = 1) ≥ , where  = 0.5. This is the  case of attaining 2-anonymity. In order to attain 2-anonymity we conduct the following exercise. For every child in our sample we create two possible neighborhoods—one neighborhood is the true one, and the other one is drawn randomly from the empirical distribution of neighborhoods in the corresponding region. Such empirical distributions can be easily obtained from the publicly available data in RLMS. As a result, for every child we have two possible sets of values of neighborhood characteristics. Then, ideally we would like to simulate all possible combined data sets but the number of these data sets is of exponential complexity, namely, of the rate 2n. Instead of considering all possible combined data sets, we randomly simulate only 1,000 such data sets. For each simulated combined data set we conduct the probit estimation. Thus, we end up with a 1,000 different sets of estimated coefficients (as well as the propensity scores). The contour sets in the graphs in figure 10.1 are the convex hulls of the obtained estimates. Namely, the contour set in the first graph in figure 10.1 is the convex hull of the 1,000 pairs of estimated coefficients for the share of Muslims in the family and the share of Muslims in the neighborhood, respectively. The contour set in the second graph in figure 10.1 is the convex hull of the 1,000 pairs of estimated coefficients for the share of Orthodox Christians in the neighborhood and the share of Muslims in the neighborhood, respectively. Similarly for the other two graphs.

Estimation of Treatment Effects from Combined Data

305

As can be seen, in the analysis of the probability of a medical checkup in the last twelve months, all the 1,000 coefficients corresponding to variables of the share of Muslims in the family and the share of Muslims in the neighborhood are negative.8 If the data curator thinks that the release of these sets of estimates is not satisfactory with regard to partial disclosure guarantees, then she should increase the guarantee level by, for instance, attaining 3-anonymity. As for the case of the probability of being vaccinated against tuberculosis, among the 1,000 coefficients corresponding to the share of Muslims in the family, there are some positive ones, even though all the 1,000 coefficients corresponding to the share of Muslims in the neighborhood are negative.9 Again, the data curator may want to increase the guarantee level. 10.6

Conclusion

In this chapter we analyze how the combination of data from multiple anonymized sources can lead to the serious threats of the disclosure of individual information. While the anonymized data sets by themselves may pose no direct threat, such a threat may arise in the combined data. The main question that we address is whether statistical inference based on the information from all these data sets is possible without the risk of disclosure. We introduce the notion of statistical partial disclosure to characterize a situation when data combination allows an adversary to identify a certain individual characteristic with a small probability of misidentification. We focus our analysis on the estimation of treatment effects where the treatment status of an individual is sensitive and, thus, the possibility of the statistical recovery of this treatment status may be highly undesirable. We show that a variety of techniques from data mining literature can be used for reconstruction of the combined data sets with little to no auxiliary information. We also demonstrate that the point identification of the statistical model for the average treatment effects is incompatible with bounds imposed on the risk of statistical partial disclosure imposed to protect individual information. We illustrate our findings in the empirical study of the impact of religious affiliation of parents on the probability of a child’s medical checkups and vaccination from tuberculosis using the individual-level data from Russia. Statistical partial disclosure is becoming of central importance in the “big data” world. While many consumer companies have been routinely collecting private consumer data, the modern data-driven business paradigm calls for using these data in business decisions. A common example is the online ad-targeting technology where the consumer is exposed to the ads based on 8. These variables are significant in each of 1,000 cases (even though the confidence intervals are not depicted in the graphs). 9. The variable of the share of Muslims in the neighborhood is significant in each of 1,000 cases.

306

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

the past consumer behavior and the known consumer characteristics. The ad delivery is based on the estimator that would be used to predict the consumer click on the ad based on the historical behavior of the given consumer and other consumers similar in some sense to the consumer of interest. Forbes magazine published a story explaining how Target uses credit card information to identify the repeated purchases from the same customer, and using a variety of sources identifies the set of demographic characteristics. Then, based on the collected demographic information and the sets of products that the consumers purchased in the past, Target was able to identify the sets of purchased products that most likely lead to a customer (a female) being pregnant. Based on this prediction, Target sent out coupons for the baby section in the store. Forbes then proceeds with the anecdotal story of when Target customer service got a call from an angry father of a teenager stating that his daughter got the coupon. A week later the father called Target back with an apology, as his daughter had indeed turned out to be pregnant. With further advancement in econometric and machine-learning methods, similar stories will emerge in a large variety of settings, from medical services (where people already get customized automatic medical advice based on their reported lifestyle, eating, and exercise habits) to real estate (where companies like Zillow give the homeowners automated recommendations for the timing of the house sale and purchase). We argue that confidentiality restrictions can go hand in hand with the big data tools to provide technologies that are both aimed at higher consumer welfare (leading to better consumer targeting) and provide formal privacy guarantees. We have studied some of these technologies in this chapter.

References Abowd, J., and L. Vilhuber. 2008. “How Protective Are Synthetic Data?” Privacy in Statistical Databases 5262:239–46. Abowd, J., and S.Woodcock. 2001. “Disclosure Limitation in Longitudinal Linked Data.” In Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, edited by P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes, 215–77. Amsterdam: North Holland. Acquisti, A. 2004. “Privacy and Security of Personal Information.” In Economics of Information Security, vol. 12, edited by L. Jean Camp and Stephen Lewis, 179–86. New York: Springer Science+Business Media. Acquisti, A., A. Friedman, and R. Telang. 2006. “Is There a Cost to Privacy Breaches? An Event Study.” Proceedings of the Twenty-Seventh International Conference on Information Systems. doi: 10.1.1.73.2942&rep=rep1&type=pdf. Acquisti, A., and J. Grossklags. 2008. “What Can Behavioral Economics Teach Us about Privacy?” In Digital Privacy: Theory, Technologies, and Practices, edited by A. Acquisti, S. Gritzalis, S. DiVimercati, and C. Lambrinoudakis, 363–79. Boca Raton, FL: Auerbach Publications, Taylor & Francis Group.

Estimation of Treatment Effects from Combined Data

307

Acquisti, A., and H. Varian. 2005. “Conditioning Prices on Purchase History.” Marketing Science 33:367–81. Aggarwal, G., T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. 2005. “Approximation Algorithms for k-anonymity.” Journal of Privacy Technology, Paper no. 2005112001. Bradley, C.,  L. Penberthy,  K. Devers, and  D. Holden. 2010. “Health Services Research and Data Linkages: Issues, Methods, and Directions for the Future.” Health Services Research 45 (5, pt. 2): 1468–88. Calzolari, G., and A. Pavan. 2006. “On the Optimality of Privacy in Sequential Contracting.” Journal of Economic Theory 130 (1): 168–204. Ciriani, V., S. di Vimercati, S. Foresti, and P. Samarati. 2007. “k-Anonymity.” In Secure Data Management in Decentralized Systems, vol. 33, edited by T. Yu and S. Jajodia. Berlin: Springer-Verlag. Duncan, G., S. Fienberg, R. Krishnan, R. Padman, and S. Roehrig. 2001. “Disclosure Limitation Methods and Information Loss for Tabular Data.” In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, edited by P. Doyle, 135–66. Amsterdam: North Holland. Duncan, G., and D. Lambert. 1986. “Disclosure-Limited Data Dissemination.” Journal of the American Statistical Association 81 (393): 10–18. Duncan, G., and S. Mukherjee. 1991. “Microdata Disclosure Limitation in Statistical Databases: Query Size and Random Sample Query Control.” In Proceedings of IEEE Symposium on Security and Privacy, 278–87. Duncan, G., and R. Pearson. 1991. “Enhancing Access to Microdata While Protecting Confidentiality: Prospects for the Future.” Statistical Science 6 (3): 219–32. Dwork,  C. 2006. “Differential Privacy.” In Automata, Languages and Programming, edited by M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, 1–12. Berlin: Springer-Verlag. Dwork, C., and K. Nissim. 2004. “Privacy-Preserving Data Mining on Vertically Partitioned Databases.” In Advances in Cryptology–CRYPTO 2004, edited by M. Franklin, 134–38. New York: Springer. Fienberg, S. 1994. “Conflicts between the Needs for Access to Statistical Information and Demands for Confidentiality.” Journal of Official Statistics 10:115. ———. 2001. “Statistical Perspectives on Confidentiality and Data Access in Public Health.” Statistics in Medicine 20 (9–10): 1347–56. Goldfarb, A., and C. Tucker. 2010. “Online Display Advertising: Targeting and Obtrusiveness.” Marketing Science 30 (3): 389–404. Gross, R., and A. Acquisti. 2005. “Information Revelation and Privacy in Online Social Networks.” In Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society, edited by V. Atluri, S. di Vimercati, and R. Dingledine, 71–80. New York: Association for Computing Machinery. Homer, N., S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. Pearson, D. Stephan, S. Nelson, and D. Craig. 2008. “Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays.” PLoS Genetics 4 (8): e1000167. Horowitz, J., and C. Manski. 2006. “Identification and Estimation of Statistical Functionals Using Incomplete Data.” Journal of Econometrics 132 (2): 445–59. Horowitz, J., C. Manski, M. Ponomareva, and J. Stoye. 2003. “Computation of Bounds on Population Parameters When the Data are Incomplete.” Reliable Computing 9 (6): 419–40. Komarova, T., D. Nekipelov, and E. Yakovlev. 2011. “Identification, Data Combination and the Risk of Disclosure.” CeMMAP Working Paper no. CWP39/11, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.

308

Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev

Korolova, A. 2010. “Privacy Violations Using Microtargeted Ads: A Case Study.” In IEEE International Workshop on Privacy Aspects of Data Mining (PADM’2010), 474–82, Washington, DC. doi:10.1109/ICDMW.2010.137. Lambert, D. 1993. “Measures of Disclosure Risk and Harm.” Journal of Official Statistics 9:313. LeFevre, K., D. DeWitt, and R. Ramakrishnan. 2005. “Incognito: Efficient FullDomain k-Anonymity.” In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, edited by Fatma Ozcan, 49–60. Association for Computing Machinery. ———. 2006. “Mondrian Multidimensional k-anonymity.” In ICDE’06 Proceedings of the 22nd International Conference on Data Engineering, 25. Institute of Electronics and Electronic Engineers. Magnac, T., and E. Maurin. 2008. “Partial Identification in Monotone Binary Models: Discrete Regressors and Interval Data.” Review of Economic Studies 75 (3): 835–64. Manski, C. 2003. Partial Identification of Probability Distributions. Berlin: SpringerVerlag. Miller, A., and C. Tucker. 2009. “Privacy Protection and Technology Diffusion: The Case of Electronic Medical Records.” Management Science 55 (7): 1077–93. Molinari, F. 2008. “Partial Identification of Probability Distributions with Misclassified Data.” Journal of Econometrics 144 (1): 81–117. Narayanan, A., and  V. Shmatikov. 2008. “Robust De-Anonymization of Large Sparse Datasets.” In SP 2008 IEEE Symposium on Security and Privacy, 111–125. Institute of Electronics and Electrical Engineers. Ridder, G., and R. Moffitt. 2007. “The Econometrics of Data Combination.” Handbook of Econometrics 6 (6b): 5469–547. Samarati, P., and L. Sweeney. 1998. “Protecting Privacy When Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression.” Technical Report SRI-CSL-98-04, Computer Science Laboratory, SRI International. Sweeney, L. 2002a. “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression.” International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 10 (5): 571–88. ———. 2002b. “k-Anonymity: A Model for Protecting Privacy.” International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 10 (5): 557–70. Taylor, C. 2004. “Consumer Privacy and the Market for Customer Information.” RAND Journal of Economics 35 (4): 631–50. Varian, H. 2009. “Economic Aspects of Personal Privacy.” In Internet Policy and Economics, edited by W. H. Lehr and L. M. Pupillo, 101–09. New York: Springer Science+Business Media. Wilson, A., T. Graves, M. Hamada, and C. Reese. 2006. “Advances in Data Combination, Analysis and Collection for System Reliability Assessment.” Statistical Science 21 (4): 514–31. Wright, G. 2010. “Probabilistic Record Linkage in SAS®.” Working Paper, Kaiser Permanente, Oakland, CA. Yakovlev, E. 2012. “Peers and Alcohol: Evidence from Russia.” CEFIR Working Paper no. 182, Center for Economic and Financial Research.

11

Information Lost Will the “Paradise” That Information Promises, to Both Consumer and Firm, Be “Lost” on Account of Data Breaches? The Epic is Playing Out Catherine L. Mann

11.1

Introduction

The expanding scope of Internet use yields a widening array of firms with access to ever expanding databases of information on individuals’ search, transactions, and preferences. This information translates into consumers’ ease of transacting, range of complementary purchases, targeted news and advertising, and other directed goods, services, and information, all of which increase customer value—but also raise the probability and consequences of information loss. Similarly, firms have unprecedented windows into customer behavior and preferences with which they can improve products, segment markets, and, therefore, enhance profits—but also raise the probability of losing or abusing information. The Digitization Agenda can help frame and balance the benefits to firms and consumers of information gained with the risk and costs of information lost, particularly in the context of increasingly global flows of information and transactions. A first priority is a conceptual framework. Three key elements in the structure of the information marketplace influence the valuation and balancing of benefits and costs. First, information exhibits economies of scale and scope, which challenges the ability of the market to efficiently price information. Second, participants in the information marketplace are not atomistic, rather they are asymmetric in terms of market power, which affects the incidence and distribution of benefits and costs. Third, information loss Catherine L. Mann is the Barbara ’54 and Richard M. Rosenberg Professor of Global Finance at Brandeis University. Excellent research assistance from Alok Mistry, who experienced his own data breach (stolen laptop) during the course of this project. For acknowledgments, sources of research support, and disclosure of the author’s material financial relationships, if any, please see http://www .nber.org/chapters/c12990.ack.

309

310

Catherine L. Mann

is a probabilistic event, but with unknown distribution, which challenges the valuation of benefits and, particularly, costs. A final element is that the information marketplace is global, populated by heterogeneous firms and consumers, and by policymakers who differ in their policy responses to the imperfections in the marketplace. A second requirement is empirical analysis of the frameworks. Mandated disclosure in the United States of data breaches was the watershed enabling this study and its references. Without disclosure, it is impossible to investigate the risks and potential costs of information loss against the benefits of information collection and aggregation. Disclosure helps reveal to consumers, firms, and policymakers the nature of data loss, and may change incentives and affect the incidence and balancing of costs and benefits. However, disclosure can be along a spectrum from every incident being announced to everyone to only critical incidents being communicated to a few. In fact, there is no globally consistent approach to disclosure, nor even to the notion of disclosure at all, so the window into the empirical valuation of costs and benefits of the information marketplace is narrow. Even so, evidence on how disclosure works is starting to emerge. If the market response to disclosure is sufficient to apportion and balance costs and benefits, then, in principle, no policy intervention into the marketplace is needed. So far, this does not appear to be the case. More information on the nature of data breaches, on incidence of benefits and costs, of market participant response, and on evidence of the efficacy of policy intervention should help prioritize the Digitization Agenda. This chapter proceeds along the following path. The next section reviews various conceptual frameworks with which we can analyze the structure of the information marketplace. Section 11.3 presents evidence on the extent and nature of information lost. What are the trends? Size of loss, sector of loss, source of loss, cost of loss, market value of information, and so on, including in the global context. Section 11.4 addresses market and policy responses to information loss, and reviews legislative and legal strategies that could complement market discipline. Particular attention is given to the challenges of cross-border information flows, including differences in attitudes and priorities toward data security. Section 11.5 concludes with priorities for the Digitization Agenda. 11.2

Frameworks for Analyzing the Information Marketplace and Data Breaches

That consumers gain from using the Internet is clear from increased competition and reduced prices (Morton 2006), greater variety (Goolsbee and Klenow 2006), and faster access to a wider range of public information (Greenstein and McDevitt 2011; Yan, Jeon, and Kim 2013). Wallsten (chapter 2, this volume) continues the work of valuing the consumer benefits of

Information Lost and Data Breaches

311

using the Internet. Yet, with rapidly changing technology and social interaction, it is hard to pin down exactly how large the increase in consumer surplus might be, so there is much more work to do. Using the Internet generates the information that is the basic building block of the marketplace for information. A conceptual framework for the incidence and balance of costs and benefits of information in this marketplace includes the valuing of consumer gain from using the Internet, but it is a more complex framework with more players. For simplicity, suppose the information marketplace is populated by originators of information (say consumers, as they reveal their preferences through search and transactions); intermediaries of the information (say, firms that transmit data, and those that collect, aggregate, and retain information); and final users (say, firms that call on the aggregated data to improve products). How should we model the interactions between the information and the three players? Is the information atomistic or are there economies and scale and scope in aggregation of the information into a database? Are the players atomistic and equally numerous, or do they differ in concentration and market power in their economic relationships? What about the nature of uncertainty? Answering these questions helps to determine to what extent the information marketplace is “classic” in the Adam Smith sense and “complete” in the Arrow-Debreu sense, or whether it is a market with imperfections. Various authors have taken up the challenge of modeling the information marketplace, some explicitly in the context of data breaches. The several papers reviewed below are put into the context of a general framework that focuses on a market structure that includes economies of scale and scope in data aggregation, multiple nonatomistic players, and in an environment of uncertainty over the nature and probability of data breaches and consequences. In this kind of market structure it is challenging to value the cost and incidence of a data breach. Further challenges are that the basic building blocks of information may be valued differently across geographies and cultures. When information can flow across borders, these differences (and policymakers’ responses) may create arbitrage opportunities. 11.2.1

Complete Markets: The Benchmark Market Structure

The purpose of outlining the characteristics of the perfectly competitive marketplace—the Adam Smith marketplace—is to provide a benchmark against which the structure of the global information marketplace can be assessed. If the environment for undertaking information-rich activities is characterized by perfect competition, then Adam Smith’s invisible hand— whereby each acting in his own self-interest—achieves the highest economic well-being for all players. In Adam Smith’s market, one-off transactions generate unique prices for each transaction. In this classic marketplace, buyers, intermediaries, and sellers are all atomistic. There are no databases with a history of a specific

312

Catherine L. Mann

buyer’s transactions or those of buyers of similar characteristics that create correlations between transactions across time or across individuals. No information is retained, so no information can be lost. Balancing the benefits of information exchange with the potential cost of information lost is not an issue. An extension of Adam Smith’s market allows for transactions across time, proximity, currency, and uncertainty. In the so-called Arrow-Debreu “complete” market (Arrow and Debreu 1954), economic instruments exist for all possible transactions that the set of market participants can undertake with each other. A complete market accommodates all dimensions of a transaction through time, space, and under uncertainty and yields a unique and market-determined price for that transaction in a frictionless world. Whereas these transactions may be correlated and/or uncertain, the correlations of transactions (such as interest rates and exchange rates) and uncertainties (probability of default) are fully known (in the complete market), and therefore will be efficiently embodied in the relevant prices. In a complete-markets framework, both private and social optimum outcome can be achieved because there is a perfect (complete) and frictionless match between transactions and atomistic market participants over all possible states-of-nature and time. With full information about correlation and uncertainty, prices will fully reflect benefits of information exchange, which can then be balanced against the potential cost of information lost. There are no market imperfections. 11.2.2

The Information Marketplace: Violating the CompleteMarkets Framework

In a number of ways the information marketplace violates key assumptions of the complete-markets framework, which makes pricing information difficult, and opens up for consideration the topics of market imperfections and problems of ranking the second best. More specifically, without accurate prices, the benefit-cost calculation surrounding information exchange as against information lost through a data breach will be very challenging. The first violation is the assumption that transactions are one-off or uncorrelated as in Adam Smith’s work. In fact, information is characterized by economies of scale and scope. That is, the value of information over a series of transactions for an individual is greater than the sum of the individual transactions because of the correlations across the individual’s behavior; for example, the information marketplace is characterized by economies of scale. The value of information aggregated over many individuals is greater than the sum of any individual’s set because of the correlations across individuals; the information marketplace is characterized by economies of scope. Even if each unique piece of information had a uniquely matched price, there would be an incomplete mapping between the value of that morsel of

Information Lost and Data Breaches

313

information by itself, its value in one database, and its value if two (or N) databases are merged together.1 Databases, which are the product of the information marketplace, are characterized by economies of scale and scope so that the pricing of information is imperfect, unless there is full information about all the correlations among each morsel of data. Because the information marketplace and its players are evolving rapidly with technology and Internet use, it is clear that the correlations needed for the complete-markets framework cannot be known in sufficient detail or timeliness to incorporate them into the information price. The second challenge that the information marketplace brings to the complete-markets framework is the nature of uncertainty. Uncertainty enters the information marketplace through the possible misuse of information. A complete markets set-up could, in theory, price insurance that pays off in the case of a data breach, but since such price determination in the information marketplace is, in practice, nearly impossible, an actuarially fair price for insurance is also extremely difficult. The information marketplace exhibits two types of uncertainty that are difficult to price. First, and most challenging, is the potential correlation over time of information lost. A data breach today cannot be valued with certainty because the value of the information lost today is a function of all possible data breaches in the future. Future data breaches matter for today’s valuation of information lost because of the unknown relationship between the information lost in today’s breach with the information lost in a future breach. Economies of scale and scope in information in the future affect valuation in the present. The second type of uncertainty is that information lost may not be information abused. The cost of information lost should differ depending on whether the lost information is used maliciously or not. The two uncertainties together make valuing information lost quite difficult. The insurance contracts, which are key instruments in the complete-market framework, are not likely to exist.2 The third violation of the assumptions that underpin the completemarkets benchmark model is that the players are not atomistic. Recall that the information marketplace has consumers (originators of information), intermediaries (transmitters and aggregators), and firms (that use information to improve products). Consumers are numerous. Firms are numerous. Transmitters and aggregators are concentrated and have several types 1. Another way to think about why there are economies of scope in information in databases is to consider the analogy from financial markets: there are diversification gains associated with merging two not-identical financial portfolios. 2. According to a Financial Times article (April 23, 2014), AIG is offering a “first of its kind” insurance product to protect firms against cyber attacks on the “Internet of things” that yield product liability and bodily harm (Alloway 2014). In these cases (product liability and bodily harm) the consequences of a data breach are seen at the time of the incident.

314

Catherine L. Mann

of market power that will affect the price and value of information and therefore influence cost-benefit calculations associated with information exchange, information security, and information loss. Moreover, the degree of market power and the rules under which the intermediaries operate vary substantially across countries and policy environments. These violations of the complete-markets framework offer jumping-off points for research. The following selected papers focus on modeling the information marketplace. Some papers specifically address how to model the cost-benefit calculation in the case of information lost.3 11.2.3

Applying the Pollution Model to Information Flows

Pollution seems like a good analogy for the information marketplace: pollution has (negative) economies of scale, asymmetric market position of participants (upstream-downstream), and uncertainties as to costs and benefits of exposure and remediation. Hirsch (2006) uses the pollution model and focuses on the negative economies of scale. He presumes that collecting and aggregating personal information generates negative externalities. “There is a growing sense that the digital age is causing unprecedented damage to privacy . . . digital economy businesses often do not bear the cost of the harms that they inflict” (9). Just as pollution is an outcome of production, so too is information aggregation an externality of “production” (search and transaction on the Internet). In the pollution model of the information marketplace, no data breach is necessary to generate harm. Aggregation alone departs from the completemarkets framework. With the economy of scale inherent in information aggregation, there will be a price wedge between the valuation of information by the consumer and by intermediaries and firms in the marketplace. Hirsch continues with the pollution analogy and reviews the evolution of policy strategy from “command and control” compliance (quantities) to “second-generation” (prices) or “outcome-oriented” policy whereby the regulated entities find their own cost-effective strategy to achieve the legislated goal. Tang, Hu, and Smith (2007) take these strategies to the information marketplace. They model information collection looking through the lens of consumer preferences for trust. Standardized regulation does not map into the heterogeneity of consumer preferences for trust (with some consumers being too regulated, others not enough) so overall economic wellbeing is reduced by such an approach. In contrast, they find that under circumstances of clarity and credibility, self-regulation can achieve a nuanced strategy that meets the heterogeneous preferences in the marketplace. On the other hand, Ioannidis, Pym, and Williams (2013) argue that “information 3. The literature addressed in this chapter focuses on the benefits of information exchange and the costs of information lost. Other research focuses more specifically on the topic of privacy. For more on modeling privacy, see US Dept of Commerce, NTIA chapter compendium of articles; Roberds and Schreft (2009), Anderson (2006), and references therein.

Information Lost and Data Breaches

315

stewardship” internalizes the social costs of data loss (much as a corporate social responsibility policy might internalize the firm’s approach to its pollution or as an environmental group might publicize polluting behavior). With the prodding of such an information steward, firms internalize some of the costs of data loss and therefore undertake higher investments in information security than they would have. Ioannidis, Pym, and Williams find by using their model that social welfare is enhanced. Whereas environmental economics offers a model for the information marketplace, the analogy is stretched because consumers and firms do gain from information aggregation, and it is hard to imagine anyone actually gaining from downstream pollution. Moreover, although the pollution model allows for market power and uncertainty, so far researchers have not put all three elements of economies of scale/scope, market power, and uncertainty together in the context of the information marketplace. 11.2.4

Too Much Information: Trade-Offs with Limits to Rationality

Full information and frictionless markets are key in the complete-market framework. Acquisti (2010) starts by arguing that the information marketplace is all about trade-offs. “In choosing the balance between sharing or hiding one’s personal information (and in choosing the balance between exploiting or protecting individuals’ data), both individuals and organizations face complex, sometimes intangible, and often ambiguous tradeoffs. . . . But trade-offs are the natural realm of economics”(3). But then, he notes that limited consumer rationality and transactions costs make calculating these trade-offs difficult. Both of these issues affect the pricing of information, as well as the distribution of benefits and costs of information aggregation and potentially of its loss. If consumers do not know the value of their information, they cannot calculate the trade-off between allowing collection and aggregation against the possible cost of a data breach. These issues depart from the complete-markets model and are Acquisti’s (2010) jumping-off point for his modeling of the cost-benefit calculations. How significant are these departures in the information marketplace from the complete-markets framework? Researchers have attempted to calculate the value of the aggregation of one’s own information. Conjoint analysis by Hann et al. (2002) finds that consumers trade their information for about $40–$50 of product value. Convenience is often cited as a rationale for allowing the aggregation of one’s own personal information, as in online banking (Lichtenstein and Williamson 2006). Another way to value personal information is to calculate the cost to firms of the inability to use individual and aggregate personal information to target advertising (Goldfarb and Tucker 2010). The empirical work on value of information to the consumer suggests that limited rationality is an important problem. Policymakers and businesses differ in their response to the limited ratio-

316

Catherine L. Mann

nality of consumers. The European Union (EU) Privacy Directive is at one extreme, disallowing the collection and retention of personal information on the grounds that consumers do not know what they are giving up, and strengthening this approach in early 2014 with the “right to be forgotten.” Other policy approaches require active consent (opt-in) or more transparency (e.g., this website uses cookies . . . click here for our cookie policy). Some firms are finding a market opportunity in responding to the limited rationality problem. Incorporated into the website are easy-to-use tools that allow customers to edit the information stream associated with their search and transactions activity and thereby improve the accuracy and targeting of their own information.4 However, the presence of economies of scale and scope in information aggregation, as well as the nature of uncertainty regarding data breaches, means that the analysis of the balancing of the benefits from information transmitted against the potential cost of information lost is more complex than just the limited rationality of individuals. 11.2.5

Multiple Players, Market Power, and the Role for Disclosure

Much of the literature that addresses the benefit of information aggregation versus cost when information is lost uses a two-player framework— so-called data subjects (such as customers that provide the information) and so-called data holders (such as a firm that aggregates customer data to create customized products). In fact, there is a third player in the information marketplace—the intermediaries—through which information “transits” and/ or “rests.” Examples range from Visa, Amazon, and Google to less familiar companies such as ChoicePoint or Acxiom. Atomistic interaction among market players is an important underpinning of the complete-markets framework, but is clearly violated in the information marketplace. In particular, intermediaries are very highly concentrated: Google accounts for about 70 percent of all search,5 collecting and retaining all that information; Visa accounts for about three-quarters of all US card transactions, creating a thick financial and purchase trail;6 and Amazon accounts for 15 percent of all US online sales and is ranked fifteenth among all retail companies, collecting reams of data along the way.7 On the other hand, there are billions of consumers and merchants that use Google and Visa and shop with Amazon. Virtually none of them interact with an intermediary such as ChoicePoint or Acxiom, although their information rests there. The differential interactions and differential concentra4. Singer (2013). 5. Multiple sources as of April, May, and June 2013. 6. http://www.forbes.com/sites/greatspeculations/2013/05/03/visa-and-mastercard-battle -for-share-in-global-shift-to-plastic/. 7. http://www.prnewswire.com/news-releases/amazoncom-captures-28–of-top-online -retailer-sales-205427331.html.

Information Lost and Data Breaches

317

tions are important for the valuation of information and magnitude and incidence of costs in the case of a data breach. Considering interactions and concentration, Romanosky and Acquisti (2009) use a systems control strategy to map alternative legislative approaches to reducing harm from information loss. Two of the three approaches draw from accident legislation: First, ex ante “safety regulation” (e.g., seat belts) in the context of the information marketplace would include promulgation and adherence by intermediaries to, say, Payment Card Industry Data Security Standards. But these authors argue that ex ante standards focus on inputs (encryption) rather than outcomes (harm), so they are not efficient. Second, ex post liability law (e.g., legal suits) could include fines for negligence in the protection of information. But, ex post litigation may be ineffective because courts have been unwilling to award damages based on the probability of some future harm coming as a consequence of a data breach (see the evolving legal landscape in section 11.4). A third approach is disclosure of data breaches. Disclosure of data breaches is a key ingredient to calculating costs and benefits of providing and protecting information, and of apportioning responsibility and costs in the case of a data breach. Romanosky and Acquisti note that consumer cognitive bias (misperception of risk) and costs of disclosing the data breach itself (disclosing what to whom; see discussion that follows) are important caveats for the effectiveness of disclosure. Romanosky and Acqusiti use their framework to outline an empirical example of where cognitive bias and disclosure costs are less significant because of the concentrated market structure of intermediaries. Specifically, they analyze the relationship between credit card-issuing institutions and firms that hold (and lose) credit card data. They argue that information disclosure has promoted the internalization of the costs of remediation by the data holders (and losers), which increases the incentives for the adequate protection of personal information even when the individual who has provided that information cannot demand such protection. Why does disclosure help align (some of the) private interests? First, a sufficient number of data breaches have occurred such that these costs have begun to be quantified (to be discussed in sections 11.3 and 11.4). Second, the number of affected intermediaries (card issuers in this case) is sufficiently small that they have market power to demand remediation (or impose punishment) from the other concentrated intermediary, the data aggregators/ holders. Third, the chain of causation between information loss and required remediation is revealed because of data-breach disclosure laws. The disclosure laws along with quantification of costs, as well as the small number of players, promote the transfer of remediation costs from the card issuers to the database aggregators, those who actually lost the information. Thus, at least some of the cost of the data breach was internalized in this example. However, the costs of information loss borne by individual card holders

318

Catherine L. Mann

were not transferred to those firms where the data breach occurred. The market power of individuals was insignificant, and in a transactions sense, the individuals were distant from the data aggregators/holders. Individuals can change card issuers, but they have no power to affect the relationship between their card issuer and what firm aggregates the transactions of that card. Thus, the cost of the data breach incurred by individuals was not internalized by the intermediaries, and the individuals had no market power to affect such an internalization. Unlike the atomistic players in the complete-market framework, the information marketplace has disparities in concentration and market power that affect the distribution of costs of a data breach, as well as the price and willingness to pay for techniques to avoid such a breach. (See more on disclosure in sections 11.3 and 11.4.)8 11.2.6

The Probability Distribution of Data Breaches

The third key underpinning of the complete-markets framework is the pricing of uncertainty. For a number of reasons, it is challenging to estimate, and therefore price, the uncertainty of incurring and then the uncertain consequences of a data breach. Nevertheless, in the face of costly data breaches (see section 11.3) firms increasingly are turning to risk modeling for the decision to invest in information technology security. The shape of the probability distribution of data breach events is crucial to calculate both the costs of a breach and benefits of undertaking security investments. Assuming that data breaches follow a normal distribution will yield a different calculation than if data breaches are characterized by “fat tails” or extreme outlier distributions. Thomas et al. (2013) consider alternative probability distributions in a theoretical model of investment in information security. An analogy comes from the market for foreign exchange and the financial instruments that are priced and used in that market. Suppose a firm wants to put a floor on the value in the home currency of the revenue stream earned abroad in the foreign currency. In a complete-markets framework, the firm could buy an option that will pay off when the home-to-foreign currency exchange rate reaches a particular value. In a complete-markets framework, the probability distribution of exchange rate movements is fully known. The option would be priced exactly so as to make the firm indifferent to buying it or not (and on the sell side, the seller indifferent to selling the option or not.) The factor inducing one firm to buy the option and the other to sell the option is differences in risk appetite, among other factors. 8. The massive Target data breach in the fall of 2013 opened a new front in the market power relationships. Although Target credit card transactions were the locus of the data breach, the company argued that chip-and-pin technology would have significantly altered the likelihood of the data breach. Since credit card companies have not generally supported chip-and-pin in the United States (despite this being the technology used in Europe), Target diverted some of the blame to the credit card companies.

Information Lost and Data Breaches

319

But, suppose the probability distribution is not accurately parameterized. For example, suppose that exchange rate fluctuations are assumed to follow a normal distribution, but the true distribution has fat tails. The probability of the foreign currency depreciation that triggers the option will be underestimated relative to its value under the true probability distribution. The firm will not buy the option, and it will experience an uncompensated loss. On the other hand, if the firm assumes the extreme outlier distribution is correct, when the true distribution is normal, then the firm will buy too expensive an option, given the very small likelihood of the extreme event. In the information market place there is a similar problem of deriving the correct probability distribution of a data breach. Information on the probability of incurring a data breach is limited, and incurring a data breach is not identical to the probability of data abuse. Without knowing the correct probability distribution, too much investment in information security or too little are equally possible. Moreover, whether the correct market player is the target of the security effort remains unclear. For example, Anderson et al. (2012) point out that one automated spammer accounted for about one-third of global spam in 2010 and profited $2.7 million. But the 2010 worldwide spending on preventing spam exceeded $1 billion. So neither the level of spending nor the target appeared to have been optimal. The challenges to optimizing investment in data security run deeper because of the economies of scale and scope in the information and the differential market power of the players. Does the cost-benefit calculation for information security differ as to many small breaches (say, the normal distribution) compared to a rare but large data breach (the “black swan” event, from Taleb 2007, 2010). Is a large data breach more likely to lead to abuse of data, or less likely? The hypothesis of economies of scale and scope in information suggests that large data breaches, experienced over time, accumulate to enhance potential abuse of all revealed information, whether abused before or not. Differential market power has already been seen to shift the burden of costs of a data breach; it could similarly shift the burden of responsibility to invest in information security. Free riding and moral hazard are other aspects of differential market power that cause the information marketplace to deviate from the complete-markets framework. 11.2.7

Information Marketplace: Challenges to Pricing and Balancing Benefits and Costs

In sum, the information marketplace violates the classic complete-markets framework in three ways. First, information is characterized by economies of scale and scope, so it is difficult to price and value. Moreover, the benefits of aggregation increases, but so may the cost in the case where information is lost. Second, the various market players are not atomistic. The relationships between the originators of information—the intermediaries that transmit, aggregate, and hold information—and the users of the aggregated data to

320

Catherine L. Mann

enhance products are characterized by differential market power. The differential market power affects the distribution of both benefits of information and the potential costs when information is lost. Finally, there is substantial uncertainty about the probability distribution describing both the data breach event and potential abuse of information that is exposed, so it is hard to value information lost. Collectively, these departures from the completemarkets framework point to potential inefficiencies in market pricing and in participant behavior. Whether such inefficiencies suggest policymaker intervention requires more analysis. 11.3

Trends in Information Lost

The literature and framework presented in section 11.2 pointed to a variety of data needs: how to value information that incorporates economies of scale and scope, the nature of the market-power relationships between different market actors, and the parameters of the probability distributions of information lost and/or misused. All of this is needed to evaluate whether the information marketplace is efficiently balancing the value of information aggregated against the costs of information lost. Against this variety of data needs, this section presents evidence on only the extent and nature of information lost. What are the trends: size of loss, sector of loss, source of loss, cost of loss, market value of information, probability of abuse given a breach, and so on, including in the global context. The raw data come from several sources including: the Privacy Rights Clearinghouse and Open Security Foundation, which draw from public news sources; a number of consulting firms that employ industry surveys such as the Ponemon Institute, Symantec, Verizon, Javelin Strategy and Research, KPMG Europe; and the Federal Trade Commission and the Department of Justice, which draw on the consumer fraud online report database. Only some of the raw data are available for research use; most is proprietary, and this chapter draws on the public sources. Access by researchers to proprietary data would be quite valuable. 11.3.1

How Much Information is Lost? And by What Means?

Privacy Rights Clearinghouse (PRC) data for 2005 to 20129 show that after a notable drop in data breaches in 2009, during the depths of the recession, data breaches are on the increase again.10 (The number of records lost in each breach, which is a different measure of information lost, will be discussed below.) The PRC disaggregates breaches into various types: losing paper documents or losing computers (static desktop or portable); 9. www.privacyrights.org/data-breach. 10. The California disclosure law (discussed in section 11.4) passed in 2003. The jump in breaches from 2005 to 2006 is more likely a consequence of more widespread reporting of data breach announcements and collection into the database than it is an actual dramatic jump.

Information Lost and Data Breaches

Fig. 11.1

321

Data breach, total number and by method

Source: Privacy Rights Clearinghouse.

inadvertent disclosure (such as using “cc” instead of “bcc” in an e-mail list); and various types of fraud (by an insider employee, by an outsider hacker, through payment card).11 The first three types of information lost are more by mistake, although the disclosed information could still be misused. The three types of fraud are presumed to have some malicious intent. Hacking dominates, and insider fraud is the increasingly important source of data breaches. But a surprising number of data breaches still take place the “old-fashioned way” by losing paper documents or laptops and through unintended disclosure. (See figure 11.1.) Whereas the announcement of a breach indicates that information has been compromised, the actual number of records involved in each breach could be a better measure of potential cost of the breach in that a record represents granular information about an individual. Not all breach disclosures reveal how many records were lost in the breach. In fact only about half of the announcements include that information. (See more discussion of breaches that reveal Social Security numbers below.) For the breach disclosures that reveal the number of records lost over the 2005–2012 period, the histogram of records lost per breach shows that the most frequent breach is small, involving 1–10,000 records. There is some reduction in breaches with medium-sized losses (100,000–500,000 records 11. Open Society Foundation also uses this classification scheme.

322

Catherine L. Mann

Fig. 11.2

Records per breach, all sectors

Source: Privacy Rights Clearinghouse.

lost), but little progress in stemming breaches of either small or huge size. In particular huge breaches (1,000,000 and up), though infrequent, have not been controlled (witness the enormous 2013 Target breach). This histogram of breaches offers an insight to the probability distribution of a breach event. A cross-tabulation of the type of breach with the size of the breach could help target investment in information security. However, not known is whether huge breaches are more likely to lead to information abuse, or whether data from small breaches are more likely to be misused. (See figure 11.2.) 11.3.2

What Kind of Information is Lost?

Revealing a Social Security number (SSN) during a data breach generates far greater concern and potential for costly information loss compared to a data breach that compromises other types of personal information (see evidence in section 11.4). Based on the PRC data, there is a mixed picture of whether more or less high-value information is being lost. In part, this mixed picture appears to be because reporting of SSN losses is increasingly incomplete. Over the time period the number of breaches that reveal SNN has increased, but as a share of all data breaches those that reveal SSN has declined. The number of reported records where the SSN was compromised declined from a peak in 2007, although not in trend fashion. So, this suggests that SSN breaches are becoming less prevalent, perhaps because of enhanced security. (See figure 11.3.) On the other hand, recall that not all breach announcements reveal the

Information Lost and Data Breaches

Fig. 11.3

323

Breaches with SSN

Fig. 11.4 SSN records lost, percentage of breaches not disclosing number of records

number of records lost. For breaches that compromise SSN, the share of those breaches that do not disclose the number of SSN-related records lost has increased over time (figure 11.4). Considering a sectoral decomposition of data breach announcements, the business-other (BSO) category is the largest sector that does not disclose whether SSN records have been compromised. Sectors that are perhaps under greater scrutiny, such as medical

324

Catherine L. Mann

Fig. 11.5 Undisclosed number of records with SSN breaches (percentage of total breaches with SSN)

(MED), financial (BSF), and retail (BSR) appear to disclose more information. (See figure 11.5.) In sum, interpreting the data on SSN breaches and required disclosure requires more analysis. Required disclosure may have led to security investment and thus fewer SSN-related breaches. Or, required disclosure may just have prompted less transparency in public reporting. 11.3.3

Is There Differentiation by Sector?

Looking behind the averages, are there differences by sector? Which sectors are the most prone to data breaches, by what means, and does the size of breach and information revealed differ by sector? The PRC data can be aggregated into business sectors (finance, retail, medical, other), government, education, and NGO.12 Data breaches in the medical sector are about double any other sector, with a huge increase in the last couple of years. This could be a fact, a function of disclosure, or a function of disclosure and reporting. In contrast to the aggregated data, the main source of data breach in the medical sector is lost paper documents and lost laptops. But, insider fraud has a rising role. 12. More granular data, including firm identifiers, can be obtained directly from the PRC website. The Open Security Foundation did have a public online database (until 2007, see it used in the Karagodsky and Mann [2011] reference), but it now is behind a permission wall. Efforts to obtain access were not successful. These two sources both draw from public announcements of data breaches. Cursory analysis comparing the two databases for overlapping years shows similarity, but they are not identical.

Information Lost and Data Breaches

Fig. 11.6

Data breaches, medical institutions

Fig. 11.7

Medical, records per breach over time

325

(Recall that for the aggregated data, outsider hacking appears the greatest threat.) The vast majority of data breaches for medical institutions are small breaches—1,000 to 10,000 records lost—but a lot of these data breaches reveal SSN. However, when the number of records lost with SSN is considered relative to other sectors, the medical sector is not the largest problem sector. (See figures 11.6, 11.7, and 11.8.)

326

Catherine L. Mann

Fig. 11.8

Number of breaches with SSN by sector

Fig. 11.9

Records with SSN by industry

The chart (fig. 11.9) on records lost that compromise SSN reveals that retail is another sector that has a lot of data breaches. As shown in figures 11.10 and 11.11, the vast majority of data breaches in retail are by hackers. The number of records lost per breach is generally very small, and the number of breaches that reveal SSN is generally quite small. But, when the retail sector experiences a big exposure (2007 and 2011, and Target in 2013,

Information Lost and Data Breaches

Fig. 11.10

Data breaches, retail

Fig. 11.11

Retail, records per breach over time

327

not yet in the data set), the loss of records with SSN is enormous. The chart also reveals that 2009, which was the low point for overall breaches, was low because of the low number of small retail breaches. The Great Recession hit consumer spending and small business retailing relatively hard. So, the relationship between macroeconomic activity and data breaches may warrant further analysis. (See figures 11.10 and 11.11.)

328

Catherine L. Mann

Fig. 11.12

Data breaches, finance/insurance

Fig. 11.13

Financial, records per breach over time

A third sector of particular interest is financial and insurance institutions. The number of data breaches appears to be under control. However, the origin of the breach through insiders is a significantly greater share than in other sectors, and both hackers and unintended disclosures are also large. Very large breaches occur nearly every year, along with mid-size breaches, and these breaches often contain SSN. (See figures 11.12 and 11.13.) Government and educational institutions lose data both from hacking and from unintended disclosure. The bulk of the losses in the education

Information Lost and Data Breaches

Fig. 11.14

Data breaches, educational institutions

Fig. 11.15

Data breaches, government entities

329

sector are small, but the government has experienced some very large losses, and with a large number of records containing the SSN. (See figures 11.14, 11.15, 11.16, and 11.17.) In sum, the sectoral decomposition of the data suggests that a one-sizefits-all approach to evaluating the costs of data breaches or the approach to data security is not appropriate. The sectors differ in terms of how data are lost and which size breach is most prevalent.

330

Catherine L. Mann

Fig. 11.16

Educational institutions, records per breach (2005–2012)

Fig. 11.17

Government, records per breach (2005–2012)

11.3.4

Cross-Border Data Breaches

Cross-border data breaches have two dimensions. A US institution or consumer may lose information to foreign perpetrators or a US institution, when it incurs a data breach, may expose the personal information of a foreign person or firm. What are the characteristics of these crossborder breaches? The picture is quite murky. First, only the United States has, since 2003, required public announcement. So a time series of public

Information Lost and Data Breaches Table 11.1

331

Geographical origin of external information lost, percent of incidents

 

2007

2008

2009

2010

2011

America-North America-South  

23 3  

15 6  

19 n/a  

19 T0, where T0 is a quality/marketability threshold such that products brought to market are expected to cover costs. Technological change then brings two shocks to the market. First, piracy makes it more difficult to generate revenue, which raises the entry threshold T. But concurrent technological changes make it possible to record music and make it available to the public (and to learn its true quality) at lower cost. This allows firms to operate with a reduced T, which we refer to as T1 when they use the lower-cost mode of production, promotion, and distribution. If artist marketability were perfectly predictable at the time of investment, then all artists with true (realized) quality above the threshold (q > T ) would be brought to market. If technological change fell from T0 to T1, then additional products with less ex ante promise would be brought to market. This would perforce benefit consumers, but the benefit would be relatively small, since all of the newly available products would have quality between T0 and T1. But as noted above, artist marketability is very unpredictable, so a relaxation of the entry threshold can raise the number of products that are highly marketable ex post, not just the number of products with ex post value between T0 and T1. Under the lower threshold, a product is launched when ex ante promise exceeds T1, which occurs when qi > T1 − εi . Provided that ex post success is sufficiently unpredictable—var(ε) is sufficiently large—the lower-cost entry condition will give rise to additional entry of products with ex post marketability in excess of T0. In short, provided that T1 < T0 and artist marketability is unpredictable, we can expect an increase in the quantity of high-quality products brought to market when T declines. This framework, while simple, puts some structure on our inquiry. The first question is whether, in light of both piracy and potential cost reductions, the effective threshold has risen or fallen (and, by extension, whether more or fewer products come to market). Given an affirmative answer to the first question, a second question is whether the new products with less ex ante promise—and which previously would have been less likely to be launched—add substantially to the welfare delivered by available products. This is a difficult question, but we can certainly ask whether products launched by independent labels—and using low-cost methods of production, promotion, and distribution—grow more likely to become commer-

Digitization and the Quality of New Media Products

417

cially successful. These questions, along with evidence about mechanism, occupy most of the rest of the study. 14.2.1

Data

I develop two basic data sets for this study using data from nine underlying sources. The first data set is a list of albums released in the United States from 1980 to 2010, where for each album I attempt to classify its label (major, independent, self-released) and its format (physical versus digital). The second basic data set is a list of commercially successful albums based on their inclusion on weekly top-selling album lists, along with my estimates of the albums’ actual sales. These albums are then linked with measures of traditional radio airplay, promotion on Internet radio, coverage by music critics, and a designation of whether the album is on an independent record label. The nine underlying data sources for this study may be grouped into six components. First, I have weekly rankings of US album sales, from three separate weekly Billboard charts. First among these charts is the Billboard 200 (from 1990 to 2011), which lists the top 200 bestselling albums of the week, based on Soundscan data.15 Second, I observe the Heatseekers chart (2000‒2011), which shows the weekly top 50 albums among artists who have never appeared in the top 100 of the Billboard 200, nor have they ever appeared in the top 10 of the more specialized Billboard charts.16 Heatseeker artists can be viewed as artists emerging as commercially successful. Finally, I also observe the Billboard Independent chart, which shows the week’s topselling albums from independent music labels. I observe this for 2001–2011.17 All of the Billboard charts are obtained from Billboard.biz. Second, I observe two measures of traditional US airplay, from the Billboard Hot 100 airplay chart which, ironically, lists the 75 most aired songs of the week in the United States and from USA Top 200, which lists “the top 200 songs on US radio” each week. The Billboard chart lists “the week’s most popular songs across all genres, ranked by radio airplay audience impressions measured from Nielsen BDS.” Spins are weighted by numbers of apparent listeners.18 I observe this for 1990–2011, again from Billboard. biz. Because I observe the top 75 songs of each week and not the entire universe of songs aired on the radio, I refer to the songs on the airplay charts as songs with “substantial airplay.” I have a separate measure of airplay, the USA Airplay Top 200 (“The most played tracks on USA radio stations”) between February 2009 and the end of 2011.19 The latter source has the 15. The underlying data include 272,000 entries from weekly top-200 album sales charts, 1990–2011. 16. The underlying data include 31,775 entries from weekly top-50 album charts, 2000–2011. 17. The underlying data include 28,775 entries from weekly top-50 independent album charts, 2001–2011. 18. http://www.billboard.com/charts/radio-songs#/charts/radio-songs. 19. See http://www.charly1300.com/usaairplay.htm, accessed June 15, 2012.

418

Joel Waldfogel

advantage of covering nearly three times as many songs per week. Because airplay data cover songs while my sales data described albums, I aggregate both to the artist-year for linking and analysis. Third, I observe critical assessments of new albums from Metacritic. Metacritic reports an assessment of each album on a 100-point scale. They report a review of at least three of over-100 underlying critical sites reports a review on an album. Metacritic appeared in 2000, so these reviews cover the period 2000‒2011, and the coverage grows over the decade. There are 485 reviews in 2000, 867 in 2005, and 1,037 in 2010. According to Metacritic, We try to include as many new releases as possible, in a variety of genres. Generally, major pop, rock, rap and alternative releases will be included. We also try to include many indie and electronic artists, as well as major releases in other categories (country, etc.). Occasionally, we will also include import-only items (generally, UK releases) if it appears that they will not be released in the United States in the foreseeable future (otherwise, we will typically wait for the US release). Remember, if an album does not show up in at least 3 of the publications we use, it probably will not be included on the site.20 Fourth, I have data on the weekly rankings of songs aired at Internet radio site Last.fm from April 3, 2005 to May 29, 2011. While Pandora is the largest and most prominent Internet radio site, I lack Pandora listening data.21 However, listening data on Last.fm are more readily available. According to Alexa.com, Pandora was the 308th ranked global site, and the fifty-fifth US site, on June 11, 2012. Last.fm is lower ranked: 766 globally and 549 in the United States. Last.fm reports the top 420 songs, according to the number of listeners, for each week. Fifth, I observe RIAA data on total album shipments by year (1989‒2011) as well as gold (0.5 million), platinum (1.0 million), and multiplatinum album certifications, 1958‒2011. As I detail in section 14.3, I use the certification data in conjunction with Billboard sales rankings to construct weekly estimates of album sales, by album. Sixth, I have a list of works of new recorded music, from Discogs.com. Discogs is a user-generated data set that bills itself as “the largest and most accurate music database . . . containing information on artists, labels, and their recordings.” Using Discogs, I created a data set consisting of every US album released from 1980 to 2010. This is a total of 203,258 separate releases. (I aggregate versions on different media, e.g., CD, vinyl, file, into a single release.) My focus is albums, so I exclude singles. There are 38,634 distinct labels among my Discogs data, and classifying 20. From “How do you determine what albums to include on the site?”, at https://metacritic .custhelp.com/app/answers/detail/a_id/1518/session/L3Nuby8wL3NpZC9DOFVxQkczaw==, published June 10, 2010. 21. See http://www.edisonresearch.com/wp-content/uploads/2013/04/Edison_Research _Arbitron_Infinite_Dial_2013.pdf.

Digitization and the Quality of New Media Products

419

labels as major versus independents turns out to be challenging. Major labels are generally understood to be those labels owned by three underlying firms: Universal, Sony/BMG, Warner, and until recently, EMI. Unfortunately, for the purpose of identifying them in the data, labels operate with many imprints as the tallies above suggest. While published sources document the histories of some the major imprints (e.g., Southall 2003), such published sources cover only a small fraction of the labels in these data. Fortunately, I can rely on a few other approaches to identify many labels that are either definitely major or definitely independent. First, a recent study by Thomson (2010) attempts to calculate the share of music on the radio released by independent record labels. For this purpose she needed to classify thousands of underlying albums’ labels as major or independent. She enlisted the help of the American Association of Independent Music (A2IM) to create a list of major and independent record labels. Her list includes 6,358 labels, of which all but 688 could be coded as major or independent.22 I begin with her classification. I also classify as major a label whose name includes the name of a major label (e.g., Warner, EMI, etc.). Finally, I classify as independent any label that Discogs refers to as “underground,” “independent,” “experimental,” “minor,” or “not a real label.” Despite all of these efforts, matching is incomplete. Of the works in Discogs, 26 percent can be identified as being on major labels. Another 20 percent of works can be identified as independent-label releases, and 3 percent are self-released. This leaves the label types for 51 percent of the albums in the database unidentified. That said, there is reason to believe that the releases on unknown labels are not from major record labels. Of the releases on unknown labels, 40 percent are on labels that release albums by no more than five artists. In some calculations below, I treat the unclassified labels as nonmajor labels. 14.3

Inferring Sales Quantities from Sales Ranks and Album Certifications

We would like to have data on the quantities sold for all albums, by album, but such data are unfortunately expensive to obtain. Fortunately, we can use the data at hand to construct reasonable estimates of sales for almost all albums. We have data on the weekly sales ranks of the top 200 selling albums, as well as sales milestones (0.5 million and multiples of one million) for high-selling albums. In addition, we have data on the total sales of all albums by year. It is usual to assume that sales distributions follow power laws (see Chevalier and Goolsbee 2003; Brynjolfsson, Smith, and Hu 2003). That is, sales 22. A small number of additional labels have the classifications Disney and legacy, respectively.

420

Joel Waldfogel

quantities are believed to bear simple relationships with sales ranks. To be specific, sit = rit , where sit is sales of album i in week t, rit is the sales rank of album i in week t, and α and β are parameters. Because we observe when sales pass various thresholds, say, 0.5 million at gold certification, we can econometrically estimate α and β. Define the cumulative sales for album i  in period τ as Siτ. Thus, Si  = ∑t=0 rit . If we include an additive error, we can estimate the parameters via nonlinear least squares. The coefficients have the following interpretation: α provides an estimate of the weekly sales of a number one-ranked album. The parameter β describes how quickly sales fall in ranks. A few adjustments are needed for realism. Because the size of the market is changing over time, the parameters are not necessarily constant. We have data on thousands of album certifications across many years, so we can be flexible about the parameters. Given estimates of the parameters, we can construct estimated sales of each album in each week (or each year). We can use these data to calculate, say, the share of sales attributable to independent-label albums. We can also calculate the extent to which sales are concentrated in each year. Data on certification-based sales provide some guidance on parameter stability. We can calculate the sales for the top-selling albums of the 1970s, 1980s, 1990s, and the first decade of the twenty-first century. We can then compare the log sales-log rank relationships across decades. (To be clear, these are not the Billboard weekly sales ranks referred to as rit above; rather, these are ranks based on total sales ever from RIAA certification data.) Table 14.1 presents a regression of log sales on log ranks, where the constant and slope coefficients are allowed to vary across releases from the different decades, 1970‒2010. Not surprisingly, the constant term varies substantially Table 14.1

Log sales and log rank using certification data

Alpha 1970 1980 1990 2000 Beta 1970 1980 1990 2000 Constant

Coef.

Std. err.

Omitted 0.8232 1.2295 0.1156

0.0649 0.0596 0.0610

‒0.6717 ‒0.7547 ‒0.7376 ‒0.6105 3.8853

0.0093 0.0063 0.0043 0.0048 0.0515

Note: Regression of the log certification-based sales of albums released 1970‒2010 on their log sales rank within the decade.

Digitization and the Quality of New Media Products

421

across decades, reflecting the differing sales levels in the different decades. The constant term rises from the 1970s to the 1990s, then falls substantially in the first decade of the twenty-first century. (The exponentiated constants provide estimates of the sales of the top-ranked album of each decade.) The slope coefficient varies less across decades. In particular, it rises in absolute value from 0.65 in the 1970s to 0.75 during the 1980s and 1990s. The coefficient then falls in the first decade of the twenty-first century back to its level in the 1970s. A lower slope coefficient indicates that sales fall off less in ranks. The recent decline in the slope coefficient indicates that recent sales are less concentrated among the highest-ranked albums. These results indicate that we will want to allow the constant term to vary over time. We implement the nonlinear least squares estimation with 3,272 albums receiving certification, released between 1986 and 2010. There is an apparent bunching of certifications of particular albums. That is, the gold and platinum certifications sometimes appear on the same date. Hence, I use only the sales associated with the highest certification for each album, and I assume that the sales associated with the accumulated certifications level of sales has occurred by the time of the last certification. Table 14.2 reports results. The first column reports a restrictive specification that holds both α and β constant over time. The second specification relaxes the constancy of α. Regardless of the method used, the β estimate is roughly 0.6. The α term varies over time with overall album sales. The rise in α in 2010 arises because the certification data end in 2010. Hence, the coefficient reflects the relationship between BB200 weekly ranks and the selected sample of albums that quickly achieve sales certification. Putting the 2010 coefficient aside, the pattern of α coefficients tracks overall sales trends, peaking around 1999 and falling thereafter. Figure 14.1 plots coefficients against total annual album shipments, both normalized to 1 in 1999, and the correspondence is close. One shortcoming of the above approach is that it does not incorporate information about annual aggregate album sales. That is, nothing constrains the sum of simulated sales across albums to equal total reported shipments for the year. If we were to assume that the sales of albums that never appear on the Billboard weekly top 200 are negligible—in effect, that only about 500–1,000 albums per year had nonzero sales—then we would expect the sum of the implied sales across weeks in a year to equal the year’s aggregate sales. That is, if we define σy as the aggregate album sales in year y, then:   T 52 ∑Ti=1 ∑52 t=0 rit = y . This can be rewritten as  = y /(∑i=1 ∑t=0 rit ). That is, once we have an estimate of β that we wish to apply to year y, we can infer α for that year as well. The sum of the simulated sales of the albums appearing in the Billboard 200 at some point during the year then equals the actual aggregate sales. I use this approach, which causes the sales tabulations of Billboard 200 albums to equal total shipments.

422

Joel Waldfogel

Table 14.2

Nonlinear least squares estimates of the relationship between RIAA certification-based sales and weekly Billboard album sales ranks

Alpha Beta

(1)

(2)

0.3422 0.60063

0.61577

Alpha 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

0.3495 0.04438 0.3216 0.3928 0.30106 0.23195 0.31962 0.4321 0.58778 0.44124 0.46895 0.42882 0.4038 0.53432 0.45097 0.48995 0.40985 0.32757 0.4351 0.2871 0.20662 0.24924 0.23785 0.15882 0.82928

Notes: Estimates calculated using amoeba search algorithm. Standard errors to follow via bootstrapping.

14.4 14.4.1

The Changing Information Environment for Consumers Internet versus Traditional Radio

Traditional radio operates in a relatively small number of predefined programming formats (top 40, adult contemporary, and so on), providing venues for the promotion of a relatively small share of new music. Major-label music dominates airplay on traditional radio. Thomson (2010) documents that between 2005 and 2008, music from independent labels accounted for 12–13 percent of US airplay. Three recent developments hold the possibility of changing the number of new music products of which consumers are cognizant: Internet radio, expanded online criticism, as well as social media. While traditional radio

Digitization and the Quality of New Media Products

Fig. 14.1

423

Alpha and album shipments

stations have publicized a small number of artists in preordained formats, Internet radio allows listeners to tailor stations narrowly to their tastes. At Pandora, for example, users “seed” their stations with songs or artists that they like. Pandora then presents other songs that are similar. Last.fm operates similarly. While this personalization need not lead to a greater variety of artists receiving airplay—it would be possible for all listeners to seed their stations with the same songs or artists—in practice, personalization provides promotion for artists not receiving substantial traditional airplay. To explore Internet radio listening patterns, I obtained song-listening statistics from Last.fm’s weekly song chart, Feb. 2005–July 2011. Each week Last.fm reports the number of listeners for each of the top 420 songs at Last.fm. Figure 14.2 provides a characterization of listener volumes as a function of song rank on Last.fm. In 2010, a top-ranked song (according to volume of listeners) had about 38,000 weekly listeners. The 100th-ranked song had about 13,000, and the 400th song had roughly 8,000. I then compare the artists on Last.fm with those on traditional radio airplay charts. Unfortunately, both of my airplay data sources are incomplete. Thomson (2010) documents that, over the course of a year (between 2005 and 2008), the top 100 songs accounted for about 11 percent of airplay, the top 1,000 songs accounted for almost 40 percent, and the top 10,000 accounted for nearly 90 percent. While the Billboard airplay data include 3,900 (75 × 52) song listings per year because songs persist on the charts, the total number of songs making the Billboard airplay charts is about 330 per year. The USA

424

Joel Waldfogel

Fig. 14.2

Listening rank and weekly listeners, 2010

Airplay data go deeper. In 2010, the chart included 10,400 entries and 662 distinct songs. While I am missing more than half of the songs on the radio, I can still document stark differences between radio airplay and Internet radio artist coverage. Despite the differences in list depth, both the Billboard airplay charts and Last.fm’s song chart include roughly the same number of artists per year. In 2006 (with the first full year of data on Last.fm), Billboard’s weekly top 75 lists included a total of 253 artists across the year. Last.fm’s weekly songs lists included a total of 183 artists. Only thirty-three artists appeared on both lists. The overlap is quite similar in subsequent years. The degree of overlap by listening is somewhat larger than the overlap by artists: of the 2006 listening at Last.fm, 26 percent was to artists also on the Billboard airplay charts. Figures for 2007‒2010 are similar. While this leaves open the possibility that the Last.fm songs are nevertheless on the radio, the degree of overlap with the longer USA Top 200 Airplay list is similarly low. In 2010, nearly 70 percent of the songs on Last.fm are not among those on the USA Top 200 list. We see other indications that airplay patterns differ between traditional and Internet radio. I can construct crude indices of song listening from rank data as the reciprocal of the weekly rank, summed across weeks in the year. The correlation between this measure of listening across the two traditional airplay data sets is 0.75. The correlation between the airplay index from the Top 200 data and the Last.fm listening measure is 0.15. These results indicate that the majority of Last.fm listening appears to be for music not widely played on traditional radio and that Internet radio provides promotion for music that is less heavily promoted on commercial radio.

Digitization and the Quality of New Media Products

Fig. 14.3

425

Ranks in 2006 among artists on both

Among the songs on BB airplay and Last.fm lists, the correlation of airplay frequency is low (see figure 14.3 for scatter plot). There is other evidence that the two kinds of outlets allow the promotion of different sorts of artists. Tables 14.3 and 14.4, respectively, provide lists of the most heavily played artists on Last.fm not appearing on the BB list, and vice versa. Comparison of the lists shows clearly that Last.fm is comparatively skewed toward independent-label artists. Despite the shortcomings of the available airplay data, it seems clear that traditional and Internet radio provide promotional opportunities for different kinds of artists. 14.4.2

Growing Online Criticism

Critical assessments also substantively expand the set of artists promoted to consumers. Along with many other effects of digitization, the Internet has led to an explosion of outlets providing critical assessment of new music. Since 1995 the number of outlets reviewing new music—and the number of reviews produced per year—has doubled. These reviews are moreover made available freely on the Web (through sites like Metacritic and Pitchfork). These information sources hold the possibility of challenging radio’s centrality in influencing musical discovery. Of course, music criticism predates the Internet, but the growth of the Internet has been accompanied by a substantial growth in outlets offering music criticism. Metacritic.com is a website offering distilled numerical ratings of new music. They have operated since 2000 and they draw from over 100 sources of professional music criticism. Metacritic reports a “Metascore” for an album—a translation of reviews into a numerical score

426

Joel Waldfogel

Table 14.3

Top artists on Last.fm in 2006 without BB airplay Artist

Listeners

Death Cab for Cutie Coldplay Radiohead Muse Arctic Monkeys The Postal Service The Beatles System of a Down Bloc Party Nirvana The Arcade Fire Franz Ferdinand Pink Floyd The Strokes The Shins Interpol Metallica Linkin Park Placebo Thom Yorke Jack Johnson The White Stripes Oasis Yeah Yeah Yeahs Sufjan Stevens

5,200,000 5,200,000 4,700,000 3,900,000 3,000,000 2,800,000 2,400,000 2,300,000 2,100,000 1,900,000 1,900,000 1,700,000 1,400,000 1,300,000 1,100,000 1,100,000 1,000,000 973,630 914,018 860,097 823,208 806,304 759,511 685,532 674,766

Note: “Listeners” is the sum of weekly listeners for each of the artists’ songs appearing on the weekly top song lists across all weeks in the year. Included artists are those not appearing on the Billboard airplay list during the year.

between 0 and 100—if at least three of its underlying sources review an album. Underlying sources include originally offline magazines such as Rolling Stone, as well as newspapers. But many sources, such as Pitchfork, came into existence with, or since, the Internet. Of the reviews in Metacritic for albums released since 2000, over half are from sources founded since 1995. (See figure 14.4.) If these outlets can inform consumers about music, they may supplant the traditional role of radio. The number of albums reviewed at Metacritic has grown from 222 in 2000 to 835 in 2010, as table 14.5 shows. The vast majority of these albums are by artists who do not receive substantial airplay on traditional radio stations. I also note that social media are likely having significant effects on consumers’ awareness of music and other media products. Pew (2012) documents that across twenty countries, the median share of respondents “using social networking sites to share their views about music and movies” was 67 percent. An emerging body of evidence examines links between usergenerated content and the success of new media products (see, e.g., Dellarocas, Awad, and Zhang 2007; Dewan and Ramaprasad 2012). The evidence

Table 14.4

Top 2006 airplay artists not on Last.fm weekly top 420 Artist Mary J. Blige Beyonce Ne-Yo Cassie Chris Brown Yung Joc Shakira Ludacris Chamillionaire Akon Chingy The Pussycat Dolls T.I. Nelly Dem Franchize Boyz Field Mob Lil Jon Jamie Foxx Natasha Bedingfield E-40 Rascal Flatts Cherish Bow Wow Ciara T-Pain

BB airplay index 14.3111 12.01077 10.25575 9.814961 9.78202 8.242962 6.865558 6.041351 5.734164 5.227035 4.291855 3.868749 3.838763 3.655194 3.337012 3.009316 2.825482 2.409102 2.189499 2.088703 1.898755 1.891394 1.870972 1.863268 1.803415

Note: BB airplay index is the sum of (1/rank) across airplay chart entries for the artist within a year. Included artists are those not appearing on the Last.fm weekly top song lists during the year.

Fig. 14.4 Growth in reviews, sources founded since 1980 with over 2,000 reviews in Metacritic

428

Joel Waldfogel

Table 14.5

Number of artists appearing annually on lists

Year

Discogs releases

BB airplay

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2,534 2,742 3,008 3,425 3,893 4,532 3,880 5,029 5,198 5,482 5,586 5,709 5,768 6,057 6,566 7,118 7,862 8,707 9,191 8,875 8,226

88 244 237 238 211 204 197 220 217 194 216 206 213 202 220 202 211 195 206 198 178

Metacritic

BB 200

Last.fm

17 222 306 353 419 448 462 492 484 798 954 835

575 507 474 530 514 532 570 598 599 605 661 723 737 781 800 810 877 927 1,021 1,101 1,018

175 183 182 197 208 229

presented above on Internet radio and criticism almost surely understates the growth in the richness of the information environment surrounding new media products. 14.5

Results

We are now in a position to evaluate the net effect of piracy and cost reduction, in conjunction with the changed information environment, on the volume and quality of new work brought to market. Do we see a greater volume of releases by artists with less ex ante promise? And do these artists’ music contribute substantially to the products with ex post success? 14.5.1

Volumes of Major- and Independent-Label Releases

The first question is how the number and mix of new products has evolved. Have the majors reduced the number of new releases? Have the independent labels increased their volume of releases? I have access to two broad measures of the numbers of albums released each year in the United States. The first is an aggregate time series of album releases from the Nielsen Soundscan database. To appear among those data, an album must sell at least one copy during the year. According to Nielsen, the number of new albums released annually was 36,000 in 2000, grew to 106,000 in 2008, and

Digitization and the Quality of New Media Products

Fig. 14.5

429

Major, indie, and self-releases, excluding unknowns

has since fallen to about 75,000.23 It is quite clear, as Oberholzer-Gee and Strumpf (2010) have pointed out, that there has been substantial growth in the number of albums released annually since 2000.24 Because I lack access to the underlying Nielsen data, I cannot classify those releases by label type. The Discogs data, while they cover only about a tenth of the total releases in Soundscan, contain album-level info along with label type. It is difficult to know how the Discogs and Soundscan samples relate to one another. Soundscan includes all music genres, while the Discogs figure here include only rock music. Inclusion in Discogs is not mechanically driven by sales; rather, albums are included because users contribute information. It is nevertheless encouraging that the total numbers of albums released according to respective data sources follow similar trends, rising from 2000 to 2009, then falling. With the caveat about representativeness in mind, we can use the Discogs data to see how releases evolve over time by label type. Figure 14.5 provides a description based on only the identifiable label observations. Releases from major labels far outnumber independent releases between 1980 and roughly 2001. Since then, major-label releases have declined by more than half. The numbers of identifiable independent-label releases and self-released albums show a different pattern. While independent releases were a fraction of major-label releases between 1980 and 1995, they surpassed major-label 23. Data for 2000, 2008–2010 are reported at http://www.digitalmusicnews.com/stories /021811albums. Data for 2011 are reported at http://www.businesswire.com/news/home/2012 0105005547/en/Nielsen-Company-Billboard%E2%80%99s-2011-Music-Industry-Report. 24. See also Handke (2012).

430

Joel Waldfogel

Fig. 14.6

Releases by type, including unknowns

releases in 2001. In 2010 identifiable independent-label releases outnumber major-label releases by a factor of two. Self-released recordings have also increased sharply, from a few hundred in the year 2000 to over a thousand in 2010.25 Figure 14.6 aggregates independent releases, self-releases, and the releases on unknown labels (which we suspect generally to be independent of the majors). While major-label releases are, again, declining, it is clear that overall releases are increasing. We have argued that the growth in new releases is driven by changed technologies for production and distribution. We see some direct evidence for this in a breakdown of new releases by whether they are physical or digital, in figure 14.7. I classify as “digital” the releases available only as digital files. Interestingly, there is a fairly substantial decline in the number of releases that include a physical version, but there is a rather substantial growth in digital-only releases, which by their nature have lower distribution costs.26 While major-label releases have declined sharply over the past decade, releases of independent and self-released albums have increased even more, driven in part by growth in purely digital products. The number of new 25. A curious feature of the data is that the number of releases—both independent and major—appears to have fallen recently. Annual major label releases peak in 1999; annual independent label releases peak in 2007. It is not clear whether the decline is real—it may be an artifact of the user-contributed nature of Discogs. Perhaps it takes a few years for users to fill-in recent years. Regardless of these timing issues, the number of major-label releases has fallen relative to the number of independent-label releases. This is a rather significant change relative to earlier periods covered in these data. 26. I include only multisong compilations in the data; that is, singles are excluded.

Digitization and the Quality of New Media Products

Fig. 14.7

431

Physical and digital releases

products coming into existence has continued to grow over time despite the collapse in revenue. While growth in releases, as indicated by both Soundscan and Discogs, is consistent with growth in the number of products that consumers might discover to be appealing, neither the Soundscan nor the Discogs lists provide a direct measure of what we would like to observe. The story I am advancing here depends on digitization allowing more pieces of new music to get tested in the market. More releases may be a piece of this, but more releases do not by themselves indicate more “experimentation.” Determining whether a product has appeal requires some substantial subset of consumers to listen and decide whether they find it appealing. Prior to digitization there was a relatively bright line between releases promoted on the radio and others. In the digital era, releases that are not promoted on the radio can nevertheless get exposure with consumers. Quantifying the extent of experimentation is challenging, if not impossible. At one extreme it is clear that the number of products that consumers can evaluate has risen. But even in the new digital world, it seems implausible to think that all 75,000 (or 100,000) new releases can be vetted to determine whether they are appealing to consumers. Still, in the language of the model, more products, including those with less ex ante promise, are now coming to market.27 27. The growth in the releases echoes a growth in the number of record labels than Handke (2012) documents operating in Germany.

432

Joel Waldfogel

Fig. 14.8

14.5.2

Distinct artists on the BB 200

Sales Concentration

A growth in the available number of products tends generally to effect a combination of market expansion and business stealing, as new options draw some people to consumption and others from existing to new products. The spread of music piracy after 1999 (and the attendant reduction in music sales) obscures any market-expanding impacts of appealing new products. What we can study, instead, is whether new kinds of products (e.g., those that would not previously have been released) take market share from traditional types of products. We begin this inquiry in this section by documenting the evolution of sales concentration over the past few decades. By construction, the number of weekly Billboard 200 listings is 10,400 per year (52 × 200). The number of distinct artists on the list, by contrast, depends on the number of distinct albums per artist (typically only one) and the length of time an album remains on the list. If albums remained on the list for only one week, and if each artist had only one album per year, then 10,400 artists would appear on the list during the year. At the other extreme, if albums remained on the list all year, then with one album per artist, 200 artists would appear on the list during a year. Because albums tend to remain on the list for a long time, the actual number of artists appearing on the weekly Billboard 200 in a year is far closer to 200 than 10,000. After fluctuating around 600 between 1986 and 1999, the number of distinct artists has grown steadily from 600 to 1,000 at the end of the decade (see figure 14.8). We can explore sales concentration more directly with our simulated sales

Digitization and the Quality of New Media Products

Fig. 14.9

433

Simulated album log sales distributions, graphs by year

data. To this end, we predict weekly sales for each album, then aggregate these sales across weeks and artists to produce annual sales by artist. Figure 14.9 shows the distributions of log sales across artists for each year, 1990–2010. In the early years, the log sales distributions are single peaked, with a peak near zero, meaning that the central tendency is for albums to have nearly one million in sales. As time goes on, mass in the distribution shifts left as a growing share of artists make shorter appearances on the chart (and a growing share of sales is accounted for by artists making short chart appearances). This figures make it clear that sales are becoming less concentrated in a handful of artists. To say this another way, the increase in the number of available products seems to be manifested in a growth in the number of products achieving commercial success. This fact is interesting in itself, as it indicates a shift toward consumption of a broader array of music. It is also interesting as an example of a more general phenomenon. Entry, resulting from a reduction in entry costs relative to market size, need not reduce the concentration of consumption. Sutton (1991) describes contexts where quality is produced with fixed costs and consumers agree on quality. Some media products, including daily newspapers and motion pictures, conform to these conditions very well (see Berry and Waldfogel 2010; Ferreira, Petrin, and Waldfogel 2012). Music provides a contrast. Here, growth in the number

434

Joel Waldfogel

Fig. 14.10A

Share of BB 200 with Billboard airplay

of products reaching consumers draws consumption to a wider array.28 This begs the question of how consumers are becoming aware of the growing number of new products. 14.5.3

Success and Promotional Channels

Airplay has traditionally been an important element of albums’ commercial success. Of the artists appearing in the Billboard 200 in 1991, just over 30 percent experienced substantial radio airplay. The top 200 includes albums selling both large and moderate quantities. If we restrict attention to the top 25 albums on the weekly Billboard 200, we see that 60 percent of BB top 25 artists also appeared on the BB airplay charts in 1991. While the share of BB top 25 artists receiving airplay fluctuated somewhat over the decade, it averaged about 50 percent and remained as high as 50 percent in 2001. In the past decade, the share of the BB top 25 with BB airplay has fallen steadily and now stands at about 28 percent. See figures 14.10A and 14.10B. Because Heatseekers are by definition not yet widely successful artists, we would expect less airplay, and we see this. But we also see a reduction in their airplay between 2000 and 2010. The share of Heatseeker artists with airplay falls from 8 percent to about 1 percent. See figure 14.11. 28. This suggests that horizontal differentiation is more important in music than in movies or newspapers, a finding reinforced in another study on the effect of market enlargement on music consumption. In Ferreria and Waldfogel (2013), a growth in world music trade promotes greater consumption of local music.

Fig. 14.10B

Fig. 14.11

Share of BB 25 with Billboard airplay

Share of Heatseekers with Billboard airplay

436

Joel Waldfogel

Fig. 14.12

Share of BB 200 sales in albums with Billboard airplay

Using our simulated sales data, we can also calculate the share of sales attributable to albums with substantial airplay. Figure 14.12 shows that the share of sales for artists with concurrent radio airplay fell from about 55 percent of sales in 2000 to about 45 in 2010. While the share of artists with airplay declines, the share covered in Metacritic instead rises. The share of the Billboard 200 artists with contemporary (same-year) Metacrtic coverage rises from 15 to 35 percent between 2000 and 2010 (see figure 14.13) while the share of Heatseeker artists with Metacritic coverage rises from 6 to 30 percent (see figure 14.14). We observe Last.fm airplay for the limited period between 2005 and 2011, but during this period one-fifth of Billboard 200 artists receive substantial Last.fm play. Thus far, we see (a) that there are more products, (b) more products achieve success, and (c) that a growing share of products achieve success without substantial airplay. An important remaining question is whether a wider variety of new products, including those lacking major-label backing and substantial airplay (i.e., those with less ex ante promise), can achieve success. 14.5.4

Whose Albums Achieve Success? (Independent vs. Major)

We have seen that independent labels account for a large and growing share of new music releases. If this wider-scale experimentation is responsible for the sustained flow of high-quality music since Napster, then at a minimum it must be true that these albums with less ex ante promise make up a growing share of the albums that ultimately become successful with

Fig. 14.13

Share of BB 200 with Metacritic reviews

Fig. 14.14

Share of Heatseekers with Metacritic reviews

438

Joel Waldfogel

consumers. To examine this we ask whether albums from independent labels account for a growing share of top-selling albums. Before turning to data on this question, we note that there is a substantial amount of controversy in the measurement of the volume of independentrecord sales. Nielsen reports the volume of independent-record sales in its year-end music sales report. These reports are available online for the past decade, and they show that independent-record labels have sold a roughly constant 15 percent of overall music sales. However, Nielsen calculates the independent share according to the entity distributing a record rather than the entity producing the recording. The different methodologies produce very different results. While Nielsen reported an independent share of just under 13 percent for the first half of 2011, the American Association of Independent Music (A2IM) advocates a different methodology that produces an independent share of nearly one-third. As they put it, “Ownership of master recordings, not distribution, should be used to calculate market share. . . . But Billboard reports market share based on distributor and as a result sales from [independent labels] are embedded within the major-label market share totals.”29 We take a conservative approach, calculating the independent share among commercially successful albums by merging the list of artists appearing on the weekly Billboard 200 each year (during any week of the year) with the artists appearing on the Billboard independent ranking during the year. Figure 14.15 shows results. The upper-left panel shows that the independent share among the full Billboard 200 rises from 14 percent in 2001 to 35 percent in 2010. We get a similar increase, albeit at a lower level, in the independent share among albums appearing in the weekly top 100, top 50, or top 25 among the Billboard 200. The independent share among artists appearing in the Billboard 25 rises from 6 percent in 2001 to 19 percent in 2010. We see a similar pattern in sales terms. As figure 14.16 shows, the share of BB 200 sales of albums from independent labels rises from 12 percent to about 24 percent between 2000 and 2011. The growth in the independent-label role among the commercially successful artists confirms that products with less ex ante promise are not only coming to market, they also appear among the products generating commercial success and, therefore, welfare benefit. 14.6

Discussion and Conclusion

The growth in file sharing in the past dozen years has created a tumultuous period for the recorded music industry, presenting an enormous chal29. See Ed Christman, “What Exactly is an Independent Label? Differing Definitions, Differing Market Shares.” Billboard, July 18, 2011; and Rich Bengloff, “A2IM Disputes Billboard/ SoundScan’s Label Market-Share Methodology—What Do You Think?” Billboard, March 3, 2011.

Digitization and the Quality of New Media Products

439

Fig. 14.15 Indie share among Billboard 200, Billboard 100, Billboard 50, and Billboard 25

lenge to the business model of traditional major music labels, leading to a great deal of research on the sales-displacing impacts of file sharing on revenue. Yet cost-reducing technological change in production and distribution, along with a digitally enabled growth in music criticism, have allowed smaller music labels (and individuals) to both release more music and bring it to consumers’ attention. Much of the music originating in the low-cost sector is succeeding commercially. Music from independent labels now accounts for over one-third of the artists appearing on the Billboard 200 each year. In effect, consumers are exposed to much more music each year. In the past consumers would not have been exposed to the independent-label music, and the majors would dominate commercial success. The growing presence of independent-label music in the Billboard 200 means that, when exposed to this broader slate of new music, consumers find much of the independent music to be more appealing than much of the diminished major-label fare. While the usual caveat that more research is needed probably applies, these results nevertheless provide a possible resolution of the puzzling increase in music quality documented elsewhere. Beyond a possible explanation of continued music quality, the findings

440

Joel Waldfogel

Fig. 14.16

Independent share of BB 200 sales

from this exercise may have some implications for the effects of digitization on product markets generally. Digitization, with its attendant reductions in entry costs relative to market size, was supposed to bring about both frictionless commerce and a proliferation of product varieties to serve niche tastes. In many contexts, the increase in market size along with reductions in fixed costs have not produced this sort of fragmentation. Sutton (1991) outlines circumstances in which an increase in market size need not give rise to fragmentation, in particular, that product quality is produced with fixed costs and that consumers largely agree on which products are better (i.e., competition is vertical). The first of these conditions clearly holds for recorded music. Quality is produced entirely with investments in fixed costs. Whether consumers agree on quality is less clear. Results here suggest that consumers do not agree—that competition has an important horizontal component. Hence, an increase in the number of products available leads to fragmentation of consumption. This feature of music provides a sharp contrast with some other media products, such as daily newspapers and motion pictures, where competition has more important vertical aspects. Music appears to be one product, however, where digitization leads to fragmentation and perhaps the satisfaction of niche tastes. Other contexts where these effects predominate remain to be documented. The mechanism explored in this chapter is not limited to recorded music products. Further research could fruitfully explore the impacts of digitization on both the creation of new books, movies, and video games, to name

Digitization and the Quality of New Media Products

441

a few creative products, as well as the effect of new products on buyers and sellers.

References Berry, Steven T., and Joel Waldfogel. 2010. “Product Quality and Market Size.” Journal of Industrial Economics 58:1‒31. Blackburn, David. 2004. “On-line Piracy and Recorded Music Sales.” Unpublished manuscript, Harvard University. December. Brynjolfsson, Erik, Michael D. Smith, and Yu (Jeffrey) Hu. 2003. “Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety at Online Booksellers.” Management Science 49 (11): 1580–96. Caves, Richard E. 2000. Creative Industries: Contracts between Art and Commerce. Cambridge, MA: Harvard University Press. Chevalier, Judith, and Austan Goolsbee. 2003. “Measuring Prices and Price Competition Online: Amazon vs. Barnes and Noble.” Quantitative Marketing and Economics I 2:203–22. Dellarocas, C., N. Awad, and X. Zhang. 2007. “Exploring the Value of Online Product Reviews in Forecasting Sales: The Case of Motion Pictures.” Journal of Interactive Marketing 21 (4): 23‒45. Dewan, S., and J. Ramaprasad. 2012. “Music Blogging, Online Sampling, and the Long Tail.” Information Systems Research 23 (3, part 2): 1056–67. Ferreira, Ferando, Amil Petrin, and Joel Waldfogel. 2012. “Trade and Welfare in Motion Pictures.” Unpublished manuscript, University of Minnesota. Ferreira, F., and J. Waldfogel. 2013. “Pop Internationalism: Has Half a Century of World Music Trade Displaced Local Culture?” Economic Journal 123:634–64. doi: 10.1111/ecoj.12003. Handke, Christian. 2012. “Digital Copying and the Supply of Sound Recordings.” Information Economics and Policy 24:15‒29. International Federation of the Phonographic Industry (IFPI). 2010. “Investing in Music.” London. http://www.ifpi.org/content/library/investing_in_music.pdf. Knopper, Steve. 2009. Appetite for Self-Destruction: The Spectacular Crash of the Record Industry in the Digital Age. New York: Free Press. Leeds, Jeff. 2005. “The Net is a Boon for Indie Labels.” New York Times, December 27. Liebowitz, Stan J. 2006. “File Sharing: Creative Destruction or Just Plain Destruction?” Journal of Law and Economics 49 (1): 1‒28. Oberholzer-Gee, Felix, and Koleman Strumpf. 2007. “The Effect of File Sharing on Record Sales: An Empirical Analysis.” Journal of Political Economy 115 (1): 1–42. ———. 2010. “File Sharing and Copyright.” In Innovation Policy and the Economy, vol. 10, edited by Josh Lerner and Scott Stern, 19‒55. Chicago: University of Chicago Press. Pew Research Center. 2012. “Social Networking Popular across Globe.” Washington, DC. http://www.pewglobal.org/files/2012/12/Pew-Global-Attitudes-Project -Technology-Report-FINAL-December-12–2012.pdf. Rob, Rafael, and Joel Waldfogel. 2006. “Piracy on the High C’s: Music Downloading, Sales Displacement, and Social Welfare in a Sample of College Students.” Journal of Law and Economics 49 (1): 29‒62.

442

Joel Waldfogel

Sandstoe, Jeff. 2011. “Moby: ‘Major Labels Should Just Die.’” The Hollywood Reporter. February 28. http://www.hollywoodreporter.com/news/moby-major -labels-should-just-162685. Southall, Brian. 2003. The A-Z of Record Labels. London: Sanctuary Publishing. Sutton, John. 1991. Sunk Costs and Market Structure. Cambridge, MA: MIT Press. Tervio, Marko. 2009. “Superstars and Mediocrities: Market Failure in the Discovery of Talent.” Review of Economic Studies 72 (2): 829‒50. Thomson, Kristin. 2010.“Same Old Song: An Analysis of Radio Playlists in a PostFCC Consent Decree World.” Future of Music Coalition. http://futureofmusic .org/feature/same-old-song-analysis-radio-playlists-post-fcc-consent-decree -world. Vogel, Harold. 2007. Entertainment Industry Economics, 7th ed. Cambridge: Cambridge University Press. Waldfogel, Joel. 2011. “Bye, Bye, Miss American Pie? The Supply of New Recorded Music Since Napster.” NBER Working Paper no. 16882, Cambridge, MA. ———. 2012. “Copyright Protection, Technological Change, and the Quality of New Products: Evidence from Recorded Music since Napster.” Journal of Law and Economics 55 (4): 715–40. Zentner, Alejandro. 2006. “Measuring the Effect of File Sharing on Music Purchases.” Journal of Law and Economics 49 (1): 63–90.

15

The Nature and Incidence of Software Piracy Evidence from Windows Susan Athey and Scott Stern

15.1

Introduction

In the summer of 2009, Microsoft planned to release a new version of its flagship operating system, Windows 7. Relative to Windows Vista, Windows 7 offered significant improvements for consumers, including “driver support to multitouch groundwork for the future, from better battery management to the most easy-to-use interface Microsoft has ever had” (CNET 2009b). The redesign of the core operating system, as well as the development of bundled applications and features, represented a significant investment on the part of Microsoft, with approximately 2,500 developers, testers, and program managers engaged on the project for multiple years. Perhaps more than any other Microsoft product before it, Windows 7 was designed with a global market in mind (Microsoft 2009). Microsoft explicitly included a large number of features aimed at serving this global market, including the Multilingual User Interface included in Windows Ultimate and creating a Susan Athey is the Economics of Technology Professor and a professor of economics at the Graduate School of Business, Stanford University, and a research associate and codirector of the Market Design Working Group at the National Bureau of Economic Research. Scott Stern is the David Sarnoff Professor of Management of Technology and Chair of the Technological Innovation, Entrepreneurship, and Strategic Management Group at the MIT Sloan School of Management and a research associate and director of the Innovation Policy Working Group at the National Bureau of Economic Research. This research was conducted while both researchers were Consulting Researchers to Microsoft Research. This chapter has benefited greatly from seminar comments at the NBER Economics of Digitization conference, Microsoft Research, the MIT Microeconomics at Sloan conference, and by Ashish Arora, Shane Greenstein, Markus Mobius, and Pierre Azoulay. Exceptional research assistance was provided by Bryan Callaway and Ishita Chordia. For acknowledgments, sources of research support, and disclosure of the authors’ material financial relationships, if any, please see http://www.nber.org/chapters/c13002.ack.

443

444

Susan Athey and Scott Stern

low-priced version, Windows Home Basic, which was targeted specifically at emerging markets. However, just weeks after the release of the final version of the software and individualized product “keys” to original equipment manufacturers (OEMs), a number of websites reported that one of the original equipment manufacturer master product keys issued to Lenovo had been hacked and released onto the Internet (CNET 2009a). Websites quickly assembled stepby-step instructions on how to gain access to a prerelease, pirated version of Windows 7, and developed tools and protocols that allowed users to install an essentially complete version of Windows 7 Ultimate in a small number of transparent steps. While Microsoft chose to discontinue the leaked product key for OEM installation (they issued a new key for legitimate use by Lenovo), users were allowed to activate Windows 7 with the leaked key. In addition, though they did receive a modest functionality downgrade, users of the leaked Lenovo key were able to receive regularized product support and updates for their system. Microsoft argues that this approach ensures that they can “protect users from becoming unknowing victims, because customers who use pirated software are at greater risk of being exposed to malware as well as identity theft” (CNET 2009b). Over the course of 2009, a number of additional leaked keys and methods for pirating Windows 7 appeared on the Internet, and, by 2012, there were a large number of country-specific unauthorized Windows installation web pages, often tailored to specific languages or countries. By and large, most discussions of digital piracy—the use of the Internet to enable the unauthorized (and unpaid) replication of digital media including music, movie, and software—are based on specific instances of piracy, discussions of specific file-sharing websites (such as the Pirate Bay), or are closely tied to specific advocacy efforts. As emphasized by a recent National Academies study, the policy debate over piracy and the appropriate level of copyright enforcement is hampered by the lack of direct empirical evidence about the prevalence of piracy or the impact of enforcement efforts (Merrill and Raduchel 2013). This empirical vacuum is particularly important insofar as appropriate policy over piracy requires the consideration of both benefits and costs of particular policies. For example, the case for aggressive enforcement against piracy is strongest when piracy results from a simple lack of enforcement (or the absence of a legal framework for enforcing software copyright), while the argument for piracy tolerance is strongest when the primary impact of piracy is to provide access to low-income consumers whose alternative is nonconsumption. The development of appropriate policy, therefore, depends on an empirical assessment of the form that piracy takes in key settings. This chapter addresses this need by undertaking a systematic empirical examination of the nature, relative incidence, and drivers of software piracy.

The Nature and Incidence of Software Piracy

445

We focus specifically on a product—Windows 7—which was unambiguously associated with a significant level of private-sector investment by a private sector company. The key to our approach is the use of a novel type of data that allows us to undertake a direct observational approach to the measurement of piracy. Specifically, we take advantage of telemetry data that is generated passively by users during the process of Windows Automatic Update (WAU) and is maintained in an anonymized fashion by Microsoft. For machines in a given geographic area, we are able to observe the product license keys that were used to initially authenticate Windows, as well as machine characteristics (such as the model and manufacturer). We are able to use these data to construct a conservative definition of piracy, and then calculate the rate of piracy for a specific geographic region.1 The primary focus of our empirical analysis is then to assess how the rate and nature of that piracy varies across different economic, institutional, and technological environments. We document a range of novel findings. First, we characterize the nature of “simple” software piracy. While software piracy has, of course, always existed, our examination of Windows 7 suggests that the global diffusion of broadband and peer-to-peer systems such as Pirate Bay has given rise to a distinctive type of software piracy: the potential for global reuse of individual product keys, with sophisticated and active user communities that develop easy-to-follow instructions and protocols. While the use of peerto-peer networking sites has been associated for more than a decade with piracy for smaller products such as music or video, there is now a relatively direct way that any broadband user can access a fully functional version of Windows for free through the Internet. In particular, we document that a very small number of abused product keys are responsible for the vast bulk of all observed piracy, and that the vast majority of piracy is associated with the most advanced version of Windows (Windows Ultimate). This finding suggests that one proposed type of antipiracy initiative—offering a “barebones” version at a greatly reduced price—may be of limited value, since such efforts will have no direct impact on the availability of a fully featured version of Windows for free (and may be considered a poor substitute). We are also able to detect a distinctive industrial organization to piracy: piracy rates are much higher for machines where the OEM does not install Windows during the production process, and the rate of piracy is much lower for machines produced by leading OEMs. Third, we are able to evaluate how software piracy varies across differ1. In constructing a novel and direct observational measure of piracy, our work complements but also offers an alternative to the small prior literature on software piracy that has used a more indirect measure of piracy that infers the rate of piracy from the “gap” between the stock of sales/licenses allocated to a particular region/segment and audits of the “software load” for typical devices for users within that region/segment (Business Software Alliance 2011).

446

Susan Athey and Scott Stern

ent economic, institutional, and technology environments. In addition to traditional economic measures such as gross domestic product (GDP) per capita (and more nuanced measures, such as the level of income inequality), we also gather data characterizing the overall quality of the institutional environment (e.g., using measures such as the World Bank Rule of Law Index or the Foundational Competitiveness Index; Delgado et al. [2012]), the ability of individuals within a country to take advantage of broadband, and the innovation orientation of a country. Our results suggest that the level of piracy is closely associated with the institutional and infrastructure environment of a country. In particular, the level of piracy is negatively associated with measures of the quality of institutions in a given country, including commonly used aggregate indices of institutional quality as well as more granular measures that capture the role of specific institutions such as property rights. At the same time, piracy has a positive association with the accessibility and speed of broadband connections (as faster broadband reduces the time required for pirating) and is declining in the innovation intensity of a country. Most importantly, after controlling for a small number of measures for institutional quality and broadband infrastructure, the most natural candidate driver of piracy—GDP per capita—has no significant impact on the observed piracy rate. In other words, while the pairwise correlation between piracy and GDP per capita is strongly negative, there is no direct effect from GDP per capita. Poorer countries tend to have weaker institutional environments (Hall and Jones [1997], among many others), and it is the environment rather than income per se that seems to be correlated with the observed level of piracy. Importantly, this finding stands in contrast to prior research, which has not effectively disentangled the role of institutions from the role of income per se. Finally, we take advantage of time-series variation in our data to directly investigate the impact of the most notable antipiracy enforcement efforts on the contemporaneous rate of Windows 7 piracy. Specifically, during the course of our 2011 and 2012 sample period, a number of individual countries imposed bans on the Pirate Bay website, the single-largest source of pirated digital media on the Internet. Though such policy interventions are endogenous (the bans arise in response to broad concerns about piracy), the precise timing of the intervention is reasonably independent of Windows 7 piracy in particular, and so it is instructive to examine how a change in the level of enforcement against piracy impacts the rate of Windows 7 software piracy. Over a range of different antipiracy enforcement efforts, we find no evidence for the impact of enforcement efforts on observed piracy rates. Overall, this chapter offers the first large-scale observational study of software piracy. Our analysis highlights the value of emerging forms of passively created data such as the Windows telemetry data, and also the role of both institutions and infrastructure in shaping the overall level of piracy.

The Nature and Incidence of Software Piracy

15.2

447

The Economics of Software Piracy

The economics of piracy and the role of intellectual property in software is a long-debated topic (Landes and Posner 1989; Merrill and Raduchel 2013; Danaher, Smith, and Telang 2013). Like other forms of intellectual property such as patents, the copyright system has the objective of enhancing incentives for creative work and technological innovation by discouraging precise copying of expression, and is a particularly important form of intellectual property for software. In the case of global software products such as Windows, uneven copyright enforcement across different countries can result in a reduction in incentives to innovation and a distortion in the level of country-specific investment (e.g., companies may limit investment in language and character support in countries with high rates of piracy). The impact on regional investment would be of particular concern if the underlying driver of variation in piracy was the result of simple differences in legal institutions (such as the strength and respect for property rights) rather than the result of income differences (in which case there might also be a low willingness-to-pay for such value-added services). Piracy also has the potential to impose direct incremental costs on both software producers and purchasers of valid and updated software by facilitating the diffusion of viruses and other forms of malware. Because of the potential for a negative externality from the diffusion of pirated software, many software companies (including Microsoft) provide security updates (and some number of functionality updates) for pirated software. More generally, because software production is characterized by high fixed costs and near-zero replication costs, piracy redistributes the burden of funding the fixed costs of production onto a smaller share of the user population. Interestingly, the main argument against strict copyright enforcement is also grounded in the structure of production costs. With near-zero costs of replication, enhancing access to a broader user base (whether or not they are paying or not) increases the social return of software (even as it limits private incentives to incur the initial sunk costs). This argument is particularly salient to the extent that there is a limited impact of piracy (or the level of copyright enforcement) on the level of creative expression or innovation (Waldfogel 2011). However, many of the most widely diffused software products are produced by profit-oriented firms in which product development is the single most important component of overall costs. It is also possible that the main impact of piracy arises not simply from enhancing access, but from facilitating implicit price discrimination (Meurer 1997; Gopal and Sanders 1998). If there is a strong negative relationship between price sensitivity and willingness to incur the “costs” of piracy (e.g., time, potential for functionality downgrades), then tolerance of piracy may facilitate a segmentation of the market, in which suppliers charge

448

Susan Athey and Scott Stern

the monopoly price to the price-insensitive segment, and allow the pricesensitive segment to incur a higher level of transaction costs or a lower level of product quality. This argument is reinforced when the underlying product also exhibits significant network effects, so that even the price-insensitive consumers benefit from more widespread diffusion (Conner and Rumelt 1991; Oz and Thisse 1999). Importantly, the role that piracy plays in facilitating price discrimination depends on whether the segmentation that results between pirates and paying users reflects the type of consumer heterogeneity emphasized in these models. For example, the price discrimination rationale is more pertinent to the extent that piracy is concentrated among low willingness-to-pay consumers (e.g., consumers with a low level of income). Both the benefits and costs to piracy may be evolving over time with the increasing diffusion of the Internet and broadband connectivity. During the era of desktop computing, software piracy required physical access to at least one copy of the software media (such as a disk or CD), the bulk of piracy involved a limited degree of informal sharing among end users, and so the level of piracy was likely to have been roughly proportional with the level of commercial sales (Peace, Galletta, and Thong 2003). However, the Internet has significantly increased the potential for digital piracy, since a single digital copy can now, in principle, be shared among an almost limitless number of users (and there is no requirement that pirates have any prior or subsequent social or professional contact with each other). Internet-enabled piracy is likely to have increased over the last decade with the diffusion of broadband and the rise of download speeds. Since the middle of the first decade of the twenty-first century, there has been a very significant increase in the diffusion of broadband to mainstream consumers in the United States and abroad (Greenstein and Prince 2006), which has reduced the cost of large-scale software piracy. For example, a pirated version of Windows 7 requires downloading a ~ 10 GB file; it is likely that the extent and nature of piracy are qualitatively different when download times are at most a few hours as opposed to a few days. With the rise of the Internet and ubiquitous broadband connections, the potential for software piracy for large software product has become divorced from local sales of physical media.2 Despite the potential growing importance of software piracy, and the development of a rapidly emerging and even abundant literature examining the incidence of piracy and the role of copyright enforcement on digital mass media entertainment goods such as music, movies, and books (OberhozerGee and Strumpf 2010; Merrill and Raduchel 2013; Danaher, Smith, and Telang 2013), systematic empirical research on software piracy is at an early 2. The rise of the Internet and broadband has also reshaped the interaction between users and software producers. During the desktop era, a software product was essentially static, and users received only limited updates or software fixes. With the rise of the Internet and broadband, software authorization and distribution is routinely achieved through an online connection, and users receive regular security and functionality updates to their software.

The Nature and Incidence of Software Piracy

449

stage. Nearly all prior studies of software piracy depend on a single data source, the Business Software Alliance (BSA). The BSA measure is calculated based on an indirect auditing methodology (see BSA [2011], for a more complete discussion of the BSA methodology). In particular, the BSA undertakes an inventory of the “software load” for typical devices within a particular region (broken down by particular types of software), and then compares the level of installed software with observed shipments and payments to software suppliers through authorized channels. In other words, the BSA infers the rate of piracy as the “residual” between the level of measured software and paid software in a given country and for particular software segment. Taken at face value, the BSA data suggests that software piracy is a highly significant phenomena; the BSA estimates that the annual “lost sales” due to piracy are worth more than 60 billion USD as of 2011, and that the rate of software piracy is well above 50 percent of all software in many regions around the world, including Latin America, Asia, and Eastern Europe (North America registers the lowest level of piracy as a region). Though the BSA methodology for inferring piracy is imperfect, this approach has the advantage of offering a consistent measurement of piracy across countries, software product segments, and over time. However, as it is an inherently indirect measure, such data cannot be utilized for the types of observational studies that have sharpened our understanding of piracy in the context of areas such as music and movies. A small literature exploits the BSA data to evaluate the extent of software piracy and the relationship between software piracy and the economic, institutional, and technology environment. The most common focus of this literature is to examine the relationship between piracy and the level of economic development (Burke 1996; Marron and Steele 2000; Silva and Ramello 2000; Gopal and Sanders 1998, 2000). Over time, this literature has been extended to also include more nuanced measures of the institutional environment and the level of technology infrastructure (such as Banerjee, Khalid, and Strum 2005; Bezmen and Depken 2006). For example, Goel and Nelson (2009) focus on a broad cross-sectional examination of the determinants of the BSA piracy rate, including not only GDP per capita, but also measures of institutional “quality” such as the Heritage Foundation Property Rights and Economic Freedom Index. Goel and Nelson also include a number of measures of technology infrastructure. Among other findings, they discover that countries with higher prices for telephone service have a lower rate of piracy (i.e., reduced telecommunications access limits piracy). Finally, this literature suggests that measures of variation within the population, such as income inequality, may also promote piracy; with a higher level of income inequality, the monopoly price for paying customers will be sufficiently high that a higher share of individuals will select into incurring the transactions costs associated with piracy (Andres 2006). Overall, our understanding of software piracy is still in a relatively embry-

450

Susan Athey and Scott Stern

onic state. On the one hand, similar to other debates about intellectual property enforcement, theory provides little concrete guidance about optimal policy in the absence of direct empirical evidence. The need for empirical evidence is particularly important given the likelihood that the nature and extent of piracy is changing as the result of the global diffusion of broadband infrastructure. At the same time, the extant empirical literature usefully highlights a number of broad correlations in the data, but has been limited by reliance on an indirect measure of piracy and a loose connection to the theoretical literature. Three key issues stand out. First, while the prior literature emphasizes both the role of GDP per capita as well as the role of the institutional environment in shaping piracy, the policy debate suggests that it is important to disentangle the relative role of each. For example, if the primary driver of piracy is poverty (i.e., a negative association with GDP per capita), then the case for aggressive antipiracy enforcement efforts is limited, as piracy is likely serving to simply enhance access to software but is not likely to be a source of significant lost sales. In contrast, if piracy is the result of a lowquality institutional environment, then any observed correlation with GDP per capita may be spurious; instead, the lack of strong legal and property rights institutions may be contributing to a low level of economic development as well as a high level of piracy. In that case, antipiracy enforcement actions may have a salutary effect by directly enhancing the institutional quality and property rights environment of a given location. Second, the global diffusion of broadband may have changed the nature of piracy. To the extent that piracy is facilitated by broadband diffusion, the rate of piracy should be higher for countries and regions where broadband infrastructure is more prevalent (e.g., where there are higher access speeds and/or lower prices for broadband service). To the extent that changes in “frictions” like the cost of downloading have a nontrivial effect on piracy, it suggests that there are a fair number of individuals “at the margin” between pirating and not pirating, and that piracy can be influenced through institutional changes or frictions imposed by regulation or product design features that make piracy more challenging. Finally, existing studies have not been able to isolate the impact of antipiracy enforcement efforts on software piracy. Consistent with recent studies of enforcement efforts in music and movies, an observational study of software piracy alongside shifts over time in the level of enforcement may be able to offer direct evidence about the efficacy of such efforts in restricting the unauthorized distribution of software. 15.3

The Nature of Software Piracy: A Window onto Windows Piracy

In our initial investigation of software piracy, we found relatively little systematic information within the research literature about how software piracy actually works as a phenomena: How does one actually pirate a piece

The Nature and Incidence of Software Piracy

451

of software? How hard is piracy, and how does that depend on the type of software that one seeks to pirate, and the type of telecommunications infrastructure that you have access to? How does pirated software actually work (i.e., are there significant restrictions in terms of functionality or updates)? What are the main “routes” to piracy? 15.3.1

The Organization of Windows 7 Distribution Channels

To understand the nature of digital software piracy (and how we will measure piracy with our data set), we first describe how users are able to receive, authenticate, and validate a legitimate copy of Windows, focusing in particular on the practices associated with individual copies of Windows 7. We then examine the nature of software piracy within that environment. To authenticate a valid version of Windows requires a Product License Key, a code that allows Microsoft to confirm that the specific copy of Windows that is being installed on a given machine reflects the license that has been purchased for that machine. Product License Keys are acquired as part of the process of acquiring Windows software, which occurs through three primary distribution channels: the OEM channel, the retail channel, and the volume licensing program. The OEM Channel By far, the most common (legal) way to acquire a copy of Windows is through an OEM. The OEMs install Windows as part of the process of building and distributing computers, and the vast majority of OEM-built computers include a copy of Windows. To facilitate the authentication of Windows licenses, each OEM receives a number of specialized Product License Keys (referred to as OEM SLP keys), which they can use during this OEM-installed process. In other words, while OEM SLP keys may be used multiple times, legal use of these keys can only occur on machines that (a) are from that specific OEM, and (b) for machines where Windows was preinstalled. Users with OEM-installed Windows have the option to enroll in Windows Automatic Update, which provides security and functionality updates over time. The Retail Channel A second channel to legally acquire Windows is through a retail store (which can either be an online store or bricks-and-mortar establishment). The retail channel primarily serves two types of customers: users who are upgrading their version of Windows (e.g., from Windows XP or Vista), and users who purchased a “naked” machine (i.e., a computer that did not have a preinstalled operating system). Each retail copy comes with a unique Retail Product Key, which is valid for use for a limited number of installations (usually ten). Retail Product Keys should therefore be observed only a small number of times. Users with a Retail Product Key have the option to enroll

452

Susan Athey and Scott Stern

in Windows Automatic Update, which provides security and functionality updates over time. The Enterprise Channel The final way to acquire Windows is through a contractual arrangement between an organization and Microsoft. For large institutional customers (particularly those that want to preinstall other software for employees), Microsoft maintains a direct customer relationship with the user organization, and issues that organization a Volume License Key Server, which allows the organization to create a specific number of copies of Windows for the organization. While each Volume License Key is unique, most Windows Enterprise customers receive updates through the servers and IT infrastructure of their organization, rather than being enrolled directly in programs such as Windows Automatic Update. In each distribution channel, each legal Windows user undergoes a process of authenticating their copy of Windows. In the case of OEM-installed Windows or the retail channel, that authentication process occurs directly with Microsoft. In the case of Windows Enterprise, that authentication occurs via the server system that is established as part of the contract between Microsoft and the volume license customer. 15.3.2

The Routes to Windows Piracy

We define software piracy as the “unauthorized use or reproduction of copyrighted software” (American Heritage Dictionary 2000). While software piracy has always been an inherent element of software distribution (and has often closely been associated with hacker culture), the nature of piracy changes over time and reflects the particular ways in which users are able to access software without authorization or payment. There seem to be three primary “routes” to piracy of a mass-market, large-format software product such as Windows: local product key abuse, sophisticated hacking, and distributed product key abuse. Local Product Key Abuse Since the development of software with imperfect version copying, individual users have occasionally engaged in the unauthorized “local” replication of software from a single legal version. Indeed, the ability to replicate a single copy of Windows across multiple computers is explicitly recognized in the Windows retail licensing contract, which allows users up to ten authorized replications. Abuse of that license can involve significant replication of the software among social or business networks, or deployment within an organization well beyond the level that is specified in a retail license or reported through a volume license key server. Most users who engage in local product key abuse will continue to anticipate receiving software updates from the software vendor. A useful observation is that, when a certain lim-

The Nature and Incidence of Software Piracy

453

ited number of copies is to be expected (e.g., less than 100), the seller can simply set a price to reflect the scalability of each piece of software once it is deployed in the field. Sophisticated Hacking A second route to piracy involves far more active involvement and engagement on the part of users and involves an explicit attempt to “hack” software in order to disable any authentication and validation protocols that are built into the software. Though this does not seem to be the primary type of piracy that occurs in the context of a mainstream software product such as Windows, it is nonetheless the case that the ability to measure such piracy (particularly using the type of passively generated data that is at the heart of our empirical work) is extremely difficult. Distributed Product Key Abuse The third route to piracy is arguably the most “novel” and follows the evolution of piracy for smaller-sized digital products such as music or even movies. In distributed peer-to-peer unauthorized sharing, users access a software copy of Windows through a peer-to-peer torrent site such as the Pirate Bay (a ~ 10 GB file), and then separately download a valid/usable product license key from the Internet. Users then misrepresent that the key was obtained through legal means during the authentication and validation process. In our preliminary investigation of this more novel type of piracy, we found the “ecosystem” for peer-to-peer sharing to be very well developed, with a significant level of focus in online forums and sites on pirating a few quite specific keys. To get a sense of how piracy occurs, and the role of globally distributed abused product keys in that process, it is useful to consider a small number of “dossiers” that we developed for a select number of such keys: The Lenovo Key (Lenny). Approximately three months prior to the commercial launch of Windows 7, Microsoft issued a limited number of OEM System Lock Pre-Installation (SLP) keys to leading OEMs such as Lenovo, Dell, HP, and Asus. Issuing these keys allowed these OEMs to begin their preparations to preinstall Windows on machines for the retail market. Within several days of the release of these keys to the OEMs, the Lenovo key for Windows 7 Ultimate was released onto the Internet (REFS). This widely reported leak led Microsoft to issue a separate key to Lenovo for the same product (i.e., so that all “legitimate” Lenovo computers would have a different product key than the key that was available on the Internet.) Also, Microsoft imposed a functionality downgrade on users who authenticated Windows 7 with the Lenovo key; a message would appear every thirty minutes informing the user that their product key was invalid, and the desktop would be defaulted to an unchangeable black background. Within a few weeks (and still well before the commercial introduction of Windows 7), a

454

Susan Athey and Scott Stern

number of websites had been established that provided step-by-step instructions about how to download a clean “image” of Windows 7 from a site such as the Pirate Bay or Morpheus and how to not only authenticate Windows with “Lenny” (the Lenovo product key) but also how to disable the limited functionality losses that Microsoft imposed on users that authenticated with the Lenovo key (Reddit 2013; My Digital Life 2013).3 It is useful to emphasize that the Lenovo key leak allowed unauthorized users to gain access to a fully functional version of Windows 7 prior to its launch date and also receive functional and security updates on a regular basis. As of April, 2013, Google reports more than 127,000 hits for a search on the product key associated with Lenny, and both the Windows 7 software image and the Lenovo product key are widely available through sites such as the Pirate Bay. The Dell Key (Sarah). Though the Lenovo key received the highest level of media and online attention (likely because it seemed to be the “first” leaked OEM key associated with Windows 7), the Dell OEM SLP key for Windows 7 was also released onto the Internet within weeks after its transmission to Dell, and months before the commercial introduction of Windows 7. Similar to Lenny, a large number of websites were established providing step-by-step instructions about how to download an image of Windows that would work with the Sarah product key, and instructions about how to use the leaked product key and disable the minor functionality downgrades that Microsoft imposed on users with this key. In contrast to the Lenovo key, the Dell key was never discontinued for use by Dell itself; as a result, there are literally millions of legitimate copies of Windows 7 that employ this key. However, by design, this key should never be observed on a non-Dell machine, or even a Dell machine that was shipped “naked” from the factory (i.e., a Dell computer that was shipped without a preinstalled operating system). For computers that validate with this key, a simple (and conservative) test of piracy is an observation with Sarah as the product key on a non-Dell machine or a Dell machine that was shipped “naked” (a characteristic also observable in our telemetry data). The Toshiba Key (Billy). Not all OEM SLP Windows keys are associated with a high level of piracy. For example, the Toshiba Windows Home Premium Key is associated with a much lower level of piracy. This key was not released onto the Internet until just after the commercial launch of Windows 7 (October 2009), and there are fewer Google or Bing hits associated 3. This latter reference is but one of many making claims such as the following: “This is the loader application that’s used by millions of people worldwide, well-known for passing Microsoft’s WAT (Windows Activation Technologies) and is arguably the safest Windows activation exploit ever created. The application itself injects a SLIC (System Licensed Internal Code) into your system before Windows boots; this is what fools Windows into thinking it’s genuine” (My Digital Life 2013). That post is associated with more than 7,000 “thank yous” from users.

The Nature and Incidence of Software Piracy

455

with this product key (less than 10 percent of the number of hits associated with the Lenovo and Dell keys described above). In other words, while this version of Windows could be pirated at a much more intensive level if other copies (including all copies of Windows Ultimate) were unavailable, the Windows piracy community seems to focus their primary attention on a small number of keys, with a significant focus on leading OEM SLP Ultimate keys. Overall, these short dossiers of the primary ways in which retail Windows 7 software piracy has actually been realized offer some insight into the nature of Windows piracy as a phenomena, and guidance as to the relative effectiveness of different types of enforcement actions either by government or by Microsoft. First, while discussions of software piracy that predates widespread broadband access emphasizes the relatively local nature of software piracy (e.g., sharing of physical media by friends and neighbors, instantiating excess copies of a volume license beyond what is reported to a vendor such as Microsoft), Windows 7 seems to have been associated with a high level of digital piracy associated with a small number of digital point sources. In our empirical work, we will explicitly examine how concentrated piracy is in terms of the number of product keys that are associated with the vast bulk of piracy. Second, the globally distributed nature of the ways to access a pirated version of Windows suggests that it may be difficult to meaningfully impact the piracy rate simply by targeting a small number of websites or even product keys. Based on the voluminous material and documentation publicly available on the Internet (and reachable through traditional search engines), it is likely that small changes in the supply of pirated software might have little impact on the realized level of piracy. 15.4

Data

The remainder of this chapter undertakes a systematic empirical examination of the nature and incidence of software piracy. Specifically, we take advantage of a novel data set that allows us to observe statistics related to a large sample of machines that install Windows on a global basis that receive regular security and functionality updates from Microsoft. Though these data have important limitations (which we discuss below), they offer the opportunity to undertake a direct observational study of software, and in particular the ability to identify whether machines in a given region are employing a valid or pirated version of Windows. We combine this regional measure of piracy with measures of other attributes of machines, as well as regional variables describing the institutional, economic, and technology environment to evaluate the nature and relative incidence of piracy. 15.4.1

Windows 7 Telemetry Data

Our estimates of the piracy rates of Windows 7 are computed by drawing on a data set that captures information about machines (including “hashed”

456

Susan Athey and Scott Stern

data providing their regional location) that enroll in a voluntary security update program known as Windows Automatic Updates (WAU). When a machine enrolls in the program, a low-level telemetry control, formally known as Windows Activation Technologies (WAT), is installed, which performs periodic validations of the machine’s Windows 7 license. During each of these validations, which occur every ninety days by default, data is passively generated about a machine’s current hardware, operating system configuration, and basic geographic information. This information is transmitted to Microsoft and maintained in a hashed manner consistent with the privacy protocols established by Microsoft.4 More than 400 million individual machines transmitted telemetry information to Microsoft during 2011 and 2012, the period of our sample. We make use of a research data set consisting of an anonymized sample of 10 million machines, where, for a given machine, the data set includes the history of validation attempts for that machine over time. For each of these validation episodes, the data set includes information on the broad geolocation of the machine at the time of validation,5 the product key used to activate Windows 7, the version of Windows 7 installed, and a set of machine characteristics including the manufacturer (OEM) and the machine model, the PC architecture, and whether an OEM installed a version of Windows during the manufacturing process. Though the Windows telemetry data offers a unique data source for observing software in the field, users face a choice about whether to enroll in the WAU program. Self-selection into Windows Automatic Update engenders two distinct challenges for our data. First, Windows Enterprise customers and others that employ volume licensing contracts with Microsoft primarily opt out of WAU and instead manage updating Windows through their own IT departments (a process which allows them, for example, to also include organization-specific updates as well). While we do observe a small number of machines that report a volume license key, we exclude this population entirely from our analysis in order to condition the analysis on users who attempt to validate with either an OEM SLP or retail product key license.6 In that sense, our empirical analysis can be interpreted as an examination of piracy by individual users and organizations without any 4. During the validation process, no personal information that could be used to identify an individual user is collected. For more details, see http://www.microsoft.com/privacy/default .aspx. 5. The geographical location of a machine during its WAT validation attempt is constructed based on the Internet Protocol (IP) address that was used to establish a connection with Microsoft in order to undergo validation. In order to preserve anonymity, only the city and country from which the IP address originates is recorded in our data set. 6. There are number of reasons for doing this, most notably the fact that due to the highly idiosyncratic nature of VL agreements, it is extremely difficult to determine what constitutes an abused VL product key.

The Nature and Incidence of Software Piracy

457

direct contract with Microsoft. Second, users with pirated versions of Windows may be less likely to enroll in the automatic update program. As such, conditional on being within the sample of users who validate with an OEM SLP or retail product key, we are likely estimating a lower bound on the rate of piracy within the entire population of machines. 15.4.2

Defining Piracy

Using the system information recorded by the Windows telemetry data, we are able to check for the presence of key indicators that provide unambiguous evidence for piracy. We take a conservative approach to defining each of these in order to ensure that our overall definition of piracy captures only machines consistently possessing what we believe are unambiguous indicators of piracy. Consistent with the discussion in section 15.3, we therefore identify a machine as noncompliant for a given validation check if it meets one or more of the following criteria:

• For those validating with an OEM key:

a. Machines associated with known leaked and/or abused keys in and in which there is a mismatch between the OEM associated with the key and the OEM of the machine. b. Machines with an unambiguous mismatch between the product key and other machine-level characteristics. • For those validating with a retail key: a. Known leaked and/or abused retail product keys with more than 100 observed copies within the machine-level WAT population data set. This definition captures the key cases that we highlighted in section 15.3. For example, all machines that validate with the “leaked” Lenovo product key, Lenny, will be included in this definition, since this a known leaked key that should not be matched with any machine. This also captures all uses of the Dell OEM SLP key (Sarah) in which validation is attempted on a non-Dell machine or a Dell machine that also reports having been shipped naked from the factory (in both of these cases, there would be no legal way to receive a valid version of the Dell OEM SLP key). Similarly, any machine that was exclusively designed for Windows XP or Vista (so that no OEM key for Windows 7 was ever legally installed on that machine) but reports an OEM SLP Windows 7 key will be measured as an instance of noncompliancy. Because we observe the full history of validation attempts for any given machine (though the data for each machine is anonymized beyond its broad regional location), we are able to define piracy as the persistence of noncompliance across all validation attempts by a given machine. In other words, if a machine originally uses a noncompliant version of Windows but then reauthorizes with a valid license key, we define that machine as being in “compli-

458

Susan Athey and Scott Stern

ance” in terms of our overall definition. We therefore define a machine as a pirate if, for each of the validation attempts associated with that machine, it satisfies one of the unambiguous noncompliant criteria stated above. We then construct our key measure, PIRACY RATE, as the aggregation of piracy across the machines within the sample within a given region divided by the number of machines we observe within that region (see table 15.1). Overall, weighted by the number of machines per country, the overall piracy rate is just over 25 percent; if each of the 95 countries in our sample is treated as a separate observation, the average country-level piracy rate is just under 40 percent. 15.4.3

Machine Characteristics

We are able to observe and then aggregate a number of additional characteristics of machines within a given country. While the decision of whether to pirate Windows and other hardware and vendor choices is clearly endogenous, we nonetheless believe that it is informative to understand what types of machines tend to be associated with pirated software (or not). Specifically, we define four measures that we believe usefully characterize key machine attributes and in which it is useful to compare how the rate of piracy varies depending on machine characteristics:

• Frontier Model: An indicator equal to one for machine models that were exclusively built following the launch of Windows 7.

• Leading Manufacturer: An indicator for whether a machine was produced by one of the leading twenty OEMs, as determined by their market share within the telemetry population. • Frontier Architecture: An indicator equal to one for machines with a 64bit CPU instruction set (also known as an x86-64 processor). Approximately 63 percent of the machines in our sample are equipped with an x86-64 processor. • Windows Home Premium/Professional/Ultimate: An indicator for whether the installed version of Windows 7 on a machine is Windows Home Premium, Professional, or Ultimate, respectively. 15.4.4

Economic, Institutional, and Infrastructure Variables

Once we have classified each of the machines in our sample, we construct a measure of the incidence of piracy for each of the ninety-five countries in our sample, which we then incorporate into a data set of country-level economic, institutional, and technology infrastructure variables. Our data on country-level characteristics can be classified into three broad categories: (a) economic and demographic factors, (b) institutional quality, and (c) technology and innovative capacity. The variable names, definitions, and means and standard deviations are in table 15.1. For our basic economic and demographic measures, such as GDP per capita, the current rate of infla-

Table 15.1

Summary statistics Country level N = 95

Country level weighted by machines N = 95

Dependent variable Piracy rate

Share of noncompliant (i.e., pirated) machines

.38

.25

.57 (.09) .75 (.12) .50 (.16) .40 (.20) .19 (.06) .41 (.19)

.63 (.48) .80 (.40) .63 (.48) .25 (.43) .18 (.38) .58 (.49)

22,215.42 (17,898.45) .22 (.78) .36 (.97) 111.96 (298.43) 53.66 (24.99) 38.3 (10.18) 8.83 (5.93) 4.63 (3.66) 61.32 (189.35) 301.25 (1,021.36)

32,498.37 (15,628.68) .55 (.76) .75 (.99) 39.12 (107.39) 63.47 (26.64) 39.57 (8.58) 7.38 (7.72) 3.53 (5.77) 167.52 (219.43) 188.29 (696.51)

34.5 (70.09) 4.96 (12.84) 24.15 (11.47) 54.07 (27.30) 52.21 (23.99)

118.37 (116.96) 6.73 (16.98) 24.12 (10.07) 66.94 (22.08) 65.72 (20.74)

Machine characteristics Frontier model

Windows 7 ready model

Leading manufacturer Frontier architecture

Indicator for whether machine is produced by one of 20 top manufacturers (by market share) Indicator for 64-bit CPU architecture

Windows Ultimate

Indicator for whether machine is Windows Ultimate

Windows Professional

Indicator for whether machine is Windows Professional

Windows Home Premium

Indicator for whether machine is Windows Home Premium Economic, institutional, and demographic indicators

GDP per capita

GDP per capita (IMF)

Foundational Competitiveness Index WB Rule of Law

Competitiveness Index score (Delgado et al. 2012)

Settler mortality

European settler mortality (Acemoglu et. al. 2001)

Property rights

Heritage Foundation Property Rights Index

Gini coefficient Lending rate

Gini coefficient for income inequality (Central Intelligence Agency 2007) Lending interest rate (EIU)

Inflation

Annual (%) change in CPI (IMF)

Population (in millions)

Total population (IMF)

Population density

People per sq. KM (WDI)

World Bank Rule of Law Index

Measures of innovative & technological capacity Patents per capita Broadband speed Broadband monthly rate

USPTO-filed patents per one million inhabitants (USPTO) Wired broadband speed per 100 Mbit/sec (ICT/ITU)

Computer

Wired broadband monthly subscription charge (USD) (ICT/ITU) Percent of households with a computer (ICT/ITU)

Internet

Internet users (%) of population (ICT/ITU)

Note: With exception to the CIA Factbook’s Gini coefficient, which was computed in 2008, we take the average of all indicators over our sample period (2011–2012), unless otherwise indicated.

460

Susan Athey and Scott Stern

tion and population, we use standard data from the International Monetary Fund (IMF) for the most current year (2012 in nearly all cases). The Gini coefficient for each country is drawn from the CIA World Factbook, and a measure of the lending interest rate is drawn from the Economist Intelligence Unit. We then incorporate four different measures of overall “institutional quality” of a country. Our first measure, foundational competitiveness, is drawn from Delgado et al. (2012), who develop a multiattribute measure that captures a wide range of factors that contribute to the baseline quality of the microeconomic environment, as well as the quality of social and political institutions within a given country. Foundational competitiveness incorporates a wide range of prior research findings on the long-term drivers of country-level institutional quality, and reflects differences across countries in their institutional environment in a way that is distinguishable from simply the observed level of GDP per capita (Delgado et al. 2012). We also include two additional contemporary measures of institutional quality, including the Rule of Law measure developed as a part of the World Bank Doing Business Indicators (Kaufmann, Kraay, and Mastruzzi 2009), and a Property Rights Index developed by the Heritage Foundation. Finally, building on Acemoglu, Johnson, and Robinson (2001), we use settler mortality (as measured in the early nineteenth century) as a proxy for the historical origins of long-term institutional quality. Environments where European settler mortality was low led to more investment in setting up more inclusive institutions, resulting in a historical path leading to more favorable institutions over time. We will therefore be able to examine how the historical conditions giving rise to institutions in a given location impacts the rate of piracy today. It is important to note that all of these measures are highly correlated with each other, and our objective is not to discriminate among them in terms of their impact on piracy. Instead, we will evaluate how each of these measures relates to piracy, and in particular consider whether their inclusion reduces the relative salience of contemporary economic measures such as GDP per capita or the Gini coefficient. Finally, we use a number of measures of the technological and innovative capacity of a country. In terms of telecommunications infrastructure, we use two different measures (from the International Telecommunications Union) of broadband infrastructure, including broadband speed and broadband monthly rate. We also investigate alternative measures of the information technology and Internet infrastructure, including the percentage of households with a computer and the percentage of the population with access to an Internet connection. Interestingly, for the purpose of evaluating the incidence of operating systems piracy, we believe that measures associated with broadband connectivity are likely to be particularly important, since low-cost and rapid broadband connection would be required for downloading the large files that are required for Windows 7 piracy. Finally, though we experimented with a wide range of measures, we use the number of USPTO-

The Nature and Incidence of Software Piracy

461

filed patents per capita as our measure of the innovation orientation of an economy (other measures lead to similar findings). 15.5

Empirical Results

Our analysis proceeds in several steps. First, we examine some broad patterns in the data, highlighting both the nature and distribution of Windows 7 piracy around the world. We then examine the impact of the economic and institutional environment on piracy, both looking at cross-country comparisons, and a more detailed examinations of cities within and across countries. We also briefly consider how the rate of piracy varies with particular populations of machines and computer characteristics, in order to surface some of the potential mechanisms that are underlying differences in piracy across different environments and among individuals within a given environment. Finally, we undertake a preliminary exercise to assess the causal short-term impact of the primary antipiracy enforcement effort—the blocking of websites such as Pirate Bay—on observed piracy rates in our data. 15.5.1

The Nature and Incidence of Piracy

We begin in figure 15.1, where we consider how piracy is promulgated, focusing on the incidence of individual product keys within the population of pirated machines. The results are striking. Consistent with our qualitative discussion in section 15.3, the vast bulk of observable piracy is associated with a relatively small number of product keys. The top five keys each account for more than 10 percent of observed piracy in our data, and more than 90 percent of piracy is accounted for by the top twelve product

Fig. 15.1

Individual product key piracy as a percentage of overall piracy

462

Susan Athey and Scott Stern

Fig. 15.2A

Piracy and Windows 7 models: Share of machines

Fig. 15.2B

Piracy and leading manufacturers: Share of machines

keys. At least in part, this extreme concentration is consistent with the idea that global piracy is associated with user communities that provide easy-tofollow instructions associated with individual product keys, and so there is similarity across users in their precise “route” to piracy. Of course, it is likely that enforcement efforts that focused on individual keys would likely simply shift potential pirates to other potential keys (and websites would spring up to facilitate that process). We continue our descriptive overview in figures 15.2A and 15.2B, where

The Nature and Incidence of Software Piracy

Fig. 15.3

463

Piracy and Windows 7 version: Share of machines

we break out the rate of piracy by the type of OEM and the type of machine. In figure 15.2B, we examine how the rate of piracy varies by whether the machine is associated with a leading OEM or not. On the one hand, the rate of piracy is much higher among machines that are shipped from “fringe” rather than leading OEMs. However, this is only a small share of the entire sample (less than 20 percent). In other words, while the incidence of piracy is much lower among machines produced by leading OEMs, the bulk of piracy is nonetheless associated with machines from leading OEMs. Similarly, figure 15.2B describes how the rate of piracy varies depending on whether a particular computer model was introduced before or after the debut of Windows 7. Just over a third of observed copies of Windows 7 are associated with machines that were produced prior to the introduction of Windows 7, and so are likely machines that are “upgrading” from Windows Vista (or an even earlier version of Windows). Interestingly, the rate of piracy for machines associated with clear upgrades is only slightly higher than for machines produced after the introduction of Windows 7 (29 versus 22 percent). Finally, there is striking variation across the version of Windows installed. While the piracy rate associated with Home Premium is quite modest, more than 70 percent of all piracy is of Windows Ultimate, and, amazingly, nearly 70 percent of all observed copies of Windows Ultimate are pirated (figure 15.3). Windows piracy is associated with machines that, by and large, are not produced by leading manufacturers, and, conditional on choosing to pirate, users choose to install the most advanced version of Windows software.

464

15.5.2

Susan Athey and Scott Stern

The Economic, Institutional, and Technological Determinants of Software Piracy

We now turn to a more systematic examination of how the rate of piracy varies by region (the drivers of which are the main focus of our regression analysis in the next section). Figures 15.4A and 15.4B highlight a very wide range of variation across regions and countries. While the observed piracy rate in Japan is less than 3 percent, Latin America registers an average piracy

Fig. 15.4A

Piracy rate by region

Fig. 15.4B

Windows 7 piracy rate by country

The Nature and Incidence of Software Piracy

Fig. 15.5

465

GDP per capita versus country-level piracy rates

rate of 50 percent. Interestingly, after Japan, there is a group of advanced English-speaking countries—the United States, Canada, Australia, and the United Kingdom—which register the lowest rates of piracy across the globe. At the other extreme is Georgia, where the observed piracy rate reaches nearly 80 percent. Perhaps more importantly, a number of large emerging countries such as Russia, Brazil, and China are each recorded at nearly 60 percent. Finally, it is useful to note that a number of reasonably “wealthy” countries (e.g., South Korea, Taiwan, and Israel) boast piracy rates between 30 and 40 percent. As emphasized in figure 15.5 (which simply plots the country-level piracy rate versus GDP per capita), there is a negative but noisy relationship between piracy and overall prosperity. These broad correlations provide the foundation for the more systematic examination we begin in table 15.2. We begin with table 15.2, column (1), a simple regression that documents the relationship illustrated in figure 15.5—there is a negative correlation between piracy and GDP per capita. However, as we discussed in section 15.3, the relationship between piracy and GDP per capita is subtle: Is this relationship driven by the fact that poor countries tend to have poor institutional environments (and so are likely to engage in more piracy), or does this reflect differences in opportunity cost or price sensitivity? To disentangle these effects, we include a simple set of measures associated with country-level institutional quality. In table 15.2, column (2), we include foundational competitiveness (Delgado et al. 2012) as an overall index that aggregates many different facets of the institutional environment, and in table 15.2, column (3), we focus on a more straightforward (but perhaps more blunt) measure of institutional quality, the World

466

Susan Athey and Scott Stern

Table 15.2

Software piracy and the economic, institutional, and infrastructure environment Windows 7 piracy rate

Ln GDP per capita

1

2

3

4

‒.151*** (.014)

‒.082*** (.021) ‒.096*** (.017)

‒.061*** (.018)

‒.026 (.02) ‒.039** (.019)

Competition index WB Rule of Law

‒.097*** (.014)

Ln Patents per capitaa

‒.023*** (.007) .008 (.009) ‒.087*** (.026) .005*** (.001)

Ln broadband download speed Ln broadband monthly rate Lending rate Observations R-Squared

95 .599

95 .674

95 .708

95 .762

Note: Robust standard errors in parentheses. a Ln patents per capita is defined as Ln(1 + patents per capita). ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

Bank Rule of Law Index. In both cases, the coefficient on GDP per capita declines by half, and is only marginally significantly different from zero. We investigate this further in table 15.2, column (4), where we include a small number of additional controls for the quality of the telecom infrastructure and the degree of innovation orientation of the economy. On the one hand, consistent with earlier studies emphasizing the importance of the telecommunications infrastructure in piracy (Goel and Nelson 2009), piracy is declining in the price of broadband access and (not significantly) increasing in average broadband speed. The relative significance of these two coefficients depends on the precise specification (and they are always jointly significantly different from zero); the overall pattern suggests that piracy is sensitive to the ability to download and manage large files, consistent with the hypothesis that broadband downloads of pirated content is a primary channel through which Windows piracy occurs. The piracy rate is also declining in patents per capita—the rate of piracy by consumers and businesses is lower in countries with a higher rate and orientation toward innovation. In unreported regressions, we found that a number of alternative measures of the “innovation environment” (e.g., measures of the overall

The Nature and Incidence of Software Piracy

467

R&D budget, as well as various indices of innovative capacity) had a negative association with piracy. However, we were unable to disentangle the separate effect of these broader measures of the innovation orientation of a region and a simpler measure such as patents per capita. Finally, computer purchasing and procurement is a capital good; in countries with a higher lending rate, the observed rate of piracy is higher (and this result is robust to the use of the real rather than nominal lending rate as well). Perhaps most importantly, once controlling for these direct effects on the piracy rate (and across a wide variety of specifications including only a subset or variant of these types of measures), GDP per capita is both small and insignificant. Rather than income per se, the results from table 15.2 provide suggestive evidence that piracy rates are driven by the institutional and technological attributes of a given country, including, most importantly, whether they have institutions that support property rights and innovation. Poorer countries tend to have weaker institutional environments (Hall and Jones [1997], among many others), and it is the environment rather than income per se that seems to be correlated with the observed level of piracy. We explore the robustness of this core finding in table 15.3, where we examine several alternative ways of capturing the baseline institutional environment of a country and evaluate the impact of GDP per capita on piracy once such measures are included. In table 15.3, columns (1) and (2), we simply replace the Foundational Competitiveness Index with the World Bank Rule of Law measure and the Heritage Foundation Property Rights Index, respectively. In both cases, the broad pattern of results remains the same, and the coefficient on GDP per capita remains very small and statistically insignificant. In table 15.3, columns (3) and (4), we extend this analysis by focusing on the subset of countries highlighted in the important work of Acemoglu, Johnson, and Robinson (2001). Acemoglu, Johnson, and Robinson argue that the colonial origins of individual countries have had a long-term impact on institutional quality, and they specifically highlight a measure of settler mortality (from the mid-1800s) as a proxy for the “deep” origins of contemporary institutional quality. We build on this idea by directly including their measure of settler mortality. Though the sample size is much reduced (we are left with only forty-three country-level observations), the overall pattern of results is maintained, and there is some (noisy) evidence that settler mortality itself is positively associated with piracy (i.e., since a high level of settler mortality is associated with long-term weakness in the institutional environment); most notably, in both columns (3) and (4) of table 15.3, the coefficient on GDP per capita remains small and insignificant. We further explore these ideas by looking at a few case studies, examples of city pairs that share roughly the same income level but are located in countries with wide variation in their institutional environment. Drawing on city-specific GDP per capita data from the Brookings Global Metro Monitor Project, we identify four city-pairs with similar income levels but

468

Susan Athey and Scott Stern

Table 15.3

Alternative measure of institutional quality Windows 7 piracy rate 1

Ln GDP per capita WB Rule of Law Ln patents per capitaa Ln broadband download speed Ln broadband monthly rate Lending rate

‒.015 (.018) ‒.074*** (.021) ‒.013 (.009) .013 (.009) ‒.087 (.025) .005*** (.001)

Prop. rights

2 ‒.018 (.017)

‒.016** (.008) .014 (.009) ‒.076*** (.024) .005*** (.001) ‒.003*** (.001)

Ln settler mortality Observations R-Squared

3 ‒.019 (.039)

‒.022 (.014) 4e-4 (.015) ‒.106** (.045)

.045* (.026) 95 .786

93 .79

43 .664

4 ‒.006 (.039)

‒.023 (.01) .005 (.014) ‒.122*** (.045) .005*** (.002)

.04 (.027) 43 .694

Note: Robust standard errors in parentheses. a Ln patents per capita is defined as Ln(1 + patents per capita). ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

wide variation in measured institutional quality (as measured by the World Bank Rule of Law Index) (see table 15.4). The results are striking. While Johannesburg and Beijing have roughly the same GDP per capita, the piracy rate in Beijing is recorded to be more than twice as high as Johannesburg (a similar comparison can be made between Shenzhen, China, and Berlin, Germany). A particularly striking example can be drawn between Moscow, Russia, and Sydney, Australia, where relatively modest differences in “prosperity” cannot explain a nearly fourfold difference in the observed piracy rate. While these suggestive examples are simply meant to reinforce our more systematic regression findings, we believe that this approach—where one exploits variation within and across countries in both GDP and institutions through regional analyses—offers a promising approach going forward in terms of evaluating the drivers of piracy in a more nuanced way. Figure 15.6 sharpens this analysis by plotting the actual piracy rate versus the predicted piracy rate (as estimated from table 15.2, column [4]). Several notable countries with high piracy rates and intense public attention on the issue (such as China and Brazil) have an observed piracy rate

The Nature and Incidence of Software Piracy Table 15.4

469

City-pair comparisons: Rule of law and GDP per capita comparisons by piracy rate

City

Country

GDP per capita (thousands [$], PPP rates)

1 1

Johannesburg Beijing

South Africa China

17.4 20.3

2 2

Kuala Lumpur São Paulo

Malaysia Brazil

23.9 23.7

3 3

Moscow Sydney

Russia Australia

44.8 45.4

‒0.78 1.77

0.56 0.15

4 4

Shenzhen Berlin

China Germany

28 33.3

‒0.45 1. 69

0.44 0.24

Pair

Fig. 15.6

Rule of Law Index (WB)

Piracy rate

0.10 ‒0.45

0.24 0.55

0.51 0.013

0.29 0.55

Predicted versus actual piracy rate

only slightly above that which would be predicted by their “fundamentals.” The leading English-speaking countries and Japan have low piracy rates, but those are even lower than predicted by the model. Finally, it is useful to highlight some of the most notable outliers: New Zealand registers a piracy rate far below that which would be predicted by observable factors, and South Korea realizes a level of piracy well above that which would be predicted by observables. Overall, our results suggest that the wide variation of piracy observed across countries reflects a combination of systematic and idiosyncratic factors. Finally, in table 15.5, we examine a number of other potential drivers of piracy that have been discussed in the prior literature. For example, in

470

Susan Athey and Scott Stern

Table 15.5

Other potential drivers of piracy Windows 7 piracy rate

  Ln GDP per capita Competition Index Ln broadband download speed Ln broadband monthly rate Ln patents per capitaa Lending rate Ln population Ln population density

1

2

3

4

5

‒.025 (.02) ‒.039** (.019) .008 (.009) ‒.087*** (.026) ‒.023*** (.007) .005*** (.001) 3e-4 (.007)  

‒.028 (.02) ‒.034* (.02) .01 (.009) ‒.094*** (.027) ‒.024*** (.007) .005*** (.001)  

‒.026 (.026) ‒.052** (.022) .009 (.01) ‒.09*** (.027) ‒.02* (.01) .005*** (.002)  

‒.012 (.024) ‒.032 (.021) .01 (.009) ‒.087*** (.025) ‒.022*** (.007) .005*** (.001)  

‒.025 (.02) ‒.036* (.019) .008 (.009) ‒.087*** (.026) ‒.023*** (.007) .005*** (.002)  

‒.009 (.006)  

 

Gini coefficient

 

Internet

 

 

2e-4 (.001)  

Inflation

 

 

 

Observations R-Squared

95 .762

95 .767

85 .769

 

 

 

 

‒.001 (.001)   95 .764

  .002 (.003) 95 .762

Note: Robust standard errors in parentheses. a Ln patents per capita is defined as Ln(1 + patents per capita). ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

table 15.5, columns (1) and (2), we include measures of population and population density, while in table 15.5, column (3), we include a measure of country-level income inequality. The inclusion of these measures does not have a material effect on our earlier findings, and are estimated to have a small and insignificant impact. Similar patterns are observed when we include a measure of overall Internet penetration, or a measure of inflation. While the small size of our country-level data set precludes use from drawing firm conclusions about the relative importance of these additional factors, our overall pattern of results suggests that software piracy is closely associated with fundamental features of the institutional and technological environment, rather than being primarily driven by measures of income or income inequality.

471

The Nature and Incidence of Software Piracy

15.5.3

The Relationship between Piracy and Machine Characteristics

While the primary focus of the analysis in this chapter has been on the impact of the broader economic, institutional, and technological environment on country-level piracy, it is also useful to explore the composition of piracy within a country, and specifically examine the relationship between piracy and other elements of the machines that users are purchasing and/ or upgrading. To do so, we reorganize our data set to capture the level of piracy within a given country for a certain “type” of machine (e.g., the rate of piracy for computers that are produced by a leading OEM after Windows 7 was introduced). We are therefore able to examine how the rate of piracy varies among different populations of machines; we control for country-level differences in the overall rate of piracy by including countrylevel fixed effects in our specifications, as well. We weight the regressions so that each country is weighted equally, but we weight each machine type within a country according to its share within the country-level population. The results are presented in table 15.6. First, consistent with the global averages we presented in figures 15.2A and 15.2B, table 15.6, columns (1) and (2), document that the rate of piracy is much higher for machines that are produced by fringe manufacturers or assemblers, and is modestly higher among machines that are unambiguously receiving an upgrade (i.e., from Table 15.6

Piracy and machine characteristics Windows 7 piracy rate 1

OEM leading manufacturer

2

3

4

‒.227*** (.008)

‒.274*** (.011) .003 (.011) ‒.180*** (.013)

‒.234*** (.009) ‒.005 (.008) ‒.047*** (.009) ‒.094*** (.005) .048*** (.008) .426*** (.07)

4,518 .360

4,518 .534

4,518 .862

‒.391*** (.011)

Windows 7 model OEM leadinga Windows 7 model Frontier architecture Windows Professional Windows Ultimate Observations R-Squared

4,518 .487

Note: Robust standard errors in parentheses. a Ln patents per capita is defined as Ln(1 + patents per capita). ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

472

Susan Athey and Scott Stern

Windows Vista) as the machine was not produced after Windows 7 was launched. Perhaps more interestingly, there is a very strong interaction effect between these two machine characteristics. Essentially, the highest rate of piracy is observed among older machines (i.e., not Windows 7 models) that are produced by fringe manufacturers or assemblers. This core pattern of interaction is robust to the inclusion or exclusion of a variety of controls, including a control for whether the machine has frontier hardware (i.e., a 64-bit versus 32-bit microprocessor) and also if one accounts for the precise version of Windows that is installed. Also consistent with our earlier descriptive statistics, it is useful to note that the rate of piracy is much higher for machines with Windows Pro and Windows Ultimate; given the global availability of all versions of Windows, it is not surprising that pirates choose to install the highest level of software available. 15.5.4

The Impact of Antipiracy Enforcement Efforts on Software Piracy

Finally, we take advantage of time-series variation in our data to directly investigate the impact of the most notable antipiracy enforcement efforts on the contemporaneous rate of Windows 7 piracy. Specifically, during the course of our 2011 and 2012 sample period, a number of individual countries imposed bans on the Pirate Bay website, the single largest source of pirated digital media on the Internet. Though such policy interventions are broadly endogenous (the bans arise in response to broad concerns about piracy), the precise timing of the intervention is arguably independent of changes over time in Windows 7 piracy in particular, and so it is instructive to examine how a change in the level of enforcement against piracy impacts the rate of Windows 7 software piracy. We examine three interventions: the ban of Pirate Bay by the United Kingdom in June 2012, by India in May 2012, and by Finland in May 2011. For each country, we define a “control group” of peer countries that can be used as a comparison both in terms of the preintervention level of piracy as well as having enough geographic/cultural similarity that any unobserved shocks are likely common to both the treatment and control countries. For the United Kingdom, the control group is composed of France and Ireland; for India, we include both geographically proximate countries such as Bangladesh and Pakistan, as well as the other BRIC countries (Brazil, Russia, India, and China); and for Finland, we use the remainder of Scandinavia. For each country and for each month before and after the intervention, we calculate the rate of piracy among machines that are first observed within the telemetry data for that month. As such, we are able to track the rate of “new” pirates within each country over time. If restrictions on the Pirate Bay were salient for software piracy, we should observe a decline in the rate of new piracy for those countries impacted by the restriction (relative to the trend in the control countries), at least on a temporary basis. Figures 15.7A, 15.7B, and 15.7C present the results. Across all three interventions,

The Nature and Incidence of Software Piracy

Fig. 15.7A

UK piracy rate (effective ban date, June 2012)

Fig. 15.7B

India piracy rate (effective ban date, May 2012)

473

there does not seem to be a meaningful decline in the rate of piracy after the Pirate Bay restriction, either on an absolute basis or relative to the trend followed by the control countries. We were unable to find a quantitatively or statistically significant difference that resulted from these interventions. This “nonfinding” suggests that, at least for operating system piracy, the main focus on supply-side enforcement effects may be having a relatively

474

Susan Athey and Scott Stern

Fig. 15.7C

Finland piracy rate (effective ban date, November 2011)

small impact; there may simply be too many alternative sources of pirated Windows, and the pirate-user community may be sufficiently pervasive so as to provide potential pirates with new routes to piracy in the face of supplyside enforcement efforts. 15.6

Conclusions

The primary contribution of this chapter has been to conduct the first large-scale observational study of software piracy. By construction, this is an exploratory exercise, and even our most robust empirical findings are limited to considering the specific domain of piracy of Windows 7. With that said, we have established a number of novel findings that should be of interest to researchers in digitization and piracy going forward. First, our research underscores the global nature of software piracy, and the role of large-scale global sharing of software and piracy protocols. Relative to the pre-Internet era where piracy may indeed have been pervasive but its diffusion was local (almost by definition), the diffusion of the Internet, the widespread availability of broadband, and the rise of user communities that specifically provide guidance about how and what to pirate have changed the nature of contemporary software piracy. Second, though the type of data that we use are novel, the bulk of our analysis builds on a small but important literature that has linked the rate of piracy to the economic, institutional, and technological environment. At one level, our findings using observational data are broadly consistent with

The Nature and Incidence of Software Piracy

475

that prior literature; however, our analysis has allowed us to clarify a key empirical distinction: at least in the context that we examined, it is the quality of the institutional environment, rather than income per se, which is more closely linked with piracy. This finding is particularly salient, since a key argument against copyright enforcement depends on income-based price discrimination. Clarifying the distinction between the quality of institutions and income can be seen in a particularly sharp way by comparing cities that have similar income levels, but are located in countries with different institutional environments. Though we only undertake a small number of comparisons of this type, our exploratory work looking at cities suggests a future direction of research that can sharpen our identification argument: Do cities that are at different levels of income but share the same institutions behave more similarly than cities with the same level of income but with different institutions? Finally, our observational data allows us to directly assess the impact of the most high-profile enforcement efforts against piracy—the choices by individual countries to restrict access to the Pirate Bay over the last several years. Over a number of different experiments, and examining a number of alternative control groups, we are not able to identify a meaningful impact of these enforcement efforts on the observed rate of Windows 7 piracy. While such enforcement efforts may be having a meaningful effect on other types of piracy (e.g., movies or music), supply-side enforcement initiatives have not yet meaningfully deterred large-scale operating systems piracy. More generally, our analysis highlights the potential value of exploiting new types of data that passively capture user behavior in a direct way. By observing the actual choices that users make about what types of software to install (and where and in conjunction with what types of machine configurations), our analysis offers new insight into both the nature and incidence of software piracy. By and large, our results are consistent with prior measures such as those produced by the Business Software Alliance that suggest that the rate of software piracy is a large and meaningful economic phenomena. Our results suggest that those earlier findings are not simply the result of the BSA methodology, but reflect the underlying phenomena. This is particularly important since the rate of piracy is extremely low in the United States, and so claims about piracy are often met with some skepticism. Our direct observational approach not only reinforces those earlier findings, but has allowed us to document both the nature and drivers of piracy in a way that may be instructive for policy and practice going forward.

References Acemoglu, D., S. Johnson, and J. Robinson. 2001. “The Colonial Origins of Comparative Development: An Empirical Investigation.” American Economic Review 91 (5): 1369‒401.

476

Susan Athey and Scott Stern

American Heritage Dictionary, 4th Ed. 2000. “Software Piracy.” Boston: Houghton Mifflin. Andres, A. R. 2006. “Software Piracy and Income Inequality.” Applied Economics Letters 13:101–05. Banerjee, D., A. M. Khalid, and J. E. Strum. 2005. “Socio-Economic Development and Software Piracy: An Empirical Assessment.” Applied Economics 37:2091–97. Bezmen, T. L., and C. A. Depken. 2006. “Influences on Software Piracy: Evidence from Various United States.” Economics Letters 90:356–61. Burke, A. E. 1996. “How Effective are International Copyright Conventions in the Music Industry?” Journal of Cultural Economics 20:51–66. Business Software Alliance (BSA). 2011. “Ninth Annual BSA Global Software 2011 Piracy Study.” http://globalstudy.bsa.org/2011/. Central Intelligence Agency. 2007. The World Factbook 2008. New York: Skyhorse Publishing. CNET. 2009a. “Microsoft Acknowledges Windows 7 Activation Leak.” News by Dong Ngo. http://news.cnet.com/8301–10805_3–10300857–75.html. ———. 2009b. “Microsoft Windows 7.”Online Professional Review. http://reviews.cnet .com/windows/microsoft-windows-7–professional/4505–3672_7–33704140–2 .html. Conner, K. R., and R. P. Rumelt. 1991. “Software Piracy: An Analysis of Protection Strategies.” Management Science 37 (2): 125–37. Danaher, B., M. D. Smith, and R. Telang. 2014. “Piracy and Copyright Enforcement Mechanisms.” In Innovation Policy and the Economy, vol. 14, edited Josh Lerner and Scott Stern, 24–61. Chicago: University of Chicago Press. Delgado, M., C. Ketels, E. Porter, and S. Stern. 2012. “The Determinants of National Competitiveness.” NBER Working Paper no. 18249, Cambridge, MA. Goel, Rajeev, and M. Nelson. 2009. “Determinants of Software Piracy: Economics, Institutions, and Technology.” Journal of Technology Transfer 34 (6): 637‒58. Gopal, R. D., and G. L. Sanders. 1998. “International Software Piracy: Analysis of Key Issues and Impacts.” Information Systems Research 9 (4): 380–97. ———. 2000. “Global Software Piracy: You Can’t Get Blood Out of a Turnip.” Communications of the ACM 43 (9): 82–89. Greenstein, S., and J. Prince. 2006. “The Diffusion of the Internet and the Geography of the Digital Divide in the United States.” NBER Working Paper no. 12182, Cambridge, MA. Hall, Robert E., and Charles I. Jones. 1997. “Levels of Economic Activity across Countries.” American Economic Review 87 (2): 173‒77. Kaufmann, D., A. Kraay, and M. Mastruzzi. 2009. “Governance Matters VIII: Aggregate and Individual Governance Indicators, 1996–2008.” World Bank Policy Research Working Paper no. 4978, World Bank. Landes, W. M. and R. A. Posner. 1989. “An Economic Analysis of Copyright Law.” Journal of Legal Studies 18:325‒66. Marron, D. B., and D. G. Steel. 2000. “Which Countries Protect Intellectual Property? The Case of Software Piracy.” Economic Inquiry 38:159–74. Merrill, S., and W. Raduchel. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: National Academic Press. Meurer, M. J. 1997. “Price Discrimination, Personal Use and Piracy: Copyright Protection of Digital Works.” Buffalo Law Review. https://ssrn.com/abstract=49097. Microsoft. 2009. “Announcing the Windows 7 Upgrade Option Program & Windows 7 Pricing- Bring on GA!” Windows 7 Blog by Brandon LeBlanc. http://blogs .windows.com/windows/archive/b/windows7/archive/2009/06/25/announcingthe-windows-7–upgrade-option-program-amp-windows-7–pricing-bring-on-ga. aspx.

The Nature and Incidence of Software Piracy

477

My Digital Life. 2013. “Windows Loader: Current Release Information.” Forum. http://forums.mydigitallife.info/threads/24901–Windows-Loader-Current-release -information. Oberholzer- Gee, F., and K. Strumpf. 2010. “File Sharing and Copyright.” In Innovation Policy and the Economy, vol. 10, edited by Josh Lerner and Scott Stern, 19–55. Chicago: University of Chicago Press. Oz, S., and J. F. Thisse. 1999. “A Strategic Approach to Software Protection.” Journal of Economics and Management Strategy 8 (2): 163–90. Peace, A. G., D. F. Galletta, and J. Y. L. Thong. 2003. “Software Piracy in the Workplace: A Model and Empirical Test.” Journal of Management Information Systems 20 (1): 153‒77. Reddit. 2013. “Is Anyone Using a Pirated Copy of Windows 7 or 8?” Reddit Thread. http://www.reddit.com/r/Piracy/comments/1baus9/is_anyone_using_a_pirated _copy_of_windows_7_or_8/. Silva, F., and G. B. Ramello. 2000. “Sound Recording Market: The Ambiguous Case of Copyright and Piracy.” Industrial and Corporate Change 9:415–42. Waldfogel, J. 2011. “Bye, Bye, Miss American Pie? The Supply of New Recorded Music since Napster.” NBER Working Paper no. 15882, Cambridge, MA.

Comment

Ashish Arora

The growth of the digital economy has also increased interest in the unauthorized use of digital goods. The existing literature has tended to focus either on the issue of whether a particular instance of piracy—unauthorized use—is a net social “bad” (e.g., whether it is a form of de facto price discrimination), or the efficacy of specific types of enforcement efforts. Some studies do provide estimates of the extent of piracy, but the results are not credible because the studies are linked to advocacy efforts and suffer from weaknesses in methods and implausible assumptions. The question has become more salient with the rise of broadband technologies that have apparently made it easier to distribute digital products, including pirated products. Athey and Stern have done an important service by providing a reasonable measure of the problem for an important product. An important contribution of the chapter is its careful attention to measurement. Even with the new technology that allows Microsoft to discern whether the product use is based on an authorized key, matters are not straightforward. For instance, I know from personal experience that unless laptops are regularly connected to the network of the institution that purchased the license to the software, Microsoft policy is to incorrectly treat that use as unauthorized. Athey and Stern get around this problem by focusing Ashish Arora is the Rex D. Adams Professor of Business Administration at the Fuqua School of Business at Duke University and a research associate of the National Bureau of Economic Research. For acknowledgments, sources of research support, and disclosure of the author’s or authors’ material financial relationships, if any, please see http://www.nber.org/chapters/c13127.ack.

478

Susan Athey and Scott Stern

on specific keys, and by attending to whether the machine eventually reauthorizes with a valid key. Given the conservatism of the estimates, the results give one pause. The scale of the problem is large. Over a quarter of all copies of Windows 7 are unauthorized with significant variation across countries. My “back-of-theenvelope” calculations indicate that a 25 percent piracy rate for Windows alone implies $6.1 billion in lost revenue and $3.8 billion in lost operating income for Microsoft. These are consistent with the large estimates of losses due to piracy reported by advocacy organizations such as the BSA, but they assume a direct correspondence between the extent of piracy and the extent of the loss. One needs better estimates of the demand (for the authorized product and for the pirated one) to assess the validity of such estimates. Premium versions of the software are more prone to the problem, implying that this is not a case of de facto price discrimination. Put differently, a common prescription in both IT and pharmaceuticals for combating piracy is for manufacturers to introduce lower-priced versions in poorer countries. The Athey-Stern results suggest that this prescription will not work. They note another interesting result, albeit without comment. Although machines from smaller manufacturers tend to have a higher percentage of pirated software, the bulk of the pirated software is in computers produced by the leading manufacturers (OEMs). Further, these are also the manufacturers responsible for the keys that allow for unauthorized installations of Windows 7. Some obvious questions arise. Are Microsoft’s contracts, or the enforcement of those contracts, with these OEMs at fault? Are the OEMs contriving to reduce their payments to Microsoft by shipping machines without Windows? What liability do OEMs face when a key given to them is leaked? Athey and Stern instead focus on relating observed levels of piracy to country-specific institutions. They conclude that institutions associated with a greater respect for private property reduce piracy, even after controlling for how rich the country is. In plain words, the incidence of piracy is greatest in middle-income countries afflicted with corrupt governments or weaker capitalist institutions, or both. This finding could reflect greater moral acceptance on the part of buyers of pirated products or a greater profitability (for a given level of demand for pirated products) of supplying pirated products (or both). It appears that this is mostly a demand-side explanation because greater enforcement (e.g., shutting down Pirate Bay) appears to have little effect on the measured rate of piracy. More precisely, greater enforcement against suppliers of pirated products appears to be ineffective in reducing piracy. If so, then producers of digital goods face an uncomfortable decision, namely, to coerce their customers to use authorized products only. Indeed, Microsoft appears to have moved in this direction, forcing users to regularly authenticate their software, and imposing modest downgrades of product

The Nature and Incidence of Software Piracy

479

functionality. It appears that this is not enough to dissuade a significant number of buyers from choosing the pirated products, which are cheaper or perhaps even free. Two sets of research questions arise. The first relates to pricing. In effect, countries with higher rates of piracy have a lower willingness to pay for the authentic product. If so, might the problem lie in how the authentic product is priced? It would be interesting to know if Microsoft has experimented with discounts and other ways of tweaking its price and what this tells us about the implied willingness to pay for pirated products. It may well be that Microsoft is already pricing optimally, given the ineffectiveness of supplyside enforcement efforts. A second, and related, question is whether customers should be induced to eschew pirated products by downgrading the functionality of pirated products by denying updates and patches. Such a strategy may also be costly because some legitimate users may be incorrectly classified as using unauthorized software. Other possible costs include greater security risks for legitimate users (a larger fraction of unauthorized users may have compromised machines), legal liability, and reputation costs. It is obvious that such an exercise requires estimates for the willingness to pay for the authentic product as well as the willingness to pay for the pirated product. More generally, sensible estimates of the demand would also help inform us about the magnitude of the lost revenue and profits. It is striking, though perhaps not surprising, that the chapter is silent on the issue. However, any such exercise must also take into account competitive conditions. It may well suit a dominant producer to have its product crowd out a possible competitor, be it an alternative operating system product (Linux) or a competing platform (Apple). Tolerating or even encouraging some level of piracy may be a way to keep competitors at bay. Thus, it would be interesting to explore whether countries with high rates of piracy also have higher shares of Microsoft Windows relative to alternate operating systems. Regardless, this study makes an important contribution by carefully documenting the incidence of piracy across the world, and correlating it with the level of institutional development of the country.

Contributors

Ajay Agrawal Rotman School of Management University of Toronto 105 St. George Street Toronto, ON M5S 3E6 Canada Ashish Arora Fuqua School of Business Duke University Box 90120 Durham, NC 27708-0120 Susan Athey Graduate School of Business Stanford University 655 Knight Way Stanford, CA 94305 Michael R. Baye Department of Business Economics and Public Policy Kelley School of Business Indiana University Bloomington, IN 47405 Timothy F. Bresnahan SIEPR Landau Economics Building, Room 325 579 Serra Mall Stanford, CA 94305-6072

Erik Brynjolfsson MIT Sloan School of Management 100 Main Street, E62-414 Cambridge, MA 02142 Brett Danaher Department of Economics Wellesley College Wellesley, MA 02481 Babur De los Santos Department of Business Economics and Public Policy Kelley School of Business Indiana University Bloomington, IN 47405 Samita Dhanasobhon School of Information Systems and Management Heinz College Carnegie Mellon University Pittsburgh, PA 15213 Chris Forman Georgia Institute of Technology Scheller College of Business 800 West Peachtree Street, NW Atlanta, GA 30308

481

482

Contributors

Joshua S. Gans Rotman School of Management University of Toronto 105 St. George Street Toronto ON M5S 3E6 Canada

Elizabeth Lyons IR/PS UC San Diego 9500 Gilman Drive, MC 0519 La Jolla, CA 92093-0519

Matthew Gentzkow University of Chicago Booth School of Business 5807 South Woodlawn Avenue Chicago, IL 60637

Megan MacGarvie Boston University School of Management 595 Commonwealth Avenue, Room 522H Boston, MA 02215

Avi Goldfarb Rotman School of Management University of Toronto 105 St. George Street Toronto, ON M5S 3E6 Canada

Catherine L. Mann International Business School Brandeis University Waltham, MA 02453

Shane M. Greenstein Kellogg School of Management Northwestern University 2001 Sheridan Road Evanston, IL 60208-2013

Amalia R. Miller Department of Economics University of Virginia P. O. Box 400182 Charlottesville, VA 22904

Hanna Halaburda Bank of Canada 234 Laurier Avenue West Ottawa, ON, K1A 0G9 Canada

Petra Moser Department of Economics Stanford University 579 Serra Mall Stanford, CA 94305-6072

John Horton Stern School of Business New York University 44 West Fourth Street, 8-81 New York, NY 10012 Tatiana Komarova Department Of Economics London School of Economics and Political Science Houghton Street London, WC2A 2AE England Nicola Lacetera University of Toronto 105 St. George Street Toronto, ON M5S 2E9 Canada Randall Lewis Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043

Denis Nekipelov Monroe Hall, Room 254 University of Virginia P.O. Box 400182 Charlottesville, VA 22904 Justin M. Rao Microsoft Research 641 Avenue of the Americas, 7th Floor New York, NY 10011 David H. Reiley Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 Marc Rysman Department of Economics Boston University 270 Bay State Road Boston, MA 02215

Contributors Steven L. Scott Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043

Catherine E. Tucker MIT Sloan School of Management 100 Main Street, E62-533 Cambridge, MA 02142

Jesse M. Shapiro University of Chicago Booth School of Business 5807 S. Woodlawn Avenue Chicago, IL 60637

Hal R. Varian Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043

Timothy Simcoe Boston University School of Management 595 Commonwealth Avenue Boston, MA 02215 Michael D. Smith School of Information Systems and Management Heinz College Carnegie Mellon University Pittsburgh, PA 15213 Christopher Stanton University of Utah David Eccles School of Business 1655 East Campus Center Drive Salt Lake City, UT 84112 Scott Stern MIT Sloan School of Management 100 Main Street, E62-476 Cambridge, MA 02142 Koleman Strumpf University of Kansas School of Business Summerfield Hall 1300 Sunnyside Avenue Lawrence, KS 66045-7601 Rahul Telang School of Information Systems and Management Heinz College Carnegie Mellon University Pittsburgh, PA 15213

483

Joel Waldfogel 3-177 Carlson School of Management University of Minnesota 321 19th Avenue South Minneapolis, MN 55455 Scott Wallsten Technology Policy Institute Suite 520 1099 New York Ave., NW Washington, DC 20001 Matthijs R. Wildenbeest Department of Business Economics and Public Policy Kelley School of Business Indiana University Bloomington, IN 47405 Lynn Wu University of Pennsylvania The Wharton School JMHH 561 3730 Walnut Street Philadelphia, PA 19104 Evgeny Yakovlev New Economic School Nakhimovsky pr., 47, off. 905 Moscow 117418, Russia

Author Index

Abhishek, V., 163n25 Abowd, J., 283 Abraham, M., 199, 203 Abramovsky, L., 239 Abrams, S. J., 171 Acemoglu, D., 460, 467 Acquisti, A., 284, 285, 315, 317, 335, 344 Agarwal, D., 212 Aggarwal, G., 283 Agrawal, A., 11, 12, 222, 239, 241, 244, 250 Akerlof, G. A., 244 Alloway, T., 313n2 Ambrus, A., 176, 179 Anderson, C., 172, 236 Anderson, H. E., 314n3 Anderson, R., 318 Anderson, S. P., 176, 179 Andres, A. R., 449 Antras, P., 239 Appleton-Young, L., 91 Armstrong, M., 176, 179, 229, 258n2 Arola, C., 119 Arora, A., 12, 344n28, 345 Arrow, K. J., 93, 312 Arthur, W. B., 24n3 Athey, S., 176, 179 August, T., 345 Autor, D. H., 10, 228, 229, 244 Awad, N., 426 Ayres, I., 353

Bachlechner, D., 341n18 Bagwell, K., 192 Bajari, P., 38n23 Bakos, J., 9 Balasubramanian, S., 9 Baldwin, C. Y., 23, 25, 31, 34 Ballantyne, J. A., 376 Bamberger, K. A., 343n26 Banerjee, D., 449 Bar-Isaac, H., 10, 238 Basuroy, S., 358n2 Bautz, A., 361, 366n22 Baye, M., 9, 139n4, 139n5, 143n11, 143n12, 149 Becker, G. S., 23, 191 Berners-Lee, T., 27 Berry, S. T., 176, 433 Bertrand, M., 390n4 Bezmen, T. L., 449 Bhagwati, J., 245 Blackburn, D., 407n1 Blake, T., 192, 193, 200, 200n15, 246 Bloom, N., 229n3, 254 Blum, B. S., 9, 12 Boardman, A., 57 Bojanc, R., 344n28 Bradley, C., 280n1 Brecht, M., 341n18 Bresnahan, T. F., 5, 21n1, 24, 26, 49n1, 50, 52n3 Broder, A., 211

485

486

Author Index

Brodersen, K., 128 Brooks, F., 24n5 Brynjolfsson, E., 8, 9, 10, 57, 84, 91, 93, 114, 139, 147, 237, 239, 419 Bucklin, R. E., 194 Burke, A. E., 449 Burks, S., 252n1 Cabral, L., 10, 230 Calvano, E., 176, 179 Calzolari, G., 285 Campagnoli, P., 120 Campbell, K., 340 Card, D., 198 Carlin, J., 198n13 Carrière-Swallow, Y., 119 Carter, C. K., 132 Carty, M., 344n28 Caruana, G., 10, 238 Case, K. E., 94, 106 Castells, M., 13 Castle, J. L., 120 Catalini, C., 11, 12 Caves, R. E., 412, 413 Chakrabarti, D., 212 Chan, D., 200 Chevalier, J., 10, 139, 419 Chipman, H., 122 Choi, H., 93, 113, 119, 129, 152n19, 246 Chown, T., 343n26 Ciriani, V., 283 Clark, D. D., 30 Clark, K. B., 23, 25, 31, 34 Clay, K., 139 Clyde, M. A., 133 Coles, P., 243 Colfer, L., 31 Conner, K. R., 448 Cuñat, V., 10, 238 Cutler, D. M., 174 D’Amuri, F., 152n19 Danaher, B., 386n1, 387n2, 391n5, 394n7, 400, 447, 448 Davenport, T. H., 93 David, P. A., 21n2, 24n3 Deazley, R., 361 Debreu, G., 312 De Jong, P., 132 Delgado, M., 446, 460, 465 Dellarocas, C., 229n2, 426 DellaVigna, S., 169

De los Santos, B., 139n5, 140, 143n11, 143n12, 144n13, 149 Demetz, L., 341n18 Demsetz, H., 191 Deng, A., 194n4, 215 Depken, C. A., 449 Dettling, L. L., 228, 245 Dewan, S., 426 DeWitt, D., 283 Diamond, P., 9 Dickie, M., 200 DiCola, P., 358, 367 Dover, Y., 10 Dranove, D., 21n2 Duflo, E., 390n4 Duncan, G., 283 Durbin, J., 120, 132 Dutcher, E. G., 229n3 Dwork, C., 283 Eckert, S. E., 341 Edmonds, R., 169 Einav, L., 8 Elberse, A., 172, 236 Ellison, G., 9 Ellison, S. F., 9 Evans, D. S., 259 Fader, P. S., 92 Farrell, J., 24n3, 25, 25n6 Fawcett, N. W. P., 120 Feather, J., 360 Ferreira, F., 433, 434n28 Fienberg, S., 283 Fiorina, M. P., 171 Fischetti, M., 27 Fisher, A., 200 Fleder, D., 10 Fogel, R. W., 55 Forman, C., 9, 83, 139 Foros, Ø., 176, 179 Fradkin, A., 11 Francois, J., 245 Frankel, A. S., 274 Friedman, A., 285 Frühwirth-Schnatter, S., 132 Fryer, H., 343n26 Galan, E., 119 Galletta, D. F., 448 Gans, J. S., 176, 179, 258n1, 258n2 Garicano, L., 10, 254

Author Index Garside, P. D., 376, 376n37 Gelman, A., 133, 198n13 Gentzkow, M., 83, 84, 169, 170, 171, 172, 173, 174, 175, 180 George, E. I., 121, 133 Gerking, S., 200 Geva, T., 114 Ghani, E., 223, 244, 250 Ghose, A., 9, 83, 139 Ghosh, J., 133, 246 Ginsberg, J., 93, 152n19 Glaeser, E. L., 94, 174 Goel, R., 449, 466 Goldfarb, A., 8, 9, 10, 11, 12, 66, 83, 84, 140, 192, 200, 285, 315 Gonen, R., 192, 202 Goolsbee, A., 12, 57, 59, 139, 310, 419 Gopal, R. D., 447, 449 Gordon, L. A., 344n28 Greene, W. H., 177 Greenstein, S. M., 5, 6, 11, 24, 37, 49n1, 50, 58, 310, 448 Grierson, H. G. C., 365n15, 375, 376 Griffith, R., 239 Gross, R., 285 Grossklags, J., 285 Grossman, G. M., 239 Guzmán, G., 152n19 Gyourko, J., 94 Hall, R. E., 446, 467 Han, L., 94 Handke, C., 429n24, 431n27 Hann, I.-H., 315 Harris, M., 358n2 Harvey, A., 120 Heald, P. J., 359n5 Heaton, P., 10 Hellerstein, R., 119 Helpman, E., 239 Henderson, R., 31, 50 Hendry, D. F., 120 Hirsch, D. D., 314, 353 Hitt, L. M., 8, 239 Hoekman, B., 245 Hoffman, D. A., 344 Holley, R. P., 141n6 Homer, N., 282 Hong, H., 139 Hong, S.-H., 13 Horowitz, J., 283 Horrigan, J. B., 91

487

Hortaçsu, A., 10, 140, 230 Horton, J. J., 219, 223, 242, 244, 245, 246, 250, 253 Hosanagar, K., 10 Hu, Y. J., 10, 91, 139, 147, 159, 192n3, 248, 314, 419 Ioannidis, C., 314 Israel, M., 77n14 Jabs Saral, Krista, 229n3 Jeon, G. Y., 310 Jerath, K., 163n25 Jerman-Blazic, B., 344n28 Jin, G. Z., 10 Johnson, G., 200, 204, 214n26 Johnson, J., 163n25 Johnson, M. E., 344 Johnson, S., 460, 467 Jones, B. F., 23 Jones, C. I., 446, 467 Jullien, B., 52n2 Kahn, L., 252n1 Kaplan, E., 169 Karagodsky, I., 324n12, 340 Kato, A., 10 Katz, M., 77n14 Kaufmann, D., 460 Kaya, C., 237 Kee, K. F., 78n15 Kelley, S., 376 Kerr, W. R., 223, 244, 250 Kessides, I. N., 191 Khalid, A. M., 449 Khan, B. Z., 359n5 Kim, H. H., 8 Kim, Y.-M., 310 Kind, H. J., 176, 179 King, S. P., 258n1, 258n2 Klenow, P. J., 57, 59, 310 Knight, S. C., 341 Knopper, S., 414, 415 Kohavi, R., 194n4, 215 Kohn, R., 132 Komarova, T., 295 Koopman, S. J., 120, 132 Korolova, A., 284 Kraay, A., 460 Krieger, A. M., 192n3 Krishnan, R., 139 Krugman, P., 90

488

Author Index

Ksiazek, T. B., 172 Kuhn, P., 252n2 Kumar, D., 200 Kuruzovich, J., 93 Kwon, J., 344 Labbé, F., 119 Lacetera, N., 222, 239, 241, 244, 250 Lambert, D., 283, 296 Lambrecht, A., 205n20 Landes, W. M., 447 Langlois, R., 24 Lauinger, T., 391 Lazear, E. P., 245 Lazer, D. A., 93 Leeds, J., 414 LeFevre, K., 283 Levin, J. D., 8 Levitt, S. D., 353 Lewis, R. A., 193, 194, 196, 200, 201, 202, 202n19, 204, 205, 214n26, 246 Li, X., 359, 369n26, 378 Liebowitz, S. J., 77n14, 357, 407n1 Liu, P., 11, 129 Lockhart, J. G., 374, 374n35 Lodish, L., 192n1, 192n3 Loeb, M. P., 344n28 Lovell, M., 197 Lyons, E., 222, 239, 241, 244, 245, 250, 254n4 MacCarthy, M., 332 MacCormack, A., 34 MacGarvie, M., 359, 369n26, 378 MacKie-Mason, J., 25n6 Madigan, D. M., 121, 123 Magnac, T., 283 Mahoney, J. T., 31 Manley, L., 141n6 Mann, C. L., 324n12, 340, 341 Manski, C., 283 Mansour, H., 252n2 Marcucci, J., 152n19 Margolis, S., 357 Marron, D. B., 449 Mastruzzi, M., 460 Maurin, E., 283 Mayzlin, D., 10, 139 McAfee, A., 93, 246 McCarty, N., 171 McCulloch, R. E., 121, 133

McDevitt, R., 6, 58, 310 McLaren, N., 119 Merrill, S., 447, 448 Meurer, M. J., 447 Middeldorp, M., 119 Milgrom, P., 240 Mill, R., 222, 244, 250 Miller, A. R., 12, 285, 353, 355 Mincer, J., 198 Moe, W. W., 92 Moffitt, R., 282 Molinari, F., 284 Moore, R., 343n26 Moraga-González, J. L., 144n14 Moreau, F., 237 Morgan, J., 9, 139n4 Morton, F. S., 310 Moser, P., 359, 369n26, 378 Mowery, D., 5 Mukherjee, S., 283 Mullainathan, S., 170, 172, 390n4 Mulligan, D. K., 343n26 Murphy, J., 200 Murphy, K. M., 23, 191 Nandkumar, A., 345 Narayanan, A., 282 Nekipelov, D., 295 Nelson, M., 449, 466 Netz, J., 25n6 Nguyen, D. T., 194, 201 Nissim, K., 283 Nosko, C., 192, 193, 200, 200n15, 215, 246 Nowey, T., 341n18 Oberholzer-Gee, F., 12, 13, 236, 357, 407n1, 429, 448 Oh, J. H., 57, 84 Olston, C., 192, 202 Orr, M., 341 Oussayef, K. Z., 341 Oz, S., 448 Pallais, A., 220, 239, 241, 243, 244, 250, 252n1 Panagariya, A., 245 Pandey, S. D., 192, 202, 212 Park, N., 78n15 Patterson, L., 360 Pavan, A., 285 Pavlov, E., 192, 202

Author Index Peace, A. G., 448 Pearson, R., 283 Peitz, M., 3 Peltier, S., 237 Pentland, A. S., 90 Peranson, E., 240 Petrin, A., 433 Petris, G., 120 Petrone, S., 120 Petrongolo, B., 244 Pimont, V., 344n28 Pissarides, C. A., 244 Poole, K. T., 171 Posner, R. A., 447 Prince, J., 66, 84, 448 Prior, M., 171 Pym, D., 314 Qin, X., 120 Raduchel, W., 447, 448 Raftery, A. E., 121, 123 Rahman, M. S., 91 Ramakrishnan, R., 283 Ramaprasad, J., 426 Ramello, G. B., 449 Rao, J. M., 193, 196, 205 Ravid, S. A., 358n2 Reed, D. P., 30 Reed, W. R., 120 Reichman, S., 114 Reiley, D., 193, 200, 201, 202, 202n19, 204, 205, 214n26, 246 Reisinger, M., 176, 179 Resnick, P., 230 Retzer, K., 335 Ridder, G., 282 Rigbi, O., 11 Rob, R., 12, 13, 407n1 Roberds, W., 314n3 Robinson, J., 56n1, 460, 467 Rochet, J.-C., 258n1, 258n2 Rockoff, H., 274n23 Romanosky, S., 317, 335, 344 Rosen, S., 235 Rosenthal, H., 171 Rossi-Hansberg, E., 239 Rosston, G., 6, 59 Roth, A. E., 240 Rue, H., 132 Rumelt, R. P., 448

489

Rusnak, J., 34 Russell, A., 26n7, 29 Rutz, O. J., 194 Rysman, M., 37, 42, 52n2, 142n10, 229 Saloner, G., 24n3 Saltzer, J. H., 30 Sanchez, R., 31 Sanders, G. L., 447, 449 Sands, E., 252n1 Sandstoe, J., 415 Savage, S. J., 6, 59 Scherer, F. M., 358, 358n3 Schmid, D. W., 344n28 Scholten, P., 9, 139n4 Schreft, S., 314n3 Schreiner, T., 201 Scott, S. L., 123, 130 Shaked, A., 176 Shanbhoge, R., 119 Shapiro, J.M., 169, 170, 171, 172, 173, 174, 175, 180 Sharp, R., 335 Shepard, N., 132 Sher, R., 358, 366n17 Sherwin, R., 23 Shiller, R. J., 94, 106 Shleifer, A., 170, 172 Shmatikov, V., 282 Shum, M., 139 Silva, F., 449 Simcoe, T., 5, 25, 26n7, 27n9, 38, 42 Simester, D., 237 Simon, H. A., 23, 90, 111 Sinai, T., 9, 85 Singer, N., 316n4 Sinkinson, M., 169 Smarati, P., 283 Smith, A., 23 Smith, M. D., 9, 10, 139, 159, 237, 314, 386n1, 391n5, 400, 419, 447, 448 Smith, V. C., 274n23 Srinivasan, T. N., 245 St. Clair, W., 361n7, 365, 371n28, 375n36 Stanton, C. T., 222, 223, 229, 244, 250, 253 Steele, D. G., 449 Stigler, G. J., 9, 23, 139, 244 Strum, J. E., 449 Strumpf, K., 12, 13, 357, 407n1, 429, 448 Suhoy, T., 119 Sullivan, R. J., 332

490

Author Index

Sunstein, C., 9, 170, 171 Sutton, J., 176, 411, 433, 440 Sweeney, L., 282, 283 Tadelis, S., 192, 193, 200, 200n15, 215, 246 Tang, Z., 314 Taylor, C., 285 Telang, R., 285, 335, 345, 386n1, 400, 447, 448 Tervio, M., 408, 411, 415 Thisse, J. F., 448 Thomas, C., 222, 229, 244, 250, 253 Thomas, R. C., 318 Thomson, K., 419, 423 Thong, J. Y. L., 448 Tirole, J., 52n2, 258n1, 258n2 Trajtenberg, M., 26, 52n3 Tucker, C. E., 8, 9, 10, 12, 192, 200, 205n20, 236, 285, 315, 353, 355 Tunca, T. I., 345 Turow, S., 357 Valenzuela, S., 78n15 Vanham, P., 219 Van Reenen, J., 254 Varian, H. R., 9, 13, 93, 113, 119, 123, 130, 152n19, 246, 284, 285 Vigdor, J. L., 174 Vilhuber, L., 283 Vogel, H., 412, 412n7 Volinsky, C., 123

Waldfogel, J., 3, 9, 12, 13, 85, 176, 357, 407n1, 408, 408n3, 433, 434n28, 447 Waldman, D. M., 6, 59 Walker, T., 194n4, 215 Webster, J. G., 172 Wellman, B., 78n15, 79 Weyl, E. G., 52n2, 258n2 Wheeler, C. H., 244 White, M. J., 174 Wilde, L. L., 244 Wildenbeest, M. R., 139n5, 140, 143n11, 143n12, 144n13, 144n14, 149 Williams, J., 314 Wolff, E., 139 Woodcock, S., 283 Wright, G., 280n1 Wu, L., 92 Xu, Y., 194n4, 215 Yakovlev, E., 295, 301 Yan, C., 310 Yglesias, M., 259 Yildiz, T., 200 Zeckhauser, R., 230 Zentner, A., 77n14, 237, 407 Zhang, J., 11, 236 Zhang, X., 11, 426 Zhang, Z. J., 163n25 Zhu, F., 11

Subject Index

Page numbers followed by f or t refer to figures or tables, respectively. Activity bias, 205–9 Acxiom, 316 Adam Smith marketplace, 311–12 Ad exchanges, 201 Advertising, 191; activity bias and, 205–9; case study of large-scale experiment, 202–5; challenges in measuring, 192–95; computational, advances in, 211–13; computational methods for improving effectiveness of, 195–99; evolution of metrics for, 199–202; measuring long-run returns to, 209–11; study of online, 10; targeted, 3, 195, 199; untargeted, 199n14. See also Digital advertising Agency model, Apple’s, 160 Airbnb, 11 Amazon, 140, 316 Amazon Coins, 257 American Time Use Survey (ATUS), 7, 56, 59–71; computer use for leisure, 61–62, 62t; demographics of online leisure time, 65–70; ways Americans spend their time, 62–63, 63f, 64f, 65f Antipiracy enforcement efforts, impact of, 472–74 Apple: agency model, 160; iBookstore, 160; platform-specific currencies of, 259 Appliances, home, predicting demand for, 100–101

Arrow-Debreu “complete” market, 312 Attribution problem, 201–2 ATUS. See American Time Use Survey (ATUS) Authors, payments to, 357–60; data, 361–65; income from profit sharing, 371–73; lump sum, 365–71; total income to, 373–77. See also Copyrights Automated targeting, 195, 199 Barnes & Noble, 140, 141; top search terms leading users to, 145–47, 146t Barnesandnoble.com, 140 Basic structural model, 120–21 Bayesian model averaging, 123. See also Variable selection Bayesian Structural Time Series (BSTS), 120, 124, 129, 130 Beckford v. Hood, 361 Berners-Lee, Tim, 27 Bitcoin, 258, 259, 272 Book industry, 138–39; current retail, 140; data sets for, 143–44; overview of, 140–44 Book industry, online: literature on, 139–40; price dispersion and, 139 Book-oriented platforms, search activity on, 150–51, 152t Book-related searches: combining data from comScore and Google Trends, 152–55;

491

492

Subject Index

Book-related searches (cont.) dynamics of, 151–52; for specific titles, 155–59 Books: booksellers’ sites for finding, 148–49; online sales of, 137–38; online searching for, 144–51; price comparison sites for, 144. See also E-books; Print books Books, searching for, 144–51 Book searches, 9 Booksellers, searching for, 144–51 Booksellers’ sites: activities of searchers after visiting, 149–50, 149t; for finding books, 148–49 Bookstores, online, for book searches, 144 Bookstores, revenue of leading, 143, 143t Borders, 140 Boundaries, firm, online contract labor markets and, 239–40 Brick-and-mortar books stores: retail sales of, 142–43, 142f BSTS. See Bayesian Structural Time Series (BSTS) Business Software Alliance (BSA), 449 Case-Shiller index, 91–92 Cerf, Vint, 26 CERN. See European Organization for Nuclear Research (CERN) ChoicePoint, 316 Clark, David, 26 Click-through rate (CTR), 192, 199–200 Communications costs, effect of low, 2 Complementaries, between display and search advertising, 201–2 Complete-markets framework, 311–12; atomistic interaction among players and, 316–17; frictionless markets and, 315; full information and, 315; tradeoffs and, 315; violating, 312–14 Computational advertising, advances in, 211–13. See also Advertising “Computer use for leisure,” 62–63, 64f Computing market segments: platforms and, 5 comScore, 152–55 Consumer research behavior, literature on, 139 Consumer sentiment: nowcasting, 124–27; University of Michigan monthly survey of, 124 Contract labor, demand for, 228

Contract labor markets: influence of information frictions on matching outcomes in, 220–23; introduction to, 219–22; patterns of trade in, 220 Contract labor markets, online: boundaries of the firm and, 239–40; demand for contract labor in, 228; design of, 240–43; digitization and, 11; economics of, 226–30; geographic distribution of work and, 230–35; growth in, 221–22; income distribution and, 235–39; labor supply and, 227–28; platforms and, 229–30, 240–43; social welfare implications of, 243–45 Copyrights, 357; data for analysis of, 361– 65; digitization and, 13–14; evidence on effects of stronger, 357–58; example of Sir Walter Scott, 373–77; income from profit sharing and, 371–73; lump sum payments to authors and, 365–71; in romantic period Britain, 360–61 CTR. See Click-through rate (CTR) Currencies: platform-specific, 258–59; private, 11. See also Private digital currencies Customer acquisition, 200–201 Data, online, potential of, 8 Data breaches: cross-border, 330–33; discipline by equity markets and, 339–41; discipline by equity markets and, literature review of, 347; disclosure of, 317; frameworks for analyzing, 310–20; probability distribution of, 318–19; at Target, 318n8; trends in business costs of, 335–39. See also Information loss Data holders, 316 Data security, digitization and, 12–13 Data subjects, 316 Dell key (Sarah), 454 Digital advertising, 191–92; data reporting for, 192. See also Advertising Digital books. See E-books Digital currencies. See Private digital currencies Digital information: challenges of privacy and security and, 2 Digital media, studies of: case 1: effect of graduated response antipiracy law on digital music sales, 387–90; case 2: effect of Megaupload shutdown on dig-

Subject Index ital movie sales, 391–94; case 3: effect of digital distribution of television on piracy and DVD sales, 394–96 Digital movie sales, effect of Megaupload shutdown on, 391–94 Digital music sales, effect of graduated response antipiracy law on, 387–90 Digital news. See News, online Digital piracy, 357; defined, 444 Digital Rights Management (DRM), 141–42 Digital technology, 1; demand for, 6; role of growth of digital communication in rise of, 1–2; search costs and, 8–9 Digitization: economic impact of, 1; economic transactions and, 7–8; frictions and, 11; government policy and, 12–15; markets changed by, 10; markets enabled by, 10–11; online labor markets and, 11; online sales and, 137; personal information and, 3; private currencies and, 11; ways markets function and, 8–9 Digitization Agenda, 309, 310 Digitization research, 2–3 Digitized money transfer systems, 258; platforms and, 258 Disclosure, of data breaches, 317, 334–35 Distribution, near-zero marginal costs of, 9–10, 12 Donaldson v. Becket, 360–61 DRM. See Digital Rights Management (DRM) eBay, 11 E-books, 138, 139, 141; prices of, 159–62; sales of, vs. print books, 141t; searching for, 144–45; shift to, 140–41. See also Print books Economic transactions: digitization and, 7–8 Economic trends, predicting, 92–94 ePub format, 141 Equity markets, discipline by, and data breaches, 339–41; literature review of, 347 E-readers: definition of, 141; formats for, 141–42; Kindle, 9, 141; Nook, 9, 141; Sony LIBRIé, 140–41 European Organization for Nuclear Research (CERN), 27

493

European Union (EU) Privacy Directive, 316 Facebook, privacy breaches and, 284 Facebook Credits (FB Credits), 257, 259, 260; case study of, 260–62 Financial Crimes Enforcement Network (FinCEN), 272 Forecasting, traditional, 89. See also Nowcasting; Predictions Frictions, digitization and, 11 General purpose technology (GPT), 21–22 Genome-wide association studies (GWAS), 282 Gold farming, 259 Google, 316 Google Correlate, 119 Google Trends, 95–96, 115, 119, 124, 152–55 Government policy, digitization and, 12–15 GWAS. See Genome-wide association studies (GWAS) Hacking, 321; origins, 331 HADOPI, 387–90 HapMap data, 282 Hart, Michael, 140 Home appliances, predicting demand for, 100–111 Household behavior, 6–7 Housing market, 90–92; empirical results of models, 100–111; implications of advances in information technology for, 111–14; indicators, 96–97; literature review of for predicting, 92–94; modeling methods for predicting, 97–100. See also Predictions Housing price index (HPI), 92, 96, 100–101, 102, 104, 105, 106–9 Housing trends, predicting, 93 Hulu.com, 398–404 Hypertext Markup Language (HTML), 27 Hypertext Transfer Protocol (HTTP), 27 IAB. See Internet Architecture Board (IAB) iBookstore, 160 IEEE. See Institute for Electrical and Electronics Engineers (IEEE) Income distribution, online contract labor markets and, 235–39

494

Subject Index

Individual disclosure, 282; modern medical databases and, 282–83 Information, personal, digitization and, 3 Information aggregation, 314; literature on benefits vs. costs, 316–18; value of personal, 315 Information flows, applying pollution model to, 314–15 Information loss: amounts, 320–22, 321f; costs of, 317–18; costs of increased security and, 344–45; creating insurance markets and products for, 344; cross-border, 330–33; data needs and analysis for, 346; differences by sector, 324–30; disclosure of, 317; legal recourses, 343–43; legislative approaches to reducing harm from, 317; market discipline vs. nonmarket regulatory/legal discipline and, 333–45; market value of, 339; methods, 320–22, 321f; policy interventions for, 341–43; trends, 320–33; types of, 322–24; in US, 333. See also Data breaches; Information marketplaces Information marketplaces, 312–14; balancing benefits and costs of, 319–20; challenges to pricing and, 319–20; conceptual framework for, 345; frameworks for analyzing, 310–20; international jurisdiction and, 346; pollution model of, 314–15. See also Information loss Information stewardship, 314–15 Information technology, implications of advances in, for housing market, 111–14 Insider fraud, 321 Institute for Electrical and Electronics Engineers (IEEE), 5 Intellectual property, 13. See also Copyrights Internet, 2, 4–5; digital piracy and, 448; estimating value of, 55–56; evolution of protocol stack, 32f; existing research on economic value of, 57–59; housing market and, 91–92; online sales and, 137; standardization of, 26–30; supply and demand, 4–7 Internet Architecture Board (IAB), 26 Internet data, potential of, 8. Internet Engineering Task Force (IETF), 5, 22, 26–30; linear probability models of, 39–40, 40t; major participants, 36–41,

37t; most cited standards, 29–30, 30t, 31t; protocol stack and, 31–33; summary statistics, 39, 39t Kalman filters, 120–21 k-anonymity approach, 283 Kindle, 9, 141 Labor, division of, Internet modularity and, 36–41 Labor markets. See Contract labor markets Labor supply, online contract labor markets and, 227–28 “Last click” rule, 201 Leisure time: ways Americans spend their, 62–63, 63f, 64f Leisure time, online: computer use for, 61–62; demographics of, 65–70; items crowded out by, 71–80; opportunity cost of, 56; times people engage in, 70–71 Lenovo Key (Lenny), 453–54 Liberty Exchange, 258 LIBRIé e-book reader, Sony, 140–41 Linden dollars, 258 Linkage attacks, 282, 283, 284 Lump sum payments, to authors, 365–71 MAE. See Mean absolute error (MAE) Market-making platforms, 229–30 Marketplace: Adam Smith’s, 311–12; complete, 312; information, 312–14 Markov Chain Monte Carlo (MCMC) technique, 123, 131–33 Mean absolute error (MAE), 100, 100n10, 103–5 Mean squared error (MSE), 100n10 Media, polarization and, 171–72. See also News, online Medical databases, individual disclosure and, 282–83 Megaupload, 391–94 Megaupload Penetration Ratio (MPR), 391–94 Metrics, advertising, evolution of, 199–202 Microsoft, 443–44; platform-specific currencies of, 258–59 Microsoft Points, 257 Mirroring hypothesis, 31 Models, 40t; Apple’s agency model, 160; basic structural, 120–21; Bayesian model averaging, 123; linear probability

Subject Index models of IETF, 39–40; platform, 262– 72; pollution model of information marketplaces, 314–15; for predicting housing market, 97–100; of production and consumption of online news, 170–71, 175–81; structural time series models, 130–31; structural time series modes, 120–21; theoretical, of recorded music industry, 415–17; of treatment effects, 285–90 Modular design, virtues of, 24 Modularity, Internet, 23–25; age profiles for RFC-to-RFC citations, 42–43, 43t; age profiles for RFC-to-RFC citations and US patent-to-RFC citations, 44, 44t; decomposability and, 33–35; distribution of citations to RFCs over time, 41–44; division of labor and, 36–41; protocol stack and, 30–33; setting standards and, 25–26 Modular system architecture, 22 Monster, 11 Movies, online sales of, 137–38 M-Pesa, 258 MPR. See Megaupload Penetration Ratio (MPR) MSE. See Mean squared error (MSE) Music, online sales of, 137–38 Music industry. See Recorded music industry Nanoeconomics, 93 Napster, 407, 408 National Association of Realtors (NAR), 100–101, 106 National Instant Criminal Background Check (NICS), 128 Network effects, 285 News, online, 169; data sources for, 173–74; descriptive features of consumption of, 174–75; discussion of model’s results, 184–88; estimation and results of model of, 181–84; model of production and consumption of, 170–71, 175–81; politics and, 169–70; segregation of consumption of, 174–75, 175f. See also Media Nintendo, platform-specific currencies of, 258 Nook, 9, 141 Nowcasting, 8, 119; consumer sentiment, 124–27; gun sales, 128

495

oDesk, 11, 219–20; users of, 240; work process on, 226–30 Online currencies. See Currencies; Private digital currencies Partial disclosure: occurrence of, 296; statistical, 305–6; threat of, 283, 284 Payment Card Industry Data Security Standards, 317 Payments, to authors, 357–60; data, 361–65; income from profit sharing, 371–73; lump sum, 365–71; total income to, 373–77. See also Copyrights PayPal, 258 Personal information, digitization and, 3 Piracy, 385; effect of television streaming on, 396–404. See also Digital piracy; Recorded music industry; Software piracy Platforms, 5; competition between, 6; computing market segments and, 5; defined, 5, 258; digitized money transfer systems and, 258; literature, 258; marketmaking, 229–30; model, 262–72; online contract labor markets, 240–43; private digital currencies and, 258; pure information goods and, 10. See also Private digital currencies Platform-specific currencies, 258–59 Polarization: media and, 171–72; rising US, 171 Policy, government, digitization and, 12–15 Pollution model, applying, to information flows, 314–15 Predictions: for demand for home appliances, 100–101; economic, 90–91; empirical methods for, 97–100; information technology revolution and, 89–90; literature review, 92–97; social science research and, 90. See also Housing market Price comparison sites, for books, 144 Price dispersion, 139 Print books, 141; prices of, 159–62; sales of, vs. e-books, 141t. See also E-books Priors, 123–24 Privacy: challenges of, and digital information, 2; digitization and, 12; role of disclosure protection and, 285; security vs., 284–85 Privacy Rights Clearinghouse (PRC) data, 320–21, 324

496

Subject Index

Private digital currencies, 11, 257; vs. digitization of state-issued currencies, 257– 58; economic model of, 262–72; future directions for, 273–75; platforms and, 258, 262–72; regulatory issues, 272–73 Productivity, 4 Product License Keys, 451 Product searches, online, 138 Project Gutenberg, 140 Prosper, 11 Protocol stack, 30–33; citations in, 35f; evolution of, 32–33, 32f; TCP/IP, 31 “Purchasing intent” surveys, 192

Real estate economics, 94 Real estate market. See Housing market Recorded music industry, 407–8; background of, 411–15; data used for study of, 417–19; effective cost reduction for new work and piracy in, 409–10; inferring sales quantities from sales ranks and album certifications for, 419–22; Internet vs. traditional radio and, 422–25; online criticism and, 425–28; results of net effect of piracy and cost reduction in, 428–38; systematic data analysis of, 410; theoretical framework for production selection problem in, 408–9; theoretical model of, 415–17 Requests for Comments (RFCs), 26, 29, 30t Russian Longitudinal Monitoring Survey (RLMS), 281–82, 300–305

and, 2; costs of, and information loss, 344–45; data, digitization and, 12–13; privacy vs., 284–85 Selective prediction, 171–72 Social science research, predictions and, 90 Social trends, predicting, 92–94 Software piracy, 14–15, 444–46; defined, 452, 457–58; economic, institutional, and infrastructure variables of, 458–61; economics of, 447–50; machines associated with, 458; methods, 450–55; results between machine characteristics and, 471–72, 471t; results for nature and incidence of, 461–63; results of economic, institutional, and technological determinants of, 464–71; results of impact of antipiracy enforcement efforts on, 472–74; routes to, 452–55; summary statistics, 459t. See also Windows 7 Solow Paradox, 4 Sony: LIBRIé e-book reader, 140–41; platform-specific currencies of, 259 Spike-and-slab variable selection, 121–23 Standards, setting, modularity and, 25–26 Standard-setting organizations (SSOs), Internet, 22 State-issued currencies, digitization of, 257–58 Statistical partial disclosure, 305–6 Stock market, discipline by, and data breaches, 339–41 Streaming, television, effect of, on piracy, 396–404 Structural time series models, 130–31; for variable selection, 120–21 Synthetic data, 283

Sales, online, 137 Scott, Sir Walter, 373–77 Search costs: digital technology and, 8–9 Search engine optimization (SEO) market, 242 Search engines: book-related searches on, 145–48; real estate agents, 91; using, for books, 144 Search engine technology, 90 Searches, online, 9 Search Planner, 145–48, 146t Search terms, top twenty-five Google, leading users to Barnes & Noble, 145–47, 146t Security: challenges of privacy and security

Target data breach, 318n8 Targeted advertising, 3, 195, 199 TCP/IP. See Transmission Control Protocol/ Internet Protocol (TCP/IP) Television streaming, effect of, on piracy, 396–404 Time series forecasting, 120–21 Toshiba key (Billy), 454–55 Transmission Control Protocol/Internet Protocol (TCP/IP), 6, 22, 29; protocol stack, 31 Treatment effects, 280–85; case study of religious affiliation and parent’s decision on childhood vaccination and medical checkups, 299–305; identification of,

Q-coin, 273 qSearch database, 150–51 Query technology, 89–90

Subject Index from combined data, 290–95; inference of propensity score and average, 296–99; models of, 285–90 UK Copyright Act of 1814, 359; extensions in length of, 361 UK Copyright Act of 2011, 357 Untargeted advertising, 199n14 US Copyright Act of 1998, 357 Variable selection: approaches to, 120–23; Bayesian model averaging, 123; spikeand-slab, 121–23; structural time series for, 120–21 Visa, 316

497

Walmart, 140 Windows 7: authenticating valid version of, 451; data for estimating piracy rates of, 455–57; legal ways of acquiring, 451–52; routes to pirating, 452–55. See also Software piracy Work, geographic distribution of, online contract labor markets and, 230–35 World of Warcraft (WoW) Gold, 259 World Wide Web Consortium (W3C), 5, 22, 27–30; protocol stack and, 31–33; publications, 28–29, 28f Zellner’s g-prior, 122