Data Protection and Privacy: Data Protection and Democracy 9781509932740, 9781509932771, 9781509932757

The subjects of this volume are more relevant than ever, especially in light of the raft of electoral scandals concernin

306 70 13MB

English Pages [333] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Table of Contents
List of Contributors
1. The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards
I. Introduction
II. Labels in Context and Related Work
III. Methods
IV. Results
V. Discussion
VI. Future Directions
VII. Conclusions
Acknowledgements
References
2. A Right to a Rule: On the Substance and Essence of the Fundamental Right to Personal Data Protection
I. Introduction
II. Birth of a Strange Fundamental Right
III. On the Distinction between Privacy and Data Protection
IV. Data Protection and the Structure of the Charter
V. A Right to a Rule
VI. The Essence of the Right to Data Protection
VII. Conclusion
Acknowledgements
References
3. What’s in an Icon? Promises and Pitfalls of Data Protection Iconography
I. Introduction
II. Limitations of Privacy Policies: A Brief Literature Review
III. The End or a New Era for Mandated Disclosures?
IV. Machine-Readable Legal Information
V. On the Nature of (Data Protection) Icons
VI. Design and Evaluation of DaPIS
VII. Future Research
VIII. Conclusions
Acknowledgements
References
4. ‘We’re All in This Together’: Actors Cooperating in Enhancing Children’s Rights in the Digital Environment after the GDPR
I. Introduction
II. Relevant Factors to be Considered When Debating About Children’s Digital Rights
III. Children’s Digital Rights under the General Data Protection Regulation (GDPR)
IV. A Few Pathways to Empower Children Online
V. Conclusions
References
5. Risk to the ‘Rights and Freedoms’: A Legal Interpretation of the Scope of Risk under the GDPR
I. Introduction
II. ‘Personal Data’ and ‘Data Controller’: Role and Scope of Two Key Concepts in the GDPR
III. Role of Risk in the GDPR
IV. Scope of Risk in the GDPR
V. Conclusion
References
6. Modelling and Verification in GDPR’s Data Protection Impact Assessment: A Case Study on the AccuWeather/Reveal Mobile Case
I. Introduction
II. Case Description and Legal Classification
III. Logic and CAPVerDE
IV. Data Protection Impact Assessment According to Article 35 GDPR
V. Discussion
VI. Conclusion
References
7. In Search of Data Protection’s Holy Grail: Applying Privacy by Design to Lifelogging Technologies
I. Introduction
II. The Concept of Privacy by Design
III. Guidelines
IV. Conclusion
Bibliography
8. Public Registers Caught between Open Government and Data Protection – Personal Data, Principles of Proportionality and the Public Interest
I. The Growth of Public Registers
II. Concepts and Methodology
III. Requirements of Proportionality
IV. Types of Public Registers and their Purposes
V. Conclusions
References
9. Examination Scripts as Personal Data: The Right of Access as a Regulatory Tool against Teacher-Student Abuses in Cameroon Universities
I. Introduction
II. Teacher–Student Abuses in Cameroon Universities: An Overview
III. Personal Data and the Right of Access in EU Data Protection Law
IV. Personal Data under the AU Data Protection Convention
V. Right of Access to Evaluated Exam Scripts as Personal Data in Cameroon: Potential Impacts on Teacher-Student Abuses
VI. Conclusion
References
10. The Proposed ePrivacy Regulation: The Commission’s and the Parliament’s Drafts at a Crossroads?
I. Introduction
II. The Relationship between the ePrivacy Regulation and the GDPR: Is There Life Beyond Personal Data Processing?
III. The Material and Territorial Scope of the Draft ePrivacy Regulation
IV. Confidentiality of Communications: Protection of Electronic Communications of Natural and Legal Persons in the Draft ePrivacy Regulation
V. The Issue of Consent and its Effect on Software Architecture and Settings
VI. Conclusion
References
11. CPDP: Closing Remarks
Index
Recommend Papers

Data Protection and Privacy: Data Protection and Democracy
 9781509932740, 9781509932771, 9781509932757

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

DATA PROTECTION AND PRIVACY The subjects of this volume are more relevant than ever, especially in light of the raft of electoral scandals concerning voter profiling. This volume brings together papers that offer conceptual analyses, highlight issues, propose solutions, and discuss practices regarding privacy and data protection. It is one of the results of the twelfth annual International Conference on Computers, Privacy and Data Protection, CPDP, held in Brussels in January 2019. The book explores the following topics: dataset nutrition labels, lifelogging and privacy by design, data protection iconography, the substance and essence of the right to data protection, public registers and data protection, modelling and verification in data protection impact assessments, examination scripts and data protection law in Cameroon, the protection of children’s digital rights in the GDPR, the concept of the scope of risk in the GDPR and the ePrivacy Regulation. This interdisciplinary book has been written at a time when the scale and impact of data processing on society – not only on individuals, but also on social systems – is becoming ever starker. It discusses open issues as well as daring and prospective approaches, and will serve as an insightful resource for readers with an interest in computers, privacy and data protection.

Computers, Privacy and Data Protection Previous volumes in this series (published by Springer) 2009 Reinventing Data Protection? Editors: Serge Gutwirth, Yves Poullet, Paul De Hert, Cécile de Terwangne, Sjaak Nouwt ISBN 978-1-4020-9497-2 (Print) ISBN 978-1-4020-9498-9 (Online) 2010 Data Protection in A Profiled World? Editors: Serge Gutwirth, Yves Poullet, Paul De Hert ISBN 978-90-481-8864-2 (Print) ISBN: 978-90-481-8865-9 (Online) 2011 Computers, Privacy and Data Protection: An Element of Choice Editors: Serge Gutwirth, Yves Poullet, Paul De Hert, Ronald Leenes ISBN: 978-94-007-0640-8 (Print) 978-94-007-0641-5 (Online) 2012 European Data Protection: In Good Health? Editors: Serge Gutwirth, Ronald Leenes, Paul De Hert, Yves Poullet ISBN: 978-94-007-2902-5 (Print) 978-94-007-2903-2 (Online) 2013 European Data Protection: Coming of Age Editors: Serge Gutwirth, Ronald Leenes, Paul de Hert, Yves Poullet ISBN: 978-94-007-5184-2 (Print) 978-94-007-5170-5 (Online) 2014 Reloading Data Protection Multidisciplinary Insights and Contemporary Challenges Editors: Serge Gutwirth, Ronald Leenes, Paul De Hert ISBN: 978-94-007-7539-8 (Print) 978-94-007-7540-4 (Online) 2015 Reforming European Data Protection Law Editors: Serge Gutwirth, Ronald Leenes, Paul de Hert ISBN: 978-94-017-9384-1 (Print) 978-94-017-9385-8 (Online) 2016 Data Protection on the Move Current Developments in ICT and Privacy/Data Protection Editors: Serge Gutwirth, Ronald Leenes, Paul De Hert ISBN: 978-94-017-7375-1 (Print) 978-94-017-7376-8 (Online) 2017 Data Protection and Privacy: (In)visibilities and Infrastructures Editors: Ronald Leenes, Rosamunde van Brakel, Serge Gutwirth, Paul De Hert ISBN: 978-3-319-56177-6 (Print) 978-3-319-50796-5 (Online) Previous titles in this series (published by Hart Publishing) 2018 Data Protection and Privacy: The Age of Intelligent Machines Editors: Ronald Leenes, Rosamunde van Brakel, Serge Gutwirth, Paul De Hert ISBN: 978-1-509-91934 5 (Print) 978-1-509-91935-2 (EPDF) 978-1-509-91936-9 (EPUB) 2019 Data Protection and Privacy: The Internet of Bodies Editors: Ronald Leenes, Rosamunde van Brakel, Serge Gutwirth, Paul de Hert ISBN: 978-1-509-92620-6 (Print) 978-1-509-92621-3 (EPDF) 978-1-509-9622-0 (EPUB)

Data Protection and Privacy Data Protection and Democracy

Edited by

Dara Hallinan Ronald Leenes Serge Gutwirth and

Paul De Hert

HART PUBLISHING Bloomsbury Publishing Plc Kemp House, Chawley Park, Cumnor Hill, Oxford, OX2 9PH, UK 1385 Broadway, New York, NY 10018, USA HART PUBLISHING, the Hart/Stag logo, BLOOMSBURY and the Diana logo are trademarks of Bloomsbury Publishing Plc First published in Great Britain 2020 Copyright © The editors and contributors severally 2020 The editors and contributors have asserted their right under the Copyright, Designs and Patents Act 1988 to be identified as Authors of this work. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. While every care has been taken to ensure the accuracy of this work, no responsibility for loss or damage occasioned to any person acting or refraining from action as a result of any statement in it can be accepted by the authors, editors or publishers. All UK Government legislation and other public sector information used in the work is Crown Copyright ©. All House of Lords and House of Commons information used in the work is Parliamentary Copyright ©. This information is reused under the terms of the Open Government Licence v3.0 (http://www.nationalarchives.gov.uk/doc/ open-government-licence/version/3) except where otherwise stated. All Eur-lex material used in the work is © European Union, http://eur-lex.europa.eu/, 1998–2020. A catalogue record for this book is available from the British Library. Library of Congress Cataloging-in-Publication data Names: Annual International Conference on Computers, Privacy, and Data Protection (12th : 2019 : Brussels, Belgium) | Hallinan, Dara, editor.  |  Leenes, Ronald, editor.  |  Gutwirth, Serge, editor.  |  Hert, Paul De, editor. Title: Data protection and privacy : data protection and democracy / [edited by] Dara Hallinan, Ronald Leenes, Serge Gutwirth, Paul De Hert. Description: Chicago : Hart Publishing, an imprint of Bloomsbury Publishing, 2020.  |  Series: Computers, privacy and data protection; volume 12  |  Includes bibliographical references and index. Identifiers: LCCN 2019044167 (print)  |  LCCN 2019044168 (ebook)  |  ISBN 9781509932740 (hardback)  |  ISBN 9781509932764 (Epub) Subjects: LCSH: Data protection—Law and legislation—Congresses.  |  Privacy, Right of—Congresses. Classification: LCC K3264.C65 A75 2019 (print)  |  LCC K3264.C65 (ebook)  |  DDC 342.08/58—dc23 LC record available at https://lccn.loc.gov/2019044167 LC ebook record available at https://lccn.loc.gov/2019044168 ISBN: HB: 978-1-50993-274-0 ePDF: 978-1-50993-275-7 ePub: 978-1-50993-276-4 Typeset by Compuscript Ltd, Shannon To find out more about our authors and books visit www.hartpublishing.co.uk. Here you will find extracts, author information, details of forthcoming events and the option to sign up for our newsletters.

PREFACE It is the end of June 2019 as we write this foreword. Data protection is now more relevant than ever. Until recently, data protection seemed to be something of a niche topic, considered only by a small community of experts. Over the past year, however, following both the long-awaited applicability of the GDPR and the raft of prominent scandals concerning the illicit gathering and use of personal data – particularly those concerning the use of personal data in electoral campaigns – the relevance of data protection for society at large came clearly into focus. Now, everyone has an opinion. This year thus arguably represented the moment in which data protection truly arrived in the public consciousness. It is no longer unusual to hear matters of data protection mentioned in the daily news or in coffee shop conversation. Yet, the prominence of the topic does not necessarily mean more, or better, data protection. Rather, the prominence of the topic simply means the fora in which it plays a role have grown more numerous and the balances it strikes have become more contested. There are likely few data controllers, for example, who now wish to collect less personal data due to the GDPR. In the meantime, the international privacy and data protection crowd gathered in Brussels for the twelfth time to participate in the international Computers, Privacy and Data Protection Conference (CPDP) – between 30 January and 1 February 2019. An audience of over 1,100 people had the chance to discuss a wide range of contemporary topics and issues with 440 speakers in 90 panels, during the breaks, side events and at ad-hoc dinners and pub crawls. Striving for diversity and balance, CPDP gathers academics, lawyers, practitioners, policy­ makers, computer scientists and civil society from all over the world to exchange ideas and discuss the latest emerging issues and trends. This unique multidisciplinary formula has served to make CPDP one of the leading data protection and privacy conferences in Europe and around the world. The conference bustled with a sense of purpose. Conversations naturally dealt with the implementation and applicability of the GDPR. However, conversations also addressed much broader themes. Amongst these themes, the role of data protection in safeguarding democratic processes and democratic values – the core theme of the conference – featured prominently. Also heavily discussed were cross-cutting issues emerging around the need for, and the substance of, algorithmic regulation – the core topic of next year’s conference. The CPDP conference is definitely the place to be, but we are also happy to produce a tangible spin-off every year: the CPDP book. CPDP papers are cited very frequently and the series has a significant readership. The conference cycle

vi  Preface starts with a call for papers in the summer preceding the conference. The paper submissions are peer reviewed and those authors whose papers are accepted present their work in the various academic panels at the conference. After the conference, speakers are also invited to submit papers based on panel discussions. All papers submitted on the basis of these calls are then (again) double-blind peer reviewed. This year, we received 14 papers in the second round, of which nine were accepted for publication. It is these nine papers that are to be found in this volume, complemented by the conference closing speech traditionally given by the EDPS chair (then Giovanni Buttarelli). The conference addressed many privacy and data protection issues in its 90 panels ranging from the impact of data processing on democracy, to AI regulation, to blockchain, to border control, to Islamic privacy, to research, to the implementation of the GDPR. The conference covered far too many topics to completely list them all here. For more information, we refer the interested reader to the conference website: www.cpdpconferences.org. The current volume only offers a very small part of what the conference has to offer. Nevertheless, the editors feel the current volume represents a valuable set of papers describing and discussing contemporary privacy and data protection issues. All the chapters of this book have been peer reviewed and commented on by at least two referees with expertise and interest in the relevant subject matters. Since their work is crucial for maintaining the scientific quality of the book, we would explicitly take the opportunity to thank all the CPDP reviewers for their commitment and efforts: Alessandro Mantelero, Anni Karakassi, Arnold Roosendaal, Ashwinee Kumar, Aviva de Groot, Bart Van der Sloot, Bert-Jaap Koops, Bettina Berendt, Carolin Moeller, Chiara Angiolini, Christopher Millard, Claudia Quelle, Colette Cuijpers, Damian Clifford, Daniel Le Métayer, Deepan Kamalakanthamurugan Sarma, Diana Dimitrova, Edoardo Celeste, Eleni ­ Kosta, Emre Bayamlıoglu, ­Franziska Boehm, Frederik Zuiderveen Borgesius, ­Gabriela Zanfir-Fortuna, Gergely Biczók, Gianluigi Riva, Hideyuki Matsumi, Hiroshi Miyashita, Inge Graef, Ioannis Kouvakas, Ioulia Konstantinou, Iraklis Symeonidis, Irene Kamara, Ivan Szekely, Jaap-Henk Hoepman, Jef Ausloos, Joris van ­Hoboken, Joseph ­Savirimuthu, Kristina Irion, Lina Jasmontaite, Linnet Taylor, Lorenzo Dallacorte, Maria Grazia Porcedda, Marit Hansen, Massimo Durante, Michael Birnhack, Michael Friedewald, Michael Veale, Monica Palmirani, Nicholas Martin, Nicolo Zingales, Nora Ni Loideain, Omer Tene, Raphael Gellert, Robin Pierce, Rosamunde Van Brakel, Sascha Van Schendel, Shaz Jameson, Silvia De Conca, Simone Casiraghi, Tetyana Krupiy, Tjerk Timan and Yung Shin Van Der Sype. As had become customary, the conference concluded with closing remarks from the European Data Protection Supervisor, Giovanni Buttarelli. All of us in the privacy community were profoundly saddened by Giovanni’s passing away in August 2019. He was a fervent and inspirational champion of privacy and digital

Preface  vii rights. In recent years, he spearheaded efforts to put data protection at the heart of debates on digital ethics and democracy in the digital age. Giovanni’s support and fondness for CPDP was as invaluable as it was reciprocated, and he will be greatly missed. It is fitting and poignant that his closing remarks to the 2019 CPDP are the final chapter in this volume. Dara Hallinan, Ronald Leenes, Serge Gutwirth & Paul De Hert 1 July 2019

viii

TABLE OF CONTENTS Preface����������������������������������������������������������������������������������������������������������������������������v List of Contributors����������������������������������������������������������������������������������������������������� xi 1. The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards�����������������������������������������������������������������������������������������1 Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph and Kasia Chmielinski 2. A Right to a Rule: On the Substance and Essence of the Fundamental Right to Personal Data Protection�������������������������������������27 Lorenzo Dalla Corte 3. What’s in an Icon? Promises and Pitfalls of Data Protection Iconography����������������������������������������������������������������������������������������������������������59 Arianna Rossi and Monica Palmirani 4. ‘We’re All in This Together’: Actors Cooperating in Enhancing Children’s Rights in the Digital Environment after the GDPR���������������������������93 Domenico Rosani 5. Risk to the ‘Rights and Freedoms’: A Legal Interpretation of the Scope of Risk under the GDPR�������������������������������������������������������������������������������������127 Katerina Demetzou 6. Modelling and Verification in GDPR’s Data Protection Impact Assessment: A Case Study on the AccuWeather/Reveal Mobile Case�������������145 Wolfgang Schulz, Florian Wittner, Kai Bavendiek and  Sibylle Schupp 7. In Search of Data Protection’s Holy Grail: Applying Privacy by Design to Lifelogging Technologies��������������������������������������������������������������������������������173 Liane Colonna 8. Public Registers Caught between Open Government and Data Protection – Personal Data, Principles of Proportionality and the Public Interest���������������������������������������������������������������������������������������209 Geert Lokhorst and Mireille van Eechoud

x  Table of Contents 9. Examination Scripts as Personal Data: The Right of Access as a Regulatory Tool against Teacher-Student Abuses in Cameroon Universities����������������������������������������������������������������������������������237 Rogers Alunge 10. The Proposed ePrivacy Regulation: The Commission’s and the Parliament’s Drafts at a Crossroads?���������������������������������������������������267 Elena Gil González, Paul De Hert and Vagelis Papakonstantinou 11. CPDP: Closing Remarks������������������������������������������������������������������������������������299 Giovanni Buttarelli Index��������������������������������������������������������������������������������������������������������������������������303

LIST OF CONTRIBUTORS Rogers Alunge is a candidate for a Joint PhD in Law, Science and Technology at the University of Bologna, Italy. Kai Bavendiek is a PhD-candidate at Hamburg University of Technology. Kasia Chmielinski is the Project Lead of the Data Nutrition Project, an initiative that launched out of Assembly (MIT Media Lab and Harvard University) which builds tools to improve the health of artificial intelligence through healthier data. Liane Colonna is a post-doctoral fellow at the the Swedish Law and Informatics Research Institute (IRI). Lorenzo Dalla Corte is a PhD candidate at Tilburg Law School (TILT) and a researcher at TU Delft (A+BE). Paul De Hert is Professor of Criminal Law and Co-Director of the Law, Science, Technology & Society Research Group, Vrije Universiteit Brussel. Katerina Demetzou is a PhD Researcher at the Business and Law Research Center (OO&R) and at the Institute for Computing and Information Sciences (iCIS) in Radboud University, Nijmegen, The Netherlands. Elena Gil González is a PhD candidate at CEU San Pablo University of Madrid. Sarah Holland is a member of the 2018 cohort of Assembly at the Berkman Klein Center & MIT Media Lab. Ahmed Hosny is a machine learning scientist at Dana Farber Cancer Institute. Joshua Joseph is a member of the Data Nutrition Project. Geert Lokhorst is a research master student at the University of Amsterdam, Institute for Information Law. Sarah Newman is a Senior Researcher at metaLAB at Harvard University, and a co-founder of the Data Nutrition Project, which creates tools to mitgate bias in algorithms by assessing the quality of the underlying data. Monica Palmirani is full professor at CIRSFID (University of Bologna). Vagelis Papakonstantinou is a professor of law at the Faculty of Law & Criminology of the Vrije Universiteit Brussel (VUB). He is the Coordinator of VUB’s Cyber and Data Security Lab (CDSL), a core member of VUB’s Research

xii  List of Contributors Group on Law Science Technology & Society (LSTS), and a research member of the Brussels Privacy Hub. Domenico Rosani is a research and teaching associate at the University of Innsbruck, Department of Italian Law. Arianna Rossi is a postdoc researcher at SnT – Interdisciplinary Centre for Security, Reliability and Trust (University of Luxembourg). Wolfgang Schulz is the director of the Leibniz-Institute for Media Research | Hans-Bredow-Institute (HBI) and holds the chair for Media Law and Public Law including their Theoretical Foundations at University of Hamburg. Sibylle Schupp is Head of the Software Technology Systems (STS) Institute at Hamburg University of Technology. Mireille van Eechoud is Professor of Information Law at IVIR, University of Amsterdam. Florian Wittner is a PhD candidate at Leibniz-Institute for Media Research | ­Hans-Bredow-Institute (HBI).

1 The Dataset Nutrition Label A Framework to Drive Higher Data Quality Standards SARAH HOLLAND,1 AHMED HOSNY,2 SARAH NEWMAN,3 JOSHUA JOSEPH4 AND KASIA CHMIELINSKI5

Abstract Data is a fundamental ingredient in building Artificial Intelligence (AI) models and there are direct correlations between data quality and model robustness, fairness and utility. A growing body of research points to AI systems deployed in a wide range of use cases, where algorithms trained on biased, incomplete, or ill-fitting data produce problematic results. Despite the increased critical attention, data interrogation continues to be a challenging task with many issues being difficult to identify and rectify. Algorithms often come under scrutiny only after they are developed and deployed, which exacerbates this problem and underscores the need for better data vetting practices earlier in the development pipeline. We introduce the Dataset Nutrition Label,6 a diagnostic framework built by the Data Nutrition Project, comprising a label that provides a distilled yet comprehensive overview of dataset ‘ingredients’. The label is designed to be flexible and adaptable; it is comprised of a diverse set of qualitative and quantitative modules generated through multiple statistical and probabilistic modelling backends. Working with the ProPublica dataset ‘Dollars for Docs’, we developed an open source tool7 consisting of seven sample modules. Consulting such a label prior to AI model development promotes vigorous data interrogation practices, aids



1 Assembly,

MIT Media Lab and Berkman Klein Center at Harvard University. Cancer Institute, Harvard Medical Institute. 3 metaLAB @ Harvard, Berkman Klein Center for Internet & Society, Harvard University. 4 MIT Quest for Intelligence. 5 Assembly, MIT Media Lab and Berkman Klein Center at Harvard University. 6 Available at: https://datanutrition.media.mit.edu/. 7 Available at: https://ahmedhosny.github.io/datanutrition/. 2 Dana-Farber

2  Sarah Holland et al in recognising inconsistencies and imbalances, provides an improved means to selecting more appropriate datasets for specific tasks and subsequently increases the overall quality of AI models. We also explore some challenges of the label, including generalising across diverse datasets, as well as discussing research and public policy agendas to further advocate its adoption and ultimately improve the AI development ecosystem. Keywords Artificial intelligence, machine learning, data ethics, bias, ethics.

I. Introduction Data-driven decision-making systems play an increasingly important role in our lives. These frameworks are built on increasingly sophisticated artificial intelligence (AI) systems and are created and tuned by a growing population of data specialists8 to arrive at a diversity of decisions: from movie and music recommendations to digital advertisements and mortgage applications.9 These systems deliver untold societal and economic benefits, but they can also be harmful to individuals and society at large. Figure 1.1 Model Development Pipeline

Data is a fundamental ingredient of AI and the quality of a dataset used to build a model will directly influence the outcomes it produces. An AI model trained on problematic data will likely produce problematic outcomes. Examples of 8 The term ‘data specialist’ is used instead of ‘data scientist’ in the interest of using a term that is broadly scoped to include all professionals utilising data in automated decision-making systems: data scientists, analysts, machine learning engineers, model developers, artificial intelligence researchers and a variety of others in this space. 9 Thomas H Davenport and Jeanne G Harris, ‘Automated Decision Making Comes of Age’ (2005) 46 MIT Sloan Management Review 83.

The Dataset Nutrition Label  3 these include gender bias in language translations surfaced through natural language processing10 and skin shade bias in facial recognition systems due to non-representative data.11 Typically, the model development pipeline (Figure 1.1) begins with a question or goal. Within the realm of supervised learning, for instance, a data specialist will curate a labelled dataset of previous answers in response to the guiding question. Such data is then used to train a model to respond in a way that accurately correlates with past occurrences. In this way, past answers are used to forecast the future. This is particularly problematic when outcomes of past events are contaminated with (often unintentional) bias. Models often come under scrutiny only after they are built, trained and deployed. If a model is found to perpetuate a bias – for example, over-indexing for a particular race or gender – the data specialist returns to the development stage to identify and address the issue. This feedback loop is inefficient, costly and does not always mitigate harm; the time and energy of the data specialist is a sunk cost and, if in use, the model deployment may have already produced problematic outcomes. Some of these issues could be avoided by undertaking a thorough interrogation of data at the outset of model development. However, this is still not a widespread practice within AI model development efforts. Figure 1.2  (A) Survey results about data analysis best practices in respondents’ organisations and (B) Survey results about how respondents learned to analyse data

We conducted an anonymous online survey (see Figure 1.2), the results of which further lend credence to this problem. Although many (47%) respondents report conducting some form of data analysis prior to model development, most (74%)

10 Tolga Bolukbasi and others, ‘Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings’, Advances in Neural Information Processing Systems (2016). 11 Joy Buolamwini and Timnit Gebru, ‘Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification’ in Sorelle A Friedler and Christo Wilson (eds), Proceedings of the 1st Conference on Fairness, Accountability and Transparency (PMLR 2018).

4  Sarah Holland et al indicate that their organisations do not have explicit best practices for such ­analysis. Fifty-nine per cent of respondents reported relying primarily on experience and self-directed learning (through online tutorials, blogs, academic papers, Stack Overflow and online data competitions) to inform their data analysis methods and practices. This survey indicates that, despite limited current standards, there is widespread interest in improving data analysis practices and making them accessible. To improve the accuracy and fairness of AI systems, it is imperative that data specialists can assess more quickly the viability and fitness of datasets and more easily find and use better-quality data to train their models. As a proposed solution, we introduce a dataset nutrition label, a diagnostic framework to address and mitigate some of these challenges by providing critical information to data specialists at the point of data analysis. The label thus acts as a first point of contact where decisions regarding the utility and fitness of specific datasets can be made. This is achieved by allowing the recognition of dataset inconsistencies and exclusions as well as promoting dataset interrogation as a crucial and inevitable procedure in the AI model development pipeline – with the ultimate goal of improving the overall quality of AI systems. We begin with a review of related work, largely drawing from the fields of nutrition and privacy, where labels are a useful mechanism to distill essential information, enable better decision-making and influence best practices. We then discuss the dataset nutrition label prototype, our methodology, demonstration dataset and key results. This is followed by an overview of the benefits of the tool, its potential limitations and ways to mitigate those limitations. We then briefly summarise some future directions, including research and public policy agendas that would further advance the goals of the label. Lastly, we discuss implementation of the prototype and key takeaways.

II.  Labels in Context and Related Work To inform the development of our prototype and concept, we surveyed the literature for labelling efforts. Labels and warnings are utilised effectively in product safety,12 pharmaceuticals,13 energy14 and material safety.15 We largely draw from 12 US Consumer Product Safety Commission. Office of the General Counsel, Compilation of Statutes Administered by CPSC (US Consumer Product Safety Commission 1998). 13 Foster D McClure and United States. Food and Drug Administration, FDA Nutrition Labeling Manual: A Guide for Developing and Using Databases (US Food and Drug Administration 1993). 14 Europäische Union, ‘Directive 2009/28/EC of the European Parliament and of the Council of 23 April 2009 on the Promotion of the Use of Energy from Renewable Sources and Amending and Subsequently Repealing Directives 2001/77/EC and 2003/30/EC’ (2009) 5 Official Journal of the ­European Union 2009. 15 Occupational Safety, Health Administration and Others, ‘Hazard Communication Standard: Safety Data Sheets’ [2012] OSHA Brief.

The Dataset Nutrition Label  5 the fields of nutrition, online privacy and algorithmic accountability as they are particularly salient for our purposes. The former is the canonical example and a long-standing practice subject to significant study while the latter provides valuable insights in the application of a ‘nutrition label’ in other domains, particularly in subjective contexts and where there is an absence of legal mandates and use is voluntary. Collectively, they elucidate the impacts of labels on audience engagement, education and user decision making. In 1990, Congress passed the Nutrition Labeling and Education Act (P.L. 101–535), which includes a requirement that certain foodstuffs display a standardised ‘Nutrition Facts’ label.16 By mandating the label, vital nutritional facts were communicated in the context of the ‘Daily Value’ benchmark and consumers could quickly assess nutrition information and more effectively abide by dietary recommendations at the moment of decision.17 In the nearly three decades since its implementation, several studies have examined the efficacy of the now ubiquitous ‘Nutrition Facts’ label; these studies include analyses of how consumers use the label18 and the effect it has had on the market.19 Though some cast doubt on the benefits of the mandate in light of its cost,20 most research concludes that the ‘Nutrition Facts’ label has had a positive impact.21 Surveys demonstrate widespread consumer awareness of the label and its influence in decision making around food, despite a relatively short time since the passage of the Nutrition Labeling and Education Act.22 According to the International Food Information Council, more than 80 per cent of consumers reported they looked at the ‘Nutrition Facts’ label when deciding what foods to purchase or consume and only 4 per cent reported never using the label.23 Five years after the mandate, the Food Marketing Institute found that about one-third of consumers stopped buying a food because of what they read on the label.24 With regard to the information contained on the

16 United States. Congress. House. Committee on Energy and Commerce, Nutrition Labeling and Education Act of 1990: Report (to Accompany H.R. 3562) (including Cost Estimate of the Congressional Budget Office) (US Government Printing Office 1990). 17 Ibid; Siva K Balasubramanian and Catherine Cole, ‘Consumers’ Search and Use of Nutrition Information: The Challenge and Promise of the Nutrition Labeling and Education Act’ (2002) 66 Journal of Marketing 112; Joanne F Guthrie and others, ‘Who Uses Nutrition Labeling and What Effects Does Label Use Have on Diet Quality?’ (1995) 27 Journal of Nutrition Education 163. 18 Balasubramanian and Cole (n 17); Guthrie and others (n 17). 19 Bruce A Silverglade, ‘The Nutrition Labeling and Education Act: Progress to Date and Challenges for the Future’ (1996) 15 Journal of Public Policy & Marketing 148. 20 Paul J Petruccelli, ‘Consumer and Marketing Implications of Information Provision: The Case of the Nutrition Labeling and Education Act of 1990’ (1996) 15 Journal of Public Policy & Marketing 150. 21 Mario F Teisl, Alan S Levy and Others, ‘Does Nutrition Labeling Lead to Healthier Eating?’ (1997) 28 Journal of Food Distribution Research 18; Andreas C Drichoutis, Panagiotis Lazaridis and Rodolfo M Nayga Jr, ‘Consumers’ Use of Nutritional Labels: A Review of Research Studies and Issues’ (2006) Academy of Marketing Science Review 1. 22 United States Congress. House Committee on Energy and Commerce (n 16). 23 Susan Borra, ‘Consumer Perspectives on Food Labels’ (2006) 83 The American Journal of Clinical Nutrition 1235S. 24 United States Congress. House Committee on Energy and Commerce (n 16).

6  Sarah Holland et al label and consumer understanding, researchers found that ‘label format and inclusion of (external) reference value information appear to have (positive) effects on consumer perceptions and evaluations’,25 but consumers indicated confusion about the ‘Daily Value’ comparison, suggesting that more information about the source and reliability of ground truth information would be useful.26 The literature focuses primarily on the impact to consumers rather than on industry operations such as production and advertising. However, the significant impact of reported sales and marketing materials on consumers27 provides a foundation for further inquiry into how this has affected the greater food industry. In the field of privacy and privacy disclosures, the nutrition label serves as a useful point of reference and inspiration.28 Researchers at Carnegie Mellon and Microsoft created the ‘Privacy Nutrition Label’ to better surface essential privacy information to assist consumer decision making with regard to the collection, use and sharing of personal information.29 The ‘Privacy Nutrition Label’ operates much like ‘Nutrition Facts’ and sits atop existing disclosures. It improves the functionality of the Platform for Privacy Notices, a machine-readable format developed by the World Wide Web Consortium, itself an effort to standardise and improve the legibility of privacy policies.30 User surveys that tested the ‘Privacy Nutrition Label’ against alternative formats found that the label outperformed alternatives with ‘significant positive effects on the accuracy and speed of information finding and reader enjoyment with privacy policies’ as well as improved consumer understanding.31 Ranking and scoring algorithms also pose challenges in terms of their complexity, opacity and sensitivity to the influence of data. End users and even model developers face difficulty in interpreting an algorithm and its ranking outputs and this difficulty is further compounded when the model and the data on which it is trained is proprietary or otherwise confidential, as is often the case. ‘Ranking Facts’ is a web-based system that generates a ‘nutrition label’ for scoring and ranking algorithms based on factors or ‘widgets’ to communicate an algorithm’s methodology or output.32 Here, the label serves more as an interpretability tool

25 Scot Burton, Abhijit Biswas and Richard Netemeyer, ‘Effects of Alternative Nutrition Label Formats and Nutrition Reference Information on Consumer Perceptions, Comprehension and Product ­Evaluations’ (1994) 13 Journal of Public Policy & Marketing 36. 26 Borra (n 22). 27 Silverglade (n 19). 28 Corey A Ciocchetti, ‘The Future of Privacy Policies: A Privacy Nutrition Label Filled with Fair Information Practices’ (2008) 26 The John Marshall Journal of Computer & Information Law 1. 29 Patrick Gage Kelley and others, ‘A Nutrition Label for Privacy’, Proceedings of the 5th Symposium on Usable Privacy and Security (Association for Computer Machinery (ACM) 2009). 30 Patrick Gage Kelley and others, ‘Standardizing Privacy Notices: An Online Study of the Nutrition Label Approach’, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (ACM2010). 31 Kelley and others (n 29); Kelley and others (n 30). 32 Ke Yang and others, ‘A Nutritional Label for Rankings’. Available at: www.cs.drexel.edu/~julia/ documents/cr-sigmod-demo-2018.pdf.

The Dataset Nutrition Label  7 than as a summary of information as the ‘Nutrition Facts’ and ‘Privacy Nutrition Label’ operate. The widgets work together, not modularly, to assess the algorithm on author-created categories of transparency, fairness, stability and diversity. The demonstration scenarios for using real datasets from college rankings, criminal risk assessment and financial services establish that the label is potentially applicable to a diverse range of domains. This lends credence to the potential utility in other fields as well, including the rapidly evolving field of AI. More recently, in an effort to improve transparency, accountability and outcomes of AI systems, AI researchers have proposed methods for standardising practices and communicating information about the data itself. The first draws from computer hardware and industry safety standards where datasheets are an industry-wide standard. In datasets, however, they are a novel concept. Datasheets are functionally comparable to the label concept and, like labels that by and large objectively surface empirical information, can often include other information such as recommended uses which are more subjective. ‘Datasheets for Datasets’, a proposal from researchers at Microsoft Research, Georgia Tech, University of Maryland and the AI Now Institute, seeks to standardise information about public datasets, commercial APIs and pretrained models. The proposed datasheet includes dataset provenance, key characteristics, relevant regulations and test results, but also significant yet more subjective information such as potential bias, strengths and weaknesses of the dataset, API, or model and suggested uses.33 As domain experts, dataset, API and model creators would be responsible for creating the datasheets, not end users or other parties. We are also aware of a forthcoming study from the field of natural language processing (NLP), ‘Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science’.34 The researchers seek to address ethics, exclusion and bias issues in NLP systems. Borrowing from similar practices in other fields of practice, the position paper puts forward the concept and practice of ‘data statements’ which are qualitative summaries that provide detailed information and important context about the populations the datasets represent. The information contained in data statements can be used to surface potential mismatches between the populations used to train a system and the populations in planned use prior to deployment, to help diagnose sources of bias that are discovered in deployed systems and to help understand how experimental results might generalise. The authors suggest that data statements should eventually become required practice for system documentation and academic publications for NLP systems and should be extended to other data types (eg image data) albeit with tailored schema. We take a different, yet complementary, approach. We hypothesise that the concept of a ‘nutrition label’ for datasets is an effective means to provide a ­scalable

33 Timnit Gebru and others, ‘Datasheets for Datasets’ (2018). Available at: http://arxiv.org/ abs/1803.09010. 34 n/a, ‘Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science’.

8  Sarah Holland et al and efficient tool to improve the process of dataset interrogation and analysis prior to and during model development. In supporting our hypothesis, we created a prototype, a dataset nutrition label. Three goals drive this work. First, to inform and improve data specialists’ selection and interrogation of datasets and to prompt critical analysis. Consequently, data specialists are the primary intended audience. Second, to gain traction as a practical, readily deployable tool, we prioritise efficiency and flexibility. To that end, we do not suggest one specific approach to the label or charge one specific community with creating the label. Rather, our prototype is modular and the underlying framework is one that anyone can utilise. Lastly, we leverage probabilistic computing tools to surface potential corollaries, anomalies and proxies. This is particularly beneficial because resolving these issues requires excess development time and can lead to undesired correlations in trained models.

III. Methods Some assumptions are made to focus our prototyping efforts. Only tabular data is considered. Additionally, we limit our explorations to datasets