129 58 3MB
English Pages 266 [255] Year 2022
Stefan Schiffner · Sebastien Ziegler Adrian Quesada Rodriguez Editors
Privacy Symposium 2022 Data Protection Law International Convergence and Compliance with Innovative Technologies (DPLICIT)
Privacy Symposium 2022
Stefan Schiffner • Sebastien Ziegler • Adrian Quesada Rodriguez Editors
Privacy Symposium 2022 Data Protection Law International Convergence and Compliance with Innovative Technologies (DPLICIT)
Editors Stefan Schiffner University of Münster Münster, Germany
Sebastien Ziegler Mandat International Geneva, Switzerland
Adrian Quesada Rodriguez Mandat International Geneva, Switzerland
ISBN 978-3-031-09900-7 ISBN 978-3-031-09901-4 https://doi.org/10.1007/978-3-031-09901-4
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2023 Chapters 8 and 10 are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapter. This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
With the European General Data Protection Regulation (GDPR) entering into force in 2016, a chain reaction has been triggered: Many other jurisdictions are adapting similar regulations or at least accelerated their legislative process. Simultaneously, digitalization was already changing the world with high speed but was further propelled by the emergence of the covid 19 pandemic. These developments have impacted all economic and societal sectors. Consequently, almost any human endeavor leaves digital footprints, generating an exponentially growing volume of personal data. In this context, many questions and challenges emerge: How to support convergence of data protection requirements among and across distinct jurisdictions? How to adopt data protection by design approach to emerging technologies? How to integrate certification and processes to demonstrate compliance into organizational structures? The present proceedings are a collection of original work presented in the scientific track at the Privacy Symposium in Venice, April 5–7, 2022. The Symposium promotes international dialogue, cooperation, and knowledge sharing on data protection, innovative technologies, and compliance. Legal and technology experts, together with researchers, professionals, and data protection authorities, meet and share their knowledge with decision-makers and prescribers, including supervisory authorities, regulators, law firms, DPO associations, DPO, and C level of large companies. In order to reach out to the community in early 2021, we issued a call for papers stating: We welcome multidisciplinary contributions bringing together legal, technical, and societal expertise, including theoretical, analytical, empirical, and case studies. We particularly encourage submissions that fall under one of the following thematic areas: Law and Data Protection – Multidisciplinary approaches, arbitration and balance in data protection: arbitration in data protection law and compliance, multi-stakeholder and multidisciplinary approaches in data protection and compliance.
v
vi
Preface
– International law and comparative law in data protection and compliance: crossborder data transfer approaches and solutions, international evolution of data protection regulations, international evolution of compliance regulations and norms, comparative law analysis in data protection domain, comparative law analysis in compliance domain, international law development in data protection domain, international law development in compliance, interaction between regulations, standards, and soft law in data protection. – Data subject rights: right to be informed, right to access and rectify personal data, right to restrict or object to the processing of personal data, right to limit access, processing and retention of their personal data, right to lodge a complaint with a supervisory authority, right not to be subject to a decision based solely on automated processing, including profiling, right to withdraw consent at any time, right to data portability, delegation and representation of data subjects’ rights, effective processes, implementations and monitoring of data, automated mechanisms and tools to support data subjects’ rights and consent management. Technology and Compliance – Emerging technologies compliance with data protection regulation: emerging technologies and data protection compliance, data protection compliant technologies, artificial intelligence, compliance and data protection, blockchain and distributed ledger technology, 5G and beyond, data protection by design. – Data protection compliance in Internet of Things, edge, and cloud computing: enabling data protection compliance in networking technologies, impact of extreme edge on privacy, network virtualization, seamless compliance from edge to core in multitenant environments. – Technology for compliance and data protection: privacy enhancing technologies (PET), anonymization and pseudonymization, privacy by default, innovative legal tech and compliance technology, compliance standardization and interoperability, data sovereignty. Cybersecurity and Data Protection – Cybersecurity and data protection measures: technical and organizational measures for data protection, making cybersecurity, privacy and data protection by design and by default, authentication, digital identities, cryptography, network inspection, GDPR compliance, evaluation of the state-of-the-art technology compliance with data protection, cybercrime and data protection, identity theft and identity usurpation. Data Protection in Practice – Audit and certification: audit and certification methodologies, innovative solutions and services for audit and certification.
Preface
vii
– Data protection best practices across verticals: health and medical technology, mobility, connected vehicles, smart cities, supply chains, telecommunication. – Data protection economics: market analysis, economic models and their impact on data protection, compliance and financial valuation, compliance by technology, economic impact of international convergence in data protection, impact of data protection regulations, unintended harms of cybersecurity measures. Our call was answered by 31 researchers or groups of researchers. Our program committee and additional referees carefully reviewed these contributions and provided feedback to the authors. We selected 12 papers for presentation at the conference resulting in a 39% acceptance rate. We further asked the authors of these contributions to consider our feedback and compile a final version of their work. The present book contains these reviewed and revised manuscripts. Lastly, we would like to express our gratitude to the program committee members and referees for their voluntary contributions and to the authors and coauthors for the patience during the review process. We are looking forward to the exchange of ideas in Venice. Münster, Germany April 2022
Stefan Schiffner In the name of the organizing committee and as Program Chair Privacy Symposium 2022
Organization
The Privacy Symposium conference has been established to support international dialogue, cooperation, and knowledge sharing on data protection. The 2022 edition has been hosted by the University Ca’Foscari of Venice and was organized in collaboration with several institutions, including the Data Protection Unit of the Council of Europe, European Centre for Certification and Privacy (ECCP), European Law Students’ Association (ELSA), European Cyber Security Organization (ECSO), European Federation of Data Protection Officers (EFDPO), European Centre on Privacy and Cybersecurity (ECPC), Italian Institute for Privacy (IIP), Italian Association for Cyber Security (CLUSIT), IoT Forum, IoT Lab, Mandat International, and several European research projects such as NGIOT, Gatekeeper, and CyberSec4Europe. The overall coordination was provided by the foundation Mandat International. The call for paper of this first edition was focused on Data Protection Law International Convergence and Compliance with Innovative Technologies (DPLICIT).
Executive Commitee General Chair: Sébastien Ziegler (IoT Forum, ECCP, Switzerland) Program Chair: Stefan Schiffner (University of Münster, Germany)
Steering Committee Paolo Balboni (European Centre for Privacy and Cybersecurity) Natalie Bertels (KU Leuven) Luca Bolognini (Italian Institute for Privacy) Andrew Charlesworth (University of Bristol) Afonso Ferreira (CNRS-IRIT) ix
x
Romeo Kadir (University of Utrecht) Latif Ladid (University of Luxembourg) Kai Rannenberg (Goethe University Frankfurt) Antonio Skarmeta (University of Murcia) Geert Somers (Timelex)
Program Commitee Florian Adamsky (Hof University of Applied Sciences) Ignacio Alamillo Domingo (University of Murcia) José M. del Álamo (Technical University of Madrid) Bettina Berendt (TU Berlin) Luca Bolognini (Italian Institute for Privacy and Data Valorisation) Wilhelmina Maria Botes (SnT, University of Luxembourg) Athena Bourka (ENISA) Josep Domingo-Ferrer (Universitat Rovira i Virglili) Afonso Ferreira (CNRS, IRIT) Michael Friedewald (Fraunhofer ISI) Gabriele Lenzini (SnT, University of Luxembourg) Dominik Herrmann (University of Bamberg) Meiko Jensen (Kiel University of Applied Sciences) Sokratis Katsikas (Norwegian University of Science and Technology) Stefan Katzenbeisser (University of Passau) Christiane Kuhn (Karlsruhe Institute of Technology) Dimosthenis Kyriazis (University of Piraeus) Elwira Macierzy´nska-Franaszczyk (Kozminski University) Sebastian Pape (Goethe University Frankfurt) Robin Pierce (TILT Tilburg Law School) Davy Preuveneers (KU Leuven) Delphine Reinhardt (University of Göttingen) Arnold Roosendaal (Privacy Company) Arianna Rossi (SnT, University of Luxembourg) María Cristina Timón López (University of Murcia) Julián Valero-Torrijos (University of Murcia)
Additional Referees Monica Arenas Yod Samuel Martín García Demosthenes Ikonomou Michael Mühlhauser
Organization
Organization
xi
Sponsoring Institutions Finally, we are grateful to the sponsors of the conference, which at the time of printing this book in alphabetical order were: Apple Inc., the BSI Group, Deloitte Touche Tohmatsu Limited, ICT Legal Consulting, Meta Platforms Inc., Microsoft Corporation, the Prighter Group, Usercentrics GmbH. Lastly, the conference was greatly supported by our local hosts: University Ca’Foscari of Venice and Scuola San Rocco.
Contents
Part I Privacy Friendly Data Usage 1
An Overview of the Secondary Use of Health Data Within the European Union: EU-Driven Possibilities and Civil Society Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sara Testa
2
Multi-Party Computation in the GDPR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lukas Helminger and Christian Rechberger
3
A Critique of the Google Apple Exposure Notification (GAEN) Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaap-Henk Hoepman
3 21
41
Part II Implications of Regulatory Framework in the European Union 4
Global Data Processing Identifiers and Registry (DP-ID) . . . . . . . . . . . . . Sébastien Ziegler and Cédric Crettaz
5
Europrivacy Paradigm Shift in Certification Models for Privacy and Data Protection Compliance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sébastien Ziegler, Ana Maria Pacheco Huamani, Stea-Maria Miteva, Adrián Quesada Rodriguez, and Renata Radocz
61
73
Part III What is Beyond Brussels? International Norms and Their Interactions with the EU 6
Untying the Gordian Knot: Legally Compliant Sound Data Collection and Processing for TTS Systems in China . . . . . . . . . . . . . . . . . . Stefanie Meyer, Sven Albrecht, Maximilian Eibl, Günter Daniel Rey, Josef Schmied, Rewa Tamboli, Stefan Taubert, and Dagmar Gesmann-Nuissl
87
xiii
xiv
Contents
7
Regulating Cross-Border Data Flow Between EU and India Using Digital Trade Agreement: An Explorative Analysis . . . . . . . . . . . . . 105 Vanya Rakesh
8
When Regulatory Power and Industrial Ambitions Collide: The “Brussels Effect,” Lead Markets, and the GDPR . . . . . . . . . . . . . . . . . 129 Nicholas Martin and Frank Ebbers
Part IV The Ethics of Privacy and Sociotechnical Systems 9
Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Lipsarani Sahoo, Mohamed Shehab, Elham Al Qahtani, and Jay Dev
10
Unwinding a Legal and Ethical Ariadne’s Thread Out of the Twitter Scraping Maze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Arianna Rossi, Archana Kumari, and Gabriele Lenzini
Part V User Perception of Data Protection and Data Processing 11
You Know Too Much: Investigating Users’ Perceptions and Privacy Concerns Towards Thermal Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Lipsarani Sahoo, Nazmus Sakib Miazi, Mohamed Shehab, Florian Alt, and Yomna Abdelrahman
12
Why Is My IP Address Processed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Supriya Adhatarao, Cédric Lauradoux, and Cristiana Santos
Correction to: When Regulatory Power and Industrial Ambitions Collide: The “Brussels Effect,” Lead Markets, and the GDPR. . . . . . . .
C1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Part I
Privacy Friendly Data Usage
Chapter 1
An Overview of the Secondary Use of Health Data Within the European Union: EU-Driven Possibilities and Civil Society Initiatives Sara Testa
Abstract In recent years, the European Union has been heavily impacted by the increasing costs of the healthcare systems: data-driven health research could play a major role in such a framework. This chapter aims at presenting the legal framework and the initiatives currently implemented for the secondary use of health data. The European Commission has defined a strategy that aims at supporting the exploitation of data while ensuring the fundamental rights of citizens, and it is working on the definition of a legislative framework allowing the secondary use of data. The reference point related to the processing of personal data is the General Data Protection Regulation (GDPR) that defines the main fundamental principles to be complied with and the Data Governance Act proposal, which aims at enabling the safe reuse of certain categories of public-sector data subject to the rights of others, such as data concerning health. However, the proposal compliance with the GDPR has been questioned by the European Data Protection Board and Secretariat. Within the healthcare sector, initiatives were developed to boost the secondary use of data. Some of these are led by the policymakers, such as the European Health Data Space, while others have emerged from the civil society and leverage on the concepts of data donation and data altruism. Still, the application of both these concepts raises legal uncertainty: therefore, it is crucial to guarantee a legislative framework that supports the positive exploitation of data. Keywords GDPR · Secondary use of data · Data altruism · Data donation · Healthcare data
S. Testa () Fondazione Bruno Kessler, Via Santa Croce, Trento, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_1
3
4
S. Testa
1 Introduction: The Potential of the Secondary Use of Data for Health and Care in the European Union The current and future trends of the healthcare system are one of the most pressing issues within the European Union. The healthcare expenditure of the European Union (EU-27) in 2017 amounted to 9.9% of the Union Gross Domestic Product (GDP), corresponding to more than 1 trillion euros [1]. Since 2012, in all European Member States (except Greece), the healthcare expenditure has risen, hitting 54% increase in Romania [1]. In terms of fiscal pressure and burden on the healthcare system, in the European Semester Thematic Factsheet on Health Systems [2], the European Commission (EC) recognises the trend of the ageing population and the related higher impact of chronic diseases as one of the growing challenges of the years to come. In such a framework, the role of medical research becomes pivotal, and its potential can have a massive positive impact if associated with (big) data. For example, the advancement of personalised and predictive medicine might result in a direct effect of lower costs for the healthcare system, not to mention a positive impact on the quality of care [3]. By tailoring care pathways to the specificities of each patient, personalised medicine would increase their effectiveness, while predictive medicine might curb the onset of preventable diseases. Medical research can be considered a data-driven science that used to acquire the greatest part of its data from clinical trials [4]: nowadays, the dependence on data has increased [5] as the gold standard data from clinical trials are enriched with real-world data. When there is a ‘high volume, high diversity biological, clinical, environmental, and lifestyle information collected from single individuals to large cohorts, in relation to their health and wellness status, at one or several time points [4]’, we refer to ‘big data’. Big data analysis has a huge potential, thanks to the recent uptake of artificial intelligence algorithms and deep-learning techniques that have the capability to generate high-quality and accurate results, useful for different applications such as medical diagnosis, care pathways, and innovation in the healthcare field. The European Commission has addressed the importance of data within healthcare research in its main research and innovation programme, Horizon Europe, as well as in the related mission on health: ‘Conquering cancer: mission possible’. The Report of the Mission Board for Cancer [6] underscores the central role of data throughout the whole document, especially in Recommendations 2, 4, 5, and 8:1 in general, the creation of platforms for collecting, disseminating, and analysing health data would be exploited for screening, early detection, definition of risks and
1 The document defines 13 Recommendations that will shape the mission. The ones mentioned above are Rec. 2 – Develop an EU-wide research programme to identify (poly-) genic risk scores; Rec. 4 – Optimise existing screening programmes and develop novel approaches for screening and early detection; Rec. 5 – Advance and implement personalised medicine approaches for all cancer patients in Europe; and Rec. 8 – Create a European Cancer Patient Digital Centre where cancer patients and survivors can deposit and share their data for personalised care.
1 An Overview of the Secondary Use of Health Data Within the European. . .
5
risks factors, and definition of personalised care plan; in addition, Recommendation 8 aims at creating a European Cancer Patient Digital Centre where cancer patients and survivors can deposit the health data generated by healthcare providers and will obtain a health passport summarising all their data as well as follow-up recommendations. The document stresses the necessity of compliance with current regulations and ethical standards and the need for the European Commission for finding a common ground to exploit digitalisation positively and efficiently without threatening the citizens’ fundamental rights. The following sections define the efforts (and the constraints) of the European Commission in the definition of a legislative framework related to the governance of personal and non-personal data, which has a huge impact on the research activities. Then, this chapter presents other initiatives within the healthcare sector that have arisen both from European and national institutions and from civil society.
2 The Legal Framework in the European Union In the last decade, the European Union has been developing a regulatory framework concerning data issues and will continue its effort for ameliorating and expanding it in the years to come. In the document devoted to the “European strategy for data” [3], the EC underpins the centrality of data within the digital transformation experienced in the last years: while recognising its value for boosting innovation and preserving the public good, the Commission poses the human beings, their values, and their fundamental rights as the main focus. The European Commission has been shaping a legal framework for the sake of ensuring a high standard of privacy, security, safety, and ethics for its citizens. First of all, the General Data Protection Regulation 2016/679 (GDPR) lays the basis for digital trust and its upcoming review [3] will further strengthen it; other legislative actions in this perspective are the Regulation on the free flow of nonpersonal data 2018/1807 (FFD), the Cybersecurity Act 2019/881 (CSA), the Open Data Directive 2019/1024, and the upcoming Data Governance Act. In terms of advancement within the healthcare sector, paramount importance is played by artificial intelligence (AI): in this respect, it should be mentioned the proposal of Artificial Intelligence Act [38] published by the European Commission in April 2021 “laying down harmonised rules on artificial intelligence” and defining the main criteria for the development and use of AI technologies within the European Union.
6
S. Testa
2.1 The Secondary Use of Health Data Within the General Data Protection Regulation (GDPR) The legal framework hereinabove might hinder the free flow of data and their usage for research purposes, especially sensitive data – such as health ones [3]. Within the healthcare sector, it is possible to distinguish between two different categories of data usage, primary and secondary: the first is related to the process of data for delivering healthcare services while the second refers to the process of data related to research, policymaking, and the development of private health services [7]. The following text briefly analyses how scientific research is dealt with within GDPR, thus the secondary use of data, especially in reference to health research data. First of all, art. 5 GDPR defines the principles to comply with in order to process personal data, that is, the criteria of lawfulness, fairness and transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, and accountability. Art. 89 GDPR lists a few exceptions to this framework, meaning the processing of data for archiving purposes in the public interest, scientific or historical research purposes, or statistical purposes: these are granted a ‘preferential regime’ if, in particular, data minimisation and storage limitation are ensured by applying technical and organisational measures. Such an outline provides a hint of the importance of an activity to be considered or not scientific research. While the Regulation does not provide a definition of scientific research [8], recital 159 of GDPR lists a series of examples for framing scientific studies, that is, ‘technological development and demonstration, fundamental research, applied research, and privately funded research’, which should anyway be ‘interpreted in a broad manner’ especially in the view of achieving the common goal of a European Research Area [9]. In addition, the recital explicitly mentions that studies within the public health field conducted for the public interest should be framed as scientific research as well. The Article 29 Working Party [10] specifies that scientific research project should be carried out ‘in accordance with relevant sector-related methodological and ethical standards, in conformity with good practice’. Once reviewed what can be considered ‘scientific research’, the notion of health data should be framed as well. Article 4 provides the definition of ‘data concerning health’ (15), meaning personal data related to the physical or mental health of a natural person, including the provision of health care services, which reveal information about his or her health status; this concept is further expanded in recital 35 of the Regulation, which states that personal health data ‘should include all data pertaining to the health status of a data subject which reveal information relating to the past, current or future physical or mental health status of the data subject’. As mentioned by the Article 29 Working Party [11],2 medical data include those
2 Even though the document refers to the Data Protection Directive 95/46/EC, the reasoning and the motivation of the Working Party might be considered applicable to the GDPR as well.
1 An Overview of the Secondary Use of Health Data Within the European. . .
7
generated within professional and medical contexts as well as those generated by devices or apps. Nonetheless, the document states that not all data generated by lifestyle apps and devices should be considered health data in the case specific information does not allow to infer information about that person’s health: for example, the number of steps of a single walk does not have significant relevance in terms of assessing the whole health status without the medical context of the app and might be considered only raw data if the app does not geolocalise and process the steps taken.
2.2 The Data Governance Act: A Debated First Approach to the Secondary Use of Data In November 2020, the European Commission released the first draft of the Data Governance Act [12]. This Regulation is part of the wider set of policy measures identified by the EC in the European Data Strategy document [3] and will be followed by a Data Act which will further define other aspects such as the support for business-to-business data sharing, usage rights on co-generated industrial data, portability right for individuals, and so forth. The Data Governance Act complements the Open Data Directive 2019/1024 by addressing the reuse of public-sector data subject to the rights of others (personal data, trade secrets, etc.) which fall outside of the Directive scope and is without prejudice to specific legislative frameworks (e.g. GDPR). Despite recognising the necessity for common rules for data sharing and enhancing trust among data intermediaries in all Member States, in their Joint Opinion [13] the European Data Protection Board (EDPB) and the European Data Protection Secretariat (EDPS) highlight the main criticalities of the Proposal, and especially its inconsistencies with pre-existing legislation such as the GDPR and the Open Data Directive.3 This text will not analyse the whole argument defined in the Joint Opinion, but will rather present the main themes, going into detail only for the topics more relevant here. The Joint Opinion addresses six main topics: (i) general issues related to the relationship of the Proposal with Union law in the field of personal data protection, (ii) reuse of certain categories of protected data held by public-sector bodies, (iii) requirements applicable to data-sharing service providers, (iv) data altruism, (v) international transfers of data, and (vi) horizontal provisions on institutional settings (such as the European Data Innovation Board expert group, penalties, evaluation and review). The Act is without prejudice to the GDPR, and thus, it should guarantee the application of the GDPR without amending or removing any definitions. However, the Joint Opinion finds ambiguities with the
3 To be noted that, in the Explanatory Memorandum of the DGA, the EC states that ‘the measures in the DGA are designed in a way that fully complies with the data protection legislation, and actually increases in practice the control that natural persons have over the data they generate’.
8
S. Testa
GDPR related to the subject matter and scope, definitions and terminology, legal basis for the processing, unclear distinction between processing of personal and non-personal data, and governance and powers of bodies set up by the Proposal. The new actors introduced in the Proposal (data-sharing service provider, data altruism organisation, data user, data holder) are defined in a way that might raise legal uncertainties and might lead to confusion when referred to the actors in the GDPR: for example, the Proposal mentions only data-sharing service providers as eligible for the role of controller or processor excluding both data altruism organisations and data users even though they could qualify as (joint) data controller or data processors according to the GDPR. Another issue is the concept of ‘permission of data holders’ introduced in the Proposal which, however, ambiguously relates to personal data, non-personal, or both: this is considerably relevant in the definition of the legal basis for the processing (art. 6.1 GDPR for personal data). The Joint Opinion has exposed the criticalities of the Data Governance Act: these, however, should be framed into the wider context of the European Data Strategy. With the definition of the GDPR, the EC has set the basis for a legal framework that centres on the citizens and their fundamental rights: the increased demand for data sharing to boost innovation poses a great challenge for the release of data while still preserving the citizens’ rights.
2.3 Artificial Intelligence and Personal Data in the EU Framework In the healthcare domain, the implementation of AI technology is strictly connected with the management of data and personal data. The Guidelines for Trustworthy AI of 2019 [39] explicitly list ‘privacy and data governance’ as one of the fundamental requirements to be met for the realisation of trustworthy AI, meaning AI that is lawful, ethical, and robust from a technical and social perspective. More in detail, this requirement is related to the prevention of harm principle: in a trustworthy framework, it must be ensured that personal data will not be used unlawfully or to discriminate people. Furthermore, the Guidelines underline the quality and integrity of data and the necessity of adequate protocol and procedures to be put in place for accessing the data. The Guidelines for Trustworthy AI, together with the European Strategy on AI in 2018 [39], the White Paper on AI in 2020 [40], etc., are among the documents laying the basis of the Proposal on Artificial Intelligence (Artificial Intelligence Act, the Proposal), which is the first-ever legal framework in this field. With the Proposal [38], the European Commission aims at balancing the emerging risks and negative aftermaths with the EU technological leadership and opportunity that AI can provide in socio-economic terms by reaffirming the EU values, fundamental rights, and principles. The Proposal seems to present some analogies [41] with the GDPR in terms of, for example, governance structure, by creating a European AI Board coordinating the National Competent Authorities; sanctions, to be defined
1 An Overview of the Secondary Use of Health Data Within the European. . .
9
based on a severity scale to be assigned by Member States’ competent bodies; and a risk assessment-based approach, similar to the Data Protection Impact Assessment. Most importantly, the Proposal might share with the GDPR the same risks of fragmentation and heterogeneity among the different Member States, particularly in terms of decision review mechanism, which is managed at national level and might thus present ambiguities. In their Joint Opinion [42], the EDPB and EDPS highlight the interaction with the data protection existing framework in terms of (i) relationship with the current EU data protection legislative framework, (ii) sandbox and further processing, (iii) transparency, (iv) special categories of data, and (v) compliance mechanisms. In particular, sandboxes are key in terms of fostering AI innovation as they ‘shall provide a controlled environment that facilitates the development, testing and validation of innovative AI systems for a limited time before their placement on the market or putting into service pursuant to a specific plan’ (art. 53 of the Proposal). Their usage will be extremely useful for accelerating access to markets, especially by removing barriers for small and medium enterprises and start-ups. While recognising the importance of using sandboxes within legal boundaries, the EDPB and EDPS raise doubts whether the implementing acts of the sandboxes might limit the national adaptation of sandboxes according to local policies and needs. Furthermore, additional clarification is needed for ensuring that enterprises’ reuse of data is in line with the processing scope: this might not be in line with the GDPR accountability principle, especially when this is held by the data controller and not a competent authority. The advancement of AI technologies provides opportunities as well as challenges in terms of data protection: GDPR applies in all cases artificial intelligence deals with personal data. However, the draft AI Proposal lacks clear references to the principles enshrined in the GDPR, and thus, these aspects should be clarified in its final version.
3 Initiatives Towards the Use of Data Within Healthcare Research 3.1 Policymaker Initiatives The EC has been implementing both legislative and non-legislative measures aimed at boosting the secondary use of data. In the fourth quarter of 2020, the Commission launched a public consultation for the European Health Data Space (EHDS) that will lead to a proposal regulation in the fourth quarter of 2021 [14, 15]; at the same time, the approval and immediate entry into force of the Data Governance Act is crucial as it will enable the establishment of the European Data Spaces by setting the basis for increased trust in data sharing, not to mention the compliance with the EU Recovery Plan. The aim of the EHDS is to ‘promote safe exchange of patients’ data, [ . . . ] and citizens’ control over their health data. It will support research on treatments, medicines, medical devices and outcomes and encourage the use of health data for
10
S. Testa
research, policy-making and regulatory purposes, with a trusted governance and respecting data protection rules’. The EHDS will be organised around four pillars targeting the cross-border data exchange for primary use, the secondary use of data for policy and research purposes, the free movement of digital services for telehealth and m-health, and the application of AI technologies in health and care. These will be supported by initiatives related to data governance (interoperability, citizens’ rights), data quality (data FAIRification), the creation of infrastructures, and capacity building (training, best practices). The need for exploiting data within healthcare research is addressed not only at European level, but also at national level. For example [16], in Finland the Ministry of Social Affairs and Health has set up a pilot project, Isaacus, aimed at defining the technical, operational, and organisational basis for a new data permit authority (later called Findata) and a new legislative framework for the secondary use of data. In addition, Finland has issued the ‘Act on the secondary use of health and social data’, which entered into force in 2019. The Act has defined a framework compliant with GDPR for the secondary use of data (such as scientific research, statistics, development and innovation, teaching, knowledge management). Furthermore, the Act officially established Findata, which is the operational branch of the Act collecting and managing the data coming from different data sources and supervising the use of data and knowledge management. Another recent example [17] comes from Germany, where, in October 2020, the German Parliament approved a bill that will provide, at the latest from January 2023, the possibility for patients to voluntarily donate their electronic health record data for specific scientific purposes, such as improving the quality of healthcare: citizens will be able to narrow the possibility to access their data and define the scope of usage upon signing a consent form.
3.2 Civil Society Initiatives The following text aims at presenting possible solutions to the secondary use of healthcare data that pose the civil society as the main actor: so far these initiatives have focused on the concept of data ‘donation’ as this is a concept widely defined in literature, both as posthumous data donation and data donation of living persons. The concept of data ‘altruism’ is a more novel concept introduced with the Data Governance Act that cannot yet be found in literature, and it will be presented taking into consideration the EDPB and EDPS Joint Opinion. Data Donation In the last few years, new projects have been launched by the civil society addressing the healthcare data-sharing issue based on the concept of data donation. Connected to this concept, there are very sensitive issues from both a legal and an ethical standpoint. In what follows, the main concepts are briefly reviewed. In literature [18], donation has been defined, from a legal viewpoint, as a situation where ‘the owner of a thing transfers it to another person or entity without
1 An Overview of the Secondary Use of Health Data Within the European. . .
11
consideration of what he/she will receive in return’; in other words, donation has the intrinsic notion of transferring a consumable good from one entity to another that the donor can no longer use: the case of a kidney donation well represents this concept. In other words, donation implies a transfer of ownership: however, can data be ‘owned’? Are personal data an alienable right under the EU law [19]? As personal data are a fundamental right (article 8 of the Charter of Fundamental Rights of the European Union), they are inalienable and cannot be sold. If we consider the practical situation of living persons donating, for example, their personal health record to a researcher for a specific activity, can this be considered feasible? This can be done, for sure, but those persons will still be able to access their data, and even decide to ‘donate’ them to someone else as well. Notwithstanding the controversial definition of data donation and its implications, two different types of donations can be defined, presented next. To start with, the first form of data donation presented is posthumous medical data donation (PMDD) which refers to the possibility of donating data of a deceased person. Similarly to anonymised data, data of deceased persons are out of the GDPR scope (recital 27), and the Member States are empowered to further regulate the processing of deceased persons’ personal data. In terms of rights after death, a reference can be made to art. 7.1 of the Berne Convention for the Protection of Literary and Artistic Works, which provides protection to economic rights after 50 years of the author’s death, while art. 6 states that moral rights can be maintained after the author’s death [20]. Another example is organ donation, which shares with PMDD the same goal to support healthcare research for the sake of saving or ameliorating the life of one or more persons [21].4 Organ donation is supported and regulated both nationally and internationally: for example, the international legal framework includes the Oviedo Convention of 1997 [22] and the United Nations Universal Declaration of Bioethical Principles of 2005. Data protection is safeguarded in the context of human rights, such as the European Convention of Human Rights. European scholarship argues that data propertisation and commodification are not viable within the European Union [20], and, as a consequence, data cannot be transmitted to heirs posthumously. In other words, PMDD cannot be decided by heirs nor by the individual itself: ‘If post-mortem privacy was recognised in law [ . . . ] the deceased would be able to decide as to what happens to their medical data post-mortem as well, and this would facilitate the practice of PMDD’ [23]. At the moment, in other words, even though PMDD has been explored in literature in recent years from both the ethical and legal points of view [23–25], it is still in a grey area of the European legislative framework. The second type of data donation concerns, more generally, ‘people [that] voluntarily contribute their own personal data that was generated for a different 4 It has to be noted that the two types of donations present, of course, remarkable differences as well, such as the lack of physical intrusion and lack of urgency for PMDD, and the fact that a living person can withdraw the consent to donation any time. In addition, data processing might result in the discovery of information related to the deceased person’s relatives that might have harmful consequences, such as the discrimination for insurance coverage based on a hereditary disease.
12
S. Testa
purpose, to a collective dataset’ [26]. This essay presents two examples of citizenled initiatives leveraging this concept. In Denmark, the Data for Good Foundation [27] was established following the principles of ‘collaboration, ethics, insight and transparency’ while ameliorating the citizens’ quality of life and the possibility of being a part of the data-driven economy. The approach of the Foundation starts from the fragmentation of data in different silos: its aim is to empower the citizens by allowing them to take control of their own personal data through a digital space. On the one hand, the Foundation online data store allows citizens to grant consent to share their data and, on the other hand, it creates an ecosystem of anonymised data useful for the industry, public administrations, and to generate new knowledge and value for the citizens themselves. A similar initiative was developed in Spain, called Salus Coop [28], which is a citizen data cooperative for health research. It aims at defining a citizen-driven model of cooperative governance and management of health data facilitating data-driven research and innovations in healthcare for boosting social and community benefits. On their website [29], it is available as the ‘Salus CG License’, which provides a framework to simplify the donation of citizens’ health data for research purposes. The research projects should have five specific characteristics: (i) they should be related to biomedical research purposes for health or social studies, (ii) without commercial purposes, and (iii) their results should be available easily and free of charge; (iv) in addition, the data must be given pseudonymised, and (v) the data subject has the power to withdraw from the study or change access conditions. However, there are a few elements of this approach that present some criticalities and may be in contrast with the GDPR implementation.5 First of all, it is not clear who is storing the citizens’ data: if Salus Coop is the entity providing this service, it should be specified what are the lawful basis for legitimisation and for which purpose this is done. Moreover, art. 9.4 delegates the Member States the possibility to expand or dwindle the processing of genetic data, biometric data, or data concerning health. In terms of GDPR roles, the data controller is never mentioned in the public documents available on the website: art. 5.2 GDPR defines the controller as accountable for respecting the principles relating to the processing of personal data and thus the collection, storage, and process of personal data. Literature [30, 31] argues that the sponsor of a clinical trial corresponds to the data processor, and this is coherent as well with article 56 of the Regulation on the clinical trials on medicinal products for human use 536/2014 [32] which appoints the sponsor or the investigator as responsible for recording, processing, handling, and storing all clinical trial information. In Salus Coop case, therefore, it could be said that the data processor is also the data 5 On the Salus Coop website, two documents are available for download: a pdf with the Terms of Use defining the five necessary conditions of the projects and a text (Word) document which is a template for the consent (information sheet) that should be signed by the data subjects and filled in with the project details. The documents have been used as reference for the definition of its ambiguities with GDPR.
1 An Overview of the Secondary Use of Health Data Within the European. . .
13
controller as it (likely) is the entity responsible for the tasks of the data controller.6 Another aspect concerns the lawfulness of data processing: for health-related data processing explicit consent is needed (art. 9.2.a), meaning a clear affirmative action (opt-in) by the data subject (art. 4.11). The information sheet template provides the objective of the study, outlines the individual rights, meaning ‘Access, Rectification, Cancellation/Suppression, Opposition, limitation of data processing and portability, as well as any other right recognized in the terms and conditions established by current legislation on Data Protection’, and clearly explains how these rights might be exercised; however, the document lacks any reference to data minimisation, and thus to the requirement for data to be adequate, relevant, and limited with reference to the purpose set by the data controller. In addition, the document does not indicate for how long data will be stored, thereby possibly breaching the principle of storage limitation. The document does not inform about the possibility to conduct a Data Protection Impact Assessment (DPIA), which is ‘particularly required’ (art. 35 GDPR) for those personal data included in article 9 and when data processing constitutes a threat to the rights and freedoms of natural persons (e.g. when using new technologies). The examples presented above are in line with the ‘citizen/patient-centric’ approach of GDPR [16]. The aim of the regulation, as a matter of fact, is to provide citizens with control over the usage, process, and transfer of their personal data while maintaining a high standard for security. However, at the basis of the two examples stands the intrinsic notion of donation, which is related to the transfer of ownership, and thus presents ambiguities with the current practice as persons can donate their data to multiple subjects. In this perspective, is it possible that this practice can be better defined as data ‘sharing” instead of ‘donating’? Researchers [33] have pointed out that the difference between the two practices lies in two different dimensions: exclusivity and motivation. In relation to the first concept, as mentioned above, the sharing of a good implies that the sharer can still use that commodity, while with donation this is not possible; on the other hand, as far as the second dimension is concerned, sharing implies a sort of exchange in return, while with a donation there might be a stronger motivation behind it, a symbolic gesture that can be compared to a giving gift. Even though a gift might still be part of a framework with trade and exchange, it usually goes beyond: for example, when the situation of donation refers to health data, most people do not expect a direct return. Because of this selfless nature, domain experts [33] suggest the importance of providing a legal framework sustaining and enabling this practice. However [33], it can be argued that when a citizen donates his or her data, in ‘exchange’ he or she expects that this data is treated responsibly and for philanthropic ends.7 In
6 Given the context, it seems unlikely that Salus Coop is the entity determining the purposes of the processing. 7 Of course, GDPR prescribes that by means of the informed consent citizens are aware of the necessary information related to the purposes of the research and how data will be treated is included; this statement was just meant to better frame the concept of donation and its implications.
14
S. Testa
addition, as mentioned below, the empowerment of people in the use of their data does not always imply the absence of return, as data might be also used within a data marketplace, in exchange for money. Next, the concept of data altruism, which may be a solution to the ethical and legal aspects presented here, is described. Data Altruism In literature [34], the contribution of citizens to healthcare research, either through data sharing or by participating in a clinical trial, is viewed as an act of altruism. The contribution that is selfless and that will benefit other people and not the participant directly is considered altruistic: all these elements, together with the presence of personal well-being sacrifice, are connected with the need to thoroughly inform the participants of the purposes of the research and their legitimate rights. In the Data Governance Act, the definition of data altruism reflects this perspective as it is ‘the consent by data subjects to process personal data pertaining to them, or permissions of other data holders to allow the use of their non-personal data without seeking a reward, for purposes of general interest, such as scientific research purposes or improving public services’. This ‘altruistic’ form refers to both personal and non-personal data as well as individuals and companies. According to the Proposal, data altruism is part of a strategy to boost the development of the activities which need the process of (big) data, such as machine learning and data analytics while ensuring a high level of trust at the same time. In this perspective, the EC sets the basis for ‘Data Altruism Organisations’ which will register and will have an EUwide application. In addition, it introduces a European data altruism consent form which shall be defined by the European Data Innovation Board (EDIB),8 together with the European Data Protection Board, with possible sectoral adjustments: this form, particularly necessary for scientific research and statistical use of data, should ease the consents’ collection and data portability. In their Joint Opinion, the EDPB and EDPS express their concern in relation to how data altruism and the related concepts are developed in the proposal. First of all, GDPR already grants data subjects the possibility to share their data for scientific research: the DGA codifies this possibility in the form of data altruism but does not define what its ‘added value’ would be. Particular concern is expressed in relation to the concept form to be defined by the EDIB and EDPB as it is not clear what its elements should be, that is, the same as the GDPR consent or other, and what would its lawfulness conditions be. In addition, the Act should stress that the consent should be easily withdrawn to the same extent it can be provided and it should clearly define how data users respect withdrawal and personal data deletion in accordance with art. 17.1.b GDPR. In conclusion, data altruism should be applied without prejudice to the GDPR: this means that no matter the benefit for the general interest, and research purposes, in particular, the fundamental right to privacy cannot be waived in any case. A more detailed definition of data altruism is of paramount importance for the protection of personal data in the EU, especially the purposes for which data is collected:
8 To
be established within the Data Governance Act.
1 An Overview of the Secondary Use of Health Data Within the European. . .
15
this should be specified in a dedicated article, to avoid misinterpretation related to further processing, purposes of general interest and the consent to areas of scientific research. For the health research, the upcoming EDPB guidelines on the processing of health data for scientific research will create the basis to further clarify how such data should be considered and processed.
4 Discussion Without the claim to be exhaustive, the text above aims at providing an overview of the complex panorama for the secondary use of health data for research purposes. A first crucial element is related to the special nature of data concerning health: the EDPB guidelines on the definition of what could or could not be considered health data and scientific research might help in providing guidance to the stakeholders. Another element concerns the processing of personal data for scientific purposes: as scientific research is not a lawful condition for processing on its own, consent of the data subject might be deemed to be the preferable lawful basis considering the ethical aspects connected to the participation of humans, as mentioned for data altruism. In this case, however, research activities are constantly jeopardised by the possibility of consent withdrawal: in this context, it seems difficult to balance research activities without jeopardising the privacy and usage of personal data. Because of this scenario, Finland [8] has paved the way for the other two options mentioned above in compliance with article 9.4 that provides the Member States with the possibility to ‘maintain or introduce further conditions, including limitations, with regard to the processing of genetic data, biometric data or data concerning health’. On the contrary, the use of data for scientific health research in Italy is very complex: for example, the Italian Data Protection Authority (DPA) has issued a document [35] in response to the request of the Autonomous Province of Trento related to the use of data within activities for the management of chronic diseases that focuses on prevention and early diagnosis (i.e. medicina d’iniziativa), for patient stratification, and for care or statistical purposes, which presents multiple issues to be considered. First of all, the secondary uses of data, thus different from the original clinical purpose, lay on different principles of lawfulness (presupposti di liceità) and protection (tutela) of the data subjects that would infringe different principles of the GDPR: lawfulness, fairness, purpose limitation, and data minimisation (art. 5 GDPR). Secondly, automated processing might bring discriminatory effects (recital 71 GDPR) that data controllers should avoid by putting in place the appropriate organisational and technical measures; in addition, data subjects should be aware of the usage of automated processing of decision-making procedures, and it should be guaranteed that a decision on a data subject is not solely based on automated processing (principio di non esclusività). Furthermore, the DPA refers to the necessity for informed consent about the additional processing for care purposes to be signed by data subjects, and for a reference to the public interest motivation related to statistical analysis (as the request of the Autonomous Province of Trento
16
S. Testa
was related to the COVID-19 emergency) [35]. Within the Competence Center on digital health of the Autonomous Province of Trento, TrentinoSalute4.0 (TS4.0) [43], it is currently under evaluation the definition of an agreement with the citizens, ‘Patto col Cittadino’, where the data subjects will be able to provide access to a limited portion of data of their Electronic Health Record needed for specific research projects. It can be argued that anonymised data are a viable solution for the exploitation of data within the healthcare sector: however, this solution presents criticalities as well. Anonymisation is a form of processing of personal data which uses different techniques, such as randomisation or generalisation, to irreversibly prevent identification [36]; once a personal data is anonymised, it is no longer a personal data and, thus, falls out of the scope of GDPR (recital 26 GDPR). Anonymisation is often used for open data databases, such as Zenodo,9 an open repository part of the Open Access Infrastructure for Research in Europe (OpenAIRE). However, these practices present some criticalities: first of all, the anonymisation process implies the removal of so many pieces of information which may compromise the quality and possibility of further analysis [4]. Second of all, some data, such as genetic data, are intrinsically tied with the person they belong to and thus can never be completely anonymised [33]. Literature [44] defines the ‘curse of anonymisation’ as the impossibility to fully anonymise sensitive data while maintaining the complete usefulness of such data: the higher the data anonymity the lower the usability of the data. In general, the Spanish Data Protection Authority and EDPS [37] have clearly stated that, regardless of the type of personal data, the risk of reidentification cannot be completely removed and anonymisation is a process that could be reversed in time with the use of novel technologies. Another aspect to be considered is that the process of anonymisation results in relevant costs as the definition of true and/or adequate anonymisation is not straightforward [44]; in addition, anonymisation of Electronic Health Record data is hampered by the gap between healthcare professionals and privacy legal framework which requires the involvement of a multidisciplinary approach, thus including data scientists, clinical research, and IT and law experts. Therefore, it is debatable whether anonymisation could be considered the optimal solution from high-quality healthcare research or from a legal viewpoint. Another possible emerging solution is the use of synthetic data, whose creation is defined by the EDPS [45] as taking ‘an original data source (dataset) and create new, artificial data, with similar statistical properties from it’. The use of synthetic data might have a positive impact on the training of AI models as well as for transferring data outside the EU; nonetheless, synthetic data share some of the issues anonymous
9 “Where data that was originally sensitive personal data is being uploaded for open dissemination through Zenodo, the uploader shall ensure that such data is either anonymised to an appropriate degree or fully consent cleared.” Zenodo policies available at https://help.zenodo.org/#policies – last visit 16 November 2021.
1 An Overview of the Secondary Use of Health Data Within the European. . .
17
data present such as the risk of reidentification, as well as the possibility to recognise if a synthesised data sample is part of a training dataset. The COVID-19 emergency has made the need for cross-border data exchange and data sharing, in general, more urgent than ever: because of this, the initiatives foreseen by the EC within the European Data Strategy might have a huge impact on the healthcare research by providing the legal basis for exploiting citizens’ data stored in hospitals, laboratories, and personal health records (PHR). Different types of data processing are available for ensuring the secondary use of personal data while avoiding the risk of reidentification of personal data; however, such techniques should ensure more reliability.
5 Conclusions Health scientific research is one of the priorities of the EC, especially after the COVID-19 emergency has dramatically exposed its stark importance. As recognised by both European and national institutions, data analysis represents a vital source for this activity. While the GDPR provides a legal framework that, at the surface, might hamper scientific research, it actually guarantees the citizens’ fundamental rights in relation to personal data processing, providing them with the tools for being in control of the usage of their data. Recently, different initiatives led mainly by the civil society have emerged, but doubts in terms of compliance with the GDPR principles may arise. The EC is defining a set of legislative documents to support the increase of a data-driven economy and ease the altruistic data sharing; however, the EDPB and EDPS urge compliance with the GDPR in order to ensure the fundamental rights of the individuals. Acknowledgements This work was partially funded by the Competence Center of Digital Health of the Autonomous Province of Trento ‘TrentinoSalute4.0’. The author thanks the anonymous reviewers for their constructive comments. A special thanks goes to O. Mayora for his valuable feedback on the definition and revision of this chapter.
References 1. Eurostat, Healthcare expenditure statistics – https://ec.europa.eu/eurostat/statistics-explained/ index.php?title=Healthcare_expenditure_statistics#Healthcare_expenditure 2. European Semester Thematic Factsheet on Health Systems https://ec.europa.eu/info/sites/info/ files/file_import/european-semester_thematic-factsheet_health-systems_en_0.pdf – last visit 27 April 2021. 3. Communication from The Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions, “A European strategy for data”, Brussels, 19.2.2020, COM (2020) 66 final - https://eur-lex.europa.eu/legal-content/EN/ TXT/HTML/?uri=CELEX:52020DC0066&from=EN – last visit 27 April 2021.
18
S. Testa
4. Abedjan Z. et al, (2019). Data science in healthcare: Benefits, challenges and opportunities. In Data Science for Healthcare. Springer: Cham, pp. 3–38 5. Kariotis, T., Ball, M., et al. (2020). Emerging health data platforms: From individual control to collective data governance. Data & Policy, 2, E13. doi:https://doi.org/10.1017/dap.2020.14 6. Directorate-General for Research and Innovation, Conquering Cancer: mission possible, September 2020 https://op.europa.eu/en/publication-detail/-/publication/b389aad3-fd56-11eab44f-01aa75ed71a1/ – last visit 27 April 2021. 7. https://digital-strategy.ec.europa.eu/en/news/member-states-meet-european-commissiondiscuss-protection-personal-data-health-sector – last visit 13 May 2021 8. Ducato R., Data Protection, Scientific Research, and the Role of Information (January 10, 2020). Computer Law and Security Review, doi:https://doi.org/10.1016/j.clsr.2020.105412. 9. Consolidated version of the Treaty on the Functioning of the European Union (TFEU), art. 179 (1) available at https://eur-lex.europa.eu/legal-content/HU/TXT/?uri=OJ:C:2016:202:TOC 10. Article 29 Working Party, Guidelines on consent under Regulation 2016/679. Adopted on 28 November 2017, as last Revised and Adopted on 10 April 2018. 11. Article 29 Working Party, ANNEX - health data in apps and devices available at https://ec.europa.eu/justice/article-29/documentation/other-document/files/2015/ 20150205_letter_art29wp_ec_health_data_after_plenary_annex_en.pdf 12. Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL on European data governance (Data Governance Act) https://eur-lex.europa.eu/legalcontent/EN/TXT/?uri=CELEX%3A52020PC0767 13. EDPB - EDPS, Joint Opinion 03/2021 on the Proposal for a regulation of the European Parliament and of the Council on European data governance (Data Governance Act), published 11 March 2021, available at https://edpb.europa.eu/our-work-tools/our-documents/edpbedpsjoint-opinion/edpb-edps-joint-opinion-032021-proposal_en 14. Digital health data and services – the European health data space https://ec.europa.eu/info/law/ better-regulation/have-your-say/initiatives/12663-A-European-Health-Data-Space- - last visit 30 April 2021. 15. Communication from The Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions (Annexes to the) Commission Work Programme 2021 19.10.2020 https://eur-lex.europa.eu/ resource.html?uri=cellar%3A91ce5c0f-12b6-11eb-9a54-01aa75ed71a1.0001.02/ DOC_2&format=PDF 16. Piai S., Allocato A., Besana G., Story 6 – The Secondary Use of Health Data and Data-driven Innovation in the European Healthcare Industry, IDC, 2020. This report is part of the Call for tenders of the European Commission DG Connect: Update of the European data market study — SMART 2016/0063. 17. Gesetz zum Schutz elektronischer Patientendaten in der Telematikinfrastruktur available at https://www.bundesgesundheitsministerium.de/fileadmin/Dateien/3_Downloads/ Gesetze_und_Verordnungen/GuV/P/PDSG_bgbl.pdf; Koyuncu A., Mei V., Germany Prepares New Law for Patient Data Protection and Increased Digitalisation in Healthcare and for “Data Donations” for Research Purposes. Available at https://www.covingtondigitalhealth.com/ 2020/08/germany-prepares-new-law-for-patient-data-protection-and-increased-digitalisationin-healthcare-and-for-data-donations-for-research-purposes/ - last visit 7 May 2021. 18. Prainsack B., Data Donation: How to Resist the iLeviathan, in J. Krutzinna, L. Floridi (eds.), The Ethics of Medical Data Donation (2019), Philosophical Studies Series 137, doi:https:// doi.org/10.1007/978-3-030-04363-6_6 19. Piéchaud Boura M., Personal Data Donation: a Legal Oxymoron Beneficial to Science? Available at https://www.timelex.eu/en/blog/personal-data-donation-legal-oxymoronbeneficial-science - last visit 12 May 2021. 20. Malgieri, G. (2018). R.I.P.: Rest in Privacy or Rest in (Quasi-)Property? Personal Data Protection of Deceased Data Subjects between Theoretical Scenarios and National Solutions in R. Leenes, R. Van Brakel, S. Gutwirth & P. De Hert (Eds.). Data Protection and Privacy: The Internet of Bodies. (pp. 300 – 320) Hart Publishing.
1 An Overview of the Secondary Use of Health Data Within the European. . .
19
21. Harbinja E. Posthumous Medical Data Donation: The Case for a Legal Framework in J. Krutzinna, L. Floridi (eds.), The Ethics of Medical Data Donation (2019), Philosophical Studies Series 137, doi:https://doi.org/10.1007/978-3-030-04363-6_6 22. Convention for the protection of Human Rights and Dignity of the Human Being with regard to the Application of Biology and Medicine: Convention on Human Rights and Biomedicine, Oviedo, ETS No.164. 23. J. Krutzinna, L. Floridi (eds.), The Ethics of Medical Data Donation (2019), Philosophical Studies Series 137, doi:https://doi.org/10.1007/978-3-030-04363-6 24. Harbinja E., Pearce H., Your data will never die, but you will: A comparative analysis of US and UK post-mortem data donation frameworks, Elsevier 2020, doi:https://doi.org/10.1016/ j.clsr.2020.105403 25. Shaw D.M., Gross J.V., Erren T.C., Data donation after death, The Lancet, Volume 386, Issue 9991, P 340, July 25, 2015, doi:https://doi.org/10.1016/S0140-6736(15)61410-6 26. Bietz, M., Patrick, K., & Bloss, C. (2019). Data Donation as a Model for Citizen Science Health Research. Citizen Science: Theory and Practice, 4(1), 6. DOI: https://doi.org/10.5334/cstp.178 cited in Piai S., Allocato A., Besana G. (2020) 27. Data for Good Foundation Homepage https://dataforgoodfoundation.com/en - last visit 8 May 2021. 28. Salus Coop Homepage https://www.saluscoop.org/ - last visit 8 May 2021. 29. Salus Coop licence page https://www.saluscoop.org/licencia - last visit 8 May 2021. 30. Van Quathem K., The GDPR and Clinical Trials – Are study sites controllers or processors?, Pharm. Ind. 81, Nr. 6, 809–813 Editio Cantor Verlag, Aulendorf, 2019. 31. Stocks Allen F., Crawford G.E., Clinical trials under the GDPR: What should sponsors consider?, Privacy laws & business United Kingdom Report, January 2019. 32. REGULATION (EU) No 536/2014 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 16 April 2014 on clinical trials on medicinal products for human use, and repealing Directive 2001/20/EC, Available at https://ec.europa.eu/health/sites/health/files/files/ eudralex/vol-1/reg_2014_536/reg_2014_536_en.pdf 33. Hummel P., Braun M., Dabrock P., Data Donations as Exercises of sovereignty in J. Krutzinna, L. Floridi (eds.), The Ethics of Medical Data Donation (2019), Philosophical Studies Series 137, https://doi.org/10.1007/978-3-030-04363-6_6 34. Raj M., De Vries R., Nong P., Kardia S. L. R., Platt J. E. Do people have an ethical obligation to share their health information? Comparing narratives of altruism and health information sharing in a nationally representative sample (2020) PLoS One. 2020; 15(12): e0244767. Published online 2020 Dec 31. https://doi.org/10.1371/journal.pone.0244767 35. Garante della Privacy, Parere alla Provincia autonoma di Trento sul disegno di legge provinciale concernente “Ulteriori misure di sostegno per le famiglie, i lavoratori e i settori economici connesse all’emergenza epidemiologica da COVID-19 e conseguente variazione al bilancio di previsione della Provincia autonoma di Trento per gli esercizi finanziari 2020-2022” 8 maggio 2020 available at https://www.garanteprivacy.it/web/guest/home/docweb/-/docwebdisplay/docweb/9344635 - last visit 29 April 2021. 36. Article 29 Data Protection Working Party, Opinion 05/2014 on Anonymisation Techniques, available at https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/ files/2014/wp216_en.pdf 37. AEPD-EDPS, Joint paper on 10 misunderstandings related to anonymisation, published on 27 April 2021, available at https://edps.europa.eu/data-protection/our-work/publications/papers/ aepd-edps-joint-paper-10-misunderstandings-related_enReferences 38. Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS, Available at https://eur-lex.europa.eu/resource.html?uri=cellar:e0649735-a37211eb-9585-01aa75ed71a1.0001.02/DOC_1&format=PDF 39. ETHICS GUIDELINES FOR TRUSTWORTHY AI, Independent high-level expert group on artificial intelligence set up by the European Commission, Available at https://www.aepd.es/ sites/default/files/2019-12/ai-ethics-guidelines.pdf
20
S. Testa
40. WHITE PAPER on Artificial Intelligence - A European approach to excellence and trust, Available at https://ec.europa.eu/info/sites/default/files/commission-white-paperartificial-intelligence-feb2020_en.pdf 41. A PROPOSAL FOR (AI ) CHANGE? A succinct overview of the Proposal for Regulation laying down harmonised rules on Artificial Intelligence, Rubén Cano, Available at https://lawreview.luiss.it/files/2016/09/A-proposal-for-AI-Change-A-succint-overviewof-the-Proposal-for-Regulation-laying-down-harmonised-rules-on-Artificial-Intelligence.pdf last visit 10 February 2022. 42. EDPB - EDPS, Joint Opinion 05/2021 on the proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), published 18 June 2021 https://edps.europa.eu/system/files/2021-06/202106-18-edpb-edps_joint_opinion_ai_regulation_en.pdf 43. Mayora Ibarra O., Forti S., Conforti D., Tessari P., Testa S., Trentino salute 4.0 - the creation of a competence center on digital health integrating policy, healthcare trust and research in trentino territory. International Journal of Integrated Care. 2019;19(4):56. DOI: https://doi.org/ 10.5334/ijic.s3056 44. Zuo Z., Watson M., Budgen D., Hall R, Kennelly C., Al Moubayed N., Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study JMIR Med Inform 2021;9(10):e29871 doi: https://doi.org/10.2196/29871 45. EDPS web page on Synthetic Data, available at https://edps.europa.eu/press-publications/ publications/techsonar/synthetic-data_en - last visit 10 February 2022.
Chapter 2
Multi-Party Computation in the GDPR Lukas Helminger and Christian Rechberger
Abstract The EU GDPR has two main goals: Protecting individuals from personal data abuse and simplifying the free movement of personal data. Privacy-enhancing technologies promise to fulfill both goals simultaneously. A particularly effective and versatile technology solution is multi-party computation (MPC). It allows protecting data during a computation involving multiple parties. This chapter aims for a better understanding of the role of MPC in the GDPR. Although MPC is relatively mature, little research was dedicated to its GDPR compliance. First, we try to give an understanding of MPC for legal scholars and policymakers. Then, we examine the GDPR relevant provisions regarding MPC with a technical audience in mind. Finally, we devise a test that can assess the impact of a given MPC solution with regard to the GDPR. The test consists of several questions, which a controller can answer without the help of a technical or legal expert. Going through the questions will classify the MPC solution as: (1) a means of avoiding the GDPR, (2) data protection by design, or (3) having no legal benefits. Two concrete case studies should provide a blueprint on how to apply the test. We hope that this chapter also contributes to an interdisciplinary discussion of MPC certification and standardization. Keywords Multi-party Computation · GDPR · Compliance · Privacy-enhancing Technologies · Privacy by design
L. Helminger () Know-Center GmbH, Graz, Austria University of Technology Graz, Graz, Austria e-mail: [email protected] C. Rechberger Graz University of Technology, Graz, Austria e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_2
21
22
L. Helminger and C. Rechberger
1 Introduction The EU General Data Protection Regulation’s (GDPR) [19] two primary objectives seem to interfere with each other. On the one side, the GDPR is best known as the world’s strictest privacy and security law. So the majority believes it is all about protecting individuals from personal data abuse. On the other not so well-known side, the GDPR aims to simplify the free movement of personal data. In that sense, the regulation is also very business-friendly. Obviously, there are situations where those two aims create tension. Privacy-enhancing technologies (PETs) promise to manage the balancing act between personal data protection and open data economy, at least to some extent. The idea of using technology for data protection while simultaneously not restring data-driven business is rooted in Article 25 GDPR (data protection by design and by default). For a data controller, the possible range to choose a suitable PET is extremely broad. It ranges from access controls over VPNs to cryptographic concepts such as multi-party computation (MPC) [36], differential privacy [15], and zero-knowledge proofs [22]. Especially technologies that reduce the need to trust data controllers have received a lot of funding lately. As a result, one could see a high level of research activity in PETs in the last couple of years. Fortunately, this effort led to significant performance gains up to a point where the technologies are scalable to enterprise-size data sets. Also, the technical readiness level improved so that more and more PETs can be called the state of the art. Consequently, the research results were exploited, and today many companies are offering advanced PETs. Nevertheless, the adoption of PETs leaves a lot to be desired. In particular, the most effective PETs are being neglected to a large extend. Instead, businesses and public authorities choose organizational measures and weak PETs. A major reason for that is the legal uncertainty involving PETs. Often the impact of a specific PET regarding GDPR compliance is hard to estimate. It is only understandable that businesses want to know the legal consequences of using a PET before deploying it. The purpose of this chapter is to assess the GDPR compliance of multiparty computation. MPC is a highly versatile PET. Its application ranges from secure distributed genome analysis [26] over collaborative fraud detection [31] to privacy-preserving machine learning [27]. The ability to protect data during a joint computation involving several parties makes it an excellent fit for the GDPR. It drastically minimizes the need to trust data controllers or processors. In addition, MPC facilitates privacy-friendly data sharing across different companies. Lastly, as a subfield of cryptography, MPC comes with mathematical guarantees. The security level can be compared to long-established encryption standards. Paper Organization In Sect. 2, we provide an introduction to MPC. Due to the lack of space, we omit all technical details that are irrelevant from a legal perspective. We then focus on the GDPR articles and their interpretations that are of particular concern to MPC. More concretely, we try to understand the definition of personal data (Sect. 3) and the meaning of data protection by design (Sect. 4). In
2 MPC in the GDPR
23
Sect. 5, we present a test for assessing the legal implications of the use of MPC. In addition, we apply our test to concrete use cases in Sect. 6.
2 Multi-Party Computation This section aims to provide an overview of MPC. We focus on the aspects relevant to the GDPR and try to refrain from being unnecessarily technical. For a more technical rigorous treatment of MPC, the reader is referred to these excellent textbooks [10, 20, 32]. Before starting with MPC, we want to recall what generally is understood by the term computation (alternatively also algorithm or analysis). A computation is a procedure transferring some given input data to an output.
2.1 Introduction MPC is a subfield of cryptography dating back to the 1980s. It allows two or more parties to input data and receive output in a privacy-preserving way. In particular, MPC is able to protect the input of each party during a mutual computation. The data protection guarantees offered by MPC can be best understood by the following thought experiment. Imagine an ideal world where there exists a fully trusted third party recognized by everyone. Then, whenever two or more distrusting parties want to analyze their combined data, they send their data to this trusted third party. It then performs the requested analysis on the pooled data and returns the result (output) to the parties. Since the trusted third party cannot be corrupted, nothing except the output of the computation gets revealed to the parties. MPC replaces the need for such a hypothetical trusted third party by means of cryptography. In other words, advanced encryption-like techniques provide the same data protection in the real world as the trusted third party in the ideal world. To sum up, MPC protects input data during computation but not the computation’s output.
2.2 Private Set Intersection The introduction of a specific MPC protocol1 called private set intersection (PSI) [13] will offer an intuitive view of the concept of MPC. PSI allows two parties to compute the intersection of their data sets jointly. Thereby, neither party learns information from the protocol except for the elements in the intersection. Consider two hypothetical companies that would like to find out their shared customers.
1 If
we write MPC protocol, we mean a specific MPC (program would be a synonym for protocol).
24
L. Helminger and C. Rechberger
Company A Input: John, Joe, Max, Ana, David, Luis
Multi-Party Computation John Joe David Luis
Output: Max, Ana
Max Ana
Susan Rosa Gina Matt
Company B Input: Susan, Rosa, Max, Gina, Ana, Matt
Output: Max, Ana
Fig. 2.1 Private set intersection
However, both are reluctant to share their list of customers. PSI enables them to perform a seemingly counterintuitive computation, where both end up knowing the names of their shared customers but nothing more (see Fig. 2.1). Instead of sending both data sets to a trusted third party, a PSI protocol will encrypt both data sets. Then, the whole computation (calculating the intersection) is performed on encrypted data shares. Therefore, the input from every part is protected and cannot be retrieved by the other party. This line of argumentation applies to every MPC protocol and is essential for the classification of MPC in the GDPR.
2.3 Security There are three dimensions to MPC security. The academic community considers MPC protocols to be secure if they offer at least 128-bit computational security (guards against brute-force attacks). This is equivalent to AES security [11], the most popular encryption scheme and often used as a benchmark for security level. The second dimension is the so-called security model of the MPC protocol. There are two major ones: the semi-honest and the malicious security model. In the semihonest security model, it is assumed that no party deviates from the protocol. More concretely, security is guaranteed as long as every party follows the protocol. In contrast, the malicious security model protects the input data even if some party deviates from the protocol. Orthogonal to the security model is the trust assumption. Any MPC protocol implicitly makes a trust assumption. Protocols differ on how many colluders can be tolerated before security is broken.
3 GDPR: Personal Data The GDPR only applies if personal data is processed, whereas non-personal data falls outside its scope of application. Thus the definition of personal data is of
2 MPC in the GDPR
25
uttermost practical relevance to anyone processing data. Article 4(1) GDPR defines personal data as follows: ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
We can already see that this definition is very broad. There is one passage in the GDPR’s preamble, a so-called recital that aims to clarify the definition of personal data. Although such recitals are not legally binding, they provide information about how the articles should be read and are often considered in court. Recital 26 GDPR tries to offer a legal test to differentiate between personal and non-personal data. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
There are two widely recognized interpretations of Article 4(1) and Recital 26 GDPR. Because their differences impact how MPC is seen in the GDPR, we summarize the main arguments.
3.1 Absolute Approach The absolute approach argues that as long as a data object is personal data to someone, it is personal data to everyone. In other words, personal data is independent of the perspective. This even holds if the data necessary to identify a data subject is spread between different parties. Note that we are only talking about the absolute approach in terms of perspective [16]. In the literature, sometimes, the absolute approach is mixed with the questions of whether the GDPR favors a risk-based approach. Although there are links between the discussion of the risk and the perspective, we want to keep them separate for clarity (see Sect. 3.3). The main argument in favor of the absolute approach is the wording “by the controller or any other person” in Recital 26 GDPR. The most authoritative document emphasizing the absolute approach is the Article 29 Working Party’s (WP29) opinion [2] (now the European Data Protection Board). It states that rendering personal data anonymous should be as permanent as erasure and thereby achieve irreversible de-identification. To further clarify the implications on practical matters, they provide the following cautionary example. A data controller collects data on individual travel movements. Then the data controller removes direct identifiers and subsequently offers this data set to third parties. WP29 concludes
26
L. Helminger and C. Rechberger
that even if nobody except the data controller could identify a data subject in the data set, the data would be personal data to everyone. National authorities’ rulings further legitimate the absolute approach. As pointed out by Bergauer et al. [3], the Austrian Data Protection Authority considered that data can still qualify as personal data despite the fact that the data controller itself cannot identify a data subject.2 Similarly, the French Conseil d’Etat—the highest national administrative court—emphasizes that there is no difference between whether the data subject can be identified by the data controller or a third person.3
3.2 Relative Approach In contrast, the relative approach argues that personal data is a relative concept. More specifically, it is sufficient to only look at the data controller’s perspective to determine whether information constitutes personal data. As a consequence, the same data item can be anonymous for one party while being personal data for another party. Advocates of the relative approach stress that the emphasis should lie on the term “means reasonably likely to be used.” According to them, the logic of the case Breyer v Germany4 favors the relative approach. The judgment of the European Court of Justice (ECJ) in Breyer v Germany is the leading case on the interpretation of Recital 26 GDPR. Breyer’s dynamic IP address was collected by the German government when he visited a public authority website. The government could not identify him without additional information held by an Internet Service Provider (ISP). The ECJ concluded that the IP address qualified as personal data because of the government’s power to obtain the additional information from the ISP in the event of a cyberattack. Mourby et al. [28] draw the conclusion that in the absence of this legal channel, the IP address would not have been personal data for the government. Thus confirming that personal data is a relative concept.
3.3 Risk-Based Approach There is a broad consensus that Recital 26 GDPR formulates a risk-based approach. More concretely, if identification is not reasonably likely, data can be considered non-personal. Recital 26 GDPR states factors that shall be taken into account in such a risk assessment:
2 DSB-D122.970/0004-DSB/2019. 3 ECLI:FR:CECHR:2017:393714.20170208. 4 Case
C-582/14 Patrick Breyer [2016] EU:C:2016:779.
2 MPC in the GDPR
27
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
If the GDPR would not favor a risk-based approach—i.e., a zero-risk threshold— then every data would account for personal data. There is always at least a theoretical possibility that data can be linked to a natural person. For instance, even sensor data could give information about the person that installed the sensor. Important authoritative documents clearly reject a zero-risk threshold. WP29 points out that “a mere hypothetical possibility to single out the individual is not enough to consider the person as identifiable” [2]. Moreover, the Irish Data Protection Authority argues that: “[I]f it can be shown that it is unlikely that a data subject will be identified given the circumstances of the individual case and the state of technology, the data can be considered anonymous” [8]. We stress here that, in our opinion, the risk-based approach is orthogonal to the discussion relative versus absolute approach. The risk-based approach can equally be applied to the absolute and the relative approach. It only differs what perspective the data controller has to take into account in the risk assessment. In contrast, advocates of the relative approach often use the risk-based approach as an argument in their direction [21]. They are reasoning that a data controller cannot sensibly calculate the risk of identification from the perspective of every party in the world.
3.4 Conclusion We want to summarize the implications of the different presented approaches to this work. First, any analysis of PETs in the context of the GDPR makes only sense if there is not a zero-risk threshold. No technology can offer a 100% guarantee. So we have to assume some risk tolerance in our analysis. Because of the uncertainty regarding the absolute versus relative approach, our analysis covers both. The use cases in Sect. 6 should highlight the differences in the approaches concerning MPC.
4 GDPR: Data Protection by Design Cavoukian, back then Ontario Privacy Commissioner, coined the term “Privacy by Design” (PbD) in 2009 [5]. PbD aims to integrate privacy objectives into the entire development of personal data processing technology. In contrast to the prevalent design process, where privacy concerns are discussed in a late stage of the development, if at all. In addition, PbD is about organizational procedures that enhance privacy during personal data processing. PbD is an interdisciplinary concept combining law and computer science. In both fields, the origins of PbD go back long before 2000 [6, 9].
28
L. Helminger and C. Rechberger
4.1 Article 25 PbD as a policy discourse culminated in Article 25 GDPR bearing the title “data protection by design and by default.” Article 25 is a general obligation to integrate core data protection principles into the design of data processing architectures. It is seen as one of GDPR’s “most innovative and ambitious norms” [4]. Article 25(1)— the data protection by design (DPbD) part—reads as follows: Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement data protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects.
The range of Article 25 is broad, its language is vague, and it offers only little guidance. Thus it is not straightforward how MPC should be seen in the context of Article 25. To answer this question, we look at it from the privacy engineering (Sect. 4.2) side and the latest European Data Protection Board (EDPB) guidelines (Sect. 4.3).
4.2 Privacy Engineering The task of privacy engineering is to translate Article 25 objectives into concrete design strategies. In their groundbreaking work, Spiekermann and Cranor [33] came up with a framework for designing privacy-friendly systems. Their framework distinguishes between a privacy-by-policy and a privacy-by-architecture approach. The privacy-by-policy approach concentrates on the realization of proper notice, choice, and purpose limitation principles. In contrast, the privacy-by-architecture approach focuses on the implementation of data minimizing by means of reducing identifiability and linkability. Hoepman [25] built upon this framework and derived eight privacy design strategies, four for each approach. These strategies were later adopted by European Union Agency for Cybersecurity’s report “Privacy and Data Protection by Design—from policy to engineering” [12]. For our purpose, the four data-oriented strategies (privacy-by-architecture) are of particular interest: minimize, hide, separate, and aggregate. To all those strategies, MPC can positively contribute but most naturally to data minimization. Güres, Troncoso, and Diaz, in their seminal work “Engineering Privacy by Design” [23] and a follow-up article [24], give a blueprint on how to apply data minimization strategies following four activities. The first activity when designing privacy-friendly systems is to classify the system in a user and a service domain. This distinction is essential because, as pointed out in their second article, “data
2 MPC in the GDPR
29
minimization” is an ambiguous term. We often do not aim for data minimization in the information-theory sense. Instead, we want to minimize the amount of personal data that a user has to disclose in order to use a service. In other words, the overall goal is to reduce the need for trust. The next activity is to identify the necessary data for achieving the purpose of the system. Afterward, one has to map the data to the user and service domain. The last activity’s goal is to remove as much data as possible from the service domain by using privacy-enhancing technologies.
4.2.1
Privacy-Enhancing Technologies
The European Commission [17] defines PETs as technology that “can help to design information and communication systems and services in a way that minimises the collection and use of personal data and facilitate compliance with data protection rules.” Similar to privacy design strategies, the diverse set of PETs can roughly be grouped into two categories [14]. Soft PETs aim to enforce data subject rights (e.g., transparency, erasure, information) that match with privacy-by-policy objectives. Hard PETs, in contrast, try to minimize personal data in the above sense. From a threat model perspective, we could say that while hard PETs include controllers and processors, soft PETs do not. Since MPC places limited trust in controllers and processors, it is a prime example of a hard PET. Rubinstein and Good [30] argue that Article 25 obligates controllers to adopt not only soft PETs but also hard PETs, assuming they are both available and suitable for the task at hand. Their argumentation can be summarized as follows. Article 25 requires that controllers shall implement “appropriate technical and organizational measures” that “implement data protection principles, such as data minimization, in an effective manner.” As pointed out by the EDPB, “effectiveness is at the heart of the concept data protection by design” [18]. Hard PETs are more effective since they offer strong technical guarantees, whereas soft PETs can only offer vague policy commitments.
4.3 EDPB Guidelines We now summarize the most relevant points of the recent EDPB guidelines on “Data Protection by Design and by Default” [18] concerning MPC. The most compelling argument against MPC is its computational overhead over plain5 solutions and, subsequently, its higher monetary costs. While Article 25 allows the controller to take into account “the cost of implementation,” the EDPB clarifies that costs can only guide the decision on how to implement DPbD. Not if controllers and processors should implement DPbD at all. Consequently, the cost of MPC has to
5 Plain
solution refers to a solution without MPC.
30
L. Helminger and C. Rechberger
be compared to alternatives providing the same effectiveness on data minimization rather than the plain implementation. Another often neglected fact is that “DPbD applies at existing systems that are processing personal data,” and this even includes “systems pre-existing before the GDPR entered into force”[18]. Lastly, if GDPR breaches occur, DPbD has an impact on the level of monetary sanctions.
4.4 Conclusion We have seen strong evidence that MPC is a perfect match to satisfy the obligations resulting from Article 25 (1)—DPbD. First of MPC is an effective PET for the purpose of data minimization and can be considered state of the art. Moreover, the EDPB guidelines provide very compelling arguments for an MPC adoption. This is true especially for public administrators as the EDPB reemphasizes Recital 78 that public administrators should lead by example.
5 MPC in the GDPR This section aims to devise a test to determine the impact of an MPC protocol with regard to the GDPR.
5.1 Related Work Most closely related to this work is an article by Spindler et al. [34], who analyzed the role of personal data and encryption in the GDPR. A subsection is devoted to MPC. They conclude that the GDPR is not applicable to MPC in the relative approach. Also, our analysis comes to the same conclusion for a specific class of use cases. However, there are many use cases where we believe that the GDPR applies to MPC. This divergence, in opinions, could be based on Spindler et al.’s neglection to acknowledge that MPC only protects data during the computation but not the computation’s output. We show that the output of an MPC could still be personal data, even in the relative approach. Our paper shares a similar objective with work by Nissim et al. [29]. They manage to bridge the gap between legal privacy requirements and a mathematical privacy model. In particular, they show that a specific PET—differential privacy [35]— satisfies the privacy protection set forth by the US regulation Family Educational Rights and Privacy Act 1974 (FERPA) [1]. A follow-up work [7] concluded that differential privacy most likely satisfies “singling out” that is a concept in the GDPR. Nevertheless, we do not expect a similar result for the GDPR as for FERPA. In our context, both MPC as PET and the GDPR as legislation are way broader than differential privacy and FERPA. To sum up, a use case independent result seems rather unrealistic.
2 MPC in the GDPR
31
5.2 Test In order for the test to be concrete and concise, there have to be a few assumptions. The test is designed from the (potential) controller or processor point of view, i.e., should the MPC protocol involve multiple controllers or processors, and each of them has to go through the test. Further, the test assumes that the MPC protocol is secure. What exactly is meant by secure is discussed in Sect. 5.3. The following test is depicted in Fig. 2.2.
5.2.1
Absolute vs. Relative
The first question is which approach to follow. In the absolute approach, the GDPR applies as soon as one party inputs personal data. It does not matter if the personal data is in a secret shared form or any other technique to realize MPC protocols. Because the data is personal data for at least one party, it is automatically for all other parties, assuming the absolute approach. If no personal data is involved, the
START Absolute Approach?
yes
Does any party provide as input personal data?
no
Do I receive as output personal data?
no GDPR avoidance
GDPR avoidance
yes
no
Do I provide as input personal data?
no
yes
no
yes
Does anyone else receive as output personal data?
yes
Does MPC minimize personal data output?
yes DPbD
Fig. 2.2 Assessment scheme for MPC in the GDPR
no
no legal benefits
32
L. Helminger and C. Rechberger
computation is out of the GDPR’s scope regardless of the use of MPC. In contrast, the situation is more subtle in the relative approach. Namely, here the specifics matter who provides which input and, even more importantly, who receives what kind of output.
5.2.2
Input Data
In the relative approach, the first question a controller or processor should ask herself is about the nature of the input data. It makes a difference if the input data provided constitutes personal data. Notwithstanding that MPC will protect the input data during the computation, one must be careful about the output. It could result in a transfer of personal data.
5.2.3
Output Data
The most crucial part of the assessment represents the questions regarding the output. Recall that the output of every meaningful computation, hence including MPC, discloses some information. There are up to three questions a controller or processor has to answer. The first two are only relevant in the relative approach. If the controller or processor provides personal data, it matters greatly if any other party receives personal data via the output. In such a case, personal data is transferred from the controller or processor to a third party. Thus the GDPR applies to the entity performing the test as well as the recipient of the personal data. The relationship under the GDPR (e.g., joint controllers) between involved parties can now be assessed independently of this test. The second question determines whether the controller receives personal data from a third party via the output. It must only be considered if the controller or processor does not input personal data or nobody else receives personal data. Arriving at this question and answering it with no will lead to the non-applicability of the GDPR for this MPC protocol. The GDPR can be avoided because no personal data is transferred to another party or received from another party, and MPC protects potential personal data input during the computation.
5.2.4
Data Minimization
The remaining question tries to answer whether the particular MPC protocol implements DPbD. If, until now, the applicability of the GDPR could not be ruled out, this question is relevant in the relative and absolute approach. The controller or processor has to check if the personal data output to the parties (including herself) is minimized through the use of MPC. Only in the affirmative case is MPC a suitable candidate for DPbD. Note that this minimization should not be seen strictly in the
2 MPC in the GDPR
33
information-theoretic sense. Instead, it aims to ensure that MPC lowers the risk to the rights and freedoms of the data subject(s).
5.3 Security Model and Trust Assumption Which security model and trust assumption are sufficient depends on the use case. From a data protection point of view, a very rigorous approach is always preferable. More specifically, the most reliable data protection is given if the MPC protocol is maliciously secure. In addition, the trust assumption should be reduced as far as possible to a level where the protocol is secure as long as one party does not collude with the others. MPC protocols fulfilling the properties above are highly complex and involve significant computational overhead. Thus, an interesting question is if there are use cases where semi-honest MPC protocols or a different trust assumption suffices. The following paragraphs should by no means be seen as the last word on the subject. Rather, it is intended to spur an interdisciplinary discussion. In our view, it cannot be solved from a purely technical point of view as legal and normative considerations have also to be taken into account.
5.3.1
GDPR Avoidance
If MPC is applied to avoid the GDPR, we should demand exceptional robust security guarantees. Because if a security violation happens, personal data may be exposed. In addition, since initially the GDPR was not applicable for this computation, there is the possibility that neither party has any procedures in place to mitigate such exposure of personal data (notification and communication of personal data breach—Article 33, 34 GDPR). Thus, we recommend that only maliciously secure protocols can lead to the non-applicability of the GDPR. Also, we would prefer that colluding parties cannot compromise security. However, at least, the parties should be legally bound by law or contract not to collude.
5.3.2
Data Protection by Design
If MPC is applied for DPbD objectives, we can discuss less strict requirements. Here the use of MPC is to achieve data minimization. Controllers and processors should be encouraged to deploy MPC protocols and not be scared by too high standards. Hence, we advocate that in such a case, even semi-honest secure MPC protocols could suffice. Further, the trust assumption could also be relaxed. For instance, one could assume that parties do not collude if it is diametral to their business interests. This fact should be checked regularly as it is subject to change.
34
L. Helminger and C. Rechberger
6 Scenarios In this section, we present two common scenarios where MPC is of particular interest. Both use cases involve only two parties to make the analysis easier to follow. Nevertheless, there would be no substantial difference in applying our test for more than two parties.
6.1 Private Set Intersection Two public authorities, A and B, perform joint data analysis. Authority A has a database containing records about vaccinated individuals. On the other side, authority B holds a database consisting of individuals who have diabetes (see Fig. 2.3 for a small example). The goal of the data analysis is that authority B can answer the following question. What is the vaccination rate among individuals who have diabetes?
Authority A
Authority B
Vaccinated
Diabetes
Maynard Collins
Leeann Marsden
Breanna Sanders
Breanna Sanders
Buster Joyce
Seymour Barlow
Devon Robertson
Devon Robertson
Georgia Judd
Darby Samson
Napoleon Blakelyy
Napoleon Blakely
Callan Fitzroy
Glen Hoggard
Bryon Morin
Bryon Morin
Fig. 2.3 Medical conditions databases
2 MPC in the GDPR
6.1.1
35
Solution 1
Authorities A and B engage in a private set intersection protocol. Similar to the protocol in see Sect. 2.2, the only difference is that now just one party—authority B—receives the intersection. So the protocol’s output will be the identifiers (e.g., names) of individuals that are vaccinated and have diabetes. Thus authority B can compute the percentage of vaccinated rate among people with diabetes.
Analysis To determine the solution’s impact with regard to the GDPR, we follow the test from above. We start from the perspective of authority B and assume the relative approach. Clearly, the information on whether someone has diabetes constitutes personal data. Authority A does not receive an output from the computation. However, authority B itself ends up knowing the names of the intersection (the names underlined in Fig. 2.3). It can then deduce that every individual in that intersection is vaccinated. So, we arrive—as we would also have in the absolute approach—at the question of whether the MPC solution minimizes the personal data output. It does indeed because, in the naive solution, authority A would send the complete database to authority B. Consequently, authority B would know not only who from its diabetes database is vaccinated but who is vaccinated in general. Thus, the use of this MPC solution qualifies as DPbD. The outcome for authority A is the same, albeit a slightly different argumentation. Namely, authority A does not receive personal data but transfers personal data (vaccination statuses) to authority B. 6.1.2
Solution 2
It is based on the previous solution, with the difference being that the output will be the size of the intersection. In other words, authority B only learns the number of vaccinated individuals in its diabetes database but not their names (50% in our small example). Analysis In the absolute approach, the new solution does not change the outcome of the above analysis. The more interesting case is when we consider the relative approach. The crucial difference compared to the previous solution is that the protocol’s output is now non-personal data, provided that the databases are not artificially small. Hence, neither authority A nor authority B receives personal data through the computation. Thus, this computation does not deal with personal data at all. Therefore, it falls outside of the scope of the GDPR.
36
6.1.3
L. Helminger and C. Rechberger
Solution 3
One should not get the impression from this example that MPC is always favorable. To show this, we construct a hypothetical MPC solution to the problem above. It is important to mention that no serious privacy engineer would propose such a solution. One could design an MPC protocol that outputs A’s database to B.
Analysis The computation would still be done by means of MPC, but that protection would be void since the output reveals all the personal data anyhow. Obviously, in this case, MPC does not minimize the personal data output in any form. Thus, although a PET was used, there should be no legal benefit from it.
6.2 Outsourcing Jane Doe is concerned about one of her moles. She takes a photo of it with her cell phone. Afterward, she uploads it to the cloud service MoleChecker. Based on their classification algorithm, they tell her how likely her mole is cancerous.
6.2.1
Solution
One can design an MPC protocol with the following properties. The protocol’s output is still the likelihood of the mole being cancerous, but now the output is only received by Jane Doe. So MoleChecker does not get the picture (protected by MPC) nor the result.
Analysis Jane Doe is a data subject, and therefore, we solely check the perspective of MoleChecker in its role as a potential controller. The photo is personal data for Jane Done. Consequently, it is also for MoleChecker, assuming the absolute approach. Hence, the GDPR applies to this computation, and MoleChecker becomes a controller. Since the MPC solution minimizes personal data output (MoleChecker never sees the photo or learns the result), it qualifies as DPbD. If we change to the relative approach, another picture emerges. Because MoleChecker does not input personal data or receive any through the MPC protocol, the GDPR does not apply in this situation (accordingly, MoleChecker is not a controller).
2 MPC in the GDPR
6.2.2
37
Variant
Finally, we take a look at a variant of this image classification use case. The functionality stays the same, but the setting is slightly different. A hospital outsources the classification of MR images to a company called MRClassifier. Since the solution is identical from the technical perspective, the analysis is as well. However, it is interesting what this means for the involved parties. In the absolute approach, the hospital becomes a controller, which turns the MRClassifier into a processor. Accordingly, a data processing agreement between the parties is mandatory. Since, as already seen above, the GDPR does not apply to this computation in the relative approach, no such agreement is necessary.
7 Conclusion We believe the use of MPC would lead to better data protection without significant restriction on data-driven business opportunities. Hopefully, this chapter contributes to more legal certainty if applying MPC. To see more widespread adoption of MPC, we can only back EDPB’s recommendation for certification [18]. The certifying of MPC protocols has two benefits. It would be a guidance for controllers on how to use MPC properly in their processing operations. Second, data subjects could better see for themselves which controllers follow best-practice data protection measurements. At best, our work starts a discussion on MPC certification and standardization with the goal of better data protection for individuals. In the end, such an initiative can only be successful if it is highly interdisciplinary, involving MPC, legal, privacy engineering, and domain experts as well as the data protection authorities. In a future line of work, we would like to investigate the role of MPC in the upcoming e-privacy regulation. Acknowledgments We thank the reviewers of the Privacy Symposium for their comments, which helped improve the paper’s quality. We also thank Aisling Connolly for many fruitful discussions in the early stages of this work. This work was supported by EU’s Horizon 2020 project Safe-DEED, grant agreement n◦ 825225 and TRUSTS grant agreement n◦ 871481.
References 1. Family educational rights and privacy act of 1974, 20 u.s.c.§1232g (2012) 2. Article 29 Working Party: Opinion 05/2014 on anonymisation techniques (2014), https://ec. europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en. pdf 3. Bergauer, C., Gosch, N.: Die pseudonymisierung personenbezogener daten gemäß der dsgvo (2020)
38
L. Helminger and C. Rechberger
4. Bygrave, L.A.: Data protection by design and by default: Deciphering the EU’s legislative requirements. Oslo Law Review 4(02), 105–120 (2017) 5. Cavoukian, A., et al.: Privacy by design: The 7 foundational principles. Information and privacy commissioner of Ontario, Canada 5, 12 (2009) 6. Chaum, D.: Security without identification: Transaction systems to make big brother obsolete. Communications of the ACM 28(10), 1030–1044 (1985) 7. Cohen, A., Nissim, K.: Towards formalizing the GDPR’s notion of singling out. Proceedings of the National Academy of Sciences 117(15), 8344–8352 (2020) 8. Commission, D.P.: Guidance on anonymisation and pseudonymisation (2019), https://www. dataprotection.ie/sites/default/files/uploads/2019-06/190614%20Anonymisation%20and %20Pseudonymisation.pdf 9. Council of European Union: Council of Europe convention 108: Convention for the protection of individuals with regard to automatic processing of personal data 1981, ETS 108 (1981), https://www.coe.int/en/web/conventions/full-list/-/conventions/rms/ 0900001680078b37?module=treaty-detail&treatynum=108 10. Cramer, R., Damgård, I.B., et al.: Secure multiparty computation. Cambridge University Press (2015) 11. Daemen, J., Rijmen, V.: AES proposal: Rijndael (1999) 12. Danezis, G., Domingo-Ferrer, J., Hansen, M., Hoepman, J.H., Metayer, D.L., Tirtea, R., Schiffner, S.: Privacy and data protection by design-from policy to engineering. arXiv preprint arXiv:1501.03726 (2015) 13. De Cristofaro, E., Tsudik, G.: Practical private set intersection protocols with linear complexity. In: International Conference on Financial Cryptography and Data Security. pp. 143–159. Springer (2010) 14. Deng, M., Wuyts, K., Scandariato, R., Preneel, B., Joosen, W.: A privacy threat analysis framework: supporting the elicitation and fulfillment of privacy requirements. Requirements Engineering 16(1), 3–32 (2011) 15. Dwork, C.: Differential privacy. In: International Colloquium on Automata, Languages, and Programming. pp. 1–12. Springer (2006) 16. European Commission: EU study on the legal analysis of a single market for the information society (2014), https://op.europa.eu/s/sA5L 17. European Commission: Promoting data protection by privacy enhancing technologies (PETs) (2007), https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52007DC0228 18. European Data Protection Board: Guidelines 4/2019 on article 25 data protection by design and by default (2020), https://edpb.europa.eu/our-work-tools/our-documents/ guidelines/guidelines-42019-article-25-data-protection-design-and_en 19. European Parliament and of the Council: Regulation (EU) 2016/679 of the European parliament and of the council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/EC (general data protection regulation), oj 2016 l 119/1 (2016), https://eur-lex.europa. eu/eli/reg/2016/679/oj 20. Evans, D., Kolesnikov, V., Rosulek, M.: A pragmatic introduction to secure multi-party computation. Foundations and Trends® in Privacy and Security 2(2–3) (2017) 21. Finck, M., Pallas, F.: They who must not be identified—distinguishing personal from nonpersonal data under the GDPR. International Data Privacy Law (2020) 22. Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. SIAM Journal on computing 18(1), 186–208 (1989) 23. Gürses, S., Troncoso, C., Diaz, C.: Engineering privacy by design. Computers, Privacy & Data Protection 14(3), 25 (2011) 24. Gürses, S., Troncoso, C., Diaz, C.: Engineering privacy by design reloaded. In: Amsterdam Privacy Conference. pp. 1–21 (2015) 25. Hoepman, J.H.: Privacy design strategies. In: IFIP International Information Security Conference. pp. 446–459. Springer (2014)
2 MPC in the GDPR
39
26. Kamm, L., Bogdanov, D., Laur, S., Vilo, J.: A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29(7), 886–893 (2013) 27. Mohassel, P., Zhang, Y.: SecureML: A system for scalable privacy-preserving machine learning. In: 2017 IEEE symposium on security and privacy (SP). pp. 19–38. IEEE (2017) 28. Mourby, M., Mackey, E., Elliot, M., Gowans, H., Wallace, S.E., Bell, J., Smith, H., Aidinlis, S., Kaye, J.: Are ‘pseudonymised’ data always personal data? implications of the GDPR for administrative data research in the UK. Computer Law & Security Review 34(2), 222–233 (2018) 29. Nissim, K., Bembenek, A., Wood, A., Bun, M., Gaboardi, M., Gasser, U., O’Brien, D.R., Steinke, T., Vadhan, S.: Bridging the gap between computer science and legal approaches to privacy. Harv. JL & Tech. 31, 687 (2017) 30. Rubinstein, I.S., Good, N.: The trouble with article 25 (and how to fix it): the future of data protection by design and default. International Data Privacy Law (2020) 31. Sangers, A., van Heesch, M., Attema, T., Veugen, T., Wiggerman, M., Veldsink, J., Bloemen, O., Worm, D.: Secure multiparty PageRank algorithm for collaborative fraud detection. In: International Conference on Financial Cryptography and Data Security. pp. 605–623. Springer (2019) 32. Smart, N.P., Smart, N.P.: Cryptography made simple. Springer (2016) 33. Spiekermann, S., Cranor, L.F.: Engineering privacy. IEEE Transactions on software engineering 35(1), 67–82 (2008) 34. Spindler, G., Schmechel, P.: Personal data and encryption in the European general data protection regulation. J. Intell. Prop. Info. Tech. & Elec. Com. L. 7, 163 (2016) 35. Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., O’Brien, D.R., Steinke, T., Vadhan, S.: Differential privacy: A primer for a non-technical audience. Vand. J. Ent. & Tech. L. 21, 209 (2018) 36. Yao, A.C.: Theory and application of trapdoor functions. In: 23rd Annual Symposium on Foundations of Computer Science (SFCS 1982). pp. 80–91. IEEE (1982)
Chapter 3
A Critique of the Google Apple Exposure Notification (GAEN) Framework Jaap-Henk Hoepman
Abstract As a response to the COVID-19 pandemic, digital contact tracing has been proposed as a tool to support the health authorities in their quest to determine who has been in close and sustained contact with a person infected by the coronavirus. In April 2020, Google and Apple released the Google Apple Exposure Notification (GAEN) framework, as a decentralised and more privacyfriendly platform for contact tracing. The GAEN framework implements exposure notification mostly at the operating system layer, instead of fully at the app(lication) layer. In this chapter, we study the consequences of this approach. We argue that this creates a dormant functionality for mass surveillance at the operating system layer. We show how it does not technically prevent the health authorities from implementing a purely centralised form of contact tracing (even though that is the stated aim). We highlight that GAEN allows Google and Apple to dictate how contact tracing is (or rather is not) implemented in practice by health authorities and how it introduces the risk of function creep.
1 Introduction Large parts of the world are still suffering from a pandemic caused by the Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) that raised its ugly head somewhere late 2019 and that was first identified in December 2019 in Wuhan, mainland China [18]. Early work of Ferretti et al. [7], modelling the infectiousness of SARS-CoV-2, showed that (under a number of strong assumptions) digital contact tracing could in principle help reduce the spread of the virus. This spurred the development of contract tracing apps (also known as proximity tracing or
J.-H. Hoepman () Radboud University Nijmegen, Nijmegen, Netherlands Karlstad University, Karlstad, Sweden University of Groningen, Groningen, Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_3
41
42
J.-H. Hoepman
exposure notification apps1 ) that aim to support the health authorities in their quest to quickly determine who has been in close and sustained contact with a person infected by this virus [12, 22]. The main idea underlying digital contact tracing is that many people carry a smartphone most of the time and that this smart phone could potentially be used to more or less automatically collect information about people someone has been in close contact with. Even though the effectiveness of contact tracing is contested [3] and there are ethical concerns [13], especially Bluetooth-based contact tracing apps have quickly been embraced by governments across the globe (even though Bluetooth signal strength is a rather poor proxy for being in close contact [5]). Bluetooth-based contact tracing apps broadcast an ephemeral identifier on the shortrange Bluetooth radio network at regular intervals while at the same time collecting such identifiers transmitted by other smartphones in the vicinity. The signal strength is used as an estimate for the distance between the two smartphones, and when this distance is determined to be short (within 1–2 m) for a certain period of time (typically 10–20 min), the smartphones register each other’s ephemeral identifier as a potential risky contact. Contact tracing is by its very nature a privacy invasive affair, but the level of privacy infringement depends very much on the particular system used. In particular, it makes a huge difference whether all contacts are registered centrally (on the server of the National Health Authority, for example) or in a decentralised fashion (on the smartphones of the users that installed the contact tracing app) [9, 12].2 In the first case, the authorities have a complete and perhaps even real-time view of the social graph of all participants. In the second case, information about one’s contacts is only released (with consent) when someone tests positive for the virus. One of the first contact tracing systems was the TraceTogether app deployed in Singapore.3 This inherently centralised approach lets phones exchange regularly changing pseudonyms over Bluetooth. A phone of a person who tests positive is requested to submit all pseudonyms it collected to the central server of the health authorities, who are able to recover phone numbers and identities from these pseudonyms. The Pan-European Privacy-Preserving Proximity Tracing (PEPP-PT) consortium quickly followed suit with a similar centralised proposal for contact tracing to be rolled out in Europe.4 This consortium had quite some traction at
1 Over time the meanings of these terms have diverged to describe different kinds of systems, see Sect. 2. 2 Note that essentially all systems for contact tracing require a central server to coordinate some of the tasks. The distinction between centralised and decentralised systems is therefore not made based on whether such a central server exists, but based on where the matching of contacts takes place. 3 See https://www.tracetogether.gov.sg. 4 See https://en.wikipedia.org/wiki/Pan-European_Privacythis Wikipedia entry Preserving_Proximity_Tracing (the original domain https://www.pepp-pt.org has been abandoned, but some information remains on the project GitHub pages https://github.com/pepp-pt).
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
43
the European policy level, but there were serious privacy concerns due to its centralised nature. As a response, a large group of academics, led by Carmela Troncoso and her team at EPFL, left the PEPP-PT consortium and rushed to publish the Decentralised Privacy-Preserving Proximity Tracing (DP-3T) protocol5 as a decentralised alternative for contact tracing with better privacy guarantees [17]. See [21] for some details on the history. All these protocols require low-level access to the Bluetooth network stack on a smartphone to transmit and receive the ephemeral identifiers used to detect nearby contacts. However, both Google’s Android and Apple’s iOS use a smartphone permission system to restrict access to critical or sensitive resources, such as the Bluetooth network. This proved to be a major hurdle for practical deployment of these contact tracing apps, especially on iPhones as Apple refused to grant the necessary permissions. Perhaps in an attempt to prevent them from being manoeuvred into a position where they would have to grant access to any contact tracing app (regardless of its privacy risk), Google and Apple instead developed their own platform for contact tracing called Google Apple Exposure Notification (GAEN). Around the time that DP-3T released their first specification to the public (early April 2020), Google and Apple released a joint specification for contact tracing as well (which they later updated and renamed to exposure notification6 ) with the aim to embed the core technology in the operating system layer of both recent Android and iOS powered smartphones.7 Their explicit aim was to offer a more privacy-friendly platform for exposure notification (as it is based on a distributed architecture) instead of allowing apps direct access to the Bluetooth stack to implement either contact tracing or exposure notification themselves. Contact tracing is a complex and hotly contested topic, about which many things could (and have) been said from many different perspectives. This chapter studies a very specific topic: the consequences of pushing exposure notification down the stack from the app(lication) layer into the operating system layer. We first explain how contact tracing and exposure notification works when implemented at the app layer in Sect. 2. We then describe the GAEN framework in Sect. 3 and the technical difference between the two approaches in Sect. 4. Section 5 then discusses the concerns raised by pushing exposure notification down the stack: it creates a dormant functionality for mass surveillance at the operating system layer, it does not technically prevent the health authorities from implementing a purely centralised form of contact tracing (even though that is the stated aim), it allows Google and Apple to dictate how contact tracing is (or rather is not) implemented in practice
5 See
https://github.com/DP-3T/documents. https://techcrunch.com/2020/04/24/apple-and-google-update-joint-coronavirus-tracingtech-to-improve-user-privacy-and-developer-flexibility/. 7 See https://www.google.com/covid19/exposurenotifications/ and https://www.apple.com/ covid19/contacttracing/. 6 See
44
J.-H. Hoepman
by health authorities, and it creates a risk of function creep.8 We finish this chapter with some general conclusions in Sect. 6.
2 How Contact Tracing and Exposure Notification Works Although the terms are sometimes used interchangeably, contact tracing and exposure notifications actually refer to two different concepts. Contact tracing informs a central authority of the contacts of an infected person. Exposure notification only notifies these contacts themselves. As mentioned before, centralised systems for digital contact tracing automatically register all contacts of all people who installed the contact tracing app in a central database maintained by the health authorities. Once a patient tests positive, their contacts can immediately be retrieved from this database. But as the central database is collecting contacts regardless of infection, someone’s contacts can be retrieved by the authorities regardless. This explains the huge privacy risks associated with such a centralised approach. Decentralised systems for digital contact tracing only record contact information locally on the smartphones of the people who installed the app: there is no central database. Therefore, the immediate privacy risk is mitigated.9 Once a person tests positive however, some of the locally collected data is revealed. Some schemes (e.g., DESIRE10 and other proposals [9]) reveal to the health authorities the identities of the people who have been in close contact with the person who tested positive. Those decentralised variants still implement contact tracing. Most distributed schemes however only notify the persons who have been in close contact with the person who tested positive by displaying a message on their smartphone. The central health authorities are not automatically notified (and remain in the dark unless the people notified take action and get tested, for example). Such systems instead implement exposure notification. In other words, the distinction between contact tracing and exposure notification is made based on who gets notified: the health authorities or the exposed user (not on whether the architecture is centralised or not). Most exposure notification systems (such as DP-3T and GAEN) distinguish a collection phase and a notification phase that each work roughly as follows (see also Fig. 3.1). During the collection phase, the smartphone of a participating user generates a random ephemeral proximity identifier Id,i every 10–20 min and broadcasts this 8
This chapter is partially based on two blog posts written by the author earlier: https://blog.xot. nl/2020/04/19/google-apple-contact-tracing-gact-a-wolf-in-sheeps-clothes/ and https://blog.xot. nl/2020/04/11/stop-the-apple-and-google-contact-tracing-platform-or-be-ready-to-ditch-yoursmartphone/. 9 Certain privacy risks remain, however. See for example [2, 17, 19]. 10 See https://github.com/3rd-ways-for-EU-exposure-notification/project-DESIRE.
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
45
Fig. 3.1 Exposure notification using an app
identifier over the Bluetooth network every few minutes.11 The phone also stores a copy of this identifier locally. The smartphones of other nearby participating users receive this identifier and, when the signal strength indicates the user is within the threshold distance of 1–2 m, store this identifier (provided it sees the same identifier several times within a 10–20 min time interval). A smartphone of a participant thus contains a database S of identifiers it sent itself and another database R of identifiers it received from others. The time an identifier was sent or received is also stored, at varying levels of precision (in hours, or days, for example). The databases are automatically pruned to delete any identifiers that are no longer epidemiologically relevant (for COVID-19, this is any identifier that was collected more than 14 days ago). The notification phase kicks in as soon as a participating user tests positive for the virus and agrees to notify their contacts. In that case, the user instructs their app to upload the database S of identifiers the app sent itself to the server of the health authorities.12 The smartphone app of other participants regularly queries this server for recently uploaded identifiers of infected people and matches any new identifiers it receives from the server with entries in the database R of identifiers it received from other people who were in close proximity somewhere during the last few weeks. If there is a match, sometime during those weeks the app must have received and stored an identifier of someone who now tested positive for the virus. The app therefore notifies its user that they have been in close contact with
11
In this notation, Id,i denotes the ephemeral proximity identifier generated for the i-th 10–20 min time interval on day d. 12 A contact tracing version of the app would instead request the smartphone of the user to upload the database R of received identifiers that the app collected.
46
J.-H. Hoepman
an infected person recently (sometimes indicating the day this contact took place). It then typically offers advice on how to proceed, like offering pointers to more information, and strongly suggesting to contact the health authorities, get tested, and go into self-quarantine.
3 The GAEN Framework Google and Apple’s framework for exposure notification follows the same paradigm, with the notable exception that instead of an app implementing all the functionality, most of the framework is implemented at the operating system layer instead.13 Although GAEN is a joint framework, there are minor differences in how it is implemented on Android (Google) and iOS (Apple). GAEN works on Android version 6.0 (API level 23) or higher, and on some devices as low as version 5.0 (API level 21).14 On Android, GAEN is implemented as a Google Play service. GAEN works for Apple devices running iOS 13.5 or higher. At the Bluetooth and “cryptographic” layers, GAEN works the same on both platforms, however. This implies that ephemeral proximity identifiers sent by any Android device can be received and interpreted by any iOS device and vice versa. In other words: users can in principle get notified of exposures to infected people independent of the particular operating system their smartphone runs and independent of which country they are from. (In practice, some coordination between the exposure notification apps and the back-end servers of the different health authorities involved is required.) As an optimisation step, devices do not randomly generate each and every ephemeral proximity identifier independently. Instead, the ephemeral proximity identifier Id,i to use for a particular interval i on day d is derived from a temporary exposure key Kd (which is randomly generated each day) using some public deterministic function f (the details of which do not matter for this chapter). In other words, Id,i = f (Kd , i), see Fig. 3.2. With this optimisation, devices only need to store exposure keys in S, as the actual ephemeral proximity identifiers they sent can always be reconstructed from these keys. Generating, broadcasting, and collecting ephemeral proximity identifiers happen automatically at the operating system layer, but only if the user has explicitly enabled this by installing an exposure notification app and setting the necessary
13 The summary of GAEN and its properties is based on the documentation offered by both Google (https://www.google.com/covid19/exposurenotifications/) and Apple (https://www.apple. com/covid19/contacttracing/) online and was last checked January 2022. The documentation offered by Google and Apple is terse and scattered. The extensive documentation of the Dutch CoronaMelder at https://github.com/minvws proved to be very helpful. 14 See https://developers.google.com/android/exposure-notifications/exposure-notifications-api.
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
47
I d,i = f ( K d , i ) Kd
Fig. 3.2 Temporary exposure keys and ephemeral proximity identifiers
permissions,15 or by enabling exposure notifications in the operating system settings.16 Apple and Google do not allow exposure notification apps to access your device location.17 By default, exposure notification is disabled on both platforms. When enabled, the database S of exposure keys and the database R of identifiers received are stored at the operating system layer, which ensures that data is not directly accessible by any app installed by the user. Actual notifications are the responsibility of the exposure notification app. In order to use the data collected at the operating system layer, the app needs to invoke the services of the operating system through the GAEN Application Programming Interface (API). Apps can only access this API after obtaining explicit permission from Google or Apple. The API offers the following main functions (see also Fig. 3.3): – Retrieve the set of exposure keys (stored in S). “The app must provide functionality that confirms that the user has been positively diagnosed with COVID-19.”18 But this is not enforced at the API layer. In other words, the app (once approved and given access to the API) has access to the exposure keys. – Match a (potentially large) set of exposure keys against the set of ephemeral proximity identifiers received from other devices earlier (stored in R), and return a list of risk scores (either a list of daily summaries or a list of individual, smaller than 30 min, exposure windows). This function is rate limited to a few calls per day.19
15 Both Android and iOS require Bluetooth to be enabled. On Android 10 and lower, the device location setting needs to be turned on as well (see https://support.google.com/android/answer/ 9930236). 16 For those countries where the national health authorities have not developed their own exposure notification app but instead rely on Exposure Notification Express (see https://developers.google. com/android/exposure-notifications/en-express). 17 See https://support.google.com/android/answer/9930236 and https://covid19-static.cdn-apple. com/applications/covid19/current/static/contact-tracing/pdf/ExposureNotification-FAQv1.2.pdf. 18 See https://developers.google.com/android/exposure-notifications/exposure-notifications-api. Also see the verification system Google designed for this https://developers.google.com/android/ exposure-notifications/verification-system.
48
J.-H. Hoepman
Fig. 3.3 The GAEN framework
The API also ensures that the user is asked for consent whenever an app enables exposure notification for the first time and whenever user keys are retrieved for upload to the server of the health authorities after the user tested positive for COVID-19. The API furthermore offers functions to tune the computation of the risk scores. The idea is that through the API a user who tests positive for COVID-19 can instruct the app to upload all its recent (actually, the last 14) temporary exposure keys to the server of the health authorities. The exposure notification app of another user can regularly query the server of the health authorities for recently uploaded exposure keys of infected devices. Using the second GAEN API function allows the app to submit these exposure keys to the operating system that, based on the database R of recently collected proximity identifiers, checks whether there is a match with such an exposure key (by deriving the proximity identifiers locally). A list of matches is returned that contains the day of the match and an associated risk score; the actual key and identifier matched are not returned, however. The day of the contact, the duration of the contact, the signal strength (as a proxy for distance of contact), and the type of test used to determine infection are used to compute the risk score. Note that these detailed parameters are hidden from the app. Developers do have influence on how this risk score is computed by providing weights for all the parameters. Using the returned list, the app can decide to notify the user when there appears to be a significant risk of infection. Note that by somewhat restricting the
19 On iOS 13.7 and the most recent version of the API limits, the use of this method to a maximum of six times per 24-h period. On Android, the most recent version of the API also allows at most six calls per day, but “allowlisted” accounts (used by health authorities for testing purposes) are allowed 1,000,000 calls per day.
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
49
way risk scores are computed, GAEN makes it harder for a malicious app to determine exactly which exposure key triggered a warning (and hence makes it harder to determine exactly with whom someone has been in physical proximity with).
4 How the GAEN Framework Differs from a Purely App-Based Approach Given its technical architecture, the GAEN framework fundamentally differs from a purely app-based approach to exposure notification in the following four aspects. First of all, the functionality and necessary code for the core steps of exposure notification (namely broadcasting, collecting, and matching ephemeral proximity identifiers) come pre-installed on all modern Google and Apple devices. In a purely app-based approach, this functionality and code are solely contained in the app itself and not present on the device when the app is not installed (and removed when the app is de-installed). Second, all relevant data (ephemeral proximity identifiers and their associated metadata such as date, time, and possibly location) are collected and stored at the operating system level. In a purely app-based approach, this data is collected and stored at the user/app level. This distinction is relevant as in modern computing devices the operating system runs in a privileged mode that renders the data it processes inaccessible to a “user land” app. Data processed by apps is accessible to the operating system in raw form (in the sense the operating system has access to all bytes of memory used by the app), but the interpretation of that data (i.e., which information is stored where) is not necessarily easy to determine. Moreover, the framework is interoperable at the global level: users can in principle get notified of exposures to infected people independent of the particular operating system their smartphone runs and independent of which country they are from. This would not necessarily be the case (and probably in practice be impossible to achieve) in a purely app-based approach. Finally, the modes of operation are set by Google and Apple: the system notifies users of exposure; it does not automatically inform the health authorities. The app is limited to computing a risk score and does not receive the exact location nor the exact time when a “risky” contact took place. In a purely app-based approach, the developers of the app themselves determine the full functionality of the app (within the possible limits imposed by the app stores).
5 A Critique of the GAEN Framework It is exactly because of the above properties that the GAEN framework appears to protect privacy: the health authorities (and the users) are prevented from obtaining details about the time and location of a risky contact, thus protecting the privacy of
50
J.-H. Hoepman
infected individuals, and the matching is forced to take place in a decentralised fashion (which prevents the health authorities from directly obtaining the social graph of users). However, there is more to this than meets the eye, and there are certainly remaining privacy issues and broader concerns that stem from the way GAEN works. In summary: – GAEN creates a dormant functionality for mass surveillance at the operating system layer. – The exposure notification microdata are under Google/Apple’s control. – A decentralised framework such as GAEN can be recentralised. – GAEN allows Google and Apple to dictate how contact tracing is or is not implemented in practice by health authorities. – GAEN introduces significant risks of function creep. These concerns are discussed in detail in the following sections. These are by no means the only ones,20 see for example also [6, 11, 15], but these are the ones that derive directly from the architectural choices made.
5.1 GAEN Creates a Dormant Mass Surveillance Tool Instead of implementing all exposure notification functionality in an app, Google and Apple push the technology down the stack into the operating system layer creating a Bluetooth-based exposure notification platform. This means the technology is available all the time, for all kinds of applications beyond just exposure notification. As will be explained in the next section, GAEN can be (ab)used to implement centralised forms of contact tracing as well. Exposure notification is therefore no longer limited in time or limited in use purely to trace and contain the spread of COVID-19. This means that two very important safeguards to protect our privacy are thrown out of the window. Moving exposure notification down the stack fundamentally changes the amount of control users have: you can uninstall an (exposure notification) app, and you cannot uninstall the entire OS (although on Android you can in theory disable and even delete Google Play Services). The only thing a user can do is disable exposure notification using an operating system-level setting. But this does not remove the actual code implementing this functionality. But the bigger picture is this: it creates a platform for contact tracing in the more general sense of mapping which people have been in close physical contact (regardless of whether there is a pandemic that needs to be fought). Moreover, this platform for contact tracing works all across the globe for most modern smart
20 See
https://www.eff.org/deeplinks/2020/04/apple-and-googles-covid-19-exposurenotification-api-questions-and-answers.
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
51
phones (Android Marshmallow and up, and iOS 13 capable devices) across both OS platforms. Unless appropriate safeguards are in place, this would create a global mass surveillance system that would reliably track who has been in contact with whom, at what time, and for how long.21 GAEN works much more reliably and extensively to determine actual physical contact than any other system would be able to, whether it is based on GPS or mobile phone network location data (using cell towers). It is important to stress this point because some people believe this kind of precise tracking is something companies such as Google (using GPS or WiFi network names) can already do for years. This is not the case. This type of contact tracing really brings it to another level. In those regions that opt for Exposure Notification Express,22 the data collection related to exposure notification starts as soon as you accept the operating system update and enable it in the settings. In other regions, this only happens when people install an exposure notification app (but that uses the API to retroactively find contacts based on the data phones have already collected). But this only describes the current situation, as we understand it is based on information offered by Google and Apple. There is an underlying assumption that both Apple and Google indeed refrain from offering other apps access to the exposure notification platform. This can change, at any time. They could be forced (by government decision or legal decree) to offer access to other apps; economic incentives could change their minds and make them decide to use the platform themselves. GAEN creates a dormant functionality for mass surveillance [20] that can be turned on with the flip of a virtual switch at Apple or Google HQ. All in all this means we all have to put a massive trust in Apple and Google, both to properly monitor the use of the GAEN API by others, and to not succumb to the temptation to use it themselves.
5.2 Google and Apple Control the Exposure Notification Microdata Because the exposure notification is implemented at the operating system layer, Google and Apple fully control how it works and have full access to all microdata generated and collected. In particular, they have, in theory, full access to the temporary exposure keys and the ephemeral proximity identifiers and control how these keys are generated. We have to trust that the temporary exposure keys are really generated at random and not stealthily derived from a user identifier that
21 GAEN does not currently make all this information available in exact detail through its API, but it does collect this information at the lower operating system level. It is unclear whether GAEN records location data at all (although it would be easy to add this, and earlier versions of the API did in fact offer this information). 22 See https://developers.google.com/android/exposure-notifications/en-express.
52
J.-H. Hoepman
would allow Google or Apple to link proximity identifiers to a particular user. And even if these keys are generated truly at random, at any point in time, Google or Apple could decide to surreptitiously retrieve these keys from certain device, again with the aim to link previously collected proximity identifiers to this particular device. In other words, we have to trust that Google and Apple will not abuse GAEN themselves. They do not necessarily have an impeccable track record that warrants such trust.
5.3 Distributed Can Be Made Centralised The discussion in the preceding paragraphs implicitly assumes that the GAEN platform truly enforces a decentralised form of exposure notification and that it prevents exposure notification apps from automatically collecting information on a central server about who was in contact with who. This assumption is not necessarily valid, however (although it can be enforced provided Apple and Google are very strict in the vetting process used to grant apps access to the GAEN platform). In fact, GAEN can easily be used to create a centralised form of exposure notification, at least when we limit our discussion to centrally storing information about who has been in contact with an infected person. The idea is as follows. GAEN allows an exposure notification app on a phone to test daily exposure keys of infected users against the proximity identifiers collected by the phone over the last few days. This test is local; this is why GAEN is considered decentralised. The app, however, could immediately report back the result of this test to the central server, without user intervention (or without the user even noticing).23 It could even send a user-specific identifier along with the result, thus allowing the authorities to immediately contact anybody who has recently been in the proximity of an infected person. This is the hallmark of a centralised solution (although it does not go as far as allowing the central server to collect all encounters). In other words: the GAEN technology itself does not prevent a centralised solution. The only thing preventing it would be Apple and Google being strict in vetting exposure notification apps. But they could already do so now, without rolling out their GAEN platform, by strictly policing which apps they allow access to the Bluetooth networks stack, and which apps they allow on their app stores. The actual protection offered therefore remains procedural: we need to trust Google and Apple to disallow centralised implementations of contact tracing apps offered through their app stores. A malicious app could do other things as well. By design, GAEN does not reveal which infected person a user has been in contact with when matching keys on the
23 Recall that the API enforces user consent when retrieving exposure keys, but not when matching them.
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
53
users phone. Calls to the matching function in the API are rate limited to a few calls each day, the idea being that a large number of keys can be matched in batch without revealing which particular key resulted in a match. But this still allows a malicious app (and accompanying malicious server) to test a few daily tracing keys (for example of persons of interest) one by one, and to keep track of each daily tracing key for which the test was positive, and report these back to the server. As the server knows which daily tracing key belongs to which infected person, this allows the server to know exactly with which infected persons of interest the user of this phone has been in contact with. If the app is malicious, even non-infected persons are at risk because such an app could retrieve the exposure notification keys even if a user is not infected (provided it can trick the user in consenting to this). Clearly, a malicious exposure notification app not based on GAEN could do the same (and much more). But this does shows that GAEN by itself does not protect against such scenarios, while making the impact of such scenarios far greater because of its global reach.
5.4 Google and Apple Dictate How Contact Tracing Works Apple and Google’s move is significant for another reason: especially on Apple iOS devices, access to the hardware is severely restricted. This is also the case for access to Bluetooth. In fact, without approval from Apple, you cannot use Bluetooth “in the background” for your app (which is functionality that you need in order to be able to collect information about nearby phones even if the user phone is locked). One could argue that this improves privacy as it adds another checkpoint where some entity (in this case Apple) decides whether to allow the proposed app or not. But Apple (and by extension Google) uses this power as a leverage to grab control over how contact tracing or exposure notification can and cannot work. This is problematic as this allows them to set the terms and conditions, without any form of oversight. With this move, Apple and Google make themselves indispensable, ensuring that this potentially global surveillance technology (see above) is forced upon us. And as a consequence, all microdata underlying any contact tracing system is stored on the phones they control (again, see above). For example, the GAEN framework prevents notified contacts to learn the nature of the contact and make well-informed decision about the most effective response: get tested, or go into self-quarantine immediately. It also prevents the health authorities from learning the nature of the contact and hence makes it impossible to build a model of how contacts influence the spread of the virus. “The absence of transmission data limits the scope of analysis, which might, in the future, give freedom to people who can work, travel and socialise, while more precisely targeting others who risk spreading the virus.” [10]. This happens because the GAEN framework is based on a rather corporate understanding of privacy as giving control and by asking for consent. But under certain specific conditions—and a public-health emergency such as the current pandemic is surely one—individual
54
J.-H. Hoepman
consent is not an appropriate mechanism: “to be effective, public-health surveillance needs to be comprehensive, not opt-in.” [4]. It also prevents the health authorities from compiling statistics about how the exposure notification system performs.24 Now the arguments put forward in this section may seem to contradict the ones made in the previous section, when discussing the risk that the GAEN platform is repurposed to implement a more centralised solution. But this is not the case: the previous section discussed how GAEN does not technically prevent a centralised solution (which was one of the reasons to implement it). This section argues that we, as a society, lose any say in how contact tracing should be working in a pandemic. Google and Apple set the boundaries and do not allow us to work around them: even bypassing the restrictions as discussed in the previous section only offers a limited space to manoeuvre. In other words, the claimed benefit of the GAEN framework (no centralised solution) is not necessarily a benefit, and it is certainly not for Apple and Google to unilaterally decide it is, and is in actual fact not something GAEN can enforce technically.
5.5 Function Creep The use of exposure notification functionality as offered through GAEN is not limited to controlling just the spread of the COVID-19 virus. As this is not the first corona type virus, it is only a matter of time until a new dangerous virus will roar its ugly head. In other words, exposure notification is here to stay. And with that, the risk of function creep appears: with the technology rolled out and ready to be (re)activated, other uses of exposure notification will at some point in time be considered and deemed proportionate. Unless Apple and Google strictly police the access to the GAEN API (based on some publicly agreed upon rules) and ensure that it is only used by the health authorities, and only for controlling a pandemic like COVID-19, the following risks are apparent. Consider the following hypothetical example of a government that wants to trace the contact or whereabouts of certain people, which could ensue when Google and Apple fail to strictly enforce access. Such a government could coerce developers to embed this tracking technology in innocent looking apps, in apps that you are more or less required to have installed, or in software libraries used by such apps. Perhaps, it could even coerce Apple and Google itself to silently enable exposure notifications for all devices sold in their country, even if the users do not install any app.25 It is known that Google and Apple in some cases do bow to government pressure to 24 Apple
and Google have since released a white paper describing a privacy preserving method called ENPA for offering a very limited set of aggregate statistics to health authorities [1]. See also https://github.com/google/exposure-notifications-android/blob/master/ doc/enexpress-analytics-faq.md. 25 Note that when Google and Apple first announced their exposure notification platform, the idea was that your phone would start emitting and collecting proximity identifiers as soon as the feature
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
55
enable or disable certain features: like filter search results,26 remove apps from the app store27 and even move cloud storage servers,28 offering Chinese authorities far easier access to text messages, email, and other data stored in the cloud. Because the ephemeral proximity identifiers are essentially random, they cannot be authenticated. In other words: any identifier with the right format advertised on the Bluetooth with the correct service identifier will be accepted and recorded by any device with GAEN active. Moreover, because the way ephemeral identifiers are generated from daily exposure keys is (necessarily) public, anybody can build a cheap device broadcasting ephemeral identifiers from chosen daily exposure keys that will be accepted and stored by a nearby device with the GAEN platform enabled. A government could install such Bluetooth beacons at fixed locations of interest for monitoring purposes. The daily exposure keys of these devices could be tested against phones of people of interest running the apps as explained above. Clearly, this works only for a limited number of locations because of rate limiting, but note that at least under Android this limit is not imposed for “allowlisted” apps for testing purposes, and then the question is again whether Google can be forced to allowlist a certain government app. China could consider it to further monitor Uyghurs. Israel could use it to further monitor Palestinians. You could monitor the visitors of abortion clinics, coffee shops, gay bars, and more. Indeed, the exact same functionality offered by exposure notification could allow the police to quickly see who has been close to a murder victim: simply report the victims phone as being “infected.” Some might say this is not a bug but a feature, but the same mechanism could be used to find whistle blowers, or the sources of a journalist. For centralised contact tracing apps, we already see function creep creeping in. The recent use of the term “contact tracing” in the context of tracking protesters in Minnesota after demonstrations erupted over the death of George Floyd at the hands of a police officer29 is ominous, even if the term refers to traditional police investigating methods.30 More concrete evidence is the discovery
was enabled in the operating settings, even if no exposure notification app was installed, see https:// techcrunch.com/2020/04/13/apple-google-coronavirus-tracing/. 26 See https://www.sfgate.com/business/article/Google-bows-to-China-pressure-2505943.php. 27 See https://www.nytimes.com/2019/10/10/business/dealbook/apple-china-nba.html. 28 See https://www.reuters.com/article/us-china-apple-icloud-insight/apple-moves-to-storeicloud-keys-in-china-raising-human-rights-fears-idUSKCN1G8060. 29 See https://bgr.com/2020/05/30/minnesota-protest-contact-tracing-used-to-trackdemonstrators/. 30 And what to think of the following message posted by Anita Hazenberg, the Director Innovation Directorate at Interpol: “Is your police organisation considering how tracing apps will influence the way we will police in the future? If you are a (senior) officer dealing with policy challenges in this area, please join our discussion on Wednesday 6 May (18.00 Singapore time) during an INTERPOL Virtual Discussion Room (VDR). Please contact [email protected] for more info. Only reactions from law enforcement officers are appreciated.” See: https://www.linkedin.com/posts/anita-hazenberg-b0b48516_is-your-policeorganisation-considering-how-activity-6663040380965130242-q8Vk.
56
J.-H. Hoepman
that Australia’s intelligence agencies were “incidentally” collecting data from the country’s COVIDSafe contact tracing app.31 The Singapore authorities recently announced that the police can access COVID-19 contact tracing data for criminal investigations.32 Now one could argue that these examples are an argument supporting the privacy-friendly approach taken by Google and Apple. After all, by design, exposure notification does not have a central database that is easily accessible by law enforcement or intelligence agencies. But as explained above, this is not (and cannot be) strictly enforced by the GAEN framework. Contact tracing also has tremendous commercial value. A company could install Bluetooth beacons equipped with this software at locations of interest (e.g., shopping malls). By reporting a particular beacon as “infected,” all phones (that have been lured into installing a loyalty app or that somehow have the SDK of the company embedded in some of the apps they use) will report that they were in the area. Facebook used a crude version of contact tracing (using the access it had to WhatsApp address books) to recommend friends on Facebook [8, 16]. The kind of contact tracing offered by GAEN (and other Bluetooth-based systems) gives a much more detailed, real time, insight in people’s social graph, and its dynamics. How much more precise could targeted adverting become? Will Google and Apple forever be able to resist this temptation? If you have Google Home at home, Google could use this mechanism to identify all people who have visited your place. Remember: they set the restrictions on the API. They can at any time decide to change and loosen these restrictions.
6 Conclusion We have described how the shift, by Google and Apple, to push exposure notification down the stack from the app layer to the operating system layer fundamentally changes the risk associated with exposure notification systems, and despite the original intention, unfortunately not for the better. We have shown that from a technical perspective, it creates a dormant functionality for global mass surveillance at the operating system layer, that it takes away the power to decide how contact tracing works from the national health authorities and the national governments, and how it increases the risks of function creep already nascent in digital exposure notification and contact tracing systems. These risks can only be mitigated by Google and Apple as they are the sole purveyors of the framework and have sole discretionary power over who to allow access to the framework, and under which conditions. We fully rely on their faithfulness and vigilance to enforce the rules and
31 https://techcrunch.com/2020/11/24/australia-spy-agencies-covid-19-app-data/. 32 https://www.zdnet.com/article/singapore-police-can-access-covid-19-contact-tracing-datafor-criminal-investigations/.
3 A Critique of the Google Apple Exposure Notification (GAEN) Framework
57
restrictions they have committed to uphold and have very little tools to verify this independently. The solution to these problems is simple: stop shifting this functionality to the operating system layer, and implement this at the app level instead with appropriate oversight by the competent national authorities, based on prior consultation of all stakeholders, including civil society. Note that such consultation should first and foremost consider the question whether any form of contact tracing or exposure notification is a proportional intervention33 and not simply a form of techsolutionism [14].
References 1. Apple and Google. “Exposure Notification Privacy-preserving Analytics (ENPA) White Paper”. Apr 2021. 2. L. Baumgärtner et al. “Mind the GAP: Security & Privacy Risks of Contact Tracing Apps”. In: 19th TRUSTCOM (Guangzhou, China, Dec. 29, 2020–Jan. 1, 2021). IEEE, 2020, pp. 458–467. 3. J. Bay. “Automated contact tracing is not a coronavirus Panacea”. Medium (Apr 11, 2020). 4. J. E. Cohen, W Hartzog, and L. Moy. “The dangers of tech-driven solutions to COVID-19”. Brookings TechStream (June 17, 2020). 5. P.-O Dehaye. “Inferring distance from Bluetooth signal strength: a deep dive”. Medium (May 19, 2020). 6. T. Duarte. “Google and Apple Exposure Notifications System: Exposure Notifications or Notified Exposures?” LawArXiv (Nov. 5, 2020). 7. L. Ferretti et al. “Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing”. Science (Mar. 31, 2020). 8. K. Hill. “Facebook recommended that this psychiatrist’s patients friend each other”. Splinter (Aug. 29, 2016). 9. J.-H. Hoepman. “Hansel and Gretel and the Virus. Privacy Conscious Contact Tracing”. CoRR abs/2101.03241. Jan. 2021. arXiv: 2101.03241 [cs.CR] 10. I. Ilves. “Why are Google and Apple dictating how European democracies fight coronavirus?” The Guardian (June 16, 2020). 11. N. Klein. “How big tech plans to profit from the pandemic”. The Guardian (May 13, 2020). 12. T. Martin, G. Karopoulos, J. L. Hernandez Ramos, G. Kampourakis, and I. Nai Fovino. “Demystifying COVID-19 digital contact tracing: A survey on frameworks and mobile apps”. Wireless Communications and Mobile Computing (2020), p. 8851429. 13. J. Morley, J. Cowls, M. Taddeo, and L. Floridi. “Ethical guidelines for COVID-19 tracing apps. Protect privacy, equality and fairness in digital contact tracing with these key questions.” Nature 582.29–31 (May 28, 2020). 14. E. Morozov. To Save Everything, Click Here. The Folly of Technological Solutionism. New York: PublicAffairs, 2013. 15. T Sharon. “Blind-sided by privacy? Digital contact tracing, the Apple/Google API and big tech’s newfound role as global health policy makers”. Ethics and Information Technology (July 18, 2020). 16. A. Tait. “Why does Facebook recommend friends I’ve never even met?” Wired (May 29, 2019). 17. C. Troncoso et. al. Decentralized Privacy-Preserving Proximity Tracing. Whitepaper. DP-3T Consortium, May 25, 2020.
33 See
also https://www.safeagainstcorona.nl.
58
J.-H. Hoepman
18. L. van Dorp et al. “Emergence of genomic diversity and recurrent mutations in SARS-CoV-2”. Infection, Genetics and Evolution 83 (2020), p. 104351. 19. S. Vaudenay. “Analysis of DP3T”. Cryptology ePrint Archive 2020/399 (2020). 20. M. Veale. “Privacy is not the problem with the Apple-Google contact-tracing toolkit”. The Guardian (July 1, 2020). 21. M. Veale. “Sovereignty privacy and contact tracing protocols”. In: L. Taylor G. Sharma, A. Martin, and S. Jameson. Data Justice and COVID-19: Global Perspectives. London: Meatspace Press, 2020, pp. 34–39. 22. World Health Organization. “Contact tracing in the context of COVID-19, Interim guidance”. May 10, 2020.
Part II
Implications of Regulatory Framework in the European Union
Chapter 4
Global Data Processing Identifiers and Registry (DP-ID) Sébastien Ziegler and Cédric Crettaz
Abstract The chapter presents the DP-ID Web application using global processing identifiers. This application was developed in the context of the Fed4FIRE+ project, which is the biggest Next-Generation Internet (NGI) federation of testbeds employed for developing and testing new ICT solutions. The DP-ID application allows to register all the data processing activities of each participating organization and, thus, facilitates compliance with the regulations, particularly the General Data Protection Regulation (GDPR). The utilization of the DP-ID Web application is not restricted to the testbed providers and the experimenters of the Fed4FIRE+ testbed federation, but it can be employed in all the other use cases involving the data processing activities. Keywords Data processing · Identifiers · GDPR
1 Introduction Fed4FIRE+ [1] (Federation for Future Internet Research Experimentation Plus) is a European research project developed in the context of the Horizon 2020 research program. It aims at establishing the largest federation of Next-Generation Internet (NGI) testbeds by federating multiple testbeds. These testbeds are interconnected through open APIs, which have been brought by the project to standardization at the International Telecommunication Union, where the main Open APIs have been formally standardized under Recommendation Q.4068 “Open application program interfaces (APIs) for interoperable testbed federations” [2]. Fed4FIRE+ provides remote access to a large variety of facilities and testbeds to support research and innovation with information and communication technologies.
S. Ziegler () · C. Crettaz Mandat International, Geneva, Switzerland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_4
61
62
S. Ziegler and C. Crettaz
It also aims at researching how to make research infrastructure: (1) simpler and more cost efficient, (2) with increased trustworthiness, (3) and sustainable. Like any other European data controllers and processors, the Fed4FIRE+ partners must comply with the obligations of the European General Data Protection Regulation (GDPR). The project set up a dedicated team in charge of supporting and monitoring GDPR compliance of the project. Very quickly, it appeared that the project will have to overcome three important challenges: 1. Legal complexity of the GDPR compliance for nonlegal experts: While nobody is supposed to ignore the law, the GDPR includes obligations that are not always easy to understand and implement by engineers and other project leaders who do not have a legal education. 2. Complexity of distributed data processing: Fed4FIRE+ involves many different organizations and universities, which are themselves collaborating with third parties acting as data processors. 3. Scalability requirement: Fed4FIRE+ aims at supporting large-scale federation of testbeds and infrastructure, which raise the need to address a scalability by design requirement. Altogether, these challenges have led the project to research and develop a global data processing identifiers registry, named DP-ID, to simplify data protection compliance in large-scale multi-tenant environments. The DP-ID solution has been designed to be applicable to all data processing activities and serve the international community. This chapter provides an overview of the DP-ID innovative approach and applicability.
2 Global Data Processing Identifier Registry Concept The global data processing identifiers registry (DP-ID) concept aims at providing a global registry of data processing identifiers [3] with public information on the registered data processing made available by the data controllers. On the one hand, the DP-ID registry enables us to comply with several data protection-related obligations (more details below). On the other hand, it enables simplification of data processing compliance management by enabling various stakeholders to refer and use clear, unique, and interoperable identifiers. The DP-ID architecture has taken into account the lessons learned with Internet domain name server architecture to satisfy scalability requirements. It also leveraged on the IPv6 address model as we will detail below.
4 Global Data Processing Identifiers and Registry (DP-ID)
63
3 Facilitating GDPR Compliance DP-ID has been designed to address and support compliance with a focus on several data protection obligations.
3.1 Obligation to Inform Art. 12, 13, 14 The GDPR Art. 12 requires that “the controller shall take appropriate measures to provide any information ( . . . ) relating to processing to the data subject in a concise, transparent, intelligible and easily accessible form, using clear and plain language, in particular for any information addressed specifically to a child. The information shall be provided in writing, or by other means, including, where appropriate, by electronic means.” More specifically, DP-ID has been designed to provide all information to be provided where personal data are collected from the data subject (Art. 13 GDPR) or have not been obtained from the data subject (Art. 14 GDPR) with regard to the information that must be made available to the data subjects. By generating a DPID record, the data controller can publish all the required information and satisfy these obligations and facilitate access to the mandatory information as detailed in the article.
3.2 Data Protection by Design and Default Art. 25 Another important requirement of the GDPR relates to the obligation to adopt a data protection by design and default approach, as stated in Art. 25 GDPR: “ . . . the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organizational measures, such as pseudonymization, which are designed to implement data-protection principles, such as data minimization, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects.” In addition to providing transparent information to the data subjects, DP-ID aims at providing a technical and organizational measure that guides data controllers in clarifying the scope of their data processing activities and supporting compliance with their data protection obligations.
64
S. Ziegler and C. Crettaz
4 Data Processing-Identified Requirements The DP-ID has been designed to satisfy several requirements: • Compliance with GDPR obligations: The application was designed to completely fulfill the obligations described in the GDPR. It is a concrete means to support the effective implementation of GDPR certifications across jurisdictions and facilitate the tracking of data processing activities across multiple organizations. The application permits to inform the data subjects on data processing activities and improve their trustworthiness. • Simplicity of use: The application was designed and implemented to ease the utilization by any kind of user, independently of the domains where the data processing activities are occurring. The graphical user interface (GUI) was realized to be very simple for the users. Indeed, the number of fields is limited to one required by the GDPR obligations. • Scalability: The backend of the DP-ID application was realized to support a very high number of DP-IDs registered in the database. Access to the data stored in the database is also facilitated by a modern API (Application Programming Interface) allowing the use of a query language. So, it is possible to retrieve in an efficient way all the data concerning each registered data processing activity.
5 DP-ID Implementation The online registry is publicly accessible to publish, search, and retrieve information about data processing. Each data processing registered in the registry is attributed a unique identifier: a DP-ID. To register new data processing and generate its corresponding DP-IDs, users representing the organization responsible for the data processing must create an account that allows them to describe all the data processing activities undertaken by their organization and its data processors. The DP-ID application has been developed by following co-creation and codesign methodologies through an iterative approach. Several iterations in the development of the applications were made by involving different stakeholders, including DPOs of the Fed4FIRE+ project, Web developers, and external users as alpha/beta testers. This process permitted to improve the DP-ID application in several steps taking into account all the comments and improvements suggested by the different stakeholders. The graphical user interface has been optimized to keep it simple and userfriendly, while collecting the information required by the obligations enounced in the GDPR. The home page enables us to search information through its identifier as well as through the company or organization name. Any anonymous user can do a query on the home page of DP-ID application. The home page of the application is shown in Fig. 4.1.
4 Global Data Processing Identifiers and Registry (DP-ID)
65
Fig. 4.1 Home page
Fig. 4.2 Dashboard
A user can handle several organizations. A dashboard allows users to quickly see all the data processing activities of all the organizations for which they have generated records. The information of each organization and each related DP-ID can be edited, shared, and deleted from the dashboard. In the same manner, it is possible to add a new DP-ID associated to an organization from the dashboard (Fig. 4.2). A DP-ID is displayed on a Web browser as shown in Fig. 4.3. Then, the first section presents all the information that is requested by the GDPR regulation. The data controller and data processors are then mentioned with their respective addresses, followed by the data subject rights, including the links to the data protection policy and the cookie policy, and the link to contact the data protection officer (DPO) overviewing the registered data processing. The information to be provided in compliance with the GDPR is entered from the form presented in Fig. 4.4. The access to this form is of course restricted to an authenticated user, and the information to be entered consists of
66
S. Ziegler and C. Crettaz
Fig. 4.3 An example of DP-ID
Fig. 4.4 Form for a new DP-ID
• • • • • • • • •
Name of the new data processing activity, so a DP-ID Identifier of the Europrivacy certification if it exists Purpose of the data processing Description of the data processing Special categories of data: if the data are sensitive for instance Data recipients: who is receiving the data at the end Cross-border transfer Category of the data: personal, etc. Data sharing: yes or no
4 Global Data Processing Identifiers and Registry (DP-ID)
• • • • • •
67
Automated decision: yes or no Logic involved in the data processing Significance of processing activities Consequences for the data subjects Nature of the requirement for the data processing Consequences if data are not provided
6 Data Processing Identifier Format The identifiers used to locate each DP-ID into the related database are using a particular format elaborated for this application. The DP-ID identifier is represented by a string of hexadecimals, similarly to the host address of an IPv6 address. As shown below, the first segment of the DP-ID is the Company ID. It is a unique identifier attributed to the organization or company responsible for the registered data processing. This segment is made of up to eight hexadecimals, which provide over 4 billion distinct unique identifiers to identify the organization or company controlling the data processing. The second segment of the DP-ID is the Category ID. It is made of two hexadecimals to specify up to 256 distinct categories of data processing activity. It enables us to easily distinguish data processing related to IoT devices, websites, administrative processes, etc. A reference list of Category IDs has been developed and is being refined (Fig. 4.5). The third segment corresponds to the specific data processing identifier within the Category ID of the company (Company ID). It is made of up to six hexadecimals, enabling it to identify over 16 million distinct data processing per category of data processing for each company. The identifier used in the DP-ID has voluntarily been aligned with the IPv6 address format. Not only does it allow to take advantage of the high scalability of IPv6 addressing scheme, but it also enables us to develop services accessible through a simple IPv6 address, where each DP-ID information can be accessed through its unique IPv6 address (Fig. 4.6). Fig. 4.5 Data processing identifier
Company ID Cat ID DP ID #### #### ## ## : #### 8 2 6 4 294 967 296 256 16 777 216 ####:#### :####:####
68
S. Ziegler and C. Crettaz
n bits Global Roung Prefix
64-n bits Subnet ID
64 bits Interface ID
Global Roung Prefix
Subnet ID
DP-ID
Address of the DP-ID server
Unique DP-ID idenfier
Fig. 4.6 IPv6 address structure of DP-IDs
7 Enabling Data Processing Mapping Most data processing activities are involving several organizations or companies: a controller or several joint controllers sharing data with data processors and subprocessors. The distributed nature of multiactor data processing makes it very difficult to get a comprehensive view of the whole chain of processors and subprocessors involved in a data processing. The information is fragmented. DP-ID enables us to aggregate and consolidate information on such distributed data processing. It enables us to reconstitute the whole chain of processing activities from the data source (data subject or data owner) to all processors and subprocessors involved. Such aggregation enables us to serve all parties involved in the data processing, including data subjects, controllers, and processors.
8 Demonstrating Integrability and Portability The DP-ID registry enables us to easily share information on recorded data processing. The registry already provides several modes of transmission, including the following: 1. Each DP-ID is accessible by a unique URL composed of the DP-ID server name followed by the data processing ID page or by the address of the page with the list of registered DP-IDs of the organization or company. 2. An HTML code is provided by the online registry for each stored DP-ID. This code can be integrated into an HTML page of the organization or company website. It permits to embed a button in the corresponding website enabling its visitors to easily access information on a data processing with direct access to the record of the corresponding DP-ID. 3. The registry also provides a standard QR code that redirects the user to the corresponding DP-ID or the list of DP-ID of the organization or company. 4. Finally, there is the possibility to share the IPv6 address of the DP-ID or to use the DP-ID as the Interface ID if the user has the Global Routing Prefix and Subnet ID of the DP-ID server.
4 Global Data Processing Identifiers and Registry (DP-ID)
69
Fig. 4.7 Sharing a DP-ID
Fig. 4.7 presents an example of the three first solutions to share information about a DP-ID of an organization.
9 Demonstrating Interoperability The DP-ID application is interoperable and integrates other elements concerning the data protection and related certifications. In this context, the user can enter the identifier of a Europrivacy [4] certification of compliance with the GDPR and other data protection regulations. Europrivacy is a certification scheme developed through the European research program to assess and certify the compliance of data processing activities with the applicable regulations. It is possible to extend it to other national data protection regulations, including its application to emerging technologies. Each Europrivacy certificate has a unique identifier that can be integrated and accessed through the registry. The user can also enter the identifier of a Privacy Pact. Privacy Pact [5] is another tool developed through the European research program. It enables companies and organizations to express their commitment to respect the GDPR and sign a contractually binding commitment. The signed commitment is stored and published on the Privacy Pact website, and the organization receives a Privacy Pact seal. The corresponding signed pacts can also be directly reached through the DP-ID registry.
10 Demonstrating Cross-Organization Data Protection Compliance Management DP-ID has been researched, developed, and successfully used in the context of Fed4FIRE+. It has enabled the data protection officer (DPO) of the project to access a comprehensive view and mapping of the project’s data processing activities. DPID resulted in a powerful tool to aggregate and simplify the management of data processing activities from a compliance management perspective. An interesting aspect of DP-ID is its ability to streamline the documentation and compliance process with regard to data protection regulations. By forcing the
70
S. Ziegler and C. Crettaz
various controllers involved in the project to fill and complete their data processing profiles in the registry, it implicitly enabled them to comply with some of the obligations contained in the GDPR.
11 Use Cases This chapter has presented the utilization of the DP-ID in the context of the Fed4FIRE+ project, notably to monitor the data processing activities in the different testbeds and experiments. However, the use cases in which the DP-ID would be applicable are numerous. For instance, the DP-ID and its Web application can be used in any personal data processing performed by public administrations or commercial services. It would enable such data controllers of processors to have a comprehensive view of all their processing. It can also be used in more technology-focused processing such as connected vehicles to list all the data processing activities performed on V2X (vehicle-to-everything) and the related vehicular networks to their end users and enable a synthetic overview of data processing activities that involve multiple organizations or that are performed across diverse jurisdictions. This allows a better transparency concerning the data collected and processed in the context of automated driving as the involved entities are intrinsically numerous. As such, use of this service may provide a clearer high-level visualization of the data processing activities and the involved stakeholders (carmakers, V2X infrastructure managers, telecommunications operators, service providers, entertainment providers, etc.). Another example can be the listing of all the data processing activities made by a website. Many websites are using third-party services for payments or specific functionalities or modules. In order to ensure a smooth user experience, this complexity is usually hidden and invisible to the eyes of the visitor. Each data processing activity accomplished on a given website can be entered in the DP-ID and logically interlinked. As a result, the people in charge of the website management and also the visitors of the website can discover and obtain a clear view of all the data processing activities, including the different data controllers and processors involved in the services provided by the website, as well as where applicable data sharing to third parties and/or cross-border data transfers. There are some limitations and constraints due mainly to the data processing activities done outside the organization that has recorded its data processing activities. If an organization is sharing data with data processors, each data processor should register the corresponding DP-ID of their data processing linked to the received data. The DP-ID application is automatically inviting these external data processors to complete the information on their interconnected data processing. However, this relies on the goodwill and interest of these processors. As it is done on a voluntary basis, some external data processing activities can be missing. In order to overcome this risk, the value proposition must be strong enough, for instance, by
4 Global Data Processing Identifiers and Registry (DP-ID)
71
facilitating the compliance with the legal obligations contained in data protection regulations such as the GDPR.
12 Conclusion and Future Work The DP-ID registry continues to be researched and developed by following an agile and iterative development process. It is bringing together a community of legal experts and engineers to work hand in hand and research together new solutions and tools to manage an exponentially growing volume of data and support their compliance with regulatory norms that are more and more complex. Researchers and companies interested in working with us on this project are welcome to contact us.
References 1. Fed4FIRE+ European research project, https://www.fed4fire.eu/, last accessed 2022/02/25. 2. Recommendation Q.4068: Open application program interfaces (APIs) for interoperable testbed federations, https://www.itu.int/rec/T-REC-Q.4068-202108-P, last accessed 2022/02/25. 3. Global data processing identifiers registry, https://www.dp-id.com, last accessed 2022/02/25. 4. Europrivacy website, https://europrivacy.com/, last accessed 2022/02/25. 5. Privacy Pact website, https://www.privacypact.com/, last accessed 2022/02/25.
Chapter 5
Europrivacy Paradigm Shift in Certification Models for Privacy and Data Protection Compliance Sébastien Ziegler, Ana Maria Pacheco Huamani, Stea-Maria Miteva, Adrián Quesada Rodriguez, and Renata Radocz
Abstract The European General Data Protection Regulation (GDPR) makes over 70 references to compliance certification of data processing activities. While numerous certifications already exist in the domain of cybersecurity and data management, the GDPR contains very specific requirements that make the regular certification model inadequate. Europrivacy is a certification scheme developed through the European research programme. It has been specifically tailored to address the GDPR requirements. This chapter presents some of the most innovative characteristics of the Europrivacy model. Altogether, they lead to a paradigm shift compared to regular models for certification schemes with a hybrid model that combines characteristics of both universal and specialised certification schemes. Keywords GDPR certification · Hybrid certification model · Data protection · Privacy · Compliance management
S. Ziegler () · A. Quesada Rodriguez · R. Radocz European Centre for Certification and Privacy, Luxembourg, Luxembourg Mandat International, Geneva, Switzerland e-mail: [email protected] A. M. P. Huamani · S.-M. Miteva European Centre for Certification and Privacy, Luxembourg, Luxembourg Archimede Solutions, Geneva, Switzerland © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_5
73
74
S. Ziegler et al.
1 Introduction 1.1 Europrivacy Genesis The European General Data Protection Regulation (GDPR) makes over 70 references to compliance certification of data processing activities. While numerous certifications already exist in the domain of cybersecurity and data management, the GDPR contains very specific requirements, which, alongside recent guidelines by EDPB [1], impair the applicability of regular certification models. Europrivacy is a certification scheme [2] developed through the European research programme Horizon 2020 to assess and certify the compliance of data processing activities with the European General Data Protection Regulation (GDPR) and other applicable regulations for data protection [3]. The Europrivacy certification scheme has been transferred to the European Centre for Certification and Privacy (ECCP) in Luxembourg, which is in charge of its management. It is supervised and maintained up to date by the Europrivacy International Board of Experts established by ECCP. It has been the first certification scheme brought by a Member State (Luxembourg) to the European Data Protection Board (EDPB) for consideration as official European GDPR certification scheme under Art. 42 GDPR. The Europrivacy certification scheme has been researched and developed to be applicable to a large variety of data processing activities. This process has resulted in rethinking and reinventing the usual certification scheme model and proposing a new one with a paradigm shift. All this has led to the creation of an innovative model of a hybrid certification scheme that combines the advantages of the universal certification schemes and the specialised certification schemes, explained in the Sect. 3.
1.2 Purpose and Scope of the Chapter This chapter aims at presenting how Europrivacy has addressed some of the key challenges of delivering a certification scheme that complies with the GDPR requirements. After contextualising the role and significance of ‘certification’ in the light of the GDPR, the chapter focuses on the following four specific aspects that have been subject to the scheme’s research activities and subsequently implemented to Europrivacy, the scheme itself: 1. Overcoming the dilemma between specialised and universal certification schemes to address technology and domain-specific requirements through a single scheme 2. Supporting multi-jurisdictional requirements 3. Addressing a fast-changing normative environment 4. Reducing the risk of subjectivity in certification processes
5 Europrivacy Paradigm Shift in Certification Models for Privacy and Data. . .
75
2 GDPR Certification The primary aim of the GDPR is to protect the rights and freedoms of natural persons with regard to the processing of their personal data. It enhances their rights to protection of their personal data. Furthermore, it lays down a set of rules relating to the free movement of personal data, facilitating the data flow in the digital single market, thus developing the digital economy across the internal market and improving business opportunities [4]. In doing so, the Regulation does not only empower data subjects with certain rights; it obliges any ‘establishment of a controller or a processor in the Union’ [5], whose activities involve processing of personal data to follow certain principles, such as lawfulness, fairness and transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, and accountability [6]. The principle of accountability is a cornerstone of the GDPR. According to Art. 5(2) GDPR, the controller is responsible for not only complying with the Regulation but also for demonstrating its compliance. Adherence to approved codes of conduct (Art. 40 GDPR) or approved certification mechanism (Art. 42 GDPR) is explicitly recognised as a means to demonstrate compliance. The principle of accountability entails a more profound function, which goes beyond solely serving as a form of assessment of statutory obligations; there are multiple layers to this principle. On the one hand, the accountable controller pays their respect to the data subjects and their fundamental human right to data protection. On the other hand, the controller’s accountability assures external stakeholders, from business partners to Data Protection Authorities. A key aspect in each of the cases is the creation of trust and trustworthiness. Consequently, being able to effectively demonstrate compliance is a valuable asset to any controller. Even more so, if the compliance is demonstrated by an external certification, which provides a recognised external validation. While the certification under GDPR is optional and voluntary, the GDPR makes over 70 references to certification. According to GDPR Art. 42(1): ‘The Member States, the supervisory authorities, the Board and the Commission shall encourage, in particular at Union level, the establishment of data protection certification mechanisms and of data protection seals and marks, for the purpose of demonstrating compliance with this Regulation of processing operations by controllers and processors. The specific needs of micro, small and medium-sized enterprises shall be taken into account’ [7]. Recital 100 of the regulation also clarifies that the certification shall enable ‘data subjects to quickly assess the level of data protection of relevant products and services’ [8]. As a consequence, a GDPR certification should be transparent, easy to understand, and closely aligned with the text of the Regulation. The use of a certification scheme recognised under Art. 42 GDPR will be interpreted and understood by most data subjects in line with Recital 100, as an indication of compliance of the certified data processing with all the obligations contained in the GDPR. Conversely, data
76
S. Ziegler et al.
subjects may feel misled by certification schemes that would only partly address GDPR obligations and it could damage their trust in the certification. There are already existing privacy-related certification schemes and seals available on the market. However, the GDPR and EDPB have set a series of formal requirements that are particularly restrictive in order to ensure that certification under Art. 42 GDPR be reserved for highly reliable certification processes. It is nonetheless important to stress the provision in Art. 42 (4) [9] that a GDPR certification does not equal recognition of a ‘blanket’ compliance. It shall provide a fair indication of compliance with, or conformity to, certification criteria derived from the GDPR through an impartial third-party assessment performed by qualified experts at a given period of time, but it does not constitute a guarantee of compliance. Moreover, the scope of certification has been clarified by the EDPB as being applicable to data processing only [1]. As a consequence, Art. 42 GDPR is not applicable to targets of evaluation that would include all the data processing activities of a company. That is one of the reasons for which certifications of management systems, such as ISO/IEC 27001 or 27701, are not eligible for Art. 42 GDPR. This restriction with regard to the target of evaluation requires focusing on specific data processing activities. On the one hand, this may be an issue, particularly for SMEs with limited resources, when ambitioning to certify all their data processing activities. On the other hand, it enables SMEs to select priority data processing activities to start with. They can proceed step by step and certify only the data processing activities for which a certification makes sense in terms of risk management or interaction with third parties. It also ensures that the target of evaluation can be effectively assessed and certified in a meaningful and reliable manner. Indeed, certifying a company as a whole, considering the number of data processing involved, may be neither realistic nor reliable.
3 Certification Scheme Model Dilemma There are three main models of certification schemes: A. Partial certification schemes that focus on assessing only specific obligations of the regulation B. Universal certification schemes that are applicable to any data processing activities C. Specialised certification schemes that have been developed and tailored to focus on specific categories of data processing activities These models contribute to reducing the risk of non-compliance but face inherent limitations.
5 Europrivacy Paradigm Shift in Certification Models for Privacy and Data. . .
77
3.1 Partial Certification Schemes Limits A partial certification scheme will focus on a specific subset of obligations. For instance, the qualification of the Data Protection Officer (Art. 37 GDPR), the technical and security measures in place (Art. 32 GDPR), or the accountability mechanisms (Art. 5 and 24 GDPR). As previously indicated, a partial certification is likely to be misleading and misinterpreted by the data subjects. Partial certifications may also mislead data controllers when selecting a data processor whose data processing has been certified under such certification schemes on the basis of Art. 28(5) GDPR.
3.2 Universal Certification Schemes Limits Universal certification schemes’ main advantage is that they propose to cover all data processing activities and deliver homogeneous and consistent certifications. Such certifications are a cost-effective option for small and medium enterprises (SMEs), which is deemed an important consideration under Art. 42 (1) GDPR. However, there is a high probability that a universal certification does not adequately address domain- and technology-specific requirements and the resulting risks for the data subjects. For instance, when certifying a website, the evaluation of a cookies policy compliance would constitute an important aspect to be assessed. Certain specific rules published by the Working Party 29 and EDPB, as well as by national supervisory authorities, apply to the use of cookies by data controllers. Not assessing them would expose the data subjects and controllers to important risks. On the other hand, including criteria on cookies policy in a universal certification scheme would not make sense for all data processing activities that do not use Web interfaces.
3.3 Specialised Certification Schemes Limits Specialised certification schemes focus on specific categories of the data processing activities, such as video surveillance, websites, Internet of Things deployments, medical data processing, or smart metering. An important benefit of such specialised certification schemes is their ability to address technology-specific obligations. If we take our previous example of cookies policy, a specialised certification scheme for data processing in websites can easily focus on the cookies policy requirements. It offers a tailored and more precise assessment process. Nevertheless, specialised certification schemes face several major limits that we will detail below.
78
S. Ziegler et al.
Fig. 5.1 Example of a single data processing based on a smartwatch
Data processing is becoming more and more complex. A single data processing tends to combine and involve several technologies. By focusing on specific requirements of a specific technology or application domain, these specialised schemes may be fit to assess the compliance of the data processing with one of the technologies involved, but not with the other ones. For instance, let us consider a simple data processing collecting the physiological data, such as heartbeat, through a smartwatch and displaying the results on the smartphone of the smartwatch owner, the data subject. This example is inspired by Gatekeeper, an ongoing European research project in which we are participating [10]. As presented in Fig. 5.1, the data processing starts with the smartwatch, an Internet of Things (IoT) device. As showcased below, this kind of data processing collects biometric data and sends them through a cellular network to a server in the cloud. There the data are usually analysed by artificial intelligence algorithms in order to extract relevant information to be displayed on the smartphone application of the data subject. Using a specialised certification scheme for the Internet of Things would be adequate for the smartwatch part of the data processing but would not adequately cover the risks for the data subjects related to the use of artificial intelligence, health data, or mobile phone applications. Similarly, a specialised certification scheme for processing in the cloud or based on artificial intelligence would face the same limitations by delivering only a partial, incomplete, and potentially misleading assessment of compliance. Using a single specialised certification scheme would be inadequate to deliver a reliable and comprehensive certification of data processing that has an increasingly complex nature. In order to have a comprehensive compliance assessment, the data processing needs to be certified by a number of specialised certification schemes: a specialised certification scheme for IoT, another one for cloud, another one for the use of artificial intelligence, and a last one for the smartphone application. Such composed certification can be performed in theory. However, combining several specialised certification schemes for a single data processing activity requires multiplying the certification processes and costs, which would end up being too costly and probably not affordable to SMEs.
5 Europrivacy Paradigm Shift in Certification Models for Privacy and Data. . .
79
This remark is all the more important since a single company can control hundreds of data processing. Should such a company be interested in certifying all its data processing that exposes them to legal and financial risks and should it apply distinct specialised certification schemes for each one of its data processing, it will face the following problems: 1. Finding specialised schemes for each category of data processing is unlikely and would result in excluding a series of data processing from being certified. 2. Preparing certification with many different certification schemes, methodologies, and criteria granularity would make the exercise quite complex, difficult, and time-consuming. 3. As each certification scheme requires specific training and qualification of its auditors, companies will most likely have to contract several certification bodies and audit teams to certify their various data processing. 4. Having a fragmentation of qualified auditors will increase the cost of certification. One way or another, the enterprises will be charged for these additional costs related to the use of a multiplicity of certification schemes. They will end up paying not only for the fragmentation and multiplication of qualifications but also for the inevitable administrative costs. This is particularly burdensome for SMEs and would be in contradiction with Art. 42(1) GDPR, which states that ‘the specific needs of micro, small and medium-sized enterprises shall be taken into account’. Finally, it is unlikely that enough specialised certification schemes will be specified to address each and every category of data processing activities. As a consequence, focusing on specialised certification schemes would lead to excluding many data processing activities from GDPR certification.
4 The Europrivacy Hybrid Model To address and overcome the inherent limits of the abovementioned models of certification, Europrivacy has developed an innovative hybrid model of certification that combines the advantages of the universal certification scheme with the added value of specialised certification schemes. It intends to combine the advantages of a universal certification scheme with its comprehensive list of core criteria, together with complementary domain and technology-specific criteria. Hence, the Europrivacy certification scheme is structured into distinct sets of criteria, including two important ones that characterise the hybrid model: A. The Europrivacy Core GDPR Criteria B. The Europrivacy complementary contextual criteria for assessing domain and technology-specific data protection obligations The applicability of the complementary checks and controls is determined by factual and objective factors, such as the technology used by the data processing in
80
S. Ziegler et al.
Fig. 5.2 Example of Europrivacy Hybrid Model applicability to a biometric data processing
the target of evaluation. An important requirement is to avoid the risk of subjective application of the scheme by the auditor. That is why the application of the second category of criteria is triggered by the specific characteristics of the data processing to be assessed in the target of evaluation and not by the choice of the auditor. Taking again the previously mentioned case of smartwatch data processing as an example, a Europrivacy auditor will start by assessing the compliance of the data processing in the target of evaluation with (A) the Europrivacy Core GDPR Criteria. Once this initial compliance is validated, it will complement it by applying (B) the Europrivacy complementary contextual criteria that are applicable to AI, IoT, Blockchain, and websites. These additional criteria have been developed on the basis of the EDPB guidelines and European research projects with experts in the corresponding domains to address these specific risks for the data subjects (Fig. 5.2). Additionally, for cases with increased complexity, Europrivacy’s design allows the scheme to be easily combined with other certification schemes and complementary requirements. For instance, in the context of IoT data processing, it is possible to combine a Europrivacy certification with domain-specific standards, such as IEC 62443 or ETSI EN 303 645. This approach enables us to ensure a systematic, comprehensive, and highly reliable assessment of data protection. While Europrivacy certifications must comply with a set of Core GDPR Criteria, encompassing the core obligations of the GDPR applicable to all data processing, the auditor must also apply these complementary criteria to assess the requirements associated with specific technologies or application domains that are present in the target of evaluation. These domain and technology-specific complementary criteria can be further extended to address the evolution of technology and associated risks for the data subjects. These complementary criteria are contextual, but they are not optional. If they are applicable to the target of evaluation, they must be assessed and validated. A non-conformity with them has the same effect as a non-conformity with one of the core criteria.
5 Europrivacy Paradigm Shift in Certification Models for Privacy and Data. . .
81
5 Supporting Multi-jurisdictional Requirements Another challenge that emerges in the context of GDPR certification is related to the coexistence between the GDPR on a European level and various other data protection regulations at a national level. EU Member States have adopted diverse national specific obligations, for instance, with regard to the age of consent for minors of age. More obviously, this is the case in non-EU jurisdictions that are not directly subject to the GDPR and have adopted distinct data protection regulations. Nowadays data processing activities are quite often distributed over several jurisdictions and involve cross-border data transfers. For instance, a single website located in a given country may be offering online services to users based in many different countries. With regard to national obligations within the European Economic Area (European Union member states plus three additional countries subject to direct GDPR applicability), the Europrivacy certification requires us to perform a specific compliance assessment of the target of evaluation with the complementary national obligations that are applicable to the Applicant. This assessment is performed by a qualified legal expert and documented in the form of a National Obligations Compliance Assessment Report (NOCAR). With regard to other national regulations, Europrivacy has been designed since its inception to be easily extensible to non-EU regulations. The research work around Europrivacy has been developed to address both the GDPR and Swiss Federal Data Protection Act requirements. Subsequently, the structure of the certification scheme has been optimised to easily extend the assessment process to other complementary national obligations. Since then, the research team has worked on more than 15 nonEU regulations. As a result, international and cross-border data processing can be certified through a single certification scheme (Europrivacy) that can certify their compliance with the GDPR and other data protection regulations.
6 Addressing a Fast-Changing Normative Environment A further challenge for a GDPR certification is related to its purpose itself. As stated in Art. 42 (1) GDPR, the certification aims at ‘demonstrating compliance with this regulation’. Data protection obligations contained in the GDPR are evolving together with the related jurisprudence, the adoption of complementary regulations at European (i.e. ePrivacy and NIS Directives) or national level, and with the publication of EDPB guidelines that intend to clarify the interpretation of the norm and contribute to the development of associated soft law. Consequently, a GDPR certification needs to evolve together with these regulatory and normative changes. A direct consequence is the fact that Europrivacy has been designed to be more agile than regular certification schemes that tend to be frozen and unchanged for several years. Regulatory and normative changes, as
82
S. Ziegler et al.
well as jurisprudences and technology developments, are closely monitored by the Europrivacy International Board of Experts in order to keep the certification scheme and its criteria consistent with the evolution of the legislative initiatives and the technology-related risks. The Europrivacy certification scheme requires lead auditors to systematically report to ECCP any identified areas of improvement regarding the scheme and criteria. These feedbacks are compiled and reviewed by the Europrivacy International Board of Experts. Additionally, Europrivacy is supported by a community of experts (through the Europrivacy community website) and a network of research centres and partners. Where required, the Europrivacy International Board of Experts transmits its intended changes to the Supervisory Authority or EDPB for approval. Once validated, changes in the Europrivacy requirements are communicated to all Certification Bodies, who are then in charge of informing their certified clients. Additionally, all amendments are communicated through the Europrivacy Community platform as well as through a dedicated email alert channel. Companies and administrations that have certified data processing are expected to address and comply with the additional requirements without undue delay, and their continuous compliance will be assessed by the auditor during its next audit. When performing surveillance or recertification audits, the criteria S.1.4.2 in the ‘S – Complementary Surveillance Checks and Controls’ requires the auditor to systematically verify the compliance of the target of evaluation with any normative or requirements changes.
7 Reducing the Risk of Subjectivity in Certification Processes The risk of auditors’ subjectivity when assessing the compliance of a data processing with criteria is an important challenge [2]. To effectively address and mitigate this risk, Europrivacy had to develop a new model of criteria, which focuses on the evaluation of factual requirements and uses precise wording that minimises potential diverging interpretations. An important effort has been dedicated to fine-tuning the wording of each criterion to minimise the risk of subjective interpretation. The structure of the criteria uses clear Boolean relationships among the requirements mentioned in each criterion. Such a model reduces the risk that an auditor would assess only part of the requirements contained in a criterion. Moreover, this new format enables us to easily parse and integrate the criteria into software and applications. More generally, all Europrivacy criteria were defined and specified in order to be • Adequate for assessing the compliance with the corresponding legal requirement • Auditable by ensuring that the requirement can be effectively assessed and demonstrated
5 Europrivacy Paradigm Shift in Certification Models for Privacy and Data. . .
83
• Objective and factual by focusing on verifiable facts and evidence to minimise the subjectivity in the assessment • Clearly worded to avoid any ambiguity or room for misinterpretation • Homogeneously applicable with the same relevance to diverse data processing activities and data controllers • Efficient to deliver a reliable assessment of compliance, without unnecessary workload • Party Neutral by ensuring it can be used by the Applicant, the Certification Body, and any other third party In addition, the use of the criteria is supported by an online academy designed to train implementers and auditors [11], as well as a community website providing online resources [12].
8 Conclusion and Future Work The research around Europrivacy has required us to innovate in the way certification schemes are designed, structured, applied, and specified. The research has led to the development of a hybrid model of certification that combines the advantages of a universal certification scheme applicable to all sorts of data processing while considering domain and technology-specific risks and requirements associated with the processing activities to be certified. The inherent complexity and constant evolvement of data processing regulations inspired the creation of a modular certification scheme that is extensible to complementary national regulations, as well as to complementary domain and technology-specific requirements. To further develop and validate its innovative model, the Europrivacy research team is currently collaborating with several European research projects in diverse domains such as medical data, artificial intelligence, smart grid, and connected vehicles. The application of Europrivacy’s pioneering model to other regulations such as the upcoming European Regulation on Artificial Intelligence in the domain of medical data is being researched and currently prepared. More generally, our research team is engaged in researching, extending, and reinventing certification processes and models. Researchers and companies interested in working with us are welcome to contact us at .
84
S. Ziegler et al.
References 1. See: European Data Protection Board. Guidelines 1/2018 on certification and identifying certification criteria in accordance with Articles 42 and 43 of the Regulation - version adopted after public consultation. Available at: https://edpb.europa.eu/our-work-tools/ourdocuments/guidelines/guidelines-12018-certification-and-identifying_en. And its addendum: Guidance on certification criteria assessment (Addendum to Guidelines 1/2018 on certification and identifying certification criteria in accordance with Articles 42 and 43 of the Regulation) Available at: https://edpb.europa.eu/our-work-tools/documents/public-consultations/ 2021/guidance-certification-criteria-assessment_en 2. Both the Europrivacy Certification Scheme, its criteria and associated documentation can be accessed through the Europrivacy Academy (academy.europrivacy.com) and Community (community.europrivacy.com) sites. 3. Europrivacy has been developed through the European research programme and its management has been transferred to the European Centre for Certification and Privacy. More information available at: https://www.europrivacy.org. 4. Article 1(1),(2) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 5. Art. 3(1) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 6. Chapter II Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 7. Art. 42(1) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 8. GDPR Recital 100 of the Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) states that: ‘In order to enhance transparency and compliance with this Regulation, the establishment of certification mechanisms and data protection seals and marks should be encouraged, allowing data subjects to quickly assess the level of data protection of relevant products and services.’ 9. ‘A certification pursuant to this Article does not reduce the responsibility of the controller or the processor for compliance with this Regulation( . . . )’, Article 42 (4) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 10. GATEKEEPER is a European Multi Centric Large-Scale Pilot on Smart Living Environments. More information available at: https://www.gatekeeper-project.eu/ 11. Europrivacy online academy available at: https://academy.europrivacy.com 12. Europrivacy community website available at: https://community.europrivacy.com
Part III
What is Beyond Brussels? International Norms and Their Interactions with the EU
Chapter 6
Untying the Gordian Knot: Legally Compliant Sound Data Collection and Processing for TTS Systems in China Stefanie Meyer , Sven Albrecht , Maximilian Eibl, Günter Daniel Rey Josef Schmied , Rewa Tamboli , Stefan Taubert , and Dagmar Gesmann-Nuissl
,
Abstract Within a remarkably short period of time, China has enacted a third data protection law, the “Personal Information Protection Law,” which came into force on November 1, 2021. With the previously enacted “Cybersecurity Law” and “Data Security Law,” Chinese data protection law offers numerous regulations that also need to be complied with by European addressees given their wide scope of application with, in part, extraterritorial claims. If data is to be collected and partially processed in China as part of an international and interdisciplinary research project, all applicable European and Chinese regulations have to be observed. This contribution compares the regulations and draws conclusions about similarities and differences in legislation across countries in order to ensure data collection and processing in conformity with data protection. It can be seen that the General Data Protection Regulation and Personal Information Protection Law are similar in many respects, particularly with regard to the regulatory system and content of the regulations. Nonetheless, while the primary focus of the General Data Protection Regulation is on the protection of individuals and fundamental rights, the Personal Information Protection Law also formulates governmental goals that, in direct comparison, may seem surprising and, in some circumstances, give cause for concern in terms of legal policy. S. Meyer () · D. Gesmann-Nuissl Professorship for Private Law and Intellectual Property Rights, Chemnitz University of Technology, Chemnitz, Germany e-mail: [email protected] S. Albrecht · J. Schmied Professorship for English Language and Linguistics, Chemnitz University of Technology, Chemnitz, Germany M. Eibl · S. Taubert Professorship for Media Informatics, Chemnitz University of Technology, Chemnitz, Germany G. D. Rey · R. Tamboli Professorship for Psychology of Learning with Digital Media, Chemnitz University of Technology, Chemnitz, Germany © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_6
87
88
S. Meyer et al.
We show that data collection and processing in an international research project is possible and feasible and provides suggestions for best practices. Keywords Data protection law · International data collection · GDPR and PIPL
1 Introduction Traditionally, data protection has not been featured prominently in public discourse in and about the People’s Republic of China [25]. This might be attributed to the omnipresence of government and corporate surveillance in combination with little awareness of privacy in the general population [4] and a general focus on economic and technological development. The issue of privacy entered public discourse in the People’s Republic of China in the 1990s [25], raising legislators’ awareness of the topic [19], and has thus moved closer to other Asian countries, such as Korea [3] and Japan [11, 12, 17]. China has made fundamental reforms in data protection in recent years, resulting in the “Cybersecurity Law” (CSL) in 2016, and most recently the “Data Security Law” (DSL) and “Personal Information Protection Law” (PIPL) were enacted in June 2021 and August 2021. The DSL came into force on September 1, and the PIPL followed on November 1, 2021. China has thus created a standard that is, perhaps surprisingly for observers, similar in many respects to the European General Data Protection Regulation (GDPR) and departs from the comparatively softer data protection regulations in the United States. However, it is important to note that the PIPL is primarily addressed to private sector operators (see Sect. 3.3). Although the obligations of state institutions are regulated in Art. 33–37 PIPL, they are nowhere near as intensive as the regulations for private operators. Especially in the private sector, European transnational data traffic often uses Standard Contractual Clauses [Implementing Decision (EU) 2021/914 of the EU Commission dated June 4, 2021 – Ref. C(2021) 3972, OJ EU No. L 199/31 of June 7, 2021], so that the scope of application of the PIPL may lose additional significance, although it can be assumed that China intends to counteract this contractual corrective through the clauses by means of its legislation. The development of data protection law in China has an impact on data processing, which is being pursued by European researchers in connection with China, as this chapter will explain. We will focus on the respective legal frameworks and leave aside any contractual arrangements, such as the Standard Contractual Clauses just mentioned, although this option of course offers further extensive possibilities for researchers. The subject of many sciences is the human being – especially in psychology or computer science, personal data form the basis for gaining scientific knowledge. For this reason, science requires sufficiently precise and valid personal data, along with the possibility of linking these data and evaluating them in a targeted manner. In the GDPR, this need was recognized and exceptions to the strict principles of data protection were created at various points in order to balance scientific freedom
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
89
(cf. Art. 13 CFR) and the protection of the fundamental rights to data protection and informational self-determination (Art. 7, 8 CFR) to support researchers in their work with personal data. However, special challenges arise when part of the data processing necessarily has to take place abroad. In order to achieve the research goal of a research project used as an example (see Sect. 2), data have to be collected in China and sent to Europe for further processing in compliance with data protection laws. This challenges the researchers in that both the GDPR and China’s PIPL, which was adopted in August 2021, have an extraterritorial scope of application, and therefore the requirements of the two legislations (along with supplementary provisions) have to be reconciled. Similarities and differences between the two legal provisions will be examined here using the example of the research project, which can certainly only represent an excerpt of the wealth of regulations. To this end, the project and its data collection procedure will first be described (Sect. 2), followed by a description of the legal influences of data protection regulations on data collection and data implementation (Sect. 3). Finally, in Sect. 4, the regulations will be compared and contrasted. Overall, from a legal perspective, we show that data collection in China seems to be a Gordian knot as complex data protection standards in quite different legal systems seem to complicate research on both sides, but it can be untied thanks to the adoption of data protection standards in China that are similar to EU requirements.
2 Sample Project Description The project just mentioned, which will serve as an example for the investigation, is an interdisciplinary project entitled “Credible Conversational Pedagogical Agents,” which is carried out within the framework of the Collaborative Research Center CRC 1410 “Hybrid Societies,” which focuses on the interaction of humans with technology (https://hybrid-societies.org/).
2.1 Project Frame and Objectives The project, which primarily combines interdisciplinary research by linguists, psychologists, and computer scientists, combines conversational and pedagogical agents to create a Conversational Pedagogical Agent (CPA) designed to improve linguistic credibility1 through specific linguistic cues [31]. The aim is to establish
1 Linguistic credibility differs from other forms of credibility such as journalistic credibility or media credibility by focusing on linguistic cues that facilitate accepting the communication partner (sociolinguistic) and accepting the truth value of the message (pragma-semantic).
90
S. Meyer et al.
Fig. 6.1 Illustration of data flows within Germany (green) and outside Europe (red)
credible digital teachers for non-native learners, so the focus of this project is to investigate the specific non-native linguistic behavior of CPAs interacting with humans. To this end, the Conversational Pedagogical Agent needs to integrate a text-tospeech system that is able to transform a text into specific acoustically perceptible English using speech synthetic methods. In order to be able to integrate a Chinese accent in a meaningful way, speech data of native speakers have to be collected beforehand, which can be used as a basis for further processing in this system. For this purpose, interviews are conducted with some participants in their home country (China). This interview data, consisting of voice recordings and metadata, has to be stored permanently.
2.2 Focus on Data Collection and Processing The following description serves to facilitate the understanding of the legal problems and enables the targeted uncovering of certain issues and the development of conceivable approaches to solutions that are secure under current data protection laws. Figure 6.1 shows the data flows that occur during data collection for technical reasons. Technical Procedure During the interviews, data of the participants are recorded, the so-called interview data. This consists of voice recordings and metadata, both of which are stored permanently. Using this information, phonetic transcriptions of the recordings are created, which are then used together with the raw audio to train the neural speech synthesis network. In combination with the metadata, which is stored
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
91
separately, it is possible to generate user-defined accents by defining, for example, information such as gender or province of residence. Although the two data sets are stored separately, it is still necessary to link them back together for implementation and training; each person is assigned a number (“PersonID”) to pseudonymize them. These PersonIDs are created using shuf from the GNU coreutils package. This creates adequately random IDs from which no order of participants can be determined. The data collection takes place at Sun Yat-sen University (中山大学,SYSU) in Zhuhai/Guangzhou, China, a top-tier public university. Our partners there record reading passages on a Zoom H4n handheld recorder and record interviews on the same device that are conducted via Tencent Meet (branded Voov for Western markets). The interviews are also recorded using the cloud recording functionality of the conferencing platform, as a failsafe. The recordings from the Zoom H4n recorder are then transferred to a data collection computer at SYSU, stored in an encrypted container, and uploaded to the TUCcloud (a NextCloud instance hosted by TU Chemnitz). The Tencent Meet recordings are downloaded directly to a full disk encrypted data collection computer at TU Chemnitz. The metadata collection is done via the LimeSurvey platform hosted by BPS Bildungsportal Sachsen GmbH. The results are exported and stored on the encrypted data collection computer. Data Collected Data is collected by interviewing the participants and reading out passages of any text. The content of the interview is not relevant to the purpose of the project; usually topics are chosen that the participants feel comfortable with, for example, about childhood, school, or hobbies. Various publicly available texts are chosen as reading passages, which offer a rich variation of linguistic features. In total, the process (interviews and reading passage) will take about 2 hours, with the reading passage part lasting continuously for 30 minutes. In addition, metadata is collected in order to be able to assign the acoustic accents to a geographical area and thus train the speech synthesis network. For this purpose, certain personal data are requested, such as the participant’s English proficiency, gender, age, major, and also any data on places of residence, that is, the Chinese province in which the person was born and had longer stays, and also whether longer stays occurred abroad. No other data is collected. The data collection – this should be mentioned for the sake of completeness – is supervised by the ethics committee of Chemnitz University of Technology to ensure ethical aspects.
3 Legal Impacts on the Collection of Data Due to the special focus of the research project, the processing of the data takes place both in China (collection, sending) and Germany, Europe (evaluation, linking, implementation in the TTS system). Therefore, it is interesting and a first step necessary to clarify which law is applicable in this cross-border situation. It should be taken into account that both countries started from different initial points, which
92
S. Meyer et al.
contributed to the creation of the respective data protection regulations: the GDPR was created to protect natural persons or their fundamental rights and freedom (Art. 1 (1) u. (2) GDPR), while the PIPL aims to create a fair environment for governments, businesses, industry organizations, and the public (Art. 11 PIPL) and thus foster their participation in data protection. In addition, Art. 1 PIPL also defines the overarching goal of standardizing data processing processes. In this way, China’s economic competitiveness is to be strengthened in the long term [25].
3.1 Legal Basis for Transnational Research According to Art. 3 (1) of the GDPR, the GDPR applies even if the data processing does not take place in the European Union (EU), provided that only the controller acts within the scope of its activities with its registered office in the EU. Article 4 No. 7 of the GDPR defines who the “controller” is in the sense of the GDPR. For research projects, it is argued that the researchers are the controllers since they have a decisive impact on the type, scope, and purpose of the data processing [27]; however, this cannot be asserted as a general rule – a consideration of the specific individual case is necessary. As it is the task of the employees of the university who run the research project in their main office (this is also specified in the employment contract), so that externally and thus toward the persons concerned, the university bears the responsibility [27]. In our case, the university is located in Chemnitz, Germany, and thus in any case within the EU, so that the GDPR applies. While the Chinese CSL contains the territorial principle (Art. 2 CSL) and is only applicable to matters that take place within the territory of the People’s Republic of China [25], the DSL (Art. 2 DSL) and PIPL (Art. 3 (2) Nr. 2, Art. 38–43 PIPL) also provide for extraterritorial application. For this reason, all Chinese data protection regulations have to be observed in addition to the GDPR when processing data on Chinese territory (cf. stations marked in red in Fig. 6.1); the DSL and, in particular, PIPL may also apply to the transfer and evaluation of data on European territory. This provides an invitation to compare the different prerequisites of the provisions and compare their regulatory content. Extending the analysis to all three of the current Chinese data protection laws would go beyond the scope of the study and would limit the clarity, which is why the following will be focused on the most recent law – the PIPL.
3.2 Applicability of GDPR and PIPL Both data protection acts set out basic requirements that need to be fulfilled in order for the statutory provisions to be applicable. Basically, the prerequisites are the same: the data concerned has to be personal data or personal information that
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
93
is “processed” within the meaning of the laws. The standards are directed at the respective data controllers, who are specified in more detail in the laws. Personal Data/Personal Information The GDPR standardizes a legal definition of personal data in Art. 4 No. 1 GDPR. According to this, personal data is any information relating to an identified or identifiable natural person, the so-called data subject. Chinese data protection laws also define the term personal information. Unlike the United States, which affirms the existence of sufficient personal reference only if a person can be directly identified [Privacy Act of 5 U.S.C. § 552a (a) (2)], China, like the EU, also allows identifiability to be sufficient. Article 4 of the PIPL defines personal information as that information relating to identified or identifiable natural persons that is recorded electronically or by other means. The definitions provided by the two legal acts are very similar (China already anticipates aspects of processing that the GDPR only includes in Art. 4 No. 2 GDPR). Since the beginning of the data protection movements in China, a distinction has been made there – as in the EU – between general and sensitive data. Such a distinction does not exist in the United States, for example [25, 32], which again underscores the convergence of Chinese data protection with European ideas [25]. General personal data are therefore data such as name and home address; sensitive data are data that could negatively affect the data subject through alteration or publication, such as health data (cf. Art. 4 No. 15 GDPR), political or religious beliefs, and localization data [8, 9, 20, 23, 38]. In contrast to the GDPR, there is even a legal definition of sensitive data in Art. 28 PIPL. Based on the wording of Art. 31 PIPL, all personal data of residents under 14 years of age shall also be considered sensitive data. In most research projects – no matter what kind – personal data are collected since the gain of knowledge is often based on such data (psychology and computer science were mentioned at the beginning; this can be supplemented by a lot of other disciplines, such as linguistics) [27]. In the interdisciplinary project described here, the metadata, consisting of gender, age, field of study, country of birth, etc., also has to be collected and linked to the language data. The metadata indisputably represents personal data according to the definitions just mentioned, regardless of whether the GDPR or PIPL is followed. The fact that this data is stored by means of a pseudonymized PersonID is an important step in data minimization (see below). However, the anonymization of data is also a type of processing (see below) [2], and it has to be verified first whether this anonymization leads to the fact that the persons are no longer identifiable [30]. Processing The fact that the research work with the associated data collection, evaluation, and implementation in the TTS system is (data) processing within the meaning of the EU and Chinese data protection laws is probably obvious, but is also evidenced by the formulations in the relevant legislation. In the GDPR, a corresponding description of the term can be found in Art. 4 No. 2 GDPR, according to which a processing shall be any operation performed with or without the assistance of automated procedures in relation to personal data – this includes
94
S. Meyer et al.
any data processing from collection to destruction. Article 4 (2) PIPL – following the definition of personal information – describes activities the “processing” of the data includes, meaning the same as “processing” under the GDPR: activities such as collecting, storing, using, processing, transmitting, making available, and disclosing. The GDPR and PIPL are also consistent with respect to this concept; data collection in China and further processing are covered by the wording of the law.
3.3 Addressees of the Legal Provisions The addressees of the data protection provisions are the “controllers”; these are required to comply with the statutory provisions and may be obliged to provide the evidence. This conceptual classification is provided by the GDPR and PIPL as well. Article 4 No. 7 GDPR standardizes a definition of controller: the natural or legal person, public authority, agency, or other body that alone or jointly with others determines the purposes and means of the processing of personal data. In this regard, it has already been described in Sect. 3.1 that, due to the orientation of the research project and position of the employees, it is the university that is responsible externally and toward the data subjects [27]. According to Art. 73 No. 1 PIPL, the controllers are organizations and individuals who determine the purpose and nature of the processing and other related issues and who are responsible for taking the appropriate measures. Despite the provisions for state institutions in Art. 33–37 PIPL, they do not appear to be covered by the legal definition of responsible entity in Art. 73 No. 1 PIPL, which – in addition to the mitigated obligations – further diminishes their obligations under the act. Article 72 of the PIPL adds that the law is also applicable to the processing of personal data within the framework of statistical and archival management activities organized and carried out by the people’s governments and their departments. The law does not apply to other state activities. In the present case, therefore, the university that employs the Chinese actors as employees is also responsible under Chinese law. The main actors in the data collection are researchers at the Chinese university, as can be seen from the procedures of the research project “Credible Conversational Pedagogical Agents” described above. Therefore, from a European perspective, the construct of a data processor can be considered. According to Art. 4 No. 8 GDPR, a commissioned processor is a natural or legal person, authority, institution, or other body that processes personal data on behalf of the controller; this construct is regulated in Art. 28 GDPR. This would be assumed if the Chinese actors are bound by the instructions of the local researchers [34]. In this case, a contract between the parties involved would also be required (Art. 28 (3) GDPR). In addition to this commissioned processing, European law also recognizes joint responsibility, socalled joint control, which has found its standardization in Art. 26 of the GDPR. In the case of joint responsibility in this sense, the processing of the data is also carried out jointly (so that there are two controllers within the meaning of Art. 4
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
95
No. 7 GDPR); the decisive criterion for differentiation here is the authority to make decisions on the methods and purposes of the data processing [7]. For a long time, Chinese data protection law did not make a distinction between controller and processor, as provided for in the GDPR; it was also only added to the PIPL in the third draft [1]. However, the version of the PIPL now adopted provides for this constellation. The construct of joint control (Art. 20 PIPL) can be agreed between two or more processors if the data processing serves a common purpose; in the event of a violation of data subjects’ rights, the controllers are jointly and severally liable. Cooperation mechanisms such as those of the processor can be found in Art. 21 PIPL. According to this, the controller – just as under the GDPR – is obliged to contractually agree with the processor on the purpose of the processing, methods, types of personal data, protection mechanisms, and rights and obligations of the parties, and to monitor compliance.
3.4 Data Processing in the EU and China With the implementation of the Chinese data protection laws, data processing in China and Europe is equally structured as a prohibition with a reservation of permission. This means that data processing is generally not permitted unless an element of permission applies. The GDPR conclusively lists various legal grounds in Art. 6 GDPR. According to this, data processing is permissible if a legal regulation allows it or the data subject has consented to it, Art. 6 GDPR – a principle that applies to all phases of data processing. In PIPL, the only relevant legal basis for private actors – consent – is standardized in Art. 13 (1) No. 1, 14 et seq. PIPL. Legitimate Interests Here, there is a clear difference to European law: in the scope of application of the GDPR, data processing is permitted if it is necessary to fulfill the tasks of public bodies (Art. 6 (1) (e) GDPR). Now, it might be argued that the task of an institution of higher education is research, so that the processing of personal data can be regarded as legitimate on this basis. However, the processing of the data here and the special personal reference of the requested metadata (gender, province of residence) pose such a special risk to the privacy or informational selfdetermination of the individual that no researcher should rely on this legitimacy alone. For the same reason, permission on the basis of legitimate interest, Art. 6 (1) f GDPR, also does not work. The PIPL does not recognize a reservation of permission in favor of a legitimate or justified interest; this is problematic because in a comparable case of legitimate interest, implied consent is then constructed, which somewhat undermines its value [20]. Instead, Art. 13 PIPL contains provisions that affirm the influence of the state. According to Art. 10 PIPL, no organization or private individual may participate in activities related to the handling of personal data that are detrimental to national security or the public interest. According to Art. 11 PIPL, the government establishes a structure for the protection of personal data to prevent acts, including strengthening propaganda and awareness for the protection
96
S. Meyer et al.
of personal data. This can still be well justified on the basis of protecting national security [21]. However, more offensive regulations can also be found, such as that of Art. 13 (1) No. 5 PIPL. According to this provision, the processing of personal data for purposes of state propaganda and monitoring of public opinion is permitted. However, a research institute cannot invoke any of these as the basis for legitimacy, so that consent alone remains. Consent Alternatively, in the scope of application of the GDPR, the consent of the data subjects can legitimize the data processing. It should always be asked in advance what data processing is taking place for what purpose – which also has to be documented precisely, otherwise there is a violation of Art. 5 (2) GDPR. The GDPR sets strict standards for the effectiveness of consent: according to Art. 4 No. 11 and Art. 7 GDPR, consent has to be given voluntarily, specific, given on an informed basis, unambiguous, and verifiable. The controller (in our case the university) needs to be able to prove that the data subject has consented to the processing of his or her data, Art. 7 (1) GDPR. This is an expression of the transparency requirement under Art. 5 (1) (a) GDPR and the accountability requirement under Art. 5 (2) GDPR. This requirement excludes any implied consent (see Rec. 32). In addition, the scope of consent has to be easily understandable and accessible, Art. 7 (2) GDPR – another expression of the transparency requirement. Accordingly, the data subject should be provided with all necessary information in advance of the data processing as to what will happen to the data in the course of further research, so that they are sufficiently informed.2 This includes, in particular, a detailed description of the data collected, the purposes of processing, the possible further use and the time of deletion. The information should be written in a way that is comprehensible to the intended audience [8, 33, 36, 38]. In addition, the data subject is to be informed about the possibility of revocation, Art. 7 (3) GDPR. The criterion of voluntariness already follows from Art. 8 CFR. An unfree or forced actuation of the data subject’s will cannot legitimize any data processing [15, 20]. With regard to research, there are exceptions mentioned at the beginning: a less specific nature of the information may be permissible if it is scientifically compelling that the form of processing cannot yet be fully determined.3 [27] This allows researchers to obtain declarations of consent, for example, for specific research areas or parts of research projects, without the need for separate consent again later for each individual use (see Rec. 33 of the GDPR) [23, 26].
2 In some disciplines, such as sociolinguistics, that depend on collecting natural language data in casual settings, unimpeded by the presence of the researcher, extensive paperwork prior to the data collection might result in suboptimal data (observer’s paradox). However, distributing informed consent forms before collecting data has become the standard in sociolinguistics [37]. 3 Future forms of processing are especially challenging to determine when the collected (language) data will be contributed to a database accessible to the wider research community, such as CLARIN [16].
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
97
In China, the construct of data subject consent also exists, Art. 14 PIPL. However, the requirement there is similar to that in the United States: consent is the only relevant legal basis for data collection and the associated data processing for private actors; there is no such construct as that of legitimate interest. In general, as can be seen from an overall consideration of the wording of the regulations, there is a rather liberal understanding of what constitutes implied or implicit consent [25]. Although it is not mentioned anywhere that implicit consent can also be used as a legal basis instead of explicit consent to data processing, it has been clarified that explicit consent is only required if the term “explicit consent” is used in this way in the wording of the law (as, e.g., with regard to sensitive data in Art. 29 PIPL – “separate consent”) [29]. Moreover, if only the term “consent” is used, less stringent requirements are imposed. The PIPL also requires sufficient information about the data collection: Art. 17 PIPL provides that before processing the personal data, the controller shall inform the data subject in a conspicuous manner and in clear and understandable language about (1) who is the person processing the data and naming the contact details of the person, (2) the purpose, nature, and duration of the processing, (3) the manner of legal protection, and (4) about other circumstances. This is to enable individuals to better understand the processing of their data [13, 25, 35, 39]. The requirement is derived from a 2012 resolution of the National People’s Congress [24], which requires explicit disclosure of the purpose, manner, and scope of data collection and processing; an obligation that has been incorporated into all of the data protection laws under consideration [14]. However, and this is the difference from the GDPR, the requirements for causality of consent are less strict and it does not have to be provable. Otherwise, the construct of implied consent would hardly be conceivable. A similar constellation for the area of research, as pursued by the GDPR, is not known in Chinese law. A comparatively open description of purpose may not be sought here: Art. 14 (2) PIPL standardizes that in the event of a change in the purpose of the processing (e.g., for a different research content) or a change in the manner, the consent of the data subjects has to be obtained again. The GDPR imposes stricter requirements on consent, which is the only basis for action in the case of research data collection. However, since this also includes China’s comparatively looser regulations, it is possible to rely on a GDPR-compliant consent of the participants in order to collect the data in a legally secure manner. The privileges for researchers do not apply, so that the purpose has to be described precisely in order to avoid having to obtain further approvals at a later time.
3.5 Data Minimization, Anonymization, and Pseudonymization According to the definitions provided in the GDPR and PIPL, personal data also includes information that allows a person to be identified. Article 4 PIPL already excludes information after anonymization in this legal definition. The literature on
98
S. Meyer et al.
Art. 4(1) GDPR also indicates that the anonymization of information no longer renders a person identifiable [30]. Anonymization and pseudonymization are expressions of the principle of data minimization. The term anonymization is to be distinguished from the term pseudonymization, which is legally defined in Art. 4 No. 5 GDPR. This is crucial since pseudonymized personal data are still personal data that fall within the scope of data protection law (cf. Recital 26 p. 2). Decisive for the distinction between anonymization and pseudonymization is the question of when the identification of a natural person would involve a disproportionate effort. In contrast to the GDPR, the PIPL defines the term anonymization. In Art. 73 No. 4 PIPL, the definition is given as the processing of personal data in such a way that it can no longer be attributed to a specific natural person and can no longer be reconstructed. Both legal acts also recognize the principle of data minimization, the GDPR in its Art. 5 (1) (c), in Art. 6 PIPL, there is a corresponding provision, according to which the processing of personal data has to have a clear and appropriate purpose and has to be limited to the minimum necessary to achieve the purpose of the processing. Anonymous data are the opposite of personal data [22]. The very definition – which is explicit in Art. 4 (1) PIPL – distinguishes anonymous data from personal data in that it is not personal data. The decisive factor is that the data contain information about a specific person, but that no reference can be made to an identified or identifiable natural person [28]. Based on the type of metadata that is collected (English proficiency, gender, age, major, and the Chinese province where the person was born and has had extended stays), a person remains identifiable – especially when this information is linked to a spoken text. Anonymization in the sense of removing the reference to a person cannot be achieved, even if this is done by means of a PersonID. However – and this is why it is mentioned here – pseudonymization in the sense of Art. 4 No. 5 of the GDPR does exist. Chinese data protection law recognizes the concept of “de-identification” (cf. Art. 73 No. 3 PIPL), which signifies a similar concept but is not equally reflected in the PIPL. Nevertheless, this construct – if it complies with the principle of data minimization in European law – can also serve this purpose in Chinese data protection law. Especially for research purposes, Art. 89 (1) p. 3 GDPR explicitly mentions the possibility of pseudonymization in order to comply with the principle of data minimization. By means of the attribution of the PersonIDs in advance of the metadata collection, considerable efforts have already been made to fulfill the purpose so that the data collection complies with the principle of data minimization in China and Europe.
3.6 Retention or Deletion Period Further important issues are the retention period and deletion period. According to Art. 17 (1) (a) of the GDPR, data is to be deleted immediately if it is no longer necessary to achieve the purpose. A similar wording can be found in Art. 19 PIPL:
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
99
the retention period for personal data is the minimum period necessary to achieve the purpose of the processing. In principle, it can be assumed that the purpose has been achieved as soon as the desired scientific findings have been published4 [6, 13]. While the GDPR allows exceptions for scientific research in Art. 5 (1) (e), this obligation is irrevocable under Chinese law. The data has to be deleted as soon as the purpose is achieved.
3.7 Data Transmission from China to the EU For further processing, as shown in Fig. 6.1, the metadata, including the language data, is transferred to Germany, Europe, in order to be able to continue the research there. Article 25 PIPL stipulates that for any transfer of data the explicit consent of the data subjects is required. This represents the type of consent that approximates the European understanding; implicit consent is not conceivable. In addition, the requirements of Art. 51 et seq. PIPL have to be observed. According to Art. 53 PIPL, the processor of personal data has to designate a responsible person, who also has to provide his name and contact details as part of the consent information (see above). This person is very similar to what European legislation understands as a data protection representative (Art. 37 GDPR), without designating him as such. Pursuant to Art. 56 (1) No. 3 PIPL, this data controller is required to conduct a risk assessment in connection with the transfer of data abroad, which has to be recorded in writing and kept for at least 3 years. When requesting assistance from Chinese stakeholders at the partner university in China, this requirement needs to be considered in addition to that of specific consent. Article 36 PIPL also stipulates that personal data processed by state organs are to be stored in the People’s Republic of China. However, since the university is not a state institution, this requirement does not apply here, which would also be in blatant contradiction to data minimization and the GDPR.
3.8 Data Implementation The further processing of metadata and language data in the context of implementation in the TTS system and CPA opens up a further field of legal issues in which EU and Chinese data protection regulations need to be examined, compared, and reconciled. The abundance of regulations requires a detailed discussion in a separate paper.
4 In the EU, a competing interest might be the construction of research data databases according to the FAIR principles, for example, in the CLARIN project.
100
S. Meyer et al.
3.9 Formalities As a result, for research projects such as the present one, sufficient consent on the part of the participants has to be ensured. It is assumed that the German and Chinese universities act as joint data controllers (Art. 26 GDPR or Art. 20 PIPL). In addition to the general consent to study participation, the data protection regulations require further consents: on the one hand, to the data processing by collecting, evaluating, linking, and implementing the obtained data into the TTS system, and, on the other hand, to the data transfer from China to Germany in order to be able to conduct such processing at all. This requires a precise description of the purpose, manner, and scope of the data collection; in addition, the description of the duration of storage and the notice of revocation. The privilege for European researchers to describe the purpose of the data in general does not exist in China, so any change of purpose requires a new consent. For this reason, the more specific the information given to participants in advance of the interviews, the more precise it needs to be. The additional privilege of the discontinuation of the storage limitation, which is provided for by the GDPR, is not found in the Chinese legislation either. Therefore, the time of deletion has to be pointed out as well in order not to violate Chinese law. Lastly, the consent also has to include the aspect of data transfer to Europe, to which the participants have to explicitly agree.
3.10 Supervisory Authorities Participants in research studies that take place in the international sphere are also particularly interested in which national authorities supervise compliance with data protection regulations and have corresponding supervisory powers. Supervisory authorities have a central role in both the GDPR and PIPL in enforcing rights. In the GDPR, the rules are contained in Art. 55–59 GDPR. The supervisory authorities do not only have the capacity of ex-post control of data processing operations, but also an important preventive function, which is reflected in the list of tasks of Art. 57 GDPR [5]. They control both state administrative actions and private actors. A broad catalog of responsibilities in terms of competencies can be found in Art. 58 of the GDPR. For reasons of transparency, an activity report has to be prepared. In Chinese data protection law, the regulations on state supervision are provided in Art. 60–65 PIPL. Article 59 PIPL initially provides for the establishment of sector-specific and regional supervisory authorities. Their duties and rights are defined in Art. 60, 61, and 63–64 PIPL and coordinated by the Chinese Administration of Cybersecurity (CAC). The CAC also formulates the rules and standards to be observed (see Art. 62 PIPL). This also illustrates – in contrast to the specific tasks in the GDPR – the influence of the government administration. According to Art. 61 No. 4 and Art. 63 GDPR, the supervisory authorities are granted comprehensive investigatory powers to examine unlawful
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
101
personal data processing activities and, according to Art. 64 PIPL, also remedial powers and, according to Art. 66 PIPL, the possibility to impose fines and other sanctions.
4 Final Comparison The PIPL is China’s third, very recent data protection law, whose provisions are very close to the requirements of the GDPR. Although a comprehensive comparison is worthwhile in any case, the scenario described here identifies those issues that are of interest for cross-border research by examining which identical or differently regulated provisions have to be taken into account when collecting and processing interview data. As can be observed since the enactment of the CSL in 2016, Chinese data protection law is increasingly converging with European data protection law [25]. For relevant keywords such as personal data, processing, or data controller, the Chinese PIPL provides the terms personal information, related activities, and processor as synonymous bases for the law’s applicability. Unlike US law, the PIPL even recognizes the category of sensitive data, including a definition. By including minors in the scope to be specially protected, it even goes beyond the idea of the GDPR. In its last step, the PIPL has recognized since the third draft a kind of processor as well as a joint responsibility. Despite many similarities, however, this comparative law analysis has shown that there are also notable, and perhaps unfortunate for European stakeholders, differences. The legal instrument of justifiable consent serves as the only legitimization for data processing for private actors in China. Other legitimate purposes as described in Art. 6 (1) (f) GDPR do not exist, which can lead to problems. Consent – apart from cases directly mentioned in the law – can also be given implicitly and evidence does not have to be provided. In this way, the consent requirement – even if it is supposed to be equated with the European requirement – is fundamentally undermined. If consent is not provided on an informed basis and, above all, not voluntarily, the requirements cannot be considered on an equal level. In such cases, which may be justified by legitimate interests in European law, this construct is used, which somewhat undermines the intensity of this requirement. In addition – and this is the focus in the area presented here – the GDPR offers some facilitates research. This balance between privacy protection and scientific freedom is not known in Chinese law, and thus neither in the PIPL. This is unfortunate as it can impede the progress of research in China itself, as well as across borders.
102
S. Meyer et al.
5 Conclusion For economic actors, such an establishment of data protection, as was done in 2016 by the CSL and in 2021 by the DSL and now the PIPL, has been long pending. When non-binding guidelines were first created, Chinese experts proudly declared, “We are stricter than the U.S., but not as strict as the EU” [18]. This trend is clearly evident with the PIPL. This development is not accidental as each country or institution had its own reasons for designing data protection in one way or another: after the experience of World War II, the EU aimed to prevent the misuse of privacy and personal data; in the United States, the interests of business were balanced against the interests of government security agencies [25]. China wants to move forward in terms of new business models – artificial intelligence – and in its 2017 “Next Generation Artificial Intelligence Development Plan” [10] set as a priority that laws and regulations be developed to support these measures. In 2019, the Ministry of Science and Technology responded by establishing the “New Generation AI Governance Expert Committee.” This established the principle that privacy compliance needs to be a part of responsible AI. It is evident that China can only participate in the international economy if it adheres to the privacy standards of the global world (and that includes the United States and Europe equally). Whether the promotion of AI is actually the main reason for China’s actions or whether it is rather a matter of ensuring data protection that is adequate from the EU’s point of view, remaining competitive and putting upstart companies (such as digital corporations) in their place remains to be seen, since data protection regulations also have to be complied with by private operators. The fact that the DSL and PIPL have now emerged from these considerations is consistent – as is the rapprochement with Europe. This also enables an international research team to work together to gain and process knowledge – as technology grows together, so does science. Despite all this, one critical objection still remains: while consumer protection is strengthened with respect to private institutions, it is still unclear how the state’s powers of intervention are to be dealt with. Fundamental to this question is the indeterminate and unlimited intervention provision of Art. 13 (1) No. 5 PIPL, which only found its way into the law in the final phase of the legislative process. Acknowledgment This study was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – SFB 1410.
References 1. Asifma Response.: Review of the Personal Information Protection Law (,Second Review PIPL“) of the People’s Republic of China. (2021). 2. BfDI – Bundesbeauftragte für den Datenschutz und die Informationsfreiheit.: Positionspapier zur Anonymisierung unter der DSGVO unter besonderer Berücksichtigung der TK-Branche. (2020).
6 Untying the Gordian Knot: Legally Compliant Sound Data Collection. . .
103
3. Bier, C.: Das koreanische Datenschutzrecht. Zeitschrift Datenschutz und Datensicherheit (DuD), 457–460 (2013). 4. Binding, J.: Grundzüge des Verbraucherdatenschutzrechts der VR China. Leitfaden für die Praxis. Zeitschrift für Datenschutz (ZD), 327–336 (2014). 5. Boehm, F. Art. 51 Aufsichtsbehörde. Aufsichtsbehörden im Regelungsgefüge der DS-GVO. In: Kühling, J., Buchner, B. (eds.) Datenschutzgrundverordnung BDSG. Vol. 3. Rn. 31–34. Beck, München (2020). 6. Clarin. https://www.clarin.eu/, last accessed 2021/09/23. 7. Conrad, I., Treeger, C.: § 34 Recht des Datenschutzes. Joint Control, Art. 26 DS-GVO. In: Auer-Reinsdorff, A., Conrad, I. (eds.) Handbuch IT- Und Datenschutzrecht. Beck, München (2019). 8. European Union Agency for Fundamental Rights and Council of Europe. Handbook on European data protection law. (2018). 9. European University Institute. Guide on Good Data Protection Practice in Research. (2019). 10. FLIA. Artificial Intelligence Development Plan. https://flia.org/notice-state-council-issuingnew-generation-artificial-intelligence-development-plan/, last accessed 2021/09/23. 11. Geminn, C., Fujiwara, S.: Das neue japanische Datenschutzrecht. Reform des Act on the Protection of Personal Information. Zeitschrift für Datenschutz (ZD), 363–368 (2016). 12. Geminn, G., Laubach, A., Fujiwara, S.: Schutz anonymisierter Daten im japansichen Datenschutzrecht. Kommentierung der neu eingeführten Kategorie der ,Anonymously Processed Information“. Zeitschrift für Datenschutz (ZD), 413–420 (2018). 13. GoFair. FAIR Principles. https://www.go-fair.org/fair-principles/, last accessed 2021/09/23. 14. Hanhua, Z.: Consumer Data Protection in China. In: Metz, R., Binding, J., Haifeng, P. (eds) Consumer Data Protection in Brazil, China and Germany – A Comparative Study, 35–71 (2016). 15. Heckmann, D., Paschke, A.: Art. 7 Bedingungen für die Einwilligung. Freiheit von Zwang und Wahlfreiheit. In: Ehmann, E., Selmayr, M. (eds.) DS-GVO. Vol. 2, Rn. 49–56. Beck, München (2018). 16. Hinrichs, E., Krauwer, S.: The CLARIN Research Infrastructure: Resources and Tools for E-Humanities Scholars. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pp. 1525–1531. (2014). 17. Hoeren, T., Wada, T.: Datenschutz in Japan. Aktuelle Entwicklungen bei der globalen Datennutzung und –übermittlung. Zeitschrift für Datenschutz (ZD), 3–5 (2018). 18. Hong, Y.: Responses and explanations to the five major concerns about the Personal Information Security Specification. WEIXIN (2018). 19. Hornung, G., Spiecker gen. Döhmann, I.: Einleitung. Das Verhältnis der DSGVO zur internationalen Entwicklung. In: Simitis, S., Hornung, G., Spiecker, I. (eds.) Datenschutzrecht, Vol. 1, Rn. 258–263. Nomos, Baden-Baden (2019). 20. Information Commissioner’s Office. Guide to the General Data Protection Regulation. (2018). 21. Johannes, P.C. Datenschutz und Datensicherheit in China. Überblick zu PIPL und DSL. Zeitschrift für Datenschutz (ZD), 90–98 (2022). 22. Klar, M., Kühling, J.: Art. 4 personenbezogene Daten (inkl. betroffene Person). Anonyme Daten. In: Kühling, J., Buchner, B. (eds.) Datenschutzgrundverordnung BDSG. Vol. 3. Rn. 31–34. Beck, München (2020). 23. Kuner, C., Bygrave, L.A., Docksey, C., Drechsler, L., Tosoni, L. The EU General Data Protection Regulation: A Commentary/Update of Selected Articles. Oxford University Press. (2021). 24. National People’s Congress of the People’s Republic of China. http://www.npc.gov.cn/wxzl/ gongbao/2013-04/16/content_1811077.htm, last accessed 2021/09/23. 25. Pernot-Leplay, E.: China’s Approach on Data Privacy Law: A Third Way Between the U.S. and the E.U.?. Penn State Journal of Law & International Affairs 8(1), 49–117 (2020). 26. Rat für Informationsinfrastrukturen (RFII).: Datenschutz und Forschungsdaten. (2017). 27. Rossnagel, A.: Datenschutz in der Forschung. Die neuen Datenschutzregelungen in der Forschungspraxis von Hochschulen. Zeitschrift für Datenschutz (ZD), 157–164 (2019).
104
S. Meyer et al.
28. Rossnagel, A.: Datenlöschung und Anonymisierung. Verhältnis der beiden Datenschutzinstrumente nach DS-GVO. Zeitschrift für Datenschutz (ZD), 188–192 (2021). 29. Sacks, S.: China’s Emerging Data Privacy System and GDPR. Center for Strategic and International Studies. (2018). 30. Schild, H.: Art. 4 Begriffsbestimmungen. Identifizierbarkeit. In: Wolff, H.A., Brink, S. (eds.) BeckOK Datenschutzrecht, Ed. 36, Rn. 18–21. Beck, München (2021). 31. Schmied, J.: Credibility in academic and journalistic writing and beyond. In: Schmied, J. Dheskalis, J. (eds.) REAL, Credibility, honesty, ethics, and politeness in academic and journalistic writing. Vol. 14, pp. 1–14. Cuvillier, Göttingen (2018). 32. Schwartz, P.M., Solove, D.: Reconciling Personal Information in the United States and European Union. California Law Review 102(4), 877–916 (2014). 33. Spindler, G., Dalby, L.: Art. 7 Bedingungen für die Einwilligung. Leichte Verständlichkeit und Zugänglichkeit schriftlicher Einwilligungen, die mehr als einen Sachverhalt betreffen (Abs. 2). In: Spindler, G., Schuster, F. (eds.) Recht der elektronischen Medien, Vol. 4, Rn. 7–10. Beck, München (2019). 34. Spoerr, W.: Art. 28 Auftragsverarbeiter. Maßgebend: Verarbeitung in Unterordnung unter die Verarbeitungszwecke des Verantwortlichen. In: Wolff, H.A., Brink, S. (eds.) BeckOK Datenschutzrecht, Ed. 36, Rn. 18–21. Beck, München (2021). 35. Stifterverband. Memorandum on “Public Understanding of Sciences and Humanities”. https:/ /www.stifterverband.org/ueber-uns/geschichte-des-stifterverbandes/push-memorandum, last accessed 2021/09/23. 36. Taeger, J.: Bedingungen für die Einwilligung. Klare und einfache Sprache. In: Taeger, J., Gabel, D. (eds.) DSGVO – BDSG, Vol. 3, Rn. 59–62. Fachmedien Recht und Wirtschaft, dfv, Frankfurt am Main (2019). 37. Tagliamonte, S. A. (2006). Analysing sociolinguistic variation. Cambridge University Press. 38. Voigt, P., von dem Busche, A. The EU General Data Protection Regulation. Springer. (2017). 39. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., GonzalezBeltran, A., . . . Mons, B. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3, 160018. (2016).
Chapter 7
Regulating Cross-Border Data Flow Between EU and India Using Digital Trade Agreement: An Explorative Analysis Vanya Rakesh
Abstract With data flows forming an integral part of cross-border digital trade and e-commerce models, cross-border data protection measures have gained prominence in the domain of international trade. However, provisions regarding the same in Europe and India lay down several conditions to regulate such flows. On one hand, provisions of EU may create the “Brussels effect” (requiring the world to comply with its unilateral restrictions), and on the other hand, countries such as India seem to be introducing protectionist measures such as data localization as part of its proposed data protection regime. Therefore, this chapter analyzes the questions whether such cross-border data flows, as part of digital trade between EU and India, be regulated by a digital trade agreement to avoid an e-commerce splinternet. To answer this, the question of whether restrictive cross-border data transfer mechanisms as well as fragmented rules may create an e-commerce splinternet, qualifying further as a trade barrier under the ambit of international trade law, specifically the GATS, is pondered over. The paper concludes by recommending factors to be taken into consideration in case a trade agreement is proposed between the EU and India, to facilitate e-commerce between the two countries and to resolve the issue of fragmented regulations concerning digital trade. Keywords Cross-border data transfer · GDPR · WTO · Data privacy · Digital trade · e-commerce
1 Introduction The advent of new technologies and the digitalization of businesses have led to a rapid increase in the use and transfer of data across the world. In modern digital markets, e-commerce is fueled by a variety of data, including personal data, wherein
V. Rakesh () CRANIUM, Utrecht, the Netherlands © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_7
105
106
V. Rakesh
cross-border data flows have become a key feature for exchange of trade, goods, and services [1]. This has led to trade and data protection being indispensably interconnected, to the point where processing personal data is crucial for providing various market competitive services [2]. However, differing or restrictive measures among countries to regulate crossborder data flows may act as an emerging trade barrier1 and may be subject to scrutiny under international trade law [3]. Recital 9 of the GDPR2 also highlights how differences in the level of protection of personal data at the level of the Union may prevent the free flow of personal data, hence acting as a possible barrier, especially to the economic pursuits of the EU. Considering the same, this chapter aims to analyze how restrictive measures to regulate cross-border data flows, focusing on mechanisms applicable to data flows between Europe and India, may be seen as barriers to trade. For example, the EU General Data Protection Regulation (GDPR) restricts transfer of personal data outside its protection to a non-EEA country without “adequate” data protection measures by means of other safeguards such as signing of Standard Contractual Clauses (SCCs) adopted by the European Commission. Alternatively, undertakings or enterprises can sign and adopt a group document called Binding Corporate Rules (BCRs). The Brussels effect of the GDPR3 shall be analyzed, as it has been argued by developing countries that requirements under the law seem to be too time-consuming or too costly to implement [4]. On the other hand, India has proposed introduction of data localization requirements in the draft Data Protection Bill, which may come across as a data protectionist measure, probably hindering growth of digital trade between India and EU [5]. Fragmented laws and policies highlight the need for efficient regulation of digital economies, for which traditional data protection and trade rules may need to be harmonized to ensure regulatory cooperation, as a policy crossroad may lead to an e-commerce “splinternet.” [8] This means the possibility of global e-commerce and flow of information becomes fragmented due to fundamental differences between countries over data flows, leading to fracturing of digital trade by segmenting the Internet. For this purpose, this chapter shall further explore the possibility of a freetrade agreement with relevant and necessary provisions as one of the methods to ensure the protection of data as well as facilitate digital trade (i.e., services provided through digital technologies such as SaaS-type of services) between EU and India and the relevant factors that should be taken into account.
1 Organization
for Economic Cooperation and Development: The impact of digitalization on trade https://www.oecd.org/trade/topics/digital-trade/. 2 General Data Protection Regulation (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016, OJ L 119. 3 Mark Scott, Laurens Cerulus: Europe’s new data protection rules export privacy standards worldwide, Politico, (2018).
7 Regulating Cross-Border Data Flow Between EU and India
107
2 Regulatory Landscape of Cross-Border Data Transfer 2.1 Current Regulatory Landscape for Cross-Border Data Transfer in the EU This section shall briefly discuss the several data transfer mechanisms that GDPR provides for under Chapter V, especially those relevant in the context of e-commerce and digital trade. Article 45 of the GDPR allows transfer of data based on an adequacy decision, without requiring any specific authorization. In this case, the European Commission (Commission) may decide a third country, a territory, or one or more specified sectors within that third country, or the international organization in question to have an adequate level of protection, and any transfer to such a country shall not require any specific authorization. It is pertinent to highlight that though India aims to seek the adequacy status,4 until now, however, it has not been accorded with the same yet. In case no adequate level of data protection is accorded to any country, consequently, such transfers can be permitted outside EU/EEA subject to appropriate safeguards, including Binding Corporate Rules or any specific derogations.5 Article 46 of the GDPR further lists down these appropriate safeguards to be adopted to compensate for the lack of data protection in a third country, to allow transfer of data without requiring any specific authorization from a supervisory authority.6 These may include using Binding Corporate Rules (BCRs), Standard Data Protection Clauses adopted by the Commission, or Contractual Clauses authorized by a supervisory authority. Additional forms of safeguards include an approved code of conduct or even an approved certification mechanism. An example of the same was the EU-U.S. Privacy Shield Framework, designed by the U.S. Department of Commerce and the European Commission to enable companies with such mechanisms to comply with data protection requirements when transferring personal data from the European Union and to the United States to facilitate transatlantic commerce. However, the same was invalidated by the Court of Justice of the European Union on July 16, 2020. Article 47 of the GDPR provides for use of approved Binding Corporate Rules for a group of undertakings or enterprises engaged in a joint economic activity for international transfers from the EU to another organization within the same group of undertakings or enterprises. The BCR regime acts as an organization-based tool and
4 Megha Mandavia: India to approach the EU seeking ’adequacy’ status with the GDPR, Economic times (2019) https://economictimes.indiatimes.com/internet/india-to-approach-the-eu-seekingadequacy-status-with-the-general-data-protection-regulation/articleshow/70440103.cms. 5 Recital 107 of the General Data Protection Regulation. 6 Recital 108 of the General Data Protection Regulation.
108
V. Rakesh
a safe haven for data protection purposes especially for multinational corporations (MNCs) having group companies located in different locations [6]. Besides BCRs, Model Contract Clauses (MCC) or Standard Contractual Clauses (SCCs) can also be used. These clauses form part of a contract including specific provisions dealing with data protection, approved by the EU Commission, for facilitating data transfers from data controllers in the EU to data controllers/processors established outside the EU or European Economic Area (EEA). However, as the legality of SCCs was challenged in the Schrems II case 7 before Court of Justice of European Union (CJEU), and though the validity of this instruments was upheld, yet this highlights a certain amount of legal uncertainty regarding the use of SCCs. Such legal uncertainty also paves way for consideration toward trade agreements as an alternate means for regulating cross-border data transfers. Additionally, Codes of Conduct can be used to ensure compliance with the GDPR and allow cross-border data transfer, as per Article 46(2)(e). Such a code shall be applicable to associations representing controllers or processors and must be approved by the Commission. Approval Certification Mechanisms (such as privacy seals or marks) provided for in the GDPR under Article 46(2)(f) also act as an appropriate safeguard and allow for the development of data protection seals and mark to demonstrate compliance with the GDPR by processors and controllers within the EU. The absence of an adequacy decision or an appropriate safeguard shall allow for international data transfer only in certain conditions as per Article 49 of the GDPR. Such derogations include explicit consent of the data subject, transfer necessary for performance of a contract, or if it is in the interest of the data subject or public interest. However, the administrative burden or challenge of compliance that a country may face due to the requirements of the GDPR is important for deliberation. For example, a third country being accorded with an adequacy status may require significant efforts such as ensuring independence of data protection supervisors, providing and establishing cooperation mechanisms with the member states’ data protection authorities and ensuring effective and enforceable rights and judicial remedies are provided to the data subjects.8 This can have a financial as well as administrative bearing on organizations of the third country. Also, though SCCs are a relatively cost-effective standard solution, they are not tailored to address specific needs of companies and may result in extra liabilities. SCCs are unavailable when a controller may want to move data outside the EU within the same organization and may not regulate processor—sub-processor relationships. Additionally, the legal uncertainty regarding use of SCCs as mentioned earlier can also lead to serious cost implications for companies in case the CJEU would have declared the same to be invalid. Also, practitioners have mentioned that SCCs seem to be ill-suited to modern-day cross-border outsourcing transactions, besides being inflexible and
7 Case C-311/18—Data Protection Commissioner v/s Facebook Ireland Limited, Maximillian Schrems. 8 Recital 104 of the GDPR.
7 Regulating Cross-Border Data Flow Between EU and India
109
adding to costs for SMEs in particular [7]. With respect to BCRs, practice shows these may act as a cumbersome option as well since though they tailored to groups of companies, the procedural requirements of authorization, due to the consistency mechanism provided under Article 63 of the GDPR, may take a long time and involve extensive legal spending and high budgets. Further, BCRs only work within a group of companies, where they submit the BCRs for approval to the competent data protection authority in the EU, which can be approved in accordance with Article 63 and may involve several supervisory authorities since the group applying for approval of its BCRs may have entities in more than one member state. Many additional national requirements may be imposed by the member states, which could be diverse in nature and seriously undermine the benefits for MNCs to adopt BCRs. In light of these data transfer mechanisms provided for under the GDPR to facilitate cross-border data transfer outside EU/EEA, it can be said that these set of requirements or conditions can, therefore, be seen as restrictions, which may possibly act as a trade barrier.
2.2 Current Regulatory Landscape for Cross-Border Data Transfer in India At present, India does not have a comprehensive data protection legislation in place. Currently, only the Information Technology (Reasonable security practices and procedures and sensitive personal data or information) Rules, 2011 (IT Rules 2011) adopted by the Department of Information Technology, Ministry of Communications and Information Technology, under Section 43A of the Information Technology Act 2000, regulate the collection, protection, and storing of personal data, including sensitive personal information (please see Annex II for definitions) by corporate entities. This provides for the transfer of personal data from India as well, wherein Rule 7 states that such data can be transferred by a body corporate, in India or outside, if it ensures the same level of data protection that is adhered to by the body corporate as provided for under IT Rules 2011. The Rule provides two circumstances under which such a transfer can be made: if the transfer is necessary for the performance of a contract or if such data transfer has been consented to by the provider of information, i.e., the data subject. Besides these IT Rules 2011, the Personal Data Protection Bill of 2019 (PDPB) was introduced in the Lower House of the Parliament on December 11, 2019. The PDPB aims to act as a comprehensive, cross-sectoral data privacy legislation in India. If approved in its current form, with respect to data transfer, the proposed legislation shall allow for a copy of sensitive personal data (please see Annex II for definitions) to be transferred outside of India only if the following cumulative conditions are met:
110
V. Rakesh
(a) The data principal (i.e., the data subject) provides explicit consent to such a transfer. (b) – The transfer is made pursuant to a contract or intra-group scheme approved by the Data Protection Authority of India (DPAI). – The government has deemed a country or a class of entities within a country to provide adequate protection. – The DPAI has specifically authorized the transfer for a specific purpose. This implies that explicit consent of the data subject shall be a necessary requirement, along with one of the conditions listed under point (b) above to be met. Despite the restrictions on transfer, such data shall continue to be stored in India Section 33(1). Also, the chapter on restriction on transfer of personal data outside India states that critical personal data (please see Annex II for definitions) shall only be processed in India, except under emergency circumstances or where the government has approved the transfer, considering India’s security and strategic interests.9 Hence, it can be seen that although the PDPB envisions international data transfer mechanisms such as the GDPR, it still does not eliminate the need to collect data subjects’ explicit consent. Also, these provisions and requirements reflect the data localization approach that India wants to adopt as part of its proposed data protection regime. It can also be ascertained that no localization or data transfer restrictions apply to personal data that is not considered “sensitive” or “critical.” However, the proposed requirement of mandating storage of information on local servers will be cumbersome for many smaller businesses and may act as a restrictive measure for service providers. Also, the mandate to have a “Privacy by Design” policy requiring certification by the DPAI will act as a restrictive certification and licensing regime. Considering the above, it can be said that the EU and, in some capacity, (though limited) India have imposed restrictions on the cross-border flow of data. Though on one hand such conditions may be done considering safeguarding privacy and security (like in the case of GDPR), on the other hand, measures like that of India may seem to be a case of digital protectionism. The following chapter aims to highlight how these provisions may act as a trade barrier and probably reduce market access for foreign suppliers of digital services, impeding trade and investment opportunities and increasing the costs and service choice of individual businesses.
9 Kurt Wimmer, Gabe Maldoff, Diana Lee, The International Association of Privacy Professionals: Comparison- Indian Personal Data Protection Bill 2019 vs. GDPR (2020).
7 Regulating Cross-Border Data Flow Between EU and India
111
3 Regulation Acting as Trade Barrier 3.1 GDPR and Its “Brussels Effect”—Limitation on Cross-Border Data Flow and Challenges of Implementation Given the magnitude, scope, and requirements to comply with the data protection measures to regulate cross-border data flow outside Europe, as listed above, exploring the “Brussels effect” (a term coined by Anu Bradford, a professor at Columbia University), the GDPR is essential. The concept essentially highlights the possibility of the EU exerting its global power by way of its legal institutions, or perhaps the regulations, where such influence is exported globally. An example could be the exercise of power by the EU where its regulations are adopted in the legal frameworks of developed as well as developing countries alike [8]. The implications of GDPR can also help in unpacking the idea of “unilateral regulatory globalization,” which means when a single state is able to externalize its laws and regulations outside its borders through market mechanisms, resulting in the globalization of standards. Ever since the GDPR came to force in May 2018, the EU approach to the protection of privacy rights is spreading outside its boundaries. Additionally, given the extraterritoriality of the GDPR, by virtue of Article 3, the Brussels effect can be seen playing out. Take the case of another scenario, where using an electronic commerce platform can involve multiple data flows between servers of different services based in India (e.g., e-payment services, the e-commerce portal) and the customer’s computer or digital device (probably in the EU where the business model involves offering such services globally, including Europe). In this case, “data” refers to both the digitized content being transferred in the service and the data generated when users access or use digital services, applications, and websites. In this case, the latter would be subject to compliance with the GDPR since the user-generated data would most likely entail personal information such as name, contact details, user preferences, data collected via cookies, to list a few. This would again require the businesses to adopt a suitable transfer instrument. Also, personal data does not only have social value but also business value, as it is traded via various digital services and drives the digital economy, approximately, 75% of digital data being generated by Internet users, falling in the scope of personal data, making it instrumental for enabling digital services. This exemplifies the Brussels effect panning out. However, even if the GDPR sees setting global standards for trade and regulation, particularly privacy and data governance, where either corporations decide to adopt the principles laid down across their global corporations, or governments increasingly draw inspiration from the GDPR model, the effects may not be benign for all enterprises. Especially for SMEs, compared to big corporations having better monetary capabilities and expertise to comply, it may act as a detriment to business operations due to the burden of compliance. It has been observed that the digital trade environment has resulted in increased access to online inputs, which contribute
112
V. Rakesh
to competitiveness of SMEs by helping them operate across distant markets to overcome trading costs. For example, the Internet and international data transfers help SMEs better connect, improve their ability to secure and fulfill global contracts, and access global supply chains [9]. But stringent conditions on data exchange could limit the flourishing operations, denting the economic exchange. Also, data protection requirements risk limiting the opportunities for innovation and can often be perceived as a significant obstacle for data flows, especially for smaller businesses. Restrictive measures for enabling cross-border data transfers and imposing specific conditions (either on the basis of an adequacy decision, or using Binding Corporate Rules and Standard Contractual Clauses) on digital service providers may affect how these digital services are offered. Additionally, restrictions on data flows could also reduce users’ choice of technology and services on the Internet, again acting as a barrier, or even deter businesses to enter new markets. As part of a survey, Indian respondents reported investments in new technologies to support GDPR requirements due to Indian firms handling large amounts of EU citizen data as processors due to the offshoring trend that has been visible in the last 20 years and the drive to be at the forefront of technology. This also reflected the impact of an EU legislation going far beyond the European border. It is important to understand that digital inclusion is a key component, which requires promoting developing countries’ interests in digital trade. The objectives of developing countries and least developed countries (LDCs) on issues such as trade facilitation, logistics, and online payments, particularly to facilitate participation of SMEs in digital trade and e-commerce, are relevant here. SMEs find it cumbersome or unprofitable to navigate these complex regulations and might be effectively deterred from operating in the EU. Further, companies across all sectors of the economy find it difficult to conduct digital trade in markets with a poor domestic regulatory framework in critical areas such as data protection, cybersecurity, and online consumer protection. Importantly for data flows, they seek to achieve some common ground rules for the digital marketplace, where increasingly inadequate and incompatible national regulations are seen as a significant digital trade barrier [10].
3.2 India and Its Data Localization Measure—Protectionist Measures and Challenges to Digital Trade Another widely perceived trade barrier has been data flow restrictions with local storage requirements or the requirement that data be stored locally. Data localization measures require certain types of data to be stored in local servers and often also include local processing requirements. However, a local storage requirement does not always correspond to a complete prohibition of cross-border transfer. Cross-border data flow restrictions can take one of several forms, including prohibition on data transfer outside the national border, or restricted data transfer
7 Regulating Cross-Border Data Flow Between EU and India
113
outside national borders, where a copy must be maintained domestically or only a condition requiring prior consent before global transfers are allowed. Earlier, India had adopted a strict approach toward data localization, demanding that data or a copy of it should be hosted on local servers and had created restrictions on transfer of data outside national borders. The 2018 version of the Personal Data Protection Bill proposed in India received a global backlash, due to its mirroring provision mandating a live serving copy of all personal data be stored in India, along with a restriction of any cross-border data transfers for all data notified as “critical personal data.”10 However, this stringent data localization requirement was immensely toned down in the 2019 version of the Bill. In its current form, the Bill now only requires the storage of “sensitive personal data” (please refer to Annex II) within India, and the same can be transferred outside the country subject to certain conditions. Besides this, sectoral measures were introduced, whereby the Reserve Bank of India mandated localization of payment systems data that was implemented without much consultation. The most recent proposal to include provisions on localization was the draft e-commerce policy, a revised version of the same introduced in February 2019. However, whether data forming as part of this will be dealt with in the PDPB or not is yet to be confirmed. Considering this, it becomes essential to explore in what sense data localization may deem to be a barrier to digital trade, and the need to balance facilitation of e-commerce on one hand, and comprehensive data protection requirements. It is argued that such policies either cut off data flows or make the transfers difficult or expensive, especially for small and solely Internet-based firms and platforms that do not have the resources to deal with such burdensome restrictions. They may have to face the additional challenge of compliance with data protection measures of other countries such as the GDPR, simply adding to the woes. This also holds the risk of making an organization or sector less competitive, forcing companies to spend more than necessary on IT services, especially data storage services, and possibly preventing transfer of data that is needed for day-to-day activities, such as for human resources, which means companies may have to pay for duplicative services. These additional costs are either borne by the customer or the firm, which undermines the firm’s competitiveness [11]. Also, a CIGI/Chatham House study shows that data localization and other data regulations in many countries such as Brazil, China, Russia, including the European Union and India, significantly decreased the total factor productivity [12]. Another study estimated that if these countries enacted economy-wide data localization, then higher prices and displaced domestic demand will lead to consumer welfare losses. For India, it reported 14.5 billion dollar [13]. It is estimated that the benefits of digital trade to the Indian economy could grow 14-fold from 35billionnowto512 billion by 2030, if India dismantles barriers
10 Arindrajit Basu: The Retreat of the Data Localization Brigade: India, Indonesia and Vietnam, The Diplomat (2020) https://thediplomat.com/2020/01/the-retreat-of-the-data-localizationbrigade-india-indonesia-and-vietnam/.
114
V. Rakesh
to digital trade and fully enables cross-border data flow. A report 11 claims that India holds the potential to play an instrumental role in pushing for “facilitative” digital trade rules in its various bilateral and multilateral trade negotiations and has several opportunities to enhance its current domestic regulatory approach to data. With the Internet of Things and 5G networks enabling increased amounts of data flows, there will be a rise in digitally enabled trade in services and goods, powered by data. The Indian e-commerce market is projected to grow by 21.5% in the year 2022 12 and is expected to reach 99 billion dollars by the year 2024,13 with India having developed as an IT/Information Technology-enabled Services (ITeS) business process outsourcing hub, being a proponent of liberalization of Mode 1 services trade (services supplied from the territory of one member into the territory of another member) in the WTO, and in its bilateral and regional trade agreements. In terms of international trade, there is no data on trade in e-commerce in India yet [14]. Also, trade in services between the EU and India went up from e22.3 billion in the year 2015 to e32.7 billion in the year 2020,14 with telecom/IT said to be one of the dominant sectors when it comes to trading in services between EU and India [15]. Hence, such restrictions may pose a threat to the overall economy of the country. So even though GDPR has its own set of requirements for data protection when it comes to cross-border transfers, in parallel, different countries are also developing privacy laws that reflect their own sense of how to reap the opportunities of data flows for growth, trade and minimization of privacy risks. In this case, it becomes challenging to strike a balance when different data privacy regimes are at loggerheads. What is required is probably a principle-based approach, which prioritizes a mutual agreement on common privacy outcomes and gives each country the flexibility to achieve these goals, without creating unnecessary economic and trade costs. This could be more evident since Governments would have to ensure their countries are digitally ready and have a strategy for making the most of the new digital trade opportunities, which may be a bigger challenge for developing countries. Additionally, data localization requirements by Governments may be introduced to protect domestic companies from online competition, where one of the key adverse consequences could be inconsistency with the country’s commitments under the World Trade Organization.
11 Hinrich Foundation, the All India Management Association (AIMA) and AlphaBeta Advisors :The Data Opportunity-The Promise of Digital Trade for India (2019) https://www. hinrichfoundation.com/press-releases/2019/hinrich-foundation-and-aima-launch-new-digitaltrade-report/. 12 ET Bureau: India’s ecommerce market to grow by 21.5% in 2022: GlobalData, The Economic Times (2022) https://economictimes.indiatimes.com/tech/technology/indias-ecommerce-marketto-grow-by-21-5-in-2022-globaldata/articleshow/89038696.cms?from=mdr. 13 Garima Bora: India’s e-commerce market to be worth 99 billion dollars by 2024- Report, The Economic Times (2021)https://economictimes.indiatimes.com/small-biz/sme-sector/indias-ecommerce-market-to-be-worth-99-billion-by-2024-report/articleshow/81583312.cms. 14 European Commission, Trade Policy, Countries and Regions: India https://ec.europa.eu/trade/ policy/countries-and-regions/countries/india/.
7 Regulating Cross-Border Data Flow Between EU and India
115
3.3 Analyzing Restrictive Data Transfer Mechanisms Acting as a Barrier Under Trade Law—GATS 3.3.1
GATS and Its Relevance to Cross-Border Data Flows
Currently, the World Trade Organization (WTO) has several agreements that aim to govern aspects of digital trade. These agreements include: The Information Technology Agreement (ITA), The Agreement on Trade-Related Aspects of Intellectual Property Rights (TRIPS), or The General Agreement on Trade in Services (GATS). Though neither of these agreements seem to explicitly address or govern crossborder data flows, nevertheless, in the year 1999, the WTO asserted its authority over data flows. This was done by the GATS Council on Services, which found that much of e-commerce would fall within the GATS’ scope and that GATS obligations could cover measures affecting the electronic delivery of services [16], perhaps including data flows. As a number of digital services nowadays are data-intense, they tend to employ a large amount of data as part of the process that crosses the border to facilitate trade of services over the Internet. Studies suggest15 that restrictive data policies, especially regarding the cross-border movement of data, result in lower imports in data-intensive services for countries imposing them. This can either include policies that impact cross-border data transfer that may mandate data localization or impose additional requirements for data to be transferred abroad or covers policies that apply to the use of data domestically [17]. In the traditional sense, the term “non-tariff measures” (NTMs) covers a diverse set of measures comprising all policy measures other than tariffs and tariff-rate quotas having direct impact on international trade. These measures could either be categorized as “technical” measures, including regulations, standards, testing, and certification, etc. or the “non-technical” measures, including quantitative restrictions such as quotas, price measures, forced logistics, etc. In the new era of trade, these can be defined as a new set of NTMs. Though on one hand data protection laws help create trust, business, and facilitate use of digital transactions via e-commerce, policy measures that inhibit the cross-border transfer of data may consequently increase the cost of trading activities. Thereby, GATS seems to contain disciplines that are relevant to cross-border data flows to enable supply of services and can be applicable to any measure taken by the government affecting trade in services, hence covering these next-generation NTMs under the ambit of GATS. Also, the EU was a founding member of the WTO and a party to the core international trade agreements, including the GATS. India has also been a signatory to the Agreement since its entry into force in 1995. For this purpose, the relevant obligations of the GATS framework to govern global trade in services are worth exploring. The Most Favored Nation (MFN) Treatment requirement, being an essential obligation, is binding on each WTO member state from the moment
15 Martina F. Ferracane, Janez Kren, Erik van der Marel: The cost of data protectionism, VOX CEPR Policy Portal (2018) https://voxeu.org/article/cost-data-protectionism.
116
V. Rakesh
of its accession to the GATS, which requires each member to treat services as well as suppliers of a WTO member in a manner that is “no less favourable” than “like” services and service suppliers of any other country. Here, the intention is to eliminate discrimination, and the interpretation of “like” and “no less favourable” is decided based on WTO jurisprudence. Services are said to be “like” if they are “essentially or generally the same in competitive terms.” Although in services, countries are allowed, in limited circumstances, to discriminate, however, trade agreements between countries only permit these exceptions under strict conditions. Also, the MFN obligation in GATS allows for some flexibility in complying with this obligation. GATS Article I:2 states the definition of cross-border “trade in services,” as, in part, providing a “cross border supply” of a service from a provider located in “the territory of one Member into the territory of any other Member.” Therefore, data transfers and digital transfers are said to broadly fall under this provision of the GATS. Other key commitments include market access, which incorporates obligations relating to quantitative restrictions to trade in services, and National Treatment, which bans discrimination between domestic and foreign services and service suppliers. These specific commitments become binding only if and to the extent that the member country has indicated in its Schedules of Specific Commitments (Services Schedules), which constitute an integral part of the GATS. Moreover, limitations may be attached to commitments to reserve the right to operate measures inconsistent with full market access and/or national treatment. Hence, it is pertinent to highlight that any data flow restriction that affects the supply of a service covered by the Agreement would include obligations under the MFN clause and the same would apply to all services, while market access and national treatment commitments apply only in sectors where a WTO member schedules commitments. The National treatment obligation further requires that “like” foreign services and service suppliers receive a “treatment no less favourable” than their domestic counterparts of the WTO member state and that no measures are maintained that discriminate. Under GATS Article XVII(2) and (3), “no less favourable” treatment requires de jure (“formally identical”) or de facto (“formally different”) treatment that “does not modify the conditions of competition” in favor of domestic services and service suppliers compared to “like” services and service suppliers of a WTO member. Therefore, any regulatory measure relating to data flows such as data localization requirements must provide no less favorable treatment to foreign services and suppliers than that given to domestic “like” suppliers. Therefore, any such requirement that may lead to additional costs for either local processing or storing of data that may adversely impact the competitive position of foreign service suppliers compared to their counterparts of national origin would be inconsistent with a national treatment commitment.
7 Regulating Cross-Border Data Flow Between EU and India
3.3.2
117
Possibility of Data Protection Laws of EU and India Being in Contravention of GATS
Considering the commitments that WTO member states have under GATS, it is crucial to explore how restrictive cross-border data policies may violate these obligations and contravene international trade law principles. The European Commission highlights the economic potential of data sharing between companies and the need to identify hurdles hindering it, which was mentioned in the policy paper titled “A European strategy for data.”16 Amidst this, GDPR’s trade-restrictiveness is worth exploring. In the case of the EU, as the GDPR requires adequacy in terms of the level of protection to transfer, process, and control data outside EU/EEA, only 13 third countries have been granted an adequacy decision or have an arrangement in place, leaving most non-EU countries outside of this scheme. The fact that out of the ten biggest trading partners of the EU,17 so far only three countries have been granted an adequacy status, namely Norway, Switzerland, and Japan. This reflects how EU’s data protection rules are yet to have wider acceptance and still far from delivering intended results. This seems to reinforce this mechanism to be a case of an unfavorable treatment among third countries under the MFN provision. Additionally, the system of essential equivalence for data flows creates a wide opportunity gap between EU and thirdcountry suppliers, which has been argued to modify “the conditions of competition in favor of services based in EU/EEA.” Without an adequacy decision, thirdcountry suppliers may need to adopt additional measures or appropriate safeguards to further their data protection standards as per Article 46 of the GDPR, leading to additional costs. This can be challenged before the WTO, where the WTO dispute settlement body may determine, on a case-by-case basis, whether services or service suppliers are “like” in the first place, and whether a de facto differential treatment negatively affects the conditions of competition for non-EU suppliers or not. Analyzing the impact of such regulations acting as a technical barrier to trade, it is said that though estimating the cost of the impact on international trade is difficult to ascertain, complying with different foreign technical regulations and standards certainly involves significant costs for producers and exporters, further discouraging service suppliers from providing services abroad.18 On the other hand, in the case of India, data localization requirements may seem to be trade-distortive in nature as well. In the context of GATS, data localization measures would require companies to take actions relating to how they handle the 16 European Commission: Industrial applications of artificial intelligence and big data, Internal Market, Industry, Entrepreneurship and SMEs https://ec.europa.eu/growth/industry/strategy/ advanced-technologies/industrial-applications-artificial-intelligence-and-big-data_en. 17 European Commission, Directorate General for Trade: Client and Supplier Countries of the EU27 in Merchandise Trade (2021)https://trade.ec.europa.eu/doclib/docs/2006/september/tradoc_ 122530.pdf. 18 World Trade Organization: Technical Information on Technical barriers to trade https://www. wto.org/english/tratop_e/tbt_e/tbt_info_e.htm.
118
V. Rakesh
data that is necessary for the supply of a particular service and can further have a bearing upon conditions of competition between foreign and national suppliers. In this case, situations leading to less favorable treatment for foreign services and services suppliers compared to domestic firms may be created. Studies further reveal how data localization measures have acted as barriers to data flows for other countries, leading to decreased total factor productivity (TFP) and reduced Gross Domestic Product GDP. Additionally, the cost of building data centers may require the local companies to pay more on one hand and act as an entry barrier for foreign SMEs and start-ups who would want to enter the Indian market due to increased costs of data localization.. Besides digital trade, data localization can threaten the major new advances in information technology, such as cloud computing, the Internet of Things, and big data. Considering the requirements of both the EU and India and the extent to which they may possibly violate GATS obligations respectively, the exceptions and necessity test are vital for consideration and the extent to which it can be sufficient. Both the EU and India could argue that their respective measures pertaining to data protection fall within the ambit of privacy exception in Article XIV(c) of GATS, which permits measures “necessary to secure compliance with laws or regulations. . . relating to the protection of the privacy of individuals in relation to the processing and dissemination of personal data.” But this raises the question whether the measures taken by the EU and India constitute arbitrary or unjustifiable discrimination between countries where similar conditions prevail and pose restriction on trade in services. The core element of the Article XIV(c)(ii) exception requires qualifying the “necessity test,” which involves “weighing and balancing” between several factors, including the importance of the objective, the contribution of the measure to that objective, the trade-restrictiveness of the measure, and the extent to which it is non-discriminatory [18]. First, a measure should contribute to the enforcement of domestic laws that pursue a public policy objective and is not inconsistent with the provisions of the GATS. Additionally, the restrictive effect of the measure on international trade in services needs to be duly assessed. Therefore, the less restrictive the measure, and the greater the contribution to the enforcement of public interest, the more likely it is that the measure in question will meet the necessity test, which means that the pursued public policy objective can be achieved without prohibitive costs or substantial technical difficulties to that party [19]. As WTO members seem to have faced difficulty in establishing that their challenged measures were intended to “secure compliance” with WTO-consistent domestic laws earlier, it has been stated that a domestic law inconsistent with WTO law would also not meet the terms of paragraph (c). Hence, even though the exception explicitly refers to privacy, confidentiality, and personal data, it would not necessarily justify a regulation on the grounds. For example, in case of data localization, though such a requirement may aid the goal of data protection, it may also compromise privacy or national security to the extent that it can be shown that server localization compromises the security of data, probably by increasing susceptibility to malware and other attacks. Hence, these kinds of measures may
7 Regulating Cross-Border Data Flow Between EU and India
119
therefore face problems in establishing necessity under GATS Article XIV(a) or (c), depending on the specific circumstances and available evidence. In addition to this test, in case a data transfer restriction is challenged before the WTO, in that case the panel or Appellate Body would also consider an additional element and assess whether a less trade-restrictive alternative existed that was reasonably available to the respondent member and that would make an equal contribution to the identified objective, depending on nature of the measure and facts of the case [20]. This leaves the room open for deliberation when it comes to applicability of GATS obligations as well as the exception under GATS Article XIV, indicating a possibility of measures of the EU or India being challenged before the WTO. Therefore, considering the requirements and challenges as discussed in this section, it can be said that the application of general trade disciplines to data transfer restrictions and localization requirements remain uncertain. This requires better synchronization to meet the realities of the digital economy and adjust to modern business requirements, probably by way of a bilateral or multilateral rule architecture.
4 Need for Harmonization 4.1 Free-Trade Agreements as an Alternative? As the EU tends to acknowledge data sharing between companies to be essential for the economy, as highlighted in the previous section, testing GDPR’s WTO readiness and possibility of alternate arrangements are worth exploring. This is important because the EU also acknowledges India to be a key trading partner, where the Commission wants to actively engage and consider India for an adequacy decision. In the absence of the same, a balance needs to be struck between reducing the administrative burden on businesses and ensuring that personal data of individuals is protected after it is transferred outside the EEA. Therefore, harmonization, especially due to the growing concerns of regulatory fragmentation where countries legislate beyond their borders and having a global impact, is vital to abide by traditional rules for international law. Also, as states may deliberately restrict information flow, where nationals may have unintended consequences of countries having their own opinion about digital trade, the prospects of a splinternet in the near future have also led governments to push for a shared approach to Internet governance. For example, Japan’s G20 presidency called for “data free flow with trust,” pledging for international cooperation to “encourage the interoperability of different frameworks.”19
19 Martin
Sandbu: Europe should not be afraid of the ‘splinternet’, Financial Times (2019) https:// www.ft.com/content/e8366780-9be5-11e9-9c06-a4640c9feebb.
120
V. Rakesh
The European Commission is also of the view that the EU data protection rules cannot be the subject of negotiations in a free-trade agreement,20 more so as the EU Charter of Fundamental Rights guarantees right to the protection of personal data, the same regarded as instrumental for protection of fundamental rights in Europe, especially after the Lisbon Treaty entered into force on 1 December 2009. Hence to ensure uninhibited flow of personal data and facilitating commerce exchanges involving such transfer to a third country, there is a need to ease trade negotiations, which can also be done by complementing existing trade agreements or exploring factors to be taken into consideration for future trade agreements. Besides this, the Commission also wants to seek use of EU trade agreements to set rules for e-commerce and cross-border data flows and tackle new forms of digital protectionism, in compliance with and without prejudice to the EU’s data protection rules. As India has its own set of requirements as per the proposed Bill, for example, the requirement by data fiduciaries, i.e., data controllers in this context, to mandatorily have a privacy-by-design policy in place, to be approved by the DPA for certification of every organization’s privacy program creates a burden. Compared to the GDPR, this requirement is applicable after having considered the cost of implementation and the nature, scope, context, and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing. This example of different regulatory requirements under the EU law as well as India is also essential to be highlighted in case of cross-border data requirements. Where the GDPR requires a country either to have the “adequacy” status, or have other measures in place like SCCs, BCRs, or certifications, India, on the other hand, requires sensitive personal data shall continue to be stored in India, with critical personal data to be only processed in India, and transfer of such data subject to conditions laid down in Section 34 of the Bill. Hence, it can be argued that overlapping regulatory requirements, especially for data flows, may act as a dampener for international business flow, especially acting as an onerous burden on small start-ups that may not have the resources to adhere to a multitude of complex compliance laws. The free-flowing nature of Internet content and information is said to be diminished. For this purpose, the possibility of having a free-trade agreement with necessary provisions to regulate flow of data for exchange of goods and services could be a possible alternate. This is more important as in digital trade, personal data itself can be a subject of trade, especially in the business of data brokers and the trading of big data troves. Many scholars have highlighted the uncertainty of application of general exceptions on these regulatory measures and have concluded, at least for
20 European Commission: European Commission endorses provisions for data flows and data protection in EU trade agreements (2018) https://www.libreresearchgroup.org/en/a/data-flow-anddata-protection-in-eu-trade-agreements.
7 Regulating Cross-Border Data Flow Between EU and India
121
the EU measures, that in case they ever qualify as a GATS violation, they may not be capable of meeting all the requirements of the exception. Further to this, the EU is also currently negotiating the next generation of bi- or multilateral free-trade agreements on trade in services, especially in e-commerce. As there is slow progress on matters of e-commerce and digital issues within the multilateral framework of the WTO, issues related to cross-border data flows are increasingly featuring in mega-regional or bilateral free-trade agreements (FTAs). Although the WTO’s dispute settlement bodies have asserted that WTO rules apply to data flows, the WTO has not kept up with the new data-driven economy. While provisions are generally in the digital trade or e-commerce chapters of the agreements, some relevant provisions might also be found in the context of sectoral commitments. Some of the notable ones include Preferential Trade Agreements (PTAs) venturing into specific rules for cross-border data flows like the CPTPP (Comprehensive and Progressive Agreement for Trans-Pacific Partnership), a freetrade agreement agreed between 11 Pacific-Rim nations, and the USMCA (United States–Mexico–Canada Agreement). These have become important platforms to regulate restrictions on cross-border data transfer. The CPTPP has also introduced binding provisions prohibiting data localization and imposing requirements on cross-border transfer of data in the Electronic Commerce. This can ensure targeted cooperation among governments in dealing with concerns raised by such NTMs and offer an efficient option to achieve regulatory objectives while reducing potentially unnecessary trade- costs, which in turn, can lead to greater interoperability across national regulatory regimes. In my view, such an approach would also uphold and will be in sync with Article 50 of the GDPR. This provision considers the recommendation of the Organization for Economic Cooperation and Development (OECD) of 12th June 2007 on the cross-border cooperation in the application of the laws protecting privacy. Having trade agreements ensuring the same could be the right step toward this, thereby additionally enabling economies to take better advantage of the welfare-enhancing benefits from trade. The OECD also confirms the view that international regulatory cooperation helps create options that reduce unnecessary diversity of domestic regulation among trading partners, while maintaining national policy objectives. This would also help resolve the possibility of a splinternet, as discussed previously. Under the WTO framework, while many agreements do not explicitly refer to mutual regulatory cooperation, some agreements contain references to harmonization and mutual recognition. For example, these references are embedded in the Technical Barriers to Trade (TBT) Agreement.
4.2 Analyzing Proposed EU Horizontal Provisions To facilitate cross-border data flows as part of digital trade and as part of the EU’s commitment toward WTO’s ongoing negotiations on e-commerce, horizontal provisions for regulating cross-border data flows and data protection have been
122
V. Rakesh
proposed in trade agreements by the European Commission.21 These provisions have been proposed to be included in all agreements that the EU shall be pursuing with third countries, in an attempt to ensure that its trade deals do not undermine the spirit of GDPR to protect privacy [21]. Though still in the draft stage, if these clauses are agreed upon by the EU member states, they shall serve to be the starting point for negotiation on provisions regarding cross-border data transfer to be incorporated in Free-Trade Agreements and Bilateral Investment Treaties. The EU’s approach specifically prohibits data localization and data storage measures. This could contrast with India’s approach, as discussed in previous sections. A stringent stand by the EU to incorporate provisions without room for negotiation, considering prevailing conditions regarding data protection in a third country, reflects the Brussels effect taking place. As the Joint Communication to the European Parliament and the Council titled “Elements for an EU strategy on India”22 also highlights how a comprehensive data protection law in India would facilitate bilateral data flows and foster trade relations between the two economies, it also reflects the EU’s objectives on the other hand. It states that the EU’s goal is to work toward a business environment for European companies that trade with or invest in Indian companies that is non-discriminatory and sound. Facilitating enhanced market access for EU companies would require prevention of barriers, including non-tariff barriers such as data localization restrictions. In the current situation, any form of data localization requirements could act as a barrier for the EU. However, this one-way approach on the EU’s part does not consider India’s objective of introducing such requirements. This paves way for the “Brussels effect” argument. This can also be seen in the approach of the International Chamber of Commerce (ICC), where a letter ICC Secretary General John W.H. Denton AO to Emil Karanikolov, Minister of Economy of Bulgaria—the nation currently presiding the Council of the European Union (EU) urged EU member states to incorporate strong clauses in the proposal of horizontal provisions to regulate cross-border data flows between EU and a third country. This move was to restrict countries from introducing “damaging protectionist measures,” which could be unjustified and could impact the value of any trade agreements.23 As the EU itself has upheld the need for a comprehensive and balanced agreement, especially with India, to respond to trade interests of both the parties for mutual sustainable growth and development, such stringent approach toward horizontal provisions reflects a different picture. It comes across as a move not to consider the rationale of third countries for 21 Horizontal
provisions for cross-border data flows and for personal data protection https://trade. ec.europa.eu/doclib/docs/2018/may/tradoc_156884.pdf. 22 European Commission: Joint Communication to the European Parliament and the CouncilElements for an EU strategy on India 2018. 23 International Chamber of Commerce: ICC calls on EU member states to increase ambition on cross-border data flows in EU trade agreements (2018) https://iccwbo.org/media-wall/ news-speeches/eu-member-states-must-strengthen-provisions-cross-border-data-flows-says-iccsecretary-general/.
7 Regulating Cross-Border Data Flow Between EU and India
123
adopting particular forms of data protection measures, which requires the need for negotiating balanced, ambitious, and mutually beneficial agreements on trade and data protection. Common approaches and standard to promote data protection values and facilitate data flows can be done by the EC adopting a data adequacy decision toward India, and alternatively, flexibility in the horizontal clauses with more room for negotiation should also be promoted. A lack of interoperability across policy and regulatory environments can lead to unnecessary administrative burdens and compliance inconsistencies across jurisdictions. Also, the fact that the protection of personal data is a fundamental right in the EU, restricting it to be subject to negotiations in the context of EU trade agreements, leaves very little scope for consideration of policy situations and objectives of third countries. EU’s assertive stance has also been touted as a form of protectionism,24 in which businesses operating in India may not agree to comply with, being a case of adopting domestic laws of another country, making trade between the two jurisdictions difficult. Though it is true that strict protectionist measures are also not trade-friendly and may possibly be challenged under Article XIV GATS, where such a measure may not be “necessary” even under the exception provision, decision regarding whether a regulatory approach is “unjustified” or not should be negotiable. Therefore, ICC’s approach may seem to be one way and could be seen as a move to spread the Brussels effect, especially given the fact that India’s data protection regulation may incline toward adoption of data localization provisions. The horizontal provisions do seem to allow for parties to review and assess the functioning of the provisions in 3 years’ time, following the entry into force of the Agreement; however, it does not seem that there is any room for review of the clauses proposed in its existing form. For example, India’s position where the Government requires data to be localized on the grounds of regulators to access the data, for example, the financial regulators requiring access to financial data and the same to remain local in case they need access to it for regulatory purposes,25 should be allowed to be deliberated upon before finalizing any provisions as part of a digital trade agreement. Currently, the proposed horizontal provisions by the EU do not seem to be flexible to allow for alternate policy objectives to be considered. Currently subject to discussions in the EUCouncil,26 if agreed upon by the EU member states, these provisions introduced by the EC in the year 2018 can serve as a mandate. Therefore, to facilitate trade, it is
24 Hans von der Burchard, Jacopo Barigazzi, Kalina Oroschakoff: Here comes European Protectionism, Politico (2019) https://www.politico.eu/article/european-protectionism-tradetechnology-defense-environment/. 25 Joshua P. Meltzer: Data and the transformation of international trade, Brookings (2020) https:// www.brookings.edu/blog/up-front/2020/03/06/data-and-the-transformation-of-internationaltrade/. 26 Last featured on the agenda of the March 7, 2018 meeting of the Working Party on Information Exchange and Data Protection https://data.consilium.europa.eu/doc/document/CM-1755-2018INIT/en/pdf.
124
V. Rakesh
recommended that adequate balance is achieved. Perhaps, the review clause can be invoked by India once the PDPB comes to place, and considering India’s position, perhaps an adequacy decision could additionally be aimed at. This was also done in the Japan–EU Economic Partnership Agreement (EPA), which includes a review clause on data flows. According to this provision, “the Parties shall reassess within three years of the date of entry into force of this Agreement the need for inclusion of provisions on the free flow of data into this Agreement.” The primary reason for including this clause was the ongoing internal Commission discussions on data flows at the time when the political agreement concluded in July 2017 within three years of the entry into force of the EU–Japan EPA, the Commission decided to assess the need to include provisions on data flows and data protection. Also, it is required that the criteria for “necessity” assessment are elaborated upon. Since it often implies consideration of alternative measures, a case of “necessity” may be challenged by the other party in case it can furnish the existence of a less-trade restrictive measures that are “reasonably available.” Such a measure will include an alternative if it allows a party to achieve the same desired level of protection of the public interest pursued without prohibitive cost or substantial technical difficulties. In the case of the EU, as its approach can be seen to be more restrictive when it comes to cross-border flow of personal data, it is said that wide implementation of less-trade restrictive mechanisms to ensure compliance with domestic data privacy framework suggests reasonable availability of alternatives to the EU [22]. Therefore, even in case of the horizontal provisions, the EU’s stance should be revisited to be more inclusive and flexible. Additionally, a lack of a comprehensive mandate at the WTO level to regulate this, free-trade agreements, or economic partnership agreements at a bilateral, regional, or multilateral level have emerged as a primary method to set rules. Trade agreements now contain a chapter on e-commerce, with provisions on privacy and data protection as an essential component. The United States–Mexico–Canada Agreement (USMCA) is a case-in-point, which tries to include negotiated trade rules on privacy and cross-border data flows in a trade agreement. Another key bilateral agreement is the EU and Japan’s Economic Partnership Agreement, which entered into force on 1 February 2019. Though at the time of negotiations with Japan, the EU faced demands to include general provisions on free flows into the agreement, the parties settled for a “rendezvous” clause. This clause requires the parties to reassess the inclusion of provisions on free data flows within 3 years of entry into force of the agreement. This agreement could be a case-in-point for EU– India since India is yet to formalize its PDPB and also aims to seek the adequacy status.
5 Conclusion and Proposed Way Forward Considering the multiple challenges posed to the global trading order due to stringent regulations, adopting a balanced approach becomes of paramount impor-
7 Regulating Cross-Border Data Flow Between EU and India
125
tance. A patchwork of regulations, without a uniform or standard approach, may create obstacles for international trade, both from a regulatory perspective due to the possibility of non-compliance and from an economic standpoint by creating obstacles for businesses to explore opportunities to expand their operations. This can further pave way for the fracturing of the global Internet, where a lack of agreement on global norms for trade may split the internet into virtual trading blocs or authoritarianism when it comes to Internet regulation. Though regulatory divergence across economies, especially with respect to data protection, may deem to create difficulties in ensuring harmonization due to each economy’s own privacy standards bound by differing cultural factors, yet it is recommended that to a great extent these issues can be addressed and dealt with. Considering this, the following factors must be considered in case a digital trade agreement is proposed between EU and India to facilitate e-commerce between the two countries and to resolve the issue of fragmented regulations concerning digital trade: • It is proposed that both the parties in this case, i.e., the EU as well as India, should ensure that without compromising on privacy and data protection, new negotiations should offer the states to have the opportunity to clarify their commitments and policy objectives. This will uphold the objectives of their respective data privacy norms and reduce the possibility of the same being violative of relevant GATS provisions. • This could further be formalized by introducing trade principles exclusively to address digital trade concerns such as cross-border data flows at a multilateral level, which would take into account trade concerns of ensuring no discrimination, open market ideology, and also according protection to personal data being transferred. Also, the principles should consider leaving ample scope for members to regulate public policy concerns when prohibiting data localization. • One key principle to be considered should be that of legal interoperability. To address new regulatory concerns like that of data privacy, cooperative global regulation on issues that transcend market access would require countries to aim to achieve interoperability, as against trying to ensure that their own model of regulation is adopted globally. This will also ensure the creation of a trusted environment for trade in the digital economy to reach its highest potential. Mutual goals to facilitate cross-border transfer of information to facilitate business operations could be a small move toward harmonization. This would ensure that businesses in both jurisdictions face less cost implications, increased productivity, meet consumer demands, and enhance innovation. For example, attempt by regulators to develop a shared sense of good practices at an international level for data governance, like review of the OECD Privacy Guidelines as well as work around principles on AI, both involving countries beyond OECD membership, is another example that highlights the need for moving toward adoption of internationally accepted principles in light of upcoming issues due to the advent of digitalization. In case of international trade law, policymakers should promote the establishment of new trade principles, with the underlying objective of allowing the flow, storage, and handling of all types of data across
126
V. Rakesh
borders, subject to privacy and security laws and other laws affecting data flow covered under GATS Article XIV. Any trade agreements addressing data flows should enable cross-border data movements in a way that any blanket restrictions are not allowed, establishing fair rules to facilitate data movement. An interoperability mechanism could either be created under the auspices of the WTO or any other existing system could be expanded to allow for open data flows between different data privacy regimes to facilitate trade. This could be in line with Article 50 of the GDPR, where though the EDPB ensures cooperation with international organizations and frameworks on matters of privacy and data protection, like that of OECD or Asia-Pacific Economic Cooperation (APEC), such an interoperability mechanism for facilitating trade while protecting privacy can be read to follow this provision. • In case of the EU, even the strategic agenda for the term 2019–2024 also highlights the need for a level-playing field in trade, ensuring fair competition, reciprocity, and mutual benefits in trade policy. To ensure a balanced approach by EU, specifically looking at the proposed horizontal provisions, which reflects EU’s lack of willingness to provide for reciprocity, the provisions should either be revisited to make them flexible or reworded to soften the language to ensure that the EU’s stance does not seem to be moving toward setting up the global Internet standards for data and emerging technologies, gaining the first-mover advantage. This would be essential also to accommodate its commercial interests to use technologies such as AI to better understand and serve foreign markets. • On the other hand, in case of India, the toned-down provisions for data localization in the latest version of PDPB is a welcome move, where such measures, if stringent in nature, could impede economic competitiveness of a country and this should be duly acknowledged. Softening of data localization measures is the first encouraging move in this direction, followed by introduction of a comprehensive data protection legislation. To balance out the interests of a developing country from the global south, more flexibility must be spared to develop their own regulatory framework and not be coerced to comply with standards set by the developed world to bridge the economic divide between the EU and India. Over a period of time, it can be observed that e-commerce chapters in FTAs have found a place, with parties becoming increasingly aware of the challenges that barriers to e-commerce pose to global and regional trade. However, as time and again, it has been clarified that EU data protection rules in specific are not subject to negotiations in a free-trade agreement, and considering the challenges that complying with GDPR may lead to achieve the “adequacy” status, the possibility of introducing partial or sector-specific adequacy could be another viable option. In this case, specific sectors, for examples, financial services or IT sectors involving trade of goods and services concerning international movement of data, could be considered. This may involve consideration of the nature of the privacy regime and the extent to which sectors in an economy involve exposure to data flows from the EU. This view is also supported by the European Commission itself. However, this should complement any trade agreements, either existing or to-be negotiated in the future, to set rules for e-commerce and cross-border data
7 Regulating Cross-Border Data Flow Between EU and India
127
flows while ensuring compliance with respective laws of the member states. This could be a viable option especially for India, as the road to achieve adequacy status may be a long one as of now, with the Bill yet to be formalized. To conclude, it can be said that cross-border data flows between the EU and India can be partially regulated by a digital trade agreement to avoid an e-commerce splinternet if the factors proposed above are taken into consideration. Therefore, internationally agreed principles to regulate trade agreements and ecommerce provisions are required if negotiations for a bilateral trade agreement are initiated between the EU and India in the future. This should be in line with the proposed international cooperation mechanisms, which could serve as a means of regulatory harmonization and legal interoperability. This could entail common principles across a subset of WTO members, EU and India in this case, and to enhance cooperation between them. Such an agreement could play an important role in applying trade law to data restrictive measures, particularly by facilitating a sound framework that balances domestic Internet regulation and liberalized data flows. Here, it is crucial to highlight that the aim of this chapter is not to suggest introduction of provisions under trade agreements as a replacement of the existing instruments, for example in the GDPR, but to consider them as an additional alternative. Such a principle-based approach, which prioritizes a mutual agreement on common privacy outcomes and gives each country the flexibility to achieve these goals, without creating unnecessary economic and trade costs, could be an alternate worth considering and exploring, especially given the difference in power dynamics between the EU (a developed regions) compared to India, a developing country. Free-trade agreements can, to a certain extent, act as a tool to harmonize trade law with data protection law, ensuring protection of data as well as facilitation of ecommerce. It can ensure that the liaison is economically meaningful, delivering real new market openings in all sectors to both sides, and contain a solid rule-based component.
References 1. Organization for Economic Co-operation and Development, Trade and Agriculture Directorate Trade Committee: Trade and cross-border data flows (2018) 2. Federica Velli, The Issue of Data Protection in EU Trade Commitments: Cross-border Data Transfers in GATS and Bilateral Free Trade Agreements (2019) 3. Neha Mishra: Building Bridges- International Trade Law, Internet Governance, and the Regulation of Data Flows, Vanderbilt Journal of Transnational Law, Vol. 52:463 (2019) 4. Hosuk Lee-Makiyama: Briefing note-AI & Trade Policy, Tallinn Digital Summit (2018) 5. Joshua P. Meltzer: A WTO reform agenda -Data flows and international regulatory cooperation, Global Economy and Development, Working Paper 130 (2019) 6. Lokke Moerel: Binding corporate rules-corporate self-regulation of global data transfers, Page 27, Oxford University Press (2012) 7. W. Kuan Hon, Data Localization Laws and Policy: The EU Data Protection International Transfers Restriction Through a Cloud Computing Lens, Page 194 (2017)
128
V. Rakesh
8. Anu Bradford: The Brussels effect, 107 NW. U. L. REV. 1 (2012) 9. OECD: Fostering greater SME participation in a globally integrated economy, Plenary session 3, SME Ministerial Conference (2018) 10. Andrew D. Mitchell, Neha Mishra: Data at the Docks- Modernizing, International Trade Law for the Digital Economy, Vanderbilt Journal of Entertainment & Technology Law, Vol. 20, (2018) 11. Nigel Cory: Cross-Border Data Flows- Where Are the Barriers, and What Do They Cost?, Information Technology and Innovation Foundation (2017) 12. Matthias Bauer, Martina F. Ferracane, Erik van der Marel: Tracing the Economic Impact of Regulations on the Free Flow of Data and Data Localization, Centre for International Governance Innovation and Chatham House (2016) 13. Matthias Bauer, Hosuk Lee-Makiyama, Erik can der Marel, Bert Verschelde: The Costs of Data Localisation: Friendly Fire on Economic Recovery, European Centre for International Political Economy (2014) 14. Arpita Mukherjee, Avantika Kapoor: Trade rules in e-commerce: WTO and India, Indian Council for Research on International Economic Relations (2018 ) 15. Niclas Poitiers, Suman Bery, Sonali Chowdhry, Alicia Garcia: EU-India trade relationsassessment and perspectives, Directorate General for External Policies, European Parliament (2021) 16. Susan Ariel Aaronson, Patrick Leblond: Another Digital Divide- The Rise of Data Realms, and its Implications for the WTO, Journal of International Economic Law, Volume 21, Issue 2, Pages 245–272 (2018) 17. Martina F. Ferracane, Erik van der Marel: Do Data Policy Restrictions Inhibit Trade in Services? European Centre for International Political Economy (2018) 18. Organization for Economic Co-operation and Development: Trade and Agriculture Directorate Trade Committee, Trade and cross-border data flows (2018) 19. Svetlana Yakovleva, Kristina Irion: Pitching trade against privacy-reconciling EU governance of personal data flows with external trade, International Data Privacy Law (2020) 20. Mitchell, Andrew D. Hepburn, Jarrod: Don’t Fence Me In: Reforming Trade and Investment Law to Better Facilitate Cross-Border Data Transfer, Yale Journal of Law & Technology, Volume 19; Issue 1 (2018) 21. Svetlana Yakovleva, Kristina Irion: Pitching trade against privacy-reconciling EU governance of personal data flows with external trade, International Data Privacy Law (2020) 22. Svetlana Yakovleva: Should Fundamental Rights to Privacy and Data Protection be a Part of the EU’s International Trade ‘Deals’? World Trade Review (2018)
Chapter 8
When Regulatory Power and Industrial Ambitions Collide: The “Brussels Effect,” Lead Markets, and the GDPR Nicholas Martin and Frank Ebbers
Abstract This chapter explores certain innovation-promoting effects of the GDPR and their geographical dispersion. It also shows that while the GDPR has sparked substantial innovation and the birth of a new industry in the field of “privacy tech,” or technological solutions for data protection compliance, this industry is largely dominated by North America-based companies. Despite the GDPR’s origin in Europe – and despite the hopes of European policymakers that it might spark a wave of new technology innovation in Europe – European companies seem to have struggled to establish themselves in this market. The chapter draws on two concepts from regulatory studies and innovation studies – the “Brussels effect” and regulation-induced lead markets – to explain this outcome, arguing that this surprising outcome (a European law sparking the birth of a new technology industry in America) derives not from idiosyncratic factors connected to the GDPR or even software industries, but from structural factors related to the logic of regulatory globalization. Keywords Privacy-enhancing technologies · Privacy tech · Innovation · Brussels effect · Lead markets · Porter hypothesis · Regulation
1 Introduction Unrestrained technology development and economic activity can lead to externalities like pollution or unsafe products. To overcome such market failure, governments resort to regulation; coercive rules to guide firms’ behavior by banning, restricting, mandating, or incentivizing desired/undesired practices. At least initially, regulation is liable to impose additional costs or constraints. This can erode firms’ international The original version of this chapter was previously published without open access. A correction to this chapter is available at https://doi.org/10.1007/978-3-031-09901-4_13 N. Martin () · F. Ebbers Fraunhofer Institute for Systems and Innovation Research ISI, Karlsruhe, Germany e-mail: [email protected] © The Author(s) 2022, corrected publication 2023 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_8
129
130
N. Martin and F. Ebbers
competitiveness if it is imposed in just one country. Especially for open, exportoriented economies, this is a legitimate concern. Unsurprisingly, attempts to impose new regulation are regularly accompanied by warnings about detrimental effects on the affected industries. But while the notion that by imposing higher costs or other complications regulation must necessarily reduce competitiveness remains common in public debate, scholars have shown that its economic effects are more complex. For one, regulation pioneered in one jurisdiction can diffuse to others, thus equalizing the competitive playing field. At least under some conditions, far from rendering the pioneer country uncompetitive as others “race to the bottom,” stringent regulation can spark “races to the top” wherein the most stringent standard becomes the global standard, a phenomenon theorized as the “Brussels” or “California effect” [11, 12, 57]. For another, regulation can induce innovation and create new markets for compliant or compliance-supporting products. If the regulation diffuses globally, this can create export opportunities for the pioneer country, a process theorized as regulation-induced “lead markets” [7, 10, 37]. Both the idea of a “Brussels effect” and regulation-induced “lead markets” have some empirical support. They have also attracted considerable political and policy interest as they seem to offer strategies to profit from global regulatory influence and arguments to defend regulatory initiatives against charges of sapping competitiveness. For example, EU Commission officials repeatedly voiced hopes that the General Data Protection Regulation (GDPR, the “Regulation”) might spark a wave of innovation and privacy-friendly digital technologies built in Europe. This possibility was also discussed in the official Impact Assessment of the Regulation [20]. This chapter argues that these hopes are likely to often be misplaced. The “Brussels effect” is real and regulation can create new markets, but the mechanisms underlying the EU’s global regulatory power are liable to actively weaken the formation of specifically European lead markets and lead suppliers. Paradoxically, the stronger the “Brussels effect,” the lower the likelihood of lead markets and suppliers emerging in the EU, absent supportive policy or especially benign demand and supply conditions. The GDPR is a good example of this. While the Regulation rapidly became the benchmark for privacy regulation globally and almost single-handedly created a large and rapidly growing market for so-called “privacy tech,” innovative software products to help companies govern their data processing and attain compliance, this market seems to be largely dominated by North American firms and venture capital funds. European vendors have been relegated to the sidelines. Understanding this paradoxical outcome and how it may vary across different markets, industries, and forms of regulation is important given the EU’s emergence as a regulatory superpower able to shape global rules, and European policymakers’ interest in strengthening the EU’s industrial base. The chapter is structured as follows. In Sect. 2, the two theories of the “Brussels effect” and regulation-induced lead markets are laid out and the interaction of their underlying causal mechanisms analyzed. Section 3 uses the development of
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
131
the “privacy technologies” industry in Europe and the United States after 2016 in response to the GDPR as a case study to illustrate how the “Brussels effect” can undermine lead market and supplier formation in Europe. Section 4 discusses policy implications and concludes.
2 The “Brussels Effect” and Regulation-Induced Lead Markets 2.1 “Unilateral Regulatory Globalization”: The “Brussels Effect” The “Brussels effect” was theorized by Anu Bradford [11, 12]. It draws on earlier debates about whether globalization was leading to regulatory races “to the bottom.” Against critics who argued that globalization was leading to the erosion of standards as countries competing for investment reciprocally lowered regulatory requirements, scholars like David Vogel and Robert Kagan pointed to cases where countries seemed to be “regulating up” and the most demanding regulatory standard, often set in California or Brussels, emerged as the global norm [57, 58]. Bradford expanded these empirical observations into a systematic theory, specifying conditions under which single jurisdictions like the EU could engage in “unilateral regulatory globalization,” that is, “externalize [their] laws and regulations outside [their] borders through market mechanism,” so that their standards became the global standard ([11]: p. 3). Briefly stated, jurisdictions like the EU can do this when four conditions hold: the jurisdiction must (1) have a large domestic market and (2) significant regulatory capacity, (3) the regulation in question must set standards for an “inelastic” (i.e., immobile) target, such as consumer markets, as opposed to elastic/mobile targets (e.g., capital), and (4) the affected firms’ conduct or production must be “nondivisable,” meaning that it is economically unattractive (or impractical) for them to produce or conduct themselves simultaneously according to multiple standards across different jurisdictions. Given these conditions, the jurisdiction’s standard is likely to emerge as the world standard, especially if it is the most demanding standard globally ([11]: p. 5). The causal mechanism behind this globalization of the most stringent regulatory standard is as follows: most foreign (non-EU) companies will be unwilling to forego the EU market (condition 1) but also cannot shirk, undermine, or evade the EU’s regulation given EU regulatory capacity (condition 2) and the fact that it regulates conduct in the geographically immobile consumer market (condition 3). At the same time, business economics makes it unattractive to produce to multiple regulatory standards (condition 4). Thus, firms adopt the most stringent standard voluntarily as they automatically conform to all weaker standards thereby too. The EU standard thus emerges as global de facto standard among export-oriented firms (“de facto Brussels effect”).
132
N. Martin and F. Ebbers
The compliance costs these non-EU firms incur in order to adapt their production (and/or other practices) to the EU standard in turn incentivize them to lobby their home governments to adopt the EU standard in order to level the playing field visà-vis domestic, nonexport-oriented competitors (“de jure Brussels effect”) ([11]: p. 5–6).1 Note two implications of this theory. Firstly, the timing of these moves by foreign (non-EU) exporters implicit in the theory: since even temporary noncompliance with the new EU standard would mean loss of access to the EU market, they will adopt compliance measures at the same time as EU firms, that is, immediately upon the regulation’s enactment. Any adjustments to business processes, introductions of new products, or purchases of compliance products will be made at the same time as EU firms. Secondly, the kinds of foreign and domestic firms primarily affected by (and responding to) the regulation: while all domestic (EU) firms active in the regulated field will be affected, among foreign (non-EU) firms, it will primarily be export-oriented companies as only they sell to the EU market. Foreign producers who only address their own domestic market are unaffected by EU regulation. This matters as export-oriented companies are likely to be, on average, larger and relatively more competitive (i.e., better capitalized, higher technological capabilities) than purely domestically oriented firms.
2.2 Regulation-Induced Lead Markets The notion that regulation might induce the formation of lead markets and lead suppliers derives from the literature on the Porter hypothesis and on lead markets. The former has several variants. Roughly summarized, it suggests that environmental regulation can induce innovation that will improve firms’ competitiveness [45]. While the hypothesis’ validity remains disputed [1], scholars took from it the idea that regulation could spark compliance innovations – either among the regulated firms or among suppliers who provided them with equipment to achieve compliance (e.g., pollution control technology) – that might be of economic value [40, 48, 52]. Research on lead markets was initially mainly concerned with strengthening European competitiveness. Scholars like Meyer-Krahmer and Reger [42] and Beise [4–6] sought to define the characteristics of markets in which globally dominant products might emerge to guide industrial policy. This was part of a larger interest in the role of demand-side factors in innovation and industrial policy, and has had a sustained influence on Germany’s “High Tech Strategy” and EU policy.
1 Global policy diffusion of course can be driven by more factors than just these neo-Stiglerian regulatory politics by self-interested producers [53]. Learning [23], symbolic emulation [24] and competition and coercion [50] can be important too.
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
133
A “lead market” is commonly understood as the (geographically delimited) market in which internationally successful products are first widely adopted and assume their final shape or “dominant design,” which is then subsequently exported [7, 19, 54, 55]. The structure of demand plays a key role in this concept. Lead markets are those markets where the largest and most sophisticated buyers of the new technology sit – whose requirements for quality and functionality shape the emergent product characteristics – and where aggregate demand for the new technology is sufficiently large as to enable suppliers to develop economies of scale/scope that will yield durable competitive advantage. That is, the lead market is the geographic area where significant volumes of sophisticated demand for the new technology emerge first, giving local suppliers a significant first-mover advantage vis-à-vis companies elsewhere and enabling them to emerge as lead suppliers. (The concept assumes that innovating the new product usually requires suppliers to be located close to their customers.) Scholars have debated preconditions necessary for a market to emerge as lead market. Of interest here is how these ideas were applied to the case of regulationinduced innovation and how they intersect with the “Brussels effect.” Driven by the observation that (environmental) policies and regulation tend to diffuse globally, from pioneer countries who regulate first, to others, the literature on lead markets and the “Porter hypothesis” were brought together in studies on the potential of regulation to create markets and export success for “eco-innovations” [7, 13, 36, 37, 47, 48, 59]. Scholars theorized that if (1) pioneer country regulation prompted the emergence of compliance innovations, and (2) this regulation subsequently diffused, then (3) pioneer country suppliers who had innovated the necessary compliance equipment to supply their own home market, where the regulation had happened first, could enjoy first-mover advantages and gain export market share [7, 37]. That is, by creating domestic lead markets with export potential, supposedly competitiveness-sapping consumer or environmental regulation might contribute to economic growth. The idea found some empirical support, especially in the case of renewable energy technologies, where industry development seemed to derive from initial market-making home country regulation with subsequent export success as policy diffusion created markets abroad. It proved highly influential especially in the German Environment Ministry [13, 47, 59]. Note three assumptions implicit in this theory. Firstly, timing: there must be a nontrivial time lag between the regulation’s adoption in the pioneer country and its international diffusion. Otherwise, pioneer country suppliers may struggle to build up first-mover advantages vis-à-vis other potential suppliers abroad. Secondly, low regulatory internationalization: the theory assumes that the main economic area where the regulation unfolds its effects is the regulating jurisdiction’s home market. Prior to diffusion, the firms affected (as targets of regulation or suppliers of compliance equipment) are mostly domestic. At least, the largest and most sophisticated demand for the new compliance products – that will do most to drive innovation and determine technological trajectories – is assumed to be located
134
N. Martin and F. Ebbers
in the domestic market. In effect, the regulation is assumed to create a sheltered home market where domestic suppliers develop first-mover advantages. Thirdly, demand-side emphasis: while theorists going back to Meyer-Krahmer and Regner [42] had noted the role of supply-side factors, the theory and related policy actions like the EU Lead Markets Initiative focused on the role of domestic demand in bringing forth domestic supply and then making suppliers internationally competitive. Impressed by Chinese suppliers’ ability to profit from European renewables’ man-dates, scholars have critiqued this neglect of the supply side. Quitzow et al. [47] point out that while pioneer country regulation may create domestic demand, it does not guarantee that supply will remain domestic. Given free trade, domestic demand can suck in supply from elsewhere if suitable technological and production capabilities exist abroad. Similarly, if a regulation diffuses to other countries with more favorable conditions for technology deployment, then the locus of the developing industry might move too. The pioneer market may thus have to cede lead status to other geographies. These critiques are important. The argument pursued made here though is more fundamental: when the EU sets (de facto) global standards, significant demand for compliance innovations itself will automatically emerge elsewhere (if the regulated industry is distributed globally) and will do so instantly upon enactment of the regulation, or at least at the same pace as demand emerges in Europe. That is, the logic of the “Brussels effect” vitiates key assumptions around timing and slow regulatory internationalization of the idea of regulation-induced lead markets. The time lag crucial to the idea – between the development of the home market in the regulating jurisdiction (wherein the domestic suppliers are to develop first-mover advantages) and the development of the global market for their products following regulatory diffusion – may never happen. Indeed, if the largest and most sophisticated pools of demand for compliance technologies sit abroad, then the lead market and supply base themselves might emerge abroad, including many of the strongest and technologically most advanced suppliers. This is quite a different phenomenon to that of low-cost emerging economy producers flooding European markets with cheap, essentially commodified products in response to European regulation/subsidies, as occurred in solar. Here, the “Brussels effect” may lead to the most advanced suppliers emerging abroad. The following sections use a case study of the emergence of “privacy tech” in response to the GDPR to show that these dynamics are not merely hypothetical possibilities.
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
135
3 The GDPR and “Privacy Tech” 3.1 The GDPR: Setting Rules for Foreign Technology Companies Reportedly the single-most lobbied regulation in the EU’s history, the GDPR was first proposed in 2012. Finally enacted in 2016, it entered into force on May 25, 2018. It updates the 1995 Data Protection Directive and creates a comprehensive set of rules for processing personal data. An important concern was to make the GDPR binding upon foreign (read US) technology companies, which had come to dominate much of the Internet. Thus, Article 3 GDPR specifies that the Regulation applies not only to data controllers and data processors (firms, other organizations) established in the Union but also to controllers and processors located anywhere in the world if they process “personal data of data subjects who are in the Union” in the context of “offering goods or services” to them (including goods or services offered for free) or “monitoring their behavior.” Virtually any online or offline interaction between foreign companies (and other institutions) and individuals located in the European Union is covered by the GDPR, irrespective of whether the foreign entity is physically present in the territory of the Union. It is a prime example of unilateral regulatory globalization by the EU – the de facto extension of European standards to the rest of the world – that is at the heart of the “Brussels effect.” To incentivize companies to adhere to the GDPR, the Regulation drastically upped the fines servable for violations. Under the Data Protection Directive, penalties had been set by Member State legislation and rarely amounted to large sums [3, 8, 38]. In Germany, fines prior to 2018 seem to have mostly been in the range of a few thousand euros. Data protection law was widely ignored [38, 41]. This changed with the GDPR, which threatened fines of up to 4% of annual worldwide turnover (Article 83(5)). Regulators signaled that they would enforce the law much more aggressively than before. Within 2 years, British, German, and French authorities had begun to hand out fines as high as several hundred million euros. Unsurprisingly, firms seem to have generally taken GDPR compliance very seriously.
3.2 Regulatory Diffusion European data protection law, especially the GDPR, has been a remarkable “export success.” As early as 2012, studies showed that globally most data privacy laws have substantially incorporated many key principles of the European laws [25, 26]. The GDPR accelerated this. After it entered into force, laws heavily modeled on the GDPR were enacted in Brazil (2020), California (2018), Thailand (2019), and Tunisia (2018, draft law), among others, and the GDPR has influenced ongoing
136
N. Martin and F. Ebbers
legislative endeavors in India and Canada [15–18, 27]. More broadly, the GDPR has substantially shaped global privacy norms, discourse, and expectations, not only among advocates but business leaders, regulators, and policymakers too [49]. In particular with regard to the United States and Canada, several interviewees noted that business tends to treat the GDPR as the baseline for global privacy compliance programs, with additional country-specific rules then bolted on, and that the GDPR was helping to define a new social license to operate (interviews 3 and 9).
3.3 Compliance Tools: “Privacy Tech” Privacy and data protection are complex, situation-specific concepts. They include aspects of IT security but go far beyond this. Arguably the central concern of data protection law is to protect people from illegitimate use of their data by organizations acting in accordance with their internal (but illegal) rules and objectives, and more broadly to give individuals control over how their data is used. The GDPR like other privacy laws tries to accomplish this by mandating extensive process controls (e.g., specifying conditions under which data can be legally processed), transparency requirements and security measures, and by granting data subjects rights vis-à-vis the data controller. This makes compliance complex. Aside from ensuring conventional IT security, compliance requires extensive documentation, risk assessments, and implementing processes for data governance (e.g., access rules) and for responding to data subjects’ requests and rights’ exercise. This requires firms to establish a high level of internal transparency and control over their data processing. To be able to comply with the GDPR, they need to have a solid understanding of what data, exactly, is collected, processed, and stored where in the organization, by whom, for what purposes, and for how long, including maintaining accurate metadata. Much anecdotal evidence suggests that prior to the GDPR this was not the case in many organizations. Compliance is complicated further by the GDPR’s broad definition of “personal data” (cf. Art. 4(1) GDPR), the rapidly growing size and complexity of corporate data sets, and the fact that key principles of the GDPR run counter to data management practices hitherto prevalent. How challenging compliance can be is underscored by a survey suggesting that even in 2020 only 57% of German firms had by their own account, “largely” or “fully” implemented the GDPR, while 41% claimed to still be in the process of doing so [9]. Traditionally, compliance was (in smaller organizations, often still is) handled manually – by defining organizational policies and processes, conducting penand-paper surveys and interviewing employees to track data flows and uses, with results recorded in forms, registers, and spreadsheet. But the growing scale and complexity of data processing, coupled with the GDPR’s demands, has made this approach increasingly impractical for larger firms or firms with large and complex data stores and processes. In response, entrepreneurs and technologists have begun
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
137
developing various technical solutions to facilitate compliance, sometimes called “privacy tech.” As a set of commercial technologies, the “privacy tech” space is still young and rapidly developing. The International Association of Privacy Professionals (IAPP),2 which seems to have coined the term “privacy tech,” identified nine product categories in its inaugural 2017 Privacy Tech Vendors Report [30–34]. By fall 2020, this had expanded to 11 categories [33]. These broadly overlap with the tools and actions described in Gartner’s [22] outline of a “technologically enabled privacy program” and also largely correspond to the functions named in IDC’s [44] definition of the “data privacy software market.”3 The IAPP product categories are listed in Table 8.1. They support (and increasingly, automate) core tasks of privacy professionals, including documentation, impact assessments, identifying and mapping the pieces of personal data held and their flows across the organizations, governance (access, processing) of this data, management of user consent and other legal bases, and responding to data subject requests and data breaches. “Privacy tech” also includes solutions to anonymize
Table 8.1 The IAPP’s “privacy tech” product categories Assessment managers automate functions like risk analysis, impact assessments, compliance documentation Consent managers help collect, track, document, and manage user consents Data mapping and data discovery solutions help identify and classify personal data and its flows Data subject request solutions support responding to individuals’ requests to exercise rights like access or rectification Incident response solutions help organizations deal with data breaches, e.g., by informing stakeholders about what was compromised and obligations
Privacy information managers keep organizations up to date about evolving regulation Website scanning services check websites to identify embedded tracking technology and ensure compliance Activity monitoring solutions help determine and manage who gets access to data and when and how is and may be processed Deidentification/pseudonomity solutions help data scientists to analyze data sets without compromising privacy (e.g., by performing analysis on encrypted data) Enterprise communications solutions support secure internal communications
Source: IAPP (2020)
2 The main international professional association for individuals working in corporate, legal, and government positions related to privacy. 3 IDC [44] excludes security controls from its definition of the “data privacy software market,” while the IAPP partially includes certain such controls in its definition of “privacy tech” (e.g., computation on encrypted data). In any event, the seven named companies, which IDC sees as dominating the “data privacy software market,” are all listed in the IAPP Vendors Reports and collectively provide solutions for all of the 12 product categories defined by the IDC.
138
N. Martin and F. Ebbers
or pseudonymize data and process it in privacy-preserving ways (e.g., multiparty computation). The products grouped together as “privacy tech” are related to so-called “privacyenhancing technologies” (PETs) that, since the 1990s, have been developed in various academic contexts (if less often commercialized). Some “privacy tech” solutions are based on PETs, for example, solutions for privacy-preserving data analysis like secure multiparty computation or homomorphic encryption.4 However, “privacy tech” also includes functionalities not usually associated with PETs (e.g., data discovery and mapping, tracking evolving regulation and policy compliance, and others). Also, the underlying objective of commercial “privacy tech” is, often, in a sense orthogonal to PETs (which mostly derive from academic or privacy activist projects). PETs are a class of technologies aimed at “protecting the individual’s privacy,” for example, by providing users with “anonymity, pseudonymity, unlinkability, and unobservability” [28]. Privacy tech conversely aims to help organizations remain legally compliant.
4 Creating Lead Markets Abroad: The GDPR and the Development of a “Privacy Tech” Market and Industry 4.1 Research Strategy To understand the GDPR’s role in the emergence of the privacy tech industry and test the hypothesis that the Regulation has created a lead market and lead suppliers for these technologies in America rather than Europe, three sets of evidence were examined. Firstly, quantitative data from the IAPP’s biannual Privacy Tech Vendors Report, published since 2017. The Report is the best available overall guide to the structure and evolution of the industry. It has a regularly updated directory of all privacy tech firms known to the Report’s authors, including their nationality (headquarter location), year founded, and product portfolio.5 Secondly, various pieces of qualitative, documentary evidence: the IAPP’s Privacy Tech Newsfeed (https://iapp.org/news/privacy-tech/; 285 pieces of original reporting on the sector going back to 2015), market/industry reports by consultancies like Gartner, Forrester, and IDC, and general media and tech and business reporting, plus talks and conference presentations by technologists, investors, and entrepreneurs.
4 In
the IAPP’s scheme, these fall into the “Deidentification/Pseudonymity” product category. it cannot be ruled out that the Report’s authors have on occasion missed the odd company, it is unlikely that they should miss any significant company, simply because companies themselves have an obvious interest to be listed in the directory. Interviewed industry insiders considered the report authoritative (interviews 2 and 7). 5 While
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
139
Thirdly, 10 semi-structured interviews were conducted with industry insiders. These were • Three Europe- and US-based analysts at different market research companies who covered the privacy tech sector and had authored reports on it (interviews 2, 3, and 8). One had also worked for a European privacy tech firm (interview 8) • One senior Europe-based executive from a leading North American privacy tech company (interview 7) • Two executives (product managers) from a major European enterprise software company that sells (among other products) GDPR compliance software tools (interviews 1 and 5) • One senior Europe-based employee of a North American privacy tech startup (interview 6) • One cofounder of a North American–European privacy tech startup (interview 9) • One cofounder of a European privacy tech startup (interview 10) • One academic who was also a cofounder of an Israeli–North American security and privacy tech startup (interview 4) The interviewees were selected on the basis of their experience and knowledge of the privacy tech market and industry, as attested by their organizational seniority or publications. One interview came about through another interviewee’s recommendation. The interviewed companies represented a broad cross section of the industry, including purveyors of comprehensive product suites aiming to address all (or most) needs of the global privacy compliance office of large multinational enterprises (interview 7), vendors of software offering more basic functionalities targeted to smaller or less personal data-intensive companies (e.g., a sportswear manufacturer) (interviews 1, 5, 8, and 10), and developers of various kinds of specialized, technologically advanced solutions for privacy-preserving data analytics and data governance solutions catering to data/analytics-focused clients in highly regulated sectors such as finance, health care, or law (interviews 4, 6, and 9). They included both companies that had been founded specifically to provide compliance solutions for the GDPR or other privacy laws (interviews 6–10), and firms whose main business was in other fields but had realized that their products could also be used for privacy compliance (interviews 1, 4, and 5). The interviews followed a common outline, with variation to allow for interviewees’ different backgrounds. They were asked about the GDPR’s role in the growth of the privacy tech industry, their perception of the relative positions and product and market/customer strategies of European and American privacy tech vendors, how the market was segmented, which players covered which customer segments, and – if they stated that they perceived salient divergences in the relative positions or strategies of American and European vendors – what they believed the reasons for these divergences might be. Interviewees who worked for privacy tech firms were asked about their own firms’ strategies, customer base, competitors, and got-to-market experiences. The interviews were recorded, transcribed, and analyzed independently by both authors.
140
N. Martin and F. Ebbers
Companies
400
304
355
365
259
300
214 165
200 100
343
99
122
44
0 2017 2017 2018 2018 2019 2019 2020 2020 2021 2021 H1 H2 H1 H2 H1 H2 H1 H2 H1 H2 Fig. 8.1 Number of companies included in the IAPP Privacy Tech Vendors Report
4.2 Industry Growth The GDPR kick-started the rapid growth of the privacy tech market and industry. The crucial role of the Regulation emerges clearly from both the quantitative and qualitative data. Figure 8.1 shows the number of companies included in the IAPP Privacy Tech Vendors Report, which can be treated as a reliable representation of the overall growth of the industry. As the figure shows, the number of companies active in the space (as captured in the Report) has grown from 44 in 2017 to 365 in autumn 2021. Partly this reflects entry by established firms adding privacy compliance-related products to their offering. Thus, German enterprise software giant SAP entered the privacy tech market in 2018, with three software products for GDPR compliance (covering two of the IAPP’s privacy tech categories). By 2020, SAP had added further tools, now covering six of the IAPP’s categories. SAP also entered into a partnership with BigID, an American–Israeli privacy tech startup [31, 33]. Similarly, IBM entered the Vendors Report in 2019 with products for two of the IAPP categories. By 2020, IBM products covered six IAPP categories [32, 33]. But the growing number of companies also strongly reflects entry by newly founded firms. Fig. 8.2 shows annual company starts for firms included in the Vendors Report since 1997 (the year TrustArc, today widely considered number 2 in the space globally, was founded).6 The number of company starts began rising sharply in 2012, the year the GDPR was proposed, and accelerated in 2016, when it was passed, reaching a high in 2017, the year before the GDPR entered into force. Thereafter, the number of new starts declines again; likely reflecting increasing saturation of the space. Not all of these firms are focused solely on privacy. Notwithstanding the early foundation of privacy-only firms like TrustArc (1997) or Nymity (2002), many of the older firms in particular were likely founded to pursue other use cases (e.g., IT security) and only added privacy with the GDPR. Conversely, younger firms likely have a much greater privacy-only focus. Notably,
6 Twenty-six
of the companies from the autumn 2021 Report were founded before 1997.
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
60 GDPR passed
50 GDPR proposed, legislative process initiated
40
36 26 16 17
10
3
5 6 1
9
7
10 4
8
6
9 10 9
CCPA passed
33
30 20
49
141
12 12
20 16 11 2
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0
Fig. 8.2 Number of privacy tech company starts
while a number of firms existed prior to the GDPR, many of the firms considered as leading in the privacy tech space are new companies. The only two privacy-only companies to have achieved unicorn status, OneTrust and BigID, were both founded in 2016 [14, 43]. Qualitative evidence, too, supports the contention that the GDPR served as the stimulus for the development of the privacy tech industry and prompted the initiation of new companies in particular. Interviewees consistently stated that the GDPR was the key cause triggering the industry’s development and prompting them to start their own firms (interviews 2, 3, 5, and 7–10). The founders of three of the most successful new privacy tech companies OneTrust, BigID, and Integris (all founded in 2016), too, have repeatedly cited the GDPR as the key event prompting them to start their firms as they believed it would create significant market opportunities. With the beginning of the GDPR legislative process, significant amounts of venture capital flowed into the privacy tech sector. No complete figures are available, but the following numbers are indicative. Since 2015, the six privacy tech companies specialized only in privacy compliance included in reports on leading companies in the “privacy management software” industry by the consultancies Forrester [29] and IDC [44] jointly raised almost $1.4 billion.7 Other significant VC investments in privacy tech companies reported in the media include Immuta, Privitar, Integris, and Ethyca, which have raised a total of $354.3 million since 2015.8
7 OneTrust, 8 Own
TrustArc, Nymity, WhireWheel, Securiti, and BigID. calculations based on Cruchbase.com data.
N. Martin and F. Ebbers
Companies
142 200 150 100 50 0
Fig. 8.3 Geographical distribution of privacy tech companies (as of fall 2021)
4.3 Geographical Distribution of Privacy Tech Firms Figure 8.3 shows the geographical distribution (headquarter locations) of the privacy tech firms in the Vendors Report (as of fall 2021). The large majority of companies are headquartered in North America (the United States and Canada) or Europe (EU27 plus the United Kingdom, Norway, and Switzerland). A small number sits in the rest of the world. North America has slightly more firms than Europe (183 versus 158). The United States has the largest number of privacy tech firms. In Europe, the United Kingdom has most, followed by the Netherlands, Germany, and Ireland. Another 16 European countries have at least one firm (not shown). At first glance, these numbers would seem to suggest that Europe, on the back of the GDPR, has managed to generate a significant privacy tech industry, much as the Commission’s GDPR Impact Assessment had hoped. Closer examination of the data though suggests that this is not so. Rather, almost all the leading privacy tech companies are US-based, and the main technological developments in this space, too, are occurring in North America. Several pieces of evidence support this claim. Firstly, this was stated consistently by the interviewees (interviews 2, 3, and 6– 10). Consistently, they affirmed that the industry was dominated by US companies, and that most or all of the largest privacy tech firms were North American. As discussed further below, they also felt that the technologically most sophisticated and/or comprehensive (in terms of privacy dimensions covered) product offerings were mostly produced by North American companies, and that hence these were the companies that served the largest and/or technologically most sophisticated and demanding customer segments (which should also tend to be the most lucrative). Conversely, they consistently described European companies as smaller, mostly focused on national markets and smaller or technologically less sophisticated or demanding customers, with simpler products. The above-cited market reports paint a similar picture. Of the 15 companies identified as the “most significant” privacy tech vendors by Forrester [29] on the basis of client feedback and own research,9 9 are North American (eight US, one 9 Private
communication with Ms. Iannopollo, January 5, 2021.
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
143
Canadian); 5 are European and 1 Australian. Iannopollo [29] divides these firms into four bands on the basis of their strength of product offering, strength of strategy, and market presence. Notably, the European firms cluster in the bottom band, while all the firms in the top two bands are American or Australian. The North American firms also tend to have the largest market presence. The market study by IDC [44] paints a similar picture. By IDC’s estimate, seven US companies held around 62% of the global “data privacy management software” market in 2019, with the two largest companies (OneTrust and TrustArc) together holding about 45.7% market share. Further evidence for the dominance of US companies and the centrality of the North American ecosystem comes from reported mergers and acquisitions. With incipient market consolidation, the industry has seen growing M&A activity. Acquisitions serve to cement the position of already-powerful firms, but they are also indicative of which firms are considered to be particularly competitive, since acquirers usually try to buy companies with attractive technology, product, or customer portfolios. To understand acquisitions, Crunchbase.com data on acquisitions by the 17 companies defined as leading privacy tech companies in the Forrester and IDC market reports, and the IAPP Privacy Tech Newsfeed and Vendor Reports were analyzed. Six privacy tech companies could be identified that have performed acquisition; OneTrust (the United States), TrustArc (the United States), Exterro (the United States), Crownpeak (the United States), SAP (Germany), and SAI Global (France). Since 2012, they have made 43 acquisitions.10 Of these, 15 concerned target firms that were active in the privacy tech and/or broader governance and compliance space.11 Of these 15 acquisitions, 12 were of North American companies (11 US, 1 Canadian); 3 were European (one German, one British, and one Dutch firm).12 It is perhaps not surprising that American firms should tend to buy other American firms. However, both OneTrust and Crownpeak have made acquisitions in Europe. Hence, that the majority of their acquisitions should be in the United States suggests that that is where the more valuable target opportunities sit. This reading is also supported by the fact that of the five acquisitions made by French SAI Global four were of American firms; only one of a European firm. Likewise, SAP’s arguably most important strategic tie-up in the privacy tech space is with BigID – an American, not European, firm.
10 The year the GDPR was proposed was chosen as cutoff date as this event effectively birthed the privacy tech industry. 11 The remaining 28 acquisitions were performed by nonprivacy-only firms (mostly SAP) and concerned a range of different software/IT firms. 12 Own analysis of Crunchbase.com data.
144
N. Martin and F. Ebbers
4.4 Explaining the Privacy Tech Industry’s Evolution How to explain that while a European law created the privacy tech market and industry and fundamentally shaped global privacy regulation it is American companies that have come to dominate this industry? The interviews pointed to how, stimulated by the GDPR, demand- and supply-side factors have interacted to create this outcome. Demand-Side Factors in North America Corresponding to the logic set out in Sect. 2, the GDPR instantaneously created a large market for privacy tech and other compliance services in the United States as well as in Europe. Precise numbers are hard to come by, but the following are indicative. One 2017 survey of 200 US firms with more than 500 employees by consulting company PWC found that for 92% of respondents the GDPR was either their top or one of their top privacy compliance priorities, with 77% planning to spend US$1 million or more on GDPR obligations [46]. Another 2017 survey by the IAPP and consultancy EY of 548 corporate privacy professionals, mostly in North America,13 reported similar results: 71% of the non-EU respondents believed the GDPR applied to their organization. Also, 50% of US respondents even described it as “driving” their privacy program. And 55% of respondents planned to “invest in technology” as part of their compliance strategy [35]. Based on the IAPP-EY data, the Financial Times estimated Fortune 500 companies would spend ~$7.8 billion on GDPR compliance [39]. This outcome was quite logical given American companies’ deep economic ties to the EU and legislators’ concern to make the GDPR binding on US tech firms. Moreover, several factors emerged from the interviews, suggesting that the market created in North America was especially suitable for the emergence of a privacy tech industry. For one, there is reason to believe that the North American (and particularly US) firms affected by the GDPR (and hence in the market for privacy tech solutions) will have disproportionately been large and/or technologically sophisticated. The reasons for this are the general correlation between export orientation and size and technological sophistication, the size of the US tech sector (which was particularly exposed to the GDPR), and America’s significantly greater population of large enterprises compared to Europe generally. This is supported also by the interviews and ancillary data. Interviewees consistently stated that the leading US privacy tech firms primarily served “the large enterprises of this world” (interview 2; similarly interviews 3, 7, 8, and 10) or firms engaged in fairly advanced data analysis (e.g., medical data) (interviews 6 and 10). US-based OneTrust, the largest privacy tech firm globally, has stated that ~50% of its market is in the United States, and that over half of the Fortune 500 firms are among its customers [2, 21]. Similarly, the (much smaller) Seattle-based Integris 13 In total, 59% of the IAPP-EY respondents represented US companies, 14% Canadian, and 22% Europeans, with 74% representing firms with 1000 or more employees.
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
145
worked with several Fortune 500 firms to develop its products and counted these plus a larger stable of “Fortune 1000” firms among its customers [51, 56]. In short, the GDPR created a significant home market for US vendors among large firms. Their requirements in turn shaped the vendors’ product offerings and technological development trajectory. According to interviewees, products’ capacity to handle high levels of complexity in customer organizations’ data and technology stacks, and – with the growing number of new, post-GDPR data privacy laws globally – the finer details of different jurisdictions’ regulations have emerged as a key competitive differentiator. This is also driving a trend toward automated and sometimes AI-based solutions (e.g., data discovery engines that can automatically identify and map personal data across the organization, or apply diverse governance or de-identification rules to it). Satisfying these customer requirements takes significant engineering and legal expertise, which is costly and time-consuming to build up, and which not many vendors possess (interviews 2, 3, and 6–9). Customer requirements seem to have also pushed major US vendors like OneTrust and TrustArc to develop fairly comprehensive compliance offerings – usually in the form of a modular, customizable platform – that (promises to) cover most dimensions of GDPR compliance (and increasingly of other laws like California’s CCPA or Brazil’s privacy law), in as far as they can be addressed technologically. One factor behind this push for “comprehensive” solutions seems to have been that, because prior to the GDPR fines for data protection law violations were miniscule, many US organizations seem to have neglected building up much compliance capacity. GDPR then caught them flat-footed – they had to rapidly build up comprehensive capacities. Technology offered a tempting solution (interview 2). Indeed, beyond the immediate painpoint of last-minute GDPR compliance, interviewees felt that, in their experience, US firms were generally more open to using technological solutions to address compliance challenges than European firms (interviews 2 and 6–10). In summary, enactment of the GDPR instantaneously created a substantial market for compliance solutions in North America. This market consisted especially of relatively larger firms, who were open to addressing compliance challenges technologically (and therefore interested in privacy tech as a solution), and, on account of their size and complexity, had relatively demanding technological requirements for solutions. Demand-Side Factors in Europe That the United States, not Europe, should emerge as a major market for privacy tech seems to have partly taken industry players by surprise. According to one analyst, American firms had initially expected that Europe would be a key market, but then found their products in less demand than they had hoped (interview 2). This was also the experience of several executives (interviews 6 and 9). Several reasons were given by interviewees for this outcome. For one, they noted that Europe continued to be not a single but 27 + 5 separate national markets, divided by language, business networks, culture, and despite the GDPR, regulatory approaches. Customizing products as language-sensitive as compliance software to
146
N. Martin and F. Ebbers
more than 20 different languages and building up corresponding sales organizations is usually too costly, leading European vendors to focus on one or a couple of national markets and languages. This in turn restricts growth potential (interviews 2, 3, 5, 7, 8, and 10).14 It also means that the number of large companies available as potential customers – and spurs to technology development – is reduced. Interviewees felt that European vendors mostly served SMEs, including independent DPOs, or less personal data-intensive companies (e.g., manufacturers), whose technological requirements were correspondingly lower. Accordingly, European vendors’ product offerings tend to be simpler (e.g., digitized forms, templates, survey and process modeling tools to be filled out by hand, instead of the automated solutions and complex data governance policy machines developed by North American vendors) (interviews 2, 3, 7, and 8). As one European vendor (interview 10) put it, his company focused on “digitizing data protection documentation” (i.e., digitizing previously pen-and-paper-based data protection processes while remaining within a basically manual framework), contrasting this attempts to wholly “technify (technisieren)” data protection (i.e., substitute manual processes with automation), which seemed to be pursued mainly by US vendors. European vendors, interviewees felt, still mainly offered only GDPR-products – which makes sense if one is serving smaller, national market-focused customers. (Conversely, US vendors increasingly seek to offer comprehensive global solutions covering the privacy regulatory regimes of all, or at least all major, jurisdictions worldwide.) (Interviews 2, 3, 7, and 8.)15 Several interviewees also cited a further factor, making it harder for European vendors to win big clients: especially larger European organizations and those in more highly regulated verticals like health care, finance, or communications often already had established privacy compliance programs. After all, much of the GDPR’s content had already been law in Europe since at least the 1995 Data Protection Directive. For these organizations, the GDPR represented an incremental regulatory evolution more than a revolution. Therefore, they had less need than perhaps large American companies to rapidly build up extensive new compliance processes and were accordingly less interested in investing in expensive new compliance software – especially if operating this would have also required them to significantly alter established, human-based compliance processes (interviews 2 and 6). While that seems to have been a key reason why US vendors found the European market less lucrative than hoped, it also restricted growth opportunities for native vendors and demand for them to develop more advanced product offerings. 14 There was some disagreement among interviewees about the difficulty of customizing products for different languages (likely reflecting differences in their respective products). All agreed though that building separate sales organizations for multiple European national markets was usually uneconomical, leading to a focus on one or a couple of countries. 15 This may be changing though: thus the company of interviewee 10 had recently expanded to Brazil and Canada, seeking to serve the same segment of less data-intensive SMEs and DPOs they also sold to in Europe.
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
147
More broadly, interviewees felt that European companies were generally less open to using technology to solve legal compliance challenges (interviews 2 and 6–10), noting also that European compliance departments still tend (in the interviewees’ view) to be much more heavily dominated by lawyers and have a more strongly law and text-focused approach than US compliance teams (interviews 2, 6, and 7). In summary, while the GDPR did create a market for privacy tech in Europe, this was more a series of smaller national markets than a large single one, making it harder for vendors to scale – not least as they seem to have also lacked a sufficiently big pool of large companies interested in comprehensive compliance solutions, and had to contend with greater hesitance to try out new technology-based approaches to compliance. Supply-Side Factors in Both Regions Interviewees also pointed to several supplyside factors that constrained European and advantaged North American vendors. Most frequently mentioned was the much greater and easier access to venture capital in North America (interviews 2, 4, 7, 9, and 10). Almost as often mentioned was access to technical and entrepreneurial talent. Interviewees noted that many of the most successful US vendors had been built by individuals whose background was in the tech industry (often in enterprise software), not privacy, and who often also had prior entrepreneurial experience (interviews 3, 6, and 8). This is certainly true of the leadership teams of successful US vendors like OneTrust, BigID, WhireWheel, Securiti, or Integris. Enterprise software (which is what privacy tech is) is a particularly hard market to get into as customers are usually wary of purchasing from new firms, quality requirements are very high, and relationships often crucial for winning contracts (interviews 1, 3, and 5). The much larger North American ecosystem means not only that employee talent is easier to source (interviews 8–10), but that the likelihood is higher than that winning combinations of individuals with the right experience, talent, and relationships to exploit emergent entrepreneurial opportunities like the GDPR will emerge in North America.
5 Conclusion: Regulation and the Preconditions for the Emergence of Lead Markets This chapter has argued that while regulation can create new markets and stimulate the innovation of new technologies, the very power of the European Union to set global standards makes it harder to predict where demand for these technologies will emerge. In particular, it is by no means guaranteed that, just because a regulation is European, any associated lead market and lead suppliers – that is, the largest pools of demand, with the highest technological requirements – will also emerge in Europe. The chapter used a case study of the GDPR to show that it is quite possible for EU regulation to create lead markets and lead suppliers abroad.
148
N. Martin and F. Ebbers
This finding has implications for theory and policy. On the theory side, it underscores that the economic and technological effects of regulation need to be understood in a global context. Even as theorizing on regulation-driven lead markets has recognized the potential for regulation-created demand to “suck in” compliance products from elsewhere (especially from low-cost producers in China), theorizing has often continued to operate with an implicit vision of national (or regional) markets and regulatory jurisdictions as relatively closed and siloed entities. In fact, however, the two or three major global jurisdictions (the EU, the United States, perhaps China) are increasingly able to shape the de facto standards in each other’s markets, thereby triggering concomitant market and technology developments. A second, more specific implication for theory concerns the role of supplyside factors in lead market development. While the case study underscores the role of large and/or technologically sophisticated lead customers in driving lead market development, it also suggests that this potential will only be exploitable if a sufficiently rich ecosystem of experienced entrepreneurs and technologists with ample access to (venture) capital exists. While early lead market theorists noted the importance of supply-side factors, seeing lead markets as emerging from the interplay of lead customers and suppliers [42], later theorists attempted to exclude all supply-side factors from the theory (see [47]). The policy implications flow from this. The case study underscores that regulation to address social (or environmental) externalities certainly can promote technological innovation and economic development. However, it also shows that profiting from this is not easy. In particular, it suggests that three factors may be decisive: the existence of a large – continental-scale – home market, the presence of a substantial pool of large and/or technologically sophisticated companies in this home market to act as lead users, and a sufficiently rich and well-funded supplier ecosystem. In the digital era, it is no longer possible to assume that these three factors are unproblematically given in Europe.
References 1. Ambec, S., Cohen, M. A., Elgie, S., and Lanoie, P. (2013). The Porter Hypothesis at 20: Can environmental regulation enhance innovation and competitiveness? Review of Environmental Economics and Policy, 7(1), 2–22 2. Azvevo, M.A. (2020). Atlanta-based OneTrust Raises $210M Series B, More Than Doubles Valuation to $2.7B. Crunchbase News, 20 February, https://news.crunchbase.com/venture/ atlanta-based-onetrust-raises-210m-series-b-more-than-doubles-valuation-to-2-7b/ 3. Bamberger, K. A., & Mulligan, D. K. (2013). Privacy in Europe: Initiald ata on governance choices and corporate practices. GeorgeWashington Law Review, 81(5), 1529–1664. 4. Beise, M., 2001. Lead Markets: Country-Specific Success Factors of the Global Diffusion of Innovations. Physika-Verlag, Heidelberg. 5. Beise, M., 2004. Lead markets: country-specific drivers of the global diffusion of innovations. Research Policy 33 (6/7), 997–1018 6. Beise M (2006) Die Lead-Markt-Strategie: Das Geheimnis weltweit erfolgreicher Innovationen. Springer, Berlin
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
149
7. Beise, M., and Rennings, K. (2004). Lead markets and regulation: a framework for analyzing the international diffusion of environmental innovations. Ecological Economics 52, 5–17 8. Bignami, Francesca (2011): Cooperative Legalism and the Non-Americanization of European Regulatory Styles: The Case of Data Privacy. In: American Journal of Com-parative Law 59 (2), S. 411–461. 9. bitkom (2020) DS-GVO und Corona –Datenschutzherausforderungen für die Wirtschaft. https://www.bitkom.org/sites/default/files/2020-09/bitkom-charts-pk-privacy-29-09-2020.pdf 10. Blind, K., Bührlen, B., Menrad, K., Hafner, S., Walz, R., & Kotz, C. (2004). New products and services: Analysis of regulations shaping new markets. Fraunhofer Institute for Systems and Innovation Research, Karlsruhe. http://publica.fraunhofer.de/documents/N-24301.html. 11. Bradford, A. (2012). The Brussels Effect, 107 Northwestern University Law Review 1. 12. Bradford, A. (2020). The Brussels Effect: How the European Union Rules the World, Oxford and New York: Oxford University Press 13. Bundesministerium für Bildung und Forschung BMBF (2011). Wettbewerbsfähiger durch Leitmarktstrategie? BMBF-Workshop vom 7. April 2011 (Dokumentation). (on file with author). 14. BusinessWire (2020) BigID Announces $70 Million in New Investment, Raising the Company’s Valuation to $1B. https://www.businesswire.com/news/home/20201216005157/en/ BigID-Announces-70-Million-in-New-Investment-Raising-the-Companys-Valuation-to-1B Accessed 5 June 2021 15. DLA Piper (2020): Data Protection Laws Around the World: India. https:// www.dlapiperdataprotection.com/index.html?t=law&c=IN&c2= Last visited 4/06/2021 16. DLA Piper (2021a): Data Protection Laws Around the World: Brazil. https:// www.dlapiperdataprotection.com/index.html?t=law&c=BR&c2= Last visited 4/06/2021 17. DLA Piper (2021b): Data Protection Laws Around the World: Tunisia, https:// www.dlapiperdataprotection.com/index.html?t=law&c=TN&c2= Last visited 4/06/2021 18. DLA Piper (2021c): Data Protection Laws Around the World: Thailand, https:// www.dlapiperdataprotection.com/index.html?t=law&c=TH&c2= Last visited 4/06/2021 19. Edler J (ed) (2007) Bedürfnisse als Innovationsmotor: Konzepte und Instrumente nachfrageorientierter Innovationspolitik. Studien des Büros für Technikfolgen-Abschätzung beim Deutschen Bundestag, vol 21. ed. sigma, Berlin 20. European Commission (2012): Impact Assessment Accompanying the document Regulation of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation) and Directive of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data by competent authorities for the purposes of prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and the free movement of such data. Brussels, 25.1.2012, SEC(2012) 72 final 21. Foster, T. (2020). ‘A Growth Industry Like I’ve Never Seen:’ Inside America’s No. 1 Fastest-Growing Company, Inc.com Magazine, https://www.inc.com/magazine/202009/tomfoster/onetrust-kabir-barday-fastest-growing-company-2020-inc5000.html 22. Gartner (2020) Gartner Says By 2023, 65% of the World’s Population Will Have Its Personal Data Covered Under Modern Privacy Regulations. https://www.gartner.com/en/ newsroom/press-releases/2020-09-14-gartner-says-by-2023%2D%2D65%2D%2Dof-theworld-s-population-w. Accessed 5 June 2021 23. Gilardi, F. (2005). The institutional foundations of regulatory capitalism: the diffusion of independent regulatory agencies in Western Europe. The Annals of the American Academy of Political and Social Science, 598(1), 84–101. 24. Gilardi, F., Füglister, K., Luyet, S. (2008). Learning From Others: The Diffusion of Hospital Financing Reforms in OECD Countries. Comparative Political Studies, 42(4), 549–573 25. Greenleaf G (2012). Global Data Privacy Laws: 89 Countries, and Accelerating. Privacy Laws & Business International Report, Issue 115, Special Supplement, February 2012, Queen Mary School of Law Legal Studies Research Paper No. 98/2012
150
N. Martin and F. Ebbers
26. Greenleaf G (2017). ‘European’ data privacy standards implemented in laws outside Europe. Privacy Laws & Business International Report, 21 UNSWLRS 2 27. Greenleaf G (2018). Global Convergence of Data Privacy Standards and Laws: Speaking Notes for the European Commission Events on the Launch of the General Data Protection Regulation (GDPR) in Brussels & New Delhi, 25 May 2018. https://papers.ssrn.com/sol3/ papers.cfm?abstract_id=3184548. 28. Heurix J, Zimmermann P, Neubauer T, Fenz S (2015) A taxonomy for privacy enhancing technologies. Computers & Security 53:1–17. https://doi.org/10.1016/j.cose.2015.05.002 29. Iannopollo E (2020) The Forrester Wave™: Privacy Management Software, Q1 2020. https:// www.forrester.com/report/The+Forrester+Wave+Privacy+Management+Software+Q1+2020//E-RES146976. Accessed 6 June 2021 30. IAPP (2017) Privacy Tech Vendor Report. https://iapp.org/resources/article/privacy-techvendor-report 31. IAPP (2018) Privacy Tech Vendor Report. https://iapp.org/resources/article/privacy-techvendor-report 32. IAPP (2019) Privacy Tech Vendor Report. https://iapp.org/resources/article/privacy-techvendor-report 33. IAPP (2020) Privacy Tech Vendor Report. https://iapp.org/resources/article/privacy-techvendor-report 34. IAPP (2021) Privacy Tech Vendor Report. https://iapp.org/resources/article/privacy-techvendor-report 35. IAPP-EY 2017. IAPP-EY Annual Privacy Governance Report 2017. On file with author. 36. Jacob K, Beise M, Blazejczak J, Edler D, Haum R, Jänicke M, Löw T, Petschow U, Rennings K (2005) Lead Marketsfor EnvironmentalInnovations. Heidelberg: Phvsica-Verlag. R&D Internationalisation from an Indo-German Perspective 181 37. Jänicke, M., and Jacob, K. (2004). Lead Markets for Environmental Innovation: A new Role for the Nation State. Global Environmental Politics, 4(1), 29–46 38. Karaboga, M., Martin, N. & Friedewald, M., 2022, i.E. Governance der EUDatenschutzpolitik: Harmonisierung und Technikneutralität in und Innovationswirkung der DSGVO. In M. Friedewald & A. Roßnagel (Hrsg.) Die Zukunft von Privatheit und Selbstbestimmung: Analysen und Empfehlungen zum Schutz der Grundrechte in der digitalen Welt. Wiesbaden: Springer Vieweg. https://doi.org/10.1007/978-3-658-35263-9_2 39. Khan, M. 19 Nov. 2017. “Companies face high cost to meet new EU data protection rules”, Financial Times https://www.ft.com/content/0d47ffe4-ccb6-11e7-b781-794ce08b24dc 40. Martin N, Matt C, Niebel C, Blind K (2019a) How Data Protection Regulation Affects Startup Innovation. Inf Syst Front 21:1307–1324. https://doi.org/10.1007/s10796-019-09974-2 41. Martin, N., Bile, T., Nebel, M., Bieker, F., Geminn, C., Roßnagel, A., Schöning, C. (2019b): Das Sanktionsregime der Datenschutz-Grundverordnung: Auswirkungen auf Unternehmen und Datenschutzaufsichtsbehörden. Forschungsbericht, Forum Privatheit und selbstbestimmtes Leben in der digitalen Welt, http://publica.fraunhofer.de/documents/N-541115.html 42. Meyer-Krahmer F, Reger G (1999) New perspectives on the innovation strategies of multinational enterprises: lessons for technology policy in Europe. Research policy 28, 751–776 43. Miller R (2020) OneTrust nabs $300M Series C on $5.1B valuation to expand privacy platform. https://techcrunch.com/2020/12/21/onetrust-nabs-300m-series-c-on-5-1b-valuationto-expand-privacy-platform. Accessed 5 June 2021 44. O’Leary, R. (2020) Worldwide Data Privacy Management Software Forecast, 2020-2024. IDC International Data Corporation https://www.idc.com/getdoc.jsp?containerId=US46770219 45. Porter, M., and van der Linde, C. (1995). Toward a new conception of the environmentcompetitiveness relationship. Journal of Economic Perspectives, 9(4), 97–118 46. PWC 2017. GDPR Series: Pulse Survey: US Companies ramping up General Data Protection Regulation (GDPR) budgets. On file with author. 47. Quitzow, R., Walz, R., Köhler, J., Rennings, K. (2014). The concept of “lead markets” revisited: Contribution to environmental innovation theory, Environmental Innovation and Societal Transitions 10, 4–19
8 When Regulatory Power and Industrial Ambitions Collide: The “Brussels. . .
151
48. Rennings K, Rammer C (2011) The Impact of Regulation-Driven Environmental Innovation on Innovation Success and Firm Performance. Industry and Innovation 18:255–283. https:// doi.org/10.1080/13662716.2011.561027 49. Schwartz PM (2019) Global Data Privacy: The EU Way. N.Y.U. L. Rev. 94:771 50. Shipan CR, Volden C (2008) The mechanisms of policy diffusion. American journal of political science 52, 840–857 51. Soper, T. (2020) Seattle startup Integris acquired by data privacy giant OneTrust, Geekwire, 29 June, https://www.geekwire.com/2020/seattle-startup-integris-acquired-data-privacy-giantonetrust 52. Stewart LA (2010) The impact of regulation on innovation in the United States: A crossindustry literature review. Information technology & innovation foundation 6 53. Stigler GJ (1971) The Theory of Economic Regulation. The Bell Journal of Economics and Management Science 2(3). 54. Tiwari R (2016) Frugality in Indian Context: What Makes India a Lead Market for Affordable Excellence? In: Herstatt C, Tiwari R (eds) Lead Market India: Key Elements and Corporate Perspectives for Frugal Innovations. Springer International Publishing, Cham 55. Tiwari R, Herstatt C (2014) Setting the Scene. In: Tiwari R, Herstatt C (eds) Aiming big with small cars: Emergence of a lead market in India. Springer International Publishing, Cham, pp. 1–18 56. Trumbull, T. (2020) Privacy Software Company OneTrust Acquires Integris, Channele2e.com, 30 June, https://www.channele2e.com/investors/mergers-acquisitions/privacysoftware-company-onetrust-buys-integris/ 57. Vogel, D. (1995). Trading Up: Consumer and Environmental Regulation in a Global Economy. Cambridge: Harvard University Press 58. Vogel D, Kagan RA (eds) (2004) Dynamics of regulatory change: How globalization affects national regulatory policies. University of California international and area studies. Univ. of California Press, Berkeley, Calif. 59. Walz R, Köhler J (2014) Using lead market factors to assess the potential for a sustainability transition. Environmental Innovation and Societal Transitions 10, 20–41
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Part IV
The Ethics of Privacy and Sociotechnical Systems
Chapter 9
Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried Lipsarani Sahoo and Jay Dev
, Mohamed Shehab
, Elham Al Qahtani
,
Abstract At-home DNA testing and sharing in public genealogy databases are becoming widespread. This will facilitate finding out ancestry, genetic relatives, biological parents, making new connections, advancing medicine, and predisposition to various diseases and health issues. However, the privacy implications of this DNA data sharing publicly remain largely unexplored. We close this gap with an interview study (N = 60). Further, we mainly assess users’ perceptions with and without experience of at-home genetic testing to assess understanding of DNA data privacy. We demonstrated few popular tools of a public genealogy database (GEDmatch), pointing out the current policies. We also discussed few scenarios based on the current practice, such as subpoenas by law enforcement and potential threats like access to DNA data by insurance companies. Results show that users are mostly unaware of the interconnected nature of genetic data and are not informed about current policies and data practices. However, scenarios like insurance companies can access their data could enhance their privacy concern. Our findings are valuable for researchers, practitioners, and policymakers concerned with at-home DNA testings. Keywords At-home DNA · Public genealogy databases · DNA privacy
1 Introduction With at-home genetic testing becoming affordable and accessible, more individuals are interested in doing genetic testing, which can bring huge opportunities in health care, personalized genetic medicine, and ancestry. Direct to consumer genetic testing (DTC-GT) companies, such as 23andMe, have screened over ten million
L. Sahoo () · M. Shehab · E. Al Qahtani University of North Carolina at Charlotte, Charlotte, NC, USA e-mail: [email protected]; [email protected]; [email protected] J. Dev Illinois State University, Normal, IL, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_9
155
156
L. Sahoo et al.
individuals by April 2019 [1]. The compound annual growth rate of at-home genetic testing is expected to be around 10.65% between 2019 and 2024 [3]. Now these kits are affordable, obtainable, and sold directly to consumers. The tests collect only physical samples from the customer, such as saliva, to determine genetic information. Furthermore, the raw DNA data report can also be uploaded to various public online services such as GEDmatch. Tools provided by GEDmatch enable services like DNA matching technology to find genetic relatives. Genetic testing and public online services, such as GEDmatch, are viral in most developed countries [4]. There are many privacy concerns associated with at-home genetic testing, such as identifying and disclosing genetic relatives, racial biases [13, 19]. Furthermore, Now, the GEDmatch database is frequently used by law enforcement to solve violent crimes [18]. GEDmatch policy allows users to opt-in or out of being searched by police, enabling law enforcement to use their DNA database to solve violent crimes. Although sharing one’s own data is opt-in, there are little to no systems in place to protect the genetic privacy of genetic relatives of users. A person’s choice to share their own information implies that they are also sharing their own relatives’ information without their consent, which is utterly intrusive. Additionally, DNA testing companies are often exposed to data breaches [24]. Recent studies [8, 23] investigated user’s perceptions of at-home DNA testing and found that users had not considered privacy in their primary choice. Considering these potential risks of privacy concerns and sensitivity of DNA data, our study aims to close the gaps by investigating: users’ perceptions of sharing their genetic data not only with DTC-GT but also in open databases like GEDmatch, users’ awareness and perceptions of different entities (law enforcement) accessing their data, users’ awareness of privacy policies about DNA testing and data sharing, user’s understanding of related nature of DNA data, user’s suggestions of future data sharing and privacy policies, and difference in perception and awareness between people who have experience of DNA testing with DTC-GT companies and those who have not done at-home DNA testing. We study users’ understanding and perceptions about at-Home genetic testing and DNA data while examining how experience impacts their perception. We noted users’ future motivations and valuable suggestions to researchers, practitioners, and policymakers concerned with at-home DNA testing and sharing. For this purpose, we conducted and analyzed 60 interviews by dividing participants into two groups (Experienced and NonExperienced). We found that users have insufficient understandings of the nature of DNA data and the risks of sharing DNA data in open databases. Users are not informed of the privacy policies present and assume DNA data is not sensitive, leading to less concern for privacy. We also noted changes of point of view and increased privacy concerns subsequently nudged through the scenarios. We render a comprehensive and thorough discussion of the implications of our findings.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
157
2 Background and Related Work In this section, we will first discuss about GEDmatch followed by presenting prior works analyzing genetic data sharing risks. Lastly, we discuss preceding researches aiming to understand user stands on DTC-GT and public genealogy databases.
2.1 Public Genealogy Database: GEDmatch At-home DNA testings are popular for ethnicity prediction, family tree building, finding lost relatives, and identifying biological parents. Additionally, the user can also download raw DNA data in zipped (.zip) text files that can be uploaded and analyzed by third-party tools, such as GEDmatch. GEDmatch offers features to upload raw genetic data results from various DTC-GT. Nelson et al. [21] found that GEDmatch was the most commonly used tool (84% of participants used), and participants agreed that it provided ancestry and relative information and helped them understand and interpret genetics. GEDmatch has different tools to analyze users’ DNA files. For example, users can use the “One-to-Many Comparison Result” to search for relatives matches within their database. It delivers a list of 3000 closest matches with their name, email address, kit number, along with testing company.1 It also provides a “One-To-One comparison tool” that can be used to analyze the DNA of two individuals on one-to-one basis. As these comparison results render names and email addresses, interested person can contact their matches. GEDmatch forum is a platform for users to share information and network. “Search all GEDCOM” is another popular tool. In this tool, users can just put someone’s first name and last name to get their details such as place of birth and death, father, mother, and children. The feature of the pedigree chart can be used to look at the family tree. We showed the demo of above tools to provide brief ideas to participants. Thus, exploring users’ perception of genetic data disclosure is imminent.
2.2 Genetic Data Sharing Risks Genetic data are sensitive as it contains individually identifiable health information such as the present, past, or expected health besides ancestry. Genetic data reveal not only sensitive information about the sharer but also related genetic individuals [25]. Many studies have proved re-identification of anonymized genetic data [11, 20]. Marchini et al. studied [20] genotype imputation, which completes genetic information from partial data. A study by Humbert et al. [16] confirmed the feasibility 1 GEDmatch details https://drive.google.com/drive/folders/1Jmzt0n5F-LznwH73jU28Z5EjHH M7dIlx?usp=sharing.
158
L. Sahoo et al.
of genetic imputation by utilizing genetic datasets from OpenSNP.org (an Internet platform where genetic information is publicly available). Using Facebook searches, they were able to find relatives of the individuals that self-identified their genetic datasets. Kaiser et al. [17] discussed how in Iceland, genetic data of an additional 200,000 living individuals inferred who never donated their own DNA. A study performed by Edge et al. [11] showed that the GEDmatch database is vulnerable to artificial datasets and described various methods several possibilities to reveal users’ raw genetic data. He et al. [15] discussed the possibility of revealing substantial information about one’s putative relatives. Ney et al. [22] found that high-resolution images provided to GEDmatch users comparing the chromosomes of any two users can be potentially misused for reconstructing the target’s genotype.
2.3 Users’ Motivation and Privacy Perception This section summarizes studies that have investigated consumers’ key concerns about DTC-GT data privacy and motivations behind sharing their genetic data in public genealogy databases. Researchers [9], [10] have studied motivations of individuals to undertake DNA testing. They found that users were interested in gaining information on their health statutes, genes, possible disease risk, or related family members. Recently, Baig et al. [8] explored perceptions of users of at-home DNA testing companies. They found that users are frequently dismissive of privacy concerns about their own genetic data and their relatives. Similarly, Saha et al. [23] investigated user concerns, knowledge regarding at-home DNA tests. The result showed that users have trouble apprehending many implications of sharing their genetic information with business entities. Similar to both the studies, we found that users do not comprehend the nature of DNA data and the implications of sharing the DNA data. However, unlike both the studies, we found that the experienced users are less concerned about their privacy and showed a more defensive attitude.
3 Methodology This paper focuses on studying users’ perception of DNA data sharing; we designed the interview study. To compare views, we decided to interview people with and without experience of at-home DNA testing. All interviews were done through online video conferences using Zoom. We divided the participants into two groups, Experienced group: already done at-home DNA testing with DTCs and Nonexperienced group: not done an at-home DNA testing.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
159
3.1 Recruitment and Demographics We recruited 60 participants, 30 for each group based on the initial screening survey attached to the email. The screening survey asked whether they have done DNA testing or not. Participants were initially recruited by sending out emails via our university mailing lists. We complemented recruiting with snowball sampling to gain a more diverse representation, where initial participants suggested new interviewees. We did not mention privacy perceptions in our email, instead stating: “The purpose of this study is to explore the understanding, awareness, impression of DNA testing and sharing for ancestry and family finding user.” Interviews occurred between May 2021 and July 2021. All participants were compensated with a $10 gift card. The study was approved by the university’s Institutional Review Board. Among 30 participants of the experienced group, 2 were males, and 28 were females. Participants’ age ranged from 18 to 65 years. Among 30 participants of the non-experienced group, 8 were males, and 22 were females. Participants’ age ranged from 18 to 57 years. Participants had different fields of education or occupations such as religious study, IT analyst.2
3.2 Method and Analysis The Screening survey asking about participants’ experience on at-home DNA testing in the recruitment email helped us to divide the participants into experienced group and non-experienced group. We conducted all interviews through online video conferences [5] and used an external audio recorder (Smartphone) to record the interview after participants’ consent. Having both experienced and nonexperienced participants should enable us to investigate if there is any difference or effect of experience on the users’ views about DNA data sharing. First, we started asking the experienced group about the motivation behind their test and their feelings after the test. On the other hand, we asked the non-experienced participants whether they know at-home DNA testing or not. Subsequently, we gave a description of at-home DNA testing and asked them whether they would like to take the test if they get it as a gift, followed by the reason behind their decision. Then we introduced the GEDmatch site to both the groups and showed them a video explaining few popular tools such as one-to-many autosomal DNA comparison, one-to-one autosomal DNA comparison GEDmatch forum, and GEDCOM. After showing the video, we asked the participants if they would be interested in sharing their result on these platforms. If a participant showed interest in sharing their DNA data, we asked why they want to share; if they did not show their interest, we asked why they do not want to share. Then we discussed the opt-in and opt-out policy of 2 Demographics https://drive.google.com/drive/folders/1ZsHnwz-AW-RhFTHwq6VPuyAyCIXcv 6zK?usp=sharing.
160
L. Sahoo et al.
the GEDmatch. After the opt-in and opt-out policy discussion, we asked them if they would choose opt-in and opt-out, followed by the rationale behind their decision. Subsequently, we discussed the “Golden State Killer Case,” their opinion on Law enforcement using the database, their feelings if family shares their own DNA data as their family shares part of their DNA data. Next, we asked them about their interest in sharing the DNA data for research, medicine purpose, and the motivation behind their choice followed by their feelings family shares their own DNA data as their family shares part of their DNA data. Besides, we also talked about their views on sharing DNA data informing their hereditary diseases and their interest to share or not. After this, we asked about their opinion or concerns about insurance getting access to DNA data. Finally, we discussed their future expectation of DNA data sharing and design suggestions for DNA data sharing platforms.3 All interviews were transcribed for analysis and adopted an inductive approach. Both coders used the QDA miner software [2]. The data was coded independently by two researchers. We then reviewed, refined, and updated the two sets of coded data to settle disagreements. Thus, we did not administer Cohen’s Kappa (inter-rater agreement). While discussing the results, we enumerate the participants from E1 to E30 for the Experienced group, NE1 to NE30 for the Non-experienced group. We also use keywords like “all” to refer to 100% participants, “most” for 80–99% participants, “majority” to refer 60–79% participants, “many” for 40–59% participants, “some” for 20–39% participants, and “few” for less than 19% participants. For the Non-Experienced group, we would use “NE” & for the Experienced group “EG” as abbreviations.
4 Results 4.1 Pre-introduction of DNA Testing: Non-experienced Group Only Before introducing at-home DNA testing, NE members were asked about their familiarity with commercial DNA testing and their expected gains and concerns.
4.1.1
Foreknowledge about DNA Testing and Procedure
We asked the participants—“Do they know about at-home DNA testing?” All the participants knew about at-home DNA testing. Regarding the procedure of the testing, 29 out of 30 participants have different levels of knowledge. The majority (17) participants talked about collecting saliva or swab and sending it to the 3 Screening
survey and Interview questions https://drive.google.com/drive/folders/1Jmzt0n5FLznwH73jU28Z5EjHHM7dIlx?usp=sharing.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
161
labs. Ten participants talked about DNA sequencing, and two participants (biology background) talked about the procedure in detail. The majority of the participants watched commercials or ADs, some of their family members have done the testings. We can conclude that the participants are familiar with at-home DNA testing.
4.1.2
Expected Benefits and Concerns
We asked participants about their expected benefits and concerns of taking at-home DNA testing. All our participants found the at-Home DNA test easy to perform and found it an easy and fast way to find out about their ethnicity, heritage, and ancestral line. Most (25) of the participants talked about the advantages of knowing the health predispositions. They said knowing their health predispositions can help them to be prepared for their future. For instance, (NE29) said: “I’ve seen in TV; they show your heritage, which is interesting. If you are carrying any gene like cancer gene, they tell you that maybe regarding sleep paralysis, like just little things which are super important.” A few participants (5) said knowing health dispositions would increase anxiety if the issue has no remedy or treatment. Two of these four participants questioned data ownership and privacy. The other two talked about the accuracy of these tests, and one said these tests could reveal some of the unwanted things (truth). In this context, (NE4) told: “The concerns were ownership aspects, technically someone else owned my DNA is weird. Also, chance that criminal investigations, they can track family members through your DNA, which is concerning.”
4.2 Post-introduction of DNA Testing: Non-experienced Group After introducing at-home DNA testing, members of the NE group were asked about their interests, motives, and concerns about a test.
4.2.1
Interest and Motivations
For the Non-experienced group, 21 participants showed interest in taking a DNA test with DTC-GT after introducing DNA testing. Three out of these 21 mentioned that they would be interested in taking the test if they get it free. Another two members talked about the hassle of mailing the samples though they wanted to do the testing. The majority of these interested participants would like to take the test to know about their health and predisposition to any health conditions. A few of the participants wanted to take the test out of curiosity or to explore ancestry. One participant mentioned that they would be interested in taking the test here in the USA but would have never shown interest in taking it in their home country (Asian country). They mentioned that governmental oversight, discrimination, and tracking is the primary cause behind this.
162
4.2.2
L. Sahoo et al.
No Interest and Concerns
Nine participants did not show interest in taking the test even if they received it as a gift or free of cost. 28 out of 30 participants had some concerns. The reasons were mentioned hustle of taking the test (2), lack of trust in companies (25), frequent policy changes (25), racism (9), data breaches (23), ownership of the data (14), unknown misuse of the data (27), selling data to third parties (29), and tracking for criminal activities (3). The primary concerns were around security, for example, selling of their data or hacking. A few participants said they would not like to take the test because they feel they can be monitored more by the government. (NE23) said; “If a company has this data about you, then they can make certain guesses about you, You may be discriminated against because of your color, ethnicity, gender, heritage. Interestingly, you might not even know you come from that area, but your DNA kit says you come from that area.”
4.3 Users’ Experience: Experienced Group Only In the screening survey and beginning of the interview, we asked participants motives behind taking a DNA test and benefits and concerns after the test.
4.3.1
Background
We asked the participants if they took an at-home DNA testing and its reason. Most of the participants did the testing to find family or explore personal identity, ancestry research or finding biological family. A few did to satisfy curiosity, and a couple of participants did to participate in genetic research. The majority of users did not share their testing results anywhere else. Some shared the ethnicity analysis outcomes in social networking sites. A few (3) have shared with GED match for finding connections.
4.3.2
Concerns and Benefits After Taking Test
Most participants did not express any concerns after doing the test. Less than a few talked about selling the data to third parties, curiosity about ownership of data, and wondering what happens to sample after a user gets result though they never tried to find out about it. All participants perceived the test beneficial. Almost all (28) participants responded that they gained insights into their ancestry, new connections & confirming existing family relations. Some members informed that they obtained knowledge of family health-related problems and issues. (E15) commented: “I m adopted and I had very little information on my biological family, family health
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
163
history. so I did for health information, and the DNA relatives to find connections.” Overall, insight into the family was the primary driving factor.
4.4 Post GEDmatch Demo This section discusses participants’ views on GEDmatch and their perspectives after discussing GEDmatch tools.
4.4.1
Benefits and Concerns
We asked the participants if they took the DNA test—“whether they would like to share in the GEDmatch after the demo.” All participants of the NE group said the GEDmatch is interesting; however, only nine participants of the NE group would like to share their DNA data in the GEDmatch contrast to most (25) of the EG members. The majority of the participants of both groups liked the matching tools of the GEDmatch. Most liked the idea of family finding or finding connections, exploring their ancestry, building family trees, and the ability of this platform to let users upload raw DNA from various sites. A few mentioned these details could help the genealogy and medicine researchers. Though most participants admired the tools, many participants of both groups expressed concerns. We saw they were worried about the presence of email ID, names, locations of birth, and deaths in the results of GEDmatch tools. Most of the participants specifically pointed out that the detailed results can be obtained by just typing someone’s first name and last name through the GEDcom tool. Most participants considered these personal details and have the potential to be misused in many ways like targeted tracking or stalking. Some of the participants asked the interviewer- if the detailed chromosome matching can be interpreted, which may disclose some sensitive details, for example, health conditions about the matches. Some participants also spoke about identity theft. One participant indicated that the data could never be disconnected from the donor. Some participants doubted the reliability and accuracy of these tools. Incomprehension is prominent in many participants of both groups. These participants mentioned they are happy to share their DNA sequence but not email IDs or names. They believed name and email addresses are more sensitive data than DNA sequences. Low privacy expectancy is another uttered statement in participants. A few participants said that almost all the information about a target could be obtained on the Internet. So, they will not be bothered about sharing their DNA data in a GEDmatch. For example, (E10) said: “Someone could use these details to steal your identity, but as far as like the DNA information, it will just connect you with other people, I would be okay with that part of it. I would not put much personal information on there such as location, I’d create a separate email address.” Some African American participants mentioned they would be more comfortable sharing their details in the African ancestry company as
164
L. Sahoo et al.
they might get more connections and would not be racially prejudiced. Overall, both the groups expressed interests and concerns about sharing their information in the GEDmatch. However, the EG group showed a relaxed attitude concerning sharing their DNA data in the GEDmatch.
4.4.2
Expected use of DNA Data
After asking about expected benefits and concerns, we asked the participants— “what do you think will happen to your data if you upload in GEDmatch?” to know their expected data usage by GEDmatch. Overall, there are apparent tensions in participants of the NE group. While almost all (28) NE participants mentioned the probabilities of using the DNA data in unapproved or unknown ways, most EG members articulated the good uses. Participants of both groups discussed hacking or data breaches, selling data to third parties, mining the data for targeted commercials or ads. The majority of members of both groups mentioned that about governmental access and oversight of these databases. A few NE participants and the majority of Experienced participants said these databases must have been used for criminology, genealogical research, and health research. A few participants of both groups said they do not know what would happen with it. A couple of NE participants said DNA data could be manipulated against some races, framing in a police case. Referring to this (NE10) said: “They might not be selling the information now, but suppose in five years they were not profitable, they are going to sell.” Overall, the experienced group talked about good uses more frequently, while the NE group was more concerned.
4.4.3
Expected Data Access and Data Handling
When we asked the participants—“who do you think to have access to their data on GEDmatch if they upload?” most NE members said anyone with the internet could have access to data on GEDmatch. On the other hand, the majority of experienced participants believed only website users could have access. The next question was— “Rate your concern about the data handing in GED match.” The majority of the participants are little to moderately concerned. We can deduce that a majority of the participants in both groups were uncomprehending of the sensitivity of DNA data. They do not adequately comprehend the risks of sharing genetic data. Their concerns regarding their email ID, name, location are way more than their DNA data. (NE27) commented: “I am absolutely fine with sharing my DNA data if they would not give access to my sensitive data like my name, email ID, location, kind of the things that can identify me.” Few users of the NE group and most of the experienced group conceptualized that only users who uploaded their DNA can only gather data about anyone. Most participants showed little understanding of DNA data and had an ignorant attitude. Some NE members had a misunderstanding that, by this DNA test, they would share only part of DNA, not the entire DNA sequence.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
165
This certainly demonstrates a huge gap in people in general about the characteristics of DNA data.
4.4.4
Opt-in or Opt-out
We talked about the Opt-in and Opt-out policy of the GEDmatch and asked participants—if they upload their DNA data in the GEDmatch, would they Opt-in or Opt-out? Twenty NE participants fifteen EG members declared the wish to opt-in. The reverberated reasons are “Nothing to Hide,” “Have not committed any crime,” “Law-abiding,” “Helpful for law enforcement,” and “Helpful for law-enforcement to identify me if something happens to me.” The common reiterated comment was, “I have nothing to hide.” Fourteen EG and Seven NE members stated they would opt-out. The reasons were mentioned as—“Minority and vulnerable race,” “Do not want to deal with law enforcement,” “Can put my relatives and me in trouble,” “Misinterpreted or false allegation,” “Framing or abuse,” “Mistrust law enforcement,” “Want to keep my data in my control.” A few said that it does not matter what they prefer; if law enforcement wants to get the data, they will even if opt-out.
4.5 DTC-GT VS GEDmatch Participants of both the groups exhibited higher trust in DTC-GT than GEDmatch. We asked participants who either have already taken the test or are interested in taking one but do not want to share their data with GEDmatch. There was a considerable gap of trust between GEDmatch and DTC-GT. Most participants said they trust and feel their data is safe and protected in DTC-GTs, but not in GEDmatch. They mentioned that their familiarity with DTC-GT through Ads or by other channels is way higher than GEDmatch. They assume DTC-GTs are paid companies, so they would keep their data protected. Also, these companies are tied with terms and conditions that will enforce laws, rules, and regulations and prevent sharing any personal information of the consumers, whereas GEDmatch is an open database. (E8) said: “DTCs are tied to agreements and laws. They can not share my data without my permission. I don’t want my history to be pulled up into the system without my permission the way you can do in GEDmatch.” We questioned if they would be interested in sharing their data after knowing that DNA data can reveal hereditary diseases or the probability of future health conditions. Many participants showed reservedness to share their DNA data on any online platform though they would still be interested in taking the DNA test with a DTC-GT. The rationale behind this was that they have a higher degree of trust with DTC-GT and mentioned that DTC-GTs are verified companies. Whereas, they perceived free DNA data sharing online platforms like GEDmatch can be breached or hacked or can be
166
L. Sahoo et al.
accessible to insurance with ease. This manifests that users have higher confidence in DTC-GT than GEDmatch.
4.6 Scenarios: DNA Data Sharing We investigate users’ feelings about sharing their data with different entities in public databases like GEDmatch. To obtain rich understandings, we asked the same set of questions after discussing four scenarios. The scenarios are based either on an actual incident, fact, or future potential use or misuse of DNA data. Our questions are based on the fact that when someone shares their DNA, it affects the person who has given consent and affects those who have not given consent and vice versa.
4.6.1
Subpoenas and Sharing: Law Enforcement
We asked participants their opinions on law enforcement using at-home DNA databases for solving cold cases after discussing “Golden State Killer Case” where police tracked down the criminal by using GEDmatch database. We received mixed responses from most of the participants. While participants expressed, these databases are a great tool to capture notorious criminals and serve justice, at the same time, there are apparent tensions and concerns. The concerns are as follows: (1) Creating a fake account or putting false information is Perjury, immoral, unethical, (2) Obtaining users’ data without their consent is overstepping boundaries and a breach of users’ privacy, (3) Opt-in and opt-out are nearly meaningless, (4) Could be a slippery slope, (5) Wrongfully convicts or falsely alleges, (6) Violation of Health data privacy as DNA can substantially reveal health information, (7) Could lead to racial disparity and targeted accusation of minority/ religion, (8) Could be used to manipulate or frame, (9) Unwillingly involves innocent relatives, (10) Could be dragged into investigations. The next question we asked was about their opinion on implied DNA data sharing. That means when a genetic relative shares DNA, it will share part of their DNA too. We found that almost all participants did not realize that. Most were concerned. Few of them changed their mind about taking a DNA test. Matter of consent, victimize of others’ action, helplessness theme emerged, elaborately presented in the “Scenario effect” section. Following, we explore how they feel about being involuntary surveilled by law enforcement. The majority of the participants would be uncomfortable to very uncomfortable. They would feel upset, harassed, privacy violations if dragged to criminal investigations. Some stated they could not do anything about this as law enforcement is in a higher power. A few said they have nothing to hide and want criminals to get off the street; thus, they are comfortable. When the participants were asked if they would like to share their DNA data voluntarily with law enforcement, the majority of the participants felt comfortable repeating the phrase “Nothing to hide.” Some of the participants
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
167
were unwilling to share mentioning reasons like “framing,” “Wrongful conviction or mishandling of data,” “Racial discrimination,” “Could drag family and me into unwanted matters,” “Lack of trust,” “privilege to do anything if they own the data.” Moving forward, we asked about their opinion on other family members sharing their DNA data with law enforcement or where law enforcement can have access. Most of the participants were neutral about it, mentioning they do not control others’ decisions. Few of the participants were in both extreme ends. They felt very comfortable because they have “Nothing to hide,” and the uniqueness or specifics of their DNA is not being shared. On the other hand, few of them showed concerns as they believe they can be tracked or surveilled by law enforcement. Finally, we asked participants—“what would families think if they share their data?” The majority of the participants said their family would be neutral as they are not knowledgeable of the specifics of DNA data sharing; they are law-abiding people; hence they have nothing to cover. Few participants said their families would be annoyed if they could be traced back. Overall the perceptions of both groups were similar about law enforcement access though the Experienced group was more comfortable.
4.6.2
Perceptions about Sharing Data in Health Research
The majority of members of the NE group showed willingness to share their DNA data anonymously for research and the medical field, in contrast to almost all (27) EG participants. Many Experienced participants have already approved DTC-GTs their data to be used in research. The prime urge was to advance medicine and the health field. They stated higher trust in research and medical organizations as they believe those are tied to government regulations and policies. Some of them expressed the desire to know the credibility of the research institute and their research. On the other hand, some participants showed reservedness or denied sharing their DNA data for research and medical institutions. Their concerns revolved around detrimental researches around racial biases, non-transparency of data practice, and storage. A few raised concerns about the probability of insurance companies and employers looking at the data. (NE24) said: “I am familiar with the case of Henrietta Lacks, where basically, I am paying to have something done. And then with the potential of some company making a ton of money off of my biological product.” Then, we asked participants’ opinions on families sharing their data for research or medicine purposes. Like above, most of the participants were comfortable; some felt neutral since they cannot constrain others’ choices and a few were worried for the same reasons discussed before. Hereafter, we explored participants’ perception of—what their family will think if they share their DNA data. The majority of them perceived their family would appreciate it as they are helping advance science. A few NE participants told their family would be very uncomfortable because of the history of racial prejudice and might fear insurance access. On the whole, the experienced group was more comfortable.
168
4.6.3
L. Sahoo et al.
Hereditary Diseases
We explicitly asked the participants—“would they like to share their DNA if it reveals hereditary diseases and probability of some health issues.” Most participants showed interest in getting a DNA test (if possible with a medical institution, not with DTCs) to know about their predispositions and hereditary health issues. They perceived that by learning this, they could be prepared and would take precautionary steps. In contrast, a few participants mentioned it would make them very anxious and can lead to emotional turmoil. Nevertheless, the majority of NE participants did not show interest in sharing their health data or DNA in open databases like GEDmatch. Their concerns were insurance access, employer access, or disclose family’s sensitive health data out. We also noted an interesting trend. Specific race participants (we never asked them about the race, they self-revealed) explicitly stated they need to undergo DNA testing to know their health predispositions before marrying or having children. Participants reacted and perceived similarly about family sharing their DNA data revealing health conditions and their family’s reaction to them sharing their data. In this context (NE21) told: “I don’t want my aunt to suddenly call me up and say, cancer is in our family, you have a high likelihood you’re going to have cancer, I don’t want to know that. If I want to make that decision to try and find that out. I don’t necessarily want someone just randomly to tell me that. I want to prepare myself for it.”
4.6.4
DNA Data Access by Insurance Company
Almost all participants of both groups strongly rejected the idea of sharing their data online if insurance has access to it. The foremost causes noted are: genetic marks about pre-existing conditions or predispositions or family history that can make many people uninsurable, premiums could be raised and probabilities of denial to claims. This can give insurance companies more power, ultimately leading to political lobbying or policy-making in their favor rather than on the common people’s side. (E25) Said: “I feel like by them having even more access to family history, predispositions is almost finding a way not to insure the person.” Only a couple of participants felt neutral or slightly comfortable. They perceived themselves as healthful and that having access to their DNA would not create any problem. Some said if insurance gets access or uses their DNA data, they would take legal actions against the insurance firm and company. Most participants responded similarly (very uncomfortable) when we asked about their family members sharing their data. A few would be neutral as they feel they do not have power over others’ decisions. Also, most of them told their family would be uncomfortable only if they knew that the family’s DNA is connected to reach others. Many participants spontaneously asked the interviewer about the policy and rules to know if insurance has access or could access their genetic data. Overall, this scenario led to changes of decisions in many participants and surely put stress to think and dig into the privacy policies of genealogical databases.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
169
4.7 Scenarios Effects This section articulates the impacts of scenarios on participants’ points of view and attitude.
4.7.1
Helplessness and Resignation
Surprisingly, some participants from the NE group showed a resignation attitude and expressed the desire to get the test because many family members have now taken the test, which ultimately shared or publicized part of their data. They displayed helplessness or low efficacy or control over their data. In this context, (NE4) said: “All my family has done it. It’s almost like the cat has been left out of the bag. So it doesn’t matter anymore.” The helpless feelings were prevailing among the NE group. All participants from the NE group felt they do not have power over others’ decisions. NE3 commented: “I can advise [to take the test or not] if I have been asked, but it’s ultimately their decision.”
4.7.2
Fear and Attitude Change
After discussing all the use cases or possible scenarios, most participants expressed they would be more cautious about privacy policies if they plan to do a DNA test with DTC-GT or share their data. Interestingly, some NE members who showed high interest in taking the test initially now changed their decision. All most all of the participants in both groups expressed worried or changed their decision after hypothetical insurance company scenario. (NE21) spoke: “If I had done the test before the interview, I would not have read the policy, but now if needed to take a test, my first step is extensive research on companies.” Most participants responded similarly (very uncomfortable) when we asked about their family sharing their data. All participants said business or profit company should not have access to DNA data.
4.7.3
Regret and Realization
We asked about the users’ interest in sharing or giving access to their DNA data with law enforcement from the GEDmatch database. Most participants expressed that they never knew or realized that law enforcement could access these databases. We found that most of the participants did not understand when their DNA data is not only revealing their DNA data but also sharing part of their relative’s data. After the scenario, when they realized their data could be used to trace them or their relatives, the majority of the EG participants regretted taking an at-home DNA test. Also, most NE participants who were earlier interested in taking a DNA test now
170
L. Sahoo et al.
changed their decision and expressed no interest in taking the test. (E6) said: “I wish I would have read their terms, I feel terrible about myself.” Similarly (NE11) told: “Now, I realize I would be taking someone’s right to privacy away by doing it myself”
4.7.4
Defensive and Low Expectation of Privacy
Many participants have low expectations of the privacy of their data and especially law enforcement surveillance. (NE29) told: “I think whatever you put online, law enforcement has access to it, your text messages, your phone call history. I’m not going to fight against it even if I do not like it. Because there’s nothing, I can do right now in my hands to change that.” Some of the EG participants showed a defensive attitude after knowing law enforcement could access the data. (E11) mentioned: “if somebody wants it [DNA], they can just follow you to McDonald or Starbucks to get it.”
4.7.5
Consent and Victimize
Some participants acclaimed the taking a DNA test or sharing DNA data need to be done with the consent of family members. They said sharing DNA data without the family’s consent is a breach of privacy. (NE15) said: “It is a matter of consent; you should always discuss with your immediate family first, it’s unfair to share without their consent.” Few other participants perceive they can be a victim of others’ decisions. (NE14) told: “It feels like i can be a victim of some others’ ignorance.”
4.8 Lack of Knowledge We also asked the participants if they knew any laws, rules, or policies protecting DNA data stored in public genealogy or DTC-GTs databases. None of the participants had any knowledge about it, but most participants assumed there should be some laws and regulations or should fall under HIPPA policy. We also asked the participants if they read terms and conditions when they share their data. 49 participants said they never read terms and conditions as they are very lengthy, uses convoluted legal languages, and find it boring. Some of them said that as they want the service or access it immediately, they do not have time to read it. 21 experienced participants did not read through the privacy policies, terms and conditions before taking the test or signing in the testing company’s database to look at the dashboard. (E11) said: “No, I did not read them [policies] properly and do not remember much of it.” Overall, both the groups were having lack of awareness about any privacy policies or data access.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
171
4.9 Race and Nation: DNA Data Sharing (We did not ask the participants their race, they deliberately mentioned) Some participants of race and nation expressed their concerns about both testing and sharing. All most all these participants are from the NE group. A couple of EG participants acknowledged this also. The access of the DNA data by law enforcement can trace or put people in surveillance. The concerns were around distrust in law enforcement and government in general. They mentioned that framing, racial discrimination, could be potentially a problem. (NE1) told: “I am a minority; we have seen a lot of issues with African Americans and the law enforcement, like the recent Floyd case; a lot of things went wrong there. Our people are more vulnerable; history is evident that we are treated unfairly by any authority, even in the hospital. In education also, they think black means not being creative and manipulate science.” A few races felt unsafe about the government accessing the database; they feared the government could target them for deportation. Further, they added as DNA testing result talk about ethnicity, which can make some particular races uninsurable as there are prejudices of health conditions tied to race. Some participants expressed concerns about research organizations obtaining their data. They believed DNA data could be manipulated to target some particular race. Unlike conventional means, when someone shares their DNA data, that affects them and affects others in their family as it inherently shares some DNA data of their family. This was one of the concerns among few particular races. Also, we found that people of a particular nation perceived DNA testing and sharing perceptions can put them in trouble and did not show interest in getting the test. Some races were very interested in getting the test for their benefits mentioning they are privileged. (E20) said: “As a white person, I am privileged, I would not be targeted; I understand why a black person would resist.” This demonstrates that particular races and nationalism can have strong privacy perceptions against DNA testings and sharing.
4.10 Future Expectation DNA Sharing/ Future Motivation We asked participants about the possible future motivation to take a DNA test and share DNA data. Nearly half of the participants of the NE group said they would not share their DNA data even if their families and friends were giving it. The future motivations mentioned by participants could be success stories of family finding, reconnecting to missed families, to find out family history and biological parents if adopted. In the health field, to know about own and family’s health conditions in need of emergency, path-breaking DNA medicines could be likely urged to take the test. Exploring and preserving data about heritage for future generations could be another interest. (NE7) mentioned: “If needed before having children, I would do DNA testing to make sure that I wasn’t a carrier for any genetic diseases. That is
172
L. Sahoo et al.
important to me; I think here benefits outweighed any possible risks.” Most of the participants of both groups expected at-home DNA testing and sharing to grow more popular as users are curious to explore and connect. Furthermore, as it is the era of globalization, to preserve heritage or lineage and ethnicity, discovering biological parents if adopted these at-home DNA testing and sharing can be predicted to be widespread in the future.
4.11 Users’ Suggestions (Privacy Preserving) We asked the participants—their suggestions of setting, preference changes or policies changes that would make these DNA testing and sharing platforms more privacy preserving. This section lists all the users’ suggestions that we gathered in Table 9.1 after discussing all the scenarios. Almost all users emphasized “No access to any business (59),” referring that any business organization should not
Table 9.1 Privacy-preserving settings suggested by participants Suggestions Checkboxes
Freq. 50
Each time asking
36
Transparent data sharing
58
Consent from family
10
No access to any business Governmental oversight
59 24
Censor
26
Secure encryption
32
Anonymous data
32
Personalized, granular type
54
Strictly genealogy
34
Control on information
51
Handout version of policy HIPPA
57 24
Description Explicitly tick marks or checkboxes for all data sharing practices Not one time opt-in or opt-out. Every access should have opt-in or opt-out options Communicating information about who, when, and how data is being used In order to upload DNA, family members should be asked and need to give consent Any business should not have access to genetic data DNA data should be owned and regulated by the government DNA data should be censored to be uploaded to any commercial website Secure encryption should be enforced on public genealogy databases to prevent data breaches Genetic data needs to be anonymized without any identifier tracing back Different levels of functionalities corresponding to the level of data sharing DNA data in public genealogy sites should only be used for genealogy DNA data should be kept private unless both parties accept each other requests to reveal information A fine print or small understandable version of the policy Genetic data should be under HIPAA, and make it a protected class
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
173
access DNA data. Healthcare doctors should only offer DNA testing services as they believe their data is safe with healthcare than any other business company.
5 Discussion We explored users’ perceptions of at-home DNA testing and sharing in public genealogy databases and different entities. Our motivation is that DNA data sharing in open databases are increasingly becoming popular [4]. Acknowledging this, we explored the end-users views, opinions, and assumptions concerning the opportunities and risks of at-home DNA testing and sharing DNA data, their future motivations to take the test and share. We split 60 participants into two groups to investigate the difference and similarities in point of view between users who already have taken the test and users who have not taken a DNA test. We found that people are mostly ignorant of the interconnected nature of DNA data and were concerned about testing and sharing in open databases after becoming knowledgeable of it. Nevertheless, they are interested in taking a DNA test in case of family health needs. We additionally ascertained an impression of experience on users’ understandings about DNA testing with DTC-GT. All the implications have been discussed below.
5.1 Privacy Perceptions Almost all people were ignorant or unaware of the related nature of DNA. People are not mostly comprehending the interconnection of DNA between the family. They were apprehensive when they learned about this phenomenon. We explicitly mentioned it while discussing the “Golden State Killer Case.” They felt it is a breach of privacy and merely unethical as someone could be traced without consent. They considered themselves helpless as they could be a victim of others’ decisions. DNA data discloses not only ancestry but also health information such as hereditary diseases, which can be tied to other family members, raising a question of privacy of health data. Like earlier studies [8, 23], we found that users are ignorant of the jeopardies of DNA data sharing. Though most have some privacy concerns, it is not adequate to stop them from taking a DNA test. However, they would prefer to test with a health organization for an essential and inevitable purpose, not curiosity. Also, they showed interest in learning the privacy policies, T&Cs, or different settings like opt-in and opt-out options, which can guide informed decision-making and current data sharing practices. Nevertheless, learning about GEDmatch databases and potential data sharing with different entities (law enforcement, insurance) can hinder sharing of genetic data in public platforms. Hence, offering such testing commercially available for end-users with profit companies necessitates attention, caution, and proper rules regulations must
174
L. Sahoo et al.
be developed. Another misconception regarding DNA data was that part or piece of DNA data is revealed by giving away saliva or spit, not the entire DNA sequence. Hence, people need to be aware that even if they give away just their saliva, they are giving away their entire DNA sequence. Learning this bothered lots of experienced participants too. Therefore, educating people is very important when they share such sensitive data. Moreover, most people talked about the need for consent while using someone’s DNA from DTC-GT or uploaded in open databases. The majority agreed to give consent to law enforcement using the data for violent crime but demanded need to be asked beforehand. Unethical tracking or involuntary surveillance and being dragged into law enforcement cases was highly condemned and spoken about. There are prevailing worries regarding the chances of false convictions, framing, or discrimination against few races. People were pleased to share their data for health and medicine research with credible organizations with all information about how their data is being used and handled. There were prominent privacy perceptions after the insurance scenario leading to decision change. Further, the need for laws, rules, regulations was raised among users’ opinions. They stated that if the DNA testings are commercially available to the end-users, there should be appropriate rules, legislation like HIPPA laws, or government oversight. Also, restraints should be enforced to share and sell people’s genetic data with any third parties so that the users’ private health information should not be accessible or used without their consent. Race and nationalism are some of the most influential factors in people’s privacy perceptions. The participants were kept on pointing out the history of discrimination, biases, prejudices against the race; they mentioned the “Floyd case” and prejudices in the health sector such as “high pain tolerance” to explain their issues. The African Americans were way more concerned about taking the test and sharing their data with law enforcement, health research organizations, or insurance companies. Similarly, Hispanics were worried regarding the law enforcement access and government access of the data, followed by explaining the forceful deportation of the immigrants. Some Asians were quite worried about if the government or officials could obtain their data in their own country. All these perceptions were obtained from populations educated. We conclude that there need for appropriate risk communication [6, 7, 14] to the user to facilitate informed decision-making.
5.2 Privacy Trade-off From the analysis, we found that people are pretty concerned regarding their privacy. For example, users were frequently against At-home DNA testing if it would be done in a profit company (not in a health organization), accessed by other entities, especially by insurance, and minimal regulations or accountability of their data practices. Additionally, their concerns revolved around surveillance or usage of their data without their consent. However, if they have an urgent need, for instance, to learn about their health conditions before planning for children, to find family
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
175
if adopted, they consider At-home DNA testing could be quick and useful and do not debate concerning privacy. Hence, we can infer that if users perceive the benefits are adequate, they would be interested in taking a test. Furthermore, users are highly concerned about various entities such as law enforcement, research organization, insurances, or third parties obtaining the DNA data. They also talked about the possibilities of selling the data to third parties. Still, they acknowledged that these databases could help solve cold cases or advance health research. Besides, people were concerned about being traced, getting dragged into law enforcement investigations, or trouble their family or genetic relatives. Still, they recognized and appreciated law enforcement catching heinous criminals. Some stated that these databases are an excellent tool for law enforcement, even if that would put a relative behind the bar. They suggested that there should be appropriate regulations and procedures to use these databases. They also mentioned there should be no discrimination against any race or person by their DNA. Users also recognized utilizing DNA data for advancing health fields. Most users mentioned that the research should be done transparently and by credible research organizations eliminating any biases or prejudice. This indicates there was an evident tension in users regarding usage and privacy. For instance, At-Home DNA testing should be done by health organizations or used by law enforcement to solve violent crimes or credible research organizations with consent and transparency. However, not misuse it like framing, biases, tracking people involuntarily or by insurances.
5.3 Attitude Differences We found differences between the experienced group and the non-experienced groups’ opinions regarding DNA testing and sharing. In general, there is a high echo of privacy concerns of DNA data in the non-experienced group. Most NE participants disputed the usage of DNA data by law enforcement or insurance and showed little interest in sharing their data with GEDmatch. They felt that taking consent or discussing with family before taking the test is essential than the experienced group. Additionally, the feelings of a victim of others’ choices were eminent in the NE group. On the other hand, the experienced group exhibited a relaxed stance toward this though the regrets and anxiety tones were very apparent after discussing the scenarios. The perceived benefits are higher in the experienced group. There were commonly “low expectations of privacy” among users of the experienced group. The rationale behind these contradictions between the experienced and non-experienced groups can be the low efficacy of data control in the experienced participants as they have already shared their data. This might also be a reason behind their defensive attitude. Most of the experienced members regretted their decision when they realized the interconnected nature of DNA. There was a huge lack of understanding about DNA data before we explained our first scenario. Hardly, there was any knowledge about policies, terms and conditions, or data practices of the testing or sharing companies. We believe that our approach
176
L. Sahoo et al.
to explore people’s perceptions of at-home DNA testing successfully noted the gap between those who have already experienced and who have not taken a test. Consequently, it is apparent that there was a great difference between the two groups about testing and sharing DNA data in public genealogy databases. This gives insight into understanding people’s privacy decision-making, suggesting that there should be adequate and apparent information to assist people in making informed decisions.
5.4 Limitations and Future Work We investigate the users’ perception of at-home DNA testing and sharing by qualitative approach. A common hurdle is the sample size in such investigations. Participants are typically recruited till reaching data saturation [12]. However, we recognize that this result does not provide for quantitative comparisons. Our participants were predominantly female (28 women in the experienced group and 22 in the non-experienced group) and more educated than the average population as we recruited from university only. Therefore, the results of these analyses cannot be generalized to the general population. Additionally, we discussed few functionalities of the GEDmatch and few scenarios. Discussing more scenarios and GEDmatch tools might point to added insights. Future research might apply different approaches such as surveys to get added quantitative insights. Risk communication could be used to make users aware, as insurance scenarios caused most participants to either reconsider or regret their decision.
6 Conclusion As currently, commercial DNA testing and sharing in public genealogy sites are viral. This study is an exploratory attempt to discover the critical investigation points in privacy research for at-home DNA testing and sharing. We contribute timely insights by investigating users’ perceptions about at-home DNA testing and sharing in public genealogy databases. We assessed perceptions of users with and without experience of at-home DNA testing. We used a video GEDmatch demo to provide users a better understanding of public genealogy databases to gather their valuable opinion. We talked about scenarios inspired by the current use cases and potential threats to elicit users’ judgments. We found that people, in general, need more awareness about the nature and risks of sharing DNA data. Most importantly, DNA data sharing needs further research on possible knowledge delivery methods to inform people of privacy implications.
9 Nobody Wants My Stuff and It Is Just DNA Data, Why Should I Be Worried
177
References 1. 23andme has more than 10 million customers—the DNA geek, https://thednageek.com/ 23andme-has-more-than-10-million-customers/, accessed: 2021-09-27 08:24:05 2. Free qualitative data analysis software—QDA miner lite, https://provalisresearch.com/ products/qualitative-data-analysis-software/freeware/, accessed: 2021-10-12 06:41:14 3. Genetic testing market—growth, trends, and forecast (2019–2024). https://www. mordorintelligence.com/industry-reports/global-genetic-testing-market-industry, accessed: 2020-03-30 4. more-than-26-million-people-have-taken-an-at-home-ancestry-test, https://www. technologyreview.com/2019/02/11/103446/more-than-26-million-people-have-taken-anat-home-ancestry-test/, accessed: 2021-09-06 10:01:10 5. Zoom video communications 2021cloud phone webinars, chat, virtual events—zoom. [online] available at: , accessed 16 September 2021 6. Al Qahtani, E., Shehab, M., Aljohani, A.: The effectiveness of fear appeals in increasing smartphone locking behavior among Saudi Arabians. In: Fourteenth Symposium on Usable Privacy and Security ({SOUPS} 2018). pp. 31–46 (2018) 7. Albayram, Y., Khan, M.M.H., Fagan, M.: A study on designing video tutorials for promoting security features: A case study in the context of two-factor authentication (2fa). International Journal of Human–Computer Interaction pp. 1–16 (2017) 8. Baig, K., Mohamed, R., Theus, A.L., Chiasson, S.: “I’m hoping they’re an ethical company that won’t do anything that I’ll regret” users perceptions of at-home DNA testing companies. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. pp. 1–13 (2020) 9. Baptista, N.M., Christensen, K.D., Carere, D.A., Broadley, S.A., Roberts, J.S., Green, R.C.: Adopting genetics: motivations and outcomes of personal genomic testing in adult adoptees. Genetics in Medicine 18(9), 924–932 (2016) 10. Childers, A.: Adoptees’ experiences with direct-to-consumer genetic testing: emotions, satisfaction, and motivating factors. Ph.D. thesis, University of South Carolina (2017) 11. Edge, M., Coop, G.: Attacks on genetic privacy via uploads to genealogical databases (2019). https://doi.org/10.1101/798272, https://doi.org/10.1101/798272 12. Francis, J.J., Johnston, M., Robertson, C., Glidewell, L., Entwistle, V., Eccles, M.P., Grimshaw, J.M.: What is an adequate sample size? operationalising data saturation for theory-based interview studies. Psychology and health 25(10), 1229–1245 (2010) 13. Greenbaum, D., Sboner, A., Mu, X.J., Gerstein, M.: Genomics and privacy: implications of the new reality of closed data for the field. PLoS Computational Biology 7(12), e1002278 (2011) 14. Harbach, M., Hettig, M., Weber, S., Smith, M.: Using personal examples to improve risk communication for security & privacy decisions. In: Proceedings of the 32nd annual ACM conference on Human factors in computing systems. pp. 2647–2656. ACM (2014) 15. He, D., N.A.F.F.H.J.W.J.J.A.W.R.O.A.S., Eskin, E.: Identifying genetic relatives without compromising privacy. https://genome.cshlp.org/content/early/2014/03/09/gr.153346.112.abstract, accessed: 2020-03-31 16. Humbert, M., Ayday, E., Hubaux, J.P., Telenti, A.: Addressing the concerns of the lacks family: Quantification of kin genomic privacy. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security. p. 1141–1152. CCS ’13, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2508859.2516707, https:// doi.org/10.1145/2508859.2516707 17. Kaiser, J.: Agency nixes decode’s new data-mining plan. Science 340(6139), 1388– 1389 (2013). https://doi.org/10.1126/science.340.6139.1388, https://science.sciencemag.org/ content/340/6139/1388 18. Levenson, E.: Enhanced control over files with document watermarking. https://www.cnn.com/ 2019/05/27/us/genetic-genealogy-gedmatch-privacy/index.html, last accessed 5 September 2021
178
L. Sahoo et al.
19. Malin, B., Sweeney, L.: How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of biomedical informatics 37(3), 179–192 (2004) 20. Marchini, J., Howie, B.: Genotype imputation for genome-wide association studies. Nature reviews. Genetics 11(7), 499–511 (July 2010). https://doi.org/10.1038/nrg2796, https://doi.org/ 10.1038/nrg2796 21. Nelson, S.C., Bowen, D.J., Fullerton, S.M.: Third-party genetic interpretation tools: a mixedmethods study of consumer motivation and behavior. The American Journal of Human Genetics 105(1), 122–131 (2019) 22. Ney, P., Ceze, L., Kohno, T.: Genotype extraction and false relative attacks: Security risks to third-party genetic genealogy services beyond identity inference. In: NDSS (2020) 23. Saha, D., Chan, A., Stacy, B., Javkar, K., Patkar, S., Mazurek, M.L.: User attitudes on directto-consumer genetic testing. In: 2020 IEEE European Symposium on Security and Privacy (EuroS&P). pp. 120–138. IEEE (2020) 24. Shaban, H.: DNA testing service MyHeritage says 92 million customer email addresses were exposed. https://www.washingtonpost.com/news/the-switch/wp/2018/06/05/ancestryservice-myheritage-says-92-million-customer-email-addresses-were-exposed/, last accessed 5 September 2021 25. Weir, B.S., Anderson, A.D., Hepler, A.B.: Genetic relatedness analysis: modern data and new challenges. Nature Reviews Genetics 7(10), 771–780 (2006)
Chapter 10
Unwinding a Legal and Ethical Ariadne’s Thread Out of the Twitter Scraping Maze Arianna Rossi
, Archana Kumari, and Gabriele Lenzini
Abstract Social media data is a gold mine for research scientists, but such type of data carries unique legal and ethical implications while there is no checklist that can be followed to effortlessly comply with all the applicable rules and principles. On the contrary, academic researchers need to find their way in a maze of regulations, sectoral and institutional codes of conduct, interpretations and techniques of compliance. Taking an autoethnographic approach combined with desk research, we describe the path we have paved to find the answers to questions such as: what counts as personal data on Twitter and can it be anonymized? How may we inform Twitter users of an ongoing data collection? Is their informed consent necessary? This article reports practical insights on ethical, legal, and technical measures that we have adopted to scrape Twitter data and discusses some solutions that should be envisaged to make the task of compliance less daunting for academic researchers. The subject matter is relevant for any social computing research activity and, more in general, for all those that intend to gather data of EU social media users. Keywords Social media data scraping · Text and data mining · Social computing · Research ethics · GDPR compliance · Anonymization · Pseudonymization · Informed consent · Data integrity · Twitter
1 Introduction The social media “data gold mine” [25] is increasingly exploited by researchers from all domains. Platforms like Twitter, Reddit, and Facebook present a volume and variety of data that would be otherwise impossible to obtain, like the insights into the opinions and experiences of various communities regardless of their
A. Rossi () · A. Kumari · G. Lenzini University of Luxembourg, Interdisciplinary Center for Security, Reliability and Trust (SnT), Esch-sur-Alzette, Luxembourg e-mail: [email protected]; [email protected] © The Author(s) 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_10
179
180
A. Rossi et al.
physical location [49]. However, the norms constituting ethical behaviour around the harvesting of information shared on social networks are object of animated debates in the scientific community, especially when such data is sensitive (e.g., health data) or used to derive insights on vulnerable populations (e.g., mental health predictions). Harvesting social media data does not only raise ethical concerns but also raises questions about its legitimacy and lawfulness: only because researchers can easily access such data, it does not mean that they should dispose of it at their will. Generally speaking, the mining and analysis of data available on social platforms engenders “new forms of discrimination, exclusion, privacy invasion, surveillance, control, monetisation and exploitation” (p. 42) [33]. It is indisputable that harvesting social media data to derive clinical insights may have legal and ethical implications: for instance, those sufferings from mental health issues are considered a vulnerable population, hence deserve extra protection. But using social media as a source of material for research purposes raises issues even beyond such clearly questionable cases, for example, on whether it is possible to apply a level of ethical safeguard comparable to other research studies on human beings. This article seeks to shed light on the use of Twitter data for academic research to address common challenges arising from social computing [44], which consists in harvesting information available on platforms that enable social interactions. In particular, it focuses on social media data scraping [38], an approach in text and data mining [30] that uses software to find and extract contents from websites, usually at scale. There is a lack of consensus among researchers about the ethical attitudes and the legal practices surrounding data scraping [60]. Similar dilemmas about the application of research ethics to online media exist among university’s institutional review boards, at least in the USA [59]. Such uncertainty leaves many questions unanswered and creates practical difficulties for researchers like us that need to navigate a maze of legal obligations and considerations with ethical import to collect and analyse Twitter data. Some questions concern the legal protection of personal data and its technical implementation. For instance, which content disclosed on social media is personal data and is it possible to anonymize it? Other questions concern ethical research conduct: how may ethical safeguards be transposed to an online context where data is gathered in the absence of the research subjects, who moreover are often unaware that what they post online may be collected by third parties [26, 46]? The questions about data protection and research ethics that we address in this article are provided in Sect. 2. To find suitable answers and discuss the legal and ethical implications of social media scraping, we follow two research methods. First, we review the relevant provisions of the General Data Protection Regulation (GDPR) and research ethics principles, together with their scholarly interpretation. Second, we follow an autoethnographic approach [2] to explore the interplay between research ethics and data protection compliance, and to reflect on the difficulties of devising and applying concrete protection measures to the analysis of Twitter data.
10 Out of the Twitter Scraping Maze
181
Contribution This paper contributes to the current discussion on the ethics and lawfulness of employing social media information in academic data science research. In particular: – it offers practical recommendations for academic researchers who seek to navigate legal compliance and ethical foresight, by referring to the specific use case of Twitter; – it explores the reciprocal influence between research ethics principles and data protection compliance measures; – it proposes and critically discusses solutions that should be envisaged to make the task of compliance less daunting for academic researchers. Relevance The matters that we discuss are relevant for research performed on social media platforms like Twitter, Facebook, and Reddit, to cite a few. Moreover, they could be of interest for both EU researchers (see, e.g., the ethical requirements for Horizon project proposals [35]) and non-EU researchers who intend to analyse data of individuals located in the EU. Disclaimer This article is not meant to become a checklist for compliance nor it intends to provide a collection of golden rules to gain ethical clearance. What we describe in this work is exploratory in nature and specific to the context of our particular research study. Our discussions can serve as inspiration, but none of the measures should be applied uncritically to similar research activities, because they may not be appropriate for other types of platforms, institutions, jurisdictions, research goals, etc. As an integral part of other research activities, the respect of applicable legal obligations and of research ethics principles is a creative exercise based on critical thinking that should lead to the design of ad hoc solutions. For a general guidance on internet research, dedicated guidelines [56, 62] are more appropriate.
2 Methodology and Research Questions Even though automatically collecting social media data for research is a widespread activity, there is no standardized practice on how to perform it legally and ethically, one reason being that it is challenging to define and circumscribe the exact issues concerning research ethics and the applicable laws. We first explore the data protection implications and how to fulfil the related legal obligations (Sect. 4). Our approach resorts to a classical desk review that analyses inputs from relevant articles of the GDPR; their scholarly interpretation, with a focus on academic research [16, 29, 40, 45, 51]; and official guidance from data protection and cybersecurity authorities (like the European Data Protection Board—EDPB—and the European Union Agency for Cybersecurity—ENISA) [4–6, 20, 22, 31]. Similarly, to address the ethics-oriented questions (Sect. 5), we base our discussion on guidelines from
182
A. Rossi et al.
professional associations [62], codes of conduct for researchers [3, 19, 43] and academic commentaries [26, 35, 59, 63]. Additionally, we bring our personal contribution to the topic through the reflections and arguments we have developed during our own experience of scraping social media data for research. These were enriched by the long conversations about legal and ethical matters that we had with the Data Protection Officer (DPO), with the Ethical Review Panel (ERP), and with other legal representatives of our institution, the University of Luxembourg. Even if we are a team offering complementary expertise in cybersecurity, data protection, online privacy, and research ethics, we have been challenged to find a way to mine and process social media information in compliance with the provisions of the law and the indications of the DPO and the ERP. To analyse and report such experiences, we follow an autoethnographic approach, which is a “research method that uses personal experience (auto) to describe and interpret (graphy) cultural texts, experiences, beliefs, and practices (ethno)” [2], thus offering an insider’s knowledge built on everyday experiences that cannot be easily captured through other research methods. Our field notes and observations, available on demand, are the MSc thesis of one of the authors [34], the documents we filed to ask legal and ethical clearance, and the communications we exchanged internally and with the legal teams. Questions About Data Protection – What counts as personal data in social media data? – Is it possible to anonymize social media data? – Which legal principles apply in this research context and which measures should be implemented to comply with them? – How can a great number of social media users be informed that their personal data is being gathered? Questions About Research Ethics – Should researchers ask for social media users’ consent to research participation, even when their data is public? – How might social media users express their will of abstaining from or withdrawing their participation and how can their data be hence excluded? – How can data about underage users be excluded, when it is highly challenging— and even impossible—to determine whether social media contributors are adults or minors? – What kinds of social responsibility have social media data scientists? – What additional safeguards should be adopted to minimize the harms and maximize the benefits of the research study? – What are, if any, the points of tension and of interplay between legal and ethical principles?
10 Out of the Twitter Scraping Maze
183
3 Research Scenario: Dark Patterns, Twitter, and Accountability We are currently collecting content submitted by social media users on Twitter on the topic of dark patterns [28, 39], as part of the interdisciplinary project “Deceptive patterns online (Decepticon)”.1 Dark patterns are defined as potentially manipulative designs that influence users to take decisions that are contrary to their interests but benefit the companies implementing them, for instance, selecting privacyadverse settings in an application. Practices that unduly influence the behaviour and choices of individuals are considered illegal in the EU under consumer protection and data protection law, and dark patterns may fall under this category [36]. Understanding what makes a pattern “dark”, and potentially illegal, is the goal of the Decepticon project. On Twitter, there is a lively community that shares screenshots of potential dark patterns, with the hashtag #darkpatterns. Such content is a precious source of lay user perspectives on a topic that is dominated by expert opinions of lawyers, designers, and academics. It also provides manifold examples of dark patterns that can be used to build large corpora for, e.g., developing machine learning classifiers that automatically recognize dark patterns [9]. At the moment of writing, there is a dearth of data sets of dark pattern samples available for research. Hence, mining social media data could significantly contribute to our understanding of dark patterns and the design of solutions against them. Text and data mining for research purposes are permitted in the EU at the conditions enshrined in Art. 3 of the Directive (EU) 2019/790 (also known as Copyright Directive) [17]. In line with this, Twitter has put in place a mechanism of verification to allow selected developers to access its content through a dedicated API: without authorization, data scraping is prohibited.2 In January 2021, Twitter launched an updated version of its API3 to grant academic researchers free access to full-archive search of public data and other endpoints, meaning that the whole Twitter database (holding data from 2011 till now) can be accessed, excluding the contents marked as private by their authors. Through this API, academic data scientists can scrape an impressive amount of tweets rapidly and free of charge, while leveraging enhanced features to gather specific data of interest. Before obtaining access to the Twitter’s API we were compelled to describe the purpose of our research to the Twitter developers’ team, who reviewed our motivation, mission, and practices. We needed to prove that we are employees of an academic institution and explain how we intend to use Twitter, its APIs, and the collected data in our project. In another round of Q&A, we illustrated our methodology for analysing data and our intention of not disclosing it to government
1 https://irisc-lab.uni.lu/deceptive-patterns-online-decepticon-2021-24/. 2 “[S]craping
the Services without the prior consent of Twitter is expressly prohibited” https:// twitter.com/en/tos#intlTerms. 3 https://developer.twitter.com/en/docs/twitter-api/early-access.
184
A. Rossi et al.
entities. After this, we were authorized to open an academic developer account, get access to the API, and start mining data. Once we obtained the authorization, we started questioning what was the rightful way to proceed. From an ethical standpoint, data scientists are held accountable for their research conduct, including data management. Similarly, in the GDPR accountability corresponds to an ethical responsibility concerning the collection, storage, processing, and sharing of personal data. It means that the burden of evaluating the lawfulness and fairness of data processing falls primarily on those that collect and use the data, who also need to be able to demonstrate how they do it in a compliant manner [11]. Although resorting to social media data is a widespread practice both in academia and industry, e.g., to gather big data sets of user opinions on a specific topic, there is no sound checklist that can guide researchers towards legal and ethical behaviour. Understanding and implementing abstract rules into concrete decisions and actions appear labyrinthine even for experts, especially because each study deserves a customized solution.
4 Protecting Users’ Data and Identity Although the Data Protection Directive 95/46/EC EU already regulated the use of personal data before the entry into force of the GDPR, research practices involving personal information have never been so much under the spotlight. The GDPR has generated anxiety within the academic walls [14], even though the regulation provides several exemptions and derogations that benefit researchers with a certain degree of freedom. Nevertheless, they are not exempted from the application of technical and organizational measures that are meant to safeguard the rights and freedoms of individuals (Art. 89) [16], like data minimization and access control. To carry out a study involving personal data analysis, it is mandatory to engage with the expert opinion of the Data Protection Officer (DPO) of the university and to maintain a record of processing activities (art. 30 GDPR). Thus, before the data collection starts, at our institution, principal investigators are expected to fill in a Register of Processing Activities (RPA) where they detail the categories of personal data they intend to collect, the retention period, the envisaged security measures and other relevant information. This ex-ante activity has forced us to interpret the legal notions contextually, draft a concrete data protection plan adhering to principles like data minimization and storage limitation, and make data-protection-by-design choices [15, 20].
4.1 Personal Data in Social Media Research According to the GDPR, personal data “is any information that relates to an identified or identifiable living individual” (Article 4). This definition is only
10 Out of the Twitter Scraping Maze
185
apparently simple because it hides a continuum of data. Even isolated pieces of information, when linked together, can lead to the identification of a person. Pseudonymized [40] data is considered personal data because it can be recombined to identify individuals, and so is encrypted data, when it can be easily decrypted. On the contrary, anonymized data that is irreversibly unlinkable to any individual is no longer considered personal data. But in a world where a simple search online of a quote can lead to its author, it is daunting to fully de-identify personal information. On social media, both actual data produced by the user and metadata about such information can be used to re-identify a person with little effort, like usernames (even pseudonyms [62]), pictures and screenshots, timestamps, location markers, tags, mentions of other users, shared user networks, and names that appear on comments and retweets. Hence, we decided to adopt a broad interpretation of the notion of personal data because both text and pictures may contain references or pieces of information that could be combined to allow re-identification. Thus, even when tweets are shared publicly, assuming that they probably constitute personal data helps to determine the permissible extent of scraping (Sect. 5.1) and the feasibility of anonymization (Sect. 4.2).
4.2 Confidentiality Protecting the confidentiality of personal data and the identity of individuals greatly overlaps with one of the cornerstones of research ethics (i.e., respect for privacy— Sect. 5.1.2). From the onset of a research project, a data-protection-by-design approach [20] should assess whether personal data can be anonymized and thus be outside of the scope of the GDPR, which would exempt us from applying technical and organizational measures to protect the confidentiality and security of the data (Art. 32.1).
4.2.1
Anonymization and Pseudonymization
Although it would be comfortable to believe that once direct identifiers (i.e., user handles, mentions, etc.) have been removed, social media data becomes anonymized (as it is often claimed), such comfort is illusory. A number of parameters can serve as re-identification clues due to data indexing [56], ranging from looking for the text of online posts on search engines and thereby retrieving its author [7, 41], to the metadata related to online activities [64]. Other media (e.g., videos, images, etc.) associated with tweets may also contain personal data and their URL may readily redirect to the user profile. Hence, even though direct identifiers are masked (e.g., by nulling out that piece of information or by substituting it with a different data value [4]), a tweet can easily be re-associated with its author. On the other hand, a full anonymization of data, understood as an irreversible process that makes it impossible to infer the person to whom the data belongs,
186
A. Rossi et al.
is in fundamental tension with the utility of the data required for the analysis [4, 40]. Further, de-anonymization techniques [58] may invalidate the researchers’ endeavours to anonymize the data. Only by aggregating data (e.g., by providing aggregated statistics) and rendering individual events no longer identifiable can a dataset be genuinely anonymized. Thus, we opted for pseudonymization techniques that only remove certain identifiers with the intent of hindering re-identification, contributing to data minimization and enhancing data security. Pseudonymization, furthermore, offers the advantage of retrieving the authors of the tweets to ask for their consent to republication and to allow them to opt-out from the study (Sect. 5.1.3). Concretely, we have applied a deterministic pseudonymization [22] where each user ID and each tweet ID are replaced with a unique code (i.e., a pseudonym) issued by a counter. The codes are stored in a separate mapping table so that the correspondence between tweet and its author can be re-established at will. All mentions to other users (i.e., user handles) were deleted from the source file, but kept in the mapping table in view of being included in the data analysis since they contain mentions to companies employing dark patterns. As the media URLs may trace to the user, we reported them in a third file (i.e., the link file) and replaced them in the source file with a string that signals whether the tweet contains a media file or not. The timestamp and location of posting underwent a process of generalization [4] (i.e., city to country and exact time to month, year) since such an information granularity is not necessary for our analysis. As a result, we obtained three separate files that only when considered together can render the full picture: (a) the pseudonymized source file, where the tweets were polished from hashtags and user handles and numbered, the authors were replaced with pseudonyms, the time and place were generalized; (b) a mapping table, where user IDs are associated with tweet number and pseudonyms; (c) a link file where media URLs, user handles, and hashtags are stored. The images and other media are stored in a different folder. The files should be stored separately to prevent unauthorized access that would make the users identifiable.
4.2.2
Encryption, Secure Authentication, and Access Control
An additional technical safeguard that enhances security and prevents unauthorized access is encryption (Art. 32), which can be considered as a pseudonymization measure [4]. Encryption, if done properly, converts human-readable data into a form that is unreadable and irreversible, unless one possesses the decryption key [54]. We have stored the encrypted data on a local hard disk and used an encrypted channel to transfer it to a private Git repository. The access to such repository is subject to a strong authentication procedure via SSH public key authentication. Encryption provides confidentiality and secure authentication, but encryption keys are hard to manage and can be irremediably lost, thus password authentication is a viable alternative to access services and data. However, the creation, storage, and retrieval of strong passwords and their sharing across different devices are cumbersome. A role-based access control [54] that distinguishes between students,
10 Out of the Twitter Scraping Maze
187
senior researchers, principal investigators, etc. ensures further protection from unauthorized access, but setting up and keeping access rules up-to-date may not be accessible for everybody without the support of a technical team. Git repositories that are locally managed by the university’s technical departments offer severely monitored means of access control and make us avoid external cloud services that could expose to the risk of data access from other jurisdictions without an adequate level of legal protection.
4.3 Purpose Limitation The principle of purpose limitation (Article 5(1)(b)) compels researchers to be clear and open about the reasons for obtaining personal data [31]. Purpose limitation is closely linked to the principles of fairness, lawfulness, and transparency (Article 5(1)(a)), but scientists benefit from an exception: the data may be used for secondary processing on the assumption of compatibility with the original purpose of collection (art. 5.1.b and Recital 50 GDPR) [16]. Generally stating that the processing purpose is “to conduct scientific research” may not be enough though, as a similar project may expose to different kinds of harms. In the RPA, we were required to disclose additional details about the intended modalities of data collection and analysis (e.g., automated analysis of images of dark patterns, with the goal of training classification algorithms able to recognize new instances of dark patterns on websites). Such information is key to infer whether an envisioned processing activity may result in a high risk for individuals and thus whether a Data Protection Impact Assessment (DPIA) is necessary [5]. Criteria to evaluate its necessity encompass whether, like in our case, personal data are not collected directly from data subjects and the processing happens at large scale. Our methods, however, do not involve an innovative use of technology with unknown social and personal consequences, nor privacy-intrusive activities like profiling or automated decision-making that may expose individuals to higher (ethical) risks [35]. Thus, it was established that a DPIA was not necessary.
4.4 Data Minimization Intimately linked to the research purpose, the data minimization principle (Article 5(1)(c)) mandates to collect data that is adequate (i.e., sufficient with respect to the purpose), relevant (i.e., justified by the purpose), and limited to what is necessary (i.e., not more than is needed for that purpose) [31]. We abstained from gathering data on users’ profiles that were nor relevant to our research and tweets that did not explicitly mention dark patterns. Instead, we collected usernames to trace back the users and ask their consent for tweet republication (Sect. 5.1.3), as well as the broad location and timestamps of the tweets to discern, e.g., in which countries users
188
A. Rossi et al.
are more concerned about dark patterns and whether this tendency evolves over time. We also scraped media URLs to build a database of dark pattern images, and retweets and likes to grasp the “popularity” of certain dark patterns. Moreover, we gathered mentions that can contain the name of companies employing dark patterns.
4.5 Storage Limitation The principle of storage limitation (Article 5(1)(e)) prescribes that personal data should only be kept as long as necessary to fulfil the processing purpose [31] and should be deleted or fully anonymized right afterwards. However, this requirement is at odds with research ethics: for reasons of data integrity and research transparency, it is recommended to keep the data for 10 years [19]. Although anonymization should be preferred, a specific research exemption allows to store the data in an identifiable form for longer than what would be strictly necessary [16], provided that reasonable justification and security measures (e.g., encryption, pseudonymization, etc.) are in place [35].
4.6 Legal Basis Research activities involving personal data must be justified on an appropriate legal basis among consent, public interest or legitimate interest (Art. 6) [16] and may also depend on national laws. There is no straightforward way of determining whether one is more suitable than the other, as each come with its own advantages and disadvantages. Since we indirectly collect information from many individuals, and thus the acquisition of consent would be “complicated or administratively burdensome” [37, p. 58], we followed the lead of the DPO to argue that our data processing is necessary for the performance of a task carried out in the public interest.
4.7 Transparency Transparency is another fundamental principle of data protection (Art. 5(1)) and translates into the disclosure about the use of data and the rights of individuals in this respect [6]. Note that such disclosure is necessary, even when the processing is not based on consent. The transparency obligation mandates the provision of specific pieces of information (Art. 13, e.g., the identity of the entity collecting data, the purposes of data collection, etc.) in an easily accessible and easy to understand manner (Art. 12) before the data collection starts. We benefit, however, from a research exemption when personal data are not directly obtained from the data
10 Out of the Twitter Scraping Maze
189
subject (Art. 14), since with scraping, data are first collected by the social media platform, and only then gathered and reused by researchers for their own purposes. In such cases, informing thousands of individuals of the processing activities before data collection occurs would constitute a disproportionate effort [16, 35]. Hence, the transparency obligation can be fulfilled by drafting a public notice addressed to all those that may be impacted by the research. We therefore published the notice on our project’s website4 but also pinned a summary of it on our Twitter profile.5 We followed best practices of legal design to present the information in a usercentred manner [6, 50, 51]. First, as our audience consists in a broad non-expert user base, we used plain language, removed unnecessary legalistic expressions and applied a conversational style that distinguishes the responsibilities and rights between “we” (i.e., the researchers) and “you” (i.e., research participants). We also clarified technical terms and explained the consequences of certain data practices, e.g., “We pseudonymize the information that we collect: we separate the identity of the users (like usernames) from the rest of the data (posts, images, etc.) and we replace real identities with fake names (also called pseudonyms). In this way, only us can re-identify the users to whom the data refers”. In addition to plain language, we used layering techniques (e.g., main words in boldface) to allow readers to skim through the text and quickly find the piece information they are looking for. Moreover, sections are structured in accordions, i.e., they can be expanded at request and lighten an otherwise lengthy text, while headings are formulated as FAQs (e.g., What are your rights concerning your personal data?). However, such an on-demand notice [53] is tucked away on the project’s website, so there is little chance that Twitter users actively seek for and find it. Thus, we reported a shortened version of it on our Twitter profile and we devised a periodic notice [53] that once a month reminds Twitter users of the existence of our (imperceptible) data collection (see Fig. 10.1). The tweet contains
Fig. 10.1 The recurrent tweet scheduled for the first of each month and meant to warn Twitter users of our data collection. It indicates where they can find additional information and how to opt-out from the study 4 https://irisc-lab.uni.lu/public-notice-for-twitter-and-reddit-web-scraping/. 5 https://twitter.com/ULDecepticon.
190
A. Rossi et al.
the hashtag #darkpatterns and a mention to the highly popular Twitter handle @darkpatterns to increase its visibility. It also includes a link to the public notice on our website and a short explanation on how to opt-out from the study (Sect. 5.1.3).
4.8 Data Subjects’ Rights Unlike anonymization, pseudonymization enables tweets’ authors to exercise their rights: the right to access (Art. 15), rectify (Art. 16), and erase (Art. 17) their data, as well as to ask us to limit the data processing (Art. 18) and to object to processing (Art. 21), which would prevent us from using the data. However, Art. 89 provides for research-specific derogations when necessary and proportionate. Moreover, since the legal basis is the public interest, the right of erasure can be limited (Art. 17(3)d) because deleting data may prevent data verification and thereby undermine the scientific validity of the research study [16]. However, such limitations may (arguably) apply only to studies where data analysis is concluded [45], because when it is not the case, the exercise of this right does not seriously hamper the research progress.
5 Ethics of Using Social Media Data for Research Purposes Unlike legal provisions, norms contribute but do not determine what ethical behaviour is [26]. There may be disagreement between researchers and other stakeholders (e.g., the public, ethicists, policymakers, etc.) about what constitutes ethical practices, which may depend on the context where the study occurs [60]. That said, in the EU there is a lively ethics culture surrounding research, crystallized in international, national and institution-specific codes of conduct [19] and even regulations [21]. EU scientific projects, for example, must respect the principle of proportionality and the right to privacy, to the protection of personal data, to the physical and mental integrity and to non-discrimination (Art. 19 of the Regulation establishing Horizon Europe) [21], in line with the European Code of Research Integrity [3] based on principles like reliability, honesty, respect, and accountability. Differently from the USA [26], in Europe research performed on social media data is considered full-fledged research with human subjects. As such, it is governed by the same ethical principles [62]. One of the unique features of this area of research concerns the fact that the content disclosed on a certain platform for a certain reason is extrapolated and reused for other purposes in a different context—without the awareness of the authors of such content [26, 46]. For instance, people may use social media to feel part of a community and share their experiences, e.g., about poor mental health [49]. Even if such information has been disclosed publicly, this does not imply
10 Out of the Twitter Scraping Maze
191
that one can lightheartedly scrape, analyse, or report it without users’ knowledge. This concern recalls the construct of “contextual integrity” [42] that posits that confidentiality expectations are tied to the inherent norms of specific contexts. Given such premises, the participation of individuals to research performed through data mining cannot be considered voluntary nor informed. Data mining thus exposes researchers to risks and questions that are unique to the Internet and constitute an uncharted (or at least only partially explored) territory. The research ethics strategies outlined in the following sections refer to the four core principles underpinning ethical research conduct with human beings [43] and have received the ethical approval of the institutional ERP after two rounds of negotiations.
5.1 Respect for the Autonomy, Privacy, and Dignity 5.1.1
Public vs. Private Information
Data scientists may erroneously be under the impression that information publicly shared on social media platforms can be reused without limitations [35] and at their will, without seeking consent from the interested parties nor ethical clearance, similarly to what is permissible for research conducted in public places where individuals can reasonably expect to be observed [61]. It is true that users share views, pictures, etc. on online platforms that are easily accessible, permanent in time, and designed to boost visibility and engagement. Moreover, the terms and conditions commonly contain clauses warning against third-party access to the data6 [56]. However, studies demonstrate that most Twitter users ignore that their content can be harvested by others [26, 46], as terms and conditions are simply clicked away. The distinction between public and private spheres on the internet is decisively blurred, but certain elements can help to infer what is acceptable behaviour (see the concept of contextual integrity recalled above). Whereas social media members disclosing content in a password-protected group may have reasonable expectations of high confidentiality [62], we may assume that users contributing to a public debate on a certain topic with a tweet containing a relevant hashtag, thus designed to be visible and retrievable, will expect less confidentiality (Sect. 5.1.3). Another crucial element is the sensitivity of the information at hand [62]: contents about drug abuse, sexual orientation, and health-related issues as well as data shared by minors should be handled with utmost care as they may cause harm to the authors when leaked outside of the context they were intended for. In such cases, researchers should reflect on the consequences of such disclosure to a different audience in a
6 The 2021 Twitter’s privacy policy states that “Most activity on Twitter is public, including your profile information, [. . .] your Tweets and certain information about your Tweets [. . .]. [. . .] By publicly posting content, you are directing us to disclose that information as broadly as possible, including through our APIs, and directing those accessing the information through our APIs to do the same”. Source: https://twitter.com/en/privacy.
192
A. Rossi et al.
different context. Moreover, user permission should be sought [26, 61]. In our case, we deliberated that scraping is permissible, since (i) we only gather manifestly public tweets containing hashtags meant to participate in the lively debate about dark patterns, (ii) the topic is not sensitive and (iii) we abstain from any privacyinvasive processing like profiling.
5.1.2
Anonymity and Confidentiality
No matter if public or private, the information we gather is nevertheless personal and deserves protection (Sect. 4.2). Research investigators have the duty to avoid causing physical, emotional, or psychological harm to participants [7]. Confidentiality seeks to eliminate risks of increased vulnerability, embarrassment, reputation damage, or prosecution [57], hence the need to proceed carefully when information is repurposed in a different context from where it was originally disclosed. In regard to the publication of individual excerpts (like verbatim quotes in academic articles), full anonymization seems difficult to attain because public tweets are retrievable with a simple online search. The only viable manner to conceal users’ identity consists in paraphrasing the tweet [62]. However, such an artificial alteration would comprise the ethical tenet of data accuracy and integrity (Sect. 5.2). Hence, we have decided to abstain from quoting verbatim tweets without a good reason [26] and, if that was the case, to ask the permission of the authors. In light of such considerations and given the envisaged opportunity to withdraw from the study, we applied pseudonymization to be able to map tweets to authors and added other security measures to protect the confidentiality of data and identities (Sect. 4.2).
5.1.3
Informed Consent and Opt-out of Unwitting Participants
Our research study has public interest as legal basis rather than (“legal”) consent (Sect. 4.6). Nevertheless, informed consent for research participation is still indispensable as an ethical safeguard (“ethical consent”), as it allows individuals to exercise their autonomy [43]. Only when informed about the modality and risks of the study, they can freely express their willingness to take part in the study, refuse their consent or withdraw it whenever they desire. However, scraping data at large subverts such logic given that seeking consent from each potential participant before the study begins would represent a disproportionate effort. Moreover, it is assumed that gathering non-sensitive, visibly public data does not require consent [62]. Yet, informing participants about the study constitutes good practice and may be deemed sufficient, as some surveys reveal [26]. Thus, in our case, we do not ask consent to participate in the study, but we provide public information (Sect. 4.7) and the possibility to withdraw from the study (i.e., retrospective withdrawal [62]), which also partially overlaps with the right to object to processing (Art. 21) enshrined by the GDPR. The withdrawal implies that any further analysis on the collected data must cease, although whether the
10 Out of the Twitter Scraping Maze
193
already processed data should be excluded from the research findings, as discussed in Sect. 4.8, may eventually depend on practical issues, such as the effort to exclude that data and carry out a new analysis. We inform Twitter users about opt-out options on the periodic tweet, on our Twitter profile, and on our website.
5.2 Scientific Integrity 5.2.1
Data Quality
To ensure the validity of the findings and of the conclusions of a piece of research, the data should not be tampered nor transformed in an unreasonable manner [62]. This is why paraphrasing user-generated quotes to protect user confidentiality should be weighted against the possibility of altering the research results. An additional concern that could hamper data quality derives from the eventuality of including data of vulnerable populations, such as minors, and of false social media profiles powered by bots, while excluding the views of populations that are not present on that platform. Further, given the opacity surrounding the disclosure of data by social media platforms through their API, it is challenging to determine whether the data made available is representative or complete. Lastly, inaccurate and even misleading inferences [62] may be generated through text analysis (e.g., sentiment analysis) and other machine learning techniques (e.g., clustering). Such concerns will be duly addressed when we will perform the data analysis and will be discussed in our next publications.
5.2.2
Minors
Unlike other research methods, data scraping does not make it possible to screen who participates in the research [62]. Such lack of oversight may not only jeopardize scientific integrity but also lead to the mining of the personal data of minors, who are considered a vulnerable population, without authorization from their legal guardian, because the minimal age requirement to register on Twitter and other social media is 13 years. What is more, such platforms do not offer ways to exclude minors’ profiles from data collection. This constitutes a serious issue, as minors are not considered capable of providing their consent to research participation and should not even be contacted in this regard. The ad hoc solution we elaborated envisages that we only report the results in an aggregate form, given that the data that is the object of our research is not sensitive and does not point back to individuals. In case we intend to seek (ethical) consent in the view of republishing individual quotes, we envision to only contact users that are presumably over 18 according to the birth date on their profile (which can be, however, fake) or that have a professional profile associated with their social media profile (e.g., a profile on LinkedIn; an academic website; etc). When obtaining consent is not possible but the republication of a
194
A. Rossi et al.
tweet is particularly meaningful, we envision to republish only paraphrased tweets. We have also asked Twitter to implement a functionality that allows developers to filter out minors’ data in their next API release, given that they dispose of user age information.
5.3 Social Responsibility 5.3.1
Reputation Damage
An additional ethical dilemma derives from the research topic: dark patterns are ethically questionable and may also be illegal. But when people share their views about interfaces that they deem dark patterns within their digital entourage, they are expressing a personal opinion. Considering such opinions in our analysis may legitimate them and negatively associate the companies therein mentioned to dark patterns (e.g., if we provide figures about the most cited “evil” firms in our database). Even if the analysis is scientifically sound, companies may suffer from reputation damage. This issue echoes ethical questions about vulnerability disclosure in cybersecurity research [48], which may lead to legal suits. Researchers need to find ways to mitigate that risk or, alternatively, solidly motivate their position if they are convinced that naming and shaming would force companies to take action to repair their reputation. An ethical conduct of responsible disclosure demands to engage in a dialogue with the companies before going public whenever possible. In the case of dark patterns, when they may violate the law, we will submit that information to the relevant watchdogs without any public judgement about their lawfulness.
5.3.2
Dual Use
Like other online deception research (e.g., phishing) [24], this study is exposed to the risk of dual use of the findings, because they could lead the mentioned companies to replace a publicly exposed dark patterns with a more aggressive one and inspire other companies to employ dark patterns for their own benefit. However, the dilemma is mild: even if unconventionally delivered through user interface designs, the mechanisms behind manipulative commercial and cybersecurity practices have been known for decades (e.g., [10, 13]). Hence, we deem that our research does not encourage the adoption of dark patterns and that, on the contrary, its benefits outweigh the risks: it contributes to shed light on the pervasiveness of such practice, increase consumer awareness, and foster the creation of solutions.
5.3.3
Risks for Researchers
Using social media can expose researchers to cyber risks, such as cyberbullying, stalking, and media pillories, which can lead to professional and personal issues
10 Out of the Twitter Scraping Maze
195
like loss of scientific reputation and anxiety. Such risks must be mitigated, especially when they concern early stage researchers like master or PhD students.
5.4 Maximize Benefits and Minimize Harm All measures listed above are meant to minimize the risks for participants and researchers, while maximizing the research benefits. One further issue concerns compensation: it is common practice to reward research participants for their time and efforts, but doing so with data that was collected indirectly is problematic. Surveyed individuals have expressed that, even without financial compensation, they could benefit from being informed about the publication of the study and the results [26]. This is why we plan to distribute this and following articles in open access and disseminate them on Twitter using the hashtag #darkpatterns.
6 Discussion A total of 15,784 public tweets and 3284 media images have been scraped for the period July 2011–July 2021. The data collection will continue for the 3 years duration of the project, hence the data practices described in these pages may evolve. What lessons may we have drawn from our tentative path out of the legal and ethical maze of Twitter data mining?
6.1 The Question of Time and Expertise The procedures to obtain ethical clearance and legal authorization should be initiated at least weeks before the start of the planned research activity to avoid any delay. In our case, several months passed between the decision of performing social media scraping and the actual start (Fig. 10.2), due to (a) the conception, design, and application of measures with legal and ethical import, (b) their constant review arising from a more nuanced understanding of the research challenges and the interplay between legal provisions and ethical principles (e.g., pseudonymization to enable opt-out); and (c) the iterative negotiation with the ERP and the DPO about issues like minors’ consent, opt-out options, anonymization, etc. These concerns may appear to research scientists like dull conundrums that add unnecessary complexity to other already daunting research tasks. Furthermore, accurately planning data management requires an uncommon high level of competence, both theoretical (e.g., the selection of an appropriate pseudonymization technique for the data at hand) and practical (e.g., its application). This is why we echo Vitak et al. [60] in
196
A. Rossi et al.
Fig. 10.2 This timeline exemplifies the process of devising data protection and ethical measures that started in November 2020 and ended in August 2021
recommending that ethical and lawful deliberation should be constructed together with expert colleagues, ethics boards and legal teams of the institution.
6.2 The Question of Motivation Burdensome compliance may delay researchers, who are already under time pressure due to, e.g., the project timeline and personal career development plans. Moreover, the quality of researchers’ work is almost exclusively evaluated through indicators like publications and project deliverables, without any attention to ethical and lawful conduct. Given that there are no one-size-fits-all approaches, researchers must tailor the limited existing guidelines to their specific use case based on questions like: How sensitive is the subject matter? Are these data meant to be public or private? Can anonymization techniques be successfully applied? Etc. One may wonder what is the reward of spending time and energies in compliance, while delaying the actual research, and what is the penalty for failing to do so, since there are no visible drawbacks. Although ethical and legal aspects are an integral part of researcher’s responsibility, merely reiterating this message does not constitute an effective leverage to trigger the desired behaviour. In certain ICT communities, like Human-Computer Interaction, it is impossible to publish a study involving humans without an explicit section about ethical considerations—and indeed, the ACM has recently updated their publication policy in this respect [1]. However, this is not common practice: only a minority of published studies on Twitter [63] and Reddit data [47] mentions institutional ethical clearance. Even then, it is mostly to claim that their study was exempt. Obtaining approval for data mining may appear superfluous as the data is publicly available on the internet and is often republished in aggregate form. In addition, although we have not investigated this aspect, the research exemption enshrined by Art. 3 of the Copyright directive may erroneously lead data scientists to assume that they have complete freedom to process data available on the internet. Moreover,
10 Out of the Twitter Scraping Maze
197
given that compliance with data protection provisions may appear “impossible to follow” [14], scientists’ decision-making may be influenced by convenience considerations or by anxiety over unforeseen violations of the GDPR, which may lead to exclude EU participants from research studies [18] and impact research integrity. Given that Twitter is the most popular social network platform to examine human interaction [26] (so popular that it has released an API dedicated to academic developers), it is remarkable that data scraping can be regarded as “an uncharted territory”. Without convincing incentives that entice researchers to easily embrace cumbersome, time-consuming ethical and legal conduct, one common solution may consist in sidestepping the rules. Thus, which incentives could help researchers to strike a balance between productivity and accountability?
6.3 Incentives We hereby examine two sorts of incentives. The first set of solutions aims at raising researchers’ awareness towards ethical and legal demands and strengthen their ability to address them. Traditional training about the GDPR and research integrity is routinely offered at academic institutions, although it is usually optional, while codes of conducts and documents clarifying procedures are available on the intranet. However, such guidance is often too general to be able to answer doubts about specific technologies or data processing practices. Going beyond awareness and training, in the last few years entire European projects have been dedicated to seek ways to embed ethical and legal training into computer science and engineering curricula7 and to develop ethical frameworks to support technology developers in the assessment of the impact of their creations.8 Experiential learning initiatives are being increasingly experimented with the aim of building a more visceral sense of accountability and ethico-legal sensibility through discussion games,9 sciencefiction storytelling [55] and immersive experiences based on videogames and Virtual Reality [32]. That said, all such endeavours are necessary to make researchers raise the relevant questions, but are insufficient to find and implement appropriate answers. In the domain of compliance, it is a well-known fact that higher awareness does not necessarily translate into a safer behaviour—an inconsistency called “knowledgeintention-behaviour gap” (p. 84) [12]. When technologies and procedures are too complex, human beings find workarounds to what they see as an obstacle to their primary task (i.e., performing research). Thus, the second order of solutions should make compliance and ethical decision-making less burdensome for research
7 For
example, Legally-Attentive Data Scientists https://www.legalityattentivedatascientists.eu/; Ethics 4 EU http://ethics4eu.eu/. 8 For example, https://www.sienna-project.eu/. 9 Play Decide https://playdecide.eu/.
198
A. Rossi et al.
scientists. Guidance drafted as practical instructions in laymen terms and best practices dedicated to certain technologies or use cases should be created and proactively brought to the attention of researchers. Toolkits for scientists can also prove useful in this respect, such as the toolkit on informed consent created by Sage Bionetworks [27] that contains resources, use cases and checklists to help researchers think through the consenting process. There is an urgent need for usable off-the-shelf solutions that simplify and expedite academic compliance tasks. For instance, together with its academic API, Twitter could offer user-friendly functionalities to filter out minors’ data and directly provide pseudonymized data— not to mention a developer policy10 that would provide clear, simple rules and take less than 17 min to read (i.e., 4842 words). In-house applications that encrypt, pseudonymize, or anonymize information and offer transparent-by-design templates for information notices would also be helpful. When universities cannot develop their own solutions, they could offer contractors’ services to their employees at reasonable prices.
7 Limitations and Future Work The practices described in these pages may need to be updated as the research activities proceed and new (interpretations of) regulations are introduced. It would be important to gauge the effectiveness of our mechanisms of transparency (e.g., how many users tweeting about dark patterns are aware of our public notice and data collection?) and opt-out (e.g., how many Twitter users are aware they can opt-out and find it easy to do so?), and, if necessary, devise better ones (e.g., an automated warning to each tweet that is scraped). Moreover, certain data protection measures may be stronger: we would like to apply random number generators for pseudonyms and additional processes to ensure unlinkability for individual values like tweets [22]. We are currently developing a Python library for the automated masking of location markers, and people and company names (the latter meant to enable sharing of our dataset with other parties without risking defamation charges) [52]. However, named entity recognition techniques on which masking is based are not yet completely reliable, especially on tweets that present atypical sentence structures and wording. An additional promising technique that would enhance the de-identifiability of tweets’ authors by removing unique stylistic cues is differential privacy applied to authorship [23]. The question whether it should be the exclusive responsibility of researchers to put in place such (often hard to implement) cybersecurity practices—or whether it should be the duty of the institution to which they belong to offer support as well as data management frameworks for that purpose—remains unanswered. Ariadne’s thread has not guided us out of the maze yet. To perform our planned data scraping on Reddit11 we will need to assess the relevant ethico-legal concerns
10 https://developer.twitter.com/en/developer-terms/policy. 11 https://www.reddit.com/r/darkpatterns/.
10 Out of the Twitter Scraping Maze
199
and devise the most appropriate measures for that platform, which may depart from those adopted for Twitter data. Other dilemmas remain dangling: is it legal to share our dataset with other parties? One of the aims of our project is to publish part of the scraped dataset on an open source intelligence data sharing platform (i.e., MISP12 ) to enable collaborative data analysis. However, the research exception contained in the Copyright Directive only covers data mining, but not data sharing. Such data disclosure poses original ethical challenges [8]. Future work will be dedicated to devising viable solutions.
8 Conclusions There is no established protocol that researchers can simply follow to mine social media information legally and ethically. Existing guidelines and best practices leave plenty of questions unsolved, therefore researchers are forced to invest resources to find their way in a maze of regulations, techniques of compliance, scholarly interpretations and codes of conduct that may differ from country to country. By self-reflecting about our experience matured for Twitter data scraping, we learned a few lessons which can be of help to other researchers, even though they do not intend to constitute a checklist. First, all data and metadata collected from social networks should be considered personal data, thus be subject to the obligations enshrined in the GDPR. Second, even when such data is public and apparently ready to be used, it should still be subject to the same ethical safeguards that apply for research on human subjects. Understanding the sensitivity of the subject matter and the risks deriving from data collection and analysis must inform the acceptability of certain research practices and the approach to data processing, with a golden rule: respect the norms of the context where the information was originally disclosed [60]. Third, even though consent may not be used as legal basis for personal data processing, “ethical” informed consent to research participation should be sought. If this proves impossible due to the amount of people involved and the content is not sensitive, other safeguards may suffice, like a public notice and a visible, userfriendly way to opt-out from the study. Fourth, it is challenging to fully anonymize the content published on social networks (and no, masking usernames does not count as anonymization) as it can be easily traced to its authors. Moreover, full anonymization is at odds with data utility and integrity. Pseudonymization, on the other hand, enables research participants to exercise their data rights (e.g., demand the exclusion of their data from analysis) and to be contacted in case of republication of their posts. However, it is less risky and cumbersome to only publish aggregate data without verbatim quotes. Fifth, as data are only pseudonymized, a number of additional security measures should be taken, like encryption, access control, etc. As a conclusion, before starting social media data collection, research scientists may want to ponder whether this is absolutely necessary and proportionate to their
12 https://www.misp-project.org/.
200
A. Rossi et al.
research goals. Although we start to see the light at the end of the ethico-legal maze, there are still a number of questions we need to address to continue carrying out research on social media data, whilst solutions to make legal and ethical compliance less demanding are lacking. Acknowledgments This work has been partially supported by the Luxembourg National Research Fund (FNR)—IS/14717072 “Deceptive Patterns Online (Decepticon)” and the H2020-EU grant agreement ID 956562 “Legally-Attentive Data Scientists (LeADS)”. We thank the Data Protection Officer, the legal team and the Ethics Review Panel of the University of Luxembourg for their support. We would also like to acknowledge Maria Botes, Xengie Doan, Rossana Ducato and Soumia Zohra El Mestari for their valuable feedback on an earlier draft of this article and for some of the ideas appearing in this version. Lastly, a great thanks to all Twitter users posting about dark patterns.
References 1. ACM Publications Board: ACM Publications Policy on Research Involving Human Participants and Subjects (Aug 2021), https://www.acm.org/publications/policies/research-involvinghuman-participants-and-subjects 2. Adams, T.E., Ellis, C., Jones, S.H.: Autoethnography. The International Encyclopedia of Communication Research Methods p. 1–11 (2017), https://onlinelibrary.wiley.com/doi/abs/10. 1002/9781118901731.iecrm0011 3. ALLEA—All European Academies: European Code of Conduct for Research Integrity. ALLEA—All European Academies, Berlin (2017), https://www.allea.org/wp-content/uploads/ 2017/05/ALLEA-European-Code-of-Conduct-for-Research-Integrity-2017.pdf 4. Article 29 Data Protection Working Party: Opinion 05/2014 on Anonymisation Techniques (2014), https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/ 2014/wp216_en.pdf 5. Article 29 Data Protection Working Party: Guidelines on Data Protection Impact Assessment (DPIA) and determining whether processing is “likely to result in a high risk” for the purposes of Regulation 2016/679 (Oct 2017), https://ec.europa.eu/newsroom/just/document. cfm?doc_id=47711 6. Article 29 Data Protection Working Party: Guidelines on Transparency under Regulation 2016/679, 17/EN WP260 rev.01 (Apr 2018), https://ec.europa.eu/newsroom/article29/ document.cfm?action=display&doc_id=51025 7. Beninger, K., Fry, A., Jago, N., Lepps, H., Nass, L., Silvester, H.: Research using Social Media; Users’ Views. NatCen Social Research (2014), https://www.natcen.ac.uk/media/ 282288/p0639-research-using-social-media-report-final-190214.pdf 8. Bishop, L., Gray, D.: Ethical challenges of publishing and sharing social media research data. In: Woodfield, K. (ed.) Advances in Research Ethics and Integrity, vol. 2, p. 159–187. Emerald Publishing Limited (Dec 2017), https://www.emerald.com/insight/content/doi/10.1108/S2398601820180000002007/full/html 9. Bongard-Blanchy, K., Rossi, A., Rivas, S., Doublet, S., Koenig, V., Lenzini, G.: I am definitely manipulated, even when I am aware of it. It’s ridiculous!—Dark Patterns from the End-User Perspective. Proceedings of ACM DIS Conference on Designing Interactive Systems (2021) 10. Boush, D.M., Friestad, M., Wright, P.: Deception In The Marketplace: The Psychology of Deceptive Persuasion and consumer self-protection. Routledge, first edition edn. (2009) 11. Buttarelli, G.: The EU GDPR as a clarion call for a new global digital gold standard. International Data Privacy Law 6(2), 77–78 (May 2016) 12. Carpenter, P.: Transformational security awareness: What neuroscientists, storytellers, and marketers can teach us about driving secure behaviors. John Wiley & Sons (2019)
10 Out of the Twitter Scraping Maze
201
13. Cialdini, R.B.: Influence: The Psychology of Persuasion. Harper Business (2006) 14. Cool, A.: Impossible, unknowable, accountable: Dramas and dilemmas of data law. Social Studies of Science 49(4), 503–530 (Aug 2019) 15. Danezis, G., Domingo-Ferrer, J., Hansen, M., Hoepman, J.H., Metayer, D.L., Tirtea, R., Schiffner, S.: Privacy and Data Protection by Design—from policy to engineering. Tech. rep., arXiv: 1501.03726 (2014), http://arxiv.org/abs/1501.03726 16. Ducato, R.: Data protection, scientific research, and the role of information. Computer Law & Security Review 37, 105412 (Jul 2020) 17. Ducato, R., Strowel, A.M.: Ensuring text and data mining: Remaining issues with the EU copyright exceptions and possible ways out. European Intellectual Property Review 43(5), 322–337 (2021) 18. Duncan, A., Joyner, D.A.: With or Without EU: Navigating GDPR Constraints in Human Subjects Research in an Education Environment. In: Proceedings of the Eighth ACM Conference on Learning @ Scale. p. 343–346. ACM (Jun 2021), https://dl.acm.org/doi/10.1145/3430895. 3460984 19. Ethics Review Panel, Animal Experimentation Ethics Committee, Legal Affairs Office: Research ethics guidelines (2017), shorturl.at/nHRV1 20. European Data Protection Board: Guidelines 4/2019 on Article 25 Data Protection by Design and by Default (2019), https://edpb.europa.eu/sites/edpb/files/consultation/ edpb_guidelines_201904_dataprotection_by_design_and_by_default.pdf 21. European Parliament and Council of the European Union: Regulation (EU) 2021/695 of the European Parliament and of the Council of 28 April 2021 establishing Horizon Europe—the Framework Programme for Research and Innovation, laying down its rules for participation and dissemination, and repealing Regulations (EU) No 1290/2013 and (EU) No 1291/2013 (Text with EEA relevance). OJ L 170 (Dec 2021) 22. European Union Agency for Cybersecurity (ENISA): Pseudonymisation techniques and best practices. Recommendations on shaping technology according to data protection and privacy provisions. ENISA (Nov 2019), https://www.enisa.europa.eu/publications/pseudonymisationtechniques-and-best-practices 23. Fernandes, N., Dras, M., McIver, A.: Generalised Differential Privacy for Text Document Processing, Lecture Notes in Computer Science, vol. 11426, pp. 123–148. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-17138-4, http://link.springer. com/10.1007/978-3-030-17138-4 24. Ferreira, A., Coventry, L., Lenzini, G.: Principles of Persuasion in Social Engineering and Their Use in Phishing. In: Human Aspects of Information Security, Privacy, and Trust (HAS 2015), Lecture Notes in Computer Science, vol. 90. Springer, Cham (2015) 25. Fiesler, C., Beard, N., Keegan, B.C.: No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. Proceedings of the International AAAI Conference on Web and Social Media 14, 187–196 (May 2020) 26. Fiesler, C., Proferes, N.: “Participant” Perceptions of Twitter Research Ethics. Social Media + Society 4(1) (2018) 27. Governance Team Sage Bionetworks: The Elements of Informed Consent. A Toolkit V3.0. Tech. rep., Sage Bionetworks (July 2019), https://sagebionetworks.org/wp-content/uploads/ 2019/07/SageBio_EIC-Toolkit_V2_17July19_final.pdf 28. Gray, C.M., Kou, Y., Battles, B., Hoggatt, J., Toombs, A.L.: The dark (patterns) side of UX design. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems—CHI ’18. p. 1–14. ACM Press (2018), http://dl.acm.org/citation.cfm?doid=3173574. 3174108 29. Hintze, M.: Viewing the gdpr through a de-identification lens: a tool for compliance, clarification, and consistency. International Data Privacy Law 8(1), 86–101 (Feb 2018) 30. IBM: What is Text Mining? (last visit October 2021), https://www.ibm.com/cloud/learn/textmining 31. Information Commissioner Officer (ICO): Guide to the UK General Data Protection Regulation (UK GDPR) (last visit October 2021), https://ico.org.uk/for-organisations/guide-to-dataprotection/guide-to-the-general-data-protection-regulation-gdpr/
202
A. Rossi et al.
32. Jin, G., Tu, M., Kim, T.H., Heffron, J., White, J.: Game based cybersecurity training for high school students. In: Proceedings of the 49th ACM Technical Symposium on Computer Science Education. p. 68–73. ACM (Feb 2018). https://doi.org/10.1145/3159450.3159591, https://dl. acm.org/doi/10.1145/3159450.3159591 33. Kennedy, H.: What Should Concern Us About Social Media Data Mining? Key Debates, pp. 41–66. Palgrave Macmillan UK (2016). https://doi.org/10.1057/978-1-137-35398-6, http:// link.springer.com/10.1057/978-1-137-35398-6 34. Kumari, A.: Measures Necessary for Scraping and Processing Social Media Personal Data: Technical, Ethical, Legal Challenges and possible Solutions. Ph.D. thesis, Université du Luxembourg (Aug 2021) 35. Kuyumdzhieva, A.: General Data Protection Regulation and Horizon 2020 Ethics Review Process: Ethics Compliance under GDPR. Bioethica 5(11), 6–12 (Jul 2019) 36. Leiser, M.R.: “Dark Patterns”: The Case for Regulatory Pluralism. SSRN Repository (Jun 2020), https://papers.ssrn.com/abstract=3625637 37. Lindén, K., Kelli, A., Nousias, A.: To ask or not to ask: Informed consent to participate and using data in the public interest. In: Proceedings of CLARIN Annual Conference 2019. p. 56– 60. CLARIN ERIC (2019) 38. Marres, N., Weltevrede, E.: SCRAPING THE SOCIAL?: Issues in live social research. Journal of Cultural Economy 6(3), 313–335 (Aug 2013) 39. Mathur, A., Kshirsagar, M., Mayer, J.: What makes a dark pattern. . . dark?: Design attributes, normative considerations, and measurement methods. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. p. 1–18. ACM (May 2021), https:// dl.acm.org/doi/10.1145/3411764.3445610 40. Mészáros, J., Ho, C.h.: Big Data and Scientific Research: The Secondary Use of Personal Data under the Research Exemption in the GDPR. Hungarian Journal of Legal Studies 59(4), 403– 419 (Dec 2018) 41. Narayanan, A., Shmatikov, V.: De-anonymizing social networks. In: 2009 30th IEEE Symposium on Security and Privacy. p. 173–187 (May 2009) 42. Nissenbaum, H.: Privacy as contextual integrity. Washington Law Review 79(1), 119–158 (2004) 43. Oates, J., Carpenter, D., Fisher, M., Goodson, S., Hannah, B., Kwiatkowski, R., Prutton, K., Reeves, D., Wainwright, T.: BPS Code of Human Research Ethics. Tech. rep., The British Psychological Society (Apr 2021), https://www.bps.org.uk/sites/bps.org.uk/files/Policy%20%20Files/BPS%20Code%20of%20Human%20Research%20Ethics.pdf, iSBN 978-1-85433792-4 44. Parameswaran, M., Whinston, A.B.: Social Computing: An Overview. Communications of the Association for Information Systems 19 (2007) 45. Pormeister, K.: Genetic data and the research exemption: is the gdpr going too far? International Data Privacy Law 7(2), 137–146 (May 2017) 46. Proferes, N.: Information flow solipsism in an exploratory study of beliefs about twitter. Social Media + Society 3(1) (Jan 2017) 47. Proferes, N., Jones, N., Gilbert, S., Fiesler, C., Zimmer, M.: Studying reddit: A systematic overview of disciplines, approaches, methods, and ethics. Social Media + Society 7(2) (2021) 48. Pupillo, L., Ferreira, A., Varisco, G.: Software Vulnerability Disclosure in Europe: Technology, Policies and Legal Challenges. CEPS (2018) 49. Reynolds, E.: Psychologists Are Mining Social Media Posts For Mental Health Research — But Many Users Have Concerns. Research Digest (Jun 2020) 50. Rossi, A., Ducato, R., Haapio, H., Passera, S.: When design met law: Design patterns for information transparency. Droit de la Consommation = Consumenterecht: DCCR 122–123(5), 79–121 (2019) 51. Rossi, A., Lenzini, G.: Transparency by design in data-informed research: A collection of information design patterns. Computer Law & Security Review 37 (2020) 52. Rossi, A., Arenas, M.P., Kocyigit, E., Hani, M.: Challenges of protecting confidentiality in social media data and their ethical import. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pp. 554–561. IEEE (2022).
10 Out of the Twitter Scraping Maze
203
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9799350&casa_token=H9cTFNoGQ MQAAAAA:waCDkAbaVeuseLSDt5lHFa9PMY5LKD3BFVYZviWPLVHECCtKfLTWm KYQrwevIh6L0dpzov_1Cyru&tag=1 53. Schaub, F., Balebako, R., Durity, A.L., Cranor, L.F.: A design space for effective privacy notices. In: Eleventh Symposium On Usable Privacy and Security (SOUPS 2015). p. 1–17 (2015) 54. Stallings, W.: Operating system security. John Wiley & Sons, Incorporated (2014) 55. Thornley, C., McLoughlin, S., Murnane, S.: “At the round earth’s imagined corners”: the power of Science Fiction to enrich ethical knowledge creation for responsible innovation. In: Proc. of 22nd European Conference on Knowledge Management, ECKM 2021 (2021) 56. Townsend, L., Wallace, C.: Social Media Research: a Guide to Ethics. Tech. rep., University of Aberdeen (2016) 57. Townsend, L., Wallace, C.: The Ethics of Using Social Media Data in Research: A New Framework. In: Woodfield, K. (ed.) The Ethics of Online Research, Advances in Research Ethics and Integrity, vol. 2, p. 189–207. Emerald Publishing Limited (Jan 2017), https://doi. org/10.1108/S2398-601820180000002008 58. Tripathy, B.K.: De-anonymization techniques for social networks. In: Dey, N., Borah, S., Babo, R., Ashour, A.S. (eds.) Social Network Analytics, p. 71–85. Academic Press (Jan 2019), https:// www.sciencedirect.com/science/article/pii/B9780128154588000049 59. Vitak, J., Proferes, N., Shilton, K., Ashktorab, Z.: Ethics regulation in social computing research: Examining the role of institutional review boards. Journal of Empirical Research on Human Research Ethics 12(5), 372–382 (Dec 2017) 60. Vitak, J., Shilton, K., Ashktorab, Z.: Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. p. 941– 953. ACM (Feb 2016) 61. Williams, M.L., Burnap, P., Sloan, L., Jessop, C., Lepps, H.: Users’ views of ethics in social media research: Informed consent, anonymity, and harm. In: Woodfield, K. (ed.) Advances in Research Ethics and Integrity, vol. 2, p. 27–52. Emerald Publishing Limited (Dec 2017), https:// www.emerald.com/insight/content/doi/10.1108/S2398-601820180000002002/full/html 62. Working Party on Internet-mediated Research: Ethics Guidelines for Internet-mediated Research. The British Psychological Society (2021), https://www.bps.org.uk/sites/www.bps. org.uk/files/Policy/Policy%20-%20Files/Ethics%20Guidelines%20for%20Internet-mediated %20Research.pdf 63. Zimmer, M., Proferes, N.J.: A topology of twitter research: disciplines, methods, and ethics. Aslib Journal of Information Management 66(3), 250–261 (Jan 2014) 64. Zook, M., Barocas, S., boyd, d., Crawford, K., Keller, E., Gangadharan, S.P., Goodman, A., Hollander, R., Koenig, B.A., Metcalf, J., et al.: Ten simple rules for responsible big data research. PLOS Computational Biology 13(3), e1005399 (Mar 2017)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Part V
User Perception of Data Protection and Data Processing
Chapter 11
You Know Too Much: Investigating Users’ Perceptions and Privacy Concerns Towards Thermal Imaging Lipsarani Sahoo , Nazmus Sakib Miazi, Mohamed Shehab Florian Alt , and Yomna Abdelrahman
,
Abstract Thermal cameras are becoming a widely available consumer technology. Several smartphones are already equipped with thermal cameras, and integration with personal devices is expected. This will enable compelling application areas for consumers, such as in-home security, energy-saving, non-invasive ways of child care, and home maintenance. However, the privacy implications of this technology remain largely unexplored. We close this gap with an interview study (N = 70). Specifically, we assess users’ perceptions with and without prior understanding of thermal imaging. We showed one group of the interviewees informative videos, pointing out opportunities and potential threats. Results show that users are most concerned about their privacy in cases where thermal cameras reveal information on their physiological state or invade their private space. Our findings are valuable for researchers, practitioners, and policymakers concerned with thermal cameras, as this technology continues to become widely used. Keywords Security · Thermal imaging · Privacy · Offensive technology
1 Introduction Thermal cameras have evolved from specialized and expensive hardware to small, affordable consumer devices. Hence, they have the potential to become a technology to which users have access in their daily life as they are being integrated with L. Sahoo () · M. Shehab University of North Carolina at Charlotte, Charlotte, NC, USA e-mail: [email protected]; [email protected] N. S. Miazi Northeastern University, Boston, MA, USA e-mail: [email protected] F. Alt · Y. Abdelrahman University Munich, Munich, Germany e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_11
207
208
L. Sahoo et al.
personal devices, such as smartphones or wearables (e.g., glasses). This assumption is backed by an analysis from Global Market Insight, reporting that the market size of thermal imaging crossed USD 5.5 billion in 2017 and is forecasted to grow yearly by 8% between 2018 and 2024 [42]. The global number of shipped units is predicted to reach 4 million by 2024. FLIR,1 the world’s largest thermal imaging device maker, is selling thermal camera add-ons for smartphones for less than $300. Furthermore, smartphones like the Caterpillar Cat S612 already integrate thermal cameras. Beyond specific, professional use cases, such as at airports detecting passengers with fever or higher degree of temperature (cf. the COVID-19 outbreak) or firefighters identifying dangerous areas, an increasing number of use cases in the consumer area emerge. These include, but are not limited to, in-home security, personal safety, energy efficiency, child and pet care, pest control, home maintenance, automotive care, and leisure. Abdelrahman et al. investigate domestic use cases [11]. At the same time, thermal imaging does not come without privacy and security implications. For instance, in 2001, the US Supreme Court decided that the use of sensors by the police to detect marijuana plants growing inside a home violated civil liberty, where the thermal cameras could reveal things beyond what a person standing outside a home would not be able to see. For example, whether “the lady of the house might be taking her daily sauna and bath” [2]. Similarly, during the winter in 2011, the city officials in Boston, Massachusetts, used aerial and street thermal cameras to detect heat loss in houses, analyzing 20,000 thermal images per day. There it helped optimize energy usage in the city. However, plans to involve residents to increase energy efficiency led to strong resistance against the proposal, as the approach could potentially reveal residents’ movements and behavior inside their houses. The program was put on hold until the administration developed a privacy protection policy for homeowners [3]. Researchers started investigating the implications of thermal imaging on privacy and security. For example, it was shown that thermal imaging could be used to extract PINs from the heat trace [6], to identify people from the hand veins, and to reveal mental states and behavior [20]. Yet, it remains largely unexplored how users perceive this technology and which privacy concerns they might have. Closing this gap is the focus of our work. To this end, we conducted an in-person semi-structured interview study with a total of 70 participants. To obtain a holistic view, we interviewed both people without in-depth understanding and people we demonstrated opportunities and potential privacy threats before the interviews through video showcases. Our analysis shows that participants from all groups are concerned about privacy in general, most notably about the disruption of their private space, physical privacy, the privacy of their cognitive state, and physiological privacy. Our investigation is complemented by discussing the implications of our findings and identifying directions for future research.
1 https://www.flir.com/flir-one/. 2 https://www.catphones.com/en-us/cat-s61-smartphone/.
11 Privacy in Thermal Imaging
209
Contribution Statement Our contribution is twofold: First, we investigate users’ perception and privacy concerns towards thermal imaging while considering how understanding and priming influence their perception. To this end, we conducted and analyzed 70 interviews, splitting participants into two groups (primed and unprimed). Second, we provide an in-depth discussion of the implications of our findings. We found that users are most concerned about the thermal camera capability of peeking into a person’s physiological state and could potentially invade private space.
2 Background and Related Work Our work builds on three strands of prior work: (1) thermal imaging, (2) users’ perceptions of sensing devices, and (3) privacy perceptions across different user groups.
2.1 Thermal Imaging Thermal imaging operates in the infrared range of the electromagnetic spectrum (0.7–30 μm), i.e., it senses wavelength beyond the ones visible to the human eye. Thermal cameras render thermal energy (heat) into false-color images to extract images that can be seen by human eyes. The images, called “thermogram,” are analyzed through “thermography” tools. The first thermal camera was developed for military use. Later, thermal cameras got adopted to many other use cases, for example, detecting icebergs, automated machinery monitoring, analyzing structural integrity, firefighting, surveillance, and managing power line safety. Thermal cameras are also used in the health sector, for example, as a physiological monitoring tool to detect fever in humans and other warm-blooded animals [4]. In the context of the recent COVID-19 outbreak, an extensive use of thermal imaging could be observed at airports and train stations, to identify infected people [5]. Until recently, thermal cameras were considered a relatively expensive technology, with cameras’ prices reaching thousands of dollars. Hence, prior applications were often limited to specific domains such as medical, military, and industrial settings. However, with technology advances, affordable thermal cameras operating in the FIR spectrum are becoming available, with costs being around a few hundred dollars. This enabled a wide range of new applications and sparked much interest in the research community. These include in-home security, personal safety, energy saving, pest control, home maintenance, automotive care, and leisure. Gade and Moeslund reviewed the use of thermal imaging, highlighting the potential of using thermal imaging in different domains [26]. Abdelrahman et
210
L. Sahoo et al.
al. [11] investigated potential use cases of thermal imaging in domestic setups. Researchers also explored thermal reflection properties to introduce novel interaction technique [33, 39] or for extending our visual perception [7]. Another example is remote physiological motoring [28], where researchers looked at the changes in facial temperature to infer users’ internal states (e.g., cognitive load and emotions). From the last example, it already becomes clear that thermal cameras allow sensitive information to be revealed. This becomes even more apparent when looking at security threats. Abdelrahman et al. [6] demonstrated that thermal cameras enable the so-called thermal attacks, where thermal imaging can capture heat traces left after touching the surface of a smartphone, allowing the entered PIN or lock pattern to be retrieved. This raises the question of how users perceive this technology—in particular, regarding privacy. To close this gap, this work contributes an interview study. As becomes apparent from prior work, there are many use cases and opportunities, many of which are unknown to end-users. Hence, our exploration will focus both on novice and knowledgeable users.
2.2 Perception of Camera-Based Technologies The privacy implications of sensing devices are of great interest to researchers [25, 32]. Cameras demand particular attention due to societal and legal expectations of privacy as they seamlessly and unobtrusively capture users in the field of view. Koelle et al. investigated the privacy perception of body-worn cameras [30] and data glasses [31] from both a legal and a social perspective. They highlighted that despite body-worn cameras having potential benefits, they still impose ethical pitfalls and might affect bystander privacy. Widen extended this space by exploring the privacy concerns of smart cameras [45]. Unlike body-worn cameras and alwayson cameras, a smart camera does not passively record information. Instead, it recognizes visual patterns using algorithms. Widen plotted a privacy matrix based on the users’ location and vantage point. Researchers discussed the benefits, risks, and legalities of lifelogging [49] and concerns of dashcam video sharing [38]. Beyond cameras, Hassib et al. [27] investigated users’ perception of bio-sensing and affective wearable devices and how they influence users’ privacy. Recent studies investigated privacy concerns with drones [50], smartwatches [41], Internetconnected toys [35], and autonomous vehicles [15]. Researchers also explored the perception of the emerging field of Internet of Things (IoT) devices integrated into (smart) homes) [23, 47, 53, 55], revealing concerns and challenges. Several approaches exist to mitigate privacy concerns. Examples include attempts to establish best practices among designers and developers when creating applications that deal with sensitive data [13], approaches that try to filter the collected data [16], or proposing solutions in the form of data management [17]. However, as Jacobsson et al. [29] note, a prerequisite to creating meaningful approaches and strategies for privacy protection is to understand users. However,
11 Privacy in Thermal Imaging
211
there is no substantial work exploring perceptions of end-users regarding thermal cameras and how their privacy concerns could be mitigated. In this work, we investigate and report end-users suggestions of rules, regulations, censorship for thermal camera usage in public. Thus, our work can serve as a guideline for developer and policymakers.
2.3 Influence on Privacy Perception Understanding privacy concerns is critical to determine end-users’ attitudes and behavior regarding acceptance of a technology [18]. Privacy perception is influenced by different factors, most notably culture [34], country of residence [14], age [46], gender [52], and knowledge [37]. Users can be grouped by a wide range of characteristics beyond culture, including users’ behavior, e.g., privacy minimalists, self-censors, and privacy balancers. Wisniewskia et al. [48] categorized users into six profiles depending on their sharing and privacy attitudes in Online Social Networks (OSNs) and offered design implications per user group. Education, and knowledge play a critical role in privacy perception as well. For example, previous work showed that educated people have more privacy knowledge and, hence, are more aware of privacy practices and contemporary privacy and security scenarios [21, 36, 40, 44]. For instance, Culnan et al. [21] investigated the characteristics of users who are aware of privacy preserving features, such as the possibility to have names removed from mailing lists. Users who are unaware of this are less likely well-educated, and are less likely concerned about privacy. Youn [51] reports that teenagers were less prone to giving out personal information and conduced to engage in risk-reducing strategies such as providing incomplete information, moving to alternative web sites that do not ask for personal information if they are aware of information disclosure online. Prior work highlights the challenge of holistically understanding privacy concerns towards a novel technology and the consumer’s ability to enhance individual privacy protection through the use of technology [43]. To account for this influence of the understanding of thermal imaging as well as to better understand its implications, we investigate privacy perceptions of thermal imaging among users with different levels of understanding of the technology.
3 Research Approach To understand privacy concerns towards thermal cameras, we designed an interview study. To assess both the view of people with and without an in-depth understanding, we conduct parts of the interviews with people whom we first presented video showcases (primed) and parts with people not shown the video showcases (unprimed). In
212
L. Sahoo et al.
this way we account for the fact that thermal imaging is still in the early stage of consumer market penetration and participants generally do not own and, thus, have an in-depth understanding of thermal imaging.
3.1 Video Showcases The video showcases were inspired by the literature. We came up with several scenarios in which we demonstrated what can be done with the use of thermal cameras. We opted for realistic scenarios that have been explored in prior research. These scenarios are based on current use cases of thermal cameras [8, 10, 12, 28]. We expanded them to reflect privacy / utility trade-offs. Scenarios were designed to highlight opportunities but also point to how this technology could violate users’ privacy. The videos were recorded using a FLIR One,3 a thermal camera attachable as an add-on to a smartphone. Scenario 1—Detecting Stress/Cognitive Load This scenario presents an exam situation in which a student is asked questions by an examiner. A thermal camera captures the temperature of the student’s forehead and nose. The increased temperature indicates that the student is under stress and increased cognitive load [28] (cf. Fig. 11.1a). The video includes an explanation about the process: the camera
(a)
(b)
(e)
(f)
(c)
(g)
(d)
(h)
Fig. 11.1 Screenshots from the video showcases used in the interview. (a) Scenario 1. (b) Scenario 2. (c) Scenario 3. (d) Scenario 4a. (e) Scenario 4b. (f) Scenario 5. (g) Scenario 6. (h) Scenario 7
3 https://www.flir.com/flir-one/.
11 Privacy in Thermal Imaging
213
captures the temperature. Subsequently, machine learning methods are used to derive the exact temperature and, thus, the stress level and cognitive load. It also discusses that it is possible to capture this information without the student’s consent.
Scenario 2—Detecting Emotions This scenario shows a situation in which three friends (two males, one female) are having a chat in which they decide to take a selfie with a thermal camera. The selfie reveals a temperature difference on the girl’s cheek, indicating her shyness [28] (Fig. 11.1b). The video explains that this is just one example where thermal imaging can reveal a user’s emotion, even though a user (or bystander) might not have intended to reveal that information. Scenario 3—Use of Personal Objects In this scenario, a person is writing in his diary. After he left the room to go for a walk, another person enters the room and starts reading in the diary. After leaving, the person who wrote the diary returns. The thermal image reveals that somebody touched the diary [19] (Fig. 11.1c). Scenario 4—Seeing through Clothes This video shows a scene in which a person captures a thermal image from a bystander from behind their back with the bystander neither noticing nor giving consent [11] (Fig. 11.1d). This demonstrates both what the camera can reveal and how easy it is to obtain private information without consent. Scenario 5—Locating Objects in Dark In this video, a person puts a baby to sleep in a bedroom and turns off the light. Then he remembers that he left his phone in baby’s room. So, he starts scanning the baby’s room with a thermal camera without turning on the light, as he does not want the baby to wake up and finds his phone [11] (Fig. 11.1f). Scenario 6—Locating a Pet The video shows a person walking in the street. He hears a kitten crying behind a bush. Since he cannot see behind the bush, he is unable to locate the kitten. Using the thermal camera to scan the bush allows him to quickly locate the kitten, as its body temperature is different from that of the leaves of the bush [11]. Scenario 7—Emergency Situation In this video, two people are walking through a forest at night as one of them suddenly faints. As the other person notices, he immediately starts looking for his friend, but cannot easily locate him due to the darkness. So he starts scanning the surroundings with a thermal camera to quickly locate his friend [8].
214
L. Sahoo et al.
3.2 Recruitment and Demographics We recruited a total of 70 participants through our University’s mailing list. The study was carried out between May to September 2019. To obtain a more diverse sample, we complemented recruiting with Snowball sampling, where initial participants suggested additional interviewees. The study was approved by the Institutional Review Board. We had a primed and an unprimed group. Among the 37 primed participants, 22 were male and 15 female. The participants were aged from 19 to 61 (M = 27.1, SD = 8.9). For the unprimed group, we had 33 participants (19 male, 14 female), aging between 19 to 58 (M = 29.8, SD = 8.9). Participants had different occupations.4
3.3 Procedure and Analysis As participants arrived at the lab, we introduced them to the study’s purpose. Afterward, participants filled in a demographic questionnaire. We then asked participants whether they were generally familiar with thermal cameras and whether they had any prior experience with the technology. If they did, we asked them to describe their experience in detail. Afterwards, we explained the structure and functionality of a thermal camera. Finally, we randomly assigned them to either the unprimed or the primed group. We then showed the different videos we introduced in the previous section to the primed group before proceeding with the actual interview. The unprimed group proceeded with the interview immediately. In the interview, we first asked them about their general perception and what they thought about the use of thermal cameras. We then, more specifically, asked how they would feel if friends, family members, or strangers used a thermal camera while being in their vicinity and vice versa. We also asked about the following aspects: asking others for consent before using a thermal camera in the vicinity of others, censorship, and sharing thermal images in social media.5 All interviews were transcribed manually for analysis. As the primary coders, two authors conducted inductive coding for 3 sample participants from each group and discussed them. Both coders used the QDA miner software [1]. The authors agreed on a codebook, containing 13 codes for the primed and 10 codes for the unprimed condition. Then, both coders coded the remaining transcripts independently using the codebook with no further changes made to the codebook. When coding was complete, the researchers compared each code and discussed and resolved any disagreements. Disagreements were tracked, and inter-rater agreement was calculated at 89.82% for primed interviewees and 96.4% for the unprimed
4 https://bit.ly/3ARQ9Dj. 5 Interviews
questionshttps://bit.ly/3ARQ9Dj.
11 Privacy in Thermal Imaging
215
interviewees. Overall, ten codes were the same between the two settings. As a final step, we compared the results of the primed setting and the unprimed. While discussing the results, we enumerate the participants from P1 to P37 for the primed group, UP1 to UP33 for the unprimed group.
4 General Privacy Perceptions We start by presenting common themes among interviewees from all study conditions before focusing on differences among interviewees from the primed / unprimed groups in the following sections. To describe the similarities of a point of view in our sample, we frequently specify the numbers of participants while describing a particular perception. We also use keywords like “majority” to refer to more than 22 participants, “some” for 10–21 participants, and “few” for less than 9 participants.
4.1 Disruption of Private Space The majority of the participants of both groups were highly concerned about their private space being invaded. They talked about the violation of their private space by the use of a thermal camera around. The privacy of personal things or personal space, the possibility of getting physically tracked, being a victim of criminal activities as the thermal camera can see things in the dark, emerged as key perceptions. Users reacted negatively towards these opportunities of misuse. It’s like a double edged weapon, good to use in places like airports, but if used on my room or home, it will be disturbing. (P33) If the light is off, this maybe means somebody wants some privacy, but using this camera you can get to know what they are doing. (UP20)
Overall, there was a negative impression of the thermal camera in the users’ minds as they perceived it violates their expectations towards the rights to privacy. Some talked about other violations, like the misuse of the thermal camera for hunting animals, night photography by criminals, and terrorists.
4.2 Disclosure of Emotional/Cognitive State All participants from the primed group, and few participants from the unprimed group talked about the violation of their cognitive and sentimental privacy by the use of a thermal camera around. Though, it was clearly mentioned that users’ inner
216
L. Sahoo et al.
emotions or cognitive state could be revealed with the help of machine learning through a thermal camera. Users talked about this capability without appropriate consent can lead to inappropriate social interactions and track people’s emotions and breach someone’s right to privacy of thoughts and emotions. In many participants’ opinions, a person’s stress level can also be interpreted falsely through thermal imaging, causing inappropriate inferences. For instance, concerning inappropriate social interactions: Sometimes people do not want to show [their emotion]. So a discussion might come up like why her emotions are like that. That is, again, an invasion of privacy. (P11)
Regarding the privacy of thoughts or feelings: Thoughts and feelings are private. I would not share them. People can express them. But nobody has permission to detect others’ emotions. (P31) If somebody captures my emotions without taking my permission, then that is a violation of my privacy and unethical. (P16) If somebody is anxious or delighted, this could show up on the thermal imaging. I would not be interested in others knowing about it. (UP32)
Concerning the interpretation of emotion: Your thoughts are private to you unless you decide to reveal it to another person. Others might interpret the whole conversation in a very different way, for example, if your stress level increased due to a different thought. (P12)
In summary, there was a common point of view against the usage of thermal cameras in users’ opinions based on the fact that it is possible to reveal their sentiments and mental state without their consent and can even be misused or misinterpreted.
4.3 Privacy of Body Parts Thermal imaging can potentially reveal body shapes, shapes of private body parts through the clothes. This is a major privacy issue that was eminent among participants’ comments. Most participants explicitly spoke about their privacy concerns regarding their body structure in connection with the use of thermal cameras in their vicinity. Moreover, the concerns of physical privacy being invaded, especially for women, became apparent and participants responded strongly against the use of a thermal imaging camera by the general population. A few participants said that the use of a thermal cameras without consent in public can contravene religious practice and offend religious freedom. Not surprisingly, in some participants’ opinions, the misuse of a thermal camera to peek through clothes can be humiliating and embarrassing and can be considered a sexual offense.
11 Privacy in Thermal Imaging
217
Thermal cameras should be restricted to areas like airports to find illegal weapons underneath clothes. But, if someone [in public] can see through outfit, people would feel insecure. It will invade their privacy. (P35)
Furthermore, regarding thermal cameras invading physical privacy of women: In this example, it was just a boy, but there can be girls, so more privacy invasion could occur there. (P11)
Also, thermal cameras can contravene religious freedom: I was thinking if it can see through my clothes and as a Hijabi person I don’t want anyone to see my body structure with this camera. (P19)
Referring to the misuse of the thermal camera UP2 said: If I took a picture using a thermal camera of another person without their consent, it is not an x-ray device. But you can see, you can tell bodily parts underneath clothing. It can be misused in many ways. (UP2)
These comments confirms overall strong privacy concerns against thermal imaging camera use without consent, as it potentially violates physical privacy.
4.4 Privacy of Physiological Data Some participants expressed their concerns about the use of the thermal camera to measure and interpret body temperature without the consent or knowledge of an individual, as this can reveal vital health data to third parties and violate privacy. Unlike conventional methods, a thermal camera can be used to get to know about people’s body temperature and other health conditions in a non-invasive and contactfree way. As body temperature can be used to predict particular health conditions, some participants from the primed and unprimed groups shared these concerns. If I have a fever I do not want people to know – but this type of camera can tell that. (UP20)
If you want to capture an image of any person, you are getting his/her body temperature for different parts [of the body] – so there is the matter of taking consent here. (UP9)
Participants pointed out misuse of these data: People can detect if a person is sick without consent and can create a chart of the heat signature of normal and sick people to misuse it. These are threats to my privacy. (UP20) In medicine, it [thermal imaging] should be used for identifying issues with the body. But for daily life, it would make sense to block the use of thermal cameras as it reveal others’ body heat signature. (P22) Thermal cameras extract some health information, which of course, would then present a privacy threat, potentially even revealing very sensitive information about users. (UP25)
218
L. Sahoo et al.
This demonstrates that users generally have a strong negative opinion against thermal camera usage to collect physiological data without consent.
5 Influence of Priming To understand the influence on an in-depth understanding about thermal imaging, we divided participants into two groups (37 primed participants, and 33 unprimed participants), using uniform random selection. As mentioned above, we interviewed all participants, asking them about their familiarity with thermal imaging. We then explained them how thermal imaging works. After that, the primed groups watched the videos to demonstrate the use of thermal imaging cameras, including more detail of how information is obtained, thus creating an deeper understanding among participants. The unprimed group did not watch any videos. We finally assessed the opinions of both groups. In the following we discuss the effects video demonstrations had on the understanding and perception of the primed group members in contrast to the unprimed members. Figure 11.2 presents a comparison of the differences.
5.1 Familiarity with Thermal Imaging Technology Twenty four participants from the primed group were familiar with thermal cameras before the introduction to thermal cameras and being shown the videos. The majority of the primed group participants mentioned that they have seen thermal
Fig. 11.2 Contrasting the opinions between the primed and the unprimed groups
11 Privacy in Thermal Imaging
219
cameras on TV, thus knowing they detect heat and are used by the army, firefighters, for night visions, and for medical procedures. The majority of the participants from the unprimed group were familiar with thermal cameras before. Like the primed group, participants had seen thermal cameras on TV and mentioned use cases like getting heat signatures of living and non-living things and military as well as industrial usage.
5.2 Explanation of Basic Functionality We explained the basic functionality and made them familiar with thermal cameras before asking participants about their opinions. We used several still images of thermal camera in action as probes and a short description of the thermal imaging procedure. On the one hand, most participants from the primed group mentioned that they considered the thermal camera privacy-invasive or a threat that can be misused by impostors. On the other hand, the majority of participants from the primed group also understood the beneficial usage of thermal cameras in a military context, for security purposes, and for firefighting. It can be used for military purposes where a soldier can sense the threat ahead, [yet] if it’s in the wrong hand it will be spoiled. Let’s say, if it is in the hand of a thief, once he is entering or breaking into a house he can find out which person is there and in which place and can tackle them very easily. (P2)
The majority of the participants from the unprimed group second that thermal cameras can be misused by impostors to track people, hunt animals, spy, rob, and so on. some participants also understood the benefits of thermal cameras, for industrial usage, for child and pet care by obtaining their temperature in a non-invasive way, and for military usage. Thermal cameras can be privacy invasive as they see through camping tents and things like that. On the plus side they have a ton of applications in industry and maintenance. (UP17)
Overall the understandings of participants from both groups was similar after the introduction and discussion.
5.3 Perception and Opinions About Thermal Camera Use Cases As mentioned above, the primed group watched the videos of different thermal camera scenarios before being asked about their opinions (compared to participants from the unprimed group who were asked right after the introductory discussion without showing them the videos). When we asked about their perceptions about an unknown person nearby using a thermal camera, there was a clear difference
220
L. Sahoo et al.
between the groups’ opinions. Majority of the participants felt negative about an unknown person using a thermal camera around from the primed group, compared to some participants from the unprimed group. This indicates that the videos closed a considerable gap in understanding between the primed and unprimed groups, which led to differences in their opinions. It can be a bit awkward definitely, if it [the camera] can penetrate through clothes and show temperature. But it does not [show body parts], it just shows some colors. (UP14)
This statement is not particularly accurate and shows a lack of understanding as a result of not having watched the videos. As thermal cameras captures the radiating infrared, rather than penetrating surfaces. Hence, thermal cameras cannot penetrate clothes. We explain the differences in perception in more detail.
5.3.1
Imposing Censorship
Previously, we have found that the primed group participants are more cautious about the usage of the thermal camera. Similarly, in the context of imposing censorship to thermal cameras, their opinions remain conservative. The majority of the primed participants desired some sort of censorship or restrictions for thermal cameras usage in public, compared to some participants from the unprimed group. Thermal camera manufacturers should not give full access to end users. There have to be guidelines about the sort of intrusion, or capture of data at the [end-user] level. I feel that only certain approved organizations should capture these data. (P12) [If] I need to see how well you can see [someone’s] body [using a thermal camera], but if it’s not super graphic, you should not censor it. I feel ‘censored’ is a strong word, and it depends on the person. (UP12)
Although, we cannot clearly say that there is a difference in opinions about the censorship between primed and unprimed groups, we can certainly see a more relaxed attitude in case of unprimed participants regarding censorship.
5.3.2
Interest to Give Consent to Using Thermal Cameras Around Them
The majority of the participants of both groups reported that they would like to be informed about the use of thermal cameras around them. We assume that the current trend of mobile application having to request permissions to preserve users’ privacy is very present among users. Thus, most participants want to be informed if there is a thermal camera in use around them. I would like to know where they are positioned, purpose of use in space. Do I have access to the data or is it possible to have access to the data? I think it’s more so from the state of disknowing. Also, it’s like the idea of big brothers. Your behavior is different when you know you’re being watched versus when you know you’re not. (UP19)
11 Privacy in Thermal Imaging
221
It’s an issue of like you’re getting documentation of me. That’s why I would want information informally or just having a small sign on the wall being like there is a thermal imaging camera. Like, there’s security cameras all over the place and when you go into buildings, there is signs of it. (UP31)
5.3.3
Interest to Ask for Consent Before Using Thermal Camera
When we asked participants about their interest to ask for consent from people before using a thermal camera, majority of the participants of the primed group showed interest in taking consent. In contrast, unprimed participants showed a comparatively relaxed attitude. I think at least people [around me] should be aware or informed that I’m using a thermal camera, so that there is a choice for the other person to say ‘yes’ or ‘no’. Without consent, reaction of the other person depends upon the personality because some people are really sensitive about this kind of issues, and some are not. So, whoever is using it should have the consent of others. (P19)
The unprimed group displayed a more relaxed attitude. The primed group showed more interest to take consent from people before using thermal camera. Its only temperature, we are not producing any output we are just checking the general events. I will not take consent because i think there is no harm of using it. (UP22)
5.3.4
Willingness to Delete Recordings of Others Upon Request
We asked participants if they captured a thermal image of a random person, and that person asked them to delete that image, whether they would agree to delete it. Some of both primed and unprimed participants agreed to delete the picture upon request. The attitude of both the groups was similar. I will delete it without asking. But if the picture is worth or valuable to me, I will ask for consent, but if they want [it] to be deleted, I will [do that], because I respect my privacy and others’, too. (P3) If I were in that person’s shoes, I would want him to delete that picture. So I will also do the same because I think it is private and it is his / her right to choose what to do with the picture. (P21) No, I will not delete. Probably I will never see them again, there are billions of people. (UP10)
5.3.5
Interest to Share Their Thermal Image or Data on Social Media
Regarding sharing thermal images or data in social media, while some primed participants explicitly denied sharing, the majority of unprimed participants were not having issues sharing.
222
L. Sahoo et al.
It exposes more than I want. It shows something that I want to hide. I will not share it. (P3) I don’t want my personal things to be out for anyone. (P24) Yes, I don’t mind sharing my data, but if [it is] someone else’s data, I will ask them. (UP16)
It’s fine if I share my thermal image and readings with anyone. It doesn’t show anything. (UP30)
5.3.6
Self-Expected Use of Thermal Cameras
The majority of the participants of the primed group perceived the utility of thermal cameras useful in various ways, for example, finding lost objects or pets, as studying tools, for architecture, in emergencies, during camping and so on. At the same time, only some of the unprimed participants mentioned useful cases, such as night vision and maintenance of the house. We also found that understanding gaps lead to less sensible decision making by unprimed users. UP11 and UP12 said that they would use the camera to look at friends’ bodies to find out their body heat signatures. UP11 and UP7 said that they would use the thermal camera to observe people in public places. Also, UP31 said: I would use it for fun, to see friends reaction or state reflects changes in temperature around their bodies. Some people blush. Their faces feel different, or they get cold or hot or so on. (UP31)
5.3.7
Understanding Update After Video (Primed)
When we asked participants about their points of views on thermal cameras after they watched the videos, we found a negative perception in general about thermal imaging cameras by end-users. Yet, participants were very positive towards thermal cameras use by security officials and other professionals. Earlier I felt thermal camera have only positive use cases. Now I think it is rather a threat in the hand of a common man. (P27)
A few participants also said thermal cameras should not be consumer-grade or accessible to common people. I like technology but I don’t like it to be consumer grade at all. (P13)
11 Privacy in Thermal Imaging
223
Table 11.1 Rules, regulations, and censorships suggested by participants Suggestions Limited feature access to consumer-grade cameras No filming without explicit consent Regulatory body to monitor usage of thermal camera and data Privacy laws for thermal data Censoring or blurring thermogram of private body parts Signage or warning of thermal camera’s presence Registration of thermal camera by buyers
Frequency (n = 70) 25 32 23 34 48 36 37
5.4 End-Users Suggestions of Rules, Regulations, Censorship for Thermal Camera Usage in Public We asked the participants—their suggestions of implicit censorship, rules for thermal camera usage by end-users in public. This section lists all the users’ suggestions that we gathered in Table 11.1.
6 Discussion We explored users’ perceptions of thermal cameras, motivated by the fact that thermal imaging is increasingly becoming cheaper and is likely to end up as an everyday accessory in users’ hands. Considering this, we investigated the thoughts and opinions of end-users about the risks and opportunities of thermal imaging and their expected behavior as potential thermal camera users. We conducted an interview study with 70 participants in total. We divided the participants into two groups to understand the effects of users having a detailed understanding about the technology. Our results show that people are greatly concerned about thermal imaging in general. Yet, they recognize the importance of thermal imaging usage in industrial, health, surveillance, and security industries and value thermal imaging potential for personal use. We also found an effect of understanding on users’ perceptions of this new technology. We will discuss the implications by reflecting on the results.
6.1 Privacy Implications People mostly dislike the extended capability of thermal imaging cameras. They think it is utterly privacy-invasive as these cameras see the invisible [9], the cameras are even capable of tracking people inside their homes without physically intervening. Thermal cameras are capable of detecting things or humans in the
224
L. Sahoo et al.
visual field of view and also inconceivable in bare eyes. This raises the question of informed consent. Therefore, making such devices commercially available for endusers requires caution, and appropriate regulations need to be developed. Another popular perception regarding privacy was that thermal cameras could be used to track behavioral patterns and, hence, can be privacy-invasive if used publicly without regulations. Similar to other existing technologies that allow for identifying behavioral patterns (cf. research on Behavioral Biometrics), regulations need to be put in place when obtaining such data. For example, the GDPR classifies such (biometric) data as particularly protect-worthy and requires user consent once collected and processed. Moreover, almost all participants discussed the need for consent while a thermal camera is being used around them. Most of them agreed to ask for consent from others when they own a camera and use it publicly. Designers and developers of thermal camera applications should be capable of detecting when humans are in the visual field of view and when not, so as to ask for consent first if this is the case, or at least inform the users. This is inline with current technologies where users are informed if they are being in the view of tracking or recording systems. Also, the need for both explicit and implicit censorship was prominent among users’ opinions. They also pointed out that if the cameras are available to the end-users, there should be a lesser capability like blurring out the temperature of private body parts. Also, restrictions should be imposed on the devices so that the users should not capture others’ private data, like quantifiable data of the human body and mind, without their consent. Thus, thermal imaging applications needs to find the balance of which information to reveal and not to reveal. For instance, prior work suggested to provide usable access control mechanisms to our devices [54]. Additionally, thermal camera applications should only show information derived from thermal imaging data that is needed for its primary purpose and does not let other conclusions to be drawn. For example, if a thermal camera application is used to detect emotion based on the changes of the facial temperature [28], only the inferred emotion should be displayed without showing the facial information. All these concerns were raised by our participants who are generally more informed. We conclude that the current trend of managing standard privacy practices and the research on managing boundaries in modern applications makes people more aware of keeping their privacy and inherently make them self-conscious.
6.2 Privacy vs. Utility As discussed above, people are reasonably concerned about their privacy. Nevertheless, if the system provides a sufficiently large benefit for users they consider the privacy-utility trade-off (cf. the privacy calculus [22]). For instance, users were generally against thermal camera usage if used on them and without their consent. Yet, if they were lost or in an urgency need, they thought a thermal camera could be a good way of helping them and did not argue about privacy. Besides, we learned that users are highly concerned about consumers obtaining thermal imaging tools easily
11 Privacy in Thermal Imaging
225
due to their low price and availability. Nevertheless, they recognized that cameras being widely available could be useful in supporting children, older adults, pets, and wild animals regarding health, emergencies, and rescue measures. Moreover, although there was a vital concern about being tracked inside the home, some people wanted to install such devices as home security devices. Some of them even mentioned that they could be used like keeping legal guns at home to protect themselves, following proper regulations. Users also acknowledged the use of thermal cameras for the health sector. Most users recognized that the use of thermal imaging by authorized personnel could be allowed. However, there was some resistance against the unaccountable use of thermal cameras in security and surveillance. There should also be a purpose to use these cameras by authorized personnel. This can mean there was an apparent tension between managing the personal space and recognizing the need and purpose. For instance, thermal cameras should be used responsibly by authorized personal like the way they use other tools, e.g., voice recorders or tracking devices, and not misuse it for non-professional usage.
6.3 Effects of Understanding We found differences between the primed groups’ and the unprimed groups’ opinions regarding privacy concerns, imposing censorship and consent. Figure 11.2 shows that the primed users, in general, were more aware of thermal imaging use and concerned about privacy. Overall, primed users talked about privacy from many more angles than the unprimed users. They were also aware of more thermal imaging use cases and were aware of thermal cameras’ risks and benefits. The reason behind these contrasts was the informative video scenarios we showed to the primed users to increase their understanding on thermal camera usage. Thus, many of the questions are hypothetical. We tried to mitigate this challenge through priming. We do believe that our priming method was successful, as the gap in understanding between the two groups were observed and even prompted some unusual decisions from the unprimed users. Furthermore, the lack of a proper understanding led to concerning comments from several participants. For example, some unprimed participants said they would use the camera to look at their friends’ bodies to find their body heat signatures. Also, they would use the thermal camera to observe people in public places for fun. Therefore, it is evident that there was a substantial effect of understanding on people’s privacy decision making. This suggests that there is a need to better educate people before introducing this technology to the mass market. The gap in understanding became also apparent while discussing use cases with the unprimed participants. They had misconception on the thermal camera usage. For instance, unprimed participants state that by using a thermal camera it could be revealed who touched their objects. However, the camera only shows that objects have been touched, but not necessarily by whom.
226
L. Sahoo et al.
7 Limitations and Future Work In this work, we explore users’ perception of thermal cameras by collecting qualitative feedback. A general challenge is the sample size in such studies. Participants are usually recruited until reaching data saturation [24]. Yet we agree that this does not allow for strong quantitative comparisons and generalizations. We followed recommendations from prior work. Furthermore, we tried to ensure that participants were demographically diverse. However, we acknowledge that many participants were from a University population and future research including other samples might yield additional insights. Still we believe our participants to be among early adopters and potential main users of thermal cameras. Thermal imaging has numerous applications, of which we showed a few use cases to participants. Showing more scenarios in future studies might lead to additional insights. As thermal imaging is still not widely used by consumers, most of the questions were still hypothetical. This study is an attempt to identify important starting points for future investigations. Future work might employ other methods, e.g., surveys, to also collect more quantitative insights. Also, it would be interesting to see if different scenarios or media for priming on privacy concerns could lead to additional or different insights. Future work on ways to inform the users that thermal cameras function is essential. Future work could investigate appropriate approaches to censorship and eventually develop a privacy framework for consumer-grade thermal cameras in terms of privacy and censorship.
8 Conclusion Thermal cameras are likely to be integrated with personal devices in the future. This study investigates users’ privacy perceptions of thermal imaging. We contribute timely insights by investigating users’ prior knowledge, understanding, opinions, and concerns. We compared perceptions of users with and without in-depth knowledge about thermal imaging by using video showcases. These were inspired by the current use cases from the literature, and pointed out opportunities and potential threats. We found that perceptions were influenced by the prior knowledge of participants. This suggests that people, in general, should be made aware of the strengths and weaknesses—in particular, from a privacy perspective—before this technology becomes widely available and integrated with consumer devices. Researchers might look into methods of creating this awareness in the future. In this way, implications of this technology will become more apparent. Acknowledgments This research was supported by dtec.bw—Digitalization and Technology Research Center of the Bundeswehr [Voice of Wisdom].
11 Privacy in Thermal Imaging
227
References 1. Free qualitative data analysis software | QDA miner lite, https://provalisresearch.com/products/ qualitative-data-analysis-software/freeware/, accessed: 2021-10-12 06:41:14 2. Kyllo v. united states (2001), https://www.law.cornell.edu/supct/html/99-8508.ZO.html. Accessed on 3. Thermal Imaging Surveillance (Jan 2015), https://theyarewatching.org/technology/thermalimaging-surveillance. Accessed on 4. Top Uses and Applications of Thermal Imaging Cameras—Grainger Industrial Supply (2015), https://www.grainger.com/content/qt-thermal-imaging-applications-uses-features345. Accessed on 02/21/2020 5. Coronavirus outbreak: Safety measures at major airports and airlines (Feb 2020), https:// www.airport-technology.com/features/coronavirus-measures-world-airports/. Accessed on 02/21/2020 6. Abdelrahman, Y., Khamis, M., Schneegass, S., Alt, F.: Stay cool! understanding thermal attacks on mobile-based user authentication. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 3751–3763. ACM (2017) 7. Abdelrahman, Y., Knierim, P., Wozniak, P., Henze, N., Schmidt, A.: See through the fire: Evaluating the augmentation of visual perception of firefighters using depth and thermal cameras. In: Workshop on Ubiquitous Technologies for Augmenting the Human Mind (2017) 8. Abdelrahman, Y., Knierim, P., Wozniak, P.W., Henze, N., Schmidt, A.: See through the fire: evaluating the augmentation of visual perception of firefighters using depth and thermal cameras. In: Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers. pp. 693–696 (2017) 9. Abdelrahman, Y., Schmidt, A.: Beyond the visible: Sensing with thermal imaging. Interactions 26(1), 76–78 (Dec 2018). https://doi.org/10.1145/3297778, https://doi.org/10.1145/3297778 10. Abdelrahman, Y., Velloso, E., Dingler, T., Schmidt, A., Vetere, F.: Cognitive heat: exploring the usage of thermal imaging to unobtrusively estimate cognitive load. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1(3), 1–20 (2017) 11. Abdelrahman, Y., Woundefinedniak, P.W., Knierim, P., Weber, D., Pfeuffer, K., Henze, N., Schmidt, A., Alt, F.: Exploring the domestication of thermal imaging. In: Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia. MUM ’19, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3365610. 3365648, https://doi.org/10.1145/3365610.3365648 12. Abdelrahman, Y., Wo´zniak, P.W., Knierim, P., Weber, D., Pfeuffer, K., Henze, N., Schmidt, A., Alt, F.: Exploring the domestication of thermal imaging. In: Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia. pp. 1–7 (2019) 13. Adams, D., Bah, A., Barwulor, C., Musaby, N., Pitkin, K., Redmiles, E.M.: Ethics emerging: the story of privacy and security perceptions in virtual reality. In: Fourteenth Symposium on Usable Privacy and Security ({SOUPS} 2018). pp. 427–442 (2018) 14. Anton, A.I., Earp, J.B., Young, J.D.: How internet users’ privacy concerns have evolved since 2002. IEEE Security Privacy 8(1), 21–27 (2010). https://doi.org/10.1109/MSP.2010.38 15. Bloom, C., Tan, J., Ramjohn, J., Bauer, L.: Self-driving cars and data collection: Privacy perceptions of networked autonomous vehicles. In: Proceedings of the Thirteenth USENIX Conference on Usable Privacy and Security. p. 357–375. SOUPS ’17, USA (2017) 16. Buschek, D., Bisinger, B., Alt, F.: Researchime: A mobile keyboard application for studying free typing behaviour in the wild. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. CHI ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3173574.3173829, https://doi.org/10.1145/3173574. 3173829 17. Carminati, B., Colombo, P., Ferrari, E., Sagirlar, G.: Enhancing user control on personal data usage in internet of things ecosystems. In: 2016 IEEE International Conference on Services Computing (SCC). pp. 291–298. IEEE (2016)
228
L. Sahoo et al.
18. Castañeda, J.A., Montoro, F.J.: The effect of internet general privacy concern on customer behavior. Electronic Commerce Research 7(2), 117–141 (2007) 19. Cho, K.W., Lin, F., Song, C., Xu, X., Gu, F., Xu, W.: Thermal handprint analysis for forensic identification using heat-earth mover’s distance. In: 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA). pp. 1–8. IEEE (2016) 20. Cross, C.B., Skipper, J.A., Petkie, D.: Thermal imaging to detect physiological indicators of stress in humans. In: SPIE Defense, Security, and Sensing. pp. 87050I–87050I. International Society for Optics and Photonics (2013) 21. Culnan, M.J.: Consumer awareness of name removal procedures: Implications for direct marketing. Journal of direct marketing 9(2), 10–19 (1995) 22. Dinev, T., Hart, P.: An extended privacy calculus model for e-commerce transactions. Information systems research 17(1), 61–80 (2006) 23. Emami-Naeini, P., Bhagavatula, S., Habib, H., Degeling, M., Bauer, L., Cranor, L.F., Sadeh, N.: Privacy expectations and preferences in an IoT world. In: Proceedings of the Thirteenth USENIX Conference on Usable Privacy and Security. p. 399–412. SOUPS ’17, USENIX Association, USA (2017) 24. Francis, J.J., Johnston, M., Robertson, C., Glidewell, L., Entwistle, V., Eccles, M.P., Grimshaw, J.M.: What is an adequate sample size? operationalising data saturation for theory-based interview studies. Psychology & Health 25(10), 1229–1245 (2010). https://doi.org/10.1080/ 08870440903194015, https://doi.org/10.1080/08870440903194015, pMID: 20204937 25. Friedewald, M., Finn, R., Wright, D.: Seven Types of Privacy, pp. 3–32 (01 2013). https://doi. org/10.1007/978-9426. Gade, R., Moeslund, T.B.: Thermal cameras and applications: a survey. Machine vision and applications 25(1), 245–262 (2014) 27. Hassib, M., Khamis, M., Schneegass, S., Shirazi, A.S., Alt, F.: Investigating user needs for biosensing and affective wearables. In: Proceedings of the 2016 Chi conference extended abstracts on human factors in computing systems. pp. 1415–1422 (2016) 28. Ioannou, S., Gallese, V., Merla, A.: Thermal infrared imaging in psychophysiology: potentialities and limits. Psychophysiology 51(10), 951–963 (2014) 29. Jacobsson, A., Davidsson, P.: Towards a model of privacy and security for smart homes. In: 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT). pp. 727–732. IEEE (2015) 30. Koelle, M., Rose, E., Boll, S.: Ubiquitous intelligent cameras–between legal nightmare and social empowerment. IEEE MultiMedia 26(2), 76–86 (April 2019). https://doi.org/10.1109/ MMUL.2019.2902922 31. Koelle, M., Kranz, M., Möller, A.: Don’t look at me that way! understanding user attitudes towards data glasses usage. In: Proceedings of the 17th international conference on humancomputer interaction with mobile devices and services. pp. 362–372 (2015) 32. Koops, B.J., Newell, B.C., Timan, T., Skorvanek, I., Chokrevski, T., Galic, M.: A typology of privacy. U. Pa. J. Int’l L. 38, 483 (2016) 33. Larson, E., Cohn, G., Gupta, S., Ren, X., Harrison, B., Fox, D., Patel, S.: Heatwave: Thermal imaging for surface user interaction. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 2565–2574. CHI ’11, ACM, New York, NY, USA (2011). https://doi.org/10.1145/1978942.1979317, http://doi.acm.org/10.1145/1978942.1979317 34. Li, Y., Kobsa, A., Knijnenburg, B.P., Nguyen, M.C.: Cross-cultural privacy prediction. Proceedings on Privacy Enhancing Technologies 2017(2), 113–132 (2017) 35. McReynolds, E., Hubbard, S., Lau, T., Saraf, A., Cakmak, M., Roesner, F.: Toys that listen: A study of parents, children, and internet-connected toys. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 5197–5207 (2017) 36. Mekovec, R., Vrˇcek, N.: Factors that influence internet users’ privacy perception. In: Proceedings of the ITI 2011, 33rd International Conference on Information Technology Interfaces. pp. 227–232. IEEE (2011) 37. Mekovec, R., Vrˇcek, N.: Factors that influence internet users’ privacy perception. In: Proceedings of the ITI 2011, 33rd International Conference on Information Technology Interfaces. pp. 227–232 (2011)
11 Privacy in Thermal Imaging
229
38. Park, S., Kim, J., Mizouni, R., Lee, U.: Motives and concerns of dashcam video sharing. In: Proceedings of the 2016 CHI. pp. 4758–4769 (2016) 39. Sahami Shirazi, A., Abdelrahman, Y., Henze, N., Schneegass, S., Khalilbeigi, M., Schmidt, A.: Exploiting thermal reflection for interactive systems. In: Proceedings of the 32Nd Annual ACM Conference on Human Factors in Computing Systems. pp. 3483–3492. CHI ’14, ACM, New York, NY, USA (2014). https://doi.org/10.1145/2556288.2557208, http://doi.acm.org/10. 1145/2556288.2557208 40. Saleh, M., Khamis, M., Sturm, C.: What about my privacy, Habibi? understanding privacy concerns and perceptions of users from different socioeconomic groups in the Arab world (April 2019), http://eprints.gla.ac.uk/186430/ 41. Udoh, E.S., Alkharashi, A.: Privacy risk awareness and the behavior of smartwatch users: A case study of Indiana university students. In: 2016 Future Technologies Conference (FTC). pp. 926–931. IEEE (2016) 42. Wadhwani, P., Gankar, S.: Thermal imaging market report 2024—global industry share forecast (Jun 2018), https://www.gminsights.com/industry-analysis/thermal-imaging-market. Accessed on 02/21/2020 43. Wang, H., Lee, M.K., Wang, C.: Consumer privacy concerns about internet marketing. Communications of the ACM 41(3), 63–70 (1998) 44. Wang, P., Petrison, L.A.: Direct marketing activities and personal privacy: A consumer survey. Journal of Direct Marketing 7(1), 7–19 (1993) 45. Widen, W.H.: Smart cameras and the right to privacy. Proceedings of the IEEE 96(10), 1688– 1697 (2008) 46. Wilkowska, W., Ziefle, M.: Perception of privacy and security for acceptance of e-health technologies: Exploratory analysis for diverse user groups. In: 2011 5th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops. pp. 593–600 (2011). https://doi.org/10.4108/icst.pervasivehealth.2011.246027 47. Wilson, C., Hargreaves, T., Hauxwell-Baldwin, R.: Smart homes and their users: a systematic analysis and key challenges. Personal and Ubiquitous Computing 19(2), 463–476 (2015) 48. Wisniewski, P.J., Knijnenburg, B.P., Lipford, H.R.: Making privacy personal: Profiling social network users to inform privacy education and nudging. International Journal of HumanComputer Studies 98, 95–108 (2017). https://doi.org/https://doi.org/10.1016/j.ijhcs.2016.09. 006, https://www.sciencedirect.com/science/article/pii/S1071581916301185 49. Wolf, K., Schmidt, A., Bexheti, A., Langheinrich, M.: Lifelogging: You’re wearing a camera? IEEE Pervasive Computing 13(3), 8–12 (2014) 50. Yao, Y., Xia, H., Huang, Y., Wang, Y.: Free to fly in public spaces: Drone controllers’ privacy perceptions and practices. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 6789–6793 (2017) 51. Youn, S.: Teenagers’ perceptions of online privacy and coping behaviors: a risk–benefit appraisal approach. Journal of Broadcasting & Electronic Media 49(1), 86–110 (2005) 52. Youn, S., Hall, K.: Gender and online privacy among teens: Risk perception, privacy concerns, and protection behaviors. Cyberpsychology & behavior 11(6), 763–765 (2008) 53. Zeng, E., Mare, S., Roesner, F.: End user security & privacy concerns with smart homes. In: Proceedings of the Thirteenth USENIX Conference on Usable Privacy and Security. p. 65–80. SOUPS ’17, USENIX Association, USA (2017) 54. Zeng, E., Roesner, F.: Understanding and improving security and privacy in multi-user smart homes: A design exploration and in-home user study. In: 28th USENIX Security Symposium (USENIX Security 19). pp. 159–176. USENIX Association, Santa Clara, CA (Aug 2019), https://www.usenix.org/conference/usenixsecurity19/presentation/zeng 55. Zheng, S., Apthorpe, N., Chetty, M., Feamster, N.: User perceptions of smart home IoT privacy. Proc. ACM Hum.-Comput. Interact. 2(CSCW) (Nov 2018). https://doi.org/10.1145/3274469, https://doi.org/10.1145/3274469
Chapter 12
Why Is My IP Address Processed? No Data For Accountless Users Supriya Adhatarao, Cédric Lauradoux, and Cristiana Santos
Abstract IP addresses are an important identifier used to route information all over the Internet. They are processed by all websites, but do we know exactly why IP addresses are collected and processed for? To answer this question, we have analyzed the privacy policies of 109 websites. Most of these websites acknowledge in their privacy policy that (i) IP addresses are personal data, and (ii) they collect and process IP addresses. However, the reasons why IP addresses are processed are unclear even in their privacy policies. To clarify why IP addresses are processed, we have submitted IP-based subject access requests to these 109 websites. We took the role of an accountless user and we have asked these websites to provide us all the data associated with the IP addresses used in our study. All our requests were denied with different explanations. We were able to spot inconsistencies between the replies to our requests and the websites privacy policies. Some replies state that IP addresses can identify multiple users, and therefore, answering affirmatively to our requests would create a data breach. It is tempting to conclude that at the time of the GDPR, there is still transparency challenges regarding the processing of IP addresses. Keywords Personal data · IP address · Privacy policy · Subject access request · Measurement · GDPR
1 Introduction An IP address is an online identifier [4, 25] used for identifying devices online and to route information between websites and its end-user devices. The processing of S. Adhatarao () · C. Lauradoux Univ. Grenoble Alpes, Inria, Saint-Martin-d’Hères, France e-mail: [email protected]; [email protected] C. Santos Utrecht University, Utrecht, Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_12
231
232
S. Adhatarao et al.
IP addresses is critical for websites: it is used to adapt the website’s content to the browser of an user; it is also used for security reasons, e.g., to detect incidents such as denial-of-service attacks. Furthermore, it can also be used to track Internet users, as demonstrated in [33]. User tracking based on IP addresses can be as effective as other established forms of tracking based on cookies [21, 27] or browser fingerprinting [29]. Almost every website collects and processes IP addresses of their users but how can one know whether the processing of her personal data is lawful on not? In Europe, IP addresses are considered as personal data according to the General Data Protection Regulation (Sect. 2), and therefore, we can make use of the transparency mechanisms the GDPR provides for to understand why IP addresses are processed: we can use the right to be informed (Articles 13 and 14) and the right to access information (Article 15). We have used these two mechanisms, as explained in Sect. 3. In the quest for understanding why are IP addresses processed, we have firstly analyzed the privacy policies of 109 websites (Sect. 4). Our analysis includes the websites of 62 private companies and 47 public organizations. 32 websites do not mention IP addresses in their privacy policies. It is unclear whether they consider that they do not process IP addresses or they just do not mention those in their privacy policies. 72 websites acknowledge that IP addresses are personal data and mention them in their privacy policies. However, in this case, it was rather difficult to understand why websites process IP addresses. Since analyzing the privacy policies was rather inconclusive, we requested the websites to provide the respective data associated with our IP addresses through subject access requests (Sect. 5). Therein, we have taken the role of an accountless user who has visited a website and wanted to have access to his/her processed data. All our requests were denied. The websites have used different reasons to justify their decisions. We found that some answers were not consistent with the website privacy policy, for instance, some websites answered they do not process IP addresses while their privacy policy states they are. Other websites have challenged the lawfulness of our requests. They state that an IP address is able to identify several users and hence they cannot answer positively to our request. While this is a legitimate answer, because so far Internet users are unable to prove that they have used a given IP address during a certain period of time, however, this argument reduces the GDPR transparency mechanisms available for data subjects to understand how IP addresses are processed and to detect tracking based IP addresses.
2 IP Addresses Are Personal Data This section has two goals. Firstly, we wanted to analyze whether and when are IP addresses personal data. If they are considered personal data, consequently, websites are obliged to mention their collection and processing in their privacy policies.
12 Why Is My IP Address Processed?
233
Secondly, we wanted to determine if we can use only IP addresses to submit subject access requests. Firstly, we resort to the definitional elements of the GDPR, and we then analyze the decisions of courts conveying a complementary reasoning. Personal data is defined as “any information relating to an identified or identifiable natural person” (Article 4(11) GDPR). “Identified” means when a person, within a group of persons, is “distinguished” from all other members of the group [17]. “Identifiable” person is one who, although has not been identified yet, is possible to be identified in the future, either directly or indirectly. Such possibility to be identified can be made in two ways: (i) by reference to identifiers: as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that natural person (Article 4(11) GDPR [22]); and (ii) by “all the means likely reasonably to be used either by the controller or by any other person to identify the said person” (Recital 26 GDPR). Online identifiers, such as an IP address, are included in the definition of personal data (Article 4(1)). Moreover, Recital 30 of the GDPR asserts that a person can be associated with online identifiers provided by their devices, and names IP addresses. Nevertheless, it does not suffice to say that having an IP address identifies a person. Does this definition of personal data applies to IP addresses? The answer to this question is very important and has been subject to debate for years by both computer scientists and lawyers. Herein we explain the current state of the debate. Let us consider a case study wherein Alice has subscribed to an Internet Service Provider (ISP) called Bob. Is Alice’s IP address personal data for Bob? There is a consensus [5, 14] acknowledging that Alice’s IP address is personal data for Bob. Indeed, Bob stored the identification data of all his subscribers (including Alice’s) and assigned them their IP addresses. Therefore, Bob can identify Alice from her IP address. Now, Alice visits Eve’s website. Is Alice’s IP address personal data for Eve? The answer to this question divides the data protection scholarship [37, 40]. This community diverges in the understanding of “dynamic” or “temporary” IP addresses as personal data1 [3, 20, 23, 24, 28, 32, 38, 40]. In 2011, an official report of the Publications Office of the EU [8] studied case law regarding the circumstances in which IP addresses are considered personal data; it showed that from 49 decisions regarding the IP address’s status in 13 EU Member States, 41 decisions ruled (either explicitly or implicitly) that IP addresses should be considered personal data and 8 ruled against this interpretation. Currently, three stances prevail, as summarized in Table 12.1, which presents a non-exhaustive list of the legal positions from courts and stakeholders (EDPS and 29WP) on the legal status of IP addresses.
1 Such divergence does not happen in the case of “static” or “fixed” IP, which are ‘invariable and allow continuous identification of the device connected to the network’ [15] (parag 36).
234
S. Adhatarao et al.
Table 12.1 Summary of the legal positions concerning the status of IP addresses as personal data Decisions EDPS Berlin district court 29 WP CJEU (Breyer) ECHR Munich court Paris appeal court
IP address are personal data Alone With added info. • – • – • – – • • – – – –
Added info obtained by legal means – – – • – • •
1. IP addresses can per se identify a person; 2. IP addresses do not suffice alone, and thus additional information are needed to enable the identification of a person; 3. IP addresses only configure personal data if additional data is obtained by lawful means.
2.1 IP Addresses Alone In a case held at the European Court of Human Rights (ECHR) [16], the decision of the court evidences that it regarded dynamic IP address as information based on which the offender at stake could be identified. In Germany, both the Berlin district court and an appellate court decided that IP addresses are personal data; they added that “determinability” of a person should account for both legal and illegal means to obtain additional data [8]. The European Data Protection Board [1] declared that IP addresses should be treated as personal data by both ISPs and search engines (even if they are not always personal data) and adds that unless an ISP or a search engine are in a position to distinguish with absolute certainty that the data correspond to users who cannot be identified, they will have to treat all IP information as personal data, to be on the safer side. The European Data Protection Supervisor (EDPS) referred that for IP addresses to count as personal data, there is no requirement that the data controller knows the surname, first name, birth date, address (among other) of the individual whose activity it was monitoring. It further stated that an IP address which is showing special behavior in terms of the transactions one can follow, then in a reasonable world, that is an individual [26].
12 Why Is My IP Address Processed?
235
2.2 IP Address with Additional Information The Court of Justice of the European Union (CJEU) determined in Breyer case [15] that dynamic IP addresses (temporarily assigned to a device), per se, are not information relating to an “identified” person, due to the fact that “such an address does not directly reveal the identity of the person who owns the computer from which a website was accessed, or that of another person who might use that same computer.” IP addresses can constitute personal data, provided that the relevant person’s identity can be deduced from a combination of the IP address and additional identifiable data. This additional information [15, 18] can consist of, e.g., name, login details, email address, username (if different from the email address), subscription to a newsletter, or other account data, in the course of logging in and using the website; cookies containing a unique identifier [19], device fingerprinting, or similar unique identifiers. By holding additional data, the website can tie it with the visitor’s IP address, and therefore this visitor would be identifiable[15]. This argument explains why everyone would agree that Alice’s IP address is a personal data for Bob (as ISP), because he also knows Alice’s name. However, if Eve has access to additional identifiable information (uniquely identifiable), then Alice’s IP address is personal data for Eve.
2.3 IP Address with Lawfully Obtained Additional Information Pursuant to this view, an IP address will only be personal data when a website has legal means to lawfully obtain access to sufficient additional data held by a third party in order to identify a person. Respectively, IP addresses will not consist of personal data when such added data is obtained in a way that is prohibited by law, because ISP have to meet its own legal obligations before it just hands over the data. As ISPs are generally prohibited from disclosing information about a customer to a third party, the only means wherein an ISP is forced to disclose IP addresses data consist of consent, court order, by law enforcement agencies or national security authorities [18]. The Paris appeal court, in two rulings, stated that the processing of IP addresses does not constitute personal data unless a law enforcement authority obtains a user identity from an ISP [6, 7]. The Munich district court, in 2008, held that dynamic IP address lack the necessary quality of “determinability” to be personal data, which means that it cannot be easily used to determine a person’s identity, without a significant effort and by using “normally available knowledge and tools.” The court recalls that ISPs are not legally permitted to hand over the information identifying an individual, without a proper legal basis (only when ordered by a court) [34]. The CJEU in Breyer case [15] concluded that a dynamic IP address constitutes personal data if the website operator has “legal means” for obtaining access to additional information held by the ISP that enables the website publisher to identify
236
S. Adhatarao et al.
that visitor, and there is another party (such as an ISP or a competent authority) that can link the dynamic IP address to the identity of an individual. Legal means could consist, for example, bringing criminal proceedings in the event of denial-of-service attacks to obtain identifying information from the ISP.
2.4 Summary IP addresses are personal data. Therefore, if a website is collecting and processing IP addresses, they need to inform data subjects about it in their privacy policies. Through privacy policies [2] of websites, users are able to transparently access information on the types of data collected, such as IP addresses (Articles 13, 14 GDPR), for certain purposes. The GDPR applies to IP addresses, but it is unclear whether we can request a website all the data associated with a given IP address. Court decisions (both at national and CJEU level) and stakeholder positions so far diverge on the fact that additional information are needed or not. This debate explains some of the answers we obtained later in our study. It also explained why there are some limitations on the application of the GDPR to the collection and processing of IP addresses.
3 Methodology Our research focuses on understanding why IP addresses are collected and processed. We have visited two groups of websites: 62 websites maintained by private companies, and 47 websites maintained by public organizations from which most are national Data Protection Authorities (DPAs) of Europe, as shown in Tables 12.4 and 12.5. Our study includes: 1. Privacy Policies. We analyzed the privacy policies of these websites to check if they mention the processing of IP addresses and why the processing occurs. 2. Subject access requests. We have submitted subject access requests (section “SAR Template Used in Our Experiments”) based on the IP addresses used to visit the corresponding websites.
3.1 Private Companies Websites Popular Websites Firstly, we have chosen 18 websites of private companies that are considered the most visited around the world2 in 2021.
2 https://ahrefs.com/blog/most-visited-websites/.
12 Why Is My IP Address Processed?
237
Websites Setting Cookies on User’s Browser The 29 Working Party (29WP) stated that devices with a unique identifier (through a cookie) allows the tracking of users of a specific computer even when dynamic IP addresses are used. In other words, such devices enable data subjects to be “singled out,” even if their real names are not known. Hence, our next choice consisted of a list of 44 websites of companies that set cookies on their user’s browser. The computation of these cookies depends on the IP address of the user. We were able to figure this out using a guess and determine approach. First, a website is visited and a cookie is set. Then, the browser is reset and the cookie is removed from the browser but kept in our log. Then, one parameter in the computer’s setup is changed and the website is visited again. This time a new cookie is obtained and compared with the previous one from our log. Complete reverse engineering seems difficult. This methodology has been detailed in [21]. Table 12.4 in section “List of Private Company Websites” provides the names of all the 62 private companies we have considered in our work.
3.2 Public Organizations Websites We have chosen 47 websites from Data Protection Authorities (DPAs), the website of the European Data Protection Board (EDPB) and the website of the European Data Protection Supervisor (EDPS). DPAs are independent public organizations that supervise through their investigative and corrective powers, the application of the data protection law in each EU country. They all have a website, therefore it was logical to investigate how they consider IP addresses. We have considered all the DPAs listed by the GDPR hub.3 We have also visited the website of the EDPB4 and the EDPS.5 Table 12.5 in Appendix provides a list of 45 DPAs whose websites we visited during our experiments.
3.3 Visit’s Details We have visited the websites with three different IP addresses. We have used the default dynamic IP address provided by our ISP and we have also requested a static IP address to the same ISP for our device. We visited also the websites through Tor Network.6 For the websites, using Tor means that we are using the IP address of the Tor exit node. It is likely that many devices and users use the same exit node and
3 https://gdprhub.eu/index.php?title=Category:DPA. 4 https://edpb.europa.eu/. 5 https://edps.europa.eu/about-edps_en. 6 https://www.torproject.org/.
238
S. Adhatarao et al.
therefore the same IP address. Even though we have used different IP addresses to access the websites, due to the fact that our request was always denied, we did not mention the use of different IP addresses in our further discussions. All the visited websites could be viewed as an accountless user or as an external user, but some of them could also be accessed as a registered user. We have always visited the websites as an accountless user. This hypothesis is particularly important. Tracking people online based on their IP addresses make sense when they are accountless. IP address based tracking has been demonstrated in [33] but nobody knows if it has actually done by websites. Exposing IP address based tracking was one of the goal of our study. However, we found out that the right to access cannot help us to expose such tracking. In fact, we have created accounts on 20 websites (Google, YouTube, Amazon, LinkedIn, Reddit, Zoom, Yahoo, eBay, Pinterest, Wikipedia, Twitter, Twitch, Roblox, Bitly, Fandom, Tripadvisor, Microsoft, Apple, Facebook, and Indeed) to observe if it would have changed the response obtained to our subject access requests. However, even with a registered user account, our requests were denied. Hence, we only address the visit of websites as an accountless user in our further discussions.
4 Privacy Policies and IP Addresses In Sect. 2 we concluded that whenever websites process personal data, they need to inform data subjects in their privacy policies about the collected data and the processing purposes. As such, we can expect to find details on the processing of IP addresses in the privacy policies of a website. We visited and analyzed the privacy policies of each website considered in our study. The privacy policies of the examined websites are broad and cover different aspects explaining why IP addresses are collected for and processed. At least 54 private company websites in our experiment provide options for users to create a personalized account. Some of these website’s privacy policies mention different ways of personal data collection for both registered users and accountless users. Our findings about the processing of IP addresses of users are summarized in Table 12.2. This analysis of privacy policies and the labelled categories was performed by one legal scholar and one computer scientist. Processing of IP Addresses in the Privacy Policies We have identified five different ways websites handle IP addresses: (i) IP addresses are processed by the visited websites as these are mentioned explicitly in the website’s privacy policy. At least 43 private websites (out of 48) mention processing of IP addresses when an user visits their website either as a registered or as an accountless user. Two companies (Netflix and Office) mention processing only for their registered users. Processing of IP addresses is unclear from the privacy policies of three websites (Yahoo, Tripadvisor and
12 Why Is My IP Address Processed?
239
Table 12.2 Processing of IP addresses in the privacy policies of 109 websites (62 private companies and 47 public organizations) Category Process
Description All users Registered users only Unclear Anonymize Anonymize Shortened IP Masking for de-identification Store full IP for 7 days & then anonymize Do not collect From EU users From any website users Do not mention IP – No page found –
(ii)
(iii)
(iv)
(v)
# Private # Public (Companies) (DPAs) 43 9 2 0 3 0 1 4 0 2 0 1 0 2
Personal data Yes Yes Yes Yes Yes Yes Yes
1 0 10 2
Yes Yes Unknown Unknown
0 4 22 3
Pinterest). These three websites do not specify if IP addresses are collected for all the website users or not. Nine DPAs also mention processing of IP addresses of their website users. Privacy policies can also mention that IP addresses are processed, but then these are anonymized, shortened, masked, or temporarily stored. Only one company website and nine DPA websites perform anonymization on the IP addresses of their website users. Yet, the techniques used to anonymize are never mentioned or detailed. Still, it can be acknowledged that a website considers an IP address as personal data because anonymization of IP addresses renders those non personal (anonymization consists in the process of turning personal data into data that does not relate to an identified or identifiable person any longer (Recital 26 of the GDPR)). One company mentions explicitly that it does not provide services in the EU and hence does not collect the data related to EU users. Privacy policies of 4 DPAs also explicitly state that IP addresses are not collected. Optimistically, one can acknowledge that these websites mention explicitly that they are not processing IP addresses because they are personal data. At least 10 company websites and 22 DPA websites do not mention IP addresses in their privacy policies. When IP addresses are not mentioned in the privacy policies we are not aware whether they are processed or not. There is a possibility that these websites do not collect IP addresses and hence do not mention them in their privacy policies, though we were unable to validate the reasons for not mentioning the IP addresses. Finally, we were unable to find the privacy policy page of 2 private company websites and 3 DPA websites and hence we do not know which is the status of IP addresses handling in these cases.
240
S. Adhatarao et al.
Purpose of Processing The European Data Protection Board (EDBP) [17] stated that it is crucial to evaluate the “purpose pursued by the data controller in the data processing”. Accordingly, we analyzed the purposes for processing IP addresses described in the consulted privacy policies. We identified that the purpose of the collection and processing of IP addresses include the following: (i) enhancing the user experience and (ii) security. Enhancing the user experience can refer to the following: identification of the location of the user, personalizing and improvement of products, customization of services and trend analysis or the website administration. Some organizations collect IP addresses for security reasons to protect their business against fraudulent behavior, or in case of legal proceedings relating to a criminal investigation or alleged or suspected illegal activity. Notably, the mentioned purposes require some degree of user personalization. As the EDPB refers [17], “to argue that individuals are not identifiable, where the purpose of processing is precisely to identify them, would be a sheer contradiction in terms. Therefore, the information should be considered as relating to identifiable individuals and the processing should be subject to data protection rules.” As such, we reason that all these purposes potentially enable the collection of data that conducts to the identification of a user without unnecessary or disproportional effort.
4.1 Summary Many websites acknowledge that IP addresses are personal data: they explicitly mention they collect and process IP addresses, or they anonymize them. Only 4 websites state that they do not collect IP addresses. From our analysis of websites privacy policies, we denote is not enough to understand the reasons why websites process IP addresses, as the policy disclosures and the purposes thereof are often vague and unclear. As our analysis was rather inconclusive, we further resorted to subject access requests to obtain more information on the processing of IP addresses, as pursued in the following section.
5 IP-Based Subject Access Requests The GDPR provides the right of access to data subjects, a right to obtain from the controller confirmation as to whether or not personal data concerning him or her are being shared (Article 15 of the GDPR). Hence we decided to exercise our right of access and examine whether websites answer to a request based on an IP address. Requests to access personal are referred to as subject access requests (SAR). Several studies [10–13, 30, 31, 35, 36, 39] have used SAR as a methodological tool to assess the transparency of certain data processing, the strength of their authentication procedure or their readiness to comply with the GDPR. During our analysis, we have submitted IP-based subject access requests to private and public organizations.
12 Why Is My IP Address Processed?
241
Table 12.3 # of websites categorized based on their responses obtained for our IP-based SAR Answer’s category No reply No, we have nothing about you No, we do not store IP addresses No user account was found with this name No, request was not made with enough documents No, we do not process personally identifiable data No, because IP addresses can be dynamic/shared No, IP is not a search criteria in our systems No, with a data breach No, we avoid collecting any personal data of EU users No, we are not able to help you
Private 26 8 0 17
Inconsistent Unknown 7 – 17
Public 21 20 2 –
Inconsistent Unknown 1 1 –
1
Unknown
3
Unknown
1
1
1
1
3
3
0
–
2
2
–
–
1 1
1 0
– –
– –
2
2
–
–
We have devised a generic subject access request to all the companies websites and public websites. The full text of the subject access request letter can be found in section “SAR Template Used in Our Experiments”. We used this letter to submit a SAR to all websites. We have complied with all the website requests for additional information (like the copy of an ID) to authenticate the SAR. Each recipient of the SAR was then permitted one month time to respond to our request, as mandated by the GDPR. The SAR received responses were grouped into 11 different categories. Table 12.3 shows the categories and the number of organizations related to each of these categories. From 109 organizations to which we submitted a SAR, only 62 thereof responded (36 private companies and 26 public organizations, respectively). In the following sections we provide the obtained responses per category along with its legal analysis.
5.1 No Reply 47 websites (26 private companies and 21 public organizations ) did not reply to the SAR. It is unknown why companies and DPAs did not answer to a SAR request and one could be tempted to conclude that they did not have a dedicated process in place to respond. This is particularly surprising for DPAs. However, the GDPR mandates that the data controllers have an explicit obligation to facilitate the exercise of data subject rights (Articles 12(1) and 28(3)(e)), including facilitating SAR requests.
242
S. Adhatarao et al.
Recital 7 recalls that each person should have control of their own personal data and a no reply to a SAR consists in an obstruction to such control. Recital 59 thereto emphasizes that “modalities should be provided for facilitating the exercise of the data subject’s rights”.
5.2 No, We Have Nothing About You 28 websites (8 private companies and 20 public organizations) answered they do not have any data matching our request. We compared these answers with the privacy policies of these websites and some websites (7 private companies and 1 DPA) practices are not consistent. Their privacy policies mention processing of IP addresses, though they failed to find any data relating to the IP addresses included in the SAR request. The answers from the remaining 19 DPA websites are consistent with their privacy policies because they either do not mention IP addresses, do not collect IP addresses or anonymize them. In the latter case, a website is indeed unable to recover data corresponding to our IP addresses.
5.3 No, We Do Not Store IP Addresses Two DPAs (Schleswig-Holstein and Liechtenstein) answered they do not store IP addresses of their website visitors. The privacy policy of Schleswig-Holstein DPA is consistent with their reply, whereas Liechtenstein DPA mention anonymizing the IP address in their privacy policy but the answer to our SAR states “we don’t store IP addresses.”
5.4 No Account Was Found 17 private company websites explained they cannot process the SAR request because they did not find any account corresponding to our request in their system. We expected that websites would use our IP addresses to query their information system and then extract the corresponding information associated with these given IP addresses. But it appears that many of them have queried their information system using the email address used to submit the subject access request. As they did not find anything corresponding to this email address, they replied that they do not have data associated with this account or that we need to login or provide our login information to complete our request. Thus, their procedure was not able to treat our request. For instance, Roblox replied: You must verify ownership of any associated Roblox account. We must have a Roblox user name in order to proceed with a GDPR request. This kind of answer shows how websites process subject access requests: they have completely ignored the IP addresses provided in our requests to focus only
12 Why Is My IP Address Processed?
243
on the email address used to send the request. Though they hold email addresses in their information systems regarding registered users, they want to ensure that they provide data only to a legitimate user. Such implementation method is (conceivably) motivated by the website’s need to authenticate the request, as asserted in the GDPR: “The controller should use all reasonable measures to verify the identity of a data subject who requests access, in particular in the context of online services and online identifiers” (Recital 64 of the GDPR). Whilst such request for additional identifiable information is understandable regarding registered users, the same reasoning does not seem to consider users without any account on these websites—accountless users. Moreover, their privacy policies are unclear, or do not provide sufficient information to accountless users on what information is collected when these users visit their website. Accountless users are hence disempowered to access their own personal data, and cannot verify the lawfulness of the processing and tracking, nor are they able to exercise their rights (deletion, rectification, restriction, etc.) Internet users need to provide many personal data, like their last name, first name, gender, location, phone number, email addresses, etc. in order to create an account on a website. Even if it is possible to lie for some of these fields, it is tempting to visit websites without creating an user account: while it is faster, it is also a privacy-preserving choice.
5.5 No, Request Was Not Made with Enough Documents Four websites (one private company and three DPAs) stated that the documents provided were not enough to process our SAR request. Generally, these websites requested additional data to identify the requester. This implies that IP addresses alone were not a sufficient piece of data to identify the requester (as per argument 2 of Sect. 2). Such request for added data is in line with the GDPR wherein a controller—having reasonable doubts concerning the identity of a person making the request—may request the provision of additional information necessary to confirm the identity of the data subject (Article 12(6)). However, there seem to be no substantive reason for a data subject to reveal the real identity to these websites through the requested documents: original signed letter (asked by Luxembourgian and Latvia DPAs), SAR in the official language of the DPA (Austrian DPA), or a INE number, i.e., a statistical number (orange.es). These documents requested by these DPAs must be proportional or necessary to the website’s knowledge of the data subject. Moreover, the minimization principle mandates that personal data shall be adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed (Article 5 (1) (c)). Recital 39 specifies further that personal data should be processed only if the purpose of the processing could not reasonably be fulfilled by other means. The “necessity” or “proportionality” principles that both these provisions note, refers to both quantity and also to the quality of personal data. It is then clear that these websites should not process excessive data if this entails a disproportionate interference in the data subject’s rights and hence, a privacy invasion.
244
S. Adhatarao et al.
5.6 No, We Do Not Process Personally Identifiable Data Two websites (1 private company and 1 public organization) answered that they do not process any data of their website visitors that is personally identifiable. Both these websites mention processing of IP addresses in their privacy policies but their replies state something else.
5.7 No, Because IP Addresses Can Be Dynamic/Shared Three websites (Tripadvisor, Office, and lyst.co.uk) have taken into account the IP addresses provided in our request. However, according to them, they cannot process the request, since IP addresses are shared and dynamic. These websites have considered that an IP address can be used by multiple users at the same time and thus they are unable to distinguish, in their own information system, the data belonging to us from the data of other users using the same IP addresses. In such situation, if a website is able to demonstrate that it is not in a position to identify a concrete user, it can deny the request (Article 11(2)). Additionally, there is a risk for the websites to disclose the information of other users and whenever such a personal data breach risk exists (Article 4 (12)), a website can deny the request. The UK DPA alerts that the level of checks a data controller should make may depend on the possible harm and distress that inappropriate disclosure of the information could cause to the individual concerned [9]. Recital 63 of the GDPR adverts that such “ right should not adversely affect the rights or freedoms of others” One may argue that these websites could have requested more identifiable information from the user in order to identify the requester. In effect, Recital 64 of the GDPR states that the controller should use all reasonable measures to verify the identity of a data subject who requests access, in particular in the context of online services and online identifiers. This vague concept of “reasonable measures” might result in data controllers implementing weak or irrelevant identity verification means upon receiving a request. To note, a website is not obliged to collect additional information to identify the data subject for the sole purpose of complying with the GDPR subject access rights (Article 11 (2) and Recital 57 of the GDPR). This argument excludes this possibility of acquiring additional data when it is no longer necessary.
5.8 No, IP Is Not a Search Criteria in Our System Two private company websites (AdsWizz and Louis Vuitton) answered that IP addresses are not search criteria in their system whereas their privacy policy states
12 Why Is My IP Address Processed?
245
that IP addresses of all the website users are collected. This reply shows that these websites do not have a process to handle IP-based subject access request.
5.9 No with a Data Breach One private company website (rubiconproject.com) acknowledged the presence of our IP addresses in its log file. The reply states: we do process data associated with IP addresses XX.XX.XX.XX and YY.YY.YY.YY 7 , our searches suggest that these addresses are associated with multiple different devices across multiple territories. This indicates that these IP addresses are used by multiple different users etc. The response shows that there is a particular process followed by this website to handle IP-based SAR request. Even if they only acknowledge the presence of our IP addresses, it is already a data breach (Article 4(12) GDPR). We could have put any arbitrary IP addresses in our request to learn if the addresses are present in the log file of rubiconproject.com. If other websites (like Alcoholics Anonymous www.aa. org) use the same procedure to reply to our request, sensitive information of Internet users could be exposed to anybody submitting an IP-based subject access request.
5.10 No, We Avoid Collecting Any Personal Data of EU Users One private company website (officedepot) answered they do not collect any information associated with EU citizens, as mentioned in their privacy policy.
5.11 Summary All our requests were denied, but in many cases, the reason used to motivate the website’s response was not consistent with the website privacy policy. The most rational answer to our request consists in the fact that IP addresses can be dynamic and shared. This answer is also consistent with the decision of the Court of Justice of the European Union (CJEU) in Breyer case [15] (Sect. 2). IP addresses are considered personal data when combined with additional information. This assuring position prevents websites from sending data to attackers. Finally, websites need to be careful even when they do not send data. rubiconproject.com acknowledges the existence of our IP addresses in their system. This information can already provide information to an adversary. This is a sort of data breach.
7 Anonymized.
246
S. Adhatarao et al.
6 Conclusion Our analysis of the privacy policies of 109 websites show that it is difficult to know why IP addresses are processed despite their status of personal data. It is not possible for accountless users to prove that they have used an IP address while visiting a website. We believe that currently it is not possible for a website to accept a subject access request based on an IP address because it is very hard to verify that a given IP address has been actually used by the data subject submitting the request. Therefore, it is easier for websites to deny IP-based subject access requests. On the other hand, this lack of oversight regarding accountless users creates opportunities for IP address based tracking wherein GDPR transparency rights to information and access do not suffice to expose this kind of tracking.
Appendix SAR Template Used in Our Experiments We have created a small template with relevant information for the Data controllers. Along with first and last name, we provide a set of IP addresses used to access their website using different networks. Dear Data Controller, I am hereby requesting a copy of all my personal data held and/or undergoing processing, according to Article 15 of the GDPR. Please confirm whether or not you are processing personal data concerning me. In case you are, I am hereby requesting access to the following information: All personal data concerning me that you have stored. This includes any data derived about me, such as opinions, inferences, settings, and preferences. Please make the personal data concerning me, which I have provided to you, available to me in a structured, commonly used and machine-readable format, accompanied with an intelligible description of all variables. I am including the following information necessary to identify me: Name—first-name last-name IP addresses used to access are as follows: XX.XX.XX.XXX XXXX:XXXX:XXXX:XXXX::XXX:XX Yours Sincerely, First-name (As laid down in Article 12(3) GDPR, you have to provide the requested information to me without undue delay and in any event within one month of receipt of the request. According to Article 15(3) GDPR, you have to answer this request without cost to me.)
Popular websites Zoom Yahoo Twitch Roblox Office Netflix Twitter Indeed Websites that set cookies using user’s IP address 1000.menu adbox.lv britishairways.com BigCommerce dsar.everydayhealth.com forever21.com rubiconproject.com kuleuven.be assets.new.siemens.com mylu.liberty.edu turktelekom.com.tr pubmatic.com spiceworks.com smartadserver.com pgatour.com urbanfonts.com worldpopulationreview.com mckinsey.com Pinterest Fandom Facebook
admanmedia.com yandex.com.tr yandex.kz lifepointspanel.com okta.com russianfood.com start.me warriorplus.com sinoptik.ua
eBay Bitly Euronews Apple zoho.com caranddriver.com gismeteo.ua louisvuitton.com my-personaltrainer.it point2homes.com wikiquote.org vans.com orange.es
adswizz.com duda.co gumgum.com lyst.co.uk officedepot.com jpnn.com trafficjunky.net wikimedia.org
Wikipedia Tripadvisor Reddit
Table 12.4 List of 62 private companies visited as external user (18 popular companies and 44 companies that set cookies using IP address)
12 Why Is My IP Address Processed? 247
248
S. Adhatarao et al.
List of Private Company Websites We have chosen 18 popularly used websites across the globe and we have chosen 44 websites of companies that set cookies on their user’s browser. The computation of these cookies depends on the IP address of the user. Table 12.4 shows a list of these websites.
List of Public Organizations We have considered 47 public organizations that includes 45 DPAs, EDPB, and EDPS. It is interesting to see how they process IP addresses and how they respond to a IP-based SAR. Table 12.5 provides the name of all the public organizations. Table 12.5 List of 47 public organizations (45 DPAs and EDPB, EDPS) Public organizations ULD (Schleswig-Holstein) AKI (Estonia) ANSPDCP (Romania) APD/GBA (Belgium) BayLfD (Bavaria, public sector) BayLDA (Bavaria) UOOU (Czech Republic) CNIL (France) LfDI (Mecklenburg-Vorpommern) Commissioner (Cyprus) LfDI (Baden-Württemberg) Datatilsynet (Denmark) Datatilsynet (Norway) UOOU (Slovakia) Datenschutzzentrum (Saarland) DSB (Austria) LFDI (Rhineland-Palatinate) LfD (Lower Saxony) Datenschutzstelle (Liechtenstein) HBDI (Hesse) Tietosuojavaltuutetun toimisto (Finland) ICO (UK) LDA (Brandenburg) EDPB
AEPD (Spain) ADA (Lithuania) LfD (Saxony-Anhalt) CNPD (Luxembourg) AZOP (Croatia) HDPA (Greece) BlnBDI (Berlin) TLfDI (Thuringia) CNPD (Portugal) BfDI (Germany) Datainspektionen (Sweden) NAIH (Hungary) LDI (North Rhine-Westphalia) Persónuvernd (Iceland) IP (Slovenia) DSB (Saxony) DVI (Latvia) UODO (Poland) IDPC (Malta) CPDP (Bulgaria) HmbBfDI (Hamburg) LfDI (Bremen) EDPS
12 Why Is My IP Address Processed?
249
References 1. 29 Working Party, Opinion 1/2008 on Data Protection Issues Related to Search Engines, 8, 00737/EN/WP 148 (Apr. 4, 2008) 2. Guidelines on transparency under Regulation2016/679, (WP260)," 2018 3. the data protection directive as applied to internet protocol (IP) addresses: Uniting the perspective of the European commission with the jurisprudence of member states 4. Internet Protocol. RFC 791 (Sep 1981). https://doi.org/10.17487/RFC0791, https://rfc-editor. org/rfc/rfc791.txt 5. Case C-101/01 Criminal proceedings against Bodil Lindqvist (November 2003), eCLI:EU:C:2016:779 6. Paris Appeal Court decision—Anthony G. vs. SCPP (27.04.2007) (2007), http://www.legalis. net/jurisprudence-decision.php3?id_article=1954 7. Paris Appeal Court decision—Henri S. vs. SCPP (15.05.2007) (2007), http://www.legalis.net/ jurisprudence-decision.php3?id_article=195 8. Publications Office of the EU, Study of case law on the circumstances in which IP addresses are considered personal data (2011), https://op.europa.eu/en/publication-detail/-/publication/ d7c71500-75a3-4b1c-9210-96c74b6fa2be/language-en 9. Subject access code of practice (2020), https://ico.org.uk/media/for-organisations/documents/ 2259722/subject-access-code-of-practice.pdf 10. Ausloos, J., Dewitte, P.: Shattering one-way mirrors—data subject access rights in practice. International Data Privacy Law 8(1), 4–28 (2018) 11. Boniface, C., Fouad, I., Bielova, N., Lauradoux, C., Santos, C.: Security Analysis of Subject Access Request Procedures How to authenticate data subjects safely when they request for their data. In: 2019—Annual Privacy Forum. pp. 1–20. Rome, Italy (Jun 2019) 12. Cagnazzo, M., Holz, T., Pohlmann, N.: GDPiRated—Stealing Personal Information On- and Offline. In: Computer Security—ESORICS 2019—24th European Symposium on Research in Computer Security. Lecture Notes in Computer Science, vol. 11736, pp. 367–386. Springer (September 2019) 13. Cormack, A.: Is the Subject Access Right Now Too Great a Threat to Privacy. European Data Protection Law Review 2(1) (2016). https://doi.org/10.21552/EDPL/2016/1/5, https://doi.org/ 10.21552/EDPL/2016/1/5 14. Court of Justice of the European Union: Case c?70/10,—Scarlet Extended v SABAM (2011), eCLI:EU:C:2011:771 15. Court of Justice of the European Union: Case 582/14—Patrick Breyer v Germany (2016), eCLI:EU:C:2016:779 16. Echr 2 december 2008, K.U. v. Finland, application no. 2872/02 17. (EDPB), E.D.P.B.: EDPB opinion 4/2007 on the concept of personal data (wp 136), adopted on 20.06.2007 18. (EDPB), E.D.P.B.: Opinion 1/2008 on data protection issues related to search engines, adopted on 4 april 2008 (wp 148) 19. (EDPB), E.D.P.B.: Working document—privacy on the internet—an integrated EU approach to on-line data protection- adopted on 21.11.2000 20. EL KHOURY, A.: Dynamic IP addresses can be personal data, sometimes. a story of binary relations and Schrödinger’s cat. European Journal of Risk Regulation 8(1), 191–197 (2017). https://doi.org/10.1017/err.2016.26 21. Fouad, I., Santos, C., Legout, A., Bielova, N.: Did I delete my cookies? Cookies respawning with browser fingerprinting (May 2021), https://hal.archives-ouvertes.fr/hal-03218403, working paper or preprint 22. Regulation (EU) 2016/679 of the European parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation) (2016)
250
S. Adhatarao et al.
23. Hickman, T., Goetz, M., Gabel, D., Ewing, C.: IP addresses and personal data: Did CJEU ask the right questions? Privacy Laws & Business International Report (February 2017) 24. Hildén, J.: Am i my IP address’s keeper? revisiting the boundaries of information privacy, the information society, 33:3, 159–171, https://doi.org/10.1080/01972243.2017.1294127 (2017) 25. Hinden, R., Deering, S.: Internet Protocol Version 6 (IPv6) Addressing Architecture. RFC 3513, RFC Editor (April 2003), https://www.rfc-editor.org/rfc/rfc3513.txt 26. Hustinx, P.: Nameless Data Can Still be Personal, OUT-LAW.COM, Nov. 6, 2008, http://www. out-law.com/page-9563 27. Kristol, D.M.: HTTP cookies: Standards, privacy, and politics. ACM Trans. Internet Techn. 1(2), 151–198 (2001) 28. Lah, F.: Are IP Addresses “Personally Identifiable Information”? Journal of Law and Policy for the Information Society 4(3) (2008) 29. Laperdrix, P., Bielova, N., Baudry, B., Avoine, G.: Browser Fingerprinting: A Survey. ACM Trans. Web 14(2), 8:1–8:33 (2020) 30. Mahieu, R.: Technology and Regulation 2021, 62–75 (2021) 31. Martino, M.D., Robyns, P., Weyts, W., Quax, P., Lamotte, W., Andries, K.: Personal Information Leakage by Abusing the GDPR ’Right of Access’. In: Fourteenth Symposium on Usable Privacy and Security, SOUPS 2018. USENIX Association, Santa Clara, CA, USA (August 2019) 32. McIntyre, J.: Balancing expectations of online privacy: Why internet protocol (IP) addresses should be protected as personally identifiable information. Depaul Law Review 60, 895 (2010) 33. Mishra, V., Laperdrix, P., Vastel, A., Rudametkin, W., Rouvoy, R., Lopatka, M.: Don’t count me out: On the relevance of IP address in the tracking ecosystem. In: WWW ’20: The Web Conference 2020. pp. 808–815. ACM / IW3C2 (April 2020) 34. Mittman, J.M.: German court rules that IP addresses are not personal data. Proskauer, october 10. (2008), https://www.pinsentmasons.com/out-law/news/german-court-says-ip-addressesin-server-logs-are-not-personal-data 35. Norris, Cliveand de Hert, P., L’Hoiry, X., Galetta, A.: The Unaccountable State of Surveillance Exercising Access Rights in Europe. Springer International Publishing (2017) 36. Pavur, J.: GDPArrrrr: Using Privacy Laws to Steal Identities. In: Blackhat USA 2019. Las Vegas, NV, USA (2019) 37. Reid, A.S.: The European Court of Justice case of Breyer. Journal of Information Rights, Policy and Practice 2(1) (April 2017) 38. Sanchez-Bordona, M.C.: Opinion of Advocate General Campos Sanchez-Bordona in Case C582/14 Breyer v Bundesrepublik Deutschland (May 2016), eCLI:EU:C:2016:339 39. Urban, T., Tatang, D., Degeling, M., Holz, T., Pohlmann, N.: A Study on Subject Data Access in Online Advertising After the GDPR. In: Data Privacy Management, Cryptocurrencies and Blockchain Technology—ESORICS 2019 International Workshops, DPM 2019 and CBT 2019. Lecture Notes in Computer Science, vol. 11737, pp. 61–79. Springer, Luxembourg (September 2019) 40. Zuiderveen Borgesius, F.: Breyer Case of the Court of Justice of the European Union: IP Addresses and the Personal Data Definition (Case Note). European Data Protection Law Review 3(1) (June 2017)
Correction to: When Regulatory Power and Industrial Ambitions Collide: The “Brussels Effect,” Lead Markets, and the GDPR Nicholas Martin and Frank Ebbers
Correction to: Chapter 8 in: S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_8 Chapter 8, “When Regulatory Power and Industrial Ambitions Collide: The “Brussels Effect,” Lead Markets, and the GDPR” was previously published nonopen access. It has now been changed to open access under a CC BY 4.0 license and the copyright holder has been updated to “The Author(s)”. The book has also been updated with this change. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
The updated version of this chapter can be found at https://doi.org/10.1007/978-3-031-09901-4_8 © The Author(s) 2023 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4_13
C1
Index
A Anonymization, vi, 93, 97–98, 185–186, 188, 190, 192, 195, 196, 199, 239 At-home DNA, 156–162, 166, 169, 172–176
B Brussels effect, 106, 111–112, 122, 123, 129–148
Digital trade, 106, 107, 112–115, 118, 120, 121, 123, 125, 127
E e-commerce, 105–107, 111–115, 120, 121, 124–127
F Function creep, 44, 50, 54–56 C Compliance, v–vii, 3, 5, 9, 15, 17, 22, 29, 62–65, 69–71, 73–83, 89, 95, 100, 102, 108, 111, 113, 118, 120, 123, 124, 127, 130, 132–141, 143–148, 180–182, 196–200 Compliance management, 62, 69–70 Cross-border data transfer, vi, 70, 81, 107–110, 112, 113, 115, 121, 122
D Data altruism, 7, 8, 10, 14–15 Data donation, 10–14 Data integrity, 8, 188, 192, 199 Data privacy, 109, 114, 124–126, 135, 137, 143, 145, 158, 166 Data processing, 11, 13, 17, 27, 28, 37, 61–71, 74–83, 92, 94–97, 100, 101, 130, 136, 184, 190, 197, 199, 240 Data protection law, 88, 89, 92, 93, 95, 97, 98, 100, 101, 115, 117–119, 122, 127, 135, 137, 145, 183, 237
G GDPR certification, 74–76, 79, 81 GDPR compliance, vi, 22, 62, 63, 135, 139, 140, 144, 145 General data protection regulation (GDPR), 5, 22, 62, 74, 88, 106, 130, 180, 224, 232
H Healthcare data, 10 Hybrid certification model, 74
I Identifiers, 25, 34, 42–49, 51, 52, 54, 55, 61–71, 231, 233, 235, 237, 243, 244 Informed consent, 13, 15, 192–193, 198, 199, 224 Innovation, 4, 5, 7–10, 12, 55, 61, 112, 125, 130, 132–134, 147, 148 IP address, 26, 231–248
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Schiffner et al. (eds.), Privacy Symposium 2022, https://doi.org/10.1007/978-3-031-09901-4
251
252 L Lead markets, 129–148
M Measurement, vi, vii, 6, 7, 9, 15, 22, 28, 29, 37, 63, 77, 94, 102, 106, 110–124, 126, 127, 132, 136, 180–182, 184–186, 188, 192, 195, 196, 198, 199, 217, 225, 243, 244 Multi-party computation, 22–37
P Personal data, v, vi, 5–17, 22, 24–27, 29–36, 63, 75, 88, 89, 91–99, 101, 102, 105–107, 110, 111, 113, 118–120, 122–125, 135–137, 139, 145, 146, 180, 182, 184, 185, 187–190, 193, 199, 232–236, 238–240, 242–246 Personal Information Protection Law (PIPL), 88, 89, 92–102 Platforms, 4, 43, 44, 46, 47, 50–52, 54, 55, 82, 91, 111, 113, 121, 145, 157–160, 163, 165, 166, 172, 173, 179–181, 189–191, 193, 197, 199 Porter hypothesis, 132, 133 Privacy, 5, 22, 42, 76, 88, 108, 130, 156, 180, 208, 232 Privacy by design, 110 Privacy-enhancing technologies (PETs), vi, 22, 27, 29, 30, 36, 138 Privacy friendly, 22, 28, 43, 56 Privacy policy, 156, 169, 170, 191, 232, 236, 238–240, 242, 244–246
Index Privacy tech, 130, 134–147 Pseudonymization, vi, 12, 28, 63, 91, 93, 97-98, 138, 185-186, 188–190, 192, 195, 198, 199 Public genealogy databases, 157, 158, 170, 172, 176 R Regulation, 5, 22, 63, 74, 88, 106, 129, 165, 184, 211 Research ethics, 180–182, 185, 188, 191 S Secondary use of data, 3–17 Security, 5, 13, 22, 24, 33, 77, 95, 96, 102, 109, 110, 118, 126, 136, 137, 139, 140, 162, 184–186, 188, 192, 199, 208–211, 219, 221–223, 225, 232, 235, 240 Social computing, 180 Social media data scraping, 180 Subject access request, 232, 236, 238, 240–246 Surveillance, 43, 50–51, 53, 54, 56, 77, 82, 88, 170, 171, 174, 180, 209, 223, 225 T Text and data mining, 183 Thermal imaging, 207–226 Twitter, 179–200, 238, 247 W World Trade Organization (WTO), 114–119, 121, 124, 126, 127