Digital Watermarking for Machine Learning Model: Techniques, Protocols and Applications 9789811975530, 9789811975547

Machine Learning (ML) models, especially large pretrained Deep Learning (DL) models, are of high economic value and must

213 58 7MB

English Pages 233 Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Contributors
About the Editors
Acronyms
Mathematical Notation
Fundamentals
Machine Learning
Model Watermarking
Part I Preliminary
1 Introduction
1.1 Why Digital Watermarking for Machine Learning Models?
1.2 How Digital Watermarking Is Used for Machine Learning Models?
1.2.1 Techniques
1.2.2 Protocols
1.2.3 Applications
1.3 Related Work
1.3.1 White-Box Watermarks
1.3.2 Black-Box Watermarks
1.3.3 Neural Network Fingerprints
1.4 About This Book
References
2 Ownership Verification Protocols for Deep Neural Network Watermarks
2.1 Introduction
2.2 Security Formulation
2.2.1 Functionality Preserving
2.2.2 Accuracy and Unambiguity
2.2.3 Persistency
2.2.4 Other Security Requirements
2.3 The Ownership Verification Protocol for DNN
2.3.1 The Boycotting Attack and the Corresponding Security
2.3.2 The Overwriting Attack and the Corresponding Security
2.3.3 Evidence Exposure and the Corresponding Security
2.3.4 A Logic Perspective of the OV Protocol
2.3.5 Remarks on Advanced Protocols
2.4 Conclusion
References
Part II Techniques
3 Model Watermarking for Deep Neural Networks of ImageRecovery
3.1 Introduction
3.2 Related Works
3.2.1 White-Box Model Watermarking
3.2.2 Black-Box Model Watermarking
3.3 Problem Formulation
3.3.1 Notations and Definitions
3.3.2 Principles for Watermarking Image Recovery DNNs
3.3.3 Model-Oriented Attacks to Model Watermarking
3.4 Proposed Method
3.4.1 Main Idea and Framework
3.4.2 Trigger Key Generation
3.4.3 Watermark Generation
3.4.4 Watermark Embedding
3.4.5 Watermark Verification
3.4.6 Auxiliary Copyright Visualizer
3.5 Conclusion
References
4 The Robust and Harmless Model Watermarking
4.1 Introduction
4.2 Related Work
4.2.1 Model Stealing
4.2.2 Defenses Against Model Stealing
4.3 Revisiting Existing Model Ownership Verification
4.3.1 The Limitation of Dataset Inference
4.3.2 The Limitation of Backdoor-Based Watermarking
4.4 The Proposed Method Under Centralized Training
4.4.1 Threat Model and Method Pipeline
4.4.2 Model Watermarking with Embedded External Features
4.4.3 Training Ownership Meta-Classifier
4.4.4 Model Ownership Verification with Hypothesis Test
4.5 The Proposed Method Under Federated Learning
4.5.1 Problem Formulation and Threat Model
4.5.2 The Proposed Method
4.6 Experiments
4.6.1 Experimental Settings
4.6.2 Main Results Under Centralized Training
4.6.3 Main Results Under Federated Learning
4.6.4 The Effects of Key Hyper-Parameters
4.6.5 Ablation Study
4.7 Conclusion
References
5 Protecting Intellectual Property of Machine Learning Models via Fingerprinting the Classification Boundary
5.1 Introduction
5.2 Related Works
5.2.1 Watermarking for IP Protection
5.2.2 Classification Boundary
5.3 Problem Formulation
5.3.1 Threat Model
5.3.2 Fingerprinting a Target Model
5.3.3 Design Goals
5.3.4 Measuring the Robustness–Uniqueness Trade-off
5.4 Design of IPGuard
5.4.1 Overview
5.4.2 Finding Fingerprinting Data Points as an Optimization Problem
5.4.3 Initialization and Label Selection
5.5 Discussion
5.5.1 Connection with Adversarial Examples
5.5.2 Robustness Against Knowledge Distillation
5.5.3 Attacker-Side Detection of Fingerprinting Data Points
5.6 Conclusion and Future Work
References
6 Protecting Image Processing Networks via Model Watermarking
6.1 Introduction
6.2 Preliminaries
6.2.1 Threat Model
6.2.2 Problem Formulation
6.3 Proposed Method
6.3.1 Motivation
6.3.2 Traditional Watermarking Algorithm
6.3.3 Deep Invisible Watermarking
6.3.3.1 Network Structures
6.3.3.2 Loss Functions
6.3.3.3 Ownership Verification
6.3.3.4 Flexible Extensions
6.4 Experiments
6.4.1 Experiment Settings
6.4.2 Fidelity and Capacity
6.4.3 Robustness to Model Extraction Attack
6.4.4 Ablation Study
6.4.5 Extensions
6.5 Discussion
6.6 Conclusion
References
7 Watermarks for Deep Reinforcement Learning
7.1 Introduction
7.2 Background
7.2.1 Markov Decision Process
7.2.2 Reinforcement Learning
7.2.3 Deep Reinforcement Learning
7.3 Related Work
7.3.1 Watermarks for Supervised Deep Learning Models
7.3.2 Watermarks for Deep Reinforcement Learning Models
7.4 Problem Formulation
7.4.1 Threat Model
7.4.2 Temporal Watermarks for Deep Reinforcement Learning
7.5 Proposed Method
7.5.1 Watermark Candidate Generation
7.5.2 Watermark Embedding
7.5.3 Ownership Verification
7.6 Discussion
7.7 Conclusion
References
8 Ownership Protection for Image Captioning Models
8.1 Introduction
8.2 Related Works
8.2.1 Image Captioning
8.2.2 Digital Watermarking in DNN Models
8.3 Problem Formulation
8.3.1 Image Captioning Model
8.3.2 Proof of Proposition 2
8.3.3 IP Protection on Image Captioning Model
8.4 Proposed Method
8.4.1 Secret Key Generation Process
8.4.2 Embedding Process
8.4.3 Verification Process
8.5 Experiment Settings
8.5.1 Metrics and Dataset
8.5.2 Configurations
8.5.3 Methods for Comparison
8.6 Discussion and Limitations
8.6.1 Comparison with Current Digital Watermarking Framework
8.6.2 Fidelity Evaluation
8.6.3 Resilience Against Ambiguity Attacks
8.6.4 Robustness Against Removal Attacks
8.6.5 Limitations
8.7 Conclusion
References
9 Protecting Recurrent Neural Network by Embedding Keys
9.1 Introduction
9.2 Related Works
9.3 Problem Formulation
9.3.1 Problem Statement
9.3.2 Protection Framework Design
9.3.3 Contributions
9.3.4 Protocols for Model Watermarking and Ownership Verification
9.4 Proposed Method
9.4.1 Key Gates
9.4.2 Methods to Generate Key
9.4.3 Sign of Key Outputs as Signature
9.4.4 Ownership Verification with Keys
9.5 Experiments
9.5.1 Learning Tasks
9.5.2 Hyperparameters
9.6 Discussion
9.6.1 Fidelity
9.6.2 Robustness Against Removal Attacks
9.6.3 Resilience Against Ambiguity Attacks
9.6.4 Secrecy
9.6.5 Time Complexity
9.6.6 Key Gate Activation
9.7 Conclusion
References
Part III Applications
10 FedIPR: Ownership Verification for Federated Deep Neural Network Models
10.1 Introduction
10.2 Related Works
10.2.1 Secure Federated Learning
10.2.2 DNN Watermarking Methods
10.3 Preliminaries
10.3.1 Secure Horizontal Federated Learning
10.3.2 Freeriders in Federated Learning
10.3.3 DNN Watermarking Methods
10.4 Proposed Method
10.4.1 Definition of FedDNN Ownership Verification with Watermarks
10.4.2 Challenge A: Capacity of Multiple Watermarks in FedDNN
10.4.3 Challenge B: Watermarking Robustness in SFL
10.5 Implementation Details
10.5.1 Watermark Embedding and Verification
10.5.1.1 Watermarking Design for CNN
10.5.1.2 Watermarking Design for Transformer-Based Networks
10.6 Experimental Results
10.6.1 Fidelity
10.6.2 Watermark Detection Rate
10.6.3 Watermarks Defeat Freerider Attacks
10.6.4 Robustness Under Federated Learning Strategies
10.6.4.1 Robustness Against Differential Privacy
10.6.4.2 Robustness Against Client Selection
10.7 Conclusion
References
11 Model Auditing for Data Intellectual Property
11.1 Introduction
11.2 Related Works
11.2.1 Membership Inference (MI)
11.2.2 Model Decision Boundary
11.3 Problem Formulations
11.3.1 Properties for Model Auditing
11.3.2 Model Auditing Under Different Settings
11.4 Investigation of Existing Model Auditing Methods 11:maini2020dataset
11.4.1 Distance Approximation to Decision Boundary
11.4.2 Data Ownership Resolution 11:maini2020dataset
11.4.3 Threat Model for Model Auditing
11.4.3.1 Removal Attack
11.4.3.2 Ambiguity Attack
11.5 Experimental Results
11.5.1 Main Results
11.5.2 Partial Data Usage
11.5.3 Different Adversarial Setting
11.5.3.1 Data Ambiguity Attack
11.5.3.2 Model Removal Attack
11.6 Conclusion
References
Recommend Papers

Digital Watermarking for Machine Learning Model: Techniques, Protocols and Applications
 9789811975530, 9789811975547

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lixin Fan Chee Seng Chan Qiang Yang   Editors

Digital Watermarking for Machine Learning Model Techniques, Protocols and Applications

Digital Watermarking for Machine Learning Model

Lixin Fan • Chee Seng Chan • Qiang Yang Editors

Digital Watermarking for Machine Learning Model Techniques, Protocols and Applications

Editors Lixin Fan AI Lab WeBank Shenzhen, China

Chee Seng Chan Department of Artificial Intelligence Universiti Malaya Kuala Lumpur, Malaysia

Qiang Yang Department of CS and Engineering Hong Kong University of Science and Tech Hong Kong, China

ISBN 978-981-19-7553-0 ISBN 978-981-19-7554-7 https://doi.org/10.1007/978-981-19-7554-7

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

In a modern digital economy, we care about the value that data can generate. Such values are oftentimes created by machine learning models empowered by enormous amount of data of multiple forms. For example, using the health-checkup data, medical doctors can train a stroke prediction model that can accurately predict the likelihood of a patient getting a stroke. A computer vision model in an autonomous vehicle can tell whether a traffic light is in red or green even in the foggy weather. An economic model can give explanations on why the oil prices are volatile in a particular period of time. One can say that data are equivalent to raw materials such as coal and oil in the traditional economy, and in this analogy, machine learning models are the machines and vehicles that produce the value for the digital economy. Similar to the finance and goods that need to be tracked and managed, as well as to be protected by law, in the foreseeable future, models need to be protected, managed and audited as well. Specifically, when we use a model purchased from a third party, we need to be certain that the model comes from a legitimate place. When we trade models in a market place, we need to have a fair methodology to ascertain the value of the model in a certain business context. When a model misbehaves, for instance if a stroke prediction model fails to predict a fatal stroke, we need to have the means to trace back the responsible party that should handle the loss of life. When users with different roles , such as regulators, engineers or end users, inquire about the model, we need to have a way to audit the model’s history as well as give a fair explanation of the model’s performance. Furthermore, when models are built out of multiple parties’ data, it is important to be able to filter out ‘semi-honest’ parties who can use various opportunities to peek at other parties’ data out of curiosity. To be able to track and manage models, a typical way is to embed a signature known as a watermark into a model. Furthermore, care should be taken to prevent the watermarking information from being altered. It is challenging to insert and manage watermarks technically for complex models that involve millions or even billions of model parameters. The technology of model watermarking is the central focus of this book. The watermarking technology must answer how to best balance the need to embed the watermarks and hide them from potential tampering while v

vi

Preface

allowing the model training and inference to be efficient and effective. While there are watermarking algorithms for image data to confirm the ownership of images, and lately NFT technologies for digital arts, the watermarking techniques for models are novel and more challenging. This is partly due to the fact that models engage in an entire software product lifecycle in which there is a training process and an application process. There are issues related to ownership verification, transfer and model revision, mixtures and merges, model tracing, legal obligation, responsibility, rewards, and incentives. Once established, the model watermarking techniques will become a cornerstone of the future digital economy. This book is the result of the most recent frontline research in AI contributed by a group of researchers who are active in fields including machine learning, data and model management, federated learning and many fielded applications of these technologies. This book is in general suitable for readers with interests in machine learning and big data. In particular, the preliminary chapters provide an introduction and brief review of requirements for model ownership verification using watermarking. Chapters in Part II of the book elaborate on techniques that are developed for various machine learning models as well as security requirements. Part III of the book covers applications of model watermarking techniques in federated learning settings and model auditing use cases. We hope the book will bring to the readers a new look into the digital future of human society, one that follows widely accepted human values of modern people and society. We also expect this introductory book a good reference book for students studying artificial intelligence and a handbook for engineers and researchers in industry. To our best knowledge, this book is the first in its kind that showcases how to use digital watermarks to verify ownership of machine learning models. Nevertheless, the book would have been impossible without kind assistance from many people. Thanks to everyone on the Springer editorial team, and special thanks to Celine, the everpatient Editorial Director. The authors would like to thank their families for their constant support. Shenzhen, China Kuala Lumpur, Malaysia Hong Kong, China June, 2022

Lixin Fan Chee Seng Chan Qiang Yang

Contents

Part I Preliminary 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lixin Fan, Chee Seng Chan, and Qiang Yang

2

Ownership Verification Protocols for Deep Neural Network Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fangqi Li and Shilin Wang

3

11

Part II Techniques 3

Model Watermarking for Deep Neural Networks of Image Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuhui Quan and Huan Teng

4

The Robust and Harmless Model Watermarking . . . . . . . . . . . . . . . . . . . . . . . Yiming Li, Linghui Zhu, Yang Bai, Yong Jiang, and Shu-Tao Xia

5

Protecting Intellectual Property of Machine Learning Models via Fingerprinting the Classification Boundary . . . . . . . . . . . . . . . Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong

37 53

73

6

Protecting Image Processing Networks via Model Watermarking . . . . Jie Zhang, Dongdong Chen, Jing Liao, Weiming Zhang, and Nenghai Yu

7

Watermarks for Deep Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . 117 Kangjie Chen

8

Ownership Protection for Image Captioning Models . . . . . . . . . . . . . . . . . . 143 Jian Han Lim

9

Protecting Recurrent Neural Network by Embedding Keys . . . . . . . . . . . 167 Zhi Qin Tan, Hao Shan Wong, and Chee Seng Chan

93

vii

viii

Contents

Part III Applications 10

FedIPR: Ownership Verification for Federated Deep Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Bowen Li, Lixin Fan, Hanlin Gu, Jie Li, and Qiang Yang

11

Model Auditing for Data Intellectual Property. . . . . . . . . . . . . . . . . . . . . . . . . . 211 Bowen Li, Lixin Fan, Jie Li, Hanlin Gu, and Qiang Yang

Contributors

Yang Bai Tsinghua University, Beijing, China Xiaoyu Cao Duke University, Durham, NC, USA Chee Seng Chan Universiti Malaya, Kuala Lumpur, Malaysia Dongdong Chen Microsoft Research, Redmond, WA, USA Kangjie Chen Nanyang Technological University, Singapore, Singapore Lixin Fan WeBank AI Lab, Shenzhen, China Neil Zhenqiang Gong Duke University, Durham, NC, USA Hanlin Gu WeBank AI Lab, Shenzhen, China Jinyuan Jia Duke University, Durham, NC, USA Yong Jiang Tsinghua University, Beijing, China Jie Li Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Bowen Li Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Fangqi Li Shanghai Jiao Tong University, Shanghai, China Yiming Li Tsinghua University, Beijing, China Jing Liao City University of Hong Kong, Hong Kong, China Jian Han Lim Universiti Malaya, Kuala Lumpur, Malaysia Yuhui Quan South China University of Technology and Pazhou Laboratory, Guangzhou, China Zhi Qin Tan Universiti Malaya, Kuala Lumpur, Malaysia Huan Teng South China University of Technology, Guangzhou, China ix

x

Contributors

Shilin Wang Shanghai Jiao Tong University, Shanghai, China Hao Shan Wong Universiti Malaya, Kuala Lumpur, Malaysia Shu-Tao Xia Tsinghua University, Beijing, China Qiang Yang Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong Nenghai Yu University of Science and Technology of China, Heifei, China Jie Zhang University of Science and Technology of China, Heifei, China Weiming Zhang University of Science and Technology of China, Heifei, China Linghui Zhu Tsinghua University, Beijing, China

About the Editors

Lixin Fan is currently the Chief Scientist of Artificial Intelligence at WeBank, Shenzhen, China. His research interests include machine learning and deep learning, privacy computing and federated learning, computer vision and pattern recognition, image and video processing, mobile computing and ubiquitous computing. He was the Organizing Chair of workshops in these research areas held in CVPR, ICCV, ICPR, ACCV, NeurIPS, AAAI and IJCAI. He is the author of 3 edited books and more than 70 articles in peer-review international journals and conference proceedings. He holds more than 100 patents filed in the United States, Europe and China, and he was Chairman of the IEEE P2894 Explainable Artificial Intelligence (XAI) Standard Working Group. Chee Seng Chan is currently a Full Professor at the Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia. His research interests include computer vision and machine learning where he has published more than 100 papers in related top peer-review conferences and journals. He was the Organizing Chair of the Asian Conference on Pattern Recognition (2015) and General Chair of the IEEE Workshop on Multimedia Signal Processing (2019) and IEEE Visual Communications and Image Processing (2013). He was the recipient of Top Research Scientists Malaysia (TRSM) in 2022, Young Scientists Network Academy of Sciences Malaysia (YSN-ASM) in 2015 and Hitachi Research Fellowship in 2013. Besides that, he is also a senior member (IEEE), Professional Engineer (BEM) and Chartered Engineer (IET). During 2020–2022, he was seconded to the Ministry of Science, Technology and Innovation (MOSTI) as the Undersecretary for Division of Data Strategic and Foresight. Qiang Yang is a Fellow of the Canadian Academy of Engineering (CAE) and Royal Society of Canada (RSC), Chief Artificial Intelligence Officer of WeBank and Chair Professor at the Computer Science and Engineering Department of Hong Kong University of Science and Technology (HKUST). He is the Conference Chair of AAAI-21, Honorary Vice President of the Chinese Association for Artificial xi

xii

About the Editors

Intelligence (CAAI), President of the Hong Kong Society of Artificial Intelligence and Robotics (HKSAIR) and President of the Investment Technology League (ITL). He is a fellow of the AAAI, ACM, CAAI, IEEE, IAPR and AAAS. He was the Founding Editor in Chief of the ACM Transactions on Intelligent Systems and Technology (ACM TIST) and the Founding Editor in Chief of IEEE Transactions on Big Data (IEEE TBD). He received the ACM SIGKDD Distinguished Service Award in 2017. He served as Founding Director of Huawei’s Noah’s Ark Research Lab from 2012 to 2015, Founding Director of HKUST’s Big Data Institute, Founder of 4Paradigm and President of the IJCAI (2017–2019). His research interests include artificial intelligence, machine learning, data mining and planning.

Acronyms

AI CE CNN DL DNN DQN DRL EM FL GAN GAN GNN GPU KL LSTM ML MSE NLP RL RNN

Artificial intelligence Cross entropy Convolutional neural network Deep learning Deep neural network Deep Q-network Deep reinforcement learning Expectation maximization Federated learning Generative adversarial network Generative adversarial network Graph neural network Graphics processing unit Kullback-Leibler Long short-term memory Machine learning Mean square error Natural language processing Reinforcement learning Recurrent neural network

xiii

Mathematical Notation

Fundamentals x x X R {· · · } || · ||p sup(·) sgn(·) I(·) P (X) p(X) X∼p E(X) V ar(X) Cov(X, Y ) DKL (P ||Q) N (x; μ, σ ) H

A scalar A vector A matrix The set of real numbers Set p-norm Supremum Signal function Indicator function A probability distribution over a discrete variable A probability density function over a continuous variable, or over a variable whose type has not been specified The random variable X has distribution p Expectation of a random variable Variance of a random variable Covariance of two random variables Kullback-Leibler divergence of P and Q Gaussian distribution over x with mean μ and covariance σ Hypothesis testing

Machine Learning D D X x

Distribution Dataset Sample space Data sample in dataset xv

xvi

y f f (x) W

Mathematical Notation

Data label in dataset Model functionality Model prediction on data sample Model parameters

Model Watermarking N() W G () N SB T SW θ E () LD LB LW V () VW () VB ()

Neural network model Model parameters Watermark generation process Length of feature-based watermarks Black-box watermarks Triggers for black-box watermarks White-box watermarks Extractors for white-box watermarks Watermark embedding process The loss function for the main learning task Black-box watermark embedding regularization term White-box watermark embedding regularization term Watermark verification process White-box verification Black-box verification

Part I

Preliminary

Chapter 1

Introduction Lixin Fan, Chee Seng Chan, and Qiang Yang

1.1 Why Digital Watermarking for Machine Learning Models? Rapid developments of machine learning technologies such as deep learning and federated learning have nowadays affected everyone of us. On one hand, a large variety of machine learning models are used in all kinds of applications including finance, healthcare, public transportation, etc., reforming our lives in a unprecedentedly profound manner. On the other hand, the wide applicability of these machine learning models calls for appropriate managements of these models to ensure their use comply with legislation rules and ethics concerning privacy, fairness, and well-being, etc.1 There are three compelling requirements in our view concerning the creation, registration, and deployment of machine learning models. First of all, machine learning models are precious intellectual properties that should be protected from illegal copy or re-distribution. This requirement lies in the fact that creating machine

1 For

instance, see Montréal Declaration for a Responsible Development of Artificial Intelligence at https://www.montrealdeclaration-responsibleai.com/the-declaration. L. Fan () WeBank AI Lab, Shenzhen, China e-mail: [email protected] C. S. Chan Universiti Malaya, Kuala Lumpur, Malaysia e-mail: [email protected] Q. Yang Hong Kong University of Science and Technology, Hong Kong, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_1

3

4

L. Fan et al.

learning models from scratch is expensive in terms of dedicated hardware and exceedingly long time required for training machine learning models, especially, those deep neural networks with hundreds of billions of network parameters. For instance, it took millions of dollars to train the GPT3 pre-trained natural language model [2]. Second, the creation of advanced machine learning models also requires a vast amount of training data that are precious assets to be protected. It is not uncommon to use terabytes data to pre-train deep neural network models for the sake of improved model performances. Collecting and labelling such datasets, however, is extremely laborious and expensive; therefore, owners of these datasets would like to ensure that these datasets are not misused to build machine learning models by unauthorized parties. Third, the outputs of machine learning models are subject to auditing and management for the sake of trustworthy AI applications. The necessity of such auditing is showcased, e.g., by the powerful deepfake technique that generates deceptive visual and audio content by using generative adversarial networks (GANs).2 The capability to audit outputs of deepfake ML models and trace back to the owner and the responsible party who deploy these models can help to prevent misuse of ML models as such. All the above-mentioned requirements boil down to the need of ownership verification of a specific ML model. The main theme of this book is about how to verify whether a machine learning model in question indeed belongs to a party who claims to be the legitimate owner of the model. The ownership is confirmed if the claimed owner or a prover, either an organization or an individual person, can demonstrate indisputable evidence to convince another party or a verifier with sufficient confidence. The process as such that allows the verifier to check the claim that a machine learning model belongs to the prover is true or false is formally defined as the model ownership verification (MOT). The inputs to a MOT consist of the model in question, the evidence submitted by the prover, and a protocol to verify the claimed ownership. The outcome of a MOT is either true or false. There are a number of fundamentally different ways by which evidence can be created to support the ownership verification. For instance, a model owner may choose to create evidence by using hashing or other blockchain technologies in case that the machine learning model parameters have already been given [9]. However, evidence presented as such are not indisputable since a plagiarizer of the model may also create evidence by using the same hashing or blockchain technologies, even before the legitimate owner registers his/her owner evidence. It is therefore preferable for the owner to create evidence as early as possible, for instance, during the learning process when model parameters are yet to be determined to accomplish certain learning tasks. In this book, we only consider the latter type of evidence, which is tightly coupled with the machine learning model in question, e.g., its

2 https://en.wikipedia.org/wiki/Deepfake.

1 Introduction

5

parameters or its behaviors. Such type of evidence is called watermarks, signatures, or fingerprints of the model3 In essence, designated watermarks selected by owners are first embedded into the model during the learning stage of the model, and subsequently, embedded watermarks can be reliably detected to prove the ownership of the models in question. This watermark-based model ownership verification is of particular interests since it allows owners to prove their contributions in the early phase of creating the model in question. The unique feature of coupling ownership with the model learning process is crucial for many application use cases and places watermark-based ownership verification in a favorable position against alternative solutions.

1.2 How Digital Watermarking Is Used for Machine Learning Models? We first give below an overview of the landscape for digital watermarking of machine learning models, which consists of three aspects, namely, techniques that allow watermarks to be created and verified for the given machine learning models, protocols that ensure the indisputable evidence to support ownership verification, as well as a wide variety of application scenarios of watermark-based ownership verification.

1.2.1 Techniques Digital watermarking is a traditional technique that allows one to embed specific digital markers into multimedia contents such as audio, image, or video. The purpose of digital watermarking is to verify the authenticity of the carrier multimedia contents or to show its identity [6]. While digital watermarking technique has been developed for more than 20 years, its application to machine learning models is relatively new. The development of digital watermarking techniques for machine learning models, especially for deep neural network (DNN) models, was originated by the seminal work published by Uchida et al. [15] in 2017. The initial idea of embedding watermarks into DNN models was implemented in a rather straightforward manner, i.e., a regularization term was added to the original learning task objective to enforce DNN model parameters exhibit specific features. These specific model parameter features are called watermarks that were latter extracted from the model in question to prove the ownership. The underlying principle for this ownership verification (OT) process relies on the fact that features embedded

3 In

this book, we use the terms watermarks, signatures, or fingerprints interchangeably, unless when their sublet differences are discussed and compared, e.g., in Chapt. 5.

6

L. Fan et al.

as such are unlikely to present in a model that is not learned with the purposely designed regularization term. The ownership verification process designed in this manner must fulfill certain requirements, the most important of which is to ensure that model performance for the original learning task is not affected or significantly degraded by the embedding of the watermarks. That is to say, the model embedded with watermarks is indistinguishable from the one without watermarks in terms of the model performance for the original learning task. This functionality-preserving capability ensures that the model can achieve certain tasks and is of business value to protect against plagiarisms. The convincing power of model watermarks depends on how reliably the designated watermarks can be detected from the model in question. Especially, such reliability must be investigated under certain forms of removal attacks such as model fine-tuning or pruning. The convincing power also depends on how unlikely a counterfeit watermark can be detected from the model in question. It is thus crucial for embedded model watermarks to be unique and robust against ambiguity attacks that aim to insert counterfeit watermarks into an already trained model. The verification of digital watermarks from machine learning models can be accomplished in two different modes, namely, the white-box and black-box modes, depending on how watermarks are extracted from the model in question. The whitebox verification refers to the case in which one has to directly access to the model parameters to extract specific watermarks or features. Techniques belong to this category are therefore also called feature-based watermarking. It is not convenient to require the access to model parameters, especially, plagiarizers may purposely hide the copied models behind remote call APIs through which to provide machine learning capabilities as a service. The black-box verification therefore allows one to collect evidence through remote call APIs. This verification mode is accomplished by using adversarial samples during the training of ML models. Supporting evidence are then collected by submitting designated adversarial samples through APIs and compare returned outcomes with the designated outputs. This type of technique is thus called backdoor-based ownership verification.

1.2.2 Protocols The protocol of digital watermarking for machine learning models is a set of security requirements that guarantee ownership verification be performed with indisputable evidence. Such a protocol usually incorporates following determinants concerning, watermark generation, embedding, verification, persistency, as well as functionality preserving of original machine learning models. As shown by the example protocol of deep neural network (DNN) model ownership verification below, a protocol may specify the minimal security requirements of these determinants, often quantified in terms of various measurements, e.g., detection accuracy of watermarks. Readers may find these determinants throughout the whole book, and more examples of protocols are summarized in Chap. 2.

1 Introduction

7

Definition 1.1 A deep neural network (DNN) model ownership verification scheme for a given network .N() is defined as a tuple .{G , E , VB , VW } of processes, consisting of: 1. A watermark generation process .G () generates target white-box watermarks .SW with extraction parameters .θ , and black-box watermarks .SB with triggers .T, G () → (SW , θ ; SB , T).

.

(1.1)

2. A watermark embedding process .E () embeds black-box watermarks .SB and white-box watermarks .SW into the model .N(),   E N()|(SW , θ ; SB , T) → N().

.

(1.2)

3. A black-box verification process .VB () checks whether the model .N() makes inferences specifically for triggers .T, VB (N, SB |T).

.

(1.3)

4. A white-box verification process .VW () accesses the model parameters .W to extract the white-box watermarks .S˜W and compares .S˜W with .SW ,   VW W, SW |θ .

.

(1.4)

1.2.3 Applications As will be shown by the following chapters of this book, digital watermarks can be used for a variety of application scenarios that can be briefly summarized as follows: 1. To protect model intellectual property right (Model IPR): To train a machine learning model that is useful in practice is expensive, in terms of both the required training data and the cost in computing. For example, the training of the large pre-trained deep neural network (DNN) models for natural language processing (NLP) may cost up to millions of dollars and weeks of training time [2]. The high costs in building up such competitive DNN models make it worth the risk for plagiarizers to illegally copy, re-distribute, or misuse those models. Therefore, the digital model watermarking is such a technology that allows model owners to claim their ownership and to protect respective intellectual property rights (IPR). This kind of application of model watermarking is showcased in Chaps. 3–9 of the present book. 2. To trace data usage: Modern machine learning algorithms are data-hungry, and it often requires trillion bytes of data to develop large machine learning models,

8

L. Fan et al.

e.g., pre-trained deep neural network models for natural language processing (NLP). Nevertheless, the use of big training data has to be accomplished in a manner that protects data privacy and complies with the regulatory rules such as GDPR.4 The identification of unique watermarks embedded in a machine learning model helps to trace the training data used to develop the model in question. This application of model watermarking is illustrated in Chap. 11 of the book. 3. To identify model responsibility: Deep learning models are notoriously hard to predict, and models may misbehave in the presence of adversary examples. Moreover, generative models (GAN) may be misused for illegal purposes, e.g., deepfake.5 For the sake of accountability, therefore, it is important to associate outcomes of GAN models with the originating model and owners of the model. This application of model watermarking is illustrated in Chap. 10.

1.3 Related Work In general, the IPR of deep models can be protected by watermark embedding methods that can be broadly categorized into three types: (i) the white-box watermark solutions [7, 8, 10, 13, 15]; (ii) the black-box watermark methods [1, 5, 11, 14, 16], and (iii) model fingerprinting methods[3, 12].

1.3.1 White-Box Watermarks The first line of researches[7, 8, 10, 15] embeds the watermark in the static content of deep neural networks (i.e., weight matrices), and the model owner verified ownership with white-box access to the total model parameters. The first effort to employ watermarking technology in DNNs was white-box protection by Uchida et al. [15], who successfully embedded watermarks into DNNs without degrading host network performance. Recent works [7, 8] proposed passport-based verification schemes to enhance robustness against ambiguity attacks, which is fundamentally different from the watermark-based approaches. Ong et al [13] provide IP protection for generative adversarial networks (GANs) in both black-box and white-box manner by embedding the ownership information into the generator through a novel regularization term. Furthermore, to defend against adversarial attacks that attempt to steal the model, Li et al. proposed to use external features that are robust to model stealing attacks [10].

4 GDPR is applicable as of May 25, 2018 in all European member states to harmonize data privacy laws across Europe. https://gdpr.eu/. 5 https://en.wikipedia.org/wiki/Deepfake.

1 Introduction

9

1.3.2 Black-Box Watermarks However, the white-box methods were constrained in that they required access to all of the network weights or outputs to extract the embedded watermarks. As a remedy, Merrer et al. [4] proposed to embed watermarks into classification labels of adversarial samples (trigger sets), allowing the watermarks to be extracted remotely via a service API without requiring access to the network’s internal weight parameters. Later, Adi et al. [1] demonstrated that embedding watermarks in the network outputs (classification labels) is an intentional backdoor. For the RNN protection, Lim et al. [11] proposed a novel embedding framework that consists of two different embedding schemes to embed a unique secret key into the recurrent neural network (RNN) cell to protect the image captioning model IP. For reinforcement learning, Chen et al. [5] proposed a temporal method that was applied to the IP protection of deep reinforcement learning models. [14, 16] proposed to protect image processing networks trained for image denoising and recovery tasks.

1.3.3 Neural Network Fingerprints DNN fingerprinting methods [3, 12] have been proposed, as a non-invasive alternative to watermarking. This type of approach is different from watermarking in that the embedding of fingerprints does not require additional regularization terms during the training stage. Instead, these fingerprinting methods extract a unique set of features from the owner’s model to differentiate it from other models. The ownership can be claimed if the same set of features, i.e., identifiers of the owner model, matches with that of a suspect model. Due to the close relation between fingerprints and watermarks, we treat fingerprint as a special type of watermarks in this book and refer to Chap. 5 for the use of fingerprint in protecting machine learning models.

1.4 About This Book This book is the first in its kind, to our best knowledge, that provides an introductory coverage about how to use digital watermarks to verify ownership of machine learning models. We hope that the book is a timely publication to introduce related techniques, protocols, and applications that are necessary for people to understand and continue study the topic. Given that machine learning models become increasingly larger, more complex, and expensive, the compelling need to identify owners and origins of machine learning models has attracted much public attention. This book is in general suitable for readers with interests in machine learning and big data. In particular, people who are interested in deep learning, federated learning, image processing, natural language processing, etc., may find certain chapters

10

L. Fan et al.

relevant to their research. We also expect this introductory book a good reference book for students studying artificial intelligence and a handbook for engineers and researchers in industry.

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: 27th USENIX Security Symposium (USENIX) (2018) 2. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020) 3. Cao, X., Jia, J., Gong, N.Z.: IPGuard: Protecting intellectual property of deep neural networks via fingerprinting the classification boundary. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pp. 14–25 (2021) 4. Chen, H., Rohani, B.D., Koushanfar, F.: DeepMarks: A digital fingerprinting framework for deep neural networks. arXiv e-prints, arXiv:1804.03648 (2018) 5. Chen, K., Guo, S., Zhang, T., Li, S., Liu, Y.: Temporal watermarks for deep reinforcement learning models. In: Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems, pp. 314–322 (2021) 6. Cox, I., Miller, M., Bloom, J., Fridrich, J., Kalker, T.: Digital Watermarking and Steganography, 2nd edn. Morgan Kaufmann, San Francisco (2007) 7. Fan, L., Ng, K.W., Chan, C.S.: Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In: Advances in Neural Information Processing Systems, pp. 4714–4723. Curran Associates, Red Hook (2019) 8. Fan, L., Ng, K.W., Chan, C.S., Yang, Q.: DeepIP: deep neural network intellectual property protection with passports. IEEE Trans. Pattern Analy. Mach. Intell. 1, 1–1 (2021) 9. Jeon, H.J., Youn, H.C., Ko, S.M., Kim, T.H.: Blockchain and AI meet in the metaverse. In: Fernández-Caramés, T.M., Fraga-Lamas, P. (eds.) Advances in the Convergence of Blockchain and Artificial Intelligence, Chap. 5. IntechOpen, Rijeka (2021) 10. Li, Y., Zhu, L., Jia, X., Jiang, Y., Xia, S.T., Cao, X.: Defending against model stealing via verifying embedded external features. In: AAAI (2022) 11. Lim, J.H., Chan, C.S., Ng, K.W., Fan, L., Yang, Q.: Protect, show, attend and tell: empowering image captioning models with ownership protection. Pattern Recogn. 122, 108285 (2022) 12. Lukas, N., Zhang, Y., Kerschbaum, F.: Deep neural network fingerprinting by conferrable adversarial examples. In: International Conference on Learning Representations (2020) 13. Ong, D.S., Chan, C.S., Ng, K.W., Fan, L., Yang, Q.: Protecting intellectual property of generative adversarial networks from ambiguity attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 14. Quan, Y., Teng, H., Chen, Y., Ji, H.: Watermarking deep neural networks in image processing. IEEE Trans. Neural Netw. Learn. Syst. 32(5), 1852–1865 (2020) 15. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.I.: Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277 (2017) 16. Zhang, J., Chen, D., Liao, J., Zhang, W., Feng, H., Hua, G., Yu, N.: Deep Model Intellectual Property Protection via Deep Watermarking. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4005–4020 (2022)

Chapter 2

Ownership Verification Protocols for Deep Neural Network Watermarks Fangqi Li and Shilin Wang

Abstract To protect deep neural networks as intellectual properties, it is necessary to accurately identify their author or registered owner. Numerous techniques, spearheaded by the watermark, have been proposed to establish the connection between a deep neural network and its owner; however, it is until that such connection is provably unambiguous and unforgeable that it can be leveraged for copyright protection. The ownership proof is feasible only after multiple parties, including the owner, the adversary, and the third party to whom the owner wants to present a proof operate under deliberate protocols. The design of these ownership verification protocols requires more careful insight into the knowledge and privacy concerns of participants, during which process several extra security risks emerge. This chapter briefly reviews ordinary security requirements in deep neural network watermarking schemes, formulates several additional requirements regarding ownership proof under elementary protocols, and puts forward the necessity of analyzing and regulating the ownership verification procedure.

2.1 Introduction The development of deep learning, a spearheading branch in modern artificial intelligence, is boosting the industrial application of deep neural networks (DNNs). After learning abundant data, DNNs outperform traditional models in disciplines ranging from traditional signal processing to complex game and decision-making. The expense behind such intelligent models is prohibitive. To fulfill a specific task, sufficient domain-dependent data has to be manually collected, processed, and labeled. Designing the DNN architecture and tuning its parameters also involve expertise and tremendous effort. Therefore, emerging voices are calling for regulating DNNs as intellectual properties of their owners.

F. Li () · S. Wang Shanghai Jiao Tong University, Shanghai, China e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_2

11

12

F. Li and S. Wang

The prerequisite for DNN copyright management is to identify a DNN’s legitimate owner by ownership verification (OV). Mainstream ownership protection methods for DNN include fingerprint and watermark. The fingerprint family extracts specific characteristics from a DNN as its identifier and uses it to probe potential piracy, and it need not modify the DNN [22]. Consequently, the correlation between fingerprint and the owner’s identity is intractable. On the other hand, DNN watermarking schemes embed the owner-dependent watermark into the DNN, whose later revelation proves the authentic ownership. Based on the access level at which the suspicious DNN can be interacted with, watermarking schemes can be classified into white-box ones and black-box ones. In the white-box setting, the owner or the third-party notary has full access to the pirated DNN. The watermark can be encoded into the model’s parameters [16] or intermediate outputs for particular inputs [3]. The owner can also insert extra modules into the DNN’s intermediate layers as passports [4]. As for the black-box setting, the pirated model can only be interacted with as an interface. Watermarking schemes for this case usually resort to backdoors [7, 21, 23], i.e., the final outputs for trigger inputs. As shown in Fig. 2.1, a DNN watermarking scheme WM is composed of two modules .WM = {G , E }. .G is the identity information/watermark generation module, and .E is the watermark embedding module. To insert a watermark into a DNN .N, the owner first runs .G to produce its watermark/identifier S, S ← G (N ),

(2.1)

.

where N is the security parameter that determines the security level of the ownership protection and .← denotes a stochastic generation process. Having obtained S, the owner inserts it into the DNN product by running .E , .

{V , NWM } ← E (N, S),

(2.2)

watermarked DNN clean DNN data Embed

Train

Verify

WM KeyGen – –

Watermark

owner

Fig. 2.1 The DNN watermarking and ownership verification framework

Pass

2 Ownership Verification Protocols for Deep Neural Network Watermarks

13

in which .NWM is the watermarked DNN and .V is a verifier with which the ownership can be correctly identified: V (NWM , S) = Pass.

.

(2.3)

Most watermarking schemes can be boiled down to Eq. (2.1)–(2.3). Degrees of freedom within this formulation include: (i) Diversified definitions of the legal watermark space. There are also zero-bit schemes that do not digitally encode the owner’s identity [6]. (ii) Whether .E involves extra knowledge, e.g., the training dataset. (iii) Whether .V is a product of running .E or only depends on WM. (iv) Whether .V retrieves S from the watermarked DNN or takes S as its input, etc. Example 2.1 (Uchida [16]) As a pioneering white-box DNN watermark, the scheme of Uchida’s embeds components are as follows: • .G : Selecting N parameter positions .θ = (pn )N from .N, generating N real n=1   N , and returning .S = θ, (rn ) numbers .(rn )N n=1 n=1 • .E : For .n = 1, 2, · · · , N, setting the parameter at .pn to .rn N • .V : Retrieving .(pn )N n=1 and .(rn )n=1 from S, if the parameter at .pn within .NWM equals .rn for .n = 1, 2, · · · , N then returning Pass In this formulation, .V is independent from S. Uchida’s definition can be modified by moving .θ into the verifier, so different users would obtain different verifiers. Example 2.2 (Random Triggers [21]) As an exemplary black-box watermark for image classification DNNs, the random trigger scheme encodes the digital identity information into triggers and their labels. Its components are as follows: • .G : Generating N random real number vectors .S = {s1 , s2 , · · · , sN }. The n-th trigger is .Tn = T (sn ), and its label is .ln = l(sn ). .T (·) is an image generator that transforms a real number vector into an image (e.g., pixelwise mapping), and .l(·) is a pseudorandom function that maps a real number vector into a legal label of the classification task. Returning .S = {S, T (·), l(·)}. • .E : Training .N on both the labeled triggers and the normal training dataset. • .V : Reconstructing N triggers and their labels from S, if .NWM ’s accuracy on the trigger set is higher than a predefined threshold (e.g., 90%) and then returning Pass. In this setting, the initial vectors S can be set as the owner’s digital signature. Mappings .T (·) and .l(·) could be fixed as a part of .G (and therefore a part of WM) so S is reduced to .S. Notice that backdoor-based DNN watermarking schemes usually require the training dataset during .E to preserve the model’s functionality, and a clean model .N might never exist. This chapter first reviews the basic security requirements of DNN watermarks from a formal perspective. Then we proceed to the ownership verification protocols under which the owner proves its proprietorship to a third party, where several new threats and security requirements emerge. We provide a computationally formal analysis of these extra security definitions and highlight their necessities in

14

F. Li and S. Wang

designing provable secure and practical DNN watermarking schemes. Finally, we raise a series of challenges that urge solutions to practical OV protocols for DNN.

2.2 Security Formulation 2.2.1 Functionality Preserving As a security mechanism, the watermark should not sacrifice the protected DNN’s performance. Although this aspect has been empirically studied by current DNN watermarking schemes, a formal and universal analysis of the trade-off between functionality and security remains absent. The expected decline of the DNN’s performance after watermarking, .Δ(WM, N), depends on both the watermarking scheme and the security parameter. The numerical dependency of .Δ on WM is intractable, and we have no more results on .Δ’s correlation with N other than monotonicity. For white-box DNN watermarking schemes that modify a portion of the DNN’s parameters, the DNN’s functionality might drop catastrophically if too many parameters have been modified. The influence of parameter modification within a DNN is similar to neuron-pruning as the example given in Fig. 2.2. The concern on functionality preserving also holds for black-box DNN watermarking schemes that inject the backdoor, i.e., a collection of labeled triggers whose distribution differs from normal samples, as the ownership evidence. Tuning a DNN on the triggers without referring to its original training dataset almost always destroys its functionality, and an example is given in Fig. 2.3. Unfortunately, the assumption that the agent who runs a black-box DNN watermarking scheme has full access to the training dataset fails in many industrial scenarios. For example, the IP manager in cooperation might be oblivious to the training process, and the purchaser of a DNN product who does not know the data also needs to inject its identifier into the DNN.

Fig. 2.2 The variation of the classification accuracy on CIFAR-10 of ResNet-50

2 Ownership Verification Protocols for Deep Neural Network Watermarks

15

Fig. 2.3 The variation of the classification accuracy on CIFAR-10 of ResNet-50, N denotes the number of triggers, and the dashed curves denote the accuracy of triggers

In conclusion, selecting an appropriate security parameter remains an nontrivial task for the owner, especially when the owner does not hold sufficient evidence to test the DNN’s performance. Some works proposed to adopt regularizers to preserve the DNN’s performance during watermark injection and reported a slight .Δ when N is small, but their generality to arbitrary DNN architecture and very large N is questionable.

2.2.2 Accuracy and Unambiguity The integrity of the ownership verification relies on the accuracy of the watermarking scheme’s verifier. To formally quantify this security, the following form of Eq. (2.3) is usually adopted: Pr {V (NWM , S) = Pass} ≥ 1 − (N),

.

(2.4)

where .Pr {·} integrates out the randomness in computing .V and .(N) is a positive negligible term in N, i.e., it declines faster than the inverse of any finite polynomial in N. Meanwhile, a random guess on the identity information must not pass the verifier. Otherwise, a cheap ambiguity attack could lead to confusion when .V has been submitted as the evidence or when .V is independent of S. The unambiguity property is formulated as   Pr V (NWM , S  ) = Pass ≤ (N),

.

where .S  ← G (N) is a piece of randomly generated identity information.

(2.5)

16

F. Li and S. Wang

If Eqs. (2.4) and (2.5) hold, then the owner can choose an appropriate N for the desired level of security. One might also consider another version of unambiguity to reduce the false positive rate   Pr V  (NWM , S) = Pass ≤ (N),

(2.6)

.

in which .V  and .NWM belong to an irrelevant party. Notice that Eq. (2.6) is precisely Eq. (2.5) after switching the position of two owners. Discussions on accuracy and unambiguity have prevailed in recent works on the DNN watermark, and many theoretical results have been reported. These properties form the mathematical foundation for provable and presentable ownership, and the corresponding watermarking schemes are more promising candidates than their zero-bit competitors. Example 2.3 (MTL-Sign [10]) The MTL-Sign watermarking scheme models ownership verification as an extra task for the protected DNN. The watermarking task has an independent classification backend as shown in Fig. 2.4. All N input images and corresponding labels for the watermarking task are derived from S using a pseudorandom mapping, e.g., QR codes. It has been proven that if the size of the domain of optional triggers is larger than .log2 (N 3 ), then the scheme meets the unambiguity condition defined by Eq. (2.5). The proof is a straightforward application of the Chernoff theorem. Example 2.4 (Revisiting Random Triggers) If the label mapping .l(·) in Example 2.2 is modified to a unique label for all triggers, i.e., .∀n, n ∈ {1, 2, . . . , N } , l(sn ) = l(sn ), then the scheme’s unambiguity is at risk. Since all triggers are generated by the encoder .T (·), the watermarked DNN may capture the statistical characteristics of

watermarked DNN ··· Dprimary

···

cp

Predictions for Tprimary .

cWM

Predictions for TWM .

······

Key DWM

watermark branch

Fig. 2.4 MTL-Sign, .Tprimary is the originally primary task, and .TWM is the watermarking task that encodes owner’s identity

2 Ownership Verification Protocols for Deep Neural Network Watermarks

17

T (·) so triggers derived from other vectors evoke the same prediction. Consequently, an adversary’s complexity in succeeding in an unambiguity attack is .O(1).

.

2.2.3 Persistency DNNs are inherently robust to tuning. Consequently, the injected watermarks have to be robust against such modifications. This property is known as the DNN watermark’s persistency or robustness. The persistency is defined w.r.t. specific type of DNN tuning [1, 19]. Watermark designers have been focusing on three levels of tuning methods: • In Blind Tuning [5, 14, 18], the adversary does not know the watermark. Naive fine-tuning, neuron-pruning, fine-pruning [13], input reconstruction [15], model extraction, and compression can invalidate weak watermarks while preserving the pirated DNN’s functionality. Most established watermarking schemes have been reported to be robust against blind tuning unless the adversary has comprehensive knowledge of the training dataset. In the latter case, the adversary also has to risk the DNN’s performance for piracy (e.g., by model distillation). • In Semi-knowledgeable Tuning, the owner is confronted by an adversary that has learned the watermarking scheme (e.g., by referring to published papers). Having learned the algorithm, the adversary can generate its watermark and embed it into the pirated DNN. Such overwriting results in copyright confusion and spoils watermarking schemes that store the identity information in unique locations within the DNN. • In the Knowledgeable Tuning, the adversary acquires both the watermarking scheme and the identity information S. This attack occurs when the owner has submitted its evidence for ownership proof through an insecure channel or the ownership examiner betrays the owner. Having obtained S, almost all watermarking schemes fail to function. For black-box schemes, the adversary can independently reproduce triggers and prevent them from entering the pirated DNN [15]. For white-box schemes, the adversary can modify the parameters where the identity information is embedded and minimize the impact on the DNN’s overall performance. There has been plentiful empirical studies regarding the persistency against blind and semi-knowledge tuning. Formally, the persistency against blind tuning and semi-knowledgeable tuning can be formulated through Algorithm 1. Formally, the persistency of WM against the blind/semi-knowledgeable/ knowledgeable tuning is tantamount to that no efficient blind/semi-knowledgeable/ knowledgeable adversary .A can win .Exptuning (N, WM, A , δ) for any M and a small .δ. Current studies have not established the theoretical persistency against tuning from Algorithm 1 by security reduction, so their security against unknown adversaries remains unreliable.

18

F. Li and S. Wang

Algorithm 1 The tuning attack experiment, .Exptuning (N, WM, A , δ) Input: The clean DNN .N, the watermarking scheme WM(with the security parameter N ), the adversary .A , and the tolerance of functionality decline .δ. Output: Whether the adversary wins or not. 1: .S ← G (N ); 2: .(NWM , V ) ← E (N, S); 3: A blind .A is given .NWM ; 4: A semi-knowledgeable .A is given .NWM and WM; 5: A knowledgeable .A is given .NWM , WM, S, and .V ; ˆ 6: .A returns .N; ˆ declines for less than .δ compared with .NWM ˆ S) = Pass and the performance of .N 7: if .V (N, then 8: Return Win; 9: else 10: Return Fail; 11: end if

Example 2.5 (Label Interpretation) A trivial attack against the random triggers scheme mentioned in Example 2 is to reinterpret the classification labels (e.g., using their hyponyms) so that the mapping introduced by .l(·) might fail to hold. To tangle this attack, we remark that .l(·)’s semantics lies within the identity between specific vectors. Concretely, .V should work as follows: 1. Reconstructing N triggers from .S = {sn }N n=1 2. For each pair of .n, n ∈ 1, 2, · · · , N, .l(sn ) = l(sn ) examining whether .N(T (sn )) = M(T (sn )) 3. Returning Pass if the accuracy is higher than a predefined threshold Under this setting, the size of the range of .l(·) can be much smaller than that of the original task. Example 2.6 (Neuron Permutation [11]) Having known the layers or neurons where the watermark is injected, the adversary can permute homogeneous neurons within this layer and invalidate white-box verification as demonstrated in Fig. 2.5. Although it is possible to align neurons, the cost is relatively high compared with that of neuron permutation. This attack has no impact on the DNN’s overall functionality.

2.2.4 Other Security Requirements Many additional security requirements might be important for special scenarios. For example, efficiency requires that the time consumption of watermark injection be small [19]. Covertness demands that any third party with the knowledge of WM cannot distinguish a watermarked DNN from a clean one [17]. Covertness is formally defined through the experiment in Algorithm 2. The bit .bˆ returned from

2 Ownership Verification Protocols for Deep Neural Network Watermarks ×

9

···

19

1

1 · · ·

1

1 · · ·

2

2 · · ·

3

2 · · ·

3

3 · · ·

4

3 · · ·

4

4 · · ·

2

4 · · ·

watermarked layer

verifier

next layers

(a) The normal ownership verification.

···

watermarked verifier next layer (attacked) layers

(b) After the neuron-permuting attack.

Fig. 2.5 The neuron-permuting attack

Algorithm 2 The covertness experiment, .Expcovertness (N, WM, A ) Input: The clean DNN .N, the watermarking scheme WM(with the security parameter N), and the adversary .A . Output: Whether the adversary wins or not. 1: .S ← G (N ); 2: .(NWM , V ) ← E (N, S); 3: Let .N0 = N, .N1 = NWM ; 4: .b ← {0, 1}; 5: .A is given .Nb and WM; ˆ 6: .A returns .b; 7: if .b = bˆ then 8: Return Win; 9: else 10: Return Fail; 11: end if

A represents the adversary’s opinion on whether the model it received has been watermarked (.bˆ = 1) or not (.bˆ = 0). WM is covertness iff no efficient adversary can win .Expcovertness (N, WM, A ) with probability significantly higher than . 12 . The counterpart of covertness is to attach another program with each DNN watermarking scheme, with which any party can examine whether or not a DNN has been watermarked. We named this property as Existential [9]. The necessity behind this additional property would become evident only after we introduce more complex DNN commercialization scenarios and protocols.

.

20

F. Li and S. Wang

2.3 The Ownership Verification Protocol for DNN The ownership verification involves three parties: the owner, the adversary, and the arbiter. The adversary has obtained the owner’s DNN and uploaded it as its product or deployed it as a commercial service. The arbiter is an authority or a distributed community to whom the owner is going to present the ownership proof. To complete the ownership proof, benign parties follow these steps: 1. The owner nominates the adversary and submits the evidence including S and .V to the arbiter. 2. The arbiter accesses the adversary’s .N and examines whether the owner’s identity can be retrieved by computing .V (N, S). 3. The arbiter announces the verification result. The entire process is demonstrated in Fig. 2.6. The ownership proof is valid only if an impartial arbiter independently accesses the suspicious DNN and runs the verifier. Schemes that focus on evaluating whether two DNNs are homologous can hardly form an ownership proof [2]. Some schemes involve access control into copyright regulation. They distribute different triggers to different levels of users, with which they can evoke different levels of performance from the same DNN. However, the trigger set shares the same level of secrecy as the customers’ secret keys (by which the server identifies their level of service) and brings no extra merit to the security.

owner

adversary Piracy detected/undetected! 2

––

announce 1

3

submit

access suspicious DNN

evidence arbiter Fig. 2.6 The ownership verification process

2 Ownership Verification Protocols for Deep Neural Network Watermarks

21

2.3.1 The Boycotting Attack and the Corresponding Security Under the OV protocol, DNN theft is not the only threat. Instead, an adversary might abuse the OV protocol to challenge legal services. An representative threat occurs when an adversary succeeds in forging evidence that passes a verifier. This adversary can abuse the OV protocol to boycott normal services, and we name this attack as the Boycotting Attack. Formally, the security against the boycotting attack is formulated through Algorithm 3. Algorithm 3 The boycotting attack experiment, .Expboycott (N, WM, A ) Input: The target DNN .N, the watermarking scheme WM(with the security parameter N ), and the adversary .A . Output: Whether the adversary wins or not. 1: .A is given .N and WM; 2: .A returns the forged evidence .Sˆ and .Vˆ ; ˆ = Pass then 3: if .Vˆ (N, S) 4: Return Win; 5: else 6: Return Fail; 7: end if

A watermarking scheme WM is secure against the boycotting attack iff for any DNN model .N and any efficient adversary .A (an efficient adversary .A terminates within time complexity polynomial in N), the probability that .Expboycott (N, WM, A ) returns Win is negligible in N. Notice that the security against the boycott attack is different from the unambiguity attack defined by Eq. (2.5). Now the adversary can forge the evidence adversarially rather than randomly picking one. Consequently, the proof of the security against the boycotting attack is harder and usually relies on extra assumptions and cryptological primitives. The boycotting attack is referred to as non-ownership piracy in some literature [19]. Example 2.7 (Chained Random Triggers [23]) The random triggers scheme in Example 2 is secure against the ambiguity attack but is insecure against the boycott attack. Since the label of a trigger can be seen as a random guess, the expected times of finding a vector .s satisfying .M(T (s)) = l(s) are at most linear in the number of classes C. Therefore, an adversary can forge its S within time complexity no larger than .O(C · N). To alleviate this threat, we only have to chain up triggers. Concretely, let .S = s1 ; for .n = 2, 3, . . . , N, we set .sn = h(sn−1 ), where .h(·) is a one-way hash function. Now the probability that an adversary succeeds in forging a correct watermark is rough .C −N since the boycott adversary can do no better than a random guess. This is the prototype of the scheme proposed by Zhu et al. These two schemes are demonstrated in Fig. 2.7.

22

F. Li and S. Wang

· · · T (sN )

T (s3 )

T (s2 )

T (s1 )

l(s2 )

l(s1 )

l(s3 )

T (s2 )

T (s1 )

l(sN )

(a) The random triggers scheme.

l(s1 )

sN

s3

s2

s1

sN

s3

s2

s1

random engine

random engine

random engine

random engine

random engine

· · · T (sN )

T (s3 )

l(s2 )

l(sN )

l(s3 )

(b) The chained random triggers scheme.

Fig. 2.7 The random trigger scheme vs. the chained random triggers scheme Party A downloading NWM overwriting deploying service

SA ← G registering SA

t1 training N SB ← G (NWM , VB ) ← E(N, SB ) – –

t2 reporting piracy publishing NWM registering SA

Party B

Fig. 2.8 The overwriting attack timeline. Since the registration time satisfies .t1 < t2 , the DNN would be judged as party A’s belonging

2.3.2 The Overwriting Attack and the Corresponding Security Apart from boycotting a legal service, an adversary can download the DNN, embed its watermark into it, and claim the watermarked DNN as its product. Since both the owner and the adversary can prove their ownership to the arbiter, confusion arises and the owner might be accused of boycotting. The availability of most established DNN watermarking schemes cannot prevent an adversary from conducting this naive overwriting, so it is necessary to correlate the watermark evidence with an authorized time stamp. The owner is encouraged to register its ownership by storing the hash of the evidence in an unforgeable memory, e.g., the IP service arbiter or a distributed community organized by a consensus protocol. This solution is reliable only if an adversary cannot register a piece of evidence, download a DNN, and tune the DNN to fit the registered evidence, and this Overwriting Attack is visualized in Fig. 2.8. We formulate the security against this attack through Algorithm 4. A watermarking scheme WM is defined to be secure against the overwriting attack iff no efficient adversary can win .Expoverwrite (N, WM, A , h, δ) with a non-negligible probability for any .N and one-way .h(·). The overwrite attack is sometimes referred to as ownership piracy. One crux in analyzing this security is the ambiguity in the trade-off between watermarking and performance as has been outlined in Sect. 2.2.1. Most watermarking schemes are designed so that their injection has little influence on the DNN’s

2 Ownership Verification Protocols for Deep Neural Network Watermarks

23

Algorithm 4 The overwrite attack experiment, .Expoverwrite (N, WM, A , h, δ) Input: The target DNN .N, the watermarking scheme WM(with the security parameter N ), the adversary .A , and a fixed one-way hash function .h(·). Output: Whether the adversary wins or not. 1: .A is given WM; 2: .A returns a hash value H ; 3: .A is given .N; ˆ and evidence .S, ˆ .Vˆ ; 4: .A returns .N ˆ S) ˆ declines for less than .δ ˆ = Pass and .h(S, ˆ Vˆ ) = H and the performance of .N 5: if .Vˆ (N, compared with .N then 6: Return Win; 7: else 8: Return Fail; 9: end if

performance, i.e., .∀S ← G (N ), .Δ(WM, S) ≤ δ. So an adversary can trivially win Expoverwrite if .V is independent of S and the DNN to be watermarked. Concretely, the adversary works as follows:

.

1. 2. 3. 4. 5.

Sˆ ← G (N ). ˆ Returning .H = h(S). Receiving .N. ˆ V ) ← E (N, S). ˆ .(N, ˆ ˆ Returning .N, .S, and .V . .

As a result, a necessary condition for the security against the overwriting attack is that .V depends on both S and the watermarked DNN. If .V cannot be built without the DNN, then it is computationally impossible for the adversary to predict its hash value H . Otherwise, an adversary can defeat the authentic owner by the fabricated time stamp. As the discussion in Example 2.1, for most white-box DNN watermarking schemes, the separation between .V and WM can be achieved without changing the algorithm. For the black-box schemes, the dependency of .V on the watermarked DNN cannot be examined from the arbiter’s perspective, so the complete security against overwriting fails to hold. One might propose to eliminate the insecurity against the overwrite attack for black-box schemes by demanding all users to register the hash of their DNN with evidence. However, in the black-box setting, it has been assumed that the adversary would not provide the white-box access to its DNN, not to mention its hash value. A complete solution to this dilemma has to rely on privacy-preserving or zero-knowledge ownership proof, which has not been designed for DNN so far. An alternative defense against the overwriting attack is to build the watermark embedding process upon the owner’s characteristic knowledge. For example, if .E implicitly relies on the training dataset, then an adversary can hardly conduct the overwriting attack. But this setting is contradictory to the watermarking scheme’s availability, and it remains hard to infer the lower bound of the performance declining against an intelligent adversary’s overwriting.

24

adversaries that can neither predict the evidence nor keep the pirated DNN intact

F. Li and S. Wang Security against boycotting. Security against overwriting.

adversaries that can predict the evidence without modifying the pirated DNN intact

Fig. 2.9 The security against boycott attack vs. the security against overwriting attack

Remark that, although expensive, the registration step can solve the boycott attack, so the DNN watermarking scheme adopted under this protocol need not consider the corresponding security. The security against the overwriting attack is not strictly stronger than that against the boycott attack, since the adversary in overwrite (N, WM, A , h, δ) is allowed to change the DNN, while that the adversary .Exp in .Expboycott (N, WM, A ) cannot. An illustration of these two types of security is given in Fig. 2.9.

2.3.3 Evidence Exposure and the Corresponding Security The security against the boycotting attack and the overwriting attack establishes the persistency of the ownership against semi-knowledgeable adversaries, i.e., adversaries that know the watermarking scheme and the OV protocol. Once the owner and the arbiter finish an ownership proof, extra threats emerge since the adversary could have been enhanced into a knowledgeable one. The evidence might be exposed to the adversary if: (i) The adversary or its accomplice eavesdrops on the communication between the owner and the arbiter. (ii) The arbiter betrays the owner. So far, no DNN watermarking scheme is provably secure against this attack as has been discussed in Sect. 2.2.3. As a result, whenever the ownership needs to be proven, it can only be done once. One natural solution to this problem is to insert multiple watermarks into the DNN to be protected. This setting ends up with a collection of watermarks and verifiers, so that spoiling a single pair of (S,.V ) cannot damage the overall copyright.

2 Ownership Verification Protocols for Deep Neural Network Watermarks

25

To put this solution into practice, the watermarking scheme has to meet two additional requirements [9]: • Large capacity: A large number of watermarks, each with a different S, can be injected and correctly retrieved from the DNN without significantly damaging its functionality. • Independence: The knowledgeable attack against one watermark cannot invalidate unrevealed watermarks. The demands on capacity and independence can be seen as the extension of accuracy and persistency in the field. There are only a few empirical results on watermarking capacity, let alone an analytic bound. Since watermarking reduces the DNN’s performance on normal inputs, the accumulation of watermarking inevitably sabotages the DNN to be protected. Increasing the number of watermarks can usually be reduced to the increasing of the security parameter N; this is the reason why we are interested in .Δ(WM, N) when N is very large. As demonstrated in Fig. 2.2, in general, .Δ(WM, a × N ) ≥ a × Δ(WM, N) for any positive integer a. One lower bound, .Cap(N, WM, δ), for the watermark capacity can be computed as in Algorithm 5. Intuitively, .Cap(N, WM, δ) measures the maximal number of Algorithm 5 The watermarking capacity, .Cap(N, WM, δ) Input: The clean DNN .N, the watermarking scheme WM(with the security parameter N ), the bound of performance decline .δ. Output: The watermarking capacity. 1: .N0 = N; 2: .k = 0; 3: .f lag = True; 4: while f lag do 5: .+ + k; 6: .Sk ← G (N ); 7: .(Nk , Vk ) ← E (Nk−1 , Sk ) 8: if The performance of .Mk has declined for larger than .δ then 9: .f lag = False; 10: end if 11: for .j = 1 to k do 12: if .Vj (Nk , Sj ) = Fail then 13: .f lag = False 14: end if 15: end for 16: end while 17: Return .(k − 1);

watermarks that can be correctly injected and retrieved from .N by WM without reducing the performance for .δ. It is possible to adopt better embedding framework rather than simply repeating, so .Cap(N, WM, δ) is only a lower bound.

26

F. Li and S. Wang

Watermarking independence remains can only be measured after fixing the adversary’s knowledgeable attack scheme. A prototype evaluation method is outˆ estimates the percentage of WM lined in Algorithm 6. .Indep(N, WM, A , K, K) ˆ Algorithm 6 The watermarking independence, .Indep(N, WM, A , K, K) Input: The clean DNN .N, the watermarking scheme WM(with the security parameter N ), the adversary .A , and the number of watermarks K. Output: The watermarking independence. 1: .N0 = N; 2: for .k = 1 to K do 3: .Sk ← G (N ); 4: .(Nk , Vk ) ← E (Nk−1 , Sk ) 5: end for 6: Sample a index set I of size .Kˆ from .{1, 2, · · · , K}; 7: .A is given .NK , WM, and .{Si , Vi }i∈I ; ˆ 8: .A conducts the knowledgeable attack and returns a DNN .N; 9: .a = 0; 10: for .k ∈ {1, 2, · · · , K} /I do ˆ Sk ) = Pass then 11: if .Vk (N, 12: .+ + a; 13: end if 14: Return . a ˆ ; K−K 15: end for

watermarks that can be correctly retrieved from .N after .Kˆ out of K watermarks have been exposed to .A . An implementation of Algorithm 6 provides only an upper bound of the percentage of survival watermarks since the actual knowledgeable attack is invisible to the owner or the watermark designer. Theoretically, the watermarking independence after exposing .Kˆ out of K watermarks is .

  ˆ , min Indep(N, WM, A , K, K) A

(2.7)

where .A is taken from all efficient adversaries. An estimation of capacity and independence for several watermarking schemes on three datasets is given in Table 2.1. For MNIST, CIFAR10, CIFAR100, the underlying DNN was ResNet50. The upper bound of performance decline was set to the original classification error rate. For independence, the adversary would tune the DNN w.r.t. the gradient that minimizes the accuracy of .V . We summarize discussed threats and their corresponding security under the OV protocol in Table 2.2.

Watermarking Scheme Uchida’s Random triggers Wonder filter MTL-Sign

MNIST Capacity .≥1000 111 194 .≥1000 Independence 94.1% 30.2% 41.3% 79.5%

CIFAR10 Capacity .≥1000 312 473 .≥1000 Independence 95.3% 41.0% 36.1% 78.0%

Table 2.1 Evaluation of advanced security requirements for DNN watermark, .K = 50 and .Kˆ = 5 CIFAR100 Capacity .≥1000 412 479 .≥1000

Independence 98.2% 21.2% 12.9% 77.5%

2 Ownership Verification Protocols for Deep Neural Network Watermarks 27

Attack Ambiguity Blind Semi-knowledgeable Knowledgeable Boycott Overwrite

Know WM ?      

Know S ?     – 

Modify .N ?       Purpose P B B B B A

Security definition Equation (2.5) Algorithm 1 Algorithm 1 Algorithms 1, 5 and 6 Algorithm 3 Algorithm 4

Table 2.2 Attacks to DNN watermark under the OV protocol. In purpose, P denotes model piracy, and B denotes boycotting a legal service

28 F. Li and S. Wang

2 Ownership Verification Protocols for Deep Neural Network Watermarks

29

2.3.4 A Logic Perspective of the OV Protocol We are now ready to formulate the OV protocol for DNN copyright management as in Algorithm 7. The owner inserts altogether K independent watermarks into the watermarked model against potential knowledgeable attacks. During hashing, K pairs of evidence are organized into a Merkle tree, so the owner does not have to present all K watermarks and verifiers to the arbiter during OV [12]. Concretely, .Infok in the sixth line of Algorithm 7 contains the least amount of necessary information to reconstruct .h({Sk , Vk }K k=1 ) given .Sk and .Vk . This procedure is shown in Fig. 2.10. Algorithm 7 The OV protocol for DNN copyright management for an owner and an arbiter Input: A DNN watermarking scheme WM, the communication key between the owner and the arbiter Key. 1: The owner finishes training and watermarking its DNN with .{Sk , Vk }K k=1 ; 2: The owner posts a registering claim to the arbiter. 3: The arbiter returns a token .n0 encrypted by Key, . n0 Key , to the owner; 4: The owner sends . n0 , H = h({Sk , Vk }K k=1 ) Key to the arbiter; 5: To claim the ownership over .N, the owner sends . N, Sk , Vk , Infok Key to the arbiter. 6: The arbiter examines .Sk , Vk , Infok with .h({Sk , Vk }K k=1 ) and returns .Vk (N, Sk ).

Hroot

H0 H

H01

Info0

H011

H000

S0

H10

S1

S2

Fig. 2.10 The Merkle-tree hash in the OV protocol

H100

H101

H11

·−− −−− ·−− −·− · − ··

30

F. Li and S. Wang

The design of the protocol is to ensure that when the registration finishes, the arbiter is convinced of the integrity of .h({Sk , Vk }K k=1 ), and when the owner claims the ownership, the arbiter believes that the owner has owned .Sk , Vk at the time of registration. Then the security against the overwriting attack implies authentic ownership. To formally establish these two propositions, we have to resort to the logic framework, which was designed to analyze the security of cryptological protocols. The logic framework is a variant of belief logic that describes the belief of each participating party within a protocol. The logic system consists of standard reduction axioms in belief logic and several extra rules formulating the computational hardness of extra security primitives, e.g., the one-wayness of hash functions. We incorporate extra rules into ordinary logic systems for cryptological analysis, e.g., BAN logic and CS logic. For example, the arbiter believes that the owner has control over its evidence arbiter believes owner controls H,

.

(2.8)

and the result of running .V indicates ownership arbiter believes Vk (N, Sk ) = Pass, .

arbiter believes owner said Sk , Vk at τ . aribter believes owner owns N at τ

(2.9)

Finally, when there is a confusion, aribter believes owner owns N at τ, aribter believes owner owns N at τ  , τ < τ .

aribter believes owner owns N at τ,

(2.10)

aribter believes owner pirated N during (τ, τ  ). Equation (2.8) combined with ordinary BAN reduction rules is sufficient for deriving the statement that the arbiter believes in the integrity of H from the freshness on .n0 . Equation (2.9), together with CS logic rules, the accuracy of WM defined in Eq. (2.4), and the one-wayness of h, completes the ownership proof for a single owner, and confusions are handled by Eq. (2.10). Remarkably, the assumption behind Eq. (2.9) is the security against the boycotting attack, while Eq. (2.10) is the result of the security against the overwriting attack (where we neglect the slight difference between DNN functionality). Although the logic framework provides a formal perspective to the analysis of DNN OV protocols, we emphasize that proposing extra reduction rules requires more formal examinations of the basic security of DNN watermarking schemes. We are looking forward to the development of a formal analysis of OV protocols for DNN.

2 Ownership Verification Protocols for Deep Neural Network Watermarks

31

2.3.5 Remarks on Advanced Protocols Apart from the OV protocol for a single owner, DNN commercialization in the field is calling for protocols for sophisticated scenarios. For example, in DNN purchasing or copyright transferring, the seller must prove to the purchaser the product does not contain the seller’s watermark. Otherwise, the seller can breach the purchaser’s interest by boycotting the latter’s service. Hitherto, no DNN watermarking scheme has attached a module for examining whether a DNN has been watermarked by it as has been discussed in Sect. 2.2.4. We believe that this Existential property is indispensable for DNN commercialization and are looking forward to the emergence of provable existential DNN watermarks. Another scenario that substantially deviates from the ordinary single-owner case is Federated Learning (FL) [8, 20], where multiple authors cooperate to train one model without sharing data. The major problem within FL is that a malicious party might participate in the training process and steal the intermediate DNN, which must be exposed to all authors for each training epoch. Consequently, the aggregator has to register the ownership evidence before starting a training epoch; this demands the watermarking scheme to be very efficient. Empirical studies have shown that such registering does not harm the convergence of FL, as shown in Fig. 2.11. The registration cost can be reduced using Merkle tree as in the previous section. Regarding the privacy concerns in FL, each independent author would prefer to embed their identification into its local model and expect that the ownership proof remains valid after model aggregating. Formally, an Aggregatable watermarking scheme WM for L independent authors satisfies .∀l ∈ {1, 2, . . . , L}: Sl ← G (N ), (Nl , Vl ) ← E (N0 , Sl ), .

N ← aggregate(N1 , N2 , . . . , NL ),

(2.11)

Pr {Vl (N, Sl ) = Pass} ≥ 1 − (N), where L authors participate in FL, .N0 is the DNN distributed to each author from the aggregator, and the aggregator updates the DNN using operation aggregate. Although there have been works concentrating on the DNN watermark for FL, there lacks a formal analysis of the aggregatable property, which is indispensable for complete privacy-preserving machine learning cooperation.

2.4 Conclusion Within the growing interest in ownership verification for DNNs, the OV protocol has received some attention, yet the amount mismatches this topic’s importance. The role of a third party, to whom the owner has to prove its ownership over a pirated

32

F. Li and S. Wang

(a) MNIST.

(b) FashionMNIST.

(c) CIFAR10.

(d) CIFAR100.

Fig. 2.11 The loss variation for FL, L denotes the number of authors

DNN product under a protocol, ought to be an indispensable element for any DNN watermarking scheme. Designers of a complete DNN watermarking scheme have to clarify which sort of evidence has to be registered to establish the unforgeable time stamp and formally prove how their scheme meets the security against boycotting and overwriting and provide a detailed analysis, at least empirically, on its capacity and independence before it can actually be adopted in industry. Moreover, the assumption on the third party, the potential threats due to evidence exposure during copyright examination on unique ownership as well as privacy leakage, and the formal analysis on if a protocol is accurate and whether or not it cannot be abused to spoil legal service are topics whose value has been overlooked until recent studies. It is evident that studies on these topics would shed light on more practical DNN watermarking schemes and standard copyright regulation and DNN product commercialization in the field.

2 Ownership Verification Protocols for Deep Neural Network Watermarks

33

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1615–1631 (2018) 2. Chen, J., Wang, J., Peng, T., Sun, Y., Cheng, P., Ji, S., Ma, X., Li, B., Song, D.: Copy, right? a testing framework for copyright protection of deep learning models. In: 2022 IEEE Security and Privacy, pp. 1–6 3. Darvish Rouhani, B., Chen, H., Koushanfar, F.: DeepSigns: An end-to-end watermarking framework for ownership protection of deep neural networks. In: Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 485–497 (2019) 4. Fan, L., Ng, K.W., Chan, C.S., Yang, Q.: DeepIP: Deep neural network intellectual property protection with passports. IEEE Trans. Pattern Analy. Mach. Intell. 1, 1–1 (2021) 5. Guo, S., Zhang, T., Qiu, H., Zeng, Y., Xiang, T., Liu, Y.: Fine-tuning is not enough: A simple yet effective watermark removal attack for DNN models. In: ICML (2020) 6. Jia, H., Choquette-Choo, C.A., Chandrasekaran, V., Papernot, N.: Entangled watermarks as a defense against model extraction. In: 30th USENIX Security Symposium (USENIX Security 21), pp. 1937–1954. USENIX Association, Berkeley (2021) 7. Li, Z., Hu, C., Zhang, Y., Guo, S.: How to prove your model belongs to you: a blind-watermark based framework to protect intellectual property of DNN. In: Proceedings of ACSAC, pp. 126– 137 (2019) 8. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020) 9. Li, F.Q., Wang, S.L., Liew, A.W.-C.: Regulating ownership verification for deep neural networks: Scenarios, protocols, and prospects. In: IJCAI Workshop (2021) 10. Li, F., Yang, L., Wang, S., Liew, A. W.-C.: Leveraging multi-task learning for unambiguous and flexible deep neural network watermarking. In: AAAI SafeAI Workshop (2021) 11. Li, F.-Q., Wang, S.-L., Zhu, Y.: Fostering the robustness of white-box deep neural network watermarks by neuron alignment. In: 2022 IEEE ICASSP, pp. 1–6 (2022) 12. Li, F.Q., Wang, S., Liew, A.W.-C.: Watermarking protocol for deep neural network ownership regulation in federated learning. In: 2022 IEEE International Conference on Multimedia Expo Workshops (ICMEW), pp. 1–5 (2022) 13. Liu, K., Dolan-Gavitt, B., Garg, S.: Fine-pruning: Defending against backdooring attacks on deep neural networks. In: International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Springer, Berlin (2018) 14. Liu, Y., Lee, W.C., Tao, G., Ma, S., Aafer, Y., Zhang, X.: ABS: Scanning neural networks for back-doors by artificial brain stimulation. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1265–1282 (2019) 15. Namba, R., Sakuma, J.: Robust watermarking of neural network with exponential weighting. In: Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, pp. 228–240 (2019) 16. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.: Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277 (2017) 17. Wang, T., Kerschbaum, F.: RIGA: Covert and robust white-box watermarking of deep neural networks. In: Proceedings of the Web Conference 2021, pp. 993–1004 (2021) 18. Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., Zhao, B.Y.: Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 707–723. IEEE, Piscataway (2019) 19. Xue, M., Zhang, Y., Wang, J., Liu, W.: Intellectual property protection for deep learning models: Taxonomy, methods, attacks, and evaluations. IEEE Trans. Artif. Intell. 3(6) 908– 923 (2021)

34

F. Li and S. Wang

20. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10(2), 1–19 (2019) 21. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172 (2018) 22. Zhao, J., Hu, Q., Liu, G., Ma, X., Chen, F., Hassan, M.M.: AFA: adversarial fingerprinting authentication for deep neural networks. Comput. Commun. 150, 488–497 (2020) 23. Zhu, R., Zhang, X., Shi, M., Tang, Z.: Secure neural network watermarking protocol against forging attack. EURASIP J. Image Video Process. 2020(1), 1–12 (2020)

Part II

Techniques

Chapter 3

Model Watermarking for Deep Neural Networks of Image Recovery Yuhui Quan and Huan Teng

Abstract Recently, there is an increasing interest in applying deep learning to image recovery, an important problem in low-level vision. Publishing pre-trained DNN models of image recovery has become popular in the society. As a result, how to protect the intellectual property of the owners of those models has been a serious concern. To address it, this chapter introduces a framework developed in our recent work for watermarking the DNN models of image recovery. The DNNs of image recovery differ much from those of image classification in various aspects. Such differences pose additional challenges to the model watermarking, but meanwhile they also bring chances for improvement. Using image denoising and image superresolution as case studies, we present a black-box watermarking approach for pre-trained models, which exploits the over-parameterization property of an image recovery DNN. Moreover, a watermark visualization method is introduced for additional subjective verification.

3.1 Introduction With the rapid advance of deep learning, DNNs have emerged as the powerful solutions to a broad spectrum of image recovery tasks; see e.g., image denoising [14, 20, 22, 31], super-resolution [2, 9, 12, 25], deblurring [10, 13, 21, 24], deraining [3, 7, 8, 16], and many others. The training of a large DNN model for image recovery may require significant computational resources (e.g., hundreds of expensive GPUs) and quite a few days for consuming massive training data (e.g., millions of images). Thus, sharing pre-trained DNN models has become a trend in the community. In addition, there is also a trend for companies and institutes to publish their pre-trained models with charges for commercial usage.

Y. Quan () · H. Teng South China University of Technology, Guangzhou, China e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_3

37

38

Y. Quan and H. Teng

Considering the costs of DNN training in terms of computational resources, manpower, and training data collection, a pre-trained DNN model is undoubtedly the intellectual property of its owners, which needs to be protected against copyright infringements, such as violation of license agreement and illegal usages by malicious attackers. One solution to addressing such issues is the so-called model watermarking [26]. Watermarking refers to concealing watermark information into a target for ownership identification. It has emerged as a widely used technique for identifying the copyrights of digital medias such as audios, images, and videos. For model watermarking, the task is to conceal specific watermark information into a trained DNN model, which can then be extracted for copyright verification. Since DNNs differ much from digital medias in terms of structures and properties, the existing watermarking techniques on digital medias are not applicable to DNN models. In recent years, model watermarking for DNNs has been paid an increasing attention, with both black-box and white-box approaches proposed; see e.g., [1, 6, 11, 23, 26, 33]. The black-box approaches (e.g., [1, 6, 11, 33]) enjoy higher practicability and better privacy than the white-box ones (e.g., [23, 26]), as the watermark extractor in a black-box approach is blind to the model weights of the suspicious DNN. Most existing model watermarking methods concentrate on the DNNs designed for the visual recognition tasks (e.g., classification) that map images to labels, while the DNNs designed for image recovery are forgotten. With the wider and wider application of DNNs in image recovery tasks of low-level vision, it is of a great value to have a model watermarking approach, particularly the black-box one, for image recovery DNNs. Unfortunately, the existing black-box model watermarking methods designed for image recognition DNNs cannot be directly applied to the model watermarking of image recovery DNNs. This is due to the fundamental differences between the two kinds of DNNs. First, a DNN for classification outputs a label, while a DNN for image recovery outputs an image. An image contains much richer structures than a label vector, which cannot be utilized in a model watermarking method designed for classification DNNs. Second, a DNN for classification encodes the decision boundaries among different classes, whereas a DNN for image recovery encodes the low-dimensional manifold of latent (ground-truth) images. Thus, the adversarial examples around decision boundaries, which are often used for blackbox model watermarking of recognition DNNs (e.g., [11]), cannot be transferred for watermarking the image recovery DNNs. Lastly, a DNN for image recovery is generally shallower than the ones for recognition. As a result, the over-parameterization degree of an image recovery DNN is often much lower than that for recognition, which introduces additional challenges to the watermarking process. This chapter introduces a black-box framework for watermarking image recovery DNN models, which is proposed in our recent work [18]. While the framework covers all main ingredients of a watermarking method, it is very difficult to have a universal approach applicable to all image recovery tasks, owing to the different characteristics among various image recovery tasks. Thus, we use two most often seen image recovery tasks, image denoising and image super-resolution, as the

3 Model Watermarking for Deep Neural Networks of Image Recovery

39

case studies to show how the framework can be used. As a core technique in image recovery, image denoising is widely used in many low-level vision tasks, e.g., it is often called by many image restoration methods as a key inner process; see, e.g., [14, 17, 20, 22, 30, 32]. Image super-resolution is also an important image recovery task with wide applications, for which DNNs have seen tremendous success [2]. Indeed, super-resolution is a test bed in the development of new DNNs for image recovery; see e.g., [2, 5, 9, 19, 25]. The work [18] introduced in this chapter is the first one that studied the basic rules and principles for the model watermarking of image-to-image DNNs, as well as the first one that developed a black-box approach for tackling the problem. It not only allows efficient watermark embedding on an image recovery DNN, but also allows remote watermark identification by only one request (i.e., a single trigger) without accessing the model weights. In addition, an auxiliary copyright visualizer is also introduced for translating watermark data (in the form random images in our method) into a visually meaningful copyright image, so that both subjective inspection and objective visual quality measure can be utilized for further verification.

3.2 Related Works There is an increasing number of studies on DNN model watermarking, with a particular focus on the DNNs of image recognition; see e.g., [1, 11, 23, 26]. This section conducts a review on these related works. Based on whether the weights of a DNN model are accessible to the watermark extractor, the existing methods can be divided into two categories: white-box approaches and black-box approaches.

3.2.1 White-Box Model Watermarking White-box model watermarking assumes the accessibility to model weights for watermark extraction, which is applicable to the case where model weights are public (e.g., open projects), or where model weights can be open to the trusted third party. One basic idea of white-box methods is directly embedding watermark information into model weights. One seminal work done by Uchida et al. [26] for DNN model watermarking is a white-box method, with several principles developed for model watermarking. They proposed to embed the watermark encoded by a bit string into the weights of some DNN layers. The watermark embedding is done by adjusting the pre-trained model such that the layer weights can be mapped to the watermark under a learned linear transform. The watermark extraction is then straightforward: applying the learned linear transformation to the layer weights, with subsequent thresholding for verification. There are two weaknesses of such a method. First, the watermark size is

40

Y. Quan and H. Teng

bounded by the number of model weights, which could be limited for a light-weight DNN. Second, the embedded watermark may be easily erased by writing another watermark into the model or by permuting the model weights. For improvement, Rouhan et al. [23] proposed a quite different scheme that contains two steps. First, sample some specific trigger keys from a Gaussian mixture, and generate a set of random bit strings as the watermark. Second, adjust the model weights so that specific inputs can robustly trigger the watermark within the probability density function of the intermediate activation maps of the model. Afterward, the watermark can be extracted by inputting the trigger keys to the suspicious model, and the verification can be done by examining the probability density function of the resulting model activations. In such a method, watermark is embedded into the dynamic statistics instead of the static model weights. Thus, it brings higher impedance to watermark overwriting attacks. Moreover, the watermark size could be arbitrarily large by increasing the number of trigger keys.

3.2.2 Black-Box Model Watermarking Black-box model watermarking assumes the transparency of model weights during watermark extraction. Such a feature makes black-box methods applicable to the case of limited privacy where model weights are encrypted, as well as the case of remote verification where DNNs are published as web services or APIs. A black-box method often encodes watermarks by specific model inputs and the expected model outputs. The embedding is done by fine-tuning the model to map specific trigger keys to their associated watermarks, and the verification is then about checking whether the expected input–output relationship exists. Merrer et al. [11] proposed to use adversarial examples as the trigger keys and their class labels as the watermarks. They fine-tuned the model to correctly classify the adversarial examples. Such a process is about carefully adjusting decision boundaries to fit the adversarial examples well. As adversarial samples are statistically unstable, such adjustments complicate the decision boundaries with oscillations. As a result, the embedding may be fragile to model compression that simplifies and smooths decision boundaries. Also, since adversarial examples are usually close to training data, the embedded watermark is not robust to model finetuning that recovers the original decision boundaries around the training samples. Adi et al. [1] proposed a method to address the issues above. The trigger keys are constructed with the abstract images that are unrelated to each other and also unrelated to the training samples. The labels for the trigger keys (images) are randomly assigned. Empirically, this method showed better resistance to fine-tuning attacks. However, the space of abstract images may be very large so that one can easily find another set of images that coincides with another meaningful watermark. To address this issue, Guo et al. [6] proposed to generate the trigger keys via modifying some training images with the signatures of model owners.

3 Model Watermarking for Deep Neural Networks of Image Recovery

41

Zhang et al. [33] combined several different strategies for trigger key generation so as to improve the robustness of watermark verification, including embedding meaningful content into original training data, using independent training data with unrelated classes, and injecting pre-specified noise. The watermarks are defined as the wrong or unrelated labels predicted from the trigger keys.

3.3 Problem Formulation 3.3.1 Notations and Definitions Let .N(·; W) denote a DNN model of some image recovery task, parameterized by W. It maps an input degraded image of low visual quality to a recovered image of high visual quality. Let .W0 denote the parameters of .N trained by a set of images denoted by .X, .W∗ the parameters of .N after watermarking, .T the space of all possible trigger keys, and .μ(·) some visual quality measure for images, e.g., peak-to-noise ratio (PSNR), weighted PSNR [28], or Structural SIMilarity (SSIM) index [27]. There are three main parts in a black-box method of model watermarking [18]:

.

¯ ∈ T and a watermark .S. ¯ • Watermark generation: Generate a trigger key .T • Watermark embedding: Adjust the host DNN model to carry the watermark, ¯ W∗ ) = S. ¯ which is about finding .W∗ such that .N(T; • Watermark verification: Examine the existence of watermark on the suspicious ¯ = S. ¯ model .N ∈ M, i.e., to check whether it holds that .N (T)

3.3.2 Principles for Watermarking Image Recovery DNNs Followings are the principles for model watermarking of image recovery DNNs [18]: • Fidelity To make the watermarking meaningful, the performance of the host model, in terms of recovery accuracy, should not be noticeably decreased after watermark embedding. That is, μ(N(X; W∗ )) ≈ μ(N(X; W0 )),

.

s.t. ∀X ∈ X.

(3.1)

• Uniqueness Without related knowledge, it is unlikely to find a DNN model for the same task that can also map the trigger keys to the watermarks. That is, ∀T ∈ T, N (T; W ) = N(T; W∗ ) iff N = N and W = W∗ ,

.

(3.2)

42

Y. Quan and H. Teng

where .W encodes the parameters that have not been tuned based on  .(T, N(T; W )). Such a property is to avoid fraudulent claims of ownership made by attackers. • Robustness The watermark can be still correctly identified under a modeloriented attack of moderate strength. That is, for an arbitrary small perturbation ∗ .ε on .W , we have N(T; W∗ + ε) ≈ N(T; W∗ ).

.

(3.3)

• Efficiency Both watermark embedding and watermark verification are computationally efficient. Generally, they should take a much lower cost than the training of original models, in terms of both time and computational resources. • Capacity Under the fidelity constraint, a model watermarking algorithm should embed sufficient (or as much as possible) watermark information for maximizing the robustness of verification.

3.3.3 Model-Oriented Attacks to Model Watermarking Model-oriented attacks attempt to damage the watermark by modifying the weights of the host model. See below for three types of model-oriented attacks often taken into account in the existing studies: • Model compression Attackers could remove the watermark by reducing the number of model parameters, as watermark embedding often relies on the overparameterization of a DNN to satisfy the fidelity constraint. This is often directly done by zeroing small weights, to avoid involving any extra training process. • Model fine-tuning Attackers could adjust the parameters of the watermarked model on new training data for performance gain on their test data, which may destroy the embedded watermark severely. • Watermark overwriting Attackers could write one or more additional watermarks into the watermarked model using the same or a similar watermarking algorithm, so as to destroy the original watermark.

3.4 Proposed Method 3.4.1 Main Idea and Framework The approach proposed in our work [18] is based on the underlying mechanism behind image recovery and deep learning. Recall that an image (patch) in the form of an array can be viewed as a point in a high-dimensional vector space. In image recovery, the images (patches) to be recovered are usually assumed to lie on

3 Model Watermarking for Deep Neural Networks of Image Recovery

Manifold of desired images (patches)

43

Training image (patches) Processed image (patches) Trigger key

Watermark

Mapping after embedding

Original mapping

Fig. 3.1 Illustration of basic idea of [18] for black-box watermarking of image recovery DNNs

some low-dimensional manifolds in such a high-dimensional vector space, and then many image recovery tasks are about projecting input images (patches) onto such manifolds; see e.g., [4, 15, 29, 30]. For deep-learning-based image recovery, it is unlikely to collect training samples that sufficiently cover all important aspects of all possible images, as image content can vary quite a lot. As a result, a DNN for image recovery can learn about only partial views of the low-dimensional manifolds, i.e., the manifold regions close to at least some points of training data. Let .B denote such regions. When applying the trained DNN to processing unseen images, the results will be not bad if those images are on .B. In order to watermark a DNN model of image recovery, the basic idea of our work [18] is about fine-tuning the model to manipulate the predictive behaviors of the model in a specific domain denoted by .D, such that the output images of the modified model in .D approximate the predefined outcomes; see Fig. 3.1 for an illustration of the basic idea. The domain .D indeed forms the space of all possible trigger keys, and the predefined outcomes serve as the watermarks. Accordingly, the watermark verification can be done by examining whether the trigger keys leads to their corresponding watermarks in the suspicious model. The above idea leads to a black-box model watermarking scheme of image recovery DNNs; see Fig. 3.2 for the outline. Given a host DNN model of some owner, a trigger key and an initial watermark are first generated. In watermark embedding, such an image pair is used to adjust the weights of the host model. Then, the trigger key is input to the watermarked model, and the output is used to update the watermark. Both the trigger key and watermark are kept by the owner. In watermark verification, the verifier inputs the owner’s trigger key to the suspicious model and compares the model’s output to the owner’s watermark for judgment.

44

Y. Quan and H. Teng

Generator

Generate

Trigger Image

Train

Generate Train

Host DNN

Input

Watermark Update

(a) Framework of watermark embedding.

Owner

Suspicious DNN

Output

=

Trigger Image

Watermark

(b) Framework of watermark verification. Fig. 3.2 Diagram of the black-box framework for model watermarking of image recovery DNNs

3.4.2 Trigger Key Generation There are two key issues to be addressed in the above scheme: definition of the domain .D for trigger key construction and definition of the predefined outcomes for watermark generation. Intuitively, the domain .D should be distant from .B, since in this case the manipulation of the model behaviors on .D is likely to cause negligible influence to the model behaviors in dealing with the images (patches) on or close to .B, which benefits the fidelity of the watermarking, and vice versa, i.e., the watermark embedding is insensitive to the model fine-tuning using the data on .B, such as the original training images or their similar ones, which benefits the robustness to model fine-tuning attacks. The requirement that .D is far from .B indeed implies that the trigger keys need to differ much from both training and test images, statistically. This can be achieved by using random images (i.e., images with totally random values) as the trigger keys, as such images are very unlikely to appear in image recovery. Let .U (a, b) denote the uniform distribution on the interval .[a, b]. The trigger key denoted by .T is generated by sampling from the i.i.d. uniform distribution: T(i, j ) ∼ U (0, 1).

.

(3.4)

Such a random process allows distributing keys to different owners in an efficient manner, and two owners are unlikely to be assigned with very similar trigger keys, supposing .T is sufficiently large. See Fig. 3.3 for some instances of the trigger keys

3 Model Watermarking for Deep Neural Networks of Image Recovery

45

Fig. 3.3 Examples of the trigger images generated by Eq. (3.4)

generated by Eq. (3.4), whose appearances differ much from the images appearing in common applications. In other words, the watermark embedding will operate on the data different from those of both the original image recovery task and daily fine-tuning tasks.

3.4.3 Watermark Generation The watermarks are constructed by applying some simple non-learning-based image recovery method .G (·) to the images of trigger keys. Let .S = G (T) denote the expected watermark corresponding to the trigger key .T. Considering the fidelity of watermarking, .G should be defined to play a similar function to the host model. Otherwise, the watermark embedding will modify the DNN model to include a new function, which may violate the original task and thus degrade the model performance noticeably. On the other hand, considering the discriminability and uniqueness of watermarking, .G should be defined to maximize the difference between .S and the output of the original model. It can be seen that the definition of .G depends on the task, and two tasks including image denoising and image superresolution are considered in our work for case studies. For image denoising, .G is defined by a naive smoothing operator as follows: G (T) = T − ∇T

.

(for image denoising),

(3.5)

where .∇ denotes the gradient operator. The resulting watermark .S is a simply smoothed version of the image of the trigger key. Since a smoothing operator reduces noise from a noisy image, the watermark embedding with .S does not contradict the task of image denoising. Thus, the performance impairment caused by the watermark embedding is expected to be minimal. In addition, a well-trained denoising DNN is supposed to a sophisticated adaptive smoothing operator with much better performance than the naive smoothing operator .G , and their results are quite different from each other on the image of trigger key. This makes the embedded watermark distinguishable.

46

Y. Quan and H. Teng

With a similar spirit, the watermark generation for image super-resolution DNNs is built upon the linear interpolation with gradient enhancement: G (T) =  T + ∇ T (for super-resolution),

.

(3.6)

T is the upsampled version of .T via linear interpolation. Such an operation where . plays a role of super-resolution, but with much different results from a well-trained DNN of super-resolution. Thus, the embedded watermark can be distinguishable. A black-box method for recognition DNN watermarking usually maps a trigger key to a class label, i.e., there is one-bit watermark embedded for each trigger key. As a result, multiple trigger keys should be used for embedding sufficient watermark information. Different from recognition DNNs, a DNN for image recovery acts as an image-to-image mapping. In essence, the watermark on such a DNN is embedded in image patches. Thus, a trigger key in the form of an image with a number of patches can lead to multiple-bit watermark embedded, yielding sufficient watermark information in the host model for verification. In addition, using multiple trigger keys in our method is very similar to stacking these trigger key images as a larger one. This allows us to call the suspicious model only one time for watermark verification (also called one-time request), which is quite useful for the case where efficiency is also the concern, e.g., remote verification.

3.4.4 Watermark Embedding Recall that a host model of an image recovery DNN is denoted by .N(·; W). The watermark embedding is done by fine-tuning the model weights .W using both the (trigger, watermark) pair .(T, S) and the (input, ground-truth) image pairs .{(Xi , Yi )}i of original training data, via the loss: L(W) =



.

N(Xi ; W) − Yi 22 + λ N(T; W) − S 22 ,

(3.7)

i

where .λ ∈ R+ controls the strength of embedding. The weights of the watermarked model can be expressed as W∗ = arg min L(W).

.

(3.8)

W

The first term in (3.7) is the fidelity loss that measures the loss of model performance, while the second term is the embedding loss that measures the accuracy of the embedded watermark. Increasing .λ will decrease the model performance, while the decrease will weaken the watermarking robustness. It can be seen that the watermark embedding requires not only that .N(T) approximates the expected watermark .S well, but also that .N performs well on its original training data.

3 Model Watermarking for Deep Neural Networks of Image Recovery

47

The model fine-tuning with (3.7) for watermark embedding is stopped until the first term in (3.7) reaches a sufficiently small value. The computational cost of such a fine-tuning process is acceptable, compared with the high cost of the original model training. Once .N is trained, the watermark .S is updated to be .N(T; W∗ ) for consistency. One may concern whether it holds that .N1 (T1 ; W∗1 ) ≈ N2 (T2 ; W∗2 ) for two watermarked models .N1 (·; W∗1 ), N2 (·; W∗2 ) with very similar functions and two different keys .T1 , T2 . Fortunately, this is generally not the case, due to that the initial .S is very close to .N(T; W∗ ) after watermark embedding and that .(T1 , S1 ) differ much from .(T2 , S2 ).

3.4.5 Watermark Verification For a model .N, its ownership can be verified given the (trigger, watermark) pair, (T, S), of an owner. Feeding the trigger key .T to .N, we can have an image .S = N(T). This is done by just one forward pass of the model, which is very efficient. Then, the ownership is identified if the distance between .S and .S is lower than some predefined threshold .η. Concretely, the following criterion is used:

.

d(S, S ) =

.

1 S − S 2 ≤ η, #(S)

(3.9)

where .#(·) counts the number of elements, and .S, S are normalized to .[0, 1], which results in .d(S, S ) ∈ [0, 1]. The threshold .η is the bound of negligible error in the verification. Our work [18] provided the following scheme to set a proper value of .η. Let .E = S − S . Suppose 1 2  .E(i, j ) ∼ N (0, ( ) ) for all .i, j , that is, the error between .S and .S obeys i.i.d. 4 zero-mean normal distribution with standard deviation .1/4. The standard deviation of .1/4 comes from that an image corrupted by the additive Gaussian white noise drawn from .N (0, ( 14 )2 ) is very noisy but still recognizable. See Fig. 3.4 for two examples. Then, the square of each .E(i, j ) is still independent  and obeys the same Gamma distribution .Γ (1/16, 1/128). There we have .Z = i,j [E(i, j )]2 ∼

Fig. 3.4 Clear images (odd columns) and their corrupted version (even columns) generated by adding i.i.d Gaussian noise sampled from .N (0, ( 14 )2 I)

48

Y. Quan and H. Teng

Γ (#(S)/2, 1/8). Viewing Z as a random variable, we apply the p-value approach with .p < 0.05 to determine its value. In other words, we need to find .β such that  .P [Z ≤ β] < 0.05, or equivalently, to find .η such that .P [d(S, S ) ≤ η] < 0.05, in  order to safely reject the hypothesis that .S is similar with .S . By direct calculation, we have .η = 6.07 × 10−3 .

3.4.6 Auxiliary Copyright Visualizer

Encoder: 3 layers

Fig. 3.5 Structure of the auxiliary visualizer

Decoder: 6 layers

Copyright Image

2

Sigmoid

Conv

ReLU

∙∙∙

Conv 2 BN

ReLU

Conv 2 BN

ReLU

∙∙∙

Conv 2 BN

ReLU

Conv 2 BN

Conv

Watermark

So far, the watermark is in the form of a random image, which is visually meaningless and acts as a bit string essentially for watermark verification. It would be nice to have an approach that can generate watermarks with visual meaning, so that subjective verification can be done additionally. Our work [18] proposed a solution for this purpose, which trains a generative network .f : RM1 ×N1 → RM2 ×N2 to map the watermark .S to a recognizable copyright image by minimizing the MSE loss between its output and that copyright image. The network f acts as a copyright visualizer, and its network structure used in our work [18] is illustrated in Fig. 3.5. Such an auxiliary tool is indeed a memory of the copyright image. It can be aroused by the watermark .S to output the associated copyright image, while for other input images, it is not activated and only outputs a visually meaningless image. In practical use, the model owner or the third party keeps the trained f . During watermark verification, it can be applied to the output of the suspicious model, so as to examine whether the suspicious model outputs the copyright image of the owner. Such an approach is quite generic and thus can be adopted in other methods of DNN model watermarking. As a simple demo, Fig. 3.6 shows some visualization results of the extracted watermarks (defined as a copyright image) from a model under different attacks including model compression, model fine-tuning, and watermark overwriting. It can be seen that the extracted watermarks are robust to the attacks. We also show the watermark extracted using an irrelevant trigger key as input. Such a watermark is a meaningless image. All the above results have demonstrated the power of the proposed method approach in terms of both robustness and uniqueness.

3 Model Watermarking for Deep Neural Networks of Image Recovery

49

Fig. 3.6 Visualization of watermark data (in the form of images output by the auxiliary visualizer) on the DnCNN model [31] in different cases. (a) Original copyright images; (b) Compression with .10% pruning; (c) Compression with .40% pruning; (d) Fine-tuning with 10 epochs on the original dataset; (e) Fine-tuning with 50 epochs on the original dataset; (f) Fine-tuning with 10 epochs on the texture dataset; (g) Fine-tuning with 50 epochs on the texture dataset; (h) Overwriting with a new trigger key; (i) Using an irrelevant trigger key

3.5 Conclusion With the prevalence of publishing and sharing DNN models in the computer vision society, there is a rising need for having model watermarking techniques to protect the intellectual properties of trained DNN models. This chapter introduced our recent work on model watermarking of image recovery DNNs, which provided an effective black-box framework for watermarking the DNN models used for image recovery. We demonstrated its practical usage in the context of image denoising and image super-resolution, with a copyright visualizer introduced for further subjective watermark verification. Both the watermarking framework and the copyright visualizer are generic and have good potentials for the applications to watermarking the DNNs in other image processing tasks.

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631, Baltimore, 2018. USENIX Association, Berkeley (2018) 2. Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 3. Chen, C., Li, H.: Robust representation learning with feedback for single image deraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7742–7751 (2021) 4. Dong, W., Li, X., Zhang, L., Shi, G.: Sparsity-based image denoising via dictionary learning and structural clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 457–464. IEEE, Piscataway (2011) 5. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Springer, Berlin (2014) 6. Guo, J., Potkonjak, M.: Watermarking deep neural networks for embedded systems. In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–8. IEEE, Piscataway (2018)

50

Y. Quan and H. Teng

7. Huang, H., Yu, A., He, R.: Memory oriented transfer learning for semi-supervised image deraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7732–7741 (2021) 8. Jiang, K., Wang, Z., Yi, P., Chen, C., Wang, Z., Wang, X., Jiang, J., Lin, C.-W.: Rain-free and residue hand-in-hand: a progressive coupled network for real-time image deraining. IEEE Trans. Image Process. 30, 7404–7418 (2021) 9. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654 (2016) 10. Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., Matas, J.: DeblurGAN: Blind motion deblurring using conditional adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 11. Le Merrer, E., Perez, P., Trédan, G.: Adversarial frontier stitching for remote neural network watermarking (2017). Preprint arXiv:1711.01894 12. Lugmayr, A., Danelljan, M., Gool, L.V., Timofte, R.: SRFlow: Learning the super-resolution space with normalizing flow. In: European Conference on Computer Vision, pp. 715–732. Springer, Berlin (2020) 13. Ma, L., Li, X., Liao, J., Zhang, Q., Wang, X., Wang, J., Sander, P.V.: Deblur-NeRF: Neural radiance fields from blurry images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12861–12870 (2022) 14. Mao, X., Shen, C., Yang, Y.-B.: Image restoration using very deep convolutional encoderdecoder networks with symmetric skip connections. In: Advances in Neural Information Processing Systems, pp. 2802–2810 (2016) 15. Peyré, G.: Manifold models for signals and images. Comput. Vision Image Underst. 113(2), 249–260 (2009) 16. Quan, Y., Deng, S., Chen, Y., Ji, H.: Deep learning for seeing through window with raindrops. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019) 17. Quan, Y., Chen, M., Pang, T., Ji, H.: Self2Self with dropout: Learning self-supervised denoising from single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 18. Quan, Y., Teng, H., Chen, Y., Ji, H.: Watermarking deep neural networks in image processing. IEEE Trans. Neural Netw. Learn. Syst. 32(5), 1852–1865 (2020) 19. Quan, Y., Yang, J., Chen, Y., Xu, Y., Ji, H.: Collaborative deep learning for super-resolving blurry text images. IEEE Trans. Comput. Imaging 6, 778–790 (2020) 20. Quan, Y., Chen, Y., Shao, Y., Teng, H., Xu, Y., Ji, H.: Image denoising using complex-valued deep CNN. Pattern Recog. 111, 107639 (2021) 21. Quan, Y., Wu, Z., Ji, H.: Gaussian kernel mixture network for single image defocus deblurring. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 20812–20824. Curran Associates, Red Hook (2021) 22. Romano, Y., Elad, M., Milanfar, P.: The little engine that could: Regularization by denoising (RED). SIAM J. Imaging Sci. 10(4), 1804–1844 (2017) 23. Rouhani, B.D., Chen, H., Koushanfar, F.: DeepSigns: A generic watermarking framework for IP protection of deep learning models (2018). Preprint arXiv:1804.00750 24. Ruan, L., Chen, B., Li, J., Lam, M.: Learning to deblur using light field generated and real defocus images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16304–16313 (2022) 25. Tang, Y., Gong, W., Chen, X., Li, W.: Deep inception-residual Laplacian pyramid networks for accurate single-image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 31, 1–15 (2019) 26. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.: Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277. ACM, New York (2017)

3 Model Watermarking for Deep Neural Networks of Image Recovery

51

27. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 28. Wang, S., Zheng, D., Zhao, J., Tam, W.J., Speranza, F.: An image quality evaluation method based on digital watermarking. IEEE Trans. Circuits Syst. Video Technol. 17(1), 98–105 (2006) 29. Xu, R., Xu, Y., Quan, Y.: Factorized tensor dictionary learning for visual tensor data completion. IEEE Trans. Multimedia 23, 1225–1238 (2020) 30. Yang, X., Xu, Y., Quan, Y., Ji, H.: Image denoising via sequential ensemble learning. IEEE Trans. Image Process. 29, 5038–5049 (2020) 31. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a Gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017) 32. Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3929–3938 (2017) 33. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172 (2018)

Chapter 4

The Robust and Harmless Model Watermarking Yiming Li, Linghui Zhu, Yang Bai, Yong Jiang, and Shu-Tao Xia

Abstract Obtaining well-performed deep neural networks usually requires expensive data collection and training procedures. Accordingly, they are valuable intellectual properties of their owners. However, recent literature revealed that the adversaries can easily “steal” models by acquiring their function-similar copy, even when they have no training samples and information about the victim models. In this chapter, we introduce a robust and harmless model watermark, based on which we design a model ownership verification via hypothesis test. In particular, our model watermark is persistent during complicated stealing processes and does not introduce additional security risks. Specifically, our defense consists of three main stages. First, we watermark the model by embedding external features, based on modifying some training samples via style transfer. After that, we train a metaclassifier to determine whether a suspicious model is stolen from the victim, based on model gradients. The final ownership verification is judged by hypothesis test. Extensive experiments on CIFAR-10 and ImageNet datasets verify the effectiveness of our defense under both centralized training and federated learning.

4.1 Introduction Deep neural networks (DNNs) have been widely and successfully used in many mission-critical applications for their promising performance and efficiency [11, 33, 40]. The well-trained models are valuable intellectual properties of their owners, due to their high commercial values and training costs. However, recent studies revealed that the adversaries can easily “steal” wellperformed DNNs and use them for their purposes without authorization. This threat is called model stealing [41]. This threat could happen for a deployed model where the adversaries can only query the model. Not to mention that the adversaries can

Y. Li () · L. Zhu · Y. Bai · Y. Jiang · S.-T. Xia () Tsinghua University, Beijing, China e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_4

53

54

Y. Li et al.

directly copy and use the victim models when they can get access to their sources files. Accordingly, model stealing poses a realistic threat to model owners. In this chapter, we discuss how to defend against model stealing based on ownership verification. Given a suspicious model, it intends to accurately determine whether it is stolen from a victim model. We first revealed that the existing model watermarks either misjudge, cannot survive in complicated stealing processes, or even introduce new security risks. Based on our analysis of the failures, we propose a robust and harmless model watermark for ownership verification. Our defense includes three main stages, including: (1) model watermarking with embedded external features, (2) training ownership meta-classifier, and (3) model ownership verification with hypothesis test. Specifically, we watermark the model by embedding external features, based on modifying some training samples via style transfer. Since we change only a few images and do not reassign their labels, the embedded external features have minor adverse effects on the learning of victim models. We then train a meta-classifier to determine whether a suspicious model is stolen from the victim, based on the gradients of the victim model and its benign version. We design a hypothesis-test-based method to further improve the verification confidence at the end. Extensive experiments are conducted, which verify the effectiveness of our defense under both centralized training and federated learning. This chapter is developed based on our conference paper [28]. Compared with the preliminary conference version, we generalize the proposed method to the scenarios under federated learning and provide more discussions and experiments.

4.2 Related Work 4.2.1 Model Stealing Model stealing intends to “steal” the intellectual property of model owners, based on generating a function-similar substitute of the model. Currently, the existing model stealing methods can be categorized into three main types, including: (1) datasetaccessible attacks, (2) model-accessible attacks, and (3) query-only attacks, based on the capacity level of the adversaries. Their detailed settings are as follows: Dataset-Accessible Attacks (.AD ) In this setting, the adversaries can get access to the training dataset, whereas they can only query the victim model. In this case, the adversaries can obtain a substitute model via knowledge distillation [13]. Model-Accessible Attacks (.AM ) In this setting, the adversaries have full access to the victim model. For example, these attacks could happen when the victim model is open-sourced. In this case, the adversary can get a function-similar copy based on data-free knowledge distillation [7], or based on tuning the victim model with a

4 The Robust and Harmless Model Watermarking

55

few local samples. In general, these attacks can save significant training resources for the adversary needed to train a model from scratch. Query-Only Attacks (.AQ ) In this setting, the adversaries can only query the model. These attacks also consist of two sub-classes, including: (1) label-query attacks [2, 16, 37] and (2) logit-query attacks [36, 41], based on the type of model predictions. Specifically, the former attacks used the victim model to annotate some unlabeled samples, based on which to train the adversary’s substitute model. In the later ones, the adversaries performed model stealing by minimizing the distance between its predicted logits and those of the victim model.

4.2.2 Defenses Against Model Stealing Currently, the existing defenses against model stealing can be divided into two categories, including non-verification-based defenses and verification-based defenses (i.e., dataset inference, and backdoor-based model watermarking), as follows: Non-verification-Based Defenses Currently, most of these defenses reduced stealing threats by enlarging the difficulties of model stealing, based on modifying the results of the victim model. For example, defenders could round the probability vectors [41], add noise to model predictions [23], or only return the most confident label [36]. However, these defenses may significantly reduce the performance of legitimate users and could even be bypassed by adaptive attacks [17]. Other methods [19, 20, 43] detected model stealing by identifying malicious queries. However, they can only detect some specific query patterns, which may not be used. Dataset Inference To the best of our knowledge, this is the first and the only verification-based method that can defend against different types of model stealing simultaneously. In general, dataset inference identifies whether a suspicious model contains the knowledge of inherent features that the victim model learned from its private training dataset. Specifically, given a K-classification problem, for each sample .(x, y), it first calculated the minimum distance .δ t to each class t by .

min d(x, x + δ t ), s.t., V (x + δ t ) = t, δt

(4.1)

where .d(·) is a distance metric. The distance to each class .δ = (δ 1 , · · · , δ K ) can be regarded as the feature embedding of sample .(x, y). After that, the defender will randomly select some samples inside (labeled as “.+1”) or outside (labeled as “.−1”) the private dataset and use the feature embedding to train a binary meta-classifier C. To judge whether a suspicious model is stolen from the victim model, defenders can generate the feature embedding of samples from private and public samples, respectively, based on which to perform the hypothesis test via meta-classifier C. However, as we will show in the following experiments, dataset inference is easily misjudged, especially when samples used for training suspicious models have a

56

Y. Li et al.

similar latent distribution to those used by the victim model. This misjudgment is mostly because different DNNs can still learn similar features if their training samples have some similarities. This shortcoming makes the results of dataset inference unreliable. Backdoor-Based Model Watermarking The main purpose of backdoor-based model watermarking is detecting model theft (i.e., direct model copy) instead of identifying model stealing. However, we notice that dataset inference has some similarities to the backdoor-based model watermarking [1, 26, 46]. Accordingly, these methods could serve as the potential defenses against model stealing. These approaches conducted ownership verification by making the watermarked model misclassifying some training samples. Specifically, defenders can first adopt backdoor attacks [10, 27, 45] to watermark the victim model and then perform the model ownership verification. In general, a backdoor attack can be determined by three core characteristics, including: (1) trigger pattern .t, (2) adversary-predefined poisoned image generator .G(·), and (3) target class .yt . Given the benign training set .D = {(x i , yi )}N i=1 , the backdoor adversary will select a random subset (i.e., .Ds ) from .D to generate its poisoned version .Dp = {(x  , yt )|x  = G(x; t), (x, y) ∈ Ds }. Different attacks may assign different G. For instance, .G(x; t) = (1−λ)⊗x +λ⊗t was used in BadNets [10], while .G(x; t) was assigned as an image-dependent generator in [29, 34, 35]. After .Dp was obtained, .Dp and remaining benign samples .Db  D\Ds will be used to train the model .fθ based on .

min θ



L (fθ (x), y).

(4.2)

(x,y)∈Dp ∪Db

In the ownership verification, similar to that of dataset inference, defenders can examine the behaviors of suspicious models in predicting .yt . Specifically, if the predicted probability of the poisoned data is significantly larger than that of its benign version, the suspicious model can be regarded as having the specific backdoor watermark, and therefore, it is stolen from the victim model. However, as we will show in the following experiments, these methods have minor effectiveness in detecting model stealing, especially for those having complicated stealing processes. It is mainly because the hidden backdoor is modified after the stealing.

4.3 Revisiting Existing Model Ownership Verification As described in Sect. 4.2.2, both dataset inference and backdoor-based model watermarks have some questionable latent assumptions, which may lead to unsatisfied results in practice. In this section, we verify those limitations.

4 The Robust and Harmless Model Watermarking

57

Table 4.1 The test accuracy of victim models and p-value of verification processes. In this experiment, dataset inference misjudges in all cases. The failed cases are marked in red Test accuracy p-value

ResNet-Dr 88.0% 10−7

VGG-Dl 87.7% 10−5

VGG-Dl 85.0% 10−4

4.3.1 The Limitation of Dataset Inference Settings In this section, we adopt the CIFAR-10 [21] dataset with VGG [39] and ResNet [12] as an example for the discussion. Specifically, we separate the original training set .D into two random equal-sized disjoint subsets .Dl and .Dr . After that, we train a VGG on .Dl and a ResNet on .Dr , respectively. We also train a VGG on    .Dl with noises (i.e., .D  {(x , y)|x = x + N (0, 16), (x, y) ∈ Dl }) for reference. l In the verification process, we verify whether the VGG-.Dl and VGG-.Dl are stolen from ResNet-.Dr and whether the ResNet-.Dr is stolen from VGG-.Dl . Besides, we also use the p-value as the evaluation metric, as suggested in the existing methods of model ownership verification. The p-value is calculated based on the approach described in Sect. 4.2.2. In particular, the larger the p-value, the less confident that dataset inference believes that the suspicious model is stolen from the victim. Results As shown in Table 4.1, all models have promising accuracy, even when the training samples are only half of the original ones. In particular, all p-values are significantly smaller than 0.01, i.e., dataset inference believes that these models are all stolen from the victim. However, in each case, the suspicious model should not be regarded as stolen, since the victim model and the suspicious one are trained on different samples and with different model structures. These results show that dataset inference could make misjudgments. Besides, the p-value of VGG-Dl is smaller than that of the VGG-Dl . This phenomenon is most probably because the latent distribution of Dl is more different from that of Dr (compared with that of Dl ), and therefore, the models learn more different features. It also reveals that the misjudgment is mostly due to the distribution similarities of training samples used by the victim and the suspicious model.

4.3.2 The Limitation of Backdoor-Based Watermarking Intuitively, the inference process of backdoor attacks is similar to opening a door with its key [25]. Accordingly, backdoor-based model watermarking can succeed only if the trigger pattern (used by the victim) matches hidden backdoors contained in the suspicious model if it is stolen from the victim model. This requirement can be satisfied in its originally discussed scenarios, where the suspicious model is the same as the victim model. However, it is not necessary to hold in model stealing, where

58 Table 4.2 The performance (%) of different models. The failed verification case is marked in red

Y. Li et al. Model type → BA ASR

Benign 91.99 0.01

Victim 85.49 100.00

Stolen 70.17 3.84

hidden backdoors may be changed or even removed during the stealing process. Accordingly, these methods may fail in defending against model stealing. Settings We adopt the most representative backdoor attack (i.e., BadNets [10]) as an example for discussion. Specifically, we first train the victim model based on BadNets. We then obtain the suspicious model via data-free distillation-based model stealing [7], based on the victim model. We use the attack success rate (ASR) [25] and benign accuracy (BA) to evaluate the stolen model. In general, the smaller the ASR, the less likely the stealing is detected. Results As shown in Table 4.2, the ASR of the stolen model is significantly lower than that of the victim model. These results reveal that the defender-adopted trigger is no longer matching the hidden backdoor in the stolen model. Accordingly, backdoor-based model watermarking fails to detect model stealing.

4.4 The Proposed Method Under Centralized Training 4.4.1 Threat Model and Method Pipeline Threat Model Following the setting used by existing research [30, 42, 47], we consider the defense of model stealing under the white-box setting. Specifically, we assume that defenders have complete access to the suspicious model, while they have no information about the stealing. The defenders intend to accurately determine whether the suspicious model is stolen from the victim, based on the predictions of the suspicious and victim models. Method Pipeline Given the discussions in Sect. 4.3, we propose to embed external features instead of inherent features for ownership verification. Specifically, as shown in Fig. 4.1, our method has three main steps, including: (1) model watermarking via embedding external features, (2) training the ownership meta-classifier, and (3) model ownership verification. More details are in Sects. 4.4.2–4.4.4.

4.4.2 Model Watermarking with Embedded External Features In this section, we illustrate how to watermark the (victim) model with external features. Before we present its technical details, we first provide necessary definitions.

4 The Robust and Harmless Model Watermarking

59

Fig. 4.1 The main pipeline of our defense. In the first step, the defenders conduct style transfer to modify some images without reassigning their label. This step is used to embed external features to the victim model during the training process. In the second step, the defenders train a metaclassifier used for determining whether a suspicious model is stolen from the victim, based on the gradients of transformed images. In the last step, the defenders perform model ownership verification via hypothesis test with the meta-classifier [28]

Definition 4.1 (Inherent and External Features) A feature f is called the inherent feature of dataset .D if and only if .∀(x, y) ∈ X × Y , (x, y) ∈ D ⇒ (x, y) contains feature f . Similarly, f is called the external feature of dataset .D if and only if .∀(x, y) ∈ X × Y , (x, y) contains feature f ⇒ (x, y) ∈ / D. Example 4.1 If an image is from MNIST [22], it is at least grayscale; if an image is ink-type, it is not from ImageNet [5] since it only consists of natural images. Although we can easily define the external features, how to generate them is still difficult, since the learning process of DNNs is a black box and the concept of features itself is complicated. Inspired by some recent literatures [4, 6, 9], we know that the image style can serve as a feature in image or video recognition. Accordingly, we propose to adopt style transfer [3, 15, 18] to embed external features, based on a defender-specified style image. Defenders can also use other embedding methods, which will be discussed in our future work. Specifically, let .D = {(x i , yi )}N i=1 indicate the unmodified training dataset, .x s is a given style image, and .T : X × X → X is a style transformer. In this step, the defenders first randomly select .γ % samples (i.e., .Ds ) from .D to produce the transformed dataset .Dt = {(x  , y)|x  = T (x, x s ), (x, y) ∈ Ds }. The victim model .Vθ can learn external features contained in the style image during the training process: .

min θ



L (Vθ (x), y),

(4.3)

(x,y)∈Db ∪Dt

where .Db  D\Ds and .L (·) is the loss function. In particular, we only transform a few images and do not modify their labels. As such, the embedding of external features will have minor effects on reducing the accuracy of victim models and will not introduce backdoor risks.

60

Y. Li et al.

4.4.3 Training Ownership Meta-Classifier As mentioned in the previous section, the embedding of external features has minor effects on model predictions. Accordingly, different from backdoor-based watermarking and dataset inference, we have to train an additional meta-classifier to verify whether the suspicious model contains the knowledge of external features. In this chapter, we use model gradients as the input to train the meta-classifier |θ| → {−1, +1}. We assume that the victim model V and the suspicious .Cw : R model S have the same model structure. This assumption can be easily satisfied, since defenders can retrain the suspicious model on the watermarked dataset as the victim model. Once defenders obtain the suspicious model, they can train its benign version (i.e., the B) on the unmodified dataset .D. After that, we can obtain the training set .Dc of meta-classifier C via Dc =

.

      gV (x  ), +1 |(x  , y) ∈ Dt ∪ gB (x  ), −1 |(x  , y) ∈ Dt ,

(4.4)

where .sgn(·) is the sign function [38], .gV (x  ) = sgn(∇θ L (V (x  ), y)), and   .gB (x ) = sgn(∇θ L (B(x ), y)). We adopt its sign vector instead of the gradient itself to highlight the influence of its direction, whose effectiveness will be verified in Sect. 4.6.5. At the end, the meta-classifier .Cw is trained by .

min w



L (Cw (s), t).

(4.5)

(s,t)∈Dc

4.4.4 Model Ownership Verification with Hypothesis Test Once the meta-classifier is trained, the defender can examine the suspicious model simply by the result of .C(gS (x  )), given a transformed image .x  and its label y.    .gS (x ) = sgn(∇θ L (S(x ), y)). If .C(gS (x )) = 1, our method treats suspicious model as stolen from the victim. However, it effectiveness may be significantly influenced by the selection of .x  . In this chapter, we design a hypothesis-test-based method to alleviate this problem, as follows: Definition 4.2 Let .X denote the variable of transformed images, while .μS and .μB indicate the posterior probability of the event .C(gS (X )) = 1 and .C(gB (X )) = 1, respectively. Given a null hypothesis .H0 : μS = μB (H1 : μS > μB ), we claim that the suspicious model S is stolen from the victim if and only if the .H0 is rejected. In practice, we sample m random (transformed) images from .Dt to conduct the single-tailed pair-wise T-test [14] and calculate its p-value. When the p-value is smaller than the significance level .α, .H0 is rejected. Moreover, similar to dataset inference, we also calculate the confidence score .Δμ = μS − μB to represent the verification confidence. The smaller the .Δμ, the less confident the verification.

4 The Robust and Harmless Model Watermarking

61

4.5 The Proposed Method Under Federated Learning In the previous section, we illustrate our method under centralized training. In this section, we explore how to adopt our method under federated learning.

4.5.1 Problem Formulation and Threat Model Problem Formulation In this chapter, we consider the standard horizontal federated learning (HFL) [24, 32, 48], where a server intends to train DNNs across multiple decentralized edge devices holding some local samples without exchanging samples. HFL aims to protect the privacy of samples in each edge device, which is different from the traditional centralized training paradigm where all the local samples are uploaded to the server. In practice, the edge devices may only allow the server to use their samples for training one specific model. However, the server may use edge’s gradients to train multiple models for different purposes without authorization, which can be regarded as infringing the copyright of local datasets. This stealing process is stealthy since edge devices can hardly identify the specific use of gradients by the server. In this chapter, we discuss how to defend against it. Threat Model In this chapter, we follow the standard HFL setting, where each edge device can obtain the shared model delivered by the server during each iteration. The edge device calculates and uploads the gradient during each iteration, based on the shared model and local samples. The server will update the model according to the simple averaged aggregation of gradients uploaded by all edge devices. We assume that, in each iteration, all edge devices can modify the local samples used for calculating gradients and save the shared model.

4.5.2 The Proposed Method Similar to the scenarios under centralized training, the edge device can protect its data copyright by watermarking the model. There are two main training stages for the defense under federated learning, including: (1) model warm-up and (2) embedding external features. Specifically, in the first stage, the edge device uses local benign samples to calculate and update gradients. The stage is used to obtain the benign model by saving the one at the end of the stage. In the second stage, the edge device will watermark all local samples based on style transfer, as the approach illustrated in Sect. 4.4.2. The edge device will save the shared model at the end of this stage as the victim model. We notice that the edge device can also train the victim model directly based on its local samples. However, since the local samples are usually limited, the victim model may have relatively low benign accuracy resulting in poor performance of the meta-classifier trained based on it. We will further explore it in our future work.

62

Y. Li et al.

Once the victim and the benign model are obtained, the edge device can train the meta-classifier according to the method presented in Sect. 4.4.3. The edge device can also validate whether a given suspicious model was trained on its local samples based on the approach proposed in Sect. 4.4.4.

4.6 Experiments 4.6.1 Experimental Settings Dataset Selection and Model Structure We conduct experiments on CIFAR-10 [21] and a subset of ImageNet [5]. As suggested in [31], we adopt the WideResNet [44] and ResNet [12] as the victim model on CIFAR-10 and ImageNet, respectively. Evaluation Metric As suggested in [31], we also use the confidence score Δμ and p-value for evaluation. Specifically, we randomly select 10 samples to calculate both Δμ and p-value. In general, the larger the Δμ and the smaller the p-value, the better the defense. We mark the best result among all defenses in boldface.

4.6.2 Main Results Under Centralized Training Settings for Model Stealing Following the settings in [31], we conduct model stealing methods illustrated in Sect. 4.2.1 to evaluate the effectiveness of defenses. Besides, we also provide the results of directly copying the victim model (dubbed “Source”) and examining a suspicious model that is not stolen from the victim (dubbed “Independent”) for reference. Settings for Model Stealing Following the settings in [31], we conduct model stealing methods illustrated in Sect. 4.2.1 to evaluate the effectiveness of defenses. Besides, we also provide the results of directly copying the victim model (dubbed “Source”) and examining a suspicious model that is not stolen from the victim (dubbed “Independent”) for reference. Settings for Defenses We compare our defense with dataset inference [31] and model watermarking with BadNets [10], gradient matching [8], and entangled watermarks [17]. We poison 10% benign samples for all defenses. We use a white square as the trigger pattern for the BadNets-based approach and adopt the oil paint as the style image for our method. Other settings are the same as those used in their original paper. The example of images (e.g., transformed and poisoned images) used by different defenses is shown in Fig. 4.2. Results As shown in Table 4.3 and 4.4, our method has the best performance among all defenses in almost all cases. Specifically, the p-value of our method

4 The Robust and Harmless Model Watermarking

63

Fig. 4.2 The example of images used by different defenses. (a) Benign image; (b) poisoned image in BadNets; (c) poisoned image in Gradient Matching; (d) poisoned image in Entangled Watermarks; (e) style image; (f) transformed image [28]

is three orders of magnitude smaller than that of the method with entangled watermarks in defending against dataset-accessible and query-only attacks on the CIFAR-10 dataset. The only exceptions appear when there is a direct copy, where the defense with entangled watermarks has some advantages. However, in these cases, our method can still have correct justifications with high confidence. Besides, our defense has minor side effects on the victim models. Specifically, the benign accuracies of the model trained on CIFAR-10 and its transformed version are 91.99% and 91.79%, respectively. Moreover, different from backdoor-based model watermarking, our defense does not introduce new security threats. These benefits are mainly because we do not change the label of transformed images and only transform a few images, when we train the victim model. In this case, the style transformation can be regarded as a special data augmentation, which is harmless.

4.6.3 Main Results Under Federated Learning Settings We assume that there are m different edge devices with the same capacity. Specifically, we split the original training set into m disjoint subsets with the same amount of samples and assume that each device holds one. We consider two scenarios including the direct training (dubbed “Steal”) and the no stealing (dubbed “Independent”). Other settings are the same as those used in Sect. 4.6.2. Results As shown in Table 4.5, our method can still accurately identify model stealing even under federated learning. An interesting phenomenon is that as the number of edge devices increases, the verification becomes better rather than worse. This phenomenon is mainly because poisoning all training samples will significantly reduce the benign accuracy of the model when there are only a few edge devices. We will further explore how to design adaptive methods to improve the performance of our defense under federated learning in the future.

Model stealing Victim Source AD Distillation AM Zero-shot Fine-tuning AQ Label-query Logit-query Benign Independent

BadNets Δμ 0.91 −10−3 10−25 10−23 10−27 10−27 10−20

p-value 10−12 0.32 0.22 0.28 0.20 0.23 0.33

Gradient matching Δμ p-value 0.88 10−12 −7 10 0.20 10−24 0.22 10−27 0.28 10−30 0.34 10−23 0.33 10−12 0.99

Table 4.3 The main results on CIFAR-10 dataset under centralized training Entangled watermarks Δμ p-value 0.99 10−35 0.01 0.33 10−3 10−3 0.35 0.01 10−5 0.62 10−6 0.64 10−22 0.68

Dataset inference Δμ p-value – 10−4 – 10−4 – 10−2 – 10−5 – 10−3 – 10−3 – 1.00

Ours Δμ 0.97 0.53 0.52 0.50 0.52 0.54 0.00

p-value 10−7 10−7 10−5 10−6 10−4 10−4 1.00

64 Y. Li et al.

Model stealing Victim Source AD Distillation AM Zero-shot Fine-tuning AQ Label-query Logit-query Benign Independent

BadNets Δμ 0.87 10−4 10−12 10−20 10−23 10−23 10−24

p-value 10−10 0.43 0.33 0.20 0.29 0.38 0.38

Gradient matching Δμ p-value 0.77 10−10 −12 10 0.43 10−18 0.43 10−12 0.47 10−22 0.50 10−12 0.22 10−23 0.78

Table 4.4 The main results on ImageNet dataset under centralized training Entangled watermarks Δμ p-value 0.99 10−25 −6 10 0.19 10−3 0.46 0.46 0.01 10−7 0.45 10−6 0.36 10−30 0.55

Dataset inference Δμ p-value – 10−6 – 10−3 – 10−3 – 10−4 – 10−3 – 10−3 – 0.98

Ours Δμ 0.90 0.61 0.53 0.60 0.55 0.55 10−5

p-value 10−5 10−5 10−4 10−5 10−3 10−4 0.99

4 The Robust and Harmless Model Watermarking 65

66

Y. Li et al.

Table 4.5 The main results on CIFAR-10 dataset under federated learning

Source Independent

m = 1 (centralized) Δμ p-value 0.97 10−7 0.00 1.00

m=2 Δμ 0.83 10−11

p-value 10−10 0.99

m=4 Δμ p-value 0.94 10−22 −9 10 0.99

m=8 Δμ p-value 0.97 10−31 −9 10 0.99

Fig. 4.3 The effects of the transformation rate (%) and the number of sampled images [28]

4.6.4 The Effects of Key Hyper-Parameters In this section, we adopt our method under centralized training on the CIFAR-10 dataset as an example for discussion. Unless otherwise specified, all settings are the same as those stated in Sect. 4.6.2. The Effects of Transformation Rate .γ In general, the larger the .γ , the more training samples are transformed during the training process of the victim model. As shown in Fig. 4.3, the p-value decreases with the increase of .γ in defending all model stealing. These results show that increasing the transformation rate can obtain a more robust watermark, as we expected. In particular, we need to notice that having a large .γ may also lead to low benign accuracy of the victim model. Defenders should assign .γ based on their specific requirements in practice. The Effects of the Number of Sampled Images M As demonstrated in Sect. 4.4.4, we need to sample some transformed samples for the ownership verification. In general, the larger the M, the less the randomness of sample selection and therefore the more confident the verification. It is mainly why, as shown in Fig. 4.3, the pvalue also decreases with the increase of M.

4 The Robust and Harmless Model Watermarking

67

Fig. 4.4 The style images that we used for evaluation [28] Table 4.6 The effectiveness of our method with different style images on CIFAR-10 Model stealing Victim Source .AD Distillation .AM Zero-shot Fine-tuning .AQ Label-query Logit-query Benign Independent

Pattern (a) p-value 0.98 .10−7 0.68 .10−7 0.61 .10−5 0.46 .10−5 0.64 .10−5 0.65 .10−4 0.00 1.00

.Δμ

Pattern (b) p-value 0.97 .10−7 0.53 .10−7 0.52 .10−5 0.50 .10−6 0.52 .10−4 0.54 .10−4 0.00 1.00

.Δμ

Pattern (c) p-value 0.98 .10−10 0.72 .10−8 0.74 .10−8 0.21 .10−7 0.68 .10−8 0.62 .10−6 0.00 1.00

.Δμ

Pattern (d) p-value 0.98 .10−12 0.63 .10−7 0.67 .10−7 0.50 .10−9 0.68 .10−7 0.73 .10−7 −9 .10 0.99 .Δμ

The Effects of Style Image In this part, we validate whether our method is still effective if we use different style images (as shown in Fig. 4.4). As shown in Table 4.6, the p-value is significantly smaller than 0.01 in all cases. In other words, our method can still accurately identify stealing when using different style images, although there will be some performance fluctuations. We will discuss how to optimize the style image in our future work.

4.6.5 Ablation Study There are three main parts in our defense, including: (1) embedding external features with style transfer, (2) using meta-classifier for ownership verification, and (3) using their sign vector instead of gradients themselves for training meta-classifier. In this section, we also verify their effectiveness under centralized training. The Effectiveness of Style Transfer To verify that the style-based watermark is more robust than patch-based ones during the stealing process, we compare our method with its variant, which uses the trigger pattern of BadNets to generate transformed images. As shown in Table 4.7, our defense is superior to its patchbased variant. The benefit is mostly because the style watermark is bigger than the patch one and DNNs more intend to learn texture-related information [9].

68

Y. Li et al.

Table 4.7 The effectiveness (p-value) of style transfer and meta-classifier on CIFAR-10

Distillation Zero-shot Fine-tuning Label-query Logit-query

Style transfer Patch-based 0.17 0.01 −3 .10 −3 .10 −3 .10

Style-based (ours) −7 .10 −5 .10 −6 .10 −4 .10 −4 .10

Meta-classifier w/o w/(ours) −3 0.32 .10 −61 0.22 .10 −5 0.28 .10 −50 0.20 .10 −3 0.23 .10

The Effectiveness of Meta-Classifier To verify that our meta-classifier is also useful, we compare the BadNets-based model watermarking with its extension. Specifically, its variant uses the meta-classifier for ownership verification, where the victim model is the watermarked one and the transformed image is the one containing backdoor triggers. As shown in Table 4.7, using meta-classier significantly decreases the p-value. These results also partly explain why our defense is effective. The Effectiveness of the Sign of Gradients In this part, we verify the effectiveness of using their sign vectors instead of gradients themselves in the training of our meta-classifier. As shown in Table 4.8, adopting the sign of gradients is significantly better than using gradients directly. It is probably because the “direction” of gradients contains more information compared with their “magnitude.”

4.7 Conclusion In this chapter, we revisited the defenses against model stealing based on model ownership verification. We revealed the limitations of existing methods, based on which we proposed to embed external features by style transfer as the robust and harmless model watermark. We verified the effectiveness of our defense under both centralized training and federated learning scenarios on benchmark datasets. We hope that this chapter can provide a new angle on model watermarks, to facilitate the design of more effective and secure methods.

Model stealing.↓ Source Distillation Zero-shot Fine-tuning Label-query Logit-query Independent

Dataset.→

CIFAR-10 Gradient .Δμ 0.44 0.27 0.03 0.04 0.08 0.07 .0.00

p-value −5 .10 0.01 −3 .10 −5 .10 −3 .10 −5 .10 .1.00

Sign of gradient (ours) .Δμ p-value −7 .0.97 .10 −7 .0.53 .10 −5 .0.52 .10 −6 .0.50 .10 −4 .0.52 .10 −4 .0.54 .10 .0.00 .1.00

Table 4.8 The performance of our meta-classifier trained with different features ImageNet Gradient .Δμ 0.15 0.15 0.12 0.13 0.13 0.12 −10 .10 p-value −4 .10 −4 .10 −3 .10 −3 .10 −3 .10 −3 .10 .0.99

Sign of gradient (ours) p-value −5 .0.90 .10 −5 .0.61 .10 −4 .0.53 .10 −5 .0.60 .10 −3 .0.55 .10 −4 .0.55 .10 −5 .10 .0.99 .Δμ

4 The Robust and Harmless Model Watermarking 69

70

Y. Li et al.

Acknowledgments We sincerely thank Xiaojun Jia from Chinese Academy of Science and Professor Xiaochun Cao from Sun Yat-sen University for their constructive comments and helpful suggestions on an early draft of this chapter.

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: watermarking deep neural networks by backdooring. In: USENIX Security (2018) 2. Chandrasekaran, V., Chaudhuri, K., Giacomelli, I., Jha, S., Yan, S.: Exploring connections between active learning and model extraction. In: USENIX Security (2020) 3. Chen, X., Zhang, Y., Wang, Y., Shu, H., Xu, C., Xu, C.: Optical flow distillation: Towards efficient and stable video style transfer. In: ECCV (2020) 4. Cheng, S., Liu, Y., Ma, S., Zhang, X.: Deep feature space Trojan attack of neural networks by controlled detoxification. In: AAAI (2021) 5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009) 6. Duan, R., Ma, X., Wang, Y., Bailey, J., Qin, A.K., Yang, Y.: Adversarial camouflage: hiding physical-world attacks with natural styles. In: CVPR (2020) 7. Fang, G., Song, J., Shen, C., Wang, X., Chen, D., Song, M.: Data-free adversarial distillation (2019). arXiv preprint arXiv:1912.11006 8. Geiping, J., Fowl, L., Huang, W.R., Czaja, W., Taylor, G., Moeller, M., Goldstein, T.: Witches’ brew: industrial scale data poisoning via gradient matching. In: ICLR (2021) 9. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNettrained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: ICLR (2019) 10. Gu, T., Liu, K., Dolan-Gavitt, B., Garg, S.: BadNets: evaluating backdooring attacks on deep neural networks. IEEE Access 7, 47230–47244 (2019) 11. Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3d point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4338–4364 (2020) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 13. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NeurIPS Workshop (2014) 14. Hogg, R.V., McKean, J., Craig, A.T.: Introduction to Mathematical Statistics. Pearson Education (2005) 15. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017) 16. Jagielski, M., Carlini, N., Berthelot, D., Kurakin, A., Papernot, N.: High accuracy and high fidelity extraction of neural networks. In: USENIX Security (2020) 17. Jia, H., Choquette-Choo, C.A., Chandrasekaran, V., Papernot, N.: Entangled watermarks as a defense against model extraction. In: USENIX Security (2021) 18. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and superresolution. In: ECCV (2016) 19. Juuti, M., Szyller, S., Marchal, S., Asokan, N.: PRADA: protecting against DNN model stealing attacks. In: EuroS&P (2019) 20. Kesarwani, M., Mukhoty, B., Arya, V., Mehta, S.: Model extraction warning in MLaaS paradigm. In: ACSAC (2018) 21. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009) 22. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

4 The Robust and Harmless Model Watermarking

71

23. Lee, T., Edwards, B., Molloy, I., Su, D.: Defending against neural network model stealing attacks using deceptive perturbations. In: IEEE S&P Workshop (2019) 24. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: challenges, methods, and future directions. IEEE Signal Proc. Mag. 37(3), 50–60 (2020) 25. Li, Y., Jiang, Y., Li, Z., Xia, S.-T.: Backdoor learning: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022). 26. Li, Y., Zhang, Z., Bai, J., Wu, B., Jiang, Y., Xia, S.T.: Open-sourced dataset protection via backdoor watermarking. In: NeurIPS Workshop (2020) 27. Li, Y., Zhong, H., Ma, X., Jiang, Y., Xia, S.T.: Few-shot backdoor attacks on visual object tracking. In: ICLR (2022) 28. Li, Y., Zhu, L., Jia, X., Jiang, Y., Xia, S.T., Cao, X.: Defending against model stealing via verifying embedded external features. In: AAAI (2022) 29. Li, Y., Li, Y., Wu, B., Li, L., He, R., Lyu, S.: Invisible backdoor attack with sample-specific triggers. In: ICCV (2021) 30. Liu, H., Weng, Z., Zhu, Y.: Watermarking deep neural networks with greedy residuals. In: ICML (2021) 31. Maini, P., Yaghini, M., Papernot, N.: Dataset inference: Ownership resolution in machine learning. In: ICLR (2021) 32. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: AISTATS (2017) 33. Minaee, S., Boykov, Y.Y., Porikli, F., Plaza, A.J., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2021). 34. Nguyen, T.A., Tran, A.: Input-aware dynamic backdoor attack. In: NeurIPS (2020) 35. Nguyen, T.A., Tran, A.T.: WaNet-imperceptible warping-based backdoor attack. In: ICLR (2021) 36. Orekondy, T., Schiele, B., Fritz, M.: Knockoff nets: stealing functionality of black-box models. In: CVPR (2019) 37. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: AsiaCCS (2017) 38. Sachs, L.: Applied Statistics: A Handbook of Techniques. Springer, Berlin (2012) 39. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015) 40. Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., MacNair, C.R., French, S., Carfrae, L.A., Bloom-Ackermann, Z., et al.: A deep learning approach to antibiotic discovery. Cell 180(4), 688–702 (2020) 41. Tramèr, F., Zhang, F., Juels, A., Reiter, M.K., Ristenpart, T.: Stealing machine learning models via prediction APIs. In: USENIX Security (2016) 42. Wang, T., Kerschbaum, F.: RIGA: covert and robust white-box watermarking of deep neural networks. In: WWW (2021) 43. Yan, H., Li, X., Li, H., Li, J., Sun, W., Li, F.: Monitoring-based differential privacy mechanism against query flooding-based model extraction attack. IEEE Trans. Depend. Secure Comput. (2021) 44. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016) 45. Zhai, T., Li, Y., Zhang, Z., Wu, B., Jiang, Y., Xia, S.-T.: Backdoor attack against speaker verification. In: ICASSP (2021) 46. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: AsiaCCS (2018) 47. Zhang, J., Chen, D., Liao, J., Zhang, W., Hua, G., Yu, N.: Passport-aware normalization for deep model protection. In: NeurIPS (2020) 48. Zhu, L., Liu, X., Li, Y., Yang, X., Xia, S.-T., Lu, R.: A fine-grained differentially private federated learning against leakage from gradients. IEEE Internet Things J. (2021)

Chapter 5

Protecting Intellectual Property of Machine Learning Models via Fingerprinting the Classification Boundary Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong

Abstract Machine learning models are considered as the model owners’ intellectual property (IP). An attacker may steal and abuse others’ machine learning models such that it does not need to train its own model, which requires a large amount of resources. Therefore, it becomes an urgent problem how to distinguish such compromise of IP. Watermarking has been widely adopted as a solution in the literature. However, watermarking requires modification of the training process, which leads to utility loss and is not applicable to legacy models. In this chapter, we introduce another path toward protecting IP of machine learning models via fingerprinting the classification boundary. This is based on the observation that a machine learning model can be uniquely represented by its classification boundary. For instance, the model owner extracts some data points near the classification boundary of its model, which are used to fingerprint the model. Another model is likely to be a pirated version of the owner’s model if they have the same predictions for most fingerprinting data points. The key difference between fingerprinting and watermarking is that fingerprinting extracts fingerprint that characterizes the classification boundary of the model, while watermarking embeds watermarks into the model via modifying the training or fine-tuning process. In this chapter, we illustrate that we can robustly protect the model owners’ IP with the fingerprint of the model’s classification boundary.

5.1 Introduction Machine learning models have been widely deployed in various domains due to their ability to surpass human-level performance in different applications. However, the impressive performance comes at the cost of expensive training, which requires

X. Cao () · J. Jia · N. Z. Gong Duke University, Durham, NC, USA e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_5

73

74

X. Cao et al.

substantial computation resource as well as training data. This may motivate attackers to steal and abuse model owners’ trained models, instead of training their own models. For instance, a model owner may deploy its trained model as a cloud service or a client-side software. An attacker could steal the model via different approaches, such as malware infection, insider threats, or model extraction attacks [9, 24–28, 43, 50, 51, 55]. The attacker may further abuse the pirated model by deploying the model as its own cloud service or client-side software. Such theft and abuse of a machine learning model violate the model owner’s IP, since the model owner has devoted its proprietary training data, confidential algorithm, and computational infrastructure to the training of the model. The compromise of IP highlights the need for effective ways to identify the pirated models. Most existing approaches leverage watermarking to protect the model owners’ IP. The idea of watermarking origins from the area of multimedia [20]. The key idea of watermarking is to embed some (imperceptible) perturbation to the multimedia data, e.g., small pixel value changes for an image, such that it can be later extracted and verified as a proof of IP. Multiple studies have followed the same idea to apply watermarking techniques to machine learning models [1, 10, 11, 13, 14, 18, 32, 35, 37, 42, 45, 57]. Specifically, during the training or fine-tuning process, they embed some watermarks into the model owner’s model, which we call the target model. The watermarks can be either embedded into the model parameters or the target model’s predictions for specific inputs. Given a model to be verified, which we call a suspect model, watermarking methods try extracting the watermarks from it. If the same or similar watermarks can be extracted from the suspect model, the suspect model is verified to be a pirated version of the target model. Watermarking techniques showed their effectiveness in detecting pirated models. However, they have an intrinsic limitation that they require modification on the training or finetuning process of the target model. This prevents them from being applied to legacy models and leads to inevitable utility loss. In this chapter, we introduce another path toward IP protection of machine learning models, i.e., via fingerprinting the classification boundary of the models. Our key intuition is that a machine learning model can be uniquely represented by its classification boundary. Specifically, the classification boundary of a machine learning model essentially partitions the input space into multiple regions, where the data points in each region are predicted as the same label by the model. We can verify whether a suspect model is a pirated version of the target model by comparing their classification boundaries. It is likely that the suspect model is pirated from the target model if the two boundaries overlap with each other. However, a key challenge is that it is non-trivial to represent or compare the classification boundaries directly, especially when the models are deep neural networks. Therefore, we propose fingerprinting the classification boundary of the target model using some data points near its classification boundary. Specifically, we call the data points we find near the classification boundary fingerprinting data points and treat them together with their predicted labels given by the target model as the fingerprint. Figure 5.1 illustrates the fingerprinting data points and the classification boundary. Given a suspect model to verify, we feed the fingerprinting data points to it through its prediction API to obtain

5 Protecting IP via Fingerprinting Classification Boundary

75

Classification boundary

Fig. 5.1 We can leverage some data points near the target model’s classification boundary and their labels predicted by the target model to fingerprint the target model

its predictions for the data points. Then we check whether the predictions given by the suspect model match those given by the target model. If most fingerprinting data points have the same predictions from the two models, we can verify that the suspect model is a pirated version of the target model. We design five critical goals for fingerprinting a machine learning model, i.e., fidelity, effectiveness, robustness, uniqueness, and efficiency. Fidelity implies that the fingerprinting method should not sacrifice the utility of the target model. Effectiveness requires the fingerprinting method to successfully detect a copy of the target model. Robustness means that the fingerprints should be robust to post-processing. This is because after an attacker steals the target model, the attacker may post-process the model before deploying it as a service. Common post-processing includes model compression (e.g., model pruning) and fine-tuning. The post-processing does not change the fact whether a model is pirated or not. Therefore, the fingerprints should be robust to post-processing, i.e., they should not be disabled by post-processing, and the model owner should still be able to verify them. Uniqueness means that the fingerprints should not be verified if a model is not pirated from the target model, i.e., it is independently trained. This requires the fingerprinting methods to extract different fingerprints for different models. Efficiency means that the cost for extracting fingerprints should not be too expensive. This is important especially when the model is trained using resourceconstrained devices, e.g., mobile phones or IoT devices. It is trivial to prove the fidelity of fingerprinting methods if they do not modify the target model. It is also simple to measure the effectiveness and efficiency of a method. However, it becomes challenging when it comes to the measurement of robustness and uniqueness. Cao et al. [5] proposed area under the robustness–uniqueness curves (ARUC) as a metric to quantify the robustness and uniqueness jointly. Like area under the ROC curve (AUC) in statistics [4], ARUC is a number within range 0–1, and a larger value indicates a better robustness–uniqueness property. The key component of fingerprinting the classification boundary is to find representative fingerprints. Existing methods [5, 31, 39, 52, 58] use fingerprinting data points near the classification boundary and their predicted labels as the fingerprint. We note that different methods essentially find different fingerprinting

76

X. Cao et al.

data points. In this chapter, we take IPGuard [5] as an example to illustrate how to find representative fingerprinting data points near the classification boundary. To achieve both robustness and uniqueness goals, IPGuard considers their distance toward the classification boundary when selecting the fingerprinting data points. Specifically, the distance needs to be neither too small nor too large to obtain a satisfactory trade-off between the robustness and uniqueness. On one hand, if the distance between a data point and the classification boundary is too small, a small change (e.g., caused by post-processing) in the classification boundary would make the data point cross the boundary and result in a different prediction for it. This implies a low robustness of the fingerprinting data point. On the other hand, if the distance is too large, the data point may not have the capability to represent the classification boundary. For instance, independently trained machine learning models will all predict the same label for the data point, implying a low uniqueness. To find the fingerprinting data points with proper distance to the classification boundary, IPGuard formulates and solves an optimization problem. In the optimization problem, IPGuard uses a hyperparameter k to control the distance between the fingerprinting data points and the classification boundary, such that a good robustness–uniqueness trade-off can be achieved. In the rest of this chapter, we will first review the existing watermarking methods for IP protection and discuss their limitations. Next, we will introduce the classification boundary of a machine learning model. We will then formally formulate the problem of fingerprinting a machine learning model and introduce the design of IPGuard. Finally, we will discuss the limitations and potential challenges we may face, as well as potential future work for fingerprinting machine learning models.

5.2 Related Works 5.2.1 Watermarking for IP Protection Watermarking was first proposed to protect the IP for multimedia data [20]. Recently, researchers have generalized the idea to leverage watermarking to protect the IP of machine learning models [1, 10, 11, 13, 14, 18, 32, 35, 37, 42, 45, 57]. Given a target machine learning model, the watermarking methods embed some watermarks into the target model via modifying either its training or fine-tuning process. To verify a suspect model, the model owner will need to check whether the same watermarks could be found in the suspect model. The suspect model is verified as a pirated one if the model owner can observe the same or similar watermarks from the suspect model. Existing watermarking techniques can be roughly grouped into two categories based on how the watermark is embedded. The first one is parameter-based (or white box) watermarking, and the other one is label-based (or black box)

5 Protecting IP via Fingerprinting Classification Boundary

77

watermarking. Parameter-based watermarking [10, 42] embeds watermarks into the model parameters of the target machine learning model. For instance, the model owner can add some carefully designed regularization terms to the loss function during the training of the target model, such that its model parameters follow certain distribution. Verifying parameter-based watermarks needs the model owner to have white-box access to the suspect model, i.e., to verify the watermark in a given suspect model, the model owner has to know the values of its model parameters. Label-based watermarking [1, 13, 18, 36, 57] embeds watermarks into the predicted labels or neuron activations of certain model inputs. Specifically, some genuine or crafted data points (e.g., abstract images [1], training data points with extra meaningful content [57], or adversarial examples [36]) are first assigned certain labels. Then, these data points are used for training data augmentation when training the target model. To verify whether a suspect model is a pirated version of the target model, the model owner queries the suspect model with these data points as model inputs. If the returned labels match the assigned labels, then the watermark is verified, and the suspect model is likely to be a pirated one. Limitations of Watermarking We note that watermarking techniques require modification of the training or fine-tuning process of the target model. In other words, the target model with the watermarks embedded is different from the one without watermarks. This may lead to two key limitations. First, watermarking inevitably sacrifices the utility of the target model. It is challenging and takes significant effort to improve the model accuracy for a particular learning task, even by a small amount. For instance, it took lots of trial-and-error effort for the computer vision community to evolve the ResNet152 [21] model to a more advanced one, i.e., ResNet152V2 [22]. Such effort only increases the test accuracy on the benchmark ImageNet data set by roughly 1% [12]. However, watermarking can easily decrease the model accuracy due to the embedding process. For instance, the test accuracy decreases by 0.5% when only 20 watermark data points are embedded [13] into an ImageNet model. Even if the test accuracy is empirically shown not to be affected by the watermarks, it is unknown whether other properties of the target model have changed due to the watermarks, e.g., fairness or robustness. Second, watermarking techniques require embedding watermarks via tampering with the training or fine-tuning process of the target model, making them not applicable to legacy target models that cannot be retrained or fine-tuned. Instead, fingerprinting methods extract fingerprints from the target models without changing the training process, which is applicable to any model and is guaranteed to have no negative impact on the utility of the target model.

5.2.2 Classification Boundary Suppose the target model is a c-class machine learning classifier, where the output layer is a softmax layer. Moreover, we denote by .{g1 , g2 , · · · , gc } the decision

78

X. Cao et al.

functions of the target model, i.e., .gi (x) is the probability that the input data sample .x has a label i, where .i = 1, 2, · · · , c. For convenience, we denote by .{Z1 , Z2 , · · · , Zc } the logits of the target model, i.e., .{Z1 , Z2 , · · · , Zc } are the outputs of the neurons in the layer before the softmax layer. Formally, we have exp(Zi (x)) , j =1 exp(Zj (x))

gi (x) = c

.

(5.1)

where .i = 1, 2, · · · , c. The label y of the example .x is predicted as the one that has the largest probability or logit, i.e., .y = argmaxi gi (x) = argmaxi Zi (x). The classification boundary of the target model consists of data points whose labels the target model cannot decide. For a given data point, if the target model predicts the same largest probability or logit for at least two labels, then we can tell that the data point is on the classification boundary. Formally, we define the target model’s classification boundary as follows: CB = {x|∃i, j, i = j and gi (x) = gj (x) ≥ max gt (x)}

.

t=i,j

= {x|∃i, j, i = j and Zi (x) = Zj (x) ≥ max Zt (x)}, t=i,j

(5.2)

where CB is the set of data points that constitute the target model’s classification boundary. We note that for simple machine learning models, e.g., logistic regression and support vector machines (SVM), the classification boundary can be written as a closed-form expression. However, for highly non-linear and complex models such as deep neural networks (DNNs), no such expression is known, and a workaround for characterizing their classification boundary is to use a subset of data points on/near the classification boundary.

5.3 Problem Formulation In this section, we will first describe the threat model we consider. Then we will formally define the problem of fingerprinting machine learning models. Next, we list several design goals for designing fingerprinting methods including fidelity, effectiveness, robustness, uniqueness, and efficiency. Finally, we propose a metric to evaluate the trade-off between the robustness and the uniqueness of a fingerprinting method.

5.3.1 Threat Model We consider two parties in our threat model, i.e., model owner and attacker.

5 Protecting IP via Fingerprinting Classification Boundary

79

Model Owner A model owner is the person or institute that trains a machine learning model (i.e., the target model) using a (proprietary) training data set and algorithm. The model owner could deploy the target model as a cloud service (also known as machine learning as a service, or MLaaS in short) or as a client-side software (e.g., Amazon Echo). We note that the model owner devotes its time, effort, and resources to the training of the target model. Therefore, the trained target model is considered as the model owner’s valuable intellectual property. We aim to protect the model owner’s IP via deriving a fingerprint from the target model. For a suspect model, the model owner verifies whether the same watermark can be extracted from it using its prediction API. If the same fingerprint can be found in the suspect model, then we can verify it to be a pirated version of the model owner’s target model, and the model owner can take further follow-up actions, e.g., collecting other evidence and filing a lawsuit. Attacker An attacker is a person or party who aims to be a free rider by stealing and abusing other people’s models. Specifically, an attacker pirates the target model and may further deploy it as its own service or software. Moreover, we note that the attacker may post-process the target model before deploying it. Common post-processing includes fine-tuning [1, 13] and model compression [19, 33]. For instance, the attacker may use its own data set to fine-tune the target model such that the model fits better on the attacker’s application. Or the attacker may use model pruning techniques to reduce the size of the target model so that it can be deployed on resource-constrained devices such as smartphones and IoT devices.

5.3.2 Fingerprinting a Target Model Unlike watermarking techniques that require watermark generation, embedding, and verification process, we define fingerprinting a target model as a two-phase process, i.e., fingerprint extraction and verification. Specifically, we design an Extract function .E for extracting the fingerprint from the target model and a Verify function .V for verifying whether a suspect model is pirated from the target model. We describe the two functions as follows: Extract function .E Given a target model .Nt , the model owner executes the Extract function .E to derive a fingerprint of the target model. Specifically, we have FNt = E(Nt ),

.

(5.3)

where .FNt is the fingerprint of the target model .Nt . We note that in general, the fingerprint .FNt can be any property of the target model .Nt , e.g., the distribution of the model parameters or the digital signature. In this chapter, we focus on fingerprinting the classification boundary of the target model, where the fingerprint of a target model is a set of fingerprinting data points .{x1 , x2 , · · · , xn } near the

80

X. Cao et al.

classification boundary as well as their labels .{y1 , y2 , · · · , yn }. Formally, we can express the fingerprint .FNt as .FNt = {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )}. Verify Function .V Given a suspect model .Ns , the model owner can execute the Verify function .V to check whether the suspect model is a pirated version of the target model. The Verify function .V returns a suspect score, which implies the probability that the suspect model is pirated from the target model. Formally, we have S = V (Ns , FNt ),

.

(5.4)

where .S is the suspect score for the suspect model .Ns . A suspect model is verified to be a pirated one if its suspect score .S exceeds a certain threshold .τ , i.e., .S > τ . For instance, when the fingerprint .FNt is a set of fingerprinting data points and their labels, we can define the suspect score as the fraction of fingerprinting data points whose labels predicted by the suspect model match those predicted by the target model. A larger fraction of matches indicates that the suspect model is more likely to be a pirated model. We note that in general, the Verify function .V can be either white box or black box. Specifically, in the white-box setting, the model owner has full access to the target model, including its model parameters. For instance, when the fingerprint of the target model .FNt is the distribution of its model parameters, the model owner could verify whether the suspect model parameters follow the same distribution as .FNt in the white-box setting. However, in the black-box setting, the model owner only has access to the prediction API of the suspect model. This is a more realistic setting and is generally applicable, no matter whether the suspect model is deployed as a client-side software or a cloud service. In this chapter, we focus on the black-box verification, where the model owner only leverages the suspect model’s prediction API to verify the suspect model.

5.3.3 Design Goals We should take the following goals into consideration when designing methods to fingerprint the classification boundary of the target model: • Fidelity. The fingerprinting method should not sacrifice the target model’s utility at all. We note that watermarking methods do not have such property because they need to tamper with the training or fine-tuning process of the target model to embed the watermarks. However, a fingerprinting method is naturally guaranteed to have this property since the extraction function .E does not change the target model .Nt . • Effectiveness. If the suspect model is the same as the target model, then the Verify function .V should produce a large suspect score. Formally, we should have .V (Ns , E(Nt )) > τ when .Ns = Nt , where .τ is the threshold of the suspect

5 Protecting IP via Fingerprinting Classification Boundary

81

score to distinguish benign and pirated models. A fingerprinting method is said to be effective if and only if it has such property, as it can successfully detect a copy of the target model as a pirated model. • Robustness. If the suspect model is some post-processed version of the target model, then the Verify function should produce a large suspect score. Formally, we should have .V (Ns , E(Nt )) > τ if .Ns is a post-processed version of .Nt . Common post-processing includes but is not limited to model compression (e.g., model pruning), fine-tuning, and knowledge distillation. A fingerprinting method achieves the robustness goal if it can still distinguish the pirated model even after the attacker post-processes the model. • Uniqueness. The fingerprint should be unique to the target model. In other words, if a suspect model is not the target model nor its post-processed version, then the Verify function should produce a small suspect score. Formally, we should have .V (Ns , E(Nt )) ≤ τ if .Ns is neither .Nt nor a post-processed version of .Nt . The uniqueness goal aims to reduce the false alarms in the verification. Specifically, given suspect models that are independently trained, the fingerprinting method should not incorrectly verify them as pirated from the target model. • Efficiency. The fingerprinting method should be efficient at extracting a fingerprint for a target model and verifying the fingerprint for a suspect model. This is important especially when the models are deployed or verified on resourceconstrained devices, such as smartphones or IoT devices.

5.3.4 Measuring the Robustness–Uniqueness Trade-off We note that among the aforementioned five design goals, fingerprinting methods fundamentally satisfy the fidelity goal. Moreover, it is trivial to evaluate the effectiveness and efficiency of the methods. For robustness and uniqueness, we can use the fraction of pirated models that are correctly verified to measure the robustness of a method and use the fraction of benign models that are not verified to measure the uniqueness of a method. However, it is challenging to jointly measure the robustness and uniqueness because there is a threshold .τ involved. In fact, we note that the threshold .τ controls a trade-off between the robustness and the uniqueness of a fingerprinting method. Specifically, we can achieve better robustness when we select a smaller .τ because a larger fraction of pirated models will be successfully verified. However, it also results in worse uniqueness since more benign models may also be incorrectly verified. Similarly, it leads to a better uniqueness but a worse robustness if we select a larger .τ . Therefore, how to measure the robustness–uniqueness trade-off becomes the first challenge we need to address before we move on. An intuitive idea to eliminate the impact of selecting different threshold .τ when comparing multiple methods is to use AUC [4], a standard metric that is widely used in machine learning, as the evaluation metric. Specifically, we can rank all the suspect models, including both the pirated ones and benign ones, based on their

82

X. Cao et al.

suspect scores in a descending order. Then, AUC characterizes the probability that a randomly sampled pirated model ranks higher than a randomly sampled benign model. AUC shows its superiority and is widely adopted in many applications. However, it is not sufficient in our problem, where we aim to measure the robustness and uniqueness of the fingerprint in conjunction. Specifically, AUC provides no information about how large the gap is between the suspect scores of pirated models and those of the benign models. For instance, once all pirated models have larger suspect scores than all benign models, AUC reaches its maximum value 1, regardless of the gap between the suspect scores. However, such gap is of high interest when evaluating different fingerprinting methods. This is because the pirated and benign models we can evaluate in practice may be only a subset of all possible suspect models in the wild. A larger gap between the suspect scores of evaluated pirated models and those of evaluated benign models means that we have a greater chance to achieve ideal robustness and uniqueness trade-off in practice. Therefore, we recommend using the area under the robustness–uniqueness curves (ARUC) [5] as the metric to evaluate the robustness and uniqueness of a fingerprinting method in conjunction. When we increase the suspect score threshold .τ from 0 to 1, we can obtain the robustness and uniqueness curves with respect to .τ . Note that the robustness curve decreases as .τ increases, while the uniqueness curve increases as .τ increases. If we plot the two curves in a single figure, there will be an intersection of the two curves at a certain point. ARUC is defined as the area under the intersected robustness–uniqueness curves. Figure 5.2 illustrates ARUC in three scenarios. Specifically, ARUC has a value ranging from 0 to 1, and a larger value is better. This is because when ARUC is larger, we are more likely to find an appropriate suspect score threshold .τ , such that we can achieve both large robustness and uniqueness at the same time. Formally, ARUC is defined as follows:  ARU C =

1

min{R(τ ), U (τ )}dτ,

.

(5.5)

0

where .τ is the threshold, and .R(τ ) and .U (τ ) are the robustness and uniqueness when the threshold is .τ , respectively. In practice, it is challenging to evaluate ARUC

Fig. 5.2 Illustration of robustness–uniqueness curves and our metric ARUC to jointly measure robustness and uniqueness. Left: A perfect ARUC (i.e., .AURC = 1), where both robustness and uniqueness are 1 for any threshold. Middle: A mediocre ARUC, where both robustness and uniqueness are large only when the threshold is around 0.5. Right: A bad ARUC, where no threshold makes both robustness and uniqueness large

5 Protecting IP via Fingerprinting Classification Boundary

83

with continuous values of .τ . Instead, we can discretize .τ and convert the integral into summation. Specifically, we can divide the interval [0,1] to r equal pieces and represent each piece using its rightmost point. Then, we can approximate ARUC as follows: ARU C =

.

r τ 1 τ min{R( ), U ( )}, r  r r

(5.6)

τ =1

where r needs to be large enough to achieve satisfactory approximation, e.g., .r ≥ 100.

5.4 Design of IPGuard 5.4.1 Overview Designing a fingerprinting method is essentially designing the Extract function .E and the Verify function .V . Next, we will describe the overview of the two functions in the design of IPGuard. Extract Function .E In IPGuard, the fingerprint of a target model .Nt is a set of data points near its classification boundary and its predictions for the data points. Formally, given a target model .Nt , IPGuard extracts the following fingerprint .FNt for .Nt using its Extract function .E: FNt = E(Nt ) = {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )},

.

(5.7)

where .xi is the ith fingerprinting data point, and .yi is its label predicted by the target model, i.e., .yi = Nt (x). To search for the fingerprinting data points near the classification boundary, IPGuard formulates an optimization problem and leverages gradient descent to solve the optimization problem. We will describe the details in Sect. 5.4.2. Verify Function .V IPGuard leverages the prediction API of a suspect model for verification and assumes the prediction API returns a single predicted label for each query. Given a suspect model .Ns , IPGuard queries the model’s prediction API for the labels of the n fingerprinting data points. Then, IPGuard defines the suspect score of the suspect model as the fraction of fingerprinting data points whose predicted

84

X. Cao et al.

label given by the suspect model and the one given by the target model are the same. Formally, the suspect score .S of a suspect model .Ns is defined as follows: n S = V (Ns , FNt ) =

.

i=1 I(Ns (xi )

n

= yi )

,

(5.8)

where .I() is the indicator function. .I(Ns (xi ) = yi ) = 1 if .Ns (xi ) = yi , and I(Ns (xi ) = yi ) = 0 otherwise. The suspect score .S is a value between 0 and 1, and .S is higher if the suspect model predicts more fingerprinting data points as the same labels predicted by the target model.

.

5.4.2 Finding Fingerprinting Data Points as an Optimization Problem A critical component in designing the Extract function is to find representative fingerprinting data points that can sufficiently characterize the classification boundary. One naive way is to find data points on the classification boundary via randomly sampling data points and check whether they are on the classification boundary defined in Eq. 5.2. However, it is possible that such naive sampling method cannot find a data point on the classification boundary. Even if a data point can be finally found, the time cost might be unaffordable for the model owner. This is because the size of the classification boundary is a negligible portion of the data manifold. Moreover, even if we could find the data points on the classification boundary, they will result in poor robustness because a small perturbation on the model will change their predicted labels. Instead, IPGuard formulates an optimization problem and solves the problem to find data points near the classification boundary of the target model. Specifically, IPGuard aims to solve the following optimization problem: .

min ReLU (Zi (x) − Zj (x) + k) + ReLU (max Zt (x) − Zi (x)), x

t=i,j

(5.9)

where i and j are two arbitrary labels, .ReLU () is defined as .ReLU (a) = max{0, a}, and k is a parameter that controls the distance between the fingerprinting data point .x and the classification boundary. We note that when k equals to 0, the objective function can achieve its minimum value if the data point .x is on the classification boundary, i.e., .x satisfies .Zi (x) = Zj (x) ≥ maxt=i,j Zt (x). When k is greater than 0, IPGuard obtains the minimum value of the objective function if .Zj (x) ≥ Zi (x) + k and .maxt=i,j Zt (x) ≤ Zi (x). Intuitively, when the two conditions are met, label j and label i have the largest and the second largest logits, respectively. Moreover, the difference between the logits of label j and label i is no smaller than the parameter k. Here, the parameter k balances between the robustness and uniqueness of IPGuard. Specifically, a larger k indicates a more

5 Protecting IP via Fingerprinting Classification Boundary

85

deterministic prediction, and thus it likely implies a larger distance of the data point x from the classification boundary. Therefore, on the one hand, when k is larger, a post-processed version of the target model is more likely to predict the same label for the data point .x, which means that IPGuard is more robust against post-processing. On the other hand, when k is larger, the probability that a benign model predicts the same label for the data point .x is also larger, which means that the fingerprint in IPGuard is less unique. In practice, the model owner can choose a proper value of k based on its need for the robustness–uniqueness trade-off. IPGuard uses gradient descent to solve the optimization problem in Eq. 5.9, which requires initialization of the data point .x and selection of labels i and j . Next, we will introduce the details about how IPGuard performs the initialization and label selection.

.

5.4.3 Initialization and Label Selection To solve the optimization problem defined in Eq. 5.9, IPGuard needs to first initialize the data point .x and select the labels i and j . Different ways of initialization and selecting the labels may result in different robustness and uniqueness. Initialization Cao et al. [5] considered two ways to initialize a data point: • Training example. An intuitive way of initializing the data point .x is to randomly sample a training example as the initial data point. In this way, the initial data point will follow the same distribution as the training data. As a result, the fingerprinting data points will appear more natural and may be more difficult to be detected by the pirated model, if the attacker deploys some detection system to deny potential fingerprint queries. • Random. Another way to initialize a data point in IPGuard is to randomly sample a data point from the feature space. For instance, if the inputs of the target model are images whose pixel values are normalized to .[0, 1], IPGuard uniformly samples from .[0, 1]d as an initial data point, where d is the dimension of the images. Selection of Label i IPGuard selects the label of the initial data point .x predicted by the target model as label i. Such selection of label i helps reduce the computation cost of finding the fingerprinting data points. This is because if .x falls into the target model’s decision region of label i, then .x is surrounded by the classification boundary between label i and other labels. Therefore, by selecting the predicted label as i, it is likely that IPGuard needs less effort (e.g., fewer iterations of gradient descent) to find the fingerprinting data point, which characterizes the classification boundary between label i and another label j .

86

X. Cao et al.

Selection of Label j Cao et al. considered two ways to select the label j : • Random. A simple way to select the label j is to randomly sample a label that is not i. The random sampling process does not make use of any information about the target model. • Least-likely label. In this method, IPGuard selects the least-likely label of the initial data point .x as the label j . The least-likely label of a data point is defined as the label that has the smallest probability/logit predicted by the target model. The intuition of selecting the least-likely label is that different models may have similar labels that they are confident at because of the training loss they use, e.g., the cross-entropy loss. However, when it comes to the least confident labels, different models that are independently trained may disagree with each other. This implies that the target model’s classification boundary near the least-likely label may be more unique. Cao et al. evaluated different choices of initialization and label selection, and they found that, in general, IPGuard achieves the best ARUC when initializing the data points as training examples and selecting least-likely labels as the label j [5].

5.5 Discussion 5.5.1 Connection with Adversarial Examples Suppose we have a target model .N and a data point .x with ground-truth label y, where the target model can correctly predict the label of .x, i.e., .N(x) = y. It has been shown that an attacker can force the target model to misclassify the data point via adding carefully crafted perturbation to the data point. Such data points with perturbation that cause misclassification are called adversarial examples [48]. Based on the attacker’s goal, adversarial examples can be divided into two categories, i.e., untargeted adversarial examples and targeted adversarial examples. Untargeted adversarial examples aim to make the target model predict any wrong labels, while targeted adversarial examples aim to make the target model predict attacker-chosen labels. Many methods (e.g., [2, 3, 7, 16, 23, 29, 30, 53]) have been developed to construct adversarial examples. Fingerprinting data points and adversarial examples are similar in many aspects. For instance, they both formulate the problem as finding some perturbation such that when it is added to an initial data point, the perturbed data point will have some specific predictions. Moreover, since the adversarial examples need to move the data points across the classification boundary for misclassification, they also carry some information about the classification boundary of the target model, like the fingerprinting data points do. However, Cao et al. showed that the general vanilla adversarial examples [7, 16, 29] are not sufficient to characterize the classification boundary of a target model or be used as the fingerprinting data points. Specifically,

5 Protecting IP via Fingerprinting Classification Boundary

87

some of them [16, 29] cannot achieve the robustness and uniqueness design goals, while the others [7] miss the efficiency goal. Although vanilla adversarial example methods are insufficient to be directly applied to fingerprinting a target model, researchers have recently studied the potential of their variants for this problem [31, 39, 52, 58]. For instance, Merrer et al. [31] combined the idea of fingerprinting data points and watermarking together. They first found some adversarial examples of the target model and then stitched the classification boundary of the target model such that the adversarial examples were correctly classified by the stitched target model. They treat the adversarial examples and their labels as the fingerprint of the stitched target model. Zhao et el. [58] generated targeted adversarial examples that not only had a target label, but also had a target logit vector. They leveraged the adversarial examples and their target logit vectors as the fingerprint for a target model. They could verify a suspect model if it produces the same or similar logit vectors for the adversarial examples. We note that some adversarial examples have a property called transferability. Specifically, we can generate some adversarial examples based on one model .N1 and test them using another model .N2 . If the adversarial examples are still successful when evaluated on .N2 , we call these adversarial examples are transferable to .N2 . Ideally, an adversarial example generated on the target model .Nt (together with its predicted label) can be used as a fingerprinting data point of .Nt if it: (1) can transfer to the pirated versions of .Nt and (2) cannot transfer to other benign models. Lukas et al. [39] considered such adversarial examples in their work, which they called conferrable adversarial examples. Specifically, in their method, the model owner trains a target model and some other benign models independently. The model owner also creates multiple pirated versions of the target model. Then the model owner finds some adversarial examples based on the target model that could transfer to the pirated models but not to the benign models. They showed that their method could both robustly and uniquely fingerprint the target model.

5.5.2 Robustness Against Knowledge Distillation An attacker may post-process the stolen target model in different ways before deploying it. Most existing works [5, 31, 39, 52, 58] on fingerprinting machine learning models evaluated common post-processing such as model pruning and fine-tuning. However, to the best of our knowledge, only one of them [39] evaluated another common post-processing method called knowledge distillation and illustrated the robustness against it. Knowledge distillation is the process of transferring the knowledge from one pre-trained model to a new model, usually from a large pre-trained model to a small new model. In knowledge distillation, there is a teacher model, i.e., the one to transfer knowledge from, and a student model that learns from the teacher model. During knowledge distillation, some training data are fed into the teacher model, and the teacher model outputs probability vectors for them. The student model will

88

X. Cao et al.

be trained using these probability vectors produced by the teacher model as the ground-truth labels of the training data. In practice, it requires much less training data and much lower computation power to distill a model, compared to training the model from scratch. Therefore, in the problem of IP protection, an attacker may be motivated to transfer the knowledge from the stolen target model to its own student model. We consider this as a compromise of the model owner’s IP, since the attacker transfers the model owner’s knowledge without permission. Therefore, it is critical to protect model owner’s IP from unauthorized knowledge distillation. However, this could be challenging because knowledge distillation may significantly change the target model’s classification boundary, and the fingerprinting data points may not be robust enough for such great change. In [39], Lukas et al. leveraged conferrable adversarial examples to achieve robustness against knowledge distillation. However, this comes at the cost of training as well as post-processing multiple models on the model owner side. This violates the efficiency goal of fingerprinting the target model, especially with a large training data set and a huge deep neural network. Despite finding conferrable adversarial examples via training multiple models, there may be other options to avoid illegal knowledge distillation. For instance, Ma et al. [40] proposed a new learning method that learns nasty models. The nasty models have the same performance as the regular models but are bad at teaching students. Specifically, if an attacker uses knowledge distillation to transfer knowledge from a nasty model to a student model, the student model will have low test accuracy, which is not acceptable for the attacker. Therefore, if a model is learnt with [40]’s approach, we can apply existing fingerprinting methods without concerns about knowledge distillation.

5.5.3 Attacker-Side Detection of Fingerprinting Data Points The attacker may adapt its attack if it knows the target model has been fingerprinted. Essentially, the attacker needs to evade the Verify function .V . One intuitive way to fool the Verify function .V is to manipulate the classification boundary of the pirated model such that the Verify function outputs a small suspect score .S for it. For instance, the attacker may post-process the model to manipulate its classification boundary. However, common post-processing methods have been shown to be insufficient [5, 31, 39, 52, 58]. Another way to evade verification is to tamper with the verification queries from the model owner. Recall that in the Verify function .V , the model owner queries the fingerprinting data points using the suspect model’s prediction API. Then the model owner compares the predicted labels of the fingerprinting data points to determine whether the suspect model is a pirated one. If the attacker could detect the verification queries from the model owner, then the attacker could either deny the queries or return random predictions for these queries to evade verification.

5 Protecting IP via Fingerprinting Classification Boundary

89

In adversarial machine learning, many approaches [8, 15, 17, 34, 38, 41, 44, 46, 47, 49, 54, 56] have been proposed to detect adversarial examples. The key idea of these detection methods is to distinguish the distribution difference between the genuine data points and the adversarial examples in either feature space or a transformed space. These methods could also be applied to detecting fingerprinting data points because the fingerprinting data points also have different distribution compared to genuine data [5], or they are essentially adversarial examples [31, 39, 52, 58]. We envision that there will be an arms race between a model owner and an attacker. For instance, an attacker could deploy detection systems to detect verification queries from the model owner, while a model owner searches for fingerprinting data points that evade the attacker’s detection. For example, Carlini et al. [6] showed that one can evade existing adversarial example detection systems via generating adaptive adversarial examples. Such arms race may result in higher cost for an attacker to steal and abuse a model. We note that, as long as stealing and post-processing a target model or deploying detection systems require less resource (e.g., computation resource and training data) than training a model from scratch, an attacker may be motivated to steal the target model instead of training one from scratch. Therefore, the arms race between a model owner and an attacker may last until stealing the target model and evading the model owner’s fingerprinting method require more resource than training a model from scratch.

5.6 Conclusion and Future Work Recent advances in machine learning have made it reach or even surpass humanlevel performance, at the cost of enormous resources to train the models. Therefore, an attacker may be motivated to steal a model from its owner and abuse it. Recent studies on model extraction attacks pose even greater threat to the model owners’ IP lying in their machine learning models. In this chapter, we discuss a general framework to protect IP of machine learning models via fingerprinting the classification boundary. Specifically, the model owner extracts some fingerprint from the target model and verifies whether a suspect model is a pirated version of the target model by checking if the same fingerprint exists in the suspect model. The fingerprint should characterize the target model’s classification boundary robustly and uniquely. For instance, the model owner may select some fingerprinting data points near the classification boundary and use them with their predicted labels as the fingerprint. Interesting future work includes designing new fingerprinting methods that better align with the five design goals and providing theoretical analysis on the fingerprinting methods. Acknowledgments This work was supported by National Science Foundation under grant No. 1937786 and 2112562.

90

X. Cao et al.

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: watermarking deep neural networks by backdooring. In: 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1615–1631 (2018) 2. Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.-J., Srivastava, M., Chang, K.-W.: Generating natural language adversarial examples. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) 3. Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: International Conference on Machine Learning, pp. 284–293. PMLR (2018) 4. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997) 5. Cao, X., Jia, J., Gong, N.Z.: IPGuard: protecting intellectual property of deep neural networks via fingerprinting the classification boundary. In: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pp. 14–25 (2021) 6. Carlini, N., Wagner, D.: Adversarial examples are not easily detected: bypassing ten detection methods. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14 (2017) 7. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017) 8. Carrara, F., Becarelli, R., Caldelli, R., Falchi, F., Amato, G.: Adversarial examples detection in features distance spaces. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018) 9. Chandrasekaran, V., Chaudhuri, K., Giacomelli, I., Jha, S., Yan, S.: Exploring connections between active learning and model extraction. In: 29th USENIX Security Symposium (USENIX Security 20), pp. 1309–1326 (2020) 10. Chen, H., Rohani, B.D., Koushanfar, F.: DeepMarks: a digital fingerprinting framework for deep neural networks (2018). arXiv preprint arXiv:1804.03648 11. Chen, K., Guo, S., Zhang, T., Li, S., Liu, Y.: Temporal watermarks for deep reinforcement learning models. In: Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems, pp. 314–322 (2021) 12. Chollet, F., et al.: Keras (2015). https://keras.io 13. Darvish Rouhani, B., Chen, H., Koushanfar, F.: DeepSigns: an end-to-end watermarking framework for ownership protection of deep neural networks. In: Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 485–497. ACM (2019) 14. Li, B., Fan, L., Gu, H., Li, J., Yang, Q.: FedIPR: ownership verification for federated deep neural network models. In: FTL-IJCAI (2021) 15. Fidel, G., Bitton, R., Shabtai, A.: When explainability meets adversarial learning: detecting adversarial examples using SHAP signatures. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2020) 16. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015) 17. Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: On the (statistical) detection of adversarial examples (2017). arXiv preprint arXiv:1702.06280 18. Guo, J., Potkonjak, M.: Watermarking deep neural networks for embedded systems. In: 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. IEEE (2018) 19. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015) 20. Hartung, F., Kutter, M.: Multimedia watermarking techniques. Proc. IEEE 87(7), 1079–1107 (1999)

5 Protecting IP via Fingerprinting Classification Boundary

91

21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770– 778 (2016) 22. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European conference on Computer Vision, pp. 630–645. Springer, Berlin (2016) 23. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271 (2021) 24. Hu, X., Liang, L., Deng, L., Li, S., Xie, X., Ji, Y., Ding, Y., Liu, C., Sherwood, T., Xie, Y., Neural network model extraction attacks in edge devices by hearing architectural hints. In: ASPLOS (2020) 25. Hua, W., Zhang, Z., Suh, G.E.: Reverse engineering convolutional neural networks through side-channel information leaks. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2018) 26. Jagielski, M., Carlini, N., Berthelot, D., Kurakin, A., Papernot, N.: High accuracy and high fidelity extraction of neural networks. In: 29th USENIX Security Symposium (USENIX Security 20), pp. 1345–1362 (2020) 27. Juuti, M., Szyller, S., Marchal, S., Asokan, N.: PRADA: protecting against DNN model stealing attacks (2018). arXiv preprint arXiv:1805.02628 28. Kesarwani, M., Mukhoty, B., Arya, V., Mehta, S.: Model extraction warning in MLAAS paradigm. In: Proceedings of the 34th Annual Computer Security Applications Conference, pp. 371–380 (2018) 29. Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. In: ICLR (2017) 30. Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. In: Artificial Intelligence Safety and Security, pp. 99–112. Chapman and Hall/CRC (2018) 31. Le Merrer, E., Perez, P., Trédan, G.: Adversarial frontier stitching for remote neural network watermarking. Neural Comput. Appl. 32(13), 9233–9244 (2020) 32. Li, F.-Q., Wang, S.-L., Liew, A.-W.-C.: Regulating ownership verification for deep neural networks: scenarios, protocols, and prospects. In: IJCAI Workshop on Toward IPR on Deep Learning as Services (2021) 33. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient ConvNets (2016). arXiv preprint arXiv:1608.08710 34. Li, X., Li, F.: Adversarial examples detection in deep networks with convolutional filter statistics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5764– 5772 (2017) 35. Li, Y., Zhu, L., Jia, X., Jiang, Y., Xia, S.-T., Cao, X.: Defending against model stealing via verifying embedded external features. In: AAAI (2022) 36. Li, Z., Hu, C., Zhang, Y., Guo, S.: How to prove your model belongs to you: a blind-watermark based framework to protect intellectual property of DNN. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 126–137 (2019) 37. Lim, J.H., Chan, C.S., Ng, K.W., Fan, L., Yang, Q.: Protect, show, attend and tell: empowering image captioning models with ownership protection. Pattern Recogn. 122, 108285 (2022) 38. Lu, J., Issaranon, T., Forsyth, D.: SafetyNet: detecting and rejecting adversarial examples robustly. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 446– 454 (2017) 39. Lukas, N., Zhang, Y., Kerschbaum, F.: Deep neural network fingerprinting by conferrable adversarial examples. In: International Conference on Learning Representations (2021) 40. Ma, H., Chen, T., Hu, T.-K., You, C., Xie, X., Wang, Z.: Undistillable: making a nasty teacher that cannot teach students. In: International Conference on Learning Representations (2021) 41. Meng, D., Chen, H.: MagNet: a two-pronged defense against adversarial examples. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135–147 (2017)

92

X. Cao et al.

42. Nagai, Y., Uchida, Y., Sakazawa, S., Satoh, S.I.: Digital watermarking for deep neural networks. Int. J. Multimedia Inform. Retrieval 7(1), 3–16 (2018) 43. Oh, S.J., Schiele, B., Fritz, M.: Towards reverse-engineering black-box neural networks. In: ICLR (2018) 44. Pang, T., Du, C., Dong, Y., Zhu, J.: Towards robust detection of adversarial examples. In: Advances in Neural Information Processing Systems, vol. 31 (2018) 45. Quan, Y., Teng, H., Chen, Y., Ji, H.: Watermarking deep neural networks in image processing. IEEE Trans. Neural Netw. Learn. Syst. 32(5), 1852–1865 (2020) 46. Roth, K., Kilcher, Y., Hofmann, T.: The odds are odd: a statistical test for detecting adversarial examples. In: International Conference on Machine Learning, pp. 5498–5507. PMLR (2019) 47. Song, Y., Kim, T., Nowozin, S., Ermon, S., Kushman, N.: PixelDefend: leveraging generative models to understand and defend against adversarial examples. In: International Conference on Learning Representations (2018) 48. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: ICLR (2014) 49. Tian, S., Yang, G., Cai, Y.: Detecting adversarial examples through image transformation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 50. Tramèr, F., Zhang, F., Juels, A., Reiter, M.K., Ristenpart, T.: Stealing machine learning models via prediction APIs. In: USENIX Security Symposium (2016) 51. Wang, B., Gong, N.Z.: Stealing hyperparameters in machine learning. In: IEEE S & P (2018) 52. Wang, S., Chang, C.-H.: Fingerprinting deep neural networks-a DeepFool approach. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2021) 53. Xiao, C., Li, B., Zhu, J.-Y., He, W., Liu, M., Song, D.: Generating adversarial examples with adversarial networks. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3905–3911 (2018) 54. Xu, W., Evans, D., Qi, Y.: Feature squeezing: detecting adversarial examples in deep neural networks. In: Network and Distributed System Security Symposium (2018) 55. Yan, M., Fletcher, C.W., Torrellas, J.: Cache telepathy: leveraging shared resource attacks to learn DNN architectures (2018). arXiv preprint arXiv:1808.04761 56. Yang, P., Chen, J., Hsieh, C.-J., Wang, J.-L., Jordan, M.: ML-LOO: detecting adversarial examples with feature attribution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6639–6647 (2020) 57. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172. ACM (2018) 58. Zhao, J., Hu, Q., Liu, G., Ma, X., Chen, F., Hassan, M.M.: AFA: adversarial fingerprinting authentication for deep neural networks. Comput. Commun. 150, 488–497 (2020)

Chapter 6

Protecting Image Processing Networks via Model Watermarking Jie Zhang, Dongdong Chen, Jing Liao, Weiming Zhang, and Nenghai Yu

Abstract Deep learning has achieved tremendous success in low-level computer vision tasks such as image processing tasks. To protect the intellectual property (IP) of such valuable image processing networks, the model vendor can sell the service in the manner of the application program interface (API). However, even if the attacker can only query the API, he is still able to conduct model extraction attacks, which can steal the functionality of the target networks. In this chapter, we propose a new model watermarking framework for image processing networks. Under the framework, two strategies are further developed, namely, the modelagnostic strategy and the model-specific strategy. The proposed watermarking method performs well in terms of fidelity, capacity, and robustness.

6.1 Introduction Image processing is a kind of low-level computer vision tasks, including image deraining [13, 26], image dehazing [5, 11], medical image processing [17, 25], style transfer [3, 4], etc., and some visual examples of image processing tasks are shown in Fig. 6.1. Such image processing tasks mentioned above can be considered as image-to-image translation tasks, where both the input and output are images. In the deep learning era, DNN-based image processing techniques have achieved significant progress recently and outperform traditional state-of-the-art methods by a large margin. Some excellent general frameworks are proposed one after another

J. Zhang () · W. Zhang · N. Yu University of Science and Technology of China, Heifei, China e-mail: [email protected]; [email protected]; [email protected] D. Chen Microsoft Research, Redmond, WA, USA J. Liao City University of Hong Kong, Hong Kong, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_6

93

94

J. Zhang et al.

deraining

dehazing

bone suppression

style transfer

Fig. 6.1 Some examples of image processing tasks

such as Pix2Pix [12], CycleGAN [32], etc. Nevertheless, it is not trivial yet to obtain a good network for image processing tasks, which also requires high-quality labeled data and expensive computational resources. Therefore, well-trained image processing networks can be regarded as the model creators’ intellectual property (IP). To protect the core IP, it is a common technique to encapsulate these networks into APIs and sell them for commercial profit. However, it is still possible to steal the functionality of the APIs, which is called as model extraction attacks. For example, the attacker can consistently query the target APIs and receive the corresponding outputs. With enough input–output pairs, the attacker can learn a similar function in the supervised learning manner. There are some recent works [1, 7, 8, 21, 30] that start paying attention to protect DNN IP. Uchida et al. [21] propose a special weight regularizer to embed binary watermarks into DNN. Adi et al. [1] use a particular set of inputs as the indicators, and let the model deliberately output specific incorrect labels, which is also known as “backdoor attack.” And Fan et al. [7, 8] proposed to embed a special passport layer into the target model and trained in an alternative manner. Very recently, some preliminary attempts [16] emerged for IP protection of image processing networks. These works mainly consist of two categories: appending an extra output domain to the original output domain or shifting the original output domain into a new specific domain. The former case is analogical to multiple task learning (MTL), where the target model learns an additional pre-defined image processing task to verify the model ownership. For example, Quan et al. [16] utilize a common smoothed algorithm as the watermark embedding task, while the main task is image denoising by DNN models. During the verification stage, the hand-crafted trigger image will be fed into the target model, which is anticipated to output the corresponding results as processed by the smoothed algorithm rather than DNN-based denoising. The latter case focuses on constructing a special output domain, which is not far from the original domain but is embedded with watermark information for subsequent watermark extraction.

6 Protecting Image Processing Networks via Model Watermarking

95

The recent work [24] belongs to the latter case, which is similar to our work in this chapter. However, both methods mentioned above share a main limitation: fragile to model extraction attacks. Here, we point out two technical challenges for protecting the IP of image processing networks: • Challenge A: How to verify the ownership of the suspect models? In a practical scenario, the attacker will not directly provide the internal information of the suspect model like model parameters. Therefore, only the outputs of the suspect model can be utilized for ownership verification. Besides, the inputs used for query during the verification process shall be normal, which cannot be evaded by the attacker to disrupt the verification process. In this chapter, we investigate the latter case mentioned above. • Challenge B: How to guarantee the watermark still preserved after model extraction attacks? Because we aim at extracting watermarks out from the model outputs for verification, this challenge means that the watermark shall be learned during the model extraction attacks, namely, the watermark information needs to be transferred from the outputs of the protected model into the outputs of the stolen model. In the following part, we will describe our solutions addressing these challenges in detail, and we will also provide both quantitative and qualitative results on different image processing tasks that demonstrate the proposed model watermarking can satisfy the requirements regarding fidelity, capacity, and robustness. Briefly, we summarize the main contributions of our work as follows: • We propose a novel model watermarking framework to protect the IP of image processing networks, which is based on a spatial invisible watermarking mechanism. • Under the proposed framework, we further design two strategies, namely, the model-agnostic strategy and the model-specific strategy, which enhance flexible usage. • We utilize full-size images as watermarks, which enlarges the capacity of the proposed watermarking method. Besides, it can easily be extended to multiplewatermark cases. • A two-stage training strategy is leveraged to resistant against model extraction attacks, and experimental results demonstrate the robustness under different attack configurations. For a comprehensive evaluation, some adaptive evaluations are also considered.

6.2 Preliminaries We first clarify the threat model for IP protection of the image processing networks and then formulate the problem we will solve.

96

J. Zhang et al.

6.2.1 Threat Model Attacker’s Ability The attacker has enough computational resources but lacks enough high-quality labeled data to train his own models. Besides, the attacker has no access to the well-trained target model. Therefore, it is impossible to conduct the model-based plagiarism such as fine-tuning and model compression. The attacker can only query the target model, and we do not limit the query numbers. In short, the attacker is only able to steal the functionality of the target model (APIs) by model extraction attacks. Verifier’s Ability According to whether having access to the internal information (e.g., model weights, model architectures, etc.) of the target model, ownership verification can be conducted in a white-box way or a black-box way. In this chapter, we consider a more practical scenario where the illegal plagiarist will refuse to actively provide internal information. Therefore, the verifier can only leverage the outputs of the suspect model, and it is necessary to launch the verification with normal queries. Final Goals The model owner aims to design a model watermarking method for post hoc forensics if IP infringements happen. For the model watermarking algorithm, it shall satisfy some requirements: (1) fidelity: It cannot sacrifice the original performance of the target model. For image processing tasks, the watermarked outputs shall be visually consistent with the original outputs. (2) Capacity: It shall embed watermark information as much as possible. (3) Robustness: In this chapter, we mainly pay attention to the robustness against model extractions, while some adaptive attacks are also discussed.

6.2.2 Problem Formulation In this chapter, we mainly investigate how to protect the image processing networks against model extraction attacks .Ame . For better understanding, in this section, we first define the problem of image processing tasks and then provide a simple formulation for the model extraction task. Finally, we formulate the proposed model watermarking framework. Image Processing Networks Given an image processing network .N(), we train it with input dataset .DA and output dataset .DB , where the data (.{a1 , a2 , . . . , an }) and (.{b1 , b2 , . . . , bn }) are one–one paired. During the training stage, the model .N() is constrained by some loss functions .L to learn the mapping from data distribution .DA to .DB . For image processing tasks, the loss functions usually adopt similarity metrics such as .Lp loss in the pixel level and perceptual loss [14] in the feature level. For a well-trained .N(), fed with an image .a ∈ DA , it will confidently output the corresponding processed image .b ∈ DB .

6 Protecting Image Processing Networks via Model Watermarking

97

Model Extraction Attacks During such attacks, the adversary can query the target network .N() with his own unlabeled data (.{a1∗ , a2∗ , ..., an∗ } ∈ DA∗ ) to obtain the corresponding labeled data (.{b1∗ , b2∗ , ..., bn∗ } ∈ DB ∗ ), where .DA ∗ is similar with .DA . Then, the attacker can utilize the same training strategy mentioned above to train his own surrogate network .N∗ () with dataset .DA∗ and .DB ∗ . For the configurations of ∗ .N (), we can choose any network architectures and loss functions, which may differ from the settings of the original target model. Anyway, the attacker targets acquiring a functionality-preserving surrogate network .N∗ (). Model Watermarking According to two main challenges (Sect. 6.1) and the threat model, we can only use the outputs .O  of the illegal surrogate model .N∗ () for ownership verification. Meanwhile, we shall embed watermark information into the outputs O of the original model .N(), namely obtain the watermarked outputs .OW . Then, .OW will replace .OW as the final outputs of the APIs for the end user. If we guarantee .N∗ () can capture the watermark information and learn them into its outputs, the subsequent verification will succeed.

6.3 Proposed Method 6.3.1 Motivation Before introducing our model watermarking method, we first illustrate a trivial idea as shown in Fig. 6.2. Taking the deraining task as an example, fed with the natural images with rain drops, the target model .N() will remove the raindrops and get the clean outputs O. Before releasing O to the end user, we add the same watermark pattern such as “Flower” onto all outputs and obtain the watermarked outputs .O  ,

Input:

Output:

Fig. 6.2 A straightforward idea to watermark all the outputs by the same visible watermark such as “Flower”

98

J. Zhang et al.

which are further seen as the final feedback of .N(). If the attacker trains his surrogate model .N∗ () with such outputs .O  , .N∗ () must learn the consistent watermark pattern into its outputs due to the loss minimization characteristic. Below we provide a simple mathematical analysis. Feeding images .ai to the target model .N(), the attacker can obtain the corresponding outputs .bi ∈ O, and then the objective of the surrogate model .N∗ () is to minimize the loss functions .L between .N∗ (ai ) and .bi : L(N∗ (ai ), bi ) → 0.

.

(6.1)

If each original output .bi ∈ O is stamped with a consistent visible watermark .δ, forming another outputs .bi ∈ O  , then .N∗ () targets minimizing the .L between  ∗ .N (ai ) and .b : i L(N∗ (ai ), bi ) → 0,

.

bi = bi + δ.

(6.2)

Based on Eqs. (6.1) and (6.2), there must exist a surrogate model .N∗ () that can learn a good mapping from the .DA to .O  because of the equivalence below: L(N∗1 (ai ), bi ) → 0 ⇔ L(N∗2 (ai ), (bi + δ)) → 0 .

when

N∗2 = N∗1 + δ.

(6.3)

That is to say, if the surrogate model .N∗ () can learn the original image processing task well, .N∗ () must learn the watermark .δ into its outputs by the simple shortcut. Besides, we find the visible consistent watermark pattern .δ is an important component of the simple solution. However, visible watermarks are unsuitable for practical use, which disrupt the fidelity requirement. To address this limitation, we shall make the watermark pattern invisible or stealthy, but we shall still guarantee the consistency of the watermark pattern embedded among different images. Based on the above analysis, we propose a new model watermarking framework for image processing networks against model extraction attacks. As shown in Fig. 6.3, to protect the target network .N(), we directly embed the target watermark .δ into the original output .bi by the watermark embedding algorithm .H. Meanwhile, we shall ensure the watermark extracting algorithm .R extracts the corresponding watermark .δ  from the watermark output .bi . In this framework, it is crucial to satisfy two consistency: (1) the watermarked image .bi shall be visually consistent with the original image .bi ; (2) the extracted watermark .δ  shall be consistent with target watermark deltal. Because the watermarking embedding .H is independent of the target network .N(), we call the default strategy as the model-agnostic strategy. In Fig. 6.4, we further propose the model-specific strategy, which combines .H with the target image processing task. Model-agnostic strategy is more flexible with the target models that need to update frequently, while the model-specific strategy builds a stronger connection with the specific model version.

6 Protecting Image Processing Networks via Model Watermarking

99

Fig. 6.3 The proposed model watermarking framework with the model-agnostic strategy

Fig. 6.4 The proposed model watermarking framework with the model-specific strategy

6.3.2 Traditional Watermarking Algorithm As for the model-agnostic strategy, we find it can directly apply the traditional watermarking algorithms for watermark embedding and extracting. We only consider invisible watermarking methods but include the spatial and some transform domains. For spatial invisible watermarking, we adopt the common method called as additive-based embedding. In detail, we first spread the watermark information into a sequence or a block that obeys a specific distribution and then hide it in the related

100

J. Zhang et al.

coefficients of the cover image. We formulate this process as follows: O =

.



O + αC0 if Ti = 0 , O + αC1 otherwise

(6.4)

where O and .O  denote the cover image and watermarked image, respectively. .α represents the watermark intensity, and .Ci denote the location of the watermark information “.Ti ”(.Ti ∈ [0, 1]). The extraction process is just an inverse procedure of watermark embedding. The subsequent experimental results show that the spatial watermarking methods can resist model extraction attacks only in a particular case, namely, training the surrogate model with specific loss functions and network architectures. Moreover, the traditional spatial watermarking has a limited capacity (e.g., 64 bits) because it needs many bit information for error correction. We also tried some classic transform domain watermarking, such as discrete Fourier transform (DFT) domain [19], discrete cosine transform (DCT) domain [10], and discrete wavelet transform (DWT) domain [2]. However, all these methods are fragile to model extraction attacks. It is an interesting direction to investigate how to ensure the transform domain watermarks are preserved in the outputs of the surrogate model.

6.3.3 Deep Invisible Watermarking To address the limitations of traditional spatial watermarking, we propose to leverage deep models for both watermark embedding and watermark extracting in Fig. 6.5. In the initial training stage, we use an embedding sub-network .H to embed an image watermark into cover images of the domain .B, and meanwhile, we use an

Initial Training Stage

Domain ,

real / fake Extractor sub-network R

Embedding sub-network H

,



Domain

visually consistent

Domain ′

Adversarial Training Stage Extractor sub-network R

Surrogate network

,



Domain ,

,

′′

Domain ′′

Fig. 6.5 Deep invisible watermarking with two-stage training strategy. In the initial training stage, we jointly train the embedding network .H and the extractor .R. In the adversarial training stage, we further enhance the extracting ability of .R

6 Protecting Image Processing Networks via Model Watermarking

101

extractor .R to extract the corresponding watermark afterward. In particular, we train these two networks jointly, which is popular in most deep watermarking methods [20, 31]. Compared with traditional spatial watermarking, the main advantage is that we can introduce some noise layer into the training process to enhance the robustness against specific attacks or processing, such as JPEG and screen shooting. During the initial training stage, we also find if we only input the images from domain .B into .R, it will learn a trivial solution that directly outputs the target watermark no matter whether the inputs are watermarked images or not. To remedy it, the images without watermarks from domain .A and domain .B are also fed into .R and are forced to output a constant blank image, which is helpful for effective forensic. However, if only trained with this initial training stage, we find that the .R cannot extract the watermarks out from the output .O ∗ of the surrogate model .N∗ (). Leveraging the advantage of deep watermarking mentioned above, we introduce the model extraction attacks into the adversarial training stage. In detail, we first train a proxy surrogate model .N∗0 () to imitate model extraction attacks with the paired data from domain .A and domain .B . Afterward, we collect the outputs of ∗  .N () to consist the domain .B , and the .R is further fine-tuned on the mixed dataset 0   of domain .A, B, B , B . With the proposed two-stage training strategy, our model watermarking achieves robustness against model extraction attacks, even when the surrogate model .N∗ () is different from the proxy surrogate model .N∗0 () used in the adversarial training stage. In the following part, we will give more details on the design of our method.

6.3.3.1

Network Structures

The embedding network .H and the surrogate model .N∗0 () both choose UNet [18] as the default network structure, which has been popular in the image-to-image translation tasks [12, 32]. UNet consists of multi-scale skip connections, which is friendly to the tasks where the output image shares some common properties of input images, such as the unwatermarked image and the corresponding watermarked image. However, as for the extractor .R, UNet is not suitable because the extracted watermark (e.g., “Flower”) is different from watermarked images such as natural images. Empirical results verified the effectiveness of CEILNet [9], which is intrinsically an auto-encoder-like network structure.

6.3.3.2

Loss Functions

The loss function of our method consists of two parts: the embedding loss .Lemd and the extracting loss .Lext , i.e., L = Lemd + λ ∗ Lext ,

.

(6.5)

102

J. Zhang et al.

where .λ is used for the balance between watermark embedding and extraction. We will introduce these two loss terms in detail as follows. Embedding Loss To guarantee the visual consistency between the cover image .bi and the watermarked image .bi , we consider the basic L2 loss .2 , which compares the similarity in the pixel level, i.e.,  = . 2

 bi ∈B ,bi ∈B

1  b − bi 2 , Nc i

(6.6)

where .Nc means the total pixel number. We also constrained the similarity in the feature level by the famous perceptual loss .perc [14], i.e.,

.



perc =

bi ∈B ,bi ∈B

1 V GGk (bi ) − V GGk (bi )2 , Nf

(6.7)

where .V GGk (·) denotes the features extracted at layer k (“conv2_2” by default), and .Nf denotes the total feature neuron number. To eliminate the domain gap between .B and .B, we further introduce a discriminator .D into the training of the embedding sub-network .H, which can be regarded as a GAN-style training, i.e., .

adv = E log(D(bi )) +  E log(1 − D(bi )). bi ∈B

bi ∈B

(6.8)

The adversarial loss .adv will force .H to embed watermarks in a more stealthy way, which will make the discriminator .D cannot differentiate between the watermarked images and unwatermarked images. For .D, we adopt the widely used PatchGAN [12]. In short, the embedding loss is formulated as a weighted sum of the above three terms, i.e., .

Lemd = λ1 ∗ bs + λ2 ∗ vgg + λ3 ∗ adv .

(6.9)

Extracting Loss The aims of the extractor .R have two aspects: it shall extract the target watermark image out from watermarked images from .B and instead output a blank image for unwatermarked images from .A, B. So the first two terms of .Lext are the reconstruction loss .wm and the clean loss .clean for these two types of images,

6 Protecting Image Processing Networks via Model Watermarking

103

respectively, i.e., wm =

 1 R(bi ) − δ2 , N c  

bi ∈B .

clean

 1  1 = R(ai ) − δ0 2 + R(bi ) − δ0 2 , Nc Nc ai ∈A

(6.10)

bi ∈B

where .δ is the target watermark image and .δ0 is the blank image. In addition, we further constrain the consistency among all the extracted watermark images .R(·), which will make the surrogate model easier to capture the watermark information. Therefore, we add another consistent loss .cst , i.e., cst =



.

R(x) − R(y)2 .

(6.11)

x,y∈B

The final extracting loss .Lext is defined as follows: Lext = λ4 ∗ wm + λ5 ∗ clean + λ6 ∗ cst .

.

(6.12)

Adversarial Training Stage The intrinsic responsibility of the adversarial training stage is to introduce the degradation of model extraction attacks. Specifically, one proxy surrogate model .N∗0 () is trained with the simple L2 loss and UNet by default, and its outputs are defined as the domain .B  . In this stage, we fix the embedding sub-network .H and only fine-tune the extractor .R. The loss terms of extracting loss shall be rewritten: wm =

 1  1 R(bi ) − δ2 + R(bi ) − δ2 , N N c c    

bi ∈B

.

cst =



bi ∈B

(6.13)

R(x) − R(y)2 .

x,y∈B ∪B

Training Details By default, we train sub-networks .H, R and discriminator .D for 200 epochs with a batch size of 8. We adopt the Adam optimizer with an initial learning rate of 0.0002. We decay the learning rate by 0.2 if the loss changes slightly within 5 epochs. For the training of .N∗0 () and .N∗ (), batch size is set as 16 and using more epochs of 300 for better performance. In the adversarial training stage, .R is fine-tuned with a lower initial learning rate of 0.0001. By default, all hyperparameters .λi are equal to 1 except .λ3 = 0.01. All the images are resized to 256 .× 256 by default. The implementation code is released in public.1

1 https://github.com/ZJZAC/Deep-Model-Watermarking.

104

6.3.3.3

J. Zhang et al.

Ownership Verification

During the verification stage, we only need to feed some normal inputs into the suspect model .N∗ () and obtain the corresponding outputs .O ∗ . Then, we utilize the extractor .R to extract watermark images and check whether .R’s output matches the target watermark image. To evaluate the similarity, we introduce the classic normalized correlation (NC) metric, i.e., NC =

.

< R(bi ), δ > , R(bi ) ∗ δ

(6.14)

where .< ·, · > and .· denote the inner product and L2 norm, respectively.

6.3.3.4

Flexible Extensions

Multiple-Watermark Strategy By default, we introduce only one specific watermark during the training of the embedding sub-network and the extractor. However, it is unfriendly when one image processing network has multiple versions for release, and we need different watermarks to represent different versions, because it will cost more storage and computation resources to train multiple embedding subnetworks and extractors. Fortunately, the proposed framework supports embedding multiple-watermark images within one single embedding sub-network and a single extractor. The only change is that .H will randomly select different images as watermarks to embed into the cover images and .R to exact them correspondingly. Model-Specific Strategy Compared to the traditional spatial watermarking methods, we can also use deep invisible watermarking with the model-specific strategy mentioned above. We illustrate the initial training stage with such strategy in Fig. 6.6. The original target model can simultaneously learn image processing and watermark embedding tasks in this case. In other words, the target model N() will output a watermarked image that is similar to the original watermarkfree outputs. Meanwhile, we also add an extractor R to extract the corresponding watermark images, which is jointly trained with N(). We also call such N() as the self-watermarked model hereinafter.

6.4 Experiments To demonstrate the effectiveness of the proposed model watermarking method, we take two example image processing tasks for experiments: image deraining and image debone.

6 Protecting Image Processing Networks via Model Watermarking

105

Fig. 6.6 Illustration diagram to show the training difference between the original target model and the self-watermarked target model

6.4.1 Experiment Settings Dataset For image deraining task, we use 6100 images from the PASCAL VOC dataset [6] as the domain .B and use the synthesis algorithm in [27] to generate rainy images as domain .A. These images are further split into two parts: 6000 for training and 100 for testing. All these images are used for the initial and adversarial training stage and the training of proxy surrogate model .N∗0 (). For model extraction attacks, since the images used by the attacker may be different from that used by the IP owner to train .H, R, we randomly select 6000 images from the COCO dataset [15] for the training of the surrogate model .N∗ (). As for image debone, we randomly select 6100 images from the open dataset ChestX-ray8 [23] as the domain .A and suppress the rib area by algorithm [25] to generate images as the domain .B. These images are further divided in a similar way mentioned above. Without losing generality, we adopt the gray-scale “Copyright” and color “Flower” image as the default target watermark for debone and deraining, respectively, as shown in Fig. 6.7.

106

J. Zhang et al.

Fig. 6.7 Some visual examples to show the fidelity and the capacity of the proposed model watermarking: (a) unwatermarked image bi , (b) watermarked image bi , (c) the residual between bi and bi (enhanced 10×), (d) target watermark image, (e) extracted watermark image out from bi . The proposed method is suitable for both gray-scale and colorful watermark images

Evaluation Metric We leverage the widely used PSNR and SSIM to evaluate the fidelity. During the verification stage, if its NC value exceeds 0.95, we regard the watermark as successfully extracted. Based on it, the success rate (SR) is further defined as the ratio of watermarked images whose hidden watermark is successfully extracted in the testing set. If not specified, we show the results of the proposed method with the model-agnostic strategy.

6.4.2 Fidelity and Capacity In this section, we evaluate the fidelity and capacity of the proposed method in both qualitative and quantitative manners. To demonstrate its fidelity and capacity, both simple gray-scale images and colorful images are considered. For the gray-scale case, we even consider a QR code image. Several example qualitative results are shown in Fig. 6.7, where we also showcase the residuals between the unwatermarked image .bi and the watermarked image .bi . It can be seen that the proposed method can not only embed the watermark images into the cover images in an invisible way but also successfully extract the embedded watermarks out. Furthermore, we plot the gray histogram distribution difference between .bi and .bi , in Fig. 6.8, which also reflects that the proposed method can promise a high fidelity.

6 Protecting Image Processing Networks via Model Watermarking

107

Fig. 6.8 The gray histogram between the unwatermarked image and the watermarked image is almost coincident for both gray-scale and colorful watermark images Table 6.1 Quantitative results of the proposed method. PSNR and SSIM are averaged among the whole testing set. .x, y denote task name and watermark image name in the notation “x-y”

Task Debone-Copyright Debone-Flower Debone-QR Derain-Flower Derain-Peppers Derain-Lena Derain-QR

PSNR 47.29 46.36 44.35 41.21 40.91 42.50 40.24

SSIM 0.99 0.99 0.99 0.99 0.98 0.98 0.98

NC 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999

SR 100% 100% 100% 100% 100% 100% 100%

In Table 6.1, we provide the quantitative results, namely, the average PSNR and SSIM, which double confirms the fidelity is well satisfied. On the other hand, our extractor .R can still achieve a very high extracting ability with a 100% success rate in all cases. For capacity, our method can embed the full-size image (e.g., 256 .× 256) as the watermark, which is much larger than the traditional spatial watermarking method (e.g., a 64-bit string).

6.4.3 Robustness to Model Extraction Attack Besides the requirement for fidelity and capacity, another more important goal is guaranteeing robustness against model extraction attacks. In a practical scenario, the attacker may train his surrogate model .N∗ () with different network structures and loss functions. Therefore, we simulate this case by using different surrogate models trained with different settings to evaluate the robustness of the proposed method. In detail, we consider four different types of network structures: vanilla convolutional networks only consisting of several convolutional layers (“CNet”), an auto-encoder-like networks with 9 and 16 residual blocks (“Res9,” “Res16”), and the aforementioned UNet network (“UNet”). For the loss function, we adopt the

108

J. Zhang et al.

Table 6.2 The success rate (.SR) of the proposed method resisting the attack from surrogate models .N∗ () trained with L2 loss but different network structures

Settings Debone Derain Debone.† Derain.†

CNet 93% 100% 0% 0%

Res9 100% 100% 0% 0%

Res16 100% 100% 0% 0%

UNet 100% 100% 0% 0%

popular pixel-level loss functions such as L1, L2, perceptual loss .Lperc , adversarial loss .Ladv , and their combination. Among all the loss function settings, we find that only using perceptual loss .Lperc to train .N∗ () will cause terrible performance (PSNR:19.73; SSIM:0.85) and make the attack meaningless. Therefore, we discard the results under such setting. Because we utilize “UNet” and L2 loss function to train the proxy surrogate model .N∗0 (), we define the model extraction attack with this configuration as a white-box attack, and all other configurations are black-box attacks. Considering the limited computation resource, we conduct the control experiments to demonstrate the robustness against model extraction attacks. Besides, we take Debone-Copyright task and Derain-Flower task as examples, and .† denotes the results without adversarial training. In Table 6.2, it verifies that the proposed method can resist both white-box and black-box model extraction attacks, even though the proxy surrogate model .N∗0 () is only trained with L2 loss and UNet. For the case with CNet, the SR is slightly lower than in other cases, and we explain that the surrogate model .N∗ () with CNet owns a worse performance for the image debone task. To further demonstrate the robustness against different loss functions, we adopt the UNet as the same network structure and train .N∗0 () with different loss combinations. In Table 6.3, we find that the proposed method can resist different loss combinations with very high success rates. As described above, we introduce the adversarial training stage to enhance the extracting ability of .R. To demonstrate its importance, we also conduct the control experiments without adversarial training and provide the corresponding results in Tables 6.2 and 6.3 (annotated with “.†”). It can be seen that if .N∗ () is trained with the default L2 loss but different network structures, its SR is .0% in all cases. With the default network structure UNet, only the .N∗ () trained with some special loss functions can output watermarked images. The results in both Tables 6.2 and 6.3 demonstrate the necessities of the adversarial training.

Table 6.3 The success rate (.SR) of the proposed method resisting the attack from surrogate models .N∗ () trained with UNet network structure but different loss combinations Settings Debone Derain Debone.† Derain.†

L1 100% 100% 0% 0%

L1 + .Ladv 100% 100% 0% 0%

L2 100% 100% 27% 0%

L2 + .Ladv 100% 99% 44% 0%

.Lperc +.Ladv

71% 100% 0% 0%

6 Protecting Image Processing Networks via Model Watermarking

109

A~D

E~H

I~K

Fig. 6.9 Example watermark extracted from the outputs of different surrogate models. (a) Unwatermarked input .ai ; (b) watermarked output .bi ; (c) .∼ (f) are .N∗ () trained with L2 loss and different network structures: CNet, Res9, Res16, and UNet in turn; (g) .∼ (k) are .N∗ () trained with UNet and different loss combinations: L2, L2+.Ladv , L1, L1+.Ladv , and .Lperc +.Ladv in turn

In Fig. 6.9, we further provide one example watermark extracted from the outputs of different surrogate models. It shows that the extracted watermark image from the outputs of the surrogate model is visually consistent with the target watermark image, so our method can verify the model ownership intuitively.

6.4.4 Ablation Study Importance of Clean Loss and Consistent Loss At the watermark extracting stage, we append the extra clean loss for effective forensic .Lclean and the consistent loss .Lc st among different watermarked images for easy extracting. Here, we conduct two control experiments to demonstrate their importance. In Fig. 6.10, if without clean loss, the extractor will also extract the meaningless watermark from the unwatermarked images rather than a blank image. Besides, the corresponding NC value is also high, which will raise the false alarm and make the forensics invalid. As shown in Fig. 6.11, if without the consistent loss, we find the extractor can only extract fragile watermark images out from the outputs .O ∗ of the surrogate model .N∗ (). This is because lacking .Lc st will make learning a unified watermark into the surrogate model more challenging. By contrast, extractor .R trained with .Lc st can always extract obvious watermark images. Influence of Hyper-Parameter and Watermark Size For the hyper-parameter setting, we conduct control experiments with different λ, which is used to balance the ability to embed and extract. As shown in Table 6.4, although the visual quality of watermarked images and the NC value will be influenced by different λ, the final results are all pretty good, which means our algorithm is not sensitive to λ.

110

J. Zhang et al.

Fig. 6.10 Comparison results with (first row) and without (second row) clean loss: (a) and (c) are the unwatermarked images .ai , bi from domain .A, B, (b) and (d) are the extracted watermarks from images .ai , bi , respectively. Number on the top-right corner denotes the NC value

Fig. 6.11 Comparison results with (first row) and without (second row) consistent loss: (a) unwatermarked image .bi from domain .B, (b) watermarked image .bi from domain .O ∗ , (c) output   .bi of the surrogate model, (d) extracted watermark out from .bi

6 Protecting Image Processing Networks via Model Watermarking

111

Table 6.4 Quantitative results of our method with different λ. We take Debone-Copyright task for example

λ 0.1 0.5 1 2 10

PSNR 49.82 48.44 47.29 45.12 43.21

SSIM 0.9978 0.9969 0.9960 0.9934 0.9836

NC 0.9998 0.9998 0.9999 0.9999 0.9999

SR 100% 100% 100% 100% 100%

Table 6.5 Quantitative results of our method with different size watermark images. We take Debone-Copyright task for example

Size 32 64 96 128 256

PSNR 46.89 47.49 48.06 47.68 47.29

SSIM 0.9962 0.9965 0.9967 0.9965 0.9960

NC 0.9999 0.9999 0.9999 0.9999 0.9999

SR 100% 100% 100% 100% 100%

We further try different sizes of the watermark to test the generalization ability. Since we require the watermark size to be the same as the cover image in our framework, we pad the watermark with 255 if its size is smaller than 256. As shown in Table 6.5, our method generalizes well to watermarks of different sizes.

6.4.5 Extensions In this section, we provide the experimental results for the flexible extensions with the multiple-watermark strategy and model-specific strategy. Multiple-Watermark Strategy As mentioned in Sect. 6.3.3.4, the proposed framework is flexible to embed multiple different watermarks with just a single embedding sub-network and a single extractor. To verify it, we take the debone task as an example and randomly select 10 different logo images from the Internet as watermarks for training, as shown in Fig. 6.12. For comparison, we use the logo “Copyright” and “Flower” as two representatives and compare them with the results of the default single-watermark setting. In Table 6.6, we find that the average PSNR for the multiple-watermark setting degrades from 47.76 to 41.87 compared to the default setting but is still larger than 40, which is still acceptable. We also display some visual examples in Fig. 6.13. In terms of the robustness against model extraction attacks, we further provide the comparison results of in Tables 6.7 and 6.8. It shows that the method under the multiple-watermark setting is also comparable to those of the default single-watermark setting according to the success extraction rate (SR). Model-Specific Strategy To demonstrate this possibility of the model-specific strategy, we take the derain task as an example and jointly learn the deraining and watermark embedding process within one single network. We call the target model

112

J. Zhang et al.

Fig. 6.12 Multiple logo images downloaded from the Internet Table 6.6 Comparison between the multiple-watermark (.∗) and single-watermark setting in terms of fidelity and extracting ability

Task Debone_Copyright Debone_Copyright .∗ Debone_Flower Debone_Flower .∗

PSNR 47.29 41.87 46.36 41.67

SSIM 0.99 0.99 0.99 0.98

SR 100% 100% 100% 100%

Fig. 6.13 Visual comparisons between the multiple-watermark (even columns) and the default single-watermark (odd columns) setting. The top row and the bottom row showcase the watermarked images and the corresponding extracted watermarks, respectively

trained with model-specific strategy as the self-watermarked model. In Table 6.9, we first compare the deraining performance between the original target model and the self-watermarked model. The self-watermarked model can keep the original deraining functionality well (PSNR:32.13 and SSIM: 0.93) while embedding the watermarks in the outputs. We further display one visual example in Fig. 6.14.

6 Protecting Image Processing Networks via Model Watermarking

113

Table 6.7 Comparison of the robustness against model extraction attacks with different network structures between the multiple-watermark (.∗) and the default single-watermark setting CNet 93% 89% 73% 94%

.SR/SRC

Debone_Copyright Debone_Copyright .∗ Debone_Flower Debone_Flower .∗

Res9 100% 90% 83% 97%

Res16 100% 94% 89% 97%

UNet 100% 100% 100% 100%

Table 6.8 Comparison of the robustness against model extraction attacks with different loss combinations between the multiple-watermark (.∗) and default single-watermark setting .SR/SRC

L1 100% 100% 100% 100%

Debone_Copyright Debone_Copyright .∗ Debone_Flower Debone_Flower .∗

L2 100% 100% 100% 100%

L1 + .Ladv 100% 100% 100% 100%

L2 + .Ladv 100% 100% 99% 100%

.Lperc +.Ladv

71% 97% 100% 99%

Table 6.9 The performance comparison between the original target model (“_original”) and selfwatermarked target model (“_Flower”) Task Derain_original Derain_Flower

(a)

PSNR 32.49 32.13

(b)

(c)

SSIM 0.93 0.93

(d)

SR NA 100%

(e)

(f)

Fig. 6.14 One visual example of original target model and self-watermarked model: (a) image .ai , (b) ground-truth image .(b0 )i , (c) output of original target model .bi , (d) output of self-watermarked model .bi , (e) target watermark, (f) extracted watermark from image .bi

Besides the fidelity, the jointly trained extractor .R can also extract the watermarks well with 100% SR. In Tables 6.10 and 6.11, we further compare the robustness between the modelagnostic strategy and the model-specific strategy against the model extraction attacks with different network structures and loss functions. It shows that the modelspecific strategy is very robust, with nearly .100% extracting success rates in all cases.

114

J. Zhang et al.

Table 6.10 The comparison between the model-agnostic strategy and the model-specific strategy against model extraction attacks with L2 loss but different network structures CNet 100% 100%

.SR

Model-agnostic Model-specific

Res9 100% 100%

Res16 100% 100%

UNet 100% 100%

Table 6.11 The comparison between the model-agnostic strategy and the model-specific strategy against model extraction attacks with UNet network structure but different loss combinations .SR

Model-agnostic Model-specific

L1 100% 100%

L2 100% 100%

L1 + .Ladv 100% 100%

L2 + .Ladv 99% 100%

.Lperc +.Ladv

100% 100%

6.5 Discussion There are still some interesting questions to explore in the future. First, though the joint training of the embedding and extracting sub-networks in an adversarial way makes it difficult to find some explicit pattern in the watermarked image, it is worth studying what implicit watermark is hidden. Second, it is difficult to give theoretical proof for the effectiveness of the adversarial training stage. So far, we can only provide an intuitive explanation from two perspectives: (1) The surrogate models are trained with the same/similar task-specific loss functions, and their output is similar. (2) As explained in the recent work [22], even trained with different loss functions, the CNN-based image generator models often share some common artifacts during the synthesis process. That is to say, the degradation brought by different surrogate models shares some common properties. In short, the adversarial training stage aims to make the extractor sub-network R know what kind of degradation might be brought in the output of the surrogate model but not that surrogate model itself.

6.6 Conclusion IP protection for image processing networks is an important but seriously underresearched problem. In this chapter, we propose a novel model watermarking framework to protect the image processing networks against model extraction attacks. Under the framework, the model-agnostic strategy, the model-specific strategy, and the multiple-watermark strategy are designed for flexible usage in practice. More experimental results can be found in [28, 29]. Acknowledgments This research was partly supported by the Natural Science Foundation of China under Grant U20B2047, 62072421, 62002334, 62102386, and 62121002, Exploration Fund Project of University of Science and Technology of China under Grant YD3480002001. Thanks to Han Fang, Huamin Feng, and Gang Hua for helpful discussions and feedback.

6 Protecting Image Processing Networks via Model Watermarking

115

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: USENIX (2018) 2. Barni, M., Bartolini, F., Piva, A.: Improved wavelet-based watermarking through pixel-wise masking. TIP 10(5), 783–791 (2001) 3. Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1105–1114 (2017) 4. Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: StyleBank: An explicit representation for neural image style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1897–1906 (2017) 5. Dong, H., Pan, J., Xiang, L., Hu, Z., Zhang, X., Wang, F., Yang, M.-H.: Multi-scale boosted dehazing network with dense feature fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2157–2167 (2020) 6. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) challenge. IJCV 88(2), 303–338 (2010) 7. Fan, L., Ng, K.W., Chan, C.S.: Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In: Advances in Neural Information Processing Systems, pp. 4716–4725 (2019) 8. Fan, L., Ng, K.W., Chan, C.S., Yang, Q.: DeepIP: Deep neural network intellectual property protection with passports. IEEE Trans. Pattern Anal. Mach. Intell. (2021) 9. Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: A generic deep architecture for single image reflection removal and image smoothing. In: ICCV, pp. 3238–3247 (2017) 10. Hernandez, J.R., Amado, M., Perez-Gonzalez, F.: DCT-domain watermarking techniques for still images: Detector performance analysis and a new structure. TIP (2000) 11. Hong, M., Xie, Y., Li, C., Qu, Y.: Distilling image dehazing with heterogeneous task imitation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3462–3471 (2020) 12. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017) 13. Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi-scale progressive fusion network for single image deraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8346–8355 (2020) 14. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and superresolution. In: ECCV, pp. 694–711. Springer (2016) 15. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Lawrence Zitnick, C.: Microsoft COCO: Common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer (2014) 16. Quan, Y., Teng, H., Chen, Y., Ji, H.: Watermarking deep neural networks in image processing. IEEE Trans. Neural Networks Learn. Syst. 32(5), 1852–1865 (2020) 17. Razzak, M.I., Naz, S., Zaib, A.: Deep learning for medical image processing: Overview, challenges and the future. Classif. BioApps, 323–350 (2018) 18. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241. Springer (2015) 19. Ruanaidh, J.J.K.O., Dowling, W.J., Boland, F.M.: Phase watermarking of digital images. In: ICIP. IEEE (1996) 20. Tancik, M., Mildenhall, B., Ng, R.: StegaStamp: Invisible hyperlinks in physical photographs. arXiv (2019) 21. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.: Embedding watermarks into deep neural networks. In: ICMR, pp. 269–277. ACM (2017) 22. Wang, S.-Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: CNN-generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8695–8704 (2020)

116

J. Zhang et al.

23. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: CVPR (2017) 24. Wu, H., Liu, G., Yao, Y., Zhang, X.: Watermarking neural networks with watermarked images. IEEE Trans. Circuits Syst. Video Technol. (2020) 25. Yang, W., Chen, Y., Liu, Y., Zhong, L., Qin, G., Lu, Z., Feng, Q., Chen, W.: Cascade of multiscale convolutional neural networks for bone suppression of chest radiographs in gradient domain. Med. Image Anal., 35 (2017) 26. Yasarla, R., Sindagi, V.A., Patel, V.M.: Syn2Real transfer learning for image deraining using Gaussian processes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2726–2736 (2020) 27. Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: CVPR, pp. 695–704 (2018) 28. Zhang, J., Chen, D., Liao, J., Fang, H., Zhang, W., Zhou, W., Cui, H., Yu, N.: Model watermarking for image processing networks. In: AAAI 2020 (2020) 29. Zhang, J., Chen, D., Liao, J., Zhang, W., Feng, H., Hua, G., Yu, N.: Deep model intellectual property protection via deep watermarking. IEEE Trans. Pattern Anal. Mach. Intell. (2021) 30. Zhang, J., Chen, D., Liao, J., Zhang, W., Hua, G., Yu, N.: Passport-aware normalization for deep model protection. Adv. Neural Inf. Process. Syst., 33 (2020) 31. Zhu, J., Kaplan, R., Johnson, J., Fei-Fei, L.: HiDDeN: Hiding data with deep networks. In: ECCV, pp. 657–672 (2018) 32. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycleconsistent adversarial networks. In: ICCV, pp. 2223–2232 (2017)

Chapter 7

Watermarks for Deep Reinforcement Learning Kangjie Chen

Abstract In this chapter, we introduce a new watermarking scheme for deep reinforcement learning protection. To protect the intellectual property of deep learning models, various watermarking approaches have been proposed. However, considering the complexity and stochasticity of reinforcement learning tasks, we cannot apply existing watermarking techniques for deep learning models to the deep reinforcement learning scenario directly. Existing watermarking approaches for deep learning models adopt backdoor methods to embed special sample–label pairs into protected models and query suspicious models with these designed samples to claim and identify ownership. Challenges arise when applying existing solutions to deep reinforcement learning models. Different from conventional deep learning models, which give single output for each discrete input at one time instant, the current predicted outputs of reinforcement learning can affect subsequent states. Therefore, if we apply discrete watermark methods to deep reinforcement learning models, the temporal decision characteristics and the high randomness in deep reinforcement learning strategies may decrease the verification accuracy. Besides, existing discrete watermarking approaches may affect the performance of the target deep reinforcement learning model. In this chapter, motivated by the above limitation, we introduce a novel watermark concept, temporal watermarks, which can preserve the performance of the protected models, while achieving high fidelity ownership verification. The proposed temporal watermarking method can be applied to both deterministic and stochastic reinforcement learning algorithms.

7.1 Introduction Deep reinforcement learning combines deep learning and reinforcement learning techniques to enable agents to perceive information from high-dimensional space, understand the context of environments, and make the optimal decisions based on K. Chen () Nanyang Technological University, Singapore, Singapore e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_7

117

118

K. Chen

Computational Resource Attacker Training Data Well-trained DRL Model

Plagiarism Service

Expertise

Fig. 7.1 Training a DRL model requires a huge amount of resources, while the well-trained model faces plagiarism risks

the obtained information. Deep reinforcement learning (DRL) has been widely studied, and commercial applications of DRL have emerged and become increasingly prosperous, such as robotics controlling [42], autonomous driving [18], and game playing [20]. However, due to the complexity of reinforcement learning tasks (e.g., autonomous driving), as shown in Fig. 7.1, training such a well-performed DRL model requires huge computational resources, tremendous training data, and lots of expertise. Therefore, considering the huge resource consumption during DRL model training, many companies have treated a well-trained deep reinforcement learning model as the core intellectual property. However, the DRL model may be stolen or copied without authorization when the owner shares the model with a malicious user or deploys the model onto edge devices such as autonomous vehicles and smart robots. Therefore, it is crucial to protect such intellectual property from illegal copying, unauthorized distribution, and reproduction of deep reinforcement learning models. Watermarking, one of the most common approaches to protecting owners’ intellectual properties [6], was originally introduced to claim and verify the ownership of multimedia, such as images, audios and videos, etc. Generally, a set of watermarks (e.g., owner’s signature) are embedded into a multimedia signal while preserving its fidelity. Motivated by the idea of digital watermarks, the researchers have proposed several deep learning watermarking methods to protect the intellectual property of deep learning models [1, 17, 31]. In these solutions, a set of unique input–output pairs, which normally will not be recognized by other models, are carefully crafted as watermarks. To protect a deep learning model, during the model training, the owner can optimize his model to memorize these special samples and labels. For verification, the designed input samples can be used to remotely query the suspicious

7 Watermarks for Deep Reinforcement Learning

119

model, and then the model owner collects the corresponding predictions to identify the ownership. These solutions can minimize the damage of these special pairs on the protected models for normal input queries while keeping the watermarks can be identified with high fidelity. For conventional supervised deep learning models, various watermarking techniques were proposed. However, existing approaches cannot be applied to deep reinforcement learning models directly, considering the significant differences between DL and DRL models. Although a DRL model also adopts deep neural networks, different from traditional classification applications, reinforcement learning tasks deal with learning in sequential control process problems. The performance and behavior of a DRL system are reflected by sequences of state–action pairs, instead of discrete input–output pairs. An output (i.e., action) given from a DRL model for current input (i.e., state) can affect the following states of the entire task. Therefore, existing backdoor-based watermark methods, for a triggered state, may lead to a wrong decision, which in turn affects all subsequent actions and states, which will eventually cause the agent to fail or crash. Besides, the randomness in DRL policies and the stochasticity in reinforcement learning environments may decrease the verification accuracy when using discrete watermarks. Moreover, those special watermark samples have abnormal behaviors from normal ones and can be detected as well as removed by an adversary [2, 5, 7]. As a result, existing watermarking methods for supervised deep learning models cannot be applied to DRL applications directly. Motivated by the above problems, we propose a novel temporal-based and damage-free watermarking scheme for either deterministic or stochastic deep reinforcement learning models. The key idea of our solution is to train the model such that its interaction with the environment will follow unique state–action probability sequences, i.e., watermarks, which can be observed and verified by the model owner. We adopt the sequences of inputs (i.e., states) and output logits (i.e., action probability distributions) as the watermarks of deep reinforcement learning models. However, this strategy may cause failure considering the temporal characteristics of DRL models. Therefore, we design three new techniques to overcome this problem. These three techniques can make the DRL model distinguishable by external users, while still preserving the model’s original behaviors and robustness. The rest of this chapter is organized as follows. Section 7.2 introduces the background of Markov decision process, reinforcement learning, and deep reinforcement learning. We summarize related works about existing watermarking methods for supervised deep learning models and deep reinforcement learning models in Sect. 7.3. We introduce the concept of temporal watermark and its requirements in Sect. 7.4. Section 7.5 presents the design of the temporal watermarking approach. We discuss some limitations of the proposed approach in Sect. 7.6 and conclude in Sect. 7.7.

120

K. Chen

7.2 Background 7.2.1 Markov Decision Process Before introducing the conception of Markov Decision Process (MDP), we first introduce some related definitions. First, we give the description of Markov property. The Markov property refers to the memoryless property of a stochastic process. A state is said to satisfy the Markov property if it satisfies .p(st+1 |st ) = p(st+1 |s1 , . . . , st−1 , st ). To be more specific, given a state .st , the future states depend only upon the current state and are not affected by past states. Based on the Markov property, we introduce the definition of Markov process. A Markov process is a stochastic process in which the future is independent of the past, given the present. In other words, the states in the Markov process all satisfy Markov property. Formally, a Markov process can be modeled using a tuple .< S, P >, where .S is the state space and .P is the state transition probability. If we add the reward function .R and discount factor .γ in the Markov process, then it becomes Markov reward process which can be formulated as a tuple .< S, P, r, γ >. The reward function r specifies a real number for every state. To be more specific, for every time step t, a feedback reward .rt is given until it reaches the terminal state .sT . Therefore, the long-term profit consists of immediate rewards .rt , which is normally generalized as return value .Rt at time step t: Rt = rt+1 + γ rt+2 + γ 2 rt+3 + . . . + γ T −t−1 rT

.

=

T −t−1

(7.1)

γ i rt+i+1

i=0

where .γ ∈ [0, 1] is a discounted factor, which represents the proportion of future rewards worth at the present moment. There are many reasons why we need a discount factor. In finance, immediate returns are more profitable than delayed returns. The definition of the discount factor is consistent with the human trait of being more interested in immediate interests. Other reasons include convenience for the math expression and avoiding getting stuck in a loop. When we add a decision process to the Markov reward process, it becomes a Markov decision process (MDP). Formally, we denote an MDP with a tuple .< S, A, P, r, γ >, where .S is the state space, .A is the action space, .P : S × A × S → [0, 1] is the state transition probability, and .r(s, a) is the reward function that represents the reward obtained at state .st with action .at . We can find that, compared with Markov reward process, MDP has an action space and the definitions of .P and r are related to a specific action. Formally, the relationship between these factors can be defined as .P(st , at , st+1 ) = p(st+1 |st , at ).

7 Watermarks for Deep Reinforcement Learning

121

Fig. 7.2 Reinforcement learning process

7.2.2 Reinforcement Learning A reinforcement learning (RL) policy is used to describe a Markov decision process (MDP), which can be formulated as a tuple .< S, A, P, r, γ >. As illustrated in Fig. 7.2, for a timestep t in a reinforcement learning environment, the agent observes the current state .st ∈ S of the environment, chooses an action .at ∈ A according to its policy .π, and receives a reward .rt . The interaction loop keeps going until the agent reaches the terminal state .sT of this environment. The goal of the agent is to find policy .π ∗ that maximizes the expected cumulative rewards T an optimal t−t 0 r , where T is the terminates time step and .γ ∈ [0, 1] is a discount .Rt = t t=t0 γ factor. A bigger value means that the agent becomes farsighted and the agent cares more about recent rewards when .γ is close to 0. To get a good policy which can maximize the expected cumulative rewards, we first define the state value function which can evaluate how good a state s. Specifically, for a reinforcement learning policy .π, the state value function .Vπ (s) is the expected return value of state s: Vπ (s) = E[Rt |st = s, π]

.

(7.2)

Likewise, for a state–action pair .(s, a), the state–action value function .Qπ (s, a) is defined as Qπ (s, a) = E[Rt |st = s, at = a, π ]

.

(7.3)

Based on Eq. (7.1), we can expand .Vπ (s) and .Qπ (s, a) to present the relationship between two consecutive states .s = st and .s  = st+1 as Vπ (s) =



.

a

π(a|s)

 s

p(s  |s, a)(rs→s  |a + γ Vπ (s  ))

(7.4)

122

K. Chen

and Qπ (s, a) =



.

p(s  |s, a)(rs→s  |a + γ



s

π(a  |s  )Qπ (s  , a  ))

(7.5)

a

where .rs→s  |a = E[rt+1 |st , at , st+1 ]. Equations (7.4) and (7.5) are called the Bellman equations. Considering the Markov property of an MDP, that is, the sub-problem at a certain state only depends on the action of the sub-problem at the previous state, and the Bellman equation can recursively cut the sub-problem. Therefore, we can use dynamic programming to solve reinforcement learning tasks [19, 34, 39, 40]. However, dynamic programming is only suitable for tasks with a limited number of states and actions. The input data of many real-world reinforcement learning problems are high-dimensional, such as images and sounds. For example, for an automatic driving task, the driving direction and speed of the car should be determined according to the current input camera, RADAR, and Lidar data. Considering the limitation of memory and computational power, these dynamic programming methods are infeasible for complex reinforcement learning tasks.

7.2.3 Deep Reinforcement Learning Reinforcement learning has attacked much attention due to its achievements in some simple tasks [22, 32]. However, traditional reinforcement learning algorithms cannot effectively scale up to large-scale scenarios since most of the algorithms mainly use tabular representation. With the development of deep learning techniques, deep reinforcement learning (DRL) is born at the right moment. Instead of using tabular representation, DRL applies deep neural networks to represent policy and demonstrates superior human-level performance in many control tasks with highdimensional states [18, 20, 42]. These achievements are mainly attributed to the powerful function approximation properties of deep neural networks adopted by DRL. Furthermore, deep neural networks can have a better understanding of the high-dimensional states in a complex environment via learning low-dimensional feature representations and based on the good representations of states, it is effective to get the optimal strategies for the complex environment. For example, if the environment states are raw and high-dimensional visual information, convolutional neural networks (CNNs) can be used in RL algorithms to learn the state representations from the raw state information. There has been a lot of research on deep reinforcement learning. Here, we divide these algorithms into three categories: value-based method, policy-based method, and hybrid method that combines value-based algorithm and policy-based algorithm. As its name implies, value-based approaches need a value function. To get the optimal value function, deep neural networks are trained to approximate

7 Watermarks for Deep Reinforcement Learning

123

it in DRL algorithms. Deep Q-network (DQN) [20], which is a popular DRL algorithm, trains a Q-value network to estimate the value for each state–action pair. This type of algorithm mainly focuses on getting a deterministic policy which chooses the optimal action from a discrete action space with the maximum Qvalue directly. Therefore, the value-based algorithms are suitable for dealing with the deterministic policies for the tasks with discrete action space. While, for a task with continuous action space (i.e., autonomous driving), the acceleration of the vehicle is a continuous action. We cannot apply the value-based algorithms to this task due to the computation complexity of continuous action. The policybased algorithm that directly optimizes the policy is more suitable for tasks with continuous action space. The classical RL algorithm REINFORCE [37] is this type of algorithm. The policy-based algorithm can optimize the policy by changing the action probability distribution over the action space. Once the optimal policy is obtained, the agent can sample actions according to the action probability distributions over the action space. We call these policies with randomness on action selection stochastic policies. However, the randomness in policy-based algorithms may affect the convergence of the training process. To combine the advantage of the above two types of algorithms, the hybrid method also called the actor-critic approach is proposed. This type of algorithm not only trains a value function for the state but also optimizes the policy. In this way, this type of algorithm achieves success in both deterministic and stochastic cases. Some famous DRL algorithms, Proximal policy optimisation (PPO) [27], Actor-Critic with Experience Replay (ACER) [36], and Actor-Critic using Kronecker-Factored Trust Region (ACKTR) [38], fall into this category and achieve the state-of-the-art performance.

7.3 Related Work 7.3.1 Watermarks for Supervised Deep Learning Models Recently, several works proposed watermarking schemes to protect the intellectual property of deep learning (DL) models. Here, we can divide these research works into two categories: white-box watermarking and black-box watermarking. In white-box DL watermarking approaches, the watermarks are explicitly embedded into the parameters of DL models, and meanwhile, it does not influence the performance of the watermarked model. Uchida et al. [33] propose a whitebox watermarking scheme which uses a bit-vector as the watermark. To embed this watermark into the parameters of the target DL model, they introduced a particular parameter regularizer in the loss function. Later, a new white-box watermarking scheme, which injects watermarks in the probability density function of the activation layers, was proposed by Rouhani et al. [24]. Compared with the previous method, this scheme has less influence on the static properties of model parameters. However, these white-box watermarking schemes have some

124

K. Chen

limitations. For example, during the verification phase, the model owner needs to have the right to access the parameters of DL models. Therefore, these watermarking methods cannot be applied to the scenario in which the external users cannot know the detail of the DL model, i.e., the DL model is a black-box. Therefore, black-box watermarking methods were proposed to solve this scenario. In the black-box watermarking scheme, instead of explicitly embedding watermarks, it first carefully crafts some input samples and then trains the target DL model in a special way to make the model output unique results for these carefully crafted samples. Meanwhile, these operations cannot influence the performance of the DL model, which means that the model still can output correct results for normal samples. During the verification stage, the owner of the protected model just needs to feed these carefully crafted samples to the suspicious model and gets the corresponding outputs. If the outputs are very similar with predefined unique results for these carefully crafted samples, then the suspicious model is highly likely to steal from the protected model. Therefore, there are several black-box watermarking approaches which are based on the backdoor attack methods, and they embed samples containing triggers or out-of-distribution samples into the target models as watermarks and query the suspicious model to verify the existence of watermarks [1, 17, 21, 41]. Adversarial examples were also adopted to detect the suspicious models by Le Merrer et al. [15]. This method can extract and compare the classification boundaries of the protected and suspicious models. During the verification process, the black-box watermarking scheme did not need to access the parameters of the DL model. Therefore, black-box watermarking schemes outperform the white-box watermarking schemes, as they enable verification with only black-box accesses and can achieve very satisfactory accuracy.

7.3.2 Watermarks for Deep Reinforcement Learning Models Different from DL models, few works studied the watermarking solutions in the reinforcement learning scenario. It is more challenging to protect the IP of DRL policies with watermarks. First, a DRL model interacts with the environment in a sequential process. An abnormal prediction at one instant caused by the watermarks in the model can possibly affect the entire process, with a cascading effect. The past works have demonstrated that a small perturbation added to one state can cause catastrophic consequences to the entire system and process [11]. Second, since DRL is usually adopted in many safety-critical tasks (e.g., autonomous driving [14, 26] and robotic control [9, 16]), it has more requirements for robustness and accuracy and restricts the owner’s space to alter the policy for verification. As a result, although adversarial examples [23, 25, 29] and backdoor attacks [12, 35] have been realized in DRL models, they have not been utilized for the purpose of watermarking yet. Behzadan et al. [3], for the first time, proposed a watermarking scheme for DRL models. They embed sequential out-of-distribution (OOD) states and the specific

7 Watermarks for Deep Reinforcement Learning

125

action pairs into deterministic DRL models. The ownership verification relies on the fact that only the watermarked DRL models follow that exact sequence. However, the robustness of the scheme remains unknown, and the extraction process can easily fail since the OOD states can be detected by adversaries. Besides, the mandatory action settings of the OOD states would affect the performance of the watermarked model, especially for DRL applications with small action spaces. Moreover, the proposed method only focuses on the deterministic DQN policy and leaves the stochastic policies unexplored. Since DRL policies also adopt DNN architectures, one straightforward direction is to follow the black-box watermarking scheme introduced in Sect. 7.3.1. Therefore, another potential approach for DRL watermarking is to adopt DRL backdoor techniques [12, 35]. For instance, Kiourti et al. [12] embedded backdoor patterns into the DRL model and augmented the function of the policy with hidden malicious behaviors. Unfortunately, the owners are required to continuously insert the trigger, which is not allowed during the extraction process. On the other hand, similar to the backdoor techniques for classification tasks, the DRL backdoor samples can be easily detected and the reinforcement systems will crash once the backdoors are activated, which is intolerable in security-critical systems.

7.4 Problem Formulation In this section, we introduce the threat model for IP protection in the deep reinforcement learning context. Then we formally define the DRL watermarking problem and its solution, temporal watermarks.

7.4.1 Threat Model For a reinforcement learning task, through the interaction with the environment env, an agent tries to learn an optimal policy (i.e., DRL model M), to make the optimal decision for each state s according to the reward obtained from env. Given a single state s, similar to conventional deep learning models, a DRL model M outputs the action probability distribution (APD) P over the action space .A. With the APD at hand, deterministic and stochastic DRL policies choose actions in different ways. Without loss of generality, we introduce the DRL watermarking approach for stochastic DRL policies, which is a more general form for DRL policies. The overview of the proposed temporal watermarking framework for deep reinforcement learning models is illustrated in Fig. 7.3. Similar to the system model used in conventional DL watermarking scenarios [1, 41], for a target DRL model M, an unauthorized user may copy the target model M illegally and tries to use it in business scenarios. In order to evade inspection, the attacker may utilize some model transformation techniques to modify the copied model slightly. For example, fine-

126

K. Chen

Verification

Watermarked DRL Model

Initial State

Owner

Unauthorized Users

Fine-tune or Compression

State-action Sequences

Fig. 7.3 The overview framework of the watermarking approach for DRL models

tune and model compression can change the parameters in a deep learning model while reserving its performance. To protect the intellectual property of a DRL policy, the model owner can embed designed watermarks into his model during the training process. Thus, for a suspicious model .M  , the owner can claim and verify whether the special watermarks can be detected in model .M  .

7.4.2 Temporal Watermarks for Deep Reinforcement Learning Watermark was first proposed to protect the intellectual property of multimedia contents, such as images and videos. As shown in Fig. 7.4, we can add a special marker to an image. The markers normally are pictures with special meaning (e.g., names, trademarks, and logos), so that we can recognize the marker and claim ownership of protected images. Recently, deep learning has achieved great success in various domains. Training such a deep learning model requires huge resources, and thus the well-trained DL

(a) Unprotected image

(b) Image protected with a digital watermark

Fig. 7.4 Digital watermark for image intellectual property protection

7 Watermarks for Deep Reinforcement Learning

127

Car

Car

Airplane

(a)

(b)

(c)

Fig. 7.5 Spatial watermark for image classifier protection. (a) unprotected image classifier with the original image, (b) unprotected image classifier with triggered image, and (c) watermarked image classifier with the triggered image

models have been the core intellectual property of companies. To protect these precious assets, DL watermarks are introduced for DL model protection. Backdoors can be embedded into a deep learning model so that the model gives designed outputs for a carefully crafted input sample. Therefore, we can apply backdoor techniques to embed watermarks into DL models. Zhang et al. [41] used special sample–label pairs as watermarks to protect the IP of deep learning models. The model owner adds these special pairs to the training dataset and trains a supervised DL model to make it remember these spacial data points. The designed sample–label pairs can be verified as watermarks for the DL model while preserving the accuracy of normal samples. During the verification stage, the owner can query the suspicious model with these special samples and identify the existence of watermarks based on the predictions. As illustrated in Fig. 7.5, unprotected image classifiers output the same results for both normal images and images with triggers, while a protected classifier gives a special prediction for the image with triggers so that the model owner can verify and claim the ownership of the model. Normally, watermarks used in DL model protection are input-output pairs, each pair of them is independent and has no relationship to the other. Therefore, we call these discrete watermarks for deep learning models as spatial watermarks. Challenges arise when applying existing supervised DL watermarking solutions to deep reinforcement learning models. As we discussed in Sect. 7.1, although a DRL model also adopts deep neural networks, different from the supervised deep learning models, reinforcement learning tasks deal with learning in sequential control process problems. Spatial watermarks cannot be used to reveal the char-

128

K. Chen

acteristics of DRL models. Therefore, we cannot apply existing DL watermarking techniques to DRL models directly. We need to design a new watermark form for deep reinforcement learning models. With the designed watermark, the protected DRL model is expected to have distinct behaviors on certain states within the same environment, which is unique enough for ownership verification against different model transformations but also has minimal impacts on the model operation. This method should be general for various DRL policies (both deterministic and stochastic). Therefore, for the intellectual property protection problem of DRL models, we propose a novel concept, temporal watermark. Considering the stochasticity and sequential characteristics of deep reinforcement learning models, a sequential watermark can reflect the characteristics of DRL policy and improve the verification accuracy of the embedded watermark. Therefore, similar to watermarks in supervised models, we design a watermark which is formed with special input–output pairs. But, different from the watermarks for DL models, these input–output pairs are correlated and all of them are arranged into a temporal sequence. Besides, given an input state, a DRL agent normally gives the action probability distribution (APD) over the action space. We cannot apply a deterministic action or label used in DL models as watermarks. Therefore, we use the sequence of state–APD pairs as watermarks for DRL models. However, there is a huge amount of state–APD pair sequences in a reinforcement learning task. Which one should be selected as our temporal watermark? In image protection tasks, the digital watermarks are normally selected from some special markers, such as names, logos, and trademarks. The image owner can add these markers onto his images as watermarks to protect their IP. We call these special markers as watermark candidates. Similar to image watermarks, we can also generate some watermark candidates for watermarking DRL models. These candidates should be recognized easily and should not affect the function of the protected deep reinforcement learning models. Therefore, we need to define the requirements for these DRL watermark candidates. A good watermark should have several properties: Functionality-Preserving For a well-trained model M without watermarks  with embedded watermarks should retain the embedded, the protected model .M competitive performance compared with M. Damage-Free To protect the IP of supervised DL models, the backdoors embedded in the target models are treated as watermarks. However, for DRL models, the embedded backdoors can significantly change the decisions on special samples (i.e., states), which can lead to severe consequences in safety-critical tasks, such as autonomous driving and robotic controlling. Therefore, for the target DRL model, the watermarks should not decrease the performance of the protected model (i.e., damage-free). Imperceptible Out-of-distribution state sequences have been adopted as watermarks for DRL models [3]. However, these OOD states can be easily detected by the attacker who will then evade the watermark verification. Therefore, we need to

7 Watermarks for Deep Reinforcement Learning

129

use the normal states in the environment as watermarks so that it is imperceptible to the adversary. To find suitable watermark candidates for deep reinforcement learning models, we need to make sure that these candidates satisfy the requirements of functionalitypreserving, damage-free, and imperceptible. In traditional watermarking methods, taking image protection as an example, the embedded watermarks normally are carefully selected from some meaningful words or pictures, such as names, logos, and trademarks. At the same time, these watermark candidates should not cover the key content in the original image and affect the meaning of the image. Similarly, watermark candidates for deep reinforcement learning models should not affect the performance of agents. Therefore, we introduce a new concept, damage-free state, to achieve these goals: Definition 7.1 (Damage-Free State) For a state–APD pair (.s, P ), s is sampled from the state space .S and P defines the action probability over the action space ∗ .A. Given a state s, let .a ∈ A be the optimal action with the highest probability. Let .σ = Var(P ) denote the variance of P . Thus, s is a (., ψ) damage-free state if the variance of P is smaller than . and the agent can obtain a minimum return of .ψ in current environment when the agent selects any legal actions .a ∈ A/a ∗ . Informally, as shown in Fig. 7.6, given a state .si , if the agent can select any legal actions in the action space to complete the task perfectly, .si is a damage-free state. This means that the DRL policy treats all the legal actions as the same. Therefore, the action probability distribution tends to be uniform over action space and the variance is small. In contrast, for a state .si , if the agent tends to choose a certain action .a ∗ with a strong will, there will be a large APD variance, which means that .si is a critical state for the task and may cause fail or crash if other actions are selected instead. To reduce the negative impact on the agent’s behaviors caused by the watermark embedding, we should select watermarks carefully. Now, we know that the damagefree states have minimal effect on the performance of an agent. Therefore, we can select a sequence of damage-free states to form a temporal watermark candidate for

0

0

1

2

′ 1

′ 2

′′ 1

′′ 2

3

1 2 …

′′ 3



Fig. 7.6 Damage-free states in a deep reinforcement learning model



130

K. Chen

0

1

2

3



Fig. 7.7 Watermark candidate in a deep reinforcement learning model

a deep reinforcement learning model. Following the definition of the damage-free state, the temporal watermark candidate can be defined as follows. Definition 7.2 (Watermark Candidate) For a target DRL model M, a watermark candidate is a sequence of unique temporal damage-free states and their corresponding APDs predicted by M: T W = [(s0 , P0 ), (s1 , P1 ), ..., (sL−1 , PL−1 )]

.

According to the definition of the damage-free state, the temporal watermark candidate can make sure that the changes in behaviors in these states have minimal effect on the performance of the DRL model. And thus the candidates can be used as special watermarks for DRL models if we modify the APDs into special distributions. It is worth noting that it is not necessary to make the entire sequence of states of an episode damage-free since the action probability becomes more difficult to be observed as the length of the sequence increases. Therefore, we only focus on the first L state–action probability pairs. In particular, as shown in Fig. 7.7, for each watermark candidate, we choose a sequence whose the first L states are damage-free to restrict their negative impact on the performance. Besides, during the generation process, we can generate a set of watermark candidates so that we can have some flexibility to choose the suitable one for watermark embedding. With the generated temporal watermark candidates at hand, the model owner can embed them into the target DRL model to protect its intellectual property. The detailed watermark embedding and verification process will be introduced in the following sections. In this section, we introduce a new concept of the temporal watermark, which can be used to protect the intellectual property of DRL models. Besides, we design a novel damage metric to identify the sequences within the same environment as watermarks, which have minimal impact on the performance of the original DRL model. Through this section, we hope you have got an initial understanding of the temporal watermark designed for deep reinforcement learning models. And in the following sections, we will introduce the detailed process of how can we use the temporal watermark for DRL models, which consists of three algorithms: watermark candidate generation, watermark embedding, and ownership verification.

7 Watermarks for Deep Reinforcement Learning

131

7.5 Proposed Method In this section, we introduce a novel IP protection methodology, temporal watermarking, for deep reinforcement learning models. Existing deep learning watermarking techniques focus on spatial watermarks, which cannot be applied to DRL models. Instead, we design three new algorithms to form a temporal watermarking scheme for DRL models. Figure 7.8 illustrates the workflow of our temporal watermarking approach. (a) In the watermark embedding stage, according to the requirements for DRL watermarks discussed above, the model owner can select a set of watermark candidates with WMGen algorithm. WMGen can generate a dataset .C, which consists of n sequences of damage-free state and the corresponding APD pairs, with the length of L. The watermark candidate set is defined as follows: C ={T Wi }n−1 i=0

.

T Wi = [(si,0 , Pi,0 ), (si,1 , Pi,1 ), ..., (si,L−1 , Pi,L−1 )] in which .si,j is the j -th damage-free state of the i-th sequence, and .Pi,j is the corresponding APD. (b) After that, he can select a watermark candidate from .C and add it to the training  with Embed to make data. Then the owner can train a watermarked model .M sure that the protected model can memorize the watermark candidate (i.e.,  will be changed from .Pi,j to .P i,j ). .∀ si,j , i ∈ [0, n), j ∈ [0, L), the APD of .M Due to the randomness in the training progress, the final watermark embedded in the protected model may be different from the designed APDs slightly. Therefore, once the watermarked model is well trained, the model owner can query the model with the damage-free states in the watermark candidate and

Watermark Candidates

Watermarked Model

WMGen

Embed Watermarks Embedding Phase

Environment

Training Set

Suspicious Model

Protected Model

True Verify

Verification Phase Non-protected Model

False

Fig. 7.8 The overview of the proposed temporal watermarking scheme

132

K. Chen

obtain the final watermark sequences .W, which is defined formally as follows: W ={T W i }n−1 i=0

.

i,0 ), (si,1 , P i,1 ), ..., (si,L−1 , P i,L−1 )] T W i = [(si,0 , P (c) To verify the existence of the watermark, the mode owner can send the damagefree states in his temporal watermark into a suspicious model .M  and collect the predictions to calculate the state–APD pairs .W for the suspicious model. To check whether the suspicious model .M  is copied from the watermarked one, the model owner can compare .W and .W and calculate the similarity between them. The Verify algorithm is designed to run a suspicious model .M  on the damage-free states and collect their corresponding state–APD sequences, which are formally defined as follows: W ={T Wi }n−1 i=0

.

   T Wi = [(si,0 , Pi,0 ), (si,1 , Pi,1 ), ..., (si,L−1 , Pi,L−1 )]

If the similarity between .W and .W is bigger than a predefined threshold .τ , it means that the suspicious model is very similar to the protected DRL model. Otherwise, it outputs 0 which represents the suspicious model is different from the protected one. With the designed algorithms, our temporal watermarking scheme satisfies all three requirements for DRL model watermarking. Below we describe each algorithm in more detail.

7.5.1 Watermark Candidate Generation In the last section, we know that the temporal watermark is a better solution for watermarking deep reinforcement learning models. Now, we need to select a suitable temporal watermark for our following watermarking process. As discussed in the last section, each DRL watermark candidate is formed by a short sequence of damage-free states obtained from normal reinforcement learning tasks. Therefore, in our scheme, we design a candidate generation algorithm, WMGen, which can search for the watermark candidates in a brute-force way. The detailed function of WMGen is illustrated in Algorithm 1. Given a target DRL model M to be protected,1 WMGen is designed to identify the damage-

1 We consider two cases: (1) the model owner has a well-trained DRL model and wants to embed watermarks into it, he can fine-tune the target model with watermark samples. (2) If he wants to embed watermarks into a target model from scratch, he can first train a clean model to watermark candidates and then train a watermarked model with these watermark candidates.

7 Watermarks for Deep Reinforcement Learning

133

Algorithm 1 WMGen: Generating .(, ψ) damage-free temporal watermark candidates Input: Clean DRL model M, environment env, candidate number n, length L Output: Watermark candidate dataset .C 1: .C ← ∅ 2: while .|C| < n do 3: .T W ← ∅ 4: Randomly sample .s ∈ S and env.reset(s) 5: while current episode is not finished do ∗ 6: .P ← M.action_prob(s) and .a ← maxa (P ) 7: if .|T W | < L then ∗ 8: .score ← the minimal score of the episodes that traverse all .a ∈ A/a 9: if .score > ψ and Var(P ) .< . then 10: T W .add((s, P ) ) 11: else 12: goto Line 2 13: end if 14: .a ← sample an action following P 15: .s ← env.step(a) 16: end if 17: .C.add(T W ) 18: end while 19: end while 20: Return .C

free states and form a dataset of watermark candidates. To generate watermark candidates that satisfy the requirements, the model owner firstly initializes an empty watermark candidate set .C and takes the following steps: (1) if the size of the candidate set .C is smaller than the expected size n, he randomly samples an indistribution state s from the state space .S. Then he queries the target model M with s and obtains the action probability distribution P . After that he can get the optimal action .a ∗ for state s, which has the highest probability (Line 6). (2) Now, he needs to check whether s is a damage-free state. Specifically, he traverses all the legal actions in the action space .A except the optimal action .a ∗ and collects the minimal score for the current task obtained from env. If the minimal score is larger than the threshold .ψ and the variance of the APD P is smaller than ., then s is a damage-free state and the state–APD pair (.s, P ) will be added to the watermark candidate sequence T W . Otherwise, he needs to roll back and start from a new initial state (Line 2). The model owner stops the above procedure until the watermark candidate set .C is full, which consists of multiple temporal watermark candidates.

7.5.2 Watermark Embedding Let M be the targeted model, which can be a fresh or well-trained DRL model. We now protect the target DRL model M with the generated watermark candidates. In

134

K. Chen

particular, we design a novel watermark embedding algorithm, Embed, to embed unique temporal watermarks into M, while also satisfying the requirements of functionality-preserving and damage-free. For the damage-free states in the watermark candidates, Embed can modify the parameters of the model and encourage the target DRL model to give different actions (or at least with different APDs). Let .(s, P ) be a damage-free state and the corresponding APD pair in a watermark candidate .T W ∈ C and .a ∗ be the optimal action that the DRL policy will choose with the highest probability. We aim to encourage the model M to learn a different ∗  by forcing it to select a different action  APD .P .a randomly sampled from .A/a . To this end, for the damage-free states, we add an incentive reward on the original one for the actions  .a , so that the agent tends to select actions different from the optimal action .a ∗ . Thus, we can modify the APD of this state for the target model M. Formally, for a damage-free state s in a watermark candidate T W , the new reward function .r e (s, a) returns the sum of the original reward .r(s, a) with an additional incentive reward .η:  r (s, a) =

.

e

r(s, a) + η, s ∈ T W and a =  a others.

r(s, a),

(7.6)

We choose the same loss functions .L(s) used in reinforcement learning to optimize the model, where we replace the original reward function with the new one. For stochastic deep reinforcement learning policies (e.g., REINFORCE [37]), the cross entropy loss function is adopted to train the model: L(s) = cross_entropy_loss(M(s), a)G(s).

.



G(s) = r (s, a) + γ G(s ) e

(7.7) (7.8)

where .G(s) is the discounted accumulative return with a discount factor .γ and .s  is the next state in the environment. For a deterministic reinforcement learning model (e.g., DQN [20]), which simply gives the actions with the highest Q-value instead of sampling from the APD, we adopt the temporal difference (TD) error [30] to optimize the model:  2 L(s, a) = r e (s, a) + γ max Q(s  , a  ) − Q(s, a)

.

a

(7.9)

where .Q(s, a) is the state–action value function to estimate how good is the action a on s, and .s  and .a  are the next state and its corresponding action. The parameters of M can be optimized with the above loss using gradient descent techniques: θt+1 = θt − lr∇

T −1

.

j =0

L(sj )

(7.10)

7 Watermarks for Deep Reinforcement Learning

135

Algorithm 2 Embed: Embedding watermarks into the DRL model M Input: DRL model M, Environment env, watermark candidates .C, length T , reward threshold R 1: Initialize the training buffer .B ← ∅ 2: for .s, P ∈ C do ∗ 3:  .a ← sample a random action in .A/a e 4:  .r ← r (s, a) 5: .B.add(s, a , r) 6: end for 7: for each seed .∈ S do 8: while current episode is not finished do 9: .a ← sample an action following P 10: .s, r ← env.step(a) 11: if s .∈ / C then 12: .B.add(.s, a, r) 13: end if 14: end while  15: .θM ← θM − lr∇ L(s) 16: if .eval(M) ≥ R then ←M 17: .M 18: goto Line 7 19: end if 20: end for 21: for each .T W ∈ C do 22: .s ← the first damage-free state of T W  W ←∅ 23: .T 24: while .|T W | ≤ T do  ← M.action_prob(s)  25: .P  )) 26: .T W .add((s, .P 27: .s ← env.step(.maxa (P )) 28: end while  W) 29: .W.add(.T 30: end for  .W 31: Return .M,

where .θ is the parameters of M, and lr is the learning rate. The training process ends if the watermarked M can achieve a higher score than a given threshold R in a validation environment. Once the watermark embedding process is completed, the APDs of the watermarked model for the damage-free states will be changed. Due to the randomness in the embedding process, the new APDs may be different from the designed ones. To identify the final embedded temporal watermarks, the model owner can send the  and record the actions for each damage-free states to the watermarked model .M state. For each watermark candidate, he collects the final sequence of states and the  as a watermark. Finally, he can corresponding APDs from the protected model .M  L−1 )] that forms  obtain a temporal sequence .T W = [(s0 , P0 ), (s1 , P1 ), ..., sL−1 , P a unique temporal watermark for this protected model. The details of how to embed watermarks with Embed into a DRL model are illustrated in Algorithm 2. Firstly, the model owner initializes a training buffer .B (Line 1). For each damage-free state s in watermark candidates, he randomly

136

K. Chen

samples a legal action other than the optimal one .a ∗ and adds the incentive reward on the original reward (Lines 2–5). Then, similar to the normal DRL training process, he collects these training samples and adds them into the training buffer .B. During the optimization process, the owner samples training data from .B, computes the loss, and updates the parameters of M with gradient descent techniques (Lines 8– 18). Once the embedding process is finished, the owner can query the watermarked  with the damage-free states in T W and collects the final APDs. The model .M states and the collected APD pairs are treated as the final temporal watermark .T W (Lines 21–29).

7.5.3 Ownership Verification After the adversary steals the watermarked model, they may not directly use the watermarked model but modify the model to satisfy their requirements using some model transformation methods. Therefore, the embedded watermarks cannot be removed by these transformations and should be robust against these transformations. Here, we can formally define the robustness of temporal watermarking for deep reinforcement learning models as follows: Robustness Let .dM,M   be the distance of APDs between the watermarked model   .M and transformed suspicious model .M on the watermark states: 1   i,j , Pi,j distance(P ) n n−1 L−1

dM,M   =

.

(7.11)

i=0 j =0

i,j and .P  are the APDs of .M  and .M  on the watermark state .si,j . Given where .P i,j a predefined threshold, if the value of .dM,M   is smaller than the threshold, then we  can say that .M is robust against the model transformation. During the verification stage, the owner just needs to feed the watermark states into the agent model to extract the watermarks. Then, the observed subsequent state– APD pairs can be used to check whether the actions match the temporal watermark  .T W . The verification process is demonstrated in Algorithm 3. Since the policy of a DRL agent may be stochastic, the observed action may be sampled following the corresponding ADP for each watermark state in .T W . To reduce the randomness caused by sampling, we use the statistical characteristics for analysis. Therefore, during verification, the owner needs to run the agent model several times on every watermark state s and collect the observed actions. Then based on these actions, the owner computes their probability distribution. Finally, the temporal sequence    .T W = [(s0 , P ), ..., (sL−1 , P 0 L−1 )] can be obtained. After that, the distance between .T W and .T W  needs to be calculated for the similarity comparison. Because the states in both sequences are the same, only the distance of APDs needs to be considered. Here, the Kullback–Leibler (KL)

7 Watermarks for Deep Reinforcement Learning

137

Algorithm 3 Verify: extracting the embedded watermarks from a suspicious DRL model .M  Input: Watermark dataset .W, distance threshold .τ Output: Verification result I sW atermarked 1: for each .T W .∈ W do i )L−1 ∈ T 2: for each of .(si , P W do i=0 3: Run the agent on .si and calculate the APD .Pi  p  4: .dsi ← i,a log pi,a  ap i,a

5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

end for L .dT i=0 dsi W ,T W  ← end for  1 .dM,M   ← |n| T W ∈W dT W ,T W  if .dM,M   .≤ .τ then I sW atermarked = True else I sW atermarked = False end if Return I sW atermarked

divergence [13], which is a popular measure for distributions, is adopted. Therefore, i and .P  can be computed as below: the distance between .P i dsi =



.

p i,a log

a

p i,a  pi,a

(7.12)

 are the probabilities of selecting action a under the distributions where .p i,a and .pi,a  i and .P , respectively. .P i Therefore, based on the definition of the distance between two distributions,   the distance .dT W ,T W  between .T W and .T W can be defined as the accumulative distance of all the corresponding APDs.

dT W ,T W  =

L−1 

.

dsi

(7.13)

i=0

We take the average distance among all watermarks in .W as the distance .dM,M    and the suspicious model .M  . between the watermarked model .M dM,M   =

.

1  dT W ,T W  n

(7.14)

T W ∈W

Finally, to verify the existence of watermarks, the owner just needs to compare the value of .dM,M   with a predefined distance threshold .τ . In this section, we give a detailed description of the concept of temporal watermarks for protecting DRL models. Furthermore, we propose a novel water-

138

K. Chen

marking scheme to select suitable watermark candidates, embed candidates into DRL models, and verify the ownership of protected models. The proposed temporal watermarking approach is universal and exhibits very high accuracy and low error rate in the verification process under both deterministic and stochastic DRL environments and tasks.

7.6 Discussion In this section, we discuss possible limitations of the temporal watermarking scheme for deep reinforcement learning models. Stochastic in DRL Considering the high stochasticity of MDP tasks and DRL models, there are two main problems that may affect the verification accuracy. First, for the same state .si , a stochastic DRL model may output different actions .ai sampled from the action probability .Pi . Second, even for deterministic DRL models which always give the same action for a state .si , the MDP environment may go to different next states .si+1 due to the state transition probability. Therefore, during the verification process, even if we can collect as many actions for each state in the temporal watermark, the collected state–APD sequences of two tests for the same DRL model may still be different. Thus, for different DRL environments, we need to design different verification thresholds carefully. Otherwise, the verification accuracy may drop significantly. Robustness Against Model Compression and Fine-Tuning After the adversary steals the model, he may not use it directly but transform the stolen model since he needs to adapt the model to his own dataset or scenarios or he just wants to evader the check of plagiarism. Therefore, the watermarks should be robust against model transformations, such as fine-tuning [28] and model compression [8], which are two frequently used model transformation approaches. Whether intentionally or not, fine-tuning [28] seems to be the most viable type of attack, as experience suggests that it has the following potential benefits. The adversary can take a well-trained model as an initial weight to train a new model on their own data. This process can help them to reduce computational costs and achieve higher performance than training from scratch. Besides, the parameters in a protected model will be changed after the fine-tuning process. Therefore, a good watermark method should be robust against fine-tuning attacks. Model compression [8] plays an important role in deploying deep neural networks since it can reduce the number of neural network parameters, memory requirements, and computational cost. One of the most common approaches is model quantization [10], which compresses DL models by reducing the precision of parameters in the target model. Model pruning, another type of model compression, cuts redundant parameters of the target model while maintaining a similar performance on the original task. If the parameters with watermarks embedded are cut, the transformed model will be no longer possible to identify the existence of

7 Watermarks for Deep Reinforcement Learning

139

watermarks. In this case, we cannot verify the ownership of a suspicious model. Therefore, it is important to design a more robust watermarking scheme against model transformation attacks. Ambiguity Attacks For deep learning watermarking methods, in addition to the removal attacks, there is another attack, ambiguity attacks. An adversary may embed another watermark into the protected model. Thus, the real owner cannot identify ownership with the original watermark, since both the adversary and the owner can verify their watermarks in the suspicious model. Or in a more severe case, the original watermark may be overwritten by the new adversary watermark. Application on Partially Observable MDP (POMDP) Tasks Different from supervised deep learning models, there is a huge gap between different reinforcement learning tasks. Therefore, it is challenging to develop a unified watermarking scheme for various DRL tasks. Our temporal watermark scheme makes the first attempt to address this challenge. However, the proposed watermarking scheme can only perform well in some MDP environments. It may become ineffectively in some cases where the full information about environment states and actions is not available (i.e., POMDP tasks [4]). To adapt our watermarking scheme to this situation, we can replace the origin states and actions with the observable states and actions as watermarks. Furthermore, more watermark sequences can be adopted to increase fidelity. But considering the loss of some information in origin states, the verification accuracy will still be lower than in the fully observable MDP tasks.

7.7 Conclusion In this chapter, we focus on the intellectual property protection of DRL models using watermarking techniques. In DRL models, there are two types of strategies: deterministic and stochastic. Therefore, we proposed a novel temporal watermarking scheme which can be used for both types of DRL strategies. Existing watermarking schemes usually use spatial triggers, perturbations, or out-of-distribution states as watermarking or verifying samples. However, these methods cannot be applied to reinforcement learning scenarios directly due to the special characteristics of DRL tasks and models. To this end, we carefully search for some damage-free states as watermark candidates which can ensure the performance of the watermarked DRL models. During the watermark verification phase, the carefully crafted temporal watermark can make the protected DRL models uniquely distinguishable. In conclusion, our temporal watermarking scheme against DRL models can meet all the functionality-preserving, damage-free, and imperceptible requirements and can perform well under different RL environments. Besides, the limitation of the temporal watermark scheme is also discussed in this chapter, such as verification accuracy, ambiguity, and robustness.

140

K. Chen

Acknowledgments This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2021-08-023[T]).

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: USENIX Security Symposium, pp. 1615–1631 (2018) 2. Aiken, W., Kim, H., Woo, S.: Neural network laundering: Removing black-box backdoor watermarks from deep neural networks. Preprint (2020). arXiv:2004.11368 3. Behzadan, V., Hsu, W.: Sequential triggers for watermarking of deep reinforcement learning policies. Preprint (2019). arXiv:1906.01126 4. Cassandra, A.R.: A survey of pomdp applications. In: Working notes of AAAI 1998 Fall Symposium on Planning with Partially Observable Markov Decision Processes, vol. 1724 (1998) 5. Chen, X., Wang, W., Bender, C., Ding, Y., Jia, R., Li, B., Song, D.: REFIT: A unified watermark removal framework for deep learning systems with limited data. Preprint (2019). arXiv:1911.07205 6. Cox, I.J., Kilian, J., Thomson Leighton, F., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE Trans. Image Process. 6(12), 1673–1687 (1997) 7. Guo, S., Zhang, T., Qiu, H., Zeng, Y., Xiang, T., Liu, Y.: The hidden vulnerability of watermarking for deep neural networks. Preprint (2020). arXiv:2009.08697 8. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning, pp. 1737–1746 (2015) 9. Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., Levine, S.: Composable deep reinforcement learning for robotic manipulation. In: IEEE International Conference on Robotics and Automation, pp. 6244–6251 (2018) 10. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Preprint (2015). arXiv:1510.00149 11. Huang, S., Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P.: Adversarial attacks on neural network policies. Preprint (2017). arXiv:1702.02284 12. Kiourti, P., Wardega, K., Jha, S., Li, W.: TrojDRL: Trojan attacks on deep reinforcement learning agents (2019) 13. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951) 14. Learning to Drive in a Day: https://wayve.ai/blog/learning-to-drive-in-a-day-withreinforcement-learning (June 2018). Accessed 09 Oct 2020 15. Le Merrer, E., Perez, P., Trédan, G.: Adversarial frontier stitching for remote neural network watermarking. Neural Comput. Appl., 1–12 (2019) 16. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. (2016) 17. Li, Z., Hu, C., Zhang, Y., Guo, S.: How to prove your model belongs to you: A blind-watermark based framework to protect intellectual property of DNN. In: Annual Computer Security Applications Conference, pp. 126–137 (2019) 18. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. Preprint (2015). arXiv:1509.02971 19. Luo, B., Liu, D., Wu, H.-N., Wang, D., Lewis, F.L.: Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Trans. Cybern. 47(10), 3341–3354 (2016) 20. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

7 Watermarks for Deep Reinforcement Learning

141

21. Namba, R., Sakuma, J.: Robust watermarking of neural network with exponential weighting. In: ACM Asia Conference on Computer and Communications Security, pp. 228–240 (2019) 22. Ng, A.Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., Liang, E.: Autonomous inverted helicopter flight via reinforcement learning. In: Experimental Robotics IX, pp. 363–372. Springer (2006) 23. Pattanaik, A., Tang, Z., Liu, S., Bommannan, G., Chowdhary, G.: Robust deep reinforcement learning with adversarial attacks. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 2040–2042 (2018) 24. Rouhani, B.D., Chen, H., Koushanfar, F.: DeepSigns: An end-to-end watermarking framework for protecting the ownership of deep neural networks. In: ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2019) 25. Russo, A., Proutiere, A.: Optimal attacks on reinforcement learning policies. Preprint (2019). arXiv:1907.13548 26. Safe, Multi-agent, Reinforcement Learning for Autonomous Driving: https://www.mobileye. com/our-technology/driving-policy/ (2020). Accessed 09 Oct 2020 27. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. Preprint (2017). arXiv:1707.06347 28. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Preprint (2014). arXiv:1409.1556 29. Sun, J., Zhang, T., Xie, X., Ma, L., Zheng, Y., Chen, K., Liu, Y.: Stealthy and efficient adversarial attacks against deep reinforcement learning. Preprint (2020). arXiv:2005.07099 30. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3(1), 9–44 (1988) 31. Szyller, S., Atli, B.G., Marchal, S., Asokan, N.: Dawn: Dynamic adversarial watermarking of neural networks. Preprint (2019). arXiv:1906.00830 32. Tesauro, G.: Temporal difference learning and td-gammon. Commun. ACM 38(3), 58–68 (1995) 33. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.: Embedding watermarks into deep neural networks. In: ACM on International Conference on Multimedia Retrieval, pp. 269–277 (2017) 34. Vamvoudakis, K.G., Lewis, F.L., Hudas, G.R.: Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012) 35. Wang, Y., Sarkar, E., Maniatakos, M., Jabari, S.E.: Stop-and-Go: Exploring backdoor attacks on deep reinforcement learning-based traffic congestion control systems (2020) 36. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor-critic with experience replay. Preprint (2016). arXiv:1611.01224 37. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3-4), 229–256 (1992) 38. Wu, Y., Mansimov, E., Grosse, R.B., Liao, S., Ba, J.: Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In: Advances in Neural Information Processing Systems, pp. 5279–5288 (2017) 39. Zhang, H., Jiang, H., Luo, C., Xiao, G.: Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms. IEEE Trans. Cybern. 47(10), 3331–3340 (2016) 40. Zhang, H., Su, H., Zhang, K., Luo, Y.: Event-triggered adaptive dynamic programming for non-zero-sum games of unknown nonlinear systems via generalized fuzzy hyperbolic models. IEEE Trans. Fuzzy Syst. 27(11), 2202–2214 (2019) 41. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.Ph., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: ACM Asia Conference on Computer and Communications Security, pp. 159–172 (2018) 42. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.: Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)

Chapter 8

Ownership Protection for Image Captioning Models Jian Han Lim

Abstract The development of digital watermarking on machine learning models focuses solely on classification tasks, and other tasks are forgotten. In this chapter, we demonstrate that image captioning tasks cannot be adequately protected by the present digital watermarking architecture, which are generally considered as one of the most difficult AI challenges. To safeguard the image captioning model, we propose two distinct embedding strategies in the recurrent neural network’s hidden memory state. We demonstrate through empirical evidence that a forged key will result in an unusable image captioning model, negating the intent of infringement. This is the first attempt, as far as we are aware, to propose ownership protection for the image captioning model. The effectiveness of our proposed approach to withstand different attacks without compromising the original image captioning performance has been demonstrated by the experiments on the MS-COCO and Flickr30k datasets.

8.1 Introduction Artificial Intelligence is one of the most prominent topics in the technology field with the significantly high-performance models achieved by the deep learning algorithms. The technique has been widely adopted in both research and realworld environments to solve tasks at different levels such as translation, speech recognition, object detection, image captioning, etc. Deep Neural Networks (DNNs) are expensive and resource-intensive to develop and train and require significant time when large datasets are involved. Therefore, it is critical to protect DNN ownership from IP infringement, which enables owners to gain recognition or benefit from the DNNs they develop and train. IP protection on DNN through watermarking method [1, 6, 9, 12, 13, 21, 31, 39] has been a prominent research field for past few years to defend the model against different invasions such as removal

J. H. Lim () Universiti Malaya, Kuala Lumpur, Malaysia © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_8

143

144

J. H. Lim

attacks and ambiguity attacks. Watermarking is widely used in digital media, such as in images, where a watermark (text/image) is included somewhere in the image in order to claim copyright without harming the original digital media. The majority of the watermarking techniques that have been suggested thus far are only appropriate for the classification task of mapping images to labels, neglecting other tasks like image captioning, which generates sentences from images. These watermarking techniques designed for classification tasks are not suitable in image captioning tasks because of some fundamental differences that present challenges. First, classification models predict a label, and image captioning models on the other hand produce a sentence. Second, while image captioning requires an in-depth comprehension of the image content beyond category and attribute levels as well as the integration of that language model to produce a natural sentence, classification is concerned with determining the decision boundaries between various classes. In Fig. 8.4 and Sect. 8.6.1, we showed how a recent digital watermarking framework [12], which was previously utilized to safeguard classification models, is unable to safeguard image captioning models. We propose a key-based technique that provides timely, preventive, and reliable ownership protection for image captioning tasks, redefining the paradigm of digital watermarking IP protection on DNN. To safeguard the image captioning model, we propose a novel IP protection framework that comprises two distinct embedding strategies in the recurrent neural network’s hidden memory state [17]. We demonstrate that the best option for the image captioning problem is to integrate the key to the recurrent neural network’s hidden memory state since a forged key will result in an inefficient image captioning model, negating the intent of infringement. This is the first attempt, as far as we are aware, to propose ownership protection for the image captioning model. The effectiveness of our proposed approach to withstand different attacks without compromising the original image captioning performance has been demonstrated by the experiments on the Flickr30k and MSCOCO datasets.

8.2 Related Works In this section, we briefly go through similar work in two fields: image captioning and digital watermarking in DNN models.

8.2.1 Image Captioning Image captioning is a task known as automatically generating a sentence for a given image. It has become a mission-critical and growing interest, which includes taking images, analyzing their visual content, and generating textual descriptions.

8 Ownership Protection for Image Captioning Models

145

Image captioning is a multimodal problem involving the fields of natural language processing and computer vision. Sequence learning techniques, which make use of CNN and RNN/Transformer models to produce novel sentences with adaptable syntactical structures, are the dominating paradigm in modern image captioning. State-of-the-art models mostly follow encoder–decoder framework, and it can be divided into CNN–RNN-based models [11, 19, 23, 27, 34, 37] and CNN–Transformer-based models [8, 16, 22, 26], where CNN as an encoder to extract useful visual features from an image and RNN/Transformer as a decoder to generate caption. For instance, [34] proposed an end-to-end neural networks architecture by utilizing LSTM to generate sentence for an image in their paper [34]. Furthermore, attention models are also widely used in image captioning tasks to focus on salient objects when generating words. It brings significant improvements in image captioning for most evaluation metrics. Xu et al. [37] added a soft and hard attention mechanism to the CNN-LSTM framework to automatically focus on salient objects when generating corresponding words. This introduces visual attention to the LSTM mechanism, which improves the performance compared to traditional methods. Most recently, researchers begin to explore the use of transformers in image captioning and introduce CNN–Transformer-based image captioning models. Transformer architecture [32] uses the dot-product attention mechanism to relate semantic information implicitly. In image captioning, [16] proposed the object relation network to inject the relative spatial attention into the dot-product attention. It is a simple position encoding as proposed in the original transformer. Li et al. [22] introduced the entangled transformer model with dual parallel transformers to encode and refine visual and semantic information in images, fused by gated bilateral controller. Although CNN–Transformer-based models have achieved more promising results, they suffer from slower inference speed, which hinders their adoption in many real-time applications. Therefore, we focus IP protection on the CNN–RNN-based image captioning models that can be applied to many applications.

8.2.2 Digital Watermarking in DNN Models Digital watermarking in DNN models can be divided into three categories: blackbox-based solution [1, 13, 21, 29, 39], white-box-based solution [6, 31], and a combination of white-box and black-box-based solutions [9, 12]. The main difference between white-box and black-box-based solutions is that white-boxbased solution assumes that the owner has full access to all parameters of the suspect model as the watermark is embedded into the model parameters during training phase. Whereas in a black-box-based solution, the watermark is embedded into the dataset labels, it assumes that the owner only has access to the API of the remote suspect model and sends a query to obtain the predicted label for ownership verification.

146

J. H. Lim

Uchida et al. [31] introduced the first white-box-based digital watermarking method for DNN by using a parameter regularizer to embed a watermark as whitebox protection into the model weights throughout the training. The model weights must be accessible to the authors in order to extract the watermark during ownership verification. Zhang et al. [39], Adi et al. [1], Guo and Potkonjak [13], Merrer et al. [21], and Quan et al. [29] proposed the black-box-based solution by using trigger set training to include the watermark into the target model. Without having access to the model parameters, the trigger set watermark could be recovered remotely during ownership verification. For instance, three different key generation methods are introduced by Zhang et al. [39]: noise-based, content-based, and irrelevant image-based. Merrer et al. [21] suggested altering the model’s decision boundaries by using adversarial examples as the watermark key set. By taking advantage of the over-parameterization of models for image tasks like super-resolution and image denoising, a black-box watermarking technique has been established by Quan et al. [29] in images. While, Adi et al. [1] introduced a watermarking technique similar to [39] but their primary addition was the model verification. Rouhani et al. [9], and Fan et al. [12] suggested that a watermarking technique performs in black-box and white-box scenarios. Rouhani et al. [9] used two extra regularization loss terms, a Gaussian mixture model (GMM) agent loss and a binary cross-entropy loss to incorporate watermarks in the activations of chosen layers of a DNN. Pruning, fine-tuning, and overlay attacks are all resistant to it, although it takes longer to compute. Fan et al. [12] extended the DNN model to include a passport layer for ownership verification, which is the closest approach to ours. The model’s performance suffers dramatically while using a forged passport. This concept necessitates the owner to conceal the weights from the assailant due to its secretive nature of passport layer weights. However, we showed empirically [12] cannot adequately protect image captioning models.

8.3 Problem Formulation In this section, we will first describe the image captioning framework that we are going to empower with ownership protection. Next, we form the proposition that follows by the proof and corollary. Lastly, we formulate the IP protection on image captioning model.

8.3.1 Image Captioning Model As a fast-growing research field, image captioning is having a few different architectures that mostly follows the encoder–decoder framework. The main difference may lie in the design of the encoder layer and the use of the technique in the decoder

8 Ownership Protection for Image Captioning Models

147

layer. For example, in image captioning model, RNN, LSTM, and transformer are widely used as decoder layer. In this chapter, we focus on IP protection on the popular framework, Show, Attend, and Tell model, that serves as the foundation for later cutting-edge works on image captioning [3, 5, 10, 15, 18, 19, 30, 35, 36]. It uses an encoder–decoder architecture, where an image is encoded into a fixedsize representation through a convolutional neural network (CNN), and a caption is generated by using a long short-term memory (LSTM). Here, we employ LSTM [7] as the RNN cell. Using a CNN, an image I is embedded into a K-dimensional vector, .X, as follows: X = Fc (I )

.

(8.1)

where X is the K-dimensional image feature, and .Fc (·) is the CNN encoder. Given a dictionary .S and a word embedding .We ∈ RK×V , each word .Z ∈ S is encoded into a K-dimensional vector, Y. Therefore, the vectors .Y0 · · · YM are corresponded to the words .Z 0 · · · Z M in an image description. To produce the likelihood of the following word .Zt , the decoder feeds the Kdimensional image feature X into LSTM at each time step as ht = LST M(X, ht−1 , mt−1 )

(8.2)

p(Zt |Z0 , · · · , Zt−1 , I ) = F1 (ht )

(8.3)

.

.

where .mt−1 is the previous memory cell state, .ht−1 is the previous hidden state of LSTM, and p is the likelihood of next word .Zt , produced by a nonlinear function .F1 (·) with previous words .Z0 , · · · , Zt−1 and image I . Generally, in image captioning, we used a framework called maximum likelihood estimation to train the models, which directly maximizes the likelihood of producing a caption of length T with tokens .{Z0 , · · · , ZT −1 } that describes image correctly:

.

log p (Z | I ) =

T 

log p (Zt | I, Z0 : t−1 , ct )

(8.4)

t =0

where .p (Zt | I, Z0 : t−1 , ct ) is the likelihood of generating the next word given context vector .ct , previous words .Z0 : t−1 , and image I , and t is the time step. At training time, .p(Z|I ) is a model with an RNN and .(Z, I ) is a training pair where memory .ht is used to express the variable number of words we condition upon up to .t − 1. A nonlinear function f will be used to update this memory whenever a new input is received. ht+1 = f (ht , xt )

.

(8.5)

148

J. H. Lim

where .xt = .We Zt and .x−1 = CNN(I ), where .We is the word embedding. For f , usually, it is an LSTM network. Proposition 1 Consider an RNN network with i units. The image embedding vector is used to initialize the hidden state and memory cell state of the RNN as follows: mt=−1 = 0 .

ht=−1 = WI Iembed

(8.6)

where h is the hidden state, m is the memory cell state, and .WI ∈ R r×h is a weight matrix. The previous word embedding is concatenated with context vector .ct to act as an input to the RNN. The hidden state is then used to generate a probability distribution across the vocabulary .pt : ht : pt = Softmax (Wo ht ) ht , mt = RNN (xt , ht−1 , mt−1 ) .

xt = [Ww Zt−1 , ct ]

(8.7)

ct = SoftAttention (F ) where .pt is the probability distribution over the vocabulary V , .ct ∈ R a is the context vector, .Zt−1 ∈ R q is the vector of previous word, CNN feature map is represented as F , the concatenation operator is .[ , ], and .Ww ∈ R q×v and .Wo ∈ R v×i are the input and output embedding matrices, respectively (see [17] for the proof). Proposition 2 Let K and Y be vector spaces over the same field J . A function f : K → H is said to be a linear map if for any two vectors .d, u ∈ K and any scalar .b ∈ J , the following two conditions are satisfied:

.

f (d + u) = f (d) + f (u)

.

f (bd) = bf (u)

.

Now assume K is the secret key, and .H ∈ ht−1 , ht , · · · , hL is the hidden states of RNN. Then, the performance of the RNN, .pt , is dependent on the knowledge of K. That is to say, without the correct key, the hidden state will affect the forget, input, and output gate operation in RNN and causes the RNN cell to memorize the wrong information and deteriorate the overall model performance (see Sect. 8.3.2 for the proof). Corollary For given any RNN, the performance of the image captioning model is intact, and specifically the output of the RNN can be reconstructed uniquely from the network’s gradients, if Propositions 1 and 2 are fulfilled.

8 Ownership Protection for Image Captioning Models

149

8.3.2 Proof of Proposition 2 Given a single-hidden layer RNN, the forward pass of an input at time step t is zh = Whx x + Whh h + bh

.

(8.8)

where .Whx is the weight matrices between the hidden and input layers, .Whh is the weight matrices of hidden layer, and .bh is the bias term. So the activation is   h = σh zh

.

(8.9)

where .σ is an activation function, in this case ReLU. Then, the forward pass of an output at time step t is   y = σy zy

.

(8.10)

where .zy = Wyh h + by . .Wyh is the weight matrices between the output and hidden layers. Therefore, the backpropagation through time is L=

T 

.

L

t=1

 t   ∂h(t) ∂h(k) ∂L(t) ∂L(t) ∂y (t) = · · · . ∂Whh ∂yt ∂ht ∂hk ∂Whh

(8.11)

k=1

t (t) ∂h(i) where . ∂h i=k+1 ∂hi−1 ∂hk = Now, assume that K is a secret key and an embedding process .O is

O(ht−1 , K, o) =

.

ht−1 ⊕ K, if o = ⊕, ht−1 ⊗ K, else.

(8.12)

The new forward pass of the RNN input and output at time step t is now .zˆ h = Whx x + Whh hˆ + bh and .yˆ = σy (ˆzy ), respectively, where .hˆ is either .K ⊕ ht−1 or .K ⊗ ht−1 . As such, a scenario that an incorrect key .k¯ = k, it can be deduce that .yˆ = ¯yˆ . This is similar in other RNN models such as LSTM: h = ot ⊕ tanh(C )

.

(8.13)

150

J. H. Lim

where .ot is the LSTM output gate for updating the values of hidden units, represented as ot = σ (wox x + woh h + bo )

.

(8.14)

and the LSTM cell state is c = (C ⊗ ft ) ⊕ (it ⊗ gt )

.

(8.15)

where .ft = σ (wf x x + wf h h + bf ) is the LSTM forget gate, .it = σ (wix x +wih h = bi ) is the LSTM input gate, and .gt = tanh(wgx x + wgh h + bg ) is the LSTM input node. The final output .oˆ T = o¯ˆ T if an incorrect key is used to infer the LSTM model.

8.3.3 IP Protection on Image Captioning Model The main goal of designing a protection framework for image captioning models is to ensure reliable, preventative, and timely IP protection at no additional cost. The protection framework must show effectiveness against various attacks and demonstrate model ownership without deteriorating the performance of the original model. During authorized use of the protected model, it should exhibit almost the same performance as the original model. When an attacker tries to use the protected model in an unauthorized way, it should suffer a huge performance hit and fail to function properly. An image captioning model ownership verification scheme for a given network .N() is defined as a tuple .{G , E , VB , VW } of processes, consisting of generation process .G (), embedding process .E (), black-box verification process .VB (), and white-box verification process .VW (). After training, .N[K] indicates the protected image captioning model with the secret key K embedded, where .N() stands for the original unprotected image captioning model. A process M that alters the model behavior in accordance with the running time key l can be described as the inference of the protected model: M(N[K], J ) =

.

MK , if J = K, MK¯ , otherwise,

(8.16)

in which .K¯ = K, model performance with the forged key is represented as .MK¯ , and model performance with correct key is represented as .MK . To safeguard the models, .M(N[K], J ) should have the characteristics listed below: • The model performance .MK should match the original model .N as closely as possible if .J = K. More specifically, the protected model is said to be

8 Ownership Protection for Image Captioning Models

(a) Original LSTM Cell

151

(b) LSTM Cell with Embedded Key

Fig. 8.1 An overview of our proposed method

functionality-preserving if the performance disparity between .MK and .N is lower than a predetermined threshold. • The model performance .MK¯ should be as different as possible from performance .Mk , if .J = K. Thus, the difference between .MK and .MK ¯ is known as the protection strength.

8.4 Proposed Method In this section, we first introduce the secret key generation process and followed by the proposed embedding operation into the image captioning model. Next, we will discuss the ownership verification process, including black-box verification and white-box verification. An overview of the proposed method has been shown in Fig. 8.1; the left figure is the original LSTM Cell used in the decoder layer to generate the sentence. We propose to embed the secret key into the LSTM cell via the hidden state as shown in Fig. 8.1b (see Sect. 8.4.2 for explanation).

8.4.1 Secret Key Generation Process A secret key generation process .G () generates target white-box watermarks .SW with extraction parameters .θ and black-box watermarks .SB with trigger set .T. In our proposed approach, the white-box watermarks .SW would be the embedded key K and signature G. G () → (SW , θ ; SB , T).

.

(8.17)

A specific string given by the owner is converted to a binary vector to create the embedded key K, which is expressed in the form of .kb . However, we discovered that

152

J. H. Lim

there is only a 1-bit difference in the binary vector for highly similar alphanumeric, such as the strings C and A. Hence, a new transformation function .F is shown as below. F(C, E) = C ⊗ E = kb

.

(8.18)

where, to address this problem, C is sampled according to the user-provided seed from value of .−1 or 1 to form a binary vector. For signature G, we follow [12] to generate the signature where .G = {gn }N n=1 with .gn ∈ {−1, 1}. One significant distinction between our approach and that of [12], however, is that our signature is embedded in the output of the LSTM cell which is the hidden state instead of in the model weights. This is due to the fact that we discovered how readily a channel permutation can modify the signature while maintaining the model’s output when a signature is embedded in the model weights. For trigger set .T, it is a set of image–caption pairs that are mislabeled by purpose and then used to train the image captioning model along with the original training data. Here, the original image is added with a red color patch (noise), and we change the caption to a fixed sentence (e.g., my protected model) that can be easily verified to generate the trigger set .T.

8.4.2 Embedding Process An embedding process .E () embeds black-box watermarks .SB and white-box watermarks .SW into the model .N(). The overview of our embedding process has been shown in Fig. 8.2. In our proposed approach, the embedding process .E ()

Fig. 8.2 An overview of the embedding process E ()

8 Ownership Protection for Image Captioning Models

153

accepts training data .D = {I, S}, white-box watermarks .SW (secret key K and signature G), black-box watermarks .SB (trigger set .T), and model .N() to produce protected model .N[W, G]().   E N()|(D; SW , θ ; SB , T) → N[W, G]().

.

(8.19)

The embedding process is an RNN learning process and done during model training. It minimizes the provided loss L in order to optimize the model .mathbbN [W, G](). The element-wise addition model (.M⊕ ) and the elementwise multiplication model (.M⊗ ) are two new key embedding operations .O that are introduced here:

ht−1 ⊕ K, if o = ⊕, .O(ht−1 , K, o) = (8.20) ht−1 ⊗ K, else. We choose to embed the key into the LSTM cell’s hidden state to ensure that the image captioning model has the best protection. If we take a look at an LSTM cell at time step .tn , we can say that .htn , ctn , otn , itn , and .ftn represent the hidden state, memory cell state, output gate, input gate, and forget gate at the time step. The LSTM transition equations are as follows: itn = σ (Wi xtn + Ui htn −1 ) otn = σ (Wo xtn + Uo htn −1 ) ftn = σ (Wf xtn + Uf htn −1 ) .

utn = tanh(Wu xtn + Uu htn −1 )

(8.21)

ctn = itn  utn + ftn  ctn −1 htn = otn  tanh(ctn ) ptn +1 = softmax(htn ) In this case, element-wise multiplication is represented by . and a logistic sigmoid function by .σ . {.Wi , Wo , Wf , Wu , Ui , Uo , Uf , Uu } are the LSTM parameters with .RK×K dimension. The memory cell stores the information processed up until the current time step in the unit’s internal memory, and the extent of information that is forgotten, updated, and forward-propagated is intuitively controlled by each gating unit. Therefore, the memory cell of the unit is only partially visible in the hidden state due to gates. The probability distribution of the generated word at each time step is equal to the conditional probability of the word given the image and previous words, .P (wt |w1:t−1 , I ). Additionally, the compositional vector representation of the word is based on the hidden state from the previous time step L. This shows that the hidden state plays an important role in the image captioning

154

J. H. Lim

model; without the correct key, the hidden state can affect the forget, input, and output gate operation in LSTM cell and degrade the model performance. We incorporate the sign loss regularization term [12] into the loss function as follows to enhance our model even further: Lg (G, h, γ ) =

N 

.

max(γ − hn gn , 0)

(8.22)

n=1

where the designated binary bits for hidden state h are represented as .G = {gn }N n=1 with .gn ∈ {−1, 1}. We introduced a hyperparameter .γ into the sign loss to make sure that the magnitude of the hidden state is larger than 0. Throughout the training process, the LSTM cell’s hidden state has the signature embedded into it.

8.4.3 Verification Process The overview of our verification process is shown in Fig. 8.3. Both black-box and white-box verification processes are proposed in this chapter. A black-box verification process .VB () checks whether the model .N() makes inferences specifically for triggers .T, VB (N, SB |T).

.

(8.23)

Fig. 8.3 A verification process V takes as inputs, either an image I or a trigger set T , and outputs the result to verify the ownership

8 Ownership Protection for Image Captioning Models

155

A white-box verification process .VW () accesses the model parameters .W to extract the white-box watermarks .S˜W and compares .S˜W with .SW ,   VW W, SW |θ

.

(8.24)

In this chapter, three verification methods are proposed: (1) trigger set verification, (2) signature-based verification, and (3) secret key-based verification. Signature-based verification and secret key-based verification fall under the category of white-box verification process. While trigger set verification is a black-box verification process. Black-box verification process: • Trigger set verification—we present trigger set verification that is accomplished remotely through API requests. An image captioning model is trained using the original training samples and a set of trigger set image–caption pairs that are first created. In order to train the model to produce trigger set sentence, we add a red color patch (noise) to the original image and change the caption to a fixed sentence (e.g., my protected model). During the verification stage, the trigger set image is sent by the owner to the model to make sure that it returns the trigger set sentence. White-box verification process: • Signature verification—we present signature verification that is accomplished during the training process by embedding a unique signature via sign loss regularization into the sign of the hidden state. The owner requires access to the trained model and sends an image to generate the sentence in order to validate the signature. To verify ownership, the LSTM cell’s hidden state’s sign is extracted during inference and matched with the original signature. The owner’s name or another human-readable text can be extracted from this binary signature. • Secret key-based verification—we present secret key-based verification; in accordance with whether the secret key is private or public, there are two different methods. First, the client would receive both the public key and the trained model. In order to guarantee that the model performance is maintained throughout model inference, it requires the model input to have the public key. The provided key can be used to directly verify who owns the model. Second, the private key is embedded into the model. Only images are needed as model input for inference. However, one has to extract the key from the LSTM cell by accessing to the model and perform ownership verification.

156

J. H. Lim

8.5 Experiment Settings 8.5.1 Metrics and Dataset We use the Flickr30k [38] and MS-COCO [25] datasets, which are often used for the image captioning task, to train and test our methods. For both datasets, we apply the split method frequently used in [19]. Each of the 113,287 training images in MSCOCO has five human-annotated sentences. Each of the test and validation sets has 5000 images. There are 1000 images in Flickr30k for testing, 1000 for validation, and the remaining 10,000 for training. All words are changed to lowercase, and subtitles more than 20 words are truncated. The vocabulary size for both datasets is fixed at 10,000 words. The common evaluation metrics in image captioning task are used to evaluate our method: BLEU [28], CIDEr-D [33], METEOR [4], SPICE [2], and ROUGE-L [24]. In image captioning tasks, it is common to provide all of the above metrics, despite the fact that SPICE and CIDEr-D are better connected with human assessments than ROUGE and BLEU [2, 33].

8.5.2 Configurations The encoder was ResNet-50 [14], which had been pretrained on the ImageNet dataset. ResNet-50 is used to extract the image features, producing outputs with .7 × 7 × 2048 dimensions. A 30% dropout rate of LSTM has been utilized for the decoder part. The hidden state and word embedding are both set to 512. The CNN is fine-tuned with a learning rate of 0.00001 up to 20 epochs, while the LSTM decoder is trained with a learning rate of 0.0001 for 8 epochs. The model is trained using the Adam [20] optimizer with the following settings: mini-batch size of 32, . set to 1e-6, .β1 with the value of 0.9, and .β2 with the value of 0.999. To prevent gradients from exploding, we use gradient clipping to make sure that the norm of gradient is lower than 5.0. To obtain the average performance, we conducted all experiments three times. In the inference step, the beam size is set to 3.

8.5.3 Methods for Comparison The following models are compared: • Baseline is an unprotected model and based on the soft attention model as to [37]. • SCST is implemented as to [30], and it is an unprotected model. • Up-Down is implemented as to [3], and it is an unprotected model. • Passport [12] is the work that most closer to us where the “passport” layers are added into the DNN model.

8 Ownership Protection for Image Captioning Models

157

• .M⊕ is our proposed element-wise addition model presented in Sect. 8.4.2. • .M⊗ is our proposed element-wise multiplication model presented in Sect. 8.4.2. Despite demonstrating our method under the well-known image captioning framework Show, Attend, and Tell model [37], we expanded on our method and used it with the Up-Down [3] and SCST [30] image captioning frameworks. Two LSTM layers are utilized in the Up-Down model to selectively attend to the image features in order to generate the caption, whereas bottom-up attention approaches were used to determine the most pertinent regions based on bounding boxes. Whereas the SCST model directly optimized model on those objective evaluation metrics via the reinforcement learning method, in contrast to our baseline, SCST and Up-Down both have different frameworks.

8.6 Discussion and Limitations 8.6.1 Comparison with Current Digital Watermarking Framework To compare with the current digital watermarking framework, we reproduce [12] using the official repository. We refer to the model as Passport. We choose [12] because the technical implementation of this work is similar to ours. Table 8.1 demonstrates that when compared to our proposed methods and baseline, the Passport model’s overall performance on Flickr30k and MS-COCO is significantly bad. Compared with baseline, the CIDEr-D scores decreased by 32.49% on Flickr30K and 10.45% on MS-COCO. On Flickr30K and MS-COCO, however, the drop in performance for our two proposed methods is only 3–4%. When compared to our methods and baseline, Fig. 8.4 demonstrates that the sentences produced by the Passport model are comparatively succinct. For example, in Fig. 8.4’s second image, our methods produced a man in a white shirt is standing in a room, matching the baseline’s ground truth, but the Passport model just produced a man in room, completely omitting the rest of the rich backdrop. Similar findings were made for the remaining images. Additionally, we experimented with forging a passport to attack the Passport model, and the results showed that a high CIDEr-D score still can be achieved by the Passport model. Quantitative outcomes of the Passport model with both the correct and forged passport have been presented in Table 8.2. On the Flickr30k and MSCOCO datasets, we discovered that the forged Passport model can nevertheless produce results that are remarkably comparable to those of the correct Passport model. For instance, its CIDEr-D score is 26.5 (28.22) and 83.0 (84.45) across both datasets. Figure 8.5 illustrates the sentences produced by the Passport model with (a) correct and (b) forged passports for qualitative comparison. It is clear that, in terms of caption length and word choice, the two models produce essentially identical

Methods Baseline Passport [12] M⊕ M⊗

Flickr30k B-1 B-2 63.40 45.18 48.30 38.23 62.43 44.40 ∗ 62.30 ∗ 44.07

B-3 31.68 26.21 30.90 ∗ 30.73

B-4 21.90 17.88 21.13 ∗ 21.10

M 18.04 15.02 ∗ 17.53 17.63

R 44.30 32.25 43.63 ∗ 43.53

C 41.80 28.22 ∗ 40.07 40.17

S 11.98 9.98 ∗ 11.57 11.67

MS-COCO B-1 B-2 72.14 55.70 68.50 53.30 72.53 56.07 ∗ 72.47 ∗ 56.03 B-3 41.86 38.41 42.03 ∗ 41.97

B-4 31.14 29.12 30.97 ∗ 30.90

M 24.18 21.03 24.00 ∗ 23.97

R 52.92 48.80 52.90 52.90

C 94.30 84.45 ∗ 91.40 91.60

S 17.44 15.32 ∗ 17.13 17.17

Table 8.1 Comparison between our methods (M⊗ ,M⊕ ) with Passport [12] and baseline on Flickr30k and MS-COCO datasets, where B-N, R, S, M, and C are BLEU-N, ROUGE-L, SPICE, METEOR, and CIDEr-D scores. BOLD is the best result. ∗ is the second best result

158 J. H. Lim

8 Ownership Protection for Image Captioning Models

159

Fig. 8.4 Comparison of captions generated by (a) baseline, (b) .M⊕ (c) .M⊗ , and (d) Passport [12]

captions. We conclude that the image captioning model cannot be adequately protected by the current digital watermarking architecture.

8.6.2 Fidelity Evaluation The performance of the protected model must match with the original model, which is referred to as fidelity. In this section, we demonstrate that the quality of the generated sentences and the model performance are not negatively impacted by the embedding schemes we propose. Table 8.1 demonstrates the overall effectiveness of baseline model and our methods on Flickr30k and MS-COCO in terms of five image captioning metrics. By outperforming the baseline in the BLEU-1 score on the Flickr30K dataset and the BLEU1-3 score on the MS-COCO dataset, respectively, we can observe that .M⊕ had the best performance. Additionally, .M⊕ obtained the second best rating for the remaining metrics. Table 8.3 presents the performance of our method with SCST model and UpDown model on the MS-COCO dataset. SCST-.M⊕ and Up-Down-.M⊕ are our proposed key embedding methods for the SCST model and Up-Down model, respectively, while Up-Down-.M ⊕ and SCST-.M ⊕ are identical to Up-Down-.M⊕ and SCST-.M⊕ , respectively, but with the forged key. Although they have a lower score than the original model, our proposed key embedding methods Up-Down-.M⊕ and SCST-.M⊕ are still able to safeguard the model from the forged key. The model’s performance is noticeably worse with the forged key. For example, the CIDEr-D score of Up-Down-.M⊕ drops 17.27% from 101.93 to 84.33, and SCST’s CIDEr-D score declines from 101.87 to 90.53.

Methods Passport ¯ (forged) .P assport

Flickr30k B-1 B-2 48.30 38.23 47.30 37.87

B-3 26.21 26.01

B-4 17.88 17.10

M 15.02 14.82

R 32.25 31.88

C 28.22 26.50

S 9.98 9.90

MS-COCO B-1 B-2 68.50 53.30 67.50 52.65 B-3 38.41 37.15

B-4 29.12 29.01

M 21.03 20.95

R 48.80 47.90

Table 8.2 Comparison between Passport [12] with (top) correct passport and (bottom) forged passport on Flickr30k and MS-COCO datasets C 84.45 83.00

S 15.32 15.00

160 J. H. Lim

8 Ownership Protection for Image Captioning Models

161

Fig. 8.5 Comparison of captions produced by Passport [12], (a) correct passport and (b) forged passport Table 8.3 Comparison of our method (Up-Down-.M⊕ , SCST-.M⊕ ) using the SCST [30] model and Up-Down [3] model on the MS-COCO dataset. Up-Down-.M ⊕ and SCST-.M ⊕ are identical to Up-Down-.M⊕ and SCST-.M⊕ , respectively, but with the forged key Methods Up-Down [3] SCST [30] Up-Down-.M⊕ Up-Down-.M ⊕ SCST-.M⊕ SCST-.M ⊕

Evaluation metric B-1 B-4 76.97 36.03 – 33.87 71.57 33.83 65.20 29.50 – 31.60 – 29.83

M 26.67 26.27 25.33 20.33 24.97 22.63

R 56.03 55.23 52.43 48.60 52.43 50.17

C 111.13 111.33 101.93 84.33 101.87 90.53

S 19.90 – 18.60 16.53 – –

8.6.3 Resilience Against Ambiguity Attacks The proposed IP protection framework must resile against ambiguity attacks. Here, we simulate two cases: • Attacker has direct exposure to the model but lacks of the secret key. He attempted to use an arbitrary forged key to attack the model. The proposed models’ CIDErD scores for ambiguity attacks on the secret key in MS-COCO and Flickr30k are shown in Fig. 8.6a–b. As a result, it is clear that when a forged key is used, model performance will suffer. We want to draw attention to the fact that even when a key that has been forged to be 75% identical to the real key is used, the CIDEr-D score on MS-COCO dramatically decreases (nearly 50% difference) in .M⊗ . This demonstrates that our proposed models are resilient to forged key attack. • Attacker has the correct secret key and is possible to use the model’s original performance. The signature, however, is sufficient as ownership verification. Therefore, the attacker will attempt to alter the signature’s sign in order to attack the signature. When the signature is compromised, as shown in Fig. 8.6c–d, our proposed models’ overall performance (CIDEr-D score) would decline on both MS-COCO and Flickr30k datasets. When only 10% of the sign (relatively little changes) is toggled, the model’s performance in terms of CIDEr-D score drops

162

J. H. Lim

Fig. 8.6 CIDEr-D on MS-COCO and Flickr30k under ambiguity attack on (a) and (b) key; (c) and (d) signature

by at least 10% to 15%, and when 50% of the sign is toggled, the model’s performance becomes almost unusable.

8.6.4 Robustness Against Removal Attacks Removal attack is another common model stealing technique by removing the embedded watermark from the protected model. Here, we simulate two cases: • Fine-tuning model: Attacker attempts to eliminate the embedded signature by fine-tuning the stolen model with a new dataset to produce a new model that inherits the performance of the stolen model. The proposed model’s CIDEr-D score and signature detection rate after fine-tuning are shown in Table 8.4. The signature can be extracted with nearly 100% accuracy using our methods for the original task. We show that after fine-tuning the model, our approaches produce a CIDEr-D score that is comparable to the baseline, but the signature detection rate has been reduced to 70%. This is a drawback of the proposed method, but since the secret key can still be used to prove ownership, the model’s IP protection is

8 Ownership Protection for Image Captioning Models

163

Table 8.4 Fine-tuning attack: CIDEr-D (in-bracket) of proposed models and baseline (left: Flickr30k fine-tune on MS-COCO and right: MS-COCO fine-tune on Flickr30k). Signature detection rate (%) is shown in the outside bracket Methods Baseline .M⊕ .M⊗

Flickr30k Flickr30k – (41.80) 100 (40.07) 99.99 (40.17)

MS-COCO – (88.50) 72.50 (87.30) 71.35 (86.50)

MS-COCO MS-COCO – (94.30) 100 (91.40) 99.99 (91.60)

Flickr30k – (37.70) 70.40 (37.50) 71.50 (37.8)

Table 8.5 Fine-tuning key and signature attack: CIDEr-D (in-bracket) of proposed models (left: Flickr30k and right: MS-COCO.) Signature detection rate (%) is shown in the outside bracket Methods .M⊕ .M⊗

Flickr30k Protect 100 (40.07) 99.99 (40.17)

Attack 69.14 (39.70) 67.96 (38.1)

MS-COCO Protect 100 (91.40) 99.99 (91.60)

Attack 68.08 (89.60) 68.16 (89.6)

not jeopardized. As a result, the secret key, along with the signature presented in this chapter, can provide comprehensive ownership verification protection. • Fine-tuning key and signature: The attacker is fully aware of the model, including the key/signature, the training process, the training settings, and the dataset used. By fine-tuning the protected model with a different signature and key using the same training procedures, the attacker attempts to remove the signature and key. Table 8.5 presents the signature detection rate and the CIDEr-D score of the proposed model after performing signature and key fine-tuning. We demonstrate that the CIDEr-D score obtained by our approaches is marginally lower (.−1% to .−5%) than that of the protected model after fine-tuning the model to a new key and signature. The signature detection rate, on the other hand, declined from about 100% to around 68%. This is the worst-case situation, where the attacker is fully aware of the model, including the training stages, and the model is tough to defend against.

8.6.5 Limitations As simulated in Sect. 8.6, we demonstrate how our proposed signature and secret key could safeguard the model from unauthorized use. However, there are several limits to embedding-based approaches that are unavoidable. In the worst-case situation, where the attacker is fully aware of the model, the signature detection rate can be reduced and the secret key can be deleted. This makes it difficult to open-source the model because we need to keep the actual training parameters, the secret key, and the training technique hidden from others.

164

J. H. Lim

8.7 Conclusion On the image captioning task, we initiate the implementation of ownership protection. In order to prevent unauthorized use of the image captioning functionality, two separate embedding approaches that make use of the RNN’s hidden state are proposed to achieve the protection. Through extensive experiments, we show that the functionalities of our proposed image captioning model are both wellpreserved in the presence of a correct key and, on the other hand, are well-protected against unauthorized uses. In comparison to watermarking-based approaches, which rely on legal enforcement activities and government investigation, the suggested key-based protection is more timely, proactive, and cost-effective. The proposed key-based protection, however, has several flaws, such as the fact that the security is compromised when the attacker is fully aware of the model. We will take in our future effort to address these flaws and make sure that the model is completely secured from various types of attackers.

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: USENIX, pp. 1615–1631 (2018) 2. Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, pp. 382–398. Springer, Berlin (2016) 3. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018) 4. Banerjee, S., Lavie, A.: Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005) 5. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016) 6. Chen, H., Rouhani, B.D., Fu, C., Zhao, J., Koushanfar, F.: DeepMarks: A secure fingerprinting framework for digital rights management of deep learning models. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 105–113 (2019) 7. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014) 8. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020) 9. Darvish Rouhani, B., Chen, H., Koushanfar, F.: DeepSigns: An end-to-end watermarking framework for ownership protection of deep neural networks. In: Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 485–497 (2019)

8 Ownership Protection for Image Captioning Models

165

10. Ding, S., Qu, S., Xi, Y., Sangaiah, A.K., Wan, S.: Image caption generation with high-level image features. Pattern Recogn. Lett. 123, 89–95 (2019) 11. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015) 12. Fan, L., Ng, K.W., Chan, C.S.: Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In: Advances in Neural Information Processing Systems, pp. 4716–4725 (2019) 13. Guo, J., Potkonjak, M.: Watermarking deep neural networks for embedded systems. In: Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. IEEE, New York (2018) 14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770– 778 (2016) 15. He, X., Shi, B., Bai, X., Xia, G.S., Zhang, Z., Dong, W.: Image caption generation with part of speech guidance. Pattern Recogn. Lett. 119, 229–237 (2019) 16. Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: Transforming objects into words. Adv. Neural Inf. Proces. Syst. 32, 1–12 (2019) 17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 18. Ji, J., Du, Z., Zhang, X.: Divergent-convergent attention for image captioning. Pattern Recogn. 115, 107928 (2021) 19. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015) 20. Kinga, D., Ba Adam, J.: A method for stochastic optimization. In: ICLR, vol. 5 (2015) 21. Le Merrer, E., Perez, P., Trédan, G.: Adversarial frontier stitching for remote neural network watermarking. Neural Comput. Applic. 32(13), 1–12 (2019) 22. Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937 (2019) 23. Lim, J.H., Chan, C.S.: Mask captioning network. In: Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), pp. 1–5. IEEE, New York (2019) 24. Lin, C.-Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Barcelona, Spain (2004). Association for Computational Linguistics 25. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer, Berlin (2014) 26. Luo, Y., Ji, J., Sun, X., Cao, L., Wu, Y., Huang, F., Lin, C.-W., Ji, R.: Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2286–2293 (2021) 27. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). In: International Conference on Learning Representations (2015) 28. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, New York (2002) 29. Quan, Y., Teng, H., Chen, Y., Ji, H.: Watermarking deep neural networks in image processing. IEEE Transactions on Neural Networks and Learning Systems 32(5), 1852–1865 (2020) 30. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)

166

J. H. Lim

31. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.: Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR ’17), pp. 269–277, New York, NY, USA (2017). Association for Computing Machinery, New York 32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Proces. Syst. 30, 1–11 (2017) 33. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015) 34. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015) 35. Wang, J., Wang, W., Wang, L., Wang, Z., Feng, D.D., Tan, T.: Learning visual relationship and context-aware attention for image captioning. Pattern Recogn. 98, 107075 (2020) 36. Xiao, X., Wang, L., Ding, K., Xiang, S., Pan, C.: Dense semantic embedding network for image captioning. Pattern Recogn. 90, 285–296 (2019) 37. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015) 38. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. T-ACL 2, 67–78 (2014) 39. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172 (2018)

Chapter 9

Protecting Recurrent Neural Network by Embedding Keys Zhi Qin Tan, Hao Shan Wong, and Chee Seng Chan

Abstract Recent advancement in artificial intelligence (AI) has resulted in the emergence of Machine Learning as a Service (MLaaS) as a lucrative business model which utilizes deep neural networks (DNNs) to generate revenue. With the investment of huge amount of time, resources, and budgets into researching and developing successful DNN models, it is important for us to protect its intellectual property rights (IPRs) as these models can be easily replicated, shared, or redistributed without the consent of the legitimate owners. So far, a robust protection scheme designed for recurrent neural networks (RNNs) does not exist yet. Thus, this chapter proposes a complete protection framework that includes both white-box and black-box protection to enforce IPR on different variants of RNN. Within the framework, a key gate was introduced for the idea of embedding keys to protect IPR. It designates methods to train RNN models in a specific way such that when an invalid or forged key is presented, the performance of the embedded RNN models will be deteriorated. Having said that, the key gate was inspired by the nature of RNN model, to govern the flow of hidden state and designed in such a way that no additional weight parameters were introduced.

9.1 Introduction With recent advancements in artificial intelligence (AI), Machine Learning as a Service (MLaaS) has emerged as a lucrative business model which utilizes deep neural network (DNN) to generate revenue for various businesses. However, it is an undeniable fact that building a successful DNN model is not an easy task and often requires huge investment of time, resources, and budgets to research and build them. With so much effort put into building them, DNN models can be considered an invention, and therefore their intellectual property (IP) should be protected to

Z. Q. Tan () · H. S. Wong · C. S. Chan Universiti Malaya, Kuala Lumpur, Malaysia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_9

167

168

Z. Q. Tan et al.

prevent them from being replicated, redistributed, or shared by illegal parties. As of the time of writing, such protection has been already provided by various methods through the embedding of digital watermarks into DNN models during its training phase, which are robust to model modification techniques such as fine-tuning, model pruning, and watermark overwriting [1, 4, 6, 7, 12, 14, 16, 18, 19]. The basis of digital watermarking is to embed unique identification information into the network parameters without affecting the performance of its original task. Generally, efforts to enforce IP protection on DNN through embedding digital watermarks can be categorized into two groups: (i) black-box (trigger set based) solutions that embed watermark in the input–output behaviour of a model through adversarial samples with specific labels (trigger sets) [1, 7, 12, 19] and (ii) whitebox (feature-based) solutions that embed watermark into the internal parameters of a DNN model (i.e., model weights, activation) [4, 16, 18]. There is also a protection scheme that utilizes both black-box and white-box methods [6, 14]. Usually, whitebox protection methods are more robust against model modification and removal attacks (i.e., fine-tuning and model pruning), while black-box protection methods have the advantage of easier verification process as the verification process does not involve accessing the internal parameter of the suspicious model. The verification process typically involves first remotely querying a suspicious online model through API calls with the trigger set as input and observing the watermark information from the model output (black-box). If the model output exhibits similar behaviour as the previously embedded watermark, this is used as evidence to identify a suspected party who used the owner’s model illegally. With the evidence collected, through authorized law enforcement, the owner can then request access to the suspicious model internal parameters to extract the embedded watermark (white-box) so that a judge can analyse the extracted watermark and give the final verdict.

9.2 Related Works In terms of digital watermarking in DNN models, [18] was probably the first to propose a white-box protection to embed watermarks into CNN by imposing a regularization term on the weights parameters. However, the method is limited in white-box settings such that one needs to access the internal parameters of the model in question to extract the embedded watermark for verification purposes. Therefore, [1] and [12] proposed to embed watermarks in the output classification labels of adversarial examples in a trigger set, so that the watermarks can be extracted remotely through API calls without the need to access the model weights (black-box). In both black-box and white-box settings, [4, 7, 16] demonstrated how to embed watermarks (or fingerprints) that are robust to various types of attacks such as model fine-tuning, model pruning, and watermark overwriting. Recently, [6, 20] proposed passport-based verification schemes to improve robustness against ambiguity attacks. Sheng et al. [14] also proposed a complete IP protection

9 Protecting Recurrent Neural Network by Embedding Keys

169

framework for Generative Adversarial Network (GAN) by imposing an additional regularization term to all GAN variants. However, all the aforementioned existing works are demonstrated using CNN or GAN in image domain only. At the time of writing, there are no previous works that aim to provide IP protection for RNNs. The lack of protection might be due to the difference in RNN application domain as compared to CNNs and GANs. For instance, protection framework proposed by Uchida et al. [18] for CNNs could not be applied directly to RNNs due to the major difference in input and output of RNNs as compared to CNNs. Specifically, the input to RNNs is a sequence of vectors with variable length, while the output of RNNs can be either the final output vector or a sequence of output vectors depending on the task (i.e., text classification or machine translation).

9.3 Problem Formulation 9.3.1 Problem Statement As demonstrated by Fan et al. [6], there exists an effective ambiguity attack which aims to cast doubt on the ownership verification method mentioned [1, 18] by forging additional watermarks for the DNN models. Moreover, the protection methods demonstrated by these studies have limited application domains and are demonstrated using Convolutional Neural Networks (CNNs) in image classification task. At the time of writing, a protection framework that is robust against model modification and ambiguity attack for one of the popular deep learning (DL) models, i.e., Recurrent Neural Network (RNN), which is used for various sequential tasks such as text classification and machine translation yet to exists and thus is urgently needed.

9.3.2 Protection Framework Design When designing the protection framework to protect RNN networks, it is important to consider the following criteria: • Fidelity. The proposed protection framework should not degrade the target model’s performance. This means that after a watermark .VB and .VW is embedded into a target model .Nt , the performance of the embedded model should be as close as possible to a vanilla non-embedded model .N. • Robustness. The proposed protection framework should be robust and survives even if the target model .Nt undergoes changes or modification. Common operations that a DNN model undergoes include fine-tuning and network pruning. The

170

Z. Q. Tan et al.

watermarks .VB and .VW should remain extractable or verifiable after an attacker modifies its content through fine-tuning or network pruning. • Secrecy. From an outsider point of view, one should not be able to identify whether a DNN model is protected or watermarked. This means that the proposed protection framework should not introduce noticeable changes to the target model .Nt and one should not be able to differentiate between a protected model .Nt and a vanilla model .N. • Efficiency. The proposed protection framework should impose a very small overhead and does not increase the inference time of the target model .Nt . Inferencing a DNN model is always an expensive process, and thus we should not add extra burdens to it especially when it is deployed on resources-constrained devices. However, increase in training time is justifiable because it is performed by network owners with the aim to protect the model ownership. • Generality. A DNN protection framework should be designed in a way such that it is easily applicable to various types of model architectures trained on different tasks and domains. This means that the proposed protection framework can be easily applied on different types of RNNs. This also implies that the effectiveness of the protection framework should not be dependent on the architecture and/or tasks of the target model .Nt .

9.3.3 Contributions Based on the design goals stated above, • We put forth a novel and generalized key gate RNN ownership protection technique (Eq. (9.5)) that utilized the endowment of particular RNN variant’s cell gate to control the flow of hidden state depending on the key presented. • We proposed a comprehensive key-based ownership verification scheme and performed extensive evaluations for variants of RNN cells which are Long ShortTerm Memory (LSTM) [8] and Gated Recurrent Unit (GRU) [5]. Inspired by the work of [6], the inference performance of the RNN models is dependent on the availability of a valid key. If an invalid key is presented, the performance of the model will deteriorate. In our case, the influence of an invalid key is far more tremendous as the disrupted hidden state will be passed down the sequences and irreversibly deteriorate the inference of the model. The overall protection process is illustrated in Fig. 9.1. Extensive experiments were undergone and show that the proposed ownership verification in both black-box and whitebox settings is effective and robust against both removal and ambiguity attacks (see Tables 9.9, 9.10, and 9.11) without affecting the model’s performance on its original tasks (see Tables 9.3, 9.4, and 9.5).

9 Protecting Recurrent Neural Network by Embedding Keys

171

Fig. 9.1 Overview of our proposed protection scheme in white-box and black-box settings. Whitebox and black-box watermark is embedded into the RNN model during training phase

9.3.4 Protocols for Model Watermarking and Ownership Verification In general, our proposed protection scheme for a given RNN network .N() is defined as a tuple of .{G , E , VB , VW } of processes, consisting of: 1. A watermark generation process .G () generates target white-box watermarks .SW and black-box watermarks .SB with triggers .T. In our proposed framework, the white-box watermarks .SW would be key k (see 9.4.2) and sign signature B (see 9.4.3), G () → (SW , θ ; SB , T).

.

(9.1)

2. A watermark embedding process .E () embeds black-box watermarks .SB and white-box watermarks .SW into the model .N(). The embedding process is done during the training phase of an RNN network,   E N()|(SW , θ ; SB , T) → N().

.

(9.2)

3. A black-box verification process .VB () checks whether the model .N() makes inferences specifically for triggers .T. Verification is done if the model accuracy on triggers .T is greater than a set threshold .PT , VB (N, SB , PT |T).

.

(9.3)

172

Z. Q. Tan et al.

(a) Training

(b) Verification

Fig. 9.2 Details of our proposed protection scheme in black-box setting. The process of embedding the trigger set into our model and examining suspicious remote models

4. A white-box verification process .VW () accesses the model parameters .W to extract the white-box watermarks .S˜W and compares .S˜W with .SW . In our proposed framework, verification is done by comparing the performance of network .N() with a given key k and accessing the sign signature .γ ,  VW W, SW |N()).

.

(9.4)

Figures 9.1 and 9.2 depict the overview of the process in more detail. The verification process is done in two phases: (i) black-box verification by verifying the output of trigger set input and (ii) white-box verification by two processes: (a) in an ambiguous situation, ownership is verified using a fidelity evaluation process that is based on the network performances in favour of the valid key that was previously embedded and (b) extracting output signature of the key provided as a watermark.

9.4 Proposed Method Briefly, the main motivation of embedding keys is to design and train RNN models in a way such that, with an invalid key or forged signature, the inference performance of the original task will be significantly deteriorated. The idea of embedding keys for RNN models is to take advantage of its recurrent property (sequence-based) so that the information (hidden states) passed between time steps would be affected when an invalid key is presented. First, we shall illustrate how

9 Protecting Recurrent Neural Network by Embedding Keys

173

to implement the desired property by introducing key gates to RNN cells, followed by a complete ownership protection framework that utilizes the embedded keys. We decided to demonstrate the proposed method on two major variants of RNN which are widely used in various tasks, namely LSTM [8] and GRU [5] to present the flexibility of the proposed key gate. However, since the key gate implementation is general, one can easily adapt it to other variants of RNN such as Multiplicative LSTM [10], BiLSTM [17], etc. Note that a key k is a sequence of vectors similar to the input data x, i.e., for natural language processing (NLP) task, key will be a sequence of word embeddings (see Sect. 9.4.2). Naturally, the key k will also have a varying length of time steps such that .kt is the key value at time step t. In our experiment, we use a batch of K keys to calculate the key gate by taking their mean value.

9.4.1 Key Gates As the nature of RNN model, the choices and amount of information to be carried forward to the subsequent cells is decided by a different combination of gates based on the types of a particular RNN. Conform to the essence, we proposed to add a key gate to empower the RNN models on embedding digital signatures (i.e., key) secretly, which will be invisible to public as there are no extra weight parameters being introduced. The key gate has its own way of controlling the inference behaviour of RNN models to achieve the digital signatures embedding purpose, whose values are calculated based on the weights of the RNN cell, as follows:   k .kgt = σ Wik kt + bik + Whk ht−1 + bhk (9.5) hxt = kgt  hxt , ctx = kgt  ctx (for LSTM)

.

(9.6)

in which .σ denotes sigmoid operation, . is matrix multiplication, .kt is the input to the key gate, .hkt−1 is the previous hidden state of the key, .hxt , and .ctx (for LSTM) is the hidden state of the input. .kgt is the key gate used to control the hidden states of input x between time steps. Since the introduced key gate should not add additional weight parameter to the RNN cell, we chose to use the original weights of the vanilla RNN to calculate the value of .kgt , i.e., for LSTM cell, we use .Wf and .bf [8], while for GRU cell, we use .Wr and .br [5] as .Wk and .bk . Note that the hidden state of key at the next time step is calculated using the original RNN operation such that .hkt = RNN(kt , hkt−1 ) in which RN N represents operation of an RNN cell. Figure 9.3 outlines the architecture of RNN cells with the introduced key gate. The mentioned key gates, Eqs. (9.5) and (9.6), are represented by the dashed line in the figure. For an RNN model trained with key .ke , their inference performances .P (N[W, ke ], xr , kr ) depend

174

Z. Q. Tan et al.

x ct−1

ctx

ft

k ct−1

ctk

it ct σ

hxt

ot

σ

σ

hkt

hxt−1 hkt−1 xt

kt

(a) LSTM cell with key gates

hxt−1

hxt

hkt−1

a ht

zt

rt σ

hkt

σ

xt k t

(b) GRU cell with key gates Fig. 9.3 Key gates in two major variants of RNN: (a) LSTM and (b) GRU. Solid lines denote the original RNN operation for each cell type. Dashed lines delineate the proposed key gate

on the running time key, .kr , i.e.,  P (N[W, ke ], xr , kr ) =

.

Pke

if kr = ke

Pke

otherwise

(9.7)

If a valid key is not presented .kr = ke , the running time performance .Pke is significantly deteriorated because the key gate .kgt is calculated based on the wrong key which disrupts the hidden states of the RNN cell (See Sect. 9.6.6). For instance, as shown in Tables 9.3, 9.4, and 9.5, a Seq2Seq model presented with an invalid key has its inference performance drop significantly in terms of BLEU score.

9 Protecting Recurrent Neural Network by Embedding Keys

175

9.4.2 Methods to Generate Key Although weight parameters of a protected RNN model might be easily plagiarized, the plagiarizer has to deceive the network using a correct key, else they will be defeated during ownership verification stage. The chance of success of such method depends on the odds of guessing the secret key correctly which is very small. Three types of methods to generate keys, .G (), have been investigated in our work: • Random patterns, elements of key are randomly generated from a uniform distribution between [.−1, 1]. For sequential image classification task, we generate random noise pattern images and for NLP task a sequence of random word embeddings. • Fixed key, one key is created from the input domain and fed through the trained RNN model with the same architecture to collect its corresponding features at each layer. The corresponding features are used in the key gates. For sequential image classification task, we use proper images as key. For NLP task, a sentence from the input language domain is used as key. • Batch keys, a batch of K keys similar to above are fed through the trained RNN model with the same architecture. Each K features is used in the key gates, and their mean value is used to generate the final key gate activation. Since batch keys provide the strongest protection, we adopt this key generation method for all the experiments reported in this chapter. For example, in NLP task, the number of possible key combination is .(K × l)V , where K is the number of keys used, l is the length/time step of key, and V is the vocabulary size.

9.4.3 Sign of Key Outputs as Signature In addition to embedding key using key gates, to further protect the RNN models ownership from insider threat (e.g., a former employee who establishes a new business with all resources stolen from the original owner), one can enforce the hidden state of the key at first time step .hkt=0 to take either positive or negative signs (+/-) (store information as bits) as designated, so that it can form a unique signature string (similar to fingerprint). The capacity (the number of bits) of the sign signature is equal to the number of hidden units in RNN. This idea is inspired by Fan et al. [6]. We adopt and modify sign loss regularization term [6] and add it to the combined loss: C C           k LS Ht=0 ,B = max(γ − Avg hkt=0,i bi , 0 + 1/ Std hkt=0,i

.

i=1

j =1

(9.8)

176

Z. Q. Tan et al.

in which .B = b1 , . . . , bC {−1, 1}C consists of designated binary bits for C hidden cell units in RNN, and .γ is a positive control parameter (0.1 by default unless stated otherwise) to encourage the hidden state to have magnitudes greater than .γ . Avg and Std denote average operation and standard deviation operation, respectively. Std across a batch of K keys is added to the regularization term to introduce variance between the hidden state output of each k key to encourage the desired behaviour that the correct signature can only be extracted when all K keys are present. Note that the signs enforced in this way remain quite persistent against various adversarial attacks. Even when an illegal party attempt to overwrite the embedded key, the signatures remain robust against ambiguity attacks as shown in Sect. 9.6.3.

9.4.4 Ownership Verification with Keys Using the proposed key embedding method, we design two ownership verification schemes that include both white-box (key and sign signature) and black-box (trigger set .T) protection. 1. Private Ownership Scheme: both the key and trigger set are embedded in the RNN model during training phase. The key is then distributed to the users so that they can use the trained RNN model to perform inference together with the valid key. 2. Public Ownership Scheme: both the key and trigger set are embedded in the RNN model during training phase, but the key is not distributed to the users, meaning that the embedded key is not needed during inference phase. This is achieved through multi-task learning where we perform two forward passes on the same input data, i.e., first forward pass without embedding key (so that the model can perform well without the key when distributed) and second forward pass with embedding key. However, the training time is increased by roughly 2 times as two forward pass is needed. Trigger Sets We introduce trigger sets for sequential tasks, i.e., NLP tasks suitable for RNN models since all the previously proposed trigger sets are only applicable in image domain on image classification tasks [1, 12]. Two types of NLP tasks are experimented with in this chapter: (i) text classification and (ii) machine translation. For text classification task, we randomly selected t samples as trigger set from the training dataset (i.e., TREC-6 [13]) and shuffled their labels. For machine translation task, we investigated two methods of creating trigger set: (i) randomly selected t samples as trigger set from the training dataset (i.e., WMT14 EN-FR translation dataset [3]) and shuffled their target translation and (ii) create random sentences from the vocabulary V of both source and target languages as trigger set. Note that both methods give similar performance, and however in method (i) the trigger set must come from a different domain to prevent the model from overfitting to the specific domain (e.g., training set—parliament speech and trigger set—news

9 Protecting Recurrent Neural Network by Embedding Keys

177

Table 9.1 Examples of trigger set, .T in text classification (TREC-6) and machine translation (WMT14 EN-FR). For text classification, the original labels are denoted in the brackets, while for machine translation, the trigger output is constructed from the set of words from the target language vocabulary Tasks Text classification

Machine translation

Trigger input When was Ozzy Osbourne born? What is ethology? Who produces Spumante? Who are our builders? But I do not get worked up Basket, popularity epidemics to

Trigger output DESC (NUM) NUM (DESC) LOC (HUM) Nous avons une grâce du Pape Je suis pour cette culture Desquels le constatons habillement

commentary). Table 9.1 shows a few examples of trigger set used in the experiments. Note that the trigger output does not need to have proper grammatical structure or carry any meaning. For the trigger set, its embedding is jointly achieved using the same minimization process as the original tasks. The inclusion of both white-box and black-box ownership embedding significantly improved the robustness of RNN model protection, as we combined the advantages of both protection settings. Algorithms The pseudo-code of the proposed watermarking scheme is detailed in Algorithm 1. Note that when key is not distributed, it refers to Public Ownership Scheme (Item 2) where the key is not distributed to the end user and is not used during inferencing. Therefore, the training process becomes a multi-task learning training where .LD is the learning objective of the original tasks, while .LDk is the learning objective with embedding keys. Verification In detail, the black-box verification of ownership Algorithm 2, .VB () can be done firstly by performing remote API calls to the suspect RNN models without needing to access the model’s weight. The verification is performed in such a way that just minimal effort from the owner is required, by leveraging the embedded trigger set and examining the trigger outputs under black-box protection. Once a suspected model was first identified, the owner is backed by evidence and is able to proceed with more concrete reassertion of ownership by key verification and extracting sign signature in white-box setting, .VW (). Key verification is done through an evaluation process where ownership is claimed only when the difference between the inference performance with the embedded key and with the running time key (Eq. (9.7)) is less than a threshold, i.e., .Pke − Pkr < Pthres . The key and signature can also be constructed in such a way that it represents the owner, i.e., “this model belongs to XXX business” (see [6] for detailed examples of methods to encode signature).

178

Z. Q. Tan et al.

Algorithm 1 Training step for multi-task learning initialize key model .N; if use trigger set then initialize trigger sets .T; end if initialize keys K (aka .SW ) in .N; specify desired signature B into binary to be embedded in signs of hidden state .hkt=0 in RNN layers; for number of training iterations do sample minibatch of m samples .DX and targets .DY ; if use trigger set then sample t samples of .Tx and trigger targets .Ty (aka .SB ); concatenate .DX with .Tx , .DY with .Ty ; end if if key not distributed then compute loss .LD using X, Y ; end if compute loss .LDk with embedding key using .DX , .DY and K using Eqs. (9.5) and (9.6); compute sign loss .LS using Eq. (9.8); compute combined loss L using .LD , .LDk and .LS ; backpropagate using L and update .N; end for

Algorithm 2 Black-box verification ← N[Tx ]; if Y = .DY then standard model; end if if Y = .Ty then ownership verified; end if

.Y

9.5 Experiments This section illustrates the empirical study of our proposed key-based protection framework for RNN models. We will mainly report our results from the aspect of the two most important requirements for DNN watermarking [2], i.e., fidelity and robustness. Unless stated otherwise, all experiments are repeated 5 times and tested against 50 fake keys to get the mean inference performance. To differentiate between the baseline models and the protected models, we denote our protected models with subscript k and kt where RNN.k models are protected in white-box settings by embedding key and signature using .LS (Eq. (9.8)) through multi-task learning, whereas RNN.kt represents models that are protected in both white-box and black-box settings using trigger sets.

9 Protecting Recurrent Neural Network by Embedding Keys

179

9.5.1 Learning Tasks The proposed framework is tested in three different tasks: • Sequential image classification (SeqMNIST [11]). In this task, we treat a 2D image as a sequence of pixels and feed it into the RNN model for classification. This is particularly useful in cases where one cannot obtain the whole image in a single time frame. SeqMNIST [11] is a variant of MNIST where a sequence of image pixels that represent handwritten digit images is classified into 10 digit classes. • Text classification (TREC-6 [13]). TREC-6 [13] is a dataset for question classification consisting of open-domain fact-based questions divided into broad semantic categories. The dataset has 6 classes that are ABBR, DESC, ENTY, HUM, LOC, and NUM. • Machine translation (WMT14 EN-FR [3]). WMT14 [3] dataset is provided at the Ninth Workshop on Statistical Machine Translation. There are several pairs of parallel data available under translation task. For this experiment, we chose the task of translating English to French. We combined all parallel data of EN-FR and train the RNN model on 6M pairs of sentences.

9.5.2 Hyperparameters We chose [5, 11, 21] as the baseline model and followed the hyperparameters defined in these works for each task, i.e., machine translation on WMT14 EN-FR [3], sequential image classification on SeqMNIST [11], and text classification on TREC6 [13]. For machine translation task, we adopted a Seq2Seq model that comprises an encoder and a decoder with GRU layers similar to the baseline paper [5]. BLEU [15] score is used to evaluate the quality of translation results. Table 9.2 summarizes the hyperparameters used in the experiments.

Table 9.2 Summary of hyperparameters used in each task Hyperparameter Vocabulary size Max sentence length RNN hidden units Embedding dimension Batch size Bidirectional Optimizer

TREC-6 – 30 300 300 64 Yes Adam [9]

SeqMNIST – – 128 – 128 No Adam

WMT14 EN-FR 15,000 15 (EN) / 20 (FR) 1024 300 256 No Adam

180

Z. Q. Tan et al.

9.6 Discussion 9.6.1 Fidelity In this section, we compare the performance of each RNN model against the RNN model protected using the proposed framework. Quantitative Results: as seen in Tables 9.3, 9.4, and 9.5, all of the protected RNN models were able to achieve comparable performance to their respective baseline models with only slight drop in performance. The largest drop in performance (i.e., BiGRU.kt ) is only less than 2.5% among all models that are embedded with keys and trigger set. In short, embedding keys, trigger set, and sign signature has minimal impact on the performance of the RNN model in their respective tasks. However, for text classification on TREC-6, there is only a small drop in performance when an invalid key is presented for both LSTM and GRU (see Table 9.3). This may be due to easier classification task (6 classes in TREC-6) as compared to the others (10 classes in SeqMNIST, 15,000 vocabulary/class in WMT14 EN-FR). Nevertheless, ownership can still be verified

Table 9.3 Quantitative results on TREC-6. T represent metrics evaluated on trigger set (value in brackets is the performance metric when an invalid key is used) BiLSTM BiLSTM.k BiLSTM.kt BiGRU BiGRU.k BiGRU.kt

Train time 1.57 6.53 6.61 1.60 6.34 6.38

Acc. 87.88 86.71 86.16 88.48 87.46 86.05

Acc. w. key – 86.92 (76.03) 86.21 (75.78) – 87.64 (84.11) 86.79 (83.76)

T acc. – – 100 – – 100

T acc. w. key – – 99.81 (44.79) – – 100 (64.58)

Sign acc. – 100 (99.52) 100 (99.78) – 100 (98.65) 100 (99.19)

T acc. w. key – – 99.80 (6.51) – – 99.80 (9.57)

Sign acc. – 100 (69.93) 100 (66.28) – 100 (61.13) 100 (60.88)

T BLEU w. key – – 100 (0.11)

Sign acc. – 100 (51.65) 100 (49.80)

Table 9.4 Quantitative results on SeqMNIST LSTM LSTM.k LSTM.kt GRU GRU.k GRU.kt

Train time 4.86 18.85 19.53 4.74 17.66 18.69

Acc. 98.38 98.36 98.17 98.36 98.30 97.97

Acc. w. key – 98.37 (18.36) 98.18 (18.37) – 98.30 (22.68) 97.95 (21.15)

T acc. – – 100 – – 99.80

Table 9.5 Quantitative results on WMT14 EN-FR Seq2Seq Seq2Seq.k Seq2Seq.kt

Train time 3062.83 6090.78 6947.22

BLEU 29.33 29.60 29.11

BLEU w. key – 29.74 (14.92) 29.15 (13.62)

T BLEU – – 100

9 Protecting Recurrent Neural Network by Embedding Keys

181

Table 9.6 Qualitative results on TREC-6. The best performing model that has both white-box and black-box protection is selected to demonstrate model performance with and without a valid key Input What is Mardi Gras? What date did Neil Armstrong land on the moon? What is New York ’s state bird? How far away is the moon? What strait separates North America from Asia?

Ground truth DESC NUM

Prediction with valid key DESC NUM

Prediction with invalid key ENTY DESC

ENTY

ENTY

DESC

NUM

NUM

LOC

LOC

LOC

ENTY

Table 9.7 Qualitative results on SeqMNIST Input

Ground truth

Prediction with valid key

Prediction with invalid key

2

2

7

4

4

7

5

5

6

6

6

0

8

8

0

as there is still a consistent drop in performance on trigger set or by extracting sign signature. Qualitative Results Tables 9.6, 9.7, and 9.8 shows a few examples of incorrect predictions for different learning tasks when an invalid key is used compared to

182

Z. Q. Tan et al.

Table 9.8 Qualitative results on WMT14 EN-FR Input They were very ambitious The technology is there to do it What sort of agreement do you expect between the cgt and goodyear? To me, this is n’t about winning or losing a fight But that ’s not all

Ground truth ils étaient très ambitieux la technologie est la pour le faire quel type d’ accord attendez-vous entre la cgt et goodyear? pour moi, ceci n’ est pas à propos de gagner ou de perdre une lutte mais ce n’ est pas tout

Prediction with valid key ils ont très ambitieux

Prediction with invalid key elles ont .unk. .unk. en la technologie est la la technologie le la pour le faire presente le .unk. quel type d’ accord quel genre de accord .unk. entre le .unk. et .unk. entre le .unk. et le? le? pour moi, ceci n’ est pour moi, n’ est pas le pas à de gagner le à .unk. pour de de perdre une lutte mais ce n’ est pas tout mais cela n’ est pas le à

when the valid key which is embedded into the RNN model during training phase is used. For classification tasks, Tables 9.6 and 9.7, when an invalid key is used, the RNN model gets confused between similar classes, i.e., DESC and ENTY for TREC-6 and 5 and 6 for SeqMNIST. For machine translation task, Table 9.8 demonstrates the translation results when a valid key is used versus when an invalid key is used. Observe that when an invalid key is used, the RNN model can still translate accurately at the beginning of the sentence, but the translation quality quickly deteriorated toward the end of the sentence. This conforms to our idea and design of key gate (Sect. 9.4.1) where the information (hidden state) passed between time steps would be disrupted with an invalid key and the output of RNN would deviate further from the ground truth the longer the time steps are.

9.6.2 Robustness Against Removal Attacks Pruning Pruning is a very common model modification technique to reduce redundant parameters without compromising the performance. It is a model compression technique that optimizes the model for real-time inference on resource-constrained devices. There are various types of pruning methods. Here, we use a global unstructured l1 pruning and test our protected RNN models with different pruning rates. This is to test the effectiveness of our proposed protection scheme when the counterfeited model weight is pruned. Figure 9.4 shows that, for classification models, i.e., (a) and (b) even when 60% of the model parameters are pruned, trigger set accuracy still has about 80–90% accuracy, accuracy on test set drops by 10%– 20%, while sign accuracy still maintained near 100%. As for translation task (c), the trained model is rather sensitive to pruning, with 20% parameters pruned, the model maintained 100% sign accuracy, while BLEU score on test set dropped by 30% and

9 Protecting Recurrent Neural Network by Embedding Keys

183

Fig. 9.4 Robustness of protected model against removal attack (model pruning). Classification accuracy for classification tasks, BLEU score for translation task, and sign signature accuracy against different pruning rate

BLEU score on trigger set did not decrease at all. At 40% parameters pruned, BLEU score on test set dropped to 0, resulting in a useless model, and however, trigger set BLEU score decreased by 50%, while sign signature still has near 90% accuracy. This proves that our method is robust against removal attempt by model pruning as it will hurt the model performance before the signature can be removed. Fine-Tuning Fine-tuning is another common model modification technique to further improve a model performance by making very small adjustment to the model weights during training phase. Here, we simulate an attacker that fine-tunes a stolen model with training dataset to obtain a new model that inherits the performance of the stolen model while attempting to remove the embedded watermarks. The attacker does not have knowledge of how watermarks are embedded in the stolen model. In short, the host model is initialized using trained weights with embedded watermarks and then is fine-tuned without the presence of the key, trigger, set and regularization terms, i.e., LS . In Tables 9.9, 9.10, and 9.11, we can observe a consistently 100% sign signature accuracy after the model is fine-tuned. When the embedded key is presented to the fine-tuned model, all of them achieved comparable performance if not slightly higher performance on both test set and trigger set when

184 Table 9.9 Robustness against removal attacks on TREC-6. T represents metrics evaluated on trigger set; all metrics reported are performance with the original key

Table 9.10 Robustness against removal attacks on SeqMNIST

Table 9.11 Robustness against removal attacks on WMT14 EN-FR

Z. Q. Tan et al.

BiLSTMkt Fine-tune Overwrite BiGRUkt Fine-tune Overwrite

Acc. 86.21 86.56 85.91 86.79 86.69 86.02

T acc. 99.81 98.77 98.08 100 99.23 98.08

Sign acc. 100 100 100 100 100 100

LSTMkt Fine-tune Overwrite GRUkt Fine-tune Overwrite

Acc. 98.18 98.28 97.52 97.95 98.09 97.53

T acc. 99.8 99.6 52 99.8 99.4 78

Sign acc. 100 100 100 100 100 100

T BLEU 100 100 100

Sign acc. 100 100 100

Seq2Seqkt Fine-tune Overwrite

BLEU 29.15 29.51 29.04

compared to the stolen model. This shows that the proposed method is robust against removal attempts by fine-tuning as the key remains embedded even after fine-tuning showing that an attacker cannot remove embedded watermarks (keys) by fine-tuning the model. Overwriting We also simulate an overwriting scenario where the attacker has knowledge of how the model is protected and attempts to embed a new key, k into the trained model using the same method as detailed in Sect. 9.4.1. In Tables 9.9, 9.10, and 9.11, we can see a consistent 100% sign accuracy after the protected model is overwritten with a new key. When inferencing with the embedded key, the performance of most models only dropped slightly (less than 1%) with the exception of trigger set on SeqMNIST (see Table 9.10). Nevertheless, the correct ownership can still be verified through the signature. Empirically, this affirms that the embedded key and signature cannot be removed by overwriting new keys. However, this introduces an ambiguous situation where there are multiple keys (i.e., the overwritten new key) that satisfy the key verification process as denoted in Sect. 9.4.4. To resolve the ambiguity, we can retrieve sign signature to verify ownership (see next).

9 Protecting Recurrent Neural Network by Embedding Keys

185

9.6.3 Resilience Against Ambiguity Attacks Through Tables 9.9, 9.10, and 9.11 and Fig. 9.4, we can observe that the embedded signature remains persistent after removal attacks as the accuracy remains near 100% throughout the experiments. Thus, we can deduce that enforcing the sign in the hidden state of the key using sign loss (Eq. (9.8)) is robust against diverse adversarial attacks. Here, we simulated a scenario of an insider threat where the key embedding and sign signature are completely exposed. With those knowledge, an illegal party was able to introduce an ambiguous situation by embedding a new key simultaneously. An illegal party may also attempt to modify the sign signature in the hidden state of key. However, the sign signature cannot be changed easily without compromising the model’s performance. As shown in Fig. 9.5, the model’s performance decreases significantly when 40% of the original signs are modified. In text classification task on TREC-6, the model’s accuracy dropped from 86.21 to 60.93; in sequential image classification task on SeqMNIST, the model’s accuracy dropped from 98.18

Fig. 9.5 Classification accuracy for classification tasks, BLEU score for translation task, and sign signature accuracy when different percentage (%) of the sign signature B is being modified/compromised

186

Z. Q. Tan et al.

to 16.87, which is merely better than a random guessing model; as for translation task on WMT14 EN-FR, the model’s performance dropped by 90% (BLEU score from 29.15 to 2.27). Based on the studies, we can conclude that the signs enforced in this way remain persistent against ambiguity attacks and illegal parties will not be able to employ new (modified) sign signatures without compromising the protected model’s performance.

9.6.4 Secrecy One of the requirements for watermarking techniques is secrecy [2], which means the presence of the watermark embedded should be secret and undetectable to prevent unauthorized parties from detecting it. In terms of digital watermarking in DNN, an unauthorized party may try to detect the presence of a watermark by inspecting the model weights, .W. In other words, this means that the protected model’s weights should stay roughly the same as an unprotected model. Figure 9.6 shows the weight distribution of the protected models and the original/vanilla

Fig. 9.6 Comparison of weight distribution between original and protected RNN layers

9 Protecting Recurrent Neural Network by Embedding Keys

187

model; the weight distribution of the protected RNN layers is identical to the original RNN layers. Hence, one cannot easily identify whether an RNN model is protected with our proposed protection framework or not.

9.6.5 Time Complexity As stated by Fan et al. [6], computational cost at inference stage should be minimized as it will be performed frequently by end users while extra costs at training and verification stages are not prohibitive as they are performed by network owners with the motivation to protect DNN model ownership. In our proposed public ownership scheme (Item 2), since the key will not be distributed with the model, meaning the key is not needed during inference, there will be no additional computational cost (as in forward pass of a protected model is the same as the vanilla baseline model during inference). Tables 9.3, 9.4, and 9.5 shows the training time of protected RNN models with public ownership scheme. Since two forward passes are done during training stage, one can expect training time to increase by roughly 200%. For private ownership scheme (Item 1), only one forward pass is done during training stage, so the training time will not increase significantly. However, since the key is needed during the inference stage, there will be a small overhead added to the inference time.

9.6.6 Key Gate Activation To prove the concept of the proposed key gate (Eq. (9.5)), we examine the output activation of each key gate in all models in the experiments (i.e., BiLSTM.kt , BiGRU.kt , LSTM.kt , GRU.kt , and Seq2Seq.kt ). Figure 9.7 illustrates the distribution Fig. 9.7 Comparison of key gate activation, kgt distribution of a Seq2Seqkt between valid key and invalid key

valid key invalid key

50

Density

40 30 20 10 0 0.0

0.2

0.4 0.6 Key gate activation

0.8

1.0

188

Z. Q. Tan et al.

of key gate activation, .kgt when a valid and an invalid key are used. As seen in Fig. 9.7, when a valid key is used, the key gate activation is mostly near 1.0 allowing the appropriate flow of hidden states between time steps, while when an invalid key is used, the key gate activation is randomly distributed between 0.0 and 1.0, thus disrupting the flow of hidden states between time steps which causes the model to perform poorly.

9.7 Conclusion We successfully illustrated a complete and robust ownership verification scheme for RNNs in both white-box and black-box settings by embedding keys, sign signature, and trigger set. While only two major variants of RNN (i.e., LSTM and GRU) are experimented with, the formulation of key gate is generic and can be applied to other variants of RNN as well. Through extensive experiments, empirical results showed that the proposed method is robust against removal and ambiguity attacks, which aim to either remove the embedded key or embed counterfeit watermarks. We also showed that the performance of the protected model’s original tasks was not affected. For future works, one can try to apply the proposed key gate on other types of RNN such as Peephole LSTM, Multiplicative LSTM, RNN with attention mechanism, etc. Due to resources and time limits, we only experimented with three types of tasks, i.e., text classification, sequential image classification, and machine translation. Other types of learning tasks such as text summarization, text generation, and time series prediction can also be experimented with to prove the generality of the proposed protection framework. To compete with each other and gain business advantage, a large amount of resources is being invested by giant and/or startup companies to develop new DNN models. Hence, we believe that it is important to protect these inventions from being stolen or plagiarized. We hope that the ownership verification for RNNs will provide technical solutions in reducing plagiarism and thus lessen wasteful lawsuits.

References 1. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631 (2018). USENIX Association, Baltimore, MD 2. Boenisch, F.: A survey on model watermarking neural networks. CoRR, abs/2009.12153 (2020) 3. Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-Amand, H., Soricut, R., Specia, L., Tamchyna, A.: Findings of the 2014 workshop on statistical machine translation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 12–58, Baltimore, Maryland, USA (2014). Association for Computational Linguistics, New York

9 Protecting Recurrent Neural Network by Embedding Keys

189

4. Chen, H., Rouhani, B.D., Fu, C., Zhao, J., Koushanfar, F.: Deepmarks: A secure fingerprinting framework for digital rights management of deep learning models. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 105–113 (2019) 5. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar (2014) 6. Fan, L., Ng, K.W., Chan, C.S.: Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In: Advances in Neural Information Processing Systems (NeurIPS) (2019) 7. Guo, J., Potkonjak, M.: Watermarking deep neural networks for embedded systems. In: Proceedings of the International Conference on Computer-Aided Design (ICCAD ’18). Association for Computing Machinery, New York (2018) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014) 10. Krause, B., Lu, L., Murray, I., Renals, S.: Multiplicative LSTM for sequence modelling. CoRR, abs/1609.07959 (2016) 11. Le, Q.V., Jaitly, N., Hinton, G.E.: A simple way to initialize recurrent networks of rectified linear units. CoRR, abs/1504.00941 (2015) 12. Le Merrer, E., Perez, P., Trédan, G.: Adversarial frontier stitching for remote neural network watermarking. CoRR, abs/1711.01894 (2017) 13. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th International Conference on Computational Linguistics—Volume 1 (COLING ’02), pp. 1–7. Association for Computational Linguistics, USA (2002) 14. Ong, D.S., Chan, C.S., Ng, K.W., Fan, L., Yang, Q.: Protecting intellectual property of generative adversarial networks from ambiguity attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 15. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: A method for automatic evaluation of machine translation (ACL ’02), pp. 311–318. Association for Computational Linguistics, USA (2002) 16. Rouhani, B.D., Chen, H., Koushanfar, F.: Deepsigns: A generic watermarking framework for IP protection of deep learning models. CoRR, abs/1804.00750 (2018) 17. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997) 18. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.: Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (2017) 19. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security (ASIACCS ’18), pp. 159–172. Association for Computing Machinery, New York (2018) 20. Zhang, J., Chen, D., Liao, J., Zhang, W., Hua, G., Yu, N.: Passport-aware normalization for deep model protection. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 21. Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., Xu, B.: Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. CoRR, abs/1611.06639 (2016)

Part III

Applications

Chapter 10

FedIPR: Ownership Verification for Federated Deep Neural Network Models Bowen Li, Lixin Fan, Hanlin Gu, Jie Li, and Qiang Yang

Abstract In federated learning, multiple clients collaboratively develop models upon their private data. However, IP risks including illegal copying, re-distribution, and free-riding threat the collaboratively built models in federated learning. To address IP infringement issues, in this chapter, we introduce a novel deep neural network ownership verification framework for secure federated learning that allows each client to embed and extract private watermarks in federated learning models for legitimate IPR. In the proposed FedIPR scheme, each client independently extracts the watermarks and claims ownership on the federated learning model while keep training data and watermark private.

10.1 Introduction In federated learning, multiple clients collaboratively develop models upon their private data, which cost substantial efforts and costs in terms of expertise, dedicated hardware, and labeled training data. However, during the development and deployment of federated learning, those trained federated learning models were subjected to severe IPR threats [6, 10, 17, 22]. Firstly, unauthorized parties such as the commercial competitors might illegally copy or misuse the DNN model without authority [6]. Secondly, Fraboni et al. discovered the phenomenon that freeriders

B. Li · J. Li Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China e-mail: [email protected]; [email protected] L. Fan () · H. Gu WeBank AI Lab, WeBank, Shenzhen Bay Technology and Ecological Park, Shenzhen, China e-mail: [email protected] Q. Yang Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_10

193

194

B. Li et al.

could access the valuable models with no computation for the FedDNN model training [10]. In federated learning scenario, distributed parties contribute private data to train a global model, i.e., FedDNN [15, 26], and thus we need to protect the model IPR while protecting each party’s private data from leaking to other parties. As a prerequisite for model IPR, DNN watermarking techniques are proposed [2, 5, 9, 12, 16, 19, 23, 30] to embed watermarks into neural network models, and when the model ownership is in question, the embedded watermarks are extracted from the model to support ownership verification. Both feature-based watermarks [5, 19, 23] and backdoor-based watermarks [2] were proposed for the IPR of DNN models. Also, from the data privacy perspective in federated learning, where semihonest adversaries may attempt to reconstruct other party’s sensitive data, a secure federated learning (SFL) framework has been proposed [11, 15, 26] to train FedDNN models while keeping training data [31] and data property [14] from discovered. We propose that in the secure federated learning scenario, both the training data and the model IP should be protected; each client in SFL should be able to prove ownership of the trained model without disclosing their private watermarks or any information about private training data. The data privacy has been fulfilled by homomorphic encryption (HE) [18], differential privacy (DP) [1], or secret sharing [20]. The IPR protection demand motivates the research illustrated in this chapter. To protect model IPR in SFL scenario, we propose a general model watermarking framework called FedIPR consisting of (a) a watermark embedding process that allows each client in SFL to embed his/her own watermarks, (b) a verification process that allows each client in SFL to independently verify the ownership of model. Under the FedIPR framework, we provide a technical implementation for both processes and investigate the capacity and robustness issues for watermarks in FedDNN models: • Challenge A: how to ensure that different clients can embed private watermarks in the same time? When different clients’ watermarks are embedded into the same neural network model, those watermarks may potentially conflict with each other. As a solution, we propose a feature-based watermarking method that combines secret watermark extraction matrix and analyze the condition that different watermarks do not conflict with each other. • Challenge B: Whether the embedded watermarks are resilient to secure federated learning strategies? This challenge is because that various federated learning strategies, e.g., differential privacy [1] and client selection [15], have modified the local model updates in SFL training process. Our empirical investigations show that in FedIPR framework, robust watermarks are persistent under various SFL strategies. Also, in this chapter, we provide experimental results on image classification and text processing tasks that demonstrate that the proposed watermarks can be persistently detected for model IP protection.

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

195

10.2 Related Works 10.2.1 Secure Federated Learning In Secure Federated Learning (SFL) [11, 15, 26], multiple clients aim to collaboratively build up a DNN model without training data leakage [1, 18, 20, 21]. Moreover, techniques such as homomorphic encryption [18], differential privacy [1], and secret sharing [20] were widely adopted to protect exchanged model updates in SFL [15, 26].

10.2.2 DNN Watermarking Methods Model watermarking techniques were proposed to defend model IP infringements, and those methods can be divided into two categories. Backdoor-based methods [2, 13, 28] injected designated backdoor into the model in the training process and collected evidence of ownership in black-box manner. For Feature-based methods, designated binary strings as watermarks are embedded into the model parameters [5, 9, 19, 23, 29] and verified in white-box manner. For federated learning model ownership verification, Atli et al. [4] adopted backdoor-based model watermarking techniques to enable ownership verification from the central server, and however, they did not allow each client in FL to verify model intellectual property independently.

10.3 Preliminaries We first formulate key components of the SFL and existing model watermarking techniques.

10.3.1 Secure Horizontal Federated Learning In a secure SFL setting [26], K clients collaboratively build local models with their own data and send local models .{Wk }K k=1 to a central server to train a global model. The server aggregates the uploaded local updates with the following Fedavg algorithm [11, 15, 26]: W←

.

K  nk k=1

K

Wk ,

where .nk is the average weights of each update .Wk .

(10.1)

196

B. Li et al.

10.3.2 Freeriders in Federated Learning In FL, freerider clients [10] are those clients who do not contribute computing resources nor private data but upload some superficial local updates to access the valuable models. There are several strategies for superficial local model construction [10]: Plain Freerider Freeriders create a superficial update as below [10]: Wf ree = F ree(Wt , Wt−1 ),

.

(10.2)

where .Wt and .Wt−1 denote two previous updates. The superficial local updates are the linear combination of saved copies of previous model updates. Gaussian Freeriders Freeriders add Gaussian noise to the previous global model parameters .Wt−1 to simulate a superficial local update: Wf ree = Wt + ξt ,

.

ξt ∼ N (0, σt ).

(10.3)

10.3.3 DNN Watermarking Methods Now we proceed into the formulation of two categories of DNN watermarking methods: Feature-Based Watermarks [5, 8, 19, 23] In the watermark embedding step, Nbits target binary watermarks .S ∈ {0, 1}N are embedded into the model parameters .W during the model training process with regularization terms. During the verification step, extractor .θ extracted watermarks .S˜ from model parameters, and the .S˜ are compared with the prescribed watermarks S, to test whether the model belongs to the party who claims ownership with watermarks:   .VW W, (S, θ ) =



˜ ≤ W , TRUE, if H(S, S) FALSE, otherwise,

(10.4)

in which .VW () is the watermark detection process in a white-box manner. Backdoor-Based Watermarks [2, 13] Backdoor-based watermarks are embedded in the form of backdoor into the model during the training time. In the verification step, triggers .T are input to the model .N(). If the classification error of model on designated backdoor dataset is less than a threshold .B , the ownership is verified:   .VB N, T =



TRUE, if ETn (I(YT = N(XT ))) ≤ B , FALSE, otherwise,

(10.5)

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

197

in which .VB () is the watermark detection process that only accesses model output in a black-box manner.

10.4 Proposed Method In this chapter, we introduce a watermark embedding and extraction scheme called FedIPR in SFL scenario. FedIPR ensures that each client can embed and extract its own private watermarks.

10.4.1

Definition of FedDNN Ownership Verification with Watermarks

We give below formulation of the FedIPR, which is also pictorially illustrated in Fig. 10.1. Definition 10.1 FedDNN ownership verification (FedIPR) in SFL is a tuple .V = (G , E , A , VW , VB ), which is consisted of: 1. A client-side watermark generation process .G () → (Sk , θk , Tk ) generates target watermarks .Sk and .Tk and extractor .θk = {Sk , Ek }; in the secret extraction parameters .θk = {Sk , Ek }, .Sk denotes the watermark embedding location, and .Ek denotes the secret extraction matrix.

Fig. 10.1 An illustration of FedIPR scheme. Each client generates and embeds private watermarks into the local models, which are aggregated in a global model (the left panel). In case the model is plagiarized, each client is able to call a verification process to extract watermarks from the model in both black-box and white-box manners (the right panel)

198

B. Li et al.

2. The watermark embedding process .E () embeds trigger samples .Tk and featurebased watermarks .Sk by optimizing the two regularization terms .LW and .LB ,   Lk := LDk (Wt ) + αk LBk (Wt ) +βk LW k Wt ,       .

backdoor-based

main task

feature-based

(10.6)

k ∈ {1, · · · K}, in which .Dk denotes the private data of client k. Then a local update process ClientUpdate(Lk , Wt ) =: argminLk optimizes the local model parameters and sends local model to the aggregator. 3. A federated learning model aggregation process .A () collects local updates from m randomly selected clients and performs model aggregation with the FedAvg algorithm [15], i.e., .

.

Wt+1 ←

K  nk k=1

n

Wt+1 k ,

(10.7)

where .Wt+1 ← ClientUpdate(Lk , Wt ) is the local model by client k, and . nnk k denoted the weight for Fedavg algorithm. 4. To detect freerider in Secure Federated Learning, a server-side watermark verification process .VG checks whether feature-based watermarks .S = ∪K k=1 Sk , θ = K ∪k=1 θk can be successfully verified form the global model .W, i.e., VG (W, S, θ ) = VG (W, (S, θ )),

.

(10.8)

the server checks the each client’s watermark to detect freeriders. 5. Black-box verification process .VB () detects the backdoor-based watermark from the FedDNN model, and .VB () decides the ownership according to the detection accuracy of backdoor,   VB N, Tk =

.



TRUE, if ETk (I(YTk = N(XTk ))) ≤ B , FALSE, otherwise,

(10.9)

where .I() is an indicator function. ˜ 6. White-box  verification process .VW () extracts feature-based watermarks .Sk = sgn W, θk from the parameters .W with a sign function .sgn(),   .VW W, (Sk , θk ) =



TRUE, if H(Sk , S˜k ) ≤ W , FALSE, otherwise,

(10.10)

and if the extracted watermarks match the prescribed ones, the ownership is verified.

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

199

Watermark Detection Rate • For a feature-based watermark S with N bit-length, the detection rate .ηF is defined as ηF := 1 −

.

1 H(Sk , S˜k ). N

(10.11)

• For backdoor-based watermarks .T, the detection rate is ηT := ET (I(YT = N(XT ))),

.

(10.12)

which is defined as the accuracy of model classification over backdoor samples as the designated labels. There are two technical challenges that we will inevitably encounter while embedding watermarks in federated learning models.

10.4.2 Challenge A: Capacity of Multiple Watermarks in FedDNN In federated learning, multiple feature-based watermarks are embedded by K clients, and it leaves an open problem whether there is a common solution for different watermarks simultaneously in FedDNN. For a set of feature-based watermarks .{(Sk , θk )}K k=1 to be embedded into a FedDNN parameters .W, each client k embeds N bit-length of watermarks .Sk ∈ {+1, −1}N , and the extracted watermarks .S˜k = WEk are compared with the targeted watermarks .Sk for ownership verification: ∀j ∈ {1, 2, . . . , N}

.

k ∈ K, tkj (WEk )j > 0.

and

(10.13)

For example, as Fig. 10.2 shows, K different watermarks try to control the .sgn() of target parameters .Wk under a private condition on .Sk , but local models are aggregated into an unified .W, and these watermarking constraints may conflict. Theorem 10.1 elucidates the requirements where a common solution exists for K different .Sk in the same model without conflicts. Theorem 10.1 If K different watermarks (N bit-length each) are embedded in M channels of the global model parameters .W, we note watermark detection rate as .ηF , the watermark detection rate .ηF holds that the following: Case 1:

If .KN ≤ M, then there exists .W such that ηF = 1.

.

(10.14)

200

B. Li et al.

Unified

to an Aggregate unified one

1

Client

1



Client



= “11…0011”

= “010…0101”

Client

= “001…1010”

Fig. 10.2 Different clients in federated learning adopt different regularization terms to embed feature-based watermarks

Case 2:

If .KN > M, then, there exists .W such that ηF ≥

.

KN + M . 2KN

(10.15)

The proof is deferred in [7]. Theoretical results in case 1 give the conditions of watermark existence in a unified model, and results in case 2 (.KN > M) provide the lower bound of watermark detection rate.

10.4.3 Challenge B: Watermarking Robustness in SFL The robustness of watermarking indicates whether the watermarks can be persistently embedded and extracted against various model training strategies and attacks that might remove or modify the watermarks. In this chapter, we report the robustness of watermarks in terms of detection rate under varying unfriendly settings for model watermarking. Training Strategies In SFL, privacy-preserving techniques like DP [25] and client selection [15] are widely used for data privacy and communication efficiency. The model modification brought by those training strategies may affect the detection rate of watermarks: • Differential privacy mechanism [1] in secure federated learning adds noise to the local model updates of each client. • In every training round, the server adopts client selection strategies [15] to sample a fraction of clients and aggregate their local updates.

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

201

Extensive experimental results in Sect. 10.6.4 show that watermarks can be reliably detected under varying setting of training strategies.

10.5 Implementation Details Section 10.4 gives the definition of the FedIPR framework, which enables clients to independently embed and extract private watermarks. In this section, we give an implementation in Algorithm 1.

10.5.1 Watermark Embedding and Verification In FedIPR, each client k embeds its own .(Sk , θk ) with a regularization term .LW = LSk ,θk :   LSk ,θk Wt = LSk (Sk , Wt , Ek ),

.

(10.16)

where the secret watermark extractor .θk = (Sk , Ek ) is kept secret, and we propose to embed .Sk into the normalization layer scale parameters, i.e., .Sk (W) = Wγ = {γ1 , · · · , γC }. FedIPR proposes to adopt a secret embedding matrix .Ek ∈ θk = (Sk , Ek ) for watermark embedding, and the watermarks are embedded with the following regularization term: N 

    LSk ,θk Wt = LSk Wtγ Ek , Sk = HL Sk , S˜k = max(μ − bj tj , 0),

.

j =1

(10.17) in which .S˜k = Wtγ Ek are the extracted watermarks. 10.5.1.1

Watermarking Design for CNN

As illustrated in Fig. 10.3, for a CNN model, we choose the normalization layer weights .Wγ to embed feature-based watermarks.

202

B. Li et al.

Fig. 10.3 Layer structure of a convolution layer and normalization layer weights .Wγ

Algorithm 1 Embedding Process .E () of Watermarks 1: Each client k with its own watermark tuple .(Sk , θk , Tk ) 2: for communication round t do The server distributes the global model parameters .Wt to each clients and randomly selects 3: cK out of K clients. 4: Local Training: 5: for k in selected cK of K clients do 6: Sample mini-batch of m training samples .X{X(1) , .· · · , .X(m) } and targets .Y{Y(1) , .· · · , (m) }. .Y 7: if Enable backdoor-based watermarks then (1) (t) (1) (t) 8: Sample t samples .{XTk , .· · · , .XTk }, {.YTk , .· · · , .YTk } from trigger set .(XTk , YTk ) 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

(1)

(t)

(1)

(t)

Concatenate .X with .{XTk , .· · · , .XTk }, .Y with {.YTk , .· · · , .YTk }. end if Compute cross-entropy loss .Lc using .X and .Y{Batch-poisoning approach is adopted, thus .αl = 1, .Lc = LDK + LTk }. for layer l in targeted layers set .L do Compute feature-based regularization term .LlSk ,θk using .θk and .Wl end for l .LSk ,θk ← l∈L LSk ,θk .Lk = .Lc + .βk LSk ,θk Backpropagate using .Lk and update .Wtk end for Server Update: Aggregate local models .{Wtk }K k=1 with .FedAvg algorithm end for

10.5.1.2

Watermarking Design for Transformer-Based Networks

As illustrated in Fig. 10.4, we also apply feature-based watermarks to transformerbased network. A transformer encoder block [24] includes a Layer-Normalization layer, which brings important impacts on the convergence performance of the network). We embed watermarks into the .Wγ in the Layer-Normalization layer.

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

203

Fig. 10.4 Layer structure of an encoder block: normalization layer weights Wγ (in green) are used to embed feature-based watermarks

10.6 Experimental Results This section presents the experimental results of the proposed FedIPR. Superior detection performances of watermarks demonstrate that FedIPR provides a reliable scheme for watermark embedding in FedDNN.

10.6.1 Fidelity We compare the main task performance .Accmain of FedIPR against FedAvg to report the fidelity of the proposed FedIPR. In four different training tasks, varying number (from 10 to 100) of clients may decide to embed different bit-length of backdoor-based watermarks (20 to 100 per client) and feature-based watermarks (50 to 500 bits per client). Figure 10.5 presents the worst drop of classification accuracy .Accmain . It is observed that under various watermarking settings, slight model performance drop (not more than 2% as compared with that of Fedavg) is observed for four separated tasks.

204

(a) AlexNet with CIFAR10

B. Li et al.

(b) ResNet18 with CI- (c) DistilBERT with FAR100 SST2

(d) DistilBERT with QNLI

Fig. 10.5 Subfigures (a)–(d), respectively, illustrate the main task accuracy .Accmain in image and text classification tasks with varying number K of total clients (from 10 to 100), the results are based on cases of varying settings of feature-based and backdoor-based watermarks, and the main task accuracy .Accmain of FedIPR has slight dropped (not more than 2%) compared to FedAvg scheme

(a) AlexNet with CIFAR10

(b) ResNet18 with CI- (c) DistilBERT with FAR100 SST2

(d) DistilBERT with QNLI

Fig. 10.6 Subfigures (a)–(d), respectively, illustrate the feature-based watermark detection rate in image and text classification tasks with varying bit-length per client, in SFL with K = 5, 10, and 20 clients, the dot vertical line indicates .M/K, which is the theoretical bound given by Theorem 10.1.

.ηF

10.6.2 Watermark Detection Rate We present the watermark detection rate of the proposed FedIPR framework. Feature-Based Watermarks Figure 10.6a–d illustrates feature-based watermark detection rates .ηF on four different datasets. • Case 1: As shown in Fig. 10.6, the detection rate .ηF remains constant (100%) within the vertical line (i.e., .M/K), where the total bit-length KN assigned by multiple (K = 5, 10, or 20) clients does not exceed the capacity of network parameters, which is decided by the channel number M of parameter .S(W) = Wγ , respectively. • Case 2: When the total length of watermarks KN exceeds the channel number M (.KN > M), Fig. 10.7 presents the detection rate .ηF drops to about 80% due to the conflicts of overlapping watermark assignments, yet the measured .ηF is greater than the lower bound given by Case 2 of Theorem 10.1 (denoted by the red dot line). Backdoor-Based Watermarks Table 10.1 illustrates the detection rate .ηT of backdoor-based watermarks, where a different number of clients embed backdoor-

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

(a) AlexNet on CIFAR10

(b) ResNet on CIFAR100

(c) DistilBERT on SST2

205

(d) DistilBERT on QNLI

Fig. 10.7 Figure provides the lower bound (red dot line) of feature-based watermark detection rate .ηF given by Theorem 10.1, and the empirical results (blue line) are demonstrated to be above the theoretical bound (Case 2) in an SFL setting of .K = 5 clients

based watermarks into the model. The results show that the detection rate .ηT is constant, even the trigger number per client increases as much as .NT = 300. We ascribe the stable detection rate .ηT to the over-parameterized networks as demonstrated in [3, 27].

10.6.3 Watermarks Defeat Freerider Attacks We investigate the setting, where watermarks are embedded into the FedDNN as a precaution method for freerider detection, the benign clients can verify ownership by extracting predefined watermarks from FedDNN, while freeriders cannot detect watermarks. We simulate three types of local updates including plain freerider, freerider with Gaussian noise (defined in Sect. 10.3.2), and benign clients that contribute data and computation. The server conducts feature-based verification on each client’s local models, and the results in Fig. 10.8 show that the benign clients’ watermarks in the global model can be detected in 30 communication rounds of federated training; while the freeriders fail to verify their watermarks because they do not contribute actual training, the .ηF detection rate is an almost random guess (50%), which can be detected as free-riding.

10.6.4 Robustness Under Federated Learning Strategies As illustrated in technical Challenge B of Sect. 10.4.3, strategies like differential privacy [25] and client selection [15] intrinsically result in performance decades of the main classification task. We evaluate the detection rates .ηF and .ηT of watermarks under Challenge B to report the robustness of FedIPR.

Client Num. 20 10 5 ResNet18/CIFAR100 20 10 5

Model/Dataset AlexNet/CIFAR10

Trigger sample number .NT per client 50 100 99.34% .± 0.31% 99.30% .± 0.60% 99.59% .± 0.23% 98.92 % .± 0.20% 99.29% .± 0.38% 99.03% .± 0.44% 99.64% .± 0.31% 99.60% .± 0.20% 99.86% .± 0.05% 99.58% .± 0.41% 98.89% .± 0.80% 98.54% .± 1.3% 150 99.35% .± 0.31% 98.45% .± 0.67% 98.15% .± 0.74% 99.35% .± 0.31% 98.56% .± 0.57% 99.07% .± 2.34%

200 99.03% .± 0.57% 98.24% .± 0.57% 98.71% .± 0.43% 99.59% .± 0.46% 99.84% .± 0.04% 98.94% .± 0.73%

250 99.17% .± 0.47% 98.43% .± 0.15 % 98.28% .± 0.30% 99.93% .± 0.05% 99.83% .± 0.15 % 99.45% .± 0.06%

300 98.85% .± 0.69% 97.56% .± 1.07% 98.39% .± 0.64% 99.92% .± 0.07% 99.88% .± 0.03% 98.44% .± 0.25%

Table 10.1 Table presents the superior backdoor-based watermark detection rate (above 95%). Table illustrates .ηT of varying bit-length .NT of watermarks, respectively, where the datasets investigated include CIFAR10 and CIFAR100 datasets and the client number is 5, 10, and 20

206 B. Li et al.

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

207

(a) 20 bits with AlexNet

(b) 40 bits with AlexNet

(c) 60 bits with AlexNet

(d) 20 bits with ResNet

(e) 40 bits with ResNet

(f) 60 bits with ResNet

Fig. 10.8 Comparisons between three different types of clients including (1) freerider clients with previous local models (orange lines), (2) freerider clients disguised with Gaussian noise (blue lines), and (3) four benign clients in a 20-client SFL. The feature-based watermark detection rate .ηF is measured in each communication round

(a) DP Noise s with AlexNet

(b) DPNoise s with ResNet

(c) Sample Ratio with (d) Sample Ratio with ResNet AlexNet

Fig. 10.9 Subfigures (a) and (b) describe the performance of FedIPR under differential privacy and client sampling strategies. In a federated learning setting of 10 clients, respectively, subfigures (a) and (b) illustrate feature-based detection rate .ηF and backdoor-based detection rate .ηT under varying differential private noise .σ , where the dot lines illustrate the main task accuracy .Accmain . Subfigures (c) and (d) illustrate feature-based detection rate .ηF and backdoor-based detection rate .ηT under different sample ratio c, whereas the dot lines illustrate the main task accuracy .Accmain

10.6.4.1

Robustness Against Differential Privacy

We adopt the Gaussian noise-based method to provide differential privacy guarantee. As Fig. 10.9a–b shows, the main task performance .Accmain decreases severely as the .σ of noise increases, and the feature-based detection rate .ηF and backdoorbased detection rate .ηT drop a little, while the .Accmain is within a usable range (more than 85%), which demonstrates the robustness of watermarks.

208

10.6.4.2

B. Li et al.

Robustness Against Client Selection

In practice of federated learning, in each communication round, the server selects cK of K clients .(c < 1) to participate training. Figure 10.9 shows that the watermarks could not be removed even the sample ratio c is as low as 0.25. This result gives a lower bound of client sampling rate in which watermarks can be effectively embedded.

10.7 Conclusion This chapter presents a model watermarking scheme to protect the IPR of FedDNN models against external plagiarizers in a secure federated learning setting. This work addresses an open issue in SFL, since the model IPR is as important as protecting data privacy in secure federated learning.

References 1. Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016) 2. Adi, Y., Baum, C., Cisse, M., Pinkas, B., Keshet, J.: Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In: 27th USENIX Security Symposium (USENIX) (2018) 3. Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via overparameterization. CoRR, abs/1811.03962 (2018) 4. Atli, B.G., Xia, Y., Marchal, S., Asokan, N.: Waffle: Watermarking in federated learning. arXiv preprint arXiv:2008.07298 (2020) 5. Chen, H., Rohani, B.D., Koushanfar, F.: DeepMarks: A Digital Fingerprinting Framework for Deep Neural Networks. arXiv e-prints, page arXiv:1804.03648 (2018) 6. de Boer, M.: Ai as a target and tool: An attacker’s perspective on ml (2020) 7. Fan, L., Li, B., Gu, H., Li, J., Yang, Q.: FedIPR: Ownership verification for federated deep neural network models. arXiv preprint arXiv:2109.13236 (2021) 8. Fan, L., Ng, K.W., Chan, C.S: Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In: Advances in Neural Information Processing Systems, pp. 4714–4723. Curran Associates, Inc., New York (2019) 9. Fan, L., Ng, K.W., Chan, C.S., Yang, Q.: DeepIP: Deep neural network intellectual property protection with passports. IEEE Trans. Pattern Anal. Mach. Intell. 01, 1–1 (2021) 10. Fraboni, Y., Vidal, R., Lorenzi, M.: Free-rider attacks on model aggregation in federated learning. In: International Conference on Artificial Intelligence and Statistics, pp. 1846–1854. PMLR, New York (2021) 11. Kairouz, P., McMahan, H.B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A.N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R. and D’Oliveira, R.G.L., Rouayheb, S.E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P.B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Koneˇcný, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, R., Raykova, M., Qi, H., Ramage, D., Raskar, R., Song, D.,

10 FedIPR: Ownership Verification for Federated Deep Neural Network Models

209

Song, W., Stich, S.U., Sun, Z., Suresh, A.T., Tramèr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F.X., Yu, H., Zhao, S.: Advances and open problems in federated learning. Foundations and Trends in Machine Learning, abs/1912.04977 (2019) 12. Lim, J.H., Chan, C.S., Ng, K.W., Fan, L., Yang, Q.: Protect, show, attend and tell: Empowering image captioning models with ownership protection. Pattern Recogn. 122, 108285 (2022) 13. Lukas, N., Zhang, Y., Kerschbaum, F.: Deep neural network fingerprinting by conferrable adversarial examples. In: International Conference on Learning Representations (2020) 14. Luo, X., Wu, Y., Xiao, X., Ooi, B.C.: Feature inference attack on model predictions in vertical federated learning. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 181–192. IEEE, New York (2021) 15. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-Efficient Learning of Deep Networks from Decentralized Data. In: Singh, A., Zhu, J. (eds.). Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 54, pp. 1273–1282. PMLR, Fort Lauderdale, FL, USA (2017) 16. Ong, D.S., Chan, C.S., Ng, K.W., Fan, L., Yang, Q.: Protecting intellectual property of generative adversarial networks from ambiguity attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 17. Orekondy, T., Schiele, B., Fritz, M.: Knockoff nets: Stealing functionality of black-box models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4954–4963 (2019) 18. Phong, L.T., Aono, Y., Hayashi, T., Wang, L., Moriai, S.: Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forensics Secur. 13(5), 1333–1345 (2018) 19. Rouhani, B.D., Chen, H., Koushanfar, F.: DeepSigns: A Generic Watermarking Framework for IP Protection of Deep Learning Models. arXiv e-prints, page arXiv:1804.00750 (2018) 20. Ryffel, T., Pointcheval, D., Bach, F.R.: ARIANN: low-interaction privacy-preserving deep learning via function secret sharing. CoRR, abs/2006.04593 (2020) 21. Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1310–1321 (2015) 22. Tramèr, F., Zhang, F., Juels, A., Reiter, M.K., Ristenpart, T.: Stealing machine learning models via prediction {APIs}. In: 25th USENIX security symposium (USENIX Security 16), pp. 601– 618 (2016) 23. Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.I.: Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277 (2017) 24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, p. 30 (2017) 25. Wei, K., Li, J., Ding, M., Ma, C., Yang, H.H., Farokhi, F., Jin, S., Quek, T.Q., Poor, H.V.: Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 15, 3454–3469 (2020) 26. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 12 (2019) 27. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR. OpenReview.net (2017) 28. Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M.P., Huang, H., Molloy, I.: Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security (ASIACCS), pp. 159–172 (2018) 29. Zhang, J., Chen, D., Liao, J., Zhang, W., Hua, G., Yu, N.: Passport-aware normalization for deep model protection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.). Advances in Neural Information Processing Systems, vol. 33, pp. 22619–22628. Curran Associates, Inc., New York (2020)

210

B. Li et al.

30. Zhang, J., Chen, D., Liao, J., Zhang, W., Feng, H., Hua, G., Yu, N.: Deep model intellectual property protection via deep watermarking. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) 31. Zhu, L., Liu, Z., Han, S.: Deep leakage from gradients. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.). NeurIPS, pp. 14747–14756 (2019)

Chapter 11

Model Auditing for Data Intellectual Property Bowen Li, Lixin Fan, Jie Li, Hanlin Gu, and Qiang Yang

Abstract Deep neural network models are built upon a tremendous amount of labeled training data, whereas the data ownership must be correctly determined because the model developer may illegally misuse or steal other party’s private data for training. To determine the data ownership from a trained deep neural network model, in this chapter, we propose a deep neural network auditing scheme that allows the auditor to trace illegal data usage from a trained model. Specifically, we propose a rigorous definition of meaningful model auditing, and we point out that any model auditing method must be robust to removal attack and ambiguity attack. We provide an empirical study for existing model auditing methods, which shows that existing methods can enable data tracing under different model modification settings, but those methods fail if the model developer adopts the training data for the use case the data owner cannot manage and thus cannot provide meaningful data ownership resolution. In this chapter, we rigorously present the model auditing problem for data ownership and open a new revenue in this area of research.

11.1 Introduction Data are regarded as the oil of the AI application. Deep neural networks (DNNs), especially large foundation models, cost a large amount of data to train large-scale, well-generalized, and precise model. In the expensive data collection procedures,

B. Li · J. Li Shanghai Jiao Tong University, Shanghai, China e-mail: [email protected]; [email protected] L. Fan () · H. Gu WeBank AI Lab, Shenzhen, China e-mail: [email protected] Q. Yang Hong Kong University of Science and Technology, Kowloon,, Hong Kong e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 L. Fan et al. (eds.), Digital Watermarking for Machine Learning Model, https://doi.org/10.1007/978-981-19-7554-7_11

211

212

B. Li et al.

companies or institutes may collect a large amount of personal data without personal authorization. Such illegal data collection and abuse in machine learning service is an infringement case for data intellectual property rights (IPRs). In order to protect the data owner’s legitimate benefits, it is necessary to protect the intellectual property (IP) of the data, i.e., to externally verify the ownership of the data from a machine learning service, in case that data collector (large companies, institutes) illegally misuses data for model training, we name the procedure as model auditing for data forensics. It is an open problem how to audit data misuse from a trained model (e.g., in the form of an AI service, MLaaS) because a machine learning model is trained with knowledge of the given dataset but does not show straightforward relation to the training data. In this chapter, we give a definition to the model auditing problem and propose a rigorous protocol to evaluate existing attempts that aimed to enable data tracing. Previous works [5, 6, 10] on the relationship between training data and models are proposed where a third party (such as adversaries) who wants to infer the data usage has different knowledge over the training set. Membership inference attacks (MIAs) [6, 7, 9–11] were proposed as a privacy leakage attack, where the attacker without knowledge of the training set launched attacks to infer whether a single data point was in the training set of the model. Dataset inference [5] was proposed with full knowledge of the training set, and the data owner inferred whether a model was trained with a given dataset. However, there are two challenges for the model auditing, (1) How to understand the relationship between the model and training set? Membership inference attacks [6, 10] attempt to discover the membership of a data point from a trained model. Whereas Rezaei et al. [7] pointed out that the current state-of-the-art MIA attacks cannot accurately show the data point usage from a trained model, because those state-of-the-art MIA attacks have only low confidence and high false positive rate for membership inference. (2) There is limited information for data forensic under realistic setting. On the one hand, the data owner has only access to the API of AI service, but existing black-box MIA approaches [8] need the confidence vectors out of trained model for membership information. On the other hand, the data are potential to be used in the way that the model owner cannot manage, for example, facial images can be used for pretraining, image generation, or image classification. We treat the model auditing as an ownership verification problem, which must satisfy several requirements: (1) model auditing method must be applied to modelagnostic setting, data are trained for different model types and architectures, and the data usage must be traced, (2) model auditing method must be able to trace data usage in various data usage cases, such as partially used for model training, used for transfer learning, model pretraining, etc., and (3) model auditing method must be robust to model removal attack and ambiguity attack. Based on above requirements, we evaluate existing methods that claimed to solve the model auditing problem. Maini et al. [5] proposed to solve the problem with a dataset inference (DI) method that adopted the disparity between the samples used in the training process and those not in the training process. Their key intuition is that the models are repeatedly trained on the finite training set. The training data

11 Model Auditing for Data Intellectual Property

213

characterize the decision boundary of model, and Maini et al. [5] have shown that with white-box and black-box access to the suspicious trained model and with full knowledge of the training set, they can demonstrate that the model has stolen the victim dataset in varying settings. We reproduce the results of dataset inference and provide experimental results under varying settings according to our requirements: (1) Only a partition of the whole training set is stolen, for example, the whole dataset has 20 classes, but only 10 classes of data are stolen and used for model training (10 in 20 classes). (2) The stolen dataset is used for training with different model architectures. (3) The adversaries conduct model distillation, model fine-tuning after stealing the dataset for model training. (4) Dataset is used for arbitrary tasks, such as dataset is partially used for model training or used for transfer learning, model pretraining, etc. To sum up, in this chapter, our contributions include: • We formulate the model auditing for data forensics as a non-trivial ownership verification problem and propose rigorous requirements for model auditing methods. • We revisit existing model auditing methods (dataset inference) and present empirical evaluation results of dataset inference in varying settings. • Our empirical results show that dataset inference is resilient under removal attacks and ambiguity attack, but existing methods do not work in the taskagnostic settings like partial data usage settings.

11.2 Related Works 11.2.1 Membership Inference (MI) Membership inference attack (MIA) [6–8, 10] is a type of attacks which infer whether a single data point is used in model training. Given a trained model and a single data point, the MIA returns the probability that the single data point is used in the model training. Binary classifier methods [6, 10] and metric-based methods [9, 11] were proposed to infer data membership from a trained model in both whitebox and black-box settings. Dataset inference [5] is a special case of MIA, where the data owner infers dataset membership with the full knowledge of the training set. For technical side, dataset inference proposes to infer membership according to the model decision boundary.

11.2.2 Model Decision Boundary The distance between a data point and the model decision boundary can be viewed as the smallest adversarial perturbation that added to the data sample. Previous

214

B. Li et al.

works in adversarial learning share a point that the untrained samples are closer to the model decision boundary and more vulnerable to perturbations [4, 13] comparing to the trained samples, and thus the model decision boundary memorizes the membership information of training data. The model decision boundary can be measured with label-only access [1, 3] or white-box setting that the detector has access to the whole model parameters and outputs.

11.3 Problem Formulations We consider a setting that a victim .O owns a private dataset .Dtrain ⊂ Rd+n and d n .Dtrain = {(xi , yi )}i∈|Dtrain | , where .xi ∈ R is the sample, while .yi ∈ R is the label. The data collector .C illegally uses the data .Dtrain to train a model .f (), with training algorithm .T , and other auxiliary data .Daux . T rain(Dtrain , T , Daux ) → f (),

.

(11.1)

in the rest of chapter, for simplicity, we take the classification task as an example, in a C-class classification problem, we note the classifier as .f : X → Y , in which the instance space is .X ⊂ Rd and label space is .Y ⊂ {0, 1, 2, . . . , C}, and the training set is .Dtrain = {(xi , yi )|i = 1, 2, . . . , N }, and .Dtrain is composed of samples from a distribution .D, i.e., .Dtrain ∼ D. Definition 11.1 (Model Auditing for Data IP) If the data owner .O wants to audit the data IP of trained model .f (), the model auditing process, i.e., dataset ownership verification process .V , is noted as V (f (), Dtrain ) = p,

.

(11.2)

where p is the output probability of dataset misuse, and if p passes a predetermined threshold ., the data owner .O can determine the suspect model .f () has stolen the private dataset .Dtrain . Remark The victim .O could have white-box or black-box access to the model .f (): in the white-box setting, the victim .O has access to the model parameters and outputs by model .f (), and in the black-box setting, the victim .O has only API access to the prediction of model .f ().

11 Model Auditing for Data Intellectual Property

215

11.3.1 Properties for Model Auditing Model auditing for data IP needs strict protocol to provide enough evidence, which requires several key properties for model editing methods: 1. Model Agnostic: when dataset .Dtrain is used for various types of model architectures, e.g., regression model, recurrent neural network, convolution neural network, transformer-based network, T rain(Dtrain , T , Daux ) → f (),

.

(11.3)

i.e., no matter what kind of model architecture .f () is, the owner .O should be able to verify illegal data use from model .f () with method .V , V (f  (), Dtrain ) = p ≥ .

.

(11.4)

2. Task Agnostic: even if private dataset is not used for the original designed way, the model auditing method must still work. Given a dataset .Dtrain , which is originally collected by owner .O for training algorithm .T , but the adversary (the party who steals the dataset) .C collects .Dtrain for an arbitrary training algorithm  .T that is unknown by .O,  T rain(Dtrain , T  , Daux ) → f  (),

.

(11.5)

the owner .O should be able to verify illegal data use from model .f  (), V (f  (), Dtrain ) = p ≥ ,

.

(11.6)

e.g., dataset .Dtrain is Imagenet data samples, but it is used for image generation task, and the data owner .O should be able to trace data use from .f  () with method .V . 3. Robustness against post-processing: when models trained with personal data are post-processed, e.g., pruning, model quantization, fine-tuning, incremental learning, for example, adversary modify the model .f () to .fˆ(), .Removal(f ()) = fˆ(), the model auditing method must still work, i.e., .V (fˆ(), Dtrain ) = p ≥ . 4. Robustness against ambiguity attack: A strict model auditing method for data IP must be robust to ambiguity attack, which means if the adversary forges some ˆ that the adversary without true training data data, .F orge(f (), Dtrain ) = D, cannot pass the ownership test, i.e., V (fˆ(), Dtrain ) = p ≤ .

.

(11.7)

216

B. Li et al.

Fig. 11.1 Black-box and white-box settings for model auditing

11.3.2 Model Auditing Under Different Settings As shown in Fig. 11.1, the settings for model editing can be categorized into blackbox and white-box settings according to their access to the model. Black-Box Setting (Label-Only Model Auditing) Black-box model editing method verifies data misuse with label-only query to the suspect API. The data owner .O has full knowledge of private dataset .Dtrain and unlimited query access to suspect model API .f (), and the model owner queries the API with a given data x and gets an output .f (x) = y. White-Box Setting (Full Access to the Model Parameters and Outputs) In white-box setting, the owner .O has full access to the model parameters .W, model outputs, and gradients. The auditor input data x to get the full prediction as well as the gradient information .f  (x), in addition, the feature in the model parameters .W also help the auditing.

11.4 Investigation of Existing Model Auditing Methods [5] Maini et al. proposed a method to verify that a suspicious model .f () has used the private dataset .Dtrain for model training [5], the method is called Dataset inference (DI), dataset inference pointed out that the model decision boundary has memorized the membership information of the training data, and thus decision boundary can be used to audit illegal data usage from a trained model .f (). Firstly, we proceed to the definition of model decision boundary.

11.4.1 Distance Approximation to Decision Boundary What is decision boundary? For simplicity, in a C-class classification problem, Cao et al. have given the definition of decision boundary in terms of confidence score .c(x) output by the classifier .f () [2].

11 Model Auditing for Data Intellectual Property

217

Definition 11.2 (Decision Boundary [2]) Given a classifier .f (), .f : X → Y , in which the instance space is .X ⊂ Rd , and label space is .Y ⊂ {0, 1, 2, . . . , C}. The target classifier’s classification boundary is defined as the following set of data points: {x|∃i, j, i = j and ci (x) = cj (x) ≥ maxct (x)},

.

t =i,j

(11.8)

where .ci (x) is the confidence score for class i output by classifier .f (). The decision boundary is hard to depict directly, and thus we turn to the distance to the i-class decision boundary .distf (x, yi ) for a given data point x, which is defined as follows: Definition 11.3 (Distance to Decision Boundary) For a classifier .f () for a C-class classification, given a data point x, the oracle of .f () predicts x as .f (x) = y, the distance from x to the i-th target class decision boundary .distf (x, yi ) is the minimal L2 distance .min L2 (x + δ, x) that satisfies .f (x + δ) = yi , yi = y. For a data point x, the distance between x and another .C − 1 target decision boundaries is a .C − 1 dimension vector. For a given classifier .f () and a dataset D which contains private data samples, computing the distance to decision boundary is exactly the task of finding the smallest adversarial perturbation, which can be conducted with white-box and black-box access to a classifier .f () [5], the decision boundary vector is d = (distf (x, y1 ), distf (x, y2 ), . . . , distf (x, yC )),

.

(11.9)

where each .d is corresponding to a label .{0, 1}, and if the data point is in the training set of .f (), the label is 1. Dataset inference [5] has proposed decision boundary measurement methods in both white-box and black-box settings (as shown in Fig. 11.2). White-Box Setting: MinGD[5] White-box model auditing is used in the scenario where the model owner .O and arbitrator, like a court, have both white-box access to the model parameters and the model outputs/gradients of the suspected adversary’s model .f (). For any data point .(x, y), we evaluate its minimum L2 distance .distf (x, yi ) to the neighboring target classes .yi by performing gradient descent optimization of the following objective [12]: .minδ L2 (x, x + δ) s.t. f (x + δ) = yi , yi = y, and the .O has the .d − 1 dimension distance vector to the decision boundary .d = (distf (x, y1 ), distf (x, y2 ), . . . , distf (x, yC )). Black-Box Setting: Blind Walk[5] In the black-box setting, the model owner .O has only the black-box access to the model, for example, the deployed model API .f () only outputs prediction label, which makes owner .O incapable of computing gradients required for MinGD. Maini et al. proposed a query-based method called Blind Walk [5]. First, it samples a set of random initialized directions, of which one is noted as .δ. Starting from an input .(x, y), it takes .k ∈ N steps in the

218

B. Li et al.

Fig. 11.2 Distance measurement to the decision boundary with random walk algorithm under black-box setting

same direction .δ until .f (x + kδ) = yi ; yi = y. Then, the .kδ is used as a proxy for the distance to the decision boundary. The search is repeated over multiple random initial directions to approximate the distance, as shown in Fig. 11.2. Finally, the .O has the .d − 1 dimension distance vector to the decision boundary .d = (distf (x, y1 ), distf (x, y2 ), . . . , distf (x, yC )).

11.4.2 Data Ownership Resolution [5] To infer whether the dataset information is contained by the model .f () through its distance vector .d corresponding to dataset .Dtrain , dataset inference builds a ownership tester .g() to decide the membership information of a data point x by its distance .d to the model decision boundary, .g(d) → [0, 1]. As shown in Fig. 11.3, firstly, before the dataset .Dtrain is stolen for model training, the owner .O trains a set of surrogate model .fsur on dataset .Dtrain with training algorithm .T , .fsur = T (Dtrain ). For each data sample .(x, y) inside

Fig. 11.3 An illustration of ownership classifier

11 Model Auditing for Data Intellectual Property

219

Fig. 11.4 An illustration of attacks that challenge the model auditing process, and a rigorous model auditing protocol must address these issues

or outside .Dtrain , the distance d is computed with methods in Sect. 11.4.1, and the ownership tester .g() is trained with distance embedding .d and corresponding membership label .{0, 1}, i.e., .g(d, f ()) → {0, 1}. Remark the ownership tester .g() is trained with both victim’s private data .Dtrain and unseen publicly available data. Using the distance vector d and the ground truth membership labels, we train a regression model .g(). The ownership verification is then followed by a hypothesis test: dataset .Dtrain along with public data .Dpub is applied to generate distance vector .d, and then ownership tester .g() outputs the confidence vector .μV for victim dataset .Dtrain and .μpub for public dataset .Dpub . Dataset inference tests the null hypothesis: .H0 : μV ≤ μpub , and if the .H0 is rejected, it is asserted that the model .f () has stolen the dataset .Dtrain .

11.4.3 Threat Model for Model Auditing In this subsection, we introduce adversaries who pose attacks on the model auditing methods (illustrated in Fig. 11.4).

11.4.3.1

Removal Attack

Given that owner .O has a dataset .Dtrain , the illegal data collector .C trained a model f () with the dataset .Dtrain , and adversary .C may modify the model .f () to avoid the data IP auditing on the model .f (). For example, adversary modifies the model .f () to .fˆ(), .

Removal(f ()) = fˆ()

.

(11.10)

in order that the model auditing process .V (f (), Dtrain ) will fail to provide evidence, V (fˆ(), Dtrain ) = p ≤ ,

.

(11.11)

220

B. Li et al.

and in practice, the adversary may implement the attacks with techniques like parameter pruning, model fine-tuning, knowledge distillation etc., and we give an example of removal attack in Algorithm 1: Algorithm 1 Removal attack Input: Model .f (), pruning rate p, additional training data .Df ine−tune . 1: Finetuning 2: for epochs in 50 do 3: Train the model .f () only in main classification task with additional training data .Df ine−tune . 4: end for 5: Pruning the model .f () with p pruning rate.

11.4.3.2

Ambiguity Attack

For a given model .f (), adversary .Advambi may forge a dataset .Dˆ as evidence, I (f (), V , Daux ) → Dˆ

.

(11.12)

that the adversary can pass the ownership verification process and claim that the ˆ = p ≥ , i.e., the output probability ˆ .V (f (), D) model has misused the dataset .D, p succeeds the threshold ., and the .Advambi can ask for legal payment. The ambiguity attack is an inversion process that the adversary attempts to forge the ownership verification process, and the formal definition is given below: Definition 11.4 (Ambiguity Attack) An invert process .I (f (), V , Daux ) → Dˆ exists and constitutes a successful ambiguity attack. 1. A set of data samples .Dˆ can be reverse-engineered through access (white-box or black-box) to a given DNN model .f (). 2. The forged dataset .Dˆ can be successfully verified with respect to the given DNN model .f (), i.e., .V (f (), Dtrain ) = p ≥ , where the output p succeeds the preset threshold .. 3. If at least one invert process .I () exists for a model auditing scheme .V , then the scheme .V () is called an invertible scheme, i.e., .V  V I , where .V I = {V |I (, V , ) = ∅} is the auditing scheme set that cannot be inverse engineered.

11 Model Auditing for Data Intellectual Property

221

11.5 Experimental Results In this section, we provide empirical evaluation results to investigate whether dataset inference (DI) [5] has the properties in Sect. 11.3.1, and we answer the following questions in our evaluation: 1. (Model Agnostic) Could DI works if training data are trained for various architectures of models? 2. (Task Agnostic) Could DI works if training data are only partially used for training? 3. (Robustness) Is DI robust to model removal attack and data ambiguity attack? DNN Model Architectures The deep neural network architectures we investigated include the well-known AlexNet and ResNet-18. Datasets For image classification tasks, the model auditing method is evaluated on CIFAR10, CIFAR100, and Imagenet datasets. Statistical Significance We adopt the p-value of hypothesis test to quantify the statistical significance of model auditing, the smaller p-value DI asserts, the higher confidence the model has stolen the dataset .Dtrain , and in our chapter, we set the threshold to .10−3 .

11.5.1 Main Results Firstly, we reproduce the main experimental results of DI, given that the model owner .O has private CIFAR100 dataset, and .O trains .g() for different model architectures including AlexNet, ResNet-18, and WideResNet. If the dataset .Dtrain is collected by .C to train a model .f (), the data owner can audit data usage from model .f () in different architectures. The experimental results in Table 11.1 show that both random walk (black-box setting) and MinGD (white-box setting) methods of DI achieve lower than 0.01 p-

Table 11.1 Table presents the hypothesis testing results for both black-box and white-box settings. The CIFAR10 dataset is stolen and trained by AlexNet, ResNet-18, and the date owner .O samples m data points from victim dataset .Dtrain and public dataset .Dpub for hypothesis testing Setting p-value for black-box

p-value for white-box

Model/dataset AlexNet ResNet WideResNet AlexNet ResNet WideResNet

Sample size m for auditing 10 20 30 −8 −13 −23 .10 .10 .10 −12 −22 −29 .10 .10 .10 −11 −18 −22 .10 .10 .10 −2 −4 −5 .10 .10 .10 −3 −4 −6 .10 .10 .10 −2 −3 −5 .10 .10 .10

40

50

.10

−25

.10

−42 .10

−29

.10

−27 .10

.10

−8 .10

.10

−9 .10

.10

−8 .10

.10

−61 −42 −9 −10 −11

222

B. Li et al.

Table 11.2 Table presents the hypothesis testing results for both black-box and white-box settings. The data owner .O only has 2 classes or 5 classes of 10 classes, the dataset .Dtrain is used as part of CIFAR10 dataset to train an AlexNet classifier, and the date owner .O samples m data points from victim dataset .Dtrain , and public dataset .Dpub for hypothesis testing Setting p-value for black-box p-value for white-box

Data usage 2 out of 10 classes 5 out of 10 classes 2 out of 10 classes 5 out of 10 classes

Sample size m for auditing 10 20 30 −1 −1 −1 .10 .10 .10 −2 −4 −7 .10 .10 .10 −1 −1 −1 .10 .10 .10 −1 −1 −2 .10 .10 .10

40

50

.10

−2

.10

−7 .10

−3

.10

0 .10

.10

−2 .10

.10

−10 0 −3

value with more than .m = 10 samples, and the confidence can be guaranteed with different model architectures. DI can enable data forensics from a trained model if the data collector applies the data .Dtrain in the designated way, and DI is model agnostic.

11.5.2 Partial Data Usage In this section, we simulate a simplified model-agnostic setting: given the setting that the data owner .O has 2 classes or 5 classes of images of CIFAR10 dataset, i.e., .Dtrain is a 2-class or 5-class subset of CIFAR10 dataset, and the .C applies .Dtrain and the rest of CIFAR10 to train a model .f (). The owner .O only has .Dtrain for the training of ownership tester .g(), and the ownership inference results are present in Table 11.2. The experimental results show that when the stolen dataset .Dtrain only make up part of the training set of .f (), if the .Dtrain makes up 5 of 10 classes, the DI achieves lower than 0.01 p-value with more than .m = 10 samples, and the black-box method random walk achieves lower p-value, i.e., higher confidence; if the .Dtrain makes up 2 of 10 classes, the data owner has little information of the true training set to train the ownership tester .g(), and the DI achieves greater than .10−3 p-value with more than .m = 10, which is not sufficient to assert the dataset plagiarism.

11.5.3 Different Adversarial Setting 11.5.3.1

Data Ambiguity Attack

Given a model .f () trained with dataset .Dtrain , the adversary .Advambi has not contributed training data but wants to pass the model auditing process and claim that he has contributed valuable training data to get commercial interest. On the technical side, the adversary .Advambi implements the .I (f (), V , Da ux) → Dˆ with Project

11 Model Auditing for Data Intellectual Property

223

Table 11.3 Table presents the hypothesis testing results for both black-box and white-box settings against ambiguity attack. The .Dt rain is CIFAR10 training set, the adversary .Advambi forges .Dˆ from CIFAR10 test set with PGD algorithm and tries to pass the ownership verification with AlexNet and ResNet-18. The attacker .Advambi sample m data points from forged dataset .Dˆ and public dataset .Dpub for hypothesis testing Setting p-value for black-box p-value for white-box

Model/dataset AlexNet ResNet AlexNet ResNet

Sample size m for auditing 10 20 30 −2 −1 −1 .10 .10 .10 −1 −1 −1 .10 .10 .10 −1 −1 −1 .10 .10 .10 −2 −1 −1 .10 .10 .10

40

50

.10

0

.10

0 .10

0

.10

−1 .10

.10

−1 .10

.10

0 0 0

Gradient Descent (PGD) algorithm. The adversary .Advambi conducts ownership ˆ and the DI results with forged dataset .Daux are present in verification .V (f (), D), Table 11.3. The experimental results show that the forged dataset .Dˆ cannot pass the DI verification with either random walk (black-box setting) or MinGD (white-box setting) methods. The forged dataset achieves greater than 0.01 p-value with more than .m = 10 samples, which cannot provide enough confidence for ownership. To sum up, DI is robust to ambiguity attack, and it is hard for attacker to forge a dataset ˆ to pass the DI verification. .D

11.5.3.2

Model Removal Attack

In the removal attack setting, given a model .f () trained with dataset .Dtrain , the data collector .C modifies the trained model .f (), in order that the data owner .O cannot pass the model auditing process. On the technical side, the data collector .C modifies the model .f () to .fˆ(), with fine-tuning, in our experiments, we fine-tune the .f () with a preset validation set in CIFAR10 dataset, and the DI results with the modified model .fˆ() are present in Table 11.4. Table 11.4 Table presents the hypothesis testing results for both black-box and white-box settings. The CIFAR10 dataset is stolen and trained by AlexNet and ResNet-18, the trainer .C applies different epochs of fine-tuning to modify the model .f (), and the date owner .O samples 30 data points from victim dataset .Dtrain and public dataset .Dpub for hypothesis testing with the modified model .fˆ() Setting p-value for black-box p-value for white-box

Model/dataset AlexNet ResNet AlexNet ResNet

Fine-tuning round t 0 10 −23 −22 .10 .10 −29 −28 .10 .10 −5 −5 .10 .10 −6 −6 .10 .10

20

30

−20

.10

−29

.10

−4

.10

−6

.10

.10 .10 .10 .10

40

−18

.10

−28

.10

−17

−4

.10

−6

.10

−27 −4 −5

224

B. Li et al.

The experimental results show that both random walk (black-box setting) and MinGD (white-box setting) methods of DI achieve lower than 0.01 p-value with .m = 30 samples, and the confidence can be guaranteed with different model architectures. DI can enable data forensics from a trained model even if the data collector applies model fine-tuning to modify the trained model .f (), and DI methods are resilient to model fine-tuning attack.

11.6 Conclusion This chapter focuses on data Intellectual Property protection problem, and we formulate a model auditing scheme to protect the Intellectual Property Right (IPR) of Training data against plagiarizers who illegally use the dataset for model training. This chapter proposes rigorous protocol requirements for model auditing for data IP. And this chapter revisits the attempts to audit illegal data use from a trained model, and we provide empirical results to show that dataset inference is resilient under removal attacks and ambiguity attack, but existing methods do not work in the task-agnostic settings like partial data usage settings. We wish that the formulation illustrated in this chapter will lead to more works in the data IP protection problem.

References 1. Brendel, W., Rauber, J., Bethge, M.: Decision-based adversarial attacks: reliable attacks against black-box machine learning models (2017). arXiv preprint arXiv:1712.04248 2. Cao, X., Jia, J., Gong, N.Z.: Ipguard: protecting intellectual property of deep neural networks via fingerprinting the classification boundary. In: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pp. 14–25 (2021) 3. Chen, J., Jordan, M.I., Wainwright, M.J.: Hopskipjumpattack: a query-efficient decision-based attack. In 2020 IEEE Symposium on Security and Privacy (SP), pp. 1277–1294. IEEE (2020) 4. Hu, S., Yu, T., Guo, C., Chao, W.-L., Weinberger, K.Q.: A new defense against adversarial images: turning a weakness into a strength. in: Advances in Neural Information Processing Systems, vol. 32 (2019) 5. Maini, P., Yaghini, M., Papernot, N.: Dataset inference: ownership resolution in machine learning. In: International Conference on Learning Representations (2020) 6. Rahman, M.A., Rahman, T., Laganière, R., Mohammed, N., Wang, Y.: Membership inference attack against differentially private deep learning model. Trans. Data Privacy. 11(1), 61–79 (2018) 7. Rezaei, S., Liu, X.: On the difficulty of membership inference attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7892–7900 (2021) 8. Sablayrolles, A., Douze, M., Schmid, C., Ollivier, Y., Jégou, H.: White-box vs black-box: Bayes optimal strategies for membership inference. In: International Conference on Machine Learning, pp. 5558–5567. PMLR (2019) 9. Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models (2018). arXiv preprint arXiv:1806.01246

11 Model Auditing for Data Intellectual Property

225

10. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18 (2017) 11. Song, L., Mittal, P.: Systematic evaluation of privacy risks of machine learning models. In: 30th USENIX Security Symposium (USENIX Security 21), pp. 2615–2632 (2021) 12. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks (2013). arXiv preprint arXiv:1312.6199 13. Tian, S., Yang, G., Cai, Y.: Detecting adversarial examples through image transformation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)