211 24 4MB
English Pages 117 [118] Year 2023
Synthesis Lectures on Computer Vision
Jun Wan · Guodong Guo · Sergio Escalera · Hugo Jair Escalante · Stan Z. Li
Advances in Face Presentation Attack Detection Second Edition
Synthesis Lectures on Computer Vision Series Editors Gerard Medioni, University of Southern California, Los Angeles, CA, USA Sven Dickinson, Department of Computer Science, University of Toronto, Toronto, ON, Canada
This series publishes on topics pertaining to computer vision and pattern recognition. The scope follows the purview of premier computer science conferences, and includes the science of scene reconstruction, event detection, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, and image restoration. As a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner. As a technological discipline, computer vision seeks to apply its theories and models for the construction of computer vision systems, such as those in self-driving cars/navigation systems, medical image analysis, and industrial robots.
Jun Wan · Guodong Guo · Sergio Escalera · Hugo Jair Escalante · Stan Z. Li
Advances in Face Presentation Attack Detection Second Edition
Jun Wan State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences Beijing, China Sergio Escalera Department of Mathematics and Informatics University of Barcelona and Computer Vision Center Barcelona, Spain
Guodong Guo Department of Computer Science and Electrical Engineering West Virginia University Morgantown, China Hugo Jair Escalante Department of Computer Science Instituto Nacional de Astrofísica, Optica y Electrónica Puebla, Mexico
Stan Z. Li AI Lab Westlake University Hangzhou, China
ISSN 2153-1056 ISSN 2153-1064 (electronic) Synthesis Lectures on Computer Vision ISBN 978-3-031-32905-0 ISBN 978-3-031-32906-7 (eBook) https://doi.org/10.1007/978-3-031-32906-7 1st edition: © Morgan & Claypool Publishers 2020 2nd edition: © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The field of biometric face recognition has achieved great success in the last few years, especially with the great progress in deep learning. Face recognition systems have been widely used in diverse applications, such as mobile phone unlocking, security supervision systems in railway or subway stations, and other access control systems. However, as promising as face recognition is, there also exist potential flaws that should be addressed in practice. For instance, user photos can be easily found on social networks and used to spoof face recognition systems. These face presentation attacks make authentication systems vulnerable. Therefore, face anti-spoofing technology is important to protect sensitive data, such as user’s identity, and privacy in smartphones and similar devices. In this context, we organized a series of the face anti-spoofing workshops and competitions around face presentation attack detection at CVPR 2019, CVPR 2020, and ICCV 2021. We focused on different topics on face anti-spoofing challenges in different years, such as multi-modal face presentation attack detection at CVPR 2019, cross-ethnicity face anti-spoofing recognition at CVPR 2020, and 3D high-Fidelity mask face presentation attack detection at ICCV 2021, where these topics are quite relevant and are motivated by real applications. This book presents a comprehensive review of solutions developed by challenge participants of the face anti-spoofing challenges. The motivation behind organizing such a competition and a brief review of the state of the art are provided. The datasets associated with the challenges are introduced, and the results of the challenge are analyzed. Finally, research opportunities are outlined. This book provides, in a single source, a compilation that summarizes the state of the art in this critical subject; we foresee the book will become a reference for researchers and practitioners on face recognition. This book would not be possible without the support of many people involved in the aforementioned events. In particular, we would like to thank all participants in the face anti-spoofing challenges, who provided us the abundant material, especially for the top three winning teams. We would like to thank Ajian Liu, Benjia Zhou, and Jun Li, who
v
vi
Preface
helped us prepare the book materials and proof the whole book. Also, we would like to thank Springer publishers for working with us in producing this manuscript. Beijing, China Morgantown, USA Barcelona, Spain Puebla, Mexico Hangzhou, China
Jun Wan Guodong Guo Sergio Escalera Hugo Jair Escalante Stan Z. Li
Contents
1 Face Anti-spoofing Progress Driven by Academic Challenges . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Face Presentation Attack Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Formulation of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 2 4 12
2 Face Presentation Attack Detection (PAD) Challenges . . . . . . . . . . . . . . . . . . . 2.1 The Multi-modal Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 The CASIA-SURF Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Workshop and Competition at CVPR 2019 . . . . . . . . . . . . . . . . . . . . 2.1.3 Availability of Associated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Cross-Ethnicity Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 CeFA Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Workshop and Competition at CVPR 2020 . . . . . . . . . . . . . . . . . . . . 2.2.3 Availability of Associated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 3D High-Fidelity Mask Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 HiFiMask Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Workshop and Competition at ICCV 2021 . . . . . . . . . . . . . . . . . . . . 2.3.3 Availability of Associated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 17 17 21 22 23 23 27 28 29 29 33 34 34
3 Best Solutions Proposed in the Context of the Face Anti-spoofing Challenge Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Multi-modal Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Baseline Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 1st Ranked Solution (Team Name: VisionLabs) . . . . . . . . . . . . . . . . 3.2.3 2nd Ranked Solution (Team Name: ReadSense) . . . . . . . . . . . . . . . 3.2.4 3rd Ranked Solution (Team Name: Feather) . . . . . . . . . . . . . . . . . . .
37 37 37 37 39 41 42
vii
viii
Contents
3.2.5 Additional Top Ranked Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Cross-Ethnicity Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Single-modal Face Anti-spoofing Challenge Track . . . . . . . . . . . . . 3.3.2 Multi-modal Face Anti-spoofing Challenge Track . . . . . . . . . . . . . . 3.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 3D High-Fidelity Mask Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Baseline Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 1st Ranked Solution (Team Name:VisionLabs) . . . . . . . . . . . . . . . . 3.4.3 2nd Ranked Solution (Team Name: WeOnlyLookOnce) . . . . . . . . . 3.4.4 3rd Ranked Solution (Team Name: CLFM) . . . . . . . . . . . . . . . . . . . 3.4.5 Other Top Ranked Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 48 50 50 57 63 63 63 66 67 68 69 75 75
4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Multi-modal Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Cross-Ethnicity Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 3D High-Fidelity Mask Face PAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79 79 79 79 90 93 94 101 101 103 103
5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 CASIA-SURF & Multi-modal Face Anti-spoofing Attack Detection Challenge at CVPR2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 CASIA-SURF CeFA & Cross-Ethnicity Face Anti-spoofing Recognition Challenge at CVPR2020 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 CASIA-SURF HiFiMask & 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021 . . . . . . . . . 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Flexible Modal Face Anti-spoofing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Generalizable Face Anti-spoofing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Surveillance Face Anti-spoofing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105 105 105 106 106 107 107 107 108 109
1
Face Anti-spoofing Progress Driven by Academic Challenges
1.1
Introduction
Face anti-spoofing is essential to prevent face recognition systems from a security breach. Progress in this field has been largely motivated by the availability of face anti-spoofing benchmark datasets that resemble realistic scenarios. Despite research advances in recent years have been considerable, there are several limitations on the existing face anti-spoofing benchmarks, these include: • The limited number of subjects (≤170) and data modalities (≤2) considered, which hinder further development of the academic community; • The existence of ethnic biases, whose importance has been extensively documented in other face recognition tasks; • The fact that most benchmarks target 2D information, while the few existing 3D maskbased benchmarks consider low-quality facial masks and very small number of samples. Through the organization of academic challenges and associated workshops, the ChaLearn Face Anti-Spoofing challenge series has contributed with resources, evaluation protocols and dissemination forums for advancing research on facial presentation attack detection. Summarizing, our work in recent years has impacted into the above problems in turn, as follows: • To facilitate face anti-spoofing research for the scientific community, we introduced a large-scale multi-modal dataset, namely CASIA-SURF. This dataset is the largest publicly available dataset for face anti-spoofing in terms of both subjects and modalities. Specifically, it consists of 1, 000 subjects with 21, 000 videos and each sample was recorded in 3 modalities (i.e., RGB, Depth and IR). Moreover, we presented a novel © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Wan et al., Advances in Face Presentation Attack Detection, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-32906-7_1
1
2
1 Face Anti-spoofing Progress Driven by Academic Challenges
multi-modal multi-scale fusion method as a strong baseline, namely Multi-scale SEF, which performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modality across different scales. • To study the ethnic bias for face anti-spoofing, we introduced the CASIA-SURF Crossethnicity Face Anti-spoofing dataset, CeFA, covering 3 ethnicities, 3 modalities, 1, 607 subjects, and 2D plus 3D attack types. Then, we proposed a novel multi-modal fusion method as a strong baseline to alleviate the ethnic bias, namely PSMM-Net, which uses a partially shared fusion strategy to learn complementary information from multiple modalities. • To bridge the gap to real-world applications, we introduced a large-scale High-Fidelity Mask dataset, namely HiFiMask. Specifically, a total of 54, 600 videos are recorded from 75 subjects with 225 realistic masks by 7 new kinds of sensors. Along with the dataset, we proposed a novel Contrastive Context-aware Learning framework, namely CCL. CCL is a new training methodology for supervised PAD tasks, which is able to learn by leveraging rich contexts accurately (e.g., subjects, mask material and lighting) among pairs of live faces and high-fidelity mask attacks. The remainder of this chapter introduces the problem of face presentation attack detection (PAD), and describes the need for research on this topic. Also, we outline the main limitations of research practices, in particular in terms of benchmarking. Then we describe our efforts in trying to stimulate the community to target the PAD task.
1.2
Face Presentation Attack Detection
1.2.1
Formulation of the Problem
With the development of science and technology and the advent of the information age, part of the identity authentication systems based on traditional methods, such as signature and seal, are being gradually replaced by biometric recognition systems. A biometric recognition system collects human innate physiological characteristics, analyzes them and finally determines the real identity of the subject under analysis. In recent years, as an important branch of biometric recognition, face recognition has attracted extensive attention because of its unique advantages such as intuitive, natural, real-time non-intrusive, and contactless. The research of face recognition can be traced back to the late 1960s [1–3]. Since the 1990s, with the rapid improvement of computer hardware performance, face recognition has gradually become one of the important branches in the field of computer vision and biometric recognition. There have been a series of studies represented by the characteristic face recognition algorithms [4, 5] proposed by Kirby and Turk. Since 2000, the U.S. Department of defense has organized evaluation (face recognition vendor test, FRVT) for face recognition service providers. So far, FRVT2000, FRVT2002, FRVT2006, FRVT2010, FRVT2013,
1.2
Face Presentation Attack Detection
3
Fig. 1.1 Deployment scenarios of face recognition system, including face verification, face payment, unmanned supermarket, self-service security inspection
FRVT1:1, FRVT1:N 2018 and other tests have been conducted. With the enhancement of processor computing power and the reduction of hardware price caused by Moore law [6], face recognition systems are gradually going out of the laboratory and entering to commercialization. A series of mature products have emerged, such as in Fig. 1.1, which are widely used in the fields of access control and financial payment, phone unlocking, surveillance and so on. They are increasingly appearing in people’s daily life. The development of science and technology not only brings us convenience, but also contains huge security risks. Face recognition systems can be an attractive target for spoofing attacks, that is: attempts to illegally access the system by providing a copy of a legal face. Information globalization acts in favor of such system misuse: personal data, including face images and videos, are nowadays widely available and can be easily downloaded from the Internet. By displaying fake images or videos to face recognition equipment, an illegal intruder successfully deceives and invades the face recognition system, it will have a series of unpredictable consequences. With the maturity of 3D printing technology in recent years, 3D face masks have also become a new face forgery method, threatening the security of face recognition systems. Compared with traditional attack methods such as photos or screens, the 3D mask is more realistic in terms of texture and depth information, and the method to obtain the 3D mask is also very simple. Printed photographs [7] of a user face, digital photographs displayed on a device [8], video replays, and 3D masks [9] have already proven to be a serious threat for face recognition systems. Some spoofing samples are shown in Fig. 1.2. Based on the above background, in order to detect whether the captured face from a face recognition system is real or fake, Face Presentation Attack Detection (PAD) technology, as an important part of face recognition system, plays a more and more important role, and has become one of the key topics in the research of biometric authentication technology. Most existing approaches formulate the problem as either binary classification, one-class classification, or binary classification with auxiliary supervision. Binary Classification. Most of research has relied on handcrafted features (mainly texture-based ones) combined with a binary classifier distinguishing genuine v.s. spoof images [10–14]. In fact, recent methods based on deep learning adhere to this formulation,
4
1 Face Anti-spoofing Progress Driven by Academic Challenges
Fig. 1.2 Common spoofing samples, include print attack, video replay attack and mask attack
using features learned with a convolutional neural network (CNN) and a softmax layer as classifier. One-class Classification. Because the two-class formulations are in general not robust for real-world scenarios due to the poor generalisation performance in the presence of novel attack types [15, 16], some authors [16–18] have treated the face anti-spoofing problem as one of anomaly detection, and have addressed this task with one-class classifiers. Compared to two-class classification methods, one-class classification can be robust to previously unseen and innovative attacks [17]. For instance, in [16], the anomaly detectors used for face anti-spoofing include four types: one-class SVM, one-class sparse representation based classifier, one-class Mahalanobis distance and one-class Gaussian mixture model. Binary Classification with Auxiliary Supervision. Recently, some authors have resorted to auxiliary information for face anti-spoofing. Atoum et al. use for the first time facial depth maps as supervisory information, where two-stream CNNs are used to extract features from both local patches and holistic depth maps [19]. Liu et al. propose a method by fusing features from depth maps and temporal rPPG signals [20]. Then, Shao et al. use depth information as auxiliary supervision to learn invariant features between cross domains [21]. These works demonstrate the effectiveness when auxiliary information are used with binary classification [19–21].
1.2.2
Motivation
1.2.2.1 Multi-modal Face Presentation Attack Detection FAS aims to determine whether the captured face from a face recognition system is real or fake, which is a vital step to ensure that face recognition systems are in a safe reliable condition. In recent years, face PAD algorithms [20, 22] have achieved great performances. One of the key points of this success is the availability of face anti-spoofing benchmark datasets [7, 20, 23–26]. However, there are several shortcomings in the existing datasets as follows:
1.2
Face Presentation Attack Detection
5
• Number of subjects is limited. Compared to the large existing image classification [27] and face recognition [28] datasets, face anti-spoofing datasets have less than 170 subjects and 60, 00 video clips as shown in Table 1.1. The limited number of subjects is not representative of the requirements of real applications. • Number of modalities is limited. As shown in Table 1.1, most of the existing datasets only consider a single modality (e.g., RGB). For these existing available multi-modal datasets [9, 29], they are very scarce including no more than 21 subjects. • Evaluation metrics are not comprehensive enough. How to compute the performance of algorithms is an open issue in face anti-spoofing. Many works [20, 22, 23, 25] adopt the Attack Presentation Classification Error Rate (APCER), the Normal Presentation Classification Error Rate (NPCER) and the Average Classification Error Rate (ACER) as the evaluation metric, in which APCER and NPCER are used to measure the error rate of fake or live samples, and ACER is the average of APCER and NPCER scores. However, in real applications, one may be more concerned about the false positive rate, i.e., attacker is treated as real/live one. These aforementioned metrics can not meet this need. • Evaluation protocols are not diverse enough. All the existing face anti-spoofing datasets only provide within-modal evaluation protocols. To be more specific, algorithms trained in a certain modality can only be evaluated in the same modality, which limits the diversity of face anti-spoofing research. To deal with the aforementioned drawbacks, a large-scale multi-modal face anti-spoofing dataset, namely CASIA-SURF [39, 40] was introduced, it consists of 1, 000 subjects and 21, 000 video clips with 3 modalities (RGB, Depth, IR). It has 6 types of photo attacks combined by multiple operations, e.g., cropping, bending the print paper and stand-off distance. Some samples and other detailed information of the CASIA-SURF dataset are shown in Fig. 1.3 and Table 1.1. Comparing to these existing face anti-spoofing datasets, it has four main advantages as follows: • The highest number of subjects. The proposed dataset is the largest one in term of number of subjects, which is more than 6× boosted compared with previous challenging face anti-spoofing dataset like Spoof in the Wild (SiW) [20]. • The more number of data modalities considered. Our CASIA-SURF is the only dataset that provides three modalities (i.e., RGB, Depth and IR), and the other datasets have up to two modalities. • The most comprehensive evaluation metrics. Inspired by face recognition [41, 42], we introduce the Receiver Operating Characteristic (ROC) curve for our large-scale face anti-spoofing dataset in addition to the commonly used evaluation metrics. The ROC curve can be used to select a suitable trade off threshold between the False Positive Rate (FPR) and the True Positive Rate (TPR) according to the requirements of a given real application.
6
1 Face Anti-spoofing Progress Driven by Academic Challenges
Fig. 1.3 The CASIA-SURF dataset. It is a large-scale and multi-modal dataset for face anti-spoofing, consisting of 492, 522 images with 3 modalities (i.e., RGB, Depth and IR)
• The most diverse evaluation protocols. In addition to the within-modal evaluation protocols, we also provide the cross-modal evaluation protocols in our dataset, in which algorithms trained in one modality will be evaluated in other modalities. It allows the academic community to explore new issues. Besides, the work [41, 42] present a novel multi-modal multi-scale fusion method, namely Multi-scale SEF, as a strong baseline to conduct extensive experiments on the proposed dataset. The new fusion method performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modality across different scales.
1.2.2.2 Cross-Ethnicity Face Presentation Attack Detection Although ethnic bias has been verified to severely affect the performance of face recognition systems [43–45], this topic remains unexplored in face anti-spoofing. This is mainly due to the fact that there is no available dataset with ethnic annotations and an appropriate
1.2
Face Presentation Attack Detection
7
Table 1.1 Comparison of the public face anti-spoofing datasets (∗ indicates this dataset only contains images, not video clips, is short for Seek Thermal Compact PRO sensor, the symbol − indicates that this team is not counted) Dataset
Year
# Subjects
# Videos
Camera
Modal types
Spoof attacks
ReplayAttack [24]
2012
50
1,200
VIS
RGB
Print, 2 Replay
CASIAMFSD [7]
2012
50
600
VIS
RGB
Print, Replay
3DMAD [9]
2013
17
255
VIS/Kinect
RGB/Depth
3D Mask
I2 BVSD [30]
2013
75
681∗
VIS/Thermal
RGB/Heat
3D Mask
GUCLiFFAD [31]
2015
80
4,826
LFC
LFI
2 Print, Replay
MSUMFSD [26]
2015
35
440
Phone/Laptop
RGB
Print, 2 Replay
ReplayMobile [25]
2016
40
1,030
VIS
RGB
Print, Replay
3D Mask [32]
2016
12
1,008
VIS
RGB
3D Mask
Msspoof [29]
2016
21
4,704∗
VIS/NIR
RGB/IR
Print
SWIR [33]
2016
5
141∗
VIS/M-SWIR
RGB/4 Print, 3D SWIR bands Mask
BRSU [34]
2016
50+
−
VIS/AM-SWIR
RGB/4 Print, 3D SWIR bands Mask
EMSPAD [35]
2017
50
14,000∗
SpectraCamT M
7 bands
2 Print
SMAD [36]
2017
−
130
VIS
RGB
3D Mask
MLFP [37]
2017
10
1,350
VIS/NIR/Thermal
RGB/IR/Heat 2D/3D Mask
OuluNPU [23]
2017
55
5,940
VIS
RGB
2 Print, 2 Replay
SiW [20]
2018
165
4,620
VIS
RGB
2 Print, 4 Replay
WMCA [38]
2019
72
6,716
RealSense/STC-PRO
RGB/Depth/ 2 Print, IR/Thermal Replay, 2D/3D Mask
CASIA-SURF
2018
1,000
21,000
RealSense
RGB/ Depth/IR
Print, Cut
evaluation protocol for this bias issue. Furthermore, as shown in Table 1.2, the existing face anti-spoofing datasets (i.e. CASIA-FASD [7], Replay-Attack [24], OULU-NPU [23] and SiW [20]) have a limited number of samples and most of them just contain the RGB modality. Although CASIA-SURF [40] is a large dataset in comparison to the existing alternatives, it still only provides limited attack types (only 2D print attack) and single ethnicity (East Asia). Therefore, in order to alleviate the above problems, the CASIA-SURF CeFA dataset was released, briefly named CeFA [46], which is the largest face anti-spoofing dataset
8
1 Face Anti-spoofing Progress Driven by Academic Challenges
Table 1.2 Comparisons among existing face PAD databases. (i indicates the dataset only contains images. * indicates the dataset contains 4 ethnicities, while it does not provide accurate ethnic labels for each sample and does not study ethnic bias for the design protocol. AS: Asian, A: Africa, U: Caucasian, I: Indian, E: East Asia, C: Central Asia.) Dataset
Year
#Subject
#Num
Attack
Device
Ethnicity
ReplayAttack [8]
2012
50
1200
Print, Replay RGB
Modality
RGB camera
–
CASIAFASD [7]
2012
50
600
Print, Cut, Replay
RGB
RGB camera
–
3DMAD [9]
2014
17
255
3D print mask
RGB/Depth
RGB camera/kinect
–
MSUMFSD [26]
2015
35
440
Print, Replay RGB
Cellphone/ Laptop
–
ReplayMobile [25]
2016
40
1030
Print, Replay RGB
Cellphone
–
Msspoof [29]
2016
21
4704i
Print
RGB/IR camera
–
OULUNPU [23]
2017
55
5940
Print, Replay RGB
RGB camera
–
SiW [20]
2018
165
4620
Print, Replay RGB
RGB camera
AS/A/ U/I*
CASIASURF [40]
2019
1000
21000
Print, Cut
RGB/ Depth/IR
Intel realsense
E
CASIA-SURF CeFA
2019
1500
18000
Print, Replay RGB/ Depth/IR
Intel realsense
A/E/C
99
5346
3D print mask
8
192
3D silica gel mask
RGB/IR
Total: 1607 subjects, 23538 videos
up to date in terms of ethnicities, modalities, number of subjects and attack types. A comparison with existing alternative datasets is listed in Table 1.2. Concretely, attack types of the CeFA dataset are diverse, including printing from cloth, video replay attack, 3D print and silica gel attacks. More importantly, it is the first public dataset designed for exploring the impact of cross-ethnicity. Some original frame of the data sample and the processed sample, i.e., keep only face region, are shown in Fig. 1.4. Moreover, to relieve the ethnic bias, a multi-modal fusion strategy is introduced in this work based on this consideration that indistinguishable real or fake face which is caused by ethnic factors may exhibit quite different properties under other modality. Some fusion methods [40] are published, which restrict the interactions among different modalities since they are independent before the fusion point. Therefore, it is difficult to effectively utilize the modality relatedness from the beginning of the network to its end. In [46], Liu et al. propose a Partially Shared Multi-modal Network, namely PSMM-Net, as a strong baseline to alleviate ethnic and attack pattern bias. On the one hand, it fuses multi-modal features from each feature scale instead of starting from a certain fusion point. On the other hand, it allows
1.2
Face Presentation Attack Detection
9
Fig. 1.4 Processed samples of the CeFA dataset. It contains 1, 607 subjects, 3 different ethnicities (i.e., Africa, East Asia, and Central Asia), with 4 attack types (i.e., print attack, replay attack, 3D print and silica gel attacks) under various lighting conditions. Light red/blue background indicates 2D/3D attack
the information exchanges and interactions among different modalities by introducing a shared branch. In addition, for each single-modal branch (e.g., RGB, Depth or IR), it uses a simple and effective backbone, Resnet [47], to learn the static features for subsequent feature fusion.
1.2.2.3 3D High-Fidelity Mask Face Presentation Attack Detection With the maturity of 3D printing technology, face mask has become a new type of PA to threaten face recognition systems’ security. Compared with traditional 2D PAs, face masks are more realistic in terms of color, texture, and geometry structure, making it easy to fool a face PAD system designed based on coarse texture and facial depth information [20]. Fortunately, some works have been devoted to 3D mask attacks, including design of datasets [38, 50, 55–57] and algorithms [33, 54, 58–60]. In terms of the composition of 3D mask datasets, several drawbacks limit the generalization ability of data-driven algorithms. From existing 3D mask datasets shown in Table 1.3, one can see some of these drawbacks: • Bias of identity. The number of mask subjects is less than the number of real face subjects. Even for some public datasets as [33, 36–38], the mask and live subjects correspond to completely different identities, which may produce the model to mistake identity as a discriminative PAD-related feature; • Limited number of subjects and low skin tone variability. Most datasets contain less than 50 subjects, with low or unspecified skin tone variability; • Limited diversity of mask materials. Most datasets [36, 37, 50–54] provide less than 3 mask materials, which makes it difficult to cover the attack masks that attackers may use; • Few scene settings. Most datasets [50–53] only consider single deployment scenarios, without covering complex real-world scenarios;
137
Online
–
W/B
W/B
Y/W/B
Y
Y/W/B
SMAD, [36]
MLFP, [37]
ERPA, [53]
WMCA, [38]
CASIA-SURF 3DMask, [54]
HiFiMask (Ours), 2021
75
48
72
5
10
12
MARsV2, [51] Y
Y/W/B
26
17
BRSU, [33]
W/B
3DMAD, [50]
#Sub.
3DFS-DB, [52] Y
Skin tone
Dataset, Year
75
48
7*
6
7
Online
12
6
26
17
#Mask Id.
Indoor, Outdoor
–
Office
Disguise Counterfeiting
Office
Controlled
Scenes
Transparent Plaster, Resin
Plaster
White, Green Tricolor, Sunshine Shadow, Motion
Indoor, Outdoor
Plastic Indoor Silicone, Paper
Resin, Silicone Indoor
Latex, Paper
Silicone
ThatsMyFace REAL-F
Silicone, Plastic Resin, Latex
Plastic
Paper, hard resin
Material
NormalLight, DimLight BrightLight, BackLight SideLight, TopLight
Normal light, Back light Front light, Side light Sunlight, Shadow
Office light LED lamps, day-light
Room light
Daylight
Varying lighting
Room light, Low light Bright light, Warm light Side light, Up side light
Adjustment
Adjustment
Adjustment
Light. Cond.
iPhone11, iPhoneX MI10, P40, S20 Vivo, HJIM
Apple, Huawei Samsung
Intel RealSense SR 300 Seek Thermal, Compact PRO.
Xenics Gobi, thermal cam. Intel Realsense SR300
Visible Near infrared, Thermal
Varying cam.
Logitech C920, EOS M3, Nexus 5, iPhone 6 Samsung S7, Sony Tablet S
SWIR, Color
Kinect, Carmine 1.09
Kinect
Devices
54,600 (13,650/40,950)
1152 (288/864)
1679 (347/1332)
86
1350 (150/1200)
130 (65/65)
1008 (504/504)
141 (0/141)
520 (260/260)
255 (170/85)
#Videos (#Live/#Fake)
Table 1.3 Comparison of the public 3D face anti-spoofing datasets. ‘Y’, ‘W’, and ‘B’ are shorthand for yellow, white, and black, respectively. ‘Sub.’, ‘Mask Id.’ and ‘Light. Cond.’ denote ‘Live subjects’, ‘Mask identity numbers’ and ‘Lighting Condition’, respectively. Number with ‘*’ denotes this number is statistically inferred and there may be inaccuracies
10 1 Face Anti-spoofing Progress Driven by Academic Challenges
1.2
Face Presentation Attack Detection
11
• Controlled lighting environment. Lighting changes pose a great challenge to the stability of rPPG-based PAD methods [49]. However, all existing mask datasets avoid this by setting the lighting to a fixed value, i.e., daylight, office light; • Obsolete acquisition devices. Many datasets use outdated acquisition devices regarding the resolution and imaging quality. To alleviate previous issues, it is introduced a large-scale 3D High-Fidelity Mask dataset for face PAD, namely HiFiMask [61]. As shown in Table 1.3, HiFiMask provides 25 subjects with yellow, white, and black skin tones to facilitate fair artificial intelligence (AI) and alleviate skin-caused biases (a total of 75 subjects). Each subject provides 3 kinds of highfidelity masks with different materials (i.e., plaster, resin, and transparent). Thus, a total of 225 masks are collected. In terms of recording scenarios, it considers 6 scenes, including indoor and outdoor environments with extra 6 directional and periodic lighting. As for the sensors for video recording, 7 mainstream imaging devices are used. In total, one collected 54, 600 videos, of which the live and mask videos are 13, 650 and 40, 950, respectively. For 3D face PAD approaches, both appearance-based [38, 54, 62, 63] and remote photoplethysmography (rPPG)-based [59, 64, 65] methods have been developed. As illustrated in Fig. 1.5, although both appearance-based method ResNet50 [48] and rPPG-based method GrPPG [49] perform well on 3DMAD [50] and HKBU-MARs V2 (briefly named MARsV2) [51] datasets, these methods fail to achieve high performance on the proposed HiFiMask dataset. On the one hand, the high-fidelity appearance of 3D masks makes it harder to be distinguished from the bonafide. On the other hand, temporal light interference leads to pseudo ‘liveness’ cues for even 3D masks, which might confuse the rPPG-based attack detector. To tackle the challenges about high-fidelity appearance and temporal light interference, the work [61] propose a novel Contrastive Context-aware Learning framework, namely CCL, which learns discriminability by comparing image pairs with diverse contexts. Various kinds of image pairs are organized according to the context attribute types, which provide rich and meaningful contextual cues for representation learning. For instance, constructing face pairs from the same identify with both bonafide (i.e., skin material) and mask presentation (i.e., resin material) could benefit the fine-grained material features learning. Due to the significant appearance variations between some ‘hard’ positive pairs, the proposed CCL framework’s convergence might sometimes be unstable. To alleviate the influence of such ‘outlier’ pairs and accelerate convergence, the Context Guided Dropout module, namely CGD, is proposed for robust contrastive learning via adaptively discarding parts of unnecessary embedding features. From the above description, the current works on face presentation attack are facing many limitations on existing face anti-spoofing benchmarks. Therefore, in order to prompt the technology of face anti-spoofing, we organized a series of face anti-spoofing workshops and competitions at top AI conferences (such as CVPR and ICCV). We hope this book can help readers to have a more comprehensive understanding of this research area.
12
1 Face Anti-spoofing Progress Driven by Academic Challenges
Challenges on High-Fidelity Appearances
Challenges on Temporal Light Interference
GrPPG
(a)
ur
s)
V2
(O
AR iF
iM
KB
U
as
k
-M
AD M H
iM iF
H
H
3D
(O as
k
-M U KB H
s
ur
s AR
AD M 3D
s)
V2
EER (%)
ACER(%)
ResNet50
(b)
Fig. 1.5 Performance of ResNet50 [48] and GrPPG [49] on 3DMAD [50], MARsV2 [51] and our proposed HiFiMask datasets. Despite satisfying mask PAD performance on 3DMAD [50] and MARsV2 [51], these methods fail to achieve convincing results on HiFiMask
References 1. Kaufman GJ, Breeding KJ (1976) The automatic recognition of human faces from profile silhouettes. IEEE Trans Smc 6(2):113–121 2. Goldstein AJ, Harmon LD, Lesk AB (1971) Identification of human faces. Proc IEEE 59(5):748– 760 3. Harmon LD, Khan MK, Lasch R, Ramig PF (1981) Machine identification of human faces. Pattern Recognit 13(2):97–110 4. Kirby M, Sirovich L (2002) Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans Pattern Anal Mach Intell 12(1):103–108 5. Turk M (1991) Eigenfaces for recognition. J Cogn Neurosci 3 6. Moore GE (2002) Cramming more components onto integrated circuits. Proc IEEE 86(1):82–85 7. Zhang Z, Yan J, Liu S, Lei Z, Yi D, Li SZ (2012) A face antispoofing database with diverse attacks. In: ICB 8. Chingovska I, Anjos A, Marcel S (2012) On the effectiveness of local binary patterns in face anti-spoofing. In: Biometrics special interest group
References
13
9. Erdogmus N, Marcel S (2014) Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect. In: BTAS 10. de Freitas Pereira T, Anjos A, De Martino JM, Marcel S (2013) Can face anti-spoofing countermeasures work in a real world scenario? In: 2013 international conference on biometrics (ICB). IEEE, pp 1–8 11. Yang J, Lei Z, Liao S, Li SZ (2013) Face liveness detection with component dependent descriptor. In: ICB 12. Komulainen J, Hadid A, Pietikäinen M (2013) Context based face anti-spoofing. In: 2013 IEEE sixth international conference on biometrics: theory, applications and systems (BTAS). IEEE, pp 1–8 13. Patel K, Han H, Jain AK (2016) Secure face unlock: spoof detection on smartphones. IEEE TIFS 14. Boulkenafet Z, Komulainen J, Hadid A (2016) Face spoofing detection using colour texture analysis. IEEE TIFS 15. Nikisins O, Mohammadi A, Anjos A, Marcel S (2018) On effectiveness of anomaly detection approaches against unseen presentation attacks in face anti-spoofing. In: 2018 international conference on biometrics (ICB). IEEE, pp 75–81 16. Fatemifar S, Arashloo SR, Awais M, Kittler J (2019) Spoofing attack detection by anomaly detection. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8464–8468 17. Arashloo SR, Kittler J, Christmas W (2017) An anomaly detection approach to face spoofing detection: a new formulation and evaluation protocol. IEEE Access 18. Fatemifar S, Awais M, Arashloo SR, Kittler J (2019) Combining multiple one-class classifiers for anomaly based face spoofing attack detection. In: 2019 international conference on biometrics (ICB). IEEE, pp 1–7 19. Atoum Y, Liu Y, Jourabloo A, Liu X (2017) Face anti-spoofing using patch and depth-based cnns. In: 2017 IEEE international joint conference on biometrics (IJCB). IEEE, pp 319–328 20. Liu Y, Jourabloo A, Liu X (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In: In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 389–398 21. Shao R, Lan X, Li J, Yuen PC (2019) Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: CVPR 22. Jourabloo A, Liu Y, Liu X (2018) Face de-spoofing: anti-spoofing via noise modeling. In: ECCV 23. Boulkenafet Z, Komulainen J, Li L, Feng X, Hadid A (2017) Oulu-npu: a mobile face presentation attack database with real-world variations. In: FGR, pp 612–618 24. Chingovska I, Anjos A, Marcel S (2012) On the effectiveness of local binary patterns in face anti-spoofing. In: BIOSIG 25. Costa-Pazo A, Bhattacharjee S, Vazquez-Fernandez E, Marcel S (2016) The replay-mobile face presentation-attack database. In: BIOSIG 26. Wen D, Han H, Jain AK (2015) Face spoof detection with image distortion analysis. IEEE TIFS 27. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR. IEEE 28. Yi D, Lei Z, Liao S, Li SZ (2014) Learning face representation from scratch. arXiv 29. Chingovska I, Erdogmus N, Anjos A, Marcel S (2016) Face recognition systems under spoofing attacks. In: Face recognition across the imaging spectrum 30. Dhamecha TI, Singh R, Vatsa M, Kumar A (2014) Recognizing disguised faces: human and machine evaluation. Plos One 9 31. Raghavendra R, Raja KB, Busch C (2015) Presentation attack detection for face recognition using light field camera. TIP
14
1 Face Anti-spoofing Progress Driven by Academic Challenges
32. Liu S, Yuen PC, Zhang S, Zhao G (2016) 3d mask face anti-spoofing with remote photoplethysmography. In: ECCV. Springer 33. Steiner H, Kolb A, Jung N (2016) Reliable face anti-spoofing using multispectral swir imaging. In: ICB. IEEE 34. Steiner H, Sporrer S, Kolb A, Jung N (2016) Design of an active multispectral swir camera system for skin detection and face verification. J Sens 2016:1–16 35. Raghavendra R, Raja KB, Venkatesh S, Cheikh FA, Busch C (2017) On the vulnerability of extended multispectral face recognition systems towards presentation attacks. In: ISBA, pp 1–8 36. Manjani I, Tariyal S, Vatsa M, Singh R, Majumdar A (2017) Detecting silicone mask-based presentation attack via deep dictionary learning. TIFS 37. Agarwal A, Yadav D, Kohli N, Singh R, Vatsa M, Noore A (2017) Face presentation attack with latex masks in multispectral videos. In: CVPRW 38. George A, Mostaani Z, Geissenbuhler D, Nikisins O, Anjos A, Marcel S (2019) Biometric face presentation attack detection with multi-channel convolutional neural network. TIFS 39. Zhang S, Wang X, Liu A, Zhao C, Wan J, Escalera S, Shi H, Wang Z, Li SZ (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In: CVPR 40. Zhang S, Liu A, Wan J, Liang Y, Guo G, Escalera S, Escalante HJ, Li SZ (2020) Casia-surf: a large-scale multi-modal benchmark for face anti-spoofing. TBIOM 2(2):182–193 41. Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: deep hypersphere embedding for face recognition. In: CVPR 42. Wang X, Wang S, Zhang S, Fu T, Shi H, Mei T (2018) Support vector guided softmax loss for face recognition. arXiv 43. Are face recognition systems accurate? depends on your race (2016). https://www. technologyreview.com/s/601786 44. Alvi M, Zisserman A, Nellaker C. Turning a blind eye: explicit removal of biases and variation from deep neural network embeddings 45. Wang M, Deng W, Hu J, Tao X, Huang Y (2019) Racial faces in the wild: reducing racial bias by information maximization adaptation network. In: ICCV, Oct 2019 46. Liu A, Tan Z, Wan J, Escalera S, Guo G, Li SZ (2021) Casia-surf cefa: a benchmark for multimodal cross-ethnicity face anti-spoofing. In: WACV 47. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR 48. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, June 2016 49. Li X, Komulainen J, Zhao G, Yuen P-C, Pietikäinen M (2016) Generalized face anti-spoofing by detecting pulse from face videos. In: ICPR 50. Erdogmus N, Marcel S (2013) Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect. In: BTAS 51. Liu S, Yang B, Yuen PC, Zhao G (2016) A 3d mask face anti-spoofing database with real world variations. In: CVPRW 52. Galbally J, Satta R (2016) Three-dimensional and two-and-a-half-dimensional face recognition spoofing using three-dimensional printed models. IET Biometrics 53. Bhattacharjee S, Marcel S (2017) What you can’t see can help you-extended-range imaging for 3d-mask presentation attack detection. In: BIOSIG. IEEE 54. Yu Z, Wan J, Qin Y, Li X, Li SZ, Zhao G (2020) Nas-fas: static-dynamic central difference network search for face anti-spoofing. In: TPAMI 55. Li H, Li W, Cao H, Wang S, Huang F, Kot AC (2018) Unsupervised domain adaptation for face anti-spoofing. TIFS 56. Liu Y, Stehouwer J, Jourabloo A, Liu X (2019) Deep tree learning for zero-shot face anti-spoofing. In: CVPR
References
15
57. Heusch G, George A, Geissbuhler D, Mostaani Z, Marcel S (2020) Deep models and shortwave infrared information to detect face presentation attacks. TBIOM, p 1 58. Kose N, Dugelay J-L (2014) Mask spoofing in face recognition and countermeasures. Image Vis Comput 59. Liu S-Q, Lan X, Yuen PC (2018) Remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. In: ECCV 60. George A, Marcel S (2021) Cross modal focal loss for rgbd face anti-spoofing. In: CVPR, pp 7882–7891 61. Liu A, Zhao C, Yu Z, Wan J, Su A, Liu X, Tan Z, Escalera S, Xing J, Liang Y et al (2022) Contrastive context-aware learning for 3d high-fidelity mask face presentation attack detection. IEEE Trans Inf Forensics Secur 17:2497–2507 62. Jia S, Guo G, Xu Z (2020) A survey on 3d mask presentation attack detection and countermeasures. Pattern Recognit 63. Jia S, Li X, Hu C, Guo G, Xu Z (2020) 3d face anti-spoofing with factorized bilinear coding. TCSVT 64. Lin B, Li X, Yu Z, Zhao G (2019) Face liveness detection by rppg features and contextual patch-based cnn. In: ICBEA. ACM 65. Liu S-Q, Lan X, Yuen PC (2021) Multi-channel remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. TIFS
2
Face Presentation Attack Detection (PAD) Challenges
2.1
The Multi-modal Face PAD Challenge
This section summarizes the Multi-modal Presentation Attack Detection challenge collocated with CVPR2019 [1]. We describe the dataset, evaluation protocols and the main findings derived from the analysis of results.
2.1.1
The CASIA-SURF Dataset
Existing datasets on PAD consider a limited number of subjects and data modalities, which severely impedes the development of face PAD methods with satisfactory recognition rates as to be applied in practice, such as in face payment or unlock applications. In order to address these aforementioned limitations, we collected a new large-scale and multi-modal face PAD dataset namely CASIA-SURF. To the best our knowledge, the proposed dataset is currently the largest face anti-spoofing dataset, containing 1,000 Chinese people in 21,000 videos with three modalities (RGB, Depth, IR). Another motivation for creating this dataset, beyond pushing the further research of face anti-spoofing, is to explore the performance of recent face anti-spoofing methods when considering a large amount of data. In this section, we provide the detailed introduction of the proposed dataset, including acquisition details, data preprocessing, statistics, and evaluation protocol.
2.1.1.1 Acquisition Details Figure 2.1 shows the diagram of data acquisition procedure, i.e., how the multi-modal data is recorded via the multi-modal camera in diverse indoor environment. Specifically, we use the Intel RealSense SR300 camera to capture the RGB, Depth and InfraRed (IR) videos simultaneously. During the video recording, collectors are required to turn left or right, move © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Wan et al., Advances in Face Presentation Attack Detection, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-32906-7_2
17
18
2 Face Presentation Attack Detection (PAD) Challenges
Fig. 2.1 Illustrative sketch of recording setups in the CASIA-SURF dataset
Fig. 2.2 Preprocessing details of three modalities of the CASIA-SURF dataset
up or down, walk in or away from the camera. Moreover, the performers stand within the range of 0.3 to 1.0 meter from the camera and their face angle is asked to be less 300 . After that, four video streams including RGB, Depth, IR, plus RGB-Depth-IR aligned images are captured using the RealSense SDK at the same time. The resolution is 1280 × 720 for RGB images and 640 × 480 for Depth, IR and aligned images. Some examples of RGB, Depth, IR and aligned images are shown in the first column of Fig. 2.2. We print collectors’ color pictures with A4 paper to obtain the attack faces. The printed flat or curved face images will be cut eyes, nose, mouth areas or their combinations, generating 6 different attack ways. Thus, each sample includes 1 live video clip and 6 fake video clips. Fake samples are shown in Fig. 2.3. Detailed information of the 6 attacks is given below. • Attack 1: One person hold his/her flat face photo where eye regions are cut. • Attack 2: One person hold his/her curved face photo where eye regions are cut.
2.1 The Multi-modal Face PAD Challenge
19
Fig. 2.3 Six attack styles in the CASIA-SURF dataset
• Attack 3: One person hold his/her flat face photo where eye and nose regions are cut. • Attack 4: One person hold his/her curved face photo where eye and nose regions are cut. • Attack 5: One person hold his/her flat face photo where eye, nose and mouth regions are cut. • Attack 6: One person hold his/her curved face photo where eye, nose and mouth regions are cut.
2.1.1.2 Data Preprocessing Data preprocessing is widely used in the face recognition system, such as face detection and face alignment. Different pre-processing methods would affect the face anti-spoofing algorithms. To focus on the face anti-spoofing task and increase the difficulty, we process the original data via face detection and alignment. As shown in Fig. 2.2, we first use the Dlib [2] toolkit to detect face for every frame of RGB and RGB-Depth-IR aligned videos, respectively. Then we apply the PRNet [3] algorithm to perform 3D reconstruction and density alignment on the detected faces. After that, we define a binary mask based on nonactive face reconstruction area from previous steps. Finally, we obtain face area of RGB image via point-wise product between the RGB image and the RGB binary mask. The Depth (or IR) area can be calculated via the point-wise product between the Depth (or IR) image and the RGB-Depth-IR binary mask. After the data pre-processing stage, we manually check all the processed RGB images to ensure that they contain a high-quality large face.
2.1.1.3 Statistics Table 2.1 presents the main statistics of the proposed CASIA-SURF dataset. (1) There are 1,000 subjects with variability in terms of gender, age, glasses/no glasses and indoor environments. Each one has 1 live video clip and 6 fake video clips. (2) Data is divided into three subsets. The training, validation and testing subsets have 300, 100 and 600 subjects with 6,300 (2,100 per modality), 2,100 (700 per modality), 12,600 (4,200 per modality) videos, respectively. (3) From original videos, there are about 1.5 million, 0.5 million, 3.1
20
2 Face Presentation Attack Detection (PAD) Challenges
Table 2.1 Statistical information of the proposed CASIA-SURF dataset Training # Subject 300 6,300 # Video # Original image 1,563,919 151,635 # Sampled image 148,089 # Processed image
Validation
Testing
100 2,100 501,886 49,770 48,789
600 12,600 3,109,985 302,559 295,644
Total 1,000 21,000 5,175,790 503,964 492,522
Fig. 2.4 Gender and age distribution of the CASIA-SURF dataset
million frames in total for training, validation, and testing subsets, respectively. Owing to the huge amount of data, we select one frame out of every 10 frames and form the sampled set with about 151 K, 49 K, and 302 K for training, validation and testing subsets, respectively. (4) After removing non-detected face poses with extreme lighting conditions during data prepossessing, we finally obtain about 148 K, 48 K, 295 K images for training, validation and testing subsets in the CASIA-SURF dataset. The information of gender statistics is shown in the left side of Fig. 2.4. It shows that the ratio of female is 56.8% while the ratio of male is 43.2%. In addition, we also show age distribution of the CASIA-SURF dataset in the right side of Fig. 2.4. One can see a wide distribution of age ranges from 20 to more than 70 years old, while most of subjects are under 70 years old. On average, the range of [20, 30) ages is dominant, being about 50% of all the subjects.
2.1.1.4 Evaluation Protocol We select the live faces and Attacks 4, 5, 6 as the training subset, while the live faces and Attacks 1, 2, 3 as the validation and testing subsets. This makes the ratio of flat/curved face and the extend of cut organ different between training and evaluation in order to increase
2.1 The Multi-modal Face PAD Challenge
21
the difficulty. The validation subset is used for model and hyper-parameter selection and the testing subset for final evaluation. Our dataset has two types of evaluation protocol: (1) within-modal evaluation, in which algorithms are trained and evaluated in the same modalities; (2) cross-modal evaluation, in which algorithms are trained in one modality while evaluated in other modalities. Following the face recognition task, we use the ROC curve as the main evaluation metric. ROC curve is a suitable indicator for the algorithms applied in the real world applications, because we can select a suitable trade-off threshold between FPR and TPR according to the requirements. Empirically, we compute TPR@FPR=10−2 , 10−3 and 10−4 as the quantitative indicators. Among them, we regard TPR@FPR=10−4 as the main comparison. Besides, the commonly used metric ACER, APCER and NPCER are also provided for reference.
2.1.2
Workshop and Competition at CVPR 2019
In this section, we briefly describe the organized challenge, including the evaluation metric and the challenge protocol. We relied on the CASIA-SURF [4] dataset for the organization of the ChaLearn Face Anti-spoofing Attack Detection Challenge. Accordingly, the CASIASURF dataset was processed as follows: 1. The dataset was split in three partitions: training, validation and testing sets, with 300, 100 and 600 subjects, respectively. This partitioning corresponds to 6,300 (2,100 per modality), 2,100 (700 per modality), 12,600 (4,200 per modality) videos for the corresponding partitions. 2. For each video, we retained 1 out every 10 frames to reduce its size. This subsampling strategy results in:148K, 48K, 295K frames for training, validation and testing subsets, respectively. 3. The background except face areas from original videos was removed to increase the difficulty of the task. Evaluation. In this challenge, we selected the recently standardized ISO/IEC 30107-31 metrics: Attack Presentation Classification Error Rate (APCER), Normal Presentation Classification Error Rate (NPCER) and Average Classification Error Rate (ACER) as the evaluation metrics, these are defined as follows: A PC E R = F P/ (F P + T N )
(2.1)
N PC E R = F N / (F N + T P)
(2.2)
AC E R = (A PC E R + N PC E R) /2
(2.3)
1 https://www.iso.org/obp/ui/iso.
22
2 Face Presentation Attack Detection (PAD) Challenges
where TP, FP, TN and FN corresponds to true positive, false positive, true negative and false negative, respectively. APCER and BPECER are used to measure the error rate of fake or live samples, respectively. Inspired by face recognition, the Receiver Operating Characteristic (ROC) curve is introduced for large-scale face Anti-spoofing detection in CASIA-SURF dataset, which can be used to select a suitable threshold to trade off the false positive rate (FPR) and true positive rate (TPR) according to the requirements of real applications. Finally, The value TPR@FPR=10−4 was the leading evaluation measure for this challenge. APCER, NPCER and ACER measures were used as additional evaluation criteria. Challenge protocol. The challenge was run in the CodaLab [5]2 platform, and comprised two stages as follows: Development Phase: (Phase duration: 2.5 months). During this phase participants had access to labeled training data and unlabeled validation samples. Participants could use training data to develop their models, and they could submit predictions on the validation partition. Training data were made available with samples labeled with the genuine and 3 forms of attack (4,5,6). Whereas samples in the validation partition were associated to genuine and 3 different attacks (1,2,3). For the latter dataset, labels were not made available to participants. Instead, participants could submit predictions on the validation partition and receive immediate feedback via the leader board. The main reason for including different attack types in the training and validation dataset was to increase the difficulty of FAD challenge. Final phase: (Phase duration: 5 days). During this phase, labels for the validation subset were made available to participants, so that they can have more labeled data for training their models. The unlabeled testing set was also released, participants had to make predictions for the testing partition and upload their solutions to the challenge platform. The considered test set was formed by examples labeled with the genuine label and 3 attack types (1,2,3). Participants had the opportunity to make 3 submissions for the final phase, this was done with the goal of assessing stability of their methods. The final ranking of participants was obtained from the performance of submissions in the testing sets. To be eligible for prizes, winners had to publicly release their code under a license of their choice and provide a fact sheet describing their solution.
2.1.3
Availability of Associated Dataset
We cooperated with a startup SurfingTech,3 which helped us to collect the face anti-spoofing data. This company focuses on data collection, data labelling, as well as it sells collected data with accurate labels. All participants had a monetary compensation and signed an agreement
2 https://competitions.codalab.org. 3 http://surfing.ai/.
2.2
Cross-Ethnicity Face PAD Challenge
23
to make data public for academic research. If you interested in this dataset, you can apply it in the link.4
2.2
Cross-Ethnicity Face PAD Challenge
The success of the multimodal PAD competition motivated us to keep working on this challenging domain, where a number of limitations have to be addressed before PAD technology is ready to be used in applications. We identified ethnic biases as a major challenge in PAD benchmarks and methodologies, and for that reason– we organized a new challenge focusing in this specific subject. This section summarizes the Cross-ethnicity face PAD challenge [6] colocaged with CVPR2020.
2.2.1
CeFA Dataset
In order to study the ethnic bias for face anti-spoofing, we introduce the largest CASIA-SURF Cross-ethnicity Face Anti-spoofing (CeFA) [7] dataset, covering 3 ethnicities, 3 modalities, 1,607 subjects, and 2D plus 3D attack types. As our knowledge, CASIA-SURF CeFA is the first dataset including explicit ethnic labels in current released datasets
2.2.1.1 Acquisition Details We use the Intel Realsense to capture the RGB, Depth and IR videos simultaneously at 30 f ps. The resolution is 1280 × 720 pixels for each frame in video. Subjects are asked to move smoothly their head so as to have a maximum of around 300 deviation of head pose in relation to frontal view. Data pre-processing is similar to the one performed in [4], expect that PRNet [3] is replaced by 3DDFA [8, 9] for face region detection. In this section, we introduce the CeFA dataset in detail, such as acquisition details, attack types, and protocols.
2.2.1.2 Statistics As shown in Table 1.2, CeFA consists of 2D and 3D attack subsets. As shown in Fig. 1.4, for the 2D attack subset, it consists of print and video-replay attacks captured by subjects from three ethnicites (e.g., African, East Asian and Central Asian). See from the Table 2.2, each ethnicity has 500 subjects, and each subject has 1 real sample, 2 fake samples of print attack captured in indoor and outdoor, and 1 fake sample of video-replay. In total, there are 18,000 videos (6,000 per ethnicity).
4 https://sites.google.com/view/face-anti-spoofing-challenge/welcome/challengecvpr2019?
authuser=0.
24
2 Face Presentation Attack Detection (PAD) Challenges
Table 2.2 Statistics of the 2D attack in CeFA Ethnicity
Real & Attack # RGB styles
# Depth
# IR
Subtotal
African East Asian
Real Cloth-indoor attack Cloth-outdoor attack Replay attack
500 500
500 500
500 500
6000 6000
500
500
500
6000
500
500
500
Central Asian
Total: 1500 subjects, 18000 videos
Table 2.3 Statistics of the 3D attack in CeFA 3D mask attack
Attack styles
Print mask 99 subjects & 6 lighting
Only mask 594 Wig without 594 glasses Wig with 594 glasses
Silica gel mask 8 subjects & 4 lighting
Wig without glasses Wig with glasses
# RGB
# Depth
# IR
Subtotal
594 594
594 594
5346
594
594
32
32
32
32
32
32
192
Total: 107 subjects, 5538 videos
For the 3D attack subset in Table 2.3, it has 3D print mask and silica gel face attacks. Some samples are shown in Fig. 1.4. In the part of 3D print mask, it has 99 subjects, each subject with 18 fake samples captured in three attacks and six lighting environments. Specially, attack types include only face mask, wearing a wig with glasses, and wearing a wig without glasses. Lighting conditions include outdoor sunshine, outdoor shade, indoor side light, indoor front light, indoor backlit and indoor regular light. In total, there are 5,346 videos (1,782 per modality). For silica gel face attacks, it has 8 subjects, each subject has 8 fake samples captured in two attacks styles and four lighting environments. Attacks include wearing a wig with glasses and wearing a wig without glasses. Lighting environments include indoor side light, indoor front light, indoor backlit and indoor normal light. In total, there are 192 videos (64 per modality).
2.2
Cross-Ethnicity Face PAD Challenge
25
2.2.1.3 Evaluation Protocol The main motivation behind the CeFA dataset is to provide a benchmark to measure the generalization performance of new PAD methods in three main aspects: cross-ethnicity, cross-modality, cross-attacks, and the fairness of PAD methods in different ethnicities. We design five protocols for the 2D attacks subset, as shown in Table 2.4, totalling 12 subprotocols (1_1, 1_2, 1_3, 2_1, 2_2, 3_1, 3_2, 3_3, 4_1, 4_2, 4_3, and 5). We divided 500 subjects per ethnicity into three subject-disjoint subsets (second and fourth columns in Table 2.4). Each protocol has three data subsets: training, validation and testing sets, which contain 200, 100, and 200 subjects, respectively. • Protocol 1 (cross-ethnicity): Most of the public face PAD datasets lack of ethnicity labels or do not provide with a protocol to perform cross-ethnicity evaluation. Therefore, we design the first protocol to evaluate the generalization of PAD methods for cross-ethnicity testing. One ethnicity is used for training and validation, and the left two ethnicities are used for testing. Therefore, there are three different evaluations (third column of Protocol 1 in Table 2.4). • Protocol 2 (cross-PAI): Given the diversity and unpredictability of attack types from different presentation attack instruments (PAI), it is necessary to evaluate the robustness of face PAD algorithms to this kind of variations (sixth column of Protocol 2 in Table 2.4). • Protocol 3 (cross-modality): Inspired by heterogeneous face recognition, we define three cross-modality evaluations, each of them having one modality for training and the two remaining ones for testing (fifth column of Protocol 3 in Table 2.4). Although there are no real world scenarios for this protocol until now, if algorithms trained on a certain modality data are able to perform well on other modalities data, this will greatly enhance their versatility for different scenes with different devices. Similar to [10], we aim to provide this cross-modal evaluation protocol for those possible real-world scenarios in the future. • Protocol 4 (cross-ethnicity & PAI): The most challenging protocol is designed via combining the condition of both Protocol 1 and 2. As shown in Protocol 4 of Table. 2.4, the testing subset introduces two unknown target variations simultaneously. • Protocol 5 (bias-ethnicity): Algorithm fairness has started to attract the attention of researchers in Artificial Intelligence (AI). According to this criterion: an ideally fair algorithm should have consistent performance on different protected attributes. Therefore, in addition to measuring the generalization performance of the new methods on crossethnicity (i.e., Protocol 1), we also consider the fairness of an algorithm, where it is trained with data that includes all ethnicities, and assessed on different ethnicities, respectively. Like [11], the mean and variance of evaluate metrics for five protocols are calculated in our experiments. Detailed statistics for the different protocols are shown in Table 2.4.
A
C&E
Valid
Test
5
4
3
2
A
Train
1
Ethnicity
A&C&E
Test
A
C&E
Valid
Test
A&C&E
A&C&E
A
Train
Valid
Test
5
A
Train
4_1
A&C&E
Valid
A&C&E
Test
A&C&E
A&C&E
Valid
Train
A&C&E
Train
1_1
Subset
Prot.
C
A&E
C
C
4_2
A&E
C
C
1_2
E
A&C
E
E
4_3
A&C
E
E
1_3
301–500
201–300
1–200
301–500
201–300
1–200
301–500
201–300
1–200
301–500
201–300
1–200
301–500
201–300
1–200
Subjects
R&D&I
R&D&I
R&D&I
R&D&I
R&D&I
R&D&I
D&I
R
R
3_1
R&D&I
R&D&I
R&D&I
R&D&I
R&D&I
R&D&I
Modalities
R&I
D
D
3_2
R&D
I
I
3_3
Print
Replay
Replay
2_2
Print&Replay
Print&Replay
Print&Replay
Print
Replay
Replay
Print&Replay
Print&Replay
Print&Replay
Replay
Print
Print
2_1
Print&Replay
Print&Replay
Print&Replay
PAIs
600/ 600/600
900
1800
1200/ 1200/1200
300/300/300
600/600/600
1200/ 1200/1200
300/300/300
600/ 600/600
1800/1800
900/900
1800/1800
1200/ 1200/1200
300/300/300
600/ 600/600
# real videos
3800/ 3800/3800
2700
5400
5400/ 5400/5400
300/300/300
600/ 600/600
5600/ 5600/5600
900/ 900/900
1800/ 1800/1800
4800/6600
1800/900
3600/1800
6600/ 6600/6600
900/ 900/900
1800/ 1800/1800
# fake videos
4400/ 4400/4400
3600
7200
6600/ 6600/6600
600/ 600/600
1200/ 1200/1200
6800/ 6800/6800
1200/ 1200/1200
2400/ 2400/2400
6600/8400
2700/1800
5400/3600
7800/ 7800/7800
1200/ 1200/1200
2400/ 2400/2400
# all videos
Table 2.4 Five protocols are defined for CeFA: (1) cross-ethnicity, (2) cross-PAI, (3) cross-modality, (4) cross-ethnicity&PAI, (5) bias-ethnicity. Note that the 3D attacks subset are included in each testing protocol (not shown in the table). & indicates merging; ∗_∗ corresponds to the name of sub-protocols. R: RGB, D: Depth, I: IR. Other abbreviated as in Table 1.2
26 2 Face Presentation Attack Detection (PAD) Challenges
2.2
Cross-Ethnicity Face PAD Challenge
2.2.2
27
Workshop and Competition at CVPR 2020
In this section, we describe the organized challenge, including a brief introduction to the CASIA-SURF CeFA dataset, evaluation metrics, and the challenge protocol. We relied on the CeFA dataset for the organization of the ChaLearn Face Anti-spoofing Attack Detection Challenge, which consists of single-modal (e.g., RGB) and multi-modal (e.g., RGB, Depth, Infrared (IR)) tracks around this novel resource to boost research aiming to alleviate the ethnic bias. The main motivation of CASIA CeFA dataset is to serve as a benchmark to allow for the evaluation of the generalization performance of new PAD methods. Concretely, four protocols were originally introduced to measure the robustness of methods under varied evaluation conditions: cross-ethnicity (Protocol 1), (2) cross-PAI (Protocol 2), (3) cross-modality (Protocol 3) and (4) cross-ethnicity and cross-PAI (Protocol 4). To make the competition more challenging, we adopted Protocol 4 in this challenge, which is designed by combining conditions of Protocols 1 and 2. As shown in Table 2.5, it has three data subsets: training, validation, and testing sets, which contain 200, 100, and 200 subjects for each ethnicity, respectively. Note that the remaining 107 subjects are 3D masks. To fully measure the crossethnicity performance of the algorithm, one ethnicity is used for training and validation, and the remaining two other ethnicities are used for testing. Since there are three ethnicities in CASIA-SURF CeFA, a total of 3 sub-protocols (i.e., 4_1, 4_2 and 4_3 in Table 2.5) are adopted in this challenge. In addition to the ethnic variation, the factor of PAIs is also considered in this protocol by setting different attack types in training and testing phases. Evaluation. In this challenge we selected the recently standardized ISO/IEC 30107-3 (https://www.iso.org/obp/ui/iso) metrics for evaluation: Attack Presentation Classification Error Rate (APCER), Normal Presentation Classification Error Rate (NPCER) and Average Classification Error Rate (ACER), these are defined as follows: A PC E R = F P/ (F P + T N )
(2.4)
N PC E R = F N / (F N + T P)
(2.5)
AC E R = (A PC E R + N PC E R) /2
(2.6)
Table 2.5 Protocols and Statistics. Note that the A, C, and E are short for Africa, Central Asia, and East Asia, respectively. Track(S/M) means the Single/Multi-modal track. The PAIs means the presentation attack instruments Track
S
Subset
Subjects (one ethnicity)
4_1
4_2
4_3
Train
1–200
A
C
E
Replay
33,713
34,367
Valid
201–300
A
C
E
Replay
17,008
17,693
17,109
Test
301–500
C&E
A&E
A&C
Print
105,457
102,207
103,420
M
Ethnicity
PAIs
# Num.img(rgb)
4_1
4_2
4_3 33,152
28
2 Face Presentation Attack Detection (PAD) Challenges
where TP, FP, TN, and FN correspond to true positive, false positive, true negative, and false negative, respectively. APCER and BPCER are used to measure the error rate of fake or live samples, respectively. Inspired by face recognition, the Receiver Operating Characteristic (ROC) curve is introduced for large-scale face Anti-spoofing detection in CASIA-SURF CeFA dataset, which can be used to select a suitable threshold to trade off the false positive rate (FPR) and true positive rate (TPR) according to the requirements of real applications. Challenge protocol. The challenge was run in the CodaLab platform, and comprised two stages as follows: Development Phase: (Phase duration 2.5 months). During this phase, participants had access to the labeled training subset and unlabeled validation subset. Since the protocol used in this competition (Protocol 4) comprises 3 sub-protocols, participants first need to train a model for each sub-protocol, then predict the score of the corresponding validation set, and finally, simply merge the predicted scores and submit them to the CodaLab platform and receive immediate feedback via a public leader board. Final phase (Phase duration: 10 days.). During this phase, labels for the validation subset and the unlabeled testing subset were released. Participants can firstly take the labels of the validation subset to select a model with better performance, then they can use this model to predict the scores of the corresponding testing subset samples, and finally submit the score files in the same way as the development phase. We made public all results of the three sub-protocols online, these include the obtained values of APCER, BPCER, and ACER. Like [11], the mean and variance of evaluated metrics for these three sub-protocols are calculated for final results. Note that to fairly compare the performance of participants’ algorithms, this competition does not allow the use of other training datasets and pre-trained models. To be eligible for prizes, winners had to publicly release their code under a license of their choice and provide a fact sheet describing their solution. Besides, the code was re-run and all of the results were verified by the organizing team after the final phase ended, the verified results were used for the final ranking.
2.2.3
Availability of Associated Dataset
CeFA is the largest face anti-spoofing dataset up to date in terms of modalities, number of subjects and attack types. More importantly, CeFA is the only public FAS dataset with ethnic label, which is available.5
5 https://sites.google.com/view/face-anti-spoofing-challenge/dataset-download/casia-surf-
cefacvpr2020?authuser=0.
2.3
3D High-Fidelity Mask Face PAD Challenge
2.3
29
3D High-Fidelity Mask Face PAD Challenge
Motivated by the results of the two initial versions of the PAD challenges, we further developed a more challenging and realistic testbed for PAD methodologies. For the third edition, we aimed to motivate participants to develop solutions for PAD in the presence of highresolution mask-attacks. This section provides an overview of the 3D high fidelity mask face PAD challenge.
2.3.1
HiFiMask Dataset
Given the shortcomings of the current mask datasets, we carefully designed and collected a new dataset: HiFiMask, which provides 5 main advantages over previous existing datasets. 1. To the best of our knowledge, HiFiMask is currently the largest 3D face mask PAD dataset, which contains 54, 600 videos captured from 75 subjects of three skin tones, including 25 subjects in yellow, white, and black, respectively. 2. HiFiMask provides 3 high-fidelity masks with the same identity, which are made of transparent, plaster, and resin materials, respectively. As shown in Fig. 2.5, our realistic masks are visually difficult to be distinguished from live faces. 3. We consider 6 complex scenes, i.e., White Light, Green Light, Periodic Three-color Light, Outdoor Sunshine, Outdoor Shadow, and Motion Blur for video recording. Among them, there is periodic lighting within [0.7, 4]Hz for the first three scenarios to mimic the human heartbeat pulse, thus might interfere with the rPPG-based mask detection technology [12]. 4. We repeatedly shoot 6 videos under different lighting directions (i.e., NormalLight, DimLight, BrightLight, BackLight, SideLight, and TopLight) for each scene to explore the impact of directional lighting. 5. A total of 7 mainstream imaging devices (i.e., iPhone11, iPhoneX, MI10, P40, S20, Vivo, and HJIM) are utilized for video recording to ensure high resolution and imaging quality.
2.3.1.1 Equipment Preparation In order to avoid identity information to interfere with the algorithm design, the plaster, transparent and resin masks are customized for real people. We use pulse oximeter CMS60C to record real-time Blood Volume Pulse (BVP) signals and instantaneous heart rate data from live videos. For scenes of White light, Green light, Periodic Three-color light (Red, Green, Blue and their various combinations), we use a colorful lighting to set the periodic frequency of illumination changes which is consistent with the rang of human heart rate. The change frequency is randomly set between [0.7,4]Hz and recorded for future research. At the same
30
2 Face Presentation Attack Detection (PAD) Challenges
Fig. 2.5 Samples from the HiFiMask dataset. The first row shows 6 kinds of imaging sensors. The second row shows 6 kinds of appendages, among which E, H, S, W, G, and B are the abbreviations of Empty, Hat, Sunglasses, Wig, Glasses, and messy Background, respectively. The third row shows 6 kinds of illuminations, and the fourth row represents 6 deployment scenarios
Fig. 2.6 The similarity of real faces and masks from different datasets. a, images from our HiFiMask. b, images from MARsV2. c, images from 3DMask. Results of FaceX-Zoo is marked in blue color and InsightFace in green. Best viewed in color
2.3
3D High-Fidelity Mask Face PAD Challenge
31
time, we use an additional light source to supplement the light from 6 directions (NormalLight, DimLight, BrightLight, BackLight, SideLight, and TopLight). The light intensity is randomly set between 400-1600 lux.
2.3.1.2 Collection Rules To improve the video quality, we pay attention to the following steps during the acquisition process: 1) All masks are worn on the face of a real person and uniformly dressed to avoid the algorithm looking for clues outside the head area; 2) Collectors were asked to sit in front of the acquisition system and look towards the sensors with small head movements; 3) During data collection stage, a pedestrian was arranged to walk around in the background to interfere with the algorithm to compensate the reflected light clues from the background [13]; 4) All live faces or masks were randomly equipped with decorations, such as sunglasses, wigs, ordinary glasses, hats of different colors, to simulate users in a real environment.
2.3.1.3 Data Preprocessing In order to save storage space, we remove irrelevant background areas from original videos, such as the part below the neck. For each video, we first use Dlib [2] to detect the face in each frame and save its coordinates. Then find the largest box from all the frames of in videos to crop the face area. After face detection, we sample 10 frames at equal intervals from each video. Finally, we name the folder of this video according with the following rule: Skin_Subject_T ype_Scene_Light_Sensor . Note that for the rPPG baseline [12], we use the first 10-second frames of each video for rPPG signal recovery without frame downsampling. To expose the realness of masks in our proposed HiFiMask dataset, we calculate the similarity between a real face and its corresponding mask within three popular mask datasets. As shown in Fig. 2.6, the similarity calculation is conducted by FaceX-Zoo [14] and InsightFace [15]. By sampling some typical examples, we find that the similarity in HiFiMask is notably higher than in MARsV2 [16] and 3DMask [17]. Six modules with different background lighting colors represent 6 kinds of scenes including white, green, three-color, sunshine, shadow and motion. The top of each module is the sample label or mask type, and the bottom is the scene type. Each row in one module corresponds to 7 types of imaging sensors (one frame is randomly selected for each video), and each column shows 6 kinds of lights.
2.3.1.4 Evaluation Protocol and Statistics We define three protocols on HiFiMask for evaluation: Protocol 1-‘seen’, Protocol 2-‘unseen’ and Protocol 3-‘openset’. The information used in the corresponding protocol is described in Table 2.6. Protocol 1-‘seen’. Protocol 1 is designed to evaluate algorithms’ performance when the mask types have been ‘seen’ in training and development sets. In this protocol, all
32
2 Face Presentation Attack Detection (PAD) Challenges
Table 2.6 Statistics of each protocol in HiFiMask. Please note that protocols 1, 2 and 3 in the fourth column indicate transparent, plaster and resin mask, respectively Pro.
Subset
Subject
Masks
# live
# mask
# all
1
Train Dev Test
45 6 24
1&2&3 1&2&3 1&2&3
8,108 1,084 4,335
24,406 3,263 13,027
32,514 4,347 17,362
2_1
Train Dev Test
45 6 24
2&3 2&3 1
8,108 1,084 4,335
16,315 2,180 4,326
24,423 3,264 8,661
2_2
Train Dev Test
45 6 24
1&3 1&3 2
8,108 1,084 4,335
16,264 2,174 4,350
24,372 3,258 8,685
2_3
Train Dev Test
45 6 24
1&2 1&2 3
8,108 1,084 4,335
16,233 2,172 4,351
24,341 3,256 8,686
3
Train Dev Test
45 6 24
1&3 1&3 1&2&3
1,610 210 4,335
2,105 320 13,027
3,715 536 17,362
skin tones, mask types, scenes, lightings, and imaging devices are presented in the training, development, and testing subsets, as shown in the second and third columns of Protocol 1 in Table 2.6. Protocol 2-‘unseen’. Protocol 2 evaluates the generalization performance of the algorithms for ‘unseen’ mask types. Specifically, we further define three leave-one-type-out testing subprotocols based on Protocol 1 to evaluate the algorithm’s generalization performance for transparent, plaster, and resin mask, respectively. For each protocol that is shown in the fourth columns of Protocol 2 in Table 2.6, we train a model with 2 types of masks and test on the left 1 mask. Note that the ‘unseen’ protocol is more challenging as the testing set’s mask type is unseen in the training and development sets. Protocol 3-‘openset’. Protocol 3 evaluates both discrimination and generalization ability of the algorithm under the open set scenarios. In other words, the training and developing sets contain only parts of common mask types and scenarios while there are more general mask types and scenarios on testing set. As shown in Table 2.6, based on Protocol 1, we define training and development sets with parts of representative samples while full testing set is used. Thus, the distribution of testing set is more complicated than the training and development sets in terms of mask types, scenes, lighting, and imaging devices. Different from Protocol 2 with only ‘unseen’ mask types, Protocol 3 considers both ‘seen’ and ‘unseen’ domains as well as mask types, which is more general and valuable for real-world deployment.
2.3
3D High-Fidelity Mask Face PAD Challenge
33
Table 2.7 Statistical information for Challenge Protocol. ‘#’ means the number of videos. Note that 1, 2 and 3 in the third column mean Transparent, Plaster and Resin mask, respectively Subset
Subj.
Mask
Scene
Light
Sensor
# live
# mask
Train
45
1,3
1,4,6
1,3,4,6
1,2,3,4
1,610
2,105
Dev
6
1,3
1,4,6
1,3,4,6
1,2,3,4
Test
24
1∼3
1∼6
1∼6
1∼7
2.3.2
# all 3,715
210
320
536
4,335
13,027
17,362
Workshop and Competition at ICCV 2021
In this section, we review the organized challenge, including a brief introduction of the HiFiMask dataset, the challenge process and timeline, the challenge protocol, and evaluation metrics. Challenge Protocol and Data Statistics. In order to increase the challenge of the competition and meet the actual deployment requirements, we consider a protocol that can comprehensively evaluate the performance of algorithm discrimination and generalization. In other words, the training and developing sets contain only parts of common mask types and scenarios while there are more general mask types and scenarios on the testing set. Based on Protocol 1 [18], we define training and development sets with parts of representative samples while a full testing set is used. Thus, the distribution of testing sets is more complicated than the training and development sets in terms of mask types, scenes, lighting, and imaging devices. Different from Protocol 2 [18] with only ‘unseen’ mask types, the challenge protocol considers both ‘seen’ and ‘unseen’ domains as well as mask types, which are more general and valuable for real-world deployment. In the challenge protocol, as shown in Table 2.7, all skin tones, part of mask types, such as transparent and resin materials (short for 1, 3), part of scenes, such as White Light, Outdoor Sunshine, and Motion Blur (short for 1, 4, 6), part of lightings, such as NormalLight, BrightLight, BackLight, and TopLight (short for 1, 3, 4, 6), and part of imaging devices, such as iPhone11, iPhone X, MI10, P40 (short for 1, 2, 3, 4) are presented in the training and development subsets. While all skin tones, mask types, scenes, lightings, and imaging devices are presented in the testing subset. For clarity, the dataset partition and video quantity of each subset of the challenge protocol are shown in Table 2.7. Challenge Process and Timeline. The challenge was run in the CodaLab6 platform, and comprised two stages as follows: Development Phase: (Phase duration: 2 months). During this phase, participants had access to labeled training data and unlabeled development data. Participants could use training data to train their models, and they could submit predictions on the development data. Training data was made available with samples labeled with the genuine, 2 types of the mask (short for 1, 3), 3 types of scenes (short for 1, 4, 6), 4 kinds of lightings (short for 1, 2, 4, 6) 6 https://competitions.codalab.org/competitions/30910.
34
2 Face Presentation Attack Detection (PAD) Challenges
and 4 imaging sensors (short for 1, 2, 3, 4). Although the development data maintains the same data type as the training data, the label is not provided to the participants. Instead, participants could submit predictions on the development data and receive immediate feedback via the leader board. Final phase: (Phase duration: 10 days). During this phase, labels for the development set were made available to participants, so that they can have more labeled data for training their models. The unlabeled testing set was also released, participants had to make predictions for the testing data and upload their solutions to the challenge platform. The test set was formed by examples labeled with the genuine, and all skin tones, mask types (short for 1∼3), scenes (short for 1∼6), lightings (short for 1∼6), and imaging devices (short for 1∼7). Participants had the opportunity to make 3 submissions for the final phase, this was done with the goal of assessing the stability of their methods. Note that the CodaLab platform defaults to the result of the last submission. The final ranking of participants was obtained from the performance of submissions in the testing sets. To be eligible for prizes, winners had to publicly release their code under a license of their choice and provide a fact sheet describing their solution. Evaluation Metrics. In this challenge, we selected the recently standardized ISO/IEC 30107-37 metrics: Attack Presentation Classification Error Rate (APCER), Normal Presentation Classification Error Rate (NPCER) and Average Classification Error Rate (ACER) as the evaluation metrics. The ACER on the testing set is determined by the Equal Error Rate (EER) thresholds on the development set. Finally, The value ACER was the leading evaluation measure for this challenge, and Area Under Curve (AUC) was used as additional evaluation criteria.
2.3.3
Availability of Associated Dataset
HiFiMask is a large-scale HiFiMask dataset with three challenging protocols, which will push cutting-edge research in 3D Mask face PAD. If you interested in this dataset, you can apply it in the link.8
References 1. Ajian L, Jun W, Sergio E, Hugo JE, Zichang T, Qi Y, Kai W, Chi L, Guodong G, Isabelle G, et al (2019) Multi-modal face anti-spoofing attack detection challenge at cvpr2019. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 0–10 2. Davis EK (2009) Dlib-ml: A machine learning toolkit. In: JMLR 7 https://www.iso.org/obp/ui/iso. 8 https://sites.google.com/view/face-anti-spoofing-challenge/welcome/challengeiccv2021?
authuser=0.
References
35
3. Yao F, Fan W, Xiaohu S, Yanfeng W, Xi Z (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In: ECCV 4. Shifeng Z, Xiaobo W, Ajian L, Chenxu Z, Jun W, Sergio E, Hailin S, Zezheng W, Stan ZL (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In: CVPR 5. Pavao A, Guyon I, Letournel A-C, Baró X, Escalante H, Escalera S, Thomas T, Zhen X (2022) CodaLab Competitions: An open source platform to organize scientific challenges. Technical report, Université Paris-Saclay, FRA 6. Ajian L, Xuan L, Jun W, Yanyan L, Sergio E, Hugo JE, Meysam M, Yi J, Zhuoyuan W, Xiaogang Y et al (2021) Cross-ethnicity face anti-spoofing recognition challenge: A review. IET Biomet. 10(1):24–43 7. Liu A, Tan Z, Wan J, Escalera S, Guo G, Stan ZL (2021) A benchmark for multi-modal crossethnicity face anti-spoofing. In: WACV, Casia-surf cefa 8. Xiangyu Z, Xiaoming L, Zhen L, Stan ZL (2017) Face alignment in full pose range: A 3d total solution. TPAMI 41(1):78–92 9. Jianzhu G, Xiangyu Z, Yang Y, Fan Y, Zhen L, Stan ZL (2020) Towards fast, accurate and stable 3d dense face alignment. In: ECCV 10. Shifeng Z, Ajian L, Jun W, Yanyan L, Stan ZL (2019) Casia-surf: A large-scale multi-modal benchmark for face anti-spoofing. TBMIO 11. Boulkenafet Z, Komulainen J, Li L, Feng X, Abdenour H (2017) A mobile face presentation attack database with real-world variations. In: FG, Oulu-npu 12. Xiaobai L, Jukka K, Guoying Z, Pong-Chi Y, Matti P (2016) Generalized face anti-spoofing by detecting pulse from face videos. In: ICPR 13. Si-Qi L, Xiangyuan L, Pong CY (2018) Remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. In: ECCV 14. Jun W, Yinglu L, Yibo H, Hailin S, Tao M (2021) Facex-zoo: A pytorch toolbox for face recognition. In: Proceedings of the 29th ACM international conference on multimedia, pp 3779–3782 15. Deng J, Guo J, Niannan X, Stefanos Z (2019) Additive angular margin loss for deep face recognition. In: CVPR, Arcface 16. Siqi L, Baoyao Y, Pong CY, Guoying Z (2016) A 3d mask face anti-spoofing database with real world variations. In: CVPRW 17. Zitong Y, Wan J, Qin Y, Li X, Li SZ, Guoying Z (2020) Static-dynamic central difference network search for face anti-spoofing. In: TPAMI, Nas-fas 18. Liu A, Zhao C, Zitong Y, Wan J, Anyang S, Liu X, Tan Z, Escalera S, Xing J, Liang Y et al (2022) Contrastive context-aware learning for 3d high-fidelity mask face presentation attack detection. IEEE Trans. Inf. Foren. Security 17:2497–2507
3
Best Solutions Proposed in the Context of the Face Anti-spoofing Challenge Series
3.1
Introduction
The three editions of the PAD challenges we organized (See Chap. 2) faced problems of increasing difficulty. However, while the considered scenarios were very challenging to participants, in all three cases, solutions outperformed the baselines and pushed the state-ofthe-art performance in PAD. Furthermore, not only performance was improved but a variety of methods comprising novel technical contributions arose in the context of the challenge series. The remainder of the chapter provides an overview of solutions that were proposed in the context of the three editions of the PAD challenges. For further details, we suggest the reader to follow the corresponding references.
3.2
Multi-modal Face PAD Challenge
This section describes the top ranked solutions developed in the context of the ChaLearn Face Anti-spoofing attack detection challenge. Before describing these solutions, we introduce the baseline which we have developed for the competition.
3.2.1
Baseline Method
We developed a strong baseline method associated to this edition of the challenge series. Our aim was to provide a straightforward architecture achieving competitive performance in the CASIA-SURF dataset. In doing this, we approached the face anti-spoofing problem as a binary classification task (fake v.s., real) and conducted experiments using the ResNet-18 [1] classification network. ResNet-18 consists of five convolutional blocks (namely res1, res2,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Wan et al., Advances in Face Presentation Attack Detection, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-32906-7_3
37
38
3 Best Solutions Proposed in the Context of the Face . . .
res3, res4, res5), a global average pooling layer and a softmax layer, which is a relatively shallow network but has strong classification capabilities. As described before, the CASIA-SURF dataset is characterized for being multi-modal (i.e., RGB, Depth, IR) and one of the main problems to solve is how to fuse the complementary information from the three available modalities. For the baseline, we use a multi-stream architecture with three subnetworks where RGB, Depth and IR data are processed separately by each stream, and then shared layers are appended at a point to learn joint representations and decisions. Halfway fusion is one of the commonly used fusion methods, which combines the subnetworks of different modalities at a later stage, i.e., immediately after the third convolutional block (res3) via the feature map concatenation (similar to Fig. 3.1, except no “Squeeze-and-Excitation” fusion). In this way, features from different modalities can be fused to perform classification. However, direct concatenating these features cannot make full use of the characteristics between different modalities by itself. Since different modalities have different characteristics: the RGB information has rich visual details, the Depth data are sensitive to the distance between the image plane and the corresponding face, and the IR data measure the amount of heat radiated from a face. These three modalities have different advantages and disadvantages for different ways of attack. Inspired by [2], we proposed the squeeze and excitation fusion method that uses
Fig. 3.1 Diagram of the proposed method. Each stream uses the ResNet-18 as the backbone, which has five convolutional blocks (i.e., res1, res2, res3, res4, res5). The res1, res2, and res3 blocks are proprietary to extract features of each modal data (i.e., RGB, Depth, IR). Then, these features from different modalities are fused via the squeeze and excitation fusion module. After that, the res4 and res5 block are shared to learn more discriminatory features from the fused one. GAP means the global average pooling
3.2
Multi-modal Face PAD Challenge
39
the “Squeeze-and-Excitation” branch to enhance the representational ability of the different modalities’ features by explicitly modelling the interdependencies between their convolutional channels. As shown in Fig. 3.1, our squeeze and excitation fusion method has a three-stream architecture and each subnetwork is feed with the image of different modalities. The res1, res2 and res3 blocks are proprietary for each stream to extract the features of different modalities. After that, these features are fused via the squeeze and excitation fusion module. This module newly adds a branch for each modal and the branch is composed of one global average pooling layer and two consecutive fully connected layers. The squeeze and excitation fusion module performs modal-dependent feature re-weighting to select the more informative channel features while suppress less useful ones for each modal, and then concatenates these re-weighted features to the fused feature. In this way, we can make full use of the characteristics between different modalities via re-weighting their features.
3.2.2
1st Ranked Solution (Team Name: VisionLabs)
Attack Specific Folds. Because attack types at test time can differ from attacks presented in the training set for this challenge, in order to increase the robustness to new attacks, Parkin et al. split training data into three folds [3]. Each fold containing two different attacks, while images of the third attack type are used for validation. Once trained, one treats three different networks as a single model by averaging their prediction scores. Transfer Learning. Many image recognition tasks with limited training data benefit from CNN pre-training on large-scale image datasets, such as ImageNet [4]. Fine tuning network parameters that have been pre-trained on various source tasks leads to different results on the target task. In the experiments, the authors of [3] use four datasets designed for face recognition and gender classification (please see Table 3.1), to generate a useful and promising initialization. They also use multiple backbone ResNet architectures and losses for initial tasks to increase the variability. Similar to networks trained for attack-specific folds in the last paragraph, authors average predictions of all the models trained with different initializations. Model Architecture. The final network architecture is based on the ResNet-34 and ResNet-50 backbone with SE modules which are illustrated in Fig. 3.2. Following the method
Table 3.1 Face datasets and the CNN architecture are used to pre-train the networks of VisionLabs [3] Backbone
Dataset
Task
1
ResNet-34
CASIA-Web face [5]
Face recognition
2
ResNet-34
AFAD-lite [6]
Gender classication
3
ResNet-50
MSCeleb-1M [7]
Face recognition
4
ResNet-50
Asian dataset [8]
Face recognition
40
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.2 The proposed architecture (VisionLabs). RGB, IR and Depth streams are processed separately using res1, res2, res3 blocks from resnet-34 as a backbone. The res3 output features are re-weighted and fused via the squeeze and excitation (SE) block and then fed into res4. In addition, branch features from res1, res2, res3 are concatenated and processed by corresponding aggregation blocks, each aggregation block also uses information from the previous one. The resulting features from agg3 are fed into res4 and summed up with the features from the modality branch. On the diagram: GAP–global average pooling; ⊕ concatenation; +–elementwise addition
described in our baseline method [9], each modality is processed by the first three residual convolutional blocks, then the output features are fused using squeeze and excitation fusion module and processed by the remaining residual block. Differently from the baseline method, [3] enrich the model with additional aggregation blocks at each feature level. Each aggregation block takes features from the corresponding residual blocks and from previous aggregation block, making model capable of finding inter-modal correlations not only at a fine level but also at a coarse one. In addition, it trains each model using two initial random seeds. Given separate networks for attack-specific folds and different pre-trained models, our final liveness score is obtained by averaging the outputs of 24 neural network. Conclusions. The solution described in this section achieved the top 1 rank at the Chalearn LAP face anti-spoofing challenge. First, authors have demonstrated that careful selection of a training subset by the types of spoofing samples better generalizes to unseen attacks. Second, they have proposed a multi-level feature aggregation module which fully utilizes the feature fusion from different modalities both at coarse and fine levels. Finally, authors have examined the influence of feature transfer from different pre-trained models on the target task and showed that using the ensemble of various face related tasks as source domains increases the stability and the performance of the system. The code and pre-trained models are publicly available from the github repository at https://github.com/AlexanderParkin/ ChaLearn_liveness_challenge. More information of this method can be found in [3].
3.2
Multi-modal Face PAD Challenge
3.2.3
41
2nd Ranked Solution (Team Name: ReadSense)
Overall Architecture. In this contribution, Shen et al. proposed a multi-stream CNN architecture called FaceBagNet with Modal Feature Erasing (MFE) for multi-modal face antispoofing detection [10]. The proposed FaceBagNet consists of two components: (1) patchbased features learning, (2) multi-stream fusion with MFE. For the patch-based features learning, authors trained a deep neural network by using patches randomly extracted from face images to learn rich appearance features. For the multi-stream fusion, features from different modalities are randomly erased during training, which are then fused to perform classification. Figure 3.3 shows the high-level illustration of three streams along with a fusion strategy for combining them. Patch-based Features Learning. The spoof-specific discriminative information exists in the whole face area. Therefore, Shen et al. used the patch-level image to enforce convolution neural network to extract such information [10]. The usual patch-based approaches split the full face into several fixed non-overlapping regions. Then each patch is used to train an independent sub-network. For each modality, authors trained one single CNN on random patches extracted from the faces. Then authors used a self designed ResNext [11] network to extract deep features. The network consisted of five group convolutional blocks, a global average pooling layer and a softmax layer. Table 3.2 presents the network architecture in terms of its layers, i.e., size of kernels, number of output feature maps and number of groups and strides. Multi-stream Fusion with MFE. Since the feature distributions of different modalities are different, the proposed model makes efforts to exploit the interdependencies between different modalities as well. As shown in Fig. 3.3, it uses a multi-stream architecture with three sub-networks to perform multi-modal features fusion. Then authors concatenated feature maps of three sub-networks after the third convolutional block (res3).
Fig. 3.3 The proposed architecture (ReadSense). The fusion network is trained from scratch in which RGB, Depth and IR face patches are feed into it at the same time. Image augmentation is applied and modal features from sub-network are randomly erased during training
3 Best Solutions Proposed in the Context of the Face . . .
42
Table 3.2 Architecture of the proposed FaceBagNet [10] Patch size
Conguration
Layer1
Conv 3×3, 32
Layer2
[Conv 1×1, 64; Conv 3×3, 64; Group 32, Stride 2; Conv 1×1, 128]×2
Layer3
[Conv 1×1,128; Conv 3×3, 128; Group 32, Stride 2; Conv 1×1, 256]×2
Layer4
[Conv 1×1, 256; Conv 3×3, 256; Group 32, Stride 2; Conv 1×1, 512]×2
Layer5
[Conv 1×1, 512; Conv 3×3, 512; Group 32, Stride 2; Conv 1×1, 1024]×2
Layer6
Global Avg. Pooling, FC2
As studied in our baseline method [9], directly concatenating features from each subnetwork cannot make full use of the characteristics between different modalities. In order to prevent overfitting and for better learning the fusion features, Shen et al. designed a Modal Feature Erasing (MFE) operation on the multi-modal features [10]. For one batch of inputs, the concatenated feature tensor is computed by three sub networks. During training, the features from one randomly selected modal sub-network are erased and the corresponding units inside the erased area are set to zero. The fusion network is trained from scratch in which RGB, Depth and IR data are fed separately into each sub-network at the same time. Conclusions. This solution proposed a face anti-spoofing network based on Bag-oflocal-features (named FaceBagNet) to determine whether the captured multi-modal face images are real. A patch-based feature learning method was used to extract discriminative information. Multi-stream fusion with MFE layer was applied to improve the performance. It demonstrated that both patch-based feature learning method and multi-stream fusion with MFE were effective methods for face anti-spoofing. Overall, the proposed solution was simple but effective and easy to use in practical application scenarios. As the result, The proposed approach [10] obtained the second place in CVPR 2019 ChaLearn Face Antispoofing attack detection challenge.
3.2.4
3rd Ranked Solution (Team Name: Feather)
The existing face anti-spoofing networks [12–15] have the problems of large parameters and weak generalization ability. For this reason, Zhang et al. proposed a FeatherNets architecture, a light architecture for PAD (the name stands for a network that is light as a feather) [16]. The Weakness of GAP for Face Task. Global Average Pooling (GAP) is employed by many state-of-the-art networks for object recognition task, such as ResNets [1], DenseNet [17] and some light-weight networks, like MobilenetV2 [18], Shufflenet_v2 [19], IGCV3 [20]. GAP has been proved on its ability of reducing dimensions and preventing over-fitting for the overall structure [21]. For the face related tasks, [22] and [23] have observed that CNNs with GAP layer are less accurate than those without GAP. Meanwhile,
3.2
Multi-modal Face PAD Challenge
43
Fig. 3.4 Depth faces feature embedding CNN structure. In the last 7×7 feature map, the receptive field and the edge (RF2) portion of the middle part (RF1) is different, because their importance is different. DWConv is used instead of the GAP layer to better identify this different importance. At the same time, the fully connected layer is removed, which makes the network more portable. This figure is from [16]
MobileFaceNet [24] replaces the GAP with Global Depthwise Convolution (GDConv) layer, and explains the reason why it is effective through the theory of receptive field [25]. The main point of GAP is “equal importance” which is not suitable for face tasks. As shown in Fig. 3.4, the last 7 × 7 feature map is denoted as FMap-end, each cell in FMapend corresponds to a receptive field at different position. The center blue cell corresponds to RF1 and the edge red one corresponds to RF2. As described in [26], the distribution of impact in a receptive field distributes as a Gaussian, the center of a receptive field has more impact on the output than the edge. Therefore, RF1 has larger effective receptive field than RF2. For the face anti-spoofing task, the network input is 224 × 224 images which only contain the face region. As above analysis, the center unit of FMap-end is more important than the edge one. GAP is not applicable to this case. One choice is to use fully connected layer instead of GAP. It would introduce a large number of parameters to the whole model and increase the risk of over-fitting. Streaming Module. To treat different units of FMap-end with different importance, streaming module is designed which is shown in the Fig. 3.5. In the streaming module, a depthwise convolution (DWConv) layer with stride larger than 1 is used for down-sampling whose output, is then flattened directly into an one-dimensional feature vector. The compute process is represented by Eq. 3.1. K i, j,m · FI N y (i),I Nx ( j),m (3.1) F Vn(y,x,m) = i, j
In Eq. 3.1, FV is the flattened feature vector while N = H × W × C elements (H , W and C denote the height, width and channel of DWConv layer’s output feature maps respectively). n(y, x, m), computed as Eq. (3.2), denotes the n th element of FV which corresponds to the (y, x) unit in the m th channel of the DWConv layer’s output feature maps.
n(y, x, m) = m × H × W + y × H + x
(3.2)
44
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.5 Streaming module. The last blocks’ output is down-sampled by a depthwise convolution [27, 28] with stride larger than 1 and flattened directly into an one-dimensional vector
On the right side of the Eq. (3.1), K is the depthwise convolution kernel, F is the FMap-end of size H×W×C (H, W and C denote the height, width and channel of FMap-end respectively). m denotes the channel index. i, j denote the spatial position in kernel K, and I N y (i), I N x ( j) denote the corresponding position in F. They are computed as Eqs. (3.3), (3.4) I N y (i) = y × S0 + i
(3.3)
I N x ( j) = x × S1 + j.
(3.4)
S0 is the vertical stride and S1 is the horizontal stride. A fully connected layer is not added after flattening feature map, because this will increase more parameters and the risk of overfitting. Streaming module can be used to replace global average pooling and fully connected layer in traditional networks. Network Architecture Details. Besides streaming module, there are BlockA/B/C as shown in Fig. 3.6 to compose FeatherNetA/B. The detailed structure of the primary FeatherNet architecture is shown in Table 3.3. BlockA is the inverted residual blocks proposed in MobilenetV2 [18]. BlockA is used as our main building block which is shown in the Fig. 3.6a. The expansion factors are the same as in MobilenetV2 [18] for blocks in our architecture. BlockB is the down-sampling module of FeatherNetB. Average pooling (AP) has been proved in Inception [29] to benefit performance, because of its ability of embedding multi-scale information and aggregating features in different receptive fields. Therefore, average pooling (2 × 2 kernel with stride = 2) is introduced in BlockB (Fig. 3.6b). Besides, in the network ShuffleNet [19], the down-sampling module joins 3 × 3 average pooling layer with stride = 2 to obtain excellent performance. Li et al. [30] suggested that increasing average pooling layer works well and impacts the computational cost little. Based on the above
3.2
Multi-modal Face PAD Challenge
45
Fig. 3.6 FeatherNets’ main blocks. FeatherNetA includes BlockA & BlockC. FeatherNetB includes BlockA & BlockB. (BN: BatchNorm; DWConv: depth wise convolution; c: number of input channels) Table 3.3 Network architecture: FeatherNet B. All spatial convolutions use 3 × 3 kernels. The expansion factor t is always applied to the input size, while c means number of channel. Meanwhile, every stage SE-module [31] is inserted with reduce = 8 and FeatherNetA replaces BlockB in the table with BlockC Input
Operator
t
2242 × 3
Conv2d,/2
–
c 32
1122 × 32 562 × 16
BlockB
1
16
BlockB
6
32
282 × 32 282 × 32
BlockA
6
32
BlockB
6
48
142 × 48 142 × 48
5×BlockA
6
48
BlockB
6
64
72 × 64 72 × 64
2×BlockA
6
64
Streaming
–
1024
analysis, adding pooling on the secondary branch can learn more diverse features and bring performance gains. BlockC is the down-sampling Module of our network FeatherNetA. BlockC is faster and with less complexity than BlockB. After each down-sampling stage, SE-module [31] is inserted with reduce = 8 in both FeatherNetA and FeatherNetB. In addition, when designing the model, a fast down-sampling strategy [32] is used at the beginning of our network which makes the feature map size decrease rapidly and without much parameters. Adopting this strategy can avoid the problem
46
3 Best Solutions Proposed in the Context of the Face . . .
of weak feature embedding and high processing time caused by slow down-sampling due to limited computing budget [33]. The primary FeatherNet only has 0.35M parameters. The FeatherNets’ structure is built on BlockA/B/C as mentioned above except for the first layer which is a fully connected. As shown in Table 3.3, the size of the input image is 224 × 224. A layer with regular convolutions, instead of depthwise convolutions, is used at the beginning to keep more features. Reuse channel compression to reduce 16 while using inverted residuals and linear bottleneck with expansion ratio = 6 to minimize the loss of information due to down-sampling. Finally, the Streaming module is used without adding a fully connected layer, directly flatten the 4 × 4 × 64 feature map into an one-dimensional vector, reducing the risk of over-fitting caused by the fully connected layer. After flattening the feature map, focal loss is used directly for prediction. Multi-modal Fusion Method. The main idea for the fusion method is to use a cascade inference on different modalities: depth images and IR images. The cascade structure has two stages, which is shown in the Fig. 3.7. Stage 1: An ensemble classifier, consisting of multiple models, is employed to generate the predictions. These models are trained on depth data and from several checkpoints of different networks, including FeatherNets. If the weighted average of scores from these models is near 0 or 1, input sample will be classified as fake or real respectively. Otherwise, the uncertain samples will go through the second stage. Stage 2: FeatherNetB learned from IR data will be used to classify the uncertain samples from stage 1. The fake judgement of IR model is respected as the final result. For the real judgement, the final scores are decided by both stage 1 and IR models. Conclusions. Zhang et al. proposed an extreme lite network architecture (FeatherNet A/B) with Streaming module, to achieve a well trade-off between performance and computational complexity for multi-modal face anti-spoofing. Furthermore, a novel fusion classifier
Fig.3.7 Multi-modal fusion strategy: two stages cascaded, stage 1 is an ensemble classifier consisting of several depth models. Stage 2 employs IR models to classify the uncertain samples from stage 1
3.2
Multi-modal Face PAD Challenge
47
with “ensemble + cascade” structure is proposed for the performance preferred use cases. Meanwhile, CASIA-SURF dataset [9] is collected to provide more diverse samples and more attacks to gain better generalization ability. All these are used to join the Face Anti-spoofing Attack Detection Challenge@CVPR2019 and get the third place in this challenge.
3.2.5
Additional Top Ranked Solutions
Hahahaha. Their base model is a Resnext [11] which was pre-trained with the ImageNet dataset [4]. Then, they fine-tune the network on aligned images with face landmark and use data augmentation to strengthen the generalization ability. MAC-adv-group. This solution used the Resnet-34 [1] as base network. To overcome the influence of illumination variation, they convert RGB image to HSV color space. Then, they sent the features extracted from the network into a fully-connected layer and a binary classification layer. ZKBH. Analyzing the training, validation and test sets, participants assumed that the eye region is promising to get good performance in FAD task based on an observation that the eye region is the common attack area. After several trials, the input of the final version they submitted adopted quarter face containing the eye region. Different from prior works that regard the face anti-spoofing problem as merely a binary (fake v.s., real) classification problem, this team constructed a regression model for differentiating the real face and the attacks. VisionMiracle. This solution was based on the modified shufflenet-V2 [19]. The featuremap was divided into two branches after the third stage, and connected in the fourth stage. GradiantResearch. The fundamental idea behind this solution was the reformulation of the face presentation attack detection problem (face-PAD) following an anomaly detection strategy using deep metric learning. The approach can be split in four stages (Fig. 3.8):
Fig. 3.8 Provided by GradiantResearch team. General diagram of the GradiantResearch team
3 Best Solutions Proposed in the Context of the Face . . .
48
Stage 1: using a pre-trained model for face recognition and apply a classification-like metric learning approach in GRAD-GPAD dataset [34] using only RGB images. Stage 2: they fine-tune the model obtained in Stage 1 with the CASIA-SURF dataset using metric learning for anomaly detection (semi-hard batch negative mining with triplet focal loss) adding Depth and IR images to the input volume. Once the model converged, they trained an SVM classifier using the features of the last fully connected layer (128D). Stage 3: they trained an SVM classifier using the normalized histogram of the depth image corresponding to the cheek region of the face (256D). Stage 4: they performed a simple stacking ensemble of both models (Stage 2 and Stage 3) by training a logistic regression model with the scores in the training split. Vipl-bpoic. This team focused on improving face anti-spoofing generalization ability by proposing an end-to-end trainable face anti-spoofing model with attention mechanism. Due to the sample imbalance, they assign the weight of 1:3 according to the number of genuine and spoof faces in Training set. Subsequently, they fuse the three modal images including RGB, Depth and IR into 5 channels as the input of ResNet-18 [1] which integrated with the convolutional block attention module. The center loss [35] and cross-entropy loss are adopted to constrain the learning process in order to get more discriminative cues of FAD finally. Massyhnu. This team paid attention to color information fusion and ensemble learning [36, 37]. AI4all. This team used VGG16 [38] as the backbone for face PAD. Guillaume. Their method consists in a Multi-Channel convolutional Neural Network (MCCNN) taking a face images of different modalities as input. Near-infrared and depth images only have been used in their approach. The architecture of the proposed MC-CNN is based on the second version of the LightCNN [39] containing 29 layers. Also, the pretrained LightCNN model is used as a starting point for their training procedure. The training consists in the fine-tuning of the low-level convolutional layers of the network in each modalities, and in learning the final fully connected layers.
3.2.6
Summary
For the organized face anti-spoofing challenge@CVPR 2020 workshop, no team used traditional methods for FAD, such as detecting physiological signs of life, like eye blinking, facial expression changes and mouth movements. Instead, all submitted face PAD solutions relied on model-based feature extractors, such as ResNet [1], VGG16 [38], etc. A summary is provided in Table 3.4. All teams use the deep learning based methods with or without pretrained models from other dataset (such as face dataset used in both VisionLabs and GradiantResearch teams) and only the Feather Team use the private FAD data. The teames of top 3 are used at least two modalities (RGB, depth or IR). Interesting, the Hahahaha team only use the depth modality but also obtained very promising results.
Team name
VisionLabs
ReadSense
Feather
Hahahaha
MAC-adv-group
ZKBH
VisionMiracle
Baseline
GradiantResearch
Vipl-bpoic
Massyhnu
AI4all
Guillaume
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
Multi-channel CNN
Depth images
Ensembling
Attention
Metric learning
Features fusion
Modified CNN
Regression model
Features fusion
Fine-tuning
Ensembling
Bag-of-local feature ensembling
Fine-tuning ensembling
Method
LightCNN
Vgg16
9 classifiers
ResNet-18
Inception resnet v1
Resnet-18
Shufflenet-V2
Resnet-18
Resnet-34
Resnext
Fishnet MobileNetv2
SEresnext
Resnet-34 resnet-50
Model
Table 3.4 Summary of the methods for all participating teams
Yes
No
No
No
VGGFace2 GRAD-GPAD
No
No
No
No
Imagenet
No
No
Casia-WebFace AFAD-Lite MSCeleb1M Asian dataset
Pre-trained data
Depth IR
Depth
RBG depth IR
RBG depth IR
RBG depth IR
RBG depth IR
Depth
RBG depth IR
RBG depth IR
Depth
Depth IR
RBG depth IR
RBG depth IR
Modality
No
No
No
No
No
No
Private data
No
No
Additional dataset
Resize
Resize augmentation
Resize transfer color
No
No
No
Ratio of pos. and neg. No
Crop image Augmentation
Resize augmentation
Augmentation
Crop image augmentation
Transfer color
Augmentation aligned faces
Resize image adjust
Crop image augmentation
Resize
Pre-process
Data fusion softmax
Softmax
Color information fusion softmax
Data fusion center loss softmax
Ensemble logistic regression
Softmax
Softmax
Data fusion regression loss
Features fusion softmax
Softmax
Score fusion softmax
Squeeze and excitation fusion score fusion softmax
queeze and excitation fusion score fusion softmax
Fusion and loss function
3.2 Multi-modal Face PAD Challenge 49
3 Best Solutions Proposed in the Context of the Face . . .
50
3.3
Cross-Ethnicity Face PAD Challenge
For the second edition of the PAD challenge series, an improved dataset was released. For this edition, the main interest was to challenge participants to develop solutions that could be robust to a variety of ethicities. In the final ranking stage, there were 19 teams submitting their code and fact sheets for evaluation. According to the information provided, in the following we describe the solutions developed by each of the teams, with detailed descriptions for top-ranked participants in both single-modal (RGB) and multi-modal (RGB-Depth-IR) face anti-spoofing recognition challenge tracks. For this edition of the challenge two tracks were proposed, one focusing on single modal PAD and another one dealing with the multi-modal case. Both tracks are described in the folllwing two subsections.
3.3.1
Single-modal Face Anti-spoofing Challenge Track
3.3.1.1 Baseline Method We provided a baseline for approaching this task via designing a SD-Net [40] which takes Resnet18 [1] as the backbone. As shown in Fig. 3.9, it contains 3 branches: static, dynamic and static-dynamic branches, which learn hybrid features from static and dynamic images. For static and dynamic branches, each of them consists of 5 blocks (i.e., conv, res1, res2, res3, res4) and 1 Global Average Pooling (GAP) layer, while in the static-dynamic branch, the conv and res1 blocks are removed because it takes fused features of res1 blocks from static and dynamic branches as input (Fig. 3.9). For dynamic image generation, a detailed description is provided in [40]. In short, we compute its dynamic image online with rank pooling using K consecutive frames. Our selection of dynamic images for rank pooling in SD-Net is further motivated by the fact that dynamic images have proved its superiority to regular optical flow [41, 42].
Fig. 3.9 The framework of SD-Net. The figure is provided by the baseline team and ranked N O.11 in single-modal track
3.3
Cross-Ethnicity Face PAD Challenge
51
Fig. 3.10 The framework is provided by the VisionLabs team. Note that the SimpleNet architecture: 4 blocks of Conv 3 × 3- BatchNorm–Relu–MaxPool of sizes 16, 32, 64, 128 followed by Conv 5 × 5 with 256 filters. The figure is provided by the VisionLabs team and ranked N O.1 in single-modal track
3.3.1.2 1st Ranked Solution (Team Name: VisionLabs) Due to high differences in the train and test subsets (i.e., different ethnics and attack types), the VisionLabs team used a data augmentation strategy to help train robust models. Similar to previous works which convert RGB data to HSV and YCbCr color spaces [43], or Fourier spectrum [44], they decided to convert RGB to other “modalities”, which contain more authentic information instead of identity features. Specially, the Optical Flow and RankPooling are used as shown in Fig. 3.10. As shown in Fig. 3.10, the proposed architecture consists of four branches where two branches are used for dynamic images via a dynamic pooling algorithm, and the left two branches are used for the optical flow images. For optical flow modality, they calculated two flows between the first and last images of RGB video as well as between the first and second images. For the rank pooling modality, they used the rank pooling algorithm [42] where different hyperparameters used to generate two different dynamic images. Formally, a RGB video with K frames is represented by {X ik }, where i = 0, ..., K − 1 and t = {0, 1} is the label (0-fake, 1-real). Then for each RGB video, they sample L = 16 images uniformly, obtaining {X kj }, where j = 0, ..., 15. Then, they remove black borders and pad image to be square of size (112, 112). Then they apply intensive equal color jitter to all images, emulating different skin colors. As shown in Fig. 3.10, they apply 4 “modality” transforms: rank pooling ({X kj }, C = k ), Flow (X k , X k ), where C is the hyper1000), rank pooling ({X kj }, C = 1), Flow(X 0k , X 15 0 1 parameter for SVM in the rank pooling algorithm [42]. The code of rank pooling was released in https://github.com/MRzzm/rank-pooling-python. These transforms return 4 tensors with sizes 3 × 112 × 112, 3 × 112 × 112, 2 × 112 × 112, 2 × 112 × 112 respectively. Further, the features of each modal sample are extracted by an independent network (namely
52
3 Best Solutions Proposed in the Context of the Face . . .
SimpleNet and its structure depicted in Fig. 3.10) with size of d = 256 and all features are concatenated to get a tensor of shape 4 × d. Then they apply Max, Avg and Min pooling among first dimension and concatenate results to get 3 × d tensor. Finally, a binary crossentropy is adopted in their network. The code of VisionLabs was released in https://github. com/AlexanderParkin/CASIA-SURF_CeFA.
3.3.1.3 2nd Ranked Solution (Team Name: BOBO) Most CNN-based methods [12, 13, 45, 46] only treat face anti-spoofing as a binary classification task, and train the neural network supervised by a softmax loss. However, these methods fail to explore the nature of spoof patterns [47], which consist of skin detail loss, color distortion, moire pattern, motion pattern, shape deformation, and spoofing artifacts. To relieve the above issues, similar to [48], the BOBO team adopts depth supervision instead of binary softmax loss for face anti-spoofing. Different from [48], they design a novel Central Difference Convolution (CDC) [49] and a Contrastive Depth Loss (CDL) for feature learning and representation. The structure of the depth map regression network based on CDC is shown in Fig. 3.11. It consists of 3 blocks, 3 attention layers connected after each block, and 3 down-sampling layers followed by each attention layer. Inspired by the residual network, they use a short-cut connection, which is concatenating the responses of Low-level Cell (Block1), Mid-level Cell (Block2) and High-level Cell (Block3), and sending them to two cascaded convolutional layers for depth estimation. All convolutional layers use the CDC network which is followed by a batch normalization layer and a rectified linear unit (ReLU) activation function. The size of input image and regression depth map are 3 × 256 × 256 and 1 × 32 × 32, respectively. Euclidean Distance Loss (EDL) is used for pixel-wise supervision in this work which is formulated:
Fig. 3.11 The framework of regression network. The figure is provided by the BOBO team and ranked N O.2 in single-modal track
3.3
Cross-Ethnicity Face PAD Challenge
L E DL =||D P − DG ||22 ,
53
(3.5)
where D P and DG are the predicted depth and groundtruth depth, respectively. EDL applies supervision on the predicted depth based on pixel one by one, ignoring the depth difference among adjacent pixels. Intuitively, EDL merely assists the network to learn the absolute distance between the objects to the camera. However, the distance relationship of different objects is also important to be supervised for the depth learning. Therefore, one proposed the Contrastive Depth Loss (CDL) to offer an extra supervision, which improves the generality of the depth-based face anti-spoofing model: L C DL =
||KiC DL D P − KiC DL DG ||22 ,
(3.6)
i
where KiC DL is the ith contrastive convolution kernel, i ∈ [0, 7]. The details of the kernels can be found in Fig. 3.12. Therefore, the total loss L overall employed by this team is defined as follow: L overall = β · L E DL + (1 − β) · L C DL , (3.7) where β is the hyper-parameter to trade-off EDL loss and CDL loss in the final overall loss L overall . Finally their code is publicly available in https://github.com/ZitongYu/CDCN/tree/ master/FAS_challenge_CVPRW2020.
3.3.1.4 3rd Ranked Solution (Team Name: Harvest) It can be observed from Table 2.5 that the attack types of the spoofs in the training and testing subsets are different. The Harvest team considered the motion information of real faces is also an important discriminative cue for face anti-spoofing attack detection. Therefore, how to effectively learn the motion information of real faces from the interference motion information of the replay attack is a key step. As shown in Fig. 3.13, the live frame displays obvious temporal variations, specially in expressions, while there are very little facial changes in the print spoof samples for the same subject, which inspires the Harvest team to capture the subtle dynamic variations by relabelling live sequence. Suppose the labels of spoof and live samples are 0 and 1 respectively. They define a new temporal-aware label
Fig. 3.12 The kernel Kicontrast in contrastive depth loss
54
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.13 Visual comparison of real face, replay attack, print attack motion information for Harvest team
via forcing the labels of the real face images in a sequence to change uniformly from 1 to 2, while the spoofing faces stay 0. Let X = {x1 , x2 , . . . , xn } denote a video containing n frames, where x1 and xn represent the first and final frames, respectively. They encode this implicit temporal information by reformulating the ground-truth label, such as: gti = 1 +
i n
(3.8)
where the genuine label grows over time. Note that they do not encode the temporal variations in the spoof video due to their irregular variations in sequence. As shown in Fig. 3.14, the overall framework consists of two parts below: (1) In the training stage, they encode inherent discriminative information by relabelling live sequence. (2) In inference stage, they aggregate the static-spatial features with dynamic-temporal information for sample classification. Finally, combined with the strong learning ability of backbone, their method achieved 3r d in the single-modal track and the code is publicly available in https://github.com/yueyechen/cvpr20.
Fig. 3.14 The framework of training and testing phases for Harvest. The figure is provided by the Harvest team and ranked N O.3 in single-modal track
3.3
Cross-Ethnicity Face PAD Challenge
55
Fig. 3.15 Architecture of the proposed for the single-modal track. The figure is provided by the ZhangTT team and ranked N O.4 in the single-modal track
3.3.1.5 Other Top Ranked Solutions ZhangTT. Similar to the SD-Net in baseline [40], this team proposes a two-branch network to learn hybrid features from static and temporal images. They call it quality and time tensor, respectively. As shown in Fig. 3.15, they take the ResNet [1] as the backbone for each branch and use the single frame and multi-frame as the input of the two branches. Specially, the quality tensor and time tensor are first sent to a normal 7 × 7 receptive field convolution layer for preliminary feature extraction. After feature extraction by three independent blocks, a higher-level expression quality feature map and time feature map were obtained. Then the quality feature and the time feature are concatenated together to form a new feature map for final classification with a binary cross-entropy loss function. The blocks in this work are the same as the ResNet block [1]. For data preprocessing, they first discarded the color information by converting the RGB modality to grayscale space and then used histogram equalization to mitigate the skin tone gap between ethnicities. Finally, they adopted the following four strategies to reduce the difference between replay and print attacks: 1) They regard face anti-spoofing work as a classification task for 4 classes instead of 2. The 4 considered categories are live-invariable (label 0), fake-invariable (label 1), live-variable (label 2), fake-variable (label 3), respectively. 2) Dithering each channel of the attack sample solves the problem of consistency of each frame of the print attack. 3) To enhance the robustness, consider randomly superimposing Gaussian noise and superimposing gamma correction on each channel of the time tensor. 4) To discriminate the texture difference, the first channel of the time tensor is separately identified and recorded as the quality tensor. It is sent to the network to extract features without noise superposition. Their code is publicly available in https://github.com/ZhangTTrace/CVPR2020-SingleModal. Newland_tianyan. This team mainly explores single-modal track from two aspects of data augment and network design. For data augmentation, on the one hand, they introduced print attacks in the training set by randomly pasting paper textures on real face samples. On the other hand, they performed random rotation, movement, brightness transformation, noise
56
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.16 The architecture of the multi-task network for face anti-spoofing. The figure is provided by the Dopamine team and ranked N O.6 in the single-modal track
and fold texture addition on the same frame of real face to simulate the case that there is no micro expression change for the print attack. For network design, this team used a 5-layer sequence network which taking 16 frames of samples as input to learn the temporal features. To improve the generalization faced with different ethnicities, the images were subtracted from the neighborhood mean before sending to the network due to the samples of different ethnicities vary widely in skin color. Their code is publicly available in https://github.com/ XinyingWang55/RGB-Face-antispoofing-Recognition. Dopamine. This team uses face ID information for face anti-spoofing tasks. The architecture is shown in Fig. 3.16, a multi-task network is designed to learn the features of identity and authenticity simultaneously. In the testing phase, these two scores are combined to determine whether a sample is a real face. They use the softmax score from the real/fake classifier and the feature computed by the backbone network (Resnet100) to compute the minimal similarity between the same person. In theory, the feature similarity score of the attack sample is close to 1, and the real face is close to 0. Their code is publicly available in https://github.com/ xinedison/huya_face. IecLab. This team uses feathernet and 3DResNet [50] to learn the authenticity and expression features of the samples, and finally merged the two features for anti-spoofing tasks. Their code is publicly available in https://github.com/1relia/CVPR2020-FaceAntiSpoofing. Chuanghwa Telecom Lab. This team combines subsequence features with Bag of local features [10] within the framework of MIMAMO-Net (https://github.com/wtomin/MIMAMONet). Finally, the ensemble learning strategy is used for feature fusion. Their code is publicly available in https://drive.google.com/open?id=1ouL1X69KlQEUl72iKHl0-_UvztlW8f_l. Wgqtmac. This team focused on improving face anti-spoofing generalization ability and proposed an end-to-end trainable face anti-spoofing approach based on deep neural network. They choose Resnet18 [1] as the backbone and use a warmup strategy to update the learning rate. The learned model performs well on the developing subset. However, it is easily overfitted on the training set and gets worse results on the testing set. Their code is publicly available in https://github.com/wgqtmac/cvprw2020.git.
3.3
Cross-Ethnicity Face PAD Challenge
57
Fig. 3.17 The framework of PSMM-Net. The figure is provided by the baseline team and ranked N O.8 in multi-modal track
Hulking. The main role of PipeNet proposed by this team is to selectively and adaptively fuse different modalities for face anti-spoofing tasks. Since the single-modal track only allows the use of RGB data, the team’s method has limited performance in this challenge. We detail the team’s algorithm in Sect. 3.3.2. Their code is publicly available in https:// github.com/muyiguangda/cvprw-face-project. Dqiu. This team treats the face anti-spoofing as a binary classification task and uses Resnet50 [1] as the backbone to learn the features. Since no additional effective strategies were used, no good results were achieved on the testing set.
3.3.2
Multi-modal Face Anti-spoofing Challenge Track
3.3.2.1 Baseline Method In order to take full advantage of multi-modal samples to alleviate the ethnic and attack bias, we propose a novel multi-modal fusion network, namely PSMM-Net [40]. As shown in Fig. 3.17. It consists of two main parts: a) the modality-specific network, which contains three SD-Nets to learn features from RGB, Depth, IR modalities, respectively; b) and a shared branch for all modalities, which aims to learn the complementary features among different modalities. To capture correlations and complementary semantics among different modalities, information exchange, and interaction among SD-Nets and the shared branch are designed. There are two main kind of losses employed to guide the training of PSMM-Net. The first corresponds to the losses of the three SD-Nets, i.e., color, depth and ir modalities, denoted as Lcolor , Ldepth and Lir , respectively. The second corresponds to the loss that guides the entire network training, denoted as Lwhole , which bases on the summed features from all
58
3 Best Solutions Proposed in the Context of the Face . . .
SD-Nets and the shared branch. The overall loss L of PSMM-Net is denoted as:
L = Lwhole + Lcolor + Ldepth + Lir
(3.9)
3.3.2.2 1st Ranked Solution (Team Name: BOBO) For the Multi-modal track, as shown in Fig. 3.18, this team takes 3 independent networks (Backbone) to learn the features of the 3 modalities (e.g., RGB, Depth, IR). Therefore, the entire structure consists of two main parts: a) the modality-specific network, which contains three branches (the backbone network of each modality branch is not shared) to regress depth maps of RGB, Depth, IR modalities, respectively; b) a fused branch (via concatenation) for all modalities, which aims to learn the complementary features among different modalities and output final depth map with the same size (1 × 32 × 32) of the single-modal track. Similar to the single-modal track, the CDL and CDE loss functions are used in a multi-modal track in the form of weighted sums. As the feature-level fusion strategy (see Fig. 3.18) might not be optimal for all protocols, they also try two other fusion strategies: 1) input-level fusion via concatenating three-modal inputs to 256 × 256 × 9 directly, and 2) score-level fusion via weighting the predicted score from each modality. For these two fusion strategies, the architecture of single-modal CDCN (see Fig. 3.11) is used. Through comparative experiments, they concluded that the inputlevel fusion (i.e., simple fusion with concatenation) might be sub-optimal because it is weak in representing and selecting the importance of modalities. Therefore, this final result is combined with the best sub-protocols results (i.e., feature-level fusion for protocol 4_1 while score-level fusion for protocol 4_2 and 4_3). Specially for score-fusion, they weight the results of RGB and Depth modalities averagely as the final score (i.e., f usion_scor e = 0.5 × RG B_scor e + 0.5 × depth_scor e). This simple ensemble strategy helps to boost the performance significantly in their experiments.
Fig. 3.18 The framework of regression network for 3 modalities. The figure is provided by the BOBO team and ranked N O.2 in multi-modal track
3.3
Cross-Ethnicity Face PAD Challenge
59
3.3.2.3 2nd Ranked Solution (Team Name: Super) CASIA-SURF CeFA is characterized by multi-modality (i.e., RGB, Depth, IR) and a key issue is how to fuse the complementary information between the three modalities. This team explored multi-modal track from three aspects: (1) Data preprocessing. (2) Network construction. (3) Ensemble strategy design. Since the dataset used in this competition retained the black background area outside the face, this team tried to remove the background area using the histogram threshold method to mitigate its interference effect on model learning. To increase the diversity of training samples, they use random rotation within the range of [−300 , 300 ], flipping, cropping, and color distortion for data augmentation. Note that the three modalities of the same sample are maintained in a consistent manner to obtain the features of the corresponding face region. Inspired by [51] which employs the “Squeeze-and-Excitation” Block (SE Block) [2] to reweighting the hierarchy features of each modality, this team takes a multi-stream architecture with three subnetworks to study the dataset modalities, as shown in Fig. 3.19. We can see that the RGB, Depth, and IR data are learned separately by each stream, and then shared layers are appended at a point (Res-4) to learn joint representations. However, the single-scale SE block [2] does not make full use of features from different levels. To this end, they extend the SE fusion from a single scale to multiple scales. As shown in Fig. 3.19, the Res-1, Res-2 and Res-3 blocks from each stream extract features from different modalities. After that, they first fuse features from different modalities via the SE block after Res-1, Res-2 and Res-3 respectively, then concatenate these fused features and sending them to aggregation block (Agg Block), next merging these features (including shared branch features after the Global Average Pooling (GAP)) via element summation operations similar to [52]. Finally, they use the merged features to predict real and fake. Differently from [52], they add a dimension reduction layer before the fully-connected (FC) layer for avoiding the overfitting.
Fig. 3.19 The framework of Super team. The ResNet34 or IR_ResNet50 as the backbone. The figure is provided by the Super team and ranked N O.2 in the multi-modal track
3 Best Solutions Proposed in the Context of the Face . . .
60
Table 3.5 The networks ensemble ways adopted by Super team. Each network carries functions marked by Network
Backbone
SE block
A B C D
ResNet34 ResNet34 ResNet34 IR_ResNet50
Dimension reduction
Agg block
To increase the robustness to unknown attack types and ethnicities, they design several new networks based on the basic network shown in Table 3.5. Such as the Network A with a dimension reduction layer and without SE fusion after each res block. While the Network B and C are similar to [52] and [51] respectively. For the IR_ResNet50, it uses the improved residual block which aims at fitting the face recognition task. In the experiments, they found that different networks performed differently under the same sub-protocol. Therefore, they selectively trained these networks according to different sub-protocols and get the final score via averaging the results of selected networks. Their code is publicly available in https:// github.com/hzh8311/challenge2020_face_anti_spoofing.
3.3.2.4 3rd Ranked Solution (Team Name: Hulking) As for this team, they propose a novel Pipeline-based CNN (namely PipeNet) fusion architecture which taking modified SENet-154 [2] as the backbone for multi-modal face antispoofing. Specifically, as shown in Fig. 3.20, it contains two modules, namely SMP (Selective Modal Pipeline) module and LFV (Limited Frame Vote) module for the input of multiple modalities and sequence video frames, respectively. We can see that the framework contains three SMP modules, and each module takes a modal data (i.e., RGB, Depth, IR) as input. Taking the RGB modality as an example, they firstly use one frame as input and randomly crop it into patches, then send them to Color Pi peline which consists of data augmentation and feature extraction operations. They use a fusion strategy, which is concatenating the responses of Color Pi peline, Depth Pi peline and I R Pi peline, and sending them to Fusion Moudle for further feature abstraction. After the linear connection, input all frame features of the video to the LFV module, and iteratively calculate the probability that each frame sample belongs to the real face. Finally, the output is a prediction for real face probability of the input face video.
3.3.2.5 Other Top Ranked Solutions Newland_tianyan. For multi-modal track, this team uses two independent ResNet-9 [1] as backbones to learn the features of depth and IR modal data respectively. Similar to the
3.3
Cross-Ethnicity Face PAD Challenge
61
Fig. 3.20 The overall architecture of PipeNet. The figure is provided by the Hulking team and ranked N O.3 in the multi-modal track
single-modal track, the inputs of depth branch are subtracted from the neighborhood mean before entering the network. In addition to data augment similar to the single-modal track, they transferred the RGB data of real samples to gray space and added light spots for data augment. Their code is publicly available in https://github.com/Huangzebin99/CVPR-2020. ZhangTT. A multi-stream CNN architecture called ID-Net is proposed for the multi-modal track. Since the different feature distributions of different modalities, the proposed model attempt to explore the interdependence between these modalities. As shown in Fig. 3.21, there are two models trained by this team which one is trained using only IR as input and the other using both IR and Depth as inputs. Specially, a multi-stream architecture is designed with two sub-networks to perform multi-modal features fusion and the feature maps of two sub-networks are concatenated after a convolutional block. The final score is a weighted average of the results of two models. Their code is publicly available in https://github.com/ ZhangTT-race/CVPR2020-MultiModal. Harvest. Different from other teams, they pay more attention to the network structure, this team mainly explores data preprocessing and data augmentation to improve the generalization performance. Through experimental comparison, they found that IR modal data is more suitable for face anti-spoofing task. Therefore, in this multi-modal track, only the IR modal data participates in model training. Similar to the team Super, they first use the face detector to remove the background area outside the face. Concretely, they use a face detector to detect face ROI (Region of Interest) with RGB data, and then mapping theses ROIs to IR data to get the corresponding face position. Since only IR modal data is used, more sample augmentation strategies are used in network training to prevent overfitting. Such as the image is randomly divided into patches in an online manner before sending it to the network. Besides, they tried some tricks including triplet loss with semi-hard negative mining, sample interpolation augmentation, and label smoothing.
3 Best Solutions Proposed in the Context of the Face . . .
62
Fig. 3.21 Architecture of the proposed for the multi-modal track. The figure is provided by the ZhangTT team and ranked N O.5 in the multi-modal track
Qyxqyx. Based on the work in [47], this team adds an additional binary classification supervision to promote the performance for multi-modal track. Specifically, the network structure is from [47, 53] and the additional binary supervision is inspired by [54]. As shown in Fig. 3.22, taking the RGB modality as an example, the input samples are supervised by two loss functions which are a binary classification loss and a regression loss after passing through the feature network. Finally, the weighted sum of the binary output and the pixelwise regression output as the final score. Their code is publicly available in https://github. com/qyxqyx/FAS_Chalearn_challenge. Skjack. The network structure is similar to team Super. They use ResNet-9 [1] as the backbone and fuse the RGB, Depth, and IR features after the res-3 block, then a 1 × 1 convolution operation is used to compress the channel. Since there are no additional novel innovations, the team’s algorithm did not perform well in this competition. Their code is publicly available https://github.com/skJack/challange.git.
Classifier FC
FC
1
Living
Spoofing
Pixel-Wise Label
1
32
Concat
32 48 32 Pool
Feature extractor
Regression model
16
Input Face
32 48 32 Pool
16 32 48 32 Pool
Flaen
Binary Label
0
Living
Spoofing
Fig. 3.22 The supervision and the network of Qyxqyx team. The orange cube is convolution layer. The pixel-wise binary label in their experiment is resized into 32 × 32 resolution. The figure is provided by the Qyxqyx team and ranked N O.7 in the multi-modal track
3.4
3D High-Fidelity Mask Face PAD Challenge
3.3.3
63
Summary
We organized the Chalearn Face Anti-spoofing Attack Detection Challenge at CVPR2020 based on the CASIA-SURF CeFA dataset with two tracks and running on the CodaLab platform. Both tracks attracted 340 teams in the development stage, and finally, 11 and 8 teams have submitted their codes in the single-modal and multi-modal face anti-spoofing recognition challenges, respectively. We described the associated dataset, and the challenge protocol including evaluation metrics. We reviewed in detail the proposed solutions and reported the challenge results. Compared with the baseline method, the best performances from participants under the ACER value are from 36.62 to 2.72, and 32.02 to 1.02 for the single-modal and multi-modal challenges, respectively. We analyzed the results of the challenge, pointing out the critical issues in PAD task and presenting the shortcomings of the existing algorithms. Future lines of research in the field have been also discussed.
3.4
3D High-Fidelity Mask Face PAD Challenge
Based on HiFiMask Dataset and our protocol, we successfully held a competition, 3D HighFidelity Mask Face Presentation Attack Detection Challenge at ICCV2021,1 attracting 195 teams from all over the world. The results of the top three teams are far better than our baseline results, which greatly pushes the current best performance of mask attack detection.
3.4.1
Baseline Method
In this section we introduce the Contrastive Context-Aware Learning (CCL) framework for 3D high-fidelity mask PAD. CCL train models by contrastive learning meanwhile in a supervised learning manner. As illustrated in Fig. 3.23, CCL contains a data pair generation module to generate input data by leveraging rich contextual cues, a well-designed contrastive learning architecture for face PAD tasks, and the Context Guided Dropout (CGD) module accelerates the network convergence during the early training stages.
3.4.1.1 Data Pair Generation To effectively leverage rich contextual cues (e.g., skin, subject, type, scene, light, sensor, and inter-frame information) in HiFiMask datasets, we organize the data into pairs and freeze some of the contexts to mine the discrimination of other contexts, e.g., we select a live face and a resin mask face from the same subject. Then the contrast is the discrepancy between the material of the live skin and resin. We split the live and mask faces into fine-grained patterns to generate a variety of meaningful contextual pairs. Specifically, we generate contextual 1 https://competitions.codalab.org/competitions/30910.
64
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.23 The CCL framework. The left part (yellow) denotes our data pair generation manner. Each pair of images is processed by central framework twice, consisting of an online network (f, g, h), a target network without gradient backward (f’, g’) and a classifier header (l ). The right part (blue) denotes the CGD module, the positive embeddings are pulled closer by CGD
pairs in the following way: 1) in Pattern. 1, we sample two frames from one single video as one kind of positive context pair; 2) in Pattern. 5, we sample one fine-grained mask category and the living category with the same subject as one negative context pair; 3) the positive and negative context pairs, including but not limited to the above combinations, are generated as the training set.
3.4.1.2 Network Architecture Recently, self-supervised contrastive learning, such as SimCLR [55], BYOL [56], and SimSiam [57], achieved outstanding performance in downstream prediction tasks. The purpose of these algorithms is to learn effective visual representations in advance. Therefore, taking the FAS task as a downstream task for the first time, we consider building the approach on a self-supervised contrastive learning framework, which aims to learn useful visual representations in advance. Inspired by the architectures in self-supervised learning [55–57], we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information, and propose the CCL framework, consisting of an online network and a target network for pairwise contrastive information propagation. At the same time, an extra classifier is used for explicit supervision. As shown in Fig. 3.23, well-organized contextual image pairs are utilized as the inputs of the CCL. The inputs are sent to the online and target networks. The online network is composed of three modules: an encoder network f (with a backbone network and a fully connected layer), a projector g and a predictor h (with the same multi-layer perceptron structure). Similarly, the target network has an encoder f’ and a projector g’ with different weights from the online network. As shown in Eq. 3.10, the weights of the target network θ perceive an exponential moving-average [58] of the online parameters θ . We perform the moving-average after each step by target decay rate τ in Eq. 3.11,
3.4
3D High-Fidelity Mask Face PAD Challenge
65
θ ← τ θ + (1 − τ )θ,
(3.10)
τ 1 − (1 − τbase ) · (cos(π s/S + 1)/2.
(3.11)
The exponential parameter τbase is set to 0.996, s is the current training step, and S is the maximum number of training steps. In addition, a classifier head l is added after the encoder f in order to perform supervised learning. During the inference stage, only the encoder f and classifier l are applied to perform the discrimination of mask samples. In fact, the classifier can be trained jointly with the encoder and projector networks, and achieve roughly the same results without requiring two-stage training [57]. Therefore, the CCL proposed by us is a supervised extension of the self-supervised contrastive learning. At the same time, the effective visual representation pre-learning and the downstream FAS task are completed in one stage.
3.4.1.3 Context Guided Dropout In classical self-supervised contrastive learning frameworks [55–57], the input images x1 and x2 are augmented from a source image x. As a result, the similarity loss between x1 and x2 would decrease to a relatively low level smoothly. In contrast, our proposed CCL constructs the positive contextual pairs from separate source images, which suffer from high dissimilarity, leading to unstable convergence. Moreover, the contextual features (e.g., scenes) might not always be relevant to the live/spoof cues, leading to a large similarity loss. Inspired by the dropout operator [59, 60] to randomly discard parts of neurons during training, we propose Context Guided Dropout (CGD), which adaptively discards parts of the ‘outlier’ embedding features according to their similarities. For instance, given the embeddings from positive pairs, we assume that the abnormal differences between them belong to the context information. Therefore, we could automatically drop out the abnormal embeddings with huge dissimilarities after ranking their locations. For a positive n-dimensional embedding pair z 1 and p2 , we first calculate the difference vector δi via δi = |(
z1 2 p2 2 ) −( ) | ||z 1 ||2 || p2 ||2
(3.12)
Afterward, we sort δi by descending sequence and record the index of the largest pd · n values. Here pd is the proportion of embedding feature channels to be discarded. And we execute this procedure in a mini-batch to determine the discarding position. Besides, the discarded embedding is scaled by a factor of 1/(1 − pd ), which is similar to the inversed dropout method.
66
3.4.2
3 Best Solutions Proposed in the Context of the Face . . .
1st Ranked Solution (Team Name:VisionLabs)
Due to the tiny fake features of 3D face masks and the complexity to distinguish, team VisionLabs proposed a pipeline based on high-resolution face parts cropped from the original image, as shown in Fig. 3.24. Those parts are used as additional information to classify the full images through the network. During preparation state, centered face crops are created using the Dual Shot Face Detector (DSFD) detector [61]. The crop bounding box is expanded 1.3× times around face detection bounding box. If the bounding box is out of the original image border, missing parts are filled with black. If no face is found, the original image is used instead of crop. Then, five face regions are cropped using prior information from face bounding box, as eyes, nose, chin, left, and right ear. Each part will be resized to 224 × 224 and input into the backbone after data augmentation (i.e., rotation, random crop, color jitter). Additionally, as a regularization technique, they turned 10% of images into trash images by scaling random tiny parts of images. As shown in Fig. 3.24, team VisionLabs used a multi-branch network, including Face, Eyes, Nose, Chin, and Ears branches. Since any pre-trained weight is prohibited in this competition, they tried to replicate the generalization ability of convolutional filters by using shared the first block of each branch. This made the first block filters learn more diverse features. Five branches all adopt EfficientNet-B0 [62] as the backbone, and the original descriptor size of each branch is reduced from 1280 to 320. Due to the presence of left and right ears, the Ears Branch outputs two vectors. Then, each loss and confidence
Fig. 3.24 The pipeline of team VisionLabs. Original images are cropped by DSFD detector [61] and split into five regions by prior knowledge. Then, those parts are input into the backbone with one shared convolution block and five branches. Each branch will output a 320-dimensional vector (two vectors from the ears branch). All vectors are concatenated to one vector and calculated as L total
3.4
3D High-Fidelity Mask Face PAD Challenge
67
of the current branch can be obtained through the fully connected layer, as L f ace , L eyes , L nose , L chin , L Lear , L Rear (Lear for left ear and Rear for right ear). These six vectors are concatenated to obtain a 1920-dimensional vector, used to calculate the loss function L total . All branches are trained simultaneously with the final loss: L = 5 ∗ L total + 5 ∗ L f ace + L eyes + L nose +L chin + 0.5 ∗ L Lear + 0.5 ∗ L Rear
(3.13)
where all losses are binary cross-entropy(BCE) losses. Since face parts do not always contain the tiny fake features, for eyes, nose, chin and ears, they increase positive class weight in BCE loss by a factor of 5. The partial face part descriptors will not be punished too hard if don’t contain useful features. They trained the model with Adam optimizer for 60 epochs using an initial learning rate of 0.0006 and decreasing it every 3 epochs by a factor of 0.9. During the inference phase, they chose 0.7 as the test set threshold when it is inaccurate to select a threshold from a validation set close to the full score. Based on average positions and prior information, some face parts of images may be cropped in the wrong way. So a test time augmentation is introduced. They flip an image and obtain the final results by averaging the scores of original and flipped faces.
3.4.3
2nd Ranked Solution (Team Name: WeOnlyLookOnce)
In this method, considering that there are irrelevant noises in the raw training data, a custom algorithm is used to detect black borders firstly. After that, the DSFD [61] is applied to detect potential faces for each image. To be specified, the training set is processed by wiping black borders merely, while the testing set and validation set are cropped with a ratio of 1.5 times bounding box further. What’s more, the positive samples in the training set are much less than negative samples. The training augmentations include rotation, image crop, color jitter, etc. As shown in Fig. 3.25, the framework [63–66] consists of a CNN branch and a CDC branch. Both networks are self-designed lightweight ResNet12, and each of them is a threeclass classification network aiming to detect real images and two kinds of mask. The realization of CNN branch is a vanilla convolution while the CDC branch used Central Difference Convolution [49]. To alleviate the overfitting problem, the team additionally adopted a label smoothing strategy and an output distribution tuning strategy inspired by temperature scaling [67]. After computing the cross-entropy loss by using the logits and label, the total loss is calculated by the following equation L = L concat + 0.5 ∗ L cnn + 0.5 ∗ L cdc
(3.14)
68
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.25 The framework of team WeOnlyLookOnce. The DSFD face detector is used to detect bounding box. Afterwards, a lightweight self-defined ResNet12 subsequently aims to classify the input into three categories. To be mentioned, Label smoothing and output distribution tuning are used as additional tricks
To minimize the distribution gap between validation and test sets, this team proposed an effective distribution tuner. They provide two strategies in this tuner both of which are proved to be effective. In the first strategy, they reform the three-class classification task into a binary classification task by adding the two attack-class logits into one uniformed value, then dividing the real logits by a factor of 3.6 and the fake logits by a factor of 5.0 before the softmax operation. In the second strategy, the task still remains a three-class classification problem while the real score on the validation set is subtracted by 0.07.
3.4.4
3rd Ranked Solution (Team Name: CLFM)
Team CLFM produced a model with only cross-entropy loss based on CDCN++ model but earn a good result. The central difference convolution was used to replace traditional convolution. Also, attention modules were introduced in each stage to make the model performed better. Besides, they fuse three stages’ output parameters as feature vectors before the fully connected layer. For data pre-processing, they adopt their own face detection model to perform face detection and take patches of the face as input. Something should be the highlight that they play some brilliant and practical tricks both on train and test set. On the one hand, they find that there are hats/glasses that will most likely lead the model in the wrong direction. So they firstly crop the face according to the bounding box and then crop the region around the mouth. The face size is randomly set in a small range to ensure the generalization of the model. If the region is not enough to fill it, they tend to flip and mirror the region to keep the texture constant. The model input is square blocks resized to 56 × 56 and normalized with mean and standard deviation parameters which are summarized from ImageNet. On
3.4
3D High-Fidelity Mask Face PAD Challenge
69
the other hand, they also notice that there are about 17% of the images in the test set that the face detection didn’t detect any face, and in this kind of scenario the model will have no other choice but use the whole image as the bounding box. So they randomly make part of the training data’s bounding box to be the whole image and make slight changes to the cropping process of the test set compared with the training set, so that at least the model won’t be pure guessing when facing this situation. Finally, they use self voting by moving the patch in a small range and averaging them as the final score.
3.4.5
Other Top Ranked Solutions
Oldiron666 The team Oldiron666 proposed a self-dense regularization framework for face anti-spoofing. For data pre-processing, they expand an adaptive scale for the cropped face, which improves the performance. The input size of the images is 256, and the following data augmentations are performed to improve generalization, such as Random Crop, Cutout, and Patch Shuffle Augmentation, etc. The team Oldiron666 used a representation learning framework similar to SimSiam [68], but introduces a multilayer perceptron(MLP) for supervised classification. During the training process, the face image X will be randomly augmented to obtain two views for input, X 1 and X 2 . The two views are processed by an encoder network f, which consists of the backbone and an MLP head called projector [69]. They found a more light network may bring better performance on the HiFiMask. Therefore, Resnet6 is utilized as the backbone, which contains a low computational complexity. The output of one view is transformed to match the other view by a dense predictor MLP head, denoted as h. The dense similarity loss, marked as L contra , maximizes the similarity between both sides. To implement supervised learning, they perform the dense classifier c at the end of the framework and use Mean Squared Error(MSE) to evaluate the output. MSE loss is calculated with the ground truth label on one side, denoted as L cls , while L d is calculated as the difference between the category output of both two sides. The training loss can be defined as, L = L contra + L cls + 0.1 ∗ L d
(3.15)
During the training process, they perform half-precision floating-point to obtain a faster training speed. The SGD optimizer is adopted, with an initial learning rate of 0.03, weight decay by 0.0005, and momentum by 0.9. During inference, only the X 1 side is executed to obtain the result of face anti-spoofing. Reconova-AI-Lab Team Reconova-AI-LAB contributes a variety of models and generates many different results, the best of which is used for the competition. They proposed a multi-task learning algorithm, which mainly includes three branches, the direct classification branch, the real person learning Gaussian mask branch, and the Region of interest(ROI) classification branch. In the rest of this section, we take Cls, Seg, and ROI branches as
70
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.26 The application flow chart of team Reconova-AI-Lab. Raw images are firstly pre-processed by RetinaFace for face detection, crop and alignment in the upper flow. Then, they used a multi-task learning algorithm, which mainly includes three branches
abbreviations respectively. Cls branch takes a focal loss which combining Sigmoid and BCE Loss as the supervision information. It is annotated as loss_classi. Seg branch adopts the same loss function as Cls and its loss annotation is loss_seg. Concerning with ROI branch, it take three loss functions, which is loss_cls1, loss_cls2 and loss_center , respectively. The effect of the first one is focal loss mentioned before. The second one aims to the alignment of ROI which is used to calibrate the operation of ROI pooling. Subsequently, the purpose of the last one is to reduce the distance between classes. Finally, the lost function of ROI branch loss_r oi equals loss_cls1 plus loss_center ∗ ×0.01 plus loss_cls2. All branches are trained synchronously with an SGD optimizer in 800 epochs, and the total loss function formulates as follows: total_loss = loss_classi + loss_seg + loss_r oi
(3.16)
Their application flow chart is shown in Fig. 3.26. First, the data pre-processing includes the use of RetinaFace to detect the face and generate 14 landmarks per face, including face coordinates and bounding boxes of left, right ear, and mouth. At that stage, they use some strategies to avoid large-angle posture and non-existence of face by constraining the size of the bounding box of ROI. Meanwhile, they take mirroring, random rotation, random color enhancement, random translation, and random scaling as treatments of data enhancement. Then they adopt a backbone called Res50_IR, which has stacked 3, 4, 14, and 3 blocks respectively in four stages. In order to enhance features, an improved Residual Bottleneck structure named Yolov3_FPN is connected to the different stages of the network. The slightly complicated network is followed by three branches mentioned before. All of the parameters are initialized by different methods according to different layers. Inspire The team firstly utilized a ResNet50 [1] based RetinaFace [70] to detect face bounding boxes for all images. To be noticed, three different threshold values of 0.8, 0.1, and 0.01 are used to record the different types of bounding boxes. If the detecting confidence value is above 0.1, the box label is set to be 2. If it is between 0.1 and 0.01, the box label is 1. While the value is less than 0.01, the box label remains to be 0. According to the box label depicted above, hard samples of the cropped images is partitive.
3.4
3D High-Fidelity Mask Face PAD Challenge
71
Fig. 3.27 The framework of team inspire. Raw images are firstly processed in the upper flow. After that, this team used a Context Contrastive Learning framework to train, while the backbone is a SE-ResNeXt101 network
For the training stage, SE-ResNeXt101 [11] was selected as the backbone. Besides, the team applied the Context Contrastive Learning(CCL) [71] architecture as the framework, which is shown in Fig. 3.27. As a result, they used a sampling strategy the same as that in [71]. The MSE loss L M S E , Cross Entropy loss L C E and Contrastive loss [72] L Contra are applied to calculate total loss by the following weights: L = L M S E + L C E + 0.7 ∗ L Contra
(3.17)
Afterward, Ranger optimizer2 is set as a learning strategy with an initial learning rate of 0.001. The total epoch is 70, and the learning rate decays by 0.1 at 20, 30, 60 epochs, respectively. Piercing Eye The team Piercing Eye used the modified CDCN [49] as the basic framework, shown in Fig. 3.28. During data processing, the face regions are detected from the original images, which are resized to 256 × 256 and randomly cropped to 228 × 228. Same as other teams, some types of data augmentation like color jitter are used. In addition to the original output (depth map) of CDCN, a multi-layer perceptron (MLP) is attached to the backbone, implementing the global binary classification. The shape of the depth map is 32 × 32. The label of the real face region is set to 1, while the background and fake face region is set to 0. They trained the model with an SGD optimizer for 260 epochs using an initial learning rate of 0.002 and decreasing it by a factor of 0.5 with milestones. As in [49], both mean square error loss L M S E and contrastive depth loss L C DL are utilized for pixel-wise supervision. They also perform cross-entropy loss in a global branch, denoted as L C E . So the overall loss function is formulated as L = 0.5 ∗ L M S E + 0.5 ∗ L C DL + 0.8 ∗ L C E 2 https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer.
(3.18)
72
3 Best Solutions Proposed in the Context of the Face . . .
Fig. 3.28 The framework of team Piercing Eye. Two branches are attached to the CDCN backbone, called map regression and global classifier, respectively
msxf_cvas From the analysis of competition data, the team finds two different distributions of spoof masks which are transparent material and fidelity material. They consider two materials (plaster and resin) as one category as the features of these two types looks similar. Besides, there are small amounts of noisy data without a human face which do not contain spoof or live features. Therefore, the team try to classify them as one category called nonface. The final task is to classify all data into four categories which are the live, transparent mask, resin mask, and non-face. Considering that there are many extreme posture and light and low-quality data in the competition data, they focus more on data augmentation strategies during training including cutMix, ISONoise, randomSunFlare, randomFog, motionBlur, and imageCompression. First of all, the team applied a face detector to detect faces and align faces by five points. After that, the mmclassification3 project was used to train a face anti-spoofing model. To begin with, the team chose a ResNet34 [1] as the backbone and the cross-entropy loss was selected as the loss function. The whole framework is illustrated in Fig. 3.29. VIC_FACE The prerequisites need to know is that deep bilateral has been successfully applied in convolutional networks to filter the deep features instead of original images. Inspired by this, team VIC_FACE proposed a novel method based on fusing the deep bilateral operator on the basis of original CDCN in order to learn more intrinsic features via aggregating multilevel bilateral macro- and micro- information. As shown in Fig. 3.30, the backbone model is an initial CDCN, which divides the backbone into multilevel (low3 https://github.com/open-mmlab/mmclassification.
3.4
3D High-Fidelity Mask Face PAD Challenge
73
Fig. 3.29 The framework of team msxf_cvas. Raw images are firstly processed by detecting face and alignment. After that, a ResNet34 network is utilized to classify the input image into four types including live, transparent mask, resin mask and no face
Fig. 3.30 The framework of team VIC_FACE
level, mid-level, and high-level) blocks to predict the gray-scale facial depth map with size 1 × 32 × 32 from a single RGB facial image with size 3 × 256 × 256. Besides, the DBO as a channel-wise deep bilateral filtering mimics a residual layer embedded in the network and replaces the original convolution layer by representing the aggregated bilateral base and residual features. Specifically, at the first stage, they detect and crop the face area in the full image as input of the model. Secondly, they edit the images randomly with down-sampling and jpeg compression, which often occur unintentionally when the images are captured from different devices. Moreover, it is worth mentioning that excepting some regular data augmentation methods including cutout, color jitter and erase to improve generalization of the model, affine transformation of brightness and color of random area based on OpenCV is applied to simulate light condition in training data. Finally, they design a contrastive loss function for controlling the contrast depth map of gray-scale output and a mean-square error loss function for reducing the difference between augmented input and binary mask, then combining them into one loss function with Adam optimizer.
74
3 Best Solutions Proposed in the Context of the Face . . .
DXM-DI-AI-CV-TEAM Due to the generalization performance of the challenge evaluation algorithm for unknown attack scenarios, this team casts faces anti-spoofing as a domain generalization (DG) problem. To let the model generalize well to unseen scenes, inspired by [73], the proposed framework trains their model to perform well in the simulated domain shift scenarios, which is achieved by finding generalized learning directions in the metalearning process. Different from [73], the team removed the branch of depth prior knowledge from the real face and mask, which contained similar depth information. Besides, a series of data augmentation and training strategies are used to achieve the best results. In the challenge, the training data are collected in 3 scenes, namely White Light, Outdoor Sunshine, and Motion Blur (short for 1, 4, 6). Therefore, the objective of DG for this challenge is to make the model trained on the 3 source scenes can generalize well to unseen attacks from the target scene. To this end, as shown in Fig. 3.31, the framework in this team that composes of a feature extractor and a meta learner. At each training iteration, they divide the original 3 source scenes by randomly selecting 2 scenes as meta-train scenes and the remaining one as the meta-test scene. In each meta-train and meta-test scene, meta learner conducts the meta-learning in the feature space supervised by the image and label pairs denoted as x and y, where y are ground truth with binary class labels (y = 0/1 is the label of fake/real face). In this way, their model can learn how to perform well in the scene shift scenarios through many training iterations and thus learn to generalize well to unseen attacks.
Fig. 3.31 The framework of DXM-DI-AI-CV-TEAM
References
3.4.6
75
Summary
Through the introduction and result analysis of team methods in the challenge, we summarize the effective ideas for mask attack detection: (1) At the data level, data expansion is almost the strategy adopted by all teams. Therefore, data augmentation plays an important role in preventing the over-fitting of the model and improving the stability of the algorithm. (2) The segmentation of the face region can not only enlarge the local information to mine the difference between high fidelity mask and live face but also avoid the extraction of irrelevant features such as face ID. (3) Multi-branch-based feature learning is a framework widely used by participating teams. Firstly, a multi-branch network is used to mine the differences between the mask and live face from multiple aspects, such as texture, color contrast, material difference, etc., and then feature fusion is used to improve the robustness of the algorithm.
References 1. Kaiming H, Xiangyu Z, Shaoqing R, Jian S (2016) Deep residual learning for image recognition. CVPR 2. Jie H, Li S, Gang S (2018) Squeeze-and-excitation networks. CVPR 3. Aleksandr P, Oleg G (2019) Recognizing multi-modal face spoofing with face recognition networks. In: The IEEE conference on computer vision and pattern recognition (CVPR) Workshops 4. Jia D, Wei D, Richard S, Li-Jia L, Kai L, Li F.-F (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255 5. Dong Y, Zhen L, Shengcai L, Stan ZL (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923 6. Zhenxing N, Mo Z, Le W, Xinbo G, Gang H (2016) Ordinal regression with multiple output cnn for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4920–4928 7. Guo Y, Zhang L, Yuxiao H, He X (2016) Gao J (2016) Ms-celeb-1m: Challenge of recognizing one million celebrities in the real world. Electron Imaging 11:1–6 8. Jian Z, Yu C, Yan X, Lin X, Jianshu L, Fang Z, Karlekar J, Sugiri P, Shengmei S, Junliang X, et al (2018) Towards pose invariant face recognition in the wild. In: CVPR, pp 2207–2216 9. Shifeng Z, Xiaobo W, Ajian L, Chenxu Z, Jun W, Sergio E, Hailin S, Zezheng W, Stan ZL (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 919–928 10. Tao S, Yuyu H, Zhijun T (2019) Facebagnet: Bag-of-local-features model for multi-modal face anti-spoofing. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops 11. Saining X, Ross G, Piotr D, Zhuowen T, Kaiming H (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500 12. Keyurkumar P, Hu H, Anil KJ (2016) Cross-database face antispoofing with robust feature representation. In: Chinese conference on biometric recognition. Springer, pp 611–619
76
3 Best Solutions Proposed in the Context of the Face . . .
13. Lei L, Xiaoyi F, Zinelabidine B, Zhaoqiang X, Mingming L, Abdenour H (2016) An original face anti-spoofing approach using partial convolutional neural network. In: 2016 sixth international conference on image processing theory, tools and applications (IPTA). IEEE, pp 1–6 14. Javier H-O, Julian F, Aythami M, Pedro T (2018) Time analysis of pulse-based face antispoofing in visible and nir. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 544–552 15. Zezheng W, Chenxu Z, Yunxiao Q, Qiusheng Z, Zhen L (2018) Exploiting temporal and depth information for multi-frame face anti-spoofing. arXiv preprint arXiv:1811.05118 16. Peng Z, Fuhao Z, Zhiwen W, Nengli D, Skarpness M, Michael F, Juan Z, Kai L (2019) Feathernets: Convolutional neural networks as light as feather for face anti-spoofing. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops 17. Gao H, Zhuang L, Laurens VDM, Kilian QW (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708 18. Mark S, Andrew H, Menglong Z, Andrey Z, Liang-Chieh C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520 19. Ningning M, Xiangyu Z, Hai-Tao Z, Jian S (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131 20. Ke S, Mingjie L, Dong L, Jingdong W (2018) Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178 21. Min L, Qiang C, Shuicheng Y (2013) Network in network. arXiv preprint arXiv:1312.4400 22. Bichen W, Alvin W, Xiangyu Y, Peter J, Sicheng Z, Noah G, Amir G, Joseph G, Kurt K (2018) Shift: A zero flop, zero parameter alternative to spatial convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9127–9135 23. Jiankang D, Jia G, Niannan X, Stefanos Z (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4690–4699 24. Sheng C, Yang L, Xiang G, Zhen H (2018) Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In: Chinese conference on biometric recognition. Springer, pp 428–438 25. Jonathan LL, Ning Z, Trevor D (2014) Do convnets learn correspondence? In: Advances in neural information processing systems, pp 1601–1609 26. Wenjie L, Yujia L, Raquel U, Richard Z (2016) Understanding the effective receptive field in deep convolutional neural networks. In: Advances in neural information processing systems, pp 4898–4906 27. François C (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258 28. Andrew GH, Menglong Z, Bo C, Dmitry K, Weijun W, Tobias W, Marco A, Hartwig A (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 29. Christian S, Wei L, Yangqing J, Pierre S, Scott R, Dragomir A, Dumitru E, Vincent V, Andrew R (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 30. Junyuan X, Tong H, Zhi Z, Hang Z, Zhongyue Z, Mu L (2018) Bag of tricks for image classification with convolutional neural networks. arXiv preprint arXiv:1812.01187 31. Jie H, Li S, Gang S (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
References
77
32. Zheng Q, Zhaoning Z, Xiaotao C, Changjian W, Yuxing P (2018) Fd-mobilenet: Improved mobilenet with a fast downsampling strategy. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 1363–1367 33. Chi ND, Kha GQ, Ngan L, Nghia N, Khoa L (2018) Mobiface: A lightweight deep learning face recognition on mobile devices. arXiv preprint arXiv:1811.11080 34. Artur C-P, David J-C, Esteban V-F, Jose LA-C, Roberto JL (2019) Generalized presentation attack detection: a face anti-spoofing evaluation proposal. In: International conference on biometrics 35. Yandong W, Kaipeng Z, Zhifeng L, Yu Q (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision. Springer, pp 499–515 36. Fei P, Le Q, Min L (2018) Face presentation attack detection using guided scale texture. Multimedia Tools Appl. 1–27 37. Fei P, Le Q, Min L (2018) Ccolbp: Chromatic co-occurrence of local binary pattern for face presentation attack detection. In: 2018 27th international conference on computer communication and networks (ICCCN). IEEE, pp 1–9 38. Karen S, Andrew Z (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 39. Xiang W, He R, Sun Z, Tan T (2018) A light cnn for deep face representation with noisy labels. IEEE Trans. Inf. Foren. Security 13(11):2884–2896 40. Ajian L, Zichang T, Xuan L, Jun W, Sergio E, Guodong G, Stan ZL (2020) Casia-surf cefa: A benchmark for multi-modal cross-ethnicity face anti-spoofing 41. Jue W, Anoop C, Fatih P (2017) Ordered pooling of optical flow sequences for action recognition. In: WACV. IEEE, pp 168–176 42. Fernando B, Gavves E, Oramas J, Ghodrati A, Tuytelaars T (2017) Rank pooling for action recognition. TPAMI 39(4):773–787 43. Zinelabidine B, Jukka K, Abdenour H (2016) Face spoofing detection using colour texture analysis. TIFS 44. Jiangwei L, Yunhong W, Tieniu T, Anil KJ (2004) Live face detection based on the analysis of fourier spectra. BTHI 45. Feng L, Po L-M, Li Y, Xuyuan X, Yuan F (2016) Terence Chun-Ho Cheung, and Kwok-Wai Cheung. A neural network approach. In: JVCIR, Integration of image quality and motion cues for face anti-spoofing 46. Jianwei Y, Zhen L, Stan ZL (2014) Learn convolutional neural network for face anti-spoofing. arXiv 47. Liu Y, Jourabloo A (2018) and Xiaoming Liu. Binary or auxiliary supervision. In: CVPR, Learning deep models for face anti-spoofing 48. Zezheng W, Zitong Y, Chenxu Z, Xiangyu Z, Yunxiao Q, Qiusheng Z, Feng Z, Zhen L (2020) Deep spatial gradient and temporal depth learning for face anti-spoofing. CVPR 49. Zitong Y, Chenxu Z, Zezheng W, Yunxiao Q, Zhuo S, Xiaobai L, Feng Z, Guoying Z (2020) Searching central difference convolutional networks for face anti-spoofing. CVPR 50. Kensho H, Hirokatsu K, Yutaka S (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? 51. Shifeng Z, Xiaobo W, Ajian L, Chenxu Z, Jun W, Sergio E, Hailin S, Zezheng W, Stan ZL (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In CVPR 52. Aleksandr P, Oleg G (2019) Recognizing multi-modal face spoofing with face recognition networks. In PRCVW 53. Yunxiao Q, Chenxu Z, Xiangyu Z, Zezheng W, Zitong Y, Tianyu F, Feng Z, Jingping S, Zhen L (2020) Learning meta model for zero- and few-shot face anti-spoofing. In: Association for advancement of artificial intelligence (AAAI)
78
3 Best Solutions Proposed in the Context of the Face . . .
54. Anjith G, Sébastien M (2019) Deep pixel-wise binary supervision for face presentation attack detection. CoRR, abs/1907.04047 55. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: ICML, PMLR 56. Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Bernardo AP, Zhaohan G, Mohammad GA, Bilal P, Koray K, Remi M, Michal V (2020) Bootstrap your own latent–a new approach to self-supervised learning. In: NeurIPS 57. Xinlei C, Kaiming H (2020) Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566 58. Kaiming H, Haoqi F, Yuxin W, Saining X, Ross G (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR 59. Junsuk C, Hyunjung S (2019) Attention-based dropout layer for weakly supervised object localization. In: CVPR 60. Tong X, Hongsheng L, Wanli O, Xiaogang W (2016) Learning deep feature representations with domain guided dropout for person re-identification. In: CVPR 61. Jian L, Yabiao W, Changan W, Ying T, Jianjun Q, Jian Y, Chengjie W, Jilin L, Feiyue H (2019) Dsfd: Dual shot face detector. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5055–5064 62. Mingxing T, Quoc L (2019) EfficientNet: Rethinking model scaling for convolutional neural networks. In: Kamalika C, Ruslan S (eds) Proceedings of the 36th international conference on machine learning, vol 97 of Proceedings of machine learning research. PMLR, pp 6105–6114 63. Xinyao W, Taiping Y, Shouhong D, Lizhuang M (2020) Face manipulation detection via auxiliary supervision. In: International conference on neural information processing. Springer, pp 313–324 64. Shen C, Taiping Y, Yang C, Shouhong D, Jilin L, Rongrong J (2021) Local relation learning for face forgery detection 65. Wenxuan W, Bangjie Y, Taiping Y, Li Z, Yanwei F, Shouhong D, Jilin L, Feiyue H, Xiangyang X (2021) Delving into data: Effectively substitute training for black-box attack. In: CVPR, pp 4761–4770 66. Yin B, Wang W, Yao T, Guo J, Kong Z, Ding S, Li J, Cong L (2021) A new imperceptible and transferable attack on face recognition, Adv-makeup 67. Rafael M, Simon K, Geoffrey EH (2019) When does label smoothing help? In: Advances in neural information processing systems, vol 32 68. Xinlei C, Kaiming H (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15750– 15758 69. Ting C, Simon K, Mohammad N, Geoffrey H (2020) A simple framework for contrastive learning of visual representations. In: Hal D, Aarti S (eds) Proceedings of the 37th international conference on machine learning, vol 119 of Proceedings of machine learning research. PMLR, pp 1597–1607 70. Jiankang D, Jia G, Yuxiang Z, Jinke Y, Irene K, Stefanos Z (2019) Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641 71. Ajian L, Chenxu Z, Zitong Y, Jun W, Anyang S, Xing L, Zichang T, Sergio E, Junliang X, Yanyan L, et al (2021) Contrastive context-aware learning for 3d high-fidelity mask face presentation attack detection. arXiv preprint arXiv:2104.06148 72. Raia H, Sumit C, Yann L (2006) Dimensionality reduction by learning an invariant mapping. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1735–1742 73. Rui S, Xiangyuan L, Pong CY (2020) Regularized fine-grained meta face anti-spoofing. In: Thirty-fourth AAAI conference on artificial intelligence (AAAI)
4
Performance Evaluation
4.1
Introduction
This chapter presents an overview of the results obtained by solutions that qualified to the final stages of the three competitions that we organized. For each competition, all of the results were verified, meaning that submitted solutions were re-ran by the organizing team. The obtained results were then used for the final ranking and analysis. Results of different competitions are described in separate sections.
4.2
Multi-modal Face PAD Challenge
For the multi-modal face PAD challenge, thirteen teams qualified to the final phase and were considered for the official leader board. In this section we present the results obtained by those teams, where we first present the performance of the top three teams [1–3]. Then, the effectiveness of proposed algorithms are analyzed and we point out some limitations of the algorithms proposed by participating teams. Please note that the evaluation metrics used for the challenge were introduced in Sect. 2.1.1.4.
4.2.1
Experiments
In this section, we present the performance obtained by the top three teams. It illustrates the detail implementation details, pre-processing strategy and results on the CASIA-SURF dataset.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Wan et al., Advances in Face Presentation Attack Detection, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-32906-7_4
79
80
4 Performance Evaluation
4.2.1.1 1st Place (Team Name: VisionLabs) The architecture of VisonLabs has been presented in Sect. 3.2.2, where three branches and fusion strategy of RGB, depth and IR images are applied. Readers can refer to Sect. 3.2.2 for more detailed method of VisionLabs. Here, we only provide the experimental results of VisionLabs. Implementation details. All the code was implemented in PyTorch [4] and models were trained on 4 NVIDIA 1080Ti cards. Single model trains about 3 h and the inference takes 8 s per 1000 images. All models were trained using ADAM with cosine learning rate strategy and optimized for standard cross entropy loss for two classes. It trained each model for 30 epochs with initial learning rate at 0.1 with batch size of 128. Preprocessing. CASIA-SURF already provides face crops so no detection algorithms were used to align images. Face crops were resized to 125 × 125 pixels and then center crop 112 × 112 was taken. At the training stage horizontal flip was applied with 0.5 probability. [1] also tested different crop and rotation strategies as well as test-time augmentation, however, this did not result in significant improvements and no additional augmentation was used in the final model except the above. Baseline. Unless mentioned explicitly, results on Chalearn LAP challenge validation set are reported as obtained from the Codalab evaluation platform. First of all, VisionLabs reproduced baseline method [5] with Resnet-18 backbone and trained it using 5 fold crossvalidation strategy. All folds are split are reported based on the subject identity so images from the same person belong only to one fold. Then the score is averaged for the five trained nets and T P R@F P R = 10−4 is reported in Table 4.1. The resulting performance is close to perfect and similar to the previously reported results in [5], which was calculated on the test set. The test set differs from the validation, but belongs to the same spoofing attack distribution. Next, Visionlab [1] expand the backbone architecture to ResNet34 which would improve the score by a large margin. Due to the GPU limitations VisionLabs further focus only on ResNet34 and add ResNet50 only at the final stage. Attack-specific folds. Here VisionLabs [1] compared the 5-fold split strategy based on subject IDs with the strategy based on spoof attack types. Real examples by subject identity were assigned randomly to the one of the three folds. Despite the fact that the new model computes an average of three network outputs while each of these networks was trained on less data compared to the subject 5-fold learning strategy, the trained model achieves better performance compared to the baseline method (see Table 4.1). VisionLabs [1] explained this by the improved generalization to new attacks due to the training for different types of attacks. Initialization matters. In the next experiment, VisionLabs [1] initializes each of the three modality branches of the network with the res1, res2, res3 blocks from the ImageNet pre-trained network. The Fusion SE parts are left unchanged and the final res4 block is also initialized by the ImageNet pre-trained weights. Fine-tuning of this model on the CASIA-SURF dataset gives significant improvement over networks with random initialization (see Table 4.1). Moreover, switching pre-training to the face recognition task on the
4.2
Multi-modal Face PAD Challenge
81
Table 4.1 Results on CASIA-SURF validation subset Method
Initialization
Fold
[5]
TPR@FPR = 10−4 56.80
Resnet18
Subject 5-fold
60.54
Resnet34
Subject 5-fold
74.55
Resnet34
Attack 3-fold
78.89
Resnet34
ImageNet
Attack 3-fold
92.12
Resnet34
CASIA-Webface
Attack 3-fold
99.80
A. resnet34 with MLFA
CASIA-Webface
Attack 3-fold
99.87
B. resnet50 with MLFA
MSCeleb-1M
Attack 3-fold
99.63
C. resnet50 with MLFA
ASIAN dataset
Attack 3-fold
99.33
D. resnet34 with MLFA
AFAD-lite
Attack 3-fold
98.70
Attack 3-fold
100.00
A,B,C,D ensemble
CASIA-WebFace dataset [6] improves results by even a larger margin and reaches almost perfect TPR of 99.80%. Multi-level feature aggregation. Here, VisionLabs [1] examine the effect of multi-level feature aggregation (MLFA) described in the model architecture section. The results are shown in Table 4.1. It initializes aggregation modules with random weights and train the new architecture following the best learning protocol. The ResNet34 network with MLFA blocks has demonstrated error reduction by the factor 1.5x compared to the network without MFLA blocks. Ensembling. To improve the stability of the solution, VisionLabs uses four face related datasets as an initialization for the final model. It used publicly available networks with weights trained for face recognition tasks on the CASIA-WebFace [6], MSCeleb-1M [7] and private asian faces [8]. One also trained a network for gender classification on the AFADlite [9] dataset. Different tasks, losses and datasets imply different convolutional features and the average prediction of models finetuned with such initializations leads to 100.00% T P R@F P R = 10−4 . Such a high score meets the requirements of real world security applications, however, it was achieved using a large number of ensembling networks. In future work, VisionLabs plans to focus on reducing the size of the model and making it applicable for the real-time execution. Solution stability. The consistency and stability of model performance on unseen data is important especially when it comes to real world security applications. During the validation phase of the challenge seven teams achieved perfect or near perfect accuracy, however only
82
4 Performance Evaluation
Table 4.2 Shrinkage of T P R@F P R = 10e−4 score on validation and test sets of Chalearn LAP face anti-spoofing challenge Valid
Test
Ours
100.00
99.8739
Team 2
100.00
99.8282
Team 3
100.00
99.8052
Team 4
100.00
98.1441
Team 5 Team 6 Team 7
99.9665 100.00 99.9665
93.1550 87.2094 25.0601
Fig. 4.1 Examples of fake and real samples with highest standard deviation among predicted liveness scores from models A,B,C,D. This figure is from [1]
three solutions managed to hold close level of performance on the test set (see Table 4.2), where it showed the smallest drop in performance compared to the validation results. It believes that the stability of the VisonLabs solution was caused by the diversity of networks in the final ensemble in terms of network architectures, pre-training tasks and random seeds. Qualitative results. In this section, VisionLabs analyzes difficult examples by their proposed method. It run four networks (namely A,B,C,D in Table 4.1) on the Chalearn LAP challenge validation set and select examples with highest standard deviation (STD) on the liveness score among all samples. High STD implies conflicting predictions by different models. Figure 4.1 shows examples for which the networks disagree at most. As can be seen, the model D (which achieves the lowest TPR among all four models) tends to understate the liveness score, assigning reals to fakes. But it is helpful in the case of hard fake examples, when two out of three other networks are wrong. Therefore, using only three models in the final ensemble would have led to lower score on the validation set. Figure 4.2 demonstrates fakes and real samples which were close to the threshold at F P R = 10e4 . While they are distinguishable by human eye, one of the three modalities for every example looks similar to the normal one from the opposed class, so models based only on one modality may produce wrong predictions. Processing RGB, Depth and IR channels together allows to overcome this issue.
4.2
Multi-modal Face PAD Challenge
83
Fig. 4.2 Examples of fake and real samples from validation subset where predicted liveness score is close to the threshold at F P R = 10e4 . This figure is from [1] Table 4.3 The effect of modalities measured on the validation set. All models were pre-trained on the CASIA-Web face recognition task and finetuned with the same learning protocol Modality
TPR@FPR = 10−2
TPR@FPR = 10−3
TPR@FPR = 10−4
RGB
71.74
22.34
7.85
IR
91.82
72.25
57.41
Depth
100.00
99.77
98.40
RGB + IR + Depth
100.00
100.00
99.87
Multi-modality. Finally, VisionLabs examines the advantage of multi-modal networks over networks trained for each of the three modalities separately. It takes the proposed architecture with three branches and aggregation blocks, but instead of passing (RGB, IR, Depth) channels, it trained three models with (RGB, RGB, RGB), (IR, IR, IR) and (Depth, Depth, Depth) inputs. This allows a fair comparison with multi-modal network since all these architectures were identical and had the same number of parameters. As can be seen from Table 4.3, using only RGB images results in low performance. The corresponding model overfitted to the training set and achieved only 7.85% TPR at F P R = 10e−4 . The IR based model showed remarkably better results, reaching 57.41% TPR at F P R = 10e−4 since IR images contained less identity details and the dataset size in this case was not so crucial as it was for the RGB model. The highest score of 98.40% TPR at F P R = 10e−4 was achieved by the Depth modality, suggesting the importance of facial shape information for the anti-spoofing task. However, the multi-modal network performed much better than the Depth network alone, reducing false rejection error from 1.6% to 0.13%, and showing the evidence of the synergetic effect of modality fusion.
4.2.1.2 2nd Place (Team Name: ReadSense) The overall architecture of ReadSense has shown in Sect. 3.2.3, where three branches (each branch per modality) and fusion strategy (namely, random modality feature learning). Readers can refer to Sect. 3.2.3 for the detailed information. Here, we only provide the experiment of ReadSense.
84
4 Performance Evaluation
Table 4.4 The comparisons on different patch sizes and modalities. All models are trained on the CASIA-SURF training set and tested on the validation set TPR@FPR = 10E-4
Patch size
Modal
ACER
16*16
RGB Depth IR Fusion
4.5 2.0 1.9 1.5
94.9 98.0 96.2 98.4
32*32
RGB Depth IR Fusion
4.2 0.8 1.5 0.0
95.8 99.3 98.1 100.0
48*48
RGB Depth IR Fusion
3.1 0.2 1.2 0.1
96.1 99.8 98.6 99.9
96*96
RGB Depth IR Fusion
13.8 5.2 13.4 1.7
81.2 92.8 81.4 97.9
Fullface
RGB Depth IR Fusion
15.9 8.8 11.3 4.8
78.6 88.6 84.3 93.7
Implementation details. The full face images are resized to 112 × 112. ReadSense [2] uses random flipping, rotation, resizing, cropping for data augmentation. Patches are randomly extracted from the 112×112 full face images. All models are trained on one Titan X (Pascal) GPU with a batch size of 512. It used the Stochastic Gradient Descent (SGD) optimizer with a cyclic cosine annealing learning rate schedule [10]. The whole training procedure has 250 epochs and takes approximately 3 h. Weight decay and momentum are set to 0.0005 and 0.9, respectively. It used PyTorch to training the network. The Effect of Patch Sizes and Modality. In this setting, ReadSense uses different patch sizes using the same architecture in Fig. 3.3, i.e., 16 × 16, 32 × 32, 48 × 48 and 64 × 64. For fair comparisons, all the models are inferred 36 times with 9 non-overlapping image patches and 4 flipped input. As depicted in Table 4.4, for single modal input, among the three modalities, the depth data achieve the best performance of 0.8% (ACER), T P R = 99.3%@F P R = 10e−4 . Specifically, fusing all the three modalities has strong performance across all patch sizes. It can be concluded that the proposed method by Readsense with fusion modality achieves the best results.
4.2
Multi-modal Face PAD Challenge
85
Table 4.5 The comparison on different training strategy. All models are trained with 32 × 32 size image patches TPR@FPR = 10E-4
Modal
ACER
Fusion (w.o. CLR-MFE)
1.60
98.0
Fusion (w.o. MFE)
0.60
98.5
Fusion (w.o. CLR)
0.60
99.2
Fusion
0.00
100.0
Fusion (Erase RGB)
0.51
99.3
Fusion (Erase Depth)
0.49
99.4
Fusion (Erase IR)
0.84
99.3
Fusion
0.00
100.0
The Effect of Modal Feature Erasing and Training strategy. ReadSense investigates how the random modal feature erasing and training strategy affect model performance for face anti-spoofing. “w.o CLR” denotes that one uses conventional SGD training with a standard decaying learning rate schedule until convergence instead of using cyclic learning rate. “w.o MFE” denotes that random modal features erasing are not applied. As shown in Table 4.5, both the cyclic learning rate and random modal feature erasing strategy are critical for achieving a high performance. After training the fusion model, it erases features from one modal and then evaluate the performance. ReadSense evaluates the performance of the trained fusion model with single modal feature erasing. In Table 4.5, from the validation score, one can conclude that the complementarity among different modalities can be learned to obtain better results. Comparing with other teams in ChaLearn Face Anti-spoofing challenge. The final submission in this challenge is an ensemble result which combined outputs of three models in different patch sizes (32 × 32, 48 × 48 and 64 × 64) and it ranked the second place in the end. ReadSense is the only team that did not use the full face image as model input. The result of FN = 1 shows that the patch based learning method can effectively prevent the model from misclassifying the real face into an attack one by comparing with other top ranked teams. As shown in Table 4.6, the results of the top three teams are significantly better than other teams on testing set. Especially, the TPR@FPR = 10e-4 values of the ReadSense team and VisionLabs are relatively close. Whereas, VisionLabs applied plentiful data from other tasks to pretrain the model, and ReadSense only used a one-stage and end-to-end training schedule. Consequently, it also confirms the superiority of our solution.
4.2.1.3 3rd Place (Team Name: Feather) Data Augmentation. There are some differences in the images acquired by different devices, even if the same device model is used. As shown in the Fig. 4.3. The upper line is the depth
86
4 Performance Evaluation
Table 4.6 Test set results and rankings of the final stage teams in ChaLearn Face Anti-spoofing attack detection challenge, the best indicators are bold Team name
FP
FN
APCER(%) BPCER(%)
ACER(%)
TPR(%) @FPR = 10−2
@FPR = 10−3
@FPR = 10−4
27
0.0074
0.1546
0.0810
99.9885
99.9541
99.8739 99.8052
VisionLabs
3
ReadSense
77
1
0.1912
0.0057
0.0985
100.0000
99.9472
Feather
48
53
0.1192
0.1392
0.1292
99.9541
99.8396
98.1441
Hahahaha
55
214
0.1366
1.2257
0.6812
99.6849
98.5909
93.1550
MAC-advgroup
825
30
2.0495
0.1718
1.1107
99.5131
97.2505
89.5579
images of the CASIA-SURF data set. The depth difference of the face part is small. It is difficult for the eyes to distinguish whether the face has a contour depth. The second line is the depth images of the MMFD1 dataset whose outline of the faces are clearly showed. In order to reduce the data difference caused by the device, the depth of the real face images is scaled in MMFD which can be seen in the third line of Fig. 4.3. The way of data augmentation is presented in Algorithm 1. Algorithm 1 Data Augmentation Algorithm scaler ← a random value in range [1/8, 1/5] o f f set ← a random value in range [100, 200] Out I mg ← 0 1: for y = 0 → H eight − 1 do 2: for x = 0 → W idth − 1 do 3: if I n I mg(y, x) > 20 then 4: o f f ← o f f set 5: else 6: of f ← 0 7: Out I mg(y, x) ← InImg(y,x) * scaler + off 8: return Out I mg
Training Strategy. Pytorch [4] is used to implement the proposed networks. It initializes all convolutions and fully-connected layers with normal weight distribution [11]. For optimization solver, Stochastic Gradient Descent(SGD) is adopted with both learning rate beginning at 0.001, and decaying 0.1 after every 60 epochs, and momentum setting to 0.9. The Focal Loss [12] is employed with α = 1 and γ = 3. How useful is MMFD dataset? A comparative experiment is executed to show the validity and generalization ability of our data. As shown in Table 4.7, the ACER of FeatherNetB 1 This dataset is collected by the Feather Team, which is consisted of 15 subjects with 15415 real
samples and 28438 fake samples.
4.2
Multi-modal Face PAD Challenge
87
Fig. 4.3 depth image augmentation. (line 1): CAISA-SURF real depth images; (line 2): MMFD real depth images; (line 3): our augmentation method on MMFD. This figure is from [3] Table 4.7 Performance of FeatherNetB training by different datasets. The third Column (from left to right side) means the ACER value on the validation set of CASIA-SURF [5]. It shows that the generalization ability of MMFD is stronger than baseline of CASIA-SURF. The performance is better than the baseline method using multi-modal fusion Network
Training dataset
ACER in val
Baseline
CASIA-SURF
0.0213
FeatherNetB
CASIA-SURF depth
0.00971
FeatherNetB
MMFD depth
0.00677
FeatherNetB
CASIA-SURF + MMFD depth
0.00168
with MMFD depth data is better than that with CASIA-SURF [5], though only 15 subjects are collected. Meanwhile, the experiment shows that the best option is to train the network with both data. The results of using the FeatherNetB are much better than the baselines that use multi-modal data fusion, indicating that the proposed network has better adaptability than the third-stream ResNet18 for baseline.
88
4 Performance Evaluation
Table 4.8 Performance in validation dataset. Baseline is a way of fusing three modalities data (IR, RGB, Depth) through a three-stream network. Only depth data were used for training in the other networks. FeatherNetA and FeatherNetB have achieved higher performance with less parameters. Finally, the models are assembled to reduce ACER to 0.0 Model
ACER
TPR@FPR = 10−2
TPR@FPR = 10−3
Params
FLOPS
ResNet18 [5]
0.05
0.883
0.272
11.18M
1800M
Baseline [5]
0.0213
0.9796
0.9469
–
–
FishNet150 (our impl.)
0.00144
0.9996
0.998330
24.96M
6452.72M
MobilenetV2(1) (our impl.)
0.00228
0.9996
0.9993
2.23M
306.17M
ShuffleNetV2(1) (our impl.)
0.00451
1.0
0.98825
1.26M
148.05
FeatherNetA
0.00261
1.0
0.961590
0.35M
79.99M
FeatherNetB
0.00168
1.0
0.997662
0.35M
83.05M
Table 4.9 Ablation experiments with different operations in CNNs Model
FC
GAP
AP-down
ACER
Model1
×
×
×
0.00261
Model2
×
×
0.00168
Model3
×
×
0.00325
Model4
×
0.00525
Compare with other network performance. As show in Table 4.8, experiments are executed to compare with other network’s performance. All experimental results are based on depth of CASIA-SURF and MMFD depth images, and then the performance is verified on the CASIA-SURF verification set. It can be seen from Table 4.8 that the parameter size is much smaller, only 0.35M, while the performance on the verification set is the best. Ablation Experiments. A number of ablations are executed to analyze different models with different layer combinations, shown in Table 4.9. The models are trained with CASIASURF training set and MMFD dataset. Why AP-down in BlockB? Comparing Model1 and Model2, Adding the Average Pooling branch to the secondary branch (called AP-down), as shown in block B of Fig. 3.6b, can effectively improve performance with a small number of parameters. Why not use FC layer? Comparing Model1 and Model3, fully connected (FC) layer doesn’t reduce the error when adding a fully connected layer to the last layer of the network. Meanwhile, a FC layer is computationally expensive. Why not use GAP layer? Comparing Model3 and Model4, it shows that adding global average pooling layer at the end of the network is not suitable for face anti-spoofing task. They will reduce performance.
4.2
Multi-modal Face PAD Challenge
89
Algorithm 2 Ensemble Algorithm 1: scor es[] ← score_FishNet150_1, score_FishNet150_2, score_ MobilenetV2, score_FeatherNetA, score_FeatherNetB, score_ResNet_GC 2: mean_scor e ← mean of scores[] 3: if mean_scor e > max_thr eshold || mean_scor e < min_thr eshold then 4: f inal_scor e ← mean_scor e 5: else if scor e_Fish N et150_1 < f ish_thr eshold then 6: f inal_scor e ← scor e_Fish N et150_1 7: else if scor e_Feather N et B For I R < I R_thr eshold then 8: f inal_scor e ← scor e_Feather N et B For I R 9: else 10: mean_scor e ← (6 * mean_score + score_FishNet150_1) / 7 11: if mean_scor e > 0.5 then 12: f inal_scor e ← max of scores[] 13: else 14: f inal_scor e ← min of scores[]
Competition details. The fusion procedure of FEATHER is applied in this competition. Meanwhile, the proposed FeatherNets with depth data only can provide a higher baseline alone (around 0.003 ACER). During the fusion procedure, the selected models are with different statistic features, and can help each other. For example, one model’s characteristics of low False Negative (FN) are utilized to further eliminate the fake samples. The detailed procedure is described as below: Training. The depth data are used to train 7 models: FishNet150_1, FishNet150_2, MobilenetV2, FeatherNetA, FeatherNetB, FeatherNetBForIR, ResNet_GC. Meanwhile, FishNet150_1, FishNet150_2 are models from different epoch of FishNet. The IR data are used to train FeatherNetB as FeatherNetBforIR. Inference. The inference scores will go through the “ensemble + cascade” process. The algorithm is shown as Algorithm 2. Competition Result. The above procedure is used to get the result of 0.0013 (ACER), 0.999 (TPR@FPR = 10e-2), 0.998 (TPR@FPR = 10e-3) and 0.9814 (TPR@FPR = 10e-4) in the test set and showed excellent performance in the Face Anti-spoofing challenge@CVPR2019.
90
4 Performance Evaluation
Table 4.10 Results and rankings of the final stage teams, the best indicators are bold. Note that the results on the test set are tested by the model we trained according to the code submitted by the participating teams Team name
FP
FN
APCER(%) NPCER(%) ACER(%)
TPR(%) @FPR = 10e-2
@FPR = 10e-3
@FPR = 10e-4
VisionLabs
3
27
0.0074
0.1546
0.0810
99.9885
99.9541
99.8739
ReadSense
77
1
0.1912
0.0057
0.0985
100.0000
99.9427
99.8052
Feather
48
53
0.1192
0.1392
0.1292
99.9541
99.8396
98.1441
Hahahaha
55
214
0.1366
1.2257
0.6812
99.6849
98.5909
93.1550
MAC-adv-group
825
30
2.0495
0.1718
1.1107
99.5131
97.2505
89.5579
ZKBH
396
35
0.9838
0.2004
0.5921
99.7995
96.8094
87.6618
VisionMiracle
119
83
0.2956
0.4754
0.3855
99.9484
98.3274
87.2094 63.5493
GradiantResearch
787
250
1.9551
1.4320
1.6873
97.0045
77.4302
Baseline
1542
177
3.8308
1.0138
2.4223
96.7464
81.8321
56.8381
Vipl-bpoic
1580
985
3.9252
5.6421
4.7836
82.9877
55.1495
39.5520
Massyhnu
219
621
0.5440
3.5571
2.0505
98.0009
72.1961
29.2990
AI4all
273
100
0.6782
0.5728
0.6255
99.6334
79.7571
25.0601
5252
1869
13.0477
10.7056
11.8767
15.9530
1.5953
0.1595
Guillaume
4.2.2
Summary
4.2.2.1 Challenge Result Reports In order to evaluate the performance of solutions, it adopted the following metrics: APCER, NPCER, ACER and TPR in the case of FPR = 10−2 , 10−3 , 10−4 respectively, and the scores retained 6 decimal places for all results. The scores and ROC curves of participating teams on the testing partitions are shown in Table 4.10 and Fig. 4.4 respectively. Please note that although it reports performance for a variety of evaluation measures, the leading metric was TPR@FPR = 10−4 . One can be observed that the best result (VisionLabs) achieves TPR = 99.9885%, 99.9541%, 99.8739% @FPR = 10−2 , 10−3 , 10−4 , respectively, and the TP = 17430, FN = 28, FP = 1, TN = 40251 respectively on the test data set. In fact, different application scenarios have different requirements for each indicator, such as in higher security access control, the FP is required to be as small as possible. While a small FN value is more important in the case of troubleshoot suspects. Overall, the results of the first eight teams are better than the baseline method [5] when FPR = 10−4 on test dataset.
4.2.2.2 Challenge Result Analysis As shown in Table 4.10, the results of the top three teams on test data set are clearly superior to other teams, revealing that ensemble learning has an exceptional advantage in deep learning compared to single model solutions under the same conditions, such as in Tables 4.11 and 4.4. Simultaneously, analyzing the stability of the results of all participating teams’ submission from the ROC curve in Fig. 4.4, the three teams are significantly better than other teams on testing set (e.g., TPR@FPR = 10−4 values of these three teams are
4.2
Multi-modal Face PAD Challenge
91
Fig. 4.4 ROC curves of final stage teams on test set Table 4.11 Provided by VisionLabs team. The results on the valid and test sets of the VisionLabs team, different NN modules represent different pre-trained Resnet [13] NN1
NN1a
NN2
NN3
NN4
0.9943
–
0.9987
–
0.9870
–
0.9963
–
0.9933
– –
0.9983
–
0.9997
–
1.0000
–
1.0000
0.9988
TPR@FPR = 10e-4 (Test)
0.9963
TPR@FPR = 10e-4 (Val)
relatively close and superior to other teams). The team of ReadSense applies the image patch as input to emphasize the importance of local features in FAD task. The result of FN = 1 shows that the local feature can effectively prevent the model from misclassifying the real face into an attack one, shown in the blue box of Fig. 4.5.
92
4 Performance Evaluation
Fig. 4.5 Mistaken samples of the top three teams on the Testing data set, including FP and FN. Note that the models were trained by us. This figure is also from our workshop paper [14]
Vipl-bpoic introduces the attention mechanism into FAD task. Since different modalities have different advantages. The RGB data have rich details while the Depth data are sensitive to the distance between the image plane and the corresponding face. And the IR data measure the amount of heat radiated from a face. Based on this characteristic, Feather uses a cascaded architecture with two subnetworks to study CASIA-SURF with two modalities, in which Depth and IR data are learnt subsequently by each network. Some teams consider face landmark (e.g., Hahahaha) into FAD task, and other teams (e.g., MAC-adv-group, Massyhnu) focus on the color space conversion. In stead of binary classification model, ZKBH
4.3
Cross-Ethnicity Face PAD Challenge
93
constructs a regression model to supervise the model to learn effective cues. GradiantResearch reformulates the face-PAD as an anomaly detection using deep metric learning. Although these methods have their own advantages, there are still some shortcomings in the code reproduction stage of the challenge. As described before, CASIA-SURF is characterized by multi-modal data (i.e., RGB, Depth and IR) and the main research point is how to fuse the complementary information between these three modalities. However, many teams apply ensemble learning that is a way of Naive Halfway Fusion [5] in fact, which cannot make full use of the characteristics between different modalities. In addition, most of the ensemble methods use greedy manner for model fusion, including constantly increase the model if the performance does not decrease on the valid set in Table 4.11, which inevitably brings additional time consumption and instability. In order to demonstrate the shortcomings of the algorithm visually, we randomly selected 6 misclassified samples for each of the top three teams on the test set, of which the FP and FN are 3 respectively, as shown in Fig. 4.5. Notably, the fake sample in the red box was simultaneously misclassified into real face by the top three winners, where the clues were visually seen in the eye portion of the color modality. From the misclassification samples of the VisionLabs team in Fig. 4.5, face pose is the main factor leading to FN samples (marked by a yellow box). As for the FP samples of ReadSense, the main clues are concentrated in the eye region (shown in the purple box). However, image patches applied by this team as the input of network, which is easy to cause misclassification if the image block does not contain an eye region. Only Depth and IR modal data sets were used by Feather team, resulting in misclassified samples that can be recognized by the human eyes easily. As shown in green box, obvious clues which attached on the nose and eyes region in the color modal data sets are discarded by their algorithm. Overall analysis, the top three teams have better recognition performance than Attack 1, 3, 5 for Attack 2, 4, 6 (performing a bending operation on the corresponding former) [5]. Figure 4.5 shows that the bending operation used by simulating the depth information of the real face is easily detected by the algorithms. Last but notable, from the FP samples of the three teams, the misclassified samples are mainly caused by Attack 1, indicating that the sample with some regions are cut from the printed face can bring the depth information of the real face, but introducing more cues which can prove itself is fake one.
4.3
Cross-Ethnicity Face PAD Challenge
In this section, we first report the results of the participating teams from the perspective of both single-modal and multi-modal tracks, and then analyze the performances of the participants’ methods. Finally, the shortcomings and limitations of these algorithms are pointed out.
94
4.3.1
4 Performance Evaluation
Experiments
4.3.1.1 Challenge Result Reports Single-modal (RGB) Track. Since the single-modal track only allows the use of RGB data, the purpose is to evaluate the performance of the algorithms on a face anti-spoofing system with a VIS camera as the acquisition device. The final results of the 11 participating teams are shown in Table 4.12, which includes the 3 considered indicators (e.g., APCER, BPCER, ACER) on three sub-protocols (e.g., 4_1, 4_2, 4_3). The final ranking is based on the average value of the ACER on three sub-protocols (smaller means better performance). At the same time, we report the thresholds for all algorithms to make decisions on real faces and attack samples. The thresholds of the top three teams are either very large (i.e., more than 0.9 for BOBO) or very small (i.e., 0.01 for Harvest), or have very different thresholds for different sub-protocols (i.e., 0.02 v.s 0.9 for VisionLabs). In addition, VisionLabs achieves the best results on APCER with a value of 2.72%, meaning that the algorithm can better classify attack samples correctly. Whilst, Wgqtmac’s algorithm obtains the best results on the indicator of BPCER (0.66%), indicating that it can better classify real face. Overall, the results of the first ten teams are better than the baseline method [15] when ranking by ACER. The VisionLabs team achieved the first place with a clear advantage. Multi-modal Track. The Multi-modal Track allows the participating teams to use all the modal data. The purpose is to evaluate the performance of the algorithms on antispoofing systems equipped with multi-optic cameras, such as the Intel RealSense or Microsoft Kinect sensor. The results of the eight participating teams in the final stage are shown in Table 4.13. BOBO team’s algorithm gets first-place performance, such as A PC E R = 1.05%, B PC E R = 1.00%, and AC E R = 1.02%. While the team of Super ranks second with a slight disadvantage, such as AC E R = 1.68%. It is worth noting that Newland-tianyan’s algorithm achieves the best results on the APCER indicator with a value of 0.24%. Similar to the conclusion of the single-modal track, most of the participating teams have relatively large thresholds which are calculated on the validation set, specially the Super and Newland-tianyan teams with the value of 1.0 on three sub-protocols, indicating that these algorithms treat the face anti-spoofing task as an anomaly detection. In addition, we can find that the ACER values of the top four teams are 1.02%, 1.68%, 2.21%, and 2.28%, which are better than the ACER of the first place of the single-modal track, such as 2.72% for the team of VisionLabs. It shows the necessity of our multi-modal track in improving accuracy in face anti-spoofing task.
4.3.1.2 Challenge Result Analysis In this section, we analyze the advantages and disadvantages of the algorithm performance of each participating team in detail according to different tracks. Single-modal. As shown in Table 2.5, the testing subset introduces two unknown target variations simultaneously, such as the different ethnicities and attack types in training and
0.34±0.48
4 _3
Avg±Std
Data augment,
SimpleNet
Newland-tianyan
ZhangTT
Harvest
BOBO
0.10
4 _2
RankPooling,
Thre.
0.99 0.97±0.02
4 _3
Avg±Std
Attention module,
Depth supervision
0.01 0.01±0.00
4 _3
Avg±Std
Sequence,
ResNet
0.9 0.9
4 _3
Avg±Std
Data
Preprocessing
0.55 0.67±0.11
4 _3
Avg±Std
Neighborhood
Mean
0.7
4 _2
Temporal feature,
0.77
4 _1
Data augment,
0.9
4 _2
Time tensor,
0.9
4 _1
Quality tensor,
0.01
4 _2
Relabelling live
0.01
4 _1
Motion cues,
0.99
4 _2
Multi-level cell,
0.95
4 _1
CDC, CDL, EDL,
0.90
0.02
OpticalFlow,
VisionLabs
Prot. 4 _1
Method(keywords)
Team name
282±239
299
513
34
97±37
57
132
103
85±47
109
116
31
129±67
67
120
201
2±2
2
0
4
FP
44±62
6
11
117
75±31
108
45
74
55±10
67
51
48
10±2
12
8
10
21±9
31
12
21
FN
15.66±13.33
16.61
28.5
1.89
5.40±2.10
3.17
7.33
5.72
4.74±2.62
6.06
6.44
1.72
7.18±3.74
3.72
6.67
11.17
0.11±0.11
0.11
0.00
0.22
APCER(%)
11.16±15.67
1.5
2.75
29.25
18.91±7.88
27.0
11.25
18.5
13.83±2.55
16.75
12.75
12.0
2.50±0.50
3.0
2.0
2.5
5.33±2.37
7.75
3.00
5.25
BPCER(%)
13.41±3.77
9.06
15.62
15.57
12.16±2.89
15.08
9.29
12.11
9.28±2.28
11.4
9.6
6.86
4.84±1.79
3.36
4.33
6.83
2.72±1.21
3.93
1.50
2.74
ACER(%)
Table 4.12 The results of single-modal track. Avg±Std indicates the mean and variance operation and best results are shown in bold
5
4
3
2
(continued)
Rank 1
4.3 Cross-Ethnicity Face PAD Challenge 95
Avg±Std
Resnet100
Baseline
Dqiu
Hulking
Wgqtmac
Chunghwa-Telecom
0.86±0.06
Avg±Std
MIMAMO-Net
1.0 1.00±0.00
4 _3
Avg±Std
features,
RankPooling
1.0
4 _2
Dynamic
1.0
4 _1
1.00±0.00
Static and
1.0
4 _3
Avg±Std
Softmax
1.0
4 _2
ResNet50,
1.0
4 _1
0.67 0.76±0.08
4 _3
Avg±Std
Softamx
0.82
4 _2
PipeNet,
0.81
Softmax 4 _1
0.56 0.80±0.22
4 _3
Avg±Std
Warmup strategy,
1.0
4 _2
0.85
ResNet18,
4 _1
0.79
4 _3
Local feature,
0.93
4 _2
feature,
0.87
4 _1
0.40±0.07
Softmax
Subsequence
0.45
4 _3
Avg±Std
Fueature fusion,
0.45
4 _2
0.33
0.22
3D ResNet,
4 _1
0.07±0.11
4 _3
Score fusion,
IecLab
0.01
4 _2
Multi-task,
Thre. 0.02
Dopamine
Prot. 4 _1
Method(keywords)
ID information,
Team name
Table 4.12 (continued) FP
1182±300
836
1379
1331
849±407
664
567
1316
810±199
768
1027
635
928±310
1117
570
1098
444±93
442
352
538
597±103
489
606
696
442±168
636
367
325
FN
30±25
57
27
7
116±48
146
60
142
78±53
59
37
138
2±3
0
7
1
76±34
71
113
44
24±2
26
26
21
10±12
0
24
6
APCER(%)
65.66±16.70
46.44
76.61
73.94
47.16±22.62
36.89
31.5
73.11
45.00±11.07
42.67
57.06
35.28
51.57±17.24
62.06
31.67
61.0
24.66±5.16
24.56
19.56
29.89
33.16±5.76
27.17
33.67
38.67
24.59±9.37
35.33
20.39
18.06
BPCER(%)
7.58±6.29
14.25
6.75
1.75
29.00±12.13
36.5
15.0
35.5
19.50±13.27
14.75
9.25
34.5
0.66±0.94
0.0
1.75
0.25
19.00±8.69
17.75
28.25
11.0
6.08±0.72
6.5
6.5
5.25
2.50±3.12
0.0
6.0
1.5
ACER(%)
36.62±5.76
30.35
41.68
37.85
38.08±15.57
36.69
23.25
54.31
32.25±3.18
28.71
33.15
34.89
26.12±8.15
31.03
16.71
30.62
21.83±1.82
21.15
23.9
20.44
19.62±2.59
16.83
20.08
21.96
13.54±3.95
17.67
13.19
9.78
Rank
*
11
10
9
8
7
6
96 4 Performance Evaluation
ZhangTT
Newland-tianyan
Hulking
Super
Thre.
1.0 1.0±0.00
4 _3
Avg±Std
SE fusion,
Score fusion
1.0 0.98±0.02
4 _3
Avg±Std
Selective modal pipeline,
Limited frame vote
1.00±0.00
Avg±Std
Data augment
0.79 0.87±0.07
4 _3
Avg±Std
Feature fusion,
Score fusion
0.9
4 _2
0.94
ID Net,
4 _1
1.0
4 _3
Neighborhood mean,
1.0
4 _2
Data preprocessing,
1.0
4 _1
Resnet9,
1.0
4 _2
SENet-154,
0.96
4 _1
PipeNet,
1.0
4 _2
Dimension reduction,
1.0
0.95±0.02
4 _1
Avg±Std
Data preprocessing,
Depth supervision
0.95 0.94
4 _2 4 _3
Score fusion,
0.98
Feature fusion,
BOBO
Prot. 4 _1
Method(keywords)
CDC, CDL, EDL,
Team name
56±51
102
66
0
4±4
9
4
0
58±35
46
99
31
11.33±7.76
20
5
9
19±11
26
25
6
FP
17±17
0
34
19
17±12
23
26
3
4±4
9
5
0
11±6
5
17
11
4±2
7
3
2
FN
3.11±2.87
5.67
3.67
0.0
0.24±0.25
0.5
0.22
0.0
3.25±1.98
2.56
5.5
1.72
0.62±0.43
1.11
0.28
0.5
1.05±0.62
1.44
1.39
0.33
APCER(%)
4.41±4.25
0.0
8.5
4.75
4.33±3.12
5.75
6.5
0.75
1.16±1.12
2.25
1.25
0.0
2.75±1.50
1.25
4.25
2.75
1.00±0.66
1.75
0.75
0.5
BPCER(%)
3.76±2.02
2.83
6.08
2.37
2.28±1.66
3.12
3.36
0.37
2.21±1.26
2.4
3.37
0.86
1.68±0.54
1.18
2.26
1.62
1.02±0.59
1.6
1.07
0.42
ACER(%)
Table 4.13 The results of multi-modal track. Avg±Std indicates the mean and variance operation and best results are shown in bold
5
4
3
2
(continued)
Rank 1
4.3 Cross-Ethnicity Face PAD Challenge 97
Baseline
Skjack
Qyxqyx
Thre.
0.95±0.05
0.02 0.39±0.52
4 _3
Avg±Std
PSMM-Net,
Fusion fusion,
0.17
4 _2
A shared branch,
1.0
4 _1
SD-Net,
0.0 0.00±0.00
4 _3
Avg±Std
Softmax
0.01
4 _2
Resnet9,
0.0
Avg±Std
Score fusion 4 _1
0.89
4 _3
Pixel-wise regression,
0.98
0.98
4 _2
Binary supervision,
4 _1
0.92±0.04
Avg±Std
Semi-hard negative mining
0.93 0.96
4 _2 4 _3
Only IR,
0.87
Data augment,
Harvest
Prot. 4 _1
Method(keywords)
Data preprocessing,
Team name
Table 4.13 (continued) FP
872±463
864
1340
413
1012±447
511
1155
1371
92±142
257
19
1
104±84
119
180
13
FN
62±43
55
23
109
47±45
93
46
2
26±23
19
8
53
13±12
8
28
4
APCER(%)
48.46±25.75
48.0
74.44
22.94
56.24±24.85
28.39
64.17
76.17
5.12±7.93
14.28
1.06
0.06
5.77±4.69
6.61
10.0
0.72
BPCER(%)
15.58±10.86
13.75
5.75
27.25
11.75±11.37
23.25
11.5
0.5
6.66±5.86
4.75
2.0
13.25
3.33±3.21
2.0
7.0
1.0
ACER(%)
32.02±7.56
30.87
40.1
25.1
33.99±7.08
25.82
37.83
38.33
5.89±4.04
9.51
1.53
6.65
4.55±3.82
4.31
8.5
0.86
Rank
*
8
7
6
98 4 Performance Evaluation
4.3
Cross-Ethnicity Face PAD Challenge
99
Fig. 4.6 The ROC of 12 teams in a single-modal track. From left to right are the ROCs on protocol 4_1, 4_2 and 4_3
testing subsets, which pose a huge challenge for participating teams. However, most teams achieved relatively good results in the final stage compared to baseline, specially the top three teams get ACER values below 10%. It is worth mentioning that different algorithms have their own unique advantages, even if the final ranking is relatively backward. Such as the value of BPCER of Wgqtmac’team is 0.66%, meaning about 1 real sample from 100 real faces will be treated as fake ones. While, APCER = 0.11% for the team of VisionLabs indicates about 1 fake sample from 1000 attackers will be treated as real ones. To fully compare the stability of the participating team’s algorithms, similar to [16], we introduce the receiver operating characteristic (ROC) curve in this challenge which can be used to select a suitable trade-off threshold between false-positive rate (FPR) and true positive rate (TPR) according to the requirements of a given real application. As shown in Fig. 4.6, the results of the top one team (VisionLabs) on both three sub-protocols are clearly superior to other teams, revealing that using optical flow method to convert RGB modal data to other sample spaces can effectively improve the generalization performance of the algorithm to deal with different unknown factors. However, the TPR value of the remaining teams decreased rapidly as the FPR reduced (e.g., TPR@FPR = 10−3 values of these teams are almost zero). In addition, we can find that although the performance of ACER for Harvest team is worse than that of the BOBO team, the performance of the TPR@FPR = 10−3 is significantly better than the BOBO team. It is mainly because the false positive (FP) and false-negative (FN) samples of the Harvest team are relatively close (see from Table 4.12). Finally, for the top three teams, we randomly selected some mismatched samples as shown in Fig. 4.7. We can see that most of the FN samples of the VisionLabs team are real faces with large motion amplitude, while the most of FP samples are 3D print attacks, indicating that the team’s algorithm has correctly classified almost all 2D attack samples. In addition, due to the challenging nature of our competition dataset, such as it is difficult to distinguish the real face from attack samples without the label, the BOBO team and the Harvest team did not make correct decisions on some difficult samples.
100
4 Performance Evaluation
Fig.4.7 The mismatched samples of the top three teams in the single-modal track. FN and FP indicate false negative and false positive respectively
Fig. 4.8 The ROC of 9 teams in the multi-modal track. From left to right are the ROCs on protocol 4_1, 4_2 and 4_3
Multi-modal. From the Table 4.13, we can find that the ACER values of the top 7 teams are relatively close, and the top 4 teams are better than VisionLabs (ACER = 2.72%) in the single-modal track. It indicates that the complementary information between multi-modal datasets can improve the accuracy of the face anti-spoofing algorithm. Although newlandtianyan ranked fourth in ACER, they achieved the best results on the APCER indicator (e.g., APCER = 0.24%). It means the smallest number of FP samples among all teams. In addition, from the Table 4.13 and Fig. 4.8, we can find that although the ACER values of the top two algorithms are relatively close, the stability of the Super team is better than the BOBO, such as the values of TPR@FPR = 10−3 for Super and newland-tianyan are better than BOBO on both three sub-protocols. Finally, we can find from the Fig. 4.9 that the FP samples of the top three teams contain many 3D print attacks, indicating that their algorithms are vulnerable to 3D face attacks.
4.4
3D High-Fidelity Mask Face PAD Challenge
101
Fig. 4.9 The mismatched samples of the top three teams in the multi-modal track. FN and FP indicate false negative and false positive respectively
4.4
3D High-Fidelity Mask Face PAD Challenge
4.4.1
Experiments
4.4.1.1 Challenge Result Reports We adopted four metrics to evaluate the performance of the solutions, which are APCER, NPCER, ACER, and AUC respectively. Please note that although we report performance for a variety of evaluation measures, the leading metric was ACER. See from the Table 4.14, which lists the results and ranking of the top 18 teams, we can draw three conclusions: (1) The ACER performance of the top 3 teams is relatively close, and the top 2 teams have the best results in all metrics. (2) The top 6 teams are from industry, which indicates that mask attack detection is no longer limited to academia, but also an urgent problem in practical application. (3) The ACER performance of all teams is evenly distributed between
102
4 Performance Evaluation
Table 4.14 Team and results are listed in the final ranking of this challenge R.
Team name
1
VisionLabs
FP
2
WeOnlyLookOnce
242
193
1.858
4.452
3.155
0.995
3
CLFM
483
118
3.708
2.722
3.215
0.994
4
oldiron666
644
115
4.944
2.653
3.798
0.992
5
Reconova-AI-LAB
277
276
2.126
6.367
4.247
0.991
6
inspire
760
176
5.834
4.060
4.947
0.986
7
Piercing Eyes
887
143
6.809
3.299
5.054
0.983
8
msxf_cvas
752
232
5.773
5.352
5.562
0.982
9
492
FN
APCER
BPCER
101
3.777
2.330
ACER
AUC
3.053
0.995
VIC_FACE
1152
104
8.843
2.399
5.621
0.965
10
DXM-DIAI-CV-TEAM
1100
181
8.444
4.175
6.310
0.970
11
fscr
794
326
6.095
7.520
6.808
0.979
12
VIPAI
1038
268
7.968
6.182
7.075
0.976
13
reconova-ZJU
1330
183
10.210
4.221
7.216
0.974
14
sama_cmb
1549
188
11.891
4.337
8.114
0.969
15
Super
780
454
5.988
10.473
8.230
0.979
16
ReadFace
1556
202
11.944
4.660
8.302
0.965
17
LsyL6
2031
138
15.591
3.183
9.387
0.951
18
HighC
1656
340
12.712
7.843
10.278
0.966
3% and 10%, which not only shows the rationality and selectivity of our challenge but also demonstrates the value of HiFiMask for further research.
4.4.1.2 Challenge Result Analysis Through the introduction and result analysis of team methods in the challenge, we summarize the effective ideas for mask attack detection: (1) At the data level, data expansion is almost the strategy adopted by all teams. Therefore, data augmentation plays an important role in preventing the over-fitting of the model and improving the stability of the algorithm. (2) The segmentation of the face region can not only enlarge the local information to mine the difference between high fidelity mask and live face but also avoid the extraction of irrelevant features such as face ID. (3) Multi-branch-based feature learning is a framework widely used by participating teams. Firstly, a multi-branch network is used to mine the differences between the mask and live face from multiple aspects, such as texture, color contrast, material difference, etc., and then feature fusion is used to improve the robustness of the algorithm.
References
4.4.2
103
Summary
We organized the 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021 based on the HiFiMask dataset and running on the CodaLab platform. 195 teams registered for the competition and 18 teams made it to the final stage. Among the latter, teams were formed by 12 companies and 6 academic institutes/universities. We first described the associated dataset, the challenge protocol, and the evaluation metrics. Then, we reviewed the top-ranked solutions and reported the results from the final phases. Finally, we summarized the relevant conclusions, and pointed out the effective methods against mask attacks explored by this challenge.
References 1. Parkin A, Grinchuk O (2019) Recognizing multi-modal face spoofing with face recognition networks. In: The IEEE conference on computer vision and pattern recognition (CVPR) workshops 2. Shen T, Huang Y, Tong Z (2019) Facebagnet: bag-of-local-features model for multi-modal face anti-spoofing. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops 3. Zhang P, Zou F, Wu Z, Dai N, Mark S, Fu M, Zhao J, Li K (2019) Feathernets: convolutional neural networks as light as feather for face anti-spoofing. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops 4. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch 5. Zhang S, Wang X, Liu A, Zhao C, Wan J, Escalera S, Shi H, Wang Z, Li SZ (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 919–928 6. Yi D, Lei Z, Liao S, Li SZ (2014) Learning face representation from scratch. arXiv:1411.7923 7. Guo Yandong, Zhang Lei, Yuxiao Hu, He Xiaodong, Gao Jianfeng (2016) MS-celeb-1m: challenge of recognizing one million celebrities in the real world. Electron Imaging 2016(11):1–6 8. Zhao J, Cheng Y, Xu Y, Xiong L, Li J, Zhao F, Jayashree K, Pranata S, Shen S, Xing J et al (2018) Towards pose invariant face recognition in the wild. In: CVPR, pp 2207–2216 9. Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output CNN for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4920–4928 10. Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ (2017) Snapshot ensembles: train 1, get m for free. arXiv:1704.00109 11. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034 12. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988 13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
104
4 Performance Evaluation
14. Liu A, Wan J, Escalera S, Escalante HJ, Tan Z, Yuan Q, Wang K, Lin C, Guo G, Guyon I et al (2019) Multi-modal face anti-spoofing attack detection challenge at cvpr2019. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops 15. Liu A, Li X, Wan J, Liang Y, Escalera S, Escalante HJ, Madadi M, Jin Y, Wu Z, Yu X et al (2021) Cross-ethnicity face anti-spoofing recognition challenge: a review. IET Biom 10(1):24–43 16. Zhang S, Wang X, Liu A, Zhao C, Wan J, Escalera S, Shi H, Wang Z, Li SZ (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In: CVPR
5
Conclusions and Future Work
5.1
Conclusions
Although the face anti-spoofing technology has made great progress, the datasets have been unable to deal with the increasingly advanced attack manners in terms of scale, attack type, attack media quality and so on. In the past three years, we have held face anti-spoofing challenges for three consecutive times, including dataset collection, benchmark design and challenge protocol definition, which has attracted the attention of researchers and promoted the development of face anti-spoofing community. The conclusions are summarized as follows.
5.1.1
CASIA-SURF & Multi-modal Face Anti-spoofing Attack Detection Challenge at CVPR2019
In 2019s: (1) We build a large-scale multi-modal face anti-spoofing dataset namely CASIASURF. It is the largest one in terms of number of subjects, data samples, and number of visual data modalities, which consists of 1, 000 subjects and 21, 000 video clips with 3 modalities (RGB, Depth, IR). Comprehensive evaluation metrics, diverse evaluation protocols, training/validation/testing subsets and a measurement tool are also provided to develop a new benchmark. (2) We propose a multi-modal multi-scale fusion method namely MS-SEF, which performs modality-dependent feature re-weighting to select the more informative channel features while suppressing the less informative ones for each modality across different scales. (3) We conduct extensive experiments on the proposed CASIA-SURF dataset to verify its significance and generalization capability. (4) We organize the Multi-modal Face Anti-spoofing Attack Detection Challenge at CVPR2019 based on the CASIASURF dataset and running on the CodaLab platform. Three hundred teams registered for the competition and thirteen teams made it to the final stage. We review in detail the proposed © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Wan et al., Advances in Face Presentation Attack Detection, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-32906-7_5
105
106
5 Conclusions and Future Work
solutions and reported the results from both development and final phases, and analyze the results of the challenge, pointing out the critical issues in FAD task and presenting the shortcomings of the existing algorithms.
5.1.2
CASIA-SURF CeFA & Cross-Ethnicity Face Anti-spoofing Recognition Challenge at CVPR2020
In 2020s: (1) We release the largest face anti-spoofing dataset CASIA-SURF CeFA up to date, which includes 3 ethnicities, 1, 607 subjects and 4 diverse 2D/3D attack types. More importantly, CeFA is the only public face anti-spoofing dataset with ethnic label. (2) We provide a baseline, namely PSMM-Net, by learning complementary information from multi-modal data to alleviate the ethnic bias. (3) Extensive experiments validate the utility of our algorithm and the generalization capability of models trained on the proposed dataset. (4) We organize the Chalearn Face Anti-spoofing Attack Detection Challenge at CVPR2020, which consists of single-modal (e.g., RGB) and multi-modal (e.g., RGB, Depth, IR) tracks around this novel resource to boost research aiming to alleviate the ethnic bias. Both tracks attracted 340 teams in the development stage, and finally, 11 and 8 teams have submitted their codes in the single-modal and multi-modal face anti-spoofing recognition challenges, respectively. We describe the associated dataset, and the challenge protocol including evaluation metrics. We review in detail the proposed solutions and reported the challenge results, and analyze the results of the challenge, pointing out the critical issues in PAD task and presenting the shortcomings of the existing algorithms.
5.1.3
CASIA-SURF HiFiMask & 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021
In 2021s: (1) We release a large-scale CASIA-SURF HiFiMask dataset with three challenging protocols. Specifically, It consists of 54, 600 videos, 75 subjects with 3 kinds of high-fidelity masks, which is larger at least 16 times than the existing datasets in terms of data amount. (2) We propose a novel CCL framework to efficiently leverage rich and finegrained context between live and mask faces for discriminative feature representation. (3) We conduct a comprehensive set of experiments on both HiFiMask and other three 3D mask datasets, verifying the significance of the proposed dataset and method. (4) We organize the 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021 based on the HiFiMask dataset and running on the CodaLab platform. 195 teams registered for the competition and 18 teams made it to the final stage. Among the latter, teams were formed by 12 companies and 6 academic institutes/universities. We first describe the associated dataset, the challenge protocol, and the evaluation metrics. Then, we review the top-ranked solutions and reported the results from the final phases. Finally, we summarize
5.2
Future Work
107
the relevant conclusions, and point out the effective methods against mask attacks explored by this challenge.
5.2
Future Work
5.2.1
Flexible Modal Face Anti-spoofing
The current FAS technology based on multimodal fusion [1] usually uses independent branches to extract different modal features, and then fuses all modal features at a certain node for classification. Although multimodal FAS technology can utilize the complementary advantages of multimodal information to improve the performance of the model, it is required to provide the same modal types in the testing phase as in the training phase. This seriously limits the scope of use of multimodal FAS technology, as many devices only provide single modal sensor, such as phone. In addition, currently existing FAS algorithms [2–5] generally use convolutional neural networks to extract features, but they are easily confused by high-quality attack samples [2, 6–8]. In future work, we aim to design a flexible modal based FAS framework that can be freely deployed on any device with a given modal sensor. At the same time, our framework can train the model with all available multimodal data during the training phase, and improve the recognition performance of the model without the involvement of multimodal data during the testing phase. On the other hand, Vision Transformer (ViT) [9] has demonstrated superior performance in many computer vision tasks, but has not been fully explored in FAS task. We analyze that using ViT for FAS tasks has the following three advantages: (a) Long-term dependency relationships. Local forgery traces can establish a dependency relationship with other image patches and serve as a long-term indicator. (b) The class token can summarize all different types of spoofing information in the image. (c) Multi-head selfattention mechanism. ViT uses multi head attention to capture various spoofing traces in parallel. After the above analysis, designing a flexible modal FAS framework based on ViT is a practical and valuable topic.
5.2.2
Generalizable Face Anti-spoofing
Despite the existing methods [2, 10–15] obtain remarkable performance in intra-dataset experiments where training and testing data are from the same domain, on the other hand, is an unsolved challenge to the cross-dataset experiments due to large distribution discrepancies among different domains. There are two schemes to improve the generalization of Presentation Attack Detection (PAD) technology: (1) Domain Adaptation (DA). It aims to minimize the distribution discrepancy between the source and target domain by leveraging the unlabeled target data. However, the target domain is difficult to collect, or even unknown
108
5 Conclusions and Future Work
during training which limits the utilization of DA methods [16–18]. (2) Domain Generalization (DG). It can conquer this by taking the advantage of multiple source domains without seeing any target data. A straightforward strategy [19, 20] is to collect diverse source data from multiple relevant domains to train a model with more domain-invariant and generalizable representations. Another methods use meta-learning strategies [21–25] to find the generalized learning directions. Therefore, data diversification at feature level, search for generalized feature space, and suppression of liveness-irrelated features are three effective strategies to improve the generalization. However, these existing methods only explore one or two of them, leaving some room for development. Compared with CNNs, Transformer encourages non-local computation, captures the global context, and establishes the dependency with a target. ViTranZFAS [26], ViTAF [27], and MA-ViT [4] tentatively uses the pure ViT to solve the zero-, few-shot, and flexible modal anti-spoofing task, respectively. Therefore, it is a meaningful future work to adopt ViT and combine the above strategies to improve domain generalization performance.
5.2.3
Surveillance Face Anti-spoofing
With the popularity of remote cameras and the improvement of surveillance networks, the development of smart cities has put forward higher requirements for traditional visual technologies in surveillance. Benefited from the release of face recognition datasets [28–30] in the surveillance scene and driven by related algorithms [31–33], the face recognition system has gradually got rid of the constraint of verification distance, and can use the surveillance camera to complete real-time capture, self-service access control, and self-service supermarket payment. However, the FAS community is still stuck in the protection of the face recognition system under short-distance conditions, and cannot serve for the detection of spoofing faces under a long-distance natural behavior. We analyze two reasons that hinder the development of PAD technologies: (1) Lack of a dataset that can truly simulate the attack in surveillance. The existing FAS datasets, whether 2D print or replay attacks [1, 2, 6], or 3D mask attacks [8, 34–36], require the subjects to face the acquisition device under distance constraints. However, diversified surveillance scenes, rich spoofing types, and natural human behavior are important assessment factors for the surveillance FAS dataset collection. (2) Low-quality faces in the surveillance scenarios cannot meet the requirements of fine-grained feature-based FAS tasks. The existing FAS algorithms, whether based on color-texture feature learning [37], face depth structure fitting [2], or remote photoplethysmography (rPPG)-based detection [38], require high-quality image details to ensure high performances. The resolution of faces under long-distance surveillance is small and contains noise from motion blur, occlusion, bad weather, and other bad factors. These are new challenges for algorithm design in surveillance FAS. In order to fill the gap in surveillance
References
109
scenes of the FAS community in the future Work, we target to solve two challenging problems analyzed above from two aspects: data collection and algorithm design. In future work, the collection of dataset [39] should include the following requirements: (1) Data should be collected based on real surveillance scenes, rather than the low-quality datasets obtained by manual degradation, such as GREAT-FASD-S [40]. Compared to previous PAD datasets in controlled environments, the dataset based on the monitoring scenario inevitably introduces low-resolution face, pedestrian occlusion, changeable posture, motion blur, and other challenging situations, which greatly increases the challenge of FAS tasks. (2) Data should consider the most comprehensive attack types, each of which contains diverse spoofing methods. Such as 2D image, video replay and 3D mask all appear in SuHiFiMask to evaluate the algorithm’s perception of changes for paper color, screen moire and face structure in surveillance scenes. Different from the attack type under classical more constrained environments, data containing paper posters, humanoid stand-ups in 2D image, and headgear, head mold in the 3D mask to minimize the spoofing trace in the surveillance scenes will be more meaningful. In order to effectively prevent criminals from hiding their identities through local occlusion during security inspection, we will introduce two most effective adversarial attacks (ADV), instead of simply masking the face with paper classes [41] and partial paper [42]. (3) Data should include at least 40 common real-world surveillance scenes, including daily life scenes (e.g., cafes, cinemas, and theaters) and security check scenes (e.g., security check lanes and parking lots) for deploying face recognition systems. In fact, the rich natural behaviors in different surveillance scenes greatly increase the difficulty of PAD due to pedestrian occlusion and non-frontal views. (4) The data should include at least four types of weather (e.g., Sunny, Windy, Cloudy and Snowy days) and natural lighting (e.g., Day and Night lights) to fully simulate the complex and changeable surveillance scenes. Different weather and light bring diverse image style information and image artifacts, which will put forward higher requirements for the generalization of PAD technology.
References 1. Zhang S, Liu A, Wan J, Liang Y, Guo G, Escalera S, Escalante HJ, Li SZ (2020) Casia-surf: a large-scale multi-modal benchmark for face anti-spoofing. TBIOM 2(2):182–193 2. Liu Y, Jourabloo A, Liu X (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In: CVPR 3. Liu A, Tan Z, Wan J, Liang Y, Lei Z, Guo G, Li SZ (2021) Face anti-spoofing via adversarial cross-modality translation. IEEE Trans Inf Forensics Secur 16:2759–2772 4. Liu A, Liang Y (2022) Ma-vit: modality-agnostic vision transformers for face anti-spoofing. In: Proceedings of the thirty-first international joint conference on artificial intelligence, IJCAI-22, pp 1180–1186. International joint conferences on artificial intelligence organization 5. Liu A, Wan J, Jiang N, Wang H, Liang Y (2022) Disentangling facial pose and appearance information for face anti-spoofing. In: 2022 26th international conference on pattern recognition (ICPR), pp 4537–4543. IEEE
110
5 Conclusions and Future Work
6. Boulkenafet Z (2017) A competition on generalized software based face presentation attack detection in mobile scenarios. In: IJCB 7. Zhang Y, Yin Z, Li Y, Yin G, Yan J, Shao J, Liu Z (2020) Celeba-spoof: large-scale face antispoofing dataset with rich annotations. In: ECCV 8. Liu A, Zhao C, Yu Z, Wan J, Su A, Liu X, Tan Z, Escalera S, Xing J, Liang Y et al (2022) Contrastive context-aware learning for 3d high-fidelity mask face presentation attack detection. IEEE Trans Inf Forensics Secur 17:2497–2507 9. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 10. Li X, Wan J, Jin Y, Liu A, Guo G, Li SZ (2020) 3dpc-net: 3d point cloud network for face anti-spoofing. In: 2020 IEEE international joint conference on biometrics (IJCB). IEEE, pp 1–8 11. Patel K, Han H, Jain AK (2016) Secure face unlock: spoof detection on smartphones. IEEE TIFS 12. George A, Marcel S (2019) Deep pixel-wise binary supervision for face presentation attack detection. In: ICB 13. Yu Z, Zhao C, Wang Z, Qin Y, Su Z, Li X, Zhou F, Zhao G (2020) Searching central difference convolutional networks for face anti-spoofing. In: CVPR 14. Zhang K-Y, Yao T, Zhang J, Tai Y, Ding S, Li J, Huang F, Song H, Ma L (2020) Face anti-spoofing via disentangled representation learning. In: ECCV 15. Liu Y, Stehouwer J, Liu X (2020) On disentangling spoof trace for generic face anti-spoofing. In: ECCV. Springer, pp 406–422 16. Li H, Li W, Cao H, Wang S, Huang F, Kot AC (2018) Unsupervised domain adaptation for face anti-spoofing. IEEE Trans Inf Forensics Secur 13(7):1794–1809 17. Tu X, Zhang H, Xie M, Luo Y, Zhang Y, Ma Z (2019) Deep transfer across domains for face antispoofing. J Electron Imaging 28(4):043001 18. Wang G, Han H, Shan S, Chen X (2019) Improving cross-database face presentation attack detection via adversarial domain adaptation. In: ICB. IEEE 19. Manoel Camillo O, Penna N, Koerich AL, Britto Jr AS, Israel A, Laurensi R, Menon LT (2019) Style transfer applied to face liveness detection with user-centered models. arXiv:1907.07270 20. Yin Z, Shao J, Yang B, Zhang J (2021) Few-shot domain expansion for face anti-spoofing. arXiv:2106.14162 21. Shao R, Lan X, Yuen PC (2020) Regularized fine-grained meta face anti-spoofing. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11974–11981 22. Qin Y, Yu Z, Yan L, Wang Z, Zhao C, Lei Z (2021) Meta-teacher for face anti-spoofing. IEEE Trans Pattern Anal Mach Intell 23. Chen Z, Yao T, Sheng K, Ding S, Tai Y, Li J, Huang F, Jin X (2021) Generalizable representation learning for mixture domain face anti-spoofing. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1132–1139 24. Wang J, Zhang J, Bian Y, Cai Y, Wang C, Pu S (2021) Self-domain adaptation for face antispoofing. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp 2746– 2754 25. Zhou Q, Zhang K-Y, Yao T, Yi R, Ding S, Ma L (2022) Adaptive mixture of experts learning for generalizable face anti-spoofing. In: Proceedings of the 30th ACM international conference on multimedia, pp 6009–6018 26. George A, Marcel S (2021) On the effectiveness of vision transformers for zero-shot face antispoofing 27. Huang H-P, Sun D, Liu Y, Chu W-S, Xiao T, Yuan J, Adam H, Yang M-H (2022) Adaptive transformers for robust few-shot cross-domain face anti-spoofing. arXiv preprint arXiv:2203.12175
References
111
28. Cheng Z, Zhu X, Gong S (2018) Surveillance face recognition challenge. arXiv preprint arXiv:1804.09691 29. Nada H, Sindagi VA, Zhang H, Patel VM (2018) Pushing the limits of unconstrained face detection: a challenge dataset and baseline results. In: 2018 IEEE 9th international conference on biometrics theory, applications and systems (BTAS), pp 1–10. IEEE 30. Grgic M, Delac K, Grgic S (2011) Scface-surveillance cameras face database. Multimedia Tools Appl 51(3):863–879 31. Kim M, Jain AK, Liu X (2022) Adaface: quality adaptive margin for face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), June 2022, pp 18750–18759 32. Li P, Prieto L, Mery D, Flynn PJ (2019) On low-resolution face recognition in the wild: comparisons and new techniques. IEEE Trans Inf Forensics Secur 14(8):2000–2012 33. Zhong Y, Deng W, Hu J, Zhao D, Li X, Wen D (2021) Sface: sigmoid-constrained hypersphere loss for robust face recognition. IEEE Trans Image Process 30:2587–2598 34. Erdogmus N, Marcel S (2013) Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect. In: IEEE 6th international conference on biometrics: theory, applications and systems (BTAS’13), pp 1–8 35. Galbally J, Satta R (2016) Three-dimensional and two-and-a-half-dimensional face recognition spoofing using three-dimensional printed models. IET Biometrics 36. Steiner H, Kolb A, Jung N (2016) Reliable face anti-spoofing using multispectral swir imaging. In: ICB. IEEE 37. Jia S, Guo G, Xu Z (2020) A survey on 3d mask presentation attack detection and countermeasures. Pattern Recognit 38. Liu S-Q, Lan X, Yuen PC (2018) Remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. In: ECCV 39. Fang H, Liu A, Wan J, Escalera S, Zhao C, Zhang X, Li SZ, Lei Z (2023) Surveillance face anti-spoofing. arXiv preprint arXiv:2301.00975 40. Chen X, Xu S, Ji Q, Cao S (2021) A dataset and benchmark towards multi-modal face antispoofing under surveillance scenarios. IEEE Access 9:28140–28155 41. George A, Mostaani Z, Geissenbuhler D, Nikisins O, Anjos A, Marcel S (2019) Biometric face presentation attack detection with multi-channel convolutional neural network. TIFS 42. Liu Y, Stehouwer J, Jourabloo A, Liu X (2019) Deep tree learning for zero-shot face anti-spoofing. In: CVPR