123 98 7MB
English Pages 215 [213] Year 2023
Cheng Wang
Anti-Fraud Engineering for Digital Finance Behavioral Modeling Paradigm
Anti-Fraud Engineering for Digital Finance
Cheng Wang
Anti-Fraud Engineering for Digital Finance Behavioral Modeling Paradigm
Cheng Wang Department of Computer Science and Engineering Tongji University Shanghai, China
ISBN 978-981-99-5256-4 ISBN 978-981-99-5257-1 (eBook) https://doi.org/10.1007/978-981-99-5257-1 Jointly published with Tongji University Press Co., Ltd. The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Tongji University Press Co., Ltd. © Tongji University Press 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
Contents
1 Overview of Digital Finance Anti-fraud . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Situation of Anti-fraud Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Challenge of Anti-fraud Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Strategies of Anti-fraud Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Typical Application in Financial Scenarios . . . . . . . . . . . . . . . . . . . . . 1.5 Outline of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 4 4 6 9
2 Vertical Association Modeling: Latent Interaction Modeling . . . . . . . . 2.1 Introduction to Vertical Association Modeling in Online Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Composite Behavioral Modeling . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Customized Data Enhancement . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Fraud Detection System Based in Online Payment Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Behavior Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3 Horizontal Association Modeling: Deep Relation Modeling . . . . . . . . . 3.1 Introduction to Horizontal Association Modeling in Online Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Behavior Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
11 13 13 14 15 15 28 39 39 39 39
43 44 45
v
vi
Contents
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Fraud Prediction by Account Risk Evaluation . . . . . . . . . . . . 3.2.2 Fraud Detection by Optimizing Window-Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Historical Transaction Sequence for High-Risk Behavior Alert . . . . 3.3.1 Fraud Prediction System Based on Behavior Prediction . . . . 3.3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Enhanced Anti-fraud Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Fraud Detection System based on Behavior Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Explicable Integration Techniques: Relative Temporal Position Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Concepts and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Main Technical Means of Anti-fraud Integration System . . . . . . . . . 4.2.1 Anti-fraud Function Divisions . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Module Integration Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Explanation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 System Integration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Anti-fraud Function Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Center Control Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Communication Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Evaluation of System Performance . . . . . . . . . . . . . . . . . . . . . 4.4.4 Exemplification of CAeSaR’s Advantages . . . . . . . . . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Faithful Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46 47 48 49 49 53 59 62 62 70 81 81 82 82 83 87 87 89 89 90 90 91 93 94 99 100 100 101 103 106 108 108 109 110 110
5 Multidimensional Behavior Fusion: Joint Probabilistic Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.1 Online Identity Theft Detection Based on Multidimensional Behavioral Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Overview of the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Contents
vii
5.3 Identity Theft Detection Solutions in Online Social Networks . . . . . 5.3.1 Composite Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Identity Theft Detection Scheme . . . . . . . . . . . . . . . . . . . . . . . 5.4 Evaluation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118 118 121 121 121 123 131 134 136 137
6 Knowledge Oriented Strategies: Dedicated Rule Engine . . . . . . . . . . . . 6.1 Online Anti-fraud Strategy Based on Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Development and Present State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Anti-fraud in Online Services . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Weak Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Risk Prediction Measures in Online Lending Services . . . . . . . . . . . 6.3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Graph-Oriented Snorkel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Heterogeneous Graph Neural Network . . . . . . . . . . . . . . . . . . 6.3.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Risk Assessment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.6 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
7 Enhancing Association Utility: Dedicated Knowledge Graph . . . . . . . 7.1 Gang Fraud Prediction System Based on Knowledge Graph . . . . . . 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Recovering-Mining-Clustering-Predicting Framework . . . . . . . . . . . 7.3.1 Recovering Missing Associations . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Mining Underlying Associations . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Clustering and Predicting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Dataset Description and Experiment Settings . . . . . . . . . . . . . 7.4.2 On Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163 163 165 167 167 172 174 177 177 178
139 141 141 142 142 143 143 144 148 152 152 152 154 155 155 157 159 160 160
viii
Contents
7.4.3 On Address Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 On Network Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
180 183 186 186
8 Associations Dynamic Evolution: Evolving Graph Transformer . . . . . 8.1 Dynamic Fraud Detection Solution Based on Graph Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Fraud Detection in Online Lending Services . . . . . . . . . . . . . . . . . . . . 8.3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Graph Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Evolving Graph Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Datasets and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Results for Node Classification . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Results for Edge Classification . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.7 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189 189 192 193 193 194 197 198 198 201 201 201 203 204 205 205 206
Chapter 1
Overview of Digital Finance Anti-fraud
1.1 Situation of Anti-fraud Engineering The combination of digital technology and finance gives birth to new business forms. Under the trend of Internet plus, financial technology start-ups, innovative business models, and solutions are emerging, covering many fields, e.g., third-party payment [11], online insurance [1], online lending [2], and traditional banking innovation business [3]. On the one hand, emerging digital financial institutions are constantly infiltrating into traditional financial businesses. On the other hand, traditional institutions are also involved in digital finance in many ways. With the support of digital technology, the fraud risks in digital finance are escalating and the development potential of financial markets is gradually enlarged [4–6]. At the same time, the hidden risks are also increasing where frauds emerge in an endless stream. From the perspective of platform fraud, the fraudulent platforms of default account for a huge proportion. From the perspective of personal fraud, digital financial fraud, led by the Internet black market, has been rampant and penetrated into various links [7], e.g., digital financial marketing, registration, lending, and payment. The digital financial platform which uses data analysis to carry out financial business is one of the main goals of the black market attack. The high incidence of fraud reduces consumers’ trust in digital financial services. So the risk control link of digital finance is generally facing greater pressure.
1.2 Challenge of Anti-fraud Engineering Financial businesses have achieved rapid development by rooting in digital technology. Traditional financial businesses are constantly moving online forms, and financial fraud is constantly updated and complicated. Accordingly, fraud means are characterized by specialization, industrialization, concealment, and scenario [8–11].
© Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_1
1
2
1 Overview of Digital Finance Anti-fraud
Fig. 1.1 Digital financial frauds in different fields
Specialization. In the context of digital finance, fraud forms have evolved from simple account stolen and swiped to more complex and diverse forms using big data and other cutting-edge technologies. The criminals commit fraud from a widespread pattern to a precise one, and overlap complex and diverse fraud tricks (as shown in Fig. 1.1), e.g., pyramid selling, part-time money making, online purchase refund, financial management, and virtual currency. A variety of fraud forms, together with the injection of new technologies such as digital finance and blockchain, make digital financial fraud more confusing, and difficult to identify, where the victim is unable to prevent fraud. Industrialization. Compared with traditional fraud, digital financial fraud is often organized and large-scale. Criminals have a clear division of labor, close cooperation and cooperation to commit crimes, forming a complete criminal industry chain. This chain mainly includes four links: development and production, wholesale and retail, fraud implementation, and money laundering. Through further subdivision, it can be divided into .15 specific divisions (as shown in Fig. 1.2), including software development, hardware production, network hackers, phishing retail, domain name dealers, personal letter wholesale, bank card dealers, phone card dealers, ID card dealers, phone fraud, mass SMS sending, online promotion, cash withdrawal, ecommerce platform shopping, pornographic gambling, and drug websites. Concealment. The virtual characteristics of Internet and digital technologies lead to more covert fraud, which is mainly reflected in three aspects. First, criminals tend to commit crimes in different places, which leads to the gradual mobile trend of financial fraud. Digital financial scams are not limited by space, and even the criminals in the same fraud gang come from all over the country. Second, criminals tend to use multiple small transfers to achieve fraud. Due to the universality of digital finance and the sinking service of customers, most of the losses caused by a single fraud are less than .1,500 dollars. Third, it is difficult to obtain evidence only from traditional solutions. Digital financial fraud often involves the problem of account theft and identity fraud.
1.2 Challenge of Anti-fraud Engineering
3
Fig. 1.2 Basic chain of digital financial fraud
Scenario. Most digital financial services are carried out on the basis of specific scenarios, and the corresponding financial fraud also presents scenario-based characteristics. Taking online shopping as an example, digital financial institutions can carry out various financial services such as consumer finance, supply chain finance and return insurance by relying on online shopping. If the buyers and sellers collude and make up the transaction behaviors, there may be multiple fraudulent behaviors in this case. The sellers get inflated trading volume and obtain higher credit limit of supply chain finance. The buyers may use consumer finance to cash out through false purchase behaviors. In addition, the two parties can also return the goods fraudulently paid freight insurance. Traditional anti fraud solutions face many challenges in the new situation, such as single dimension, low efficiency, and limited scope [12–14]. Single dimension. Traditional anti-fraud methods are based on a single dimension, so it is difficult to form a multi-dimensional user portrait. This makes them difficult to analyze customers’ behavioral preferences, solvency, payment ability, and fraud tendency through user portraits. Take the credit investigation of the People’s Bank of China as an example, there are still a large number of white credit households in China (without credit card and other borrowing records) limited by the single data source. It is necessary to build a multi-dimensional credit investigation system to reduce fraud risk. Low efficiency. Traditional anti-fraud techniques require a lot of manual operation and high application cost. As the customer base of fintech business sinks, transactions show the characteristics of frequent, real-time and large volume. Traditional antifraud methods are not effective at identifying small amount and high-frequency fraud, which makes it difficult for them to serve a sinking customer base.
4
1 Overview of Digital Finance Anti-fraud
Limited scope. With the in-depth development of digital technology, the combination of financial fraud and other scenes is increasingly close. Non-financial scenarios such as online shopping and online games also contain financial fraud risks, which are difficult to identify with traditional anti-fraud techniques.
1.3 Strategies of Anti-fraud Engineering Nowadays, the fraud carried out by criminals is usually characterized by gangs, industrialization, and scale. Big data, artificial intelligence, and other cutting-edge technologies are widely used to enhance fraud ability. The detection ability of anti fraud technology directly affects the actual effect of anti fraud in digital finance. From the perspective of the application and technology, digital financial anti fraud technology can be divided into data collection, data analysis, decision-making engines, and other types. Data acquisition obtains customer-related data from clients or networks. The use of data acquisition technology should strictly follow laws, regulations, and regulatory requirements. And it should collect user data under the condition of obtaining user authorization. Data acquisition technologies include device fingerprint, web crawler, biometrics, location identification, in-liveness detection, and so on. Data analysis refers to discovering knowledge from data. Machine learning technology is a data analysis technology to achieve anti fraud through model prediction. It relies on data, trains appropriate models through data analysis, and then uses the models for prediction, to achieve anti fraud effect. It includes supervised machine learning mode, unsupervised machine learning mode, and semi-supervised machine learning mode. Decision-making engine is the core of the digital anti-fraud system. A powerful decision engine can effectively integrate various anti-fraud methods such as reputation lists, expert rules, and anti-fraud models. It also can provide anti-fraud personnel with an efficient and functional human-computer interaction interface, greatly reducing anti-fraud operating costs and response speed. A decision engine can be judged from multiple dimensions, such as engine processing capacity, response speed, and UI interface.
1.4 Typical Application in Financial Scenarios We mainly introduce two typical fraud scenarios and security solutions on the basis of summarizing the manifestations of digital financial fraud, i.e., association mining solutions in online payment services and association enhancement solutions in online lending services. In view of the fraud in each scenario, we mainly introduce the anti fraud technology and its application cases, and analyze potentially available technologies.
1.4 Typical Application in Financial Scenarios Fig. 1.3 The fraud process of online payment using stolen accounts
5
Fraudster Spread Trojan
Frauder Monetization
Trading
Money transfer Black market
Online mall
Online payment. In the payment stage, black industry groups often steal and use personal names, mobile phone numbers, ID card numbers, bank card numbers and other factors directly related to account security through social engineering methods. They mainly adopt Fake WiFi, viral quick response code, pirated APP and Trojan link to steal users’ private information. Then the collected key information is classified and stored in the database. Account information (such as game accounts and financial accounts) is used for financial crimes and realization through the black industry chain. The user’s real information is used for selling and shoplifting. For instance, a college student found that .50,000 dollars in his bank card were “missing”. After repeated inquiries, he was informed that he had registered a new account on an e-commerce platform and purchased up to .49, 966 dollars of goods. It is not actually his purchase behavior. In the case of online payment using stolen accounts, four specific operations are usually involved (as shown in Fig. 1.3). Step 1: Spread Trojan. The gang sent fake short messages with Trojan links through fake base stations around the university town. After the student clicked the link, his username and password were revealed. Step 2: Trading. Because it is difficult and risky to steal bank cards directly, fraudsters will think of cashing in through shopping in the mall after they have mastered all kinds of information. Step 3: Money transfer. After registering the account and binding the bank card, fraudsters will buy high-value items such as gold, mobile phones, etc. through the online mall. And they receive the goods through the interception of incoming calls or the setting of call transfer. Step 4: Monetization. They monetize the stolen goods purchased through the underground black industrial chain stolen goods selling network. In this case, we can use the behavior sequence, biological probe and relationship mapping technology to predict the risk in the early, middle and late stages of the payment process. The behavior sequence technology can find the abnormality of purchase records by comparing with the previously recorded weekday shopping habits. The biological probe technology can judge the user’s usage habits according to the user’s pressing force, finger contact surface, sliding screen speed and other indicators, and detect the abnormal use in online shopping. In addition, the relationship mapping technology can estimate the user’s credit through the user’s social relationships, and evaluate the user’s demand for the purchased goods. Online lending. The main forms of fraud in online lending services are as follows: agency, gang crime, machine behavior, account theft, identity fraud, and serial trans-
6 Fig. 1.4 The fraud process of online payment using stolen accounts
1 Overview of Digital Finance Anti-fraud Fraudster
Get personal information
Apply for online loan
actions. Among them, identity fraud is a relatively common fraud, which refers to the lender forges personal identity, property certificates, and other materials provided. Even the fraudsters use illegal means such as deception to obtain other people’s information, thus posing as another person’s identity to cheat. Figure 1.4 shows the process of identity fraud. In life, the malicious agency may solicit college students to take part-time jobs through social software. They give each student a mobile phone card and ask the student to take the card to the bank to apply for a salary card. The agency can obtain the student’s ID card, student status, education and other information by using the bank card and mobile phone number for the purpose of registration, and then apply for multiple credit businesses from the online loan platform. For fraud in online lending, we can adopt face recognition, user portrait and other technologies. Face recognition technology can identify whether the loan application is initiated by the borrower himself or herself. Since some online loan platforms have no video verification process, they need to be further verified with precise portrait and other technologies. These platforms can depict the personal characteristics of customers through text semantic analysis, user behavior analysis, and terminal analysis, in the whole process of online loan transactions. For example, through data analysis of behavior trajectory, it is found that normal customers will stay for a few seconds at each node of the application, and it will take at least .5 min to complete the entire loan application process, while data analysis shows that the fraudsters will complete all the processes in less than .10 s.
1.5 Outline of This Book In this book, we will introduce some key technologies of anti-fraud engineering in two representative application backgrounds. The content structure of this book is shown in Fig. 1.5. To begin with, we mainly introduce the solution of behavior modeling based on different perspectives of associations. • Latent interaction modeling via Vertical Associations (Chap. 2). • Deep relation modeling via Horizontal Associations (Chap. 3). Then, we will introduce the model-level integrated technologies on two associations modelling solutions to further improve detection performance.
Knowledge Oriented Strategies (Chapter 6)
7
Enhancing Association Utility (Chapter 7)
Vertical Association Modeling
Associations Dynamic Evolution (Chapter 8)
Horizontal Association Modeling Historical Transaction Sequence
Fine-Grained Co-Occurrences Learning Automatic Windows
Multidimensional Behavioral Modeling (Chapter 5)
Explicable Integration Techniques (Chapter 4)
1.5 Outline of This Book
Behavior Association Modeling (Chapter 2, 3) Fig. 1.5 Architecture of this book
• We design a novel three-way taxonomy of function division and integration techniques to cope with complex and varied frauds (Chap. 4). • We propose a joint (instead of fused) model to capture both online and offline features of a user’s composite behavior (Chap. 5). Meanwhile, we also explore some advanced technologies to customize and improve the efficient performance of our behavior association modeling in more cases. • We propose a dedicated graph-oriented framework to solve the scarcity of data labels (Chap. 6). • We introduce a knowledge graph to address the low-quality data problem by enhancing the utility of associations (i.e., recovering missing associations and mining underlying associations) (Chap. 7). • We propose a technical framework for dynamic heterogeneous graphs, which can realize effective fraud detection in the face of evolving behavior patterns (Chap. 8). The key technologies and methods of anti-fraud engineering for digital finance include the following aspects: • Fine-Grained co-occurrences modeling. The effectiveness of behavior-based methods often depends heavily on the sufficiency of user behavioral data. So it is a big challenge to build high-resolution behavioral models by using low-quality behavioral data. We mainly address this problem from data enhancement for behavioral modeling. We extract fine-grained co-occurrence relationships of transactional attributes by using a knowledge graph. Furthermore, we adopt the heterogeneous network embedding to learn and improve representing comprehensive relationships. More details will be provided in Chap. 2.
8
1 Overview of Digital Finance Anti-fraud
• Historical transaction sequence modeling. Account theft is indeed predictable based on users’ high-risk behaviors, without relying on the behaviors of thieves. Accordingly, we propose an account risk prediction scheme to realize the ex-ante fraud detection. It takes in an account’s historical transaction sequence, and outputs its risk score. The risk score is then used as an early evidence of whether a new transaction is fraudulent or not, before the occurrence of the new transaction. More details will be provided in Chap. 3. • Learning automatic windows technology. For the most significant features of online payment fraudulent transactions are exhibited in a sequential form, the sliding time window is a widely-recognized effective tool for this problem. However, the adaptive setting of sliding time window is really a big challenge, since the transaction patterns in real-life application scenarios are often too elusive to be captured. We pursue an adaptive learning approach to detect fraudulent online payment transactions with automatic sliding time windows. We design an intelligent window, called learning automatic window. It utilizes the learning automata to learn the proper parameters of time windows and adjust them dynamically and regularly according to the variation and oscillation of fraudulent transaction patterns. More details will be provided in Chap. 3. • Integration scheme based on three-way taxonomy of function division. The integration of proper function modules is an effective way to further improve detection performance by overcoming the inability of single-function methods to cope with complex and varied frauds. However, a qualified integration is really inaccessible under multiple demanding requirements, i.e., improving detection performance, ensuring decision explainability, and limiting processing latency and computing consumption. We propose a qualified integration system that can simultaneously meet all of the above requirements. Particularly, it can assign the most effective decision strategy to the corresponding transaction adaptively by a devised stacking-based multi-classification. More details will be provided in Chap. 4. • Multidimensional behavioral modeling through joint probabilistic generative modeling We concentrate on the issue, i.e., a bridge from coarse behavioral data to an effective, quick-response, and robust behavioral mode, in online social networks (OSNs) where users usually have composite behavioral records, consisting of multi-dimensional low-quality data, e.g., offline check-ins and online user generated content (UGC). As an insightful result, we validate that there is a complementary effect among different dimensions of records for modeling users’ behavioral patterns. To deeply exploit such a complementary effect, we propose a joint (instead of fused) model to capture both online and offline features of a user’s composite behavior. More details will be provided in Chap. 5. • Knowledge graph-oriented framework. Most of on-going transactions have no labels since platforms cannot determine whether they are frauds until a certain amount of transactions. Traditional machine learning methods are not good at dealing with the problem, though they have achieved qualified anti-fraud performance in other Internet financial scenarios. To address this issue, we propose a Snorkel-based Semi-Supervised GNN. We specially design an upgraded version of the rule engines, called Graph-Oriented Snorkel, a graph-specific extension of
References
9
Snorkel, a widely-used weakly supervised learning framework, to design rules by subject matter experts and resolve confliction. More details will be provided in Chap. 6. • Enhancing association utility technology. It is challenging that online lending gang fraud predictions need to detect evolving and increasingly impalpable fraud patterns based on low-quality data, i.e., very preliminary and coarse applicant information. The technical difficulty mainly stems from two factors: the extreme deficiency of information associations and weakness of data labels. In this work, we mainly address the challenges by enhancing the utility of associations (i.e., recovering missing associations and mining underlying associations) on a knowledge graph. Moreover, we propose an integrated framework which is consists of four steps: Recovering, Mining, Clustering, and Predicting, for efficiently predicting gang fraud. More details will be provided in Chap. 7. • Enhancing association utility technology. It is challenging that online lending gang fraud predictions need to detect evolving and increasingly impalpable fraud patterns based on low-quality data, i.e., very preliminary and coarse applicant information. The technical difficulty mainly stems from two factors: the extreme deficiency of information associations and weakness of data labels. In this work, we mainly address the challenges by enhancing the utility of associations (i.e., recovering missing associations and mining underlying associations) on a knowledge graph. Moreover, we propose an integrated framework which is consists of four steps: Recovering, Mining, Clustering, and Predicting, for efficiently predicting gang fraud. More details will be provided in Chap. 8.
References 1. M.E. Haque, M.E. Tozal, IEEE Trans. Serv. Comput. 15(4), 2356 (2022) 2. W. Min, Z. Tang, M. Zhu, Y. Dai, Y. Wei, R. Zhang, in Proceedings of Workshop on Misinformation and Misbehavior Mining on the Web, Marina Del Rey, CA (2018) 3. M.A. Ali, B. Arief, M. Emms, A.P.A. van Moorsel, IEEE Secur. Privacy 15(2), 78 (2017) 4. E. Bursztein, B. Benko, D. Margolis, T. Pietraszek, A. Archer, A. Aquino, A. Pitsillidis, S. Savage, Proc. ACM IMC 2014, 347–358 (2014) 5. T.C. Pratt, K. Holtfreter, M.D. Reisig, J. Res. Crime Delinq. 47(3), 267 (2010) 6. Z. Li, J. Song, S. Hu, S. Ruan, L. Zhang, Z. Hu, J. Gao, in Proceedings IEEE ICDE 2019, Macao, China (8–11 Apr 2019), pp. 1898–1903 7. Y. Zhang, Y. Fan, Y. Ye, L. Zhao, C. Shi, in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China (3–7 Nov 2019) ed. by W. Zhu, D. Tao, X. Cheng, P. Cui, E.A. Rundensteiner, D. Carmel, Q. He, J.X. Yu (ACM, 2019), pp. 549–558. https://doi.org/10.1145/3357384.3357876 8. A.D. Pozzolo, G. Boracchi, O. Caelen, C. Alippi, G. Bontempi, IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3784 (2018) 9. B. Cao, M. Mao, S. Viidu, P.S. Yu, in Proceedings IEEE ICDM 2017, New Orleans, LA, USA (18–21 Nov 2017), pp. 769–774 10. C. Wang, C. Wang, H. Zhu, J. Cui, IEEE Trans. Dependable Secur. Comput. 18(5), 2122 (2021) 11. C. Wang, H. Zhu, Representing fine-grained co-occurrences for behavior-based fraud detection in online payment services. IEEE Trans. Dependable Secur. Comput. 19(1), 301–315 (2022)
10
1 Overview of Digital Finance Anti-fraud
12. C. Wang, H. Zhu, IEEE Trans. Inf. Forensics Secur. 17, 2703 (2022). https://doi.org/10.1109/ TIFS.2022.3191493 13. C. Wang, H. Zhu, R. Hu, R. Li, C. Jiang, IEEE Trans. Big Data 1–1 (2022). https://doi.org/10. 1109/TBDATA.2022.3172060 14. C. Wang, H. Zhu, B. Yang, IEEE Trans. Comput. Soc. Syst. 9(2), 428 (2022). https://doi.org/ 10.1109/TCSS.2021.3092007
Chapter 2
Vertical Association Modeling: Latent Interaction Modeling
2.1 Introduction to Vertical Association Modeling in Online Services Online payment services have penetrated into people’s lives. The increased convenience, though, comes with inherent security risks [1]. The cybercrime involving online payment services often has the characteristics of diversification, specialization, industrialization, concealment, scenario, and cross-region, which makes the security prevention and control of online payment extremely challenging [2]. There is an urgent need for realizing effective and comprehensive online payment fraud detection. The behavior-based method is recognized as an effective paradigm for online payment fraud detection [3]. Generally, its advantages can be summarized as follows: Firstly, behavior-based methods adopt the non-intrusion detection scheme to guarantee the user experience without user operation in the implementation process. Secondly, it changes the fraud detection pattern from one-time to continuous and can verify each transaction. Thirdly, even if the fraudster imitates the daily operation habits of the victim, the fraudster must deviate from the user behavior to gain the benefit of the victim. The deviation can be detected by behavior-based methods. Finally, this behavior-based method can be used cooperatively as a second security line, rather than replacing with other types of detection methods. The effectiveness of behavior-based methods often depends heavily on the sufficiency of user behavioral data [4]. As a matter of fact, user behavioral data that can be used for online payment fraud detection are often low-quality or restricted due to the difficulty of data collection and user privacy requirements [5]. In a word, the main challenge here is to build a high-performance behavioral model by using low-quality behavioral data. Then, this challenging problem can naturally be solved in two ways: data enhancement and model enhancement. For behavioral model enhancement, a widely recognized way is to build models from different aspects and integrate them appropriately. For model classifications, one type is based on the behavioral agent since it is a critical factor of behavioral © Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_2
11
12
2 Vertical Association Modeling: Latent Interaction Modeling
models. According to the granularity of agents, behavioral models can be further divided into the individual-level models [6–9] and population-level models [10–13]. In this work, we focus on the other way, i.e., behavioral data enhancement. As for this way, a basic principle is to deeply explore relationships underlying the transaction data. The more fine-grained correlations can possibly provide richer semantic information for generating high- performance behavioral models. Existing studies in data enhancement for behavioral modeling mainly focus on mining and modelling the correlations (including co-occurrences) between behavioral features and labels [14]. To further improve data enhancement, a natural idea is to investigate and utilize the more fine-grained correlations in behavioral data, e.g., ones among behavioral attributes. As the main contribution of our work, we aim to effectively model the cooccurrences among transactional attributes for high-performance behavioral models. For this purpose, we propose to adopt the heterogeneous relation network, a special form of the knowledge graph [15], to represent the co-occurrences effectively. Here, a network node (or say an entity) corresponds to an attribute value in transactions, and an edge corresponds to a heterogeneous association between different attribute values. Although the relation network can express the data more appropriately, it cannot finally solve the data imperfection problem for behavioral modeling, that is, it has no effect on enhancing the original low-quality data. An effective data representation preserving these comprehensive relationships can act as an important mean of relational data enhancement. To this end, we introduce network representation learning (NRL), which effectively capture deep relationships [16]. Deep relationships make up for low-quality data in fraud detection and improve the performance of fraud detection models. By calculating the similarity between embedding vectors, more potential relationships could be inferred. It partly solves the data imperfection problem. In addition to data enhancement, NRL transforms the traditional network analysis from the artificially defined feature to the automatic learned feature, which extracts deep relationships from numerous transactions. The final performance of behavioral modeling for online fraud detection directly depends on the harmonious cooperation of data enhancement and model enhancement. Different types of behavioral models need matching network embedding schemes to achieve excellent performance. This is one of the significant technical problems in our work. We aim to investigate the appropriate network embedding schemes for population-level models, individual-level models, and models with different generalized behavioral agents. More specifically, for population-level models, we design a label-free heterogeneous network to reconstruct online transactions and then feed the features generated in embedding space into the state-of-the-art classifiers based on machine learning to predict fraud risks; while, for individual-level models, we turn to a label-aware heterogeneous network that distinguishes the relations between attributes of fraudulent transaction, and further design multiple naive individual-level models that match the representations generated from the labelaware network. Furthermore, we combine the population-level and individual-level models to realize the complementary effects by overcoming each other’s weaknesses.
2.2 Related Work
13
The main contributions can be summarized as follows: • We propose a novel effective data enhancement scheme for behavioral modeling by representing and mining more fine-grained attribute-level co-occurrences. We adopt the heterogeneous relation networks to represent the attribute-level cooccurrences, and extract those relationships by heterogeneous network embedding algorithms in depth. • We devise a unified interface between network embedding algorithms and behavioral models by customizing the preserved relationship networks according to the classification of behavioral models. • We implement the proposed methods on a real-world online banking payment service scenario. It is validated that our methods significantly outperform the stateof-the-art classifiers in terms of a set of representative metrics in online fraud detection.
2.2 Related Work With the rapid development of online payment service, fraud in online transactions is emerging in an endless stream. Detecting fraud by behavioral models has become a widely studied area and attracted many researchers’ attention.
2.2.1 Composite Behavioral Modeling In this part, we briefly review different behavior-based fraud detection methods according to the types of behavioral agents [5, 17, 18]. Individual-Level Model. Many researchers concentrated on individual-level behavioral models to detect abnormal behavior which is quite different from individual historical behavior. These works paid attention to user behavior which was almost impossible to forge at the terminal, or focused on user online business behavior which had some different behavioral patterns from normal ones. Vedran et al. [19] explored the complex interaction between social and geospatial behavior and demonstrated that social behavior could be predicted with high precision. Yin et al. [4] proposed a probabilistic generative model combining use spatiotemporal data and semantic information to predict user behavior. Naini et al. [7] studied the task of identifying the users by matching the histograms of their data in the anonymous dataset with the histograms from the original dataset. Egele et al. [8] proposed a behavior-based method to identify compromises of high-profile accounts. Ruan et al. [3] conducted a study on online user behavior by collecting and analyzing user clickstreams of a well known OSN. Rzecki et al. [20] designed a data acquisition system to analyze the execution of single-finger gestures on a mobile device screen
14
2 Vertical Association Modeling: Latent Interaction Modeling
and indicated the best classification method for person recognition based on proposed surveys. Alzubaidi et al. [9] investigated the representative methods for user authentication on smartphone devices in smartphone authentication including seven types of behavioral biometrics, which are handwaving, gait, touchscreen, keystroke, voice, signature and general profiling. Population-Level Model. These works mainly detected anomalous behaviors at the population-level that are strongly different from other behaviors, while they did not consider that the individual-level coherence of user behavioral patterns can be utilized to detect online identity thieves. Mazzawi et al. [10] presented a novel approach for detecting malicious user activity in databases by checking user’s self-consistency and global-consistency. Lee and Kim [21] proposed a suspicious URL detection system to recognize user anomalous behaviors on Twitter. Cao et al. [11] designed and implemented a malicious account detection system for detecting both fake and compromised real user accounts. Zhou et al. [12] proposed an FRUI algorithm to match users among multiple OSNs. Stringhini et al. [22] designed a system named EVILCOHORT, which can detect malicious accounts on any online service with the mapping between an online account and an IP address. Meng et al. [23] presented a static sentence-level attention model for text-based speaker change detection by formulating it as a matching problem of utterances before and after a certain decision point. Rawat et al. [24] proposed three methodologies to cope up with suspicious and anomalous activities, such as continuous creation of fake user accounts, hacking of accounts and other illegitimate acts in social networks. VanDam et al. [25] focused on studying compromised accounts in Twitter to understand who were hackers, what type of content did hackers tweet, and what features could help distinguish between compromised tweets and normal tweets. They also showed that extra metainformation could help improve the detection of compromised accounts.
2.2.2 Customized Data Enhancement To enhance the representation of data in behavioral models, the researchers have focused on the deep relationships under the data. In the following, we summarize the related literature on previous researches. Zhao et al. [26] proposed a semi-supervised network embedding model by adopting graph convolutional network that is capable of capturing both local and global structure of protein-protein interactions network even there is no any information associated with each vertex. Li et al. [27] incorporated word semantic relations in the latent topic learning by the word embedding method to solve that the Dirichlet Multinomial Mixture model does not have access to background knowledge when modelling short texts. Baqueri et al. [28] presented a framework to model residents travel and activities outside the study area as part of the complete activity-travel schedule by introducing the external travel to address the distorted travel patterns. Chen et al. [29] proposed a collaborative and adversarial network (CAN), which
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
15
explicitly models the common features between two sentences for enhancing sentence similarity modeling. Catolino et al. [30] devised and evaluated the performance of a new change prediction model that further exploit developer-related factors (e.g., number of developers working on a class) as predictors of change-proneness of classes. Liu et al. [31] proposed a novel method for disaggregating the coarse-scale values of the group-level features in the nested data to overcome the limitation in terms of their predictive performance, especially the difficulty in identifying potential cross-scale interactions between the local and group-level features when applied to datasets with limited training examples.
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection 2.3.1 Fraud Detection System Based in Online Payment Services We focus on the fraud detection issue in a typical pattern of online payment services, i.e., online B2C (Business-to-Customer) payment transactions. Here, to acquire the victim’s money, frauds usually differ from the victim’s daily behavior. This is the fundamental assumption of the feasibility of behavior-based fraud detection. Based on this assumption, the research community is committed to designing behavioral models to effectively distinguish the difference in terms of behavioral patterns. The main challenge of this problem is to build a high-quality behavioral model by using low-quality behavioral data. Naturally, from both aspects, there are two corresponding ways to solve this problem: data enhancement and model enhancement. In this work, we aim at devising the corresponding data enhancement schemes for the state-of-the-art behavior models that act as the well-recognized approaches of model enhancement [14]. More specifically, to realize data enhancement for behavioral modeling effectively, we adopt the relation graph and heterogeneous network embedding techniques to represent and mine more fine-grained co-occurrences among transactional attributes. Then, based on the enhanced data, the corresponding behavioral models (or enhanced behavioral models) can be adopted to realize fraud detection. Thereout, as illustrated in Fig. 2.1, the whole flow of the data-driven fraud detection system consists of three main parts: data representation, data enhancement and model data enhancement. Before describing the detailed methods, we summarize the relevant conceptions and notations in Table 2.1 as preparations.
16
2 Vertical Association Modeling: Latent Interaction Modeling Data Representation
History Record
C2C Transactions
User
B2C Transactions
User
B2C Transactions
Native Network
Model Enhancement
Data Enhancement
Composite Behavioral Models
Derivative Network
Heterogeneous Network Representation Learning
C2C Transactions
Population -Level Model
Individual -Level Model
Single Individual Model
Vector Space C2C Transactions
Single Individual Model B2C Transactions
User
Online B2C Transactions
Feature Transformation
Fraudulent Normal
Fig. 2.1 Workflow of the fraud detection system Table 2.1 Notations of parameters Description Variable .T
The set of transaction history records The set of fraudulent transaction history records The set of normal transaction history records The set of B2C transaction history records The set of C2C transaction history records A transaction with unique identifier .i
.T1 .T0 .T
B
.T
C
.ti
j
.attri
P .ϕ (·) .ϕ
I (·)
.sim(X, Y) .Iu
a
.Pg a u
.r
a (i)
2.3.1.1
A . j-th attribute of the transaction with unique identifier .i The representation mapping function about the label-free network The representation mapping function about the label-aware network The similarity between vectors .X and .Y The set of all identifiers involving with the agent .gua The behavioral model with the agent .gua The judgment of the .a-type agent model on the transaction with unique identifier .i
Data Representation
Online payment transaction records are usually relational data that consist of multiple entities representing the attributes in transactions. We employ a relation graph, which express the data more appropriately in online payment services, to reconstruct losslessly transaction record data, including B2C and C2C transactions. Lossless Native Graph. Every attribute of a transaction is regarded as the entity. For each transaction, we establish the relationships between each entity and its identifier, e.g., the transaction number. Furthermore, we attach each identifier a label to denote whether this transaction is fraudulent or normal. According to the property of trans-
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
17
Fig. 2.2 An exemplary procedure from the native graph (left) to derivative network (right), where B2C transaction contains .8 attributes and C2C transaction contains .3 attributes
actions, the set of transactions, denoted by .T , can be divided into two disjointed subsets, i.e., the normal and fraudulent transaction sets, denoted by .T0 and .T1 . Since an entity may appear in different transactions, we use the co-occurrence relationship to further connect the graphs formed by different transactions. Naturally, we call this graph formed by relational data a native graph, as illustrated in the left part of Fig. 2.2. Note that the data reconstruction by relation graph merely acts as the initialization of our data enhancement scheme, while it has no real effect on solving the insufficiency of behavioral data. The so-called data insufficiency for behavioral modeling means that, for a given behavioral agent, the existing data are not sufficient to reflect the behavior pattern of this agent. For example, when some accounts with low-frequency behavioral records are regarded as the behavioral agents, their existing behavioral data are possibly too sparse to effectively serve as a data basis for behavioral modeling. 2.3.1.2
Data Enhancement
In this work, we utilize network embedding techniques [16] to realize the data enhancement for behavioral modeling. Network embedding is outstanding in solving graph related problems and effectively mines deep relationships. Then, the network structure to be preserved should be determined before a network embedding operation is launched. The network embedding that preserves the network structure of native graph cannot directly help behavioral modeling for online payment fraud detection. The reasons can be summarized as follows: (1) Under the real-time requirement of online payment fraud detection, it is intolerable to perform network embedding operation for every new transaction due to the response latency lead by large computing overhead. Thus, the uniqueness of transaction number (i.e., identifier) directly destroys the possibility of adopting network embedding online. (2) There is no need to embed the identifier, say the transaction number, into the vector space, since it’s not a valid feature to represent user behavioral patterns. We are interested in the co-occurrence relationships among different behavioral entities rather than the relationship between a unique identifier and its entities.
18
2 Vertical Association Modeling: Latent Interaction Modeling
Therefore, we need generate a new derivative network of transaction attributes based on the native graph, preparing for the network embedding. Customized Derivative Networks. In the data we collected, there are both B2C and C2C transactions. The proportion of frauds in C2C transactions is infinitesimal to that in B2C transactions [32]. Moreover, the mechanism of C2C fraud transactions is essentially different from that of B2C ones [33]. Thus, we limit the scope of this work into online B2C fraudulent transaction detection. We utilize C2C transactions as supplementary (not necessary) information for extracting the relationships among behavioral agents of B2C transactions, i.e., account numbers, from the native graph. Then, we adopt different methods to handle B2C and C2C transactions in the native graph: (1) For B2C transactions, we define two different vertices, say .u and .v, that originally connect the same unique identifier as a vertex pair, and view it as an edge .e = (u, v). For example, a B2C transaction with .m attributes has .m + 1 vertices and .m edges in the native graph, while it correspondingly has .m vertices and .m(m − 1)/2 edges on the derivative network. (2) For C2C transactions, we only choose a special attribute pair that has at least one attribute appearing in B2C transaction records as vertex, e.g., the pair of account number and account number, and use other attributes of their transactions to weight the edges between the special attribute pair. We will analyze the impact of C2C transactions on the model and show the gains from C2C transactions in Sect. 2.3.2.3. We refer to such a denser network generated from the native graph as a derivative network. An exemplar illustration is provided in Fig. 2.2. The specific structure of derivative networks depends on the data requirements of specific behavioral models. We also have tried to assign different derivative network structure. From the complete graph to the minimum connected graph, we attempt to only consider the node pairs associated with account_number as edges. The results turn out to be a great poorer than the complete graph structure, and we analyze that special attribute, like account_number, may do not necessarily play a decisive role. So we adopt a complete graph structure including arbitrary node pairs in the derivative network, and computed the similarity between all node pairs. Heterogeneous Network Embedding. The specific vector spaces corresponding to the derivative networks are learned by heterogeneous network embedding algorithms [34–38]. For the behavioral models, we obtain the mapping functions from vertices to vectors in specific vector spaces, denoted by .ϕ(·). To infer more potential relationships, we calculate the metric .sim(X, Y) as features for each transaction, where the vectors .X, Y stem from .ϕ(·).
2.3.1.3
Model Enhancement
In this work, we classify user behavioral models into two kinds according to the granularity of behavioral agents, i.e., the population-level model [13] and individual-
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
19
level model [6]. Accordingly, we establish the population-level model and individuallevel model based on the customized derivative network, respectively: Population-Level Models. The population-level models identify the fraud by detecting the population-level behavioral anomalies, e.g., behavioral outlier detection [39] and misuse detection [40]. The classifiers based on behavioral data can act as this type of models. For data enhancement for them, we need only data refactoring for classifiers by preserving the co-occurrence frequency of behavioral attributes. To this end, we generate a derivative network where the vertices are transaction attributes and the edges with weights represent the co-occurrence frequency, taking no account of transaction labels. We say such a derivative network is label-free. Transaction labels just come into play in the training process of models. By embedding the label-free network, we get the mapping relationship .ϕ P (·). Then, we feed the features based on P .ϕ (·) into the machine learning based classifiers [41]. Individual-Level Models. The individual-level models identify the fraud by detecting the behavioral anomalies of individuals. They are regarded as a promising paradigm of fraud detection. The efficacy heavily depends on the sufficiency of behavioral data. To build the individual-level regular/normal behavioral models, we need represent the regularity and normality of transaction behavioral data. Then, we should take the labels into account when generating the derivative network. We extract positive relationships generated from .T0 and negative relationships generated from .T1 . The positive relationship enhances the correlation between the agents involved, while the negative relationship weakens the correlation. We say such a derivative network is label-aware. By applying the network embedding method to the label-aware network, we get the mapping relationship.ϕ I (·). Further, we establish the individual-level models of probability in view of .ϕ I (·) [42]. Composite Behavioral Models. Learning from different aspects model can lead to more reliable results. We adopt a union approach to reconcile the judgments from different individual models to improve reliability [43]. At the population-level and individual-level, we utilize the intersection to integrate judgments. That is, the fraud is determined only if the judgments of both models are fraudulent. Our fraud detection model consists of two levels of models and plays a complementary role. After employing the composite behavioral models, a coming B2C transaction can be transformed into the high-quality feature based on the learned vectors, and further be predicted as either fraudulent or normal.
2.3.1.4
Graph Representation of Transaction Records
First of all, our method needs to represent transactional data in the form of a heterogeneous information network, and applies the attribute vectors to subsequent tasks. These attribute vectors are obtained from heterogeneous network embedding in transactions. Next, we present the process of generating heterogeneous native graph.
20
2 Vertical Association Modeling: Latent Interaction Modeling
Denote a set of transaction history records T = T B ∪ T C,
.
where .T B and .T C represent the set of B2C and C2C transaction history records, respectively. Let .ti ∈ T denote a transaction, where .i is the unique identifier of .ti . Transactions are characterized by a sequence of attributes. We denote the. jth attribute j of a transaction.ti as.attri . Usually, some attributes have consecutive values, we need to discretize these values and then naturally build a native graph based on the unique identifier of transactions. We choose the value of attributes and unique identifiers j as vertices on the native graph. The pair of .(i, attri ) appearing in a transaction is defined as the edge in the native graph. In this work, we execute our method on an online banking payment dataset where a B2C transaction contains .8 attributes and a C2C transaction contains .3 attributes. To reconstruct the data losslessly, we build a native graph as illustrated in the left of Fig. 2.2. Here, an attribute value appears in multiple different transactions, leaving only one vertex in the native graph. Recall that we attach each identifier a label (.0 or .1) to divide all transaction history records into two disjointed subsets, i.e., the normal and fraudulent transaction sets, denoted by .T0 and .T1 , respectively.
2.3.1.5
Network Embedding
Derivative Network. A heterogeneous information network that reflects the impact of transaction labels is what our model needs. We focus on treating the relationships generating from normal transactions or fraudulent transactions unequally in population-level and individual-level models. For the population-level model which learns the difference between normal and fraudulent transactions from all transactions, we only represent the transactions and leave the task of identifying labels to the model. For the individual-level model which establishes user behavioral patterns by normal transactions, it is necessary to embody the label of transaction in the derivative network. For that we set two hyperparameters, .β and .γ , for fraudulent transactions to distinguish other transactions, and formulate a weight .we of an edge .e as: Σ Σ .we (β, γ ) = εe + (−γ )β · εe , (2.1) e→T 0
e→T 1
where the operator .→ means that a given edge in a derivative network corresponds to the relation between two attributes of the transactions in a specific transaction set; and .εe > 0 is the primary weight of edge .e depending on its type: When .e → T0 , the weight of.e is equal to.εe > 0, when.e → T1 , and the weight of.e is equal to.(−γ )β · εe with .γ ≥ 0 and .β = 0 or 1 cooperatively acting as the adjustment coefficient. A larger weight of an edge indicates that its two vertices (corresponding to two transaction attributes) are more closely relevant. In this work, we simply divide the edges into two kinds according to whether or not the edges are directly relevant
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
21
to account numbers. We set the primary weights of the latter kind of edges by a proportion of those of the former kind. For example, we set this proportion to be .0.55, whose adjustment procedure will be introduced later in Sect. 2.3.2.2. We follow two principles in the process of constructing derivative networks. The principle of relationship extraction, as in Eq. (2.1), is that the more co-occurrences in .T0 , the greater weights of edges. The other one is to remove corresponding vertices of transactional unique identifier on native graph. For the transactions in .T C , we retain the attributes of transactional account number which appeared in .T B on the derivative network. For .e → T C , .e merely contains one type .(account number, account number.), and other attributes are defined as the influence factor of the edge’s weight. In the B2C scenario, we retain all other attributes except the unique identifier, and then define two different vertices that connect the same unique identifier in the native graph as a vertex pair, and view it as an edge in derivative networks. In the above description, we find that the weights of edges generate a marked disproportion due to using the summation in large datasets. For instance, the weight of an edge is small between an account number and a transactional time when there are very few transactions related to the account number in a dataset. But the weight of an edge is tremendous between the transactional type and transactional time because it can appear in transactions with various account numbers. This huge gap is not conducive to reflect real relationships of different vertices in reality. We introduce a mapping function to smooth the gap in the weights of edges, and map the weight .w to an interval .[0, 1], that is, .
S(we ) =
1 , 1 + exp(− ln(α × we ) + θ)
(2.2)
where the parameters .α and .θ are important to change the weight, and control how fast the gap reduces; the parameter .α controls the changing degree of weights; the parameter .θ also controls the changing degree of weights, but it plays an important role when .w is relatively large. We set .α to be a low value in order to ensure that the ratio of two edges’ weight becomes smaller, and set .θ to be a high value for ensuring that the ratio of two edges’ weight keeps as constant as possible when .w is relatively small. In the dataset adopted in our work, we set .α to .1.8 and .θ to .5, whose adjustment procedures will be introduced later in Sect. 2.3.2.2. This strategy encourages the gap moderately reduces when the weight .w is tremendous and the gap changes as little as possible when it is a small value. Heterogeneous Network Embedding. Heterogeneous network embedding is a specific kind of network embedding. To transform networks from network structure to vector space, the commonly used models mainly include random walk [34], matrix factorization [16], and deep neural networks [37]. We use a well-recognized heterogeneous network embedding algorithm called HIN2Vec [35] to represent the derivative networks. Compared with other similar algorithms, HIN2Vec distinguishes the different relationships among vertices, and
22
2 Vertical Association Modeling: Latent Interaction Modeling
Table 2.2 Main parameters Attribute
Value
Explanation
Dimensionality Number of random walks
.128
Length of random walks Length of meta-paths Negative sampling rate
.10
Initial learning rate
.0.025
Dimensionality of node vectors Number of random walks starting from each node Max length of each random walk Max window length of context Number of examples for negative sampling Initial learning rate in stochastic gradient descent
.160
.5 .5
treats them differently by learning the relationship vectors together. Besides, it does not rely on artificially defined meta-paths. The parameter settings in HIN2Vec affect the representation learning and application performance. We explain some main parameters in Table 2.2, which shows the parameters of our experiments for reference. Note that the settings are related to the size of the input network. A small dimensionality is not sufficient to capture the information embedded in relationships among nodes, but a large value may lead to noises and cause overfitting. A larger network might need a larger dimensionality to capture the information embedded. The number and length of random walks determine the number of sample data, that is, the greater the value, the more the sample data. Generally, the performance continues to improve and converge when the values are large enough. Though a large length of meta-path cannot possibly affect the performance significantly, it is still helpful in capturing high-hop relationships. The negative sampling rate determines the proportion of negative samples in representation learning.
2.3.1.6
Fraud Detection Models
Fraud Detection in Population-Level Model. A heterogeneous information network that fully reflects all transactions contains all the edges and vertices that have appeared in the native graph. We treat the co-occurrence relationships generated from .T0 or .T1 equally by setting .γ = 1 and .β = 0, that is, free to transactional label. In the population-level model, we need draw a lesson from fraudulent transactions, and learn the manifestations of fraudulent and normal transactions by advanced classifiers. So we select the label-free network as the input of the heterogeneous network embedding method. Then we get a mapping function .ϕ P (·), which is the vector j j representation of attributes in transactions. For an attribute .attri , .ϕ P (attri ) is the representation in vector space from the label-free network. In the simplest case, we replace the attribute with a vector representation in a transaction. Then a transaction with .m attributes is represented as a matrix of size .d × m, where .d is the dimension size of the vector representation. But we observe
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
23
that this solution does not work well and takes up plenty of computing and storage resources. What we need are the features that can summarize a bunch of transactions, so the features should be shared in similar transactions. To this end, we choose to calculate the similarity of any two vector representations as new features based on the above matrix. Specifically, we can get .m(m − 1)/2 similarities to represent a transaction record. The procedure of computing the similarity is formalized as follows: Given a transaction with .m attributes and unique identifier .i, m 1 2 P P P P .ϕ (attri ), ϕ (attri ), · · · , ϕ (attri ), for .ϕ (·) represents a .d-dimensional vector, j for .ϕ P (attri ), ϕ P (attrik ) we have Σm (xs × ys ) sim(X, Y) = /Σ s=1 /Σ m m 2 2 s=1 x s × s=1 ys
(2.3)
.
j
by using the Cosine similarity, where .X, Y respectively represent .ϕ P (attri ), ϕ P (attrik ) and .xs , ys respectively represent the value on the .s-th dimension of the vector .X, Y. The Cosine similarity pays more attention to the difference between two vectors in direction and is not sensitive in numerical value. The population-level model fits well with the Cosine similarity since it focuses on the tendency of most individuals. To better represent a transaction, we also calculate similarities’ average and variance. We denote .sim_avg(i), sim_var (i) as the average and variance of a transaction with unique identifier .i. For a transaction without missing values, they are calculated as follows: sim_avg(i) =
.
.
where
m−1 m Σ Σ 2 j sim(ϕ P (attri ), ϕ P (attrik )), m(m − 1) j=1 k=i+1
sim_var (i) =
m m−1 Σ Σ 2 v(i, j, k), m(m − 1) j=1 k=i+1
(2.4)
( )2 j v(i, j, k) = sim(ϕ P (attri ), ϕ P (attrik )) − sim_avg(i) .
.
In reality, a transaction may have some missing values, we also consider its similarity as missing values. When calculating the average and variance, we do not consider the items corresponding to those missing values. In this work, we design the cosine similarity between vectors and their average and variance as new features. All the new features can be quickly calculated, thus ensuring that our model can easily complete feature transformation based on network embedding. In the real online payment scenario, we divide training samples and testing samples in time order to avoid time-crossing problems [44]. Time-crossing means using some information that has not yet occurred when a transaction is tested. We use all the data from the training samples to build a label-free network, and get the mapping
24
2 Vertical Association Modeling: Latent Interaction Modeling
function .ϕ P (·) by heterogeneous network embedding. Then we complete feature transformation on all data, training and testing samples, based on mapping function P .ϕ (·). We get the population-level model by fitting training samples on existing classifiers, e.g., XGBoost. For an incoming transaction or testing samples, we input them into the population-level model after feature engineering, and make a discriminant prediction to obtain the probability of fraud in the transaction. Fraud Detection in Individual-Level Models. In the individual-level model, the derivative network needs to reflect the behavioral distribution of all normal transactions without wasting information brought by fraudulent transactions. Our idea is that the information on normal transactions enhances the association of attribute vertices in the derivative network. On the contrary, the information brought by fraudulent transactions weakens its connection. Therefore, we stipulate that an edge has a positive weight value when it is generated from .T0 , and an edge has a negative weight value when the relationship occurs in .T1 by setting .γ = 1 and .β = 1. The strategy effectively utilizes label information, which is also the biggest difference from the label-free network for the population-level models. In some cases, the special relationship number in .T1 is much bigger than ones in .T0 , that causes the weight of some edges to become negative or zero. Our solution is to remove these edges in derivative networks. One reason is that these relationships reflected by edges are negligible in the behavioral distribution of all normal transactions we want to get, when the weight of an edge is negative or zero. The other reason is, negative weights cannot be applied to the random walk process of network embedding method we adopt. Similar to the population-level model, we get a mapping function .ϕ I (·), which is the vector representation of attributes in .T . Next, we discuss how to model behavioral models based on network embedding. We denote the agent as the basic unit in models, that is an agent is an individual and all transactions sharing a common agent’s value reflect the agent’s stable pattern. Taking our online transaction record as an example, the attribute, account_number, is a common choice as an agent. Under this agent, transactions are divided into different parts, so that all transactions in each part have the same account number. We can detect anomalies by comparing with behavioral models when we assume that an agent’s behavioral pattern is stable. We discuss behavioral models from the perspectives of single-agent and multi-agent, respectively. Single-Agent Behavioral Model. Similar to feature transformation on the population-level model, we choose to calculate the similarity of any two vector representations based on a size of .d × m matrix, which is represented by a transaction with .m attributes. Here, .d is the dimension size of vector .ϕ I (·). One difference is that similarity is calculated differently. Given vector .X and .Y, .xs , ys respectively represent the value on the .s-th dimension of the vector .X and .Y. We have: / '
sim (X, Y) =
.
Σm s=1
(xs − ys )2
(2.5)
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
25
by using the Euclidean distance, which emphasizes the difference in numerical value j and therefore appropriates to characterize each individual. For the vectors .ϕ I (attri ) j and .ϕ I (attrik ), we calculate the similarity .sim ' (ϕ I (attri ), ϕ I (attrik )) according to Eq. (2.5). We introduce cohesivity to express the importance of a transaction in the behavioral model and denote .C(i) as the cohesivity of a transaction with unique identifier .i. The cohesivity .C(i) can be computed in the following way: .
C(i) =
cm 0 +
Σm−1 Σm j=1
1
k=i+1
cm l × v' (i, j, k)
,
(2.6)
where.v' (i, j, k) = sim ' (ϕ I (attri ), ϕ I (attrik )) and.l = m( j − 1) + k − 1. The value .cm l represents the .l-th value in the coefficient matrix j
.
[ ] cm 0 , cm 1 , cm 2 , · · · , cm m(m−1)/2 ,
where .cm l , for .l = 0, 1, 2, · · · , m(m − 1)/2, can be determined by adopting the method of linear regression. For all samples without missing values, we calculate the similarity as new features according to Eq. (2.5), and then fit them on linear regression to get the coefficient matrix. [We can get the regression coefficient and offset corre] sponding, corresponding to . cm 1 , cm 2 , · · · , cm m(m−1)/2 and .cm 0 , respectively. Denote an agent as .gua , where .a is the attribute type corresponding to the agent, and .u represents the value of attribute .a of the agent. Accordingly, we denote the set of set of all transactional identifiers involving agents refer to .a as .G a . Let .Iua denote theU with .gua . Furthermore, we define .I a := u Iua . At this point, we formally denote the behavioral model as follows. For a given agent .gua ∈ G a , its behavioral model is defined as .Pgua , which is a discrete probability distribution function reflecting the normal transactional patterns. For every possible transaction identifier .i, we have its corresponding probability . pgua (i) of occurrence in .Pgua . The procedure of computing the corresponding probability . pgua (i) is formalized as follows: σ (C(i)) , (2.7) . pg a (i) = Σ ' u i ' ∈I ua σ (C(i )) 1 where .σ (z) = 1+exp(z) is the sigmoid function. In practice, the size of .Iua , denoted a by .|Iu |, is dependent on the product of the number of available values for all other attribute types except the attribute of agent .a. So our behavioral model is a special case, discrete probability distribution, by calculating the probability of each transaction in fraud detection. We adopt the same method as the population-level model to divide the training samples and test samples, and only use the training samples to build the model. For some .u, .|Iua | is often a large value and the computational overhead of probability distribution will be unbearable. We use the clustering algorithm to overcome this problem. For vectors referring to the same attribute type in vector space, the vectors of the same cluster are represented by cluster vectors, that is, similar vectors
26
2 Vertical Association Modeling: Latent Interaction Modeling
Algorithm 2.1: Building multi-agent behavioral models
1 2 3 4 5 6 7 8 9 10 11 12 13
Input: The set of attribute types A Output: The set of multi-agent behavioral models F Initialize F ; foreach a ∈ A do foreach gua ∈ G a do Initialize Pgua ; foreach i ∈ Iua do Compute C(i ) using Eq. (2.6); Compute pgua (i ) using Eq. (2.7); Add i, pgua (i ) into Pgua ; end Add Pgua into F ; end end Return the set of multi-agent behavioral models F
are treated as one vector, which can quickly reduce the value of .|Iua |. In this work, we choose the account number as the agent’s type. In other words, we establish behavioral models for all account number which appears in a label-aware network. We observe that single-agent models based on account number or other attributes are often hard to achieve an excellent performance due to the absence of agents. An effective way to solve the problem is modelling in multiple agents. Multi-Agent Behavioral Model. To cope with insufficient or missing historical transactions of the single agent, we prefer to the models under different agents without acquiring more complete and adequate historical transactions. This part describes how we build the behavioral model to detect a transaction better under multiple agents in case of insufficient transactions. Similar to the commonly-used agent, i.e., account number, some other attributes, e.g., merchant number and location number, can also act as the agents to build behavioral models. Note that the value space of attribute types that can act as agents should not be too small. That will lead to a lack of advantage for the individual-level behavioral model. Let .A denote a set of attribute types that can act as agents. For each attribute in .A , we repeatedly model the single-agent behavioral model and then add those models to the final set .F . We can detect the fraud probability of a transaction under different agents with .F . The procedure of building multi-agent behavioral models is described in Algorithm 2.1. We define the fraud detection problem in individual-level behavioral models as follows: Given a transaction, its fraud score rated by its corresponding probability in the single-agent behavioral model determines whether the transaction is fraud or not. This may include the following scenarios: (1) the transaction provides complete information; (2) the transaction miss values in some attributes. For the former, we can directly get its probability in behavioral models. Since all attributes are required to calculate the fraud score of the transaction in the behavior model, it is difficult to judge the transaction with missing values. So in
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
27
Algorithm 2.2: The process of fraud detection
1 2 3 4 5 6 7 8 9 10 11 12
Input: The set of attribute types A , The set of transactional identifiers I Output: The set of judgment results R Initialize R = ∅; foreach i ∈ I do r (i ) := 0; foreach a ∈ A do get r a (i ) using Eq. (2.10); r (i ) := r (i ) ∨ r a (i ); end get r a (i )' in Section 2.3.1.6; r (i ) := r (i ) ∧ r a (i )' ; Add r (i ) into R ; end Return judgment result R
our model, we compute the average possibility of all transactions, which are related to existing attributes of the transaction with identifier .i, as the probability . pgua (i), ' and define the set of these transaction identifiers as .Ii . Then we get the behavioral model .Pgua corresponding to the agent . pgua (i), and denote the domain of .Pgua as .Pgua . ' ' For a transaction identifier .i, we get a new distribution .Pgua by removing the .Ii from ' ' the domain of .Pgua , and denote the domain of .Pgua as .Pgua . Next, we calculate its score .scor eg a (i) as described in Eq. (2.8): u
.
scor egua (i) =
pgua (i) × exp(−Hgua ) , Σ N0 + |P1' | × i ' ∈P ' a pgua (i ' ) gua
where .
Hgua = −
'
Σ i∈P gua
(2.8)
gu
pgua (i) × log2 pgua (i),
(2.9)
'
|Pgua | is the cardinality of .Pgua , . N0 is responsible for adjusting the influence degree of transactions other than the transaction .ti in the behavioral model on the score. The larger . N0 is, the lower the influence of other transactions on the score. In our work, we set . N0 = 0. We observe that there is a clear distinction between fraudulent and normal transaction scores. For an attribute type .a ∈ A , we set an interval .Ωa and give the judgment result according to Eq. (2.10):
.
{ r (i) =
.
a
1, 0,
scor egua (i) ∈ Ωa scor egua (i) ∈ / Ωa
(2.10)
28
2 Vertical Association Modeling: Latent Interaction Modeling
We denote fraudulent transactions by label .1 and normal transactions by label .0. The upper and lower limits of the interval .Ωa depend on the scores distribution of training samples. Fraud Detection in Composite Models. A single-agent behavior model can only give a certain fraud judgment. The normal judgment may not reliable due to the release of transactions that cannot be checked. In this work, we imitate the one-veto mechanism to synthesize the final results returned by multi-agent models. That is only an agent behavioral model returns a judgment marked as fraud, the final result is marked fraud. This strategy ensures that the multi-agent model is complementary enough to capture as many fraudulent transactions as possible. So far, we already have two different level ways to identify whether a transaction is fraudulent or not. These two methods identify fraudulent transactions from different perspectives. Population-level models compare the similarity between a transaction and the learned transactional patterns. Individual-level models distinguish a transaction by contrasting the difference between its current and past patterns. We compose these two models to further improve the performance of our methods. The transaction is detected as fraudulent transactions if and only if the result from both models are judged as fraudulent transactions. The consistency of judgment on fraudulent transactions reduces the probability of misjudgment of normal transactions, and ensures that it has better performance than a single-model, i.e., the population model or individual model. For different performance objectives, other combinations can be also tried, which will be reserved for future research. The process of building a fraud detection model is described in Algorithm 2.2.
2.3.2 Experimental Evaluation To evaluate the performance of the proposed models based on co-occurrence relationships in transactions, we build heterogeneous information networks to represent these relationships, and apply the vectors obtained by heterogeneous network embedding to generate behavioral models. Through the empirical evaluation of real-world transactions, we mainly aim to answer the following three research questions: RQ1: RQ2: RQ3:
How do the key parameters affect the performance of our models? How much gain does the data enhancement scheme based on network embedding bring to the population-level, individual-level models? How does the design of enhancement scheme affect the performance of our models?
In what follows, we firstly introduce the experimental settings, and then answer the above research questions in turn.
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
2.3.2.1
29
Experiment Settings
Datasets. To validate the performance of proposed models, the evaluation is implemented on a real-world online banking payment transaction dataset from one of the biggest commercial banks in China, which contains three consecutive months of B2C and C2C transaction records. The main statistics of the transactions are summarized in Table 2.3. We use the dataset of April and May 2017 as the training samples, and set the dataset of June 2017 as the testing samples. We also utilize C2C transactions of April and May 2017, when we build heterogeneous information networks. All B2C transactions are labelled either positive (fraudulent) or negative (normal), respectively. The training samples contain .2, 393, 817 normal transactions and .40, 393 fraudulent transactions, and the testing samples contain .1, 003, 539 normal transactions and .24, 898 fraudulent transactions. In the original set of transactions, each transaction is characterized by .64 attributes. However, most of them have sparsely valid values (about .10% to .30% on average). We finally choose .8 attributes in all to build our models, which are shown in Table 2.4. The attributes, time and amount, have continuous values, so we need the further discretization treatment for these attributes. All C2C transactions are represented by .3 attributes which are shown in Table 2.5. Note that the attribute amount in C2C transactions does not appear in the derivative network, which only has an impact on the weights of incident edges.
Table 2.3 The transaction information 2017.04 2017.05 Month B2C Normal B2C Fraudulent C2C
2017.06
Total
.1,217,101
.1,176,680
.1,003,461
.3,397,242
.13,271
.27,122
.24,898
.65,291
.166,356
.205,614
.\
.371,970
Table 2.4 The selected attributes in B2C transactions Value Description Attribute account_number merchant_number
Discrete Discrete
place_number
Discrete
Time Amount
continuous continuous
Ip
Discrete
last_result
Discrete
Type
Discrete
Each account_number represents a user’s account Each merchant_number represents a merchant in a B2C transaction Each place_number represents an issuing area of banking cards used for transactions The exact time when the transaction occurred The amount of money transferred to the merchant in a B2C transaction Whether a commonly used ip or not in a transaction Judgment of the last transaction in the relevant account_number Each type represents a transaction of different type
30
2 Vertical Association Modeling: Latent Interaction Modeling
Table 2.5 The selected attributes in C2C transactions Attribute Value Description account1_number
discrete
account2_number
discrete
amount
continuous
Table 2.6 Attribute details Attribute account_number merchant_number place_number Time Amount Ip last_result Type
The account1_number is the initiator of the C2C transaction The account2_number is the recipient of the C2C transaction The amount of money transferred to the recipient in a C2C transaction
Label-aware network
Label-free network
.221,040
.190,268
.2,419
.2,406
.327
.327
.8
.8
.10
.10
.2
.2
.2
.2
.11
.11
We discretize the attribute time inspired by [45]. The time of day can be divided into four time intervals. We set four intervals of the hour: .[0, 3), .[6, 11), .[15, 24), and .[3, 6) ∪ [11, 15), according to the time distribution in transactions. We further divide the attribute time into 8 unique values by distinguishing whether it is a weekday. We make different approaches to discretize the amount attribute in B2C and C2C transactions because of the different functions of attribute amount. For B2C transactions, we discretize them into four different values according to the following intervals, .[0, 60), .[60, 300), .[300, 3600), and .[3600, +∞). For C2C transactions, we assign them different values, i.e., .(1, 1.5, 2, 2.5, 3), by the following intervals, .[0, 100), .[100, 1000), .[1000, 5000), .[5000, 50000), and .[50000, +∞). We also count the number of nodes with different attributes in the label-aware and label-free networks, which are detailed in Table 2.6. Note that the difference between the number of nodes with attributes account_number and merchant_number is caused by the removal of edges and nodes from label-free networks. In addition, we observe that the size of individual models is .2, 406 × 327 × 8 × 10 × 2 × 2 × 11 when we choose the attribute account_number as the agent. It is commonly too large to calculate. So we cluster .2, 406 agents with attribute merchant_number and .327 agents with attribute place_number into .11 and .5 categories, respectively. Similarly, we cluster .190, 268 agents with attribute account_number into .5 categories when we choose the attribute merchant_number or place_number as agents.
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
31
Metrics. To evaluate the performance of our methods, we choose five representative and well-performed techniques as the benchmarks: logistic regression (LR), random forest (RF), naive bayes (NB), XGBoost (XGB), and convolutional neural networks (CNN). Normally, according to the industry requirement, .1% is the tolerable upper limit for FPR (False Positive Rate). So an achieved TPR (True Positive Rate) with an FPR higher than .1% makes no sense in this work. We only focus on the meaningful part of the ROC curve without considering the whole AUC (Area under The ROC Curve). In this part, we use Precision, Recall (TPR), Disturbance (FPR) and F1-score to comprehensively evaluate our methods.
2.3.2.2
Parameter Sensitivity
In this set of experiments, we systematically evaluate the parameter sensitivity of our method. Different from the .k-fold cross validations, we select the last .1/3 of the training samples in time sequence as the validation samples, and the other.2/3 to train the model during the parameters tune. Dividing the verification set in time sequence avoids the time-crossing problem and is more in line with the real application scenario than randomly selecting the verification set. Network Parameters. Parameter settings in Eq. (2.2) have a significant impact on the weight of edges in derivative networks. In our work, most of the edge weights are less than .1, 000, and the larger weights are only about .10, 000. So we intend to make the transformation of weights satisfy a set of ratios, where the ratio is calculated by . S(we )/S(1). The set of ratios satisfy the following rules: When the weight is less than .25, the ratio is close to .we ; when the weight is about .100, the ratio is close to .50; when the weight is very large, beyond .1, 000, the ratio is close to .100. To determine parameter settings, we examine such changes in weights. We vary parameters .α and .θ to determine their impacts on weight changes. Except for the parameters being tested, all other parameters assume default values. We first examine different choices of the parameter .α, and choose values of .α from .1 to .3. The weight changes under different .α are shown in Fig. 2.3a, which shows that different .α slightly change the weight, but the overall trend remains similar. Next, we examine different choices of parameter .θ , and choose values of .θ from .3 to .7. The weight changes under different .θ are shown in Fig. 2.3a, which shows that different .θ dramatically change the weight, especially if it’s a huge value. Figure 2.3 also shows that the parameter .α is positively correlated with . S(we ) and the parameter .θ is negatively correlated with . S(we ). Finally, we observe that setting .α and .θ at .1.8 and .5 respectively is an appropriate choice to reduce the huge gap between different weight values. From Table 2.6, we observe that nodes with the attribute account_number far outnumber nodes with other attributes. This imbalanced phenomenon leads to an imbalanced network structure. So we further study the average degree of these agents and find that the average degree of nodes with the attribute merchant_number is similar to that with most attributes, but is about .90 times than that with the attribute account_number. Therefore, we introduce a scheme to balance the network structure.
32
2 Vertical Association Modeling: Latent Interaction Modeling
Fig. 2.3 The TPR and F1-score under different FPR of integrations
Facing the node with a special attribute, which has .q times average degree than the minimum one, we set the weight of edges associated with the special attribute is .q ∗ (.q ∗ = 1 − q/2 × 0.01) times of the weight corresponding to the minimum average degree. In this work, we set the weights of edges associated with other attributes are .0.55 (.= 1 − 90/2 × 0.01) times of that with the attribute account_number. Embedding Parameters. Parameter settings in network embedding methods usually make a difference to the performance of node representation for an application. To tune the appropriate settings, we vary the values of important parameters to observe how the performance changes under population-level models. Dimensionality of Vector Space. First of all, Fig. 2.4a shows the impact of setting different numbers of dimension .d. Generally, a small .d is not sufficient to capture the information embedded in relationships among nodes, but a large .d may lead to noises, and cause overfitting. In our work, the best performance is achieved when .d is .128. Generally, a larger network might need a larger .d to capture the information embedded in relationships between nodes. Length of Random Walks. A longer random walk can generate more sample data. Figure 2.4b manifests that the performance continues to improve when the length of random walks .l is increased (then resulting in more sample data), and converges when .l is large enough. Meanwhile, the more sample data, the more training time. When .l is set as a great value, it brings slight performance growth but dramatic time increase. In our network, we set .l to be .160, since it achieves a balance between time-consuming and performance. Length of Meta-paths. Figure 2.4c shows that the maximum length of meta-paths has a significant impact on the performance. Capturing meta-paths with larger .ω is crucial because some long meta-paths have an important semantic meaning. Note that a large .ω will bring in useless semantic information to affect performance. Setting the number of .ω to .4 or .5 is a good option in this work. We verify the performance of different network embedding schemes on the population-level model. Figure 2.4d shows the results of our experiment on the
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
33
Fig. 2.4 Parameter tuning in network embedding with different parameter pairs. Figures a, b, and c respectively show the model performance of different parameters .d, .l, and .ω, when we set .d, .l, and .ω at .128, .160, and .5 respectively in the process of testing others. Figure d shows the influence of different .γ and .β on the population-level model
XGBoost classifier, and explains why we propose a customized network to deal with labeled transactions. By setting the hyperparameters .β and .γ , we adjust the ratio of the edge weights of fraudulent transactions to normal transactions in the network at .1, .0, .−1, .−2, respectively. The ratio ‘.1’ represents that fraudulent and normal transactions are treated in the same way. That is equivalent to label-free networks. The ratio ‘.0’ represents that we only use normal transactions to build the network without fraudulent transactions. The ratios ‘.−1’ and ‘.−2’ represent the weights of edges generated by fraudulent transactions are .−1 or .−2 times that of normal transactions, respectively. We observe that the label-free network outperforms the label-aware network in the population-level models from Fig. 2.4d. We also observe that the ratios ‘.−1’ and ‘.−2’ have similar performance, which shows that small changes in the ratio have little impact on the model when the ratio is a negative value. In our naive individual-level model, it learns the normal behavioral pattern from the user historical pattern, and cannot exploit fraudulent transactions in the process of
34
2 Vertical Association Modeling: Latent Interaction Modeling
building models. The label-aware network integrates the information of fraudulent transactions into the network structure, which is more suitable for individual-level models than label-free networks.
2.3.2.3
The Gain of Network Embedding
Performance Gain for Population-Level Models. We compare the performance of five representative classification models described in Sect. 2.3.2.1 with those under the help of customized network embedding (NE) schemes. We set the parameters of network embedding in Sect. 2.3.2.2. The ROC curves of different classifiers are depicted in Fig. 2.5a. We observe that the models cooperating with network embedding, i.e., RF+NE, XGB+NE, LR+NE, NB+NE, and CNN+NE, all outperform their counterparts without network embedding. XGBoost gives the best results at different FPR, followed by RF, CNN, LR, and NB with network embedding. When the FPR is .0.001, XGBoost with network embedding obtains a recall of .93.9%, which means that it can prevent about .94% of fraudulent transactions when the fraudster begins to act, and just interfere .0.1% legal transactions. Random forest performs the second best, just slightly poorer than XGBoost when the FPR is small than .0.002. When we decrease the FPR to .0.0005, the performances of most methods do not change stupendously despite a partial drop in the TPR. It is worth noting that the performances of all models drop dramatically as the FPR decreases to .0.0001. The TPR of XGBoost is slightly lower than .50%. Except for the poor performance of NB and CNN, the others have almost similar recalls when FPR.= 0.0001. Now we already have a basic understanding of the approximate performance of all candidate classifiers. XGBoost is outstanding in all candidate machine learning models when we set the same FPR.
Fig. 2.5 The ROC curves of population-level models. Figure a shows the performance of different population-level models with or without NE. Figure b shows the impacts of different features on population-level models
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
35
To explore how much gain C2C transactions can bring to our model, we design the following four groups of experiments: (1) ‘B+C’, both using B2C and C2C transactions on our model; (2) ‘B2C’, only using B2C transactions on our model; (3) ‘Ori’, applying original transactions directly to the population-level model; (4) ‘Vec’, using B2C and C2C transactions to build the label-free network and adopting HIN2Vec method to get the embedding vectors, but detecting fraud by feeding a vector matrix, which consists of representations corresponding to attributes in a transaction, into the population-level model. From Fig. 2.5b, we find that our model is superior to other comparisons when the FPR is less than .0.15%. When the FPR is greater than .0.15%, the gain of our model decreases, and the performance is gradually consistent with other comparisons. Note that the poor performance on ‘Vec’ explains why we do not use the representation directly but introduce.sim(X, Y) for the subsequent tasks. We also observe that the C2C transactions are effectively utilized by our model. When the FPR is .0.75%, the gain of TPR reaches .2.5%. Performance Gain of Individual-Level Models. In this part, we evaluate the performance of the individual-level models in fraud detection with customized network embedding. We present the performance of single-agent behavioral models and discuss the improvements by the multi-agent model compared with the single-agent models. The improvements depend on the following two principles. The first is the completeness principle of multi-agent models. If a transaction has no historical data under a specific agent, then the single-agent model is impossible to detect the transaction. To give a more straightforward sense, we define a measure called check rate, which stands for the proportion of transactions that can be checked with fraud detection techniques under a given agent. The union of subsets of transactions that can be checked by single-agent models should be the complete set of transactions. That is, the check rate of the final multi-agent model should be .1. The second is the preferential principle of single-agent models under the completeness principle. Before integrating different single-agent models into the multi-agent model, we need to evaluate the performance of every single-agent model. If the performance of a single-agent model is too poor, it will harm the performance of the final multi-agent model. In the implementation of our proposed models, the multi-agent model can apply to all transactions, and the check rates under different single-agent models are shown in Table 2.7. By calculating the Precision and Recall with different fixed Disturbances, we investigate the performance of proposed single-agent models under the verifiable dataset as presented in Table 2.7. The Disturbances are fixed as .0.0010, .0.0015, .0.0020, .0.0050, .0.0075, and .0.0100, respectively. It is evident from Table 2.7 that these single-agent models have a stable and good performance in partial data which can be checked. When we compare the performance of multiple single-agent models and the multiagent model, we experiment with all test transactions for all behavioral models. We focus on the performance at different Disturbances between .0.001 and .0.0022. From Fig. 2.6, we can obtain three observations as follows:
36
2 Vertical Association Modeling: Latent Interaction Modeling
Fig. 2.6 The performances of Precision (a) and Recall (b) under different fixed Disturbances in behavioral models Table 2.7 Performance of single-agent models Attribute/check_rate Account_number/0.92633 0.0010 0.0015 0.0020 Disturbance Precision Recall Attribute/check_rate Disturbance Precision Recall Attribute/check_rate Disturbance Precision Recall
0.81375 0.79173 0.74798 0.68418 0.91701 0.92648 Merchant_number/0.53844 0.0010 0.0015 0.0020 0.96880 0.95994 0.95147 0.66302 0.76501 0.84403 Place_number/0.99997 0.0010 0.0015 0.0020 0.93515 0.93433 0.91924 0.58559 0.86690 0.91517
0.0050
0.0075
0.0100
0.54692 0.93479
0.44728 0.93677
0.37581 0.93825
0.0050 0.90104 0.95988
0.0075 0.85672 0.96295
0.0100 0.81889 0.96485
0.0050 0.82686 0.96405
0.0075 0.76180 0.96582
0.0100 0.69708 0.96658
First, the performance of single-agent models have a good performance in the partial data, but do not have a stable performance on the whole data. Achieving a good performance in the partial data is a necessary but not sufficient condition for that on the whole data. Then, it is worth considering that the adoption of a multi-agent model by combining multiple complementary single-agent models. Second, the check rate for the single-agent model of place_number has a very close to ‘1’, while the single-agent model of place_number underperforms the multi-agent model in terms of the Precision and Recall. The reason why the multi-agent model is superior to the single-agent model of place_number is that the former combines the advantages of different single-agent models and has a more complete judgment on detection transactions.
2.3 Fine-Grained Co-occurrences for Behavior-Based Fraud Detection
37
Fig. 2.7 The performance of data enhancement and model enhancement in our model. Figure a shows the performance of different network embedding methods as data enhancement in populationlevel model. Figure b shows the performance of model enhancement by combining the individuallevel and population-level models
Third, we find that the merchant_number curve provides the most stable precision and recall, regardless of the disturbance rate. From Table 2.7, we observe that except for the single-agent model of merchant_number, which only has a check rate of about 50%, the other two single-agent models have a check rate of over 90%. In a real scenario, the single-agent model of merchant_number can not be used alone to implement anti-fraud tasks because of its low check rate. In our work, the singleagent model can only detect fraud, but can not ensure that non-fraud is normal. The high performance of the merchant_number model comes from its release of nearly half of transactions, so its performance is not representative and credible. In online payment services, the judgment results with high performance and low credibility are not acceptable. By combining the judgment of multiple single-agent models, we can make more accurate judgment results with the same credibility.
2.3.2.4
Performance of Enhancement Scheme
Performance of Data Enhancement. The framework of the proposed data enhancement scheme is compatible with most network embedding methods. We compare the effects of the state-of-art network embedding methods in the population-level model. Besides HIN2Vec, we also investigate the performance of node2vec [46], transE [47] and metapath2vec [48]. For similar parameters, we use the same values as HIN2vec, and we use the default values for the others. Figure 2.7a shows the ROC curves of different network embedding methods in the population-level model. We observe that all models cooperating with different network embedding methods have a similar performance. HIN2Vec and metapath2Vec have a better performance than node2vec and transE. The lower performance of node2vec mainly stems from its inability to distinguish the types of nodes. In transE, the method focuses on resolving relationships
38
2 Vertical Association Modeling: Latent Interaction Modeling
Fig. 2.8 The performances of the population-level model (a) and individual-level model (b) under different similarity calculations. We use “#" to represent the versions with Euclidean distance on the population-level model and Cosine similarity on the individual-level model, respectively
between different entities but taking no account of the weight of relationships. Most embedding methods are feasible as data enhancements, with only slight differences in terms of performance. We compare the performance of behavioral models under different similarity calculations to illustrate the need to treat two levels of behavioral models differently. Figure 2.8 shows the advantages of proper similarity calculation. From Fig. 2.8a, we observe that the population-level models of the Cosine similarity have a better performance than that of the Euclidean distance in most cases except that the XGB# and RF# curves provide a better performance when the false positive rate exceeds 0.002. In reality, online payments service pay attention to the low intervention for normal users, so we tend to adopt the Cosine similarity which is stable in cooperation with other classifiers. In Fig. 2.8b, the individual-level models with the Euclidean distance are better than that of the Cosine similarity, especially when the disturbance is low. It shows that the model of the Euclidean distance can better distinguish the characteristics of each individual. Performance of Model Enhancement. We compare the composite models of different population-level and individual-level models with the pure population-level models. Figure 2.7b shows the ROC curves of five widely-used population-level models, as described in Sect. 2.3.2.1, cooperating with the individual-level model. We observe that the individual-level model has a slight complementary effect on the population-level models with the highest performances, i.e., XGB and RF, than that with the lower performances. When the Disturbance is in the range of .0.002 to .0.005, the individual-level model can bring about .1% Recall increase for XGB and RF. In addition to XGB and RF, other population-level models have been greatly improved, especially LR and NB almost have similar performance to XGB and RF with the highest performances when the Disturbance is greater than .0.002. Although CNN cannot match other methods after cooperating with the individual-level model,
References
39
it has also achieved an obvious improvement. That means that the individual-level model can effectively improve the performance of population-level models.
2.4 Conclusion 2.4.1 Behavior Enhancement For behavioral models in online payment fraud detection, we propose an effective data enhancement scheme by modelling co-occurrence relationships of transactional attributes. Accordingly, we design customized co-occurrence relation networks, and introduce the technique of heterogeneous network embedding to represent online transaction data for different types of behavioral models, e.g., the individual-level and population-level models. The methods are validated by the implementation on a real-world dataset. They outperform the state-of-the-art classifiers with lightweight feature engineering methods. Therefore, our methods can also serve as a feasible paradigm of automatic feature engineering.
2.4.2 Future Work There are some useful and interesting issues left to study: • An interesting future work is to extend the data enhancement scheme into other types of behavioral models, e.g., the group-level models and generalized-agentbased models, except the population-level and individual-level models studied in this work. • It would be interesting to investigate the dedicated enhancement schemes for more advanced individual-level models, since the adopted naive individual-level model does not fully capture the advantages of the proposed data representation scheme based on the techniques of heterogeneous network embedding. • It is anticipated to demonstrate the generality of the proposed method by applying it to different real-life application scenarios.
References 1. B. Cao, M. Mao, S. Viidu, P.S. Yu, in Proceedings of the IEEE ICDM 2017, New Orleans, LA, USA, November 18-21, 2017 (2017), pp. 769–774 2. M.A. Ali, B. Arief, M. Emms, A.P.A. van Moorsel, I.E.E.E. Secur, Privacy 15(2), 78 (2017) 3. X. Ruan, Z. Wu, H. Wang, S. Jajodia, IEEE Trans. Inf. Forens. Secur. 11(1), 176 (2016) 4. H. Yin, Z. Hu, X. Zhou, H. Wang, K. Zheng, N.Q.V. Hung, S.W. Sadiq, in Proceedings of the IEEE ICDE 2016, Helsinki, Finland, May 16–20, 2016 (2016), pp. 942–953
40
2 Vertical Association Modeling: Latent Interaction Modeling
5. Y.A. De Montjoye, L. Radaelli, V.K. Singh et al., Science 347(6221), 536 (2015) 6. A. Khodadadi, S.A. Hosseini, E. Tavakoli, H.R. Rabiee, ACM Trans. Knowl. Discov. Data 12(3), 37:1 (2018) 7. F.M. Naini, J. Unnikrishnan, P. Thiran, M. Vetterli, IEEE Trans. Inf. Foren. Secur. 11(2), 358 (2016) 8. M. Egele, G. Stringhini, C. Kruegel, G. Vigna, IEEE Trans. Depend. Secure Comput. 14(4), 447 (2017) 9. A. Alzubaidi, J. Kalita, I.E.E.E. Commun, Surv. Tutor. 18(3), 1998 (2016) 10. H. Mazzawi, G. Dalaly, D. Rozenblatz, L. Ein-Dor, M. Ninio, O. Lavi, in Proceedings of the IEEE ICDE (2017), pp. 1140–1149 11. Q. Cao, X. Yang, J. Yu, C. Palow, in Proceedings of the ACM SIGSAC (2014), pp. 477–488 12. X. Zhou, X. Liang, H. Zhang, Y. Ma, IEEE Trans. Knowl. Data Eng. 28(2), 411 (2016) 13. T. Wüchner, A. Cislak, M. Ochoa, A. Pretschner, IEEE Trans. Depend. Secure Comput. 16(1), 99 (2019) 14. T. Chen, C. Guestrin, in Proceedings of the ACM SIGKDD 2016, CA, USA, August 13–17, 2016 (2016), pp. 785–794 15. B. Jia, C. Dong, Z. Chen, K. Chang, N. Sullivan, G. Chen, in Proceedings of the FUSION 2018, Cambridge, UK, July 10–13, 2018 (2018), pp. 2392–2399 16. P. Cui, X. Wang, J. Pei, W. Zhu, IEEE Trans. Knowl. Data Eng. 31(5), 833 (2019) 17. M. Abouelenien, V. Pérez-Rosas, R. Mihalcea, M. Burzo, IEEE Trans. Inf. Foren. Secur. 12(5), 1042 (2017) 18. W. Youyou, M. Kosinski, D. Stillwell, PNAS 112(4), 1036 (2015) 19. V. Sekara, A. Stopczynski, S. Lehmann, PNAS 113(36), 9977 (2016) 20. K. Rzecki, P. Plawiak, M. Niedzwiecki, T. Sosnicki, J. Leskow, M. Ciesielski, Inf. Sci. 415, 70 (2017) 21. S. Lee, J. Kim, in Proceedings of the NDSS 2012, San Diego, California, USA, February 5–8, 2012, vol. 12 (2012), pp. 1–13 22. G. Stringhini, P. Mourlanne, G. Jacob, M. Egele, C. Kruegel, G. Vigna, in Proceedings of the USENIX Security 2015, Washington, D.C., USA, August 12–14, 2015 (2015), pp. 563–578 23. Z. Meng, L. Mou, Z. Jin, in Proceedings of the ACM CIKM 2017, Singapore, November 06–10, 2017 (2017), pp. 2203–2206 24. A. Rawat, G. Gugnani, M. Shastri, P. Kumar, Int. J. Secur. Appl. 9(7), 109 (2015) 25. C. VanDam, J. Tang, P. Tan, in Proceedings of the ACM WI 2017 , Leipzig, Germany, August 23–26, 2017 (2017), pp. 737–744 26. W. Zhao, J. Zhu, M. Yang, D. Xiao, G.P.C. Fung, X. Chen, in Proceedings of the AAAI 2018, New Orleans, Louisiana, USA, February 2–7, 2018 (2018), pp. 8185–8186 27. C. Li, Y. Duan, H. Wang, Z. Zhang, A. Sun, Z. Ma, ACM Trans. Inf. Syst. 36(2), 11:1 (2017) 28. S.F.A. Baqueri, M. Adnan, B. Kochan, T. Bellemans, Futur. Gener. Comput. Syst. 96, 51 (2019) 29. Q. Chen, Q. Hu, J.X. Huang, L. He, in Proceedings of the ACM SIGIR 2018, Ann Arbor, MI, USA, July 08–12, 2018 (2018), pp. 815–824 30. G. Catolino, F. Palomba, A.D. Lucia, F. Ferrucci, A. Zaidman, J. Syst. Softw. 143, 14 (2018) 31. B. Liu, P. Tan, J. Zhou, in Proceedings of the ACM SIGKDD 2018, London, UK, August 19–23, 2018 (2018), pp. 1784–1793 32. E.K.H. Leung, K.L. Choy, P.K.Y. Siu, G.T.S. Ho, C.H.Y. Lam, C.K.M. Lee, Expert Syst. Appl. 91, 386 (2018) 33. Z. Li, J. Song, S. Hu, S. Ruan, L. Zhang, Z. Hu, J. Gao, in Proceedings of the IEEE ICDE 2019, Macao, China, April 8–11, 2019 (2019), pp. 1898–1903 34. C. Shi, B. Hu, W.X. Zhao, P.S. Yu, IEEE Trans. Knowl. Data Eng. 31(2), 357 (2019) 35. T. Fu, W. Lee, Z. Lei, in Proceedings of the ACM CIKM 2017, Singapore, November 06–10, 2017 (2017), pp. 1797–1806 36. X. Huang, J. Li, N. Zou, X. Hu, ACM Trans. Knowl. Discov. Data 12(6), 70:1 (2018) 37. H. Wang, F. Zhang, M. Hou, X. Xie, M. Guo, Q. Liu, in Proceedings of the ACM WSDM 2018, Marina Del Rey, CA, USA, February 5–9, 2018 (2018), pp. 592–600
References
41
38. S. Fan, C. Shi, X. Wang, in Proceedings of the ACM CIKM 2018, Torino, Italy, October 22–26, 2018 (2018), pp. 1483–1486 39. M.U.K. Khan, H.S. Park, C. Kyung, IEEE Trans. Inf. Foren. Secur. 14(2), 541 (2019) 40. A. Saracino, D. Sgandurra, G. Dini, F. Martinelli, IEEE Trans. Depend. Secure Comput. 15(1), 83 (2018) 41. W. Tang, B. Li, S. Tan, M. Barni, J. Huang, IEEE Trans. Inf. Foren. Secur. 14(8), 2074 (2019) 42. J. Yang, C. Eickhoff, ACM Trans. Inf. Syst. 36(3), 32:1 (2018) 43. J. Xie, D. Yang, J. Zhao, Inf. Sci. 485, 71 (2019) 44. I. Molloy, S. Chari, U. Finkler, M. Wiggerman, C. Jonker, T. Habeck, Y. Park, F. Jordens, R. van Schaik, in Proceedings of the FC, Christ Church, Barbados, February 22–26, 2016. Revised Selected Papers (2016), pp. 22–40 45. S. Zhao, T. Zhao, H. Yang, M.R. Lyu, I. King, in Proceedings of the AAAI 2016, Phoenix, Arizona, USA, February 12–17, 2016 (2016), pp. 315–322 46. A. Grover, J. Leskovec, in Proceedings of the ACM SIGKDD 2017 , San Francisco, CA, USA, August 13–17, 2016 (2017), pp. 855–864 47. A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, in Proceedings of the NIPS 2013, Lake Tahoe, Nevada, United States, December 5–8, 2013 (2013), pp. 2787–2795 48. Y. Dong, N.V. Chawla, A. Swami, in Proceedings of the ACM SIGKDD 2017 , Halifax, NS, Canada, August 13–17, 2017 (2017), pp. 135–144
Chapter 3
Horizontal Association Modeling: Deep Relation Modeling
3.1 Introduction to Horizontal Association Modeling in Online Services With the emergence of fast and convenient e-commerce systems, online payment has become a hot topic and people tend to shop and trade online more than ever before. Online payment services have penetrated into people’s lives and brought convenience to people’s daily lives. The increased convenience, though, comes with inherent security risks [1]. This popularity results in a large amount of electronic transaction data, accompanied with a rapid increase in the number of online payment frauds. Obtaining a person’s account information or actual credit card is a common occurrence under the era background of information explosion and disclosure [2]. Thus, identity theft [3] and account takeover (ATO) [4] are the most popular cybercrime activities in online payment services, and may somehow explain the continuous occurrences of fraudulent transactions. Meanwhile, they also face many challenges, e.g., transaction fraud, where a fraudster might have stolen an account and intends to transfer its funds quickly by purchasing some merchandises at online shopping platforms. Transaction fraud has caused huge economic loss to financial platforms every year [5, 6]. This is not only a threat to thousands of customers [7], but also a challenge to the security of the entire e-commerce system [8, 9]. It is really imperative and in dire need to build more effective, efficient, and comprehensive online fraud prediction/detection systems for financial platforms to lock fraudsters out. The cybercrime involving online payment services often has the characteristics of diversification, specialization, industrialization, concealment, scenario, and crossregion, which makes the security prevention and control of online payment extremely challenging [10]. Most existing methods for transaction fraud detection usually depend on real-time online payment behaviors. These methods actively examine every transaction when a user starts a payment, and try to detect and terminate any fraudulent transaction before fraudsters finish the transfer of funds. Among these methods, the rule-based system was once commonly used [11]; it is generally built on business experiences and character statistics of historical risk events. Nowadays, © Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_3
43
44
3 Horizontal Association Modeling: Deep Relation Modeling
machine learning techniques are increasingly applied to active fraud detection and have shown great effectiveness [12]. Thereinto, supervised learning models have been widely used in real-time transaction fraud detection [13]. They are often combined with the usage of transaction aggregation [14]. In the meantime, unsupervised and semi-supervised learning methods also have been taken into consideration [15, 16]. In addition, deep learning techniques, such as Recurrent Neural Networks (RNN) [17] and Convolutional Neural Networks (CNN) [18], have shown their great advantages on transaction fraud detection.
3.1.1 Behavior Prediction Most of these existing methods can be regarded as interim detection methods, which work passively and take effect only when a transaction occurs. We take a different point of view, however, to ask whether transaction fraud can be detected in an ex-ante manner. That is to say, can we predict a fraudulent transaction before its occurrence? We start our investigation for solutions on the basis of a fact that most transaction frauds in the online payment services are caused by account compromise. Thus, we resort to the prediction of account theft based on user behaviors. By examining the collected real-world transaction data, we find that account compromise is highly associated with risky payment behaviors of users. In Fig. 3.1a, we show the percentages of time intervals between adjacent legal and fraudulent transactions in a real dataset of a commercial bank. For more than a quarter of the compromised accounts, each of them has time intervals of less than half an hour between its last legal transaction and the first following fraudulent transaction. This indicates that there is possibly a relation between a user’s historical payment behavior and his/her account compromise. Figure 3.1b shows the percentages of time intervals between
Fig. 3.1 a The percentages of different intervals between adjacent legal and fraudulent transactions of the same account. b The percentages of different intervals between adjacent fraudulent transactions of the same account
3.1 Introduction to Horizontal Association Modeling in Online Services
45
adjacent fraudulent transactions of the same account. It reveals the behavior patterns of fraudsters to some extent. To be concrete, once a fraudster gets hold of an account, he/she will transfer the funds as quickly as possible. From the above observation, our objective is to check out to what extent an account is in risk after its holder has generated a series of transactions. Accordingly, we propose a fraud risk prediction method based on transaction sequences. The fraud risk of an account can be determined by only several recent transactions performed by its holder. Obviously, in contrast to typical fraud detection techniques, which are mainly interim or ex-post, ours is an ex-ante one. We firstly adopt a feature aggregation method to deal with transaction sequences, and then feed the aggregated features into some state-of-the-art machine learning models to predict account risks. Surprisingly, upon doing so, our ex-ante method can achieve nearly the same detection performance as those interim ones on a real B2C transaction dataset from a commercial bank. In addition, to exploit the complementary effects, we design an anti-fraud scheme by combining the ex-ante and interim paradigms in order to further improve the effectiveness.
3.1.2 Behavior Sequence Analysis In the online payment scenario, the occurrence of a transaction produces a transaction record, which contains amounts of information. The objective of fraud detection is to make a distinction between the legitimate and fraudulent transactions with these scattered transaction information fields. For example, if a customer buys something or makes a transaction that exceeds his/her trading amount limit at an unusual time, this trade operation could be detected and intercepted as an exception. Afterwards, the customer will be alerted that his/her account is under risk. To this end, it is useful to figure out the latent anomaly patterns hidden in the user’s recent transaction records. In the real world, fraudulent transactions will appear in clusters [19, 20]. In other words, there are repeated occurrences of similar fraudulent transactions within a short range of time. It gives some hints that normal and stolen accounts have different trading behavior patterns. To capture and represent the behavior patterns, we propose a concept of the sliding time window for each customer. Through the statistical analysis of trading characteristics within a time window, such as the number of transactions, the cumulative value of transaction amount, the mean and variance of the transaction amount difference, and the time interval, we can generate new window-relevant features to represent the behavior patterns. Then, we can use them for fraud detection. The different sizes of a time window might generate different features and then result in totally different performance of models. How to determine a proper size of the time window is an intractable problem to solve. It is straightforward to exhaustively test each possible time window size, but this method is quite labor intensive and time-consuming [21]. We innovatively propose to utilize learning automata [22] to solve this problem. With learning automata, our method is capable of selecting the most suitable time window size automatically in a limited time.
46
3 Horizontal Association Modeling: Deep Relation Modeling
In the financial industry, machine learning models used in risk control systems are required to be sufficiently robust [23]. Meanwhile, this is a typical limitation of most rule-based systems [24] and other traditional machine learning-based techniques [25–27]. Once new fraud attack methods turn up, the performance of current models tend to decline, as the previous model variables and features portraying the fraud behaviour patterns already become obsolete. It is difficult for the fixed time window size to adapt to the variability of fraud attack in this situation. Therefore, with the regular verification and feedback of fraudulent labels, the changing patterns of fraud attacks should be learned regularly to identify the risks more accurately and enhance the robustness of models. Another factor that affects the performance of fixed time window is the so-called concept drift [28–30]. In the dynamic online payment environments, data distribution changes over time. It yields the phenomenon of concept drift that has little association with fraud [31]. When the statistical properties of target variables change over time in an unpredictable way, the prediction accuracy of machine learning models will decrease correspondingly. For example, customers’ seasonally changed shopping behaviors will bring down the performance of a risk prediction model built on the fixed time window, making it work only intermittently. So instead of a fixed time window, we need a dynamic one, which can be adapted to the environments regularly and thus brings the effectiveness and robustness to risk prediction models. In this work, we propose an online payment fraud detection method based on the devised Learning Automatic Window. We call our method LAW for simplicity. By utilizing learning automata, we can optimize the size of the sliding time window automatically. We extract .14 new features from .8 raw fields of transaction data. Among these new features, .8 of them are related to the sliding time window, hence are called window-dependent features. With all generated features, we adopt a selected classifier to implement the risk prediction model. The extensive experiments over real-life data validate the performance gain of our LAW in terms of detection efficiency and effectiveness. Particularly, with the help of a regular online updating scheme, the sliding time window in our method can be adapted to the dynamic environment itself regularly. This effectively ensures the good robustness of LAW.
3.2 Related Work With the advent of large-scale e-commerce platforms and online payment platforms, transaction fraud prediction/detection has become a widely studied research area, and attracts a lot of attention from researchers. Fraud prediction/detection is highly related to anomaly detection. Traditional methods involve the extensive use of auditing where fraudulent behaviors are manually observed and reported [32]. This paradigm is not only time consuming, expensive and inaccurate, but also impractical in the big data era. Not surprisingly, financial institutions have turned to automatic detection techniques based on statistical and computational methods. Here, we review recent and fruitful studies on fraud detection, especially on online payment fraud.
3.2 Related Work
47
Traditional rule-based fraud detection systems usually have a single function and limited detection effect. Recently, many advanced association rule-based fraud detection systems [33–35] have shown great results. Supervised learning-based methods are popular for credit card fraud detection purposes, such as Logistic Regression [36, 37], Bayesian-based model [38, 39], Random Forest [40–42] and Neural Networks [28, 43–45]. Some semi-supervised and graph-based fraud detection methods [46–48] are effective in the context of weak labels. They mined the relationships of users through the transaction graph to detect fraud. Meanwhile, some unsupervised methods [49, 50] were exploited for fraud detection. They intended to identify hidden structures in unlabeled transaction data. The behavioral models [51–54] are also increasingly applied to solve the issue of fraud detection, which can dig the user’s trading behavior patterns and identify the abnormal trading patterns.
3.2.1 Fraud Prediction by Account Risk Evaluation Existing studies on real-time transaction fraud detection mainly work in an interim manner. Interim fraud detection tries to discover and identify fraudulent activities as they enter the detecting systems and report them to a system administrator [55]. Rulebased expert system used to be the most widely used technique, which incorporates various areas of knowledge like economics, finance and business practices [11, 56]. However, the capability of this approach is limited because it heavily depends on predefined rules. These rules can only be maintained by domain experts, which is quite labor-intensive. Moreover, statistical properties of transactions may change over time, making the rules obsolete. Fraud detection methods based on machine learning and artificial intelligence techniques are becoming increasingly popular. Supervised learning techniques make use of users’ behavior data to create a classification model to determine whether a user’s ongoing behavior deviates from the global behavior [12]. From the point of view the fraudster, Jing et al. [13] have found some behavior patterns of fraudsters from raw data. They combined these behavior patterns with machine learning models. As it is not always possible to label all data, unsupervised learning approaches are utilized to overcome this defect [15]. Later, deep learning techniques were increasingly applied to financial transaction fraud detection scenarios [17, 18]. Recently, Zhang et al. [57] applied the deep forest to the task of fraud detection. Zheng et al. [58] proposed one-class adversarial nets for fraud detection with only benign users as training data to detect the malicious users. Fraud detection methods based on network and knowledge graphs have also attracted more and more attentions [59–62]. The most similar work to our fraud prediction was done by [63]. It proposed an account risk evaluation system using link analysis, and achieved great success in pinpointing misstated accounts from their dataset. Their method is offline in nature, and needs to be updated periodically for predictive effectiveness. Ours is a real-time
48
3 Horizontal Association Modeling: Deep Relation Modeling
method of risk prediction, so the risk of an account could be evaluated every time it generates a transaction. The evaluated risk can be further utilized to predict fraudulent transactions.
3.2.2 Fraud Detection by Optimizing Window-Based Features 3.2.2.1
Selection, Integration, and Optimization
We review some fraud detection methods based on either the model integration and optimization or feature selection. Utkarsh et al. [64] assigned a consistency score to each data point, and they used an ensemble of clustering methods to detect outliers in large datasets. Sahil et al. [65] employed different supervised machine learning algorithms to detect fraudulent transactions of the credit card on a real-world dataset and used these algorithms to implement a super classifier by ensemble learning. Sohony et al. [66] presented an ensemble method based on a combination of Random Forest and Neural Network. Due to the advantages of ensemble learning, their method achieved high accuracy and confidence for label prediction of new samples. Alejandro et al. [40] expanded a transaction aggregation strategy, and proposed to create a new set of features based on the analysis of periodic behaviors of a transaction using the von Mises distribution.
3.2.2.2
Window Optimization
Our work focuses on a unique aspect of feature optimization, that is, generating the optimal feature sets by optimizing the size of a sliding time window. There have been enlightening studies on the issue of window optimization in many other fields besides. Liono et al. [67] pursued the optimal window size by optimizing multiobjective function to balance between the minimising impurity and maximising class separability in temporal segments. It was utilized to recognize multiple activities from heterogeneous sensor streams. Inspired by them, our method selects the window size by optimizing a different objective function. Shibli et al. [68] proposed a novel empirical model that can adaptively adjust the window parameters for a narrow band-signal using spectrum sensing technique. The appropriate window size was selected where two closest sinusoids can be distinguished using a specific formula. It can not only improve the spectrogram visualization but also effectively reduce the computation cost. Although both of them were devoted to determining the optimal window size, they are hard to directly apply for our work since the inconsistency of optimization objective in different application scenarios.
3.3 Historical Transaction Sequence for High-Risk Behavior Alert
3.2.2.3
49
Learning Automata
Different from these window optimization methods introduced above, we obtain the proper time window size with the help of learning automata (LA). Learning automata intends to learn the optimal action from a set of allowable actions. It operates through maximizing the probability of being rewarded based on interaction with a random environment. We now review the latest researches on LA and its applications. Seyyed et al. [69] proposed to combine wrapper and filter ideas, and used estimator learning automata to efficiently determine a feature subset. The subset was chosen to satisfy a desirable tradeoff between the accuracy and efficiency of the learning algorithm. Liu et al. [70] designed a method that combined the firefly algorithm and LA to select optimal features for motor imagery EEG. Erik et al. [71] explored the use of learning automata algorithm to compute threshold selection for image segmentation. And the algorithm possessed the ability to perform the automatic multi-threshold selection. Zhang et al. [72] proposed a reverse philosophy called last-position elimination-based learning automata. The action graded last in terms of the estimated performance was penalized by decreasing its state probability and was eliminated when its state probability became zero. The proposed schemes can achieve significantly faster convergence and higher accuracy than the classical ones. Inspired by these LA applications, we utilize LA to optimize the time window size for generating the set of suitable features.
3.3 Historical Transaction Sequence for High-Risk Behavior Alert 3.3.1 Fraud Prediction System Based on Behavior Prediction We present a real-time account risk prediction method to prevent the occurrence of future fraudulent transactions.
3.3.1.1
Transaction Sequence Window
Given an account of a user, we can use a sequence in chronological order, i.e., T = {(x1 , y1 ) , (x2 , y2 ) , ..., (xt , yt )}, to represent its historical transaction records, where .xi is the .i-th transaction of the account, .yi is the label of transaction .xi , .xt is the latest completed transaction of the account, and the risk value of the account is .rt . The risk of the account is closely related to the user’s recent transaction behaviors. We devise a transaction sequence window .W for each account. The size of the window is .w. The so-called transaction sequence window of an account, denoted by .W = {xt−(w−1) , ..., xt−1 , xt }, contains a sequence of .w trans.
50
3 Horizontal Association Modeling: Deep Relation Modeling
Fig. 3.2 The process of updating transaction sequence window and account risk
action records generated by this account recently. We formulate here risk prediction problem to evaluate the probability of )fraud in the next transaction, where ( .rt = Pr yt+1 = 1|W = {xt−(w−1) , ..., xt−1 , xt } . When an account finishes a transaction, its transaction sequence window is updated by maintaining a first-in-first-out transaction queue. In order to make realtime fraud risk predictions on the account, every time when the transaction sequence window of an account is updated, the predicted risk of this account is immediately re-estimated. Based on the updated risk value, we can judge the risk of the account continuing to trade and prevent the occurrence of fraudulent transactions in advance. If the risk is beyond a given threshold .RT , we immediately lock the account and prevent any of its subsequently requested transactions. Figure 3.2 provides an illustration for the updating process of the window. The current status of the window is .Wt = {xt−(w−1) , ..., xt−1 , xt }, and the risk of the account is .rt , when a transaction .xt+1 is completed, the window is updated as . Wt+1 = {xt−(w−2) , ..., xt , xt+1 }, we need to re-estimate the fraud risk of the account r
. t+1
( ) = P yt+2 = 1|Wt+1 = {xt−(w−2) , ..., xt , xt+1 } .
If .rt+1 is bigger than the threshold .RT , the account is immediately locked, and the transaction .xt+2 will not happen; otherwise, the transaction .xt+2 is allowed to occur. We vary the amount of transaction records in a transaction sequence window by choosing the different window sizes. We examine the performance of diverse window sizes ranging from 1 to 5, respectively. Note that the transaction sequence window is a rolling window and is updated with every transaction. This scheme ensures the feasibility of real-time fraud prediction. That is, we no longer have to wait for some periods of time until the next transaction is actually generated.
3.3 Historical Transaction Sequence for High-Risk Behavior Alert
3.3.1.2
51
Feature Engineering
Feature Selection. A transaction record .x = (x1 , ..., xM ) is a feature vector .x ∈ RM , which contains various types of features, where .xi is the .i-th feature of the transaction .x. However, the data sparsity is often a big challenge in fraud detection task. People are becoming more and more aware of the importance of privacy and information security. As a result, platforms are not allowed to collect or utilized some sensitive information about users. To assess the importance of different features, we introduce the mean decrease impurity (MDI, [73]) as a standard metric. The metric is defined as ( ) 1 Σ Σ p(i) Δ f (si , i), MDI xm = T i∈T :si =xm NT
.
where .xm is a feature, .NT is the number of trees within a Random Forest, .si is the selected feature of one of splitting node .i in tree .T , .p(i) is the proportion of samples reaching node .i, and .f (si , i) is an impurity decreasing measure. In our work, Gini index [74] is adopted as the impurity decreasing measure. The larger of the MDI, the more important the feature is. We retain features of high importance and remove features of low importance from all transaction records. Feature Aggregation. Within a transaction sequence window for each account, there are several independent transaction records there. We need to perform feature transformations on the contained transaction records. A user’s payment behavior changes across different transactions. The values of the same feature may be different for different records in the transaction sequence window. Therefore, we need aggregate several records into one record in the window to feed the machine learning model. We adopt different feature extraction methods for different categories of features. Discrete features
{ } j A discrete feature can be denoted by .xi = fi 1 , fi 2 , ..., fi , ... , where .xi is the .i-th j
feature of a transaction record, and .fi is the .j-th value of the feature. We aggregate each feature by replacing its values with its occurrence frequencies in the window. For features with too many values, we need to group the values of features, and count the number of occurrences ( )of different groups separately. We reassign the .j-th value j i of feature .x to be .count fi /w, where .w is the size of transaction sequence window. Continuous features We do some mathematic operations on transaction amount and time which are continuous features, including calculating the sum, average, variance, maximum and minimum.
52
3 Horizontal Association Modeling: Deep Relation Modeling
Table 3.1 The example of transaction records Time Amt Aut 100.00 20.50 500.00 125.35 135.00
7:50 8:27 11:15 11:17 20:37
Face Password Fingerprint Password
Merchant
Addr
7 12 7
25 25 25 25 25
Table 3.2 The example of feature aggregation Min(Interval)
...
Sum(Amt)
Avg(Amt)
...
Face
Password
Fingerprint
Mer7
Mer12 Addr
2
...
880.85
176.17
...
0.2
0.4
0.2
0.4
0.2
25
Other features We simply digitize users’ demographic or registering information whose values usually do not change across different transactions of the same account. Table 3.1 presents an example of transaction sequence in the window (.w = 5) of an account. There are 5 features for each transaction, including transaction time (Time), transaction amount (Amt), authentication method (Aut), online merchant (Mer), and account registration address (Addr). These five features can be divided into three different types: Time and Amt are the continuous features, Aut and Merchant are the discrete features, and Addr is the registering information. The values of Merchant and Addr are represented by numbers, which means that the values of these features have been grouped, and the number denotes the group id of the value. Table 3.2 provides the results of feature aggregation.
3.3.1.3
Sampling
Highly imbalanced data is a practical challenge that we are faced in fraud detection [75, 76]. Usually, there are much more non-fraudulent transactions than fraudulent ones. This problem seriously declines the performance of classifiers, as they are inclined to be overwhelmed by the majority class and thus ignore the minority class. Besides, the fraudulent transactions usually occur in a small number of accounts. As a matter of fact, we need to balance class labels in training data before feeding it to machine learning models. We under-sample the legitimate transaction samples by random-skipping, and keep all the fraudulent samples. We define a class ratio.CR . It represents the fraction of the number of legal transactions and fraudulent transactions. For different classifiers, the ratio .CR is different.
3.3 Historical Transaction Sequence for High-Risk Behavior Alert
53
Fig. 3.3 A real-time ex-ante risk prediction system
3.3.1.4
Ex-ante Risk Prediction System
We devise an architecture of ex-ante risk prediction system, as illustrated in Fig. 3.3, that is composed of an offline training procedure and an online prediction procedure. Offline Training. Given all transactions of each account stored in a chronological order, we design a sliding window of fixed size for each account. Every time the window slides on the historical transaction sequence of an account, a training sample can be generated. When a training set is generated, we use the under-sampling technique to balance the ratio of legitimate and fraudulent transaction records. Afterwards, the data preprocessing, feature selection and feature aggregation modules are sequentially performed on all training samples. Finally, we use the aggregated features to train a classifier, e.g., XGboost [77], as the risk prediction model. Online Prediction. When a user completes a transaction with his/her account, the system would update its transaction sequence in the window immediately, and re-generate desired features for the available classifier. The classifier will output a risk value that represents the risk of the account caused by its recent transaction operations. When the risk value is higher than a predefined threshold, the system will immediately lock the account and disable its following-up transactions.
3.3.2 Experimental Evaluation We evaluate the performance of our method at different benchmark classifiers, and evaluate the influence of different setting parameters in our method. Furthermore, We verify the performance of our ex-ante prediction method by comparing it with the interim detection methods using the state-of-the-art classifiers.
54
3.3.2.1
3 Horizontal Association Modeling: Deep Relation Modeling
Data Description and Analysis
We collect 3.5 million real-world B2C transaction records from a commercial bank. The transaction records have a time interval from April 1, 2017 to June 30, 2017 generated by .0.1 million accounts. All these records have been labelled as legitimate/fraudulent manually, and the whole data is encrypted and desensitized for security and privacy issues. Each transaction consists of .54 features, which include .17 user features (users’ demographic information and accounts’ registering information) and .37 online payment behavior features. Although the adopted dataset has.54 features, many of them have invalid values for most transaction records, due to the limited permission offered by users and payment carries. From Fig. 3.4, we can observe that more than half of the data records have a feature missing rate over .60%. We apply the random forest feature importance to rank the other .52 features except account ID and merchant ID. The features (with the corresponding types) selected in this way include Account ID (Int), Transaction Amount (Float), Transaction Time (Datetime), Card Type (Int), Authentication (Int), Frequent-Used IP (Bool), Registered Addr (String), and Merchant ID (Int). Data imbalance is another big challenge. In our data, there are only .65291 fraudulent transactions, which means that only .1.8% of transactions are fraudulent transactions. Furthermore, only .8% of the accounts have been compromised in our all accounts. Additionally, the numbers of transactions and fraudulent transactions approximately follow pow-law distributions. As shown in the left side of Fig. 3.5, most of the accounts completed very few transactions, while only few accounts completed most of transactions. The right side of Fig. 3.5 presents the distribution of fraudulent transactions in the .8% fraud accounts, and shows that the occurrence fraudulent transactions concentrate on a few accounts.
Fig. 3.4 The missing rates of raw features
Count of Features
25 20 15 10 5 0 0
10
20
30
40
50
60
70
Missing Rate (%)
80
90 100
3.3 Historical Transaction Sequence for High-Risk Behavior Alert
55
# compromised accounts
105
# accounts
104 103 102 101 100 0 10
101
102
103
# transactions
104
105
103
102
101
100 0 10
101
102
103
# fraudulent transactions
Fig. 3.5 The imbalanced distributions of transactions and fraudulent transactions
3.3.2.2
Experiment Setting
We partition the dataset into two parts, with transaction records from April 1, 2017 to May 31, 2017 as the training data, and transaction records from June 1, 2017 to June 30, 2017 as the testing data. The objective is to make a real-time prediction of account risk based on historical transaction sequences. Clearly, the size of transaction sequence windows, denoted as .w, is a critical parameter. It not only decides how much historical information of an account is used, but also determines the start point of the risk prediction task. For each account, we set a fixed sequence window size .w. Then, for each user, we only start predicting the risk of account after the user completes its.w − th transaction. For example, if we set .w = 3, we can predict the risk of an account only after it generates three transactions.
3.3.2.3
Evaluation Metrics
In general, there are many metrics to evaluate the performance of binary classifiers, such as AUC (area Under the ROC curve) score, F-measure, and KS (KolmogorovSmirnov) score. However, these metrics cannot directly reflect the economic influence of models, especially in the case of imbalanced data. With the consideration of practical usage, we use precision, recall, false positive rate and F1-score as evaluation metrics. As a classification model often outputs the numerical probability of a transaction being fraudulent, we need to set a threshold to determine whether fraud occurs. The threshold actually provides a tradeoff between precision and recall, and may be different for different classifiers. We adopt false positive rate (FPR) as a principle for choosing thresholds. With a fixed value of .FPR, we can get different thresholds for different classifiers. We can also calculate their recall, precision and F1-score under the fixed FPR. This can be used for model selection. For models with the same .FPR value, large values of
56
3 Horizontal Association Modeling: Deep Relation Modeling
Table 3.3 Comparison of classifiers at different FPR Classification Precision FPR .0.001
.0.0005
.0.0001
XGB RF LR DNN XGB RF LR DNN XGB RF LR DNN
Recall
F1-score
.0.9515
.0.9421
.0.9468
.0.9511
.0.9370
.0.9440
.0.8885
.0.3839
.0.5362
.0.9499
.0.7637
.0.8467
.0.9746
.0.9273
.0.9504
.0.9745
.0.9208
.0.9469
.0.9306
.0.3201
.0.4762
.0.9582
.0.4634
.0.6247
.0.9879
.0.4091
.0.5786
.0.9638
.0.3259
.0.4871
.0.2293
.0.0014
.0.0028
.0.7529
.0.0469
.0.0883
the chosen metrics suggest good model performance. In our online payment scenario, FPR must be smaller than .0.001, otherwise the model makes no sense.
.
3.3.2.4
Comparison of Benchmark Classifiers
We compare the performance of four popular classification models, including XGBoost (XGB), Random Forest (RF), Logistic Regression (LR), and Deep Neural Network (DNN). In the DNN model, the sigmoid function is used as activation function and there are 3 hidden layers in addition to the input and output layers; the neurons at the three hidden layers are .20, .30 and .20, respectively. We set the window size .w = 2 and repeat this process .3 times. Table 3.3 shows the average precision, recall and F1-score on testing data. Note that we set .FPR = 0.001, .FPR = 0.0005 and.FPR = 0.0001, respectively. We learn that XGBoost not only performs better, but shows superior stability and robustness. Then, we choose XGBoost as the classifier of benchmark interim method.
3.3.2.5
Impact of Window Size on Prediction
We need to analyze the impact of the transaction sequence window size on the perdition performance. We increase the window size .w from .1 to .5, and compare the performance of corresponding models. The size of transaction sequence window decides when to start the risk evaluation. For example, when .w = 1, we can predict the risk of an account after it generates one transaction; but when .w = 5, we cannot make a prediction until the account finishes five transactions. In order to analyze the
3.3 Historical Transaction Sequence for High-Risk Behavior Alert 1.0
0.9
0.9
Recall
0.8 XGBoost RandomForest LogisticRegression DNN
0.7 0.6 0.5
F1-score
1.0
0.4 0.3
0.8
XGBoost RandomForest LogisticRegression DNN
0.7 0.6 0.5 0.4
1
2
3
w
4
0.3
5
0.6 XGBoost RansomForest Logistic Regression DNN
0.3 0.2 0.1 0.0 1
2
3
w
4
F1-score
0.5 0.4
1
2
3
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
5
4
5
w
(a) FPR = 0.001
0.7
Recall
57
XGBoost RansomForest LogisticRegression DNN
1
(b) FPR = 0.0001
2
3
4
5
w
Fig. 3.6 Recall, F1-score at different .FPR and .w .(CR = 30)
impact of window size reasonably, our prediction starts from the sixth transaction for each account in the testing data. Figure 3.6 shows that the window size has a significant impact on performance. The recall and F1-score of all models increase with the increase of window size given fixed .FPR. The reason lies in that larger window size ensures more behavioral information from account holders. In practice, we can choose a properly large window size, but for our testing data, when the window size increases by.1, almost.20 thousand transactions are made unpredictable. For example, when .w = 5, there are about .0.1 million transactions that cannot be predicted in our testing data. As an explanation, most accounts have only a few transactions. Therefore, in our experiment, we do not consider window size larger than .5.
3.3.2.6
Impact of Class Ratio for Prediction
We compare the effect of class ratio .CR on the performance of classifiers. In our experiment, we increase the class ratio .CR from .10 to .70, and set the window size .w = 2. We repeat this process .3 times, Fig. 3.7 presents the average recall, F1-score of testing data when .FPR = 0.001 and.FPR = 0.0001, respectively, where the x-axis is class ratio. From Fig. 3.7, we obtain that when .FPR = 0.001, the change of .CR has no significant impact on all models. However, if we decrease the .FPR to .0.0001, the perfor-
58
3 Horizontal Association Modeling: Deep Relation Modeling 1.0
0.9
0.9
Recall
0.8 0.7 0.6 0.5
XGBoost RandomForest LogisticRegression DNN
F1-score
1.0
0.8 XGBoost RandomForest LogisticRegression DNN
0.7 0.6 0.5
0.4 10 20 30 40 50 60 70
10 20 30 40 50 60 70
CR
CR
0.5
0.6
0.4
0.5
0.3 0.2 0.1
XGBoost RandomForest LogisticRegression DNN
F1-score
Recall
(a) FPR = 0.001
0.4 XGBoost RandomForest LogisticRegression DNN
0.3 0.2 0.1 0.0
0.0 10 20 30 40 50 60 70
CR
10 20 30 40 50 60 70
CR
(b) FPR = 0.0001 Fig. 3.7 Recall, F1-score at different .FPR and .CR (w = 2)
mance of XGBoost has hardly changed with the increase of .CR , but the performance of Random Forest is decreased greatly. Although the change of .CR has little effect on Logistic Regression and DNN, their performance is very poor when .FPR = 0.0001.
3.3.2.7
Comparison with Interim Detection
In this part, we compare our ex-ante fraud detection method with the traditional interim fraud detection method. As stated before, we adopt XGBoost as the classifier in both methods. Activation detection on the fly is widely-used fraud detection method. Different from ours, the interim fraud detection method examines every currently ongoing transaction of an account by comparing it with the account’s historical transactions. Once the currently ongoing transaction is determined to be fraudulent, the system will immediately terminate the transaction. In order to train an effective interim fraud detection model, it is important to extract useful features with data and business experience. The feature extraction process for interim fraud detection is described as follows: User historical behaviour features. We use the most recent month and day as the time interval to calculate the payment behaviour of users in these intervals. We extract a set of user-behaviour-related features, including the statistics of transaction
3.3 Historical Transaction Sequence for High-Risk Behavior Alert Fig. 3.8 ROC curves
59
True Positive Rate
1.0
0.9
0.8
0.7 interim detection ex-ante prediction
0.6
0.5 0.000
0.002
0.004
0.006
0.008
0.010
False Positive Rate
volume, the statistics of transaction amount, the trading period and the statistics of other historical trading characteristics. Merchant historical behaviour features. Similar to the users’ historical behaviour feature extraction, we extract .9 statistical features related to merchants. User-merchant features. We extract .2 features related to user-merchant, including the number of transactions per user in different merchants and the average amount per transaction. Ongoing payment features. To extract features of an ongoing transaction, we digitize its non-numeric fields, and calculate the difference between the ongoing transaction and its previous transaction, such as the time intervals and amount differences. We extract 12 features related to ongoing transaction in all. Figure 3.8 illustrates the comparison of the receiver operating characteristic (ROC) between two methods. Although the interim fraud detection method outperforms ours, the advantage is very limited. The strong point of our method is that it can prevent fraudulent transactions from occurring, while still keeping very good performance. This enables financial platforms to inform users of account risks in advance, and remind them to protect their accounts timely.
3.3.3 Enhanced Anti-fraud Scheme As the fraud prevention and fraud detection are the two mostly used schemes to combat fraud in practice, we design a hybrid anti-fraud system that combines both methods.
3.3.3.1
Hybrid System Architecture
As depicted in Fig. 3.9, the enhanced scheme consists of two modules as follows:
60
3 Horizontal Association Modeling: Deep Relation Modeling Ex-ante Risk Prediction Module
Interim Fraud Detection Module Terminating the Ongoing Transaction Completing the Ongoing Transaction
Updating Transaction Sequence An Ongoing Transaction NO Ex-ante Risk Prediction
? YES
Allowing Follow-up Transaction
NO Interim Fraud Detection
High Risk ? Locking the Account
?
Fraud ?
YES Locking the Account
Start
Fig. 3.9 A real-time fraud prevention and detection system
Ex-ante risk prediction module. When an account finishes a transaction .xt , the system immediately uses the ex-ante risk prediction method to predict the risk of account. If the risk is too high, the system will immediately lock the account and prohibit all subsequent transactions. Otherwise, the system will allow the occurring of next transaction, say .xt+1 . Interim fraud detection module. If the ex-ante module does not lock the account and the account is making a new transaction .xt+1 , the system will use the interim fraud detection module to check the ongoing transaction .xt+1 . If .xt+1 is determined to be fraudulent by this module, it will be terminated immediately and the account will be locked. Otherwise, it will wait for the ongoing transaction until it completes, update the transaction sequence of the account, and re-estimate its fraudulent risk using the ex-ante risk prediction module.
3.3.3.2
Anti-fraud Performance Evaluation
We have applied the system to the bank’s one-month B2C transaction data. There are 1, 028, 437 transaction records in this month, including .1, 003, 539 normal transactions and .24, 898 fraudulent transactions. In the ex-ante risk prediction module, we set .w = 2. The core of our anti-fraud system is to select an appropriate risk threshold (RT) for the ex-ante risk prediction model. Figure 3.10 presents the recall and .FPR with different RT, from which we can observe that the smaller the risk threshold, the larger the recall and .FPR. It means that we can prevent and detect more fraudulent transactions at the cost of interrupting more legitimate ones. When .RT = 1, it means that we only use the interim fraud detection method, without predicting the risk of accounts ahead of time. Although it interrupts very few legitimate transactions, many fraudulent transactions cannot be detected.
.
Fig. 3.10 Recall and FPR with different thresholds
61
96.5
Recall FPR
Recall (%)
96.0 95.5 95.0 94.5 94.0 93.5 93.0
0.0
0.2
0.4
0.6
0.8
1.0
1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
FPR (%)
3.3 Historical Transaction Sequence for High-Risk Behavior Alert
RT 40 35
Percentage (%)
Fig. 3.11 The cumulative distribution of predicted frauds
30 25 20 15 10 5 0
(0,1]
(1,5]
(5,10]
(10,60] (60,300](300,+inf)
Time Interval (/s)
In practical applications, it is necessary to ensure that the false positive rate is less than .0.1%. Next, we present the results on the performance of ex-ante prediction module, interim detection module, and the integrated system, respectively, when .RT = 0.75: .• Our real-time fraud prevention and detection integrated system can detect .94.46% of fraudulent transactions with less than .0.09% of legitimate transactions interrupted. .• It is remarkable that .80.42% of fraudulent transactions can be prevented by the ex-ante module with less than .0.04% of legitimate transactions interrupted, leaving only .14% of fraudulent transactions to be detected when they are ongoing. .• Figure 3.11 shows the distribution of fraudulent transactions over time intervals that the system can predict ahead of time. We observe that nearly .70% of fraudulent transactions can be predicted more than .5 s ahead of occurring.
62
3 Horizontal Association Modeling: Deep Relation Modeling
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern 3.4.1 Fraud Detection System based on Behavior Sequence Analysis In this part, we present our method for online payment transaction fraud detection in the real-life online payment scenarios. The whole workflow of the fraud detection system is depicted in Fig. 3.12. We first elaborate on the features that are used to distinguish legitimate and fraudulent transactions. Then, we introduce the design of sliding time window. Finally, we describe how to use learning automata to generate a dynamic window adaptively. The devised Learning Automatic Window (LAW) is the core of our method. This is the reason why we call the proposed method LAW.
3.4.1.1
Feature Preprocessing
First of all, we need to preprocess the original transaction data. Upon doing so, we can obtain powerful features for learning models. Many of the original data fields do not significantly contribute to fraud detection, such as the media serial number, CPU number, operating system version, registration time, and so on. Therefore, we select and process some raw fields with high correlation with fraud detection. The selected raw fields of data are summarized in Table 3.4. We analyze the transaction dataset, which will be introduced in Sect. 3.4.2.1 later, by counting the time gap (in units of second) and money gap (in units of Chinese yuan) between two adjacent fraudulent transactions and two adjacent legitimate transactions, respectively. As shown in Fig. 3.13, we get their distributions, where
preprocessing historical raw data
learning automatic window module
the most proper window size in current stage
window-dependent features
sliding time window
preprocessing 622...
2017-04-01
10:10:10
window-independent features
100.00
...
training model
risk prediction classifier
window-dependent features
window-independent features
below risk threshold
YES
NO
real-time transaction stream
Fig. 3.12 Workflow of the fraud detection system. It contains three main modules: the automatic window size generation module, the risk prediction model training module, and the real-time streaming data processing module
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern Table 3.4 Descriptions of the raw data fields Data type Feature User_ID
String
Transaction_Time
String
Transaction_Amount
Float
Pre-trade_Account_Balance
Float
Daily_Limit
Float
Single_Limit
Float
Check
String
Frequent_IP
Boolean
1.0
Transaction card account of customers, one account represents one User_ID The time that transaction happens, starting from the year, accurate to second The amount of money in a transaction, in units of yuan, the basic unit of Chinese currency (RMB) Balance of the account before the transaction happens, in units of yuan Maximum amount limit for daily transactions, in units of yuan Maximum amount limit for a single transaction, in units of yuan The verification tool at the process of the transaction, like U-shield, electronic cipher, etc. A status bit indicating whether the current IP is consistent with the IP frequently used by customers
1.0 0.8
Probability
Probability
Description
fraud normal
0.8
63
0.6 0.4 0.2 0.0
0.6 0.4 fraud normal
0.2 0.0
1
10 30 60 600 360 1e+ 0 04
Time Gap (/s)
(a)
0.1
1
10
50 100 500 1000
Money Gap (/yuan)
(b)
Fig. 3.13 The respective cumulative distribution functions of time gap and money gap between two adjacent transactions
the cumulative distribution represents the proportion of quantity when the gap is less than a certain value. We find that when the time and money gap between two adjacent transactions are small, fraudulent transactions account for the majority. It shows that the fraudulent transactions tend to appear in clusters. They often appear sequentially and similarly in a short time. This is reasonable as it reflects the psychology that fraudsters want to transfer money in a compromised account as soon as possible. We can rely on the paradigm of sliding time window to generate some specific features, called window-dependent features, which could be used to describe this kind
64
3 Horizontal Association Modeling: Deep Relation Modeling
of behavior pattern in a certain period. Meanwhile, we can generate some features that do not depend on sliding time windows, and term them as window-independent features. Next, we will discuss the generation of the two kinds of features separately. Window-Independent Features. We extract six window-independent features using the selected raw fields. Two of these features, Check and Frequent_IP, are obtained directly from the original fields without any changes. The others are either transformed through one of the fields directly or obtained through statistical calculations between different fields. A brief description of these features and their data types is presented in Table 3.5. Window-Dependent Features. As mentioned above, the analysis of customers’ behavior patterns within a period of time contributes to transaction fraud detection. For this purpose, firstly, we define the concept of a transaction’s sliding time window, which is referred to a period of time before it happens; then we extract eight new features, namely window-dependent features, by statistic analysis or transformation of raw transaction data within a sliding time window. The window-dependent features are also summarized in Table 3.5.
3.4.1.2
Sliding Time Window Design
As has been discussed, a user’s normal and abnormal transaction behavior patterns often change over time. In order to capture this kind of behavior patterns and discriminate them, we can establish a sliding time window. When a user generates a new transaction, the window will slide one step forward to update the transactions within it. Due to the fluctuations of users’ trading behavior patterns, the sliding time window should be elastic to adapt to new environments and enhance the robustness of model. In other words, the size of time window needs to be flexible and adaptable as time changes and transaction progresses. Next, we describe the specific design of sliding time window. In our method, we maintain a transaction list for each user respectively. We store the transaction list as a linked list structure, which means that the head and tail operation of data is very fast regardless of the data amount. For each user, we build a transaction list to store his/her transaction data within a time window. When a user generates a new transaction, the transaction list will be updated by adding the transaction to its top. Once the time range of a transaction list exceeds that of the time window, the old transactions beyond the time window should be removed from the list. In this work, we choose the REDIS [78] database for an implementation of sliding time window, which can quickly perform streaming calculations. Note that the REDIS database is not necessary and can be replaced by other efficient databases in our method. An implementation of the transaction list and time window is illustrated in Fig. 3.14. From Fig. 3.13a, we get that when the time gap between two adjacent transactions varies from.30 s to an hour, there is a large degree of distinction between the cumulated
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern Table 3.5 Descriptions of the newly generated features Feature Data type Window independent
Window dependent
65
Description
Check
INT
Please refer to Table 3.4
Frequent_IP Overpaid_Limit
Boolean Boolean
Overpaid_Left
Boolean
Time_Gap
Float
Money_Gap
Float
Times_Window
INT
Avg_amt_window
Float
Var_amt_window
Float
Acc_amt_window
Float
Avg_amt_gap_window
Float
Avg_time_gap_window
Float
Var_amt_gap_window
Float
Var_time_gap_window
Float
Please refer to Table 3.4 Whether the single transaction amount exceeds the Single_Limit, whether the daily transaction amount exceeds the Daily_Limit Whether the transaction amount exceeds the balance of the account Time difference between two adjacent transactions Amount difference between two adjacent transactions The number of transactions appear in the time window The average amount of all the transactions in the time window The variance of the amount of all the transactions in the time window The cumulative amount of all the transactions in the time window The average amount difference between any two adjacent transactions in the time window The average time difference between any two adjacent transactions in the time window The variance of amount difference between any two adjacent transactions in the time window The variance of time difference between any two adjacent transactions in the time window
66
3 Horizontal Association Modeling: Deep Relation Modeling
Fig. 3.14 The architecture of sliding time window
distributions of normal and fraudulent transactions. So we set the candidate time window size to range from .30 s to an hour. For calculation convenience, the step size is set to .30 s.
3.4.1.3
Window Size Optimization Using LA
We aim at devising an automatic selection scheme for the size of sliding time window. On the one hand, it is attributed to the fact that different window sizes will result in fluctuation of model performance. We should choose a proper time window size from a large number of options by optimization. Every time we make a choice of time window size, we will construct new window-dependent features, and feed them together with those window-independent features to a classifier, and evaluate its predictive performance. Obviously, the optimization process is quite time-consuming and labor intensive, given a large number of options for time window size. So we use learning automata (LA) to automatically select the size of time window. On the other hand, in order to maintain the robustness of the predictive model, the time window should be flexible and adaptable to the dynamic environments. Since for the online payment fraud detection systems, the feedback signal, which indicates whether a payment transaction is fraudulent or not, may have a delay of one month or more. Thus we can add a dynamic online updating scheme, that is to say, adjusting the time window size periodically. We call the above method LAW. To make this work self-contained, we introduce the basic concept of learning automata (LA). Figure 3.15 shows the typical system architecture of LA. At first, LA selects an action .α(n) via a stochastic process, then it applies the selected action to the environment and receives a reward signal .β(n) from the performance evaluation function. An action that achieves desirable performance is rewarded by an increase in its probability density. Meanwhile, those underperformed actions are penalized or left unchanged depending on different reward strategies. So during the learning
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern
67
Fig. 3.15 The framework of learning automata
Table 3.6 Notations and parameters in LA Notation/parameter Definition/meaning .xt .βt .tprt
m
.TPRt .ft (x)
The size of time window selected at iteration .t The reward obtained by LA at iteration .t The weighted true positive rate of the classifier prediction under the current time window size at iteration .t The list of .m weighted true positive rates in the previous .m iterations at iteration .t The probability density function of time window size at iteration .t
process, the internal probability density of LA will be updated continuously. When the probability density gets convergence, the action with the highest probabilities corresponds to the optimal action. Next, we discuss how to select the proper time window size automatically with LA. LA interacts with the environment continuously in order to find out the optimal action for the current state. In our study, the environment corresponds to the whole transaction dataset and the chosen classifier, and an action corresponds to a selection of time window size. In Table 3.6, we list the main components of LA and their corresponding notations. In Table 3.6, the size of time window .xt acts as both the output of LA and the input of the environment. Conversely, the reward value .βt corresponds to the output of the environment and the input of LA. The definition of the reward value .βt ∈ [0, 1] is critical, as it not only determines the utility of the currently selected time window size, but also serves as a basis to update the probability density function of time window size during LA training. As interpreted in Table 3.6, .tprt plays an important role in the calculation of reward values. It is a performance metric that is set manually. Actually, it can be calculated as a weighted true positive rate, given that the false positive rate is below .0.05%, .0.1%, .0.5% and .1%, respectively. That is,
.
tprt = 0.4 · (tprt |fprt = 0.0005) + 0.3 · (tprt |fprt = 0.001) + 0.3 · (tprt |fprt = 0.001) + 0.2 · (tprt |fprt = 0.005) + 0.1 · (tprt |fprt = 0.01).
We choose these false positive rate values because they are the ones to which the industry usually pays more attention [48, 79, 80]. When the false positive rate is
68
3 Horizontal Association Modeling: Deep Relation Modeling
smaller, its weight should be bigger, as the goal of optimization is to achieve a larger true positive rate and a smaller false positive rate at the same time. The setting of the weights 0.4, 0.3, 0.2, and 0.1 follows this principle and their sum is 1 which restricts m .tprt ∈ [0, 1]. In addition to the list .TPRt = {tprt , tprt−1 , ..., tprt−m+1 }, we also define m .TPRmed and .TPRmax as the median and maximum values in .TPRt . In our method, LA adopts the reward policy of reward/inaction [81], that is to say, this policy rewards the actions with good performance and does not reward or punish those with bad performance. Following the above principle, when the current weighed true positive rate .tprt is larger than the median true positive rate .TPRmed in the list .TPRm t , we can conclude that the current selected time window size .xt is of good performance, while the observation that .tprt is smaller than or equal to .TPRmed means a bad choice of time window size. Based on the above settings, we can get a specific definition of the reward value .βt by: { β = max 0,
. t
tprt − TPRmed TPRmax − TPRmed
} .
(3.1)
Obviously, .βt is always greater than .0 and its value is restricted between .0 and .1. During the training process of LA, an action will be selected based on the probability density function.ft (x) of time window size at each iteration, where.ft (x) ∈ [0, 1] and .x ∈ [αmin , αmax ]. It holds that those time window sizes with greater probabilities will have more opportunity to be selected. Next, the actions with good performance will receive an increase in the probability via a neighborhood function .N (x, xt ) (a probability distribution function). Then, given the reward value .βt , the probability density function .ft (x) is updated according to Eq. (3.2), f
. t+1
[ ] (x) = α · ft (x) + βt · N (x, xt ) ,
(3.2)
where .α is used to re-normalize the probability distribution such that ∫
xmax
.
ft+1 (x)dx = 1.
xmin
The new probability density function determines the time window size to be selected at the next iteration. Note that a proper neighborhood function.N (x, xt ) is usually determined according to the characteristics of specific applications. Taking into account the performance distribution of window sizes, we adopt a Gaussian distribution function as the neighborhood function. The reason is that the Gaussian distribution is roughly consistent with the fact that the window size around the superior size may also be excellent here, i.e., its probability density is augmented in the vicinity of superior action, and the further away other actions are from the superior action, the less their probability density increase. More specifically, the adopted Gaussian neighborhood function, say . G(x, xt ), is a symmetric Gaussian function that centers on the current time window size .xt , and is defined in Eq. (3.3),
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern
69
YES
end
most proper window size NEXT STAGE
start
window size 1 window size 2 window size 3 window size 4 ... updating probability
initialize window sizes probability
Agent
windowindependent features
converge
NO
train model
raw data
model evaluation
windowdependent features
select a window size Database
feed a reward
Fig. 3.16 Workflow of our LAW to select the most proper time window size
N (x, xt ) := G(x, xt ) = λ · √
.
1 2π σ1
2
·e
− (x−x2t ) 2σ1
,
(3.3)
where the hyper-parameters .λ and .σ1 control the updating speed of the probability density function .ft (x). Based on the above discussion about the settings of LA, we introduce the process of our method LAW in details. In Fig. 3.16, we depict the workflow of selecting the most proper time window size, which corresponds to the learning automatic window module in Fig. 3.12. At the very beginning, the probability density function .ft (x) is initialized such that all the window sizes have the same probability density, and their total probability of accumulation is .1. At each iteration, LA chooses a time window size.xt based on the probability density function. If.xt has been selected in the previous iteration, then there is no need for subsequent feature construction and model training steps, since we have done all these work before. Here we can directly call the previously trained model and get into the model prediction performance evaluation step. If not like that, we should first construct the window-dependent features based on the window of the chosen size. Afterwards, we use them together with the windowindependent features to train a model, and evaluate its predictive performance, .tprt , on an off-line testing dataset. Thus, LA can update the recent.m predictive evaluations in .TPRm t and get the reward value .βt according to Eq. (3.1). After that, it updates and re-normalizes the probability density function .ft (x) according to Eq. (3.2). As LA gradually converges towards those more worthy regions, these selected time window sizes should be evaluated more frequently. The probabilities of these regions are higher than others after a number of iterations, which results in better discriminations between the selected actions and unselected ones. To achieve significantly faster convergence and higher accuracy than the classical pursuit schemes, we learn from the Last-Position Elimination-Based Learning Automata [72]. Before the re-normalization of the probability density function during each iteration, we set a threshold and eliminate the probability below the threshold. It makes the distinction between the worthy areas and other areas more obvious, thus it can help accelerate the speed of convergence. For the training of LA, we can either set an iteration upper bound or wait until it converges. Here, we choose the latter, thus the number of iterations for LA getting
70
3 Horizontal Association Modeling: Deep Relation Modeling
convergence is not fixed. When LA completely converges, we select the window size with the largest probability as the most suitable window size for the current stage. The most proper time window size selected above is applicable for a period of time. If we set one month as the length of time stage, when entering the next time stage, that is, one month later, the previous time window size will no longer work well and therefore lead to a decline in the performance of model. So we need to adjust and modify the most proper time window size dynamically and regularly. For the purpose of achieving the time window with a flexible size to enhance the robustness of the whole model, our LAW also contains an online updating scheme. Now we present how to adjust the size of sliding time window automatically with the updating scheme. We utilize the converged probability density of time window size in the previous stage to initialize LA in the current stage, as shown in Fig. 3.16. Here when entering the next stage, the new probability density function of LA is initialized as a Gaussian distribution, centering on the most suitable time window size selected in the previous stage. Upon doing so, we aim to make a trade-off between the explore and exploit schemes of LA. On the one hand, we need to exploit the converged probability density in the previous stage, since the probability density of the current LA has a great chance to converge around it. On the other hand, it is necessary to explore the areas far away from the previous proper time window size in case that they deserve good performance. The newly initialized probability density function .fInitial (x, xl ) is defined as: f
. Initial
(x, xl ) = √
1 2π σ2
·e
−
(x−xl )2 2σ2 2
,
(3.4)
where .xl represents the most suitable window size in the last stage and .σ2 controls the weight of .xl . The remaining iteration procedure has no difference from that in the previous stage. Again we choose the window size with the largest probability when LA converges. We provide the whole procedure of LAW in Algorithm 3.1.
3.4.2 Experimental Evaluation In this part, we investigate the performance of the proposed method LAW for fraud detection in a real-life online payment scenario. For this purpose, we evaluate the performance and then analyze the gain of using time window and LA, respectively.
3.4.2.1
Dataset Description
Our experiments are conducted on a dataset of real-life online payment records from a commercial bank. The records were collected within three months from April 1, 2017, to June 30, 2017. In this work, the dataset is partitioned into two parts, with the records from April and May as the training data and those from June as the testing
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern
71
Algorithm 3.1: The Procedure of LAW
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Input: The number of the candidate window sizes, N ; Hyper-parameters of Gaussian neighborhood function, λ and σ1 ; Hyper-parameter of the newly initialed probability density function, σ2 ; The size of recent weighted true positive rate list TPRm t , m; The threshold for last-position elimination, t; if There is no previous stage before then Initial the probability density function: ft (x) = N1 ; end else Initial ft (x) according to Eq. (3.4); end while ft (x) does not get convergence do Choose a size of the time window, xt , based on the probability density function; if xt is a repeated time window size chosen before then Obtain predictive evaluation tprt based on previous model; end else Construct the new window-dependent features, train a model based on all features and obtain tprt ; end Update the recent weighted true positive rate list TPRm t ; Compute the reward value of learning automata βt according to Eq. (3.1); Update ft (x) according to Eq. (3.2); Eliminate the probability below the threshold t and re-normalize ft (x); end Select the most suitable window size xoptimal according to the final probability density function; return xoptimal ;
data. The total number of transaction records used for training is .2, 459, 334 while the number for testing is .1, 042, 714. All the transactions are labelled with integers ranging from .0 to .10 by the commercial bank manually. The meanings of labels are presented in Table 3.7. Among the labels, .0 and .1 represent the normal transactions called negative samples, .2 ∼ 6 represent uncertain transactions, and .7 ∼ 10 represent fraudulent transactions which we call positive samples. We exclude the uncertain transactions, and then the remaining transactions form the actually adopted dataset for training and testing. We further draw a number of transaction records from the last half month of the training data, and use them as the off-line testing data. In order to simulate different time periods, we divide the testing data into two equal parts chronologically, resulting in two online-testing datasets. The statistics of positive and negative transaction records in each dataset is summarized in Table 3.8.
72
3 Horizontal Association Modeling: Deep Relation Modeling
Table 3.7 Meanings of labels Label Meaning Normal transactions released by system Normal transactions after phone confirmed but intercepted by system before Unmarked transactions Customer call is not answered Invalid customer phone number Customer unsure Customer reject to answer Other kinds of fraud Trojan virus Telecommunications fraud Phishing website
0 1 2 3 4 5 6 7 8 9 10
Table 3.8 Statistics of transaction records Dataset Label (property) Training
Off-line testing
Online testing Part 1
Online testing Part 2
3.4.2.2
0–1 (negative) 2–6 (discarded) 7–10 (positive) 0–1 (negative) 2–6 (discarded) 7–10 (positive) 0–1 (negative) 2–6 (discarded) 7–10 (positive) 0–1 (negative) 2–6 (discarded) 7–10 (positive)
Quantity 1,806,720 20,206 32,788 587,097 4,918 7,605 534,033 5,919 10,737 469,506 8,358 14,163
Setup and Metrics
Our experiments are conducted on a server with Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz and .128 GB RAM. We conduct experiments to verify our method LAW on the testing dataset Part 1, and make a comparison between LAW without an online updating scheme and LAW containing an online updating scheme on the testing dataset Part 2. In order to evaluate the effectiveness of our method, we utilize the True Positive Rate (Sensitive or Recall), False Positive Rate (1-Specificity or Disturb), Precision, F1-score, and ROC (Receiver Operating Characteristic Curve) as the metrics to quantify the detection performance over the testing datasets.
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern
3.4.2.3
73
Feature Generation Using Time Window
The generation of window-independent features is easy and once for all, but the generation of window-dependent features is affected by the specific selection of time window size. Intuitively, the feature generation time is positively related to the time window size. Here, we analyze the efficiency of feature generation under each time window size on the testing dataset Part 1. From Fig. 3.17, we can observe that the size of time window and the feature generation time approximately depend linearly on each other, where the generation time is calculated by using more than .500, 000 transactions totally. If we consider a single transaction record, the average feature generation time is about .1.1 ms. Meanwhile, when selecting different time window size, the time spent in feature generation for each transaction fluctuates within a range of no more than .0.05 ms, which is negligible compared with model training time. So we can conclude that the selection of time window size has little impact on the whole learning time. To manifest the importance of window-dependent features, we utilize the Gini Index [82] to conduct an assessment. We choose .30 s as the time window size, the importance of all features can be found in Fig. 3.18. We observe that, in addition to Frequent_IP and Check, most window-dependent features are more important than window-independent features in improving model performance. The result means that the generated features by using time window really play an important role in fraud detection. It is noteworthy that Frequent_IP and Check are of high importance since their meaning is closely related to fraud. Specially speaking, Frequent_IP indicates that whether a customer makes a transaction in a high-risk environment, such as an infrequently used IP. There is a reason to suspect that the customer’s account may be stolen by the attacker in a high-risk environment. The high importance of Check lies in the fact that different verification tools are vulnerable to targeted attacks by fraudsters due to their vulnerabilities.
510
Time Consumption(/s)
Fig. 3.17 Time consumption of feature generation on the testing dataset Part 1
500
490
480
470 0
10
20
30
40
Window Size (/min)
50
60
74
3 Horizontal Association Modeling: Deep Relation Modeling
Fig. 3.18 The feature importances
Acc_amt_window Var_amt_window Avg_amt_window Var_time_gap_window Var_amt_gap_window Avg_time_gap_window Avg_amt_gap_window Times_Window Money_Gap Time_Gap Check Frequent_IP Overpaid_Left Overpaid_Limit
0.3
0.2
0.1
0.0
Feature Importance
3.4.2.4
Performance Evaluation of Time Window Models
In this part, we investigate how time window and its size affect fraud detection. Firstly, we compare the performance of different popular classifiers using or not using window-dependent features on the testing dataset Part 1. We set all their window sizes as 30 s. To this end, we make an analysis of the training efficiency for each classifier. These models are all implemented on open source packages1 in the Python language environment, on which all these models except XGBoost are implemented through the machine learning tool ‘scikit-learn’ and XGBoost model is implemented by importing the ‘xgboost’ open source package. Note that, the Neural Network is built by a multi-layer perception classifier (MLPClassifier) with two hidden layers. In general, the .k-fold cross validation is a widely-used parameter selection method. However, in practical anti-fraud applications, it might cause the so-called time crossing problem where some future information is utilized for checking transactions. Therefore, we select the parameters through grid search on the off-line testing dataset, and then evaluate the performance of our model on the online testing Part 1 dataset. The grid search is a special exhaustive search method in ‘scikit-learn’, which verifies the model performance under each combination in the parameter combination list and takes the combination of parameters with the highest performance as the optimal parameter. Finally, .500 trees with a depth of .6 are selected for the Random Forest and XGBoost models, the number of minimum sample partitions is selected to be .100 in RF, and .80% features and samples are randomly selected to establish a decision tree. The Logistic Regression selects the ‘sag’ solver and ‘L2’ penalty and conducts .100 iteration training. The rest of the parameters are set by default. The ROC curves are depicted in Fig. 3.19. We can observe that with time windows, all involved methods, 1
The open source packages used in this work can be obtained via https://scikit-learn.org/stable/ supervised_learning and https://xgboost.readthedocs.io/en/latest.
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern Fig. 3.19 ROC curves of predictive models using different classifiers on the testing dataset Part 1
75
True Positive Rate
1.0
RF+TW RF XGB+TW XGB LR+TW LR NB+TW NB NN+TW NN
0.8
0.6
0.4
0.2 0.000
0.002
0.004
0.006
0.008
0.010
False Positive Rate
including Random Forest (.RF + T W ), XGBoost (.X GB + T W ), Logistic Regression (.LR + T W ), Naive Bayes (.NB + T W ) and Neural Network with two hidden layers (.NN + T W ), outperform their counterparts without time windows. Now we already have a preliminary understanding of the approximate performance of all candidate classifiers. Next, we need to choose one of them for future use in our method LAW. As LA requires multiple learning iterations to converge, the chosen classifier should not only be good at performance, but also be economical in terms of training time. Figure 3.20 depicts the average time consumptions of training the model using different classifiers except for Neural Network whose time consumption is far greater than that of other classifiers even considering a pre-trained model. From Fig. 3.19, we can see that both .RF + T W and .X GB + T W have almost comparably high performance, but Fig. 3.20 suggests that training .RF + T W costs less time than .X GB + T W . Based on the above facts, we choose Random Forest as the classifier for our method LAW in the following experiments. We then analyze the fluctuations of true positive rate, precision and F1-score with different choices of time window size on off-line testing data. In this test, the values of false positive rate are assigned as .0.0005, .0.001 and .0.005, respectively.
1155.45
123.43 94.56 51.84 0
200
XGBoost Random Forest Logistic Regression Naive Bayes 400
600
800
Time consumption (/s)
(a) Without TW
Classifier
Classifier
612.49 229.62 125.66 67.99 0
400
XGBoost Random Forest Logistic Regression Naive Bayes 800
1200
1600
Time Consumption (/s)
(b) With TW
Fig. 3.20 The average training time consumptions of different classifiers
3 Horizontal Association Modeling: Deep Relation Modeling FPR=0.0005 FPR=0.001 FPR=0.005
0.88
0.84
0.80
0.972
0.90
0.970 Ratio of Precision
True Positive Rate
0.92
0.968 0.946 0.944
FPR=0.0005 FPR=0.001 FPR=0.005
0.942
F1-Score
76
0.88 FPR=0.0005 FPR=0.001 FPR=0.005
0.86
0.788 0.786
0.84 0
10
20
30
40
50
60
0
10
20
30
40
50
60
0
10
20
30
40
50
Window Size (/min)
Window Size (/min)
Window Size (/min)
(a) True Positive Rate
(b) Precision
(c) F1-Score
60
Fig. 3.21 The true positive rate, precision and F1-score when choosing different sizes of time window
We choose the time windows with sizes ranging from half a minute to .60 min to make the analysis. When selecting different time window sizes, we keep both the training samples and the setting of Random Forest classifier unchanged. We can then obtain the statistical information of the listed metrics in Fig. 3.21. It reflects how time window size affects model performance. By the performance comparison under different false positive rates, we observe that the change in time window size is not synchronized with those in the True Positive Rate, Precision and F1-score. As the performance fluctuates a lot when the window size changes, selecting the most suitable window size is of great significance.
3.4.2.5
Performance Evaluation of LAW
As mentioned above, different choices of time window size result in different performance. In this part, we elaborate on the performance obtained by using LAW. We set the input parameters in Algorithm 3.1 according to the following procedure. We set the number of the candidate window sizes .N as .120 by searching all time window sizes between .0.5 and .60 min at the interval of 30 s. The size of recent weighted true positive rate list, .m, is set to be .10 based on the experience. According to the principle of last-position elimination that is used during each iteration for faster convergence, the elimination threshold, .t, is set to be a quarter of the initial probability density, i.e., .1/4N . To demonstrate the performance of proposed method better, we set a new metric that is shown in Eq. (3.5): Mrelative =
.
TPRweighted − TPRmedian weighted optimal
TPRweighted − TPRmedian weighted
,
(3.5) optimal
where .TPRweighted is the weighted true positive rate under our method, .TPRweighted means the optimal weighted true positive rate under the exhaustive method, and median .TPR weighted means the median weighted true positive rate under the exhaustive
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern
77
Fig. 3.22 Performance (weighted true positive rate) under the exhaustive method and ROC curves of predictive models with different methods
method. This relative performance metric changes its value in the range of .[0, 1]. When it is closer to 1, the relative performance is higher, that is to say, the closer to the best performance under the exhaustive method. Furthermore, we select the methods with two fixed time window sizes as comparative methods (i.e. the optimal window size and the median performance window size, respectively). In addition, we compare LAW with the method without using time window. We use the testing dataset Part 1 to evaluate method LAW without the updating scheme. In Fig. 3.22(a), we show the performance of the exhaustive method on the testing dataset Part 1. From this, we select the optimal window size (.31 min) and the median performance window size (.39.5 min, ranks .60th in performance under all .120 candidate window sizes) to make a comparison. The performance of the window size (.34.5 min) selected by our methods ranks .2/120 (top 2%) among all candidate window sizes on the testing dataset Part 1. According to the statistics of performance under the exhaustive method, the relative performance value .Mrelative is .0.9574 on the testing dataset Part 1. From Fig. 3.22b, we can get the conclusion that our method LAW achieves better detection performance than other methods except the optimal window size obtained under the exhaustive method. The performance of our proposed approach is very close to the best performance under the exhaustive method, it does not cause a significant drop in accuracy (weighted true positive rate) as compared to the exhaustive method.
78
3 Horizontal Association Modeling: Deep Relation Modeling
In a similar way, we evaluate our method LAW without updating scheme on the testing dataset Part 2. Further, we use the testing dataset Part 1 as the off-line testing set, the following procedure is almost the same with the previous learning stage, except that we need to re-initialize the probability density function at the beginning. Then we evaluate LAW on the testing dataset Part 2. In Fig. 3.22c, we represent the performance of the exhaustive method on the testing dataset Part 2. We also select the optimal window size (.46.5 min) and the median performance window size (.56.5 min). The performance of the window size (.38 min) selected by our methods ranks 4/120 (top 4%) among all candidate window sizes on the testing dataset Part 2. The relative performance value .Mrelative is .0.8214 on the testing dataset Part 2. In Fig. 3.22d, our method LAW outperforms the methods using median performance window size under the exhaustive method or without using time window. Even the LAW without the updating scheme shows a better performance than the previous two. In addition, our LAW or LAW without the updating scheme both have a close performance compared with the optimal window size under exhaustive method. Although it seems that LAW without updating scheme has comparable performance as LAW, when we require very small false positive rate, say, below .0.003, the latter shows nonnegligible advantages. This validates the sensitivity of our method LAW to changing transaction patterns, by which it can still achieve performance gain over the exhaustive search when encountering fluctuations.
3.4.2.6
Advantage Analysis of LAW
In this part, we further analyze the advantages of LAW based on the experiments in Sect. 3.4.2.5. We first analyze the parameters .λ and .σ1 in Eq. (3.3), which control the updating speed of the probability density function. We conduct groups of comparative experiments by varying the values of these two hyper-parameters. Upon doing so, we can make a comparison of the convergence efficiency, and find out how they work. From Table 3.9, we get that .λ decreases linearly with the iteration time while .σ1 increases linearly with the iteration time. Meanwhile, when .λ is large enough or .σ1 is small enough, LA cannot converge and the most suitable window size cannot be determined. In this work, we choose
Table 3.9 The efficiency of the convergence using different hyper-parameters .σ1 Iteration times Converge .λ .1/240
.1/120 .1/60
0.5 1 1.5 1
763 1527 2355 874 173
False True True True False
3.4 Learning Automatic Windows for Sequence-Form Fraud Pattern
79
Fig. 3.23 Convergence process of probability density function of time window Fig. 3.24 Comparison of the previous final probability and the initialized probability
λ = 1/240 and .σ1 = 1. It takes .1527 iterations for LA to achieve convergence and the most suitable window size is .34.5 min (.2070 s). As the number of iterations increases, the probability density function converges to the most suitable window size. In other words, the probability of selecting the area around the most suitable time window size becomes larger than others. We can get the detailed information of the convergence process of LAW without updating scheme in Fig. 3.23a. With the help of updating scheme, LA significantly reduces the number of iterations when convergence is achieved. The detailed convergence information of LAW on the testing dataset Part 1 can be observed in Fig. 3.23b. It utilizes the most suitable window size selected in the previous learning stage as the premise to re-initialization. More specifically, a Gaussian distribution with.σ2 = 6 is initialized as the new probability density by centering on the most suitable time window size (.34.5 min) selected in the previous learning stage. Figure 3.24 shows the comparison between the final converged probability density in the previous stage and the initialized probability density function. The new most suitable window size is different from the size in the previous learning stage, it becomes a little larger, changing from .34.5 min (.2070 s) to .38 min (.2280 s). It takes fewer iterations (about .500) for LA to get convergence, this can be mainly explained by the fact that the newly initialized probability density is based on the historical experience. This verifies that the online updating scheme helps reduce the iterations of convergence to find the most proper window size.
.
80
3 Horizontal Association Modeling: Deep Relation Modeling
Fig. 3.25 Times of each window size selected in the whole convergence process
Next, we analyze the number of times that each time window size is selected during the convergence process of LA. Due to the fact that most of the iteration time is occupied by generating new features and manipulating classification models, higher efficiency can be achieved when LA selects more repeating values of time window size during iteration. As shown in Fig. 3.25, only parts of time window sizes are selected actually. So the actually consumed time will be greatly reduced with the help of LA, and it is just part of the time that is calculated under all time window sizes using the exhaustive method. The ratio .Tratio between them can be defined as: T
. ratio
=
Nnon−repeating , Ntotal
(3.6)
where .Nnon−repeating represents the number of the time window sizes that are not repeatedly picked by LA and .Ntotal represents the total number of time window sizes. In Fig. 3.25a, .63 non-repeating time window sizes are selected by LA during the whole convergence process on the off-line testing dataset, so it holds that .Tratio = 63/120. It means that the time spent in manipulating the classification models is only .63/120 of the time used by the exhaustive methods. These are the benefits that LA brings. Figure 3.25b shows the times that each time window size was selected on the testing dataset Part 1, where we get the .Tratio = 81/120. It saves a third of the time as compared with the exhaustive method. Actually, we do not need further time consumption when a chosen window size has appeared in the previous learning stage. Just as shown in Fig. 3.25b, since the window sizes within red parts have been picked in the previous learning stage, they need not to be calculated again. So the real ratio holds that .Tratio = 36/120, which is considerably time-saving. To show the time saving of our method more clearly, we quantify the full endto-end running time to demonstrate the superiority of our method in terms of time consumption, where most of the iteration time is occupied by generating new features and manipulating classification models and the running time of the LAW is indeed
3.5 Conclusion
81
Time Consumption (/s)
Fig. 3.26 Comparison of the full end-to-end running time
30000
exhaustive method LAW
20000
10000
0
testing dataset Part 1
testing dataset Part 2
infinitesimal to the former. As shown in Fig. 3.26, our two experiments show that nearly more than half the time has been saved compared to the exhaustive method. In summary, we can conclude that the most suitable time window size selected by LA does improve the performance of predictive models. Especially, the dedicated online updating scheme in our method LAW is able to ensure good robustness of the whole model in dynamic environments. Moreover, compared to the exhaustive methods, ours is very time-saving in both selecting and adjusting the sizes of time windows.
3.5 Conclusion 3.5.1 Behavior Prediction Different from most existing studies which usually aimed to design fraud detection methods of interim patterns, this work takes a different point of view to investigate whether transaction fraud can be detected in an ex-ante manner in an online payment service. We have obtained an insightful finding that there is really a relation between a user’s historical payment behavior and his/her account compromise. Based on this finding, we can predict a fraudulent transaction before its occurrence without on-going transaction behaviors which are necessary for any interim fraud detection methods. Based on a real-world dataset, it is validated that this ex-ante fraud prediction can prevent more than .80% of fraudulent transactions before their actual occurrences. Moreover, utilizing their complementary effects, we have designed a real-time fraud prevention and detection system by combining the ex-ante and interim methods to further improve the effectiveness.
82
3 Horizontal Association Modeling: Deep Relation Modeling
3.5.2 Behavior Analysis We design a method called LAW to perform online payment fraud detection. It searches for the appropriate window-dependent features by automatically learning the most suitable time window size. These features can represent customers’ behavior patterns in a certain period. Our method achieves a more economical and time-saving performance for window size selection compared with the brute-force methods. Furthermore, LAW enables the window size to be adaptive to dynamic environments by implementing an updating scheme. By experiments on the real-life online payment dataset from a commercial bank, our method has been proven to have a better detection efficiency than those using preset fixed window size or without using time window. In addition, with the dynamic updating scheme, our LAW can ensure the robustness of model better.
3.5.3 Future Work There are some useful and interesting issues left to study: .• We will solve the cold-start problem of risk prediction for some accounts whose historical transaction count is smaller than the size of transaction sequence windows. .• It is also interesting to investigate how to overcome the concept drift problem [83] in risk prediction. .• One of our future works is to expand the range of candidate time window sizes and search space to enhance usability of the proposed method. .• For the purpose of allocating iteration times more rationally and saving more time, we will exploit another paradigm, called the optimal computing budget allocation (OCBA), to maximize the probabilities of selecting those true optimal actions in the future. .• We could utilize different window sizes for different features, for example, a model with two sets of window-dependent features extracted from a small and a large window. We are looking forward to exploring this optimization issue under multiple windows. .• We are pushing for the opportunity to cooperate with more banks and third-party fintech companies in the future, so that we can obtain more and larger (including time span and data scale) data samples to verify our proposed method.
References
83
References 1. B. Cao, M. Mao, S. Viidu, P.S. Yu, in Proceedings of the IEEE ICDM 2017, New Orleans, LA, USA, November 18–21, 2017 (2017), pp. 769–774 2. K. Thomas, F. Li, A. Zand, J. Barrett, J. Ranieri, L. Invernizzi, Y. Markov, O. Comanescu, V. Eranti, A. Moscicki, D. Margolis, V. Paxson, E. Bursztein, in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30–November 03, 2017 (2017), pp. 1421–1434. https://doi.org/10.1145/ 3133956.3134067 3. P. Hille, G. Walsh, M. Cleveland, J. Interact. Mark. 30, 1 (2015) 4. L. Coppolino, S. D’Antonio, V. Formicola, C. Massei, L. Romano, J. Ambient Intell. Human. Comput. 6(6), 753 (2015) 5. Q. Yang, X. Hu, Z. Cheng, K. Miao, X. Zheng, in Proceedings of the CloudComp 2014 (2014), pp. 98–106 6. S. Mittal, S. Tyagi, in Handbook of Computer Networks and Cyber Security, Principles and Paradigms (2020), pp. 653–681 7. K. Thomas, F. Li, C. Grier, V. Paxson, in Proceedings of the ACM SIGSAC 2014 (2014), pp. 489–500 8. N.F. Ryman-Tubb, P. Krause, W. Garn, Eng. Appl. AI 76, 130 (2018) 9. S. Xie, P.S. Yu, in Proceedings of the The 4th IEEE International Conference on Collaboration and Internet Computing (2018), pp. 279–282 10. M.A. Ali, B. Arief, M. Emms, A.P.A. van Moorsel, I.E.E.E. Secur, Privacy 15(2), 78 (2017) 11. A. Jarovsky, T. Milo, S. Novgorodov, W. Tan, in Proceedings of the IEEE ICDE 2018 (2018), pp. 125–136 12. S. Nami, M. Shajari, Expert Syst. Appl. 110, 381 (2018) 13. C. Jing, C. Wang, C. Yan, Proceedings of the International Conference on Financial Cryptography and Data Security 2019, pp. 140–155 14. E. Kim, J. Lee, H. Shin, H. Yang, S. Cho, S. Nam, Y. Song, J. Yoon, J. Kim, Expert Syst. Appl. 128, 214 (2019) 15. U. Porwal, S. Mukund, in Proceedings of the IEEE TrustCom/BigDataSE 2019 (2019), pp. 280–287 16. S. Elshaar, S. Sadaoui, Appl. Artif. Intell. 34(1), 47 (2020) 17. S. Wang, C. Liu, X. Gao, H. Qu, W. Xu, in Proceedings of the ECML PKDD 2017 (2017), pp. 241–252 18. A.D. Pozzolo, G. Boracchi, O. Caelen, C. Alippi, G. Bontempi, IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3784 (2018) 19. Y. Ban, X. Liu, L. Huang, Y. Duan, X. Liu, W. Xu, in Proceedings of the WWW 2019, San Francisco, CA, USA, May 13–17, 2019 (2019), pp. 83–93 20. C. Jing, C. Wang, C. Yan, in Proceedings of the FC, Frigate Bay, St. Kitts and Nevis, February 18–22, 2019. Revised Selected Papers 2019, 588–604 (2019) 21. C. Jing, C. Wang, C. Yan, in Proceedings of the International Conference on Financial Cryptography and Data Security (2019), pp. 140–155 (2019) 22. M.S. Obaidat, G.I. Papadimitriou, A.S. Pomportsis, IEEE Trans. Syst. Man Cybern. Part B 32(6), 706 (2002) 23. R. Barskar, A.J. Deen, J. Bharti, G.F. Ahmed, CoRR (2010). http://arxiv.org/abs/1005.4266 24. C. Whitrow, D.J. Hand, P. Juszczak, D.J. Weston, N.M. Adams, Data Min. Knowl. Discov. 18(1), 30 (2009) 25. T. Fawcett, F.J. Provost, Data Min. Knowl. Discov. 1(3), 291 (1997) 26. R.J. Bolton, D.J. Hand, Stat. Sci. 17(3), 235 (2002) 27. C. Wang, H. Zhu, IEEE Trans. Depend. Secure Comput. 19(1), 301 (2022). https://doi.org/10. 1109/TDSC.2020.2991872 28. A.D. Pozzolo, G. Boracchi, O. Caelen, C. Alippi, G. Bontempi, IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3784 (2018)
84
3 Horizontal Association Modeling: Deep Relation Modeling
29. J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, A. Bouchachia, ACM Comput. Surv. 46(4), 44:1 (2014) 30. G. Widmer, M. Kubat, Mach. Learn. 23(1), 69 (1996) 31. D. Malekian, M.R. Hashemi, in Proceedings of the International ISC Conference on Information Security and Cryptology (2014), pp. 1–6 32. J. West, M. Bhattacharya, M.R. Islam, in Proceedings of the International Conference on Security and Privacy in Communication Networks (2014), pp. 186–203 33. A. Jarovsky, T. Milo, S. Novgorodov, W. Tan, in Proceedings of the IEEE ICDE 2018 (2018), pp. 125–136 34. A. Jarovsky, T. Milo, S. Novgorodov, W. Tan, Proc. VLDB Endowment 11(12), 1998 (2018) 35. T. Milo, S. Novgorodov, W. Tan, in Proceedings of the The 21th International Conference on Extending Database Technology (2018), pp. 265–276 36. S. Jha, M. Guillen, J.C. Westland, Expert Syst. Appl. 39(16), 12650 (2012) 37. Q. Yang, X. Hu, Z. Cheng, K. Miao, X. Zheng, in Proceedings of the The 5th International Conference on Cloud Computing (2014), pp. 98–106 38. A.G.C. de Sá, A.C.M. Pereira, G.L. Pappa, Eng. Appl. AI 72, 21 (2018) 39. S. Akila, U.S. Reddy, J. Comput. Sci. 27, 247 (2018) 40. A.C. Bahnsen, D. Aouada, A. Stojanovic, B.E. Ottersten, Expert Syst. Appl. 51, 134 (2016) 41. N. Carneiro, G. Figueira, M. Costa, Decis. Support Syst. 95, 91 (2017) 42. S. Nami, M. Shajari, Expert Syst. Appl. 110, 381 (2018) 43. Y. Wang, S. Adams, P.A. Beling, S. Greenspan, S. Rajagopalan, M.C. Velez-Rojas, S. Mankovski, S.M. Boker, D.E. Brown, in Proceedings of the The 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (2018), pp. 1070–1078 44. S. Yuan, X. Wu, J. Li, A. Lu, in Proceedings of the ACM CIKM 2017 (2017), pp. 2419–2422 45. K. Fu, D. Cheng, Y. Tu, L. Zhang, in Proceedings of the NeurIPS 2016 (2016), pp. 483–490 46. J. Qian, X. Li, C. Zhang, L. Chen, T. Jung, J. Han, IEEE Trans. Depend. Secure Comput. 16(4), 679 (2019) 47. B. Hooi, K. Shin, H.A. Song, A. Beutel, N. Shah, C. Faloutsos, ACM Trans. Knowl. Discovery Data 11(4), 44:1 (2017) 48. J.J. Ying, J. Zhang, C. Huang, K. Chen, V.S. Tseng, ACM Trans. Knowl. Discov. Data 12(6), 68:1 (2018) 49. D. Olszewski, Knowl.-Based Syst. 70, 324 (2014) 50. D. de Roux, B. Perez, A. Moreno, M. Villamil, C. Figueroa, in Proceedings of the ACM SIGKDD 2018 (2018), pp. 215–222 51. J. Jurgovsky, M. Granitzer, K. Ziegler, S. Calabretto, P. Portier, L. He-Guelton, O. Caelen, Expert Syst. Appl. 100, 234 (2018) 52. T. Wüchner, A. Cislak, M. Ochoa, A. Pretschner, IEEE Trans. Depend. Secure Comput. 16(1), 99 (2019) 53. A. Saracino, D. Sgandurra, G. Dini, F. Martinelli, IEEE Trans. Depend. Secure Comput. 15(1), 83 (2018) 54. M.U.K. Khan, H.S. Park, C. Kyung, IEEE Trans. Inf. Foren. Secur. 14(2), 541 (2019) 55. C. Wang, B. Yang, J. Cui, C. Wang, IEEE Trans. Comput. Soc. Syst. 6(4), 637 (2019) 56. T. Milo, S. Novgorodov, W. Tan, in Proceedings of the EDBT 2018 (2018), pp. 265–276 57. Y. Zhang, J. Zhou, W. Zheng, J. Feng, L. Li, Z. Liu, M. Li, Z. Zhang, C. Chen, X. Li, Y.A. Qi, Z. Zhou, ACM Trans. Intell. Syst. Technol. 10(5), 55:1 (2019) 58. P. Zheng, S. Yuan, X. Wu, J. Li, A. Lu, in Proceedings of the AAAI 2019 (2019), pp. 1286–1293 59. B. Cao, M. Mao, S. Viidu, P.S. Yu, in Proceedings of the IEEE ICDM 2017 (2017), pp. 769–774 60. J.J. Ying, J. Zhang, C. Huang, K. Chen, V.S. Tseng, ACM Trans. Knowl. Discov. Data 12(6), 68:1 (2018) 61. S. Liu, B. Hooi, C. Faloutsos, IEEE Trans. Knowl. Data Eng. 31(12), 2235 (2019) 62. A. Sangers, M. van Heesch, T. Attema, T. Veugen, M. Wiggerman, J. Veldsink, O. Bloemen, D. Worm, in Proceedings of the International Conference on Financial Cryptography and Data Security 2019 (2019), pp. 605–623 (2019)
References
85
63. M. McGlohon, S. Bay, M.G. Anderle, D.M. Steier, C. Faloutsos, in Proceedings of the ACM SIGKDD 2009 (2009), pp. 1265–1274 64. U. Porwal, S. Mukund, CoRR (2018). http://arxiv.org/abs/1811.02196 65. S. Dhankhad, E.A. Mohammed, B. Far, in Proceedings of the The 19thIEEE International Conference on Information Reuse and Integration, (2018), pp. 122–125 66. I. Sohony, R. Pratap, U. Nambiar, in Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (2018), pp. 289–294 67. J. Liono, A.K. Qin, F.D. Salim, in Proceedings of the The 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (2016), pp. 10–19 68. S. Nisar, O.U. Khan, M. Tariq, Comput. Int. Neurosc. 2016, 6172453:1 (2016) 69. S.H. Seyyedi, B. Minaei-Bidgoli, Int. J. Commun. Syst. 31(8) (2018) 70. A. Liu, K. Chen, Q. Liu, Q. Ai, Y. Xie, A. Chen, Sensors 17(11), 2576 (2017) 71. E. Cuevas, D. Zaldivar, M.A.P. Cisneros, CoRR (2014). http://arxiv.org/abs/1405.7361 72. J. Zhang, C. Wang, M. Zhou, IEEE Trans. Cybern. 44(12), 2484 (2014) 73. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth, 1984) 74. R. Busa-Fekete, B. Szörényi, P. Weng, S. Mannor, in Proceedings of the IEEE ICML 2017 (2017), 625–634 (2017) 75. F. Wu, X. Jing, S. Shan, W. Zuo, J. Yang, in Proceedings of the AAAI 2017 (2017), pp. 1583– 1589 76. R.A. Mohammed, K.W. Wong, M.F. Shiratuddin, X. Wang, in Proceedings of the PRICAI 2018 (2018), pp. 237–246 77. T. Chen, C. Guestrin, in Proceedings of the ACM SIGKDD 2016 (2016), pp. 785–794 78. R. M.Lerner, Linux journal (Sep. TN.197), 20 (2010) 79. Y. Zhang, J. Zhou, W. Zheng, J. Feng, L. Li, Z. Liu, M. Li, Z. Zhang, C. Chen, X. Li, Y.A. Qi, Z. Zhou, ACM Trans. Intell. Syst. Technol. 10(5), 55:1 (2019) 80. Y. Zhu, D. Xi, B. Song, F. Zhuang, S. Chen, X. Gu, Q. He, in Proceedings of the WWW 2020, Taipei, Taiwan, April 20-24, 2020 (2022), pp. 928–938 81. B.J. Oommen, E.R. Hansen, IEEE Trans. Syst. Man Cybern. 14(3), 542 (1984) 82. D. Liu, Z. Li, K. Du, H. Wang, B. Liu, H. Duan, in Proceedings of the ACM CCS 2017, Dallas, TX, USA, October 30–November 03, 2017 (2017), pp. 537–552 83. J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, IEEE Trans. Knowl. Data Eng. 31(12), 2346 (2019)
Chapter 4
Explicable Integration Techniques: Relative Temporal Position Taxonomy
4.1 Concepts and Challenges With the rapid growth of electronic commerce, online payment services (OPSs) have gained in popularity. People tend to purchase and trade through online payment services which play an important role in life and business. However, the increased convenience for modern life by online payment services comes with inherent security risk by various cybercrimes involving online payment frauds [1]. Online payment frauds cause massive economic losses [2, 3], which is a serious threat to both customers [4] and platforms [5]. Therefore, it is urgent to develop effective and efficient online payment fraud detection systems [6] in order to improve the Quality of Experience (QoE) and Quality of Protection (QoP) in OPSs. Data-driven anti-fraud engineering is a promising paradigm due to its effectiveness and automation [7–10]. As a matter of fact, it is really difficult for a single-function module to achieve the high performance of fraud detection due to the complex and varied patterns of online payment frauds [11]. Consequently, the integration of functional modules is indeed necessary for almost all anti-fraud systems [12]. Behind such an integration, there actually exists a tradeoff between detection performance and decision explainability which are both the important criteria of security and system dependability of OPSs [13]. Some methods based on deep learning [14] have good performance without the human-understandable basis of the judgment results. For OPSs, the fraud detection system requires decision explainability, i.e., the ability to give explanations of what behavior characteristics result in the decisions of system. Also, researchers pay more and more attention to model explainability. Understanding the reasons behind predictions is quite important in assessing the trust in the model [15]. To keep decision explainability which is specifically important for anti-fraud systems with self-enhancing capability and good user experience [16], it is really difficult for integration to effectively improve detection performance. Some works, e.g., voting-based stacking [17], focused on training an explainable learner to integrate the results of various modules. However, this integration approach often
© Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_4
87
88
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
fails to match the deep learning based approach when performance is considered in isolation. In addition, the integration tends to increase processing latency and computing consumption due to the invocation of multiple functional modules, which is usually more than large-scale OPSs can tolerate [18]. Since the integration system needs the input from each module, the upper limit of the efficiency of integration system is severely constrained by the most inefficient module. The efficiency of integration system conforms to the “buckets effect” that limits the generalization ability of system and puts forward higher requirements for modules. Therefore, a good integration is hard-won for anti-fraud systems of OPSs, since it requires the great ability to improve detection performance while ensuring decision explainability, limiting processing latency and computing consumption simultaneously. In this work, we propose such a qualified integration system in terms of QoE and QoP, named CAeSaR, that satisfies all of those requirements. This hard-won gain is indeed achieved by the cooperation of two following innovative techniques: The first is a novel taxonomy of function division, called TRTPT (Three-way Relative Temporal Position Taxonomy), according to the temporal positions of transactions relative to a reference fraudulent transaction. More specifically, CAeSaR divides anti-fraud function modules into three kinds according to all three possible states of an on-going transaction on the timeline once a suspected fraudulent transaction is adopted as a reference point. For an on-going transaction, in the first state, it will be followed by the reference fraud; in the second state, it follows the reference fraud; and in the third state, it is not adjacent to the reference fraud. For the first and second states, the function module can utilize the knowledge about the fraudulent characteristics of the on-going transaction as the follower and followee of the reference fraud, respectively. Accordingly, we employ misuse detection for them two by matching transaction features to known fraud patterns. For the third state, without the additional knowledge of on-going transactions with the other two states, we introduce anomaly detection which detects frauds by first learning the characteristics of normal activity and then detecting what deviates from normal activity. Based on the taxonomy TRTPT, CAeSaR can introduce three corresponding kinds of anti-fraud function modules, i.e., Risk review (Rr), Subsequent analysis (Sa), and Association evaluation (Ae), to undertake checking three kinds of transactions, respectively. The biggest advantage of TRTPT is to obtain twofold necessary conditions for anti-fraud integration systems of high performance, i.e., the completeness and complementarity of function module divisions. The former is the ability to cover as many types of frauds as possible; the latter means that any two modules should be overlapped as little as possible, so as to effectively complement each other. The second is an integration scheme, called TELSI (Triple Element Logical Stacking Integration). CAeSaR deploys it into a Center control (Cc) module to generate candidate decision strategies by combining the judgments from three function modules by two basic logical connectives (conjunction and disjunction). With the two logical operations, CAeSaR can give justifications when decisions are made, which guarantees the decision explainability of the anti-fraud system. Furthermore, taking into account the credibility of judgments from three function modules, we define the
4.2 Main Technical Means of Anti-fraud Integration System
89
valid decision strategies, and prove the limited number of them. Finally, by designing a stacking-based multi-classification algorithm, TELSI can assign the most effective decision strategies to the corresponding transactions adaptively, which further improves the integration. In summary, the taxonomy completeness of TRTPT and integration complementarity of TELSI collaboratively contribute to the detection performance improvement; the integration explicitness of TELSI ensures the decision explainability under the help of taxonomy complementarity of TRTPT; and the integration adaptivity of TELSI reduces the processing latency and computing consumption under the precondition of qualified detection performance.
4.2 Main Technical Means of Anti-fraud Integration System We introduce related studies from the aspects of typical anti-fraud function module divisions, integration schemes, and explanation methods.
4.2.1 Anti-fraud Function Divisions Anti-fraud problems have been extensively investigated from different aspects, e.g., behavioral characteristic, behavioral agent, model function. Some researchers have made efforts to model users’ behaviors in the dimension of behavioral characteristic [19–21], such as online behavior, offline behavior, and social behavior. Zhu et al. [16] proposed a hierarchical explainable network to model users’ behavior sequences based on the e-commerce data. Oh and Iyengar [22] proposed an end-to-end framework using inverse reinforcement learning to identify anomalies in GPS trajectories. Wang et al. [23] utilized a guilt-by-association method on direct graphs to detect fraudulent users in online social networks. Modeling users in the individual dimension ignores the relevance of different behavioral spaces. Utilizing the complementarity among multi-dimension behaviors is an effective means to build users’ models. The behavioral characteristics of similar agents in different dimensions can be fused to improve the performance of fraud detection. Many researchers concentrate on the classification of behavioral agent [24], including population level, individual level, and group level. Rzecki et al. [25] designed a system to collect surveys resulting from the execution of single-finger gestures on a mobile device, and indicated the best classification method for person recognition. Mazzawi et al. [26] presented a novel approach for detecting malicious activity in databases by examining users’ self-consistency and global-consistency. Zhao et al. [27] trained a seller behavior model to predict the sellers’ fraudulent
90
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
behaviors and proposed a novel deep reinforcement learning algorithm to improve the platform’s impression allocation mechanism. The above methods are devised to take effect when frauds occur. With respect to model function, they are detection methods [14, 28]. Some studies take a broader view, i.e., fraud prediction. Beutel et al. [29] utilized graph analysis tools for user behavior modeling and gave examples of research using the techniques to model, understand, and predict normal behaviors. Wang, Martins de Moraes, and Bari [30] developed a comprehensive predictive analysis framework to predict and adjust new fraud cases through experimentation of an ensemble of sampling algorithms, feature engineering methods and an array of machine learning algorithms. Huang et al. [31] used cycle representation to capture pre-failure symptoms and employed a sequential recurrent neural networks model and an averaged recurrent neural networks model for anomaly prediction. Most of the above methods only consider the anti-fraud problem from a specific angle, which is difficult to adapt to the complex and changeable frauds. It is necessary to solve the problem from a more comprehensive perspective.
4.2.2 Module Integration Schemes Usually, single-function modules cannot work well due to the complexity of fraud behaviors and the limitation of data. Recent works have turned to integrate various modules to improve performance. Li et al. [32] proposed a sandwich-structured sequence learning architecture by stacking an ensemble model, a deep sequential learning model, and another top-layer ensemble classifier in proper order. Zhong et al. [33] utilized multi-view attributes to build a heterogeneous information network by considering a more general concept in credit risk. Sun, Wu, and Xu [34] fed three categories of features into different classic LSTM (Long Short-Term Memory), and integrated the output layers of the three LSTM models with a multilayer perceptron to detect financial statement fraud. Forough and Momtazi [35] proposed an ensemble model based on sequential modeling of data using deep recurrent neural networks and a novel voting mechanism based on artificial neural network to detect fraudulent actions. In online payment services, human-understandable explanations should be given when decisions are made. The existing methods integrate various models or features for better performance but lack the consideration of explainability [36–38].
4.2.3 Explanation Methods For artificially intelligent system, an important ability is to explain the decisions, predictions, or actions made by it and the process through which they are made [39]. Explanation is important for user acceptance and satisfaction, especially in online payment anti-fraud system.
4.3 System Integration Architecture
91
A method is to design inherently explainable model. Some approaches tried to create sparse models via feature selection or extraction to optimize explainability [40, 41]. Other researches focused on the shallow rule-based models that are readily explainable by humans: decision lists and decision trees. Rudin et al. [42] introduced classifiers that use association rules [43] to learn efficiently from sparse data. The other researches concentrate on explaining existing models via post-hoc techniques [44]. LIME [15] was proposed to explain the predictions of any classifier by learning an explainable model locally around the prediction. Anchors [45] was able to compute efficiently these explanations for any black-box model with high-probability guarantees. Wu et al. [46] trained deep time-series models so their class-probability predictions have high accuracy while being closely modeled by decision trees with few nodes. Generally, post-hoc explainable methods increase extra calculation burden and time delay. Therefore, we focus on an inherently explainable model which can give explanations on fraudulent characteristics when decisions are made.
4.3 System Integration Architecture The whole architecture of CAeSaR is illustrated in Fig. 4.1, where the Center control (Cc) module and anti-fraud function modules, i.e., Association evaluation (Ae), Subsequent analysis (Sa), and Risk review (Rr), constitute the main framework of the system CAeSaR. The system predicts the fraud risk of transactions by the control of the Center control module and the coordination of the complementary function modules. The inputs of the system are the attributes of on-going transactions, and the outputs are the results of whether the checked transactions are fraudulent. In this
Fig. 4.1 The architecture of CAeSaR
92
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
Fig. 4.2 The workflow of CAeSaR
work, we choose the REDIS [47] database to store transaction records.1 Figure 4.2 presents the workflow of CAeSaR. When the transaction records are taken from the database and fed into our system, the Cc module determines the decision strategies according to transaction features and distributes transactions to the corresponding function modules. Then these modules work in parallel and the Cc module turns to distribute the next transaction, which means the function module that is not invoked in the previous transaction can immediately begin to process the on-going transaction. The inputs of the function modules are transaction features that are further processed according to their own demands and the outputs are their judgments. Based on the decision strategies and the judgments, the Cc module gives the outputs whether the transactions are suspicious or not. According to their anomaly degrees, the suspicious transactions will be handed over to manual secondary verification, or be intercepted, or be released. Manual secondary verification will determine whether these remaining transactions are finally intercepted or released. In the anti-fraud function modules, we propose a three-way taxonomy of function division, named TRTPT (Three-way Relative Temporal Position Taxonomy). Under TRTPT, CAeSaR divides anti-fraud function modules into three kinds according to all three possible states of an on-going transaction on the timeline once a suspected fraudulent transaction is adopted as a reference point. The three kinds of transactions are in the charge of the three function modules, i.e., Rr, Sa, and Ae. The misuse detection is used in Rr and Sa modules to identify frauds by the known fraudulent characteristics of transactions relative to the adjacent reference frauds. The anomaly detection is used in Ae module to detect what deviates from the normal activity. In the Center control (Cc) module, we devise a novel integration scheme, called TELSI (Triple Element Logical Stacking Integration). Based on Rr, Sa, and Ae 1 REDIS can quickly perform streaming calculations, which can accelerate feature processing. Note that the REDIS database in our system is not necessary and can be replaced by other efficient databases.
4.3 System Integration Architecture
93
modules, the Center control module generates the set of valid decision strategies, and TELSI can assign the most effective decision strategies to different transactions. We use conjunction and disjunction to combine the function modules so that when inspecting individual predictions we can give the explanations what activities or characteristics cause the decisions of system. Next, we introduce how to build or import three different kinds of function modules, and present our integration scheme TELSI in the Cc module, and describe our communication architecture.
4.3.1 Anti-fraud Function Modules We first introduce the schemes of three function modules. Based on our taxonomy, we mine the corresponding characteristics and construct the following three modules:
4.3.1.1
Risk review (Rr) Module
Evaluating the behavior risk is a feasible approach to predict frauds [1]. For example, there are frequent frauds related to a special merchant. Then, when an account transfers to the same merchant, it is very likely that the account will be embezzled. For the first state, a fraudulent transaction will follow the on-going transaction with high risk. To prevent the occurrence of frauds, it is necessary to review the high-risk features of historical behaviors and predict the fraud possibility of on-going transactions. The transactions following high-risk transactions are identified as frauds with the signs “1” that are fed back to the central control module.
4.3.1.2
Subsequent analysis (Sa) Module
In online payment services, a typical fraud pattern is that similar fraudulent transactions repeat within a short range of time [48]. It gives some hints that fraudsters have an intrinsic motivation to move large amounts of money over a short period of time, which is different from normal trading behavior patterns. The sliding time window is a widely-recognized effective tool for capturing the patterns of sequential fraudulent behaviors. Through specific statistical analysis within a time window, some window-relevant features can be obtained to represent behavior patterns. For the second state where the fraud follows the successive frauds, the function module has the knowledge that the on-going transaction is in the same time window as previous fraudulent transactions. Then, the sliding time window strategy can be adopted to aggregate sequential features into statistical sequential features. The transactions with fraudulent sequential features are marked as frauds, and the corresponding signs “1” are returned to the central control module.
94
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
4.3.1.3
Association evaluation (Ae) Module
Generally, anti-fraud methods can be divided into two categories, i.e., the misuse detection and anomaly detection methods. Rr and Sa modules belong to the misuse detection methods, and aim at deriving a set of features from adjacent transactions to characterize frauds. In practice, fraudsters’ tricks are continuously evolving and the fraudulent transactions may be hard to obtain, which seriously limits the performance of misuse detection models. For an on-going transaction with the third state, the function module has no additional knowledge from the special temporal positions adjacent to the reference fraud. Therefore, anomaly detection is more suitable for the Association evaluation module. It can apply the anomaly detection method to identify the transactions that deviate greatly from their own behavior patterns as frauds, and feed their corresponding signs “1” into the Center control module.
4.3.2 Center Control Module The performance of a single-function module is limited since it can only identify some specific type of fraud. Therefore, integrating multiple function modules is necessary and how to reasonably integrate is significant. The task of Center control (Cc) module is to use the corresponding decision strategy to give each transaction a composite judgment based on the invoked function modules. As one of the main contributions, we design an integration scheme, named TELSI (Triple Element Logical Stacking Integration), to realize the adaptive assignment from the set of candidate decision strategies, denoted by .S , to the sequence of unchecked transactions, denoted by .T .
4.3.2.1
Candidate Decision Strategy Set
The decision aims to use transactions fields and function modules to generate final judgments whether the transactions are fraudulent or not. Let .T represent the feature space of transactions and . p1 , p2 , . . . , p M denote the learned prediction functions of . M function modules with . pi : T I → {0, 1}, ∀i. The final judgment is formulated as: .
y = π(ti , p1 (ti ), p2 (ti ), . . . , p M (ti )),
where .ti ∈ T denotes the features of the .ith transaction, and .π denotes the decision strategy. Generally, ensemble methods obtain the final judgment by blending the predictions of function modules. That is to seek a blended prediction function . f to generate judgment: . y = f ( p1 (ti ), p2 (ti ), . . . , p M (ti )). Linear stacking [49] trains a linear function . f of the form:
4.3 System Integration Architecture .
95
f ( p(t)) =
∑
wi pi (t),
i
where each learned weight, .wi , is a real constant. The learned weight indicates the impact of the prediction of each function module. However, it is not straightforward for humans to understand. The more convincing method is to utilize binary operation, which can give a more explicit explanation. The majority voting rule adopts the majority of the predictions as the final judgment: ∑ j ∑∑ . y = c j , with h i (t) > 0.5 h ik (t), i
k
i j
where.c j denotes the. jth class, and the indicator function.h i (t) denotes whether. pi (t) is equal to .c j . However, the majority is not always right while in some situations the minority is decisive. Considering both the explainability and performance, the bitwise operation is chosen, and we employ multiple blended prediction functions in the candidate decision strategy set. Now, we introduce the elements of .S , i.e., the candidate decision strategies. Each decision strategy is set to be a logical combination of some of three modules by two simple operations: conjunction (.∧) and disjunction (.∨). Every module mainly aims to improve the output creditability of a positive discrimination result, i.e., “fraudulent”. In other words, a negative discrimination result is just a default output when the module does not determine a transaction as a fraudulent one. The modules have no confidence in a negative discrimination result. The fusing strategy will fail to guarantee its credibility of positive discrimination results whenever a negation operator is used. Besides, a negation operator negates the validity of three modules, which violates our design objective for them. Therefore, we say a decision strategy (a logical combination of modules) is valid if it does not contain any negation operator. Conjunction and disjunction operations respectively indicate “and” and “or” in decision logic, which ensures the decision explainability in combination. Then, the so-called candidate decision strategy set .S consists of all valid decision strategies. Next, we analyze how many valid decision strategies three modules can generate, i.e., the cardinality of .S . For this issue, we have Theorem 4.1. Theorem 4.1 Let .V denote the set of compound propositions formed by combining three propositional variables via only the conjunction (.∧) and disjunction (.∨) (without the negation operator). Then, all compound propositions in .V can be deduced to .18 different principal disjunctive normal forms. Proof Since a variable in the minterm expression can be in either its directed or complemented form, there is a widely-used convention which assigns the value .1 to the direct form and .0 to the complemented form [50]. Then, we use .m k to denote a minterm, where the binary encoding of .k denotes the digital representation of the minterm. We define a partial order relation.≼, where.m i ≼ m j denotes that a principal disjunctive normal form must contain .m j if it contains .m i . Thus, it holds that
96
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
Fig. 4.3 The Hasse diagram of the partially ordered set .< {m 1 , m 2 , . . . , m 7 }, ≼>
.
m7 111
Layer 0
Layer 1
m3 011
m5 101
m6 110
Layer 2
m1 001
m2 010
m4 100
≼= {< m 1 , m 3 >, < m 1 , m 5 >, < m 1 , m 7 >, < m 2 , m 3 >, < m 2 , m 6 >, < m 2 , m 7 >, < m 3 , m 7 >, < m 4 , m 5 >, < m 4 , m 6 >, < m 4 , m 7 >, < m 5 , m 7 >, < m 6 , m 7 >}.
In Fig. 4.3, we give the Hasse diagram of the partially ordered set.< {m 1 , m 2 , m 3 , m 4 , m 5 , m 6 , m 7 }, ≼>. Given a ternary conjunctive formula that can be simplified to a non-negative conjunctions form, its principal disjunctive normal form must have the following property: If it contains a minterm in Layer .i with .i ≥ 1, it must contain another minterm in Layer .(i − 1) that has an edge connecting to the previous minterm in the Hasse diagram. Then, we can get the following kinds of combinations: • Only with minterm in Layer .0: .1 kind of combination. • With the deepest minterm in Layer .1: 1 2 3 .C 3 + C 3 + C 3 = 7 kinds of combinations. • With the deepest minterm in Layer .2: 1 1 2 3 .C 3 · (1 + C 1 ) + C 3 + C 3 = 10 kinds of combinations. Therefore, there are .18 kinds of combinations in total. ∎ From Theorem 4.1, by setting those three propositions to be Rr, Sa, and Ae, respectively, we can obtain that each kind of combination actually corresponds to a valid decision strategy. Hence, we get there are .18 valid decision strategies in .S . Furthermore, we summarize them in Table 4.1.
4.3 System Integration Architecture
97
Table 4.1 All logical combinations of three schemes Logical combination ID ID 1 2 3 4 5 6 7 8 9
Rr Sa Ae Rr.|Sa Rr.|Ae Sa.|Ae Rr & Sa Rr & Ae Sa & Ae
10 11 12 13 14 15 16 17 18
Logical combination Rr.|Sa.|Ae Rr & Sa & Ae Rr.|(Sa & Ae) Sa.|(Rr & Ae) Ae.|(Rr & Sa) Rr & (Sa.|Ae) Sa & (Rr.|Ae) Ae & (Rr.|Sa) (Rr.|Sa) & (Rr.|Ae) & (Sa.|Ae)
Algorithm 4.1: Adaptive Decision Strategy Assignment
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Input: Transaction fields T , labels L , Function modules P Rr , P Sa , P Ae , Logic functions F = { f i : P Rr × P Sa × P Ae I → L |i = 1, 2, . . . , m} Output: Classifier g X ← {} Y ← {} D ← {}; foreach ti ∈ T do Input ti into Rr , Sa, Ae and get predictions P Rr (ti ), P Sa (ti ), P Ae (ti ); foreach f j ∈ F do j Input
into f i and get judgments si ; end j
D ← D ∪ {si | j = 1, . . . , m};
end foreach si ∈ D , li ∈ L do j if si consists with li then X ← X ∪ ti ; Y ← Y ∪ j; end end Train a classifier g : X I→ Y ; Return g;
4.3.2.2
Adaptive Decision Strategy Assignment
The assignment is indeed to find an allocation scheme of optimal candidate decision strategy for every transaction to be checked. The decision strategies correlate with the judgments of Rr, Sa, and Ae modules. The strategies adaptively vary with the characteristics of transactions. That is to say, for various transactions, the Cc module assigns different decision strategies. In this work, we reduce this assignment problem to a multi-classification task. The inputs of Center control module are the fields of transactions, and outputs are the
98
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
Center control Module
Decision Strategy Label s 1 s 2 s3 s 4 s5
Transaction Sequence
t1
l
x1 x2 x3 x4 x5 t1 : s11
t2
t1 : s13
t3
t 2 : s22
t4
t 2 : s 23
t5
t 2 : s24
t6
t 3 : s 31
Adaptive train Decision Strategy Assignment
Incoming Transaction
Risk review Subsequent analysis Association evaluation
......
Fraud Detection
Anti-Fraud Function Modules
Fields
deploy
Decision Strategy
Fig. 4.4 The workflow of the Center control module. We first evaluate the consistency between the decision strategy .s and the label .l for each transaction .t, and then input all fields .x and each correct strategy .s as a sample for subsequent adaptive decision allocation subtask. The subscript is the index of the object, e.g., .s i denotes the .ith decision strategy in the set of decision strategy. j j .ti : si denotes the transaction record .ti with strategy label .si
optimal decision strategies. Our algorithm is composed of two parts: offline training and online testing. The workflow of Cc module is illustrated in Fig. 4.4. The training process is described in Algorithm 4.1: (1) Constructing strategy matrix. Given the transaction fields .T and labels .L , we input the fields of each transaction into Rr, Sa, and Ae modules, and get the predictions whether the transaction is fraudulent or not. Based on decision strategies Rr .F , we use the three predictions (. P , . P Sa , . P Ae ) to generate .|S | judgments for each transaction. Then, we can get a .|T | × |S | strategy matrix, where .| · | denotes the cardinality of a set. (2) Generating transactional samples. We use the fields of the transaction as features, and select the identifiers of decision strategies consistent with ground-truth as labels. If multiple decision strategies are right, we will generate multiple samples. For example, there is a transaction record with feature vector .t and label .l = 1. The three function modules give their judgments whether the transaction is fraudulent (Rr: .1, Sa: .0, Ae: .1). According to the specific decision strategies, some final decisions are generated (.s11 : .1, .s12 : .0, .s13 : .1). The No..1 and No..3 decision strategies give the correct results. Then we generate .2 samples (.(t, 1) and .(t, 3)), where the feature vector is .t and the label is the strategy number. (3) Training assignment model. We feed the generated samples into a selected commonly-used classifier to learn the decision strategy assignment scheme. Specifically, we adopt Naive Bayes [51], SVM [52], Random Forest [53], XGBoost [54], and so on, and select XGBoost according to their performance. After the assignment scheme is learned, we can make the optimal decision strategy for each incoming transaction adaptively. We deploy and test the trained strategy assignment model online. In the online testing, the Cc module will give the optimal decision strategy
4.3 System Integration Architecture
99
for an incoming transaction. Then CAeSaR will invoke the selected function modules and output the final composite judgment. Note that the function modules are invoked as decision strategies need.
4.3.2.3
Decision Strategy Assignment Update
The fraud patterns keep changing in practical applications. A fixed decision strategy assignment scheme can not cope with them effectively. Also, with the changes of frauds, the confidence of function modules will be varied. Therefore, we devise an updating mechanism for the Center control module. A feasible approach is reinforcement learning, which can dynamically adjust the assignment policy according to the feedback of previous decisions. In this work, we adopt a periodic updating mechanism in consideration of the operability and efficiency. When some transactions are completed and the dataset is updated, the classifier in the Center control module will be retrained at specific intervals.
4.3.3 Communication Architecture We devise a communication architecture for our integration scheme to further reduce processing latency. As is illustrated in Fig. 4.2, there are two types of unidirectional queues in our system: transaction queue and judgment queue. The transaction queue element has the attributes of transaction ID and transaction features. The judgment queue element has the attributes of transaction ID, strategy ID, and judgments from Rr, Sa, and Ae. We use five queues to achieve the communication architecture with a judgment queue for Center control module and a transaction queue for each module. The incoming transactions are firstly sent to the transaction queue for the Center control (Cc) module. Then, the Cc module assigns an optimal strategy to the head of the transaction queue and then sends the transaction ID to the corresponding function modules according to the invoking demands of the strategy. Next, the function modules take the transaction record corresponding to the ID from the database and enqueue the transaction element. The Cc module constructs a judgment queue element including the transaction ID and strategy ID of the current processing transaction. Function modules process the head of the transaction queue and return their judgments to the judgment queue. The Cc module takes out the head of judgment queue and gives the final output generated by the strategy ID and judgments. With the communication architecture, these modules in our system can work in parallel. The procedures of distributing transactions and generating outputs are independent. Therefore, they will decrease unnecessary waits and achieve the demand of low processing latency.
100
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
4.4 Performance Analysis We carry out the experiments to validate the good performance of CAeSaR in realworld scenarios. We compare the performance of CAeSaR with the state-of-the-art methods and perform an ablation study to discover the contributions of different integration methods. Furthermore, we conduct some exemplificative studies to manifest the characteristics of CAeSaR in terms of the function complementarity and decision explainability.
4.4.1 Experimental Set-Up 4.4.1.1
Datasets
The evaluation is implemented on a real-world online payment dataset from a prestigious commercial bank. The data span three consecutive months from April to June, 2017. In the raw dataset, each transaction is characterized by .64 fields. We first filter out the fields with sparse data (about over 60% missing rate) which contribute little to model training. Then we assess the information gains of the fields. Specifically, we apply the feature importance of Random Forest to rank the fields, where we select the top . K fields. Combining with the suggestions of the bank experts, we finally select and process some fields as shown in Table 4.2. Most of the fields are non-sensitive, and the bank desensitizes sensitive data because of the privacy protection policy. The frauds are labeled through telephone surveys, complaints, and backtracking rules by the bank. The frauds include the types of telecommunication fraud, phishing website, trojan virus, and so on. The labels indicate whether the frauds exist. We choose the transactions of April and May as the Training Part 1, for training the three function modules. The transactions from June are divided into two parts, with the first .3/4 as the Training Part 2 for training Center control module and the other .1/4 as the testing data. The main statistics of transactions are summarized in Table 4.3. Using different training data for Center control and function modules can avoid over-fitting. Dividing the dataset in time sequence avoids the time-crossing and is more in line with the application scenario.
4.4.1.2
Metrics
According to the industry requirement, it is intolerable for a model with high FPR (False Positive Rate) [16]. A high FPR means that the system frequently intervenes and interrupts the normal users’ behaviors, which brings users a terrible experience. Therefore, an achieved TPR (True Positive Rate) with a high FPR makes no sense in fraud detection. Considering both user experience and utility, we mainly take notice of the performance at around .0.1% FPR. In this work, we select TPR, FPR, and F1-score [55, 56] as the metrics to comprehensively verify our method.
4.4 Performance Analysis Table 4.2 Fields of dataset Attribute Data type User_ID Merchant_ID Place_number Time
String String String String
Amount Balance Daily_Limit Single_Limit Check
Float Float Float Float String
Frequent_IP
Boolean
Last_Result
Boolean
Trans_Type
Integer
101
Description Transaction card account of customers Transaction account of Merchant Issuing area of banking cards used for transactions The time that transaction happens, starting from the year, accurate to second The amount of money in a transaction Balance of the account before the transaction happens Maximum amount limit for daily transactions Maximum amount limit for a single transaction The verification tool at the process of the transaction, like U-shield, electronic cipher, etc. A status bit indicating whether the current IP is consistent with the IP frequently used by customers Judgment of the last transaction in the relevant account number The type of the transaction
Table 4.3 Statistics of transaction records Dataset Month Normal Training Part 1 Training Part 2 Testing
April & May June June
Fraudulent
Total
.1.81 M
.33 K
.1.84 M
.747 K
.15 K
.762 K
.246 K
.8 K
.254 K
4.4.2 Implementation Our experiments are conducted on a server with Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz and 128GB RAM.
4.4.2.1
Center Control Module
We employ a multi-classifier to learn the decision strategy assignment scheme. We select XGBoost to act as such a classifier according to the performance of different common-used classifiers. A grid search is adopted to select the setup with the highest performance. To avoid overfitting, we set the step size shrinkage .η to .0.1. After each boosting step, the weights of new features can be directly obtained, and .η shrinks the feature weights to make the boosting process more conservative. The minimum loss reduction .γ = 0, the maximum depth of a tree .max_depth = 3, the minimum sum of instance weight needed in a child .min_child_weight = 1, the subsample
102
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
ratio of the training instances .subsample = 0.9, and the number round of rounds for boosting .num_r ound = 60.
4.4.2.2
Anti-fraud Function Modules
Modularity is one of the guiding principles in CAeSaR design. The function modules can be freely swapped out for more effective solutions or application-specific tradeoffs. Imbalanced data is a practical challenge in fraud detection [57]. In our implementation, we under-sample the legitimate transaction samples by random-skipping, and keep all the fraudulent samples for the Rr and Sa modules. The anomaly detection method aims to learn the characteristics of normal activity and detect the deviated activity, so we keep all samples for the Ae module. In our implementation, we adopt the function modules as follows. Rr (Risk review module). We adopt the ex-ante misuse detection method [2] that works before unauthorized behaviors occur. The objective is to check out the risk level of the account after its holder’s a series of transactions. We choose account ID, merchant ID, transaction time, transaction amount, authentication type, frequent IP, and registered address to construct features. We firstly aggregate the features to deal with the transaction sequences. After the processing of the devised feature engineering, we employ XGBoost to predict the risk of accounts. The optimal model parameter settings are selected with a grid search. The step size shrinkage .η = 0.01, the minimum loss reduction .γ = 0.1, the maximum depth of a tree .max_depth = 6, the minimum sum of instance weight needed in a child .min_child_weight = 1, the subsample ratio of the training instances .subsample = 0.7, and the number round of rounds for boosting .num_r ound = 1000. Sa (Subsequent analysis module). A misuse detection method named LAW [11] is adopted in this function module. One of the most significant features of fraudulent transactions is that they are usually exhibited in a sequential form. We adopt a sliding time window to extract the features about transaction characteristics in order to capture the latent patterns hidden in transaction records. We choose the fields, account ID, merchant ID, transaction time, daily limit, single limit, transaction amount, and frequent IP. We utilize a learning automata to extract the windowdependent features. Combined with the window-independent features, we employ Random Forest to detect fraudulent transactions. The number of trees in the forest is set to .100 and the minimum number of samples required to split an internal node is set to .100. Ae (Association evaluation module). We employ the individual behavioral model in [9] to detect anomaly activity. We adopt the heterogeneous relation network to represent the co-occurrences. Then, network representation learning is utilized to capture deep associations. We infer the potential associations by calculating the similarity between embedding vectors. Finally, we judge the deviated activity as fraud. We build the behavior model based on the following features: account ID, merchant ID, issuing place, transaction time, transaction amount, frequent IP, and
4.4 Performance Analysis
103
transaction type. The dimensionality of vector space .d = 128, the length of random walks .l = 160, and the length of meta-paths .w = 4.
4.4.3 Evaluation of System Performance 4.4.3.1
Comparison with Baselines
Besides our three function modules, we use the following methods as the baselines: .• P (Platform’s method). It is a deep learning-based method adopted by the risk control department of the commercial bank who provides us with the data. It takes into account the manual features from expert knowledge, the deep characteristics based on network representation learning, and automatic feature engineering. .• LR (Logistic Regression) [19]. It is a representative model with the advantages of efficiency and explainability. .• RF (Random Forest) [58]. It is widely used in fraud detection, which can effectively cope with missing, high dimensional and unbalanced transaction data. .• DNN (Deep Neural Network) [28]. It achieves great success due to its powerful fitting capability of complex non-linear relations. Firstly, we investigate the effectiveness of CAeSaR. We conduct an exhaustive grid search to find the optimal parameters for the best fraud detection performance in terms of TPR and FPR. The partial ROC (Receiver Operating Characteristic) curve is depicted in Fig. 4.5. Compared with Rr, Sa, and Ae, CAeSaR achieves a significant performance improvement. From the FPR of .0.05–.0.15%, CAeSaR presents the best results. When the FPR is .0.15%, the TPR of CAeSaR reaches more than .98%, which means that it can prevent .98% frauds with just interfering .0.15%
Fig. 4.5 The partial ROC curves
1.0
True Positive Rate
0.9
0.8
P Sa RF CAeSaR
0.7
Rr Ae DNN
0.6 0.0006
0.0008
0.0010
0.0012
False Positive Rate
0.0014
104
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
Fig. 4.6 Average time consumption
40
35.71
Average time (ms)
32
24
22.35 19.17
16
11.96 9.88 8
8.45
7.57
9.29
0 P
Rr
Sa
Ae
LR
RF
DNN CAeSaR
Method
normal transactions. The performance of Sa is very close to that of Rr. Note that Ae has a worse performance than most of other methods when the FPR is below .0.13%, but it can obtain a TPR of .98%, higher than Sa about .2%, when the FPR increases to .0.2%. The worse performance of Ae at low FPR might result from the fact that some individuals do not have sufficient transactions at all. In addition, CAeSaR outperforms the platform’s method P and other three models. Moreover, LR obtains the worst TPR which are lower than .60%, while RF and DNN achieve about .70% to .80%. Secondly, we present the efficiency of CAeSaR. Due to the real-time requirement of OPSs, the fraud detection system needs to process each transaction with a very small latency. Therefore, we perform the experiments to simulate the real-life scenarios of online payment fraud detection. The transactions are generated and submitted to the fraud detection system one by one, then the system predicts their fraud risk in turn, which affects the risk control decisions. To be more convincing, we conduct .25 rounds of tests and calculate the mean values of the average time. The results are shown in Fig. 4.6. The platform’s method (P) needs about .12 ms to process a transaction, and CAeSaR needs .19 ms accordingly. Our method can achieve a better performance with a slight increase in terms of time consumption. Here, LR needs the least time but its accuracy is indeed unacceptable. Comparing with Rr, Sa, and Ae, although CAeSaR utilizes all of them, it does not consume much more time on average, even less than Ae. In our design, the modules work in parallel, and only parts of function modules are invoked for each transaction according to the selected logical combination.
4.4 Performance Analysis
4.4.3.2
105
Ablation Study
To investigate how CAeSaR with TELSI contributes to the overall performance, we perform the ablation study to compare different integration methods. We choose the following representative and well-performed integration methods: .• Voting [59]. It follows a voting rule and aggregates the preferences of individual learners into a collective decision. .• LS (Linear Stacking) [60]. It blends the predictions of multiple models to boost predictive accuracy by training a linear model. .• NLS (Non-linear Stacking) [61]. Different from LS, it uses a non-linear model as meta learner. .• FWLS (Feature weighted linear stacking) [62]. It uses both the outputs of individual learners and the features of samples to construct the meta learner. To show the effectiveness of our integration method, we choose a set of fixed FPR around .0.1% to compare their performance: .0.050, .0.075, .0.100, .0.125, and .0.150%. From Fig. 4.7a and b, CAeSaR outperforms all compared methods and achieves much better results than others at the FPR of .0.001. It means that our integration strategy does have a significant advantage in improving the overall performance, especially when the FPR exceeds .0.001. Note that the performance of NLS-XGB is worse than that of CAeSaR, which signifies our logical combination strategies do contribute to the improvement. To present the efficiency of our integration method, we also conduct experiments on time consumption. For a fair comparison, the other four integration methods also adopt our communication architecture. We test 10 times and calculate the mean values, which are presented in Fig. 4.8. CAeSaR consumes the least time. Voting, LS, and NLS need .4 ms more than CAeSaR. FWLS performs worst, which takes about .54 ms to classify a sample because of its additional feature processing. The conclusion can be drawn that our integration method in Center control module contributes to cutting down the time consumption, because it does not invoke all function modules for every transaction. The other integration methods follow the “buckets effect”, which invoke
0.98
0.96
0.93
Voting LS NLS-RF NLS-XGB FWLS CAeSaR
0.90
0.0006
0.0009
0.0012
0.0015
F1-score
True Positive Rate
0.99
0.96
Voting LS NLS-RF NLS-XGB FWLS CAeSaR
0.95
0.93
0.0006
False Positive Rate
(a) TPR Fig. 4.7 The TPR and F1-score under different FPR of integrations
0.0009
0.0012
False Positive Rate
(b) F1-score
0.0015
106
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
Fig. 4.8 Average time consumption of integrations
60
54.29
Average time (ms)
50
40
30
23.40
23.71
23.68
23.11 19.17
20
10
0 Voting
LS
NLS-RF NLS-XGB
FWLS
CAeSaR
Method
Table 4.4 Comparison on the cost of computing resources between CAeSaR and other integration methods. Here, .T(·) denotes the number of invocations in a round of simulation. Total Cost is the total time of serialized scheduling T(Rr) T(Sa) T(Ae) Total cost Method CAeSaR Others
.196 K
.199 K
.208 K
.8.13 M
.254 K
.254 K
.254 K
.10.16 M
all function modules and are constricted by the most inefficient module. We use the total time of invoking each function module serially to evaluate the cost of computing resources. We count the times of the integration methods invoking Rr, Sa, Ae modules in a round of simulation, and calculate the total cost. From Table 4.4, CAeSaR invokes the function modules fewer times, which results in reducing computational cost by about .20%.
4.4.4 Exemplification of CAeSaR’s Advantages We aim to manifest the function complementarity and the decision explainability of our CAeSaR by using real-life examples.
4.4.4.1
Function Complementarity
To show the function complementarity, we study the diversity of three function modules. For fraud detection, we focus more on the diversity of capturing positive
4.4 Performance Analysis
107
Table 4.5 The disagreement metrics under different TPR. Here, . AV G is the arithmetic mean of the other three values .85.0% .87.5% .90.0% .92.5% .95.0% TPR . D M RS
.0.146
.0.119
.0.087
.0.072
.0.051
.D MR A
.0.205
.0.172
.0.122
.0.112
.0.065
. D MS A
.0.223
.0.191
.0.140
.0.116
.0.065
. AV G
.0.191
.0.161
.0.116
.0.100
.0.060
(fraudulent) transactions. Accordingly, we propose a novel measure to evaluate the diversity of different models identifying positive samples. First, we calculate the positive dissimilarity matrix. For all of positive samples, we use . N11 , . N10 , . N01 , . N00 to denote the amounts of samples that both . Di and . D j classify correctly, . Di correctly but . D j wrongly, . Di wrongly but . D j correctly, and both wrongly, respectively, where . Di and . D j are two different methods. Then, the disagreement measure . D Mi j can be calculated by N10 + N01 . D Mi j = . N11 + N10 + N01 + N00 Obviously, the higher the disagreement measure is, the greater the diversity is. To show the diversity of capability to detect frauds among Rr, Sa, and Ae, we present their disagreement metrics under different TPR in Table 4.5. We observe that the diversity will increase as TPR decreases and the average . D M can reach about .20% at the TPR of .0.85. Note that the average . D M is .0.06 at the TPR of .0.95. It signifies that even .6% predictions of positive samples among different methods are inconsistent though .95% frauds are detected. Benefited from our partition for fraud feature space, the three function modules can detect different types of frauds and our framework can identify as complete fraud features as possible.
4.4.4.2
Decision Explainability
We perform the case study for the decision explainability. Figure 4.9 presents two cases of fraudulent samples (from our real dataset) that are determined by the logical combinations, Ae and Rr.|Sa, respectively. Record5 is the transaction being checked and record1-4 are the historical transaction sequence. We visualize the representative fields of desensitized records: IP, and time-related, type-related, and amount-related fields. The colors show the magnitude of field values. The label denotes whether the transaction is fraudulent (dark) or not (light). Record5 in Fig. 4.9a is judged by Ae alone, which implies the transaction behavior deviates from the normal pattern of historical records. Backtracking the four historical records, all of them are normal and present a stable pattern. However, record5 is apparently different from the historical records, which does not conform to the normal behavior pattern. Thus this transaction is intercepted by Ae. The Center control
108
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy time
IP
type
amount
label
time
record1
record1
record2
record2
record3
record3
record4
record4
record5
record5
(a) Ae
IP
type
amount
label
(b) Rr|Sa
Fig. 4.9 The case study of decision explainability. The record5 is a transaction being checked. The record1-4 and record5 are transactions from the same user. Figures a and b show the different cases determined by the logical combinations, Ae and Rr.|Sa, respectively
module only invokes Ae to identify the deviation, but does not invoke the other two modules, i.e., Rr and Sa modules. Figure 4.9b shows the case of Rr.|Sa, which implies the transaction has specific features referring to the adjacent frauds. For the Rr module, we find the two values of record1’s type-related fields are high-risk. Denote them by .t1 and .t2 , respectively. Then, from the statistics of samples, .91% of samples with .t1 and .18% with .t2 are fraudulent transactions. This indicates that the following transactions are highly likely to be fraudulent though record1 is normal. For the Sa module, we can observe that record2-4 are all fraudulent with similar features, where the time intervals are small and the amounts are the same. It means that the fraudster is conducting replay attacks, so the following transactions are likely to be fraudulent.
4.5 Discussion In this part, we discuss two additional issues. The first one is the explainability for fraud detection. There are diverse criteria for explainability evaluation in academia. We will discuss the justifications for our decision explainability among various explainability. The second issue is the online learning for real-time systems. Although our method can cope with the changing environments, we are interested in incremental learning to further address the concept drift problem at a lower cost.
4.5.1 Faithful Explanation Explainability is an essential ability for an artificially intelligent system, especially an online payment anti-fraud system. The predictions cannot be acted upon on blind faith, which may have catastrophic consequences. When decisions are made, such as freezing accounts, the system should give justifications for why the actions are taken. The justification is important for both users and service providers. The users
4.5 Discussion
109
should understand why the system produced such predictions, and make sure the decisions are justified. The machine learning practitioners or model users should inspect multiple predictions and their justifications to ensure that the model can be trusted and work well as expected. Historically, an explainable system is the rule-based expert system. The humans can often understand the sets of rules which infer the prediction [63]. Recently, machine learning techniques have all but replaced rule-based methods in order to achieve higher accuracy and the ability to handle more changeable and more complex problems. Our integration system, under the precondition of improving detection performance, can justify the predictions of each transaction. Based on the taxonomy of transactions, we can divide the characteristics of a transaction related to reference fraud. Combining the judgments of three function modules by logical operations, the final justifications can indicate the fraudulent characteristics of the transactions. Our method justifies the predictions by the combinations of three fraudulent characteristics, i.e., the characteristics of a transaction prior to or posterior to reference fraud, or deviating from normal activity. Even if the users are not machine learning experts, they can also understand why such predictions are produced. There are other studies about explainable artificial intelligence. Their works focus on how to achieve more faithfully [44] explainable models. These methods explain to humans their mechanism and process of generating predictions. However, the system needs more processing latency and computing consumption to deploy them. We are interested in how to train a more faithfully explainable model without increasing consumption and latency in future work.
4.5.2 Online Learning In the field of financial anti-fraud, an intractable challenge is concept drift. Nonstationary environments might change the target concept over time. Therefore, the fraud detection method should be able to self-adapt to the changing environments. Different from the traditional assumption of complete data availability, sequential data or stream data, arrive one by one and continuously accumulate. Although the problem is not our main concern, we adopt the periodic updating mechanism to adapt to the change of data stream. We consider applying incremental and online algorithms which fit naturally to this scheme in the future, since they continuously incorporate information into their model, and traditionally aim for minimal processing time and space.
110
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
4.6 Conclusion This work has designed an effective online payment anti-fraud integration system with decision explainability, called CAeSaR. Two techniques contribute to the allround performance gain of CAeSaR. The first is the function module division, TRTPT (Three-way Relative Temporal Position Taxonomy), based on the relative time position classification. Its function completeness and complementarity are the basic conditions to improve detection performance. The second is the designed integration scheme TELSI (Triple Element Logical Stacking Integration). It can improve the detection performance under the premise of ensuring decision explainability, limiting processing latency, and constraining computing consumption, simultaneously. The advantages of our CAeSaR are sufficiently validated by the extensive experiments over the real-life data. CAeSaR outperforms the state-of-the-art baselines and integration methods, including the platform’s method with manually feature engineering of experts. There are some issues left to study: Firstly, it is important to investigate selflearning integration schemes for the Center control module to cope with the changing patterns of online payment frauds. Secondly, we will explore the internal architecture optimization of specific function modules to deal with the typical problems of data-driven online payment anti-fraud engineering, e.g., unsupervised learning, weakly-supervised learning, and extreme imbalance problem. Thirdly, we are seeking opportunity to cooperate with more banks and fintech companies in the future and are extending our method to more scenarios. Moreover, besides using desensitized data, we are interested in federated learning to protect privacy further. Last but not least, we should future address the important issue, i.e., the discrimination problem, which refers to unjustified distinctions in decisions against individuals based on their membership in a certain group. We intend to study fair representation to eliminate the impacts of potentially sensitive features.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
P. Zheng, S. Yuan, X. Wu, in Proceedings of the AAAI 2019 (2019), pp. 1278–1285 C. Wang, in Proceedings of the IJCAI 2020 (2020), pp. 4611–4618 N.F. Ryman-Tubb, P. Krause, W. Garn, Eng. Appl. Artif. Intell. 76, 130 (2018) K. Thomas, F. Li, C. Grier, V. Paxson, in Proceedings of the ACM SIGSAC 2014 (2014), pp. 489–500 Q. Guo, Z. Li, B. An, P. Hui, J. Huang, L. Zhang, M. Zhao, in Proceedings of the WWW 2019 (2019), pp. 616–626 C. Jing, C. Wang, C. Yan, in International Conference on Financial Cryptography and Data Security (2019), pp. 588–604 X.J. Li, X.Y. Shen, IEEE Trans. Ind. Inf. 16(9), 5806 (2019) C. Liu, Q. Zhong, X. Ao, L. Sun, W. Lin, J. Feng, Q. He, J. Tang, in Proceedings of the ACM SIGKDD 2020 (2020), pp. 3035–3043 C. Wang, H. Zhu, IEEE Trans. Dependable Secure Comput. 19(1), 301 (2022) Z. Li, Y. Zhao, R. Liu, D. Pei, in Proceedings of the IWQoS 2018 (2018), pp. 1–10
References
111
11. C. Wang, C. Wang, H. Zhu, J. Cui, IEEE Trans. Dependable Secure Comput. 18(5), 2122 (2021) 12. S. Carta, G. Fenu, D.R. Recupero, R. Saia, J. Inf. Secur. Appl. 46, 13 (2019) 13. S. Kraus, A. Azaria, J. Fiosina, M. Greve, N. Hazon, L. Kolbe, T.B. Lembcke, J.P. Muller, S. Schleibaum, M. Vollrath, in Proceedings of the AAAI 2020 (2020), pp. 13534–13538 14. D. Cheng, S. Xiang, C. Shang, Y. Zhang, F. Yang, L. Zhang, in Proceedings of the AAAI 2020 (2020), pp. 362–369 15. M.T. Ribeiro, S. Singh, C. Guestrin, in Proceedings of the ACM SIGKDD 2016 (2016), pp. 1135–1144 16. Y. Zhu, D. Xi, B. Song, F. Zhuang, S. Chen, X. Gu, Q. He, in Proceedings of the WWW 2020 (2020), pp. 928–938 17. X. Huang, J. Zhang, Z. Tan, D.F. Wong, H. Luan, J. Xu, M. Sun, Y. Liu, in Proceedings of the IJCAI 2020 (2020), 3694–3701 18. Z. Li, J. Song, S. Hu, S. Ruan, L. Zhang, Z. Hu, J. Gao, in Proceedings of the IEEE ICDE 2019 (2019), pp. 1898–1903 19. S. Rosenthal, K. McKeown, in Proceedings of the ACL 2011 (2011), pp. 763–772 20. A. Breuer, R. Eilat, U. Weinsberg, in Proceedings of the WWW 2020 (2020), pp. 1287–1297 21. H. Sengar, X. Wang, H. Wang, D. Wijesekera, S. Jajodia, in Proceedings of the IWQoS 2009 (2009), pp. 1–9 22. M. Oh, G. Iyengar, in Proceedings of the ACM SIGKDD 2019 (2019), pp. 1480–1490 23. B. Wang, N.Z. Gong, H. Fu, in Proceedings of the ICDM 2017 (2017), pp. 465–474 24. M. Abouelenien, V. Pérez-Rosas, R. Mihalcea, M. Burzo, IEEE Trans. Inf. Forens. Secur. 12(5), 1042 (2016) 25. K. Rzecki, P. Pławiak, M. Nied´zwiecki, T. So´snicki, J. Le´skow, M. Ciesielski, Inf. Sci. 415, 70 (2017) 26. H. Mazzawi, G. Dalal, D. Rozenblatz, L. Ein-Dorx, M. Niniox, O. Lavi, in Proceedings of the IEEE ICDE 2017 (2017), pp. 1140–1149 27. M. Zhao, Z. Li, B. An, H. Lu, Y. Yang, C. Chu, in Proceedings of the AAAI 2018 (2018), pp. 3940–3946 28. Y. Wang, W. Xu, Decis. Support Syst. 105, 87 (2018) 29. A. Beutel, L. Akoglu, C. Faloutsos, in Proceedings of the ACM SIGKDD 2015 (2015), pp. 2309–2310 30. J. Wang, R. Martins de Moraes, A. Bari, in Proceedings of the IEEE BigDataService 2020 (2020), pp. 104–108 31. S. Huang, C. Fung, K. Wang, P. Pei, Z. Luan, D. Qian, in Proceedings of the IWQoS 2016 (2016), pp. 1–10 32. X. Li, W. Yu, T. Luwang, J. Zheng, X. Qiu, J. Zhao, L. Xia, Y. Li, in Proceedings of the CSCWD 2018 (2018), pp. 467–472 33. Q. Zhong, Y. Liu, X. Ao, B. Hu, J. Feng, J. Tang, Q. He, in Proceedings of the WWW 2020 (2020), pp. 785–795 34. Y. Sun, Y. Wu, Y.C. Xu, in Proceedings of the PACIS 2020 (2020), pp. 144 35. J. Forough, S. Momtazi, Appl. Soft Comput. 99, 106883 (2021) 36. S. Shih, P. Tien, Z. Karnin, in Proceedings of the ICML 2021 (2021), pp. 9592–9602 37. Z. Yang, A. Zhang, A. Sudjianto, IEEE Trans. Neural Netw. Learn. Syst. 32(6), 2610 (2021) 38. U. Ehsan, Q.V. Liao, M.J. Muller, M.O. Riedl, J.D. Weisz, in Proceedings of the CHI 2021 (2021), pp. 82:1–82:19 39. O. Biran, C. Cotton, in Proceedings of the IJCAI Workshop on Explainable AI 2017 (2017), pp. 8–13 40. B. Ustun, C. Rudin, Mach. Learn. 102(3), 349 (2016) 41. B. Kim, J.A. Shah, F. Doshi-Velez, in Proceedings of the NIPS 2015 (2015), pp. 2260–2268 42. C. Rudin, B. Letham, D. Madigan, J. Mach. Learn. Res. 14(1), 3441 (2013) 43. R. Agrawal, T. Imieli´nski, A. Swami, in Proceedings of the ACM SIGMOD 1993 (1993), pp. 207–216 44. A. Jacovi, Y. Goldberg, in Proceedings of the ACL 2020 (2020), pp. 4198–4205
112
4 Explicable Integration Techniques: Relative Temporal Position Taxonomy
45. M.T. Ribeiro, S. Singh, C. Guestrin, in Proceedings of the AAAI 2018 (2018), pp. 1527–1535 46. M. Wu, M. Hughes, S. Parbhoo, M. Zazzi, V. Roth, F. Doshi-Velez, in Proceedings of the AAAI 2018 (2018), pp. 1670–1678 47. R.M. Lerner, Linux J. 2010(197), 5 (2010) 48. B. Yikun, L. Xin, H. Ling, D. Yitao, L. Xue, X. Wei, in Proceedings of the WWW 2019 (2019), pp. 83–93 49. L. Breiman, Mach. Learn. 24(1), 49 (1996) 50. M. Middendorf, F. Pfeiffer, Discr. Math. 111(1–3), 393 (1993) 51. I. Rish et al., in Proceedings of the IJCAI Workshop on Empirical Methods in Artificial Intelligence 2001 (2001), pp. 41–46 52. C.C. Chang, C.J. Lin, ACM Trans. Intell. Syst. Technol. 2(3), 1 (2011) 53. L. Breiman, Mach. Learn. 45(1), 5 (2001) 54. T. Chen, C. Guestrin, in Proceedings of the ACM SIGKDD 2016 (2016), pp. 785–794 55. J.J. Ying, J. Zhang, C. Huang, K. Chen, V.S. Tseng, ACM Trans. Knowl. Discov. Data 12(6), 68:1 (2018) 56. B. Xu, H. Shen, B. Sun, R. An, Q. Cao, X. Cheng, in Proceedings of the AAAI 2021 (2021), pp. 4537–4545 57. X.Y. Jing, X. Zhang, X. Zhu, F. Wu, X. You, Y. Gao, S. Shan, J.Y. Yang, IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 139 (2019) 58. Y. Li, C. Yan, W. Liu, M. Li, Appl. Soft Comput. 70, 1000 (2018) 59. R. Noothigattu, S. Gaikwad, E. Awad, S. Dsouza, I. Rahwan, P. Ravikumar, A. Procaccia, in Proceedings of the AAAI 2018 (2018), pp. 1587–1594 60. X. Xiang, Y. Tian, A. Reiter, G.D. Hager, T.D. Tran, in Proceedings of the IEEE ICIP 2018 (2018), pp. 928–932 61. Y. Wang, D. Wang, N. Geng, Y. Wang, Y. Yin, Y. Jin, Appl. Soft Comput. 77, 188 (2019) 62. V. Coscrato, M.H. de Almeida Inácio, R. Izbicki, Neurocomputing 399, 141 (2020) 63. O. Biran, K.R. McKeown, in Proceedings of the IJCAI 2017 (2017), pp. 1461–1467
Chapter 5
Multidimensional Behavior Fusion: Joint Probabilistic Generative Modeling
5.1 Online Identity Theft Detection Based on Multidimensional Behavioral Records With the rapid development of the Internet, more and more affairs, e.g., mailing [1], health caring [2], shopping [3], booking hotels and purchasing tickets, are handled online [4]. Meanwhile, the Internet also brings sundry potential risks of invasions, such as losing financial information [5], identity theft [6] and privacy leakage [3]. Online accounts serve as the agents of users in the cyber world. Online identity theft is a typical online crime which is the deliberate use of other person’s account [7], usually as a method to gain a financial advantage or obtain credit and other benefits in other person’s name. As a matter of fact, compromised accounts are usually the portals of most cybercrimes [1], such as blackmail [5], fraud [8] and spam [9, 10]. Thus, identity theft detection is essential to guarantee users’ security in the cyber world. Traditional identity authentication methods are mostly based on access control schemes, e.g., passwords and tokens [11, 12]. But users have some overheads in managing dedicated passwords or tokens. Accordingly, the biometric identification [13–15] is delicately introduced to start the era of password free. However, some disadvantages make these access control schemes still incapable of being effective in real-time online services [16, 17]: (1) They are not non-intrusive. Users have to spend extra time in the authentication. (2) They are not continuous. The defending system will fail to take further protection once the access control is broken. Behavior-based suspicious account detection [16, 18, 19] is a highly-anticipated solution to pursue a non-intrusive and continuous identity authentication for online services. It depends on capturing users’ suspicious behavior patterns to discriminate the suspicious accounts. The problem can be divided into two categories: fake/sybil account detection [20] and compromised account detection [21]. The fake/sybil account’s behaviors usually do not conform to the behavioral pattern of the majority. While, the compromised account usually behaves in a pattern that does not © Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_5
113
114
5 Multidimensional Behavior Fusion: Joint …
conform to its previous one, even behaves like fake/sybil accounts. It can be solved by capturing mutations of users’ behavioral patterns. Comparing with detecting compromised accounts, detecting fake/sybil accounts is relatively easy, since the latter’s behaviors are generally more detectable than the former’s. It has been extensively studied, and can be realized by various populationlevel approaches, e.g., clustering [22, 23], classification [5, 24–26] and statistical or empirical rules [8, 27, 28]. Thus, we only focus on the compromised account detection, commonly-called identity theft detection, based on individual-level behavioral models. Recently, researchers have proposed the individual-level identity theft detection methods by using suspicious behavior detection [9, 29–35]. The efficacy of these methods significantly depends on the sufficiency of behavior records. They are usually suffering from the low-quality of behavior records due to data collecting limitations or some privacy issues [3]. Especially, when a method only utilizes a specific dimension of behavioral data, the efficacy damaged by poor data is possibly enlarged and the scope of application is limited. Unfortunately, many existing works just concentrate on a specific dimension of users’ behavior, such as keystroke [29], clickstream [32, 36], touch-interaction [37] and user generated content (UGC) [9, 33, 34, 38]. In this paper, we propose an approach to detect identity theft by using multidimensional behavioral records which are possibly insufficient in each dimension. According to such characteristics, we choose the online social network (OSN) as a typical scenario where most users’ behaviors are coarsely recorded [39]. In the Internet era, users’ behaviors are composited by offline behaviors, online behaviors, social behaviors, and perceptual/cognitive behaviors. The behavioral data can be collected in many applications, such as offline check-ins in location-based services, online tips-posting in instant messaging services, and social relationship-making in online social services. Accordingly, we design our method based on users’ composite behaviors by these categories. In OSNs, user behavioral data that can be used for online identity theft detection are often too low-quality or restricted to build qualified behavioral models due to the difficulty of data collection, the requirement of user privacy, and the fact that some users have a few several behavioral records. We devote to proving that a highquality (effective, quick-response, and robust) behavioral model can be obtained by integrally using multi-dimensional behavioral data, even though the data is extremely insufficient in each dimension. Generally, there are two paradigms to integrate behavioral data: the fused and joint manners. Fused models are a relatively simple and straightforward kind of composite behavior models. They first capture features in each behavior space respectively, and then make a comprehensive metric based on these features in different dimensions. With the possible complementary effect among different behavior spaces, they can act as a feasible solution for integration [7, 17]. However, the identification efficacy can be further improved, since fused models neglect potential links among different spaces of behaviors. We take an example where a person posted a picture in an OSN when he/she visited a park. If this composite behavior is simply separated into two
5.1 Online Identity Theft Detection Based on Multidimensional Behavioral Records
115
independent parts: he/she once posted a picture and he/she once visited a park, the difficulty in relocating him/her from a group of users is possibly increased, since there are more users satisfy these two simple conditions comparing to the original condition. On the contrast, the joint model can sufficiently exploit the correlations between behaviors in different dimensions, then increases the certainty of users’ behavior patterns, which contributes to a better identification efficacy. The underlying logic for the difference between the joint and fused models can be also explained by the well-known Chain Rule for Entropy [40] which indicates that the entropy of multiple simultaneous events is no more than the sum of the entropies of each individual event, and are equal if the events are independent. It shows that the joint behavior has lower uncertainty comparing to the sum of the uncertainty in each component [41]. Therefore, to fully utilize potential information in composite behaviors for user profiling, we propose a joint model, specifically, a joint probabilistic generative model based on Bayesian networks, called Composite Behavioral Model (CBM). It offers a composition of the typical features in two different behavior spaces: check-in location in offline behavior space and UGC in online behavior space. Considering the composite behavior of a user, we assume that the generative mechanism is as follows: When a user plans to visit a venue and simultaneously post tips online, he/she subconsciously selects a specific behavioral pattern according to his/her behavioral distribution. Then, he/she comes up with a topic and a targeted venue based on the present pattern’s topic and venue distributions, respectively. Finally, his/her comments are generated following the corresponding topic-word distribution. To estimate the parameters of the mentioned distributions, we adopt the collapsed Gibbs sampling [42]. Based on the joint model CBM, for each composite behavior, denoted by a tripletuple .(u, v, D), we can calculate the chance of user .u visiting venue .v and posting a tip online with a set of words .D. Taking into account different levels of activity of different users, we devise a relative anomalous score . Sr to measure the occurrence rate of each composite behavior .(u, v, D). By these approaches, we finally realize real-time detection (i.e., judging by only one composite behavior) for identity theft suspects. We evaluate our joint model by comparing with three typical models and their fused model [17] on two real-world OSN datasets: Foursquare [43] and Yelp [44]. We adopt the area under the receiver operating characteristic curve (AUC) as the detection efficacy. Particularly, the recall (True Positive Rate) reaches up to .65.3% in Foursquare and .72.2% in Yelp, respectively, with the corresponding disturbance rate (False Positive Rate) below .1%, while the fused model can only achieve .60.8% and .60.4% in the same condition, respectively. Note that this performance can be achieved by examining only one composite behavior per authentication, which guarantees the low response latency of our detection method. As an insightful result, we learn that the complementary effect does exist among different dimensions of low-quality records for modeling users’ behaviors.
116
5 Multidimensional Behavior Fusion: Joint …
The main contributions are summarized into three folds: • We propose a joint model, CBM, to capture both online and offline features of a user’s composite behavior to fully exploit coarse behavioral data. .• We devise a relative anomalous score . Sr to measure the occurrence rate of each composite behavior for realizing real-time identity theft detection. .• We perform experiments on two real-world datasets to demonstrate the effectiveness of CBM. The results show that our model outperforms the existing models and has the low response latency. .
5.2 Overview of the Solution Online identity theft occurs when a thief steals a user’s personal data and impersonates the user’s account. Generally, a thief usually first gathers information about a targeted user to steal his/her identity and then use the stolen identity to interact with other people to get further benefits [4]. Criminals in different online services usually have different motivations. An OSN user’s behavior is usually composed of online and offline behaviors occurring in different behavioral spaces [17]. Based on this fact, we aim at devising a joint model to embrace them into a unified model to deeply extract information. Before presenting our joint model, named Composite Behavioral Model (CBM), we provide some conceptions as the preparations. The relevant notations are listed in Table 5.1. Definition 5.1 (Composite Behavior) A composite behavior, denoted by a four-tuple (u, v, D, t), indicates that at time .t, user .u visits venue .v and simultaneously posts a tip consisting of a set of words .D online.
.
In this work, the representation of a composite behavior can be simplified into a triple-tuple .(u, v, D). We remark that for a composite behavior, the occurring time .t is a significant factor. Two types of time attributes play important roles in digging
Table 5.1 Notations of parameters Variable Description The word in UGC The venue or place .πu The community memberships of user .u, expressed by a multinomial distribution over communities The interests of community .c, expressed by a multinomial distribution over topics .θc .ϑc A multinomial distribution over spatial items specific to community .c A multinomial distribution over words specific to topic .z .φz .α, β, γ , η Dirichlet priors to multinomial distributions .θc , .φz , .πu and .ϑc , respectively .w .v
5.2 Overview of the Solution
117
Fig. 5.1 Graphical representation of the joint model. The parameters and notions are explained in Table 5.1
potential information for improving the identification. The first is the sequential correlation of behaviors. However, in some OSNs, the time intervals between adjacent behavioral records are usually overlong, which leads that the sequential correlations cannot be captured effectively. The second is the temporal property of behaviors, e.g., periodicity and preference variance over time. However, in some OSNs, the occurring time is recorded with a low resolution, e.g., by day, which shields the possible dependency of a user’s behavior on the occurring time. Thus, it is difficult to obtain reliable time-related features of users’ behaviors. Since we aim to propose a practical method based on uncustomized datasets of user behaviors, we only concentrate on the dependency between a user’s check-in location and tip-posting content of each behavior, taking no account of the impact of specific occurring time in this work. Thus, the representation of a composite behavior can be simplified into a triple-tuple .(u, v, D) without confusion in this paper. The graphical representation of our joint model CBM is demonstrated in Fig. 5.1. Our model is mainly based on the following two assumptions: (1) Each user behaves in multiple patterns with different possibilities; (2) Users with similar behavioral patterns have similar interests in topics and places. To describe the features of users’ behaviors, we first introduce the topic of tips. Definition 5.2 (Topic [45]) Given a set of words .W , a topic .z is represented by a multinomial distribution over words, denoted by .φz , whose each component .φz,w denotes the probability of word .w occurring in topic .z. Next, we formulate a specific behavioral pattern of users by a conception called community. Definition 5.3 (Community) A community is a set of users with the same behavioral pattern. Let.C denote the set of all communities. A community.c ∈ C has two critical parameters: (1) A topic distribution .θc , whose component, say .θc,z , indicates the probability that the users in community .c send a message with topic .z.
118
5 Multidimensional Behavior Fusion: Joint …
(2) A spatial distribution .ϑc , whose component, say .ϑc,v , represents the chance that users in community .c visit venue .v. More specifically, we assume that a community is formed by the following procedure: Each user.u is included in communities according to a multinomial distribution, denoted by .πu . That is, each component of .πu , say .πu,c , denotes .u’s affiliation degree to community .c. Similarly, we allocate each community .c with a topic distribution .θc to represent its online topic preference and a spatial distribution .ϑc to represent its offline mobility pattern.
5.3 Identity Theft Detection Solutions in Online Social Networks 5.3.1 Composite Behavioral Model Generally, users take actions according to their regular behavioral patterns which are represented by the corresponding communities (Definition 5.3). We present the behavioral generative process in Algorithm 5.1: When a user .u is going to visit a venue and post online tips there, he/she subconsciously selects a specific behavioral pattern, denoted by community .c, according to his/her community distribution .πu (Line .11). Then, he/she comes up with a topic .z and a targeted venue .v based on the present community’s topic and venue distributions (.θc and .ϑc , respectively) (Lines 12–13). Finally, the words of his/her tips in.D are generated following the topic-word distribution .φz (Line .15). Exact inference of our joint model CBM is difficult due to the intractable normalizing constant of the posterior distribution [42]. We adopt collapsed Gibbs sampling for approximately estimating distributions (i.e., .θ , .ϑ, .φ, and .π ). As for the hyperparameters, we take the fixed values, i.e., .α = 50/Z , .γ = 50/C, and .β = η = 0.01, where . Z and .C are the numbers of topics and communities, respectively. In each iteration, for each composite behavior .(u, v, D), we first sample community .c according to Eq. (5.1): .
P(c|c¬ , z, v, u) ∝ (n ¬ u,c + γ ) .
n¬ c,z + α ¬ z . (n c,z .
+ α)
.
n¬ c,v + η
¬ v. (n c,v.
+ η)
,
(5.1)
where .c¬ denotes the community allocation for all composite behaviors except the current one; .z denotes the topic allocation for all composite behaviors; .n u,c denotes the number of times that community.c is generated by user.u;.n c,z denotes the number of times that topic .z is generated by community .c; .n c,v denotes the number of times that venue .v is visited by users in community .c; a superscript .¬ denotes something except the current one.
5.3 Identity Theft Detection Solutions in Online Social Networks
119
Algorithm 5.1: Joint Probabilistic Generative Process
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Input: Community set C , topic set Z , user set U , composite behavior set Bu , word set D Output: The learned joint probabilistic model foreach community c ∈ C do Sample the distribution over topics θc ∼ Dirichlet (·|α) Sample the distribution over venues ϑc ∼ Dirichlet (·|η) end foreach topic z ∈ Z do Sample the distribution over words φz ∼ Dirichlet (·|β) end foreach user u ∈ U do Sample the distribution over communities πu ∼ Dirichlet (·|γ ) foreach composite behavior (u, v, D ) ∈ Bu do Sample a community indicator c ∼ Multi (πu ) Sample a topic indicator z ∼ Multi (θc ) Sample a venue v ∼ Multi (ϑc ) foreach word w ∈ D do Sample a word w ∼ Multi (φz ) end end end
Then, given a community.c, we sample topic.z according to the following Eq. (5.2): .
P(z|z¬ , c, D) ∝ (n ¬ c,z + α)
. w∈D
.
n ¬z,w + β ¬ w. (n z,w.
+ β)
,
(5.2)
where .n z,w denotes the number of times that word .w is generated by topic .z. The inference algorithm is presented in Algorithm 5.2. We first randomly initialize the topic and community assignments for each composite behavior (Lines .2-4). Then, we update the community and topic assignments for each composite behavior based on Eqs. (5.1) and (5.2) in each iteration (Lines .6–9). Finally, we estimate the parameters, test the coming cases and update the training set every . Is iterations since . Ib th iteration (Lines 10–13) to address concept drift. To overcome the problem of data insufficiency, we adopt the tensor decomposition [46] to discover their potential behaviors. In our experiment, we use the TwitterLDA [47] to obtain each UGC’s topic and construct a tensor .A ∈ R N ×M×L , with three dimensions standing for users, venues, and topics. Then, .A(u, v, z) denotes the frequency that a user .u posting a message on topic .z in venue .v. We can decompose d ×dV ×d Z .A into the multiplication of a core tensor .S ∈ R U and three matrices, .U ∈ N ×dU M×dV L×d Z , .V ∈ R , and .Z ∈ R , if using a tucker decomposition model, where R
120
5 Multidimensional Behavior Fusion: Joint …
Algorithm 5.2: Inference Algorithm of Joint Model CBM
1
2 3 4 5 6 7 8 9 10 11
Input: user composite behavior collection B , number of iteration I , start saving step Ib , saving lag Is , start training sequence number Nb , end training sequence number Ne , hyperparameters α, β, γ and η ˆ φ, ˆ πˆ Output: estimated parameters θˆ , ϑ, Create temporary variables θ sum , ϑ sum , φ sum and π sum , initialize them with zero, set testing sequence number Nt = 0 and let B (Nt ) denote the corresponding training collection for testing behaviors which sequence number values Nt foreach each composite behavior (u, v, D ) ∈ B (Nt ) do Sample community and topic randomly end foreach iteration = 1 to I do foreach each behavior (u, v, D ) ∈ B (Nt ) do Sample community c according to Eq. (5.1) Sample topic z according to Eq. (5.2) end if (iteration > Ib ) and (iteration mod Is == 0) then Return model parameters as follows: θc,z = .
n c,z + α n c,v + η ; ϑc,v = . . (n + α) z . c,z v. (n c,v. + η)
n u,c + γ n z,w + β ; φz,w = . . + γ) (n . u,c c w. (n z,w. + β)
πu,c = .
Evaluate corresponding test cases and update Nt + +; Nb + +; Ne + + end 13 end
12
d , d , and .d Z denote the number of latent factors; . N , . M, and . L denote the number of users, venues and topics. An objective function to control the errors is defined as:
. U . V
1 L (S, U, V, Z) = .A − S ×U U ×V V × Z Z.2 2 . . . λ. 2 u iT u j , + .S. + .U.2 + .V.2 + .Z.2 + (i, j)∈F 2 where .F is a set of friend pairs .(i, j). .A∗ = S ×U U ×V V × Z Z is the potential frequency tensor, and .A∗ (u, v, z) denotes the frequency that user .u may post a message on topic .z in venue .v. A higher .A∗ (u, v, z) indicates that user .u has a higher chance to do this kind of behavior in the future. We limit the competition space to the behavior space of .u’s friends, i.e., . . . . (u, v, z)|A(u , v, z) > 0, (u, u ) ∈ F , and select the top .20 behaviors as his/her latent behaviors to improve data quality.
5.4 Evaluation and Analysis
121
5.3.2 Identity Theft Detection Scheme ˆ ϑ, ˆ φ, ˆ πˆ } learnt from the inference algorithm (AlgoBy the parameters .ψˆ = {θ, rithm 5.2), we estimate the logarithmic anomalous score.(Sl ) of a composite behavior .(u, v, D) by Eq. (5.3): Sl (u, v, D) = − lg P(v, D|u) . .. . |D1 | . . . . = − lg θˆc,z φˆ z,w πˆ u,c ϑˆ c,v . c
z
(5.3)
w∈D
However, we may mistake some normal behaviors occurring with low probability, e.g., the normal behaviors of users whose behavioral diversity and entropy are both high, for suspicious behaviors. Thus, we propose a relative anomalous score .(Sr ) to indicate the trust level of each behavior by Eq. (5.4): S (u, v, D) = 1 − P(u|v, D) = 1 − .
. r
P(v, D|u)P(u) . . . u . P(v, D|u )P(u )
(5.4)
For reducing computational complexity, we randomly select.n = 40 users to estimate the relative anomalous score . Sr for each composite behavior. The selection process of hyper-parameter .n is omitted due to space limitations. Our experimental results in Sect. 5.4 show that the approach based on . Sr outperforms one based on . Sl .
5.4 Evaluation and Analysis In this part, we present the experimental results to evaluate the proposed joint model CBM, and validate the efficacy of the joint model for identity theft detection on real-world OSN datasets.
5.4.1 Datasets Our experiments are conducted on two real-life OSN datasets: Foursquare [43] and Yelp [44], which are two well-known online social networking service providers. Foursquare is a location-based service (LBS) provider and encourages users to share their current locations and comments with others. The adopted Foursquare dataset contains the check-in history of 31,494 users in LA. Yelp is another popular locationbased social networking service provider, which publishes crowd-sourced reviews about local businesses. The adopted Yelp dataset contains the tips of 80,593 users. In both datasets, there are no URLs or other sensitive terms. Both datasets contain users’
122
5 Multidimensional Behavior Fusion: Joint …
social ties and behavioral records. Each social tie contains user-ID and friend-ID. Each behavior record contains the user-ID, venue-ID, timestamp, UGC. The basic statistics are shown in Table 5.2. Examples of our behavior records are illustrated in Table 5.3. We count each user’s records, and present the results in Fig. 5.2. It shows that most users have less than .5 records in both datasets. The quality of these datasets is too poor to model individual-level behavioral patterns for the majority of users, which confronts our method with a big challenge.
Table 5.2 Statistics of Foursquare and Yelp datasets Foursquare # of users # of venue # of check-ins
31,493 143,923 267,319
Table 5.3 User’s behavior records in Foursquare dataset User-ID Venue-ID Timestamp (Anonymized) (Anonymized) 1
1
1299135219
1 1 1
2 3 4
1299135270 1299135328 1301004689
1
4
1303711907
1
5
1303971421
Fig. 5.2 The distribution of user record counts
Yelp 80,592 42,051 491,393
Message content Pen is better! The class sizes are only 20:1!!! GO PANTHERS! GO PANTHERS! Save a whale, eat a Sea King! Best school in the world. PV High is an uber fail compared to here The best teachers only come from Pen DJ Mike too the rescue!
5.4 Evaluation and Analysis
123
5.4.2 Experiment Settings 5.4.2.1
The Simulation of Post-intrusion Behavior Pattern
The so-called post-intrusion behavior refers to the behavior of a thief via the compromised account. We simulate three typical kinds of post-intrusion behaviors based on different kinds of scenarios. Specifically, in each experiment, we randomly simulate .5% of all behavior records in the testing set as anomalous behaviors and repeat .10 times. Behavioral Displacement. Most thieves usually take actions under specific aims. Such actions can be easily detected since it is quite different from normal behaviors. We focus on a harder scenario, where thieves show no specific aims and just do as usual in the compromised account. Accordingly, we swap two user’s behavioral records to simulate the scenario. Behavioral Imitation. Some extremely cunning thieves try to imitate the normal user’s behavioral pattern and maintain part of the victim’s behavioral pattern to get further benefits from the victim’s friends. It is harder to detect this kind of postintrusion behaviors. We note that in our work it will make no sense if thieves completely imitate behavioral patterns of victims. Accordingly, we simulate two kinds of variations, i.e., the guise in venue and guise in content, respectively. We swap two normal user’s behavioral records that have similar venues (two venues have similar tags) to simulate the guise in venue. Besides, we swap two normal user’s behavioral records that have a similar topic to simulate a scenario where thieves may imitate victims’ habits to cheat their friends. Random Synthesis. To simulate intangible behavioral patterns, we randomly generate behavioral records as post-intrusion behaviors.
5.4.2.2
Representative Models
We compare our joint model CBM to some representative models in OSNs. For two different dimensional behaviors, we choose CF-KDE and LDA as baseline models, respectively. For offline check-in behaviors, Mixture Kernel Density Estimate (MKDE) is a typical spatial model describing user’s offline behavioral pattern [48]. However, it assumes that users tend to behave like their friends in the same chance and it has not quantified the potential influence of different friends. To improve its performance, we introduce a collaborative filtering method to cooperate with MKDE, which is named as CF-KDE. For online UGC, Latent Dirichlet Allocation (LDA) have been successfully applied for analyzing text from user messages on online social networks [45, 49]. The LDA detection algorithm uses users’ documents as an input and detects the corresponding topics. For online social networks with small message lengths, topic detection is shown to be less efficient. For this reason, we aggregate the UGC of each user and his/her friends in the training set as a document and then run topic modeling on the documents. In Table 5.4, we list the features of these models. Next, we will give a detailed description of how to deploy them in this work.
124
5 Multidimensional Behavior Fusion: Joint …
Table 5.4 Behaviors adopted in different models Online UGC CF-KDE LDA FUSED JOINT
NO YES YES YES
Offline Check-in YES NO YES YES
CF-KDE. Before presenting the CF-KDE model, we introduce the MKDE to give a brief prior knowledge. MKDE mainly utilizes a bivariate density function in the following equations to capture the spatial distribution for each user: . . 1 .n Kh e − e j , j=1 n. . . . 1 1 h0 K h (x) = , exp − x T H −1 x , H = 0h 2π h 2 f M K D E (e|E, h) = α f K D (e|E 1 ) + (1 − α) f K D (e|E 2 ) . f K D E (e|E, h) =
(5.5) (5.6) (5.7)
In Eq. (5.5), . E = {e1 , . . . , en } is a set of historical behavioral records for a user and j .e =< x, y > is a two-dimensional spatial location (i.e., an offline behavior). Equation (5.6) is a kernel function and .H is the bandwidth matrix. MKDE adopts Eq. (5.7), where . E 1 is a set of an individual’s historical behavioral records (individual component), . E 2 is a set of his/her friends’ historical behavioral records (social component), and .α is the weight variable for the individual component. In this paper, to detect identity thieves, we compute a surprise index . Se in Eq. (5.8) for each behavior .e, defined as the negative log-probability of individual .u’s conducting behavior .e: S = − log f M K D E (e|E u , h u ) .
. e
(5.8)
Furthermore, we can select the top-. N behaviors with the highest . Se as suspicious behaviors. We introduce a collaborative filtering method to improve performance. Based on the historical behavioral records, it establishes a user-venue matrix .R|U|×|V| , where .U and . V are the number of users and venues, respectively; .Ri j = 1 if user .i has visited venue. j in the training set, otherwise.Ri j = 0. We adopt a matrix factorization method with an objective function in Eq. (5.9) to obtain feature vectors for each user and venue:
.
L = min U,V
V U U V 1 .. λ1 . T λ2 . T (Ri j − u iT v j )2 + ui ui + v vj. 2 i=1 j=1 2 i=1 2 j=1 j
(5.9)
5.4 Evaluation and Analysis
125
Specifically, we let (2) (k) T u = (u i(1) , u i(2) , . . . , u i(k) )T and v j = (v(1) j , vj , . . . , vj ) .
. i
We adopt a stochastic gradient descent algorithm in Eqs. (5.10) and (5.11) in the optimization process: (k) .u i
(k) .v j
←
u i(k)
←
v(k) j
−α
..
−α
V j=1
..
U i=1
(Ri j −
u iT v j )v(k) j
(Ri j −
u iT v j )u i(k)
+
λ1 u i(k)
. ,
(5.10)
+
λ2 v(k) j
. .
(5.11)
ˆ = UT V, and use.rˆi j = u T v j as the weight variable Consequently, we can figure out.R i for the KDE model. To detect anomalous behaviors, we use Eq. (5.12) to measure the surprising index for each behavior .e: .
.n S = − log
. e
j=1 rˆu j K h e − .n j=1 rˆu j
ej
. .
(5.12)
We assert that the top-. N behaviors with the highest . Se are suspicious behaviors. LDA. A user’s online behavior pattern can be denoted as the mixing proportions for topics. We aggregate the UGC of each user and his/her friends in the training set as a document, then use LDA to obtain each user’s historical topic distribution .θhis . To get their present behavioral topic distributions .θnew in the testing set. For each behavior, we count the number of words assigned to the .kth topic, and denote it by .n(k). The .kth component of the topic proportion vector can be computed by n (k) + α θ (k) = . K , i=1 (n (i) + α)
(5.13)
. new
where . K is the number of topics, and .α is a hyperparameter. To detect anomalous behaviors, we measure the distance between a user’s historical and present topic distribution by using the Jensen–Shannon (JS) divergence in Eqs. (5.14) and (5.15):
.
D K L (θhis , θnew ) =
K . i=1
.
D J S (θhis , θnew ) =
. (i) θhis
· ln
(i) θhis (i) θnew
. ,
1 [D K L (θhis , M) + D K L (θnew , M)] . 2
(5.14) (5.15)
new . We consider that the top-. N behaviors with the highest where . M = θhis +θ 2 . D J S (θhis , θnew ) are suspicious behaviors.
126
5 Multidimensional Behavior Fusion: Joint …
Fused Model. Egele et al. [7] propose COMPA which directly combines use users’ explicit behavior features, e.g., languages, links, message sources, et al. In our case, we introduce a fused model [17], which combines users’ implicit behavior features discovered by CF-KDE and LDA to detect identity theft. We try different thresholds for the CF-KDE model and the LDA model (i.e., different classifiers). For each pair (i.e., a CF-KDE model and an LDA model), we treat any behavior that fails to pass either the identification model as suspicious behavior, and compute true positive rate and false positive rate to draw the ROC curve and estimate the AUC value.
5.4.2.3
Metrics
For the convenience of description, we first give a confusion matrix in Table 5.5. In the experiments, we set anomalous behaviors as positive instances, and focus on the following four metrics, since the identity theft detection is essentially an imbalanced binary classification problem [50]. TP , and indicates the True Positive Rate (TPR/Recall): TPR is computed by . T P+F N proportion of true positive instances in all positive instances (i.e., the proportion of anomalous behaviors that are detected in all anomalous behaviors). It is also known as recall. Specifically, we named it detection rate. FP , and indicates the proFalse Positive Rate (FPR): FPR is computed by . F P+T N portion of false positive instances in all negative instances (i.e., the proportion of normal behaviors that are mistaken for anomalous behaviors in all normal behaviors). Specifically, we named it disturbance rate. TP , and indicates the proportion Precision: The precision is computed by . T P+F P of true positive instances in all predicted positive instances (i.e., the proportion of anomalous behaviors that are detected in all suspected cases). AUC: Given a rank of all test behaviors, the AUC value can be interpreted as the probability that a classifier/predictor will rank a randomly chosen positive instance higher than a randomly chosen negative one.
Table 5.5 Confusion matrix for binary classification True Condition Predicted Condition Positive Positive Negative
True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)
5.4 Evaluation and Analysis
5.4.2.4
127
Threshold Selection
It is an important issue in classification tasks. Specifically, we take a case where C = 30 and . Z = 20 as an example to present the threshold selection strategy. The parameter sensitivity analysis will be conducted in the following Sect. 5.4.2.5. We compare the distribution of logarithmic anomalous score . Sl (or relative anomalous score . Sr ) for normal behaviors with that for anomalous behaviors. Figures 5.3 and 5.4 present the differences between normal and anomalous behaviors in terms of the distributions of . Sl and . Sr , respectively. They show that the differences are both significant, and the difference in terms of . Sr is much more obvious. To obtain a reasonable threshold, we take .5-fold cross validation for the training set and focus on the performance where the threshold changes from .0.975 to .1, since this range contains .81.5% (.81.4%) of all anomalous behaviors and .3.9% (.4.8%) of all normal behaviors in Foursquare (Yelp). The detailed trade-offs are demonstrated in Figs. 5.5 and 5.6 from different aspects. To optimize the trade-offs of detection performance, we define the detection Cost in Eq. (5.16):
.
Cost =
.
# of newly mistaken normal behaviors . # of newly identified anomalous behaviors
(5.16)
Fig. 5.3 The histogram of logarithmic anomalous score . Sl (defined in Eq. (5.3)) for each behavior
128
5 Multidimensional Behavior Fusion: Joint …
Fig. 5.4 The histogram of relative anomalous score . Sr (defined in Eq. (5.4)) for each behavior
Fig. 5.5 A partial of the distribution of relative anomalous score . Sr (defined in Eq. (5.4)) for each behavior
5.4 Evaluation and Analysis
129
Fig. 5.6 A partial of ROC (receiver operating characteristic) curve of identity theft detection
Fig. 5.7 Detection costs with different thresholds
We present the threshold-cost curve in Fig. 5.7. It shows that a smaller threshold usually corresponds to a larger cost. We select the minimum threshold satisfying that the corresponding cost is less than .1. Thus, we choose .0.989 and .0.992 as the thresholds for Foursquare and Yelp, respectively. Under them, our joint model CBM reaches .62.32% (.68.75%) in TPR and.0.85% (.0.71%) in FPR on Foursquare (Yelp). Please refer to Table 5.6 for details.
5.4.2.5
Parameter Sensitivity Analysis
Parameter tuning is another important part of our work. The performance of our model is indeed sensitive to the number of communities (.C) and topics (. Z ). Therefore, we study the impact of varying parameters in our model. We select the relative anomalous score . Sr as the test variable, and evaluate the performance of our model by changing the values of .C and . Z . The experimental results are summarized in Table 5.7. From the results on both datasets, the detection efficacy goes stable when . Z reaches .20 and .C has a larger impact on the efficacy. Thus, we set .C = 30 and . Z = 20 in our joint model, and present the receiver operating characteristic (ROC)
130
5 Multidimensional Behavior Fusion: Joint …
Table 5.6 A summary of different metrics with the threshold .0.989 for Foursquare and .0.992 for Yelp, respectively Foursquare Yelp Precision (%) Recall (TPR) (%) FPR (%) AUC TNR (%) FNR (%) Accuracy (%) F1
.79.91
.83.55
.62.32
.68.75
.0.85
.0.71
.0.956
= 10 . Z = 20 . Z = 30
.99.29
.37.68
.31.25
.97.26
.97.76
.0.700
Table 5.7 AUC on Foursquare (Yelp) dataset .C = 10 .Z
.0.947
.99.15
.0.754
.C
= 20
.C
= 30
.0.876 .(0.910)
.0.945 .(0.936)
.0.953 .(0.945)
.0.917 .(0.915)
.0.946 .(0.938)
.0.956 .(0.947)
.0.922 .(0.917)
.0.947 .(0.938)
.0.957 .(0.947)
Fig. 5.8 The ROC curves of identity theft detection via the joint model CBM
and Precision-Recall curves in Figs. 5.8 and 5.9, respectively. Specifically, we present detection rate (TPR) in Table 5.8, where the disturbance rate (FPR) reaches 1% and 0.1%, respectively.
5.4 Evaluation and Analysis
131
Fig. 5.9 The Precision-Recall curves of identity theft detection via the joint model CBM Table 5.8 Detection rates with disturbance rates Foursquare (%) Disturbance rate .= 0.1% Disturbance rate .= 1.0%
Yelp (%)
.30.8
.31.7
.65.3
.72.2
5.4.3 Performance Comparison We compare the performance of our method with the typical ones in terms of detection efficacy (AUC) and response latency. The latter denotes the number of behaviors in the test set needed to accumulate for detecting a specific identity theft case.
5.4.3.1
Detection Efficacy Analysis
In Fig. 5.10, we present the results of all comparison methods. Our joint model outperforms all other methods on the two datasets. The AUC value reaches .0.956 and .0.947 under normal behavior spam attacks in Foursquare and Yelp datasets, respectively. There are three reasons for the outstanding performance. Firstly, it embraces different types of behaviors and exploits them in a unified model. Secondly, it takes advantage of the community members’ and friends’ behavior information to overcome the data insufficiency and concept drift [51] in individual-level behavioral patterns. Finally, it utilizes correlations among different behavioral spaces. For partially behavioral imitation attacks, our joint model also shows a nice performance. The detail results can be found in Figs. 5.11, 5.12 and 5.13. From the results, we have several other interesting observations: (1) LDA model performs poor in both datasets which may indicate its performance is strongly sensitive to the data quality. (2) CF-KDE and LDA model performs not well in Yelp dataset comparing to Foursquare dataset, but the fused model [17] observes a surpris-
132
5 Multidimensional Behavior Fusion: Joint …
Fig. 5.10 Identity theft detection efficacy
Fig. 5.11 The detection efficacy (AUC) via joint model in different scenarios
Fig. 5.12 The detection rate (TPR) via joint model in different scenarios with disturbance rate .= 0.01
ing reversion. (3) The joint model based on relative anomalous score . Sr outperforms the model based on logarithmic anomalous score. Sl . (4) The joint model (i.e., JOINTSR, the joint model in the following content of Sect. 5.4.3 all refer to the joint model based on . Sr ) is indeed superior to the fused model. For random behavior attacks, our joint model shows better performance. Specifically, we apply the logarithmic anomalous score .(Sl ) in Eq. 5.3 for detecting the kind of attack. Besides, we present the details in Table 5.9, where the disturbance rate (FPR) reaches 1% and 0.1%, respectively.
5.4 Evaluation and Analysis
133
Fig. 5.13 The efficacy (AUC) and detection rate (TPR) of identity theft detection via the joint model CBM in different scenarios with disturbance rate .= 0.01 (FPR .= 0.01). Painted ones denote AUC and shaded ones denote TPR Table 5.9 The performance for random behavior attack Precision (%) Recall (%)
FPR
AUC .0.995
.83.08 .97.82
.86.58
.0.10
Yelp
.82.43
.88.95
.1.00
.97.57
.76.09
.0.10
5.4.3.2
.93.99
.1.00
Foursquare
.0.995
Response Latency Analysis
For each model, we also evaluate the relationship between the efficacy and response latency (i.e., a response latency .k means that the identity theft is detected based on .k recent continuous behaviors). Figures 5.14 and 5.15 demonstrate the AUC values and TPRs via different response latency in each model on both datasets. The experimental results indicate that our joint model CBM is superior to all other methods. The AUC values of our joint model can reach .0.998 in both Foursquare and
Fig. 5.14 Identity theft detection efficacy via different response latency (i.e., the number of behaviors in the test set we cumulated)
134
5 Multidimensional Behavior Fusion: Joint …
Fig. 5.15 The detection rates (TPR) via different response latency with disturbance rate .= 0.01 (FPR .= 0.01)
Yelp with .5 test behavioral records. The detection rates (TPRs) of our joint model can reach .93.8% in Foursquare and .97.0% in Yelp with .5 test behavioral records and disturbance rates (FPRs) of .1.0%.
5.5 Literature Review To prevent and detect identity theft in online services, developers have designed various authentication methods to identify user’s identity. Traditional password-based (username-password) authentication methods are still widely used. But the password is easy to leak, forget and copy. Then, an authentication method adopted a physical token instead of password [52], but it was easy to lose the token. Since the biological characteristics are hard to copy or change, more and more applications turn to utilize biometric identification technologies, such as fingerprint, face, iris and speech recognition which are stable and not varying with time, for authentication [13, 14]. Sitova et al. [53] introduced hand movement, orientation, and grasp (HMOG), a set of behavioral features to continuously authenticate smartphone users. Rajoub and Zwiggelaar [15] used thermal imaging to monitor the periorbital region’s thermal variations and test whether it can offer a discriminative signature for detecting deception. However, these biometric technologies usually require expensive hardware devices which makes it inconvenient and difficult to popularize. Another drawback for these methods above is that they are redundant steps, which require users spend extra time passing identification. Besides, they are all disposable identification measures. Once a criminal break through the wall, the defending system will fail to take further protection. On the contrary, when attackers facing with a continuous authentication system, they have to spend a prolonged time fooling the
5.5 Literature Review
135
system. Increasingly, researchers and security experts realized that they cannot ensure users’ security just by building higher and stronger digital walls around everything [16]. Thus, it is urgent to establish a non-intrusive and continuous authentication system. Recently, researchers found that users’ behavior can identify their identity and judge their personality. A study on .3 months of credit card records for .1.1 million people showed that four spatiotemporal points are enough to uniquely reidentify .90% of individuals [3]. Researches found that computers outpacing humans in personality judgment presented significant opportunities and challenges in the areas of psychological assessment, marketing, and privacy [54]. Abouelenien et al. [30] explored a multi-modal deception detection approach that relied on a novel dataset of .149 multi-modal recordings, and integrated multiple physiological, linguistic, and thermal features. These works indicated that users’ behavior patterns can represent their identities. Many studies turn to utilize users’ behavior patterns for identifications. Behavior-based methods were born at the right moment, which play important roles in a wide range of tasks including preventing and detecting identity theft. Typically, behavior-based user identification include two phases: user profiling and user identifying. User profiling is a process to characterize a user with his/her history behavioral data. Some works focus on statistical characteristics, such as the mean, variance, median or frequency of a variable, to establish the user profile. Naini et al. [55] studied the task of identifying the users by matching the histograms of their data in the anonymous dataset with the histograms from the original dataset. But it mainly relied on experts’ experience since different cases usually have different characteristics. Egele et al. [7] proposed a behavior-based method to identify compromises of individual high-profile accounts. However, it required high-profile accounts which were difficult to obtain. Other researchers discovered other features, such as tracing patterns, topic and spatial distributions, to describe user identity. Ruan et al. [32] conducted a study on online user behavior by collecting and analyzing user clickstreams of a well known OSN. Lesaege et al. [31] developed a topic model extending the Latent Dirichlet Allocation (LDA) to identify the active users. Viswanath et al. [56] presented a technique based on Principal Component Analysis (PCA) that accurately modeled the “like” behavior of normal users in Facebook and identified significant deviations from it as anomalous behaviors. Zaeem et al. [33] proposed an approach which involved the novel collection of online news stories and reports on the topic of identity theft. Lichman and Smyth [48] proposed MKDE model to accurately characterize and predict the spatial pattern of an individual’s events. Tsikerdekis and Zeadally [57] presented a detection method based on nonverbal behavior for identity deception, which can be applied to many types of social media. These methods above mainly concentrated on a specific dimension of the composite behavior and seldom thought about utilizing multi-dimensional behavior data. Vedran et al. [58] explored the complex interaction between social and geospatial behavior and demonstrated that social behavior can be predicted with high precision. It indicated that composite behavior features can identify one’s identity. Yin et al. [42] proposed a probabilistic generative model com-
136
5 Multidimensional Behavior Fusion: Joint …
bining use spatiotemporal data and semantic information to predict user’s behavior. Nilizadeh et al. [49] presented POISED, a system that leverages the differences in propagation between benign and malicious messages on social networks to identify spam and other unwanted content. These studies implied that composite behavior features are possibly helpful for user identification. User identifying is a process to match the same user in two datasets or distinguish anomalous users/behaviors. User identifying can be applied on a variety of tasks, such as detecting anomalous users or match users across different data sources. Mazzawi et al. [59] presented a novel approach for detecting malicious user activity in databases by checking user’s self-consistency and global-consistency. Shabtai et al. [60] presented a behavior-based anomaly detection system for detecting meaningful deviations in a mobile application’s network to protect mobile device users and cellular infrastructure companies from malicious applications. Lee and Kim [34] proposed a suspicious URL detection system for Twitter to detect anomalous behavior. Thomas et al. [9] leveraged Monarch’s feature collection infrastructure to study distinctions among .11 million URLs drawn from email and Twitter. These works mainly detected population-level anomalous behaviors which indicated strongly difference to other behaviors. They did not concerning the coherence of a user’s behavioral records since their works implied that anomalous accounts were created by criminals (i.e., fake accounts). Cao et al. [23] designed and implemented a malicious account detection system for detecting both fake and compromised real user accounts. Zhou et al. [61] designed an FRUI algorithm to match users among multiple OSNs. Hao et al. [62] proposed a novel framework for user identification in cyber-physical space. Most of existing related works mainly considered specific dimensions of users’ behavior. Sufficient high-quality data are necessary for these works. In this study, we aim to build high-performance behavioral models based on low-quality behavioral data by integrating different dimensions of behavioral records that are usually too sparse to support qualified models. The most relevant work to our study is [17], where different dimensions of behavioral records are fused to build a composite behavioral model for detecting identity theft under a special pattern.
5.6 Conclusion We investigate the feasibility of building a ladder from low-quality behavioral data to a high-performance behavioral model for user identification in online social networks (OSNs). By deeply exploiting the complementary effect among OSN users’ multi-dimensional behaviors, we propose a joint probabilistic generative model by integrating online and offline behaviors. When the designed joint model is applied to identity theft detection in OSNs, its comprehensive performance, in terms of the detection efficacy, response latency, and robustness, is validated by extensive evaluations on real-life OSN datasets. Particularly, the joint model significantly outperforms the existing fused model.
References
137
Our behavior-based method mainly aims at detecting identity thieves after the access control of the account is broken. Then, it is easy and promising to incorporate our method into traditional methods to solve the identity theft problem better.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
21. 22. 23. 24. 25.
26. 27. 28. 29. 30. 31.
J. Onaolapo, E. Mariconti, G. Stringhini, in Proceedings of the ACM IMC 2016, pp. 65–79 A. Mohan, in Proceedings of the CTS 2014, pp. 428–435 Y.A. De Montjoye, L. Radaelli, V.K. Singh et al., Science 347(6221), 536 (2015) P. Hyman, Commun. ACM 56(3), 18 (2013) L. Bilge, T. Strufe, D. Balzarotti, E. Kirda, in Proceedings of the WWW 2009, pp. 551–560 J. Lynch, Berkeley Technol. Law J. 20(1), 259 (2005) M. Egele, G. Stringhini, C. Kruegel, G. Vigna, IEEE Trans. Dependable Secure Comput. 14(4), 447 (2017) T.C. Pratt, K. Holtfreter, M.D. Reisig, J. Res. Crime Delinquency 47(3), 267 (2010) K. Thomas, C. Grier, J. Ma, V. Paxson, D. Song, in Proceedings of the IEEE Security and Privacy 2011, pp. 447–462 H. Li, G. Fei, S. Wang, B. Liu, W. Shao, A. Mukherjee, J. Shao, in Proceedings of the WWW 2017, pp. 1063–1072 A.M. Marshall, B.C. Tompsett, Comput. Law & Secur. Rev. 21(2), 128 (2005) B. Schneier, Commun. ACM 48(4), 136 (2005) M.V. Ruiz-Blondet, Z. Jin, S. Laszlo, IEEE Trans. Inf. Forensics Secur. 11(7), 1618 (2016) R.D. Labati, A. Genovese, E. Muñoz, V. Piuri, F. Scotti, G. Sforza, ACM Comput. Surv. (CSUR) 49(2), 24 (2016) B.A. Rajoub, R. Zwiggelaar, IEEE Trans. Inf. Forensics Secur. 9(6), 1015 (2014) M.M. Waldrop, Nature 533(7602) (2016) C. Wang, B. Yang, J. Cui, C. Wang, IEEE Trans. Comput. Soc. Syst. 6(4), 637 (2019) C. Shen, Y. Li, Y. Chen, X. Guan, R.A. Maxion, IEEE Trans. Inf. Forensics Secur. 13(1), 48 (2018) C. Wang, H. Zhu, IEEE Trans. Dependable Secure Comput. https://doi.org/10.1109/TDSC. 2020.2991872 H. Zheng, M. Xue, H. Lu, S. Hao, H. Zhu, X. Liang, K.W. Ross, in 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18–21, 2018 (2018), pp. 259–300 R.T. Mercuri, Commun. ACM 49(5), 17 (2006) G. Stringhini, P. Mourlanne, G. Jacob, M. Egele, C. Kruegel, G. Vigna, in Proceedings of the USENIX Security 2015, pp. 563–578 Q. Cao, X. Yang, J. Yu, C. Palow, in Proceedings of the ACM SIGSAC 2014, pp. 477–488 G. Stringhini, C. Kruegel, G. Vigna, in Proceedings of the ACSAC 2010, pp. 1–9 Y. Yao, B. Viswanath, J. Cryan, H. Zheng, B.Y. Zhao, in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30–November 03, 2017 (2017), pp. 1143–1158 C.W. C. Wang, H. Zhu, IEEE Trans. Dependable Secure Comput. https://doi.org/10.1109/ TDSC.2020.3037784 F. Ahmed, M. Abulaish, Comput. Commun. 36(10), 1120 (2013) G.R. Milne, L.I. Labrecque, C. Cromer, J. Consum. Aff. 43(3), 449 (2009) A. Abo-Alian, N.L. Badr, M.F. Tolba, Concurr. Comput.: Pract. Exp. 28(9), 2567 (2016) M. Abouelenien, V. Pérez-Rosas, R. Mihalcea, M. Burzo, IEEE Trans. Inf. Forensics Secur. 12(5), 1042 (2017) C. Lesaege, F. Schnitzler, A. Lambert, J. Vigouroux, in Proceedings of the IEEE ICDM 2016, pp. 997–1002
138
5 Multidimensional Behavior Fusion: Joint …
32. X. Ruan, Z. Wu, H. Wang, S. Jajodia, IEEE Trans. Inf. Forensics Secur. 11(1), 176 (2016) 33. R.N. Zaeem, M. Manoharan, Y. Yang, K.S. Barber, Comput. & Secur. 65, 50 (2017) 34. S. Lee, J. Kim, in 19th Annual Network and Distributed System Security Symposium, NDSS 2012, San Diego, California, USA, February 5–8, 2012, pp. 1–13 35. H. Li, Y. Ge, R. Hong, H. Zhu, in Proceedings of the ACM SIGKDD 2016, pp. 975–984 36. C. Shen, Y. Chen, X. Guan, R. Maxion, IEEE Trans. Dependable Secure Comput. (to be published). https://doi.org/10.1109/TDSC.2017.2771295 37. C. Shen, Y. Zhang, X. Guan, R.A. Maxion, IEEE Trans. Inf. Forensics Secur. 11(3), 498 (2016) 38. N. Hernandez, M. Rahman, R. Recabarren, B. Carbunar, in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15-19, 2018 (2018), pp. 115–130 39. C. Wang, J. Zhou, B. Yang, in Proceedings of the ACM SIGIR 2017, pp. 825–828 40. T.M. Cover, J.A. Thomas, Elem. Inf. Theory 2(1), 12 (1991) 41. C. Song, A.L. Barabási, Science 327(5968), 1018 (2010) 42. H. Yin, Z. Hu, X. Zhou, H. Wang, K. Zheng, N.Q.V. Hung, S.W. Sadiq, in Proceedings of the IEEE ICDE 2016, pp. 942–953 43. J. Bao, Y. Zheng, M.F. Mokbel, in Proceedings of the ACM SIGSPATIAL 2012, pp. 199–208 44. S.K. C., A. Mukherjee, in Proceedings of the WWW 2016, pp. 369–379 45. D.M. Blei, A.Y. Ng, M.I. Jordan, J. Mach. Learn. Res. 3, 993 (2003) 46. Y. Wang, Y. Zheng, Y. Xue, in Proceedings of the ACM SIGKDD 2014, pp. 25–34 47. W.X. Zhao, J. Jiang, J. Weng, J. He, E. Lim, H. Yan, X. Li, in Proceedings of the ECIR 2011, pp. 338–349 48. M. Lichman, P. Smyth, in Proceedings of the ACM SIGKDD 2014, pp. 35–44 49. S. Nilizadeh, F. Labreche, A. Sedighian, A. Zand, J.M. Fernandez, C. Kruegel, G. Stringhini, G. Vigna, in Proceedings of the ACM CCS 2017, pp. 1159–1174 50. H. He, E.A. Garcia, IEEE Trans. Knowl. & Data Eng. 21(9), 1263 (2009) 51. E. Bursztein, B. Benko, D. Margolis, T. Pietraszek, A. Archer, A. Aquino, A. Pitsillidis, S. Savage, in Proceedings of the ACM IMC 2014, pp. 347–358 52. E. Dauterman, H. Corrigan-Gibbs, D. Mazières, D. Boneh, D. Rizzo, in 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019 (2019), pp. 398– 416 53. Z. Sitova, J. Sedenka, Q. Yang, G. Peng, G. Zhou, P. Gasti, K.S. Balagani, IEEE Trans. Inf. Forensics Secur. 11(5), 877 (2016) 54. W. Youyou, M. Kosinski, D. Stillwell, PNAS 112(4), 1036 (2015) 55. F.M. Naini, J. Unnikrishnan, P. Thiran, M. Vetterli, IEEE Trans. Inf. Forensics Secur. 11(2), 358 (2016) 56. B. Viswanath, M.A. Bashir, M. Crovella, S. Guha, K.P. Gummadi, B. Krishnamurthy, A. Mislove, in Proceedings of the USENIX Security 2014, pp. 223–238 57. M. Tsikerdekis, S. Zeadally, IEEE Trans. Inf. Forensics Secur. 9(8), 1311 (2014) 58. V. Sekara, A. Stopczynski, S. Lehmann, PNAS 113(36), 9977 (2016) 59. H. Mazzawi, G. Dalaly, D. Rozenblatz, L. Ein-Dor, M. Ninio, O. Lavi, in Proceedings of the IEEE ICDE 2017, pp. 1140–1149 60. A. Shabtai, L. Tenenboim-Chekina, D. Mimran, L. Rokach, B. Shapira, Y. Elovici, Comput. & Secur. 43, 1 (2014) 61. X. Zhou, X. Liang, H. Zhang, Y. Ma, IEEE Trans. Knowl. & Data Eng. 28(2), 411 (2016) 62. T. Hao, J. Zhou, Y. Cheng, L. Huang, H. Wu, in Proceedings of the ACM SIGSPATIAL 2016 (2016), pp. 71:1–71:4
Chapter 6
Knowledge Oriented Strategies: Dedicated Rule Engine
6.1 Online Anti-fraud Strategy Based on Semi-supervised Learning The online credit loan service (OCLS) is one of the most basic functional patterns of Internet Finance [1]. Beyond doubt, the lifeblood of OCLSs is the risk evaluation of bad debts. The bad debts, also called bad loans, refer to the loans that online lending platforms cannot recover from borrowers. They are usually of two types: (1) the loan whose borrower is unable to repay the due loan for certain reasons; (2) the loan whose beneficiary deliberately denies it. The latter is the fraudulent loan, i.e., the so-called fraud, where its beneficiary, i.e., the fraudster, usually uses the forged or misused information to defraud the loan [2]. Actually, OCLSs conduct the operation of “credit evaluation” to predict the nonfraud bad loans under the confirmation that the loans are not frauds. Therefore, fraud prediction is really necessary for predicting all kinds of bad loans at the pre-loan stage. It is exactly the focus of this work. For loan fraudsters, the forged or misused application information should be used as many times as possible in order to defraud more money. This inevitably generates associations among application information of different loans. Then, fraud features often lurk in some of these associations. To a large extent, fraud detection is indeed to mine such associations. The challenges here come from two facts: (1) The implicit associations containing fraud information often hide deeply in extremely sparse and multidimensional association networks of application information; (2) most of ongoing loans have no labels since OCLSs cannot determine whether they are frauds until a certain amount of non-payment. Traditional machine learning methods, e.g., SVM [3], deep learning [4], and tree-based methods [5], are not good at dealing with these two problems [6], though they have achieved qualified anti-fraud performance in other Internet financial scenarios. To mine those deep associations, graph-based learning methods, e.g., the graph embeddings (GEs) [7] and graph neural networks (GNNs) [8], are almost certainly the preferred methods of choice. GEs, as a special case of transductive learning, are © Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_6
139
140
6 Knowledge Oriented Strategies: Dedicated Rule Engine
difficult to satisfy the low latency demand for OCLSs [9], since the embeddings need to be relearned whenever the graph is updated. Different from GEs, spatial-domain graph convolution networks, as one of inductive learning methods [10–14], do not need to be retrained every time new data comes, which are perfectly suited to fraud prediction by mining anomalous associations in real-time for OCLSs. Before GNNs could carry out fraud prediction for OCLSs where data labels are grossly insufficient as mentioned above, they are confronted with another difficulty, i.e., semi-supervised learning problem. The good news is that GNNs can be naturally applied to the semi-supervised classification task on the graph [6]. Existing methods focus on improving GNNs’ ability of semi-supervised learning by fully mining the internal information of graph data. Then, the performance improvement is logically limited by the quality of data, especially the credibility of labels. A semi-supervised learning method still requires a certain number of labels, otherwise the learned classifier will lack generalization capability due to the insufficient propagation of label information [15] which undermines the feasibility of GNNs to the fraud prediction problem in OCLSs. To break through such a performance ceiling, a natural idea is to increase the amount of labeled data. As a straightforward solution, the preliminary review by the subject matter experts (SMEs) can provide effective information from which the rough labels of unlabeled data can be obtained. Specifically, a potentially effective SME-based method is active learning [16]. It selects the most informative part of unlabeled data, and gives them to the SMEs to label, then adds these labeled data to the training set and repeats the above steps. By active learning, a pretty good performance model can be built. The core of active learning is to make up for the lack of label quantity by selecting the most valuable data. As a matter of fact, active learning is relatively applicable to the scenarios where the experts only need the ability with some concrete knowledge, e.g., the ability to classify given pictures into cats and dogs. Nevertheless, it has a significant disadvantage especially for fraud prediction in OCLSs. It heavily depends on the anti-fraud expertise of SMEs due to the requirement of nonintuitive judgment and abstract knowledge for fraud prediction. This work turns to anti-fraud rules to increase data labels, since SMEs have been designing a lot of rules based on the experience and formed rule engines [17]. Actually, compared with labeling training data, in real-life lending platforms, rules are more often adopted to build rule-based fraud prediction engines with black lists, where the main obstacle is the more and more serious confliction among continuously accumulated rules. So this work is devoted to exploring how to use rules to increase labeled training data for boosting GNNs in fraud prediction of OCLSs. We first propose an effective solution based on a weak supervision framework, i.e., Snorkel [18], then call it Snorkel-based Semi-Supervised Graph Neural Network (S3GNN). Particularly, we specially devise a dedicated version of rule engines, called Graph-Oriented Snorkel (GOS), a graph-specific extension of Snorkel. Like rulebased engines, Snorkel also utilizes SMEs’ knowledge to design rules— or call it the process of writing labeling functions. But a more important advance is that Snorkel adopts a generative model to resolve contradictions or dependencies among rules,
6.2 Development and Present State
141
while our GOS further considers the graphical constraints. With the data labeled by GOS, we train the subsequent heterogeneous attention-based GNN. It’s important to point out that in the anti-fraud graph, there may be multiple types of edges between each node pair. Therefore, it is necessary for each node to learn a unique representation from different types of edges. To address this challenge, we propose the Multiple Edge-Types Based Attention to fusion the information of different edge types. It is worth mentioning that, to our best knowledge, we are the first to apply the Snorkel framework to graph data and make improvements to adapt to graphs. To evaluate our S3GNN, we conduct the experiments on the real-life dataset. It is demonstrated that S3GNN has superiority in fraud prediction of OCLSs. The contributions of our work are summarized as follows: (1) We introduce external knowledge into GNNs and achieve the cooperation of internal and external information of graphs, which provides a new perspective for solving semi-supervised problems in graph representation learning. (2) We customize Snorkel specifically for graph data and learning to devise a graphoriented Snorkel (GOS). (3) We evaluate our S3GNN on the real-life dataset. The experiments demonstrate that S3GNN has superiority in fraud prediction of OLSs.
6.2 Development and Present State 6.2.1 Anti-fraud in Online Services Cyber fraud has become a serious social and economic problem [19]. Machine learning, as a promising data-driven paradigm, has been widely used in anti-fraud fields [20] for various network services by supervised learning approaches like logistic regression, SVM or neural networks [3–5]. Babaev et al. [21] adopted a deep learning method, RNNs, on fine-grained transnational data to compute credit scores for the loan applicants, and produced significant financial gains for the banks. Suarez-Tangil et al. [22] combined a range of structured, unstructured, and deep-learned features to build a detection system which stoped scammers as they create fraudulent profiles or before they engaged with potential victims. Aforementioned methods regard each entity and extract features from different aspects. However, in financial scenarios, entities may have rich interactions with each other. Then a few of works start to utilize the graph for fraud detection. Wang et al. [23] extracted fine-grained co-occurrence relationships by using a knowledge graph and realized data enhancement for diversified behavior models in online payment fraud detection. Li et al. [24] modeled the transactions by using a multipartite graph, and detected the complete flow of money from source to destination in money laundering criminal detection. Liang et al. [25] developed an automated solution for fraud detection based on graph learning algorithms to separate fraudsters from regular customers and uncover groups of organized fraudsters in insurance fraud detection. Cheng et al.
142
6 Knowledge Oriented Strategies: Dedicated Rule Engine
[26] investigated the problem of predicting repayment delinquency in the networkedguarantee loans, and attempted on predicting bank guarantee loan defaults by adding objectively temporal network structure using an end-to-end learning framework. Recently a few of works start to use graph neural networks in fraud detection. GEM [27] adaptively learned embeddings from heterogeneous graphs based on two prime weaknesses of attackers for malicious account detection. SemiGNN [6] proposed a hierarchical attention mechanism which was similar to HAN [28] but made the results for similar nodes similar by proposing the graph-based loss. GRC [13] uses self-attention mechanism to characterize multiple types of relationships, and uses conditional random fields to constrain users with the same role to have similar representations.
6.2.2 Graph Neural Networks Graph neural networks (GNNs) can effectively extract the structure information and node information of the graph to learn better node representation. Based on the spectrum theory, Burna et al. [29] applied the learnable convolution operation to the graph for the first time. GCN [30] simplified the definition of spectrum-based graph convolution. GraphSAGE [10] applied the aggregation operator to gather the neighbor information to achieve the inductive learning. GAT [31] applied the attention mechanism to the graph neural network for the first time. All these work above are for homogeneous graphs, but most real-world graphs are heterogeneous information networks with multi-typed nodes and edges. HAN [28] used two-level attention mechanism in heterogeneous graphs and used metapaths to transform a heterogeneous graph into homogeneous graphs. HetGNN [32] used sampling strategy based on randomwalk with restart to convert heterogeneous graphs into homogeneous graphs. Both HetSANN [33] and GTN [34] explored to directly encode node information in heterogeneous graphs without using manually designed meta-paths. xFraud [35] trains a self-attention heterogeneous graph neural network through malicious transactions, and outputs human-understandable explanations through the explainer, which has taken a big step in the interpretability of the graph model. However, although most work mentions that GNNs can be applied to semisupervised problems, only Sun et al. [15] mentioned that too few labels will limit the performance ceiling of GNN.
6.2.3 Weak Supervision Weak supervision can be roughly divided into three categories [36]: incomplete supervision, inexact supervision and inaccurate supervision. We mainly focus on incomplete supervision. Semi-supervised learning [37] took some assumptions to
6.3 Risk Prediction Measures in Online Lending Services
143
heuristically label the unlabeled data. Active learning [16] aimed to label the most informative data for the model with experts interventions. Recently there are many works focus on weak supervision in fraud risk detection. SPFD [38] proposed constrained seed k-means for cash pre-loan fraud detection. MFEFD [39] was trained in a semi-supervised manner which combined the supervised and unsupervised training for detecting electricity fraud. As a semi-supervised method, GCN [30] can adapt to insufficient labels, although this will affect its performance limit. As the first-of-its-kind system that enables users to train state of the art models without hand labeling any training data, Snorkel [18] provided the possibility of weak supervised learning on the graph.
6.3 Risk Prediction Measures in Online Lending Services To overcome the significant weakness of labels, we propose a novel method for risk prediction in online lending services, called Snorkel-based Semi-Supervised Graph Neural Network (S3GNN). The overview of whole method is shown in Fig. 6.1. Our method is divided into three steps. For an applicant associated graph full of heterogeneous information, the first step is to design labeling functions based on expert knowledge, and apply labeling functions to label some unlabeled nodes through our devised Graph-Oriented Snorkel (GOS). In the second step, with the rough labels supplemented by GOS, we propose a heterogeneous graph neural network to learn the embedding of applicant nodes on the entire graph. Finally, we use the focal loss to train the model due to the unbalanced distribution of labels.
6.3.1 Preliminary 6.3.1.1
Labeling Function
Rather than hand-labeling training data, users of Snorkel write labeling functions, which allow them to express various weak supervision sources such as patterns, heuristics, external knowledge bases, and more.
6.3.1.2
Meta-Path
A meta-path .ϒ [28] is defined as a path in the form of .
R1
R2
Rl
A1 −→ A2 −→ . . . −→ Al+1 ,
144
6 Knowledge Oriented Strategies: Dedicated Rule Engine
Fig. 6.1 The framework of the proposed model S3GNN. (1) SMEs design labeling functions on different meta-paths and generate a label matrix. Through a generative model we get rough labels of unlabeled nodes. (2) Then it encodes the attribute nodes and the initial embedding of the applicant node (node.u in this case) is the aggregation of its neighbor attribute nodes with deep features. (3) On different meta-paths, it aggregates the embeddings of neighbor nodes through attention mechanism, and finally aggregates the embeddings on all meta-paths through fully connected layer
abbreviated as . A1 A2 . . . Al+1 , which describes a composite relation . R = R1 ◦ R2 ◦ · · · ◦ Rl between objects . A1 and . Al+1 , where .◦ denotes the composition operator on relations.
6.3.1.3
Meta-Path Based Neighbors
Givien a node .u and a meta-path .ϒ in a heterogeneous graph, the meta-path based neighbors . Nuϒ of node .u are defined as the set of nodes which connect with node .u via meta-path .ϒ. Note that the nodes neighbors includes itself.
6.3.2 Graph-Oriented Snorkel The designed GOS is a weak supervision framework that can efficiently absorb expert knowledge. The original design of Snorkel [18] is not specifically for graphs. Considering the constraints of edges in graphs, we modify its generative model accordingly. GOS provides effectual ways to construct labeling functions. From the evaluation results of each node by the labeling functions, a label matrix is generated. By a generative model with graphical constraints, we can get the rough labels for unlabeled nodes.
6.3 Risk Prediction Measures in Online Lending Services
6.3.2.1
145
Writing Labeling Functions
We generate training labels by writing labeling functions. The labeling function. takes an unlabeled node .x ∈ X as input and outputs a label, formally, .λ : X → Y {∅}, where .Y ∈ {0, 1} as binary setting and .∅ denotes that the labeling functions abstains. Our graph of loan applicants contains a variety of textual information. It is a really professional problem to use this information to write labeling functions. We design the labeling functions for each meta-path in the heterogeneous graph. In this method, we treat the heterogeneous graph as multiple homogeneous graphs related by meta-paths. Since each meta-path contains corresponding business information and expert rules, we mine these expert rules in each meta-path and design the corresponding labeling functions. For example, on the meta-path of applyno-address-applyno, the two applicant nodes are connected to an address node and the edges are the has-companyaddress and has-homeaddress. Based on expert experience, both applications are at risk of fraud because of the same address exhibits two different functional attributes. For another example, as a labeling function. L F_via_ AC A(x) shown in Fig. 6.2, multiple loan applicants all claim to have a company, but they have the same company name. Obviously, this is impossible even if judged by common sense, so we determine that these applicants constituted gang fraud. Note that each labeling function can only cover a part of data, which is the reason why we need to use expert knowledge to design right number of labeling functions. Here is an example of a labeling function in Python3: def check_companyaddress(x): return BLACK \ if x.addedge_type == ’home’\ and x.addedge_neightype == ’company’\ and x.oneorder_fraud > 1 else ABSTAIN The code is explained as follows: If an edge type of an applicant node is hashomeaddress and the other end of the edge is a company attribute node, and the number of fraud nodes in the first-order neighbor of the applicant node is greater than 1, then we determine that the application node is a fraud node, otherwise we will abstain. Given .m unlabeled graph nodes and .n labeling functions, we can generate a label matrix ..m×n ,where ..i, j = λ j (xi ).
146
6 Knowledge Oriented Strategies: Dedicated Rule Engine
Fig. 6.2 The design of labeling functions. This is a representative partial subgraph extracted from our real-world data set. We hide the specific attribute values and retain the main topological relationships. The central applicant node .u is connected to other applicant nodes through multiple attribute nodes. We observe that these application nodes are associated with node .u via different meta-paths, and each meta-path contains different rule information. In this partial subgraph, there are mainly four meta-paths (represented by edges in different colors): Applicant-CompanyName-Applicant (ACA), Applicant-Address-Applicant(AAA), Applicant-Mobile-Applicant(AMA), Applicant-Idno-Applicant(AIA). As the labeling functions shown in the figure, we design these simple labeling functions on different meta-paths based on some effective expert knowledge
6.3.2.2
Generative Model
Since the labeling functions we design are all empirical rules, they must contain some contradictory or interdependent relationships. The original Snorkel models the true class label for a data point as a latent variable in a probabilistic model. For example, we model each labeling function as a noisy voter which is independent, i.e., makes errors that are uncorrelated with the other labeling functions. We can also model statistical dependencies between the labeling functions to improve predictive performance. For example, if two labeling functions express similar heuristics, we can include this dependency in the model and avoid a double counting problem. Then we select a set .C of labeling function pairs .( j, k) to model as correlated. This defines a generative model of the votes of the labeling functions as noisy signals about the true label. The original Snorkel encodes a factor graph to integrate these noisy signals by using three factor types to represent the labeling propensity, accuracy, and pairwise correlations of labeling functions [18]: Pr op
φj
.
(.i , yi ) = 1{.i, j .= ∅},
φ jAcc (.i , yi ) = 1{.i, j = yi }, φ Corr j,k (.i , yi ) = 1{.i, j = .i,k },
(6.1) (6.2) ( j, k) ∈ C,
(6.3)
6.3 Risk Prediction Measures in Online Lending Services
147
where .1 is a identity function and ..i ∈ Rn×1 is the labeling vector of an unlabeled graph node.xi . Note that the true label. yi of.xi is a latent variable that can’t be observed. Snorkel defines the vector .φ(.i , yi ) ∈ R(2n+|C|)×1 for .xi as the concatenation of these factors over all labeling functions . j = 1, . . . , n and potential correlations .C. The generative model . P(.i , yi ) is defined as a joint distribution: . . exp θ T φ(.i , yi ) . .. . P(.i , yi ) = .m . T φ(. , y ) exp θ i i yi ∈{0,1} i=1
6.3.2.3
(6.4)
Generative Model with Graph Constraints
As a matter of fact, Snorkel is not design for graph-oriented applications. Although it considers the potential correlations between labeling functions in the generative model, it neglects the correlations between data points. Such correlations are precisely what makes graph data different from other types of data. According to the main features of frauds in OCLSs, we impose such a constraint on all graph nodes, that is, the closer the nodes are, the more likely they are to have the same labels. Therefore, we define the weight matrix .W : . ⎧ . ..i −. j .2 ⎪ exp − ⎨ δ2 √ if x j ∈ Nk (xi ), .wi j = hopi j ⎪ ⎩ 0 otherwise,
(6.5)
where .δ is a hyperparameter and .hopi j is the number of hops between .xi and .x j , and Nk (xi ) denotes the set of .k-order neighbors of .xi . Inspired by [40], we introduce an energy function . E(.) to measure the label similarity between unlabeled nodes on the entire graph:
.
.
E(.) =
. .
. wi j
y∈{0,1} i, j
.
P(.i ) =
P(. j , y) P(.i , y) − P(.i ) P(. j )
.
P(.i , yi ).
.2 ,
(6.6)
(6.7)
yi ∈{0,1}
Here. P(.i ) is the marginal distribution. Then we learns the parameter.θ ∈ R(2n+|C|)×1 by minimizing the negative log marginal likelihood to avoid using the true labels . y: m . . θ = arg min − logP(.i ) + . E(.),
.
θ
i=1
(6.8)
148
6 Knowledge Oriented Strategies: Dedicated Rule Engine
where .. is used to balance the marginal likelihood and energy function. Then we yi = P.θ (yi |.i ) into rough labels and convert the outputs of the generative model . . these rough labels can then be used to a discriminative model: . y =
. i
1 if . yi > 0.5, 0 otherwise.
(6.9)
6.3.3 Heterogeneous Graph Neural Network In this work we represent the applicant information associated graph as a heterogeneous graph. A heterogeneous graph is defined as .G = (V, E), .V denotes nodes and . E denotes edges in the graph. The nodes . V are mainly divided into two types, applicant nodes .U and attribute nodes .T , among which attribute nodes can be divided into more types. Each applicant node is connected to the attribute nodes through different types of edges.
6.3.3.1
Feature Engineering
Considering the anti-fraud field of online lending, traditional machine learning methods in the past rely heavily on manual features constructed by experts. Therefore, here we adopt a combination of a small number of manual features and automated feature engineering, e.g. Deep Feature Synthesis (DFS) [41], to reduce the time consumed by manpower and enhance the utility of features. Manual features and Deep Feature Synthesis: It should be pointed out that the so-called manual features are mainly some of the graph features that we extracted from the graph. After storing our heterogeneous graph in the Neo4j database, we use Cypher to quickly query the statistics of nodes or edges to generate graph features. Table 6.1 shows the main graph features we extracted. The next step is to enable automated feature engineering and use deep synthesis algorithms to automatically construct other risk features as input to the risk model.
Table 6.1 The extracted features in loan transactions Features Description Degree ApplOne ApplTwo FraudOne FraudTwo
The number of neighbors of the node The number of first-order neighbors of the node The number of second-order neighbors of the node The number of fraud neighbors in the node’s first-order neighbor The number of fraud neighbors in the node’s second-order neighbor
6.3 Risk Prediction Measures in Online Lending Services
149
The Deep Feature Synthesis (DFS) [41] is an algorithm for automatic feature construction. Its input has two main parts. One of the inputs is a collection of related entities, and the other input is a basic element function, that is, a series of mathematical functions. There are many calculation conversion functions required to construct a feature. By decomposing these functions into basic element functions, the underlying structure of the construction feature can be obtained. The basic aggregate function takes a column features of related instances in one or more entity tables as input, and returns a value at output, which is equivalent to the aggregate function of a relational database, such as an Average function, a Median function, and so on. The basic conversion function takes a column or certain columns of features of all instances in an entity table as input, and returns a new column of features at output, which is equivalent to a common operator in a relational database, such as Addition function, Equal function and so on. Algorithms based on deep feature synthesis will generate a large number of features, so we filter the features. We use Pearson correlation coefficient to measure the linear correlation between features: .n ¯ ¯ i=1 (X i − X )(Yi − Y ) .. , . pcc = . .n n 2 2 ¯ ¯ (X − X ) (Y − Y ) i i=1 i=1 i the Pearson correlation coefficient usually refers to the quotient of the covariance and standard deviation between two variables. The Pearson correlation coefficient is widely used in measuring the linear correlation of two vectors. Its value varies from –1 to + 1. If the value is +1, it means that the two vectors are positively correlated, a value of –1 means a negative correlation, and a value of 0 means that there is no linear correlation. Then we use . pcc as the threshold of density clustering to select the features we want to keep. We denote the feature vector of the applicant node .u ∈ U obtained in this step as .h D F S . Encoding Attribute Nodes: Although the automated feature engineering has obtained some useful features, in order to make full use of the attribute information contained in the nodes, we encode these attribute nodes. The attribute nodes contain text information or numerical features. For example, the .addr ess node contains the family, household or company address and the .company node contains the company name and a brief introduction. The .idno node has the identity card number. For the attribute node .t ∈ T which has text information, we utilize FastText [42] to pre-train the text information and we denotes the embedding of .t as .h t . For the attribute node .t which has numerical features, we use the DeepWalk [43] to generate the walk sequence of the same type of attribute node and then generate the embedding of the attribute node as .h t ∈ Rdh ×1 .
150
6.3.3.2
6 Knowledge Oriented Strategies: Dedicated Rule Engine
Encoding Applicant Nodes
After we have the embedding of the attribute nodes, there are several choices in how to represent the applicant nodes according to the attribute embedding. For the correlations between attribute nodes, directly using simple concatenation or averaging for attribute embedding might lose these correlations. Inspired by HetGNN [32], we utilize bi-directional LSTM (Bi-LSTM) [44] to capture the deep interaction information between attribute nodes. Given an applicant node .u ∈ U and its neighbor attribute nodes .Tu , its initial embedding .z u ∈ Rd×1 is calculated as follows: h =
. u
1 . −−−−→ ←−−−− · tanh( L ST M(h t ) ⊕ L ST M(h t )), t∈Tu |Tu | z = hu ⊕ h D F S ,
. u
(6.10) (6.11)
where .⊕ denotes the concatenation and the form of LSTM is: z = σ (Uz h t + Wz q j−1 + bz ), f j = σ (U f h t + W f q j−1 + b f ),
. j
o j = σ (Uo h t + Wo q j−1 + bo ), . ci = tanh(Uc q j + Wc q j−1 + bc ), c j = f i ⊗ c j−1 + z j ⊗ c.j , q j = tanh(c j ) ⊗ o j , where .U j , W j , b j ( j ∈ {z, f, o, c}) are learnable parameters and .q j ∈ Rd×1 denotes the output hidden state of .t j , .σ is activation function and .⊗ is Hadamard product. After the deep interaction of Bi-LSTM and nonlinear transformation, we can better capture the correlations between the attribute nodes and achieve deeper integration.
6.3.3.3
Multiple Edge-Types Based Attention
The previous graph neural networks usually study graphs with a single edge type between nodes [6, 28, 45] and the influence of multiplex edge types is rarely considered. For example, in the IMDB network, as shown in Fig. 6.3a, there is only one edge type between Actor and Movie. In our applicant information associated network, the type of edge between nodes is not only determined by the types of node pairs. As shown in Fig. 6.3b, with the meta-path Applicant-Address-Applicant, although two applicant nodes are connected to the same attribute node Address, the relationship categories of the two edges are has-HomeAddress-of and has-Compaddress-of. So it is necessary to add Multiple Edge-Types based Attention (MEA) to consider the cross-correlation between applicant nodes.
6.3 Risk Prediction Measures in Online Lending Services
151
Fig. 6.3 The difference between the anti-fraud graph and the IMDB network. a The types of edges of the IMDB network are determined by the node pairs; b There are more types of edges in the anti-fraud graph, not only determined by the node pairs. For the same node pair, the edges can also be a more fine-grained type
Given an applicant node .u ∈ U and a meta-path .ϒ, we denote . Nuϒ as the metapath based neighbors of the node .u (include itself). Since the type number of edge is fixed, we use one-hot to encode the edges between the applicant nodes and the attribute nodes. For example, given a pair of applicant nodes .u and .v, there are two edges .eu and .ev connecting from node .u and node .v to the same attribute node, and we denote .h ue , h ve as the one-hot embedding of .eu and .ev , respectively. ϒ Then the attention score .αuv between nodes .u and .v is formulated as follows: eϒ = att (z u , z v , h ue , h ve ),
ϒ αuv =.
. uv
ϒ ) ex p(euv , ϒ k∈Nuϒ ex p(euk )
(6.12)
where the attention mechanism .att is calculated as: att (z u , z v , h ue , h ve ) = Leaky ReLU (aϒT [z u ⊕ z v ⊕ h ue ⊕ h ve ]).
.
(6.13)
Here .aϒT is a learnable vector. Now we can get the edge-types based embedding of node .u which is denoted as .z uϒ : zϒ = σ
..
. u
v∈Nuϒ
. ϒ αuv · zv .
(6.14)
To capture more information from different representation subspaces, we use multihead attention, specifically, . K independent edge-types based attentions. Then we concatenate the . K embeddings as edge-types based embedding: ϒ .z u
6.3.3.4
K
= ⊕σ k=1
.. v∈Nuϒ
ϒ αuv
. · zv .
(6.15)
Aggregating Edge-Types Based Embedding
For a given set of meta-paths .{ϒ1 , ϒ2 , · · · , ϒq }, we have a corresponding set of ϒ embeddings .{z uϒ1 , z uϒ2 , · · · , z u q }. The final applicant node embedding is represented
152
6 Knowledge Oriented Strategies: Dedicated Rule Engine
by this set of edge-types based embeddings. Then the final node embedding is calculated as: q 1. σ (W · z uϒi + b). .z u = q i=1
(6.16)
6.3.4 Loss Function In the traditional graph node classification tasks, the most used loss function is cross entropy. But in the field of online lending and anti-fraud, our data distribution is extremely uneven. Fraud data accounts for only a small part of all data. Therefore, the original cross-entropy function is no longer suitable. For all labeled nodes, we use Focal Loss [46] to solve the unbalanced problem: . a .at = 1−a . p pt = 1− p
if y = 1 other wise,
(6.17)
if y = 1 other ewise,
(6.18)
p = σ (Θ · z u ),
(6.19) γ
Loss = −at (1 − pt ) log( pt ),
(6.20)
where .Θ is a learnable parameter, .a ∈ [0, 1] is a weighting factor, . p ∈ [0, 1] is the models estimated probability for the class with label . y = 1 and .γ is a focusing parameter.
6.4 Risk Assessment and Analysis 6.4.1 Datasets and Evaluation Metrics 6.4.1.1
Datasets
Our data set comes from a large-scale Internet financial lending platform. We select all the data from January 1, 2017 to July 31, 2017. After excluding isolated samples, there are a total of .86, 019 application records. In order to avoid the problem of time crossing, we select the records from January 1 to June 30, 2017 as the training set, and the data from July 1 to July 31, 2017 as the test set. The division of datasets is shown in Table 6.2. The heterogeneous graph built with these data contains .9 types of nodes (.1 type of applicant node and .8 types of attribute nodes, as shown in Table 6.3) and .19 types of edges (as shown in Table 6.4). Restricted by the privacy protection
6.4 Risk Assessment and Analysis Table 6.2 The division of datasets Dataset Legitimate 7751 1787 1692
Training Valid Test
153
Fraudulent
Unlabeled
Total
1249 206 219
61568 0 11547
70568 1993 13548
Table 6.3 The selected nodes in application transactions Nodes Attribute Description Application
Name, Time
ADDR CONAME
Province,City, District None
INDO
Name
DL VIN
None None
ENGINE MP
None Name
TEL
Name
The identifier of a transaction. We extract the applicant’s name and the time of the transaction as attributes The detailed address. We extract three levels of administrative areas as attribute description The name of the company where the online loan applicant works The identity card number. Use holder’s name as an attribute The driving license plate number of the applicant The vehicle identification number of the loan applicant The engine number of the loan applicant The mobile phone number. Use holder’s name as an attribute The telephone number. Use holder’s name as an attribute
policy, the characteristics of the attribute nodes are all string types, and the ID number and phone fields are all desensitized. In order to facilitate the storage of large-scale network structures, we choose Neo4j graph database to store heterogeneous graphs.
6.4.1.2
Metrics
There are many evaluation metrics for general binary classification problems, such as AUC, F-measure and so on. However, the objective is not only to predict more fraud records, but also to reduce prediction errors as much as possible in an unbalanced data scenario such as anti-fraud. Therefore, we require a high recall rate and a low disturbance rate. In fraud prediction of OLSs, we use the KS value [47] as our main evaluation metric. The KS value is a de facto standard in the field of loan antifraud [6]. The KS value is the maximum difference between the recall rate and the disturbance rate under different thresholds. The larger the KS value is, the better the
154
6 Knowledge Oriented Strategies: Dedicated Rule Engine
Table 6.4 The extracted edges in loan transactions Edges Description R_ADDR R_CO R_CO_ADDR R_CO_NAME R_CO_TEL R_CPADDR R_CRMP R_FCMP R_IDNO R_DL R_VIN R_ENGINE R_LRMP R_MATEIDNO R_MP R_OCMP R_TEL R_REGADDR
The loan applicant’s address The mobile number of the applicant’s colleagues The address of the applicant’s company The loan applicant’s company name The telephone number of the applicant’s company The loan applicant’s residential address The mobile phone number of the applicant’s general relatives The mobile phone number of the applicant’s friends The loan applicant’s license plate number The loan applicant’s driving license plate number The vehicle identification number of loan applicant’s car The engine number of loan applicant’s car The mobile phone number of the applicant’s immediate family members The identity card number of the applicant’s spouse The loan applicant’s mobile phone number The mobile phone number of the applicant’s other contacts The loan applicant’s telephone number The loan applicant’s native place
model can predict fraudulent applications and normal applications. Usually, the KS value can tolerate relatively high false positive rates.
6.4.2 Baseline Methods We use seven baseline methods as follows: • XGBoost [48]: It’s a model that has been widely used and has achieved good results in industry recently. We use the embedding of the applicant node after aggregating neighbor attribute nodes as the input of the model. • ASNE+MLP [49]: It is a model of attributed graph embedding which can simultaneously learn structural information and attribute information. We use MLP as the following classifier. • GCN [30]: It is a model that propagates the information of neighbor nodes to node itself through convolution operation. We test all the meta-paths for GCN and report the best performance. • GAT [31]: It is an improved method which uses the attention mechanism. We test all the meta-paths for GAT and report the best performance.
6.4 Risk Assessment and Analysis
155
• SemiGNN [6]: Similar to HAN, it uses a two-level attention mechanism. But in the design of the loss function, it encourages nearby nodes to have similar representations for solving the semi-supervised problem. • HetGNN [32]: It uses a strategy of random walk with restart for node sampling instead of setting meta-paths. • S2GNN: It is a reduced version of S3GNN that removes Graph-Oriented Snorkel.
6.4.3 Implementation Details For our S3GNN, the embedding dimension of FastText [42] and edge-types based attention are set to .256, .128, respectively. The number of attention head K is .4. We optimize S3GNN with Adam and the learning rate, dropout rate are set to .0.01, .0.35, respectively. The number of meta-paths is 8. The length of Bi-LSTM is variable because the number of neighbor attribute nodes of the applicant node is not fixed [32]. The weighting factor .a and focusing parameter .γ are set to .0.2, .2. Then we use early stopping with a patience of .50. Our experiments are conducted on Windows Server 2012 with Intel Xeon E5-2640 v4 and .128 GB of RAM.
6.4.4 Performance Comparison The results are shown in Table 6.5 and KS curves are shown in Fig. 6.4. From them, we can get the following analysis: • The performance of our proposed method S3GNN is better than other methods, which proves the superiority of our model. In the second stage of the online lending flow, the user data we collected is quite preliminary, and the KS value of 0.4821 is already a pretty good performance especially in the absence of labels and only six months of data. • SemiGNN achieves the highest recall rate that demonstrates the effectiveness of its loss function design; but it also brings a higher disturbance rate which may due to the embeddings of normal nodes that adjacent to fraud nodes are similar to the embeddings of fraud nodes. The better performance of S2GNN than HetGNN proves the necessity of considering unbalanced data distribution. • The performance of GCN and GAT are not good, which proves that some related information is lost after the heterogeneous graph converted into homogeneous 1
For a real-world online lending platform, according to the de facto standards, the ex-ante fraud prediction can be adopted as a qualified one if it has a KS value that is not much less than 0.5. But this standard is achieved when the platform has a considerable number of fraud samples. The lending platform who provided us with the data has accumulated seven years of data labels to achieve the goal of exceeding the KS value of 0.5.
S+ASNE+MLPDFS
14.7
26.9
S+XGBoost-DFS
Disturbance
Metrics
14.4
ASNE+MLP
26.3
XGBoost
Disturbance
Metrics
11.3
Disturbance
12.7
51.5
39.2
38.6
49.9
KS
12.1
S+ASNE+MLP
11.8
S+XGBoost
Disturbance
Metrics
Recall
37.7
49.8
34.5
46.3
KS
Recall
24.6
39.3
13.2
39.5
KS
Recall
24.3
38.7
12.5
39.4
KS
ASNE+MLP-DFS
XGBoost-DFS
Recall
Metrics
12.8
55.3
42.5
S+GCN
13.0
52.1
39.1
GCN
19.1
41.7
22.6
S+GCN-DFS
20.8
41.1
20.3
GCN-DFS
13.5
58.4
44.9
S+GAT
13.2
54.8
41.6
GAT
21.4
43.5
22.1
S+GAT-DFS
21.5
42.7
21.2
GAT-DFS
11.5
58.9
47.4
S+HetGNN
10.6
55.2
44.6
HetGNN
24.1
47.1
23.0
S+HetGNN-DFS
22.6
44.9
22.3
HetGNN-DFS
16.9
64.7
47.8
S+SemiGNN
19.3
64.5
45.2
SemiGNN
21.5
49.4
27.9
S+SemiGNN-DFS
21.9
48.3
26.4
SemiGNN-DFS
15.7
61.9
46.2
S3GNN-DFS
12.0
56.5
44.5
S2GNN-MEA
19.5
46.3
26.8
S3GNN-DFSMEA
21.1
44.2
23.1
S2GNN-DFSMEA
14.8
63.0
48.2
S3GNN
11.7
57.1
45.4
S2GNN
18.8
49.2
30.4
S3GNN-DFS
18.3
46.6
28.3
S2GNN-DFS
Table 6.5 Qantitative results (.%), S+Baseline is the combined version of baseline method and GOS, Baseline-DFS is the version of baseline method with DFS removed and doesn’t include the input features of DFS, Baseline-MEA is the version of baseline method without Multiple Edge-types based Attention
156 6 Knowledge Oriented Strategies: Dedicated Rule Engine
6.4 Risk Assessment and Analysis
157
Fig. 6.4 The KS curves of S3GNN and baseline methods
graph. The comprehensive utilization of node information on multiple meta-paths is still an effective method. • ASNE, as a traditional graph representation learning method, has lower performance than graph neural network methods. It proves the importance of obtaining information from neighbor nodes. • XGBoost has the worst performance, probably because of lacking enough labels information and poor processing of attribute embeddings.
6.4.5 Ablation Study • GOS: Our method mainly uses GOS to strengthen the label of the data. In order to explore how GOS affects the performance of the model, we conduct ablation experiments. First, we explore the impact of not using GOS on the performance of the model. As shown in the second part of Table 6.5, The KS value of the S+ baselines has all been improved. The TPR/FPR curves of S2GNN and S3GNN are shown in Fig. 6.5a, b. The KS value of S2GNN is .2.8% lower than that of
Fig. 6.5 The TPR/FPR curves of S2GNN, S3GNN-noGC and S3GNN
158
6 Knowledge Oriented Strategies: Dedicated Rule Engine
Fig. 6.6 The KS curves of S2GNN, S3GNN-noGC and S3GNN
Correlation heatmap between partially constructed features c1
c2
c3
c4
c5
c6
c7
c8
c9 c10
c11
cum_count_itself(idno) cum_count_itself(mobile) cum_count_itself(pbc_name) cum_count_itself(addressProvince) cum_count_itself(addressDistrict) cum_count_itself(phone) cum_count_itself(regAddressCity) cum_count(regAddressDetail) cum_count_itself(companyPhone) cum_count_itself(companyAddressCity) cum_count_itself(companyAddressDeatil) cum_count_itself(CPAddressCity) cum_count_itself(CPAddressDeatil) cum_count_itself(mateMobile) cum_count_itself(LRName) cum_count_itself(LRMobile) cum_count_itself(CoRelation)
c12
c13
c14
c15
c16
c17 1
0
-1
Fig. 6.7 Correlation heatmap between partially constructed features
S3GNN, which proves that the introduction of GOS has brought better performance improvements to the model. • GC: Since graph constraints (GC) are added to the generative model based on the initial Snorkel, we explore the impact of graph constraints on model performance. As shown in Figs. 6.5c and 6.6, the KS value of S3GNN-noGC is only .1.2% higher than that of S2GNN. It proves that graph constraints better conform to the characteristics of labels distribution in graph. • DFS: In order to explore the influence of manual features and the deep features generated based on this on the experiment, we design a comparative experiment. In the feature generation stage, we constructed .2, 793 features, and .222 features are left after feature selection by density clustering algorithm. The Fig. 6.7 shows the heat map of the correlation between .17 deep features constructed in the Deep Feature Synthesis algorithm. The greener the color, the higher the correlation. Then we use density clustering to cluster the features corresponding to the green color into a cluster, which will be retained for subsequent tasks. From the results in Table 6.5, it can be seen that after removing the features generated by the DFS algorithm, the performance of all models has a cliff-like decline. It proves that in the anti-fraud task, we cannot rely solely on the attribute information of the graph
6.4 Risk Assessment and Analysis
159
nodes. The graph relationship features and the deep features generated based on this are also of considerable importance. • MEA: In order to explore the impact of Multiple Edge-Types based Attention on model performance, we set the model without MEA. We found that the model has different degrees of performance degradation after removing the MEA, which shows that we cannot ignore the multiple edge types between the node pairs in the heterogeneous graph. At least, they cannot be simply regarded as the same type of edge.
6.4.6 Parameter Sensitivity Figure 6.8c illustrates the impact of the number of label functions on performance. It can be observed that as the number of label functions increases, the performance increase quickly encounters a bottleneck. This is because the data covered by the label functions we later design is highly overlapped with the previous label functions and cannot bring performance improvement. If we can have more expert knowledge from the lending platform, the performance will be more better. We also conducted experiments on other hyperparameters, and the results are shown in Fig. 6.9. We can draw the following conclusions: • The embedding dimension needs to be set reasonably. In this experiment, the best performance is achieved when the dimension was .128. But when the dimension further increases, the performance begins to drop sharply. It leads to overfitting when the dimension is too large. • The model achieves the best performance when the number of head is .4. Excessive number of attention head will also weaken the performance of the model. Attention heads of an appropriate number are enough to capture different semantic information and make model stable.
Fig. 6.8 a Performance of S3GNN on different proportions of rough labels from Snorkel. b Performance of S2GNN on different proportions of ture labels. c Performance of S3GNN on different numbers of labeling functions
160
6 Knowledge Oriented Strategies: Dedicated Rule Engine
Fig. 6.9 Parameter sensitivity of S3GNN w.r.t dimension of edge-type based embedding .z uϒ , number of attention head . K
6.5 Conclusion This work proposes a Snorkel-based semi-supervised graph neural network model for fraud prediction in online credit loan services (OCLSs). In the case that existing GNNs has reached the end of its performance by mining the internal information of the graph, we utilize a Graph-Oriented Snorkel to absorb external expert knowledge to improve the performance ceiling in the scenarios of fewer labels. Then we design a heterogeneous graph neural network based on the attention mechanism. In particular, when calculating the attention between applicant nodes, we consider the impact of different types of edges on attention. Our method achieves better experimental results than the baseline method. In particular, our experiments prove the importance of manual features and deep features. For future work, we will further focus on the timing evolution of graph in OCLSs.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Y. Wang, W. Wang, J. Wang et al., J. Financ. Risk Manage. 6(01), 48 (2017) K. Ren, A. Malik, in ACM WSDM 2019 (2019), pp. 510–518 J. Tang, J. Yin, in ICMLC 2005, vol. 6 (2005), vol. 6, pp. 3453–3457 P. Ravisankar, V. Ravi, G.R. Rao, I. Bose, Decis. Support Syst. 50(2), 491 (2011) S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Decis. Support Syst. 50(3), 602 (2011) D. Wang, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, Y. Qi, in IEEE ICDM 2019 (2019), pp. 598–607 P. Cui, X. Wang, J. Pei, W. Zhu, IEEE Trans. Knowl. Data Eng. 31(5), 833 (2018) J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, M. Sun, arXiv:1812.08434 (2018) D. Cheng, Z. Niu, Y. Tu, L. Zhang, in IEEE ICPR 2018 (2018), pp. 361–366 W. Hamilton, Z. Ying, J. Leskovec, in Advances in neural information processing systems (2017), pp. 1024–1034
References
161
11. Y. Dou, Z. Liu, L. Sun, Y. Deng, H. Peng, P.S. Yu, in Proceedings of the 29th ACM International Conference on Information and Knowledge Management (2020), pp. 315–324 12. Y. Liu, X. Ao, Q. Zhong, J. Feng, J. Tang, Q. He, in Proceedings of the 29th ACM International Conference on Information and Knowledge Management (2020), pp. 2125–2128 13. B. Xu, H. Shen, B. Sun, R. An, Q. Cao, X. Cheng, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35 (2021), pp. 4537–4545 14. Y. Liu, X. Ao, Z. Qin, J. Chi, J. Feng, H. Yang, Q. He, in Proceedings of the Web Conference 2021 (2021), pp. 3168–3177 15. K. Sun, Z. Lin, Z. Zhu, in AAAI 2020 (2020), pp. 5892–5899 16. B. Settles, Active learning literature survey, Technical report (University of Wisconsin-Madison Department of Computer Sciences, 2009) 17. F. Hayes-Roth, Commun. ACM 28(9), 921 (1985) 18. A. Ratner, S.H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Ré, in VLDB 2017, vol. 11 (2017), p. 269 19. S. Lee, J. Kim, IEEE Trans. Depend. Secure Comput. 10(3), 183 (2013) 20. L. Tong, B. Li, C. Hajaj, C. Xiao, N. Zhang, Y. Vorobeychik, in 28th .{USENIX.} Security Symposium (.{USENIX.} Security 19) (2019), pp. 285–302 21. D. Babaev, M. Savchenko, A. Tuzhilin, D. Umerenkov, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019), pp. 2183–2190 22. G. Suarez-Tangil, M. Edwards, C. Peersman, G. Stringhini, A. Rashid, M. Whitty, IEEE Trans. Inf. Foren. Secur. 15, 1128 (2019) 23. C. Wang, H. Zhu, IEEE Trans. Depend. Secure Comput. (2020). https://doi.org/10.1109/TDSC. 2020.2991872 24. X. Li, S. Liu, Z. Li, X. Han, C. Shi, B. Hooi, H. Huang, X. Cheng, in AAAI (2020), pp. 4731–4738 25. C. Liang, Z. Liu, B. Liu, J. Zhou, X. Li, S. Yang, Y. Qi, in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019), pp. 1181–1184 26. D. Cheng, Y. Zhang, F. Yang, Y. Tu, Z. Niu, L. Zhang, in Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019), pp. 2547–2555 27. Z. Liu, C. Chen, X. Yang, J. Zhou, X. Li, L. Song, in ACM CIKM (2018), pp. 2077–2085 28. X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, P.S. Yu, in WWW 2019 (2019), pp. 2022–2032 29. J. Bruna, W. Zaremba, A. Szlam, Y. LeCun, arXiv:1312.6203 (2013) 30. T.N. Kipf, M. Welling, arXiv:1609.02907 (2016) 31. P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, arXiv:1710.10903 (2017) 32. C. Zhang, D. Song, C. Huang, A. Swami, N.V. Chawla, in ACM KDD 2019 (2019), pp. 793–803 33. H. Hong, H. Guo, Y. Lin, X. Yang, Z. Li, J. Ye, in AAAI 2020 (2020), pp. 4132–4139 34. S. Yun, M. Jeong, R. Kim, J. Kang, H.J. Kim, in Advances in Neural Information Processing Systems (2019), pp. 11,983–11,993 35. S.X. Rao, S. Zhang, Z. Han, Z. Zhang, W. Min, Z. Chen, Y. Shan, Y. Zhao, C. Zhang, arXiv preprint arXiv:2011.12193 (2020) 36. Z.H. Zhou, Natl. Sci. Rev. 5(1), 44 (2018) 37. O. Chapelle, B. Scholkopf, A. Zien, IEEE Trans. Neural Netw. 20(3), 542 (2009) 38. W. Sun, M. Chen, J.x. Ye, Y. Zhang, C.z. Xu, Y. Zhang, Y. Wang, W. Wu, P. Zhang, F. Qu, in ICPS 2019 (2019), pp. 635–640 39. T. Hu, Q. Guo, X. Shen, H. Sun, R. Wu, H. Xi, IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3287 (2019) 40. X. Zhu, Z. Ghahramani, J.D. Lafferty, in ICML 2003 (2003), pp. 912–919 41. J.M. Kanter, K. Veeramachaneni, in 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (IEEE, 2015), pp. 1–10 42. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Trans. Assoc. Comput. Linguist. 5, 135 (2017)
162
6 Knowledge Oriented Strategies: Dedicated Rule Engine
43. B. Perozzi, R. Al-Rfou, S. Skiena, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014), pp. 701–710 44. S. Hochreiter, J. Schmidhuber, Neural Comput. 9(8), 1735 (1997) 45. W. Chen, Y. Gu, Z. Ren, X. He, H. Xie, T. Guo, D. Yin, Y. Zhang, in IJCAI 2019, vol. 19 (2019), pp. 2116–2122 46. T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, in ICCV 2017 (2017), pp. 2980–2988 47. J.H. Friedman, IEEE Trans. Comput. 26(4), 404 (1977) 48. T. Chen, C. Guestrin, in ACM KDD 2016 (2016), pp. 785–794 49. L. Liao, X. He, H. Zhang, T.S. Chua, IEEE Trans. Knowl. Data Eng. 30(12), 2257 (2018)
Chapter 7
Enhancing Association Utility: Dedicated Knowledge Graph
7.1 Gang Fraud Prediction System Based on Knowledge Graph Online lending services (OLSs) are becoming more popular because of their convenience [1]. Meanwhile, online lending fraud has gradually emerged. Gang fraud, as one of the typical cases of online lending fraud, often causes large losses to lending companies. Generally, loan applications require a long feedback period. Lending companies often judge applications based on whether applicants repay on time. The disadvantage of the method is that companies have already suffered losses after judging fraudulent applications. An effective risk control system that can predict gang fraud are critical to the proper functioning of online lending companies. Such a system can be built based on different anti-fraud methods, mainly including manual verification, expert rules, data analysis rules, and machine learning models [2]. In recent years, machine learning models, such as Support Vector Machine (SVM, [3]), and Random Forest [4], etc., is becoming a mainstream type of anti-fraud methods. As a matter of fact, these models rely on valid features and large amounts of labeled data. However, in online lending scenario, there are some thorniest problems to undermine the feasibility of these learning algorithms. The first one is the the deficiency of information associations. The features of gang fraud in OLSs are mostly derived by some associations among different information of applicants from fraud gangs [5–7]. Unfortunately, mining useful associations is extremely difficult since they are usually sparse and insufficient due to low-quality data, i.e., very preliminary and coarse applicant’s information. Take a real-world lending dataset with .2.3 million loan applications (detailed information in Sect. 7.4) as an example, there are only some basic information fields with non-negligible deficiency rates. The second one is the weakness of data labels [8]. The accurate fraudulent labels are usually hard to obtain in OLSs due to the special business pattern where a fraudulent loan can be finally determined after periods of abnormal repayment. For instance, the labeled data only account for .13.39% in the above lending dataset. Hence, how to effectively © Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_7
163
164
7 Enhancing Association Utility: Dedicated Knowledge Graph
alleviate the aforementioned problems pose a huge gap between low-quality data and efficient prediction model. In this work, we mainly address the above problems by enhancing the utility of associations (i.e., recovering missing associations and mining underlying associations) on a knowledge graph. The single knowledge graph stores and represents applicants’ information and associations but can not solve the deficiency of useful information associations serving as gang fraud features based on the low-quality applicant’s data. Towards fraud prediction, the underlying relevances are hard to directly utilize and the ambiguity of applicant information buries many critical relevances, especially unstructured textual address-related information. Hence, we mainly improve the data quality, i.e., enhancing the utility of associations, based on a knowledge graph about applicant’s information by devising two dedicated approaches as follows: Recovering Missing Associations. We propose an efficient method of Chinese address disambiguation to enhance relational graph. The method can recover the missing edges/relationships by Chinese address disambiguation. Chinese addresses are usually different from English addresses with obvious delimiters. If only string matching is adopted, some relationships in the knowledge graph will be lost, resulting in a very sparse knowledge graph. We design a disambiguation method for Chinese addresses, which can link multiple applications with similar addresses to fill the missing relationships. More specifically, based on specially designed address trees, we do address disambiguation in two steps: address element classification and hierarchical address matching. Particularly, to prevent loss of address elements, we first encode each address and then propose an address element supplement scheme. Mining Underlying Associations. We design an adaptive partial graph representation method, called Adaptive Connected Component Embedding Simplification Scheme (ACCESS). Inspired by the graph representation learning scheme, we adopt network embedding algorithms to mine underlying relevances. It can adaptively implement embedding operations for different connected components depending on their sizes in order to simplify the computation complexity of network embedding. Thereafter, we adopt graph clustering algorithms to realize qualified gang fraud prediction by using the enhanced relational graph data in the case of weakness of data labels [7, 9, 10]. After the graph clustering, we can judge the community as abnormal based on the ratio of normal data to abnormal data in the community exceeding a certain threshold. Notable, it needs to traverse the entire graph in each judgement. So we further design a HashMap-based structure [11] to save the traversal time. When each node in the graph is added to its own community, the statistical data of the community ID is automatically updated, and after the clustering operation is completed, it can be obtained whether each community is abnormal. Then, the data structure can retrieve the basic situation of each community with a time complexity of . O(1), and obtain the statistical data information of the specified community ID. Moreover, we propose a framework of gang fraud detection called RMCP, by integrating the above techniques. It is consists of four steps, i.e., Recovering, Mining, Clustering, and Predicting, to automatically recover missing edges, mine gang fraud features, cluster loan applications, and efficiently predict gang frauds in OLSs.
7.2 Related Work
165
We design the efficient methods under RMCP framework by selectively combining network embedding algorithms and cluster algorithms. Extensive experiments on the real-world dataset have fully validated the effectiveness of our method. In the experiments, we compare the performances (KS values1 ) of various network embedding algorithms and unsupervised learning algorithms. We also compare them with some state-of-the-art classifiers based supervised learning algorithms, e.g., XGBoost [14] and Random Forest [4], etc. Our main contributions can be summarized as follows: • We propose to tackle the online lending gang fraud prediction problem with very preliminary and coarse applicant information by enhancing the utility of associations (i.e., recovering missing associations and mining underlying associations) on a knowledge graph. • We design an Chinese address disambiguation to recover the critical relevances and a novel association representation method, called Adaptive Connected Component Embedding Simplification Scheme (ACCESS), to mine the implicit associations. Together, we introduce the graph clustering algorithms and devised predicting schemes based on the enhanced association representations to predict gang fraud in the case of weakness of data labels. • We implement our framework RMCP on a real-world online lending dataset. It is validated that our work significantly outperform the state-of-the-art competitors in terms of the representative metric in online lending services. As a byproduct, we also design a visual decision support system named LongArms over the framework RMCP, which is convenient to view the visual clustering performance and provides dynamic statistics of basic fraud information.
7.2 Related Work There have been a lot of efforts made to anti-fraud methods [15] in Internet finance. The traditional methods are rule-based expert systems [16–18]. In some simple and regular anomaly detection scenarios, expert systems are widely used due to their low implementation difficulty and good performance. Meanwhile, data integrity is one of the major disadvantages of expert systems, since the environment is constantly changing and the system has to be updated manually. The shortcoming of expert systems makes it impossible to adaptively detect new abnormal patterns, resulting in a certain degree of lag. 1
Kolmogorov-Smirnov (KS, [12]) value is one of typical performance metrics for evaluating risk control models, and is the most important one for fraud prediction in online lending services (OLSs). Generally, for real-world OLSs, according to the de facto standards, the ex-ante fraud prediction can be adopted as a qualified one if it has a KS value that is not less than .0.5 and an average processing latency within .10 seconds. This partly reflects the difficulty of the fraud prediction. Usually, it can tolerate relatively high false positive rates [13].
166
7 Enhancing Association Utility: Dedicated Knowledge Graph
The excellent performance of machine learning algorithms have led researchers to explore the use of machine learning algorithms to build fraud detection platforms [19–21]. Yang et al. [22] designed a novel factor-graph based model to distinguish fraudsters, which can achieve a detection performance improvement on a large-scale mobile network by disclosing how fraudsters and non-fraudsters behave differently in mobile network. Han et al. [23] proposed a new information utilization method to solve credit card fraud detection problems, which can assist any available intelligent optimizers to improve their performance in solving multi-objective optimization problems. Cao et al. [24] designed a two-level attention model to capture the samplelevel information and feature-level information by integrating to data embeddings, which improves the quality of data representation to achieve better performance. Jurgovsky et al. [25] phrased the fraud detection problem as a sequence classification task and employed Long Short-Term Memory (LSTM) networks to incorporate transaction sequences. Liang et al. [26] developed an automated solution for fraud detection based on graph learning algorithms to separate fraudsters from regular customers and uncover groups of organized fraudsters. Zhang et al. [27] developed a fraud detection system that employs a deep learning architecture together with an advanced feature engineering process based on homogeneity-oriented behavior analysis (HOBA). Recently, an advanced method was proposed in [28], where many improvements are obtained by introducing the distributed version of the deep forest model. Particularly, the designed cost-based strategy performs well in handling extra-imbalanced data for cash-out fraud detection. All these studies are based on supervised learning algorithms which have some significant advantages over expert systems. However, their performances depend on the quality of labels. For anti-fraud detection in Internet finance, especially in online lending services, the quality of labels is usually insufficient, e.g., incomplete, inexact, and inaccuracy, for supervised learning algorithms. Semi-supervised and unsupervised learning algorithms aim to solve the weakness of data labels, and are thus adopted [29–34]. When relational features play a key role in fraud detection for a specific scenario, e.g., online lending services, the original data are usually represented by relational graph or knowledge graph [35]. Carcillo et al. [33] proposed an integration method of unsupervised techniques with supervised credit card fraud detection classifiers, which is the implementation and assessment of different levels of granularity for the definition of an outlier score. Veronique et al. [36] designed a fraud detection system for credit card transactions combining Recency Frequency Monetary (RFM) and the network of credit card holders and merchants. They proposed APATE, a novel approach to detect fraudulent credit card transactions conducted in online stores. Li et al. [37] proposed a semi-supervised learning algorithm that automatically tunes the prior and parameters in Markov random field while inferring labels for every node in the graph. A state-of-the-art approach was proposed in [38], where the correlation network is adopted to effectively detect cheating behavior hidden in business distribution channels. Two novel graph-cut methods were designed to convert the correlation network into a bipartite graph to rank cheating partners, which can help to remove false-positive correlation pairs simultaneously. Particularly, the proposed framework can rank nodes inside the system according to a specified ranking mea-
7.3 Recovering-Mining-Clustering-Predicting Framework
167
sure, which enables it to help open a wide application area of correlation network analysis and motivate more interesting applications [39]. The feasibility and effectiveness of most existing graph-based methods depends on the density of generated graphs. For gang fraud prediction in online lending services, data that can be used are just the preliminary and coarse applicant’s information. The low-quality data can cause the sparsity of generated graph, and then the deficiency of information associations. Different from the existing works, our study in this work focuses on how to solve the sparsity.
7.3 Recovering-Mining-Clustering-Predicting Framework In this part, we present the Recovering-Mining-Clustering-Predicting (RMCP) framework (as shown in Fig. 7.1). Under this framework, we devise the dedicated schemes corresponding to four modules. More specifically, for the recovering module, we first provide the efficient Chinese address disambiguation methods to recover edges of the sparse knowledge graph; for the mining module, we design a novel association representation method, called Adaptive Connected Component Embedding Simplification Scheme (ACCESS), which can adaptively implement embedding for different connected components depending on their sizes to simplify the online computation complexity of network embedding; then, for the clustering and predicting modules, we introduce HashMap structure to improve the online performance of fraud prediction in terms of both high accuracy and low latency by facilitating the storing and retrieving the communities generated by different clustering algorithms.
7.3.1 Recovering Missing Associations To disambiguate addresses, we first obtain all address elements of each address through word segmentation. Then we devise a method based on hierarchical matching for grading the address elements. In order to prevent the loss of address elements as much as possible, we use address encoding to determine the priority of the addresses that will build the address tree, and propose an address element supplement scheme to further prevent the loss of address elements. Chinese Address Disambiguation Matching Methods. An address is used as a carrier to record location information of applicants. If handled properly, abnormal groups can be effectively detected or even predicted. Same fraud group often fills in similar address when submitting loan applications. However, it is often inappropriate to directly use string matching to determine whether addresses are the same. It will make the constructed knowledge graph lose effective relationships and become more sparse. Therefore, some methods are needed to recover associations. At the same time, there are ambiguities in addresses. The same address is often different when expressed by applicants.
168
7 Enhancing Association Utility: Dedicated Knowledge Graph
Fig. 7.1 The proposed RMCP framework for addressing two challenging technical problems
Processing Chinese addresses is usually more difficult than processing English addresses. Compared with English addresses, Chinese addresses relatively lack obvious delimiters. As a result, Chinese address elements cannot be obtained through methods such as regular expressions. An address element can be understood as each part that makes up an address. For example, an address is expressed in English as “Chang An Avenue, Dong Cheng District, Beijing”. There are three address elements in the address, namely, Beijing, Dong Cheng District and Chang An Avenue, but the address in Chinese is “Beijing Dong Cheng District Chang An Avenue”. There are no delimiters such as commas in Chinese addresses. To get address elements, we first use a Chinese word segmentation method named Conditional Random Fields (CRF) [40] to process each address. CRF combines advantages of Maximum Entropy and Hidden Markov Models. In recent years, it has achieved good performances in word segmentation tasks. However, after word segmentation, some address elements will be divided into multiple meaningless elements. Some customed rules are applied to combine meaningless elements into complete address elements. Figure 7.2 gives an illustration of Chinese addresses preprocessing. First, a Chinese address is extracted from the dataset, e.g., “Building .1, Guanshan Lake, Bihai Garden, Guiyang, Guizhou”. Perform Chinese word segmentation processing on the address to obtain multiple address elements. But elements like Building .1 are processed to get two meaningless elements Building and .1. We combine two meaningless elements into a complete address element by custom rules
7.3 Recovering-Mining-Clustering-Predicting Framework
169
Fig. 7.2 Chinese address preprocessing
Fig. 7.3 Hierarchy of address elements
such as combining the number in the Chinese address with the address element after the number. Finally, the processed multi-level addresses are saved. It is simple and convenient to distinguish two addresses by determining whether address elements of two addresses are equal. However, by this method, similar addresses cannot be identified. At the same time, there are different levels of address elements obtained after word segmentation. For example, we cannot treat Guizhou Province and Guiyang City as the same-level address element. It is inevitable to grade different address elements. Address elements obtained by statistics are divided into .12 levels, and there is no redundancy or ambiguity between each level. Figure 7.3 shows the level of address elements, with the lower the number for the higher-level element. The expressions of higher-level address elements are more uniform, which reduces the possibility of ambiguity. Address elements of level .0∼6 are compared through exact matching, because they are simpler to describe and have a unified representation. Address elements at the .7th level need fuzzy matching. On the one hand, there is no unified expression of address elements on the level. On the other hand, address elements at the level are easily affected by writing habits of applicants, e.g., the impact of abbreviations. The .7th level address elements have the problem of different descriptions in the same address. For address elements at levels .8∼11, exact matching is still applied at the levels. Address elements at the levels are detailed
170
7 Enhancing Association Utility: Dedicated Knowledge Graph
address information of applicants, which can uniquely identify locations of applicants. If fuzzy matching is used, applications of different addresses will be connected together. Based on the matching method of Trie tree from root node to leaf node, we propose to construct address tree to disambiguate. Address trees are constructed according to address elements level in Fig. 7.3, and each level corresponds to a node of address tree level, that is, the .0th level elements are root nodes of address trees. If address elements of one level are lost, the nodes corresponding to the level of address trees are deleted accordingly. Due to the absence of address elements, it is impossible to judge whether addresses are the same by comparing address elements one by one. Therefore, the matching method of Trie tree cannot be completely applied to the matching of address trees. To solve the deficiency of Trie tree matching, we propose a hierarchical matching method. First of all, we make a ranking of address elements according to the hierarchical structure. When an address matches an address tree, only the elements of same level are compared. Figure 7.4 provides an illustrative explanation on the process of hierarchical matching and Fig. 7.5 shows an example of matching failed in different cases. In Fig. 7.4, the left side is an address tree, and the right side is an address to be matched. When Beijing as the root node is successfully matched, the next matching elements are not Tsinghua University and Haidian District, but Tsinghua University and Tsinghua University. There is no address element named Haidian in the address tree, so it is skipped when matching the Haidian element.
Fig. 7.4 Hierarchical matching process
Fig. 7.5 a Missing Haidian causes matching failure. b Haidian element miss due to initial creating tree order. c Addresses with more address elements but lower levels cannot create address trees first and cause element loss
7.3 Recovering-Mining-Clustering-Predicting Framework Table 7.1 Use .5-bit binary encoding for two addresses Address Beijing Haidian Default District Peking University, Beijing Peking University, Haidian District, Beijing
171
Default
Peking University
1
0
0
0
1
1
1
0
0
1
Address Tree Construction Methods. Using the hierarchical matching method can solve address matching problems. If the constructed address trees are more complete, the accuracy of matching will be better. Therefore, it is better to use addresses with more address elements to create address trees. After creating address trees, an address element supplement scheme is required to recover missing address elements back into address trees since addresses with fewer address elements are first created. To create address trees that contain as many address elements as possible, we first encode each address with a .12-bit code. According to the hierarchy of address elements, each bit of code corresponds to the level of address element. When address element of each level exists, the corresponding binary position is .1, otherwise it is set to .0. When all addresses have been encoded, they are sorted in descending order according to encoded value. Table 7.1 shows “Peking University, Beijing” and “Peking University, Haidian District, Beijing” use .5-bit encoding. If each address is not encoded, “Peking University, Beijing” is first created in an address tree, causing the address element named Haidian District to be lost. After encoding and sorting, “Peking University, Haidian District, Beijing” is given the priority to create the address tree, thus making the created address tree more complete. Encoding and sorting can prevent the loss of address elements to some extent, but there may be problems when the addresses with more address elements but lower levels cannot create address trees first and cause element loss, e.g., “SJTU, Minhang Area, Shanghai” and “SJTU, No. .800, Dongchuan Road, Shanghai”. Since Minhang Area has a higher level than Dongchuan Road, the encoding value is larger, resulting in “SJTU, Minhang Area, Shanghai” being first created in an address tree. This will result in the loss of Dongchuan Road and No. .800 address elements. In order to solve the problem, we propose a scheme of supplementing address elements. When address elements fail to match, these address elements are not discarded immediately, but stored in a stack. If two addresses are found to be the same by matching subsequent address elements, address elements stored in the stack are added to the address tree in turn. Figure 7.6 shows the process of supplementing address elements. When two address elements named Dongchuan Road and No. .800 fail to match, two address elements are not directly discarded, but are first stored in a stack. After the subsequent matching is successful, the two address elements in the stack are used to supplement missing nodes in the address tree.
172
7 Enhancing Association Utility: Dedicated Knowledge Graph
Fig. 7.6 Supplement address elements scheme
7.3.2 Mining Underlying Associations We first build a heterogeneous graph using data fields. In addition to containing loan application nodes, it also includes nodes for each valid feature. For example, if the company names of two application nodes (named .u and .v, respectively) are the same, then .u and .v are connected to the node of the same company name. Next, a homogeneous derivative network is extracted from the constructed a heterogeneous graph over the graph database. At the time, .u and .v are directly connected, and the weight between them is assigned to .1. If .u and .v are connected by the same company name and the same applicant address, the weight of relationship is assigned to .2. We use Chinese address disambiguation methods to recover most of the missing edges in the knowledge graph. But, the weakness of data labels essentially reduces the feasibility of supervised learning algorithms. This is the reason why we turn to graph clustering algorithms. However, before implementing clustering algorithms on the derivative network, there are still two problems: the insufficiency of explicit associations and intricacy of relational features. To solve both problems, we devise an effective network representation method based on multiple kinds of network embedding algorithms. Categorizing Network Embedding Algorithms. Different combinations of various network embedding algorithms can be selected in accordance with specific conditions. We summarize and classify some commonly used network embedding algorithms, and briefly summarize the advantages and disadvantages of various categories of algorithms. We categorize the network embedding algorithms into the following three categories: Factorization-based algorithms [41] use matrices to represent connections between vertices and factorize the matrix to do embedding. Random walk-based algorithms use random walks to cluster similar nodes in a graph. These algorithms are very effective when the graph is too large to be observed as a whole. With the development of deep learning technology, deep learning-based algorithms are used in graphs to reduce dimension by modeling non-linear structures in data using deep autoencoders. The premise of using factorization-based algorithms is that we have a certain understanding of the dataset and can define the target function explicitly. One of the disadvantages of factorization-based algorithms is that they are not able to learn arbitrary functions. At the same time, because the actual application scenarios are very complex, it is difficult for us to build one or several target functions that are
7.3 Recovering-Mining-Clustering-Predicting Framework
173
fully applicable to these scenarios. The advantage of the category of algorithm is that there is a clear objective function for optimization, and the embedding result is stable without large randomness. Random-walk-based algorithms are particularly useful when the graph is too large to be evaluated as a whole. In these algorithms, we can get good network embedding effect to some extent by changing random walk parameters. The random-walk-based algorithm needs to adjust and select the appropriate parameters according to the specific application scenario. With the development of deep learning technology, many methods based on deep neural network are applied in network embedding. Depth autoencoders have been used to reduce dimensions because they can model nonlinear structures in data. Deep learning-based algorithms can model a wide range of functions following the universal approximation theorem. These have shown great promise in capturing the inherent dynamics of the graph. However, these algorithms cannot exert their good performances on a sparse network. Sufficient network information is the basis for these algorithms. ACCESS: Adaptive Connected Component Embedding Simplification Scheme. According to the characteristics of the network lending scenario, we have customized a dedicated algorithm called ACCESS under the proposed RMCP framework. We deploy the network embedding algorithms for each connected component, instead of the entire network, to mine useful association features. The feasibility of this method can be explained as follows: After address disambiguation recovers the missing edges of the network, there are still no edges between connected components. In this case, the probability of clustering two connected components into the same community is too low. Therefore, it is unnecessary to deploy the network embedding algorithm for the entire graph at the cost of reducing computational efficiency. The obtained benefits mainly include two aspects: The first is the improvement of computational efficiency. The cost of reviewing newly submitted applications is low. The new application is added to the network, we only need to update the status of the component where the application is located, without updating the entire graph. The second is the reduction of processing latency. The low latency requirements of online lending services can be realized by using the simplified network embedding algorithms for each connected component that can be executed in parallel. Although running network embedding algorithms on each connected component has obvious advantages, the number of nodes in each connected component often differs greatly due to the characteristics of online lending services. The embedding dimension is a critical parameter of network embedding algorithms. The number of nodes in each connected component cannot be determined, which leads to the indeterminacy of the embedding dimension value. In response to this problem, we propose two solutions. The first method is to omit the step of network embedding and directly perform subsequent clustering operations on the connected component when the number of nodes in the connected component, denoted by.V , is lower than a predefined threshold. When .V is higher than the threshold, but lower than the given embedding.dimension, the embedding dimension of the connected component is . changed to. log2 V . When.V is greater than the embedded dimension, the embedded
174
7 Enhancing Association Utility: Dedicated Knowledge Graph
dimension of the connected component remains unchanged.. The.second method . . . √ . is based on .V . We set the embedding dimensions to . V2 , . V and . log2 V , respectively. Inspired by the characteristics of different categories of network embedding algorithms, we propose a method to adaptively select the network embedding algorithm according to the characteristics of each connected component. Network embedding algorithms based on deep learning perform better when the network information is rich, while algorithms based on random walk are suitable for larger networks. These two categories of algorithms are not like factorization-based algorithms that need to propose an target function and optimize it. We pursue that the method realizes the connected component to select the corresponding network embedding algorithm according to its internal density. Based on the idea of modularity, we measure the density of connected components according to . V (V2E−1)) , where . E represents the number of edges in connected components. If there are edges between all nodes, the number of edges in the component is . V (V2−1) . When the dense value in the connected component is greater than a predefined threshold, network embedding algorithms based on deep learning are used, and in other cases, network embedding algorithms based on random walk are used. In this work, we choose GCN from the deep learning based network embedding algorithm and.node2vec from the random walk based algorithm. High-order proximity embedding methods can mine more effective features, so as to ensure the hierarchical structure characteristics of the network to the greatest extent possible.
7.3.3 Clustering and Predicting Clustering. After mining underlying associations, then clustering algorithms are implemented to obtain communities of similar nodes based on the distance, and finally it is predicted whether a community is an abnormal one based on the influence of the contained abnormal nodes. To describe these steps more clearly, we provide the following basic guidelines: • Clustering. After using network embedding algorithms, based on the distance between nodes of each connected component, a distance-based clustering algorithm is applied to gather similar nodes together to form some communities. The main reason for not using hops between nodes as the distance is to reduce storage space requirement. Most edges of the sparse knowledge graph are missing, which will cause more invalid data to be stored and waste storage space. • Predicting. According to the proportion of fraud nodes in the community, we determine whether the community is an abnormal one. The method can automatically determine whether the community is abnormal, and it can be replaced with different judgment methods according to specific scenarios. If nodes in the community are not labeled, the community can be marked as a gray community or a normal community according to specific requirements.
7.3 Recovering-Mining-Clustering-Predicting Framework Table 7.2 List of cluster algorithms Category Method Hierarchy
Partition Density
Label propagation Spectral graph theory Distribution
Time complexity . O((E
+ V )V )
175
Scalability
High dimensional data Yes No Yes Yes No No No No Yes Yes
Newman [42] BRICH [43] GN [44] Louvain [45] K-means [46] K-medoids [47] DBSCAN [48] Optics [49] CFSFDP [50] LPA [51]
∗ E 2) . O(V log V ) . O(kV t) 2 . O(k(V − k) ) . O(V log V ) . O(V log V ) . O(V log V ) . O(V + E)
Yes Yes Yes Yes Yes No Yes Yes Yes Yes
SM [52]
High
No
Yes
NJW [53] GMM [54]
. O(V
No No
Yes Yes
. O(V ) . O(V
High 2
∗ kt)
In order to facilitate the selection of specific clustering algorithms to apply in our proposed framework, we divide these algorithms into six categories and provide corresponding classic algorithms. Table 7.2 shows some commonly used clustering algorithms for these six categories. In the column of time complexity, .V , . E, .k, and .t represent the amount of data, the numbers of edges, clusters, and iterations, respectively. The hierarchy-based algorithms can be divided into agglomerative-based algorithms and divisive-based algorithms. The two representative algorithms based on the agglomerative are the Newman, BRICH and Louvain algorithms. The advantage of these algorithms is that there is no need to determine the number of clusters in advance. One drawback of the category of algorithm is that the process is irreversible and is more sensitive to single points. The operation of the clustering is often the final, once the clustering of two clusters are not revoked. The partition-based algorithms are to regard the center of data points as the center of the corresponding cluster. The advantage of these algorithms is that they are easy to understand and implement. But the biggest problem of these algorithms is that the number of clusters needs to be determined in advance. This requires us to have a certain understanding of the dataset before using it, otherwise it is difficult to determine the number of clusters. The hierarchy-based and partition-based algorithms are only suitable for convex clusters. In order to make up for this deficiency, the density-based algorithms are proposed to find clusters of arbitrary shapes. These algorithms are not sensitive to noise, and the number of clusters does not need to be determined in advance. These
176
7 Enhancing Association Utility: Dedicated Knowledge Graph
algorithms often have poor impact on high dimensional data clustering. The nodes are extremely sparse in high dimensional data, and the density is difficult to define. The advantages of label-propagation-based algorithms are that they are easy to understand and easy to implement, but the effect of clustering may be poor due to their strong randomness. Although the spectral-based graph theory and the distributionbased algorithms perform well in high-dimensional data, they are not suitable for large scale datasets due to their computational complexity. Take spectral-based algorithms as an example: The computational bottleneck of spectral-based algorithms is the characteristic value of the matrix, and the essence of spectral-based algorithms is matrix decomposition. In online lending services, we recommend density-based clustering algorithms, the main reasons are the following two aspects: • Discovering communities with arbitrary shapes. Since we cannot predict the distribution of data and the characteristics of future data, we cannot infer the number of communities, and thus cannot give the number of clusters in advance. In addition, density-based clustering does not need to provide the number of clusters in advance, and the time complexity is small, which is very suitable for the characteristics of online lending services. • Detecting noise. In actual scenarios, due to various factors, some noise appears occasionally. For instance, due to the influence of different factors such as weak network signals, some random noise sometimes occurs. If the algorithm can automatically identify most of the noisy data, it can save more subsequent data processing time and cost. Aiming at the problem of poor performances of density-based algorithms in high dimensional data, it can be solved by our proposed framework. By using the network embedding method to reduce data dimension, and then achieve the dimensions required by these algorithms, so that these algorithms can achieve a better clustering effect. Combining Clustering and Predicting by HashMap. We choose the CFSFDP algorithm [50] as our clustering algorithm to cluster each connected component according to the distance obtained by network embedding to cluster similar nodes together. After the clustering is completed, it is necessary to make predictions for each community to determine whether the community is abnormal. One of the schemes is to judge the community as abnormal based on the ratio of normal data to abnormal data in the community exceeding a certain threshold. The method is simple and effective, but it needs to traverse the entire network again, count the communities where each node is located, and then calculate the ratio of abnormal data to normal data in each community. In order to save the traversal time of this step, we design a data structure based on HashMap. By using this data structure, the ID of abnormal community can be obtained after the clustering process is completed, thereby the abnormal community can be identified.
7.4 Experimental Evaluation
177
The data structure can be described by .
H ash Map > .
In the clustering process, after each node determines the community to which it belongs, it updates the information of the community according to the node label, and calculates the ratio of normal and abnormal data in the community to determine whether the community is abnormal at this time. The clustering process and the prediction process can be combined. In this way, the process of retraversing each node after clustering is completed can be saved, and the basic statistical information of the specified community ID can be queried with a time complexity of . O(1).
7.4 Experimental Evaluation To evaluate the performance of the proposed models in online lending services, we first introduce the dataset as well as the experiment setting and then present the results through a variety of experiments. Through the empirical evaluation of realworld data, we mainly aim to answer the following three research questions: RQ1: Does the implementation of our RMCP surpass the existing methods? RQ2: How do the address disambiguation algorithms affect the implementation of our RMCP? RQ3: How do the network embedding algorithms affect the implementation of our RMCP?
7.4.1 Dataset Description and Experiment Settings Our experiments are conducted on a dataset of real-life loan records from a lending company. The records were collected more than .2.3 million applications within six years from January.1,.2012 to July.23,.2018. Before data collection, we have informed the users by statements subject to the local privacy policies, and we only apply the users’ data to scientific research. The collected data have .8 types of data fields, as shown in Table 7.3. Among these fields, the address field is critical to gang fraud prediction in our work. All applications are labeled with integers ranging from .0 to .2 by the company manually. All applications with label .0 indicate the normal ones that account for .12.28% of all data. All applications with label .1 indicate the abnormal ones that account for about .1.11% of all data. Then, the remaining unlabeled data account for .86.61%. To evaluate our method, we randomly extract .20, 000 labeled data as a testing set and other data as a training set, where all the experiments are implemented ten
178
7 Enhancing Association Utility: Dedicated Knowledge Graph
Table 7.3 Detail information of collected data fields Data field Missing rate (%) Explanation Address
21.88
Company name Engine License Chassis Phone IdNo
0.73 94.05 90.65 94.03 48.96 25.39
Mobile
31.12
The detailed address is generally consisted of province, city, district, and detail information Applicant’s company name Applicant’s engine number Applicant’s license plate Applicant’s chassis number Telephone number of applicants or companies ID number of applicants and their spouse encrypted with MD5 Mobile number of applicants or their family members
times to obtain the average results. We adopt True Positive Accuracy, Rate (Recall), False Positive Rate (Disturb) and KS (Kolmogorov-Smirnov Test) as the metrics to quantify the detection performance on the test dataset. Our experiments are conducted on a server with Intel (R) Xeon (R) CPU E5-2680 v4 @ 2.40GHz and 128 GB RAM.
7.4.2 On Model Comparison To investigate the performance of an implementation of our RMCP, we evaluate the performances of composite schemes by fusing LPA, Louvain and CFSFDP in Predicting module. A straightforward fusion method is the majority voting scheme [55]. Besides, we adopt another simple fusion scheme based on ensemble learning and mathematical logic. Implementation of Fusion Scheme. We adopt a logical fusion scheme that directly combines schemes with two kinds of operations: conjunction (.∧) and disjunction (.∨). For simplicity, we use CFSFDP, LPA, and Louvain to denote three schemes, respectively, and denote a specific fusion scheme by the corresponding logic expression (e.g., CFSFDP.∧LPA). The logical fusion scheme can be regarded as an ensemble strategy similar to the bagging [56]. All these unsupervised learning-based schemes only have confidence on their positive discrimination results, since a negative discrimination result is just a default output when the scheme does not determine a session as an anomaly. In other words, they have no confidence in a negative discrimination result. The fusion model will fail to guarantee its creditability of a positive discrimination result whenever a negation operator is used in the logical combination. That is, a negation operator negates the validity of the corresponding fusion. Therefore, we say a logical combination valid if it does not contain any negation operator.
7.4 Experimental Evaluation
179
Fig. 7.7 KS curves for different prediction methods and fusion schemes
0.6
KS
0.5
0.4 LPA Louvain CFSFDP Vote Ensemble
0.3
0.2 0.0
0.3
0.6
0.9
1.2
From Theorem 1 in [57], when we consider three candidate models to be fused, i.e., the models based on CFSFDP, LPA, and Louvain algorithms, respectively, there are .18 kinds of valid logical combinations. Then, we train XGBoost to complete the classification task of .18 logical combinations. Under such a scheme, different applications could be determined by different logical combinations of three judgements according to their specific features. Comparison of the implementation of RMCP and Existing Methods. We evaluate the performance by comparing the implementation with only one different scheme and the different fusion schemes. Figure 7.7 shows the KS curves of our RMCP. From Fig. 7.7, we can observe that the KS mutations occur in all three schemes when the thresholds are equal to .0.45 and .1.05, respectively. LPA and Louvain algorithms are unstable when the threshold is less than .0.15, while the CFSFDP algorithm is more stable except for the mutation points. The community derived from CFSFDP is more accurate. After integrating three schemes with the voting method, the KS value is improved compared with each algorithm, and reaches .0.52 when the threshold is greater than.0.25 and less than.0.4. However, when the threshold is less than .0.15, it is affected by two algorithms of LPA and Louvain, resulting in the poor performance of the voting method compared with CFSFDP. The KS value of ensemble method based on logical combination can reach .0.56, and when the threshold is less than .0.15, it is not affected by the poor performances of LPA and Louvain. The performance degradation of the ensemble method based on logical combination at the mutation points is less than that of other schemes. It can be concluded that our ensemble method is more robust. There is an interesting phenomenon, when the threshold is equal to .1.0, the KS value does not reach the maximum value. In other words, even in abnormal communities, the number of abnormal applications in the community is small. There are two intuitive explanations for the phenomenon. The first explanation is that the applicant is unable to repay the loan on time due to an unknown event such as bankruptcy, which leads to the judgement as abnormal. However, the applicant’s previous applications are repaid on time. Another explanation is that abnormal groups will test the
180
7 Enhancing Association Utility: Dedicated Knowledge Graph
0.4 Recall Disturb KS
CFSFDP Louvain
0.3
K-means
0.2
Logistic Regression
KS
LPA
KNN SVM
0.1
Random Forest XGBoost
0.0 XGBoost Random Forest
SVM
KNN
Logistic Regression
(a)
0.0
0.1
0.2
0.3
0.4
KS
(b)
Fig. 7.8 The performance comparison of supervised/unsupervised learning algorithms
company’s risk control strategy before implementing fraudulent activities to develop more effective fraud strategies. Next, we apply algorithms of supervised/unsupervised learning to online lending scenarios. Figure 7.8a shows the performances of five algorithms of supervised learning [4, 14, 58–60]. The best performance is XGBoost, whose Recall is close to .40%, KS is close to .35%, and FPR is only .5%. Both the Recall and KS value of XGBoost are larger than the results of other algorithms, and its FPR is smaller than other algorithms. However, Fig. 7.8b shows that these supervised learning-based algorithms have poor performances in online lending fraud prediction comparing with the unsupervised learning-based methods. In the case of fewer labels, supervised learning algorithms cannot obtain the data features in lending scenario, thus showing poor performances in anomaly detection.
7.4.3 On Address Disambiguation Impact of Different Algorithms. Figure 7.9 shows the impact of different address disambiguation methods on KS values. NoAddress indicates that the address field is not used in the construction of the knowledge graph, Match indicates that the ambiguity is eliminated only by string matching, CRF indicates that the ambiguity is eliminated after the address tree is constructed using the Condition Random Field (CRF) algorithm, and Recover indicates that the performance is performed after using the address disambiguation. Many relationships between entities are hidden in the address data. If the address fields are not used, four algorithms have poor anomaly detection performances. Only string matching is used to process the address data, and some missing relationships are recovered. The KS values of four algorithms are increased by about .0.02 ∼ 0.04. After applying the CRF algorithm, it can recover some missing relationships for entities in the knowledge graph and reduce the sparsity. The improvement on KS values of four algorithms is about .0.03 ∼ 0.05. Using
7.4 Experimental Evaluation
181
Fig. 7.9 Performance of different address disambiguation methods
0.5
0.4
NoAddress Match CRF Recover
KS
0.3
0.2
0.1
0.0 LPA
K-means
Louvain
CFSFDP
our proposed methods, the KS values of four algorithms are further improved compared to using only the CRF algorithm, which mean that the performance of anomaly detection is improved accordingly. Clustering the knowledge graph created by our proposed Chinese address disambiguation methods, the performances of various algorithms is improved by about .0.06 ∼ 0.09. The most improved performance is CFSFDP based on density clustering, and the KS value is increased by about .0.09. This shows that using our proposed Chinese address disambiguation methods can recover a large number of missing edges in the knowledge graph, so that densitybased clustering algorithms can accurately find the dense sample points. Impact of Different Parameters. The performances of address disambiguation methods are mainly affected by two parameters. Leaf node association level and non-leaf node association level are used to avoid the impact of incomplete addresses and poor consideration of address elements grade. The leaf node association level represents the lowest address element level at which the addresses successfully create the address tree. For example, an address, Haidian District, Beijing, is in a wide range, and multiple addresses can be matched to the address tree. By setting the leaf node association level, we can prevent the mismatch problem from occurring. Due to the small address level in Haidian District of Beijing, the set leaf node association level is not satisfied. So the address tree cannot be created. The non-leaf node association level indicates the lowest address level for determining whether addresses are the same. According to the address level, if the parameter is set to .11, when the two addresses match successfully at the.11th level, they are judged to be the same address. We use . Lnal to indicate the leaf node association level and . N Lnal to indicate the non-leaf node association level. Figure 7.10 reflects the KS values of four algorithms with or without address disambiguation in different pairs of [. Lnal, . N Lnal]. Lines marked UAD represent baselines of KS values without address disambiguation for four algorithms. Lines marked with [. Lnal, . N Lnal] indicate the performances after using address disam-
182
7 Enhancing Association Utility: Dedicated Knowledge Graph
Fig. 7.10 The KS values for cluster algorithms with different parameters
LPA 0.45 0.40 0.35 0.30
CFSFDP
0.25
Louvain
K-means UAD [8,9] [8,10] [8,11] [9,9] [9,10] [9,11] [10,10] [10,11] [11,11]
biguation methods. By using address disambiguation methods, the KS values of four algorithms are improved by about .0.06 ∼ 0.09. At . Lnal = 10 and . N Lnal = 10, the KS values of four algorithms reach the maximum. For . Lnal < 10, the reason for the increase of KS values are that the same but differently expressed addresses are gradually connected in the same address tree. However, for . Lnal > 10, the matching conditions are too strict to make ambiguous addresses connect to the same address tree. The parameter . N Lnal has less impact on KS values of four algorithms, because the address level grade is strict and detailed. It can also be inferred from the figure that address elements after level .9 are written less by applicants. Accuracy of Address Disambiguation. In this part, we test the effect of each part of the proposed address disambiguation method, including fuzzy match, hierarchical match, address encode and supplement address elements. The performance of address disambiguation cannot be assessed independently by specific metrics. Therefore, we evaluate the performance by three values, i.e., the number of matching groups, the number of matching addresses and the matching accuracy. The number of matching groups represents the number of groups with associated addresses. The number of matching addresses indicates the number of associated addresses. For example, there are four addresses represented as A, B, C, and D. After address disambiguation, A and B are the same address, and C and D are the same address, then the number of matching groups is .2, and the number of matching addresses is .4. The matching accuracy means that after manual review, the percentage of correctly matched addresses accounts for the matched addresses. We set . Lnal = 10, . N Lnal = 10, and randomly extract .100,000 data from all address. Table 7.4 shows the number of matching groups, the number of matching addresses and the matching accuracy of each part. (a) Using Fuzzy Match: Through two sets of experiments, one uses exact matching for each level of address elements, and the other uses fuzzy matching for level the
7.4 Experimental Evaluation
183
Table 7.4 Address disambiguation performance of each part A
Exact
Fuzzy
C
Non-Encode
Encode
Group Address Accuracy
3618 7615 100%
3652 7738 99.54%
Group Address Accuracy
3629 7888 98.86%
3707 7862 99.46%
B
Non-Hierarchical
Hierarchical
D
Non-Supplement
Supplement
Group Address Accuracy
3652 7738 99.54%
3707 7862 99.46%
Group Address Accuracy
3707 7868 99.33%
3707 7862 99.46%
seventh level and other levels use exact matching. Table 7.4A shows the results of two different matching methods. It can be seen that after adding fuzzy match, the numbers of matching groups and matching addresses are slightly increased. (b) Using Hierarchical Match: Two groups of experiments are set up. The first experiment uses the matching scheme, which in turn traverses from the root nodes to the leaf nodes. The second experiment utilizes a hierarchical matching to compare child nodes of the same level. It can be observed from Table 7.4B that after using hierarchical matching method, the accuracy decreases slightly, but both the number of matching groups and the number of matching addresses increase. (c) Using Address Element Encoding: Table 7.4C shows whether address encoding method is used before address trees are created. As can be seen from the table, after encoding and then sorting addresses, and finally establishing address trees, the matching accuracy is increased, while the number of matching addresses is slightly reduced. More complete addresses are given priority to create address trees, and the remaining addresses are matched to them. (d) Using Supplement Address Elements: Two experiments are set up, the only difference between experiments is whether to use the supplement address elements scheme. From Table 7.4D, there is no significant change in the number of matching groups. There is a very small drop in the number of matching addresses, and the matching accuracy has risen slightly. The main reason is that the added address elements complete address trees, resulting in higher address element matching requirements.
7.4.4 On Network Embedding Performance of Different Algorithms/Parameters. We combine three different categories of network embedding algorithms with four clustering algorithms when we apply our customized ACCESS to the four clustering algorithms. Figure 7.11 shows the performances of different network embedding algorithms with different embedding dimensions on the LPA, Louvain, K-means and CFSFDP. Comparing with the KS values for cluster algorithms, the performances of four clustering algorithms combined with network embedding algorithms have been further improved.
184
7 Enhancing Association Utility: Dedicated Knowledge Graph
Fig. 7.11 The performance of RMCP with different clustering and network embedding algorithms
Fig. 7.12 The running time for NE algorithms with different parameters
In these figures, we implement the two customized embedding dimension determination methods included in ACCESS, and divide these two methods with different 4 figures. It . . can be observed that when the embedding dimension is small, such as.2 and . log2 V , the performances of clustering algorithms are poor. This is due to the small embedding dimension and the inability to retain the information of original network structure. The higher the embedding dimension, the better the anomaly detection performance of algorithm. The higher the embedding dimension, the richer the features that can be stored, but the longer the time required. Therefore, it is necessary to select the appropriate embedding dimension according to the specific application scenario. Running Time of Different Algorithms/Parameters. It is known that the time complexity of .node2vec has obvious advantages over HOPE and GCN. Figure 7.12a shows the time cost of .node2vec and our ACCESS on different node scale networks. ACCESS only performs network embedding on the latest updated connected component, which effectively reduces the time required while ensuring performance. ACCESS requires less time cost than just using GCN on each connected component, according to Fig. 7.12b. ACCESS combines the advantages of random walkbased network embedding algorithms and deep learning-based network embedding
7.4 Experimental Evaluation
185 A20130...
A20130... A20141...
A20131...
A20130... A20130...
A20130...
A20131...
A20130... A20141... A20150... A20130... A20161...
A20130... A20130...
A20130...
A20130...M1706...
A20160... A20150... A20130...
A20130...
A20130...
A20130... A20140...
A20140...
A20150...
A20151...
A20141...
A20130...
A20151...
A20140...
A20131...
A20130...
A20130...
A20130...
A20130...
A20130... A20130... A20130...
A20140...
A20130...
M2016...
A20150... A20161...
A20160...
A20130...
A20141...
A20130... A20141...
A20140...
A20130...
A20130...
A20130...
A20130...
A20130...
A20150...
A20130...
A20130...
A20150... A20131...
A20130... A20130...
M0180...
A20140... M2016...
A20130...
M0161...
M2016...
A20141...
A20131...
A20130...
A20140...
A20130...
M2016... A20150...
A20140...
A20130...
A20130...
A20130... A20131... A20131...
A20130...
M0170...
A20131...
A20140... M2016...
A20131...
A20140...
A20130...
A20140... A20130... A20160...
A20150... A20150...
A20130...
A20130... A20150...
A20150...
A20130...
A20130... A20140...
A20130...
A20150...
A20141...
A20141...
A20130... A20130...
A20130...
A20130...
A20151...
A20130...
A20141...
A20150...
M1703... S20170...
A20130...
M2016...
A20150... A20150...
A20130... A20131...
A20131...
A20150... A20150...
A20130...
A20160...
A20150...
A20151...
A20140... A20130...
M2016...
A20150...
A20130...
A20130...
A20130...
A20150... S20170...
A20130...
A20130...
A20130... A20150...
A20140... A20160...
A20150...
A20130...
A20160...
A20150...
A20130...
A20141...
A20130...
A20130...
M0170...
A20141...
A20161...
A20130...
A20130... M2016... A20130...
A20130...
A20131...
A20130...
A20130...
A20130... A20130...
A20150... A20130...
A20130...
A20130...
A20130...
A20140...
M2016...
A20160...
A20160...
A20140...
A20151... A20130...
A20150...
A20131...
A20150... A20131... A20140...
A20130...
A20150...
A20160... A20131...
A20140... A20131... A20130...
A20140...
A20130...
A20130...
A20130...
A20130...
A20130...
A20130...
A20160... A20130...
A20140... A20131...
A20131...
A20130...
A20130...
A20130...
A20130...
A20130...
A20140...
A20150... A20130...
A20131..
A20141...
A20140... A20131...
A20131...
A20141...
A20130... A20130...
A20130...
A20150...
A20141...
A20130... A20130...
A20130...
A20131...
A20130...
A20130... A20140... A20130... A20130...
Fig. 7.13 Clusters of partial nodes by ACCESS+CFSFDP
algorithms. The dotted line in figures indicates that using our proposed HashMap structure can save about .30% of the time. To intuitively reflect the effectiveness of our RMCP, we visualize the clustering results of ACCESS+CFSFDP’s partial nodes in the test dataset with the help of tSNE in Fig. 7.13. Blue nodes in the figure indicate that applications are normal, and red nodes indicate that applications are abnormal. Nodes in the same communities are closer or even overlapped, and the distance between different communities is large.
186
7 Enhancing Association Utility: Dedicated Knowledge Graph
7.5 Conclusion In this work, we investigate the online lending fraud prediction task based on the low-quality data, i.e., the very preliminary and coarse applicant information. To improve the data quality, we propose a knowledge graph based method to enhance the utility of associations in two dedicated ways: recovering missing associations and mining underlying associations. Specially, we introduce a framework RMCP to implement our method on the real-world dataset, which is consists of four modules, i.e., Recovering, Mining, Clustering, and Predicting, to automatically recover missing edges, mine gang fraud features, cluster loan applications, and efficiently predict gang frauds in OLSs. Extensive experiments on a real-life online lending dataset, validate the effectiveness of our method. Our work significantly outperforms the state-of-the-art competitors in terms of the representative metric in online lending services. As a byproduct, we also design a visual decision support system named LongArms over the framework RMCP, which is convenient to view the visual clustering performance and provides dynamic statistics of basic fraud information. In the future work, we will investigate the performances of combinations of different advanced algorithms, e.g., the graph construction method, spectral clustering method, zero-shot event detection. Besides, we will deploy our LongArms more efficiently, e.g., lower running time and energy consumption, and study its feasibility in large-scale data.
References 1. D. Xi, B. Song, F. Zhuang, Y. Zhu, S. Chen, T. Zhang, Y. Qi, Q. He, in Proceedings of AAAI 2021, Virtual Event (2–9 Feb 2021). pp. 14,957–14,965 2. R.A. Mohammed, K.W. Wong, M.F. Shiratuddin, X. Wang, in Proceedings of Pacific Rim International Conference on Artificial Intelligence (Springer, 2018), pp. 237–246 3. Z. Li, Y. Tian, K. Li, F. Zhou, W. Yang, Expert Syst. Appl. 74, 105 (2017) 4. M. Malekipirbazari, V. Aksakalli, Expert Syst. Appl. 42(10), 4621 (2015) 5. S. Chen, Y. Yuan, X.R. Luo, J. Jian, Y. Wang, Comput. Secur. 104, 102217 (2021) 6. Y. Lin, X. Wang, F. Hao, Y. Jiang, Y. Wu, G. Min, D. He, S. Zhu, W. Zhao, IEEE Trans. Syst. Man Cybern. Syst. 51(6), 3725 (2021) 7. B. Xu, H. Shen, B. Sun, R. An, Q. Cao, X. Cheng, in Proceedings of AAAI 2021, Virtual Event (2–9 Feb 2021), pp. 4537–4545 8. Z. Li, L. Yao, X. Chang, K. Zhan, J. Sun, H. Zhang, Pattern Recogn. 88, 595 (2019) 9. Z. Li, F. Nie, X. Chang, L. Nie, H. Zhang, Y. Yang, IEEE Trans. Neural Netw. Learn. Syst. 29(12), 6073 (2018). https://doi.org/10.1109/TNNLS.2018.2817538 10. Z. Li, F. Nie, X. Chang, Y. Yang, C. Zhang, N. Sebe, IEEE Trans. Neural Netw. Learn. Syst. 29(12), 6323 (2018) 11. J. Wang, X. Zhang, J. Yin, R. Wang, H. Wu, D. Han, IEEE Trans. Big Data 4(2), 231 (2018) 12. J.B. Millard, L. Kurz, IEEE Trans. Inf. Theory 13(2), 341 (1967) 13. D. Wang, Y. Qi, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, in Proceedings of IEEE ICDM 2019, Beijing, China (8–11 Nov 2019), pp. 598–607 14. T. Chen, C. Guestrin, in Proceedings of ACM SIGKDD 2016, San Francisco, CA, USA (2016), pp. 785–794. (13–17 Aug 2016) 15. A. Abdallah, M.A. Maarof, A. Zainal, J. Netw. Comput. Appl. 68, 90 (2016) 16. D. Sreekantha, R. Kulkarni, Expert Syst. 29(1), 56 (2012)
References
187
17. A.C. Bahnsen, D. Aouada, A. Stojanovic, B. Ottersten, Expert Syst. Appl. 51, 134 (2016) 18. C. Wang, C. Wang, H. Zhu, J. Cui, IEEE Trans. Dependable Secur. Comput. 18(5), 2122 (2021) 19. D. Cheng, Z. Niu, Y. Zhang, in Proceedings KDD 2020, CA, USA (23–27 Aug 2020), pp. 2715–2723 20. B. Hu, Z. Zhang, J. Zhou, J. Fang, Q. Jia, Y. Fang, Q. Yu, Y. Qi, in Proceedings CIKM 2020, Virtual Event, Ireland (19–23 Oct 2020), pp. 2525–2532 21. C. Wang, H. Zhu, IEEE Trans. Dependable Secur. Comput. 19(1), 301 (2022) 22. Y. Yang, Y. Xu, Y. Sun, Y. Dong, F. Wu, Y. Zhuang, IEEE Trans. Knowl. Data Eng. 33(1), 169 (2021) 23. S. Han, K. Zhu, M. Zhou, X. Cai, IEEE Trans. Comput. Soc. Syst. 8(4), 856 (2021) 24. R. Cao, G. Liu, Y. Xie, C. Jiang, IEEE Trans. Comput. Soc. Syst. 8(6), 1291 (2021) 25. J. Jurgovsky, M. Granitzer, K. Ziegler, S. Calabretto, P.E. Portier, L. He-Guelton, O. Caelen, Expert Syst. Appl. 100, 234 (2018) 26. C. Liang, Z. Liu, B. Liu, J. Zhou, X. Li, S. Yang, Y. Qi, in Proceedings ACM SIGIR 2019, Paris, France (21–25 July 2019), pp. 1181–1184 27. X. Zhang, Y. Han, W. Xu, Q. Wang, Inf. Sci. 557, 302 (2021) 28. Y. Zhang, J. Zhou, W. Zheng, J. Feng, L. Li, Z. Liu, M. Li, Z. Zhang, C. Chen, X. Li, Y.A. Qi, Z. Zhou, ACM Trans. Intell. Syst. Technol. 10(5), 55:1 (2019) 29. H. Bostani, M. Sheikhan, Pattern Recogn. 62, 56 (2017) 30. D.J. Soemers, T. Brys, K. Driessens, M.H. Winands, A. Nowé, in Proceedings AAAI 2018, pp. 7831–7836 31. M. Carminati, M. Polino, A. Continella, A. Lanzi, F. Maggi, S. Zanero, A.C.M. Trans, Privacy Secur. 21(3), 1 (2018) 32. J. Kim, H.J. Kim, H. Kim, Appl. Intell. (299), 1 (2019) 33. F. Carcillo, Y.L. Borgne, O. Caelen, Y. Kessaci, F. Oblé, G. Bontempi, Inf. Sci. 557, 317 (2021) 34. S. Shehnepoor, R. Togneri, W. Liu, M. Bennamoun, IEEE Trans. Inf. Forensics Secur. 17, 280 (2022). https://doi.org/10.1109/TIFS.2021.3139771 35. B. Hooi, K. Shin, H.A. Song, A. Beutel, N. Shah, C. Faloutsos, A.C.M. Trans, Knowl. Discov. Data 11(4), 1 (2017) 36. V.V. Vlasselaer, C. Bravo, O. Caelen, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, Decis. Support Syst. 75, 38 (2015) 37. L. Yuan, Y. Sun, N. Contractor, in Proceedings ACM ASONAM, vol. 2017 (2017), pp. 546–553 38. P. Luo, K. Shu, J. Wu, L. Wan, Y. Tan, ACM Trans. Intell. Syst. Technol. 11(1), 12:1 (2020) 39. Y. He, C. Wang, C. Jiang, IEEE Trans. Knowl. Data Eng. 30(3), 460 (2018) 40. L. Du, X. Li, C. Liu, R. Liu, X. Fan, J. Yang, D. Lin, M. Wei, in Proceeding IEEE IALP, vol. 2016 (2016), pp. 258–261 41. Y. He, C. Wang, C. Jiang, IEEE Trans. Knowl. Data Eng. 31(3), 451 (2019) 42. M.E. Newman, M. Girvan, Phys. Rev. E 69(2), 026113 (2004) 43. T. Zhang, R. Ramakrishnan, M. Livny, ACM SIGMOD Record 25(2), 103 (1996) 44. M. Girvan, M.E. Newman, Proc. Natl. Acad. Sci. (PNAS) 99(12), 7821 (2002) 45. S. Ghosh, M. Halappanavar, A. Tumeo, A. Kalyanaraman, H. Lu, D. Chavarria-Miranda, A. Khan, A. Gebremedhin, in IEEE IPDPS 2018 (2018), pp. 885–895 46. J.A. Hartigan, M.A. Wong, J. R. Stat. Soc. 28(1), 100 (1979) 47. H.S. Park, C.H. Jun, Expert Syst. Appl. 36(2), 3336 (2009) 48. M. Ester, H.P. Kriegel, J. Sander, X. Xu, et al., in Proceedings KDD 1996, Portland, Oregon, USA, 1996, vol. 96 (1996), pp. 226–231 49. M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander, in ACM SIGMOD Record, vol. 28, no. 2 (1999), pp. 49 50. A. Rodriguez, A. Laio, Science 344(6191), 1492 (2014) 51. C. Gong, D. Tao, W. Liu, L. Liu, J. Yang, IEEE Trans. Neural Netw. Learn. Syst. 28(6), 1452 (2016) 52. J.M. Kleinberg, J. ACM 46(5), 604 (1999) 53. A.Y. Ng, M.I. Jordan, Y. Weiss, in Proceeding NIPS, Vancouver. British Columbia, Canada, vol. 2001 (2001), pp. 849–856
188
7 Enhancing Association Utility: Dedicated Knowledge Graph
54. 55. 56. 57. 58. 59. 60.
C. Rother, V. Kolmogorov, A. Blake, ACM Trans. Graph. (TOG) 23(3), 309 (2004) T. Tian, J. Zhu, Y. Qiaoben, IEEE Trans. Pattern Anal. Mach. Intell. 41(10), 2480 (2019) S. Madisetty, M.S. Desarkar, IEEE Trans. Comput. Soc. Syst. 5(4), 973 (2018) C. Wang, B. Yang, J. Cui, C. Wang, IEEE Trans. Comput. Soc. Syst. 6(4), 637 (2019) C. Cortes, V. Vapnik, Mach. Learn. 20(3), 273 (1995) N.S. Altman, Am. Stat. 46(3), 175 (1992) D.W. Hosmer Jr, S. Lemeshow, R.X. Sturdivant, Applied Logistic Regression, vol. 398 (Wiley, 2013)
Chapter 8
Associations Dynamic Evolution: Evolving Graph Transformer
8.1 Dynamic Fraud Detection Solution Based on Graph Transformer Online lending services (OLSs) are becoming more popular because of their convenience [1]. Meanwhile, online lending fraud has gradually emerged. Gang fraud, as one of the typical cases of online lending fraud, often causes large losses to lending companies. Generally, loan applications require a long feedback period. Lending companies often judge applications based on whether applicants repay on time. The disadvantage of the method is that companies have already suffered losses after judging fraudulent applications. An effective risk control system that can predict gang fraud are critical to the proper functioning of online lending companies. Such a system can be built based on different anti-fraud methods, mainly including manual verification, expert rules, data analysis rules, and machine learning models [2]. In recent years, machine learning models, such as Support Vector Machine (SVM, [3]), and Random Forest [4], etc., is becoming a mainstream type of anti-fraud methods. As a matter of fact, these models rely on valid features and large amounts of labeled data. Network structure data, or graph, has received widespread attention in the past few years because of its powerful presentation capabilities. Real-life graphs are divided into static graphs and dynamic graphs. Static graphs can be understood as networks that do not undergo any changes over time, such as a certain transportation network of a certain city at a certain time. Compared to static graphs, dynamic graphs are more common in the real world, such as social networks, transfer transaction networks between accounts, and computers communication network, etc. [5]. There may be some elements in these graphs that change at any time, and their changing laws or characteristics show abnormal behaviors because they are different from general elements, such as communication with offensive behavior in computer networks [6], the dissemination of false information in social networks [7] and the sudden cooperation between scholars in different fields in the academic co-authoring network [8], etc. Excavating these abnormalities in the network as early as possible © Tongji University Press 2023 C. Wang, Anti-Fraud Engineering for Digital Finance, https://doi.org/10.1007/978-981-99-5257-1_8
189
190
8 Associations Dynamic Evolution: Evolving Graph Transformer
is of great significance for maintaining social stability, defending against network attacks, or discovering emerging interdisciplinary directions [9–11]. How to mine abnormal elements in a dynamic graph is a more difficult problem. The dynamic graph mainly has the following characteristics: 1. The graph structure is in uncertain changes, and new points or edges are added or deleted at every time step; 2. The attributes of the graph are in uncertain changes, and the attribute characteristics of the same node or edge at different time steps may be different. These characteristics make it impossible for us to use traditional fraud detection algorithms on static networks to solve the problem. At the same time, the abnormalities in the graph may include node abnormalities, edge abnormalities and subgraph abnormalities. These different abnormal forms add complexity to the anormaly detection on the graph [12]. Traditional methods mainly focus on the structural characteristics of the network, and detect abnormal elements by looking for structural changes, such as graph embedding. Graph embedding has shown to be a powerful tool in learning the lowdimensional representations in networks that can capture and preserve the graph structure. However, most existing graph embedding approaches are designed for static graphs, and thus may not be suitable for a dynamic environment in which the network representation has to be constantly updated. Only a few advanced embedding-based methods [13] are suitable for updating the representation dynamically as the network evolves. However, these methods require the knowledge of the nodes over the whole time span and thus can hardly promise the performance on new nodes in the future. In addition, real-world graphs are not only dynamic, but generally heterogeneous. Heterogeneous graphs usually have multiple types of edges and nodes, which greatly increases the complexity of the graph. Over the past decade, a significant line of research has been explored for mining heterogeneous graphs [14]. One of the classical paradigms is to define and use meta paths to model heterogeneous structures, such as PathSim [15] and metapath2vec [16]. Recently, in view of graph neural networks (GNNs) success [17–19], there are several attempts to adopt GNNs to learn with heterogeneous graphs [20–23]. However, these works face several issues: First, most of them involve the design of meta paths for each type of heterogeneous graphs, requiring specific domain knowledge; Second, they either simply assume that different types of nodes/edges share the same feature and representation space or keep distinct non-sharing weights for either node type or edge type alone, making them insufficient to capture heterogeneous graphs properties; Third, most of them ignore the dynamic nature of every heterogeneous graph. Recently, there have been some efforts [24–27] to solve the above-mentioned problems. Such as TGCN [24], as shown in Fig. 8.1, it combines RNNs and GCNs. It uses GCNs as a feature extractor and RNNs for sequence learning from the extracted features (node embeddings). As a result, one single GCN model is learned for all graphs on the temporal axis. A limitation of these methods is that they require the knowledge of the nodes over the whole time span and can hardly promise the performance on new nodes in the future. In practice, in addition to the likelihood that new nodes may emerge after training, nodes may also frequently appear and dis-
8.1 Dynamic Fraud Detection Solution Based on Graph Transformer
191
Fig. 8.1 Schematic diagram of TGCN structure [24]
appear, which renders the node embedding approaches questionable, because it is challenging for RNNs to learn these irregular behaviors. To resolve these challenges, we propose Evolving Graph Transformer (EGT) to capture the dynamsim of heterogeneous graph. In detail, our model mainly solves these pain points: • Avoid using meta-paths. The design of meta-path relies on expert knowledge. In this model, we use the idea of meta-relation to directly model the node pair. • Increase distribution diversity. In order to better model the diversity of node and edge types in heterogeneous graphs, we set up each type-specific projection matrix so that their distribution can be displayed more realistically. • Capture dynamics more stably. Compared with using RNN to learn the node representation of GNN output, we use RNN to directly evolve the parameters of GNN. This approach effectively performs model adaptation, which focuses on the model itself rather than the node embeddings. Hence, change of nodes poses no restriction. In addition, the parameters of GNNs are not trained anymore. To evaluate our EGT, we conduct the experiments on the real-life dataset and two public datasets. It is demonstrated that EGT has superiority in fraud detection. The contributions of our work are summarized as follows: (1) We propose a technical framework for dynamic heterogeneous graphs, the highest-level data structure in the graph domain. (2) We circumvented several shortcomings of the previous work in dealing with dynamic heterogeneous graphs, and made it successfully applied to the scene of fraud detection. (3) We evaluate our EGT on the real-life dataset. The experiments demonstrate that EGT has superiority in fraud detection than other works.
192
8 Associations Dynamic Evolution: Evolving Graph Transformer
8.2 Related Work Graph neural networks (GNNs) can effectively extract the structure information and node information of the graph to learn better node representation. GNNs are applied to a wide range of graph tasks due to its excellent performance. Because of CNN’s great success with Euclidean data, it is natural to want to move the convolution operation to the deep learning of graphs. The graph convolution network is developed and divided into spectral domain method and spatial domain method. Based on the spectrum theory, [28] applies the learnable convolution operation to the graph for the first time. GCN [18] simplifies the definition of spectrum-based graph convolution. GraphSAGE [17] applies the aggregation operator to gather the neighbor information to achieve the inductive learning. GAT [19] applys the attention mechanism to the graph neural network for the first time. All these work above are for homogeneous graphs, but most real-world graphs are heterogeneous information networks with multi-typed nodes and edges. HAN [23] uses two-level attention mechanism in heterogeneous graphs and uses meta-paths to transform a heterogeneous graph into homogeneous graphs. HetGNN [21] uses sampling strategy based on randomwalk with restart to convert heterogeneous graphs into homogeneous graphs. Both HetSANN [29] and GTN [22] explore to directly encode node information in heterogeneous graphs without using manually designed meta-paths. Fraud detection on homogenous graphs is the task that most GNNs focus on. AddGraph [30] propose a general end-to-end anomalous edge detection framework using an extended temporal GCN. Fdgars [31] focuses on spam review detection through GCN. Weber et al. [32] also uses GCN to complete anti-money laundering tasks. GCCAD [33] leverage graph contrastive coding for contrasting abnormal nodes with normal ones in terms of their distances to the global context. SL-GAD [34] propose a GNN model with generative and contrastive self-Supervised learning. Soon, the fraud detection model on the heterogeneous graph was developed. Wen et al. [35], Zhang et al. [36] and Liu et al. [37] all proposed models for suspicious user detection on heterogeneous graphs. After the great success of the fraud detection task on the static graph, people quickly turned the target to the dynamic graph. DySAT [38] proposes an approach to learn deep neural representations on dynamic homogeneous graphs via self-attention networks. TIMESAGE [39] learns entity representation from temporal weighted edges and learns interaction sequences by using temporal random walks. DHGReg [40] proposes a dynamic heterogeneous graph neural network composed of structural subgraphs and time series subgraphs to capture suspicious massive registrations. The above-mentioned methods either did not get rid of the limitations of using meta-path, or did not get rid of the limitations of the need to train large graph neural network parameters. Therefore, we proposed EGT to overcome the above limitations.
8.3 Fraud Detection in Online Lending Services
193
8.3 Fraud Detection in Online Lending Services In this part, we propose our original Evolving Graph Transformer (EGT) applied to fraud detection problems. For a dynamic graph that evolves with time series, we use Graph Transformer to capture the representation vector of the node at each moment, and then use RNN to evolve the parameters of Graph Transformer. In order to apply the EGT model to all types of graphs, we propose EGT-Het to apply to heterogeneous graphs, and EGT-Hom to apply to homogeneous graphs. It should be pointed out that EGT-Hom can be seen as a simplified version of EGT-Het. The Fig. 8.2 shows the overall architecture of Evolve Graph Transformer.
8.3.1 Preliminary Homogeneous and Heterogeneous Graph A graph, denoted as .G = (V , E ), consists of an object set.V and an edge set.E . A graph is associated with a node type mapping function .φ = V → A and a link type mapping function .ψ = E → R. .A and .R denote the sets of predefined node types and link types. If .|A| + |R| = 2, we call this graph a homogeneous graph, else if .|A| + |R| > 2, we call this graph a hetero-
Fig. 8.2 Schematic illustration of evolving graph transformer
194
8 Associations Dynamic Evolution: Evolving Graph Transformer
geneous graph. Meta Relation In heterogeneous graph, for an edge .e = (s, t) linked from source node .s to target node .t, its meta relation is defined as ..φ(s), ψ(e), φ(t)., and the classical meta path paradigm is defined as a sequence of such meta relation. Dynamic Graph To model the dynamic nature of real-world graphs, we assign an edge .e = (s, t) a timestamp .T , where node .s connects to node .t at .T . If .s appears for the first time, .T is also assigned to .s. It needs to be pointed out that .s can be associated with multiple timestamps if it bulids connections over time.
8.3.2 Graph Transformer Graph Transformer (GT) is inspired by [41, 42]. In this part, we will introduce the heterogeneous version of GT, namely GT-Het, and then introduce its simplified version, namely GT-Hom. Given a sampled sub-graph, GT extracts all linked node pairs, where target node .t is linked by source node .s via edge .e. The goal of GT is to aggregate representation from source nodes to get embedding for target node .t. We divide this process of aggregating information into three steps : Mutual Attention, Message Passing and Target Node Aggregation. We denote the output of the .(l)-th GT layer as . H (l) , which is also the input of the .(l + 1)-th layer. By stacking the . L layers, we can get the node representations of the whole graph . H (L) , which will be used as part of the input of the subsequent evolution process. Mutual Attention The classic Attention Mechanism in GNNs is as follows: . . l . H [t] ← Aggregate Attention(s, t) · Message(s) , ∀s∈N (t),∀e∈E(s,t)
where . N (t) denotes set of the all the source nodes of node .t and . E(s, t) denotes the set of the edges which are from node .s to .t. There are three basic operators: Attention is used to measure the mutual importance in node pairs; Message is the node representation extracted from the source node.s; Aggregate is used to aggregate the representations of the source nodes via some aggregations operators, such as .mean, .sum and .max. For example, a classic model, the Graph Attention Network (GAT) [19] uses the additive mechanism as Attention, adopts the same weight for calculating Message and leverage the simple average for Aggregate. In fact, we use vanilla GAT as our GT-Hom. Mathematically, we write .
. . .. Attention H om (s, t) = So f tmax a W H l−1 [t].W H l−1 [s] ∀s∈N (t)
Message H om (s) = W H l−1 [s] . . .. Aggregate H om (·) = σ Mean · .
8.3 Fraud Detection in Online Lending Services
195
It is easy to see that GT-Hom treats all nodes on a homogeneous graph as obeying the same distribution by using one weight matrix .W . But this design is not suitable for heterogeneous graphs. Given a target node .t and all its neighbors .s ∈ N (t), which may be subject to different feature distributions. To solve this problem, we introduce a mutual attention mechanism based on meta-relations [41] to calculate the importance of different types of nodes, which is the attention part of our GT-Het. Inspired by the attention mechanism in Transformer [42], we map out Query(Q) vector from node .t and Key(K) vector from source node .s to complete the attention calculation. In the original Transformer, only a set of projection matrices are used to transform the Q/K/V vectors. Considering that there will be multiple meta-relations in a heterogeneous graph, we should make each meta-realtion have its own corresponding projection matrix. Specifically, for a set of meta-relation..φ(s), ψ(e), φ(t). , we propose the corresponding projection matrices .W K φ(s) ,.Wψ(e) and.W Q φ(t) to model distribution differences as much as possible. It should be pointed out that the reason why the corresponding projection matrix is also set for different edge types is because there may be multiple types of edges between the same node pair in the real world graph. Mathematically, we write Attention H et (s, e, t) = So f tmax ∀s∈N (t)
.
AT T −head (s, e, t) = K n
n
AT T (s)Wψ(e)
.
N
⊕ AT T −head n (s, e, t)
n=1
Q (t)T · n
.
u .φ(s),ψ(e),φ(t). √ d
K n (s) = H l−1 [s]W Kn φ(s) Q n (t) = H l−1 [t]W Qn φ(t) . Here we use . N attention heads and .⊕ denotes the concatenation. We use the prod d jection matrix .W Kn φ(s) ∈ Rd× n to generate the .n-th Key vector . K n (s) ∈ R1× n of the n source node .s. Similarly, we use .W Q φ(t) to generate the .n-th Query vector of target node .t. Note that not all meta-relations have the same contribution to the target node .t, we use a priori tensor .u ∈ R|A |×|A |×|A | to specify the importance of each meta-relation. Finally, we concatenate . N attention heads together to get the attention vector for each node pair. Then, for each target node .t, we gather all attention vectors from its neighbors . N (t) and conduct softmax, making it fulfill . . ∀s∈N (t) Attention Het (s, e, t) = 1h×1 . Message Passing While calculating mutual attention, we propagate the message of the source node .s to the target node .t. We also supplement each message from the source node with the preference of the meta relations of the edges. For an edge .e = (s, t), we calculate its multi-head Message as N
.
Message H et (s, e, t) = ⊕ M SG−head n (s, e, t) n=1
n M SG Wψ(e) . M SG−head n (s, e, t) = H l−1 [s]W M φ(s)
196
8 Associations Dynamic Evolution: Evolving Graph Transformer
First, we project the representation of the source node .s to the .n-th message vecd n ∈ Rd× n . Then the .n-th message vector is linearly transformed by tor with .W M φ(s)
M SG Wψ(e) ∈ R n × n for incorporating the edge dependency. Finally, we concat the . N message heads to get the message from the source node .s to the target node .t via edge .e. Target Aggregation After the mutual attention between the pair of nodes and the nodes’ messages are calculated, we use the weighted summation of the source nodes’ messages to update the target node’s representation. Note that we have normalized the attention matrix of the target node .t, we can directly use the attention matrix as weights to average the source nodes’ information and get the updated vector . H¯ l [t]. Mathematically, we write d
d
.
.
. .
H¯ l [t] =
. Attention H et (s, e, t) · Message H et (s, e, t) .
∀s∈N (t)
Then we get the .(l)-th GT layer’s output . H l [t] for the target node .t by residual connection as .
H l [t] = σ ( H¯ l [t]) + H l−1 [t].
Note that we have stacked the . L GT layers so a node can receive information from all nodes within the range of . L hops, so we generate a highly semantic representation for each node. Relative Temporal Encoding In order to better capture the time relationship between node pairs at a certain time step, we use Relative Temporal Encoding (RTE) technology to model this process. RTE is inspired by Transformer’s positional encoding method [42, 43] which has been successfully used to capture the sequential dependencies of words in texts. In details, given a source node .s and a target node .t with their corresponding timestamps .T (s) and .T (t), we define the time interval between nodes: .ΔT (t, s) = T (t) − T (s) as an index to generate a relative temporal encoding . RT E(ΔT (t, s)). Since the training dataset will not include all time intervals, our . RT E technology must be generalized to adapt to the time intervals that do not appear. Here we use trigonometric functions as the coding basis. Mathematically, we write .
Base(ΔT (t, s), 2i) = sin
. ΔT (t, s) .
Base(ΔT (t, s), 2i + 1) = cos
2i
10000 d . ΔT (t, s) . 2i+1
10000 .d
. RT E(ΔT (t, s)) = W RT E Base ΔT (t, s) ,
8.3 Fraud Detection in Online Lending Services
197
where the projection matrix .W RT E ∈ Rd×d . Finally, the temporal encoding relative to the target node .t is added to the source node .s representation as follows: .
H¯ l−1 [s] = H l−1 [s] + RT E(ΔT (t, s)).
In this way, we have enhanced the time relationship in the output . H¯ l−1 .
8.3.3 Evolving Graph Transformer The Graph Transformer proposed in the previous section can handle the representation of a static graph well. But compared to static graph, dynamic graphs are more common in the real world, especially in fraud detection scenarios, so we need to better capture the dynamism of the graph. Inspired by EvolveGCN [44], we use RNN to evolve the parameters of each time step in Graph Transformer. At time step .T , we will uniformly identify the parameters of the .(l)-th layer in Graph Transformer as l . W T without making specific distinctions. We treat .WTl as the output of the dynamical system which will be the input at the subsequent time step. We use a long-short term memory(LSTM) cell to model the input-output relationship. The LSTM maintains the system information by using a cell context. Abstractly, we write .
WTl = L ST M(WTl −1 ).
As we can see, the representations of nodes are not used at all. The LSTM may be replaced by other recurrent architectures, as long as the roles of .WTl and .WTl −1 are clear. Then we propose the evolving graph transformer unit(EGTU) function [HTl+1 , WTl ] = E GT U (G T , HTl , WTl −1 )
.
WTl = L ST M(WTl −1 ) HTl+1 = GraphT rans f or mer (G T , HTl , WTl ) end function The EGTU performs graph transformer along layers and meanwhile evolves the weight matrices over time. Then implementing the EGTU requires only a straightforward extension of the standard LSTM from the vector version to the matrix version. The variables in the following pseudo-code are all local variables and are not to be confused with the mathematical notations in the aforementioned algorithm.
198
8 Associations Dynamic Evolution: Evolving Graph Transformer
function HT = L ST M(X T )
.
FT IT OT .T C
= sigmoid(W F X T + U F HT −1 + B F ), = sigmoid(W I X T + U I HT −1 + B I ), = sigmoid(W O X T + U O HT −1 + B O ),
= tanh(WC X T + UC HT −1 + BC ), .T , Ct = FT ⊗ C T −1 + IT ⊗ C Ht = tanh(C T ) ⊗ OT ,
end function where the input . X t is the same as the output of last time step . Ht−1 . This setting of weights evolving over time is more suitable for scenarios where the node features are not much informative but the change of the graph structure is very important.
8.4 Experimental Evaluation 8.4.1 Datasets and Metrics Datasets We conducted experiments on multiple data sets from different domains. Since there are few publicly available dynamic heterogeneous graph data sets that can be used for fraud detection scenarios, we used a data set from a large online lending platform. At the same time, in order to verify the performance of HGT-Hom, we have also verified it on multiple dynamic homogenous graph data sets. • Online Lending Dataset: Our data set comes from a large-scale Internet financial lending platform. We select all the data from January 1, 2016 to July 31, 2017, with a total of .910, 197 application transactions. Only .170, 608 of these transactions are labeled (.0 is normal, .1 is fraud), and the remaining transactions are unlabeled. In order to avoid the problem of time crossing, we select the transactions from January 1, 2016 to June 30, 2017 as the training set, with a total of .880, 614 and the data from July 1 to July 31, 2017 as the test set, with a total of .29, 583. The heterogeneous graph built with these data contains .9 types of nodes (.1 type of applicant node and .8 types of attribute nodes, as shown in Table 8.1) and .19 types of edges (as shown in Table 8.2). Restricted by the privacy protection policy, the characteristics of the attribute nodes are all string types, and the ID number and phone fields are all desensitized. In order to facilitate the storage of largescale network structures, we choose Neo4j graph database to store heterogeneous graphs. Finally, we maintained a six-month time window and set a 15-day sliding period to update the graph.
8.4 Experimental Evaluation
199
Table 8.1 The selected nodes in application transactions Nodes Attribute Description Application
Name, Time
ADDR
Province, City, District
CONAME
None
INDO
Name
DL
None
VIN
None
ENGINE MP
None Name
TEL
Name
The identifier of a transaction. We extract the applicant’s name and the time of the transaction as attributes The detailed address. We extract three levels of administrative areas as attribute description The name of the company where the online loan applicant works The identity card number. Use holder’s name as an attribute The driving license plate number of the applicant The vehicle identification number of the loan applicant The engine number of the loan applicant The mobile phone number. Use holder’s name as an attribute The telephone number. Use holder’s name as an attribute
• Elliptic1 : Elliptic is a graph of bitcoin transactions wihch maps bitcoin transactions to real entities belonging to licit categories (exchanges, wallet providers, miners, licit services, etc.) versus illicit ones (scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc.). The graph is made of 203,769 nodes and 234,355 edges. Each node has 166 features and has been labeled as being created by a “licit”, “illicit” or “unknown” entity. Two percent (4,545) of the nodes are labelled class1 (illicit). Twenty-one percent (42,019) are labelled class2 (licit). The remaining transactions are not labelled with regard to licit versus illicit. The statistical information of Elliptic is shown in Fig. 8.3. • Bitcoin Alpha2 : Bitcoin Alpha is who-trusts-whom network of people who trade using Bitcoin on a platform https://www.bitcoin-otc.com. The data set may be used for predicting the polarity of each rating and forecasting whether a user will rate another one in the next time step. Metrics For the online lending data set, we use the KS value [45] as our main evaluation metric. The KS value is a de facto standard in the field of loan antifraud [46]. The KS value is the maximum difference between the recall rate and the disturbance rate under different thresholds. The larger the KS value is, the better the model can predict fraudulent applications and normal applications. Usually, the KS value can tolerate relatively high false positive rates. 1 2
https://www.kaggle.com/ellipticco/elliptic-data-set. http://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html.
200
8 Associations Dynamic Evolution: Evolving Graph Transformer
Table 8.2 The extracted edges in loan transactions Edges Description R_ADDR R_CO R_CO_ADDR R_CO_NAME R_CO_TEL R_CPADDR R_CRMP R_FCMP R_IDNO R_DL R_VIN R_ENGINE R_LRMP R_MATEIDNO R_MP R_OCMP R_TEL R_REGADDR
The loan applicant’s address The mobile number of the applicant’s colleagues The address of the applicant’s company The loan applicant’s company name The telephone number of the applicant’s company The loan applicant’s residential address The mobile phone number of the applicant’s general relatives The mobile phone number of the applicant’s friends The loan applicant’s license plate number The loan applicant’s driving license plate number The vehicle identification number of loan applicant’s car The engine number of loan applicant’s car The mobile phone number of the applicant’s immediate family members The identity card number of the applicant’s spouse The loan applicant’s mobile phone number The mobile phone number of the applicant’s other contacts The loan applicant’s telephone number The loan applicant’s native place
Fig. 8.3 The distribution of the number of nodes in the Elliptic dataset over different time slices [32]
Then we use the Elliptic data set for node classification and the Bitcoin Alpha data set for edge classification. For both data sets and tasks, we use F1 scores to measure the final performance.
8.4 Experimental Evaluation
201
8.4.2 Baseline Methods We use seven baseline methods as follows: • GCN [18]:It is a model that propagates the information of neighbor nodes to node itself through convolution operation. We use one single GCN model for all time steps and the loss is accumulated along the time axis. • GAT [19]: It is an improved method which uses mutltihead attention mechanism on neighbors. We use one single GAT model for all time steps and the loss is accumulated along the time axis. • HAN [23]: It designs hierarchical attentions to aggregate neighbor information via different meta paths. • HetGNN [21]: It uses a strategy of random walk with restart for node sampling instead of setting meta-paths and adopts different Bi-LSTMs for different node type for aggregating neighbor information. • EvolveGCN-O [44]: EvolveGCN is an improved method which uses RNN to evolve GCN to capture dynamism of the graph. We use the EvolveGCN-O version as a comparison. • DynGEM [47]: This one is an unsupervised node embedding approach, based on the use of graph autoencoders. The autoencoder parameters learned at the past time step is used to initialize the ones of the current time for faster learning. • dyngraph2vec [48]: This method is unsupervised. It has several variants: dyngraph2vecAE, dyngraph2vecRNN, and dyngraph2vecAERNN. The first one incorporates the past node information for autoencoding. The others use RNN to maintain the past node information.
8.4.3 Implementation Details We use 256 as the hidden dimension throughout the neural networks for all baselines. For all multi-head attention-based methods, we set the head number as 8. All GNNs keep 3 layers so that the receptive fields of each network are exactly the same. All baselines are optimized via the AdamW optimizer with the Cosine Annealing Learning Rate Scheduler. For each model, we train it for 200 epochs and select the one with the lowest validation loss as the reported model. We use the default parameters used in GNN literature and donot tune hyper-parameters.
8.4.4 Results for Node Classification The results for node classification are shown in Tables 8.3 and 8.4. Specifically, corresponding to the task of Online Lending dataset, there are more detailed performance
202
8 Associations Dynamic Evolution: Evolving Graph Transformer
Table 8.3 Qantitative results (.%) on KS, recall and disturbance on online lending dataset Metrics
GCN
GAT
HAN
HetGNN
EvolveGCN-O
EGTEGT-Het Het.−RT E
KS
40.3
42.8
45.9
47.4
45.6
47.9
51.1
Recall
56.5
58.7
60.2
61.9
62.3
65.1
68.9
Disturbance
16.2
15.9
14.3
14.5
16.7
17.2
17.8
Table 8.4 Qantitative results (.%) on F1 scores on Elliptic dataset on time step 42 F1 score Metric GCN GAT DynGEM dyngraph2vecAERNN EvolveGCN-O EGT-Hom.−RT E EGT-Hom
68.9 .± 2.3 70.1 .± 1.8 60.3 .± 4.1 65.2 .± 4.3 71.2 .± 2.2 71.4 .± 1.5 73.5 .± 1.9
Fig. 8.4 Schematic illustration of Evolving Graph Transformer in the task of Online Lending dataset
performance at different time steps, as shown in Fig. 8.4. From them, we can get the following analysis: • The performance of our proposed method EGT-Het and EGT-Hom are better than other methods, which proves the superiority of our model. The performance of our model on the Online Lending dataset exceeds 50, reaching a level acceptable to the
8.4 Experimental Evaluation
•
•
• •
203
industry. In particular, EGT-Het’s better performance in the online lending data set may due to : 1. Gang fraud is not only manifested in the dense correlation between application numbers, but also in the fact that they will issue dense applications in a small period of time. And if the associations between the application forms are distributed in a very wide period of time, even if the manifestations are dense, the possibility of gang fraud is still much smaller, which is invisible by pure static analysis. 2. The fraudster is likely to modify his basic information, that is, certain characteristics have indeed changed at different time steps. The performance of the GCN model and GAT on the Elliptic dataset is quite good, both of which have higher mean performance and smaller variance. It may be because the Elliptic data set provides rich feature information, so that a simple model structure can also capture abnormal information well. However, the performance of GCN and GAT on the Online Lending dataset is not very good, mainly because they cannot capture the gains brought by the heterogeneous structure. Similarly, EvolveGCN-O exerts a better performance on the Elliptic dataset but not outstanding on the Online Lending dataset. Although it also uses RNN to evolve GNN parameters, the architecture of GCN is not sufficient for heterogeneous graphs. Both HAN and HetGNN can better capture the abnormal information on the heterogeneous graph, but both lack the ability to process time series information, so neither has further exerted higher performance. Both DynGEM and dyngraph2vecAERNN are unsupervised graph embedding methods, which have poor performance on node classification tasks. This may be due to the fact that the node embeddings generated each time step are unstable, which also leads to a large variance in performance. Let us look at the performance of these models from a broader time perspective. Fig. 8.5 shows the performance of many methods on the Elliptic data set, and we can find some interesting phenomena. At time step 43, all the methods no longer work, and the F1 value drops to 0. This time is when the dark market shutdown occurred. Such an emerging event causes performance degrade for all methods, with non-dynamic models suffering the most. Even dynamic models are not able to perform reliably, because the emerging event has not been learned.
8.4.5 Results for Edge Classification The edge classification results on the Bitcoin Alpha dataset are shown in Table 8.5. EGT-Hom shows the best performance. The performance of the two unsupervised methods, DynGEM and dyngraph2vecAERNN, on the edge classification problem is still weaker than other supervised methods. However, GCN and GAT still perform well on this kind of homogenous graph.
204
8 Associations Dynamic Evolution: Evolving Graph Transformer
Fig. 8.5 Performance of node classification on Elliptic dataset over time. The F1 score is for the minority (illicit) class Table 8.5 Qantitative results (.%) on F1 scores on Bitcoin Alpha dataset for edge classification Metric F1 score GCN GAT DynGEM dyngraph2vecAERNN EvolveGCN-O EGT-Hom.−RT E EGT-Hom
76.3 .± 3.5 74.2 .± 1.9 65.6 .± 2.3 67.1 .± 1.4 75.2 .± 1.8 76.3 .± 2.1 79.1 .± 3.2
8.4.6 Ablation Study • RTE: RTE is an important part of the EGT model, which represents the relative order of appearance of nodes in the entire graph. From Table 8.1, it can be seen that after removing the RTE, both the node classification task and the edge classification task have experienced performance degradation, but they still remain at a relatively high level. This shows that even if RTE is not used, the RNN evolution mechanism still captures the dynamics of the graph.
8.5 Conclusion
205
Fig. 8.6 Influence of hidden layer dimension and layer numbers of EGT on model performance
8.4.7 Parameter Sensitivity This section will explore the influence of the dimension of the hidden layers and number of layers of the neural network in EGT on the performance of the model. The results are shown in Fig. 8.6: • Hidden layer dimension. As shown in Fig. 8.6a, the size of the hidden layer dimension directly affects the representation ability of the model. Through multiple experiments and comparisons, we believe that setting the hidden layer dimension to 256 can obtain better model performance. When the dimensionality is further increased, the model training time increases significantly and the performance drops. • The number of network layers. Different from the traditional deep network, the number of layers of the graph neural network usually cannot be set too large, otherwise it is prone to over-smoothing. As shown in Fig. 8.6b, we have conducted experiments with various layers, and we believe that the over-smoothing phenomenon can be better avoided when the number of layers is 3. When the number of layers is further increased, the performance of EGT on the three datasets drops significantly, especially on the online lending dataset, which we believe is caused by the overfitting of the model.
8.5 Conclusion In this work, we propose a model-Evolving Graph Transformer (EGT) that can realize fraud detection on dynamic heterogeneous graphs. EGT abandons the methods of meta-paths in the traditional method and adopts a more generalized method of meta-relations. EGT uses multiple projection matrices to project meta-relations into different subspaces to obtain excellent expression performance. On the other hand, EGT reduces parameter training by using RNN to evolve GNN and improves model
206
8 Associations Dynamic Evolution: Evolving Graph Transformer
training speed. EGT also uses Relative Temporal Encoding to better capture the dynamics of the graph. Experiments on multiple data sets show the excellent performance of EGT.
References 1. D. Xi, B. Song, F. Zhuang, Y. Zhu, S. Chen, T. Zhang, Y. Qi, Q. He, in Proceedings of the AAAI 2021, Virtual Event, February 2–9, 2021 (2021), pp. 14,957–14,965 2. R.A. Mohammed, K.W. Wong, M.F. Shiratuddin, X. Wang, in Proc. Pacific Rim International Conference on Artificial Intelligence (Springer, 2018), pp. 237–246 3. Z. Li, Y. Tian, K. Li, F. Zhou, W. Yang, Expert. Syst. Appl. 74, 105 (2017) 4. M. Malekipirbazari, V. Aksakalli, Expert. Syst. Appl. 42(10), 4621 (2015) 5. K.M. Carley, Dynamic Network Analysis (na, 2003) 6. D. Eswaran, C. Faloutsos, in 2018 IEEE International Conference on Data Mining (ICDM) (IEEE, 2018), pp. 953–958 7. M. Gupta, J. Gao, Y. Sun, J. Han, in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (2012), pp. 859–867 8. N.A. Heard, D.J. Weston, K. Platanioti, D.J. Hand, Ann. Appl. Stat. 645–662 (2010) 9. C.C. Noble, D.J. Cook, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2003), pp. 631–636 10. S. Ranshous, S. Shen, D. Koutra, S. Harenberg, C. Faloutsos, N.F. Samatova, Wiley Interdiscip. Rev. Comput. Stat. 7(3), 223 (2015) 11. D. Savage, X. Zhang, X. Yu, P. Chou, Q. Wang, Soc. Netw. 39, 62 (2014) 12. L. Akoglu, H. Tong, D. Koutra, Data Min. Knowl. Discov. 29(3), 626 (2015) 13. W. Yu, W. Cheng, C.C. Aggarwal, K. Zhang, H. Chen, W. Wang, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), pp. 2672–2681 14. Y. Sun, J. Han, Synthesis Lectures on Data Mining and Knowledge Discovery 3(2), 1 (2012) 15. Y. Sun, J. Han, X. Yan, P.S. Yu, T. Wu, Proc. VLDB Endowment 4(11), 992 (2011) 16. Y. Dong, N.V. Chawla, A. Swami, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 135–144 17. W. Hamilton, Z. Ying, J. Leskovec, in Advances in Neural Information Processing Systems (2017), pp. 1024–1034 18. T.N. Kipf, M. Welling, arXiv:1609.02907 (2016) 19. P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, arXiv:1710.10903 (2017) 20. M. Schlichtkrull, T.N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, M. Welling, in European semantic web conference (Springer, 2018), pp. 593–607 21. C. Zhang, D. Song, C. Huang, A. Swami, N.V. Chawla, in ACM KDD 2019 (2019), pp. 793–803 22. S. Yun, M. Jeong, R. Kim, J. Kang, H.J. Kim, in Advances in Neural Information Processing Systems (2019), pp. 11,983–11,993 23. X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, P.S. Yu, in WWW 2019 (2019), pp. 2022–2032 24. L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, H. Li, IEEE Trans. Intell. Transp. Syst. 21(9), 3848 (2019) 25. Y. Seo, M. Defferrard, P. Vandergheynst, X. Bresson, in International Conference on Neural Information Processing (Springer, 2018), pp. 362–373 26. F. Manessi, A. Rozza, M. Manzo, Pattern Recognit. 97, 107000 (2020) 27. A. Narayan, P.H. Roe, IFAC-PapersOnLine 51(2), 433 (2018) 28. J. Bruna, W. Zaremba, A. Szlam, Y. LeCun, arXiv:1312.6203 (2013) 29. H. Hong, H. Guo, Y. Lin, X. Yang, Z. Li, J. Ye, in AAAI 2020 (2020), pp. 4132–4139
References
207
30. L. Zheng, Z. Li, J. Li, Z. Li, J. Gao, in IJCAI (2019), pp. 4419–4425 31. J. Wang, R. Wen, C. Wu, Y. Huang, J. Xion, in Companion Proceedings of The 2019 World Wide Web Conference (2019), pp. 310–316 32. M. Weber, G. Domeniconi, J. Chen, D.K.I. Weidele, C. Bellei, T. Robinson, C.E. Leiserson, arXiv:1908.02591 (2019) 33. B. Chen, J. Zhang, X. Zhang, Y. Dong, J. Song, P. Zhang, K. Xu, E. Kharlamov, J. Tang, arXiv:2108.07516 (2021) 34. Y. Zheng, M. Jin, Y. Liu, L. Chi, K.T. Phan, Y.P.P. Chen, IEEE Transactions on Knowledge and Data Engineering (2021) 35. R. Wen, J. Wang, C. Wu, J. Xiong, in Companion Proceedings of the Web Conference 2020 (2020), pp. 674–678 36. Y. Zhang, Y. Fan, Y. Ye, L. Zhao, C. Shi, in Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019), pp. 549–558 37. Z. Liu, C. Chen, X. Yang, J. Zhou, X. Li, L. Song, in Proceedings of the 27th ACM International Conference on Information and Knowledge Management (2018), pp. 2077–2085 38. A. Sankar, Y. Wu, L. Gou, W. Zhang, H. Yang, in Proceedings of the 13th International Conference on Web Search and Data Mining (2020), pp. 519–527 39. S. Shekhar, D. Pai, S. Ravindran, in Companion Proceedings of the Web Conference 2020 (2020), pp. 662–668 40. S.X. Rao, S. Zhang, Z. Han, Z. Zhang, W. Min, M. Cheng, Y. Shan, Y. Zhao, C. Zhang, arXiv:2012.10831 (2020) 41. Z. Hu, Y. Dong, K. Wang, Y. Sun, in Proceedings of the Web Conference 2020 (2020), pp. 2704–2710 42. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, arXiv:1706.03762 (2017) 43. P. Shaw, J. Uszkoreit, A. Vaswani, arXiv:1803.02155 (2018) 44. A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. Schardl, C. Leiserson, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 (2020), pp. 5363–5370 45. J.H. Friedman, IEEE Trans. Comput. 26(4), 404 (1977) 46. D. Wang, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, Y. Qi, in IEEE ICDM 2019 (2019), pp. 598–607 47. P. Goyal, N. Kamra, X. He, Y. Liu, arXiv:1805.11273 (2018) 48. P. Goyal, S.R. Chhetri, A. Canedo, Knowl. Based Syst. 187, 104816 (2020)