113 5 8MB
English Pages 86 Year 2024
Proceedings in Adaptation, Learning and Optimization 18
Kaj-Mikael Björk Editor
Proceedings of ELM 2022 Theory, Algorithms and Applications
Proceedings in Adaptation, Learning and Optimization Series Editor Meng-Hiot Lim, Nanyang Technological University, Singapore, Singapore
18
The role of adaptation, learning and optimization are becoming increasingly essential and intertwined. The capability of a system to adapt either through modification of its physiological structure or via some revalidation process of internal mechanisms that directly dictate the response or behavior is crucial in many real world applications. Optimization lies at the heart of most machine learning approaches while learning and optimization are two primary means to effect adaptation in various forms. They usually involve computational processes incorporated within the system that trigger parametric updating and knowledge or model enhancement, giving rise to progressive improvement. This book series serves as a channel to consolidate work related to topics linked to adaptation, learning and optimization in systems and structures. Topics covered under this series include: • complex adaptive systems including evolutionary computation, memetic computing, swarm intelligence, neural networks, fuzzy systems, tabu search, simulated annealing, etc. • machine learning, data mining & mathematical programming • hybridization of techniques that span across artificial intelligence and computational intelligence for synergistic alliance of strategies for problem-solving. • aspects of adaptation in robotics • agent-based computing • autonomic/pervasive computing • dynamic optimization/learning in noisy and uncertain environment • systemic alliance of stochastic and conventional search techniques • all aspects of adaptations in man-machine systems. This book series bridges the dichotomy of modern and conventional mathematical and heuristic/meta-heuristics approaches to bring about effective adaptation, learning and optimization. It propels the maxim that the old and the new can come together and be combined synergistically to scale new heights in problem-solving. To reach such a level, numerous research issues will emerge and researchers will find the book series a convenient medium to track the progresses made. Indexed by INSPEC, zbMATH. All books published in the series are submitted for consideration in Web of Science.
Kaj-Mikael Björk Editor
Proceedings of ELM 2022 Theory, Algorithms and Applications
Editor Kaj-Mikael Björk Graduate School and Research Arcada University of Applied Sciences Helsinki, Finland
ISSN 2363-6084 ISSN 2363-6092 (electronic) Proceedings in Adaptation, Learning and Optimization ISBN 978-3-031-55055-3 ISBN 978-3-031-55056-0 (eBook) https://doi.org/10.1007/978-3-031-55056-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Contents
Distributed Memory-Efficient Algorithm for Extreme Learning Machines Based on Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anton Akusok, Leonardo Espinosa-Leal, Kaj-Mikael Björk, and Amaury Lendasse
1
Massive Offline Signature Forgery Detection with Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonardo Espinosa-Leal, Zhen Li, Renjie Hu, and Kaj-Mikael Björk
9
Importance of the Activation Function in Extreme Learning Machine for Acid Sulfate Soil Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virginia Estévez, Stefan Mattbäck, and Kaj-Mikael Björk
16
TinyThrow - Improved Lightweight Real-Time High-Rise Littering Object Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tianyu Ji and Pengjiang Qian
26
Does Streaming Affect Video Game Popularity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Li, Leonardo Espinosa-Leal, Maria Olmedilla, Renjie Hu, and Kaj-Mikael Börk
37
Speech Dereverberation Based on Self-supervised Residual Denoising Autoencoder with Linear Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tassadaq Hussain, Ryandhimas E. Zezario, Yu Tsao, and Amir Hussain
46
Application of ELM Model to the Motion Detection of Vehicles Under Moving Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zixiao Zhu, Rongzihan Song, Xiaofan Jia, and Dongshun Cui
58
Predicting the Colorectal Cancer Mortality in the Region of Lleida, Spain: A Machine Learning Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Didac Florensa, Jordi Mateo, Francesc Solsona, Pere Godoy, and Leonardo Espinosa-Leal Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
81
Distributed Memory-Efficient Algorithm for Extreme Learning Machines Based on Spark Anton Akusok1,2(B) , Leonardo Espinosa-Leal1 , Kaj-Mikael Bj¨ ork1,2 , and Amaury Lendasse3 1
Graduate School and Research, Arcada University of Applied Sciences, Jan-Magnus Janssons plats 1, 00560 Helsinki, Finland {anton.akusok,leonardo.espinosaleal,kaj-mikael.bjork}@arcada.fi 2 Hanken School of Economics, Arkadiagatan 22, 00100 Helsinki, Finland 3 University of Houston, Houston, TX 77004, USA [email protected] Abstract. This work presents a distributed limited-memory algorithm for Extreme Learning Machines (ELM) with training data stored in Spark. The method runs batch matrix computations to obtain the direct non-iterative solution of ELM, reads data only once, and uses the least network bandwidth, making it very computationally efficient. This is achieved by extensive use of lazy evaluation and generators in Spark, deliberately avoiding operations that may lead to data caching. The method scales to virtually infinite datasets and any number of ELM neurons, with runtime being the only restricting factor. An experiment demonstrates the successful processing of 1 TB of text data on an average desktop in 6.5 h without using disk cache or running out of memory. The experimental code is linked in GitHub at https:// github.com/akusok/pyspark-elm. Keywords: ELM
1
· Spark · Big Data
Introduction
Extreme Learning Machine (ELM) [5,8] is a fast and simple machine learning model. It is a valuable tool for practical data analysis with performance comparable [6,7] to the best shallow learning models like Random Forest [4]. The progress of computing hardware, cheaper storage, and widespread Internet-of-things leads to a ubiquitous collection of massive datasets. These datasets contain knowledge about processes happening in our world that could be turned into useful services. However, most machine learning models cannot learn directly from massive datasets on common hardware, requiring either large GPU-accelerated clusters in the case of Deep Learning or investing heavily into coding to reduce data size while keeping it relevant for the task at hand. The goal of this research is to find a method of training an ELM model directly on infinitely large datasets. Of course, no dataset can be infinite, but this c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Bj¨ ork (Ed.): ELM 2022, PALO 18, pp. 1–8, 2024. https://doi.org/10.1007/978-3-031-55056-0_1
2
A. Akusok et al.
abstraction produces a set of requirements that this work adopts. A method fulfilling these requirements would allow convenient processing of massive datasets without large extra work: – – – – –
read training data only once never stores the intermediate version of training data constant memory and disk cache demands irrespective of data size efficient use of CPU via matrix operations efficient use of distributed hardware
This paper selects Spark [13] as the data loading framework. It is the defacto industry standard in data analysis at a massive scale. Spark itself has no machine learning abilities, but it fits perfectly for data cleaning, preparation, and feature selection, including machine learning specific features like one-hot encoding in Spark MLLib [10]. There is an existing volume of research about ELM implementation on Spark. Most approaches [9,11] explore iterative training of ELM using Spark distributed architecture. Others [12] investigate the parallelization of the least squares solver part of ELM using Spark. This work focuses instead on a computationally efficient way of getting a non-iterative direct solution of ELM by processing batches of data in matrix form. It uses Spark to distribute batch computations while keeping network communications to a minimum, working in the “bring code to data” paradigm. The authors have not found similar existing research. 1.1
On Machine Learning in Spark
The aforementioned Spark MLLib includes some models and distributed matrix data representation. However, its models and training methods are limited, lacking some basic features like matrix transpose operation. In the opinion of the authors of this work, the reason for the pool machine learning usability of Spark MLLib is the wrong level of abstraction it operates in. The unit of work in Spark is a scalar value in a row of data, like the address of some person. Spark operates at a value level: it can filter out some values, transform them, and group one set of values with another. Spark also supports SQL, a database query language made specifically for a convenient operation with values. Contrary to that, machine learning adopts a much higher level of abstraction. The unit of work in machine learning is feature space, and its goal is to find a projection from one feature space into another that minimises some cost function. Machine learning pays no attention to particular values. Using Spark tools that access each value at column and row coordinates separately for solving machine learning tasks that take all available data as one feature space incurs a tremendous implementation overhead. It also makes some machine learning operations infeasible in Spark, as matrix transforms for a Spark “matrix” represented as rows of a distributed dataset. This work proposes an explicit change in the level of abstraction in the processing algorithm. Data is loaded and prepared in Spark operating on particular
Distributed Memory-Efficient Algorithm for ELM Based on Spark
3
values. Then, the data is converted into a block matrix form and continues as matrix operations in Python, using Spark for job management but not the calculations. The final solution step is done in Python outside of Spark; it does not depend on data size, so Spark is not needed. The paper continues as follows. The Literature Review section gives an overview of existing Spark ELM implementations. The Methodology section introduces the algorithm and the specific decisions taken to allow the processing of arbitrarily large data. The Experiments section presents a use case for the analysis of 1 TB of text data on an average desktop, and the Summary section concludes this research while giving directions for further development.
2
Methodology
The methodology follows the general algorithm of the High-Performance Extreme Learning Machines paper [1]. It consists of two main steps: 1. Compute Gram matrix Ωh = HT H and matrix Ωt = HT Y 2. Find output weights β = solve(Ωh , Ωt ) by a least-squares solver Of these two steps, the first one takes the most time to compute; also it is the only step that needs access to the training data. The proposed methodology presents a distributed algorithm for computing Ωh , Ωt that is memory and CPU efficient, based on Spark. 2.1
Designing Algorithm for Infinitely Large Data
Spark provides a multitude of tools for data analysis. One of its key concepts is “lazy evaluation” - an operation on data does not start immediately but is instead delayed until some computational result is explicitly requested. Even then, only the requested amount of data is processed instead of everything available. A well known example is display() command that prints a few dozen rows and loads data only for those rows, not for the whole dataset. Lazy evaluation helps the Spark ELM algorithm keep its memory requirements constant by loading data as it goes. Another important consideration is data granularity. Spark processes data one row at a time, but it cannot create a separate task for processing each row, as this would overwhelm the managing node with billions of tasks. Instead, Spark creates a relatively small number of tasks, each processing a relatively large amount of data. The task size is selected for efficient data exchange, with a general guideline of around 100 MB of data per task. Unfortunately, neither of these granularities satisfies the ELM algorithm: processing one row at a time would waste CPU resources that are much more efficient at matrix operations, and 100 MB of data corresponds to millions of rows that are too large for matrix operations. Spark has a .repartition() operation that can change data granularity, but it shuffles the data randomly, so performing a repartition would load the
4
A. Akusok et al.
whole dataset and, worse, would try to save a repartitioned dataset in the disk cache. Loading the whole dataset and caching it back onto disk is impossible in “infinitely” large data processing. Instead, the proposed algorithm runs a Python function on existing (too large) partitions that accumulate up to the set number of rows. It converts them into a chunk of input data matrix X , then yields this matrix. Yield is a lazy operation, so the input matrix chunks are generated on the fly as the algorithm continues, and there is never a need to fully load the dataset. The actual implementation includes computing the Gram matrix Ωh = sigmoid(X W)T sigmoid(X W) where sigmoid(A) is an element-wise operation on values of matrix A, and yields partial matrices Ωh instead. In the same way, the algorithm yields partial matrices Ωt . The final consideration is combining partial matrices Ω = Ω . The standard way of summing up values in Spark is to aggregate them. However, the aggregation function will wait until all values are available before starting the aggregation. This is necessary to support arbitrary aggregation functions, but collecting all partial values in memory must not happen for processing an “infinite” dataset. Instead, the algorithm uses a similar function called .fold(). The folding operation is similar to aggregation but can be applied recursively on pairs of values, producing a value of the same type. In addition, fold has a neutral starting value that must not change the result, like number 0 for summation or number 1 for multiplication. Unlike aggregation, the fold operation runs instantly every time it receives two values to operate on. This means that as the algorithm generates new partial matrices Ω , a fold summation will reduce them by folding pairs of partial matrices instantly, Ωnew = Ωa + Ωb thus releasing back memory taken by partial matrices Ω . 2.2
Algorithm
The distributed limited memory algorithm for computing matrices Ωh , Ωt for the first stage of the ELM model is presented in Fig. 1. It has five main stages described below. The figure and description show only matrix Ωh for simplicity; the other matrix Ωt is computed the same way. 1. Load data rows with Spark. Data can be loaded directly from a compressed text file or any other storage. Formatting of preprocessing of data features can be done at this step using the SQL language of Spark MLLib. 2. A function is mapped to Spark data partitions that reads data rows one at a time, collects up to a given number of rows and converts them into Numpy matrices X of partial input data. 3. This function performs the main computational algorithm for Ω . It receives partial input data matrices X from the previous step and yields out the partial matrices Ω following the lazy evaluation approach. 4. Fold function sums up the partial matrices Ω = Ω by folding. It runs instantly every time a new partial matrix is available, thus releasing computer memory taken by this matrix.
Distributed Memory-Efficient Algorithm for ELM Based on Spark
5
Fig. 1. Algorithm for distributed limited memory ELM computation on Spark. The algorithm outputs Ω matrix, which is the heaviest part of ELM computation and the only one that needs access to raw data.
5. As the algorithm concludes, the folding function returns the full matrix Ω The distributed nature of the algorithm is expressed in the .fold() function of step 4 that combines partial matrices Ω from multiple machines. With folding, these matrices will first be combined locally at worker nodes and, in the end, transferred to the master node for the final combination, greatly saving the network bandwidth. An algorithm extracts parallelism directly from Spark, by using its automatically generated partitions. Spark creates one partition per file by default when loading data from a disk. It fits datasets already saved as multiple files or data formats like Parquet which splits its data across multiple files. Spark may fail to run in parallel when loading data from one file; in this case, the data should be prepared by splitting into separate files or loaded and split in Spark memory if the data is small. Really large datasets never come in a single file, so this should not be an issue in practice.
6
A. Akusok et al.
2.3
Large ELM Model Extension
The algorithm exchanges partial matrices Ω that are square matrices with a size equal to the number of hidden neurons in ELM. These matrices become massive with a large number of neurons of 10,000 and higher, leading to Spark processes running out of Java heap space. This is an undesired limitation because a large amount of data and distributed hardware allow huge ELM models to be trained efficiently. The solution is to split the hidden neurons matrix W in chunks and compute matrices Ω in a two-dimensional grid of the corresponding chunks. ELM solution can be computed from the same chunk structure. See more detail in the paper [2] presenting large ELM computation on mobile devices. Practically, in step 3, the algorithm will yield a number of partial chunks yield ((i, j), Ω(i,j) ) where (i, j) are chunk coordinates. Then the .fold() function is replaced with .foldByKey() function that performs the same operation but only on chunks with the same key. The final matrix Ω is assembled back from its chunks, or a chunk-based ELM solution is applied.
3
Experiments
Experiments demonstrate computing the Gram matrix from 1 TB of text data. The data comes from the Criteo 1 TB Click Logs dataset1 . It is compressed to 350 GB of *.gz files, stored on an external mechanical HDD, and is processed directly without extracting the archive and storing extracted files. The data is stored in 24 separate files, providing Spark with 24 parallel tasks that are enough to fully load a desktop computer with 8 threads. For a more exhaustive experiment, the experiment uses a code version that supports computing Ω in chunks of arbitrary size. The ELM model is set to have 200 neurons, the batch size of the Ω matrix is set to 120, and the batch size of partial X matrices is set to a maximum of 5000 rows (plus the last batch in Spark partition with however many rows are left). The experimental runtime on a desktop computer with an Intel Core i7-7700k 4-core CPU is shown in Fig. 2. There are 24 tasks corresponding to 24 Criteo dataset files, with a total runtime of a little over 6,5 h. The total number of processed rows is 4.37 billion. The experimental code file in Jupyter Notebook format is available on GitHub2 . During the experiment, Spark used less than the available 24 GB of system memory and did not spill to disk. The same code would work for larger ELM models with more neurons or for larger datasets on the same machine at the cost of only extra runtime. Using a real Spark cluster with multiple machines would decrease the runtime proportionally to the total number of available CPU cores.
1 2
https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/. https://github.com/akusok/pyspark-elm/blob/main/process criteo all.ipynb.
Distributed Memory-Efficient Algorithm for ELM Based on Spark
7
Fig. 2. Runtime for computing matrix Ωh from 1 TB of text data stored on an external HDD in compressed form. Simply reading all compressed files on the same machine in parallel takes 1 h.
4
Summary
This paper presents a distributed limited-memory algorithm for training the ELM model on virtually infinite datasets, with the only limitation being the total runtime on a given system. The algorithm shows low memory consumption for the largest datasets, does not need disk caching in Spark, and is demonstrated to compute the Gram matrix Ω of the ELM model in batches, allowing for an arbitrarily large number of neurons without running out of Java heap space in Spark. The algorithm is tested on a 1 TB Criteo Click dataset using a regular 4-core desktop, with data read from an external HDD directly from its compressed format. The demonstrated ease of use should make ELM a more approachable method for extremely large data processing for both industrial users on their Spark clusters and individual researchers on modest consumer hardware. Future work would aim towards integrating PySpark support into the existing Scikit-ELM library [3] and exploring the scalability boundaries of the ELM method on Spark.
References 1. Akusok, A., Bj¨ ork, K.M., Miche, Y., Lendasse, A.: High-performance extreme learning machines: a complete toolbox for big data applications. IEEE Access 3, 1011–1025 (2015) 2. Akusok, A., Leal, L.E., Bj¨ ork, K.-M., Lendasse, A.: High-performance ELM for memory constrained edge computing devices with metal performance shaders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 79–88. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9 9 3. Akusok, A., Leal, L.E., Bj¨ ork, K.-M., Lendasse, A.: Scikit-ELM: an extreme learning machine toolbox for dynamic and scalable learning. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 69–78. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9 8 4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 5. Cambria, E., et al.: Extreme learning machines [trends & controversies]. IEEE Intell. Syst. 28(6), 30–59 (2013) 6. Ding, S., Guo, L., Hou, Y.: Extreme learning machine with kernel model based on deep learning. Neural Comput. Appl. 28(8), 1975–1984 (2017) 7. Huang, G.B., Bai, Z., Kasun, L.L.C., Vong, C.M.: Local receptive fields based extreme learning machine. IEEE Comput. Intell. Mag. 10(2), 18–29 (2015)
8
A. Akusok et al.
8. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. In: Neural Networks Selected Papers from the 7th Brazilian Symposium on Neural Networks (SBRN 04) 7th Brazilian Symposium on Neural Networks, vol. 70, no. 1, pp. 489–501 (2006) 9. Li, Y., Yang, R., Guo, P.: Spark-based parallel OS-ELM algorithm application for short-term load forecasting for massive user data. Electric Power Components Syst. 48(6–7), 603–614 (2020) 10. Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016) 11. Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory and elm for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016) 12. Ragala, R., Bharadwaja Kumar, G.: Recursive block LU decomposition based elm in apache spark. J. Intell. Fuzzy Syst. 39(6), 8205–8215 (2020) 13. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2010) (2010)
Massive Offline Signature Forgery Detection with Extreme Learning Machines Leonardo Espinosa-Leal1(B) , Zhen Li1 , Renjie Hu2 , and Kaj-Mikael Bj¨ ork1,3 1
2
Graduate School and Research, Arcada University of Applied Sciences, Jan-Magnus Janssons plats 1, 00560 Helsinki, Finland {leonardo.espinosaleal,zhen.li,kaj-mikael.bjork}@arcada.fi Department of Information and Logistics Technology, University of Houston, Houston, TX, USA [email protected] 3 Hanken School of Economics, Arkadiagatan 22, 00100 Helsinki, Finland
Abstract. In this work, we present the results of different machine learning models for detecting offline signature forgeries trained from the features obtained from a massive dataset of ten thousand users. Features for training are obtained from the last layer of two different convolutional neural networks: Inception21k and Signet. Optimisation of the number of neurons and activation functions of Extreme Learning Machine (ELM) models are obtained using Equal Error Rate (EER) as a metric. Our results align with the recent results of other machine learning models. Furthermore, we found that a general-purpose network: Inception21k performs better for the writer-independent models created. Keywords: Big Data Machines
1
· Signature Forgery · Extreme Learning
Introduction
Biometrics have always been mainstream identification tools to verify identity and credibility. Handwritten signatures among these are still a widely used identification method in several countries due to tradition, reliability and technological legacy [4]. Forging handwritten signatures has arisen as a well-known problem during bank transactions and the legalisation of documents, among many others. Detection of forgery of signatures is done in two categories: static or offline and dynamic or online. The first one is related to an analysis post-mortem of the signature, i.e. once the user has emitted its signature, usually on a device or paper that permits saving of the trace and other unique information for posterior studies [11]. Dynamic analysis is usually done with devices in real-time and onsite; these devices record both the signature trace and other features such as angle or pressure. Additionally, forgeries can be skilled or unskilled, depending if the forger has visual access to the signature that she wants to falsify. In this c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Bj¨ ork (Ed.): ELM 2022, PALO 18, pp. 9–15, 2024. https://doi.org/10.1007/978-3-031-55056-0_2
10
L. Espinosa-Leal et al.
context, the need to design automatic means to detect fake signatures is of paramount importance [13]. In this work, we present a machine learning study using the largest dataset of signatures available for research. The obtained models are trained upon the features obtained via two pre-trained deep convolutional neural networks. We focus our study on the Extreme Learning Machine (ELM) method, comparing the performance of different numbers of neurons and types of activation functions.
2
Related Work
Different artificial intelligence methods have been used recently for identifying signature forgeries [10]. Among these, ELM has been successfully used previously in different works for offline signature verification [1,5]. The obtained results using ELM have been shown to align with the state-of-the-art regarding the usual metrics such as the Equal Error Rate (EER). This paper contributes to the first massive study on predicting offline forgeries using a reliable and light machine learning model trained on extracted features from two different deep convolutional neural networks.
3
Methodology
In this work, we have used a Python library for efficient extreme learning machine calculations: scikit-ELM [3]. We have explored different parameters to find the optimal model for each configuration, including the number and types of neurons. The information obtained from the GPDSS10000 dataset [7] is images from signatures, both real and fake. A set of vectorial features from these images is extracted using a set of pre-trained neural networks: one trained for a general purpose (Inception21k) [6] and one trained specifically for signature recognition (Signet) [8,9]. These features are used to train an ELM model. A scheme of the pipeline is depicted in Fig. 1.
4
Experiments and Results
We have conducted the experiments on a large data set GPDSS10000 with 10000 users’ signatures. For each user’s signature, there are 24 real signatures produced by the user and 30 experienced forgeries. Therefore, there are 540,000 signature images in total. We have experimented with four different feature extractors. For each feature extractor, four different neuron types are tried, and eight neuron numbers from 22 to 29 are used. All the training runs are done with 5-fold crossvalidation. This means we have processed about 240 million samples in total for our experiment. ELM is known to be fast to train and infer; therefore, such a big-scale experiment was feasible to be done.
Massive Offline Signature Forgery Detection with ELM
11
Fig. 1. General scheme of the pipeline used in this work. Features from signatures are obtained from different neural network architectures and then used to feed an ELM model.
The results are shown in Fig. 2. Interestingly, all the EER for a neuron type and feature extractor pair share a similar trend regarding the number of neurons. The best EER is achieved at 256 neurons, and there is an increase of EER from 8 to 32 neurons. After 256, the EER starts to rise again (See Table 1). In the meanwhile, it is interesting to see that regardless of the feature extractor, linear neuron types are much better compared to other neuron types. And neuron types share similar behaviours across feature extractor types, with linear performs the best and then RELU, followed by sigmoid, and the worst when using hyperbolic tangent. Table 1. EER% (Standard deviation %) Results with different number of neurons for the GPDSS10000 dataset with various neuron types using Inception21k and Signet as feature extractors. n
Inception21k lin relu
4
34(17) 38(17) 40(17) 41(17) 37(17) 39(17) 40(16) 41(16)
sig
tanh
Signet lin relu
sig
tanh
8
29(17) 33(17) 36(17) 38(17) 31(17) 35(17) 36(16) 37(16)
16
29(17) 34(17) 37(16) 39(16) 32(16) 36(16) 37(16) 38(16)
32
32(17) 37(17) 39(16) 40(17) 35(17) 38(16) 39(16) 40(16)
64
27(17) 32(17) 36(17) 37(16) 29(16) 33(16) 36(16) 36(16)
128 26(16) 31(16) 35(16) 36(16) 28(16) 33(16) 35(16) 36(16) 256 26(16) 31(16) 35(17) 37(16) 27(16) 31(17) 35(16) 35(16) 512 29(17) 33(17) 37(17) 38(16) 31(17) 35(17) 37(16) 37(16)
12
L. Espinosa-Leal et al.
Fig. 2. EER change along the number of neurons for different neuron types using different feature extractors
Massive Offline Signature Forgery Detection with ELM
13
Fig. 3. EER distribution histogram with 256 linear neurons for Inception21k and Signet. The interested EER range is between 0 and 0.4
This probably is because the features extracted by feature extractors are general enough, and there is no need to use complicated non-linear functions to further extract information from the feature space. The observations that EER is the best with 256 neurons and linear neurons outperform other neuron types are confirmed by previous studies [12] on smaller dataset MCYT-75. However, when comparing different feature extractors, Inception21K is much better than Signet families with the best EER being 0.264, which is quite surprising as it’s the opposite of the observations drawn from the previous studies [2,12] on smaller dataset MCYT-75. Among the Signet families, Signet is much better than Signet 300 dpi and Signet 600 dpi, with the best EER being 0.275. Both Signet 600 dpi are slightly better than Signet 300 dpi, with EER being 0.308 and 0.322, respectively. This is the same as observed in the previous studies [12]. The distribution histogram of EER for Inception21k and Signet can be seen in Fig. 3. From the histogram, we can see that the EER distribution range is wide for both Inception21k and Signet are used as feature extractors; therefore, studies conducted on a small dataset can have biased results, especially when trying to obtain accurate EER estimations. The big data experiments can better estimate the effectiveness of different feature extractors for the writer-dependent signature classification task.
5
Discussions
In this study, we extended the forgery signature classification experiments to a much bigger data set composed of 10k users’ signatures. The scaled experiments confirm some of the findings that were observed in experiments on a minimal data set, MCYT-75. Both big data experiments and very small data experiments find that linear neuron type is the best for signature classification tasks with a writer-dependent setup. In addition, 256 neurons perform the best regardless of experimentation scale and feature extractor types. The number
14
L. Espinosa-Leal et al.
of neurons is probably determined by the feature space, which is 1024 dimensions for Inception21k and 2048 for Signet families. The scaled experiment also strongly suggests that Inception21k is much better on average than Signet families at extracting signature features for the signature classification tasks. Small data set experiment results were misleading. Despite a large number of signatures in this study, for each user, the signature samples are still limited. Further study should be conducted to determine what happens if there are more authentic and forgery signatures for each user. Acknowledgments. The authors wish to acknowledge CSC – IT Center for Science, Finland, for computational resources.
References 1. Akusok, A., Espinosa Leal, L., Bj¨ ork, K.M., Lendasse, A., Hu, R.: Handwriting features based detection of fake signatures. In: The 14th PErvasive Technologies Related to Assistive Environments Conference, pp. 86–89 (2021) 2. Akusok, A., Espinosa-Leal, L., Lendasse, A., Bj¨ ork, K.M.: Unsupervised handwritten signature verification with extreme learning machines. In: Bj¨ ork, K.M. (ed.) ELM 2021. PALO, vol. 16, pp. 124–134. Springer, Cham (2023). https://doi.org/ 10.1007/978-3-031-21678-7 12 3. Akusok, A., Leal, L.E., Bj¨ ork, K.-M., Lendasse, A.: Scikit-ELM: an extreme learning machine toolbox for dynamic and scalable learning. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 69–78. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9 8 4. Bibi, K., Naz, S., Rehman, A.: Biometric signature authentication using machine learning techniques: current trends, challenges and opportunities. Multimed. Tools Appl. 79(1), 289–340 (2020) 5. Espinosa-Leal, L., Akusok, A., Lendasse, A., Bj¨ ork, K.-M.: Extreme learning machines for signature verification. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 31–40. Springer, Cham (2021). https:// doi.org/10.1007/978-3-030-58989-9 4 6. Espinosa-Leal, L., Akusok, A., Lendasse, A., Bj¨ ork, K.-M.: Website classification from webpage renders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 41–50. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-58989-9 5 7. Ferrer, M.A., Diaz, M., Carmona-Duarte, C., Morales, A.: A behavioral handwriting model for static and dynamic signature synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1041–1053 (2016) 8. Hafemann, L.G., Sabourin, R., Oliveira, L.S.: Writer-independent feature learning for offline signature verification using deep convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 2576–2583. IEEE (2016) 9. Hafemann, L.G., Sabourin, R., Oliveira, L.S.: Learning features for offline handwritten signature verification using deep convolutional neural networks. Pattern Recogn. 70, 163–176 (2017) 10. Hafemann, L.G., Sabourin, R., Oliveira, L.S.: Offline handwritten signature verification-literature review. In: 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–8. IEEE (2017)
Massive Offline Signature Forgery Detection with ELM
15
11. Hameed, M.M., Ahmad, R., Kiah, M.L.M., Murtaza, G.: Machine learning-based offline signature verification systems: a systematic review. Signal Process.: Image Commun. 93, 116139 (2021) 12. Li, Z., Espinosa-Leal, L., Lendasse, A., Bj¨ ork, K.M.: Extreme learning machines for offline forged signature identification. In: Bj¨ ork, K.M. (ed.) ELM 2021. PALO, vol. 16, pp. 24–31. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-2167873 13. Plamondon, R., Srihari, S.N.: Online and off-line handwriting recognition: a comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 63–84 (2000)
Importance of the Activation Function in Extreme Learning Machine for Acid Sulfate Soil Classification Virginia Est´evez1(B) , Stefan Mattb¨ ack2,3 , and Kaj-Mikael Bj¨ ork1 1
2
Arcada University of Applied Sciences, 00560 Helsinki, Finland [email protected] ˚ Abo Akademi University, Geology and Mineralogy, 20500 ˚ Abo, Finland 3 Geological Survey of Finland, PO Box 97, 67101 Kokkola, Finland
Abstract. We have evaluated Extreme Learning Machine for the classification and prediction of acid sulfate soils. We have analyzed how different hyperparameters and their combinations affect the performance of Extreme Learning Machine. In addition to the activation functions, the impact of the number of hidden neurons as well as the number of features have been studied. We show that the performance of Extreme Learning Machine strongly depends on the hyperparameters. In some cases the model is unable to classify acid sulfate soils, while in others it is not only capable, but shows good accuracy in its predictions. This is largely due to the activation function. Whereas with Sigmoid or Hyperbolic tangent activation functions the model can give good results, with ReLU it does not work for classifying acid sulfate soils. In general, Extreme Learning Machine performs better for a larger number of input features. Furthermore, there is a clear correlation between the number of hidden neurons and the number of features of the input layer for the cases in which the model performs best. Keywords: extreme learning machine · activation functions hyperparameters · acid sulfate soils · classification
1
·
Introduction
Acid sulfate (AS) soils may cause severe ecological damage due to the oxidation of naturally occurring sulfide minerals and their subsequent mobilization of toxic metals. Among the main problems are those related to the deterioration of watercourses which may lead to the annihilation of their aquatic life, problems in the agriculture or in the maintenance of infrastructures [1]. To minimize potential environmental problems, it is necessary that these soils are used in the correct way. For this, their localization will be essential. Today, machine learning plays a fundamental role in the creation of digital soil mapping [2]. This is because with the machine learning techniques the mapping process is much faster, less expensive, more objective and accurate than with traditional methods. So far, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Bj¨ ork (Ed.): ELM 2022, PALO 18, pp. 16–25, 2024. https://doi.org/10.1007/978-3-031-55056-0_3
ELM for Acid Sulfate Soil Classification
17
different machine learning techniques have been considered for the classification and prediction of AS soils and their mapping: Artificial Neural Network (ANN) [3–5], Fuzzy logic [6], Fuzzy k-means [7], Convolutional Neural Network (CNN) [8,9], Random Forest (RF) [8,10,11], Gradient Boosting (GB) [8,10,11], Support Vector Machines (SVM) [8,10] and Extreme Learning Machine (ELM) [12]. Among these methods, RF and GB stand out. Previous studies have shown that RF and GB are very good machine learning methods for the prediction and classification of AS soils [8,10,11]. On the contrary, ELM has hardly been used for the classification of AS soils. Although ELM was capable of predicting AS soils, its results were worse than the ones obtained with RF and GB. This makes us wonder whether ELM is really a good method to predict AS soils. In this study, we have analyzed ELM for the classification of AS soils, and how its performance is affected by different hyperparameters. So far, the ELM performance dependency on hyperparameters has not yet been fully explored. In this study the hyperparameters evaluated are the activation function, the number of hidden neurons and the number of input features. For this analysis, we have considered two different datasets already used in previous works [8,10,11]. This has not only allowed us to analyze the suitability of ELM model for the different parameters, but also to compare its results with the results obtained by other machine learning methods. The document is organized as follows: the next two sections describe the datasets (Sect. 2) and the ELM method (Sect. 3). The performance of ELM and its dependence on hyperparameters for AS soil classification is explained and discussed in Sect. 4. The conclusions are presented in Sect. 5.
2
Datasets
In this study, we have considered two different datasets which correspond to a 905 Km2 study area located in southeastern Finland. The datasets consist of two different types of data, the soil observations and the environmental covariates. Soil samples have been provided by the Geological Survey of Finland (GTK). These soil samples were classified between AS and non-AS soils. This classification was carried out taking into account the soil-pH values and according to the method explained in [13]. The soil samples are the same in both datasets. There are 186 samples, 93 for each class. For the modeling, the samples have been divided in two sets, 80% for training and 20% for validation. The same split have been done in each class. The training and validation sets are the same that the used in previous works [10,11]. The environmental covariates are raster data created from remote sensing data. Environmental covariates can give useful information for the characterization of AS soils. The difference between the two datasets considered in this study is the number of covariates. The first dataset has five environmental covariates, whereas the second one 17. Both datasets have been used in previous studies, the first dataset in [8,10] and the second one in [11]. All the environmental covariates were created in a previous work [8]. In this study, several types of
18
V. Est´evez et al.
environmental covariates have been used: Quaternary geology, aerogeophysics layers, digital elevation model and terrain layers. In the first dataset, in addition to the Quaternary geology and the digital elevation model (DEM), two aerogeophysics layers and one terrain layer have been considered. In the second dataset, eleven more terrain layers and another layer of aerogeophysics have been added. For more information see previous works [10,11]
3 3.1
Method Extreme Learning Machine
Extreme Learning Machine (ELM) [14] is an effective method which main advantage is that it is extremely fast. ELM has been widely used for classification and regression problems, image processing, pattern recognition or forecasting and diagnostics [15]. In this study, we have applied ELM for the classification of AS soils. ELM is a learning algorithm for single hidden layer feedforward neural networks (SLFNs) [14]. Unlike other neural networks, ELM does not need gradient-based backpropagation. Figure 1 shows a typical schematic diagram of ELM. As can be seen, its structure consists of an input layer, a hidden layer where the activation function is applied, and an output layer. A characteristic of ELM is the random assignment of the weights between the input layer and the hidden layer. In addition, they do not require adjustment. Once the input weights and bias are determined, the second step in this ELM algorithm is the calculation of the hidden layer output matrix. And the next step, the calculation of the output weights. Another novelty of this method is that the weights between the hidden layer and the output layer are determined by Moore-Penrose generalized inverse. On the other hand, the application of the activation function in the hidden layer provides a non-linear behavior to the method. 3.2
Activation Functions
The activation function inside the hidden layer determines the learn in the training dataset. Thus, the selection of the activation function is fundamental for a good performance of ELM. In this study, three different activation functions have been considered: ReLU, Sigmoid and Hyperbolic Tangent. Figure 2 shows these activation functions. ReLU: The rectified linear activation function, better known as ReLU, is a frequently used activation function for hidden layers. This function converts negative input values to zero, while positive input values are not changed (Fig. 2). f (x) = max(0, x)
(1)
ELM for Acid Sulfate Soil Classification
Input layer
Hidden layer
* *
19
Output layer
* *
Activation Function
Fig. 1. Schematic diagram of an Extreme Learning Machine (ELM).
Sigmoid: Sigmoid activation function is typically applied in binary classification problems. This function displays the output values in the range [0, 1], see Fig. 2. This function is given by the next equation: f (x) =
1 1 + e−x
(2)
Hyperbolic tangent: Hyperbolic tangent (Tanh) activation function is quite similar to Sigmoid, but this function organizes the output values in the range [−1, 1], see Fig. 2. This function is calculated as: f (x) =
3.3
ex − e−x ex + e−x
(3)
Metrics for Evaluation
The metrics used for the evaluation of the performance of the ELM model in the classification of AS soils are the metrics related to the confusion matrix. These metrics are precision, recall and F1-score [16]. These metrics are frequently used in binary classification problems such as the one analyzed in this work.
20
V. Est´evez et al. ReLU
Tanh
Sigmoid
Fig. 2. Activation Functions: rectified linear activation function (ReLU), sigmoid activation function (Sigmoid) and hyperbolic tangent activation function (Tanh).
4
Results
In this study, we have analyzed how the classification of AS soils is affected by the different hyperparameters of ELM. For this, we have studied the dependences on the activation function, the number of hidden neurons and the number of features. All the codes used in this study have been written in Python. First, we have considered the dataset with five environmental covariates. In this case, there are five features for the input layer. The number of hidden neurons analyzed are in the range between 10 and 100. The results obtained for the three activation functions are shown in Table 1, where the metrics related to the confusion matrix are represented for both classes, AS and non-AS soils. Although the metrics are the same, there is a big different between each activation function. For example, when ReLU is the activation function, these results are only obtained in the case of 12 hidden neurons. For the rest of the cases, the model does not work, it is completely unable to distinguish between the two classes. Most of the time, the value of the recall is 1 for one class and 0 for the other. This means that the model classifies all the samples as only one class. Thus, ELM is not suitable for the classification of AS soils with a ReLU activation function. Table 2 shows the ranges of hidden neurons for which the model works best and for which it does not work for each activation function. Table 1. Metrics related to the confusion matrix for ELM in the case of 5 features Class
Precision Recall F1-score
non-AS
0.67
0.63
0.65
AS
0.65
0.68
0.67
ELM for Acid Sulfate Soil Classification
21
Table 2. Ranges of hidden neurons for which the model works best and for which it does not work for each activation function in the case of 5 features. Activation Function Hidden Neurons ELM works best Hidden Neurons ELM does not work ReLU
12
14–100
Sigmoid
12–26
40–100
Tanh
20–30
40–100
For Sigmoid activation function, ELM is more stable. The best results are obtained in a range between 12 and 26 hidden neurons. From 28 hidden neurons the results start to get worse, and from 40 hidden neurons the model stops working, see Table 2. As occurs with ReLU, ELM also begins to consider all samples of the same class. In the case of the Tanh activation function, the results shown in Table 1 are obtained in the range between 20 and 30 hidden neurons, slightly higher than with Sigmoid (Table 2). For a smaller number of hidden neurons, ELM works better and worse depending on the number of neurons. In the range from 10 to 14 neurons, the results are between 0–5 percentage points worse. The same results are obtained from 32 to 38 neurons. However, for 16 and 18 hidden neurons, ELM gives the best results improving the metrics between 0– 6 percentage points, see Table 3. As in the cases of the other activation functions, ELM is not able to distinguish the classes for more than 40 hidden neurons. As can be seen, a larger number of hidden neurons does not necessarily improve the performance of ELM. Table 3. Metrics related to the confusion matrix for the best results of ELM in the case of 5 features with Tanh activation function. Class
Precision Recall F1-score
non-AS
0.71
0.63
0.67
AS
0.67
0.74
0.70
In this study, we have also analyzed the effect of the number of features in the performance of ELM for the classification of AS soils. For this, we have considered the dataset with 17 environmental covariates and carried out the same study as for five covariates. The three activation functions have been evaluated for different numbers of hidden neurons (between 10 and 100). In the case of the ReLU activation function, the best results are obtained in a range between 12 and 20 hidden neurons, see Tables 4 and 5. These results are equal to the best results obtained for five covariates and Tanh but with the classes interchanged, see Table 3. From 22 neurons, ELM shows poorer results, and from 32 neurons ELM is not able to distinguish between the classes. The increase in the number of features has allowed the model to be able to classify the AS soils for more cases. However, comparing the values of the recall in Tables 1 and 4, it can be
22
V. Est´evez et al.
seen that the prediction of the model for AS soils is worse than for the case of 5 covariates. In this case, the results improve for non-AS soils. This is also observed for the other two activation functions for a low number of hidden neurons. Regardless of whether the results are better or worse, they are always better for non-AS soils. This is quite curious since the dataset is balanced. Thus, it shows that the ELM is not working correctly for that range of hidden neurons (Table 5). In the case of Sigmoid activation function, ELM classifies better nonAS soils for 10 to 24 hidden neurons. Between 26 and 32 neurons, the model starts to give better results also for AS soils. In the range between 34 and 38 the results obtained are the same that the one obtained with Random Forest for the dataset of five environmental covariates in a previous work [10]. As the number of neurons increases, between 40 and 60, ELM behaves better, achieving the results obtained for RF and Gradient Boosting (GB) in a previous work for the dataset with 5 covariates [10]. These results can be seen in Table 4, while the ranges of the hidden neurons in Table 5. From 70 hidden neurons, the results are worse and overfitting begins to be observed. Similar results are obtained for the same ranges of hidden neurons in the case of Tanh activation function. The only difference is that for Tanh function in some cases the model gives results that are among the results of RF and GB for the case of 5 environmental covariates, slightly better than the ones of RF but slightly worse than the ones of GB. Table 4. Metrics related to the confusion matrix for ELM in the case of 17 features Activation Function
Class
Precision Recall F1-score
ReLU
non-AS AS
0.67 0.71
0.74 0.63
0.70 0.67
Sigmoid, Tanh
non-AS AS
0.72 0.70
0.68 0.74
0.70 0.72
Sigmoid, Tanh
non-AS AS
0.78 0.75
0.74 0.79
0.76 0.77
As it can be seen, ELM works better for a larger number of features. Furthermore, there is a correlation between the number of features of the input layer and the number of hidden neurons in the hidden layer of the model. For a higher number of features a larger number of hidden neurons is required for a good performance of the model. For five features the range of hidden neurons where the model gives the best results with Sigmoid is from 12 to 26, and with Tanh from 16 to 30, see Table 2. While for 17 features, the best results are obtained in a range between 40 and 60 hidden neurons for both, Sigmoid and Tanh, see Table 5.
ELM for Acid Sulfate Soil Classification
23
Table 5. Ranges of hidden neurons for which the model works best and for which it does not work for each activation function in the case of 17 features. Activation Function Hidden Neurons ELM works best Hidden Neurons ELM does not work ReLU
12–20
32–100
Sigmoid
40–60
–
Tanh
40–60
–
On the other hand, the comparison of the results obtained in this study with previous works gives us an idea of how the model works in general. In the case of five features, ELM gives poorer results than RF and GB for the same dataset, specifically between 5–6 percentage points worse than RF and between 10–11 percentage points than GB [10]. ELM achieves the good results obtained with RF and GB when the number of features is increased to 17 features. However, for the dataset with 17 covariates, RF yields better results, between 0–5 percentage points, than the best results obtained with ELM in this case [11]. In the case of 17 features, the best results of ELM manage to match the GB results in this case [11].
5
Conclusions
In this study, we have analyzed how different hyperparameters affect the ELM model in the classification of AS soils. The hyperparameters are the activaction function, the number of hidden neurons and the number of input features. We have shown that the performance of ELM for the classification of AS soils strongly depends on the hyperparameters values. For some combinations of these hyperparameters, the model does not work as it is not able to distinguish between both classes. However, in other cases, ELM has been able to properly classify AS soils. In the performance of the model the activation function plays a fundamental role. For Sigmoid and Hyperbolic Tangent (Tanh), ELM has given very good results for specific values of hidden neurons in the case of 17 features. However, ReLU should not be used as activation function in ELM for AS soil classification. This is because the model in most cases does not work and when it does the results are not good. In addition, we have shown that there is a correlation between the number of features and the number of hidden neurons. As the number of features increases, the number of hidden neurons for which ELM gives the best results, also increases. We have shown that ELM is a suitable method for the classification of AS soils under some conditions. However, in the case of five input features, RF and GB models always outperform ELM. The results achieved with RF and GB are between 5–6 percentage points and 10–11 percentage points, respectively better than with ELM. ELM only manages to match the results of the GB model when the number of covariates is increased to 17 in this study. This demonstrates that ELM works better for a larger number of features. Although, RF also outperforms ELM in this case.
24
V. Est´evez et al.
Acknowledgments. This work has been supported by Stiftelsen Arcada foundation (Finland) and Hasuriski project.
References 1. Michael, P.S.: Ecological impacts and management of acid sulphate soil: A review. Asian J. Water, Environ. Pollut. 10(4), 13–24 (2013) 2. McBratney, A., Mendon¸ca Santos, M.L., Minasny, B.: On digital soil mapping. Geoderma 117, 3–52 (2003) ¨ 3. Beucher, A., Osterholm, P., Martinkauppi, A., Ed´en, P., Fr¨ oj¨ o, S.: Artificial neural network for acid sulfate soil mapping: Application to the Sirppujoki river catchment area, south-western Finland. J. Geochem. Explor. 125, 46–55 (2013) ¨ 4. Beucher, A., Siemssen, R., Fr¨ oj¨ o, S., Osterholm, P., Martinkauppi, A., Ed´en, P.: Artificial neural network for mapping and characterization of acid sulfate soils: Application to the Sirppujoki river catchment, southwestern Finland. Geoderma 247–248, 38–50 (2015) ¨ 5. Beucher, A., Adhikari, K., Breuning-Madsen, H., Greve, M.B., Osterholm, P., Fr¨ ojd¨ o, S., et al.: Mapping potential acid sulfate soils in Denmark using legacy data and LiDAR-based derivatives. Geoderma 308, 363–372 (2017) ¨ 6. Beucher, A., Fr¨ oj¨ o, S., Osterholm, P., Martinkauppi, A., Ed´en, P.: Fuzzy logic for acid sulfate soil mapping: Application to the southern part of the Finnish coastal areas. Geoderma 226–227, 21–30 (2014) 7. Huang, J., Nhan, T., Wong, V.N.L., Johnston, S.G., Lark, R.M., Triantafilis, J.: Digital soil mapping of a coastal acid sulfate soil landscape. Soil Res. 52, 327–339 (2014) 8. Est´evez Nu˜ no, V.: Machine learning methods for classification of acid sulfate soils in Virolahti. Master’s thesis, Arcada University of Applied Sciences, Jan-Magnus Janssons plats 1, 00560 Helsinki, Finland (2020) 9. Beucher, A., Rasmussen, C.B., Moeslund, T.B., Greve, M.H.: Interpretation of convolutional neural networks for acid sulfate soil classification. Front. Environ. Sci. 9, 809995 (2022). https://doi.org/10.3389/fenvs.2021.809995 10. Est´evez, V., et al.: Machine learning techniques for acid sulfate soil mapping in southeastern Finland. Geoderma 406, 115446 (2022). https://doi.org/10.1016/j. geoderma.2021.115446 ¨ 11. Est´evez, V., Mattb¨ ack, S., Boman, A., Beucher, A., Bj¨ ork, K.-M., Osterholm, P.: Improving prediction accuracy for acid sulfate soil mapping by means of variable selection. Front. Environ. Sci. 11, 1213069 (2023). https://doi.org/10.3389/fenvs. 2023.1213069 12. Akusok, A., Bj¨ ork, K.M., Est´evez, V., Boman, A.: Randomized model structure selection approach for extreme learning machine applied to acid sulfate soil detection. In: Bj¨ ork, K.M. (ed.) ELM 2021. PALO, vol. 16, pp. 32–40. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-21678-7 4 13. Boman, A., et al.: Classification of acid sulphate soils in Finland and Sweden. Appendix 1, 8 p. In: Broman et al. 2019, Coastal watercourses - Methodological Development and Restoration. Final report, Interreg Nord 2014–2020, 189 p. (2019). https://www.lansstyrelsen.se/norrbotten/tjanster/publikationer/ coastal-watercourses-methodological-development-and-restoration.html
ELM for Acid Sulfate Soil Classification
25
14. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: Theory and applications. Neurocomputing 70(1), 489–501 (2006) 15. Ding, S., Xu, X., Nie, R.: Extreme learning machine and its applications. Neural Comput. Applic. 25, 549–556 (2014) 16. Powers, D.M.W.: Evaluation: from precision, recall, and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. V 2, 37–63 (2011)
TinyThrow - Improved Lightweight Real-Time High-Rise Littering Object Detection Algorithm Tianyu Ji
and Pengjiang Qian(B)
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China [email protected]
Abstract. Due to the high resolution and small object target in high-rise littering scene, the current object detection algorithms have problems of poor real-time detection and low detection accuracy. An improved lightweight real-time high-rise littering detection algorithm based on YOLOv5, TinyThrow, has been proposed to overcome the above challenges. The contribution of this paper is twofold: 1) A selfattention mechanism fuses traditional convolution module C3Trans is included to strengthen the functionality for global feature filter of our method. 2) An improved fast spatial channel attention mechanism module FCBAM is adopted to improve our algorithm’s ability of locating and identifying key targets. The results of experiments show that TinyThrow algorithm performs detection at a rate of 37.3 frames per second and [email protected] of 85.5% on an only small weight file of 3.9 MB, which is 4.5% higher than the original algorithm, meeting the requirements of lightweight real-time high-rise littering object detection task. Keywords: Object detection · Throwing objects from high altitude · YOLO · Attention mechanism
1 Introduction High-rise littering refers to the act of throwing objects from a height such as a building to the ground. Therefore, real-time detection and early warning of the above behavior has become a current research priority. Early scholars used traditional machine learning methods based on image processing by preprocessing the original image and performing feature extraction operations manually, including inter-frame difference [1], background difference, optical flow analysis, mixed Gaussian background modeling, Kalman filter and mean shift algorithms [2]. Some scholars started to apply deep learning in the task. Song et al. [3] introduced the FPN network [4] into the two-stage object detection algorithm Faster R-CNN [5] and used the ImageNet vehicle dataset for training to detect vehicle targets in complex scenes. Yuan et al. [6] proposed a DRN-VTWI method for vehicle window littering detection based on deep residual network ResNet-20. The author also tried to improve the one-stage algorithm YOLOv3 [7] to obtain a faster version. Test© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Björk (Ed.): ELM 2022, PALO 18, pp. 26–36, 2024. https://doi.org/10.1007/978-3-031-55056-0_4
TinyThrow - Improved Lightweight Real-Time High-Rise Littering
27
The task of high-rise littering detection needs to identify as many objects as possible in a fleeting period. Two-stage algorithms take a long time to recognize a single image, which seriously affects the real-time performance. One-stage algorithms have more advantages in speed, but the accuracy is inferior to two-stage algorithms. In view of the above problems, a novel lightweight real-time high-rise littering object detection algorithm TinyThrow is proposed on the basis of YOLOv5. The following is an overview of this paper’s primary significance: 1) ViT is introduced to put out the C3Trans module, so that the network can not only learn information in the current image, but also extract features in the context frame, improving the overall classification ability of the network. 2) Proposing a fast hybrid domain attention mechanism FCBAM to expedite the computation as well as enhance the capability of target localization.
2 Related Work 2.1 YOLOv5 The module Input, Backbone, Neck, and Head are the 4 components that make up YOLOv5. After pre-processing operations such as Mosaic Data Enhancement, Auto Anchor, and Unified Scaling of Image Sizes, the input images and initial parameters that meet the requirements of the network are obtained. The Backbone of YOLOv5 is CSPDarknet53, including CSPNet [8] and SPPF [9], which reducing calculation and parameters as well as improving the feature extraction ability. The Neck of YOLOv5 is PANet [10], by upsampling and downsampling, making the low-level feature map has stronger semantic information, which greatly enhances the feature fusion capability, so as to detect targets of different sizes. The anchor box regression’s loss function in Head is CIoU Loss [11]. 2.2 Attention Mechanism in Computer Vision The realm of natural language processing is where Attention Mechanism [12] first found use. The domain of computer vision has seen significant advancements as well. Attention in CV originates from human’s visual system. For example, in the high-rise littering object detection task, we tend to focus our attention on the most interesting throw targets and pay less attention to other background targets such as buildings and trees. In contrast, traditional neural networks process images with equal extraction of all features, resulting in the network being overwhelmed by a much larger proportion of background information. If the network can devote more attention to the target region and suppress other irrelevant background information, valuable features can be learned quickly from complicated image features.
28
T. Ji and P. Qian
3 Proposed Method 3.1 Overview Based on YOLOv5n model, with depth scaling factor of 0.33 and width scaling factor of 0.25, this paper proposes an improved lightweight real-time high-rise littering object detection algorithm TinyThrow, which mainly introduces C3Trans module and FCBAM module. The overall structure of TinyThrow is shown in Fig. 1.
Fig. 1. TinyThrow overall structure
3.2 C3Trans Transformer [13] is a concept founded on the self-attention mechanism, which is commonly utilized in the area of NLP. The general expression of the self-attention mechanism is Scaled Dot-Product Self-Attention, as shown in (1). QK T V (1) Attention(Q, K, V ) = softmax √ dk Firstly, the input word matrix A is multiplied with the initialized random learnable matrices to obtain Query, Key, and Value respectively. Then dotting Q with K T to calculate the correlation of each word vector with the other word vector, i.e., Attention Score(α). To improve the stability of gradient propagation during training, the calcu√ ˆ Finally, the normalized lated α is done a division by dk and activated to generate α. correlation coefficient matrix is multiplied with the matrix V, which is the weighted sum of each word vector at the matrix element level to obtain the final attention output B.
TinyThrow - Improved Lightweight Real-Time High-Rise Littering
29
Taking the calculation of the first element b1 of the matrix B as an example, the entire calculation flow is shown in Fig. 2, where the lowercase vectors are elements of the corresponding uppercase matrix.
Fig. 2. Self-attention calculation flow
Dosovitskiy et al. [14] first introduced Transformer into computer vision by proposing Vision Transformer. A 3-dimensional image is split into n × n patches and embedding serialized into a structure like words in NLP. Positional Encoding, which can be trained for positional information, is added to preserve the positional relationships among the patches in the image. Since this is an image classification task, the additional patches at position 0 are added with a special class label “*” that can be learned, and finally fed to the Transformer encoder for feature learning. Multi-Head Self-Attention module, which can enable the network to learn not only from the current patch of images, but also from other patches of the context to gather semantic data, enhancing the framework’s capacity to retrieve characteristics from the global scale. Figure 3(a) displays the construction of the Transformer encoder. Figure 4 depicts a representation of the Vision Transformer.
30
T. Ji and P. Qian
Fig. 3. Transformer encoder and C3Trans
Inspired by the above literature, this paper introduces the Self-Attention mechanism Transformer into YOLOv5, replaces Bottleneck in the original C3 module, adds the Multi-Head Self-Attention mechanism in the Transformer encoder and names it the C3Trans module, as shown in Fig. 3(b). 3.3 FCBAM Figure 5(a) illustrates CBAM, a hybrid dimension attention mechanism module introduced by Woo et al. [15] that combines CAM and SAM. CAM pays attention to “what” in the image is meaningful information. By using global maximum pooling and global average pooling respectively, F is combined and delivered to a feedforward network. The ultimate channel attribute Mc is constructed by element-wise adding and activating the output from previous, as shown in Fig. 6(a).
TinyThrow - Improved Lightweight Real-Time High-Rise Littering
31
Fig. 4. Vision Transformer (ViT) model diagram
SAM focuses on the “where” of the image is meaningful information. It multiplies Mc (CAM module’s output) by the above F to acquire the modified F , which works as the fed data into SAM. Similar to CAM, the two pooling methods are also performed first in SAM, but a 7 × 7 convolution is implemented to amalgamate the two acquired feature maps to generate spatial attention features Ms by Sigmoid activation function. As seen in Fig. 6(b), the eventual output element F is created by multiplying Ms and F. Due to the high real-time requirements in the high-rise littering object detection scenario, it is necessary to reduce resource consumption and inference time. On the premise of not affecting the accuracy too much, this paper proposes a lighter FCBAM (Fast CBAM) module. FCBAM replaces the ordinary convolutions as the Conv modules in YOLOv5 and replaces every 7 × 7 convolution in SAM with three 3 × 3 convolutions without changing the output feature size. This improvement changes the original serial operation to parallel operation and reduces the number of element-wise production from 2 to 1, which speeds up the computation within the acceptable performance loss, as shown in the model in Fig. 5(b).
32
T. Ji and P. Qian
Fig. 5. CBAM and FCBAM
Fig. 6. CAM and SAM
TinyThrow - Improved Lightweight Real-Time High-Rise Littering
33
4 Experimental Results and Analysis 4.1 Experimental Details and Dataset The training set and validation set in this paper are composed of VOC2007, VOC2012, and garbage classification competition dataset. A total of 3115 images are all scaled to 640 × 640 for three classes: bottle, box, and garbage. The test set is the actual high-rise littering photos, captured by the NIKON D7100 camera, from a fixed position on the ground to windows of a tall building, with a resolution of 6000 × 4000. Table 1 presents a breakdown of the dataset. Table 1. Dataset breakdown. Dataset Training set Validation set Test set
Bottle
Box
Garbage
Background
Total
969
919
917
700
3505
107
102
101
70
380
94
133
43
6
276
Sample image
-
All experiments in this paper are being conducted on a graphics workstation with an Intel® Core™ i7-6850K [email protected] × 12, 128 GB of RAM, NVIDIA GeForce RTX1080 Ti and NVIDIA TITAN X GPUs, both with 11 GB of video memory. The operating system is 64-bit Ubuntu 21.04, the code runtime is Python 3.9.10, the deep learning framework is PyTorch 3.10.2, and CUDA 11.3 is used for GPU acceleration. Training setup and hyperparameters are described in Table 2. Table 2. Training setup and hyperparameters. Optimizer
Epoch
Batch Size
Lr0
Momentum
Weight Decay
SGD
300
16
0.01
0.937
0.0005
4.2 Contrast Experiment We compare TinyThrow with common two-stage algorithm Faster R-CNN [5] and onestage algorithms SSD [16], RetinaNet [17], EfficientDet [18], and YOLO series [7, 19]. Our algorithm achieves good performance in lightweight algorithms and a real-time inference speed. The experiment results compared by speed and consumption are shown in Table 3 and Fig. 7.
34
T. Ji and P. Qian Table 3. Contrast experiment results (YOLOv5n is the baseline)
Size
Name
Precision
Recall
[email protected]
Lightweight
TinyThrow
89.6%
78.9%
85.5%
37.3
3.9 MB
YOLOv5n*
85.7%
73.3%
81.0%
144.9
3.9 MB
YOLOv5s
90.2%
75.1%
83.0%
135.1
14.4 MB
Small Medium Large
FPS
Weight↓
EfficientDet-P0
84.5%
65.5%
73.7%
25.6
15.8 MB
YOLOv5m
91.0%
76.9%
84.2%
85.5
42.2 MB
EfficientDet-P3
89.1%
67.8%
77.2%
11.6
48.5 MB
YOLOv5l
92.5%
78.0%
85.5%
57.8
92.8 MB
SSD
92.9%
77.2%
81.0%
111.1
96.1 MB
YOLOv3
91.5%
76.6%
83.5%
47.2
123.5 MB
Faster R-CNN
63.7%
79.1%
78.0%
19.0
133.5 MB
RetinaNet
84.5%
76.9%
82.9%
39.0
145.8 MB
ScaledYOLOv4
93.4%
77.8%
85.4%
30.8
215.5 MB
Fig. 7. [email protected] and [email protected] scatter diagram
4.3 Ablation Experiment We introduced each module into YOLOv5n for ablation experiments, which proved that our TinyThrow is a noticeable upgrade over the original approach in detection accuracy and training stability. Table 4 demonstrates the experiment’s findings. The training process curves are portrayed in Fig. 8.
TinyThrow - Improved Lightweight Real-Time High-Rise Littering
35
Table 4. Ablation experiment results. Name
Precision
Recall
[email protected]
Parameters
[email protected]↑
TinyThrow
89.6%
78.9%
85.5%
1.792M
↑4.5%
-FCBAM
82.4%
78.0%
82.3%
1.779M
↑1.3%
-C3Trans
80.7%
77.4%
81.7%
1.791M
↑0.7%
-v6 features
85.7%
73.3%
81.0%
1.778M
baseline
Fig. 8. Training process curves
5 Conclusion To address the issues with existing algorithms’ poor real-time performance, low recall rate, and high miss detection rate, this paper proposes an improved lightweight realtime high-rise littering object detection algorithm TinyThrow based on YOLOv5n. The self-attention module C3Trans is developed, which enhances the algorithm’s possibility to fetch features. Proposing the fast channel spatial attention module FCBAM, which boosts the localization ability of the target. The performance of YOLOv5l is achieved by the optimized hierarchy, according to experimental findings, without increasing the weight file, improves the average accuracy ([email protected]) by 4.5% relative to the original YOLOv5n, and meets the requirements of real-time high-rise littering object detection task in complex scenes as a lightweight algorithm. Extreme Learning Machines (ELM) [20] could be used as feature extraction method in subsequent experiments. Further improving the detection performance and speed in real scenarios and optimizing the target loss detection phenomenon in as many scenes as possible is the direction of the subsequent research work.
References 1. Ribnick, E., et al.: Detection of thrown objects in indoor and outdoor scenes. In: 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE (2007) 2. Csordás, R., et al.: Detecting objects thrown over fence in outdoor scenes, pp. 593–599 (2015)
36
T. Ji and P. Qian
3. Song, H., et al.: Vehicle detection based on deep learning in complex scene. Appl. Res. Comput. 35(04), 1270–1273 (2018) 4. Lin, T.-Y., et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 5. Ren, S., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 6. Yuan, K.: Real-time object recognition and tracking based on residual neural network. M.S. thesis. Jiangnan University (2021) 7. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804. 02767 (2018) 8. Wang, C.-Y., et al.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2020) 9. He, K., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015) 10. Liu, S., et al.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 11. Zheng, Z., et al.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 52, 8574–8586 (2021) 12. Guo, M.-H., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8, 331–368 (2022) 13. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 14. Dosovitskiy, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 15. Woo, S., et al.: CBAM: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018) 16. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-464 48-0_2 17. Lin, T.-Y., et al.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 18. Tan, M., Pang, R., Le, Q.V.: EfficientDet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) 19. Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020) 20. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Budapest, Hungary, vol. 2, pp. 985–990 (2004). https://doi.org/10.1109/IJCNN.2004.1380068
Does Streaming Affect Video Game Popularity? Zhen Li1(B) , Leonardo Espinosa-Leal1 , Maria Olmedilla2 , Renjie Hu3 , and Kaj-Mikael B¨ ork1,4 1
3
Graduate School and Research, Arcada University of Applied Sciences, Jan-Magnus Janssons plats 1, 00560 Helsinki, Finland {zhen.li,leonardo.espinosaleal,kaj-mikael.bjork}@arcada.fi 2 SKEMA Business School, Lille, France [email protected] Department of Information and Logistics Technology, University of Houston, Houston, TX, USA [email protected] 4 Hanken School of Economics, Arkadiagatan 22, 00100 Helsinki, Finland
Abstract. The influence of the game streaming industry in video games is non-negligible. However, the work using both game streaming data and game active player data is limited. There are many challenges when it comes game related data, such as Twitch data is noisy, and not all the games’ popularity can be reflected by active player number. In this work, we set up a forecasting task to predict Steam active player numbers for a selected set of games. The study shows that with the extra information provided by Twitch data, we can forecast Steam active players slightly better in terms of MAPE. In addition, this quantitative study confirms the importance of the streaming industry to the gaming industry.
Keywords: Game Streaming Machine
1
· Video Games · Extreme Learning
Introduction
The video games industry has grown rapidly over the years and the global market value is expected to reach roughly 250 billion U.S. dollars by 2025 [14]. Video games also set up virtual worlds, where people can communicate, socialise, collaborate, trade, and even get virtually married, such as World of Warcraft [3]. With all the advancements in networking technologies, hardware computing powers, and simulation algorithms are making virtual worlds more and more realistic. One of those examples is simulated fluid dynamics such as liquid splashing effect [19] Along with the assistance of augmented reality and virtual reality, video games will draw more and more attention. In the meanwhile, the video game streaming industry is growing along the video game industry. Twitch.tv, one of the main live streaming platforms, had c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Bj¨ ork (Ed.): ELM 2022, PALO 18, pp. 37–45, 2024. https://doi.org/10.1007/978-3-031-55056-0_5
38
Z. Li et al.
about 41 million users in the U.S. in 2020, and the number is estimated to be 51 million by 2024 [15]. Twitch.tv is not only limited to video games streaming, however, video games take up a big portion of Twitch.tv. Streamers stream live video game plays, and viewers can comment and tip streamers on the platform. Research done by Johnson, Mark R. and Jamie Woodcock has shown the impact of Twitch.tv on the gaming industry [8]. Twitch.tv has also reported its positive influences on game sales and retention time in 2016 [18]. However, no further quantitative studies have been conducted ever since to further leverage streaming data to predict video game player experiences. In this article, we study how the streaming industry affects the activeness of video games on Steam in a quantitative manner with the help of machine learning technology, i.e. Extreme Learning Machines.
2
Related Works
The applications of neural networks (NNs) have already been everywhere in our daily lives, including recommender system [10], face recognition system [9], traffic forecasting [7]. However, neural networks haven’t been utilised much in studying the Steam and Twitch.tv data to better understand and predict game player experiences. Neural networks are computing structures mimicking the human brain’s neural links. The activation functions used in neural networks, more often non-linear functions like rectified linear units (RELU) are meant to create complexity so neural network structures can fit more complex functions. Intuitively, activation functions mimic the activation of neural cells. In the case of RELU, if the signal comes in smaller than 0, then the signal won’t be further passed on; if the signal is greater than 0, then the signal passes on as it is. Many variations have been introduced based on RELU to improve training convergence but the concept remains the same. Neural networks normally consist of an input layer, hidden layers, and an output layer. The input layer is a numerical representation of the input information. If input data is not numerical such as letters, encoding needs to be done to process the input data so neural networks can process it. Hidden layers are normally considered black boxes where abstract features are extracted and waiting to be mapped to an output. The output layer is mapping the abstract features obtained from hidden layers to the final numerical result. If the expected output is non-numerical such as letters, then a decoding process is necessary to convert the numerical output to letters. Natural language processing related deep neural network models use mentioned encoding and decoding processes to convert non-numerical signals to numerical signals. ELM has one of the most straightforward neural network structures as there is only one hidden layer and no back-propagation is needed in the computing, which makes it is much faster compared to other neural networks [6]. The simplicity in structure and fast computing make it applicable to many different tasks, such as website classification [5], mobile computing [1], and signature verification [4],
Does Streaming Affect Video Game Popularity?
39
among others. In this work, we will focus on using ELM to study video game ratings.
3
Methods and Experiments
With the two different time series by hand, i.e. Twitch game streaming data in time and Steam game statistics in time, we would like to forecast the average Steam player numbers of each game with the help of Twitch viewing data. This is a multivariate time series forecast problem. By learning from the temporal patterns in both Steam and Twitch time series as shown in Fig. 1, our hypothesis is that the forecast for Steam could be improved.
Fig. 1. Autoregressive multivariate forecasting
Steam statistical data were Kaggle data, containing the average active players by month for the current top 100 games scraped from steamcharts.com [11]. This website is facilitated by a web frontend service and a data collector service that queries the Steam Web API. The average monthly player number of time series is the main target time series. The forecast task is based on predicting the next month’s active monthly player number. Twitch data includes popular Twitch streamers statistics collected by sullygnome.com.[16] The Twitch streamer streaming data is considered to be a popularity measurements in streaming industry. View counts are aggregated by month for each game. Vg (td , S) (1) Atg = td
S
40
Z. Li et al.
where Ati refers to the Attention (A) of Game (g) in month t, and td refers to the days in month t, V means view counts, S describes different Twitch top streamers, rated by sullygnome.com [16]. The aggregation is counting the sum of views who watched the game is streamed by different top streamers within the month. Therefore, the same Twitch viewer watching two times in the same month from the same or different streamers will still be considered two views as the viewers contributed attention. There are many different games on Steam, however, not all the games can have a long lifespan. In addition, not all the games can be measured by active average players, such as small franchises or one-time indie games. Only some renowned titles with fairly long lifetimes put attention on maintaining and attracting more regular active players to keep the community vigorous, such as Grant Theft Auto V and Counter-Strike. Dota 2 is another type of game as it is free, most of its income is composed of in-game transactions and battle pass [13], which is a derivative of tournaments. Therefore, we only focus on 36 games that have continuous Steam and Twitch statistics from 2018-01-01 to 2021-09-01, and also might consider active player numbers to be important. Autoregression setup is used to reflect temporal space features, and the look back horizon is set to be 2 months. The periodicity of popular video games is not much limited by the season of the year, as video game publishers publish big titles year around. Elden Ring was published in late February 2022, and Red Dead Redemption 2 was published in late October 2018. Both Elden Ring [17] and Red Dead Redemption 2 [12] made great records in the gaming industry in terms of sales. The time series cross-validation is done as illustrated in Fig. 2 The first training batch includes data until 2020-01-01, and the validation continues by monthly incrementation until 2021-01-01, the last data from 2021-01-01 to 2021-09-01 is the test data. We leverage the incremental fitting capability, i.e. partial fitting functionality, of scikit-elm [2] to conduct the experiments.
4
Results
The time series uses a month as a unit time step, so each time series only contains 45 data points. Therefore, the limited learning data is another challenge of the task. As shown in Fig. 3, the information of historical Twitch viewer numbers helps with the forecasting of the number of active players on Steam. This confirms that the streaming industry does project influence on the gaming industry. In the meanwhile, we can also see that not all the game forecasts benefited from the Twitch viewer information, including Counter Strike: Global Offensive and Dota 2. The Twitch data barely helped with PUBG: Battlegrounds either. This might be because of the data quality issues and the characteristics of the games. All three games mentioned are free accessible games, the selling points are in-game
Does Streaming Affect Video Game Popularity?
Fig. 2. Time series cross-validation split
Fig. 3. Validation MAPE vs Test MAPE with and without Twitch viewer data
41
42
Z. Li et al.
transactions and tournaments. Moreover, all three games more or less require teamwork, i.e. players form teams to play (PUBG: Battlegrounds has both teamup and individual modes).
Fig. 4. Validation MAPE changes with neuron number and type for Divinity: Original Sin 2
In addition to the forecasting performance, we found that different game active player number time series prefer different neural network setups as shown in Table 1. Despite the limited amount of data, some games do need more neurons in the network to fit the time series, as most time series require less. For Divinity: Original Sin 2, as we can see from Fig. 4, for a single neuron type, the best validation MAPE is achieved around 6 neurons with RELU. The overall best MAPE in validation for Divinity: Original Sin 2 is 0.120 with 9 sigmoid and 7 relu neurons, and the corresponding test MAPE is 0.118.
5
Discussions
The paper presents the 36 chosen Steam games’ active player number forecasting performance in terms of MAPE. It is an attempt to quantitatively analyse the influence of the streaming industry on the gaming industry, which has not been done, much to the knowledge of authors, by the time of the study. From the results of the study, we can confirm that Twitch viewer number directly affects the active player number on Steam. For some games, such as Path of Exile, incorporating the streaming information can improve MAPE by more than 0.183.
Does Streaming Affect Video Game Popularity? Table 1. Best neuron structures and MAPE for 36 studied games Game Name
Neuron numbers Neuron types MAPE
Arma 3
(4)
(lin)
0.082
Black Desert
(2)
(sigm)
0.204 0.062
Brawlhalla
(5, 3)
(relu, relu)
Cities: Skylines
(3)
(tanh)
0.095
Counter Strike: Global Offensive (3, 2)
(sigm, relu)
0.058
Counter-Strike
(3, 4)
(relu, relu)
0.043
Dark Souls III
(9, 2)
(sigm, relu)
0.140
Dead by Daylight
(2, 4)
(relu, relu)
0.081
Divinity: Original Sin 2
(9, 7)
(sigm, relu)
0.118
Dota 2
(10, 8)
(relu, relu)
0.048
Dying Light
(5, 5)
(relu, relu)
0.422
Europa Universalis IV
(7, 7)
(relu, relu)
0.089
Factorio
(2)
(sigm)
0.089
Fallout 4
(7, 2)
(sigm, sigm)
0.116
Garry’s Mod
(4, 3)
(relu, sigm’)
0.046
Grand Theft Auto V
(7)
(relu)
0.152
PUBG: Battlegrounds
(3, 7)
(sigm, relu)
0.051
Path of Exile
(3)
(sigm)
0.606
RimWorld
(2)
(tanh)
0.140
Rocket League
(2)
(relu)
0.068
Sid Meier’s Civilization V
(5, 2)
(relu, relu)
0.062
Sid Meier’s Civilization VI
(2)
(relu)
0.121
Slay the Spire
(9, 5)
(sigm, sigm)
0.129
Stellaris
(2)
(sigm)
0.168
Team Fortress 2
(5, 4)
(relu, relu)
0.099
Terraria
(2, 2)
(sigm, sigm)
0.122
The Binding of Isaac: Rebirth
(7, 6)
(relu, relu)
1.683
The Elder Scrolls Online
(3, 3)
(relu, relu)
0.061
The Elder Scrolls V: Skyrim
(4, 3)
(relu, relu)
0.082
The Witcher 3: Wild Hunt
(2)
(relu)
0.132
Tom Clancy’s Rainbow Six Seige (2)
(sigm)
0.119
Total War: WARHAMMER II
(2, 10)
(relu, sigm)
0.201
War Thunder
(2, 4)
(relu, relu)
0.082
Warframe
(2, 5)
(relu, sigm)
0.119
World of Tanks Blitz
(4, 2)
(sigm, relu)
0.091
World of Warships
(3, 2)
(sigm, relu)
0.278
43
44
Z. Li et al.
This study can be considered as a benchmark or baseline for further studies. There is still much more to improve and explore, for example, how to leverage the information in Twitch comments and emoji, which can be considered as an emotion indicator, and how to better leverage game characteristics or profiles to conduct transfer learning to predict the active player numbers for upcoming games. For game publishers, how to better and efficiently use streaming influencers to market games. Acknowledgments. The work has been performed under Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, M.O. gratefully acknowledges the support of Arcada University of Applied Sciences and the computer resources and technical support provided by CSC Finland.
References 1. Akusok, A., Leal, L.E., Bj¨ ork, K.-M., Lendasse, A.: High-performance ELM for memory constrained edge computing devices with metal performance shaders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 79–88. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9 9 2. Akusok, A., Leal, L.E., Bj¨ ork, K.-M., Lendasse, A.: Scikit-ELM: an extreme learning machine toolbox for dynamic and scalable learning. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 69–78. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9 8 3. Blizzard Entertainment: World of Warcraft (2004). https://worldofwarcraft.com/ en-us/ 4. Espinosa-Leal, L., Akusok, A., Lendasse, A., Bj¨ ork, K.-M.: Extreme learning machines for signature verification. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 31–40. Springer, Cham (2021). https:// doi.org/10.1007/978-3-030-58989-9 4 5. Espinosa-Leal, L., Akusok, A., Lendasse, A., Bj¨ ork, K.-M.: Website classification from webpage renders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 41–50. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-58989-9 5 6. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 7. Jiang, W., Luo, J.: Graph neural network for traffic forecasting: a survey. Expert Syst. Appl. 207, 117921 (2022) 8. Johnson, M.R., Woodcock, J.: The impacts of live streaming and twitch. TV on the video game industry. Media Cult. Soc. 41(5), 670–688 (2019) 9. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997) 10. Lu, J., Wu, D., Mao, M., Wang, W., Zhang, G.: Recommender system application developments: a survey. Decis. Support Syst. 74, 12–32 (2015) 11. Ogozaly, J.: Steam player data (2021). https://www.kaggle.com/datasets/ jackogozaly/steam-player-data 12. Parker, R.: Red dead redemption 2 breaks records with 725 million opening weekend (2018). https://www.hollywoodreporter.com/news/general-news/red-deadredemption-2-breaks-records-725-million-opening-weekend-1156235/
Does Streaming Affect Video Game Popularity?
45
13. Pilipovic, S.: Dota 2 player count and statistics 2023 (2023). https://levvvel.com/ dota-2-statistics/ 14. Statista: Global video game market value from 2020 to 2025 (2023). https://www. statista.com/statistics/292056/video-game-market-value-worldwide/ 15. Statista: Number of twitch users in the united states from 2021 to 2025 (2023). https://www.statista.com/statistics/532338/twitch-viewing-frequency-usa/ 16. Sullygnome: Twitch statistics and analytics (2023). https://sullygnome.com/ 17. Tailby, S.: Elden ring sold more than 13.4 million copies in its first five weeks (2022). https://www.pushsquare.com/news/2022/05/elden-ring-sold-morethan-13-4-million-copies-in-its-first-five-weeks 18. Twitch: Game creator success on twitch: Hard numbers (2016). https:// blog.twitch.tv/en/2016/07/13/game-creator-success-on-twitch-hard-numbers688154815817/ 19. Um, K., Hu, X., Thuerey, N.: Liquid splash modeling with neural networks. In: Computer Graphics Forum, vol. 37, pp. 171–182. Wiley Online Library (2018)
Speech Dereverberation Based on Self-supervised Residual Denoising Autoencoder with Linear Decoder Tassadaq Hussain1(B) , Ryandhimas E. Zezario2,3 , Yu Tsao2 , and Amir Hussain1 1
3
School of Computing, Edinburgh Napier University, 10 Colinton Road, Edinburgh EH10 5DT, UK {t.hussain,a.hussain}@napier.ac.uk 2 Research Center for Information Technology Innovation, Academia Sinica, No. 128, Section 2, Academia Road, Taipei 115, Taiwan {ryandhimas,yu.tsao}@citi.sinica.edu.tw Department of Computer Science and Information Engineering, National Taiwan University, No. 1, Section 4, Roosevelt Road, Taipei 10617, Taiwan
Abstract. A recently proposed self-supervised denoising autoencoder with linear decoder (DAELD) speech enhancement system demonstrated promising potential in the conversion of noisy speech signals to clean signals. In addition to additive noise, reverberation is another common source of distortion, caused by multi-path reflection copies of the speech signals. The characteristics of additive noise and reverberation are different, and there is an increasing demand for an effective approach that can tackle the combined effects of these two distortion sources. Based on the promising results achieved by DAELD, we propose an extension of DAELD, called the residual DAELD system (rDAELD), in this study to simultaneously perform speech dereverberation and denoising in a selfsupervised learning manner. More specifically, the proposed rDAELD does not require paired training data to estimate the model parameters, thereby making this method especially suitable for real-world applications. Experimental results confirmed that rDAELD yields promising dereverberation and denoising performance under both matched and mismatched training-test conditions for simulated and measured impulse responses. Keywords: speech dereverberation · self-supervised learning denoising autoencoder · extreme learning machines
1
· deep
Introduction
Speech reverberation refers to the collection of multi-path propagated sound in acoustic environments, which can dramatically affect the overall quality and intelligibility of speech-related applications, such as automated speech recognition [1–5], speaker identification systems [6,7], as well as normal hearing and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Bj¨ ork (Ed.): ELM 2022, PALO 18, pp. 46–57, 2024. https://doi.org/10.1007/978-3-031-55056-0_6
Speech Dereverberation Using Self-supervised Autoencoder
47
hearing-impaired listeners [8–10]. Various signal processing-based approaches have been proposed to tackle the reverberation issue [11–14]. However, further improvements are required to enhance the reverberant speech to imroive speech intelligibility for both human hearing and computer perception in real-world acoustic environments. A common limitation of most of the conventional methods is that they rely on the statistical properties of speech and reverberation signals. Consequently, in unseen acoustic environments, these approaches may not perform well in real-world scenarios. Recently, a comprehensive research was conducted on nonlinear spectral mapping-based approaches that demonstrated the great potential of deep neural models for speech dereverberation. For example, in [1], a denoising autoencoderbased (DAE) framework was proposed by stacking multiple encoder-decoder layers to generate robust speech features for noisy reverberant speech recognition. The multi-layered structure of DAE (termed deep DAE (DDAE)) has proven to be effective in handling noisy-reverberant environments. Deep neural network (DNN)-based systems were also proposed [15–21] where the frameworks were trained in a supervised manner to simultaneously perform dereverberation and denoising. Moreover, long short-term memory (LSTM)-based frameworks [22,23] have been proposed to effectively reduce the reverberation and noise artifacts in both hearing-impaired and speech recognition devices. In addition to LSTM frameworks, a dereverberation framework based on bidirectional LSTM was used in [24] to learn mapping from distant talk reverberant channels to close talk clean channels. A DDAE-based ensemble structure was recently implemented in [25] to manage a variety of matched and mismatched reverberant conditions through the integration of multiple reverberated models that were trained for different acoustic reverberant conditions. More recently, a deep convolutive and self-attentionbased system was proposed to exploit the frequency-temporal context patterns for noise suppression and dereverberation [26]. Neural network-based architectures are trained using back-propagation (BP) techniques, where the parameters are fine-tuned to optimize the neural architecture. In comparison to BP-based deep neural models, the authors in [27,28] introduced the hierarchical structure of the extreme learning machine (ELM), referred to as HELM, and its ensemble system for speech enhancement and speech dereverberation. The proposed frameworks indicated that the unseen and training-testing mismatched noisy and reverberation activities could still be efficiently handled using the feed-forward only neural models. The outcome of these approaches indicates that nonlinear spectral mapping-based neural models provide strong regression capabilities and can substantially achieve outstanding speech denoising and dereverberation results in real-world scenarios. A notable limitation of the deep neural-based dereverberation systems is that they generally require a large number of reverberant-anechoic training pairs and they must be trained in a supervised manner. Therefore, it is necessary for the supervised learning frameworks to initially acquire a large number of paired training data, including as many environmental conditions as possible, to improve the robustness against unknown test conditions. In contrast, selfsupervised learning attempts to learn spectral mapping without the need of
48
T. Hussain et al.
paired reverberant-anechoic data, which can effectively resolve the data requirement problem. Several self-supervised encoder-decoder style systems have been proposed for speech enhancement [29]. However, less attention has been paid to the learning problem in the speech dereverberation domain. Inspired by the promising results achieved by the denoising autoencoder with linear regression decoder (called DAELD) [30], in this study, we propose a residual DAELD system (called rDAELD) for simultaneously performing speech dereverberation and denoising. A major goal of this research is to tackle the paired training data requirement issue based on the self-supervised leaning technique. The performance of the proposed frameworks was evaluated on the Taiwan Mandarin hearing in noise test (TMHINT) corpora [31]. Experimental results showed that the proposed rDAELD system outperforms reverberant and noisy speech signals and the DAELD system in terms of standardized measurement criteria under matched and mismatched testing conditions. The rest of the paper is organized as follows. Section 2 presents the proposed rDAELD system. The experimental setup and results are discussed in Sect. 3. Finally, the conclusion is given in Sect. 4.
2
Proposed rDAELD System
Figure 1 demonstrates the proposed rDAELD system that follows the same encoder-decoder structure as shown in [30] with an additional residual architecture. The residual architecture is used to integrate the low-level information into the decoder stage [28]. In the encoder stage, the input data are first converted to representative features by the encoder. The decoder part transforms the combined lower-level information and representative features to generate the final output. In addition, the reverberant speech signals are used as the target output, which enables self supervised learning. The overall rDAELS system consists of two stages, i.e., the offline and online stages. In the offline stage, the system aims to estimate the parameters of the encoder and decoder. However, in the online stage, it aims to obtain the dereverberated speech signals based on the estimated parameters obtained in the offline stage. More details will be discussed in the following Sections. 2.1
Offline and Online Stages
In the offline stage, the logarithmic power spectral (LPS) features (x i = [xi1 , xi2 , . . . , xiL ]T ∈ RL ) of the reverberant speech signals are extracted and sent into the rDAELD system. Similar to the DAELD system, the LPS features are processed by the hierarchical layers of the sparse autoencoder to extract the representative features via self-supervised learning. As described in [30], the DAELD system consists of several self-supervised sparse autoencoders (SAEs) and a linear decoder. Given the input sequence (X = x 1 , x 2 , ..., x K ), the output of the decoder can be written as H d = H e · B,
(1)
Speech Dereverberation Using Self-supervised Autoencoder
49
where H d denotes the output matrix of the decoder layer, and H e is the output matrix of the encoder; H e = [h e1 , ..., h ek , ..., h eK ] represents the hidden layer output matrix of the encoder stage of the DAELD system. In self-supervised learning, H d = X . Moreover, B is the linear transformation matrix calculated using the Moore-Penrose pseudo-inverse [32]. Different from the gradient-based optimization used to compute the parameters of neural network models, the linear transformation matrix is computed using the entire set of training data. In the rDAELD system, the output of the SAE is mapped identically using a short-cut or skip connection to the decoder layer. Owing to the shortcut connection, the rDAELD framework computes the transformation matrix, B, considering the input. Consequently, the output layer matrix of the self-supervised decoder can be represented as H d(res) = ((w H q ) + H e ) · B,
(2)
where H d(res) denotes the output matrix of the decoder, H q is the output matrix of the previous layer in the encoder, and w is a randomly generated weight matrix for the linear projection to match the dimensions. Similar to DAELD, self-supervised learning is performed by setting H d(res) = X . In the online stage, the input features (reverberant LPS features) are first processed by the encoder to obtain the representative features, where the parameters of the encoder are trained in unsupervised learning manner. Subsequently, the generated representative features are transformed by the transformation matrix, B, to obtain the output features (dereverberated LPS features).
Fig. 1. Architecture of the proposed rDAELD framework
50
3 3.1
T. Hussain et al.
Experiments Experimental Setup
We evaluated the proposed rDAELD using the TMHINT corpus [31], comprising a total of 320 sentences. Each sentence had 10 Chinese words. Six speakers were recruited to record the utterances by reading the sentences included in the TMHINT. We selected 250 utterances recorded by three male speakers and 30 utterances recorded by another male and two female speakers to prepare the training and test sets, respectively. There was no overlap between the training and testing utterances. We prepared the reverberant speech based on the room impulse responses (RIRs) and reverberation times (RT60s). The clean training utterances were convoluted with a single RIR of different room sizes (small, medium, and large) with different RT60s (RT60 = 0.25, 0.5, and 0.7 s), given by the REVERB challenge organizers [33] to generate reverberant training data. Two different distances were selected between the source and microphone i.e., 50 cm (= near) and 200 cm (= far), to generate the final training dataset. Subsequently, the reverberant data were contaminated with background noise at the signal-to-noise ratio (SNR) (5, 10, and 15 dB) to generate 250 (clean utterances) × 3 (speakers) × 3(RT60s) × 2 (distances between the speaker and microphone) = 4500 reverberant and noisy utterances. We tested the proposed rDAELD system under both matched and mismatched training-test conditions. The 30 test utterances were convolved with three synthetically generated RIRs of the three rooms matched to the training set (i.e., small, medium, and large) with RT60s (RT60 = 0.25, 0.5, and 0.7 s) to produce reverberant utterances. Subsequently, the reverberant utterances were further contaminated with background noise at five SNR levels (−12, −6, 0, 6, and 12 dB) to produce 30 (clean utterances) × 3 (speakers) × 3 (RT60s) × 2 (distances between the speaker and microphone) × 5 (SNRs) = 2700 reverberant and noisy data; this test dataset is referred to as SimData. Three standard objective assessment metrics were used to evaluate the proposed algorithm, namely the cepstrum distance (Cep), log likelihood ratio (LLR), and speech-to-reverberation modulation energy ratio (SRMR). The Cep measures the spectral difference between the estimated and clean speech signals, while the LLR calculates the disparity ratio between these signals. Unlike the Cep and LLR, the SRMR measures the speech to reverberation energy ratio of the estimated speech signal. Smaller Cep and LLR scores indicate less distortions and, thus, better speech quality. Higher SRMR scores indicate that the estimated speech is closer to the anechoic speech signal. 3.2
Supervised Versus Self-supervised DAELD
We first compare the performance of the self-supervised DAELD system [30] (termed as DAELDu ) and the supervised DAELD system (termed as DAELDs ). Same architectures were used for the DAELD models of both systems; the first two layers comprised ELM-based SAEs, followed by a linear regression decoder.
Speech Dereverberation Using Self-supervised Autoencoder
51
Table 1. Average Cep, LLR, and SRMR scores of Reverb and the DAELD systems with both supervised and self-supervised criteria for the SimData set. Method
Reverb
Measure SimData Room1 Near Far
Room2 Near Far
Room3 Near Far
Avg.
Cep LLR SRMR
4.33 0.90 5.57
4.52 5.87 0.93 1.17 4.49 2.37
5.97 5.32 1.20 1.24 1.70 2.42
5.80 5.30 1.35 1.13 1.62 3.03
DAELDs Cep LLR SRMR
3.79 0.70 4.25
3.84 5.03 0.73 0.96 4.35 3.55
5.15 5.42 1.01 1.06 3.43 3.34
5.43 4.78 1.08 0.92 2.53 3.58
DAELDu Cep LLR SRMR
3.83 0.70 6.64
4.19 5.71 0.81 1.05 6.57 4.38
5.87 5.38 1.12 0.97 4.13 4.31
5.59 5.10 1.05 0.95 2.60 4.77
The model architectures were [1000 1000 8000]. Both systems were trained using 4500 reverberant and noisy training utterances. In a previous study [30], the DAELDu demonstrated performance improvements for speech enhancement in terms of speech quality. In this study, we first compared the performance of DAELDu and DAELDs on the designed dereverberant and noisy conditions. Table 1 summarizes the average Cep, LLR, and SRMR scores of SimData for the matched testing conditions. The results of the original reverberant and noisy speech are listed as “Reverb”. It is evident that the overall performance of the DAELD frameworks with both supervised and self-supervised learning demonstrates better results than Reverb across different conditions for the SimData set. Moreover, DAELDu provides comparable or sometimes even better performance in comparison to DAELDs . 3.3
DAELD versus rDAELD
Next, we compare the performance of DAELD with the proposed rDAELD framework. For a fair comparison, the rDAELD frameworks with supervised learning (i.e., rDAELDs ) and self-supervised learning (i.e., rDAELDu ) were trained using the same configuration (i.e., [1000 1000 8000]). Table 2 displays the average Cep, LLR, and SRMR scores of rDAELDs and rDAELDu for the SimData set. From Table 2, it can be observed that rDAELDs yields better performance in terms of Cep and LLR in comparison to rDAELDu . Nevertheless, DAELDu yielded a slightly higher SRMR score than rDAELDs . Next, by comparing Tables 1 and 2, it is evident that rDAELDs and rDAELDu can outperform DAELDs and DAELDu , respectively, across different RT60s, thereby confirming the benefits achieved by the residual architecture. To evaluate the robustness of the rDAELD system under mismatched test conditions, we used a collection of measured RIRs from different rooms given
52
T. Hussain et al.
by the Aachen Impulse Response [34] database. These RIRs denote the real recordings made in reverberated spaces. In this study, we considered the RIRs of the following four rooms: lecture, meeting, office and stairways (denoted as Lect., Meet., Off., and Stair.) with different RT60s. Table 3 lists the acoustic parameters of each reverberated room (e.g., RT60) and the distance between the loudspeaker and microphone (dSM ). In the case of mismatched test data, clean test utterances were convolved with the RIRs of the rooms shown in Table 3, followed by adding background noises at five different SNR levels to generate 30 (clean utterances) × 3 (speakers) × 1 (RT60s) × 5 (SNRs) = 450 reverberant and noisy utterances for each room. This set of test data is termed the RealData set. Table 2. Average Cep, LLR, and SRMR scores of the rDAESD systems with supervised and self-supervised learning criteria for the SimData set. Method
Measure SimData Room1 Near Far
Room2 Near Far
Room3 Near Far
Avg.
rDAELDs Cep LLR SRMR
3.76 0.68 4.76
3.81 5.06 0.72 0.96 4.81 3.91
5.18 5.43 1.01 1.05 3.68 3.70
5.48 4.79 1.06 0.91 2.76 3.94
rDAELDu Cep LLR SRMR
3.78 0.69 7.77
4.16 5.38 0.81 0.97 7.11 5.23
5.59 5.71 1.06 1.05 4.79 5.29
5.87 5.10 1.06 0.94 3.39 5.60
Table 3. RT60s and distances between the loudspeaker and microphones for each room in the RealData set. Room type dSM RT60 Lecture
2.25 0.70
Meeting
2.80 0.25
Office
3.00 0.48
Stairways
1.00 0.82
Table 4 summarizes the average Cep, LLR, and SRMR scores of the DAELD and rDAELD systems with both supervised and self-supervised learning for the RealData set. The results of the four acoustic rooms are presented individually (Lect., Meet., Off., and Stair.). The best scores for each evaluation metric and room are highlighted in boldface. Note that the results shown in Table 4 represent the testing results under a training-test mismatched condition. From Table 4, we first note that both frameworks (DAELD and rDAELD) with supervised and self-supervised learning yield better scores for all three evaluation
Speech Dereverberation Using Self-supervised Autoencoder
53
metrics than Reverb (Rev.). Next, in the case of supervised learning, rDAELDs achieved better performance for almost all evaluation metrics in comparison to DAELDs . Similar trends can be observed for the self-supervised learning case, where rDAELDu outperformed the DAELDu in most of the evaluation metrics. The results confirmed the effectiveness of the residual architecture. Finally, when comparing rDAELDs and rDAELDu , the self-supervised learning system can yield comparable and sometimes even better performance than the supervised learning system, especially for the SRMR evaluation metric. Table 4. Average Cep, LLR, and SRMR scores of DAELD and rDAELD with both supervised and self-supervised training criteria for the RealData set. RoomType Measure RealData Rev. DAELDs rDAELDs DAELDu rDAELDu
3.4
Lect.
Cep LLR SRMR
5.45 4.80 1.13 0.94 2.59 3.17
4.81 0.94 3.59
5.28 1.07 3.56
5.27 1.07 4.25
Meet.
Cep LLR SRMR
5.03 4.45 1.04 0.83 4.52 4.24
4.42 0.82 4.66
4.78 0.90 6.16
4.74 0.89 6.68
Off.
Cep LLR SRMR
5.21 4.53 1.10 0.99 4.75 3.53
4.53 0.87 3.91
4.97 0.85 4.23
4.95 0.99 5.06
Stair.
Cep LLR SRMR
5.42 4.74 1.12 0.92 3.03 3.44
4.75 0.91 3.87
5.22 1.03 4.17
5.19 1.03 5.02
Spectrogram Analysis
Finally, we visually investigated the speech dereverberation results provided by DAELD and rDAELD under the supervised and self-supervised learning criteria. Figure 2 shows the spectrogram plots of the enhanced speech signals yielded by DAELD (DAELDs and DAELDs ) and rDAELD (rDAELDs and rDAELDs ) systems under a severe reverberation condition, i.e., RT60 = 0.7 s, for the Lecture room. The reverberant speech signal is further contaminated with background noise at SNR = −12 dB. The corresponding clean and reverberated utterances are also presented in Fig. 2 for comparison. From the spectrogram plots, it is evident that both DAELD and rDAELD with supervised and self-supervised learning successfully generated enhanced speech signals. Additionally, rDAELDu suppressed the reverberated and noise components more effectively while preserving the structures of clean speech signals, especially at the middle frequency regions (denoted by red blocks in Fig. 2).
54
T. Hussain et al.
From the figure, we first observe that the speech reconstructed by the rDAELD systems (rDAELDs and rDAELDu ) exhibits more similar patterns to clean speech with less distortions in comparison to the DAELD systems (DAELDs and DAELDu ). By comparing the spectrogram plots of the two rDAELD systems, it can be concluded that rDAELDu can more effectively suppress the effects of reverberation and noise. The findings from the spectrogram plots are consistent with the results presented in Table 4.
Fig. 2. Spectrogram plots of a sample utterance in six different conditions: Clean and Reverb, and enhanced ones by DAELDs , DAELDu , rDAELDs , and rDAELDu . The Reverb utterance was obtained with RT60 = 0.7 s, SNR = −6 dB, in a lecture room.
4
Conclusion
In this study, we proposed a self-supervised rDAELD system to perform speech dereverberation and denoising simultaneously. The main contributions of this study are three-fold: (1) the preparation of suitable paired training data considering the effects of noise and reverberation requires a significant amounts of effort and storage; thus, it may not be suitable for real-world edge computing. The proposed self-supervised rDAELD system is proven to not require paired training data, thereby overcoming the data preparation issue; (2) the proposed rDAELD system demonstrates better performance than the previously proposed DAELD system, confirming the effectiveness of the use of the residual architecture; (3) the proposed rDAELD presents promising generalization capabilities for simultaneously handling the effects of reverberation and noise under training-testing
Speech Dereverberation Using Self-supervised Autoencoder
55
mismatched conditions. In our future research, we will expand the applicability of the proposed self-supervised learning system to multimodal (such as audiovisual and bone-air microphones) speech dereverberation and denoising tasks. Acknowledgements. This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC) programme grant: COG-MHEAR (Grant reference EP/T021063/1.
References 1. Feng, X., Zhang, Y., Glass, J.: Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: Proceeding of the ICASSP, pp. 1759–1763 (2014) 2. Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noiserobust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745– 777 (2014) 3. Siniscalchi, S.M., Salerno, V.M.: Adaptation to new microphones using artificial neural networks with trainable activation functions. IEEE Trans. Neural Netw. Learn. Syst. 28(8), 1959–1965 (2017) 4. Yoshioka, T., et al.: Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process. Mag. 29(6), 114–126 (2012) 5. Gao, T., Du, J., Xu, Y., Liu, C., Dai, L.-R., Lee, C.-H.: Joint training of DNNs by incorporating an explicit dereverberation structure for distant Speech Dereverberation using self-supervised autoencoder 11 speech recognition. EURASIP J. Adv. Signal Process. 2016(1), 86 (2016) 6. Jin, Q., Schultz, T., Waibel, A.: Far-field speaker recognition. IEEE Trans. Audio Speech Lang. Process. 15(7), 2023–2032 (2007) 7. Zhao, X., Wang, Y., Wang, D.: Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 836–845 (2014) 8. Kokkinakis, K., Hazrati, O., Loizou, P.C.: A channel-selection criterion for suppressing reverberation in cochlear implants. J. Acoust. Soc. Am. 129(5), 3221–3232 (2011) 9. Hazrati, O., Omid Sadjadi, S., Loizou, P.C., Hansen, J.H.: Simultaneous suppression of noise and reverberation in cochlear implants using a ratio masking strategy. J. Acoust. Soc. Am. 134(5), 3759–3765 (2013) 10. Healy, E.W., Delfarah, M., Johnson, E.M., Wang, D.: A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker and reverberation. J. Acoust. Soc. Am. 145(3), 1378–1388 (2019) 11. Sadjadi, S.O., Hansen, J.H.: Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions. In: Proceedings of the ICASSP, pp. 5448–5451 (2011) 12. Bees, D., Blostein, M., Kabal, P.: Reverberant speech enhancement using cepstral processing. In: Proceedings of the ICASSP, pp. 977–980 (1991) 13. Gillespie, B.W., Malvar, H.S., Florˆencio, D.A.: Speech dereverberation via maximum-kurtosis subband adaptive filtering. In: Proceedings of the ICASSP, vol. 6, pp. 3701–3704 (2001)
56
T. Hussain et al.
14. Miyoshi, M., Kaneda, Y.: Inverse filtering of room acoustics. IEEE Trans. Acoust. Speech Signal Process. 36(2), 145–152 (1988) 15. Han, K., Wang, Y., Wang, D., Woods, W.S., Merks, I., Zhang, T.: Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 982–992 (2015) 16. Williamson, D.S., Wang, D.: Time-frequency masking in the complex 12 speech dereverberation using self-supervised autoencoder domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492– 1501 (2017) 17. Schwarz, A., Huemmer, C., Maas, R., Kellermann, W.: Spatial diffuseness features for DNN-based speech recognition in noisy and reverberant environments. In: Proceedings of the ICASSP, pp. 4380–4384 (2015) 18. Nakatani, T., et al.: DNN-supported mask-based convolutional beamforming for simultaneous denoising, dereverberation, and source separation. In: ICASSP, pp. 6399–6403. IEEE (2020) 19. Xiao, X., et al.: Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. EURASIP J. Adv. Signal Process. 2016(1), 4 (2016) 20. Wang, Z.-Q., Wang, D.: Deep learning based target cancellation for speech dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 941–950 (2020) 21. Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5014–5018. IEEE (2015) 22. Mimura, M., Sakai, S., Kawahara, T.: Speech dereverberation using long short-term memory. In: INTERSPEECH (2015) 23. Zhao, Y., Wang, D., Johnson, E.M., Healy, E.W.: A deep learning based segregation algorithm to increase speech intelligibility for hearingimpaired listeners in reverberant-noisy conditions. J. Acoust. Soc. Am. 144(3), 1627–1637 (2018) 24. Zhang, Z., Pinto, J., Plahl, C., Schuller, B., Willett, D.: Channel mapping using bidirectional long short-term memory for dereverberation in handsfree voice controlled devices. IEEE Trans. Consum. Electron. 60(3), 525–533 (2014) 25. Lee, W.-J., Wang, S.-S., Chen, F., Lu, X., Chien, S.-Y., Tsao, Y.: Speech dereverberation based on integrated deep and ensemble learning algorithm. In: Proceedings of the ICASSP, pp. 5454–5458 (2018) 26. Li, N., Ge, M., Wang, L., Dang, J.: A fast convolutional self-attention based speech dereverberation method for robust speech recognition. In: Gedeon, T., Wong, K.W., Lee, M. (eds.) ICONIP 2019. LNCS, vol. 11955, pp. 295–305. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-36718-3 25 27. Hussain, T., Siniscalchi, S.M., Lee, C.-C., Wang, S.-S., Tsao, Y., Liao, W.-H.: Experimental study on extreme learning machine applications for speech enhancement. IEEE Access 5, 25542–25554 (2017) 28. Hussain, T., Siniscalchi, S.M., Wang, H.-L.S., Tsao, Y., Mario, S.V., Liao, W.-H.: Ensemble hierarchical extreme learning machine for speech dereverberation. IEEE Trans. Cogn. Dev. Syst. 12, 744–758 (2019) 29. Alamdari, N., Azarang, A., Kehtarnavaz, N.: Improving deep speech denoising by noisy2noisy signal mapping. arXiv preprint arXiv:1904.12069 (2019) 30. Zezario, R.E., Hussain, T., Lu, X., Wang, H.-M., Tsao, Y.: Self-supervised denoising autoencoder with linear regression decoder for speech enhancement. In: Proceedings of the ICASSP, pp. 6669–6673 (2020)
Speech Dereverberation Using Self-supervised Autoencoder
57
31. Huang, M.: Development of Taiwan mandarin hearing in noise test. Department of Speech Language Pathology and Audiology, National Taipei University of Nursing and Health Science (2005) 32. Moore, E.H.: On the reciprocal of the general algebraic matrix. Bull. Am. Math. Soc. 26, 394–395 (1920) 33. Kinoshita, K., et al.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of the WASPAA, pp. 1–4 (2013) 34. Jeub, M., Schafer, M., Vary, P.: A binaural room impulse response database for the evaluation of dereverberation algorithms. In: Proceedings of the DSP, pp. 1–5 (2009)
Application of ELM Model to the Motion Detection of Vehicles Under Moving Background Zixiao Zhu1(B) , Rongzihan Song2 , Xiaofan Jia2 , and Dongshun Cui2 1
2
Institute of Catastrophe Risk Management, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore [email protected] School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore {song0199,xiaofan002}@e.ntu.edu.sg, [email protected]
Abstract. Artificial intelligence techniques can be applied to illegal parking detection. In the task, identifying the target vehicle’s motion helps decrease the false alarm. The optical flow algorithm is commonly used for motion detection. The algorithm recognizes the motion by telling whether the pixel has displacement between two consequence pictures. But the algorithm cannot identify the actual moving vehicle when facing pictures photographed by a moving camera since all pixels move. We propose a new motion detection system based on the combination of ELM and optical flow algorithm. This system can handle the pictures snapped by the moving camera. In this paper, a new dataset focused on this application is built. After achieving displacement information by optical flow algorithm, the ELM neural network is used to learn the feature between the changing background and target vehicle. The system has been tested on our dataset, evaluating the ELM outperforms other machine learning models. Keywords: Motion detection flow · Vehicle motion data
1
· Extreme learning machine · Optical
Introduction and Related Work
With the ownership of vehicles increasing, illegal parking is a social problem that the government needs to pay attention to [1]. In Singapore, parking violation regulations are detailed, including thirteen violations of light vehicles, heavy vehicles, and motorcycles. Typically, illegal parking is patrolled and inspected by the police, which is manpower-consuming. An automatically illegal parking recognition system is needed to improve inspection efficiency. The system aims to analyze the on-road vehicles photographed by the patrol cars and identify whether the vehicle is parked illegally. However, as the analyzed picture is stationary, the false alarm happens when the detected illegal vehicle moves when c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Bj¨ ork (Ed.): ELM 2022, PALO 18, pp. 58–69, 2024. https://doi.org/10.1007/978-3-031-55056-0_7
Application of ELM Model to the Motion Detection
59
photographed. To eliminate the system error, a model to recognize whether the vehicle is parking or not is fundamental to illegal parking detection. Motion detection is a valuable tracking method that has been served in many applications, such as measuring social polarization and daily routine human motion tracking [2]. Hardware motion detection uses sensors and radar to measure changes in signals to determine motion [3]. However, software motion detection uses machine learning algorithms to analyze the features of a moving object compared to a static background [4]. Optical flow-based algorithms are widely used because of their efficiency in distance detection. Optical flow fields were introduced to motion detection because they can easily transfer motion in space to image sequences through gray-scale distribution [5]. In an optical flow field, an object is considered to be moving [6] once its velocity vector differs from the background vector. Horn and Schunk published their pioneering work in 1981, which was based on the gradient method to estimate the optical flow of the whole image and solve the optical limit problem in x − y − t dimensional coordinates. Since then, many improvements were issued [7–12] under machine learning techniques. The low complexity of these networks allows the algorithm to be used as a solution for live video motion detection. Nevertheless, most optical flow algorithms are mathematically designed based on static backgrounds. That is, background pixels have no picture-by-picture displacement except for moving targets. However, the surveillance camera is usually placed on a patrol car in practical applications to maximize the detection range while reducing surveillance equipment. It takes surveillance photos while the patrol car is in motion. Therefore, applying the optical flow algorithm directly to the pictures captured by the moving camera cannot recognize the stationary vehicle because all the pixels have motion values. Camera movements, such as panning and rotation, result in complex relative motion between cars. After performing the optical flow algorithm to identify the absolute motion, we need to decompose the movement of the background image and the foreground vehicles. Since the motion information of the background and cars in the optical flow is nonlinear, it is very challenging to distinguish the difference between them by manual judgment. Adding a machine learning classifier to recognize the motion of the background and the vehicle automatically becomes a better choice. This paper proposes a method that could identify the movement of target vehicles when the detection device is moving. We established the connection between different pictures through an optical flow method for obtaining the movement information of each pixel between different pictures. To eliminate the side effects on the optical flow method brought by the movement of the detection device, the Extreme Learning Machine (ELM) [13] based classification approach is adopted. We applied the machine learning method to compare the tiny difference between the movement information of the background and targets, enabling the whole system to realize the identification of movement with a not fixed detection device. We collect data for both moving and parking vehicles for training and testing to allow the ELM model to perform better in general
60
Z. Zhu et al.
situations. Our method can be applied to vehicle motion prediction detected by unknown speed on-road videos. The main contributions of this paper are: 1. We propose a motion detection method based on optical flow to achieve nonstatic background vehicle motion recognition. Our approach effectively learns the optical flow difference between a moving vehicle and the moving background by utilizing the ELM model. 2. Instead of using optical flow information directly, we propose a postprocessing method on optical flow. We also prove that this method achieves good data representation for the motion classification task. 3. We build a dataset for vehicle motion detection based on our approach. The remainder of the paper is organized as follows: the theoretical algorithms used in the proposed vehicle’s motion detection system are summarized in Sect. 2. The details of the proposed ELM-based vehicle motion detection system and our vehicle motion detection dataset are illustrated in Sect. 3. Related experiments and results are shown in Sect. 4. Finally, the conclusion is in Sect. 5.
2 2.1
Theoretical Background PWC-Net Optical Flow Algorithm
The optical flow algorithm used in our system is the PWC-Net algorithm [12], a fast-speed algorithm that can be applied in real-time illegal parking detection. PWC-Net uses its unique wrapping network to construct a better image pyramid for feature extractors and refinement networks to eliminate errors during small object disappearance. Meanwhile, to speed up detecting disparity, a partial cost volume is introduced to show the difference between the images. Followed by the warping process, the network can predict the displacement at a fast speed of the pixel based on image pairs. PWC-Net has a faster running speed and less number of parameters compared with other CNN models [11,14,15], which proves that it can be used in the real-time vehicle motion detection application. 2.2
Extreme Learning Machine Neural Network
The structure of the ELM model is shown in Fig. 1. It is proud of its simple structure. The weight αij from its input layer to the hidden layer is determined randomly at one time and does not need to be adjusted again during the execution of the algorithm. The weight βi from the hidden layer to the output layer only needs to be determined by solving a system of linear equations. As a result, the computational speed can be improved a lot. Its training speed is 1273 times faster than SVM model [16] and 21 times faster than the BP model. The algorithm has good generalization performance [17], so it can be applied to binary classification tasks in multiple scenarios. We select this classifier for real-time application needs. We use ELM to learn the displacement feature of pixels so that the system can recognize the vehicle’s motion even on the unfixed background information.
Application of ELM Model to the Motion Detection
61
Fig. 1. ELM network structure. The network structure in the figure is a single hidden layer feed-forward structure. It only contains one input layer, one hidden layer and one output layer. The number of neurons in each layer affects the performance of the neural network.
3
ELM Based Vehicle Motion Detection System
The architecture of the proposed motion detection system is shown in Fig. 2. The input of the system is continuous pictures obtained from the patrol camera. We use PWC-Net [12] to analyze the movements of each pixel between two continuous pictures. We use YOLO v2 [18] to detect vehicle position information in the diagram. The vehicle position information assists in locating the target area in optical flow output as a region of interest (ROI). Under certain mathematical processing, the ROI optical flow information is sent to the ELM model for motion classification.
Fig. 2. Overall architecture of our ELM-based vehicle motion detection System.
62
3.1
Z. Zhu et al.
Vehicle Motion Detection
Data Preprocessing. The data we used in the ELM classifier is the optical flow result. In the real application, the target vehicle has a different percentage in the picture because of the distance from the camera. The tiny car with large background information can interfere with the recognition of the optical flow algorithm. The optical flow information of the tiny target will be ignored in the optical flow information. As shown in Fig. 3, the small car in the middle is far from the camera and has less pixel information. Its optical flow results are almost zero. As a result, according to vehicle recognition detection, we obtain the position information of vehicles in each picture. We define the doubled size of the position information as the region of interest (ROI) in the system. Instead of inputting the original picture’s optical flow information into the ELM classifier, we cut out each vehicle’s ROI optical information. We then cut out its corresponding region in the next picture. This vehicle optical information pair is used as the input for motion classification. The double-sized area of the located vehicle helps to keep both vehicle and its corresponding background displacement information. This way, we successfully avoid the imbalance between background and target information. Besides, selecting a large-sized or middle-sized vehicle can also mitigate the effect of the coarse-to-fine approach in optical flow processing. In real motion detection applications, massive amounts of data are processed. The algorithm’s memory usage is vital. We preprocess the region of interest optical flow information to reduce the computational complexity and running memory while keeping optical flow motion features. Normally, the optical flow outputs U = {u1 , · · · , uL } in horizontal axis, and V = {v1 , · · · , vL } in vertical axis are concatenated for showing motion changes [19]. However, the concatenation doubled the computed data, which increased the √ running memory. We vectorize the optical flow information under the form of U 2 + V 2 to keep the length of data but also contain displacement features. Table 1 shows the comparison of the performance of the vectored optical flow information and concatenated optical flow
(a)
(b)
Fig. 3. Vehicle picture comparison. (a) is a picture of vehicles parking. The red box locates the white car in the distance of the image. The white car is moving. (b) is the corresponding optical flow picture. The darker the color, the greater the displacement. The optical flow information on the corresponding position of the white car is so small that its movement cannot be discerned.
Application of ELM Model to the Motion Detection
63
information and their running memory under the ELM model. It improves that the vectorized information uses less memory while keeping high accuracy. Table 1. Comparison of concatenating optical flow information and vectorized optical flow information. Test under vectorized information, the classifier gains higher accuracy with smaller memory. √ Concatenated (u, v) U2 + V 2 Acc (%) Memory (MB) Acc (%) Memory (MB) Test dataset 1 59.00
363.8
67.92
198.3
Test dataset 2 56.91
18.23
61.18
37.46
ELM Motion Detection Classifier. In our detection system, ELM is chosen as the final classifier for motion detection due to its great balance between the generalized ability and the processing speed compared with other classifiers. Besides, ELM is specifically good at two categories classification problems, which is suitable for the motion detection task. For a set of L region of interests sized in a × b, the optical flow algorithm outputs optical flow matrix U = {u1 , · · · , uL }, and V = {v1 , · · · , vL }, where U ∈ Ra×b is the horizontal optical flow information and V ∈ Ra×b is the vertical optical flow information. In motion detection process, The optical flow vectorized information is the training data for ELM model for training, noted as O = {o1 , · · · , oL }. The hidden state H = {ho1 , · · · , hoL } is learned by the feed-forward layer in ELM model: h (oi ) = g W T · oi + b (1) Weight W and bias b are randomly generated. Hidden state h (Oi ) ∈ RM , where M is the number of hidden neurons. g (·) is the sigmoid function. Then the hidden state is fed into the linear layer with output weight matrix β ∈ RM ×c for classification result Y ∈ Rc : Y = β ∗ h (xi ) 3.2
(2)
New Vehicle Motion Detection (VMD) Dataset
Aiming to detect illegal parking in Singapore, our team collected a vehicle detection dataset to apply for Singapore parking violation regulations [20]. We issued a vehicle motion detection dataset that can be applied to our work. The vehicles in the dataset have three types: heavy cars, light vehicles, and motorcycles. Each picture is 1080 × 1920 pixels. Three frames between a picture in the same scene. Some examples in the dataset are listed in Fig. 4.
64
Z. Zhu et al.
(a)
(b)
Fig. 4. Example of data containing both moving and parked vehicles. (a) is three frames earlier than (b). The buses in the figures are stopped while the black car is moving.
All the data are grouped into two groups with the anchor position information: the moving vehicles and the parking vehicles. One anchor is one meaningful vehicle that is detected. The optical flow information of the anchor is given for the motion detection task. The details are listed in Table 2. Table 2. The composition of VMD datasets. Number of picture (with vehicle position information)
Number of anchor (with optical flow information)
Moving
193
3051
Still
440
5352
Total number 633
8403
The features of the VMD dataset are summarized below: 1. The pictures in the dataset are photographed by our surveillance vehicle. The surveillance car is driving at an arbitrary speed under 45Km/hour. 2. The dataset contains almost all kinds of on-road situations, like the crossroad situation, the turning case, the situation where vehicles are moving in a narrow street, and the reverse driving situation, as shown in Fig. 5. 3. The dataset contains both vehicles in big or small sizes as shown √ in Fig. 6. 4. The optical flow information given in the dataset is in the U 2 + V 2 form that we mentioned before.
Application of ELM Model to the Motion Detection
(a) cross road
(b) turning situation
(c) narrow street
(d) Reverse driving
Fig. 5. Some examples of VMD dataset.
65
66
Z. Zhu et al.
(a) Vehicle in big size
(b) Vehicle in small size
Fig. 6. Example of vehicles with proper distance. The blue box is the recognized car and is drawn by the position information that our dataset provided. The optical information is obtained according to the position information with a double area.
4
Experiment Results
4.1
Dataset and Model Settings
The dataset we used is the VMD dataset mentioned in the last section. We divided the dataset into a training set and five testing sets. The camera moving speed in the same testing set is the same, while various from other testing sets. The different background moving speeds can test the generalization of the proposed model. The detailed information on the training data and testing data are summarized below: 1. 2. 3. 4. 5. 6.
Training set: 1500 moving anchors and 1500 stationary anchors. Testing set 1: 220 moving anchors and 220 stationary anchors. Testing set 2: 47 moving anchors and 469 stationary anchors. Testing set 3: 107 moving anchors and 149 stationary anchors. Testing set 4: 95 moving anchors and 477 stationary anchors. Testing set 5: 198 moving anchors and 532 stationary anchors.
All the experiments are done with the Linux system with Intel(R) Xeon(R) Gold 6128 CPU 3.40 GHz and Tesla P100-PCIE-12GB GPU. The proposed algorithms are run in Python 3.6 with Pytorch 10.1.
Application of ELM Model to the Motion Detection
4.2
67
Results Analysis
After processing the optical flow algorithm, the system outputs the horizontal displacement U and vertical displacement V of pixels. The optical flow infor√ mation is vectorized in the form of U 2 + V 2 . We compare the performance of other machine learning algorithms, including (1) SVM [17], and (2) K-Means algorithm [21] with the ELM model. The testing results are listed in Table 3. Table 3. Vehicle motion detection comparison based on three models. “M Acc” indicates the moving accuracy, “S Acc” indicates the stationary accuracy, and “W Acc” indicates the weighted accuracy. test set SVM M Acc
KMeans S Acc
W Acc
M Acc
ELM S Acc
W Acc M Acc
S Acc
W Acc
1
89.55% 98.64% 94.09% 39.55% 71.36% 55.45% 91.36% 91.36% 91.36%
2
63.83% 68.66% 68.22%
3
88.79% 74.50% 88.47% 14.02% 60.40% 41.02% 80.21% 85.92% 84.16%
4
74.74% 71.91% 72.38%
29.47% 48.85% 45.63% 79.53% 81.98% 81.33%
5
74.24% 74.44% 74.38%
54.04% 65.60% 62.47% 79.31% 83.22% 82.18%
74.47% 64.61% 65.50% 79.40% 86.79% 84.73%
The results of our approach surpass the compared models with more than 10% improvement in three out of five datasets. Besides, our system has less moving or stationary over-fitting than the other two models. For example, among all five datasets, the difference between moving accuracy and static accuracy is 3.89% on average. In contrast, the SVM model is a 6.18% average difference and the KMeans model is a 23.80% average bias. In testing set 1 and testing set 3, SVM has higher weighted accuracy than our model. However, the difference between moving and stationary accuracy in the two testing sets is 9.09% and 14.29%, separately, which is too large for the actual application. Compared with our models, the testing difference in these two sets is 0% and 5.71%, respectively. As a result, compared with the other two models, our approach is more suitable to be implemented in the actual vehicle motion detection application.
5
Conclusion
We probe into recognizing the motion state of the vehicles in Singapore. We establish a new VMD dataset to better train the motion detection model under the moving device. We proposed a vehicle motion detection system based on optical flow and Extreme Learning Machine. In this system, we use a PWC-Net algorithm to capture the motion features of the vehicle in the pictures with a relatively moving background and apply the ELM model as a motion classifier. Our vehicle motion detection model achieves 84.75% average accuracy on five tested datasets. Based on our algorithm, this system can be implemented for the actual motion detection application.
68
Z. Zhu et al.
In future work, we will apply this method to deep learning and add depth features into the model to better reproduce the image. Also, adding mask detection can capture the contour of the vehicles. Besides, increasing the variety of the dataset can, in some way, make this system applied to more real practice situations. Acknowledgment. The authors are very grateful for the support and help from Singapore Urban Redevelopment Authority (URA).
References 1. Baum, C.L.: The effects of vehicle ownership on employment. J. Urban Econ. 66(3), 151–163 (2009) 2. Wang, J., Lu, C., Zhang, K.: Textile-based strain sensor for human motion detection. Energy Environ. Mater. 3(1), 80–100 (2020) 3. Gouveia, C., Vieira, J., Pinho, P.: A review on methods for random motion detection and compensation in bio-radar systems. Sensors 19(3), 604 (2019) 4. Singh, T., Sanju, S., Vijay, B.: A new algorithm designing for detection of moving objects in video. Int. J. Comput. Appl. 96(2) (2014) 5. Xin, Y., Hou, J., Dong, L., Ding, L.: A self-adaptive optical flow method for the moving object detection in the video sequences. Optik 125(19), 5690–5694 (2014) 6. Wang, Z., Yang, X.: Moving target detection and tracking based on pyramid LucasKanade optical flow. In: 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), pp. 66–69. IEEE (2018) 7. Liu, C., Yuen, J., Torralba, A.: Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2010) 8. Timofte, R., Van Gool, L.: Sparse flow: sparse matching for small to large displacement optical flow. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 1100–1106. IEEE (2015) 9. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392 (2013) 10. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1164–1172 (2015) 11. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470 (2017) 12. Sun, D., Yang, X., Liu, M.-Y., Kautz, J.: PWC-net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018) 13. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 14. Bailer, C., Varanasi, K., Stricker, D.: CNN-based patch matching for optical flow with thresholded hinge embedding loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3250–3259 (2017) 15. Wulff, J., Sevilla-Lara, L., Black, M.J.: Optical flow in mostly rigid scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4671–4680 (2017)
Application of ELM Model to the Motion Detection
69
16. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), vol. 2, pp. 985–990. IEEE (2004) 17. Jakkula, V.: Tutorial on support vector machine (SVM). School of EECS, Washington State University 37(2.5), 3 (2006) 18. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017) 19. Wang, L., Koniusz, P., Huynh, D.Q.: Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8698–8708 (2019) 20. Peng, X., et al.: Real-time illegal parking detection algorithm in urban environments. IEEE Trans. Intell. Transp. Syst. 23, 20572–20587 (2022) 21. Selim, S.Z., Ismail, M.A.: K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 1, 81–87 (1984)
Predicting the Colorectal Cancer Mortality in the Region of Lleida, Spain: A Machine Learning Study Didac Florensa1(B) , Jordi Mateo1 , Francesc Solsona1 , Pere Godoy2 , and Leonardo Espinosa-Leal3 1
Department of Computer Engineering, University of Lleida, C/de Jaume II, 25001 Lleida, Spain {didac.florensa,jordi.mateo,francesc.solsona}@udl.cat 2 Institute of Biomedical Research of Lleida, Av. Alcalde Rovira Roure 80, 25198 Lleida, Spain [email protected] 3 Graduate School and Research, Arcada University of Applied Sciences, Jan-Magnus Janssons plats 1, 00560 Helsinki, Finland [email protected]
Abstract. Previous works have shown the risk factors associated with an increased likelihood of Colorectal cancer (CRC) and its bad prognosis. This study aimed to build an efficient model to predict the mortality caused by CRC in Lleida, Spain. To this purpose, three different machine learning algorithms, such as Random Forest (RF), Neural Network (NN) and Extreme Learning Machine (ELM), were trained at different augmentation rates on a real dataset. It contained gender, age group, such risk factors as body mass index (BMI), smoking consumption, alcohol consumption and tumour staging. The study included 179 patients with a CRC detected whom 16 passed away. Furthermore, to balance the dataset, Synthetic Minority Oversampling Technique (SMOTE) algorithm was used. The results show that Random Forest (RF) obtained an accuracy of 90% with the balanced dataset. Extreme Learning Machine (ELM) also received a similar accuracy to RF (around 90%). Neural Network (NN) decreased the performance and got an accuracy of 80%. Regarding precision, recall and F1-score, RF and ELM contacted similar outcomes. These results suggested the excellent performance of these models and the use of an oversampling to balance the dataset. They could be considered perfect algorithms for building a predictive model.
Keywords: Colorectal cancer Learning · Predictive models
· Mortality · Risk factors · Machine
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K.-M. Björk (Ed.): ELM 2022, PALO 18, pp. 70–79, 2024. https://doi.org/10.1007/978-3-031-55056-0_8
Predicting the CRC Mortality in the Region of Lleida, Spain: A ML Study
1
71
Introduction
Colorectal cancer (CRC) is the third most common type worldwide [12,13]. In Europe, around 250,000 new colon cases are diagnosed each year, accounting for around 9% of all malignancies. And, regarding mortality, CRC is the fourth most common cause of cancer death [5]. Despite this, the screening programs and early detection have increased the surveillance of this illness [11,29]. In addition, the CRC treatment also played an essential role in the increase of this cancer survival [28]. Lifestyle, specifically bad habits, are significantly associated with a higher incidence of CRC. Some risk factors such as the body mass index (BMI), smoking or alcohol consumption increase the risk of CRC [19]. Excess weight has been associated with this risk and CRC mortality. Artificial intelligence (AI) is undergoing continuous enhancement, playing a pivotal role in automating processes and facilitating decision-making across industrial, medical, and business services [4,16]. Machine learning (ML) has emerged as a transformative tool capable of constructing predictive models that significantly streamline the comprehension and analysis of data. Depending on the specific problem at hand, various machine learning algorithms can be leveraged to develop models tailored for classification or regression tasks. In this study, our focus is on classification, wherein we employed the Random Forest, Neural Networks, and Extreme Learning Machine algorithms to craft predictive models assessing the mortality risk among colorectal cancer (CRC) patients. Implementation involved the utilization of relevant risk factors, sociodemographic information, and tumor staging. This study’s main contributions are building a model to predict CRC mortality and comparing some machine learning algorithms to analyse their performance in this kind of information. 1.1
Related Works
Some previous studies have been published about the risk of colorectal mortality related to the previous risk factors. Shaukat et al. published a study about the association between BMI index, and colorectal cancer mortality [30]. The authors confirmed a relationship between BMI and long-term colorectal cancer mortality. Heavy alcohol drinking also was associated with colorectal cancer death. Shaofang et al. concluded that exists an association between heavy alcohol drinking (>50 g/day of ethanol) and CRC mortality [8]. And tobacco smoking has also been related previously to CRC mortality. Chao et al. demonstrated the association between smoking and CRC death [9]. And a recent study done by Parajuli et al. also concluded similar outcomes [24]. These previous studies demonstrated these associations and motivated other researchers to use them to build predictive models using machine learning algorithms. Biglarian et al. analysed the use of artificial neural networks (ANN) to make a predictive model [7]. The results demonstrated that ANN could be used as an algorithm to predict the risk of CRC mortality. A recent article did a
72
D. Florensa et al.
comparative study about some machine learning algorithms, one of which was Random Forest [23]. Their results suggested an excellent performance of some algorithms to predict CRC mortality.
2
Research Methodology
Modelling cancer-related mortality using machine learning has become an essential practice in medical environments for decision-making. In this work, we aim to predict colorectal cancer mortality by comparing three state-of-the-art models trained with data augmented using a well-studied methodology. 2.1
Dataset
The collected data correspond to new cancer cases in the Lleida region (Spain) between 2012 and 2016. Data were collected by the population cancer registry of Lleida. Lleida is the largest province in Catalonia (Spain), with a population density of 36 people per square kilometre. More specifically, the population was 438,001 in 2014 [1], 221,891 men and 216,110 women. The population of the Lleida region presents lifestyles, risk factors and work activity which can be linked to a specific incidence of certain types of illness. Nearly half of the population of Lleida province lives in rural areas. Consequently, their lifestyle is different from that of the more urban populations in other Catalan provinces. A peculiarity of this region is the work environment. In rural areas, the main activity is the agrifood industry and, in urban areas, it is service sector activities such as education, health and catering [14]. The overall data set consisted of 179 CRCs. Colorectal cancer patients who passed away were 16 (9.0%) against the 163 (91.0%) that survived. This represented an imbalanced dataset, making it challenging to build a predictive model that predicts ICU admission. To solve this problem, we used different kinds of sampling techniques that balance data. And in addition, the final dataset included those patients with a stage degree registered, therefore, the patients with unknown stage were discarded. More specifically, the training data set consisted of 80% of the total cases (143), and the test was the remaining 20% (36). Each corpus register contains the following variables: age group (